#archiveteam-bs log for 2021-09-02

Home Search Previous day Next day

00:01:22		dm4v quits [Read error: Connection reset by peer]
00:02:08		dm4v joins
00:02:10		dm4v is now authenticated as dm4v
00:02:10		dm4v quits [Changing host]
00:02:10		dm4v (dm4v) joins
00:02:27		Stiletto quits [Remote host closed the connection]
00:09:19		Stiletto joins
00:27:35		wickedplayer494 is now authenticated as wickedplayer494
00:41:56		britmob2563 quits [Ping timeout: 250 seconds]
00:42:21		qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds]
00:47:14		Megame (Megame) joins
00:48:46		qwertyasdfuiopghjkl joins
00:50:13		sonick quits [Client Quit]
00:53:28		britmob2563 joins
01:03:31		dm4v quits [Ping timeout: 252 seconds]
01:05:02		dm4v joins
01:05:04		dm4v is now authenticated as dm4v
01:05:04		dm4v quits [Changing host]
01:05:04		dm4v (dm4v) joins
01:10:15		Iki quits [Ping timeout: 244 seconds]
01:13:00		BlueMaxima joins
01:22:56	<pabs>	AK: I recommend using anarcat's feed2exec for that RSS project, it already has a plugin for using SPN
01:23:42	<pabs>	hmm, looks like it only uses /save/http.... though not POST to /save/
01:24:23	<pabs>	IIRC those are different?
01:24:44	<@arkiver>	nah, lets not use SPN for that
01:25:58	<pabs>	its easy to write another plugin for whatever other mechanism you want, the SPN plugin is just a few lines of Python
01:26:19	<@arkiver>	SPN is almost always over capacity
01:26:40	<@arkiver>	AK: ThreeHM: made an issue on urls-grab repo for it
01:27:06	<pabs>	am I remembering correctly that /save/https:... and POST to /save/ are different?
01:33:05	<Jake>	i believe they are different, but both are over capacity often if I'm not mistaken?
01:33:38	<@OrIdow6>	This is ArchiveTeam
01:34:00	<@OrIdow6>	There is no point bulk SPNing when you have a direct ability to put things into the Wayback Machien
01:34:44		jamesp leaves
01:37:45		Iki joins
01:43:02		wyatt8740 quits [Ping timeout: 250 seconds]
01:43:13		wyatt8740 joins
02:08:36		Megame quits [Ping timeout: 250 seconds]
02:50:08		sonick (sonick) joins
03:35:15		nicolas17 quits [Client Quit]
03:43:01		qw3rty_ joins
03:46:52		qw3rty__ quits [Ping timeout: 252 seconds]
04:04:13		DogsRNice quits [Read error: Connection reset by peer]
04:33:02		BlueMaxima_ joins
04:34:54		Wolfin quits [Quit: ZNC - https://znc.in]
04:35:40		wolfin (wolfin) joins
04:36:48		BlueMaxima quits [Ping timeout: 250 seconds]
04:40:14		BlueMaxima joins
04:44:04		BlueMaxima_ quits [Ping timeout: 252 seconds]
05:06:29		Megame (Megame) joins
05:19:57		sonick quits [Client Quit]
07:12:00		sonick (sonick) joins
08:09:16		Ruthalas quits [Ping timeout: 244 seconds]
08:13:30		Ruthalas (Ruthalas) joins
08:26:19		benjins quits [Ping timeout: 244 seconds]
08:30:27		BlueMaxima_ joins
08:34:31		BlueMaxima quits [Ping timeout: 252 seconds]
09:56:17		knecht420 quits [Client Quit]
09:58:13		knecht420 (knecht420) joins
10:02:52		knecht420 quits [Client Quit]
10:04:11		knecht420 (knecht420) joins
11:10:05		benjins joins
11:10:45		benjins is now authenticated as benjins
11:23:11		qwertyasdfuiopghjkl quits [Remote host closed the connection]
11:42:19		russss quits []
11:42:31		russss (russss) joins
11:43:32		anarcat (anarcat) joins
11:43:46	<anarcat>	i'm kind of afk but happy to discuss RSS feed archival with feed2exec, async
11:44:53	<pabs>	AK, arkiver, JAA: ^
11:47:32		JSharp quits []
11:47:44		JSharp (JSharp) joins
11:59:03		revi quits []
11:59:18		revi (revi) joins
11:59:37		Megame quits [Client Quit]
12:04:15		BlueMaxima_ quits [Client Quit]
12:11:02	<AK>	Ooh
12:11:42	<AK>	Possibly not a bad idea, feed2exec sending new stuff somewhere that we can then add to #// or whatever else is needed
13:02:21		knecht420 quits [Client Quit]
13:03:53		knecht420 (knecht420) joins
13:39:08		knecht420 quits [Client Quit]
13:40:20		knecht420 (knecht420) joins
13:45:32		Megame (Megame) joins
14:35:36		devsnek quits []
14:35:48		devsnek (devsnek) joins
14:42:06		knecht420 quits [Client Quit]
14:42:27		knecht420 (knecht420) joins
14:59:07		knecht420 quits [Read error: Connection reset by peer]
14:59:11		knecht4202 (knecht420) joins
15:00:43		knecht4202 is now known as knecht420
15:14:40		@EggplantN quits []
15:15:09		EggplantN joins
15:29:50		gazorpazorp quits [Remote host closed the connection]
15:30:04		gazorpazorp (gazorpazorp) joins
15:30:42		Arcorann quits [Ping timeout: 250 seconds]
15:35:28		ThreeHM quits [Ping timeout: 250 seconds]
15:36:36		ThreeHM (ThreeHeadedMonkey) joins
15:40:45		xkey quits [Quit: WeeChat 3.1]
15:41:34		xkey (eyo) joins
16:10:35		Megame quits [Client Quit]
16:41:46		sonick quits [Client Quit]
16:42:50		monoxane quits [Ping timeout: 244 seconds]
17:31:54	<anarcat>	well right now i have things like this:
17:32:05	<anarcat>	[anarcat-archive]
17:32:05	<anarcat>	url = https://anarc.at/recentchanges/archive.rss
17:32:05	<anarcat>	output = feed2exec.plugins.wayback
17:32:05	<anarcat>	filter = feed2exec.plugins.ikiwiki_recentchanges
17:32:29	<anarcat>	the "wayback" plugin is this thing https://gitlab.com/anarcat/feed2exec/-/blob/main/feed2exec/plugins/wayback.py
17:33:00	<anarcat>	which basically does a GET on web.archive.org/save/$URL
17:33:11	<anarcat>	i was told (by pabs) that i should be doing a POST instead
17:33:32	<anarcat>	but my understanding of the whole SPN thing is that it works better from a browser, because the browser triggers the inline resources archival
17:33:41	<anarcat>	and i don't really want to rewrite wpull here
17:33:50	<anarcat>	i guess i could have a wpull plugin or something, but then i have two problems
17:35:58	<@OrIdow6>	anarcat: What I think pabs is trying to tell you is that SPN2, which involves a POST request, uses a headless browser to make the capture
17:37:04	<anarcat>	oh, that actually happens server-side?
17:37:11	<anarcat>	i did tell pabs to send me a patch :p
17:37:38	<anarcat>	but i suspect it can't be as simple as this:
17:37:39	<anarcat>	- res = session.get(wayback_url, allow_redirects=True)
17:37:40	<anarcat>	+ res = session.post(wayback_url, allow_redirects=True)
17:37:42	<anarcat>	right?
17:37:59	<@OrIdow6>	I don't remember the exact API it uses
17:38:02	<@OrIdow6>	https://blog.archive.org/2019/10/23/the-wayback-machines-save-page-now-is-new-and-improved/
17:38:56	<@OrIdow6>	Officially if you want to use the API you're supposed to ask them, in practice it seems many people have watched the network traffic the web interface makes (the POST) and replicated that
17:39:29	<@OrIdow6>	Anyhow, if there is an ArchiveTeam RSS project, vs. personal RSS projects, I do not think it is a stretch to say that we will not be using SPN in any form
17:43:19	<anarcat>	for sure
17:43:37	<h2ibot>	Ryz edited List of websites excluded from the Wayback Machine (+27, Added https://palladiummag.com/): https://wiki.archiveteam.org/?diff=47095&oldid=47081
17:51:01		nicolas17 joins
17:51:52	<Iki>	Another site to add to the 'excluded' page: https://jonahbennett.com
17:52:21	<@JAA>	Iki: Yes, please add it. Anyone can make edits. :-)
17:55:40	<Ryz>	Added that~
17:56:34	<anarcat>	oh hi JAA
17:57:10	<@JAA>	Hey :-)
17:57:39	<h2ibot>	Ryz edited List of websites excluded from the Wayback Machine (+28, Added https://jonahbennett.com/): https://wiki.archiveteam.org/?diff=47096&oldid=47095
18:00:40	<h2ibot>	JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=47097&oldid=47096
18:22:09		spirit quits [Quit: Leaving]
18:58:20		DogsRNice (Webuser299) joins
19:31:19		Ryz quits [Remote host closed the connection]
19:32:21		Ryz (Ryz) joins
20:27:32		lennier1 quits [Client Quit]
20:28:13		lennier1 (lennier1) joins
20:45:57		britmob2563 quits [Client Quit]
21:16:53	<@arkiver>	anarcat: what is that?
21:17:19	<@arkiver>	yeah we're definitely not going to use SPN
21:18:18	<anarcat>	sorry, what is what?
21:18:41	<@JAA>	arkiver: Basically, we can use that to process RSS feeds and get URLs (or anything we want really) out. And then queue those to URLs.
21:19:02	<anarcat>	oh, feed2exec?
21:19:05	<anarcat>	it's a feed parser i wrote
21:19:11	<@JAA>	With or without human doublechecking etc.
21:19:22	<anarcat>	you add feeds to a .ini file, and you can run commands on the items, or run python code, or whatever
21:19:25	<@JAA>	At least I assume that's what the question was about.
21:19:29	<anarcat>	it's my rss swiss army knife
21:19:38	<@JAA>	:-)
21:20:04	<anarcat>	i used feed2imap for a while, and then wanted to run commands based on some feeds (e.g. "download the new mp3 or torrent on this feed") and nothing quite did that
21:20:05	<@arkiver>	JAA: why not just do that extraction with #//
21:20:15	<anarcat>	i use it to dump my feed on the wayback machine
21:20:17	<@arkiver>	i'm going to try and get newsgrabber back today with #//
21:20:19	<anarcat>	and check links
21:20:27	<@arkiver>	just need a machine to queue the links to #//
21:20:45	<@arkiver>	anarcat: do you have lists of old links? if yes, we can archive them through #//
21:20:55	<anarcat>	what is #//
21:21:03	<anarcat>	i don't have a list
21:21:13	<@JAA>	How would that work since we need to refetch the RSS feeds (and wherever else NG was getting the articles from) regularly? Deduping etc.
21:21:15	<anarcat>	i don't have any specific rss feed i want to archive, i think this was something pabs brought up
21:21:38	<anarcat>	feed2exec handles http-level caching (e.g. etag and 'last-modified') and has a client-side cache
21:21:38	<@arkiver>	JAA: queue as special item, and attach some randomness
21:21:38	<@JAA>	Unless you want ugly hacks like URL#20210902T21
21:21:53	<@arkiver>	i am completely fine with attaching that ugly hack
21:22:02	<@arkiver>	but might go with rss:URL or so
21:22:28	<@arkiver>	we have a lot of capacity in #// now, it's easy to use. just have to push in the items periodically
21:22:58	<@JAA>	Wasn't the idea with relaunching NewsGrabber to make it less hacky rather than more?
21:23:29	<@arkiver>	it'll be more stable
21:24:03	<@JAA>	Until the regular queueing breaks and we don't notice for weeks and fun like that. :-)
21:24:05	<AK>	For the rss feeds I was thinking something pretty much exactly like how feed2exec seems to work. We'll give a box every rss feed we can find, it then pipes them through to urls (somehow). Dunno if that's what arkiver is planning
21:24:28	<@arkiver>	AK: i mean, scanning the RSS feeds on #//
21:24:48	<AK>	Ahh, so we'd just redo the entire feed every x hours or whatever?
21:24:54	<@arkiver>	yeah
21:25:00	<@arkiver>	queue any found URLs back to #//
21:25:03	<@JAA>	Well, dedupe handles the duplicated articles.
21:25:12	<@arkiver>	yep
21:25:19	<@JAA>	Which means we don't need to keep state on the thing that processes the feeds.
21:25:24	<@arkiver>	so
21:25:28	<@JAA>	Although it's worth mentioning that not all of NG was RSS feeds.
21:25:41	<@arkiver>	how about a list of feeds on github to which people can PR and issue new URLs
21:25:58	<@arkiver>	i think we might keep that list in urls-items
21:26:02	<AK>	^^ That would be a really good start
21:26:20	<@arkiver>	going to try to get something done today
21:26:33	<@arkiver>	got some lists already? obviously i'll copy the list from newsgrabber
21:26:58	<nicolas17>	where does stuff fetched by #// go? WBM?
21:27:01	<AK>	Biggest one I like is https://azure.microsoft.com/cdn/en-gb/updates/feed/ lol
21:27:02	<@arkiver>	yes
21:27:23	<@arkiver>	AK: nice, lets do it
21:27:41	<@arkiver>	and #// now gets all outlinks as well, though that is still experimental for now (but it's happening)
21:27:48	<@arkiver>	no
21:27:49	<@arkiver>	sorry
21:27:53	<@arkiver>	i mean all page requisites
21:27:54	<AK>	Once the github file is setup, I'll pr in any more feeds I find
21:28:07	<@arkiver>	perfect
21:28:42	<@arkiver>	JAA: you somewhat fine with this as well, or do you have very strong reservations against using #// to scan?
21:29:22	<AK>	arkiver, ahh, so currently it would grab https://azure.microsoft.com/en-gb/updates/general-availability-azure-devops-august-2021-updates/ and requisites, but not any of the links from within the updates (using azure as an example)
21:29:23	<@JAA>	Nah, fine with me. I just want to see it up and running again. Although yeah, it won't replace NG entirely, so more work will be needed (or other special items in URLs).
21:30:01	<@arkiver>	AK: it archived page requisites
21:30:04	<@arkiver>	archives*
21:30:28	<@arkiver>	or do you want to go deeper?
21:30:37	<anarcat>	one caveat with feed2exec is there's no cache purge right now :/
21:30:49	<nicolas17>	arkiver: I was thinking about the iPhone IPSW files, I think making them IA items would be better than loading them into WBM
21:30:52	<@arkiver>	anarcat: i'm not sure we'll use feed2exec
21:30:53	<anarcat>	probably not hard to implement, but it's not there... so if you have lots of feeds with lots of items, storage will explode
21:31:16	<@arkiver>	nicolas17: are they from official static apple URLs?
21:31:17	<nicolas17>	does WBM even cope with multi-gigabyte files? I remember having problems with that, getting the download interrupted halfway through, but I don't know if the problem was downloading from WBM, or WBM's original fetch from the source
21:32:28	<AK>	arkiver, I don't know. In the azure example, these are short update posts, with links to the main info for each update. I've seen a few rss feeds slightly similar. So I think we'd need to go 1 deeper (Follow links from the first page) to get all the new info
21:32:37	<AK>	Don't know if that's possible or potentially going to cause issues though
21:32:44	<@arkiver>	hmm
21:32:51	<@arkiver>	will figure something out
21:34:04	<nicolas17>	AK: hm that may need something a bit more custom, you might not want to follow all links from the first page including the footer nav bar...
21:34:23	<@arkiver>	im not worried about the footer
21:34:38	<@arkiver>	AK: i'll allow a depth to be set here
21:34:41	<AK>	nicolas17, I think the deduper would mean we'd only archive that stuff once, which probably isn't bad to do
21:34:43	<nicolas17>	I guess if you can deduplicate it's fine
21:34:52	<@arkiver>	like 1, making it go one level deep, 2 for 2 etc.
21:35:01	<AK>	That would be awesome
21:35:35	<@arkiver>	getting something in now
21:37:27	<@JAA>	nicolas17: The WBM doesn't mind serving big files from our WARCs, but SPN has a limit of 2 GiB and will truncate files that are bigger than that.
21:37:52	<nicolas17>	I see, maybe those files I saw before came from SPN then
21:37:55	<@JAA>	I.e. don't use SPN for big files.
21:38:30	<@JAA>	Also, I'm interested in learning where you get the IPSW URLs from.
21:40:43	<@arkiver>	again - SPN is almost always overloaded
21:40:58	<@arkiver>	#// is very far from being overloaded
21:41:09		qwertyasdfuiopghjkl joins
21:41:09	<@JAA>	Yeah, that too.
21:41:24	<@arkiver>	nicolas17: so for those apple files, are they official static URLs from apple?
21:41:46	<@arkiver>	AK: lets move this to #//
21:42:10	<nicolas17>	JAA: officially from Apple, there's an XML file fetched by iTunes that links to the last available version for each device, I'm not sure if Apple has any list of previous versions
21:42:48	<@JAA>	Ah nice. Can haz XML URL please?
21:42:49	<nicolas17>	but there's several third-party sites that have URLs (pointing at Apple's CDN, not mirrors) for everything
21:43:42	<@arkiver>	nicolas17: yeah we definitely want to archive those into the wayback machine
21:43:44	<@JAA>	Yeah, I've seen those lists before. The official source is always great to have though and often much harder to find.
21:43:51	<@arkiver>	and i'm fine with duplicating them to IA items as well
21:43:54	<@arkiver>	so you have both :)
21:44:05	<@arkiver>	but first wayback, we can always get them out of these into items
21:44:53	<nicolas17>	https://itunes.apple.com/WebObjects/MZStore.woa/wa/com.apple.jingle.appserver.client.MZITunesClientCheck/version it's a mess because it includes old-style iPods, phone carrier config updates, etc.
21:45:24	<@arkiver>	nice...
21:45:39	<nicolas17>	and afaik only mentions the last version (for a given device)
21:47:12	<nicolas17>	but when an update is released, it will be added to https://ipsw.me/ and https://www.theiphonewiki.com/wiki/Firmware/iPhone/14.x etc within hours :)
21:50:39	<nicolas17>	I think beta downloads are officially only linked from a paid-dev-account apple webpage, but that too gets promptly added to the wiki (lately by me), and once you have the CDN link, the downloads themselves aren't login-walled (except beta 1)
21:50:42	<nicolas17>	eg https://www.theiphonewiki.com/wiki/Beta_Firmware/iPhone/15.x
21:56:33	<nicolas17>	that itunes.apple.com xml seems to be a backwards-compatibility mess, Apple has a more consistent way to deliver "asset updates" now, eg. the ipsws for the new ARM Macs: https://mesu.apple.com/assets/macos/com_apple_macOSIPSW/com_apple_macOSIPSW.xml
21:57:40	<@JAA>	Thanks!
21:59:44		sonick (sonick) joins
22:01:21	<nicolas17>	I know of ~300 URLs in mesu.apple.com, I recently started archiving them into a git repository :) some have never changed, some seem to change every week ("LinguisticData", keyboard autocorrect stuff I think)
22:01:45	<nicolas17>	https://gitlab.com/nicolas17/mesu-archive/-/commit/79f4d982bb
22:02:44	<@JAA>	Very nice.
22:03:09	<@JAA>	I really need to get back to working on pywarc so I can launch some of my project ideas. This would fit nicely into one of those.
22:04:17	<nicolas17>	someone recently sent me a few more URLs he had found himself
22:04:41	<nicolas17>	"com_apple_MobileAsset_CharacterVoices.xml" mickey and minnie telling the time in 35 languages, for the mickey Apple Watch face
22:31:04		Stiletto quits [Ping timeout: 252 seconds]
22:42:08		Stiletto joins
23:24:01	<pabs>	anarcat: working on a patch now
23:31:46	<anarcat>	pabs: aaawesome
23:32:03	<anarcat>	JAA: would be curious to hear what your end result will be for rss feeds
23:35:25	<@arkiver>	just dump extracted URLs in #//, it'll do deduplication by itself
23:35:30	<@arkiver>	not much need for fancy stuff
23:37:19	<pabs>	are there any docs on the SPN2 API?
23:38:04	<pabs>	it doesn't seem to return proper HTTP codes and returns errors in HTML, I'm wondering if there is a way to get JSON responses
23:45:32	<anarcat>	what is #//?
23:50:28	<ArchivalEfforts>	After using grab-site for gettings a site, which files should I upload to Internet Archive?
23:50:30	<ArchivalEfforts>	Both the main WARC as well as the -meta one and the .cdx seem relevant, but I can safely ignore the other files without losing information, right?
23:56:13		BlueMaxima joins

Home Search Previous day Next day