#archiveteam-bs log for 2022-05-16

Home Search Previous day Next day

00:02:39	<lennier1>	TheTechRobo: The library is using requests: https://github.com/digitalmethodsinitiative/itunes-app-scraper/blob/master/itunes_app_scraper/scraper.py
00:02:57	<TheTechRobo>	lennier1: So yeah, then those environment variables will work.
00:03:49	<TheTechRobo>	Keep in mind that disabling SSL verification like the CURL_CA_BUNDLE="" does (required for warcprox, or modifying the trusted CA store) will spam your terminal with "urllib3: Making insecure connection to localhost. Blablablabla."
00:04:32	<lennier1>	The bottleneck seems to be the request for similar apps. I get timed out for awhile if I don't delay about a second between calling get_similar_app_ids_for_app.
00:05:59	<lennier1>	Alternately you could brute force it, but only about one in a thousand IDs in the used range are valid. Unless there's some way to predict which are used.
00:18:24		bonga quits [Ping timeout: 252 seconds]
00:19:31		bonga joins
00:24:31	<lennier1>	How do I use that sitemap? I downloaded https://apps.apple.com/sitemaps_apps_85_1_20220509.xml.gz and extracted the xml file. I figured it would be readable as text, but doesn't seem to be. Edge says it has an encoding error.
00:31:16		driib08353670 (driib) joins
00:33:28	<thuban>	lennier1: file looks fine to me--extracts without issue, result is pure ascii xml.
00:34:12	<thuban>	is yours 23528629 bytes unzipped? what did you use to extract it?
00:34:21		driib0835367 quits [Ping timeout: 252 seconds]
00:34:22		driib08353670 is now known as driib0835367
00:40:04	<lennier1>	That's odd. Extracted with 7zip. sitemaps_apps_85_1_20220509.xml is 414,082 bytes.
00:41:15	<lennier1>	sitemaps_apps_85_1_20220509.xml.gz is 335,329 bytes. SHA1 92B44E456A3D05B89BB47191DD46E3C208611DC0
00:42:37	<Jake>	a9d2324c7bf34222824ee0f45200362f5811f008 *sitemaps_apps_85_1_20220509.xml
00:42:37	<Jake>	ce80bab6dd6b5af7a6213db77b5abe9bd12a8b18 *sitemaps_apps_85_1_20220509.xml.gz
00:43:11	<thuban>	ditto Jake
00:43:52	<thuban>	the gz file (_not_ the extracted xml) is 414082 bytes
00:44:19	<thuban>	i suggest double-checking whether you've invoked 7zip correctly / just using gunzip
00:48:24	<thuban>	(`7z e sitemaps_apps_85_1_20220509.xml.gz` works for me too)
00:50:08	<lennier1>	Wait, what was the size of the file you originally downloaded? I started with 335329 bytes and AFTER using 7zip it's 414082 bytes. Is that file still in a compressed format?
00:51:48	<@JAA>	The file at https://apps.apple.com/sitemaps_apps_85_1_20220509.xml.gz is 414082 bytes.
00:52:31	<thuban>	& maybe; see what `file` says or try extracting it again. (is it possible you downloaded it with some tool that got confused by gzip transfer-encoding?)
00:52:56	<ThreeHM>	I end up with 334,939 bytes if I gzip the .gz file again, so it might compressed twice
00:53:06	<lennier1>	Yes, it's compressed twice.
00:53:34	<@JAA>	There is no TE or whatever, it sends it as a plain application/octet-stream.
00:53:50	<lennier1>	If I change the .xml extension of the extracted file to .gz, I can get the final .xml file (with no extension).
00:53:55	<@JAA>	At least when accessing it with curl. Maybe it does something weird if you use a browser?
00:53:57	<thuban>	er, *content-encoding
00:54:17	<TheTechRobo>	JAA: Nah, browser is dumb too.
00:54:20	<@JAA>	No CE either. Just chunked TE.
00:55:25	<thuban>	lol, weird. what _did_ you download it with, lennier1?
00:55:39	<lennier1>	Firefox
00:55:57	<TheTechRobo>	same
00:57:11		AlsoHP_Archivist joins
00:58:30	<ThreeHM>	Chrome seems to handle it correctly
00:59:53	<thuban>	TheTechRobo: on windows? (lennier1, i assume you're on windows since you mentioned edge)
01:00:13	<TheTechRobo>	thuban: Firefox Developer Edition on Debian
01:00:21	<TheTechRobo>	I don't use Windows
01:00:54		HP_Archivist quits [Ping timeout: 265 seconds]
01:01:02	<lennier1>	Windows, yes.
01:02:54		dm4v quits [Client Quit]
01:02:55	<lennier1>	But yeah, there are a lot of app links in that file. I wonder if the sitemap includes all publicly listed apps.
01:03:02		dm4v joins
01:03:04		dm4v is now authenticated as dm4v
01:03:04		dm4v quits [Changing host]
01:03:04		dm4v (dm4v) joins
01:06:42	<ThreeHM>	Found a bug report for this: https://bugzilla.mozilla.org/show_bug.cgi?id=610679 - Opened 12 years ago!
01:07:27	<Jake>	wow
01:08:05	<thuban>	JAA: content-encoding is gzip with `curl --compressed`
01:08:58	<thuban>	(and ofc in the browser)
01:17:00	<@JAA>	thuban: Yes, and it's Apple that recompresses it in that case.
01:17:14	<@JAA>	I also get the 335329 file with that.
01:20:03	<@JAA>	Which is actually the correct behaviour, I think.
01:21:12	<@JAA>	`Accept-Encoding: gzip` asks the server to send the requested resource in gzip-compressed form, so it does that. The fact that the resource is already compressed doesn't really come into play there.
01:22:38	<@JAA>	Doesn't make it any less confusing though. And so many times CE is abused when it's actually TE compression because many clients don't even support the latter.
01:26:23	<@JAA>	(`curl --compressed` also decompresses it again as it's supposed to.)
01:34:14	<thuban>	i mean, arguably apple's server is incorrect to recompress a gzip (since 'Accept-Encoding: gzip' doesn't specifically forbid the 'identity' value)
01:34:17	<thuban>	but yes, curl handles it correctly and ff does not
01:34:40	<@JAA>	It's not incorrect, but it's certainly not optimal, yeah.
01:35:09	<thuban>	'correct' as in 'The Right Thing'
01:35:29	<thuban>	anyway, apparently this behavior was a workaround for some combination of bugs in apache and apache being easy to misconfigure. transfer protocols were a mistake :(
01:35:57	<Jake>	I feel like I've seen this brought up here before, haha.
01:36:35	<@JAA>	That's where 90+% of HTTP's quirks come from, browsers working around server bugs instead of throwing a brick at the people running broken servers.
01:44:58		AlsoHP_Archivist quits [Client Quit]
02:16:17		FaraiNL joins
02:19:27		hackbug quits [Quit: Lost terminal]
02:19:33		hackbug (hackbug) joins
02:20:44	<h2ibot>	Usernam edited List of websites excluded from the Wayback Machine (+27): https://wiki.archiveteam.org/?diff=48596&oldid=48547
02:21:40	<FaraiNL>	Hi. Quick question, I hope this is the right channel. I've created some warc files using grab-site which I would like to upload to the internet archive. According to the faq I have to use the subject keyword "archiveteam" (check) and to "let us know". How do I let the team know? Here via IRC? Also, do I upload just the .warc.gz file? Or also the
02:21:40	<FaraiNL>	meta-warc.gz? And how do I set the mediatype to web?
02:39:59		bonga quits [Ping timeout: 265 seconds]
02:45:09		bonga joins
02:56:55		FaraiNL quits [Ping timeout: 266 seconds]
02:56:56		Discant joins
02:57:21		DiscantX quits [Ping timeout: 252 seconds]
03:00:45		tbc1887 quits [Remote host closed the connection]
03:00:45		Discant quits [Remote host closed the connection]
03:00:45		TheTechRobo quits [Remote host closed the connection]
03:00:45		gazorpazorp quits [Remote host closed the connection]
03:00:45		driib0835367 quits [Client Quit]
03:00:45		driib0835367 (driib) joins
03:00:46		Discant joins
03:00:47		TheTechRobo joins
03:00:47		TheTechRobo is now authenticated as TheTechRobo
03:01:12		tbc1887 (tbc1887) joins
03:01:21		gazorpazorp (gazorpazorp) joins
03:01:44		gazorpazorp quits [Remote host closed the connection]
03:02:50		gazorpazorp (gazorpazorp) joins
03:03:14		gazorpazorp quits [Remote host closed the connection]
03:03:31		gazorpazorp (gazorpazorp) joins
03:11:51	<Jake>	Just as a quick tip, uploads from normal users aren't going into the Wayback Machine anymore. Only specific whitelisted users can directly upload to be included in the Wayback Machine.
03:14:13	<Jake>	Oh, they left. :(
03:21:08		bonga quits [Read error: Connection reset by peer]
03:22:09		bonga joins
04:11:04		Pingerfo- quits [Ping timeout: 252 seconds]
04:11:08		Pingerfowder (Pingerfowder) joins
04:26:50		pabs quits [Remote host closed the connection]
04:28:43		pabs (pabs) joins
04:43:35		Arcorann quits [Ping timeout: 265 seconds]
05:06:57		lennier1 quits [Client Quit]
05:08:40		lennier1 (lennier1) joins
05:11:52		tbc1887 quits [Read error: Connection reset by peer]
05:22:52		monoxane4 quits [Ping timeout: 265 seconds]
06:13:08		TheTechRobo quits [Ping timeout: 265 seconds]
06:17:59		TheTechRobo (TheTechRobo) joins
06:18:22		TheTechRobo quits [Remote host closed the connection]
06:18:43		TheTechRobo (TheTechRobo) joins
06:42:46		monoxane4 (monoxane) joins
06:44:01		BlueMaxima quits [Read error: Connection reset by peer]
06:54:20		Arcorann (Arcorann) joins
09:08:35		@dxrt quits [Ping timeout: 265 seconds]
10:50:21		Cookie joins
10:52:43	<Cookie>	Hi. I made a post here: https://archive.org/post/1124209/can-i-find-out-how-much-of-a-website-has-already-been-archived
10:54:54	<Cookie>	I'd like to know how to find out, for any given website -- if any "whole site" archives/swipes have been done. I notice there are wiki pages with this info for many sites, so that's helpful. But I'm interested in a more automated, methodical way of finding out this information without relying on a human updating a wiki page.
10:55:41	<Sanqui>	Cookie, in terms of Archive Team's projects, e.g. for fanfiction.net we try to upkeep a wikipage such as https://wiki.archiveteam.org/index.php/FanFiction.Net
10:56:48	<Cookie>	That is awesome! Anyone can scrape a website and create a WARC file though, right? Any they wouldn't necessarily create or update the relevant wiki page...
10:56:56	<Sanqui>	that's right
10:57:15	<Sanqui>	of course, when presented with a WARC file, it is impossible to simply determine whether it is "complete or not"
10:57:25	<Sanqui>	that's why it's important to have good metadata and record keeping as well
10:57:32	<Cookie>	Yes I have realised that after thinking about it
10:57:51	<Cookie>	But suppose there isn't good metadata (which is likely!)
10:58:06	<Cookie>	But they have uploaded it to the archive anyway
10:58:33	<Sanqui>	in general, archivebot jobs tend to be "complete" (unless aborted, run with aggressive ignores, hit high error rates, etc.)
10:58:38	<Sanqui>	and they are searchable here https://archive.fart.website/archivebot/viewer/
10:59:11		G4te_Keep3r joins
10:59:17	<Sanqui>	so that's another place to check
11:01:30	<Cookie>	Ooh I hadn't seen that yet.
11:01:38	<Cookie>	So for example: https://archive.fart.website/archivebot/viewer/domain/www.fanfiction.net
11:02:03	<Cookie>	The first result looks "big" (2018). The others look "small"
11:02:44	<Cookie>	Can these results be related to the collections here: https://archive.org/details/archiveteam_fanfiction
11:03:10	<Cookie>	Oh, no those say 2012
11:03:11	<Sanqui>	Nope, ArchiveBot get its own collection for all crawls
11:03:33	<Sanqui>	in fact, the warcs are partitioned, and uploaded in separate items along with other crawls running on the same machine
11:03:51	<Sanqui>	also, I suspect the first job is incomplete due to a crash. Mainly because the last segment (00070) was uploaded 7 months after the previous one
11:04:07	<Sanqui>	it might be possible to determine what exactly went on from irc logs, but I don't have time for that atm :)
11:05:09	<Cookie>	Okay is this the list of archivebot collections? https://archive.org/details/archivebot?&sort=-publicdate
11:05:18	<Sanqui>	aye
11:06:59	<Cookie>	What is a "GO Pack" ? Is that just what every item is called? And I have to examine the metadata to find out what's in it?
11:07:58	<Sanqui>	I believe the names are variable based on the pipeline that uploaded them (there are over 50 machines running archivebot instances)
11:08:15	<Sanqui>	and yeah, that's why the archivebot viewer exists, to make it easier to search through these
11:08:29	<Cookie>	If I search that collection for "fanfiction.net" it doesn't find anything: https://archive.org/details/archivebot?query=fanfiction.net&sort=-publicdate
11:09:24	<Sanqui>	yeah, the collection's not really intended for manual browsing, but the data is all there. the viewer shows you which jobs exist for a given domain, and which items contain the data from that job.
11:09:44	<Cookie>	I can't see fanfiction.net here: https://archive.fart.website/archivebot/viewer/domains/f
11:10:17	<Sanqui>	that's because it's www.fanfiction.net
11:11:08	<Cookie>	Ohh haha. Where do I log a suggestion to remove the "www" from sites in this index? ;-)
11:11:49	<Cookie>	Anyway the search found it alright
11:11:53	<Sanqui>	https://github.com/ArchiveTeam/ArchiveBot/issues possibly. I'm not even sure the viewer has its own repo
11:12:32	<Sanqui>	there's a lot of issues with the whole setup, but we're all volunteers here, it's difficult to overhaul the way things are built already
11:12:38	<Cookie>	Thanks for explaining this
11:12:51	<Sanqui>	np
11:13:18	<Cookie>	I'd still like to relate the results at https://archive.fart.website/archivebot/viewer/domain/www.fanfiction.net to the collections I've listed here: https://archive.org/post/1124209/can-i-find-out-how-much-of-a-website-has-already-been-archived
11:13:49	<Cookie>	In order to answer the question "is it efficient for me to archive fanfiction.net again, now?"
11:13:55	<Sanqui>	https://web.archive.org/web/collections/2022*/fanfiction.net
11:14:07	<Cookie>	also, what about incremental backups?
11:14:12	<Sanqui>	if you see an "ArchiveBot" collection here, then it's done by ArchiveBot
11:14:22	<Sanqui>	otherwise, none of the collections are related
11:14:46	<Cookie>	Okay, and if it's done by ArchiveBot then that means it's "complete"
11:14:47	<Cookie>	?
11:15:15	<Sanqui>	not really, actually, because it's possible ArchiveBot saw a link to fanfiction.net while archiving another website, and grabbed just one page
11:15:36	<Sanqui>	the only thing by archivebot that's close to complete is this job from 2018 https://archive.fart.website/archivebot/viewer/job/1bkfa
11:16:20	<Sanqui>	I would answer your question with a YES because there has been no major and documented effort to archive fanfiction.net since 2018
11:16:36	<Cookie>	Okay :-)
11:16:38	<Sanqui>	as long as you use best practices, update the wikipage, etc.
11:17:14	<Cookie>	So does this mean I should point the ArchiveBot at it? Or is it too large and I should explore different crawling methods?
11:17:31	<Sanqui>	archivebot with no offsite links may be sufficient
11:18:00	<Cookie>	Thank you!!
11:18:47	<Sanqui>	somebody (like me I suppose) will have to run the job for you, but you can help monitor it :)
11:19:09	<Cookie>	Sure. I'll read up on how this all works
11:19:27	<Cookie>	And I guess I will ask the archive.org people how their existing fanfiction.net collections can be managed
11:20:48	<Sanqui>	feel free to poke or pm me in the future
11:40:13		fuzzy8021 leaves
11:41:11		fuzzy8021 (fuzzy8021) joins
12:06:56		qwertyasdfuiopghjkl joins
12:18:34		Cookie quits [Remote host closed the connection]
12:50:26		Discant quits [Ping timeout: 265 seconds]
13:01:25		Arcorann quits [Ping timeout: 265 seconds]
13:14:11		Cookie joins
13:16:31	<Cookie>	This is a probably a dumb question, but if the data in "go" archives wasn't grouped by date crawled, but instead by name of website, wouldn't it be possible to use archive.org's native category and metadata system to search for a specific website archive?
13:18:23	<Cookie>	i.e. instead of this: https://archive.org/details/archiveteam_archivebot_go_20181027110001/ -- which contains sections of several of different website archives, you might have: https://archive.org/details/archiveteam_archivebot_go_201810_fanfiction.net/ - which only contains a portion of one website. Additional crawls would either update that one,
13:18:23	<Cookie>	or add more archives in the same collection (or topic or tag or whatever)
13:27:36		Arcorann (Arcorann) joins
13:36:33	<Cookie>	It would undoubtably take more time and effort than is available. I was just wondering if there is anything specific preventing this from happening.
14:00:52		Arcorann quits [Ping timeout: 265 seconds]
14:12:27		Cookie quits [Remote host closed the connection]
14:31:43	<bonga>	https://www.androidcentral.com/apps-software/google-play-store-to-get-rid-of-nearly-900000-abandoned-apps
14:31:54	<bonga>	Google is officially removing the apps
14:32:15	<bonga>	But we don't need to worry like the app store
14:32:26	<bonga>	They are archived by the APK sites
14:32:39	<bonga>	Reviews are the only unarchived thing
14:42:24		Cookie joins
15:10:06		HP_Archivist (HP_Archivist) joins
15:12:23		AK quits [Quit: AK]
15:13:43		AK (AK) joins
15:27:40	<@JAA>	Cookie: We tried that in the past, and it didn't go well. The problem is that IA items have a size limit, and AB jobs can get much bigger than that.
15:28:25	<@JAA>	So then you need to coordinate items for the same domain between pipelines, which gets really fun...
15:28:49	<@JAA>	Ideally, IA would allow searching a collection for filenames, and then this would work automatically.
15:29:35	<@JAA>	Also, the grouping isn't by pipeline. Pipelines upload to a staging server, which groups files together until a set exceeds some size threshold, and then that gets uploaded.
15:30:14	<@JAA>	So it's more of a time slice of all AB pipelines, although that isn't entirely accurate either.
15:32:15		march_happy (march_happy) joins
15:35:48		bonga quits [Ping timeout: 252 seconds]
15:36:22		bonga joins
15:37:27		spirit joins
15:39:14	<Cookie>	JAA: Thanks for putting my mind at rest
15:39:57	<Cookie>	Filenames might work... But a metadata field specifically for "website name" would be the best.
15:42:57		Cookie quits [Remote host closed the connection]
15:52:30		Cookie joins
15:53:12	<@JAA>	win-raid.com is redirecting to the new Discourse forum since about 2022-05-13 23:50. I finally get to stop my continuous archival of that. :-)
16:20:20		Cookie quits [Remote host closed the connection]
16:32:11		bonga quits [Read error: Connection reset by peer]
16:33:56		bonga joins
17:11:30		HP_Archivist quits [Ping timeout: 252 seconds]
17:12:55		HP_Archivist (HP_Archivist) joins
17:13:13		HP_Archivist quits [Remote host closed the connection]
17:14:25		HP_Archivist (HP_Archivist) joins
17:14:43		HP_Archivist quits [Remote host closed the connection]
17:15:55		HP_Archivist (HP_Archivist) joins
17:16:13		HP_Archivist quits [Remote host closed the connection]
17:17:25		HP_Archivist (HP_Archivist) joins
17:17:43		HP_Archivist quits [Remote host closed the connection]
17:18:11		HP_Archivist (HP_Archivist) joins
17:46:15		Holder joins
17:46:35		Holder quits [Remote host closed the connection]
18:08:29		driib08353671 (driib) joins
18:10:45		driib0835367 quits [Ping timeout: 265 seconds]
18:10:45		driib08353671 quits [Read error: Connection reset by peer]
18:13:23		driib0835367 (driib) joins
18:41:36	<thuban>	ah, they've left. but we've discussed re-archiving ffn on a few occasions; it is not archivebot-suitable
18:42:03	<thuban>	previously: https://hackint.logs.kiska.pw/archiveteam-ot/20210608#c289987, https://hackint.logs.kiska.pw/archiveteam-bs/20210523#c288184, https://hackint.logs.kiska.pw/archiveteam-bs/20210524
18:42:44	<thuban>	(last has details of site structure)
18:43:21	<@JAA>	Last time I checked (earlier this year), all story pages were behind Buttflare Attack mode, not just with a ridiculous rate limit but entirely.
18:44:01	<thuban>	oof
18:45:42		thuban contemplates writing a chromebot that doesn't suck...
18:47:13	<thuban>	Sanqui, weren't you doing something with puppeteer and warcprox?
18:48:20		HP_Archivist quits [Client Quit]
18:53:15		bonga quits [Ping timeout: 252 seconds]
18:53:39		bonga joins
19:05:47		michaelblob (michaelblob) joins
19:11:20		michaelblob quits [Client Quit]
19:11:54		michaelblob (michaelblob) joins
19:14:27		michaelblob quits [Client Quit]
19:14:46		michaelblob (michaelblob) joins
19:54:17		Iki joins
19:55:13		DiscantX joins
20:27:51		DiscantX quits [Ping timeout: 252 seconds]
20:57:38		Nulo quits [Ping timeout: 265 seconds]
21:00:56		bonga quits [Read error: Connection reset by peer]
21:01:24		bonga joins
21:02:01	<Doranwen>	JAA: yes, I can confirm that ff.n is very much behind the worst mode ever - and it's gotten even worse recently
21:02:59	<Doranwen>	bad enough that even browsing it, it'll sit on a "checking you're human" page for ages - I've resorted to pasting the links of any story I want to read into FFDL because at least I don't have to watch it think, and I can check over and solve a captcha or two as necessary
21:03:25	<@JAA>	Doranwen: Oh, there's still room for it to get worse, but I won't mention it here in case they're lurking here and looking for ideas how to do so.
21:03:38	<Doranwen>	LOL
21:04:27	<Doranwen>	well, they should realize they've gotten it a little counterproductive at this point - I mean, I used to actually read the fics on the site and only d/l them when I was done - now it's a pain to do anything on the site
21:11:02		bonga quits [Ping timeout: 265 seconds]
21:11:54		bonga joins
21:28:49		lennier2 joins
21:31:28		lennier1 quits [Ping timeout: 265 seconds]
21:31:31		lennier2 is now known as lennier1
22:10:47	<bonga>	https://www.androidcentral.com/apps-software/google-play-store-to-get-rid-of-nearly-900000-abandoned-apps
22:11:37	<bonga>	Google is removing "abandoned" apps from the play store. Let's archive them just in case the APK sites did not.
22:12:03	<bonga>	Let's especially archive html pages and reviews as csv of play store apps over 2 years old
22:45:18		BlueMaxima joins
23:13:49		Arcorann (Arcorann) joins
23:19:49		Mateon1 quits [Remote host closed the connection]
23:20:00		Mateon1 joins
23:32:39		march_happy quits [Ping timeout: 252 seconds]

Home Search Previous day Next day