#archiveteam-bs log for 2021-08-05

Home Search Previous day Next day

00:00:52		dm4v quits [Read error: Connection reset by peer]
00:03:41		dm4v joins
00:03:43		dm4v is now authenticated as dm4v
00:03:43		dm4v quits [Changing host]
00:03:43		dm4v (dm4v) joins
00:16:36		j quits [Remote host closed the connection]
00:26:43		Campbell quits [Ping timeout: 252 seconds]
01:03:23		dm4v quits [Ping timeout: 244 seconds]
01:03:59		dm4v joins
01:04:01		dm4v is now authenticated as dm4v
01:04:01		dm4v quits [Changing host]
01:04:01		dm4v (dm4v) joins
01:41:47		nuroten joins
01:45:11	<nuroten>	thuban or anyone saving HK media sites: can we please add the new June 4th Musueum website? just launched today, the physical museum was shut down so they switched to online exhibition. it's js-heavy though, so not sure how much can be saved https://8964museum.com/
01:48:50	<nuroten>	(it has archival images, timelines, etc. of the massacre. link's been added to the pad under Hong Kong Alliance in Support of Patriotic Democratic Movements of China, but I thought I'd mention it here in case it gets lost among the other ones already saved)
01:49:26		Arcorann (Arcorann) joins
01:54:31	<nuroten>	thanks!
02:08:30		lennier1 quits [Client Quit]
02:09:03		lennier1 (lennier1) joins
02:10:05		AntiLiberal joins
03:09:08		HP_Archivist (HP_Archivist) joins
03:20:24		Campbell joins
03:41:40		jamesp quits [Client Quit]
03:42:44		qw3rty_ joins
03:46:36		qw3rty__ quits [Ping timeout: 250 seconds]
03:54:55		AntiLiberal quits [Ping timeout: 244 seconds]
04:19:23		Atom quits [Read error: Connection reset by peer]
04:19:35		Atom joins
04:53:34		wizards_ joins
04:56:48		wizards quits [Ping timeout: 250 seconds]
06:48:56		qwertyasdfuiopghjkl quits [Client Quit]
06:58:56		qwertyasdfuiopghjkl joins
07:01:26		HP_Archivist quits [Ping timeout: 244 seconds]
07:10:15		BlueMaxima quits [Remote host closed the connection]
07:10:29		BlueMaxima joins
07:15:01		BlueMaxima quits [Read error: Connection reset by peer]
07:15:14		BlueMaxima joins
07:36:39		Eighty_ joins
07:38:00		Eighty quits [Ping timeout: 250 seconds]
07:53:54	<@OrIdow6>	Anyone have any examples of Google Drive files or folders that are maybe 3 GB - 10 GB?
07:54:14	<@OrIdow6>	Also other such cases, such as folders with millions of files inside them
07:54:19	<@OrIdow6>	Publicly accessible, obviously
07:58:14		knecht420 quits [Read error: Connection reset by peer]
07:58:16		knecht4207 (knecht420) joins
07:58:47		knecht4207 quits [Client Quit]
07:59:39		knecht4207 (knecht420) joins
07:59:47		knecht4207 is now known as knecht420
08:03:45		nimaje joins
08:29:41	<@HCross>	nuroten: hmm.. I wonder if this is a Brozzler affair
08:32:30	<gazorpazorp>	@OrIdow6: https://drive.google.com/drive/folders/1r8I5hpSPCf_9JWECwa6c4E4tQZELd3cx flash game zip files ranging from hundreds of MBs to tens of GB
08:37:26	<@OrIdow6>	Thank you gazorpazorp
08:41:21	<gazorpazorp>	https://drive.google.com/drive/folders/1oCMgJeBc55NuEasPcgwjx2FuPdQd8neu randomly found, different types of files, many nested folders
08:41:28	<gazorpazorp>	np
08:53:59	<gazorpazorp>	@OrIdow6: https://drive.google.com/drive/folders/1TuO-0XyxTVK7Jys2WW0gduRcoQMTpB9C there are lots and lots of files and nested folders. I have no idea how to calculate the total number of files and whether it's near a million or not
08:54:56	<gazorpazorp>	but I can't find a folder with millions of files inside that aren't in other nested folders
09:04:39		qwertyasdfuiopghjkl quits [Client Quit]
09:12:31		Wayward quits [Ping timeout: 252 seconds]
10:03:40		Video quits [Ping timeout: 252 seconds]
10:39:14		Video joins
11:43:07		spirit joins
12:28:03		Iki quits [Read error: Connection reset by peer]
12:31:01		Iki joins
13:13:36		BlueMaxima quits [Client Quit]
14:08:38		spirit quits [Client Quit]
14:14:45		Jonboy3451 quits [Read error: Connection reset by peer]
14:18:22		Jonboy345 joins
14:35:12		Doran is now known as Doranwen
14:53:30		Arcorann quits [Ping timeout: 250 seconds]
15:14:57	<nuroten>	@HCross: Brozzler? what do you mean?
15:16:27	<nuroten>	oh, https://github.com/internetarchive/brozzler ?
15:19:00	<nuroten>	yeah, I'm not sure, kind of imagining something that emulates browser clicking all the interactive elements and caching as it goes, if such a thing exists
15:22:22	<nuroten>	description sounds useful, link extraction
15:28:09		AntiLiberal joins
15:35:24		qwertyasdfuiopghjkl joins
15:39:08		AntiLiberal quits [Ping timeout: 244 seconds]
16:05:29		qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds]
16:08:36		qwertyasdfuiopghjkl joins
16:18:54		spirit joins
16:49:11	<Iki>	Is there a good way to archive a site a second time without huge overlaps? I'm thinking either: 1) have the original warc on-hand and don't add old pages and/or use the 'revisit' option, or 2) do a more limited comparison, such as comparing old archives against the current sitemap
16:49:45	<Iki>	Just curious if there is a tool that makes this straightforward. It's easily scriptable (such as by comparing a sitemap against IA's CDX output), but scriptable is not scalable
16:52:39	<Iki>	Okay. Gonna share options as I find them. Please let me know if any complications to them are known. I'll tag these thoughts with the keyword "repeatscrape"
16:53:32	<Iki>	repeatscrape 1: Looks like wpull allows use of a --database argument to track previously-visited URLs. Pretty good! Though it doesn't compare against the contents to check for changes
16:56:44	<Iki>	repeatscrape 2: Doesn't look like wget has wpull's database option, though it might be possible to use the --warc-dedup option and --warc-cdx options to basically do the same thing
17:00:04	<Iki>	repeatscrape 3: grab-site can take all wpull options via --wpull-arg. In addition, looks like it includes the dupespotter plugin, which maybe does the kind of comparison I'm looking for?
18:28:11	<@JAA>	Iki: It's a hard problem to solve. The most accurate solution is to recrawl the entire site and write revisit records as appropriate. But due to dynamically generated sites (e.g. session IDs, timestamps), you'll end up with a lot of duplication anyway. Anything else would have to be specific to a particular site and make use of its structure since you'd have to know which URLs to refetch (e.g.
18:28:17	<@JAA>	sitemaps, article lists) and which not (e.g. articles that you previously covered).
18:45:52		HP_Archivist (HP_Archivist) joins
19:23:22		qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds]
19:23:57		qwertyasdfuiopghjkl joins
19:26:29		lennier1 quits [Client Quit]
19:27:22		lennier1 (lennier1) joins
19:35:12		sec^nd quits [Remote host closed the connection]
19:41:52		sec^nd (second) joins
20:01:16	<Frogging101>	[17:04:05] <Frogging101> https://www.misterpoll.com/directory/religion/pg/7
20:01:18	<Frogging101>	[17:04:10] <Frogging101> Fuck page 7 in particular, I guess.
20:01:40	<Frogging101>	posted that in -dev yesterday by mistake, instead of here. oops
20:15:54		@Fusl_ quits [Ping timeout: 250 seconds]
20:16:20		jonty quits [Ping timeout: 250 seconds]
20:20:37		Megame (Megame) joins
20:32:27		Fusl_ (Fusl) joins
20:32:27		@ChanServ sets mode: +o Fusl_
20:32:40		jonty (jonty) joins
20:32:40		Stilett0 joins
20:34:31		Stiletto quits [Ping timeout: 252 seconds]
20:40:52		wolfin (wolfin) joins
20:53:06		djsrv_ quits [Quit: ZNC 1.8.2 - https://znc.in]
20:54:35		djsrv (djsrv) joins
21:22:12		nuroten quits [Ping timeout: 244 seconds]
21:36:16		spirit quits [Client Quit]
21:42:47	<JensRex>	How does the dagensblaeser.net crawl manage to so far get 31GB data and 450K requests? There isn't that much content there.
21:48:08	<@JAA>	Off-site links, probably.
21:48:48	<JensRex>	Looks like it's just downloading the same js and css over and over again right now.
21:53:54	<@JAA>	Well, or that. Shitty site that doesn't know how to use caching headers.
21:58:03		Iki quits [Read error: Connection reset by peer]
22:00:41		Jonboy345 quits [Remote host closed the connection]
22:01:03		Jonboy345 joins
22:46:41		Megame quits [Client Quit]
22:59:26		SenileOvaltine joins
23:19:09		VerifiedJ quits [Quit: The Lounge - https://thelounge.chat]
23:20:09		VerifiedJ (VerifiedJ) joins
23:21:48		VerifiedJ quits [Client Quit]
23:22:36		VerifiedJ (VerifiedJ) joins
23:25:59		qwertyasdfuiopghjkl quits [Client Quit]
23:36:48		Stilett0 is now known as Stiletto
23:48:22		Arcorann (Arcorann) joins

Home Search Previous day Next day