00:00:52dm4v quits [Read error: Connection reset by peer]
00:03:41dm4v joins
00:03:43dm4v quits [Changing host]
00:03:43dm4v (dm4v) joins
00:16:36j quits [Remote host closed the connection]
00:26:43Campbell quits [Ping timeout: 252 seconds]
01:03:23dm4v quits [Ping timeout: 244 seconds]
01:03:59dm4v joins
01:04:01dm4v quits [Changing host]
01:04:01dm4v (dm4v) joins
01:41:47nuroten joins
01:45:11<nuroten>thuban or anyone saving HK media sites: can we please add the new June 4th Musueum website? just launched today, the physical museum was shut down so they switched to online exhibition. it's js-heavy though, so not sure how much can be saved https://8964museum.com/
01:48:50<nuroten>(it has archival images, timelines, etc. of the massacre. link's been added to the pad under Hong Kong Alliance in Support of Patriotic Democratic Movements of China, but I thought I'd mention it here in case it gets lost among the other ones already saved)
01:49:26Arcorann (Arcorann) joins
01:54:31<nuroten>thanks!
02:08:30lennier1 quits [Client Quit]
02:09:03lennier1 (lennier1) joins
02:10:05AntiLiberal joins
03:09:08HP_Archivist (HP_Archivist) joins
03:20:24Campbell joins
03:41:40jamesp quits [Client Quit]
03:42:44qw3rty_ joins
03:46:36qw3rty__ quits [Ping timeout: 250 seconds]
03:54:55AntiLiberal quits [Ping timeout: 244 seconds]
04:19:23Atom quits [Read error: Connection reset by peer]
04:19:35Atom joins
04:53:34wizards_ joins
04:56:48wizards quits [Ping timeout: 250 seconds]
06:48:56qwertyasdfuiopghjkl quits [Client Quit]
06:58:56qwertyasdfuiopghjkl joins
07:01:26HP_Archivist quits [Ping timeout: 244 seconds]
07:10:15BlueMaxima quits [Remote host closed the connection]
07:10:29BlueMaxima joins
07:15:01BlueMaxima quits [Read error: Connection reset by peer]
07:15:14BlueMaxima joins
07:36:39Eighty_ joins
07:38:00Eighty quits [Ping timeout: 250 seconds]
07:53:54<@OrIdow6>Anyone have any examples of Google Drive files or folders that are maybe 3 GB - 10 GB?
07:54:14<@OrIdow6>Also other such cases, such as folders with millions of files inside them
07:54:19<@OrIdow6>Publicly accessible, obviously
07:58:14knecht420 quits [Read error: Connection reset by peer]
07:58:16knecht4207 (knecht420) joins
07:58:47knecht4207 quits [Client Quit]
07:59:39knecht4207 (knecht420) joins
07:59:47knecht4207 is now known as knecht420
08:03:45nimaje joins
08:29:41<@HCross>nuroten: hmm.. I wonder if this is a Brozzler affair
08:32:30<gazorpazorp>@OrIdow6: https://drive.google.com/drive/folders/1r8I5hpSPCf_9JWECwa6c4E4tQZELd3cx flash game zip files ranging from hundreds of MBs to tens of GB
08:37:26<@OrIdow6>Thank you gazorpazorp
08:41:21<gazorpazorp>https://drive.google.com/drive/folders/1oCMgJeBc55NuEasPcgwjx2FuPdQd8neu randomly found, different types of files, many nested folders
08:41:28<gazorpazorp>np
08:53:59<gazorpazorp>@OrIdow6: https://drive.google.com/drive/folders/1TuO-0XyxTVK7Jys2WW0gduRcoQMTpB9C there are lots and lots of files and nested folders. I have no idea how to calculate the total number of files and whether it's near a million or not
08:54:56<gazorpazorp>but I can't find a folder with millions of files inside that aren't in other nested folders
09:04:39qwertyasdfuiopghjkl quits [Client Quit]
09:12:31Wayward quits [Ping timeout: 252 seconds]
10:03:40Video quits [Ping timeout: 252 seconds]
10:39:14Video joins
11:43:07spirit joins
12:28:03Iki quits [Read error: Connection reset by peer]
12:31:01Iki joins
13:13:36BlueMaxima quits [Client Quit]
14:08:38spirit quits [Client Quit]
14:14:45Jonboy3451 quits [Read error: Connection reset by peer]
14:18:22Jonboy345 joins
14:35:12Doran is now known as Doranwen
14:53:30Arcorann quits [Ping timeout: 250 seconds]
15:14:57<nuroten>@HCross: Brozzler? what do you mean?
15:16:27<nuroten>oh, https://github.com/internetarchive/brozzler ?
15:19:00<nuroten>yeah, I'm not sure, kind of imagining something that emulates browser clicking all the interactive elements and caching as it goes, if such a thing exists
15:22:22<nuroten>description sounds useful, link extraction
15:28:09AntiLiberal joins
15:35:24qwertyasdfuiopghjkl joins
15:39:08AntiLiberal quits [Ping timeout: 244 seconds]
16:05:29qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds]
16:08:36qwertyasdfuiopghjkl joins
16:18:54spirit joins
16:49:11<Iki>Is there a good way to archive a site a second time without huge overlaps? I'm thinking either: 1) have the original warc on-hand and don't add old pages and/or use the 'revisit' option, or 2) do a more limited comparison, such as comparing old archives against the current sitemap
16:49:45<Iki>Just curious if there is a tool that makes this straightforward. It's easily scriptable (such as by comparing a sitemap against IA's CDX output), but scriptable is not scalable
16:52:39<Iki>Okay. Gonna share options as I find them. Please let me know if any complications to them are known. I'll tag these thoughts with the keyword "repeatscrape"
16:53:32<Iki>repeatscrape 1: Looks like wpull allows use of a --database argument to track previously-visited URLs. Pretty good! Though it doesn't compare against the contents to check for changes
16:56:44<Iki>repeatscrape 2: Doesn't look like wget has wpull's database option, though it might be possible to use the --warc-dedup option and --warc-cdx options to basically do the same thing
17:00:04<Iki>repeatscrape 3: grab-site can take all wpull options via --wpull-arg. In addition, looks like it includes the dupespotter plugin, which maybe does the kind of comparison I'm looking for?
18:28:11<@JAA>Iki: It's a hard problem to solve. The most accurate solution is to recrawl the entire site and write revisit records as appropriate. But due to dynamically generated sites (e.g. session IDs, timestamps), you'll end up with a lot of duplication anyway. Anything else would have to be specific to a particular site and make use of its structure since you'd have to know which URLs to refetch (e.g.
18:28:17<@JAA>sitemaps, article lists) and which not (e.g. articles that you previously covered).
18:45:52HP_Archivist (HP_Archivist) joins
19:23:22qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds]
19:23:57qwertyasdfuiopghjkl joins
19:26:29lennier1 quits [Client Quit]
19:27:22lennier1 (lennier1) joins
19:35:12sec^nd quits [Remote host closed the connection]
19:41:52sec^nd (second) joins
20:01:16<Frogging101>[17:04:05] <Frogging101> https://www.misterpoll.com/directory/religion/pg/7
20:01:18<Frogging101>[17:04:10] <Frogging101> Fuck page 7 in particular, I guess.
20:01:40<Frogging101>posted that in -dev yesterday by mistake, instead of here. oops
20:15:54@Fusl_ quits [Ping timeout: 250 seconds]
20:16:20jonty quits [Ping timeout: 250 seconds]
20:20:37Megame (Megame) joins
20:32:27Fusl_ (Fusl) joins
20:32:27@ChanServ sets mode: +o Fusl_
20:32:40jonty (jonty) joins
20:32:40Stilett0 joins
20:34:31Stiletto quits [Ping timeout: 252 seconds]
20:40:52wolfin (wolfin) joins
20:53:06djsrv_ quits [Quit: ZNC 1.8.2 - https://znc.in]
20:54:35djsrv (djsrv) joins
21:22:12nuroten quits [Ping timeout: 244 seconds]
21:36:16spirit quits [Client Quit]
21:42:47<JensRex>How does the dagensblaeser.net crawl manage to so far get 31GB data and 450K requests? There isn't that much content there.
21:48:08<@JAA>Off-site links, probably.
21:48:48<JensRex>Looks like it's just downloading the same js and css over and over again right now.
21:53:54<@JAA>Well, or that. Shitty site that doesn't know how to use caching headers.
21:58:03Iki quits [Read error: Connection reset by peer]
22:00:41Jonboy345 quits [Remote host closed the connection]
22:01:03Jonboy345 joins
22:46:41Megame quits [Client Quit]
22:59:26SenileOvaltine joins
23:19:09VerifiedJ quits [Quit: The Lounge - https://thelounge.chat]
23:20:09VerifiedJ (VerifiedJ) joins
23:21:48VerifiedJ quits [Client Quit]
23:22:36VerifiedJ (VerifiedJ) joins
23:25:59qwertyasdfuiopghjkl quits [Client Quit]
23:36:48Stilett0 is now known as Stiletto
23:48:22Arcorann (Arcorann) joins