| 00:00:52 | | dm4v quits [Read error: Connection reset by peer] |
| 00:03:41 | | dm4v joins |
| 00:03:43 | | dm4v is now authenticated as dm4v |
| 00:03:43 | | dm4v quits [Changing host] |
| 00:03:43 | | dm4v (dm4v) joins |
| 00:16:36 | | j quits [Remote host closed the connection] |
| 00:26:43 | | Campbell quits [Ping timeout: 252 seconds] |
| 01:03:23 | | dm4v quits [Ping timeout: 244 seconds] |
| 01:03:59 | | dm4v joins |
| 01:04:01 | | dm4v is now authenticated as dm4v |
| 01:04:01 | | dm4v quits [Changing host] |
| 01:04:01 | | dm4v (dm4v) joins |
| 01:41:47 | | nuroten joins |
| 01:45:11 | <nuroten> | thuban or anyone saving HK media sites: can we please add the new June 4th Musueum website? just launched today, the physical museum was shut down so they switched to online exhibition. it's js-heavy though, so not sure how much can be saved https://8964museum.com/ |
| 01:48:50 | <nuroten> | (it has archival images, timelines, etc. of the massacre. link's been added to the pad under Hong Kong Alliance in Support of Patriotic Democratic Movements of China, but I thought I'd mention it here in case it gets lost among the other ones already saved) |
| 01:49:26 | | Arcorann (Arcorann) joins |
| 01:54:31 | <nuroten> | thanks! |
| 02:08:30 | | lennier1 quits [Client Quit] |
| 02:09:03 | | lennier1 (lennier1) joins |
| 02:10:05 | | AntiLiberal joins |
| 03:09:08 | | HP_Archivist (HP_Archivist) joins |
| 03:20:24 | | Campbell joins |
| 03:41:40 | | jamesp quits [Client Quit] |
| 03:42:44 | | qw3rty_ joins |
| 03:46:36 | | qw3rty__ quits [Ping timeout: 250 seconds] |
| 03:54:55 | | AntiLiberal quits [Ping timeout: 244 seconds] |
| 04:19:23 | | Atom quits [Read error: Connection reset by peer] |
| 04:19:35 | | Atom joins |
| 04:53:34 | | wizards_ joins |
| 04:56:48 | | wizards quits [Ping timeout: 250 seconds] |
| 06:48:56 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 06:58:56 | | qwertyasdfuiopghjkl joins |
| 07:01:26 | | HP_Archivist quits [Ping timeout: 244 seconds] |
| 07:10:15 | | BlueMaxima quits [Remote host closed the connection] |
| 07:10:29 | | BlueMaxima joins |
| 07:15:01 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 07:15:14 | | BlueMaxima joins |
| 07:36:39 | | Eighty_ joins |
| 07:38:00 | | Eighty quits [Ping timeout: 250 seconds] |
| 07:53:54 | <@OrIdow6> | Anyone have any examples of Google Drive files or folders that are maybe 3 GB - 10 GB? |
| 07:54:14 | <@OrIdow6> | Also other such cases, such as folders with millions of files inside them |
| 07:54:19 | <@OrIdow6> | Publicly accessible, obviously |
| 07:58:14 | | knecht420 quits [Read error: Connection reset by peer] |
| 07:58:16 | | knecht4207 (knecht420) joins |
| 07:58:47 | | knecht4207 quits [Client Quit] |
| 07:59:39 | | knecht4207 (knecht420) joins |
| 07:59:47 | | knecht4207 is now known as knecht420 |
| 08:03:45 | | nimaje joins |
| 08:29:41 | <@HCross> | nuroten: hmm.. I wonder if this is a Brozzler affair |
| 08:32:30 | <gazorpazorp> | @OrIdow6: https://drive.google.com/drive/folders/1r8I5hpSPCf_9JWECwa6c4E4tQZELd3cx flash game zip files ranging from hundreds of MBs to tens of GB |
| 08:37:26 | <@OrIdow6> | Thank you gazorpazorp |
| 08:41:21 | <gazorpazorp> | https://drive.google.com/drive/folders/1oCMgJeBc55NuEasPcgwjx2FuPdQd8neu randomly found, different types of files, many nested folders |
| 08:41:28 | <gazorpazorp> | np |
| 08:53:59 | <gazorpazorp> | @OrIdow6: https://drive.google.com/drive/folders/1TuO-0XyxTVK7Jys2WW0gduRcoQMTpB9C there are lots and lots of files and nested folders. I have no idea how to calculate the total number of files and whether it's near a million or not |
| 08:54:56 | <gazorpazorp> | but I can't find a folder with millions of files inside that aren't in other nested folders |
| 09:04:39 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 09:12:31 | | Wayward quits [Ping timeout: 252 seconds] |
| 10:03:40 | | Video quits [Ping timeout: 252 seconds] |
| 10:39:14 | | Video joins |
| 11:43:07 | | spirit joins |
| 12:28:03 | | Iki quits [Read error: Connection reset by peer] |
| 12:31:01 | | Iki joins |
| 13:13:36 | | BlueMaxima quits [Client Quit] |
| 14:08:38 | | spirit quits [Client Quit] |
| 14:14:45 | | Jonboy3451 quits [Read error: Connection reset by peer] |
| 14:18:22 | | Jonboy345 joins |
| 14:35:12 | | Doran is now known as Doranwen |
| 14:53:30 | | Arcorann quits [Ping timeout: 250 seconds] |
| 15:14:57 | <nuroten> | @HCross: Brozzler? what do you mean? |
| 15:16:27 | <nuroten> | oh, https://github.com/internetarchive/brozzler ? |
| 15:19:00 | <nuroten> | yeah, I'm not sure, kind of imagining something that emulates browser clicking all the interactive elements and caching as it goes, if such a thing exists |
| 15:22:22 | <nuroten> | description sounds useful, link extraction |
| 15:28:09 | | AntiLiberal joins |
| 15:35:24 | | qwertyasdfuiopghjkl joins |
| 15:39:08 | | AntiLiberal quits [Ping timeout: 244 seconds] |
| 16:05:29 | | qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds] |
| 16:08:36 | | qwertyasdfuiopghjkl joins |
| 16:18:54 | | spirit joins |
| 16:49:11 | <Iki> | Is there a good way to archive a site a second time without huge overlaps? I'm thinking either: 1) have the original warc on-hand and don't add old pages and/or use the 'revisit' option, or 2) do a more limited comparison, such as comparing old archives against the current sitemap |
| 16:49:45 | <Iki> | Just curious if there is a tool that makes this straightforward. It's easily scriptable (such as by comparing a sitemap against IA's CDX output), but scriptable is not scalable |
| 16:52:39 | <Iki> | Okay. Gonna share options as I find them. Please let me know if any complications to them are known. I'll tag these thoughts with the keyword "repeatscrape" |
| 16:53:32 | <Iki> | repeatscrape 1: Looks like wpull allows use of a --database argument to track previously-visited URLs. Pretty good! Though it doesn't compare against the contents to check for changes |
| 16:56:44 | <Iki> | repeatscrape 2: Doesn't look like wget has wpull's database option, though it might be possible to use the --warc-dedup option and --warc-cdx options to basically do the same thing |
| 17:00:04 | <Iki> | repeatscrape 3: grab-site can take all wpull options via --wpull-arg. In addition, looks like it includes the dupespotter plugin, which maybe does the kind of comparison I'm looking for? |
| 18:28:11 | <@JAA> | Iki: It's a hard problem to solve. The most accurate solution is to recrawl the entire site and write revisit records as appropriate. But due to dynamically generated sites (e.g. session IDs, timestamps), you'll end up with a lot of duplication anyway. Anything else would have to be specific to a particular site and make use of its structure since you'd have to know which URLs to refetch (e.g. |
| 18:28:17 | <@JAA> | sitemaps, article lists) and which not (e.g. articles that you previously covered). |
| 18:45:52 | | HP_Archivist (HP_Archivist) joins |
| 19:23:22 | | qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds] |
| 19:23:57 | | qwertyasdfuiopghjkl joins |
| 19:26:29 | | lennier1 quits [Client Quit] |
| 19:27:22 | | lennier1 (lennier1) joins |
| 19:35:12 | | sec^nd quits [Remote host closed the connection] |
| 19:41:52 | | sec^nd (second) joins |
| 20:01:16 | <Frogging101> | [17:04:05] <Frogging101> https://www.misterpoll.com/directory/religion/pg/7 |
| 20:01:18 | <Frogging101> | [17:04:10] <Frogging101> Fuck page 7 in particular, I guess. |
| 20:01:40 | <Frogging101> | posted that in -dev yesterday by mistake, instead of here. oops |
| 20:15:54 | | @Fusl_ quits [Ping timeout: 250 seconds] |
| 20:16:20 | | jonty quits [Ping timeout: 250 seconds] |
| 20:20:37 | | Megame (Megame) joins |
| 20:32:27 | | Fusl_ (Fusl) joins |
| 20:32:27 | | @ChanServ sets mode: +o Fusl_ |
| 20:32:40 | | jonty (jonty) joins |
| 20:32:40 | | Stilett0 joins |
| 20:34:31 | | Stiletto quits [Ping timeout: 252 seconds] |
| 20:40:52 | | wolfin (wolfin) joins |
| 20:53:06 | | djsrv_ quits [Quit: ZNC 1.8.2 - https://znc.in] |
| 20:54:35 | | djsrv (djsrv) joins |
| 21:22:12 | | nuroten quits [Ping timeout: 244 seconds] |
| 21:36:16 | | spirit quits [Client Quit] |
| 21:42:47 | <JensRex> | How does the dagensblaeser.net crawl manage to so far get 31GB data and 450K requests? There isn't that much content there. |
| 21:48:08 | <@JAA> | Off-site links, probably. |
| 21:48:48 | <JensRex> | Looks like it's just downloading the same js and css over and over again right now. |
| 21:53:54 | <@JAA> | Well, or that. Shitty site that doesn't know how to use caching headers. |
| 21:58:03 | | Iki quits [Read error: Connection reset by peer] |
| 22:00:41 | | Jonboy345 quits [Remote host closed the connection] |
| 22:01:03 | | Jonboy345 joins |
| 22:46:41 | | Megame quits [Client Quit] |
| 22:59:26 | | SenileOvaltine joins |
| 23:19:09 | | VerifiedJ quits [Quit: The Lounge - https://thelounge.chat] |
| 23:20:09 | | VerifiedJ (VerifiedJ) joins |
| 23:21:48 | | VerifiedJ quits [Client Quit] |
| 23:22:36 | | VerifiedJ (VerifiedJ) joins |
| 23:25:59 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 23:36:48 | | Stilett0 is now known as Stiletto |
| 23:48:22 | | Arcorann (Arcorann) joins |