| 00:04:54 | | BlueMaxima joins |
| 00:31:18 | | BlueMaxima_ joins |
| 00:35:06 | | BlueMaxima__ joins |
| 00:35:19 | | BlueMaxima quits [Ping timeout: 258 seconds] |
| 00:39:06 | | BlueMaxima_ quits [Ping timeout: 250 seconds] |
| 00:44:18 | | lukash7 quits [Quit: The Lounge - https://thelounge.chat] |
| 01:03:48 | | dm4v quits [Ping timeout: 250 seconds] |
| 01:05:28 | | dm4v joins |
| 01:05:30 | | dm4v is now authenticated as dm4v |
| 01:05:30 | | dm4v quits [Changing host] |
| 01:05:30 | | dm4v (dm4v) joins |
| 01:13:33 | | lukash7 joins |
| 01:14:25 | | Iki quits [Ping timeout: 258 seconds] |
| 01:17:45 | | lukash7 quits [Client Quit] |
| 01:28:40 | | Iki joins |
| 01:33:42 | | pmlo1 quits [Ping timeout: 250 seconds] |
| 01:34:15 | | Arcorann (Arcorann) joins |
| 01:34:44 | | pmlo1 joins |
| 02:35:00 | | lukash7 joins |
| 02:37:30 | | Iki quits [Read error: Connection reset by peer] |
| 02:42:52 | | lukash7 quits [Client Quit] |
| 02:56:54 | | lukash7 joins |
| 03:23:52 | | wickedplayer494 quits [Remote host closed the connection] |
| 03:29:15 | | wickedplayer494 joins |
| 03:29:56 | | wickedplayer494 is now authenticated as wickedplayer494 |
| 03:34:26 | | qw3rty__ joins |
| 03:38:10 | | qw3rty_ quits [Ping timeout: 258 seconds] |
| 03:44:35 | | Iki joins |
| 03:48:31 | | Wayward quits [Ping timeout: 258 seconds] |
| 03:57:49 | | etnguyen03 quits [Client Quit] |
| 04:18:43 | | nepeat_ quits [Remote host closed the connection] |
| 04:18:43 | | notak joins |
| 04:21:12 | | RJHacker75162 joins |
| 04:23:06 | | RJHacker75162 is now known as nepeat |
| 04:23:08 | | notak quits [Ping timeout: 250 seconds] |
| 04:23:10 | | nepeat is now known as RJHacker82098 |
| 04:23:39 | | RJHacker82098 is now authenticated as nepeat |
| 05:45:49 | | Jack_Thompson quits [Ping timeout: 258 seconds] |
| 07:30:06 | | nuroten quits [Remote host closed the connection] |
| 07:40:17 | | BlueMaxima__ quits [Client Quit] |
| 07:45:45 | | BlueMaxima joins |
| 08:20:03 | | Jack_Thompson joins |
| 08:26:21 | | mutantmnky is now known as mutantmonkey |
| 08:27:52 | | Wayward (wayward) joins |
| 08:31:08 | | duce1337 (duce1337) joins |
| 09:00:24 | | godane2 quits [Client Quit] |
| 09:00:54 | | godane (godane) joins |
| 09:16:53 | | godane quits [Client Quit] |
| 09:23:05 | | betamax quits [Remote host closed the connection] |
| 09:30:26 | | betamax (betamax) joins |
| 09:31:36 | | Doran (Doranwen) joins |
| 09:31:59 | | Doranwen quits [Ping timeout: 258 seconds] |
| 10:36:42 | | spirit joins |
| 10:41:25 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 10:54:53 | | Daloader_ joins |
| 11:17:38 | | Mateon1 quits [Remote host closed the connection] |
| 11:31:38 | | duce1337_ (duce1337) joins |
| 11:31:38 | | duce1337 quits [Read error: Connection reset by peer] |
| 11:37:36 | | Mateon1 joins |
| 12:51:32 | | Kaz__ quits [Quit: Connection closed for inactivity] |
| 13:08:25 | | us3rrr joins |
| 13:10:56 | | onetruth quits [Ping timeout: 250 seconds] |
| 13:11:40 | | sec^nd quits [Remote host closed the connection] |
| 13:12:01 | | sec^nd (second) joins |
| 13:39:26 | | RJHacker82098 is now known as nepeat |
| 14:21:30 | | second (second) joins |
| 14:22:59 | | sec^nd- (second) joins |
| 14:25:04 | | sec^nd quits [Ping timeout: 255 seconds] |
| 14:25:05 | | sec^nd- is now known as sec^nd |
| 14:26:25 | | second quits [Ping timeout: 255 seconds] |
| 15:08:33 | | Arcorann quits [Ping timeout: 258 seconds] |
| 15:11:26 | | roxfan quits [Remote host closed the connection] |
| 15:14:48 | | duce1337_ quits [Read error: Connection reset by peer] |
| 15:15:06 | | duce1337 (duce1337) joins |
| 15:41:44 | | pcr leaves |
| 15:54:24 | | cmlow quits [Client Quit] |
| 16:18:46 | | pcr joins |
| 17:22:43 | | Barto quits [Ping timeout: 258 seconds] |
| 17:57:08 | | VerifiedJ quits [Quit: The Lounge - https://thelounge.chat] |
| 17:57:55 | | VerifiedJ (VerifiedJ) joins |
| 18:06:51 | | Larsenv quits [Quit: ZNC 1.8.0 - https://znc.in] |
| 18:16:26 | | ThreeHeadedMonkey quits [Ping timeout: 250 seconds] |
| 18:20:00 | | ThreeHeadedMonkey (ThreeHeadedMonkey) joins |
| 18:23:21 | | Larsenv (Larsenv) joins |
| 18:34:04 | | flashmeow joins |
| 18:43:16 | <betamax> | JAA: thank you for keeping on top of the election sites (and sorry for not being more involved) |
| 18:43:42 | <betamax> | I see that all the lib dem sites are set to larger delays - I assume this is due to rate limiting? |
| 18:44:06 | <@JAA> | Yeah, they're all hosted on the same IP and have silly rate limits. |
| 18:44:46 | <@JAA> | 1 request per second is already too much and gets your IP banned (connection refused). |
| 18:45:02 | <@JAA> | Some others are also slowed down for similar reasons. |
| 18:45:08 | <betamax> | there's also ~15 or so that are stuck with a very large delay (e.g: e6t5bxy2xr04kg98svkax7o7r with 300 second delay) - is this due to issues with the underlying pipeline? |
| 18:45:31 | <@JAA> | Those with a 300001 delay are all dead, see #archivebot just now. |
| 18:45:54 | <@JAA> | I'll take care of those when I clean it up. |
| 18:46:30 | <@JAA> | There's also a growing list of sites that need special treatment in some way. Some are LibDem sites we aborted, then there's a bunch of sites that embed another domain via an iframe etc. |
| 18:47:13 | <@JAA> | And no worries re involvement, you created the list. :-) Hope you don't mind the thousands of highlights in #archivebot though. :-P |
| 18:47:17 | <@EggplantN> | if required we can always just setup grabsite with a /24? JAA/betamax :) |
| 18:47:26 | <@EggplantN> | and do the random IP iptables rule |
| 18:48:10 | <@JAA> | If someone who isn't me does it, sure! :-) |
| 18:48:32 | <betamax> | JAA: I cleaned up the list, but it was the hundreds of volunteers through Democracy Club (https://democracyclub.org.uk/) that did the hard work crowdsourcing the data |
| 18:49:02 | <@EggplantN> | :P i mean, I can get the box + grab-site configured if you wanna put the jobs into it |
| 18:49:07 | <@EggplantN> | or just AB them as you are :D |
| 18:50:05 | | Barto (Barto) joins |
| 18:50:19 | <@JAA> | AB is fully automated at this point. So if I can throw a list of domains somewhere, that's fine with me. I don't really want to have to set up uploads, add ignores manually, etc. |
| 18:50:56 | <@EggplantN> | Fair enough ;) |
| 18:51:04 | <@JAA> | I occasionally go through to see if a job's running wild and check the list of finished jobs to see which ones went wrong. Otherwise, it's automatic. |
| 18:51:08 | <betamax> | I didn't mean for this to be such a big project - I (naively) thought it would be straightforward :) |
| 18:51:37 | <@JAA> | :-) |
| 18:52:00 | <@EggplantN> | It's a great idea for a long running project and it falls into the same category as another one i mentioned to JAA recently. It would be best looking into creating a specific tool for these projects. I'll have a brainstorm one night |
| 18:52:28 | <@JAA> | Aye |
| 18:53:10 | <@JAA> | Gov sites as well. A few found their way into this list, and it's amazing how much stuff on them has never been archived before. |
| 18:53:32 | <@JAA> | One was retrieving thousands of PDFs, and only like a hundred were in the WBM. |
| 18:54:48 | <betamax> | Yeah. I tried to do my own thing for the US 2018 midterm elections (archiving candidate sites) and ran into complexity issues prety quickly. I ended up attempting to archive each site using wget with warc output, then aborting and keeping the partial archive if it took longer than 5 or so minutes, but that isn't a very good technique :D |
| 18:56:01 | | BerndLauert joins |
| 18:56:54 | <betamax> | JAA: it's amazing how much gov / council stuff is just deleted. I've been archiving the UK council webcasts for about a year now (https://archive.org/details/public-i-webcast-archive) but a lot is lost forever because webcasts are deleted after a completely arbitrary time period |
| 18:57:29 | <betamax> | (for that project I pull PDFs / meeting minutes from the local government sites if they've been mentioned in the metadata for the webcast) |
| 18:57:50 | <@JAA> | Yeah |
| 19:01:56 | | Daloader_ quits [Ping timeout: 250 seconds] |
| 19:04:33 | | duce1337_ (duce1337) joins |
| 19:04:33 | | duce1337 quits [Read error: Connection reset by peer] |
| 19:05:12 | <Ryz> | Mmm, there always be deleting, it's not just websites that suddenly go poof, no, the more nefarious stuff is stuff deleted while the website still looks alive s: |
| 19:05:28 | | icedice joins |
| 19:13:33 | <BerndLauert> | What's the team's opinion on archiving game modding sites? And porn? I've never seen this being discussed in the archival circles |
| 19:14:37 | <@JAA> | https://transfer.archivete.am/inline/bG4mu/aatt.png |
| 19:15:18 | <BerndLauert> | lol |
| 19:17:07 | <@JAA> | Porn depends a bit, but game modding absolutely. |
| 19:18:00 | <BerndLauert> | there's a degree of overlap there |
| 19:18:22 | <@Kaz> | we literally grabbed the porn bit of tumblr |
| 20:21:28 | | flashmeow quits [Remote host closed the connection] |
| 20:30:57 | | HackMii_ quits [Remote host closed the connection] |
| 20:31:28 | | HackMii_ (hacktheplanet) joins |
| 20:48:04 | | BerndLauert quits [Ping timeout: 244 seconds] |
| 20:58:59 | | superkuh quits [Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilayer] |
| 21:53:54 | | duce1337_ is now known as duce1337 |
| 22:25:13 | <aarchi> | Any one know how frequently Stack Exchange uploads their dumps to IA? Only the most recent is kept, so I can't tell. https://archive.org/details/stackexchange |
| 22:39:04 | | duce1337 quits [Client Quit] |
| 22:47:55 | <Jake> | about every 80-90 days I think. |
| 22:59:04 | <@JAA> | Every three months per https://meta.stackexchange.com/questions/224873/all-stack-exchange-data-dumps |
| 23:15:20 | <@Kaz> | checks out. Last one 75 days ago, then 159 days, then 249 days, etc |
| 23:15:57 | <@JAA> | But also, gross that they only keep the latest dump. |
| 23:17:17 | <@JAA> | https://data.stackexchange.com/ has weekly dumps, by the way. |
| 23:19:36 | <@Kaz> | yeah, I was just poking around hoping the data was maybe shipped off elsewhere first or something, but doesn't seem to be |
| 23:19:56 | <@JAA> | Relevant discussion: https://meta.stackexchange.com/questions/224873/all-stack-exchange-data-dumps |
| 23:22:07 | <@JAA> | So one would have to dig through the logs for that IA item, get the torrent info hashes, and then hope that someone still seeds them. |
| 23:37:23 | | us3rrr quits [Client Quit] |
| 23:38:58 | | FalconK quits [Quit: WeeChat 3.1] |
| 23:39:39 | | FalconK (FalconK) joins |
| 23:42:01 | | FalconK quits [Client Quit] |
| 23:55:15 | | Iki quits [Ping timeout: 258 seconds] |
| 23:59:15 | | Ryz quits [Remote host closed the connection] |