00:04:54BlueMaxima joins
00:31:18BlueMaxima_ joins
00:35:06BlueMaxima__ joins
00:35:19BlueMaxima quits [Ping timeout: 258 seconds]
00:39:06BlueMaxima_ quits [Ping timeout: 250 seconds]
00:44:18lukash7 quits [Quit: The Lounge - https://thelounge.chat]
01:03:48dm4v quits [Ping timeout: 250 seconds]
01:05:28dm4v joins
01:05:30dm4v quits [Changing host]
01:05:30dm4v (dm4v) joins
01:13:33lukash7 joins
01:14:25Iki quits [Ping timeout: 258 seconds]
01:17:45lukash7 quits [Client Quit]
01:28:40Iki joins
01:33:42pmlo1 quits [Ping timeout: 250 seconds]
01:34:15Arcorann (Arcorann) joins
01:34:44pmlo1 joins
02:35:00lukash7 joins
02:37:30Iki quits [Read error: Connection reset by peer]
02:42:52lukash7 quits [Client Quit]
02:56:54lukash7 joins
03:23:52wickedplayer494 quits [Remote host closed the connection]
03:29:15wickedplayer494 joins
03:34:26qw3rty__ joins
03:38:10qw3rty_ quits [Ping timeout: 258 seconds]
03:44:35Iki joins
03:48:31Wayward quits [Ping timeout: 258 seconds]
03:57:49etnguyen03 quits [Client Quit]
04:18:43nepeat_ quits [Remote host closed the connection]
04:18:43notak joins
04:21:12RJHacker75162 joins
04:23:06RJHacker75162 is now known as nepeat
04:23:08notak quits [Ping timeout: 250 seconds]
04:23:10nepeat is now known as RJHacker82098
05:45:49Jack_Thompson quits [Ping timeout: 258 seconds]
07:30:06nuroten quits [Remote host closed the connection]
07:40:17BlueMaxima__ quits [Client Quit]
07:45:45BlueMaxima joins
08:20:03Jack_Thompson joins
08:26:21mutantmnky is now known as mutantmonkey
08:27:52Wayward (wayward) joins
08:31:08duce1337 (duce1337) joins
09:00:24godane2 quits [Client Quit]
09:00:54godane (godane) joins
09:16:53godane quits [Client Quit]
09:23:05betamax quits [Remote host closed the connection]
09:30:26betamax (betamax) joins
09:31:36Doran (Doranwen) joins
09:31:59Doranwen quits [Ping timeout: 258 seconds]
10:36:42spirit joins
10:41:25BlueMaxima quits [Read error: Connection reset by peer]
10:54:53Daloader_ joins
11:17:38Mateon1 quits [Remote host closed the connection]
11:31:38duce1337_ (duce1337) joins
11:31:38duce1337 quits [Read error: Connection reset by peer]
11:37:36Mateon1 joins
12:51:32Kaz__ quits [Quit: Connection closed for inactivity]
13:08:25us3rrr joins
13:10:56onetruth quits [Ping timeout: 250 seconds]
13:11:40sec^nd quits [Remote host closed the connection]
13:12:01sec^nd (second) joins
13:39:26RJHacker82098 is now known as nepeat
14:21:30second (second) joins
14:22:59sec^nd- (second) joins
14:25:04sec^nd quits [Ping timeout: 255 seconds]
14:25:05sec^nd- is now known as sec^nd
14:26:25second quits [Ping timeout: 255 seconds]
15:08:33Arcorann quits [Ping timeout: 258 seconds]
15:11:26roxfan quits [Remote host closed the connection]
15:14:48duce1337_ quits [Read error: Connection reset by peer]
15:15:06duce1337 (duce1337) joins
15:41:44pcr leaves
15:54:24cmlow quits [Client Quit]
16:18:46pcr joins
17:22:43Barto quits [Ping timeout: 258 seconds]
17:57:08VerifiedJ quits [Quit: The Lounge - https://thelounge.chat]
17:57:55VerifiedJ (VerifiedJ) joins
18:06:51Larsenv quits [Quit: ZNC 1.8.0 - https://znc.in]
18:16:26ThreeHeadedMonkey quits [Ping timeout: 250 seconds]
18:20:00ThreeHeadedMonkey (ThreeHeadedMonkey) joins
18:23:21Larsenv (Larsenv) joins
18:34:04flashmeow joins
18:43:16<betamax>JAA: thank you for keeping on top of the election sites (and sorry for not being more involved)
18:43:42<betamax>I see that all the lib dem sites are set to larger delays - I assume this is due to rate limiting?
18:44:06<@JAA>Yeah, they're all hosted on the same IP and have silly rate limits.
18:44:46<@JAA>1 request per second is already too much and gets your IP banned (connection refused).
18:45:02<@JAA>Some others are also slowed down for similar reasons.
18:45:08<betamax>there's also ~15 or so that are stuck with a very large delay (e.g: e6t5bxy2xr04kg98svkax7o7r with 300 second delay) - is this due to issues with the underlying pipeline?
18:45:31<@JAA>Those with a 300001 delay are all dead, see #archivebot just now.
18:45:54<@JAA>I'll take care of those when I clean it up.
18:46:30<@JAA>There's also a growing list of sites that need special treatment in some way. Some are LibDem sites we aborted, then there's a bunch of sites that embed another domain via an iframe etc.
18:47:13<@JAA>And no worries re involvement, you created the list. :-) Hope you don't mind the thousands of highlights in #archivebot though. :-P
18:47:17<@EggplantN>if required we can always just setup grabsite with a /24? JAA/betamax :)
18:47:26<@EggplantN>and do the random IP iptables rule
18:48:10<@JAA>If someone who isn't me does it, sure! :-)
18:48:32<betamax>JAA: I cleaned up the list, but it was the hundreds of volunteers through Democracy Club (https://democracyclub.org.uk/) that did the hard work crowdsourcing the data
18:49:02<@EggplantN>:P i mean, I can get the box + grab-site configured if you wanna put the jobs into it
18:49:07<@EggplantN>or just AB them as you are :D
18:50:05Barto (Barto) joins
18:50:19<@JAA>AB is fully automated at this point. So if I can throw a list of domains somewhere, that's fine with me. I don't really want to have to set up uploads, add ignores manually, etc.
18:50:56<@EggplantN>Fair enough ;)
18:51:04<@JAA>I occasionally go through to see if a job's running wild and check the list of finished jobs to see which ones went wrong. Otherwise, it's automatic.
18:51:08<betamax>I didn't mean for this to be such a big project - I (naively) thought it would be straightforward :)
18:51:37<@JAA>:-)
18:52:00<@EggplantN>It's a great idea for a long running project and it falls into the same category as another one i mentioned to JAA recently. It would be best looking into creating a specific tool for these projects. I'll have a brainstorm one night
18:52:28<@JAA>Aye
18:53:10<@JAA>Gov sites as well. A few found their way into this list, and it's amazing how much stuff on them has never been archived before.
18:53:32<@JAA>One was retrieving thousands of PDFs, and only like a hundred were in the WBM.
18:54:48<betamax>Yeah. I tried to do my own thing for the US 2018 midterm elections (archiving candidate sites) and ran into complexity issues prety quickly. I ended up attempting to archive each site using wget with warc output, then aborting and keeping the partial archive if it took longer than 5 or so minutes, but that isn't a very good technique :D
18:56:01BerndLauert joins
18:56:54<betamax>JAA: it's amazing how much gov / council stuff is just deleted. I've been archiving the UK council webcasts for about a year now (https://archive.org/details/public-i-webcast-archive) but a lot is lost forever because webcasts are deleted after a completely arbitrary time period
18:57:29<betamax>(for that project I pull PDFs / meeting minutes from the local government sites if they've been mentioned in the metadata for the webcast)
18:57:50<@JAA>Yeah
19:01:56Daloader_ quits [Ping timeout: 250 seconds]
19:04:33duce1337_ (duce1337) joins
19:04:33duce1337 quits [Read error: Connection reset by peer]
19:05:12<Ryz>Mmm, there always be deleting, it's not just websites that suddenly go poof, no, the more nefarious stuff is stuff deleted while the website still looks alive s:
19:05:28icedice joins
19:13:33<BerndLauert>What's the team's opinion on archiving game modding sites? And porn? I've never seen this being discussed in the archival circles
19:14:37<@JAA>https://transfer.archivete.am/inline/bG4mu/aatt.png
19:15:18<BerndLauert>lol
19:17:07<@JAA>Porn depends a bit, but game modding absolutely.
19:18:00<BerndLauert>there's a degree of overlap there
19:18:22<@Kaz>we literally grabbed the porn bit of tumblr
20:21:28flashmeow quits [Remote host closed the connection]
20:30:57HackMii_ quits [Remote host closed the connection]
20:31:28HackMii_ (hacktheplanet) joins
20:48:04BerndLauert quits [Ping timeout: 244 seconds]
20:58:59superkuh quits [Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilayer]
21:53:54duce1337_ is now known as duce1337
22:25:13<aarchi>Any one know how frequently Stack Exchange uploads their dumps to IA? Only the most recent is kept, so I can't tell. https://archive.org/details/stackexchange
22:39:04duce1337 quits [Client Quit]
22:47:55<Jake>about every 80-90 days I think.
22:59:04<@JAA>Every three months per https://meta.stackexchange.com/questions/224873/all-stack-exchange-data-dumps
23:15:20<@Kaz>checks out. Last one 75 days ago, then 159 days, then 249 days, etc
23:15:57<@JAA>But also, gross that they only keep the latest dump.
23:17:17<@JAA>https://data.stackexchange.com/ has weekly dumps, by the way.
23:19:36<@Kaz>yeah, I was just poking around hoping the data was maybe shipped off elsewhere first or something, but doesn't seem to be
23:19:56<@JAA>Relevant discussion: https://meta.stackexchange.com/questions/224873/all-stack-exchange-data-dumps
23:22:07<@JAA>So one would have to dig through the logs for that IA item, get the torrent info hashes, and then hope that someone still seeds them.
23:37:23us3rrr quits [Client Quit]
23:38:58FalconK quits [Quit: WeeChat 3.1]
23:39:39FalconK (FalconK) joins
23:42:01FalconK quits [Client Quit]
23:55:15Iki quits [Ping timeout: 258 seconds]
23:59:15Ryz quits [Remote host closed the connection]