00:37:52Mineroboter joins
00:39:14Mineroboter_ quits [Ping timeout: 250 seconds]
00:55:17Arcorann_ quits [Ping timeout: 258 seconds]
01:02:54dm4v quits [Client Quit]
01:03:06dm4v joins
01:03:08dm4v quits [Changing host]
01:03:08dm4v (dm4v) joins
01:44:49Wayward quits [Read error: Connection reset by peer]
01:45:25Wayward (wayward) joins
01:52:01Wayward quits [Remote host closed the connection]
01:53:11Wayward (wayward) joins
02:21:53JackThompson joins
02:23:20monoxane9 (monoxane) joins
02:24:32monoxane quits [Ping timeout: 250 seconds]
02:24:32monoxane9 is now known as monoxane
02:25:45Jack_Thompson quits [Ping timeout: 258 seconds]
03:38:25qw3rty_ joins
03:42:02qw3rty__ quits [Ping timeout: 258 seconds]
03:43:11qw3rty_ quits [Ping timeout: 258 seconds]
04:02:28DogsRNice quits [Read error: Connection reset by peer]
04:12:16<@OrIdow6>Sanqui: A while ago I collected a big list of subdomains for that
04:12:55<@OrIdow6>Including from a defunct directory application partially saved in the WBM (which went down a long time ago) IIRC
04:21:05Arcorann_ joins
04:32:20rsn joins
04:34:32rsn_ quits [Ping timeout: 250 seconds]
04:39:18Doranwen quits [Ping timeout: 250 seconds]
04:39:40nuroten quits [Remote host closed the connection]
04:45:12Doranwen (Doranwen) joins
04:51:48nuroten joins
05:01:52fuzzy8021 quits [Killed (NickServ (GHOST command used by fuzzy802!~fuzzy8021@173-224-26-244.ptcnet.net))]
05:01:58fuzzy8021 (fuzzy8021) joins
05:05:35sonick (sonick) joins
05:38:25IKI quits [Ping timeout: 244 seconds]
05:46:02thuban quits [Ping timeout: 250 seconds]
06:36:54qw3rty joins
06:42:26thuban joins
06:46:02Sylirana quits [Read error: Connection reset by peer]
06:46:24Sylirana (Sylirana) joins
07:09:14Arcorann_ quits [Ping timeout: 250 seconds]
07:09:21<@OrIdow6>Sanqui: Also, one thing I'd like to see happen, but probably amn't going to do myself, is to try to make some systematic effort to get all the self-hosted webcomics that are around
07:09:24rsn_ joins
07:09:29<@OrIdow6>Most seem to disappear so quickly
07:10:32rsn quits [Ping timeout: 250 seconds]
07:12:07<@OrIdow6>Around 2016/2017, I have noticed (from what I've seen in the WBM) that at least one person bulk SPNd a bunch of webcomics (to the point that, in 2017, theduckwebcomics.com was the 4th most SPNd domain - https://shawnw.io/presentations/RESAW19_Slides.pdf, page 15)
07:12:23<@OrIdow6>(Though I can't confirm it was the same person, obviously)
07:12:47<@OrIdow6>Ideally some of these people could be gathered and brought here to run them that way
07:13:01<@OrIdow6>I.e. through ArchiveBot
07:15:17sonick quits [Client Quit]
07:29:12<@OrIdow6>So I looked into that 3D model thing that's removing submissions from unregistered accounts going down the 24th, and you need an account to see most submissions (not sure why)
07:30:43<@OrIdow6>arkiver: Have you started any work on Tinkercad? If not, I will (and early this time)
07:33:41<@OrIdow6>(Hopefully)
07:48:56VukkyWork (VukkyWork) joins
08:03:53<Sanqui>OrIdow6: fun stuff.... I would love to see a comic project happen, and I would participate, but I can't afford to spearhead it
08:20:42Arcorann_ joins
08:26:54<@OrIdow6>Sanqui: And I'm afraid I probably wouldn't be able to lead it either... ArchiveTeam pipe dream #44245
08:55:38sneezey (sneezey) joins
09:04:48Wayward quits [Ping timeout: 258 seconds]
09:05:00Wayward- (wayward) joins
09:10:21BlueMaxima_ joins
09:11:17rsn joins
09:13:36rsn_ quits [Ping timeout: 250 seconds]
09:14:23BlueMaxima quits [Ping timeout: 258 seconds]
09:25:36Doran (Doranwen) joins
09:25:53Doranwen quits [Ping timeout: 258 seconds]
09:55:45Wingy2 (Wingy) joins
09:56:30Wingy quits [Ping timeout: 250 seconds]
09:56:30Wingy2 is now known as Wingy
10:10:09duce1337 (duce1337) joins
10:10:22Mineroboter quits [Ping timeout: 250 seconds]
10:10:27Mineroboter_ joins
10:11:11TigerbotHesh quits [Quit: ZNC - https://znc.in]
10:11:18TigerbotHesh joins
10:27:25LeighR (LeighR) joins
10:29:05Mineroboter joins
10:29:41<LeighR>If I want to submit a list of sites to be archived via transfer.archivete.am, should I list them as !a commands including --explain, or just as a list of URLs?
10:29:51Mineroboter_ quits [Ping timeout: 250 seconds]
10:31:58TigerbotHesh quits [Client Quit]
10:32:30TigerbotHesh joins
10:33:44Mineroboter quits [Ping timeout: 258 seconds]
10:34:09Mineroboter joins
10:38:26Mineroboter_ joins
10:38:31Mineroboter quits [Ping timeout: 250 seconds]
10:40:47<@HCross>just a list
10:40:49<@HCross>of URLs
10:41:43Mineroboter joins
10:43:19Mineroboter_ quits [Ping timeout: 258 seconds]
10:45:34Mineroboter_ joins
10:46:19Mineroboter quits [Ping timeout: 250 seconds]
10:47:24BlueMaxima_ quits [Read error: Connection reset by peer]
10:53:55Mineroboter joins
10:54:07Mineroboter_ quits [Ping timeout: 250 seconds]
10:56:13Mineroboter_ joins
10:58:39Mineroboter quits [Ping timeout: 258 seconds]
11:00:37Mineroboter_ quits [Ping timeout: 250 seconds]
11:03:12Mineroboter joins
11:05:39IKI joins
11:05:42Mineroboter_ joins
11:07:51Mineroboter quits [Ping timeout: 258 seconds]
11:13:56Mineroboter joins
11:14:03Mineroboter_ quits [Ping timeout: 250 seconds]
11:16:46Mineroboter_ joins
11:18:23Mineroboter quits [Ping timeout: 250 seconds]
11:23:35Mineroboter_ quits [Ping timeout: 250 seconds]
11:26:59Mineroboter joins
11:31:23Mineroboter quits [Ping timeout: 250 seconds]
11:33:29Mineroboter joins
11:38:08Mineroboter quits [Ping timeout: 258 seconds]
11:42:21Mineroboter joins
11:45:15Mineroboter_ joins
11:46:57Mineroboter quits [Ping timeout: 258 seconds]
11:47:54katocala quits [Remote host closed the connection]
11:50:01Mineroboter_ quits [Ping timeout: 258 seconds]
11:55:21<@arkiver>OrIdow6: go ahead and get tinkercad ready :)
11:55:29<@arkiver>please ping me when you have something
11:55:45Mineroboter joins
12:04:46VukkyWork quits [Remote host closed the connection]
12:09:49katocala joins
12:17:14thuban quits [Read error: Connection reset by peer]
12:17:29thuban joins
12:37:13<@EggplantN>Any preliminary data on tinkercad?
12:46:40icedice joins
12:47:10icedice quits [Client Quit]
13:03:06<@EggplantN>HCross excellent news on tinkercad. its aws
13:03:33<@HCross>oh good
13:03:35<@HCross>we can go brr
13:03:47<@HCross>assuming they're not throttling like reuters
13:03:53<@HCross>the bastards
13:10:05duce1337_ (duce1337) joins
13:10:05duce1337 quits [Read error: Connection reset by peer]
13:42:44webdownload quits [Remote host closed the connection]
13:45:56Wayward (wayward) joins
13:46:10Wayward- quits [Ping timeout: 258 seconds]
14:19:44benjinsmith joins
14:23:00benjins quits [Ping timeout: 250 seconds]
14:29:33benjinsmith is now known as benjins
15:16:18Arcorann_ quits [Ping timeout: 250 seconds]
15:34:13spirit joins
15:40:54godane1 joins
15:42:54godane quits [Read error: Connection reset by peer]
15:54:00<@OrIdow6>Be advised that at least one aspect of Tinkercad (downloads of raw? data, i.e. the most important thing) seems to happen dynamically, though it's still on AWS and seems to be behind a cache (though I doubt the latter will do much for unpopular submissions)
16:01:40<@EggplantN>Got any examples of the cache OrIdow6 just trying to do pre-emptive infra planning
16:01:47<@EggplantN>I know its rare for us at AT
16:02:02<@HCross>ArchiveTeam being prepared
16:02:04<@HCross>what is this shit
16:02:51<@OrIdow6>On size, I saw something a while ago that said there were >100m submissions, though this could be different due to the series of acquisitions, rearrangements, etc. they've done; downloads seem to mostly be in the 100kB-1MB range, though I don't know about item size (possible at least 3x copies for each)
16:03:06<@OrIdow6>Let me look
16:03:29<@OrIdow6>I haven't gotten to a fine level of detail yet, so it's always possible there'll be some quirk or alternate method I don;t know about now
16:05:15<@OrIdow6>EggplantN: What sort of information do you want? YOu have to be logged in (there's a valid one on bugmenot) to download anything
16:05:44<@EggplantN>Just an example URL or hostname where we download from :D
16:07:29<@OrIdow6>Host (for 3d files, anyway) seems to be csg.tinkercad.com
16:07:44<@OrIdow6>Other 2 file types I haven't looked at in as much detail
16:08:22<@OrIdow6>Example URL is https://csg.tinkercad.com/things/dltfDKEiVFB/polysoup.stl?rev=-1 , it will give you some info in headers but not send anything useful unless you have login cookies
16:08:33<@EggplantN>aight perfect
16:08:35<@OrIdow6>"as much detail" meaningly "hardly any at all"
16:08:41<@HCross>that url looks like it's behind Telia/Cogent
16:08:50<@HCross>us-east elb
16:11:40<@OrIdow6>Will try to work on this one quickly, because I have a bad feeling about it
16:11:46<@OrIdow6>But for now I need to sleep anyhwo
16:11:52<@HCross>Goodnight
16:12:26<@OrIdow6>It's much moreso morning in my time zone, but thank you in any case
16:15:24<Jake>(Could we get a channel setup for this earlier rather than later?)
16:17:13<@EggplantN>give us a pun Jake
16:17:26<@EggplantN>for a channel name
16:17:30<@HCross>tinkercrap?
16:17:37<@EggplantN>too easier
16:17:43<@EggplantN>we need something big brain
16:19:29<Jake>god I suck at channel names haha
16:19:38<Jake>i have literally nothing good
16:21:27<masterX244>Titanicad?
16:21:31<Aoede>tinkerhad
16:21:40<lunik1>I can go even smaller brain: tinkerbad
16:21:54<@EggplantN>good suggestions. HCross your decision I like both
16:22:14<@HCross>tinkerhad sounds better
16:25:46LeGoupil joins
16:27:20@HCross quits [Changing host]
16:27:20HCross (HCross) joins
16:27:20guybrush.hackint.org sets mode: +o HCross
16:28:50HP_Archivist (HP_Archivist) joins
16:35:07<Sanqui>pff
16:35:12<Sanqui>how about
16:35:13<Sanqui>stinkercad
16:38:31<@arkiver>#tinkerhad
16:41:03<@arkiver>OrIdow6: ^
16:44:40spirit quits [Client Quit]
16:55:59Daloader joins
17:12:04duce1337_ quits [Read error: Connection reset by peer]
17:12:04duce1337 (duce1337) joins
17:17:48DogsRNice (Webuser299) joins
17:55:59<@JAA>wordpress.com apparently introduced some shitty intermediate page for images on certain blogs. Accessing the image URL directly shows an HTML page instead of the image. Example: https://tiffanyosborn.files.wordpress.com/2017/11/byu-law.jpg
17:57:16<@JAA>They seem to use the Accept header for this. While Origin is also listed as 'Vary', that doesn't appear to matter, nor does Referer. So it doesn't look like a hotlink protection. Not sure why the fuck they'd do that then...
17:57:30<@JAA>Anyway, something to keep in mind if we ever do a wordpress.com project I suppose.
17:58:07<Ryz>This is a thing that cropped up a couple of months ago, this also applies to the Tumblr images too
17:58:54<Ryz>It has been consistent with Tumblr but inconsistent wildly with Wordpress-powered blogs
17:59:48<@JAA>It's probably only wordpress.com, not Wordpress blogs in general.
18:03:00<Ryz>Also doesn't help that a couple of months ago, there's some Wordpress blogs that introduced 'sponsored' posts that's in-between the first latest post and second latest post~
18:08:22HP_Archivist quits [Ping timeout: 258 seconds]
18:24:07LeighR quits [Ping timeout: 244 seconds]
18:44:53Mateon2 joins
18:46:19Mateon1 quits [Ping timeout: 258 seconds]
18:46:19Mateon2 is now known as Mateon1
19:08:02spirit joins
19:08:53HP_Archivist (HP_Archivist) joins
19:09:42<betamax>JAA: update on the UK elections stuff
19:09:46<betamax>(1) the twitter scrape has finished, with 20.7 million URLs (yikes!). For the 2019 UK general election there were 8 million URLs in the twitter scrape and I did them in 8 batches of 1M - is that a reasonable size / too large?
19:09:52<betamax>(2) I've split the candidate websites into chunks of 100, the first of which is at https://www.tardis.ed.ac.uk/~andrewferguson/uk_elections_2021_betamax/candidate_sites_split/candidate_sites_sublist_00
19:09:57<betamax>I've then removed all sites that look like party sites (e.g: "North East Green Party"), and removed any subdomains for a party (e.g: <name>.greenparty.com). I think that will prevent the outlinks from being too much of an issue?
19:10:33<betamax>(by "removed", I mean moved to a separate list that will need to be done without outlinks)
19:11:24<betamax>actually..... I should really add one of the subdomains back in (since there won't be any issues if there is just one)
19:11:54<@JAA>IIRC, running with --no-offsite-links does *not* prevent that crosslink issue. If anything, it only makes it even worse.
19:13:07<@JAA>On the Twitter lists: is there a significant account overlap with the 2019 ones? If so, it might be worth filtering those out.
19:14:15<betamax>hmm, not sure. I will try and dedupe now
19:15:03<betamax>How does running with --no-offsite-links not fix the issue? Surely it means it won't attempt to grab any crosslinks, so the problem can't occur?
19:15:58<@JAA>It won't attempt to grab them, but they're still added to the URL queue and then dismissed.
19:16:22<@JAA>I'm not 100 % sure about this as it's been a while since I've experimented with this, but that's how I remember it.
19:17:32<betamax>oh, that's a pain
19:17:38<@JAA>So using my example from yesterday: if you !a < a list that has example.org and example.net, and then the former has a link to example.net/foo/ and gets retrieved before a page from example.net linking there, example.net/foo/ will be added to the URL queue and (silently) ignored because its domain does not match the root URL's.
19:18:15<betamax>I assume there's no easy way to feed a list of URLs into AB so that each ends up as a separate job
19:18:36<betamax>(I imagine that trying to run each site as it's own job would be ... inefficient)
19:20:13<@JAA>Not directly, but we've done it before on the US elections last year.
19:20:58<betamax>Did the output still end up in the wayback?
19:22:03<@JAA>It's a separate bot that queues things to ArchiveBot.
19:22:05<@JAA>So yes.
19:22:21<@JAA>I can get that up and running again.
19:22:38<betamax>I don't want to cause additional work / headache, but that would be incredibly helpful
19:23:07Barto quits [Ping timeout: 258 seconds]
19:27:56<betamax>twitter scrape is 17M URLs after deduping with the 2019 general election scrape
19:28:18<@JAA>Well, it's an improvement. :-)
19:33:09<betamax>Should I go with 1M URLs per job, or cap it lower (500k?)
19:33:16pcr leaves
19:33:21Sylirana quits [Ping timeout: 244 seconds]
19:34:13pcr joins
19:34:18<@JAA>1M should be fine. Ideally somewhat grouped by domain so the cookie jar issue doesn't come into play as much.
19:34:19Sylirana (Sylirana) joins
19:35:09<betamax>They're sorted (required to dedupe them) so roughly sorted by domain.
19:35:51<betamax>I'll put one in now then see how it goes before adding additional ones.
19:35:53<@JAA>Makes sense. I thought you might've done that. (I usually dedupe without sorting.)
19:35:57<@JAA>Sounds good.
19:37:43<betamax>How, out of curiosity? (I sorted both lists then did 'comm' but there's probably a much faster way)
19:39:32<@JAA>I trade memory for runtime, basically. Load the entire base file into memory, then check each line of the second file against that and print if not present.
19:40:00<@JAA>So it doesn't work well for huge files, but up to some millions is fine.
19:40:42<@JAA>https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/dedupe
19:40:54Barto (Barto) joins
19:42:08<betamax>Ah. I'm using a shared server with 4GB RAM, so that probably wouldn't have worked well :)
19:42:53<rewby>I usually use sort -u for deduping
19:43:08<@JAA>Another option would be using a probabilistic data structure (e.g. bloom or cuckoo filter), but that assumes you're fine with imperfect results.
19:51:02Daloader quits [Ping timeout: 250 seconds]
19:59:13<@JAA>betamax: So for the websites, that's the two 'reprocessed lists' linked on https://wiki.archiveteam.org/index.php/Elections/2021_UK_elections I guess?
20:02:12themadpro quits [Quit: Connection closed for inactivity]
20:06:26Barto quits [Ping timeout: 258 seconds]
20:12:10Barto (Barto) joins
20:19:36<betamax>Yes, that's the ones.
20:21:13<@JAA>Ok, I'll look into getting that started later.
20:21:48<betamax>Thanks very much!
20:41:04cmlow quits [Client Quit]
20:44:07LeGoupil quits [Client Quit]
20:48:06duce1337_ (duce1337) joins
20:48:06duce1337 quits [Read error: Connection reset by peer]
21:01:35nuroten quits [Remote host closed the connection]
21:16:10Sylirana quits [Ping timeout: 244 seconds]
21:16:59Sylirana (Sylirana) joins
21:20:42HP_Archivist quits [Client Quit]
21:27:07<Ryz>Welp, I better make a lot more room; I was about to toss in more forums into ArchiveBot but the elections thing is more important~
21:42:50<betamax>Ryz: thanks - hopefully it won't take too long to get all the election stuff through
21:43:20<betamax>and I've noticed that some of the election sites are gone already, it's a shame that more value isn't placed upon keeping such sites around
21:47:34tech234a quits [Quit: Connection closed for inactivity]
22:07:17redlizard leaves [Konversation terminated!]
22:42:47duce1337_ quits [Client Quit]
22:47:29nuroten joins
22:48:18tech234a (tech234a) joins
23:06:53Aoede quits [Ping timeout: 250 seconds]
23:07:06Aoede (Aoede) joins
23:07:45grawity quits [Ping timeout: 250 seconds]
23:07:58grawity (grawity) joins
23:10:21Matthww quits [Quit: Ping timeout (120 seconds)]
23:10:44Matthww joins
23:16:32Mineroboter quits [Client Quit]
23:18:52Mineroboter joins
23:19:30teej quits [Quit: Connection closed for inactivity]
23:30:10BlueMaxima joins
23:44:25DogsRNice quits [Read error: Connection reset by peer]