| 00:37:52 | | Mineroboter joins |
| 00:39:14 | | Mineroboter_ quits [Ping timeout: 250 seconds] |
| 00:55:17 | | Arcorann_ quits [Ping timeout: 258 seconds] |
| 01:02:54 | | dm4v quits [Client Quit] |
| 01:03:06 | | dm4v joins |
| 01:03:08 | | dm4v is now authenticated as dm4v |
| 01:03:08 | | dm4v quits [Changing host] |
| 01:03:08 | | dm4v (dm4v) joins |
| 01:44:49 | | Wayward quits [Read error: Connection reset by peer] |
| 01:45:25 | | Wayward (wayward) joins |
| 01:52:01 | | Wayward quits [Remote host closed the connection] |
| 01:53:11 | | Wayward (wayward) joins |
| 02:21:53 | | JackThompson joins |
| 02:23:20 | | monoxane9 (monoxane) joins |
| 02:24:32 | | monoxane quits [Ping timeout: 250 seconds] |
| 02:24:32 | | monoxane9 is now known as monoxane |
| 02:25:45 | | Jack_Thompson quits [Ping timeout: 258 seconds] |
| 03:38:25 | | qw3rty_ joins |
| 03:42:02 | | qw3rty__ quits [Ping timeout: 258 seconds] |
| 03:43:11 | | qw3rty_ quits [Ping timeout: 258 seconds] |
| 04:02:28 | | DogsRNice quits [Read error: Connection reset by peer] |
| 04:12:16 | <@OrIdow6> | Sanqui: A while ago I collected a big list of subdomains for that |
| 04:12:55 | <@OrIdow6> | Including from a defunct directory application partially saved in the WBM (which went down a long time ago) IIRC |
| 04:21:05 | | Arcorann_ joins |
| 04:32:20 | | rsn joins |
| 04:34:32 | | rsn_ quits [Ping timeout: 250 seconds] |
| 04:39:18 | | Doranwen quits [Ping timeout: 250 seconds] |
| 04:39:40 | | nuroten quits [Remote host closed the connection] |
| 04:45:12 | | Doranwen (Doranwen) joins |
| 04:51:48 | | nuroten joins |
| 05:01:52 | | fuzzy8021 quits [Killed (NickServ (GHOST command used by fuzzy802!~fuzzy8021@173-224-26-244.ptcnet.net))] |
| 05:01:58 | | fuzzy8021 (fuzzy8021) joins |
| 05:05:35 | | sonick (sonick) joins |
| 05:38:25 | | IKI quits [Ping timeout: 244 seconds] |
| 05:46:02 | | thuban quits [Ping timeout: 250 seconds] |
| 06:36:54 | | qw3rty joins |
| 06:42:26 | | thuban joins |
| 06:46:02 | | Sylirana quits [Read error: Connection reset by peer] |
| 06:46:24 | | Sylirana (Sylirana) joins |
| 07:09:14 | | Arcorann_ quits [Ping timeout: 250 seconds] |
| 07:09:21 | <@OrIdow6> | Sanqui: Also, one thing I'd like to see happen, but probably amn't going to do myself, is to try to make some systematic effort to get all the self-hosted webcomics that are around |
| 07:09:24 | | rsn_ joins |
| 07:09:29 | <@OrIdow6> | Most seem to disappear so quickly |
| 07:10:32 | | rsn quits [Ping timeout: 250 seconds] |
| 07:12:07 | <@OrIdow6> | Around 2016/2017, I have noticed (from what I've seen in the WBM) that at least one person bulk SPNd a bunch of webcomics (to the point that, in 2017, theduckwebcomics.com was the 4th most SPNd domain - https://shawnw.io/presentations/RESAW19_Slides.pdf, page 15) |
| 07:12:23 | <@OrIdow6> | (Though I can't confirm it was the same person, obviously) |
| 07:12:47 | <@OrIdow6> | Ideally some of these people could be gathered and brought here to run them that way |
| 07:13:01 | <@OrIdow6> | I.e. through ArchiveBot |
| 07:15:17 | | sonick quits [Client Quit] |
| 07:29:12 | <@OrIdow6> | So I looked into that 3D model thing that's removing submissions from unregistered accounts going down the 24th, and you need an account to see most submissions (not sure why) |
| 07:30:43 | <@OrIdow6> | arkiver: Have you started any work on Tinkercad? If not, I will (and early this time) |
| 07:33:41 | <@OrIdow6> | (Hopefully) |
| 07:48:56 | | VukkyWork (VukkyWork) joins |
| 08:03:53 | <Sanqui> | OrIdow6: fun stuff.... I would love to see a comic project happen, and I would participate, but I can't afford to spearhead it |
| 08:20:42 | | Arcorann_ joins |
| 08:26:54 | <@OrIdow6> | Sanqui: And I'm afraid I probably wouldn't be able to lead it either... ArchiveTeam pipe dream #44245 |
| 08:55:38 | | sneezey (sneezey) joins |
| 09:04:48 | | Wayward quits [Ping timeout: 258 seconds] |
| 09:05:00 | | Wayward- (wayward) joins |
| 09:10:21 | | BlueMaxima_ joins |
| 09:11:17 | | rsn joins |
| 09:13:36 | | rsn_ quits [Ping timeout: 250 seconds] |
| 09:14:23 | | BlueMaxima quits [Ping timeout: 258 seconds] |
| 09:25:36 | | Doran (Doranwen) joins |
| 09:25:53 | | Doranwen quits [Ping timeout: 258 seconds] |
| 09:55:45 | | Wingy2 (Wingy) joins |
| 09:56:30 | | Wingy quits [Ping timeout: 250 seconds] |
| 09:56:30 | | Wingy2 is now known as Wingy |
| 10:10:09 | | duce1337 (duce1337) joins |
| 10:10:22 | | Mineroboter quits [Ping timeout: 250 seconds] |
| 10:10:27 | | Mineroboter_ joins |
| 10:11:11 | | TigerbotHesh quits [Quit: ZNC - https://znc.in] |
| 10:11:18 | | TigerbotHesh joins |
| 10:27:25 | | LeighR (LeighR) joins |
| 10:29:05 | | Mineroboter joins |
| 10:29:41 | <LeighR> | If I want to submit a list of sites to be archived via transfer.archivete.am, should I list them as !a commands including --explain, or just as a list of URLs? |
| 10:29:51 | | Mineroboter_ quits [Ping timeout: 250 seconds] |
| 10:31:58 | | TigerbotHesh quits [Client Quit] |
| 10:32:30 | | TigerbotHesh joins |
| 10:33:44 | | Mineroboter quits [Ping timeout: 258 seconds] |
| 10:34:09 | | Mineroboter joins |
| 10:38:26 | | Mineroboter_ joins |
| 10:38:31 | | Mineroboter quits [Ping timeout: 250 seconds] |
| 10:40:47 | <@HCross> | just a list |
| 10:40:49 | <@HCross> | of URLs |
| 10:41:43 | | Mineroboter joins |
| 10:43:19 | | Mineroboter_ quits [Ping timeout: 258 seconds] |
| 10:45:34 | | Mineroboter_ joins |
| 10:46:19 | | Mineroboter quits [Ping timeout: 250 seconds] |
| 10:47:24 | | BlueMaxima_ quits [Read error: Connection reset by peer] |
| 10:53:55 | | Mineroboter joins |
| 10:54:07 | | Mineroboter_ quits [Ping timeout: 250 seconds] |
| 10:56:13 | | Mineroboter_ joins |
| 10:58:39 | | Mineroboter quits [Ping timeout: 258 seconds] |
| 11:00:37 | | Mineroboter_ quits [Ping timeout: 250 seconds] |
| 11:03:12 | | Mineroboter joins |
| 11:05:39 | | IKI joins |
| 11:05:42 | | Mineroboter_ joins |
| 11:07:51 | | Mineroboter quits [Ping timeout: 258 seconds] |
| 11:13:56 | | Mineroboter joins |
| 11:14:03 | | Mineroboter_ quits [Ping timeout: 250 seconds] |
| 11:16:46 | | Mineroboter_ joins |
| 11:18:23 | | Mineroboter quits [Ping timeout: 250 seconds] |
| 11:23:35 | | Mineroboter_ quits [Ping timeout: 250 seconds] |
| 11:26:59 | | Mineroboter joins |
| 11:31:23 | | Mineroboter quits [Ping timeout: 250 seconds] |
| 11:33:29 | | Mineroboter joins |
| 11:38:08 | | Mineroboter quits [Ping timeout: 258 seconds] |
| 11:42:21 | | Mineroboter joins |
| 11:45:15 | | Mineroboter_ joins |
| 11:46:57 | | Mineroboter quits [Ping timeout: 258 seconds] |
| 11:47:54 | | katocala quits [Remote host closed the connection] |
| 11:50:01 | | Mineroboter_ quits [Ping timeout: 258 seconds] |
| 11:55:21 | <@arkiver> | OrIdow6: go ahead and get tinkercad ready :) |
| 11:55:29 | <@arkiver> | please ping me when you have something |
| 11:55:45 | | Mineroboter joins |
| 12:04:46 | | VukkyWork quits [Remote host closed the connection] |
| 12:09:49 | | katocala joins |
| 12:10:18 | | katocala is now authenticated as katocala |
| 12:17:14 | | thuban quits [Read error: Connection reset by peer] |
| 12:17:29 | | thuban joins |
| 12:37:13 | <@EggplantN> | Any preliminary data on tinkercad? |
| 12:46:40 | | icedice joins |
| 12:47:10 | | icedice quits [Client Quit] |
| 13:03:06 | <@EggplantN> | HCross excellent news on tinkercad. its aws |
| 13:03:33 | <@HCross> | oh good |
| 13:03:35 | <@HCross> | we can go brr |
| 13:03:47 | <@HCross> | assuming they're not throttling like reuters |
| 13:03:53 | <@HCross> | the bastards |
| 13:10:05 | | duce1337_ (duce1337) joins |
| 13:10:05 | | duce1337 quits [Read error: Connection reset by peer] |
| 13:42:44 | | webdownload quits [Remote host closed the connection] |
| 13:45:56 | | Wayward (wayward) joins |
| 13:46:10 | | Wayward- quits [Ping timeout: 258 seconds] |
| 14:19:44 | | benjinsmith joins |
| 14:23:00 | | benjins quits [Ping timeout: 250 seconds] |
| 14:29:33 | | benjinsmith is now known as benjins |
| 14:29:34 | | benjins is now authenticated as benjins |
| 15:16:18 | | Arcorann_ quits [Ping timeout: 250 seconds] |
| 15:34:13 | | spirit joins |
| 15:40:54 | | godane1 joins |
| 15:42:54 | | godane quits [Read error: Connection reset by peer] |
| 15:54:00 | <@OrIdow6> | Be advised that at least one aspect of Tinkercad (downloads of raw? data, i.e. the most important thing) seems to happen dynamically, though it's still on AWS and seems to be behind a cache (though I doubt the latter will do much for unpopular submissions) |
| 16:01:40 | <@EggplantN> | Got any examples of the cache OrIdow6 just trying to do pre-emptive infra planning |
| 16:01:47 | <@EggplantN> | I know its rare for us at AT |
| 16:02:02 | <@HCross> | ArchiveTeam being prepared |
| 16:02:04 | <@HCross> | what is this shit |
| 16:02:51 | <@OrIdow6> | On size, I saw something a while ago that said there were >100m submissions, though this could be different due to the series of acquisitions, rearrangements, etc. they've done; downloads seem to mostly be in the 100kB-1MB range, though I don't know about item size (possible at least 3x copies for each) |
| 16:03:06 | <@OrIdow6> | Let me look |
| 16:03:29 | <@OrIdow6> | I haven't gotten to a fine level of detail yet, so it's always possible there'll be some quirk or alternate method I don;t know about now |
| 16:05:15 | <@OrIdow6> | EggplantN: What sort of information do you want? YOu have to be logged in (there's a valid one on bugmenot) to download anything |
| 16:05:44 | <@EggplantN> | Just an example URL or hostname where we download from :D |
| 16:07:29 | <@OrIdow6> | Host (for 3d files, anyway) seems to be csg.tinkercad.com |
| 16:07:44 | <@OrIdow6> | Other 2 file types I haven't looked at in as much detail |
| 16:08:22 | <@OrIdow6> | Example URL is https://csg.tinkercad.com/things/dltfDKEiVFB/polysoup.stl?rev=-1 , it will give you some info in headers but not send anything useful unless you have login cookies |
| 16:08:33 | <@EggplantN> | aight perfect |
| 16:08:35 | <@OrIdow6> | "as much detail" meaningly "hardly any at all" |
| 16:08:41 | <@HCross> | that url looks like it's behind Telia/Cogent |
| 16:08:50 | <@HCross> | us-east elb |
| 16:11:40 | <@OrIdow6> | Will try to work on this one quickly, because I have a bad feeling about it |
| 16:11:46 | <@OrIdow6> | But for now I need to sleep anyhwo |
| 16:11:52 | <@HCross> | Goodnight |
| 16:12:26 | <@OrIdow6> | It's much moreso morning in my time zone, but thank you in any case |
| 16:15:24 | <Jake> | (Could we get a channel setup for this earlier rather than later?) |
| 16:17:13 | <@EggplantN> | give us a pun Jake |
| 16:17:26 | <@EggplantN> | for a channel name |
| 16:17:30 | <@HCross> | tinkercrap? |
| 16:17:37 | <@EggplantN> | too easier |
| 16:17:43 | <@EggplantN> | we need something big brain |
| 16:19:29 | <Jake> | god I suck at channel names haha |
| 16:19:38 | <Jake> | i have literally nothing good |
| 16:21:27 | <masterX244> | Titanicad? |
| 16:21:31 | <Aoede> | tinkerhad |
| 16:21:40 | <lunik1> | I can go even smaller brain: tinkerbad |
| 16:21:54 | <@EggplantN> | good suggestions. HCross your decision I like both |
| 16:22:14 | <@HCross> | tinkerhad sounds better |
| 16:25:46 | | LeGoupil joins |
| 16:27:20 | | @HCross quits [Changing host] |
| 16:27:20 | | HCross (HCross) joins |
| 16:27:20 | | guybrush.hackint.org sets mode: +o HCross |
| 16:28:50 | | HP_Archivist (HP_Archivist) joins |
| 16:35:07 | <Sanqui> | pff |
| 16:35:12 | <Sanqui> | how about |
| 16:35:13 | <Sanqui> | stinkercad |
| 16:38:31 | <@arkiver> | #tinkerhad |
| 16:41:03 | <@arkiver> | OrIdow6: ^ |
| 16:44:40 | | spirit quits [Client Quit] |
| 16:55:59 | | Daloader joins |
| 17:12:04 | | duce1337_ quits [Read error: Connection reset by peer] |
| 17:12:04 | | duce1337 (duce1337) joins |
| 17:17:48 | | DogsRNice (Webuser299) joins |
| 17:55:59 | <@JAA> | wordpress.com apparently introduced some shitty intermediate page for images on certain blogs. Accessing the image URL directly shows an HTML page instead of the image. Example: https://tiffanyosborn.files.wordpress.com/2017/11/byu-law.jpg |
| 17:57:16 | <@JAA> | They seem to use the Accept header for this. While Origin is also listed as 'Vary', that doesn't appear to matter, nor does Referer. So it doesn't look like a hotlink protection. Not sure why the fuck they'd do that then... |
| 17:57:30 | <@JAA> | Anyway, something to keep in mind if we ever do a wordpress.com project I suppose. |
| 17:58:07 | <Ryz> | This is a thing that cropped up a couple of months ago, this also applies to the Tumblr images too |
| 17:58:54 | <Ryz> | It has been consistent with Tumblr but inconsistent wildly with Wordpress-powered blogs |
| 17:59:48 | <@JAA> | It's probably only wordpress.com, not Wordpress blogs in general. |
| 18:03:00 | <Ryz> | Also doesn't help that a couple of months ago, there's some Wordpress blogs that introduced 'sponsored' posts that's in-between the first latest post and second latest post~ |
| 18:08:22 | | HP_Archivist quits [Ping timeout: 258 seconds] |
| 18:24:07 | | LeighR quits [Ping timeout: 244 seconds] |
| 18:44:53 | | Mateon2 joins |
| 18:46:19 | | Mateon1 quits [Ping timeout: 258 seconds] |
| 18:46:19 | | Mateon2 is now known as Mateon1 |
| 19:08:02 | | spirit joins |
| 19:08:53 | | HP_Archivist (HP_Archivist) joins |
| 19:09:42 | <betamax> | JAA: update on the UK elections stuff |
| 19:09:46 | <betamax> | (1) the twitter scrape has finished, with 20.7 million URLs (yikes!). For the 2019 UK general election there were 8 million URLs in the twitter scrape and I did them in 8 batches of 1M - is that a reasonable size / too large? |
| 19:09:52 | <betamax> | (2) I've split the candidate websites into chunks of 100, the first of which is at https://www.tardis.ed.ac.uk/~andrewferguson/uk_elections_2021_betamax/candidate_sites_split/candidate_sites_sublist_00 |
| 19:09:57 | <betamax> | I've then removed all sites that look like party sites (e.g: "North East Green Party"), and removed any subdomains for a party (e.g: <name>.greenparty.com). I think that will prevent the outlinks from being too much of an issue? |
| 19:10:33 | <betamax> | (by "removed", I mean moved to a separate list that will need to be done without outlinks) |
| 19:11:24 | <betamax> | actually..... I should really add one of the subdomains back in (since there won't be any issues if there is just one) |
| 19:11:54 | <@JAA> | IIRC, running with --no-offsite-links does *not* prevent that crosslink issue. If anything, it only makes it even worse. |
| 19:13:07 | <@JAA> | On the Twitter lists: is there a significant account overlap with the 2019 ones? If so, it might be worth filtering those out. |
| 19:14:15 | <betamax> | hmm, not sure. I will try and dedupe now |
| 19:15:03 | <betamax> | How does running with --no-offsite-links not fix the issue? Surely it means it won't attempt to grab any crosslinks, so the problem can't occur? |
| 19:15:58 | <@JAA> | It won't attempt to grab them, but they're still added to the URL queue and then dismissed. |
| 19:16:22 | <@JAA> | I'm not 100 % sure about this as it's been a while since I've experimented with this, but that's how I remember it. |
| 19:17:32 | <betamax> | oh, that's a pain |
| 19:17:38 | <@JAA> | So using my example from yesterday: if you !a < a list that has example.org and example.net, and then the former has a link to example.net/foo/ and gets retrieved before a page from example.net linking there, example.net/foo/ will be added to the URL queue and (silently) ignored because its domain does not match the root URL's. |
| 19:18:15 | <betamax> | I assume there's no easy way to feed a list of URLs into AB so that each ends up as a separate job |
| 19:18:36 | <betamax> | (I imagine that trying to run each site as it's own job would be ... inefficient) |
| 19:20:13 | <@JAA> | Not directly, but we've done it before on the US elections last year. |
| 19:20:58 | <betamax> | Did the output still end up in the wayback? |
| 19:22:03 | <@JAA> | It's a separate bot that queues things to ArchiveBot. |
| 19:22:05 | <@JAA> | So yes. |
| 19:22:21 | <@JAA> | I can get that up and running again. |
| 19:22:38 | <betamax> | I don't want to cause additional work / headache, but that would be incredibly helpful |
| 19:23:07 | | Barto quits [Ping timeout: 258 seconds] |
| 19:27:56 | <betamax> | twitter scrape is 17M URLs after deduping with the 2019 general election scrape |
| 19:28:18 | <@JAA> | Well, it's an improvement. :-) |
| 19:33:09 | <betamax> | Should I go with 1M URLs per job, or cap it lower (500k?) |
| 19:33:16 | | pcr leaves |
| 19:33:21 | | Sylirana quits [Ping timeout: 244 seconds] |
| 19:34:13 | | pcr joins |
| 19:34:18 | <@JAA> | 1M should be fine. Ideally somewhat grouped by domain so the cookie jar issue doesn't come into play as much. |
| 19:34:19 | | Sylirana (Sylirana) joins |
| 19:35:09 | <betamax> | They're sorted (required to dedupe them) so roughly sorted by domain. |
| 19:35:51 | <betamax> | I'll put one in now then see how it goes before adding additional ones. |
| 19:35:53 | <@JAA> | Makes sense. I thought you might've done that. (I usually dedupe without sorting.) |
| 19:35:57 | <@JAA> | Sounds good. |
| 19:37:43 | <betamax> | How, out of curiosity? (I sorted both lists then did 'comm' but there's probably a much faster way) |
| 19:39:32 | <@JAA> | I trade memory for runtime, basically. Load the entire base file into memory, then check each line of the second file against that and print if not present. |
| 19:40:00 | <@JAA> | So it doesn't work well for huge files, but up to some millions is fine. |
| 19:40:42 | <@JAA> | https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/dedupe |
| 19:40:54 | | Barto (Barto) joins |
| 19:42:08 | <betamax> | Ah. I'm using a shared server with 4GB RAM, so that probably wouldn't have worked well :) |
| 19:42:53 | <rewby> | I usually use sort -u for deduping |
| 19:43:08 | <@JAA> | Another option would be using a probabilistic data structure (e.g. bloom or cuckoo filter), but that assumes you're fine with imperfect results. |
| 19:51:02 | | Daloader quits [Ping timeout: 250 seconds] |
| 19:59:13 | <@JAA> | betamax: So for the websites, that's the two 'reprocessed lists' linked on https://wiki.archiveteam.org/index.php/Elections/2021_UK_elections I guess? |
| 20:02:12 | | themadpro quits [Quit: Connection closed for inactivity] |
| 20:06:26 | | Barto quits [Ping timeout: 258 seconds] |
| 20:12:10 | | Barto (Barto) joins |
| 20:19:36 | <betamax> | Yes, that's the ones. |
| 20:21:13 | <@JAA> | Ok, I'll look into getting that started later. |
| 20:21:48 | <betamax> | Thanks very much! |
| 20:41:04 | | cmlow quits [Client Quit] |
| 20:44:07 | | LeGoupil quits [Client Quit] |
| 20:48:06 | | duce1337_ (duce1337) joins |
| 20:48:06 | | duce1337 quits [Read error: Connection reset by peer] |
| 21:01:35 | | nuroten quits [Remote host closed the connection] |
| 21:16:10 | | Sylirana quits [Ping timeout: 244 seconds] |
| 21:16:59 | | Sylirana (Sylirana) joins |
| 21:20:42 | | HP_Archivist quits [Client Quit] |
| 21:27:07 | <Ryz> | Welp, I better make a lot more room; I was about to toss in more forums into ArchiveBot but the elections thing is more important~ |
| 21:42:50 | <betamax> | Ryz: thanks - hopefully it won't take too long to get all the election stuff through |
| 21:43:20 | <betamax> | and I've noticed that some of the election sites are gone already, it's a shame that more value isn't placed upon keeping such sites around |
| 21:47:34 | | tech234a quits [Quit: Connection closed for inactivity] |
| 22:07:17 | | redlizard leaves [Konversation terminated!] |
| 22:42:47 | | duce1337_ quits [Client Quit] |
| 22:47:29 | | nuroten joins |
| 22:48:18 | | tech234a (tech234a) joins |
| 23:06:53 | | Aoede quits [Ping timeout: 250 seconds] |
| 23:07:06 | | Aoede (Aoede) joins |
| 23:07:45 | | grawity quits [Ping timeout: 250 seconds] |
| 23:07:58 | | grawity (grawity) joins |
| 23:10:21 | | Matthww quits [Quit: Ping timeout (120 seconds)] |
| 23:10:44 | | Matthww joins |
| 23:16:32 | | Mineroboter quits [Client Quit] |
| 23:18:52 | | Mineroboter joins |
| 23:19:30 | | teej quits [Quit: Connection closed for inactivity] |
| 23:30:10 | | BlueMaxima joins |
| 23:44:25 | | DogsRNice quits [Read error: Connection reset by peer] |