| 00:17:03 | | Arcorann (Arcorann) joins |
| 00:20:13 | <@JAA> | Hmm, JCenter seems to be throwing 403 for all files for me. |
| 00:28:12 | <@OrIdow6> | Mass download any of it before? |
| 00:28:16 | <@OrIdow6> | Regular Bintray still working |
| 00:28:23 | <@OrIdow6> | So I think I have this in a workins state |
| 00:30:49 | | Sylirana quits [Read error: Connection reset by peer] |
| 00:31:17 | | Sylirana (Sylirana) joins |
| 00:34:43 | <@JAA> | JCenter might return. It's supposed to stay working until next year. |
| 00:35:02 | <@JAA> | But yeah, let's get Bintray itself running ASAP. |
| 00:35:11 | <@OrIdow6> | Sort of multitasking here |
| 00:35:15 | <@JAA> | We can't download anything from JCenter without Bintray anyway. |
| 00:35:20 | <@OrIdow6> | Will try to get it uploaded in a few minutes |
| 00:44:58 | <@OrIdow6> | Alright https://github.com/OrIdow6/bintray-grab should be good aside for cosmetic/branding thing (it's still SMMB in that area) and backfeed URL |
| 00:45:40 | <@JAA> | arkiver: ^ |
| 00:52:18 | <etnguyen03> | Is there an IRC for bintray? |
| 00:52:33 | <etnguyen03> | *a channel |
| 00:52:46 | <@JAA> | Not so far. |
| 00:58:21 | | Mineroboter_ joins |
| 00:59:56 | | Mineroboter quits [Ping timeout: 250 seconds] |
| 01:00:12 | <@OrIdow6> | Very rough around the edges, of course, but it should get all info |
| 01:00:23 | <@OrIdow6> | Site used POST for a bunch of stuff anyway |
| 01:13:56 | <@OrIdow6> | So I am trying to use a Japanese IP address, with Accept-Language: ja, a realistic UA, am getting the "counter", am using a 10second + random delay, and am being conservative w/ the URLs I visit, and am still being blocked |
| 01:14:03 | <@OrIdow6> | From Aimix-Z |
| 01:14:42 | <@JAA> | Oof |
| 01:21:25 | <nyany> | that's some world class automation detection if i've seen it |
| 01:22:03 | <nyany> | or are you behaving like a normal user |
| 01:22:12 | <@OrIdow6> | If it stays alive over the weekend I may have enough time to try it more |
| 01:22:16 | <nyany> | wow thanks irccloud |
| 01:22:31 | <nyany> | i meant to say are you jumping to extremes or are you behaving like a normal user |
| 01:22:37 | <@OrIdow6> | No, this is a crawler that a human reading logs could detect easily |
| 01:22:43 | <nyany> | fair |
| 01:22:57 | <@OrIdow6> | Unless you were a very methodical user of a text browser |
| 01:23:15 | <@OrIdow6> | *Unless you suspected they were |
| 01:33:33 | | rsn_ quits [Quit: Leaving] |
| 01:44:00 | | pcr leaves |
| 01:53:33 | | Wayward quits [Read error: Connection reset by peer] |
| 01:53:35 | | Wayward- (wayward) joins |
| 01:55:42 | | Iki joins |
| 02:12:36 | | pcr joins |
| 02:55:24 | <atphoenix> | maybe they have a human reading logs... and/or detect that only text resources are accessed and not other resources. Kind of like the inverse of a bot trap URL. (if the browser doesn't get all the resources normally accessed by a graphical browser, consider the user to be a bot) |
| 02:56:37 | <@OrIdow6> | I've tried to consider that |
| 02:57:00 | <@OrIdow6> | So it does get images, and I also get a "counter"/analytics URL that every page got but seemed not to have a purpose |
| 02:57:17 | <@OrIdow6> | But due to the nature of the grab setup there is a long delay in some cases |
| 02:59:04 | <atphoenix> | bintray -> binnedtray or spilledtray |
| 02:59:14 | <@OrIdow6> | Well it shuts down in 4 hours |
| 02:59:26 | <atphoenix> | bintray their website says "UPDATE 4/27/2021: We listened to the community and will keep JCenter as a read-only repository indefinitely. Our customers and the community can continue to rely on JCenter as a reliable mirror for Java packages. |
| 02:59:26 | <atphoenix> | " |
| 02:59:36 | <@OrIdow6> | JCenter ~= main Bintray |
| 02:59:40 | <thuban> | how often does grab-site check the 'delay' file? (does it depend on the current delay?) |
| 02:59:46 | <@OrIdow6> | *!= |
| 03:00:04 | <@JAA> | JCenter is kind of integrated into Bintray but also standalone. |
| 03:00:30 | <@JAA> | You can't discover the content on JCenter once Bintray is down, even though it will still be there for a while. |
| 03:00:55 | <thuban> | i set it to something VERY large while i fixed up the ignores, but now i've set it back to 0 and it's not started again... any way to signal for a re-check? |
| 03:01:22 | <atphoenix> | I copied that from https://jfrog.com/blog/into-the-sunset-bintray-jcenter-gocenter-and-chartcenter/ which has other date info too |
| 03:02:12 | <@JAA> | thuban: Yes, it depends on the current delay, and no, there isn't such a signal. |
| 03:02:44 | <thuban> | whoops. 15 minutes it is, i guess |
| 03:02:55 | <@JAA> | Not entirely sure how it's implemented exactly in grab-site, but I think it's similar to AB, which checks the settings after every URL. |
| 03:03:06 | <@JAA> | So yeah, don't go too high on the delay settings. :-) |
| 03:03:19 | <@JAA> | 15 minutes doesn't sound too bad though. |
| 03:03:30 | <@JAA> | We frequently use 3 or 5 minutes on AB. |
| 03:05:34 | <thuban> | i didn't actually calculate it, i just threw in 1000000 because 100000 seemed too small and i thought i'd be able to change it again |
| 03:09:14 | <@arkiver> | OrIdow6: hi, so this is not ready yet for warrior? |
| 03:09:19 | <@arkiver> | or is it |
| 03:09:30 | <@arkiver> | if yes I'll get it up |
| 03:09:32 | <@arkiver> | right 4 hours |
| 03:09:39 | <@arkiver> | please confirm ^ |
| 03:09:42 | <@arkiver> | or JAA ^ |
| 03:09:49 | <thuban> | oh, other question: i forgot to change to my external drive before starting the crawl :( fortunately i _think_ i've got enough space anyway, but is there a way to move data without losing the state? |
| 03:11:15 | <@JAA> | arkiver: I only looked over it briefly, no idea. |
| 03:11:45 | <@arkiver> | JAA: shut down in 4 hours? |
| 03:11:51 | <@arkiver> | shutdown* |
| 03:12:35 | <@JAA> | Unsure, I haven't seen a time announced anywhere, but maybe I missed it. |
| 03:12:42 | <@JAA> | But it's going down today (1 May). |
| 03:12:54 | <@arkiver> | alright |
| 03:13:15 | <@arkiver> | and we got a list of items? |
| 03:13:17 | <@JAA> | And they've been warning users with brown-outs and whatnot, so I don't expect it to stay longer. |
| 03:13:31 | <@JAA> | We have a list of users, and everything else can be discovered from there. |
| 03:13:43 | <@arkiver> | haha is that file actually named .zstandard :P |
| 03:14:09 | <@JAA> | Yep lol |
| 03:14:13 | <@JAA> | OrIdow6: For the future, it's .zst |
| 03:14:57 | <@OrIdow6> | arkiver: AFAIK it is, except for branding (which I can go and change now) and backfeed |
| 03:15:08 | <@OrIdow6> | Which needs an URL (right now it's example.com) |
| 03:15:19 | <@OrIdow6> | JAA: I know, but I couldn't remember at the time |
| 03:17:07 | | pcr leaves |
| 03:17:21 | <@OrIdow6> | Well, and technically the file: item type isn't implemented, but as discussed previously that's deliberate at this point as I suspect that's a lot of data that's mostly available elsewhere |
| 03:18:28 | <thuban> | JAA: no error messages, but my queue seems to be stuck. any idea why? |
| 03:19:32 | <@JAA> | ¯\_(ツ)_/¯ |
| 03:19:53 | <thuban> | ;_; |
| 03:21:08 | <@JAA> | By the way, there are also branded subdomains like google.bintray.com. The files are available under different URLs for those, either dl.bintray.com with an expiring token or that subdomain. |
| 03:28:32 | <thuban> | oh, it resumed. apparently readline timed out. |
| 03:30:46 | <@arkiver> | OrIdow6: alright, no time for me to test it much so I'll just fix those things and get it started now |
| 03:31:12 | <thuban> | (i kind of suspect one of the threads might still be stuck) |
| 03:32:05 | <@arkiver> | OrIdow6: stdout_sorted.txt is your item list right? |
| 03:32:35 | <@arkiver> | OrIdow6: what is all the aimix-z stuff? |
| 03:33:03 | <@arkiver> | right old code |
| 03:33:06 | <@OrIdow6> | JAA: Hm, thanks for pointing out that it downloads them differently, may need to handle that differenctly |
| 03:33:16 | <@OrIdow6> | arkiver: That's the intended item list, yes |
| 03:33:36 | <@OrIdow6> | Aimix-Z is another site that seems borderline impossible to archive because it aggressively bans people |
| 03:33:56 | <@arkiver> | ok |
| 03:35:56 | <@arkiver> | OrIdow6: i'm replacing zst with gz |
| 03:36:01 | <@OrIdow6> | I am thinking of trying to use backfeed to make a super-distributed crawl, where each item is just 3 urls and it recurses around |
| 03:36:46 | <@OrIdow6> | arkiver: Did I do zst wrong again? |
| 03:36:46 | <@arkiver> | OrIdow6: only thing that needs changing is backfeed? |
| 03:37:00 | <@arkiver> | OrIdow6: no, i'd rather use gz when we're not going dicts |
| 03:37:10 | <@arkiver> | wont save much, and gz is still the default for WARCs in general |
| 03:37:18 | <@arkiver> | doing* |
| 03:38:30 | <@OrIdow6> | arkiver: Give me a few minutes to quickly fix this thing J A A reminded me of |
| 03:38:46 | <@OrIdow6> | And this isn't the ideal final version, but I sort of ran out of time |
| 03:38:54 | <@OrIdow6> | And it should work nonetheless |
| 03:39:02 | <@arkiver> | not ideal is fine now |
| 03:39:27 | <@arkiver> | OrIdow6: ok please PR it to the archiveteam clone |
| 03:39:28 | <@OrIdow6> | Well, J A A told me of half and reminded me of the other half |
| 03:39:35 | <@OrIdow6> | Ok |
| 03:42:54 | <@arkiver> | OrIdow6: all fixed and pushed |
| 03:43:11 | <@OrIdow6> | Testing this change to see if it breaks anything |
| 03:43:15 | <@OrIdow6> | Thanks |
| 03:44:33 | <@arkiver> | OrIdow6: the change jaa proposed? |
| 03:45:11 | <@arkiver> | items queued |
| 03:45:30 | <@OrIdow6> | arkiver: He didn't propose a change, he told me about a corner case (roughly) |
| 03:46:04 | <@arkiver> | crap we need a target |
| 03:46:14 | <@arkiver> | EggplantN: you around? or HCross Kaz |
| 03:46:41 | <@arkiver> | OrIdow6: any rough size estimate? |
| 03:46:42 | <@arkiver> | TBs? |
| 03:46:45 | <@arkiver> | or not |
| 03:47:06 | <@JAA> | All the Brits are probably asleep. :-/ |
| 03:47:36 | <@OrIdow6> | I'd say high GBs or low TBs |
| 03:47:37 | <@arkiver> | yeah i should be as well |
| 03:47:48 | <@arkiver> | ok good |
| 03:47:49 | <@OrIdow6> | As this should reject all big files |
| 03:48:00 | <@OrIdow6> | Well, queue them as file:, which isn't implemented yet |
| 03:48:05 | <@JAA> | Another fun edge case: https://bintray.com/griffon/griffon-plugins?offset=16&max=8&repoPath=%2Fgriffon%2Fgriffon-plugins&sortBy=lowerCaseName&filterByPkgName= |
| 03:48:15 | <@JAA> | Links to a package that isn't under griffon/griffon-plugins. |
| 03:48:50 | <@OrIdow6> | Should be able to handle that |
| 03:49:01 | <@OrIdow6> | As items are users |
| 03:49:16 | | qw3rty__ joins |
| 03:49:35 | <@JAA> | Yeah, it shows up under sleonidy as well. |
| 03:50:18 | <@JAA> | In fact, bintray/jcenter is full of these. |
| 03:50:48 | <endrift> | I noticed bintray just showed up on the tracker. Is there a channel for that yet? |
| 03:51:01 | | endrift scrolls up |
| 03:51:06 | <@JAA> | I think that means those are also all available under two different URLs. |
| 03:51:06 | <endrift> | ah, not yet |
| 03:51:54 | <@OrIdow6> | What do you mean, shows up? |
| 03:52:08 | <@arkiver> | how does FOS work again |
| 03:52:10 | <@JAA> | It's included on the user's repos/packages. |
| 03:52:23 | <@JAA> | So it appears twice. |
| 03:52:50 | | qw3rty_ quits [Ping timeout: 250 seconds] |
| 03:53:24 | <@arkiver> | we'll use a target on FOS |
| 03:53:31 | <@JAA> | That'll be fun. |
| 03:53:39 | <@arkiver> | OrIdow6: excuse the ping, is the update ready? |
| 03:53:48 | <@OrIdow6> | Just making the PR |
| 03:53:54 | <@arkiver> | perfect |
| 03:55:08 | <@OrIdow6> | Alright https://github.com/ArchiveTeam/bintray-grab/pull/1 arkiver |
| 03:56:11 | <@OrIdow6> | strict.lua was a thing I was using during testing, that would crash upon reading from new variables instead of returning nil |
| 03:57:16 | | Eighty quits [Quit: leaving] |
| 03:57:17 | <@OrIdow6> | Which apparently I did actually add to git, but oh well |
| 03:57:34 | | Eighty (Eighty) joins |
| 03:58:13 | <@arkiver> | JAA: looks like FOS is still working :P |
| 03:59:11 | <@JAA> | Yeah, for now. |
| 03:59:19 | <@arkiver> | OrIdow6: started! |
| 03:59:21 | <@arkiver> | people should update |
| 04:00:16 | <@OrIdow6> | arkiver: Thanks |
| 04:00:26 | <@arkiver> | OrIdow6: why the if not something? |
| 04:00:34 | <@OrIdow6> | Where? |
| 04:00:46 | <@arkiver> | where we normally check if to_send is nil |
| 04:00:54 | <@arkiver> | before setting the first discovered item |
| 04:01:16 | <@OrIdow6> | Because strict.lua broke it for some reason |
| 04:01:35 | <@arkiver> | odd |
| 04:02:20 | <@arkiver> | OrIdow6: i think we can just make this multi item size 1 |
| 04:02:22 | <etnguyen03> | Is drone building a docker image? |
| 04:04:31 | <@arkiver> | changed to multi item size 1 |
| 04:04:36 | <@arkiver> | i'll be off now for some sleep |
| 04:04:40 | <@arkiver> | gotta get up earlu |
| 04:05:15 | <@OrIdow6> | arkiver: OK |
| 04:05:17 | <@OrIdow6> | Goodnight |
| 04:05:29 | <@OrIdow6> | I hope you're not getting up early for this project |
| 04:06:16 | <@arkiver> | no not for this project :) |
| 04:07:06 | <@arkiver> | thanks for the work on this, its good we at least archive something here |
| 04:07:41 | <@JAA> | I'll have a rough size estimate for the files in a bit. |
| 04:12:17 | | etnguyen03 quits [Client Quit] |
| 04:14:33 | <@JAA> | Extrapolated from a 1‰ sample of all users, there should be on the order of ten million files with a total size of 10 TB. May easily be off by quite a bit though since it's such a small sample. |
| 04:18:16 | <@OrIdow6> | Not too bad |
| 04:18:27 | <@OrIdow6> | Yeah, I'd expect it to vary a lot |
| 04:21:40 | <@JAA> | Looks like some users 404. |
| 04:21:58 | <@JAA> | Two examples: sfali, olacabs |
| 04:22:49 | <@OrIdow6> | It should deal with those correctly |
| 04:23:14 | <@OrIdow6> | Well, trying it out, I forgot to check for 200, so it makes 3 unnecessary requests |
| 04:23:27 | <@OrIdow6> | But it correctly succeeds |
| 04:24:33 | <@JAA> | :-) |
| 04:27:00 | <jodizzle> | Is there a recommended concurrency? |
| 04:27:53 | <@OrIdow6> | Not yet |
| 04:28:27 | <@JAA> | Go nuts. I haven't seen any issues at high concurrencies. |
| 04:28:38 | <@JAA> | (Not running this, but on that sample above.) |
| 04:31:03 | <@JAA> | Ok, they start 429ing somewhere between 50 and 100 concurrency with qwarc. |
| 04:32:45 | <jodizzle> | Got it |
| 04:32:50 | <jodizzle> | I'm getting some 401s, is that normal? |
| 04:32:57 | <@OrIdow6> | Where? |
| 04:33:13 | <jodizzle> | example: 80=401 https://api.bintray.com/maven/zdmytriv/vgs-aws-maven/aws-maven/;publish=1 |
| 04:33:24 | <jodizzle> | Makes the worker sleep |
| 04:33:53 | <thuban> | problem: as much as i'd like to have outlinks on this ah.com run, for context, i'm concerned there won't be time (and i've done enough of the priority content that i don't want to re-run as --no-offsite-links) |
| 04:33:56 | <thuban> | solution: add hacky negative-match ignore, then gs-dump-urls skipped and run them in a separate crawl (or even feed them to archivebot) when i'm done, y/n/q? |
| 04:34:01 | <@OrIdow6> | It shouldn't be going there |
| 04:34:40 | <@JAA> | thuban: That's what I've been doing on AB, yeah. Negative lookahead ignore. |
| 04:35:00 | <@JAA> | Make sure to not miss subdomains, URLs with ports, etc. |
| 04:36:29 | <thuban> | JAA: '^((?!alternatehistory.com).)*$' lgty? |
| 04:37:02 | <thuban> | lax but this is almost certainly io-bound so i don't know that it matters |
| 04:38:43 | <@JAA> | I suppose that would work. |
| 04:39:07 | <@JAA> | I usually do something like ^https?://(?!([^/]*\.)?example.org(:\d+)?/) |
| 04:39:18 | <@JAA> | Er, example\.org |
| 04:39:59 | <thuban> | welp, here goes |
| 04:41:57 | <jodizzle> | thuban: Might want to turn igon on to verify |
| 04:43:00 | <thuban> | jodizzle: thanks, but dashboard / gs-dump-urls in_progress look good and i don't want to slow it down |
| 04:43:42 | <@OrIdow6> | https://github.com/ArchiveTeam/bintray-grab/pull/2 - misc changes - can someone accept this? |
| 04:58:24 | | godane quits [Ping timeout: 258 seconds] |
| 04:58:47 | | Zopolis4 (Zopolis4) joins |
| 05:25:32 | <@OrIdow6> | JAA: Do you want to accept that? Fine if you defer |
| 05:25:51 | <@OrIdow6> | jodizzle: Seeing any more errors? I see it's slowed |
| 05:28:09 | <jodizzle> | OrIdow6: I was trying to stop the container gracefully to restart with higher concurrency, but it's still doing backoff from that 401 link. I guess I should just kill it? |
| 05:29:03 | <@OrIdow6> | Yeah |
| 05:29:14 | <@OrIdow6> | It will just abort anyway |
| 05:31:06 | <@JAA> | OrIdow6: Seems fine, merged. |
| 05:31:41 | <@OrIdow6> | Thanks JAA |
| 05:38:02 | <@JAA> | Some files are actually served directly on bintray.com, by the way. |
| 05:38:25 | <@JAA> | E.g. those in the package jfrog/jfrog-mission-control/mc-docker-installer |
| 05:39:12 | <@JAA> | Er actually, that's an EULA. Great. |
| 05:39:48 | <@OrIdow6> | AFAICT it does that with small files (threshold somewhere around 1 MB), so that's how I determine it |
| 05:40:21 | <@OrIdow6> | It gets files directly on the site in the user: item and then queues CDN ones as file: item |
| 05:41:15 | <@JAA> | Nope, I've seen plenty of small files get served via dl.bintray.com. |
| 05:41:30 | <@JAA> | But it's a 302 redirect. |
| 05:41:40 | <@OrIdow6> | That's what I mean |
| 05:41:46 | <@OrIdow6> | Oh, I see with the EULA |
| 05:42:05 | <@OrIdow6> | I thought you meant it was a license in a file |
| 05:42:18 | <@JAA> | Ah |
| 05:42:47 | <@JAA> | Yeah, no, intermediate page with a scripty button. |
| 05:46:06 | <@JAA> | And also, https://dl.bintray.com/jfrog/jfrog-mission-control/ is serving completely different files than what's listed on https://bintray.com/jfrog/jfrog-mission-control/mc-docker-installer |
| 05:47:03 | <@JAA> | Files that aren't even under any project. |
| 05:58:02 | <Vukky> | https://github.com/ArchiveTeam/seesaw-kit/pull/121 - there was an attempt to do a thing |
| 06:00:31 | <@JAA> | I'll leave that to someone else as I have zero experience with seesaw's web interface. |
| 06:00:55 | <Vukky> | Alright |
| 06:05:21 | | PFD (PFD) joins |
| 06:05:44 | <PFD> | where does this get logged to anyways |
| 06:07:46 | <@JAA> | A website that's currently down. |
| 06:09:52 | <PFD> | rip |
| 06:19:48 | | PFD quits [Client Quit] |
| 06:20:37 | <@HCross> | Good morning world, what is needed here |
| 06:24:37 | <Wayward-> | more hard drives |
| 06:25:01 | <@OrIdow6> | Hello HCross |
| 06:25:37 | <@OrIdow6> | Apropos of the hastily-started (which was my fault) Bintray project, workers and preferably a target that's not FOS |
| 06:26:18 | <@OrIdow6> | Well, for all I know FOS is fine |
| 06:27:15 | <@OrIdow6> | Shouldn't be much data, and site may shut down in half an hour anyway |
| 06:35:06 | <@HCross> | Let me get out of bed and I’ll throw workers at it |
| 06:35:22 | <@HCross> | Rate limits? |
| 06:35:40 | | spirit joins |
| 06:35:46 | <@JAA> | I started getting 429s between 50 and 100 concurrent with qwarc. |
| 06:35:58 | <@JAA> | No idea what that translates to. |
| 06:36:25 | <@HCross> | Size per item? |
| 06:36:39 | <@HCross> | Sorry, trying to size this |
| 06:37:15 | <@JAA> | 50 conc with qwarc corresponded to 65 req/s. |
| 06:37:35 | <@JAA> | Items aren't big, below 1 MB on average. |
| 06:38:58 | <thuban> | does grab-site retry on 'Connection closed' errors? |
| 06:40:30 | <@JAA> | Yes. Not 'Connection refused' though as far as I can see. |
| 06:43:41 | <thuban> | hm, ok. i'm seeing 0s erroring without corresponding 200s following; is that just just an ordering issue? |
| 06:47:21 | <thuban> | (gs-dump-urls lists one such in 'error' rather than 'todo' or 'in_progress' but i'm not sure whether that status is intended as final) |
| 06:48:48 | <@HCross> | JAA: sir, I believe you asked for some archivism |
| 06:49:16 | <@HCross> | that has been delivered |
| 06:50:58 | <thuban> | "Note that, unlike wget, wpull puts retries at the end of the queue." oh, hopefully that's it. nts, check up on this |
| 06:52:43 | | HackMii_ quits [Remote host closed the connection] |
| 06:53:14 | | HackMii_ (hacktheplanet) joins |
| 06:59:16 | <@HCross> | I've turned it up a bit |
| 06:59:32 | <@OrIdow6> | Thanks HCross |
| 06:59:53 | <@HCross> | if we crash into FOS we can deal with that |
| 07:00:34 | <@JAA> | Nice |
| 07:01:01 | <@HCross> | I'm seeing some 502s |
| 07:01:17 | <@HCross> | methinks Bintray may be distressed |
| 07:01:18 | <@OrIdow6> | Midnight, seems it's still going |
| 07:01:25 | <@HCross> | but it's just crossed 8am London and we're still alive |
| 07:01:59 | <@JAA> | Average response time went from 1 to 7 seconds in the past couple minutes for me. |
| 07:02:21 | <@HCross> | yep |
| 07:02:32 | <@HCross> | im still pulling quite hard |
| 07:02:40 | <@HCross> | but let me know if you want me to back the truck off |
| 07:03:50 | <@JAA> | thisisfine.png :-) |
| 07:04:04 | <@HCross> | I'm about to drive the truck in even harder |
| 07:04:44 | <@JAA> | Response time has come down again to 2.5 s for me (one-minute average). |
| 07:06:24 | <@HCross> | Archiving Truck has been revved up |
| 07:06:39 | <@HCross> | and is now crashing head first into the Binary wall |
| 07:06:42 | <@HCross> | Bintray |
| 07:07:26 | <@JAA> | Yes Rico, kaboom. |
| 07:15:19 | <@HCross> | JAA: im getting some really big items |
| 07:15:20 | <@HCross> | is that normal |
| 07:15:39 | <@JAA> | Hmm |
| 07:15:52 | <@JAA> | Have some examples? |
| 07:16:05 | <@JAA> | Files shouldn't be downloaded yet as I understood it. |
| 07:17:22 | <@HCross> | unfortunately I don't ask it sped past |
| 07:17:53 | <@HCross> | are we discovering as we go |
| 07:17:58 | <@JAA> | Hmm yeah, I see now, average item size is 150-ish MiB. |
| 07:18:03 | <@JAA> | OrIdow6: Is that expected? |
| 07:18:42 | <@OrIdow6> | JAA: Items coming in are still mostly under <1 MiB; what do you mean |
| 07:18:46 | <@JAA> | There is backfeed, but I believe the initial list should already be virtually complete. |
| 07:19:05 | <@OrIdow6> | ? |
| 07:19:09 | <@HCross> | I have items in the thousands of URLs |
| 07:19:20 | <@HCross> | 2065=200 https://bintray.com/nus-ncl/generic/services-in-one/1-98bd8b8?versionPath=%2Fnus-ncl%2Fgeneric%2Fservices-in-one%2F1-98bd8b8 |
| 07:20:08 | <@HCross> | see how my items done count dropped, but the size shot up |
| 07:20:26 | <@OrIdow6> | I'm not sure what versionPath is, but that does look like it should correctly have 1000s of URLs |
| 07:20:31 | <@OrIdow6> | That item |
| 07:20:57 | <@JAA> | Oh yeah, I was misreading that graph. |
| 07:21:19 | <@JAA> | Some items are in the 10s of MB, but most are still small. |
| 07:21:29 | <@HCross> | give me a few minutes |
| 07:21:31 | <@HCross> | and I'll double again |
| 07:21:34 | | LeighR (LeighR) joins |
| 07:22:03 | <@HCross> | this will be like the opening minutes of Parler again |
| 07:22:19 | <@JAA> | Found an image of HCross: https://i.ytimg.com/vi/BvXxIWkcWrA/maxresdefault.jpg |
| 07:22:33 | <@HCross> | "where did all the items go, we queued a ton" "harry claimed them all" _brief pause_ "harry checked them all back in very quickly" |
| 07:22:56 | <@HCross> | EggplantN: "oh fuck, oh fuck... FUCK" |
| 07:23:17 | <@JAA> | lol |
| 07:23:41 | <@JAA> | I think JFrog's servers will fall over before ours this time. |
| 07:23:53 | <@JAA> | Unless they have some scaling going on. |
| 07:23:59 | <@HCross> | EggplantN actually phoned me to yell at me over that |
| 07:24:35 | <@JAA> | Looks like the main site's hosted in Dallas, by the way. |
| 07:25:02 | <@HCross> | so I'm hauling it all back to the EU |
| 07:25:03 | <@HCross> | woo |
| 07:25:23 | <@JAA> | And then back to FOS in California. lol |
| 07:25:58 | <@JAA> | Oh well, the real fun will be when/if we grab the actual files. |
| 07:26:14 | <@JAA> | Very rough estimate puts that at 10M files and 10 TB. |
| 07:30:18 | <@HCross> | if we get that, I'll move over to my California colo |
| 07:30:22 | <@HCross> | and start going BRRR |
| 07:31:54 | | sonick quits [Remote host closed the connection] |
| 07:32:17 | <@HCross> | this may be an ideal candidate for meta if we need more targets |
| 07:32:30 | <@HCross> | JAA: shall we make a channel? |
| 07:32:45 | <@JAA> | Those aren't going to Dallas, by the way. Amazon and Google CDN as far as I've seen. |
| 07:32:59 | <@OrIdow6> | It does |
| 07:33:09 | <@OrIdow6> | Because it has a token that expires |
| 07:33:20 | <@OrIdow6> | So what are queued aren't the CDN URLs, it's the redirects to them |
| 07:33:46 | | BlueMaxima_ joins |
| 07:34:08 | <@JAA> | Only very few have tokens. |
| 07:34:16 | <@JAA> | But I see re redirects. |
| 07:34:38 | <@OrIdow6> | All the ones I looked at had tokens |
| 07:34:44 | <@OrIdow6> | Can you give examples? |
| 07:36:09 | <@JAA> | About three quarters of the ones I've collected in a test run didn't have tokens. |
| 07:36:12 | <@OrIdow6> | Perhaps I was biased towards a certain type of file while manually exploring the site |
| 07:37:03 | <@JAA> | 176k of 239k plain dl.bintray.com URLs |
| 07:37:26 | <@JAA> | A couple random projects that have those: k8ty-app/maven/k8ty-nltk adfactory/maven/adfactory_android est7/maven/rx2errorhandler |
| 07:37:29 | | BlueMaxima quits [Ping timeout: 258 seconds] |
| 07:37:41 | <@HCross> | I do wonder if they've got an autoscaler that I can crash into harder |
| 07:37:50 | <@HCross> | if they're in "the cloud" :tm |
| 07:39:04 | <@JAA> | They seem to be using IBM's hosting. networklayer.com shows up prominently in the routes. |
| 07:39:12 | <@HCross> | yep |
| 07:39:52 | <@HCross> | they're hauling me all the way from London on the IBM backbone |
| 07:40:46 | <@JAA> | I'm going to NY via Level3 first. |
| 07:41:46 | <@OrIdow6> | JAA: Redirect from what to what? If you mean redirects to dl., it does follow those |
| 07:42:14 | <@HCross> | ah, I have direct peering with IBM in London |
| 07:42:32 | <@HCross> | so this is actually very cheap |
| 07:43:01 | <@JAA> | OrIdow6: .../download_file redirects to dl.bintray.com but without tokens in the latter URL for the majority of projects. |
| 07:43:48 | <@OrIdow6> | JAA: Oh, I see |
| 07:44:00 | <@JAA> | Are we already grabbing those? |
| 07:44:10 | <@OrIdow6> | So the dl. urls themselves can redirect to a CDN or get served directly from dl. |
| 07:44:29 | <@OrIdow6> | In the former case, they will be queued as file: items; in the latter, they will be fetched as part of the user: item |
| 07:44:40 | <@JAA> | OH |
| 07:44:58 | <@JAA> | Ok, that explains some things. |
| 07:45:05 | <Jake> | I've been getting some weird 400s on some funky urls. Not sure if this is normal? https://jakel.rocks/up/fd73e7198ba6777f/urls |
| 07:45:34 | <@OrIdow6> | With some nuance to account for custom subdomains |
| 07:46:03 | <@OrIdow6> | Jake: That doesn't look right |
| 07:46:21 | <Jake> | (As well as 403s on some S3 objects) https://jakel.rocks/up/d4713371c935c8cb/s3-403s |
| 07:46:40 | | hooway joins |
| 07:47:02 | <@HCross> | wee |
| 07:47:10 | <@HCross> | I appear to be downloading most of Kubernetes source code |
| 07:47:15 | <@JAA> | Right, so we're grabbing all the smaller files, but the larger ones that redirect to Cloudfront get queued to backfeed. |
| 07:47:29 | <@OrIdow6> | Jake: Do you have the full logs for the first one? |
| 07:47:31 | <@HCross> | 6701=200 https://dl.bintray.com/fabric8/fabric8/.images/de/de7821b9943bd0498290d6e45b0a5f336ca53cb0a101817f4858543fb936d3ae/layer.tar |
| 07:47:37 | <@HCross> | so I should be seeing that? |
| 07:47:49 | <Jake> | Orldow6: No full logs for the first one. I'll see if I can get some. |
| 07:48:05 | <@OrIdow6> | The second one is an avatar URL that's had some problem extracting, as it's a 403 on S3 I think it's worth leaving the lenient extractor in |
| 07:48:19 | <@JAA> | I thought we were skipping files entirely for now. But yeah, that's expected then, HCross. |
| 07:48:31 | <@JAA> | And I guess the average item size will not stay below 1 MB in that case. |
| 07:48:34 | <@HCross> | ah right |
| 07:48:43 | <@HCross> | if we're getting a lot of these I may need to rethink a few things |
| 07:49:31 | <@JAA> | Though they're only the smaller files. Larger ones are on the CDN and not grabbed yet. |
| 07:49:46 | <@JAA> | Random example of such a CDN redirect: https://dl.bintray.com/kuende/k8s/kube-apiserver |
| 07:51:24 | <@EggplantN> | Y’all need fire power or is bincentre close to falling over |
| 07:51:27 | <Jake> | I found the 400 again Orldow6: https://jakel.rocks/up/e0799a8d20ba0c30/bintray |
| 07:51:38 | <@HCross> | EggplantN: im getting backed off to 1024 seconds |
| 07:51:39 | <@HCross> | lol |
| 07:51:49 | <@HCross> | but im going to see if that was a one off |
| 07:51:53 | <@HCross> | and if I can push harder |
| 07:52:05 | <@EggplantN> | I was gonna bring the warriors |
| 07:52:26 | <Jake> | I think there's a few issues with the script first |
| 07:52:39 | <@OrIdow6> | Jake: Thanks |
| 07:53:07 | <@HCross> | EggplantN: not yet |
| 07:53:11 | <@HCross> | lets iron out the script |
| 07:53:17 | <@HCross> | and we'll need targets |
| 07:53:43 | <@JAA> | FOS seems fine so far? We will need targets for the large files though. |
| 07:54:16 | <@JAA> | But we don't even know yet when this all gets taken down. |
| 07:54:42 | <@JAA> | Also, I feel for the poor lad who will get the user:bintray item. |
| 07:54:46 | <@EggplantN> | HCross deploy at-offload |
| 07:55:25 | <@HCross> | will do when needed |
| 07:56:44 | <@HCross> | https://www.irccloud.com/pastebin/jYMVpd0h/ |
| 07:56:46 | <@HCross> | JAA: ^ |
| 07:56:49 | <@HCross> | I presume that's intentional |
| 07:57:04 | <@JAA> | Yep |
| 07:57:08 | <@JAA> | Those are the large files. |
| 07:58:51 | <@OrIdow6> | Jake: I am running the problem item you found with a bunch of debug output right now, may take a while |
| 07:59:02 | <Jake> | sure, no problem. Should we stop while you do that? |
| 07:59:21 | <@OrIdow6> | The problem here seems to be that ti is getting too much rather than too little |
| 07:59:28 | | HackMii_ quits [Remote host closed the connection] |
| 07:59:58 | <@OrIdow6> | So unless or until stuff like this predominates, I think it's best to continue |
| 08:00:19 | <@OrIdow6> | Seeing as the site is apparently in the sort of limbo state where it's already supposed to have shut down |
| 08:00:31 | | HackMii_ (hacktheplanet) joins |
| 08:00:37 | <@OrIdow6> | I do think a channel would be a good idea |
| 08:00:54 | <@HCross> | https://twitter.com/steveonjava/status/1387072410868797440 |
| 08:00:56 | <@HCross> | JAA: ^^^ |
| 08:01:02 | <@HCross> | does that mean we have a reprieve |
| 08:01:26 | <@HCross> | but we should still go as hard as we can |
| 08:02:03 | <@OrIdow6> | ark iver writes a project: The channel is created before it's clear there's even going to be a project in the first place |
| 08:02:17 | <@OrIdow6> | I write a project: The channel may or may not be created after it starts running |
| 08:02:36 | <@HCross> | and I crawl out of bed, straight to my laptop and start |
| 08:03:01 | <@HCross> | im sat here in pyjamas, server wrangling |
| 08:03:10 | <@JAA> | HCross: JCenter != Bintray |
| 08:03:18 | <@HCross> | ahh |
| 08:03:36 | <@JAA> | JCenter's index is on Bintray, but it's still kind of separate. |
| 08:03:48 | <@JAA> | Once Bintray goes down, we can't discover JCenter's content. |
| 08:04:00 | <@HCross> | ew |
| 08:04:09 | <@JAA> | Or at least not as far as I know. |
| 08:04:17 | <@JAA> | They had directory listing before but removed that. |
| 08:06:53 | <@OrIdow6> | Ah, I see what the problem is |
| 08:09:00 | <@JAA> | Yeah, let's make a channel. binnedtray? |
| 08:10:04 | <Jake> | ashtray? I'm horrible at channel names. |
| 08:10:32 | <@OrIdow6> | If you want to be confusing, bitbucket |
| 08:10:56 | <@OrIdow6> | So I had two checks to make sure Jake's problem didn't happen, and made a mistake in one and messed up the other one with a later commit |
| 08:11:23 | <@OrIdow6> | Oh, "bin" != "bit" |
| 08:12:15 | <@JAA> | Channel name rules: no confusion, much pun. |
| 08:13:43 | <@JAA> | Oh, I see atphoenix already suggested binnedtray earlier. :-) |
| 08:17:09 | <atphoenix> | :) |
| 08:19:45 | <@HCross> | which one are we using |
| 08:22:04 | <@OrIdow6> | By the way, I just pushed commits to my copy of the repo, do *not* merge these, I made a mistake |
| 08:30:06 | <@OrIdow6> | So it turns out I had it right, I just messed up when reviewing my changes |
| 08:30:40 | <@JAA> | Since nobody's saying anything about the channel names, I'll just decide: #binnedtray |
| 08:30:44 | <@OrIdow6> | https://github.com/ArchiveTeam/bintray-grab/pull/3 |
| 08:31:14 | <Jake> | yeah, #binnedtray is good! |
| 08:41:18 | | LeighR quits [Ping timeout: 244 seconds] |
| 09:36:16 | | NF885 (NF885) joins |
| 09:58:23 | | LeighR (LeighR) joins |
| 09:59:50 | | NF885 quits [Ping timeout: 244 seconds] |
| 10:00:42 | | spirit quits [Client Quit] |
| 10:03:29 | <@arkiver> | OrIdow6: if you want to a channel, create one |
| 10:25:08 | | blankie joins |
| 10:25:08 | | blankie is now authenticated as blankie |
| 10:25:08 | | blankie quits [Changing host] |
| 10:25:08 | | blankie (blankie) joins |
| 10:35:22 | | NF885 (NF885) joins |
| 10:43:38 | | hooway_ joins |
| 10:43:38 | | hooway quits [Read error: Connection reset by peer] |
| 11:00:02 | | BlueMaxima_ quits [Read error: Connection reset by peer] |
| 11:09:28 | | blankie quits [Ping timeout: 258 seconds] |
| 11:14:55 | | blankie (blankie) joins |
| 11:20:43 | | blankie quits [Remote host closed the connection] |
| 11:21:05 | | blankie joins |
| 11:21:05 | | blankie is now authenticated as blankie |
| 11:21:05 | | blankie quits [Changing host] |
| 11:21:05 | | blankie (blankie) joins |
| 11:50:03 | | Zopolis420 (Zopolis4) joins |
| 11:50:39 | | Zopolis420 leaves |
| 11:59:36 | | spirit joins |
| 12:22:30 | | etnguyen03 (etnguyen03) joins |
| 12:28:02 | | pcr joins |
| 12:29:12 | | blankie quits [Ping timeout: 258 seconds] |
| 12:29:20 | | blankie joins |
| 12:29:20 | | blankie is now authenticated as blankie |
| 12:29:20 | | blankie quits [Changing host] |
| 12:29:20 | | blankie (blankie) joins |
| 12:39:53 | | FMecha joins |
| 12:40:27 | | FMecha quits [Remote host closed the connection] |
| 12:47:57 | | hooway_ quits [Read error: Connection reset by peer] |
| 12:48:07 | | hooway joins |
| 12:48:25 | | hooway_ joins |
| 12:48:25 | | hooway quits [Read error: Connection reset by peer] |
| 12:57:26 | | pcr leaves |
| 12:57:30 | | pcr joins |
| 12:57:50 | | blankie quits [Remote host closed the connection] |
| 12:58:46 | | blankie (blankie) joins |
| 13:00:38 | | sonick joins |
| 13:10:08 | | blankie quits [Remote host closed the connection] |
| 13:11:14 | | blankie joins |
| 13:11:15 | | blankie is now authenticated as blankie |
| 13:11:15 | | blankie quits [Changing host] |
| 13:11:15 | | blankie (blankie) joins |
| 13:12:02 | | NF885 quits [Ping timeout: 244 seconds] |
| 13:41:33 | | Zopolis4 quits [Remote host closed the connection] |
| 13:55:47 | | NF885 (NF885) joins |
| 14:00:05 | | NF885 quits [Ping timeout: 244 seconds] |
| 14:05:48 | | hilda quits [Ping timeout: 258 seconds] |
| 14:12:00 | | spirit quits [Client Quit] |
| 14:47:35 | | blankie quits [Ping timeout: 258 seconds] |
| 14:48:02 | | blankie joins |
| 14:48:02 | | blankie is now authenticated as blankie |
| 14:48:02 | | blankie quits [Changing host] |
| 14:48:02 | | blankie (blankie) joins |
| 15:17:53 | | hilda joins |
| 15:21:45 | | LeighR quits [Client Quit] |
| 15:22:42 | | blankie quits [Ping timeout: 250 seconds] |
| 15:22:54 | | blankie joins |
| 15:22:54 | | blankie is now authenticated as blankie |
| 15:22:54 | | blankie quits [Changing host] |
| 15:22:54 | | blankie (blankie) joins |
| 15:26:14 | | minus quits [Quit: Bye] |
| 15:27:32 | | minus joins |
| 15:33:58 | | blankie quits [Ping timeout: 250 seconds] |
| 15:35:53 | | hilda quits [Ping timeout: 258 seconds] |
| 15:43:15 | | blankie joins |
| 15:43:15 | | blankie is now authenticated as blankie |
| 15:43:15 | | blankie quits [Changing host] |
| 15:43:15 | | blankie (blankie) joins |
| 15:48:20 | | hooway joins |
| 15:48:20 | | hooway_ quits [Read error: Connection reset by peer] |
| 16:05:36 | | Arcorann quits [Ping timeout: 250 seconds] |
| 16:05:51 | | Daloader joins |
| 16:28:26 | | Iki quits [Remote host closed the connection] |
| 16:41:26 | | blankie quits [Read error: Connection reset by peer] |
| 16:53:31 | | hilda joins |
| 17:11:38 | | mutantmnky (mutantmonkey) joins |
| 17:12:52 | | mutantmonkey quits [Ping timeout: 258 seconds] |
| 17:14:27 | <cpina> | In the big PATCHED geocities torrent there is the file geocities.archiveteam.torrent/WORKSHOP/SEEDS.tar.bz2 with the file inside all-seed-47 listing a Web that I'd like to have (geocities.com/SiliconValley/Lakes/4468). Sadly in UPPERCASE/geocities-S-i.7z.001 there isn't the Lakes/4468 :-( . I've done a script and 7z-unzipped / tar listed all the files from the torrent: no luck for Lakes/4468 |
| 17:14:35 | <cpina> | Is there anywhere else that I could look at? |
| 17:29:49 | <cpina> | (also, thanks for the geocities torrent and other archiving efforts, it's fantastic) |
| 18:21:13 | | LeighR (LeighR) joins |
| 18:32:56 | | hilda quits [Ping timeout: 250 seconds] |
| 18:44:57 | | hilda joins |
| 18:48:21 | <thuban> | second pass of ah.com non-political chat is done; all threads 3 pages or under should now be completely saved :> |
| 18:50:35 | | Terbium quits [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.] |
| 19:12:21 | | Terbium joins |
| 19:16:54 | | hooway_ joins |
| 19:16:54 | | hooway quits [Read error: Connection reset by peer] |
| 19:26:18 | | NF885 (NF885) joins |
| 20:19:15 | | DogsRNice (Webuser299) joins |
| 20:22:34 | | Daloader quits [Ping timeout: 250 seconds] |
| 20:25:52 | | NF885 quits [Remote host closed the connection] |
| 20:28:17 | | NF885 (NF885) joins |
| 20:44:12 | | devsnek quits [Write error: Connection reset by peer] |
| 20:44:12 | | JSharp quits [Read error: Connection reset by peer] |
| 20:44:34 | | devsnek (devsnek) joins |
| 20:44:35 | | JSharp (JSharp) joins |
| 20:45:02 | | pcr leaves |
| 20:47:16 | | pcr joins |
| 20:53:57 | | chriscoffee quits [Read error: Connection reset by peer] |
| 20:55:02 | | chriscoffee (chriscoffee) joins |
| 20:57:33 | <thuban> | second pass of political chat and fourth pass of non-political chat are done |
| 21:15:52 | | pcr leaves |
| 21:21:19 | | NF885 quits [Ping timeout: 244 seconds] |
| 21:57:18 | | systwi quits [Ping timeout: 258 seconds] |
| 22:00:20 | | systwi (systwi) joins |
| 22:09:32 | <masterX244> | JAA: TM-exchange trackdata almost done. got to process that data locally to extract the User profiles for another pass |
| 22:15:03 | | LeighR quits [Client Quit] |
| 22:37:58 | <@JAA> | Lovely |
| 23:02:35 | | genofire quits [Ping timeout: 255 seconds] |
| 23:09:08 | | hooway_ quits [Client Quit] |
| 23:25:52 | | pcr joins |
| 23:28:02 | | BlueMaxima joins |
| 23:35:50 | | Jack_Thompson quits [Read error: Connection reset by peer] |
| 23:54:59 | | pcr leaves |
| 23:55:32 | | pcr joins |