00:17:03Arcorann (Arcorann) joins
00:20:13<@JAA>Hmm, JCenter seems to be throwing 403 for all files for me.
00:28:12<@OrIdow6>Mass download any of it before?
00:28:16<@OrIdow6>Regular Bintray still working
00:28:23<@OrIdow6>So I think I have this in a workins state
00:30:49Sylirana quits [Read error: Connection reset by peer]
00:31:17Sylirana (Sylirana) joins
00:34:43<@JAA>JCenter might return. It's supposed to stay working until next year.
00:35:02<@JAA>But yeah, let's get Bintray itself running ASAP.
00:35:11<@OrIdow6>Sort of multitasking here
00:35:15<@JAA>We can't download anything from JCenter without Bintray anyway.
00:35:20<@OrIdow6>Will try to get it uploaded in a few minutes
00:44:58<@OrIdow6>Alright https://github.com/OrIdow6/bintray-grab should be good aside for cosmetic/branding thing (it's still SMMB in that area) and backfeed URL
00:45:40<@JAA>arkiver: ^
00:52:18<etnguyen03>Is there an IRC for bintray?
00:52:33<etnguyen03>*a channel
00:52:46<@JAA>Not so far.
00:58:21Mineroboter_ joins
00:59:56Mineroboter quits [Ping timeout: 250 seconds]
01:00:12<@OrIdow6>Very rough around the edges, of course, but it should get all info
01:00:23<@OrIdow6>Site used POST for a bunch of stuff anyway
01:13:56<@OrIdow6>So I am trying to use a Japanese IP address, with Accept-Language: ja, a realistic UA, am getting the "counter", am using a 10second + random delay, and am being conservative w/ the URLs I visit, and am still being blocked
01:14:03<@OrIdow6>From Aimix-Z
01:14:42<@JAA>Oof
01:21:25<nyany>that's some world class automation detection if i've seen it
01:22:03<nyany>or are you behaving like a normal user
01:22:12<@OrIdow6>If it stays alive over the weekend I may have enough time to try it more
01:22:16<nyany>wow thanks irccloud
01:22:31<nyany>i meant to say are you jumping to extremes or are you behaving like a normal user
01:22:37<@OrIdow6>No, this is a crawler that a human reading logs could detect easily
01:22:43<nyany>fair
01:22:57<@OrIdow6>Unless you were a very methodical user of a text browser
01:23:15<@OrIdow6>*Unless you suspected they were
01:33:33rsn_ quits [Quit: Leaving]
01:44:00pcr leaves
01:53:33Wayward quits [Read error: Connection reset by peer]
01:53:35Wayward- (wayward) joins
01:55:42Iki joins
02:12:36pcr joins
02:55:24<atphoenix>maybe they have a human reading logs... and/or detect that only text resources are accessed and not other resources. Kind of like the inverse of a bot trap URL. (if the browser doesn't get all the resources normally accessed by a graphical browser, consider the user to be a bot)
02:56:37<@OrIdow6>I've tried to consider that
02:57:00<@OrIdow6>So it does get images, and I also get a "counter"/analytics URL that every page got but seemed not to have a purpose
02:57:17<@OrIdow6>But due to the nature of the grab setup there is a long delay in some cases
02:59:04<atphoenix>bintray -> binnedtray or spilledtray
02:59:14<@OrIdow6>Well it shuts down in 4 hours
02:59:26<atphoenix>bintray their website says "UPDATE 4/27/2021: We listened to the community and will keep JCenter as a read-only repository indefinitely. Our customers and the community can continue to rely on JCenter as a reliable mirror for Java packages.
02:59:26<atphoenix>"
02:59:36<@OrIdow6>JCenter ~= main Bintray
02:59:40<thuban>how often does grab-site check the 'delay' file? (does it depend on the current delay?)
02:59:46<@OrIdow6>*!=
03:00:04<@JAA>JCenter is kind of integrated into Bintray but also standalone.
03:00:30<@JAA>You can't discover the content on JCenter once Bintray is down, even though it will still be there for a while.
03:00:55<thuban>i set it to something VERY large while i fixed up the ignores, but now i've set it back to 0 and it's not started again... any way to signal for a re-check?
03:01:22<atphoenix>I copied that from https://jfrog.com/blog/into-the-sunset-bintray-jcenter-gocenter-and-chartcenter/ which has other date info too
03:02:12<@JAA>thuban: Yes, it depends on the current delay, and no, there isn't such a signal.
03:02:44<thuban>whoops. 15 minutes it is, i guess
03:02:55<@JAA>Not entirely sure how it's implemented exactly in grab-site, but I think it's similar to AB, which checks the settings after every URL.
03:03:06<@JAA>So yeah, don't go too high on the delay settings. :-)
03:03:19<@JAA>15 minutes doesn't sound too bad though.
03:03:30<@JAA>We frequently use 3 or 5 minutes on AB.
03:05:34<thuban>i didn't actually calculate it, i just threw in 1000000 because 100000 seemed too small and i thought i'd be able to change it again
03:09:14<@arkiver>OrIdow6: hi, so this is not ready yet for warrior?
03:09:19<@arkiver>or is it
03:09:30<@arkiver>if yes I'll get it up
03:09:32<@arkiver>right 4 hours
03:09:39<@arkiver>please confirm ^
03:09:42<@arkiver>or JAA ^
03:09:49<thuban>oh, other question: i forgot to change to my external drive before starting the crawl :( fortunately i _think_ i've got enough space anyway, but is there a way to move data without losing the state?
03:11:15<@JAA>arkiver: I only looked over it briefly, no idea.
03:11:45<@arkiver>JAA: shut down in 4 hours?
03:11:51<@arkiver>shutdown*
03:12:35<@JAA>Unsure, I haven't seen a time announced anywhere, but maybe I missed it.
03:12:42<@JAA>But it's going down today (1 May).
03:12:54<@arkiver>alright
03:13:15<@arkiver>and we got a list of items?
03:13:17<@JAA>And they've been warning users with brown-outs and whatnot, so I don't expect it to stay longer.
03:13:31<@JAA>We have a list of users, and everything else can be discovered from there.
03:13:43<@arkiver>haha is that file actually named .zstandard :P
03:14:09<@JAA>Yep lol
03:14:13<@JAA>OrIdow6: For the future, it's .zst
03:14:57<@OrIdow6>arkiver: AFAIK it is, except for branding (which I can go and change now) and backfeed
03:15:08<@OrIdow6>Which needs an URL (right now it's example.com)
03:15:19<@OrIdow6>JAA: I know, but I couldn't remember at the time
03:17:07pcr leaves
03:17:21<@OrIdow6>Well, and technically the file: item type isn't implemented, but as discussed previously that's deliberate at this point as I suspect that's a lot of data that's mostly available elsewhere
03:18:28<thuban>JAA: no error messages, but my queue seems to be stuck. any idea why?
03:19:32<@JAA>¯\_(ツ)_/¯
03:19:53<thuban>;_;
03:21:08<@JAA>By the way, there are also branded subdomains like google.bintray.com. The files are available under different URLs for those, either dl.bintray.com with an expiring token or that subdomain.
03:28:32<thuban>oh, it resumed. apparently readline timed out.
03:30:46<@arkiver>OrIdow6: alright, no time for me to test it much so I'll just fix those things and get it started now
03:31:12<thuban>(i kind of suspect one of the threads might still be stuck)
03:32:05<@arkiver>OrIdow6: stdout_sorted.txt is your item list right?
03:32:35<@arkiver>OrIdow6: what is all the aimix-z stuff?
03:33:03<@arkiver>right old code
03:33:06<@OrIdow6>JAA: Hm, thanks for pointing out that it downloads them differently, may need to handle that differenctly
03:33:16<@OrIdow6>arkiver: That's the intended item list, yes
03:33:36<@OrIdow6>Aimix-Z is another site that seems borderline impossible to archive because it aggressively bans people
03:33:56<@arkiver>ok
03:35:56<@arkiver>OrIdow6: i'm replacing zst with gz
03:36:01<@OrIdow6>I am thinking of trying to use backfeed to make a super-distributed crawl, where each item is just 3 urls and it recurses around
03:36:46<@OrIdow6>arkiver: Did I do zst wrong again?
03:36:46<@arkiver>OrIdow6: only thing that needs changing is backfeed?
03:37:00<@arkiver>OrIdow6: no, i'd rather use gz when we're not going dicts
03:37:10<@arkiver>wont save much, and gz is still the default for WARCs in general
03:37:18<@arkiver>doing*
03:38:30<@OrIdow6>arkiver: Give me a few minutes to quickly fix this thing J A A reminded me of
03:38:46<@OrIdow6>And this isn't the ideal final version, but I sort of ran out of time
03:38:54<@OrIdow6>And it should work nonetheless
03:39:02<@arkiver>not ideal is fine now
03:39:27<@arkiver>OrIdow6: ok please PR it to the archiveteam clone
03:39:28<@OrIdow6>Well, J A A told me of half and reminded me of the other half
03:39:35<@OrIdow6>Ok
03:42:54<@arkiver>OrIdow6: all fixed and pushed
03:43:11<@OrIdow6>Testing this change to see if it breaks anything
03:43:15<@OrIdow6>Thanks
03:44:33<@arkiver>OrIdow6: the change jaa proposed?
03:45:11<@arkiver>items queued
03:45:30<@OrIdow6>arkiver: He didn't propose a change, he told me about a corner case (roughly)
03:46:04<@arkiver>crap we need a target
03:46:14<@arkiver>EggplantN: you around? or HCross Kaz
03:46:41<@arkiver>OrIdow6: any rough size estimate?
03:46:42<@arkiver>TBs?
03:46:45<@arkiver>or not
03:47:06<@JAA>All the Brits are probably asleep. :-/
03:47:36<@OrIdow6>I'd say high GBs or low TBs
03:47:37<@arkiver>yeah i should be as well
03:47:48<@arkiver>ok good
03:47:49<@OrIdow6>As this should reject all big files
03:48:00<@OrIdow6>Well, queue them as file:, which isn't implemented yet
03:48:05<@JAA>Another fun edge case: https://bintray.com/griffon/griffon-plugins?offset=16&max=8&repoPath=%2Fgriffon%2Fgriffon-plugins&sortBy=lowerCaseName&filterByPkgName=
03:48:15<@JAA>Links to a package that isn't under griffon/griffon-plugins.
03:48:50<@OrIdow6>Should be able to handle that
03:49:01<@OrIdow6>As items are users
03:49:16qw3rty__ joins
03:49:35<@JAA>Yeah, it shows up under sleonidy as well.
03:50:18<@JAA>In fact, bintray/jcenter is full of these.
03:50:48<endrift>I noticed bintray just showed up on the tracker. Is there a channel for that yet?
03:51:01endrift scrolls up
03:51:06<@JAA>I think that means those are also all available under two different URLs.
03:51:06<endrift>ah, not yet
03:51:54<@OrIdow6>What do you mean, shows up?
03:52:08<@arkiver>how does FOS work again
03:52:10<@JAA>It's included on the user's repos/packages.
03:52:23<@JAA>So it appears twice.
03:52:50qw3rty_ quits [Ping timeout: 250 seconds]
03:53:24<@arkiver>we'll use a target on FOS
03:53:31<@JAA>That'll be fun.
03:53:39<@arkiver>OrIdow6: excuse the ping, is the update ready?
03:53:48<@OrIdow6>Just making the PR
03:53:54<@arkiver>perfect
03:55:08<@OrIdow6>Alright https://github.com/ArchiveTeam/bintray-grab/pull/1 arkiver
03:56:11<@OrIdow6>strict.lua was a thing I was using during testing, that would crash upon reading from new variables instead of returning nil
03:57:16Eighty quits [Quit: leaving]
03:57:17<@OrIdow6>Which apparently I did actually add to git, but oh well
03:57:34Eighty (Eighty) joins
03:58:13<@arkiver>JAA: looks like FOS is still working :P
03:59:11<@JAA>Yeah, for now.
03:59:19<@arkiver>OrIdow6: started!
03:59:21<@arkiver>people should update
04:00:16<@OrIdow6>arkiver: Thanks
04:00:26<@arkiver>OrIdow6: why the if not something?
04:00:34<@OrIdow6>Where?
04:00:46<@arkiver>where we normally check if to_send is nil
04:00:54<@arkiver>before setting the first discovered item
04:01:16<@OrIdow6>Because strict.lua broke it for some reason
04:01:35<@arkiver>odd
04:02:20<@arkiver>OrIdow6: i think we can just make this multi item size 1
04:02:22<etnguyen03>Is drone building a docker image?
04:04:31<@arkiver>changed to multi item size 1
04:04:36<@arkiver>i'll be off now for some sleep
04:04:40<@arkiver>gotta get up earlu
04:05:15<@OrIdow6>arkiver: OK
04:05:17<@OrIdow6>Goodnight
04:05:29<@OrIdow6>I hope you're not getting up early for this project
04:06:16<@arkiver>no not for this project :)
04:07:06<@arkiver>thanks for the work on this, its good we at least archive something here
04:07:41<@JAA>I'll have a rough size estimate for the files in a bit.
04:12:17etnguyen03 quits [Client Quit]
04:14:33<@JAA>Extrapolated from a 1‰ sample of all users, there should be on the order of ten million files with a total size of 10 TB. May easily be off by quite a bit though since it's such a small sample.
04:18:16<@OrIdow6>Not too bad
04:18:27<@OrIdow6>Yeah, I'd expect it to vary a lot
04:21:40<@JAA>Looks like some users 404.
04:21:58<@JAA>Two examples: sfali, olacabs
04:22:49<@OrIdow6>It should deal with those correctly
04:23:14<@OrIdow6>Well, trying it out, I forgot to check for 200, so it makes 3 unnecessary requests
04:23:27<@OrIdow6>But it correctly succeeds
04:24:33<@JAA>:-)
04:27:00<jodizzle>Is there a recommended concurrency?
04:27:53<@OrIdow6>Not yet
04:28:27<@JAA>Go nuts. I haven't seen any issues at high concurrencies.
04:28:38<@JAA>(Not running this, but on that sample above.)
04:31:03<@JAA>Ok, they start 429ing somewhere between 50 and 100 concurrency with qwarc.
04:32:45<jodizzle>Got it
04:32:50<jodizzle>I'm getting some 401s, is that normal?
04:32:57<@OrIdow6>Where?
04:33:13<jodizzle>example: 80=401 https://api.bintray.com/maven/zdmytriv/vgs-aws-maven/aws-maven/;publish=1
04:33:24<jodizzle>Makes the worker sleep
04:33:53<thuban>problem: as much as i'd like to have outlinks on this ah.com run, for context, i'm concerned there won't be time (and i've done enough of the priority content that i don't want to re-run as --no-offsite-links)
04:33:56<thuban>solution: add hacky negative-match ignore, then gs-dump-urls skipped and run them in a separate crawl (or even feed them to archivebot) when i'm done, y/n/q?
04:34:01<@OrIdow6>It shouldn't be going there
04:34:40<@JAA>thuban: That's what I've been doing on AB, yeah. Negative lookahead ignore.
04:35:00<@JAA>Make sure to not miss subdomains, URLs with ports, etc.
04:36:29<thuban>JAA: '^((?!alternatehistory.com).)*$' lgty?
04:37:02<thuban>lax but this is almost certainly io-bound so i don't know that it matters
04:38:43<@JAA>I suppose that would work.
04:39:07<@JAA>I usually do something like ^https?://(?!([^/]*\.)?example.org(:\d+)?/)
04:39:18<@JAA>Er, example\.org
04:39:59<thuban>welp, here goes
04:41:57<jodizzle>thuban: Might want to turn igon on to verify
04:43:00<thuban>jodizzle: thanks, but dashboard / gs-dump-urls in_progress look good and i don't want to slow it down
04:43:42<@OrIdow6>https://github.com/ArchiveTeam/bintray-grab/pull/2 - misc changes - can someone accept this?
04:58:24godane quits [Ping timeout: 258 seconds]
04:58:47Zopolis4 (Zopolis4) joins
05:25:32<@OrIdow6>JAA: Do you want to accept that? Fine if you defer
05:25:51<@OrIdow6>jodizzle: Seeing any more errors? I see it's slowed
05:28:09<jodizzle>OrIdow6: I was trying to stop the container gracefully to restart with higher concurrency, but it's still doing backoff from that 401 link. I guess I should just kill it?
05:29:03<@OrIdow6>Yeah
05:29:14<@OrIdow6>It will just abort anyway
05:31:06<@JAA>OrIdow6: Seems fine, merged.
05:31:41<@OrIdow6>Thanks JAA
05:38:02<@JAA>Some files are actually served directly on bintray.com, by the way.
05:38:25<@JAA>E.g. those in the package jfrog/jfrog-mission-control/mc-docker-installer
05:39:12<@JAA>Er actually, that's an EULA. Great.
05:39:48<@OrIdow6>AFAICT it does that with small files (threshold somewhere around 1 MB), so that's how I determine it
05:40:21<@OrIdow6>It gets files directly on the site in the user: item and then queues CDN ones as file: item
05:41:15<@JAA>Nope, I've seen plenty of small files get served via dl.bintray.com.
05:41:30<@JAA>But it's a 302 redirect.
05:41:40<@OrIdow6>That's what I mean
05:41:46<@OrIdow6>Oh, I see with the EULA
05:42:05<@OrIdow6>I thought you meant it was a license in a file
05:42:18<@JAA>Ah
05:42:47<@JAA>Yeah, no, intermediate page with a scripty button.
05:46:06<@JAA>And also, https://dl.bintray.com/jfrog/jfrog-mission-control/ is serving completely different files than what's listed on https://bintray.com/jfrog/jfrog-mission-control/mc-docker-installer
05:47:03<@JAA>Files that aren't even under any project.
05:58:02<Vukky>https://github.com/ArchiveTeam/seesaw-kit/pull/121 - there was an attempt to do a thing
06:00:31<@JAA>I'll leave that to someone else as I have zero experience with seesaw's web interface.
06:00:55<Vukky>Alright
06:05:21PFD (PFD) joins
06:05:44<PFD>where does this get logged to anyways
06:07:46<@JAA>A website that's currently down.
06:09:52<PFD>rip
06:19:48PFD quits [Client Quit]
06:20:37<@HCross>Good morning world, what is needed here
06:24:37<Wayward->more hard drives
06:25:01<@OrIdow6>Hello HCross
06:25:37<@OrIdow6>Apropos of the hastily-started (which was my fault) Bintray project, workers and preferably a target that's not FOS
06:26:18<@OrIdow6>Well, for all I know FOS is fine
06:27:15<@OrIdow6>Shouldn't be much data, and site may shut down in half an hour anyway
06:35:06<@HCross>Let me get out of bed and I’ll throw workers at it
06:35:22<@HCross>Rate limits?
06:35:40spirit joins
06:35:46<@JAA>I started getting 429s between 50 and 100 concurrent with qwarc.
06:35:58<@JAA>No idea what that translates to.
06:36:25<@HCross>Size per item?
06:36:39<@HCross>Sorry, trying to size this
06:37:15<@JAA>50 conc with qwarc corresponded to 65 req/s.
06:37:35<@JAA>Items aren't big, below 1 MB on average.
06:38:58<thuban>does grab-site retry on 'Connection closed' errors?
06:40:30<@JAA>Yes. Not 'Connection refused' though as far as I can see.
06:43:41<thuban>hm, ok. i'm seeing 0s erroring without corresponding 200s following; is that just just an ordering issue?
06:47:21<thuban>(gs-dump-urls lists one such in 'error' rather than 'todo' or 'in_progress' but i'm not sure whether that status is intended as final)
06:48:48<@HCross>JAA: sir, I believe you asked for some archivism
06:49:16<@HCross>that has been delivered
06:50:58<thuban>"Note that, unlike wget, wpull puts retries at the end of the queue." oh, hopefully that's it. nts, check up on this
06:52:43HackMii_ quits [Remote host closed the connection]
06:53:14HackMii_ (hacktheplanet) joins
06:59:16<@HCross>I've turned it up a bit
06:59:32<@OrIdow6>Thanks HCross
06:59:53<@HCross>if we crash into FOS we can deal with that
07:00:34<@JAA>Nice
07:01:01<@HCross>I'm seeing some 502s
07:01:17<@HCross>methinks Bintray may be distressed
07:01:18<@OrIdow6>Midnight, seems it's still going
07:01:25<@HCross>but it's just crossed 8am London and we're still alive
07:01:59<@JAA>Average response time went from 1 to 7 seconds in the past couple minutes for me.
07:02:21<@HCross>yep
07:02:32<@HCross>im still pulling quite hard
07:02:40<@HCross>but let me know if you want me to back the truck off
07:03:50<@JAA>thisisfine.png :-)
07:04:04<@HCross>I'm about to drive the truck in even harder
07:04:44<@JAA>Response time has come down again to 2.5 s for me (one-minute average).
07:06:24<@HCross>Archiving Truck has been revved up
07:06:39<@HCross>and is now crashing head first into the Binary wall
07:06:42<@HCross>Bintray
07:07:26<@JAA>Yes Rico, kaboom.
07:15:19<@HCross>JAA: im getting some really big items
07:15:20<@HCross>is that normal
07:15:39<@JAA>Hmm
07:15:52<@JAA>Have some examples?
07:16:05<@JAA>Files shouldn't be downloaded yet as I understood it.
07:17:22<@HCross>unfortunately I don't ask it sped past
07:17:53<@HCross>are we discovering as we go
07:17:58<@JAA>Hmm yeah, I see now, average item size is 150-ish MiB.
07:18:03<@JAA>OrIdow6: Is that expected?
07:18:42<@OrIdow6>JAA: Items coming in are still mostly under <1 MiB; what do you mean
07:18:46<@JAA>There is backfeed, but I believe the initial list should already be virtually complete.
07:19:05<@OrIdow6>?
07:19:09<@HCross>I have items in the thousands of URLs
07:19:20<@HCross>2065=200 https://bintray.com/nus-ncl/generic/services-in-one/1-98bd8b8?versionPath=%2Fnus-ncl%2Fgeneric%2Fservices-in-one%2F1-98bd8b8
07:20:08<@HCross>see how my items done count dropped, but the size shot up
07:20:26<@OrIdow6>I'm not sure what versionPath is, but that does look like it should correctly have 1000s of URLs
07:20:31<@OrIdow6>That item
07:20:57<@JAA>Oh yeah, I was misreading that graph.
07:21:19<@JAA>Some items are in the 10s of MB, but most are still small.
07:21:29<@HCross>give me a few minutes
07:21:31<@HCross>and I'll double again
07:21:34LeighR (LeighR) joins
07:22:03<@HCross>this will be like the opening minutes of Parler again
07:22:19<@JAA>Found an image of HCross: https://i.ytimg.com/vi/BvXxIWkcWrA/maxresdefault.jpg
07:22:33<@HCross>"where did all the items go, we queued a ton" "harry claimed them all" _brief pause_ "harry checked them all back in very quickly"
07:22:56<@HCross>EggplantN: "oh fuck, oh fuck... FUCK"
07:23:17<@JAA>lol
07:23:41<@JAA>I think JFrog's servers will fall over before ours this time.
07:23:53<@JAA>Unless they have some scaling going on.
07:23:59<@HCross>EggplantN actually phoned me to yell at me over that
07:24:35<@JAA>Looks like the main site's hosted in Dallas, by the way.
07:25:02<@HCross>so I'm hauling it all back to the EU
07:25:03<@HCross>woo
07:25:23<@JAA>And then back to FOS in California. lol
07:25:58<@JAA>Oh well, the real fun will be when/if we grab the actual files.
07:26:14<@JAA>Very rough estimate puts that at 10M files and 10 TB.
07:30:18<@HCross>if we get that, I'll move over to my California colo
07:30:22<@HCross>and start going BRRR
07:31:54sonick quits [Remote host closed the connection]
07:32:17<@HCross>this may be an ideal candidate for meta if we need more targets
07:32:30<@HCross>JAA: shall we make a channel?
07:32:45<@JAA>Those aren't going to Dallas, by the way. Amazon and Google CDN as far as I've seen.
07:32:59<@OrIdow6>It does
07:33:09<@OrIdow6>Because it has a token that expires
07:33:20<@OrIdow6>So what are queued aren't the CDN URLs, it's the redirects to them
07:33:46BlueMaxima_ joins
07:34:08<@JAA>Only very few have tokens.
07:34:16<@JAA>But I see re redirects.
07:34:38<@OrIdow6>All the ones I looked at had tokens
07:34:44<@OrIdow6>Can you give examples?
07:36:09<@JAA>About three quarters of the ones I've collected in a test run didn't have tokens.
07:36:12<@OrIdow6>Perhaps I was biased towards a certain type of file while manually exploring the site
07:37:03<@JAA>176k of 239k plain dl.bintray.com URLs
07:37:26<@JAA>A couple random projects that have those: k8ty-app/maven/k8ty-nltk adfactory/maven/adfactory_android est7/maven/rx2errorhandler
07:37:29BlueMaxima quits [Ping timeout: 258 seconds]
07:37:41<@HCross>I do wonder if they've got an autoscaler that I can crash into harder
07:37:50<@HCross>if they're in "the cloud" :tm
07:39:04<@JAA>They seem to be using IBM's hosting. networklayer.com shows up prominently in the routes.
07:39:12<@HCross>yep
07:39:52<@HCross>they're hauling me all the way from London on the IBM backbone
07:40:46<@JAA>I'm going to NY via Level3 first.
07:41:46<@OrIdow6>JAA: Redirect from what to what? If you mean redirects to dl., it does follow those
07:42:14<@HCross>ah, I have direct peering with IBM in London
07:42:32<@HCross>so this is actually very cheap
07:43:01<@JAA>OrIdow6: .../download_file redirects to dl.bintray.com but without tokens in the latter URL for the majority of projects.
07:43:48<@OrIdow6>JAA: Oh, I see
07:44:00<@JAA>Are we already grabbing those?
07:44:10<@OrIdow6>So the dl. urls themselves can redirect to a CDN or get served directly from dl.
07:44:29<@OrIdow6>In the former case, they will be queued as file: items; in the latter, they will be fetched as part of the user: item
07:44:40<@JAA>OH
07:44:58<@JAA>Ok, that explains some things.
07:45:05<Jake>I've been getting some weird 400s on some funky urls. Not sure if this is normal? https://jakel.rocks/up/fd73e7198ba6777f/urls
07:45:34<@OrIdow6>With some nuance to account for custom subdomains
07:46:03<@OrIdow6>Jake: That doesn't look right
07:46:21<Jake>(As well as 403s on some S3 objects) https://jakel.rocks/up/d4713371c935c8cb/s3-403s
07:46:40hooway joins
07:47:02<@HCross>wee
07:47:10<@HCross>I appear to be downloading most of Kubernetes source code
07:47:15<@JAA>Right, so we're grabbing all the smaller files, but the larger ones that redirect to Cloudfront get queued to backfeed.
07:47:29<@OrIdow6>Jake: Do you have the full logs for the first one?
07:47:31<@HCross>6701=200 https://dl.bintray.com/fabric8/fabric8/.images/de/de7821b9943bd0498290d6e45b0a5f336ca53cb0a101817f4858543fb936d3ae/layer.tar
07:47:37<@HCross>so I should be seeing that?
07:47:49<Jake>Orldow6: No full logs for the first one. I'll see if I can get some.
07:48:05<@OrIdow6>The second one is an avatar URL that's had some problem extracting, as it's a 403 on S3 I think it's worth leaving the lenient extractor in
07:48:19<@JAA>I thought we were skipping files entirely for now. But yeah, that's expected then, HCross.
07:48:31<@JAA>And I guess the average item size will not stay below 1 MB in that case.
07:48:34<@HCross>ah right
07:48:43<@HCross>if we're getting a lot of these I may need to rethink a few things
07:49:31<@JAA>Though they're only the smaller files. Larger ones are on the CDN and not grabbed yet.
07:49:46<@JAA>Random example of such a CDN redirect: https://dl.bintray.com/kuende/k8s/kube-apiserver
07:51:24<@EggplantN>Y’all need fire power or is bincentre close to falling over
07:51:27<Jake>I found the 400 again Orldow6: https://jakel.rocks/up/e0799a8d20ba0c30/bintray
07:51:38<@HCross>EggplantN: im getting backed off to 1024 seconds
07:51:39<@HCross>lol
07:51:49<@HCross>but im going to see if that was a one off
07:51:53<@HCross>and if I can push harder
07:52:05<@EggplantN>I was gonna bring the warriors
07:52:26<Jake>I think there's a few issues with the script first
07:52:39<@OrIdow6>Jake: Thanks
07:53:07<@HCross>EggplantN: not yet
07:53:11<@HCross>lets iron out the script
07:53:17<@HCross>and we'll need targets
07:53:43<@JAA>FOS seems fine so far? We will need targets for the large files though.
07:54:16<@JAA>But we don't even know yet when this all gets taken down.
07:54:42<@JAA>Also, I feel for the poor lad who will get the user:bintray item.
07:54:46<@EggplantN>HCross deploy at-offload
07:55:25<@HCross>will do when needed
07:56:44<@HCross>https://www.irccloud.com/pastebin/jYMVpd0h/
07:56:46<@HCross>JAA: ^
07:56:49<@HCross>I presume that's intentional
07:57:04<@JAA>Yep
07:57:08<@JAA>Those are the large files.
07:58:51<@OrIdow6>Jake: I am running the problem item you found with a bunch of debug output right now, may take a while
07:59:02<Jake>sure, no problem. Should we stop while you do that?
07:59:21<@OrIdow6>The problem here seems to be that ti is getting too much rather than too little
07:59:28HackMii_ quits [Remote host closed the connection]
07:59:58<@OrIdow6>So unless or until stuff like this predominates, I think it's best to continue
08:00:19<@OrIdow6>Seeing as the site is apparently in the sort of limbo state where it's already supposed to have shut down
08:00:31HackMii_ (hacktheplanet) joins
08:00:37<@OrIdow6>I do think a channel would be a good idea
08:00:54<@HCross>https://twitter.com/steveonjava/status/1387072410868797440
08:00:56<@HCross>JAA: ^^^
08:01:02<@HCross>does that mean we have a reprieve
08:01:26<@HCross>but we should still go as hard as we can
08:02:03<@OrIdow6>ark iver writes a project: The channel is created before it's clear there's even going to be a project in the first place
08:02:17<@OrIdow6>I write a project: The channel may or may not be created after it starts running
08:02:36<@HCross>and I crawl out of bed, straight to my laptop and start
08:03:01<@HCross>im sat here in pyjamas, server wrangling
08:03:10<@JAA>HCross: JCenter != Bintray
08:03:18<@HCross>ahh
08:03:36<@JAA>JCenter's index is on Bintray, but it's still kind of separate.
08:03:48<@JAA>Once Bintray goes down, we can't discover JCenter's content.
08:04:00<@HCross>ew
08:04:09<@JAA>Or at least not as far as I know.
08:04:17<@JAA>They had directory listing before but removed that.
08:06:53<@OrIdow6>Ah, I see what the problem is
08:09:00<@JAA>Yeah, let's make a channel. binnedtray?
08:10:04<Jake>ashtray? I'm horrible at channel names.
08:10:32<@OrIdow6>If you want to be confusing, bitbucket
08:10:56<@OrIdow6>So I had two checks to make sure Jake's problem didn't happen, and made a mistake in one and messed up the other one with a later commit
08:11:23<@OrIdow6>Oh, "bin" != "bit"
08:12:15<@JAA>Channel name rules: no confusion, much pun.
08:13:43<@JAA>Oh, I see atphoenix already suggested binnedtray earlier. :-)
08:17:09<atphoenix>:)
08:19:45<@HCross>which one are we using
08:22:04<@OrIdow6>By the way, I just pushed commits to my copy of the repo, do *not* merge these, I made a mistake
08:30:06<@OrIdow6>So it turns out I had it right, I just messed up when reviewing my changes
08:30:40<@JAA>Since nobody's saying anything about the channel names, I'll just decide: #binnedtray
08:30:44<@OrIdow6>https://github.com/ArchiveTeam/bintray-grab/pull/3
08:31:14<Jake>yeah, #binnedtray is good!
08:41:18LeighR quits [Ping timeout: 244 seconds]
09:36:16NF885 (NF885) joins
09:58:23LeighR (LeighR) joins
09:59:50NF885 quits [Ping timeout: 244 seconds]
10:00:42spirit quits [Client Quit]
10:03:29<@arkiver>OrIdow6: if you want to a channel, create one
10:25:08blankie joins
10:25:08blankie quits [Changing host]
10:25:08blankie (blankie) joins
10:35:22NF885 (NF885) joins
10:43:38hooway_ joins
10:43:38hooway quits [Read error: Connection reset by peer]
11:00:02BlueMaxima_ quits [Read error: Connection reset by peer]
11:09:28blankie quits [Ping timeout: 258 seconds]
11:14:55blankie (blankie) joins
11:20:43blankie quits [Remote host closed the connection]
11:21:05blankie joins
11:21:05blankie quits [Changing host]
11:21:05blankie (blankie) joins
11:50:03Zopolis420 (Zopolis4) joins
11:50:39Zopolis420 leaves
11:59:36spirit joins
12:22:30etnguyen03 (etnguyen03) joins
12:28:02pcr joins
12:29:12blankie quits [Ping timeout: 258 seconds]
12:29:20blankie joins
12:29:20blankie quits [Changing host]
12:29:20blankie (blankie) joins
12:39:53FMecha joins
12:40:27FMecha quits [Remote host closed the connection]
12:47:57hooway_ quits [Read error: Connection reset by peer]
12:48:07hooway joins
12:48:25hooway_ joins
12:48:25hooway quits [Read error: Connection reset by peer]
12:57:26pcr leaves
12:57:30pcr joins
12:57:50blankie quits [Remote host closed the connection]
12:58:46blankie (blankie) joins
13:00:38sonick joins
13:10:08blankie quits [Remote host closed the connection]
13:11:14blankie joins
13:11:15blankie quits [Changing host]
13:11:15blankie (blankie) joins
13:12:02NF885 quits [Ping timeout: 244 seconds]
13:41:33Zopolis4 quits [Remote host closed the connection]
13:55:47NF885 (NF885) joins
14:00:05NF885 quits [Ping timeout: 244 seconds]
14:05:48hilda quits [Ping timeout: 258 seconds]
14:12:00spirit quits [Client Quit]
14:47:35blankie quits [Ping timeout: 258 seconds]
14:48:02blankie joins
14:48:02blankie quits [Changing host]
14:48:02blankie (blankie) joins
15:17:53hilda joins
15:21:45LeighR quits [Client Quit]
15:22:42blankie quits [Ping timeout: 250 seconds]
15:22:54blankie joins
15:22:54blankie quits [Changing host]
15:22:54blankie (blankie) joins
15:26:14minus quits [Quit: Bye]
15:27:32minus joins
15:33:58blankie quits [Ping timeout: 250 seconds]
15:35:53hilda quits [Ping timeout: 258 seconds]
15:43:15blankie joins
15:43:15blankie quits [Changing host]
15:43:15blankie (blankie) joins
15:48:20hooway joins
15:48:20hooway_ quits [Read error: Connection reset by peer]
16:05:36Arcorann quits [Ping timeout: 250 seconds]
16:05:51Daloader joins
16:28:26Iki quits [Remote host closed the connection]
16:41:26blankie quits [Read error: Connection reset by peer]
16:53:31hilda joins
17:11:38mutantmnky (mutantmonkey) joins
17:12:52mutantmonkey quits [Ping timeout: 258 seconds]
17:14:27<cpina>In the big PATCHED geocities torrent there is the file geocities.archiveteam.torrent/WORKSHOP/SEEDS.tar.bz2 with the file inside all-seed-47 listing a Web that I'd like to have (geocities.com/SiliconValley/Lakes/4468). Sadly in UPPERCASE/geocities-S-i.7z.001 there isn't the Lakes/4468 :-( . I've done a script and 7z-unzipped / tar listed all the files from the torrent: no luck for Lakes/4468
17:14:35<cpina>Is there anywhere else that I could look at?
17:29:49<cpina>(also, thanks for the geocities torrent and other archiving efforts, it's fantastic)
18:21:13LeighR (LeighR) joins
18:32:56hilda quits [Ping timeout: 250 seconds]
18:44:57hilda joins
18:48:21<thuban>second pass of ah.com non-political chat is done; all threads 3 pages or under should now be completely saved :>
18:50:35Terbium quits [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.]
19:12:21Terbium joins
19:16:54hooway_ joins
19:16:54hooway quits [Read error: Connection reset by peer]
19:26:18NF885 (NF885) joins
20:19:15DogsRNice (Webuser299) joins
20:22:34Daloader quits [Ping timeout: 250 seconds]
20:25:52NF885 quits [Remote host closed the connection]
20:28:17NF885 (NF885) joins
20:44:12devsnek quits [Write error: Connection reset by peer]
20:44:12JSharp quits [Read error: Connection reset by peer]
20:44:34devsnek (devsnek) joins
20:44:35JSharp (JSharp) joins
20:45:02pcr leaves
20:47:16pcr joins
20:53:57chriscoffee quits [Read error: Connection reset by peer]
20:55:02chriscoffee (chriscoffee) joins
20:57:33<thuban>second pass of political chat and fourth pass of non-political chat are done
21:15:52pcr leaves
21:21:19NF885 quits [Ping timeout: 244 seconds]
21:57:18systwi quits [Ping timeout: 258 seconds]
22:00:20systwi (systwi) joins
22:09:32<masterX244>JAA: TM-exchange trackdata almost done. got to process that data locally to extract the User profiles for another pass
22:15:03LeighR quits [Client Quit]
22:37:58<@JAA>Lovely
23:02:35genofire quits [Ping timeout: 255 seconds]
23:09:08hooway_ quits [Client Quit]
23:25:52pcr joins
23:28:02BlueMaxima joins
23:35:50Jack_Thompson quits [Read error: Connection reset by peer]
23:54:59pcr leaves
23:55:32pcr joins