00:09:11Wingy1139793760180 quits [Remote host closed the connection]
00:10:00Wingy1139793760180 (Wingy) joins
01:42:26tzt quits [Ping timeout: 265 seconds]
02:12:13tzt (tzt) joins
02:54:56wickedplayer494 quits [Ping timeout: 265 seconds]
02:55:54wickedplayer494 joins
03:10:48Wingy1139793760180 quits [Remote host closed the connection]
03:11:35Wingy1139793760180 (Wingy) joins
04:28:22BlueMaxima quits [Read error: Connection reset by peer]
04:51:27Wingy1139793760180 quits [Remote host closed the connection]
04:52:18Wingy1139793760180 (Wingy) joins
05:16:32<pabs>https://www.theregister.com/2022/08/04/gitlab_data_retention_policy/
05:17:12<pabs>(GitLab to delete inactive projects from users on the free tier)
05:19:16wickedplayer494 quits [Ping timeout: 240 seconds]
05:19:48wickedplayer494 joins
05:29:06<@OrIdow6>Highlights: "the policy is scheduled to come into force in September 2022"; "GitLab... will... give users weeks or months of warning"
05:34:35<@OrIdow6>Looking into discovery
05:47:09<@OrIdow6>Will look into discovery tomorrow
05:47:12<@OrIdow6>Anyhow thanks pabs
05:47:39<pabs>thanks
05:49:59Wingy1139793760180 quits [Client Quit]
06:23:02Megame (Megame) joins
06:23:13sec^nd quits [Remote host closed the connection]
06:23:49sec^nd (second) joins
06:29:16Gereon620 (Gereon) joins
06:36:21Wingy1139793760180 (Wingy) joins
06:42:07ghuntley joins
06:48:07<AK>The gitlab one could be a big one
06:49:16<Jake>Yup and incredibly sad if they actually end up going through with it
06:49:41<Jake>Somewhat ironic to me that saving 1M/y is worth sacrificing years of goodwill
06:49:58<AK>"GitLab is aware of the potential for angry opposition to the plan", sounds like they know it will piss people off, but I guess they're hoping it will make it worth it
06:54:26<ghuntley>GitLabs on-call person/SRE is about to have a bad couple of nights...
06:55:39<Jake>well, we'd hope to not cause any huge problems...
07:00:26<ghuntley>I wonder if the cost in network egress will go above $1M in storage costs...
07:00:35<ghuntley>(also compute...)
07:01:02<ghuntley>(also SRE salary/time / project planning in response to the new load)
07:01:28<Jake>Oh I mean, probably! Seems incredibly shortsighted to me, but /shrug.
07:03:09<ghuntley>tme to pull out this article again - https://articles.uie.com/beans-and-noses/
07:03:12<ghuntley>*time
07:03:22Wingy1139793760180 quits [Ping timeout: 265 seconds]
07:04:56<Jake>looks like GitLab runs under cloudflare... could end up being annoying.
07:07:35<Jake>trying to find public info on any ratelimits, but they seem to be generous right now.
07:31:39DiscantX joins
07:31:43DiscantX quits [Remote host closed the connection]
07:35:08DiscantX_pi quits [Quit: WeeChat 2.3]
07:38:10G4te_Keep3r34 quits [Ping timeout: 265 seconds]
07:43:36DiscantX joins
07:52:32Wingy1139793760180 (Wingy) joins
08:12:45<tech234a>https://gitlab.com/explore/projects?sort=latest_activity_asc
08:13:28<tech234a>https://gitlab.com/explore/projects?sort=created_asc
08:13:41<tech234a>First project in that second list is ID 450
08:13:52<tech234a>second one is 526
08:14:30<tech234a>IDs appear sequential
08:14:51<tech234a>Current max ID is approximately 38337314
08:15:01<tech234a>based on https://gitlab.com/explore/projects?sort=created_desc
08:17:15<tech234a>API endpoint for the ID ex: https://gitlab.com/api/v4/projects/526
08:18:07<tech234a>can also use ID without API: https://gitlab.com/projects/526
08:19:58<tech234a>so probably about ~38.4 million items with many deleted/private
08:20:06<tech234a>OrIdow6: ^
08:21:38<tech234a>returns 404 for a private repo
08:23:07Megame quits [Client Quit]
08:23:09<Jake>I see a generous rate limit on that call of 2K/minute.
08:24:28<Jake>19.2K minutes if we use the full rate limit every minute on an ip or about 13 days I think. Should be possible.
08:27:47<tech234a>also as a note: in addition to public and private, there is also now-discontinued "internal" visibility setting that some repos might still have. "Internal" means visible to any logged-in user.
08:29:16Wingy1139793760180 quits [Ping timeout: 240 seconds]
08:30:10<tech234a>found a project ID 143 as new lowest, it was archived so it wasn't originally visible in the project listing
08:35:29<tech234a>API for listing https://gitlab.com/api/v4/projects?sort=asc&order_by=id
08:38:08<tech234a>Docs for this: https://docs.gitlab.com/ee/api/projects.html#list-all-projects
08:39:08<tech234a>Also make note that keyset-based pagination is likely needed to avoid running into a limit https://docs.gitlab.com/ee/api/projects.html#pagination-limits
08:55:19Wingy1139793760180 (Wingy) joins
09:30:40G4te_Keep3r34 joins
09:31:56ghuntley quits [Client Quit]
09:43:31HackMii quits [Remote host closed the connection]
09:44:04HackMii (hacktheplanet) joins
10:38:19Wingy1139793760180 quits [Remote host closed the connection]
10:39:10Wingy1139793760180 (Wingy) joins
11:09:05<apache2_>would it be possible for someone here to archive vtda.org? it's got a ton of useful retrocomputing materials
12:10:00HackMii quits [Remote host closed the connection]
12:10:00sec^nd quits [Remote host closed the connection]
12:10:44HackMii (hacktheplanet) joins
12:10:58sec^nd (second) joins
12:25:13katocala quits [Remote host closed the connection]
12:46:09joepie91|m joins
12:46:30thuban quits [Read error: Connection reset by peer]
12:47:02thuban joins
12:54:17sec^nd quits [Remote host closed the connection]
12:54:51sec^nd (second) joins
13:07:19Wingy1139793760180 quits [Ping timeout: 265 seconds]
13:11:50Wingy1139793760180 (Wingy) joins
13:43:41qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds]
13:59:31jacobk quits [Ping timeout: 265 seconds]
14:09:46Arcorann quits [Ping timeout: 240 seconds]
14:45:33jacobk joins
14:46:53Wingy1139793760180 quits [Ping timeout: 265 seconds]
14:47:32Wingy1139793760180 (Wingy) joins
15:00:33Wingy1139793760180 quits [Remote host closed the connection]
15:01:28Wingy1139793760180 (Wingy) joins
15:07:04Nulo_ joins
15:08:04jacobk quits [Ping timeout: 240 seconds]
15:08:09Nulo quits [Ping timeout: 265 seconds]
15:08:09Nulo_ is now known as Nulo
15:08:19jacobk joins
15:09:30jacobk quits [Client Quit]
15:35:13Wingy1139793760180 quits [Ping timeout: 265 seconds]
15:45:24<systwi_>thuban: Continuing from #archiveteam-ot, the page I wanted to grab is not very large: https://www.cyberciti.biz/faq/debian-linux-install-openssh-sshd-server/
15:46:27<systwi_>OrIdow6: Continuing from #archiveteam-ot, yeah, that website ^ uses Clownflare and it's only accessible through either a real browser or a curl-impersonator request.
15:50:12<thuban>if you're not too worried about the header integrity issues that we retired chromebot over, you could try using its engine, crocoite: https://github.com/PromyLOPh/crocoite
15:52:25<thuban>i've never tried to run this myself and i don't know if it still works, but if it does it would be much more convenient than writing your own script/wpull plugin
15:53:43<systwi_>It's worth a shot. I'll try installing it and giving it a try.
15:53:59<systwi_>*I'll install it and give it a try.
16:01:08<systwi_>Oh, if anyone already has an HTTrack instance up and running, could you please give that cyberciti.biz page I mentioned a try?
16:05:52Wingy1139793760180 (Wingy) joins
16:07:47Wingy1139793760180 quits [Remote host closed the connection]
16:08:35Wingy1139793760180 (Wingy) joins
16:14:19Wingy1139793760180 quits [Read error: Connection reset by peer]
16:16:02Wingy1139793760180 (Wingy) joins
16:22:39Wingy1139793760180 quits [Remote host closed the connection]
16:23:27Wingy1139793760180 (Wingy) joins
16:25:29spirit quits [Quit: Leaving]
16:40:28Wingy1139793760180 quits [Ping timeout: 240 seconds]
16:46:15spirit joins
16:49:30Wingy1139793760180 (Wingy) joins
16:53:00Wingy1139793760180 quits [Remote host closed the connection]
16:53:49Wingy1139793760180 (Wingy) joins
17:07:59Wingy1139793760180 quits [Remote host closed the connection]
17:10:40Wingy1139793760180 (Wingy) joins
17:22:59Wingy1139793760180 quits [Remote host closed the connection]
17:23:34jacobk joins
17:23:47Wingy1139793760180 (Wingy) joins
17:37:58Wingy1139793760180 quits [Read error: Connection reset by peer]
17:38:48Wingy1139793760180 (Wingy) joins
17:43:09mikael quits [Quit: ZNC - http://znc.in]
17:46:56<systwi_>crocoite didn't work. :-/
17:49:35jacobk quits [Ping timeout: 265 seconds]
17:49:48jacobk joins
17:53:03<thuban>well, wget and curl don't 'compose', and having looked at the wpull plugin api i don't think you can override the actual request implementation (is that an architectural limitation or just an oversight)?
17:53:07Wingy1139793760180 quits [Read error: Connection reset by peer]
17:54:19<jamesp>Should the GitLab project be under #gitgud (used for GitHub) or #gitlost? I'm thinking merge with #gitgud
17:56:06<thuban>systwi_: so i guess in your position i would write a little python script that calls curl-impersonate, saves the results, and after extracting links/urls from appropriate mime types derelativizes, dedupes, and queues them
17:56:27<thuban>the naïve implementation won't scale, but you don't need scale
18:04:16<systwi_>I'm thinking that's the next best step. I can already access the correct page with curl-impersonate, so it's up to me to do the rest.
18:06:22<thuban>seems damn silly to reimplement a whole spiderer/scraper, though. you'd think we'd have something for this
18:08:01<systwi_>Is the need to use tools like curl-impersonate on some pages relatively new? I never recall encountering such situations even five months ago.
18:08:34<@JAA>No, Buttflare has always been a pain in the butt.
18:08:39<systwi_>But yeah, it would be nice to see this in a nicer tool than something crummy that I'd make. :-P
18:09:00<@JAA>But apart from using a headless browser etc., there wasn't any tooling for it until curl-impersonate emerged.
18:09:45<systwi_>Yeah, but I mean in the sense of TLS fingerprinting that Cloudflare does.
18:09:55<systwi_>I see.
18:10:39<thuban>i personally ran into it at least a year ago
18:10:58<@JAA>They did increase the fingerprinting, but that was some time ago. I think around the same time they rolled out the new JS challenge.
18:11:14<thuban>JAA: got any insight on the wpull plugin question?
18:11:42<Jake>systwi_: using a bit of an experimental crawler I've been working on, we added https://github.com/refraction-networking/utls which impersonates Chrome's ClientHello. I captured that site in WARC for you here: https://jakel.rocks/up/b96ca2096dbcbeb2/ZENO-20220804175848709-00001-ATHENA.warc.gz
18:12:30<thuban>:0
18:13:29<systwi_>Woah, thanks so much, Jake! :-D
18:13:35<systwi_>Extremely helpful.
18:14:03<@JAA>thuban: It's complicated. Yeah, you can't override the actual requesting. But wpull is highly modular. In principle, it should be relatively easy to e.g. use just its scraping/link extraction part. I doubt this is documented much though.
18:15:52<@JAA>(And actually, you probably can override the requesting via a plugin, except that's currently broken due to a bug in the plugin system...)
18:16:05<thuban>(?)
18:19:32<Jake>np! Always happy to help. :)
18:19:33<@JAA>Doing that would be incredibly messy, but it should be possible. You can replace individual components of wpull in a plugin, see e.g. here: https://github.com/ArchiveTeam/ArchiveBot/blob/4a672dbff49597dd8a1f53d95ee60f6ff17a5c87/pipeline/archivebot/wpull/plugin.py#L71
18:20:08<@JAA>And the relevant bug is this: https://github.com/ArchiveTeam/wpull/issues/383
18:20:28<@JAA>(I really should get that giant PR merged already, huh?)
18:22:40<thuban>i read the first post that pr and went 'but what if they need references to objects that don't exist yet?' then i read the second post on the pr :<
18:23:08<thuban>*post on
18:23:51<thuban>(yes. what were the "various issues" with 2.0.1?)
18:24:59<@JAA>Here's a selection from the laundry list: https://github.com/ArchiveTeam/wpull/pull/393
18:27:19<thuban>nice
18:27:39<@JAA>(That's the giant PR I was talking about.)
18:32:31<thuban>speaking of wpull, i was thinking again about the subdomains issue from the other day, and it occurred to me it that it might be nice to add 'subdomains' to the `--span-hosts-allow` options and that this would be fairly simple from a code standpoint.
18:35:31<thuban>do you think i should put in a pr (assuming i can figure out the test suite)? chfoo's comments on https://github.com/ArchiveTeam/wpull/issues/373 suggest that `--span-hosts-allow` has a limited future given its ambiguities / the lack of power in its implementation (reservations i'm sympathetic to), but it's been five years, so...
18:41:04sec^nd quits [Remote host closed the connection]
18:41:05HackMii quits [Write error: Broken pipe]
18:41:36<thuban>or, hm, i guess there's been no development to speak of since 2019. https://hackint.logs.kiska.pw/archiveteam-bs/20201028#c40220 i forgot we already talked about this
18:41:37sec^nd (second) joins
18:42:01<@JAA>Yeah
18:42:11<@JAA>That idea is something worth considering, I guess.
18:42:43HackMii (hacktheplanet) joins
18:46:51<@JAA>Although in general I think the focus should be on fixing the bugs first.
18:46:58<@JAA>Anyway, we're well into -dev territory. :-)
19:01:12HackMii quits [Remote host closed the connection]
19:03:14HackMii (hacktheplanet) joins
19:06:26HackMii quits [Remote host closed the connection]
19:06:57HackMii (hacktheplanet) joins
19:12:32ghuntley joins
19:13:18sec^nd quits [Remote host closed the connection]
19:13:18HackMii quits [Write error: Broken pipe]
19:14:07sec^nd (second) joins
19:14:26HackMii (hacktheplanet) joins
19:41:03HackMii quits [Remote host closed the connection]
19:41:03sec^nd quits [Remote host closed the connection]
19:41:12jacobk quits [Client Quit]
19:41:27HackMii (hacktheplanet) joins
19:41:36sec^nd (second) joins
19:49:32HackMii quits [Remote host closed the connection]
19:49:53HackMii (hacktheplanet) joins
19:59:46spirit quits [Client Quit]
20:13:01spirit joins
20:22:36Shjosan quits [Quit: Am sleepy (-, – )…zzzZZZ]
20:35:48Shjosan (Shjosan) joins
21:09:37Gereon620 quits [Client Quit]
21:09:52Gereon620 (Gereon) joins
21:13:40michaelblob quits [Ping timeout: 240 seconds]
21:19:54Ruka (Ruk8) joins
21:20:10<Ruka>Hello Everyone!
21:22:28Ruka quits [Read error: Connection reset by peer]
21:22:35Ruka (Ruk8) joins
21:28:35Ruka quits [Client Quit]
21:30:58<pabs>I note that IA's description of ArchiveTeam's ArchiveBot still mentions EFNet even though you moved to hacknet: https://archive.org/details/archivebot
21:31:45Ruk8 (Ruk8) joins
21:32:51<pabs>hi Ruk8, welcome! Did you have a question?
21:36:05<Ruk8>Nothing in particular, like a few days ago I have a list of url that need to be archived... Today's list is mainly composed of Italian Scientific Journas and for the rest there are some videos hosted on a cdn.
21:37:21<Ruk8>(I'm the guy that requested the adobe archival of framemaker/robohelp installers)
21:41:00<thuban>yes, i remember. if you upload your list to transfer.archivete.am and paste the link here again, someone will queue it into archivebot
21:41:19Somebody2 (Somebody2) joins
21:44:10<Ruk8>Here's the list: https://transfer.archivete.am/10GlW3/urls.txt
21:47:18ghuntley quits [Client Quit]
21:48:17<thuban>thanks! someone will start the job soon
21:48:59<pabs>is there an advantage of using AB for that rather than SPN?
21:50:32<Ruk8>Thanks everyone! I'm glad to offer some help
21:54:40<thuban>pabs: ime, reliability, mostly. spn can be a bit flaky under load, especially for bulk submissions
21:55:47<pabs>ok. I've found the SPN email interface to be reasonable for those the couple of times I used it
22:01:34<thuban>there are also (as always) some edge cases--e.g. tumblr's image rewriting breaks most images on tumblr pages under spn, but archivebot can disable it with `--user-agent-alias=curl`
22:04:16HackMii quits [Ping timeout: 240 seconds]
22:06:45HackMii (hacktheplanet) joins
22:32:18cowsay-moo joins
22:32:50<Jake>all good. :) now oocities... I don't believe _we_ run it? Let me double check if we have any more info on that
22:33:18<cowsay-moo>no you don't run it.. I thought since you are in archive circles, that you may have heard something
22:35:05jacobk joins
22:39:05<cowsay-moo>on the oocites FAQ (under "can I download oocities.org"), they mention that they are interested in working with others who want to make backups of the content. If they didn't lose a server or something, maybe archiveteam could get with them sometime to make a backup at some point? https://web.archive.org/web/20220619181802/http://www.oocities.org/geocities-archive/faq.html
22:40:30<cowsay-moo>out of all the geocities archives, they had the most content. geocities.ws is down permanently, so we already lost one archive. reocities is still up. I'd hate to see this information lost.. it's all in the hands of a single group
22:41:39qwertyasdfuiopghjkl joins
22:42:04Matthww quits [Ping timeout: 240 seconds]
22:42:42<thuban>i think the archiveteam torrent was more comprehensive
22:44:08<cowsay-moo>is everything in the torrent on the wayback machine? most of what I've tried to pull up in the past on WBM hasn't been archived
22:49:03<thuban>i don't think so--this was ages ago; afaict the project wasn't even in warc
22:50:00<Jake>yes, sorry. I was trying to see if we had any more information written down about oocities, but I can't seem to locate anything. This project wasn't WARC, so it won't be on the WBM, but rather in torrents.
22:50:00<cowsay-moo>boo... no seeders on the torrent
22:50:36<cowsay-moo>good to know thanks
22:50:44<thuban>https://archive.org/details/archiveteam-geocities
22:51:12Matthww joins
22:51:35<Jake>yeah, I believe we should have a full copy on IA somewhere, even if the torrent isn't seeding anymore.
22:51:50<cowsay-moo>yeah I have that page pulled up already, but there's no text search
22:52:11<thuban>https://wiki.archiveteam.org/index.php/GeoCities#How_can_I_find_a_page_or_website_I'm_looking_for?
22:53:33<cowsay-moo>thanks I'll do some more reading
22:54:39<cowsay-moo>I actually find more "real" information on old geocities archives now than I do on modern search engines, it seems. so much info scrubbed, or deranked, or pushed aside by AI-generated garbage. you'd be surprised how great a geocities search is in 2022... lol.
22:56:30Matthww quits [Ping timeout: 265 seconds]
22:57:36<thuban>(by the way, tbp's peer numbers are often unreliable; there were definitely seeders on the torrent quite recently)
22:59:01<cowsay-moo>I'll dig out a spare drive and give it a shot.. thanks
23:15:40wickedplayer494 quits [Ping timeout: 240 seconds]
23:16:11wickedplayer494 joins
23:16:28Terbium quits [Ping timeout: 240 seconds]
23:16:40Terbium joins
23:16:52pabs quits [Ping timeout: 240 seconds]
23:17:28pabs (pabs) joins
23:18:54katocala joins
23:19:20Matthww joins
23:27:26Arcorann (Arcorann) joins
23:41:18michaelblob (michaelblob) joins
23:42:30<Jake>https://twitter.com/gitlab/status/1555325376687226883
23:42:36<Jake>This sounds... less critical than before.
23:43:14<Jake>(as long as it's still accessible by the public, and not archived just for the repo owner...)