| 00:09:11 | | Wingy1139793760180 quits [Remote host closed the connection] |
| 00:10:00 | | Wingy1139793760180 (Wingy) joins |
| 01:42:26 | | tzt quits [Ping timeout: 265 seconds] |
| 02:12:13 | | tzt (tzt) joins |
| 02:54:56 | | wickedplayer494 quits [Ping timeout: 265 seconds] |
| 02:55:54 | | wickedplayer494 joins |
| 02:57:17 | | wickedplayer494 is now authenticated as wickedplayer494 |
| 03:10:48 | | Wingy1139793760180 quits [Remote host closed the connection] |
| 03:11:35 | | Wingy1139793760180 (Wingy) joins |
| 04:28:22 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 04:51:27 | | Wingy1139793760180 quits [Remote host closed the connection] |
| 04:52:18 | | Wingy1139793760180 (Wingy) joins |
| 05:16:32 | <pabs> | https://www.theregister.com/2022/08/04/gitlab_data_retention_policy/ |
| 05:17:12 | <pabs> | (GitLab to delete inactive projects from users on the free tier) |
| 05:19:16 | | wickedplayer494 quits [Ping timeout: 240 seconds] |
| 05:19:48 | | wickedplayer494 joins |
| 05:29:06 | <@OrIdow6> | Highlights: "the policy is scheduled to come into force in September 2022"; "GitLab... will... give users weeks or months of warning" |
| 05:34:35 | <@OrIdow6> | Looking into discovery |
| 05:47:09 | <@OrIdow6> | Will look into discovery tomorrow |
| 05:47:12 | <@OrIdow6> | Anyhow thanks pabs |
| 05:47:39 | <pabs> | thanks |
| 05:49:59 | | Wingy1139793760180 quits [Client Quit] |
| 06:23:02 | | Megame (Megame) joins |
| 06:23:13 | | sec^nd quits [Remote host closed the connection] |
| 06:23:49 | | sec^nd (second) joins |
| 06:29:16 | | Gereon620 (Gereon) joins |
| 06:36:21 | | Wingy1139793760180 (Wingy) joins |
| 06:42:07 | | ghuntley joins |
| 06:48:07 | <AK> | The gitlab one could be a big one |
| 06:49:16 | <Jake> | Yup and incredibly sad if they actually end up going through with it |
| 06:49:41 | <Jake> | Somewhat ironic to me that saving 1M/y is worth sacrificing years of goodwill |
| 06:49:58 | <AK> | "GitLab is aware of the potential for angry opposition to the plan", sounds like they know it will piss people off, but I guess they're hoping it will make it worth it |
| 06:54:26 | <ghuntley> | GitLabs on-call person/SRE is about to have a bad couple of nights... |
| 06:55:39 | <Jake> | well, we'd hope to not cause any huge problems... |
| 07:00:26 | <ghuntley> | I wonder if the cost in network egress will go above $1M in storage costs... |
| 07:00:35 | <ghuntley> | (also compute...) |
| 07:01:02 | <ghuntley> | (also SRE salary/time / project planning in response to the new load) |
| 07:01:28 | <Jake> | Oh I mean, probably! Seems incredibly shortsighted to me, but /shrug. |
| 07:03:09 | <ghuntley> | tme to pull out this article again - https://articles.uie.com/beans-and-noses/ |
| 07:03:12 | <ghuntley> | *time |
| 07:03:22 | | Wingy1139793760180 quits [Ping timeout: 265 seconds] |
| 07:04:56 | <Jake> | looks like GitLab runs under cloudflare... could end up being annoying. |
| 07:07:35 | <Jake> | trying to find public info on any ratelimits, but they seem to be generous right now. |
| 07:31:39 | | DiscantX joins |
| 07:31:43 | | DiscantX quits [Remote host closed the connection] |
| 07:35:08 | | DiscantX_pi quits [Quit: WeeChat 2.3] |
| 07:38:10 | | G4te_Keep3r34 quits [Ping timeout: 265 seconds] |
| 07:43:36 | | DiscantX joins |
| 07:52:32 | | Wingy1139793760180 (Wingy) joins |
| 08:12:45 | <tech234a> | https://gitlab.com/explore/projects?sort=latest_activity_asc |
| 08:13:28 | <tech234a> | https://gitlab.com/explore/projects?sort=created_asc |
| 08:13:41 | <tech234a> | First project in that second list is ID 450 |
| 08:13:52 | <tech234a> | second one is 526 |
| 08:14:30 | <tech234a> | IDs appear sequential |
| 08:14:51 | <tech234a> | Current max ID is approximately 38337314 |
| 08:15:01 | <tech234a> | based on https://gitlab.com/explore/projects?sort=created_desc |
| 08:17:15 | <tech234a> | API endpoint for the ID ex: https://gitlab.com/api/v4/projects/526 |
| 08:18:07 | <tech234a> | can also use ID without API: https://gitlab.com/projects/526 |
| 08:19:58 | <tech234a> | so probably about ~38.4 million items with many deleted/private |
| 08:20:06 | <tech234a> | OrIdow6: ^ |
| 08:21:38 | <tech234a> | returns 404 for a private repo |
| 08:23:07 | | Megame quits [Client Quit] |
| 08:23:09 | <Jake> | I see a generous rate limit on that call of 2K/minute. |
| 08:24:28 | <Jake> | 19.2K minutes if we use the full rate limit every minute on an ip or about 13 days I think. Should be possible. |
| 08:27:47 | <tech234a> | also as a note: in addition to public and private, there is also now-discontinued "internal" visibility setting that some repos might still have. "Internal" means visible to any logged-in user. |
| 08:29:16 | | Wingy1139793760180 quits [Ping timeout: 240 seconds] |
| 08:30:10 | <tech234a> | found a project ID 143 as new lowest, it was archived so it wasn't originally visible in the project listing |
| 08:35:29 | <tech234a> | API for listing https://gitlab.com/api/v4/projects?sort=asc&order_by=id |
| 08:38:08 | <tech234a> | Docs for this: https://docs.gitlab.com/ee/api/projects.html#list-all-projects |
| 08:39:08 | <tech234a> | Also make note that keyset-based pagination is likely needed to avoid running into a limit https://docs.gitlab.com/ee/api/projects.html#pagination-limits |
| 08:55:19 | | Wingy1139793760180 (Wingy) joins |
| 09:30:40 | | G4te_Keep3r34 joins |
| 09:31:56 | | ghuntley quits [Client Quit] |
| 09:43:31 | | HackMii quits [Remote host closed the connection] |
| 09:44:04 | | HackMii (hacktheplanet) joins |
| 10:38:19 | | Wingy1139793760180 quits [Remote host closed the connection] |
| 10:39:10 | | Wingy1139793760180 (Wingy) joins |
| 11:09:05 | <apache2_> | would it be possible for someone here to archive vtda.org? it's got a ton of useful retrocomputing materials |
| 12:10:00 | | HackMii quits [Remote host closed the connection] |
| 12:10:00 | | sec^nd quits [Remote host closed the connection] |
| 12:10:44 | | HackMii (hacktheplanet) joins |
| 12:10:58 | | sec^nd (second) joins |
| 12:25:13 | | katocala quits [Remote host closed the connection] |
| 12:46:09 | | joepie91|m joins |
| 12:46:30 | | thuban quits [Read error: Connection reset by peer] |
| 12:47:02 | | thuban joins |
| 12:54:17 | | sec^nd quits [Remote host closed the connection] |
| 12:54:51 | | sec^nd (second) joins |
| 13:07:19 | | Wingy1139793760180 quits [Ping timeout: 265 seconds] |
| 13:11:50 | | Wingy1139793760180 (Wingy) joins |
| 13:43:41 | | qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds] |
| 13:59:31 | | jacobk quits [Ping timeout: 265 seconds] |
| 14:09:46 | | Arcorann quits [Ping timeout: 240 seconds] |
| 14:45:33 | | jacobk joins |
| 14:46:53 | | Wingy1139793760180 quits [Ping timeout: 265 seconds] |
| 14:47:32 | | Wingy1139793760180 (Wingy) joins |
| 15:00:33 | | Wingy1139793760180 quits [Remote host closed the connection] |
| 15:01:28 | | Wingy1139793760180 (Wingy) joins |
| 15:07:04 | | Nulo_ joins |
| 15:08:04 | | jacobk quits [Ping timeout: 240 seconds] |
| 15:08:09 | | Nulo quits [Ping timeout: 265 seconds] |
| 15:08:09 | | Nulo_ is now known as Nulo |
| 15:08:19 | | jacobk joins |
| 15:09:30 | | jacobk quits [Client Quit] |
| 15:35:13 | | Wingy1139793760180 quits [Ping timeout: 265 seconds] |
| 15:45:24 | <systwi_> | thuban: Continuing from #archiveteam-ot, the page I wanted to grab is not very large: https://www.cyberciti.biz/faq/debian-linux-install-openssh-sshd-server/ |
| 15:46:27 | <systwi_> | OrIdow6: Continuing from #archiveteam-ot, yeah, that website ^ uses Clownflare and it's only accessible through either a real browser or a curl-impersonator request. |
| 15:50:12 | <thuban> | if you're not too worried about the header integrity issues that we retired chromebot over, you could try using its engine, crocoite: https://github.com/PromyLOPh/crocoite |
| 15:52:25 | <thuban> | i've never tried to run this myself and i don't know if it still works, but if it does it would be much more convenient than writing your own script/wpull plugin |
| 15:53:43 | <systwi_> | It's worth a shot. I'll try installing it and giving it a try. |
| 15:53:59 | <systwi_> | *I'll install it and give it a try. |
| 16:01:08 | <systwi_> | Oh, if anyone already has an HTTrack instance up and running, could you please give that cyberciti.biz page I mentioned a try? |
| 16:05:52 | | Wingy1139793760180 (Wingy) joins |
| 16:07:47 | | Wingy1139793760180 quits [Remote host closed the connection] |
| 16:08:35 | | Wingy1139793760180 (Wingy) joins |
| 16:14:19 | | Wingy1139793760180 quits [Read error: Connection reset by peer] |
| 16:16:02 | | Wingy1139793760180 (Wingy) joins |
| 16:22:39 | | Wingy1139793760180 quits [Remote host closed the connection] |
| 16:23:27 | | Wingy1139793760180 (Wingy) joins |
| 16:25:29 | | spirit quits [Quit: Leaving] |
| 16:40:28 | | Wingy1139793760180 quits [Ping timeout: 240 seconds] |
| 16:46:15 | | spirit joins |
| 16:49:30 | | Wingy1139793760180 (Wingy) joins |
| 16:53:00 | | Wingy1139793760180 quits [Remote host closed the connection] |
| 16:53:49 | | Wingy1139793760180 (Wingy) joins |
| 17:07:59 | | Wingy1139793760180 quits [Remote host closed the connection] |
| 17:10:40 | | Wingy1139793760180 (Wingy) joins |
| 17:22:59 | | Wingy1139793760180 quits [Remote host closed the connection] |
| 17:23:34 | | jacobk joins |
| 17:23:47 | | Wingy1139793760180 (Wingy) joins |
| 17:37:58 | | Wingy1139793760180 quits [Read error: Connection reset by peer] |
| 17:38:48 | | Wingy1139793760180 (Wingy) joins |
| 17:43:09 | | mikael quits [Quit: ZNC - http://znc.in] |
| 17:46:56 | <systwi_> | crocoite didn't work. :-/ |
| 17:49:35 | | jacobk quits [Ping timeout: 265 seconds] |
| 17:49:48 | | jacobk joins |
| 17:53:03 | <thuban> | well, wget and curl don't 'compose', and having looked at the wpull plugin api i don't think you can override the actual request implementation (is that an architectural limitation or just an oversight)? |
| 17:53:07 | | Wingy1139793760180 quits [Read error: Connection reset by peer] |
| 17:54:19 | <jamesp> | Should the GitLab project be under #gitgud (used for GitHub) or #gitlost? I'm thinking merge with #gitgud |
| 17:56:06 | <thuban> | systwi_: so i guess in your position i would write a little python script that calls curl-impersonate, saves the results, and after extracting links/urls from appropriate mime types derelativizes, dedupes, and queues them |
| 17:56:27 | <thuban> | the naïve implementation won't scale, but you don't need scale |
| 18:04:16 | <systwi_> | I'm thinking that's the next best step. I can already access the correct page with curl-impersonate, so it's up to me to do the rest. |
| 18:06:22 | <thuban> | seems damn silly to reimplement a whole spiderer/scraper, though. you'd think we'd have something for this |
| 18:08:01 | <systwi_> | Is the need to use tools like curl-impersonate on some pages relatively new? I never recall encountering such situations even five months ago. |
| 18:08:34 | <@JAA> | No, Buttflare has always been a pain in the butt. |
| 18:08:39 | <systwi_> | But yeah, it would be nice to see this in a nicer tool than something crummy that I'd make. :-P |
| 18:09:00 | <@JAA> | But apart from using a headless browser etc., there wasn't any tooling for it until curl-impersonate emerged. |
| 18:09:45 | <systwi_> | Yeah, but I mean in the sense of TLS fingerprinting that Cloudflare does. |
| 18:09:55 | <systwi_> | I see. |
| 18:10:39 | <thuban> | i personally ran into it at least a year ago |
| 18:10:58 | <@JAA> | They did increase the fingerprinting, but that was some time ago. I think around the same time they rolled out the new JS challenge. |
| 18:11:14 | <thuban> | JAA: got any insight on the wpull plugin question? |
| 18:11:42 | <Jake> | systwi_: using a bit of an experimental crawler I've been working on, we added https://github.com/refraction-networking/utls which impersonates Chrome's ClientHello. I captured that site in WARC for you here: https://jakel.rocks/up/b96ca2096dbcbeb2/ZENO-20220804175848709-00001-ATHENA.warc.gz |
| 18:12:30 | <thuban> | :0 |
| 18:13:29 | <systwi_> | Woah, thanks so much, Jake! :-D |
| 18:13:35 | <systwi_> | Extremely helpful. |
| 18:14:03 | <@JAA> | thuban: It's complicated. Yeah, you can't override the actual requesting. But wpull is highly modular. In principle, it should be relatively easy to e.g. use just its scraping/link extraction part. I doubt this is documented much though. |
| 18:15:52 | <@JAA> | (And actually, you probably can override the requesting via a plugin, except that's currently broken due to a bug in the plugin system...) |
| 18:16:05 | <thuban> | (?) |
| 18:19:32 | <Jake> | np! Always happy to help. :) |
| 18:19:33 | <@JAA> | Doing that would be incredibly messy, but it should be possible. You can replace individual components of wpull in a plugin, see e.g. here: https://github.com/ArchiveTeam/ArchiveBot/blob/4a672dbff49597dd8a1f53d95ee60f6ff17a5c87/pipeline/archivebot/wpull/plugin.py#L71 |
| 18:20:08 | <@JAA> | And the relevant bug is this: https://github.com/ArchiveTeam/wpull/issues/383 |
| 18:20:28 | <@JAA> | (I really should get that giant PR merged already, huh?) |
| 18:22:40 | <thuban> | i read the first post that pr and went 'but what if they need references to objects that don't exist yet?' then i read the second post on the pr :< |
| 18:23:08 | <thuban> | *post on |
| 18:23:51 | <thuban> | (yes. what were the "various issues" with 2.0.1?) |
| 18:24:59 | <@JAA> | Here's a selection from the laundry list: https://github.com/ArchiveTeam/wpull/pull/393 |
| 18:27:19 | <thuban> | nice |
| 18:27:39 | <@JAA> | (That's the giant PR I was talking about.) |
| 18:32:31 | <thuban> | speaking of wpull, i was thinking again about the subdomains issue from the other day, and it occurred to me it that it might be nice to add 'subdomains' to the `--span-hosts-allow` options and that this would be fairly simple from a code standpoint. |
| 18:35:31 | <thuban> | do you think i should put in a pr (assuming i can figure out the test suite)? chfoo's comments on https://github.com/ArchiveTeam/wpull/issues/373 suggest that `--span-hosts-allow` has a limited future given its ambiguities / the lack of power in its implementation (reservations i'm sympathetic to), but it's been five years, so... |
| 18:41:04 | | sec^nd quits [Remote host closed the connection] |
| 18:41:05 | | HackMii quits [Write error: Broken pipe] |
| 18:41:36 | <thuban> | or, hm, i guess there's been no development to speak of since 2019. https://hackint.logs.kiska.pw/archiveteam-bs/20201028#c40220 i forgot we already talked about this |
| 18:41:37 | | sec^nd (second) joins |
| 18:42:01 | <@JAA> | Yeah |
| 18:42:11 | <@JAA> | That idea is something worth considering, I guess. |
| 18:42:43 | | HackMii (hacktheplanet) joins |
| 18:46:51 | <@JAA> | Although in general I think the focus should be on fixing the bugs first. |
| 18:46:58 | <@JAA> | Anyway, we're well into -dev territory. :-) |
| 19:01:12 | | HackMii quits [Remote host closed the connection] |
| 19:03:14 | | HackMii (hacktheplanet) joins |
| 19:06:26 | | HackMii quits [Remote host closed the connection] |
| 19:06:57 | | HackMii (hacktheplanet) joins |
| 19:12:32 | | ghuntley joins |
| 19:13:18 | | sec^nd quits [Remote host closed the connection] |
| 19:13:18 | | HackMii quits [Write error: Broken pipe] |
| 19:14:07 | | sec^nd (second) joins |
| 19:14:26 | | HackMii (hacktheplanet) joins |
| 19:41:03 | | HackMii quits [Remote host closed the connection] |
| 19:41:03 | | sec^nd quits [Remote host closed the connection] |
| 19:41:12 | | jacobk quits [Client Quit] |
| 19:41:27 | | HackMii (hacktheplanet) joins |
| 19:41:36 | | sec^nd (second) joins |
| 19:49:32 | | HackMii quits [Remote host closed the connection] |
| 19:49:53 | | HackMii (hacktheplanet) joins |
| 19:59:46 | | spirit quits [Client Quit] |
| 20:13:01 | | spirit joins |
| 20:22:36 | | Shjosan quits [Quit: Am sleepy (-, – )…zzzZZZ] |
| 20:35:48 | | Shjosan (Shjosan) joins |
| 21:09:37 | | Gereon620 quits [Client Quit] |
| 21:09:52 | | Gereon620 (Gereon) joins |
| 21:13:40 | | michaelblob quits [Ping timeout: 240 seconds] |
| 21:19:54 | | Ruka (Ruk8) joins |
| 21:20:10 | <Ruka> | Hello Everyone! |
| 21:22:28 | | Ruka quits [Read error: Connection reset by peer] |
| 21:22:35 | | Ruka (Ruk8) joins |
| 21:28:35 | | Ruka quits [Client Quit] |
| 21:30:58 | <pabs> | I note that IA's description of ArchiveTeam's ArchiveBot still mentions EFNet even though you moved to hacknet: https://archive.org/details/archivebot |
| 21:31:45 | | Ruk8 (Ruk8) joins |
| 21:32:51 | <pabs> | hi Ruk8, welcome! Did you have a question? |
| 21:36:05 | <Ruk8> | Nothing in particular, like a few days ago I have a list of url that need to be archived... Today's list is mainly composed of Italian Scientific Journas and for the rest there are some videos hosted on a cdn. |
| 21:37:21 | <Ruk8> | (I'm the guy that requested the adobe archival of framemaker/robohelp installers) |
| 21:41:00 | <thuban> | yes, i remember. if you upload your list to transfer.archivete.am and paste the link here again, someone will queue it into archivebot |
| 21:41:19 | | Somebody2 (Somebody2) joins |
| 21:44:10 | <Ruk8> | Here's the list: https://transfer.archivete.am/10GlW3/urls.txt |
| 21:47:18 | | ghuntley quits [Client Quit] |
| 21:48:17 | <thuban> | thanks! someone will start the job soon |
| 21:48:59 | <pabs> | is there an advantage of using AB for that rather than SPN? |
| 21:50:32 | <Ruk8> | Thanks everyone! I'm glad to offer some help |
| 21:54:40 | <thuban> | pabs: ime, reliability, mostly. spn can be a bit flaky under load, especially for bulk submissions |
| 21:55:47 | <pabs> | ok. I've found the SPN email interface to be reasonable for those the couple of times I used it |
| 22:01:34 | <thuban> | there are also (as always) some edge cases--e.g. tumblr's image rewriting breaks most images on tumblr pages under spn, but archivebot can disable it with `--user-agent-alias=curl` |
| 22:04:16 | | HackMii quits [Ping timeout: 240 seconds] |
| 22:06:45 | | HackMii (hacktheplanet) joins |
| 22:32:18 | | cowsay-moo joins |
| 22:32:50 | <Jake> | all good. :) now oocities... I don't believe _we_ run it? Let me double check if we have any more info on that |
| 22:33:18 | <cowsay-moo> | no you don't run it.. I thought since you are in archive circles, that you may have heard something |
| 22:35:05 | | jacobk joins |
| 22:39:05 | <cowsay-moo> | on the oocites FAQ (under "can I download oocities.org"), they mention that they are interested in working with others who want to make backups of the content. If they didn't lose a server or something, maybe archiveteam could get with them sometime to make a backup at some point? https://web.archive.org/web/20220619181802/http://www.oocities.org/geocities-archive/faq.html |
| 22:40:30 | <cowsay-moo> | out of all the geocities archives, they had the most content. geocities.ws is down permanently, so we already lost one archive. reocities is still up. I'd hate to see this information lost.. it's all in the hands of a single group |
| 22:41:39 | | qwertyasdfuiopghjkl joins |
| 22:42:04 | | Matthww quits [Ping timeout: 240 seconds] |
| 22:42:42 | <thuban> | i think the archiveteam torrent was more comprehensive |
| 22:44:08 | <cowsay-moo> | is everything in the torrent on the wayback machine? most of what I've tried to pull up in the past on WBM hasn't been archived |
| 22:49:03 | <thuban> | i don't think so--this was ages ago; afaict the project wasn't even in warc |
| 22:50:00 | <Jake> | yes, sorry. I was trying to see if we had any more information written down about oocities, but I can't seem to locate anything. This project wasn't WARC, so it won't be on the WBM, but rather in torrents. |
| 22:50:00 | <cowsay-moo> | boo... no seeders on the torrent |
| 22:50:36 | <cowsay-moo> | good to know thanks |
| 22:50:44 | <thuban> | https://archive.org/details/archiveteam-geocities |
| 22:51:12 | | Matthww joins |
| 22:51:35 | <Jake> | yeah, I believe we should have a full copy on IA somewhere, even if the torrent isn't seeding anymore. |
| 22:51:50 | <cowsay-moo> | yeah I have that page pulled up already, but there's no text search |
| 22:52:11 | <thuban> | https://wiki.archiveteam.org/index.php/GeoCities#How_can_I_find_a_page_or_website_I'm_looking_for? |
| 22:53:33 | <cowsay-moo> | thanks I'll do some more reading |
| 22:54:39 | <cowsay-moo> | I actually find more "real" information on old geocities archives now than I do on modern search engines, it seems. so much info scrubbed, or deranked, or pushed aside by AI-generated garbage. you'd be surprised how great a geocities search is in 2022... lol. |
| 22:56:30 | | Matthww quits [Ping timeout: 265 seconds] |
| 22:57:36 | <thuban> | (by the way, tbp's peer numbers are often unreliable; there were definitely seeders on the torrent quite recently) |
| 22:59:01 | <cowsay-moo> | I'll dig out a spare drive and give it a shot.. thanks |
| 23:09:12 | | wickedplayer494 is now authenticated as wickedplayer494 |
| 23:15:40 | | wickedplayer494 quits [Ping timeout: 240 seconds] |
| 23:16:11 | | wickedplayer494 joins |
| 23:16:28 | | Terbium quits [Ping timeout: 240 seconds] |
| 23:16:40 | | Terbium joins |
| 23:16:43 | | wickedplayer494 is now authenticated as wickedplayer494 |
| 23:16:52 | | pabs quits [Ping timeout: 240 seconds] |
| 23:17:28 | | pabs (pabs) joins |
| 23:18:54 | | katocala joins |
| 23:19:20 | | Matthww joins |
| 23:20:00 | | katocala is now authenticated as katocala |
| 23:27:26 | | Arcorann (Arcorann) joins |
| 23:41:18 | | michaelblob (michaelblob) joins |
| 23:42:30 | <Jake> | https://twitter.com/gitlab/status/1555325376687226883 |
| 23:42:36 | <Jake> | This sounds... less critical than before. |
| 23:43:14 | <Jake> | (as long as it's still accessible by the public, and not archived just for the repo owner...) |