00:16:57 | | Wohlstand quits [Quit: Wohlstand] |
00:24:27 | | etnguyen03 quits [Client Quit] |
00:32:46 | | Chris5010 quits [Quit: Ping timeout (120 seconds)] |
00:33:04 | | Chris5010 (Chris5010) joins |
00:36:25 | | StarletCharlotte joins |
00:37:06 | | vitzli (vitzli) joins |
00:39:07 | <h2ibot> | PaulWise edited Finding subdomains (-79, SecurityTrails now paid only): https://wiki.archiveteam.org/?diff=55101&oldid=54541 |
00:39:47 | | chains joins |
00:58:26 | | etnguyen03 (etnguyen03) joins |
01:04:16 | | fionera quits [Quit: fionera] |
01:04:55 | | fionera (Fionera) joins |
01:04:59 | | fionera quits [Client Quit] |
01:28:40 | | StarletCharlotte quits [Remote host closed the connection] |
01:29:48 | <pabs> | JAA: re Infobox, I was thinking for the SoftwareHeritage wiki page, to add Libera #swh, hackint #codearchiver/#gitgud |
01:30:46 | <pabs> | !tell VoynichCR true, however the AT wiki page about SoftwareHeritage isn't about archiving SWH stuff, but about SWH's code archive |
01:30:47 | <eggdrop> | [tell] ok, I'll tell VoynichCR when they join next |
01:30:51 | <@JAA> | pabs: Eh, the three are separate projects really. |
01:32:07 | <pabs> | JAA: thanks for the radio4all link, how did you manage to find the right page? |
01:33:00 | <@JAA> | pabs: I looked for the items with a timestamp (in the title, not the upload date) shortly after the job finished. |
01:33:30 | <@JAA> | It'll usually be in the first one after that, as was the case with this one, too. Might occasionally be in a later pack. |
01:34:24 | <h2ibot> | PaulWise edited Software Heritage (+9, add Libera #swh IRC channel): https://wiki.archiveteam.org/?diff=55102&oldid=55098 |
01:34:43 | <pabs> | ok. should I file something about the indexing problem or will it resolve later? |
01:36:24 | <@JAA> | pabs: Not sure, the current viewer isn't even in the AB repo... |
01:36:27 | <@JAA> | chfoo: ^ |
01:41:26 | <h2ibot> | PaulWise edited Software Heritage (+124, SWH is mirrorable but not archivable): https://wiki.archiveteam.org/?diff=55103&oldid=55102 |
01:41:55 | <pabs> | !tell VoynichCR found a compromise for SoftwareHeritage, pointed archive status/type at the mirroring info https://www.softwareheritage.org/mirrors/ https://www.softwareheritage.org/2019/10/03/enea/ |
01:41:57 | <eggdrop> | [tell] ok, I'll tell VoynichCR when they join next |
01:43:27 | <h2ibot> | PaulWise edited Software Heritage (+32, mention bulk archiving restriction): https://wiki.archiveteam.org/?diff=55104&oldid=55103 |
01:44:15 | <pabs> | steering++ |
01:44:15 | <eggdrop> | [karma] 'steering' now has 44 karma! |
01:44:19 | <pabs> | that_lurker++ |
01:44:19 | <eggdrop> | [karma] 'that_lurker' now has 51 karma! |
01:44:36 | | chains_ joins |
01:48:46 | | chains quits [Ping timeout: 260 seconds] |
01:50:23 | <pabs> | JAA++ |
01:50:25 | <eggdrop> | [karma] 'JAA' now has 233 karma! |
01:52:16 | | chains_ quits [Client Quit] |
01:58:19 | | gust quits [Read error: Connection reset by peer] |
02:06:04 | | yasomi is now known as Xe |
02:09:23 | <Xe> | I work on software that protects websites against hyper-aggressive AI scrapers (via making them do a proof-of-work check) and I'm wondering how I can make sure that I don't accidentally block ArchiveTeam because your traffic patterns will obviously be flagged as anomalous. |
02:09:40 | <pabs> | which software are you working on? |
02:09:50 | <Xe> | https://github.com/TecharoHQ/anubis |
02:10:06 | <pabs> | ah we were just talking about that recently |
02:10:31 | <Xe> | right now it is super aggressive, super paranoid, and super block-y, but I want to figure out ways to tactically lessen its paranoia |
02:10:32 | <pabs> | also, please consider SWH, since Anubis is often used on code sites https://wiki.archiveteam.org/index.php/Software_Heritage |
02:10:44 | <Xe> | already talking with them :) |
02:10:52 | <pabs> | ah great |
02:11:19 | <pabs> | for ArchiveBot the non-Mozilla-UA bypass will work |
02:11:45 | <pabs> | it can set UA as the default AB one, or curl |
02:12:04 | <pabs> | for DPoS projects I'm not sure what UA they use |
02:12:08 | <Xe> | can you test on https://xeiaso.net? |
02:12:29 | <pabs> | yep |
02:12:48 | <Xe> | I care about making sure you pirate archivists can save culture, but I also am like so tired of AI scraper bot downtime lol |
02:13:48 | <pabs> | btw one problem with the non-Mozilla-UA bypass is that we often have to use -u firefox in order to bypass anti-bot stuff on other sites |
02:13:52 | <steering> | Xe++ |
02:13:53 | <eggdrop> | [karma] 'Xe' now has 1 karma! |
02:13:54 | <steering> | :) |
02:14:02 | <Xe> | pabs: yeah, it's...not ideal |
02:14:10 | <Xe> | that's why I'd like to make a more ideal solution |
02:14:19 | <Fijxu|m> | Thanks Xe for anubis |
02:14:38 | <Fijxu|m> | I now use it for https://inv.nadeko.net and has been working great |
02:14:40 | <Xe> | right now it's entirely a hack i did over a weekend because i kept getting paged by my git server going down |
02:14:58 | <Xe> | it's had to do about 6 months of software process maturity in 6 days :DDD |
02:15:29 | <Fijxu|m> | I'm also looking to add some features to Anubis ;3 |
02:15:31 | <pabs> | could allowlist AB IPs, but then they become public. and the list will change over time. |
02:15:56 | <pabs> | maybe for now you could allowlist all the AB User-Agents? |
02:16:14 | <Xe> | yeah, i'm thinking about writing some kind of RFC for this that would have the user agent point to a domain with the list of IP addresses in some kind of JSON file |
02:16:27 | <Xe> | just need to figure out like |
02:16:30 | <Xe> | all the hard parts |
02:16:33 | <Xe> | (it's all hard parts) |
02:16:48 | <pabs> | https://github.com/ArchiveTeam/ArchiveBot/tree/master/db/user_agents/ |
02:17:16 | <pabs> | that then becomes a way for sites to block AB though :) |
02:17:21 | <Xe> | the thing that throws a wrench in all of this is residential proxy services |
02:17:29 | <Fijxu|m> | those are ancient user-agents.. |
02:18:16 | <pabs> | (AB UAs don't change very often, which also means using them becomes less effective at bypassing stuff over time) |
02:18:35 | <Xe> | yeah |
02:18:46 | <Xe> | you can see how this is like 99% hard parts though right lol |
02:18:52 | <pabs> | :) |
02:19:51 | <pabs> | arkiver JAA - ^ any thoughts on getting archiving bypass mechanisms into anti-bot tech? |
02:20:38 | <pabs> | I think allowlisting the AB UAs is a least-bad start |
02:20:43 | <Xe> | the extra extra hard part is that the code for the bot blocker is open source so the bad guys can easily just use it to bypass it |
02:21:46 | <steering> | it's an arms race like any other |
02:21:49 | <pabs> | did you add a non-JS PoW thing for us paranoid users btw? :) |
02:21:57 | <pabs> | also, one of your competitors: https://forge.lindenii.runxiyu.org/powxy/:/repos/powxy/ |
02:22:18 | <Xe> | i'm thinking about it, but like tbh, i'm starting to come to the conclusion that having first-party JS is a broken config |
02:22:36 | <Xe> | considering having the no-JS solution be a bunch of CCNA, voight-kampf and the like questions |
02:22:48 | | BennyOtt_ joins |
02:23:03 | <Xe> | er |
02:23:07 | <Xe> | not having first-party JS |
02:23:12 | <Xe> | i have had weird sleep lately lol |
02:23:27 | <Xe> | suddenly being a load-bearing part of the open source community is mildly terrifying tbh |
02:23:44 | <pabs> | I can only imagine :) |
02:24:04 | | BennyOtt quits [Ping timeout: 250 seconds] |
02:24:04 | | BennyOtt_ is now known as BennyOtt |
02:24:05 | | BennyOtt is now authenticated as BennyOtt |
02:24:23 | <pabs> | I was thinking the non-JS thing would just be a form and a sha256 command-line or something |
02:25:14 | <Xe> | i've considered that, but i don't want to leak implementation details |
02:25:34 | <Xe> | not to mention, actual malware is now using curl2bash as a splot strat |
02:26:06 | <Fijxu|m> | pabs: that looks neat |
02:26:20 | <Xe> | pabs: yeah i've been talking with runxiyu |
02:26:41 | <Fijxu|m> | pabs: that is possible |
02:26:55 | <Fijxu|m> | but can be highly automatized if you are getting specifically targeted |
02:27:26 | <pabs> | ok, DPoS projects seem to use a bunch of browser UAs, for eg imgur https://github.com/ArchiveTeam/imgur-grab/blob/master/user-agents.txt |
02:27:30 | <Xe> | pabs: nah part of me really wants to do CCNA questions |
02:27:41 | <Xe> | mostly because it would be funny |
02:27:42 | <pabs> | I would fail :( |
02:28:04 | <pabs> | anyway most of the sites Anubis is used on are JS-only GitLabs anyway |
02:28:04 | <steering> | I feel like AI would probably do better than me |
02:28:16 | <Fijxu|m> | I run a guestbook on my website, some dude started spamming it with a lot of IPs, so I added https://mcaptcha.org/ to it which has support for javascript and a non js client made on rust to solve the challenge without javascript |
02:28:42 | <Fijxu|m> | and then the dude used the client to solve submit the captcha result... Spamming again |
02:28:47 | <@JAA> | We normally include 'ArchiveTeam' or 'Archive Team' (not necessarily at the beginning) in the UA, unless the site blocks that. That'd probably be the, uh, least unreliable method of identifying our traffic. |
02:29:37 | <Xe> | pabs: the extra ironic part is that i work for an object storage company where my job is to tell people how to copy files using the guise of generative AI |
02:29:38 | <@JAA> | A non-JS fallback would be much appreciated indeed. |
02:29:39 | <Xe> | my life is wild |
02:30:02 | <Fijxu|m> | JAA: https://github.com/TecharoHQ/anubis/issues/95 |
02:30:13 | <Fijxu|m> | Something will come out eventually |
02:30:14 | <@JAA> | Yeah, I've seen the issue. :-) |
02:32:42 | | sparky14921 (sparky1492) joins |
02:36:12 | | sparky1492 quits [Ping timeout: 250 seconds] |
02:36:13 | | sparky14921 is now known as sparky1492 |
02:40:14 | <pokechu22> | I believe archivebot's default UA includes mozilla in it |
02:40:45 | <pokechu22> | right, it's ArchiveTeam ArchiveBot/20210517.c1020e5 (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36 |
02:57:14 | <Xe> | oh |
02:57:22 | <Xe> | exceedingly dumb question before the melatonin wins |
02:57:40 | <Xe> | would putting an Onion-Location header in the mix make it easier for archivebot |
02:57:51 | <Xe> | so that you could access the origin directly over tor hidden services |
02:59:48 | <pabs> | IIRC right now we don't have any Tor onion archiving, neither AB pipelines nor DPoS |
03:00:14 | <pabs> | I think there were both in the past though |
03:01:55 | <@JAA> | Correct |
03:02:28 | <@JAA> | Note that part of our effort is also URL preservation. A hidden service that isn't blocked is great, but nobody will find it in the Wayback Machine. |
03:03:15 | <pokechu22> | but to be clear, archivebot is generally monitored, and if we see 403s or other errors with the default user-agent we'll generally restart the job with a different user-agent |
03:03:35 | | etnguyen03 quits [Remote host closed the connection] |
03:03:46 | <pokechu22> | (you can see URLs being downloaded and the associated status codes on http://archivebot.com) |
03:07:41 | <h2ibot> | PaulWise edited Archiveteam:IRC/Relay (+453, add bot-heavy relay channels): https://wiki.archiveteam.org/?diff=55105&oldid=55022 |
03:09:48 | <pabs> | Xe: on that note, does Anubis use proper HTTP error codes? |
03:09:59 | <pabs> | is there one for humans-only?! |
03:15:17 | <pabs> | Xe: btw, the URLs project is probably the main DPoS that would hit Anubis instances, here are the UAs for it: https://github.com/ArchiveTeam/urls-grab/blob/master/user-agents.txt |
03:15:40 | <pabs> | https://wiki.archiveteam.org/index.php/URLs |
03:16:28 | <nicolas17> | and urls has a crazy scale https://tracker.archiveteam.org/urls/ |
03:17:08 | <nicolas17> | its requests are *hopefully* spread across tons of different websites being archived simultaneously, but there's no guarantees |
03:18:36 | <nicolas17> | speaking of residential proxies, a weird issue I've been having recently on my personal server is TCP SYNs from hundreds of IPs in the same residential-address block, often from brazil, netstat shows connections stuck in SYN_RECV state |
03:19:04 | <nicolas17> | yet it's nowhere near enough to actually affect my server like a DDoS... idk what they're trying to do |
03:19:39 | <pokechu22> | I heard about that happening (or something similar) to someone who ran a TOR exit node |
03:19:59 | <nicolas17> | I blocked the whole /16 and a few days later noticed something similar from another range |
03:19:59 | <steering> | nicolas17: mm, not just scanning for open ports? |
03:20:08 | <nicolas17> | no, it's all to my :443 |
03:20:54 | <nicolas17> | if they're trying to overload my webserver they're failing at it |
03:21:57 | <nicolas17> | pokechu22: I used to run a non-exit Tor node on that machine, but not even that anymore |
03:42:29 | <pabs> | IIRC the Tor thing was IP-spoofed TCP SYNs |
03:44:51 | <nicolas17> | this *could* be spoofed |
03:53:09 | | BluRS joins |
04:10:12 | <pokechu22> | It was ones spoofed as coming from your IP IIRC |
04:10:36 | <pokechu22> | https://delroth.net/posts/spoofed-mass-scan-abuse/ |
04:14:56 | <nicolas17> | oh scanning other people with your IP |
04:28:10 | | Webuser534462 joins |
05:01:38 | | fuzzy80211 quits [Read error: Connection reset by peer] |
05:02:14 | | fuzzy80211 (fuzzy80211) joins |
05:11:19 | | BlueMaxima quits [Read error: Connection reset by peer] |
05:20:30 | | fuzzy80211 quits [Read error: Connection reset by peer] |
05:22:14 | | AlsoHP_Archivist joins |
05:22:15 | | fuzzy80211 (fuzzy80211) joins |
05:24:36 | | HP_Archivist quits [Ping timeout: 260 seconds] |
05:49:36 | | ThetaDev quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |
05:49:59 | | ThetaDev joins |
05:50:19 | <that_lurker> | steering: Could you also update the topics on #hackernews and #hackernews-firehose. |
05:51:02 | | midou quits [Read error: Connection reset by peer] |
05:51:08 | | midou joins |
05:52:49 | | loug83181422 joins |
05:57:24 | <pabs> | Xe: btw, Anubis is presumably bypassed by headless browser based bots? |
05:59:11 | | egallager quits [Quit: This computer has gone to sleep] |
06:00:17 | <steering> | that_lurker: oops, I gave you +ot on those instead of +* |
06:01:52 | | sec^nd quits [Remote host closed the connection] |
06:02:07 | | sec^nd (second) joins |
06:02:31 | | midou quits [Ping timeout: 260 seconds] |
06:03:04 | | fuzzy8021 (fuzzy80211) joins |
06:05:22 | | midou joins |
06:05:30 | | fuzzy80211 quits [Ping timeout: 250 seconds] |
06:07:17 | <that_lurker> | steering++ |
06:07:17 | <eggdrop> | [karma] 'steering' now has 45 karma! |
06:08:53 | | SootBector quits [Ping timeout: 276 seconds] |
06:14:55 | <pabs> | Xe: looks like the answer is yes. for example, here is AT's in-progress Brozzler based Mnbot bypassing Anubis https://mnbot.very-good-quality-co.de/item/72b3897a-b086-480e-94c7-d6194f638cf4 |
06:16:04 | <pabs> | https://wiki.archiveteam.org/index.php/User:TheTechRobo/Mnbot |
06:27:50 | | fuzzy80211 (fuzzy80211) joins |
06:30:13 | | fuzzy8021 quits [Read error: Connection reset by peer] |
06:38:35 | | egallager joins |
07:13:41 | | Island quits [Read error: Connection reset by peer] |
07:38:14 | | Sokar quits [Ping timeout: 250 seconds] |
07:46:21 | | midou quits [Ping timeout: 260 seconds] |
07:55:22 | | midou joins |
08:01:45 | | sec^nd quits [Remote host closed the connection] |
08:01:56 | | sec^nd (second) joins |
08:16:09 | | kuroger quits [Quit: ZNC 1.9.1 - https://znc.in] |
08:21:07 | | kuroger (kuroger) joins |
08:38:12 | <Xe> | pabs: it can't use proper HTTP error codes because using them makes websites re-enqueue things |
08:38:58 | <Xe> | makes crawlers re-enqueue things |
08:39:07 | <Xe> | all the challenge pages throw 200 |
08:54:24 | <@arkiver> | Xe: that is a major problem in archiving. if there no is sign that "rate limiting" (or however you want to call this) is happening, pages will simply be archived and end up in web archives as if those are the pages a regular user would see |
08:55:02 | <@arkiver> | i don't think we can make a fundamental distinction between web archive crawlers and LLM crawlers, without using lists of IPs etc. |
08:55:38 | <@arkiver> | ArchiveBot uses some static IPs would could be listed as being web archive crawlers, but many of the projects of Archive Team use resources from all over the world, and IPs of those change all the time |
08:56:16 | <@arkiver> | so the only way i see this happening at the moment is with handing over lists of IPs. of course we could do something like that. there could perhaps be some 'central repository' with these IPs. |
08:57:01 | <@arkiver> | however this may not work on 'public' Archive Team projects since LLM crawlers could get their IPs added to this list by running some of our project, and then further abuse the position of these IPs to start crawling for LLMs |
08:58:17 | <@arkiver> | i also suspect that any general measures to attempt to allow web archive crawlers, but prevent LLM crawlers, will eventually not work as LLM crawlers can pretend to be web archive crawlers. |
08:59:40 | <@arkiver> | the bottom line to all this would be that the 'open nature' or open source aspects of web archiving are not suitable for sites that employ anti-LLMs measures, and we'd much more need to employ closed source code, better control IPs and what they're used for, etc. |
09:00:39 | <@rewby> | I don't know if it's been mentioned yet, but came across https://forum.safeguar.de/t/end-of-atomic-spectroscopy-at-nist/501 |
09:00:48 | <@arkiver> | ... and, what i think that will eventually happen is that LLM crawlers will find (expensive) ways to not requite humans to do the proof of work, and eventually get around any blocking. meanwhile web archive crawlers will not be able to do this due to lack of resources. |
09:01:17 | <@arkiver> | what we may actually end up with is that the web is crawlable for LLM crawlers, but not for web archive crawlers. web archive crawlers would be the ultimate victim. |
09:02:08 | <@arkiver> | with the advances in AI and LLMs, work that can be done by a human can eventually be done by a machine. it can be done 'cheaply' so by LLM crawlers, but not by web archive crawlers. |
09:03:21 | <@arkiver> | all that being said, i have the feeling that any attempts to keep content publicly accessible but not accessible to LLM crawlers will eventually fail, and companies/sites will either have to accept that and accept that LLM crawling happens, or information is going to move behind login and paywalls (i think that last one is most likely) |
09:04:23 | | Naruyoko5 joins |
09:06:11 | <BornOn420> | Xe You're famous https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/ |
09:07:30 | | Naruyoko quits [Ping timeout: 250 seconds] |
09:10:29 | <@arkiver> | Xe: since i noted that crawling will likely be much more centralized - Archive Team does have a centralized part, the tracker. i can see a possibility that the tracker sounds out a message to somewhere when it hands out URLs to be archived, this message could note "in the next 15 minutes, expect these certain URLs to be archived", and the LLM defenses would take down their defenses for some time. |
09:11:53 | <@arkiver> | or... better even, "in the next 15 minutes, requests will be made with HTTP header 'X-Archive-Request: <RANDOM STRING>'", and that the LLM firewall for those 15 minutes allows for requests with that HTTP header to happen without blocking |
09:12:51 | <@arkiver> | the point of trust them is between the (centralized) tracker at Archive Team. trust could happen through trusting a certain IP tightly controlled by the tracker. and it will still allow for untrusted IPs to do the crawling, for certain amount of time |
09:13:21 | <@arkiver> | then* |
09:14:03 | <@arkiver> | additional tightening could be introduced such as "at most x URLs will be requested with the random HTTP headers", etc. |
09:16:02 | <@arkiver> | since the HTTP header value is random, the LLM firewall could help archiving by letting the crawler know if the value is expired. but only for random values it has seen before. so it'd receive a random value, allow x requests to be made for a duration of y seconds, after that keep the random string registered for z minutes, and during those z minutes return a certain message and status code to inform the crawlers the random string is |
09:16:02 | <@arkiver> | expired |
09:16:48 | <@arkiver> | this will us to handle cases in which the random string is used for a longer time or more URLs than expected, and allow us to prevent bad data from being recorded. |
09:22:15 | <@arkiver> | ("sounds out" should have been "sends out" in a previous message, i wrote this on the fly, bunch of thought, no thorough checking) |
09:22:22 | <@arkiver> | thoughts* |
09:22:30 | | pedantic-darwin quits [Quit: The Lounge - https://thelounge.chat] |
09:38:45 | | kuroger quits [Client Quit] |
09:46:46 | | kuroger (kuroger) joins |
10:01:51 | | egallager quits [Quit: This computer has gone to sleep] |
10:36:06 | | Webuser534462 quits [Quit: Ooops, wrong browser tab.] |
10:36:47 | | nothere quits [Read error: Connection reset by peer] |
10:45:05 | | egallager joins |
10:45:30 | | systwi_ quits [Quit: systwi_] |
10:48:23 | | Dango360 quits [Read error: Connection reset by peer] |
10:52:33 | | Dango360 (Dango360) joins |
11:00:02 | | Bleo18260072271962345 quits [Quit: The Lounge - https://thelounge.chat] |
11:02:48 | | Bleo18260072271962345 joins |
11:04:18 | | kuroger quits [Client Quit] |
11:11:15 | | kuroger (kuroger) joins |
11:15:39 | | Wohlstand (Wohlstand) joins |
11:26:51 | | thighs joins |
11:31:02 | <thighs> | Hello AT! I recently stumbled upon the admin password for a site called fembooru.jp, and that's just sad as it's a now run down 700 GB website I believe some friends shared. I plan to contact the admin about this, but I’m worried the site might disappear soon since one of the admin hasn't posted since 2021. It hosts over 13,000 images, so losing |
11:31:02 | <thighs> | them would be a real shame. Anyone can help? |
11:32:07 | <@arkiver> | just saw your email as well :) |
11:34:47 | <@arkiver> | (didn't have time to look into it yet) |
11:34:48 | | SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962] |
11:35:18 | | SkilledAlpaca418962 joins |
11:35:21 | <thighs> | Oh that's great, didn't think things would go this fast :D |
11:38:33 | <thighs> | I just discovered this site today, so I don't know much about it. It looks dated and run down a bit, but I couldn't even think about archiving it on my own for multiple reasons unfortunately. femboyfinancial.jp has been returning a 502 for a while from what I can see, though. |
11:39:34 | | gust joins |
12:04:44 | | nothere joins |
12:09:23 | | kuroger quits [Client Quit] |
12:13:21 | | BornOn420 quits [Remote host closed the connection] |
12:13:55 | | BornOn420 (BornOn420) joins |
12:23:03 | | kuroger (kuroger) joins |
12:35:08 | | Ketchup901 quits [Remote host closed the connection] |
12:35:24 | | Ketchup901 (Ketchup901) joins |
12:35:34 | | Wohlstand quits [Remote host closed the connection] |
12:41:42 | | Webuser424104 joins |
12:42:23 | | Webuser359951 joins |
12:49:19 | | Webuser359951 leaves |
13:08:25 | | kuroger quits [Client Quit] |
13:08:44 | | SootBector (SootBector) joins |
13:11:52 | | kuroger (kuroger) joins |
13:32:56 | | kuroger quits [Client Quit] |
13:44:54 | | kuroger (kuroger) joins |
14:02:10 | | anarcat quits [Quit: rebooting] |
14:03:07 | <@arkiver> | (replying to the email) |
14:03:34 | | anarcat (anarcat) joins |
14:05:51 | <@arkiver> | thighs: see email |
14:06:03 | <@arkiver> | (job started for https://fembooru.jp/ ) |
14:06:32 | | anarcat quits [Client Quit] |
14:07:53 | | sparky1492 quits [Remote host closed the connection] |
14:08:12 | | sparky1492 (sparky1492) joins |
14:08:19 | | anarcat (anarcat) joins |
14:23:13 | | Webuser424104 quits [Client Quit] |
14:27:27 | | notarobot16 joins |
14:29:28 | | notarobot1 quits [Ping timeout: 250 seconds] |
14:29:28 | | notarobot16 is now known as notarobot1 |
14:58:29 | | VoynichCR (VoynichCR) joins |
14:58:30 | <eggdrop> | [tell] VoynichCR: [2025-03-26T01:30:46Z] <pabs> true, however the AT wiki page about SoftwareHeritage isn't about archiving SWH stuff, but about SWH's code archive |
14:58:31 | <eggdrop> | [tell] VoynichCR: [2025-03-26T01:41:55Z] <pabs> found a compromise for SoftwareHeritage, pointed archive status/type at the mirroring info https://www.softwareheritage.org/mirrors/ https://www.softwareheritage.org/2019/10/03/enea/ |
15:00:39 | | FiTheArchiver joins |
15:06:10 | | sparky14921 (sparky1492) joins |
15:10:16 | | sparky1492 quits [Ping timeout: 260 seconds] |
15:10:16 | | sparky14921 is now known as sparky1492 |
15:18:38 | | Sokar joins |
15:22:31 | | NatTheCat quits [Ping timeout: 260 seconds] |
15:24:34 | | vitzli quits [Quit: Leaving] |
15:25:42 | | NatTheCat (NatTheCat) joins |
15:34:58 | | VoynichCR quits [Client Quit] |
15:38:37 | | lennier2 joins |
15:41:46 | | lennier2_ quits [Ping timeout: 260 seconds] |
15:49:37 | | arch quits [Remote host closed the connection] |
15:49:45 | | arch joins |
15:50:34 | | arch quits [Remote host closed the connection] |
15:50:43 | | arch joins |
15:56:06 | <h2ibot> | Manu edited Discourse/archived (+101, Queued discourse.criticalengineering.org): https://wiki.archiveteam.org/?diff=55106&oldid=54850 |
16:18:53 | | loug83181422 quits [Quit: The Lounge - https://thelounge.chat] |
16:19:15 | | chrismeller quits [Quit: chrismeller] |
16:19:28 | | loug83181422 joins |
16:19:50 | | chrismeller (chrismeller) joins |
16:20:16 | | kuroger quits [Client Quit] |
16:21:45 | | DLoader is now authenticated as DLoader |
16:21:45 | | DLoader quits [Changing host] |
16:21:45 | | DLoader (DLoader) joins |
16:24:45 | | kuroger (kuroger) joins |
16:24:52 | | egallager quits [Quit: This computer has gone to sleep] |
16:32:11 | | kuroger quits [Read error: Connection reset by peer] |
16:37:27 | | sparky14920 (sparky1492) joins |
16:37:46 | | sparky1492 quits [Ping timeout: 260 seconds] |
16:37:46 | | sparky14920 is now known as sparky1492 |
17:14:05 | | VoynichCR (VoynichCR) joins |
17:26:44 | | nomead joins |
17:30:59 | | sparky14925 (sparky1492) joins |
17:34:30 | | sparky1492 quits [Ping timeout: 250 seconds] |
17:34:31 | | sparky14925 is now known as sparky1492 |
17:50:59 | | Lord_Nightmare2 (Lord_Nightmare) joins |
17:51:50 | | Lord_Nightmare quits [Ping timeout: 250 seconds] |
17:51:50 | | Lord_Nightmare2 is now known as Lord_Nightmare |
18:12:30 | <h2ibot> | HadeanEon edited Deaths in 2025 (-5889, BOT - Updating page: {{saved}} (80),…): https://wiki.archiveteam.org/?diff=55107&oldid=55099 |
18:12:31 | <h2ibot> | HadeanEon edited Deaths in 2025/list (+257, BOT - Updating list): https://wiki.archiveteam.org/?diff=55108&oldid=55100 |
18:15:31 | <h2ibot> | VoynichCr edited Deaths in 2025 (-804): https://wiki.archiveteam.org/?diff=55109&oldid=55107 |
18:19:24 | | nomead quits [Client Quit] |
18:24:32 | | VoynichCR quits [Client Quit] |
18:37:54 | | AlsoHP_Archivist quits [Read error: Connection reset by peer] |
19:03:05 | | lflare quits [Quit: Ping timeout (120 seconds)] |
19:03:25 | | lflare (lflare) joins |
19:09:41 | <h2ibot> | Manu edited Discourse/archived (+81, Queued scanlines.xyz): https://wiki.archiveteam.org/?diff=55110&oldid=55106 |
19:17:36 | | loug83181422 quits [Ping timeout: 260 seconds] |
19:19:12 | | loug83181422 joins |
19:27:53 | | loug83181422 quits [Read error: Connection reset by peer] |
19:28:41 | | loug83181422 joins |
19:29:45 | | Dango360_ (Dango360) joins |
19:33:14 | | Dango360 quits [Ping timeout: 250 seconds] |
19:34:25 | <gareth48|m> | JAA: How's the queue, will we be able to get a chance to scrape the images for https://store.vket.com/en soon? I've discovered a few additional things of merit since we last talked, mostly in my own webscraping efforts which might adjust the strategy. Namely the fact there are a ton of "unlisted" links that don't appear in the search grid but can be easily derived by navigating iteratively through product IDs (thankfully there are only |
19:34:25 | <gareth48|m> | like 10,000 products so its not absurd). I've preserved the downloads of all those items which are free but I don't have the webpages. |
19:35:24 | | loug83181422 quits [Ping timeout: 250 seconds] |
19:39:33 | | loug83181422 joins |
19:40:51 | <h2ibot> | Exorcism edited Discourse/archived (+93): https://wiki.archiveteam.org/?diff=55111&oldid=55110 |
19:43:16 | | loug831814229 joins |
19:43:33 | | Dango360_ quits [Client Quit] |
19:43:42 | | Dango360 (Dango360) joins |
19:45:25 | <Dango360> | welp, this is the moment i've been expecting for a while |
19:45:27 | <Dango360> | after april 2nd, it will no longer be possible to get roblox assets without authentication |
19:45:30 | <Dango360> | https://devforum.roblox.com/t/creator-action-required-new-asset-delivery-api-endpoints-for-community-tools/3574403 |
19:47:21 | | loug83181422 quits [Ping timeout: 260 seconds] |
19:47:21 | | loug831814229 is now known as loug83181422 |
19:49:05 | <@JAA> | → #robloxd |
19:49:41 | <Dango360> | reposting the message there, thanks |
19:56:41 | | CYBERDEV joins |
19:58:01 | | loug831814229 joins |
19:59:44 | | loug8318142298 joins |
20:00:26 | | loug83181422 quits [Read error: Connection reset by peer] |
20:00:26 | | loug8318142298 is now known as loug83181422 |
20:02:07 | | loug831814229 quits [Read error: Connection reset by peer] |
20:04:17 | <@JAA> | gareth48|m: Just ran a test job, looks like the lists need to be quite small to get through within the 10 minutes. I'll run them over the next few hours. |
20:05:18 | | loug83181422 quits [Ping timeout: 250 seconds] |
20:08:44 | <@JAA> | Actually, there very much is rate limiting in the form of HTTP 405. |
20:08:59 | <gareth48|m> | JAA: yeah they're really limiting the connection speed since they're shutting down soon.... (full message at <https://matrix.hackint.org/_irc/v1/media/download/AUJIIDrGJTppwaOemaPQBM7szzSpRwVElmayw2P6-He9tkM5sRvK_x5cR5uXPLeW5-FU77qTnSyxVExbbbQQ1aNCffQP5BVAAGhhY2tpbnQub3JnL3NzQnd4ZGVYdVhzQ2dSdWZVeEtmWEtOWA>) |
20:09:49 | <@JAA> | That's what I was going to do, yes, but it's not feasible with AB due to the rate limits. |
20:10:08 | <@JAA> | Kicks in after ~200 requests at full speed. |
20:10:38 | <gareth48|m> | JAA: Gotcha, just saw your test in archivebot. Yeah it would have to be split up over multiple jobs. |
20:10:56 | <@JAA> | I was already going to split it into 10 jobs each for en and ja. |
20:11:35 | <gareth48|m> | JAA: Not surprised you already thought of it y'all have much more experience than I do, just figured I'd suggest it. |
20:12:00 | <@JAA> | But even a list of 1k isn't going to finish within 10 minutes with that limiting, so this needs something else. |
20:16:23 | <gareth48|m> | JAA: Let me know what you think of. Thanks for taking point on tackling this! |
20:20:58 | <gareth48|m> | My own work has been going well, I've archived ~95% of the free items and their associated gallery images. Working on parsing the tags into json and downloading edge cases. |
20:22:05 | <@JAA> | The downloads are all loginwalled, right? |
20:22:43 | | BitsNBytesNBagels joins |
20:22:52 | <gareth48|m> | JAA: Yeah, they are. You have to log in and then also go to each site's purchase page and order it. |
20:24:14 | <gareth48|m> | Also the site requires google, line, apple, or microsoft authentication they dont have their own account system |
20:24:30 | <gareth48|m> | s/account/auth/ |
20:31:59 | | BlueMaxima joins |
20:57:18 | | sparky14924 (sparky1492) joins |
20:58:07 | <mgrandi> | https://www.theverge.com/news/635915/game-informer-return-gunzilla-games |
21:00:51 | | sparky1492 quits [Ping timeout: 260 seconds] |
21:00:51 | | sparky14924 is now known as sparky1492 |
21:11:22 | | aeg leaves |
21:32:07 | | lennier2_ joins |
21:33:01 | | egallager joins |
21:35:00 | | lennier2 quits [Ping timeout: 250 seconds] |
21:36:53 | | etnguyen03 (etnguyen03) joins |
21:41:16 | | Island joins |
21:51:51 | <@JAA> | gareth48|m: Interesting, some pages reference images that return 404s. Anyway, I have something running now with qwarc. |
21:54:22 | <@JAA> | Rough ETA 11-12 hours |
21:56:01 | <@JAA> | That's /en/items/$id and /ja/items/$id plus any URLs on those that contain 'X-Amz-Expires'. |
21:59:44 | <glassy> | how you keeping JAA :) |
22:02:00 | | thighs quits [Quit: Ooops, wrong browser tab.] |
22:07:27 | | Dango360 quits [Client Quit] |
22:07:45 | | Dango360 (Dango360) joins |
22:09:59 | | Island quits [Read error: Connection reset by peer] |
22:10:21 | | Island joins |
22:15:36 | | @imer quits [Quit: Oh no] |
22:16:07 | | imer (imer) joins |
22:16:07 | | @ChanServ sets mode: +o imer |
22:19:24 | <@JAA> | No 405s so far, so that's good. |
22:19:59 | <@JAA> | glassy: <thisisfine.png> as usual :-) |
22:20:21 | <glassy> | same sh*t different day? |
22:21:46 | <gareth48|m> | <JAA> "gareth48: Interesting, some..." <- Huh, that is weird. Quite a few of the products are 404d due to being out of sale, removed by the owner or deleted for being adverse |
22:21:53 | | BitsNBytesNBagels quits [Quit: My MacBook has gone to sleep. ZZZzzz…] |
22:22:11 | <gareth48|m> | I haven't run statistics but I collected a list of all the codes associated with the various items for quick reference when scraping the actual products |
22:22:36 | <FiTheArchiver> | is wayback machine down? |
22:23:01 | <@JAA> | glassy: Aye |
22:23:16 | <glassy> | FiTheArchiver IA sad as awhole :( |
22:23:22 | <gareth48|m> | <JAA> "That's /en/items/$id and /ja/..." <- Yeah that's solid logic, that should get a vast majority of it. Will those thumbnails populate back to the store browse pages? |
22:23:22 | <gareth48|m> | If not those would be a good secondary target, but this is already 90% of the way there because the actual product information and photos are preserved |
22:23:35 | <@JAA> | gareth48|m: Yeah, I'm seeing lots of 404s on item pages, and I expected as much. |
22:23:59 | <@JAA> | The images are weird, but such is the web. |
22:24:21 | <@JAA> | Every page references a different signed URL, so no, those won't appear there and would have to be refetched for store pages. |
22:24:26 | <gareth48|m> | Sorry to crosstalk but thanks so much for y'alls help. I'm already busting ass grabbing as much as I can of the available items I would've been doomed trying to get a full scrape like this alone. I really appreciate the help, it'll be good to have this good of a record for an interesting part of VR history |
22:24:44 | <gareth48|m> | JAA: makes sense |
22:24:47 | <@JAA> | In principle, it could be fixed with a userscript down the line, I suppose. |
22:24:48 | <FiTheArchiver> | ahh ok. thought my wayback downloader was broken but i think cdx is down |
22:25:25 | <@JAA> | There are broken items, too: https://store.vket.com/en/items/1042 |
22:27:01 | <gareth48|m> | JAA: This is one of those that bizarrely returns a 500 error right? |
22:27:33 | <@JAA> | Yeah |
22:28:03 | <datechnoman> | 20 bucks its a power outage at IA :P |
22:28:21 | <@JAA> | Yeah, usually is. |
22:28:26 | <gareth48|m> | yep lol, no idea what's up with those |
22:29:08 | <glassy> | just as urls was catching up |
22:30:21 | <gareth48|m> | It caused my webscraper to go into an infinite loop the first time around because I was only checking for 404 or 200 (I'm on the newer end of doing this lol) |
22:31:17 | <@JAA> | How many of these did you see? |
22:32:01 | <gareth48|m> | JAA: Is this in response to me mentioning the 500 errors? my client isnt rendering it as such |
22:32:38 | <@JAA> | IRC doesn't have responses. :-) |
22:32:41 | <@JAA> | And yes |
22:33:13 | | etnguyen03 quits [Client Quit] |
22:34:22 | <gareth48|m> | right, good point. I'm very new to IRC, in case that wasn't obvious lol. I'm picking it up slowly but surely. |
22:34:22 | <gareth48|m> | 22 500 errors in total, for all ~9800 items |
22:35:36 | <gareth48|m> | Sorry, I mean 18, Ctrl + F user error there lol |
22:35:55 | <@JAA> | I'm at 2 in 900, so that's similar enough of a rate. |
22:36:31 | | etnguyen03 (etnguyen03) joins |
22:37:19 | <gareth48|m> | Huh, interesting, my first was at 795 and my next was at 1042. |
22:37:33 | <@JAA> | I'm doing the IDs in random order. |
22:38:03 | <gareth48|m> | Oh gotcha, let me know the final count, I'll be curious to compare it against what I found |
22:38:51 | <@JAA> | Oh yeah, for items that don't exist, I only fetch the English page. |
22:39:23 | <@JAA> | Hopefully, there are no items that only exist under /ja/. :-P |
22:39:54 | <gareth48|m> | JAA: that would be a horrible edge case lol. Ill check because I'm curious and let you know |
22:40:41 | <@JAA> | Hmm, maybe I should fix that and start over. |
22:41:50 | <@JAA> | I didn't intend to do it like that. Just forgot to adjust the logic when I added /ja/. |
22:41:59 | <FiTheArchiver> | wayback machine just tweeted it is indeed power issues |
22:42:41 | <gareth48|m> | It'll be a few hours until I get that data, since I'm currently writing my tag parser and extractor. I'm using beautiful soup to grab the html and parse relevant tags into json (for eventual conversion into archive.org metadata for all free items) |
22:43:09 | <@JAA> | There's enough time, so I think I will start over (but keep the data I already got just in case). |
22:43:14 | <gareth48|m> | Very grateful I'm able to dedicate nearly 100% of my effort towards getting the assets off because I'm getting really close to having pulled it off. I'm in the edge case / verification stage now. |
22:43:29 | <@JAA> | Dedupe would be annoying if I ran those separately later. |
22:43:45 | <@JAA> | Nice :-) |
22:43:48 | <gareth48|m> | Oh yeah I can only guess lol. |
22:44:34 | | Wohlstand (Wohlstand) joins |
22:44:44 | <gareth48|m> | All of the free assets on the store were actually smaller than I expected! 1543 assets at only 43 Gb which is way smaller than I thought |
22:54:44 | | BornOn420 quits [Remote host closed the connection] |
22:55:18 | | BornOn420 (BornOn420) joins |
22:59:28 | | sparky14921 (sparky1492) joins |
23:02:58 | | sparky1492 quits [Ping timeout: 250 seconds] |
23:02:59 | | sparky14921 is now known as sparky1492 |
23:05:55 | | sparky14920 (sparky1492) joins |
23:09:28 | | sparky1492 quits [Ping timeout: 250 seconds] |
23:09:29 | | sparky14920 is now known as sparky1492 |
23:13:26 | | etnguyen03 quits [Client Quit] |
23:17:21 | | klg quits [Ping timeout: 260 seconds] |
23:17:33 | | klg (klg) joins |
23:33:29 | | egallager quits [Client Quit] |
23:50:56 | | CraftByte quits [Quit: Ping timeout (120 seconds)] |
23:53:01 | | etnguyen03 (etnguyen03) joins |
23:56:22 | | Wohlstand quits [Client Quit] |