#archiveteam-bs log for 2025-03-26

Home Search Previous day Next day

00:16:57		Wohlstand quits [Quit: Wohlstand]
00:24:27		etnguyen03 quits [Client Quit]
00:32:46		Chris5010 quits [Quit: Ping timeout (120 seconds)]
00:33:04		Chris5010 (Chris5010) joins
00:36:25		StarletCharlotte joins
00:37:06		vitzli (vitzli) joins
00:39:07	<h2ibot>	PaulWise edited Finding subdomains (-79, SecurityTrails now paid only): https://wiki.archiveteam.org/?diff=55101&oldid=54541
00:39:47		chains joins
00:58:26		etnguyen03 (etnguyen03) joins
01:04:16		fionera quits [Quit: fionera]
01:04:55		fionera (Fionera) joins
01:04:59		fionera quits [Client Quit]
01:28:40		StarletCharlotte quits [Remote host closed the connection]
01:29:48	<pabs>	JAA: re Infobox, I was thinking for the SoftwareHeritage wiki page, to add Libera #swh, hackint #codearchiver/#gitgud
01:30:46	<pabs>	!tell VoynichCR true, however the AT wiki page about SoftwareHeritage isn't about archiving SWH stuff, but about SWH's code archive
01:30:47	<eggdrop>	[tell] ok, I'll tell VoynichCR when they join next
01:30:51	<@JAA>	pabs: Eh, the three are separate projects really.
01:32:07	<pabs>	JAA: thanks for the radio4all link, how did you manage to find the right page?
01:33:00	<@JAA>	pabs: I looked for the items with a timestamp (in the title, not the upload date) shortly after the job finished.
01:33:30	<@JAA>	It'll usually be in the first one after that, as was the case with this one, too. Might occasionally be in a later pack.
01:34:24	<h2ibot>	PaulWise edited Software Heritage (+9, add Libera #swh IRC channel): https://wiki.archiveteam.org/?diff=55102&oldid=55098
01:34:43	<pabs>	ok. should I file something about the indexing problem or will it resolve later?
01:36:24	<@JAA>	pabs: Not sure, the current viewer isn't even in the AB repo...
01:36:27	<@JAA>	chfoo: ^
01:41:26	<h2ibot>	PaulWise edited Software Heritage (+124, SWH is mirrorable but not archivable): https://wiki.archiveteam.org/?diff=55103&oldid=55102
01:41:55	<pabs>	!tell VoynichCR found a compromise for SoftwareHeritage, pointed archive status/type at the mirroring info https://www.softwareheritage.org/mirrors/ https://www.softwareheritage.org/2019/10/03/enea/
01:41:57	<eggdrop>	[tell] ok, I'll tell VoynichCR when they join next
01:43:27	<h2ibot>	PaulWise edited Software Heritage (+32, mention bulk archiving restriction): https://wiki.archiveteam.org/?diff=55104&oldid=55103
01:44:15	<pabs>	steering++
01:44:15	<eggdrop>	[karma] 'steering' now has 44 karma!
01:44:19	<pabs>	that_lurker++
01:44:19	<eggdrop>	[karma] 'that_lurker' now has 51 karma!
01:44:36		chains_ joins
01:48:46		chains quits [Ping timeout: 260 seconds]
01:50:23	<pabs>	JAA++
01:50:25	<eggdrop>	[karma] 'JAA' now has 233 karma!
01:52:16		chains_ quits [Client Quit]
01:58:19		gust quits [Read error: Connection reset by peer]
02:06:04		yasomi is now known as Xe
02:09:23	<Xe>	I work on software that protects websites against hyper-aggressive AI scrapers (via making them do a proof-of-work check) and I'm wondering how I can make sure that I don't accidentally block ArchiveTeam because your traffic patterns will obviously be flagged as anomalous.
02:09:40	<pabs>	which software are you working on?
02:09:50	<Xe>	https://github.com/TecharoHQ/anubis
02:10:06	<pabs>	ah we were just talking about that recently
02:10:31	<Xe>	right now it is super aggressive, super paranoid, and super block-y, but I want to figure out ways to tactically lessen its paranoia
02:10:32	<pabs>	also, please consider SWH, since Anubis is often used on code sites https://wiki.archiveteam.org/index.php/Software_Heritage
02:10:44	<Xe>	already talking with them :)
02:10:52	<pabs>	ah great
02:11:19	<pabs>	for ArchiveBot the non-Mozilla-UA bypass will work
02:11:45	<pabs>	it can set UA as the default AB one, or curl
02:12:04	<pabs>	for DPoS projects I'm not sure what UA they use
02:12:08	<Xe>	can you test on https://xeiaso.net?
02:12:29	<pabs>	yep
02:12:48	<Xe>	I care about making sure you pirate archivists can save culture, but I also am like so tired of AI scraper bot downtime lol
02:13:48	<pabs>	btw one problem with the non-Mozilla-UA bypass is that we often have to use -u firefox in order to bypass anti-bot stuff on other sites
02:13:52	<steering>	Xe++
02:13:53	<eggdrop>	[karma] 'Xe' now has 1 karma!
02:13:54	<steering>	:)
02:14:02	<Xe>	pabs: yeah, it's...not ideal
02:14:10	<Xe>	that's why I'd like to make a more ideal solution
02:14:19	<Fijxu\|m>	Thanks Xe for anubis
02:14:38	<Fijxu\|m>	I now use it for https://inv.nadeko.net and has been working great
02:14:40	<Xe>	right now it's entirely a hack i did over a weekend because i kept getting paged by my git server going down
02:14:58	<Xe>	it's had to do about 6 months of software process maturity in 6 days :DDD
02:15:29	<Fijxu\|m>	I'm also looking to add some features to Anubis ;3
02:15:31	<pabs>	could allowlist AB IPs, but then they become public. and the list will change over time.
02:15:56	<pabs>	maybe for now you could allowlist all the AB User-Agents?
02:16:14	<Xe>	yeah, i'm thinking about writing some kind of RFC for this that would have the user agent point to a domain with the list of IP addresses in some kind of JSON file
02:16:27	<Xe>	just need to figure out like
02:16:30	<Xe>	all the hard parts
02:16:33	<Xe>	(it's all hard parts)
02:16:48	<pabs>	https://github.com/ArchiveTeam/ArchiveBot/tree/master/db/user_agents/
02:17:16	<pabs>	that then becomes a way for sites to block AB though :)
02:17:21	<Xe>	the thing that throws a wrench in all of this is residential proxy services
02:17:29	<Fijxu\|m>	those are ancient user-agents..
02:18:16	<pabs>	(AB UAs don't change very often, which also means using them becomes less effective at bypassing stuff over time)
02:18:35	<Xe>	yeah
02:18:46	<Xe>	you can see how this is like 99% hard parts though right lol
02:18:52	<pabs>	:)
02:19:51	<pabs>	arkiver JAA - ^ any thoughts on getting archiving bypass mechanisms into anti-bot tech?
02:20:38	<pabs>	I think allowlisting the AB UAs is a least-bad start
02:20:43	<Xe>	the extra extra hard part is that the code for the bot blocker is open source so the bad guys can easily just use it to bypass it
02:21:46	<steering>	it's an arms race like any other
02:21:49	<pabs>	did you add a non-JS PoW thing for us paranoid users btw? :)
02:21:57	<pabs>	also, one of your competitors: https://forge.lindenii.runxiyu.org/powxy/:/repos/powxy/
02:22:18	<Xe>	i'm thinking about it, but like tbh, i'm starting to come to the conclusion that having first-party JS is a broken config
02:22:36	<Xe>	considering having the no-JS solution be a bunch of CCNA, voight-kampf and the like questions
02:22:48		BennyOtt_ joins
02:23:03	<Xe>	er
02:23:07	<Xe>	not having first-party JS
02:23:12	<Xe>	i have had weird sleep lately lol
02:23:27	<Xe>	suddenly being a load-bearing part of the open source community is mildly terrifying tbh
02:23:44	<pabs>	I can only imagine :)
02:24:04		BennyOtt quits [Ping timeout: 250 seconds]
02:24:04		BennyOtt_ is now known as BennyOtt
02:24:05		BennyOtt is now authenticated as BennyOtt
02:24:23	<pabs>	I was thinking the non-JS thing would just be a form and a sha256 command-line or something
02:25:14	<Xe>	i've considered that, but i don't want to leak implementation details
02:25:34	<Xe>	not to mention, actual malware is now using curl2bash as a splot strat
02:26:06	<Fijxu\|m>	pabs: that looks neat
02:26:20	<Xe>	pabs: yeah i've been talking with runxiyu
02:26:41	<Fijxu\|m>	pabs: that is possible
02:26:55	<Fijxu\|m>	but can be highly automatized if you are getting specifically targeted
02:27:26	<pabs>	ok, DPoS projects seem to use a bunch of browser UAs, for eg imgur https://github.com/ArchiveTeam/imgur-grab/blob/master/user-agents.txt
02:27:30	<Xe>	pabs: nah part of me really wants to do CCNA questions
02:27:41	<Xe>	mostly because it would be funny
02:27:42	<pabs>	I would fail :(
02:28:04	<pabs>	anyway most of the sites Anubis is used on are JS-only GitLabs anyway
02:28:04	<steering>	I feel like AI would probably do better than me
02:28:16	<Fijxu\|m>	I run a guestbook on my website, some dude started spamming it with a lot of IPs, so I added https://mcaptcha.org/ to it which has support for javascript and a non js client made on rust to solve the challenge without javascript
02:28:42	<Fijxu\|m>	and then the dude used the client to solve submit the captcha result... Spamming again
02:28:47	<@JAA>	We normally include 'ArchiveTeam' or 'Archive Team' (not necessarily at the beginning) in the UA, unless the site blocks that. That'd probably be the, uh, least unreliable method of identifying our traffic.
02:29:37	<Xe>	pabs: the extra ironic part is that i work for an object storage company where my job is to tell people how to copy files using the guise of generative AI
02:29:38	<@JAA>	A non-JS fallback would be much appreciated indeed.
02:29:39	<Xe>	my life is wild
02:30:02	<Fijxu\|m>	JAA: https://github.com/TecharoHQ/anubis/issues/95
02:30:13	<Fijxu\|m>	Something will come out eventually
02:30:14	<@JAA>	Yeah, I've seen the issue. :-)
02:32:42		sparky14921 (sparky1492) joins
02:36:12		sparky1492 quits [Ping timeout: 250 seconds]
02:36:13		sparky14921 is now known as sparky1492
02:40:14	<pokechu22>	I believe archivebot's default UA includes mozilla in it
02:40:45	<pokechu22>	right, it's ArchiveTeam ArchiveBot/20210517.c1020e5 (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36
02:57:14	<Xe>	oh
02:57:22	<Xe>	exceedingly dumb question before the melatonin wins
02:57:40	<Xe>	would putting an Onion-Location header in the mix make it easier for archivebot
02:57:51	<Xe>	so that you could access the origin directly over tor hidden services
02:59:48	<pabs>	IIRC right now we don't have any Tor onion archiving, neither AB pipelines nor DPoS
03:00:14	<pabs>	I think there were both in the past though
03:01:55	<@JAA>	Correct
03:02:28	<@JAA>	Note that part of our effort is also URL preservation. A hidden service that isn't blocked is great, but nobody will find it in the Wayback Machine.
03:03:15	<pokechu22>	but to be clear, archivebot is generally monitored, and if we see 403s or other errors with the default user-agent we'll generally restart the job with a different user-agent
03:03:35		etnguyen03 quits [Remote host closed the connection]
03:03:46	<pokechu22>	(you can see URLs being downloaded and the associated status codes on http://archivebot.com)
03:07:41	<h2ibot>	PaulWise edited Archiveteam:IRC/Relay (+453, add bot-heavy relay channels): https://wiki.archiveteam.org/?diff=55105&oldid=55022
03:09:48	<pabs>	Xe: on that note, does Anubis use proper HTTP error codes?
03:09:59	<pabs>	is there one for humans-only?!
03:15:17	<pabs>	Xe: btw, the URLs project is probably the main DPoS that would hit Anubis instances, here are the UAs for it: https://github.com/ArchiveTeam/urls-grab/blob/master/user-agents.txt
03:15:40	<pabs>	https://wiki.archiveteam.org/index.php/URLs
03:16:28	<nicolas17>	and urls has a crazy scale https://tracker.archiveteam.org/urls/
03:17:08	<nicolas17>	its requests are hopefully spread across tons of different websites being archived simultaneously, but there's no guarantees
03:18:36	<nicolas17>	speaking of residential proxies, a weird issue I've been having recently on my personal server is TCP SYNs from hundreds of IPs in the same residential-address block, often from brazil, netstat shows connections stuck in SYN_RECV state
03:19:04	<nicolas17>	yet it's nowhere near enough to actually affect my server like a DDoS... idk what they're trying to do
03:19:39	<pokechu22>	I heard about that happening (or something similar) to someone who ran a TOR exit node
03:19:59	<nicolas17>	I blocked the whole /16 and a few days later noticed something similar from another range
03:19:59	<steering>	nicolas17: mm, not just scanning for open ports?
03:20:08	<nicolas17>	no, it's all to my :443
03:20:54	<nicolas17>	if they're trying to overload my webserver they're failing at it
03:21:57	<nicolas17>	pokechu22: I used to run a non-exit Tor node on that machine, but not even that anymore
03:42:29	<pabs>	IIRC the Tor thing was IP-spoofed TCP SYNs
03:44:51	<nicolas17>	this could be spoofed
03:53:09		BluRS joins
04:10:12	<pokechu22>	It was ones spoofed as coming from your IP IIRC
04:10:36	<pokechu22>	https://delroth.net/posts/spoofed-mass-scan-abuse/
04:14:56	<nicolas17>	oh scanning other people with your IP
04:28:10		Webuser534462 joins
05:01:38		fuzzy80211 quits [Read error: Connection reset by peer]
05:02:14		fuzzy80211 (fuzzy80211) joins
05:11:19		BlueMaxima quits [Read error: Connection reset by peer]
05:20:30		fuzzy80211 quits [Read error: Connection reset by peer]
05:22:14		AlsoHP_Archivist joins
05:22:15		fuzzy80211 (fuzzy80211) joins
05:24:36		HP_Archivist quits [Ping timeout: 260 seconds]
05:49:36		ThetaDev quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
05:49:59		ThetaDev joins
05:50:19	<that_lurker>	steering: Could you also update the topics on #hackernews and #hackernews-firehose.
05:51:02		midou quits [Read error: Connection reset by peer]
05:51:08		midou joins
05:52:49		loug83181422 joins
05:57:24	<pabs>	Xe: btw, Anubis is presumably bypassed by headless browser based bots?
05:59:11		egallager quits [Quit: This computer has gone to sleep]
06:00:17	<steering>	that_lurker: oops, I gave you +ot on those instead of +*
06:01:52		sec^nd quits [Remote host closed the connection]
06:02:07		sec^nd (second) joins
06:02:31		midou quits [Ping timeout: 260 seconds]
06:03:04		fuzzy8021 (fuzzy80211) joins
06:05:22		midou joins
06:05:30		fuzzy80211 quits [Ping timeout: 250 seconds]
06:07:17	<that_lurker>	steering++
06:07:17	<eggdrop>	[karma] 'steering' now has 45 karma!
06:08:53		SootBector quits [Ping timeout: 276 seconds]
06:14:55	<pabs>	Xe: looks like the answer is yes. for example, here is AT's in-progress Brozzler based Mnbot bypassing Anubis https://mnbot.very-good-quality-co.de/item/72b3897a-b086-480e-94c7-d6194f638cf4
06:16:04	<pabs>	https://wiki.archiveteam.org/index.php/User:TheTechRobo/Mnbot
06:27:50		fuzzy80211 (fuzzy80211) joins
06:30:13		fuzzy8021 quits [Read error: Connection reset by peer]
06:38:35		egallager joins
07:13:41		Island quits [Read error: Connection reset by peer]
07:38:14		Sokar quits [Ping timeout: 250 seconds]
07:46:21		midou quits [Ping timeout: 260 seconds]
07:55:22		midou joins
08:01:45		sec^nd quits [Remote host closed the connection]
08:01:56		sec^nd (second) joins
08:16:09		kuroger quits [Quit: ZNC 1.9.1 - https://znc.in]
08:21:07		kuroger (kuroger) joins
08:38:12	<Xe>	pabs: it can't use proper HTTP error codes because using them makes websites re-enqueue things
08:38:58	<Xe>	makes crawlers re-enqueue things
08:39:07	<Xe>	all the challenge pages throw 200
08:54:24	<@arkiver>	Xe: that is a major problem in archiving. if there no is sign that "rate limiting" (or however you want to call this) is happening, pages will simply be archived and end up in web archives as if those are the pages a regular user would see
08:55:02	<@arkiver>	i don't think we can make a fundamental distinction between web archive crawlers and LLM crawlers, without using lists of IPs etc.
08:55:38	<@arkiver>	ArchiveBot uses some static IPs would could be listed as being web archive crawlers, but many of the projects of Archive Team use resources from all over the world, and IPs of those change all the time
08:56:16	<@arkiver>	so the only way i see this happening at the moment is with handing over lists of IPs. of course we could do something like that. there could perhaps be some 'central repository' with these IPs.
08:57:01	<@arkiver>	however this may not work on 'public' Archive Team projects since LLM crawlers could get their IPs added to this list by running some of our project, and then further abuse the position of these IPs to start crawling for LLMs
08:58:17	<@arkiver>	i also suspect that any general measures to attempt to allow web archive crawlers, but prevent LLM crawlers, will eventually not work as LLM crawlers can pretend to be web archive crawlers.
08:59:40	<@arkiver>	the bottom line to all this would be that the 'open nature' or open source aspects of web archiving are not suitable for sites that employ anti-LLMs measures, and we'd much more need to employ closed source code, better control IPs and what they're used for, etc.
09:00:39	<@rewby>	I don't know if it's been mentioned yet, but came across https://forum.safeguar.de/t/end-of-atomic-spectroscopy-at-nist/501
09:00:48	<@arkiver>	... and, what i think that will eventually happen is that LLM crawlers will find (expensive) ways to not requite humans to do the proof of work, and eventually get around any blocking. meanwhile web archive crawlers will not be able to do this due to lack of resources.
09:01:17	<@arkiver>	what we may actually end up with is that the web is crawlable for LLM crawlers, but not for web archive crawlers. web archive crawlers would be the ultimate victim.
09:02:08	<@arkiver>	with the advances in AI and LLMs, work that can be done by a human can eventually be done by a machine. it can be done 'cheaply' so by LLM crawlers, but not by web archive crawlers.
09:03:21	<@arkiver>	all that being said, i have the feeling that any attempts to keep content publicly accessible but not accessible to LLM crawlers will eventually fail, and companies/sites will either have to accept that and accept that LLM crawling happens, or information is going to move behind login and paywalls (i think that last one is most likely)
09:04:23		Naruyoko5 joins
09:06:11	<BornOn420>	Xe You're famous https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/
09:07:30		Naruyoko quits [Ping timeout: 250 seconds]
09:10:29	<@arkiver>	Xe: since i noted that crawling will likely be much more centralized - Archive Team does have a centralized part, the tracker. i can see a possibility that the tracker sounds out a message to somewhere when it hands out URLs to be archived, this message could note "in the next 15 minutes, expect these certain URLs to be archived", and the LLM defenses would take down their defenses for some time.
09:11:53	<@arkiver>	or... better even, "in the next 15 minutes, requests will be made with HTTP header 'X-Archive-Request: <RANDOM STRING>'", and that the LLM firewall for those 15 minutes allows for requests with that HTTP header to happen without blocking
09:12:51	<@arkiver>	the point of trust them is between the (centralized) tracker at Archive Team. trust could happen through trusting a certain IP tightly controlled by the tracker. and it will still allow for untrusted IPs to do the crawling, for certain amount of time
09:13:21	<@arkiver>	then*
09:14:03	<@arkiver>	additional tightening could be introduced such as "at most x URLs will be requested with the random HTTP headers", etc.
09:16:02	<@arkiver>	since the HTTP header value is random, the LLM firewall could help archiving by letting the crawler know if the value is expired. but only for random values it has seen before. so it'd receive a random value, allow x requests to be made for a duration of y seconds, after that keep the random string registered for z minutes, and during those z minutes return a certain message and status code to inform the crawlers the random string is
09:16:02	<@arkiver>	expired
09:16:48	<@arkiver>	this will us to handle cases in which the random string is used for a longer time or more URLs than expected, and allow us to prevent bad data from being recorded.
09:22:15	<@arkiver>	("sounds out" should have been "sends out" in a previous message, i wrote this on the fly, bunch of thought, no thorough checking)
09:22:22	<@arkiver>	thoughts*
09:22:30		pedantic-darwin quits [Quit: The Lounge - https://thelounge.chat]
09:38:45		kuroger quits [Client Quit]
09:46:46		kuroger (kuroger) joins
10:01:51		egallager quits [Quit: This computer has gone to sleep]
10:36:06		Webuser534462 quits [Quit: Ooops, wrong browser tab.]
10:36:47		nothere quits [Read error: Connection reset by peer]
10:45:05		egallager joins
10:45:30		systwi_ quits [Quit: systwi_]
10:48:23		Dango360 quits [Read error: Connection reset by peer]
10:52:33		Dango360 (Dango360) joins
11:00:02		Bleo18260072271962345 quits [Quit: The Lounge - https://thelounge.chat]
11:02:48		Bleo18260072271962345 joins
11:04:18		kuroger quits [Client Quit]
11:11:15		kuroger (kuroger) joins
11:15:39		Wohlstand (Wohlstand) joins
11:26:51		thighs joins
11:31:02	<thighs>	Hello AT! I recently stumbled upon the admin password for a site called fembooru.jp, and that's just sad as it's a now run down 700 GB website I believe some friends shared. I plan to contact the admin about this, but I’m worried the site might disappear soon since one of the admin hasn't posted since 2021. It hosts over 13,000 images, so losing
11:31:02	<thighs>	them would be a real shame. Anyone can help?
11:32:07	<@arkiver>	just saw your email as well :)
11:34:47	<@arkiver>	(didn't have time to look into it yet)
11:34:48		SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962]
11:35:18		SkilledAlpaca418962 joins
11:35:21	<thighs>	Oh that's great, didn't think things would go this fast :D
11:38:33	<thighs>	I just discovered this site today, so I don't know much about it. It looks dated and run down a bit, but I couldn't even think about archiving it on my own for multiple reasons unfortunately. femboyfinancial.jp has been returning a 502 for a while from what I can see, though.
11:39:34		gust joins
12:04:44		nothere joins
12:09:23		kuroger quits [Client Quit]
12:13:21		BornOn420 quits [Remote host closed the connection]
12:13:55		BornOn420 (BornOn420) joins
12:23:03		kuroger (kuroger) joins
12:35:08		Ketchup901 quits [Remote host closed the connection]
12:35:24		Ketchup901 (Ketchup901) joins
12:35:34		Wohlstand quits [Remote host closed the connection]
12:41:42		Webuser424104 joins
12:42:23		Webuser359951 joins
12:49:19		Webuser359951 leaves
13:08:25		kuroger quits [Client Quit]
13:08:44		SootBector (SootBector) joins
13:11:52		kuroger (kuroger) joins
13:32:56		kuroger quits [Client Quit]
13:44:54		kuroger (kuroger) joins
14:02:10		anarcat quits [Quit: rebooting]
14:03:07	<@arkiver>	(replying to the email)
14:03:34		anarcat (anarcat) joins
14:05:51	<@arkiver>	thighs: see email
14:06:03	<@arkiver>	(job started for https://fembooru.jp/ )
14:06:32		anarcat quits [Client Quit]
14:07:53		sparky1492 quits [Remote host closed the connection]
14:08:12		sparky1492 (sparky1492) joins
14:08:19		anarcat (anarcat) joins
14:23:13		Webuser424104 quits [Client Quit]
14:27:27		notarobot16 joins
14:29:28		notarobot1 quits [Ping timeout: 250 seconds]
14:29:28		notarobot16 is now known as notarobot1
14:58:29		VoynichCR (VoynichCR) joins
14:58:30	<eggdrop>	[tell] VoynichCR: [2025-03-26T01:30:46Z] <pabs> true, however the AT wiki page about SoftwareHeritage isn't about archiving SWH stuff, but about SWH's code archive
14:58:31	<eggdrop>	[tell] VoynichCR: [2025-03-26T01:41:55Z] <pabs> found a compromise for SoftwareHeritage, pointed archive status/type at the mirroring info https://www.softwareheritage.org/mirrors/ https://www.softwareheritage.org/2019/10/03/enea/
15:00:39		FiTheArchiver joins
15:06:10		sparky14921 (sparky1492) joins
15:10:16		sparky1492 quits [Ping timeout: 260 seconds]
15:10:16		sparky14921 is now known as sparky1492
15:18:38		Sokar joins
15:22:31		NatTheCat quits [Ping timeout: 260 seconds]
15:24:34		vitzli quits [Quit: Leaving]
15:25:42		NatTheCat (NatTheCat) joins
15:34:58		VoynichCR quits [Client Quit]
15:38:37		lennier2 joins
15:41:46		lennier2_ quits [Ping timeout: 260 seconds]
15:49:37		arch quits [Remote host closed the connection]
15:49:45		arch joins
15:50:34		arch quits [Remote host closed the connection]
15:50:43		arch joins
15:56:06	<h2ibot>	Manu edited Discourse/archived (+101, Queued discourse.criticalengineering.org): https://wiki.archiveteam.org/?diff=55106&oldid=54850
16:18:53		loug83181422 quits [Quit: The Lounge - https://thelounge.chat]
16:19:15		chrismeller quits [Quit: chrismeller]
16:19:28		loug83181422 joins
16:19:50		chrismeller (chrismeller) joins
16:20:16		kuroger quits [Client Quit]
16:21:45		DLoader is now authenticated as DLoader
16:21:45		DLoader quits [Changing host]
16:21:45		DLoader (DLoader) joins
16:24:45		kuroger (kuroger) joins
16:24:52		egallager quits [Quit: This computer has gone to sleep]
16:32:11		kuroger quits [Read error: Connection reset by peer]
16:37:27		sparky14920 (sparky1492) joins
16:37:46		sparky1492 quits [Ping timeout: 260 seconds]
16:37:46		sparky14920 is now known as sparky1492
17:14:05		VoynichCR (VoynichCR) joins
17:26:44		nomead joins
17:30:59		sparky14925 (sparky1492) joins
17:34:30		sparky1492 quits [Ping timeout: 250 seconds]
17:34:31		sparky14925 is now known as sparky1492
17:50:59		Lord_Nightmare2 (Lord_Nightmare) joins
17:51:50		Lord_Nightmare quits [Ping timeout: 250 seconds]
17:51:50		Lord_Nightmare2 is now known as Lord_Nightmare
18:12:30	<h2ibot>	HadeanEon edited Deaths in 2025 (-5889, BOT - Updating page: {{saved}} (80),…): https://wiki.archiveteam.org/?diff=55107&oldid=55099
18:12:31	<h2ibot>	HadeanEon edited Deaths in 2025/list (+257, BOT - Updating list): https://wiki.archiveteam.org/?diff=55108&oldid=55100
18:15:31	<h2ibot>	VoynichCr edited Deaths in 2025 (-804): https://wiki.archiveteam.org/?diff=55109&oldid=55107
18:19:24		nomead quits [Client Quit]
18:24:32		VoynichCR quits [Client Quit]
18:37:54		AlsoHP_Archivist quits [Read error: Connection reset by peer]
19:03:05		lflare quits [Quit: Ping timeout (120 seconds)]
19:03:25		lflare (lflare) joins
19:09:41	<h2ibot>	Manu edited Discourse/archived (+81, Queued scanlines.xyz): https://wiki.archiveteam.org/?diff=55110&oldid=55106
19:17:36		loug83181422 quits [Ping timeout: 260 seconds]
19:19:12		loug83181422 joins
19:27:53		loug83181422 quits [Read error: Connection reset by peer]
19:28:41		loug83181422 joins
19:29:45		Dango360_ (Dango360) joins
19:33:14		Dango360 quits [Ping timeout: 250 seconds]
19:34:25	<gareth48\|m>	JAA: How's the queue, will we be able to get a chance to scrape the images for https://store.vket.com/en soon? I've discovered a few additional things of merit since we last talked, mostly in my own webscraping efforts which might adjust the strategy. Namely the fact there are a ton of "unlisted" links that don't appear in the search grid but can be easily derived by navigating iteratively through product IDs (thankfully there are only
19:34:25	<gareth48\|m>	like 10,000 products so its not absurd). I've preserved the downloads of all those items which are free but I don't have the webpages.
19:35:24		loug83181422 quits [Ping timeout: 250 seconds]
19:39:33		loug83181422 joins
19:40:51	<h2ibot>	Exorcism edited Discourse/archived (+93): https://wiki.archiveteam.org/?diff=55111&oldid=55110
19:43:16		loug831814229 joins
19:43:33		Dango360_ quits [Client Quit]
19:43:42		Dango360 (Dango360) joins
19:45:25	<Dango360>	welp, this is the moment i've been expecting for a while
19:45:27	<Dango360>	after april 2nd, it will no longer be possible to get roblox assets without authentication
19:45:30	<Dango360>	https://devforum.roblox.com/t/creator-action-required-new-asset-delivery-api-endpoints-for-community-tools/3574403
19:47:21		loug83181422 quits [Ping timeout: 260 seconds]
19:47:21		loug831814229 is now known as loug83181422
19:49:05	<@JAA>	→ #robloxd
19:49:41	<Dango360>	reposting the message there, thanks
19:56:41		CYBERDEV joins
19:58:01		loug831814229 joins
19:59:44		loug8318142298 joins
20:00:26		loug83181422 quits [Read error: Connection reset by peer]
20:00:26		loug8318142298 is now known as loug83181422
20:02:07		loug831814229 quits [Read error: Connection reset by peer]
20:04:17	<@JAA>	gareth48\|m: Just ran a test job, looks like the lists need to be quite small to get through within the 10 minutes. I'll run them over the next few hours.
20:05:18		loug83181422 quits [Ping timeout: 250 seconds]
20:08:44	<@JAA>	Actually, there very much is rate limiting in the form of HTTP 405.
20:08:59	<gareth48\|m>	JAA: yeah they're really limiting the connection speed since they're shutting down soon.... (full message at <https://matrix.hackint.org/_irc/v1/media/download/AUJIIDrGJTppwaOemaPQBM7szzSpRwVElmayw2P6-He9tkM5sRvK_x5cR5uXPLeW5-FU77qTnSyxVExbbbQQ1aNCffQP5BVAAGhhY2tpbnQub3JnL3NzQnd4ZGVYdVhzQ2dSdWZVeEtmWEtOWA>)
20:09:49	<@JAA>	That's what I was going to do, yes, but it's not feasible with AB due to the rate limits.
20:10:08	<@JAA>	Kicks in after ~200 requests at full speed.
20:10:38	<gareth48\|m>	JAA: Gotcha, just saw your test in archivebot. Yeah it would have to be split up over multiple jobs.
20:10:56	<@JAA>	I was already going to split it into 10 jobs each for en and ja.
20:11:35	<gareth48\|m>	JAA: Not surprised you already thought of it y'all have much more experience than I do, just figured I'd suggest it.
20:12:00	<@JAA>	But even a list of 1k isn't going to finish within 10 minutes with that limiting, so this needs something else.
20:16:23	<gareth48\|m>	JAA: Let me know what you think of. Thanks for taking point on tackling this!
20:20:58	<gareth48\|m>	My own work has been going well, I've archived ~95% of the free items and their associated gallery images. Working on parsing the tags into json and downloading edge cases.
20:22:05	<@JAA>	The downloads are all loginwalled, right?
20:22:43		BitsNBytesNBagels joins
20:22:52	<gareth48\|m>	JAA: Yeah, they are. You have to log in and then also go to each site's purchase page and order it.
20:24:14	<gareth48\|m>	Also the site requires google, line, apple, or microsoft authentication they dont have their own account system
20:24:30	<gareth48\|m>	s/account/auth/
20:31:59		BlueMaxima joins
20:57:18		sparky14924 (sparky1492) joins
20:58:07	<mgrandi>	https://www.theverge.com/news/635915/game-informer-return-gunzilla-games
21:00:51		sparky1492 quits [Ping timeout: 260 seconds]
21:00:51		sparky14924 is now known as sparky1492
21:11:22		aeg leaves
21:32:07		lennier2_ joins
21:33:01		egallager joins
21:35:00		lennier2 quits [Ping timeout: 250 seconds]
21:36:53		etnguyen03 (etnguyen03) joins
21:41:16		Island joins
21:51:51	<@JAA>	gareth48\|m: Interesting, some pages reference images that return 404s. Anyway, I have something running now with qwarc.
21:54:22	<@JAA>	Rough ETA 11-12 hours
21:56:01	<@JAA>	That's /en/items/$id and /ja/items/$id plus any URLs on those that contain 'X-Amz-Expires'.
21:59:44	<glassy>	how you keeping JAA :)
22:02:00		thighs quits [Quit: Ooops, wrong browser tab.]
22:07:27		Dango360 quits [Client Quit]
22:07:45		Dango360 (Dango360) joins
22:09:59		Island quits [Read error: Connection reset by peer]
22:10:21		Island joins
22:15:36		@imer quits [Quit: Oh no]
22:16:07		imer (imer) joins
22:16:07		@ChanServ sets mode: +o imer
22:19:24	<@JAA>	No 405s so far, so that's good.
22:19:59	<@JAA>	glassy: <thisisfine.png> as usual :-)
22:20:21	<glassy>	same sh*t different day?
22:21:46	<gareth48\|m>	<JAA> "gareth48: Interesting, some..." <- Huh, that is weird. Quite a few of the products are 404d due to being out of sale, removed by the owner or deleted for being adverse
22:21:53		BitsNBytesNBagels quits [Quit: My MacBook has gone to sleep. ZZZzzz…]
22:22:11	<gareth48\|m>	I haven't run statistics but I collected a list of all the codes associated with the various items for quick reference when scraping the actual products
22:22:36	<FiTheArchiver>	is wayback machine down?
22:23:01	<@JAA>	glassy: Aye
22:23:16	<glassy>	FiTheArchiver IA sad as awhole :(
22:23:22	<gareth48\|m>	<JAA> "That's /en/items/$id and /ja/..." <- Yeah that's solid logic, that should get a vast majority of it. Will those thumbnails populate back to the store browse pages?
22:23:22	<gareth48\|m>	If not those would be a good secondary target, but this is already 90% of the way there because the actual product information and photos are preserved
22:23:35	<@JAA>	gareth48\|m: Yeah, I'm seeing lots of 404s on item pages, and I expected as much.
22:23:59	<@JAA>	The images are weird, but such is the web.
22:24:21	<@JAA>	Every page references a different signed URL, so no, those won't appear there and would have to be refetched for store pages.
22:24:26	<gareth48\|m>	Sorry to crosstalk but thanks so much for y'alls help. I'm already busting ass grabbing as much as I can of the available items I would've been doomed trying to get a full scrape like this alone. I really appreciate the help, it'll be good to have this good of a record for an interesting part of VR history
22:24:44	<gareth48\|m>	JAA: makes sense
22:24:47	<@JAA>	In principle, it could be fixed with a userscript down the line, I suppose.
22:24:48	<FiTheArchiver>	ahh ok. thought my wayback downloader was broken but i think cdx is down
22:25:25	<@JAA>	There are broken items, too: https://store.vket.com/en/items/1042
22:27:01	<gareth48\|m>	JAA: This is one of those that bizarrely returns a 500 error right?
22:27:33	<@JAA>	Yeah
22:28:03	<datechnoman>	20 bucks its a power outage at IA :P
22:28:21	<@JAA>	Yeah, usually is.
22:28:26	<gareth48\|m>	yep lol, no idea what's up with those
22:29:08	<glassy>	just as urls was catching up
22:30:21	<gareth48\|m>	It caused my webscraper to go into an infinite loop the first time around because I was only checking for 404 or 200 (I'm on the newer end of doing this lol)
22:31:17	<@JAA>	How many of these did you see?
22:32:01	<gareth48\|m>	JAA: Is this in response to me mentioning the 500 errors? my client isnt rendering it as such
22:32:38	<@JAA>	IRC doesn't have responses. :-)
22:32:41	<@JAA>	And yes
22:33:13		etnguyen03 quits [Client Quit]
22:34:22	<gareth48\|m>	right, good point. I'm very new to IRC, in case that wasn't obvious lol. I'm picking it up slowly but surely.
22:34:22	<gareth48\|m>	22 500 errors in total, for all ~9800 items
22:35:36	<gareth48\|m>	Sorry, I mean 18, Ctrl + F user error there lol
22:35:55	<@JAA>	I'm at 2 in 900, so that's similar enough of a rate.
22:36:31		etnguyen03 (etnguyen03) joins
22:37:19	<gareth48\|m>	Huh, interesting, my first was at 795 and my next was at 1042.
22:37:33	<@JAA>	I'm doing the IDs in random order.
22:38:03	<gareth48\|m>	Oh gotcha, let me know the final count, I'll be curious to compare it against what I found
22:38:51	<@JAA>	Oh yeah, for items that don't exist, I only fetch the English page.
22:39:23	<@JAA>	Hopefully, there are no items that only exist under /ja/. :-P
22:39:54	<gareth48\|m>	JAA: that would be a horrible edge case lol. Ill check because I'm curious and let you know
22:40:41	<@JAA>	Hmm, maybe I should fix that and start over.
22:41:50	<@JAA>	I didn't intend to do it like that. Just forgot to adjust the logic when I added /ja/.
22:41:59	<FiTheArchiver>	wayback machine just tweeted it is indeed power issues
22:42:41	<gareth48\|m>	It'll be a few hours until I get that data, since I'm currently writing my tag parser and extractor. I'm using beautiful soup to grab the html and parse relevant tags into json (for eventual conversion into archive.org metadata for all free items)
22:43:09	<@JAA>	There's enough time, so I think I will start over (but keep the data I already got just in case).
22:43:14	<gareth48\|m>	Very grateful I'm able to dedicate nearly 100% of my effort towards getting the assets off because I'm getting really close to having pulled it off. I'm in the edge case / verification stage now.
22:43:29	<@JAA>	Dedupe would be annoying if I ran those separately later.
22:43:45	<@JAA>	Nice :-)
22:43:48	<gareth48\|m>	Oh yeah I can only guess lol.
22:44:34		Wohlstand (Wohlstand) joins
22:44:44	<gareth48\|m>	All of the free assets on the store were actually smaller than I expected! 1543 assets at only 43 Gb which is way smaller than I thought
22:54:44		BornOn420 quits [Remote host closed the connection]
22:55:18		BornOn420 (BornOn420) joins
22:59:28		sparky14921 (sparky1492) joins
23:02:58		sparky1492 quits [Ping timeout: 250 seconds]
23:02:59		sparky14921 is now known as sparky1492
23:05:55		sparky14920 (sparky1492) joins
23:09:28		sparky1492 quits [Ping timeout: 250 seconds]
23:09:29		sparky14920 is now known as sparky1492
23:13:26		etnguyen03 quits [Client Quit]
23:17:21		klg quits [Ping timeout: 260 seconds]
23:17:33		klg (klg) joins
23:33:29		egallager quits [Client Quit]
23:50:56		CraftByte quits [Quit: Ping timeout (120 seconds)]
23:53:01		etnguyen03 (etnguyen03) joins
23:56:22		Wohlstand quits [Client Quit]

Home Search Previous day Next day