#archiveteam-bs log for 2025-05-27

Home Search Previous day Next day

00:01:18		ATinySpaceMarine joins
00:04:33		DogsRNice joins
00:26:05	<pabs>	I recently saw some articles about Stack Exchange being obsoleted by AI, SO user engagement crashing, dying etc. they also added Cloudflare captchas for me, and I also saw some questions that are not yet in the WBM, and the data exports to IA items were discontinued due to "irresponsible" AI companies.
00:26:37	<pabs>	on the data exports: https://web.archive.org/web/20250425133147/https://meta.stackexchange.com/questions/401324/announcing-a-change-to-the-data-dump-process
00:26:51		jspiros quits [Ping timeout: 260 seconds]
00:27:18	<pabs>	some posts: https://blog.pragmaticengineer.com/stack-overflow-is-almost-dead/ https://www.techzine.eu/news/devops/127669/stack-overflow-is-dying-is-it-being-replaced-by-ai/ https://devclass.com/2025/05/13/stack-overflow-seeks-rebrand-as-traffic-continues-to-plummet-which-is-bad-news-for-developers/
00:27:40	<pabs>	https://old.reddit.com/r/webdev/comments/116vvpp/saying_goodbye_to_stack_overflow/ https://codeandhack.com/stack-overflow-is-falling-apart/
00:28:33		jspiros (jspiros) joins
00:29:36	<pokechu22>	The cloudflare captchas are weird; they happen when entering from duckduckgo but not when loading the page directly IIRC
00:31:05	<pabs>	I get them always, even when pasting a URL into a fresh browser profile
00:31:35	<pabs>	recent HN discussion about SO: https://news.ycombinator.com/item?id=43999125
00:33:08	<steering>	yeah, I've been getting captchas for a week or two.
00:39:11	<ericgallager>	for anyone with an extra 6.5TB sitting around: https://ddosecrets.com/article/psyclone-media
01:03:06	<nicolas17>	I have 59GB free :)
01:52:24	<@JAA>	pabs: #stackunderflow
01:55:12	<h2ibot>	PaulWise created StackOverflow (+28, add SO redirect): https://wiki.archiveteam.org/?title=StackOverflow
01:56:01	<pabs>	woops, a wiki search did not find the page for SO. added a redirect
01:57:16		katocala quits [Ping timeout: 260 seconds]
01:59:13	<h2ibot>	PaulWise edited Stack Exchange (+274, add SO dying para): https://wiki.archiveteam.org/?diff=55781&oldid=54103
02:23:06		dabs joins
02:26:40	<pabs>	pokechu22: hmm, AB 7qwz68jnobcw90l4utbhabotd old-fc2web-urls.txt isn't going to finish before the deadline. could you help me craft an ignore for offsite URLs?
02:27:14	<pokechu22>	!igd fc2web.com
02:27:43	<pabs>	its more than just that domain IIRC
02:27:52	<pokechu22>	Pabs: ^(http\|ftp)s?://(?!([^/]*[@.])?fc2web\.com\.?(:\d+)?/)[^/]+/ for just fc2web.com and subdomains
02:28:31	<pokechu22>	uh, I'll check what else is in the list
02:30:20	<pokechu22>	cat old-fc2web-urls.txt \| grep -Fv -e fc2web.com -e easter.ne.jp -e gooside.com -e k-free.net -e muvc.net -e 55street.net -e zero-city.com -e k-server.org -e ojiji.net -e zero-yen.com -e kt.fc2.com -e finito.fc2.com -e pimp.fc2.com -e happy-web.fc2.com -e ktplan.fc2.com
02:30:26	<pokechu22>	gives http://wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww http://momosakura21.zero-city http://pinkpink.zero-city http://tomoryoulove.zero-city http://halharuta.zero-city http://michi.zero-city http://largest.zero-city http://riorio.zero-city http://rabbit.zero-city http://taro.zero-city
02:30:28	<pokechu22>	http://natumi.zero-city http://lovelovecall.zero-city http://largest.finito-web.com
02:31:34	<pokechu22>	so allowing just those domains (well, all of fc2.com is probably easier)...
02:36:13	<pokechu22>	^(http\|ftp)s?://(?!([^/]*[@.])?(fc2web\.com\|easter\.ne\.jp\|gooside\.com\|k-free\.net\|muvc\.net\|55street\.net\|zero-city\.com\|k-server\.org\|ojiji\.net\|zero-yen\.com\|fc2\.com)\.?(:\d+)?/)[^/]+/ should do it, apart from those broken URLs which I'll run separately
02:37:57	<pokechu22>	http://largest.finito-web.com/ might need more investigation actually, as there's other subdomains but that's the only one in that list
02:41:07		dabs quits [Client Quit]
03:44:04	<pokechu22>	pabs: FYI, I didn't apply that ignore yet
03:45:15	<pabs>	maybe just allow all of finito-web.com, easiest option
03:46:10	<pokechu22>	eh, I don't think that's super useful since even if we allow it, it'd only recurse on http://largest.finito-web.com/ - finito-web.com will need its own !a < list. But I guess it's pretty cheap to add that too
03:47:45	<pokechu22>	ignore added
03:48:30	<h2ibot>	JustAnotherArchivist created WarriorBot (+471, Document that this sort of existed): https://wiki.archiveteam.org/?title=WarriorBot
03:48:31	<h2ibot>	Kevidryon2 created Glitch (+626, glitch is closing soon): https://wiki.archiveteam.org/?title=Glitch
03:48:33	<pokechu22>	hmm, it's only now getting robots.txt/sitemap.xml so it hasn't even recursed one layer deep :\|
03:49:30	<h2ibot>	Nintendofan885 created National Archives Catalog (+973, create): https://wiki.archiveteam.org/?title=National%20Archives%20Catalog
03:49:31	<h2ibot>	Nintendofan885 edited NOAA (+25, no project but #UncleSamsArchive would make…): https://wiki.archiveteam.org/?diff=55785&oldid=54375
03:49:32	<h2ibot>	Nintendofan885 moved Archive Team press releases to Archiveteam:Press releases (move to project namespace): https://wiki.archiveteam.org/?title=Archiveteam%3APress%20releases
03:49:33	<h2ibot>	Nintendofan885 moved 2019-03-28 Help archive Google+ to Archiveteam:Press releases/2019-03-28 Help archive Google+ (move to subpage): https://wiki.archiveteam.org/?title=Archiveteam%3APress%20releases/2019-03-28%20Help%20archive%20Google%2B
03:51:30		DogsRNice quits [Read error: Connection reset by peer]
03:56:32	<h2ibot>	JustAnotherArchivist created LIHKG (+280, Created page with "{{Infobox project \| URL =…): https://wiki.archiveteam.org/?title=LIHKG
04:10:34	<h2ibot>	JustAnotherArchivist edited Mildom (+6): https://wiki.archiveteam.org/?diff=55791&oldid=53750
04:22:35	<h2ibot>	JustAnotherArchivist created Posts.cv (+541, Created page with "{{Infobox project \| URL =…): https://wiki.archiveteam.org/?title=Posts.cv
04:23:36	<h2ibot>	Thezt edited List of websites excluded from the Wayback Machine/Partial exclusions (+43): https://wiki.archiveteam.org/?diff=55793&oldid=55638
04:23:37	<h2ibot>	JustAnotherArchivist edited Deathwatch (-46, Link to [[Glitch]] and [[Posts.cv]]): https://wiki.archiveteam.org/?diff=55794&oldid=55769
04:45:39	<h2ibot>	JustAnotherArchivist created Kinja (+1177, Created page with "{{Infobox project \| URL =…): https://wiki.archiveteam.org/?title=Kinja
05:15:01		Wohlstand quits [Ping timeout: 260 seconds]
05:44:31	<pokechu22>	I've started a separate archivebot job for finito-web.com, seeded with URLs from duckduckgo+google+IA CDX. The original list has http://www.finito.fc2.com/ stuff though
06:00:19	<@arkiver>	JAA: on closing the channels, feel free to just kick me out!
06:00:54	<@JAA>	arkiver: Yep, I'll kick everyone when I close them.
06:13:19		Island quits [Read error: Connection reset by peer]
06:42:03	<triplecamera\|m>	Hi. I'm looking for a file which hasn't been archived by the Wayback Machine. Is there a quick way to search through all archiving sites?
06:42:22		ArchivalEfforts quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
06:42:31		ArchivalEfforts joins
06:42:49	<triplecamera\|m>	The file's URL is <http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-828-operating-system-engineering-fall-2006/labs/lab{1..6}handout.gz>, where 1 has been archived by wbm, but 2 to 6 haven't.
06:48:54	<that_lurker>	have you asked if the site maintainers still had them somewhere?
06:50:52	<triplecamera\|m>	I will ask them if I can't find any copies from the Internet.
06:52:17	<that_lurker>	actually these might be the ones you are looking for https://dspace.mit.edu/bitstream/handle/1721.1/92292/6-828-fall-2006/contents/labs/index.htm
06:53:15	<@JAA>	https://pdos.csail.mit.edu/6.828/2006/index.html seems to have lab handouts for that lecture in that year.
06:53:25	<triplecamera\|m>	Wikipedia says that there are many [web archiving initiatives](https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives), but I don't want to search one by one.
06:54:45	<triplecamera\|m>	that_lurker: You are right. But unfortunately all tar.gz files from there are corrupted.
06:55:30	<triplecamera\|m>	JAA: Yes, but the handouts for lab 5 & 6 are missing.
06:59:30	<@JAA>	Oof, yeah, the files on their DSpace have all high bytes replaced by U+FFFD. :-/
07:00:09	<@JAA>	They start with 1fefbfbd instead of 1f8b.
08:09:31		Dada joins
08:50:20		janos777 joins
08:57:50		janos777 quits [Ping timeout: 276 seconds]
09:19:45		Megame (Megame) joins
09:40:35		fmixolydian joins
09:43:17		janos777 joins
09:51:25	<fmixolydian>	i just sent an email to a member of glitch's staff regarding which archiving method to use
09:51:52	<fmixolydian>	i've tried requests with bs4, but that didn't work (since the website is 99% generated with JS)
09:52:12	<fmixolydian>	should i just do a wget full website grab?
09:53:20	<fmixolydian>	(im kinda new to archiving websites, sorry)
10:02:55	<fmixolydian>	welp, i tried that and it doesnt work
10:03:35	<fmixolydian>	bc wget doesnt execute javascript, and its the included javascript that creates the links
10:06:41	<fmixolydian>	wait a second, i just noticed they have an api - maybe i could use that
10:08:06		fmixolydian quits [Client Quit]
10:21:41		Megame quits [Ping timeout: 276 seconds]
10:22:13		Webuser309148 joins
10:23:32		Webuser309148 quits [Client Quit]
10:40:31		janos777 quits [Ping timeout: 260 seconds]
10:50:00		fmixolydian joins
10:56:17	<fmixolydian>	i'd like to update yall on my glitch.com python dumping script
10:56:32		hexagonwin is now authenticated as hexagonwin
10:56:45	<fmixolydian>	i successfully managed to extract ~80 project's data
10:56:54	<fmixolydian>	now i just need to speed it up
10:58:27	<fmixolydian>	unfortunately there are millions of projects, i doubt i'm gonna be able to grab all the data in time
11:00:03		Bleo182600722719623455 quits [Quit: The Lounge - https://thelounge.chat]
11:02:46		Bleo182600722719623455 joins
11:10:05		janos777 joins
11:14:56		tzt quits [Ping timeout: 260 seconds]
11:25:31		fmixolydian quits [Client Quit]
11:29:53		ducky quits [Remote host closed the connection]
11:32:29		ducky (ducky) joins
11:34:01		ducky quits [Read error: Connection reset by peer]
11:49:49		ducky (ducky) joins
12:05:41		janos777 quits [Ping timeout: 260 seconds]
12:37:05		tzt (tzt) joins
12:37:11		monoxane quits [Ping timeout: 260 seconds]
12:43:19		ericgallager quits [Quit: Leaving]
12:46:04		ericgallager joins
13:00:21		ericgallager quits [Client Quit]
13:07:29		fmixolydian joins
13:16:53		fmixolydian quits [Client Quit]
13:35:02		fmixolydian joins
13:36:24	<fmixolydian>	hello everyone, im doing a manual dump of glitch.com through a few scripts i made
13:36:56	<fmixolydian>	a few of the early userids (<20) have ~200 projects each on average
13:37:13	<fmixolydian>	however most userids now return 404s
13:38:04	<fmixolydian>	for now im only dumping project and user data (the project contents will have to be dumped differently)
13:45:39		corentin joins
13:45:50		corentin is now authenticated as corentin
13:46:00	<fmixolydian>	hello
13:51:09		@arkiver quits [Remote host closed the connection]
13:51:41		arkiver (arkiver) joins
13:51:41		@ChanServ sets mode: +o arkiver
13:52:07	<fmixolydian>	wb
13:54:17	<fmixolydian>	bruh i just got ratelimited from glitch
14:00:05	<@arkiver>	fmixolydian: for glitch.com, was would be really helpful is if you can make lists of content that is one there. users, posts, projects, etc.
14:00:07	<@arkiver>	or find ways to list all of them for each type
14:01:52		fuzzy8021 quits [Read error: Connection reset by peer]
14:01:58		fuzzy80211 (fuzzy80211) joins
14:03:22		fuzzy8021 (fuzzy80211) joins
14:03:24		fuzzy80211 quits [Read error: Connection reset by peer]
14:05:16		NotGLaDOS quits [Ping timeout: 260 seconds]
14:05:18		terry joins
14:05:42	<fmixolydian>	arkiver: there are users, collections, emails, projects, and teams (afaik)
14:06:22	<fmixolydian>	here's the source code for my dumping script (if it helps): https://codeberg.org/fmixolydian/rpglitch
14:08:13	<fmixolydian>	dumpall.sh is a script that, for every user in a range, calls dumpuser.py (which outputs a list of project uuids), and calls dumpproj.py on each one
14:09:10	<fmixolydian>	the userids appear to go up to ~80 million, projects use UUID 4
14:10:49	<fmixolydian>	also, the main thing to dump would be the subdomains (<domain>.glitch.social), or `.domain` in each project's dumped JSON file
14:12:00	<fmixolydian>	they're mostly small and could be dumped with a simple wget dump - however, for dynamic projects you first have to make a request to "wake it up", kinda like vercel, wait ~10s, and then try to dump the contents with wget.
14:12:08	<@arkiver>	are user IDs sequential?
14:12:11	<fmixolydian>	yea
14:12:20	<@arkiver>	(i did not look into the site yet, just collecting info for when i do)
14:12:23	<@arkiver>	alright that is nice
14:12:26	<fmixolydian>	however, i've noticed large gaps in userids
14:12:33	<@arkiver>	and can all other item types be found through a user?
14:13:04	<fmixolydian>	rpglitch (the codename i gave to my dumping scripts) only look for projects for eaach user
14:13:08	<fmixolydian>	but afaik yes
14:14:59	<fmixolydian>	(also, im not ratelimited anymore, but i'm gonna stop dumping glitch userids to not get banned)
14:23:37		FiTheArchiver joins
14:24:05	<fmixolydian>	hello
14:25:04		FiTheArchiver quits [Remote host closed the connection]
14:31:49	<Dango360>	seems that glitch project contents are sent via websockets (wss://api.glitch.com/{projectId}/ot?authorization={authIdThatIsGeneratedPerBrowserSession})
14:38:57		ericgallager joins
14:43:50	<@arkiver>	oof i hope not
14:46:10		fmixolydian quits [Client Quit]
14:48:07		fmixolydian joins
14:49:16	<Dango360>	you can view the websocket messages in inspect element, network tab https://glitch.com/edit/#!/archiveteam-tracker-tgbot
14:55:42		ahm258760 quits [Quit: The Lounge - https://thelounge.chat]
14:56:30	<fmixolydian>	Dango360: you mean ot and logs?
14:56:53	<Dango360>	fmixolydian: ot
14:57:02	<fmixolydian>	yea
14:58:27	<fmixolydian>	i cant connect to the project
14:58:36	<Dango360>	try a different project
14:58:39	<Dango360>	it might be broken
14:59:01	<fmixolydian>	static projects work fine
14:59:08	<fmixolydian>	lemme try another dynamic one
14:59:52		ahm258760 joins
15:03:42	<fmixolydian>	Dango360: my own dynamic project doesnt work either
15:15:55		Mateon2 joins
15:18:11		Mateon1 quits [Ping timeout: 260 seconds]
15:18:11		Mateon2 is now known as Mateon1
15:28:29		Mateon2 joins
15:29:09		ericgallager quits [Client Quit]
15:31:01		Mateon1 quits [Ping timeout: 260 seconds]
15:31:01		Mateon2 is now known as Mateon1
15:41:53		grill (grill) joins
15:50:14		fmixolydian quits [Client Quit]
15:55:30		ericgallager joins
16:05:11		janos777 joins
16:19:31		Island joins
16:27:21		Naruyoko5 joins
16:30:45		BornOn420_ quits [Remote host closed the connection]
16:31:06		Naruyoko quits [Ping timeout: 260 seconds]
16:31:25		BornOn420 (BornOn420) joins
17:12:55		BornOn420 quits [Remote host closed the connection]
17:13:33		BornOn420 (BornOn420) joins
17:15:39	<h2ibot>	HadeanEon edited Deaths in 2025 (+862, BOT - Updating page: {{saved}} (130),…): https://wiki.archiveteam.org/?diff=55796&oldid=55778
17:15:40	<h2ibot>	HadeanEon edited Deaths in 2025/list (+61, BOT - Updating list): https://wiki.archiveteam.org/?diff=55797&oldid=55779
17:20:21		JayEmbee quits [Quit: WeeChat 4.1.1]
17:53:26		grill quits [Ping timeout: 276 seconds]
17:59:00		fmixolydian joins
18:02:03		fmixolydian quits [Client Quit]
19:18:15		Wohlstand (Wohlstand) joins
19:51:23		janos777 quits [Read error: Connection reset by peer]
21:05:02		etnguyen03 (etnguyen03) joins
21:30:03		Naruyoko joins
21:33:16		Naruyoko5 quits [Ping timeout: 260 seconds]
21:36:35		Naruyoko5 joins
21:40:56		Naruyoko quits [Ping timeout: 276 seconds]
21:42:18		Naruyoko joins
21:44:56		Naruyoko5 quits [Ping timeout: 260 seconds]
22:09:51		nicolas17 is now authenticated as nicolas17
22:47:28		Dada quits [Remote host closed the connection]
23:10:55		etnguyen03 quits [Client Quit]
23:32:24		etnguyen03 (etnguyen03) joins
23:42:32		ATinySpaceMarine quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]

Home Search Previous day Next day