00:01:18ATinySpaceMarine joins
00:04:33DogsRNice joins
00:26:05<pabs>I recently saw some articles about Stack Exchange being obsoleted by AI, SO user engagement crashing, dying etc. they also added Cloudflare captchas for me, and I also saw some questions that are not yet in the WBM, and the data exports to IA items were discontinued due to "irresponsible" AI companies.
00:26:37<pabs>on the data exports: https://web.archive.org/web/20250425133147/https://meta.stackexchange.com/questions/401324/announcing-a-change-to-the-data-dump-process
00:26:51jspiros quits [Ping timeout: 260 seconds]
00:27:18<pabs>some posts: https://blog.pragmaticengineer.com/stack-overflow-is-almost-dead/ https://www.techzine.eu/news/devops/127669/stack-overflow-is-dying-is-it-being-replaced-by-ai/ https://devclass.com/2025/05/13/stack-overflow-seeks-rebrand-as-traffic-continues-to-plummet-which-is-bad-news-for-developers/
00:27:40<pabs>https://old.reddit.com/r/webdev/comments/116vvpp/saying_goodbye_to_stack_overflow/ https://codeandhack.com/stack-overflow-is-falling-apart/
00:28:33jspiros (jspiros) joins
00:29:36<pokechu22>The cloudflare captchas are weird; they happen when entering from duckduckgo but not when loading the page directly IIRC
00:31:05<pabs>I get them always, even when pasting a URL into a fresh browser profile
00:31:35<pabs>recent HN discussion about SO: https://news.ycombinator.com/item?id=43999125
00:33:08<steering>yeah, I've been getting captchas for a week or two.
00:39:11<ericgallager>for anyone with an extra 6.5TB sitting around: https://ddosecrets.com/article/psyclone-media
01:03:06<nicolas17>I have 59GB free :)
01:52:24<@JAA>pabs: #stackunderflow
01:55:12<h2ibot>PaulWise created StackOverflow (+28, add SO redirect): https://wiki.archiveteam.org/?title=StackOverflow
01:56:01<pabs>woops, a wiki search did not find the page for SO. added a redirect
01:57:16katocala quits [Ping timeout: 260 seconds]
01:59:13<h2ibot>PaulWise edited Stack Exchange (+274, add SO dying para): https://wiki.archiveteam.org/?diff=55781&oldid=54103
02:23:06dabs joins
02:26:40<pabs>pokechu22: hmm, AB 7qwz68jnobcw90l4utbhabotd old-fc2web-urls.txt isn't going to finish before the deadline. could you help me craft an ignore for offsite URLs?
02:27:14<pokechu22>!igd fc2web.com
02:27:43<pabs>its more than just that domain IIRC
02:27:52<pokechu22>Pabs: ^(http|ftp)s?://(?!([^/]*[@.])?fc2web\.com\.?(:\d+)?/)[^/]+/ for just fc2web.com and subdomains
02:28:31<pokechu22>uh, I'll check what else is in the list
02:30:20<pokechu22> cat old-fc2web-urls.txt | grep -Fv -e fc2web.com -e easter.ne.jp -e gooside.com -e k-free.net -e muvc.net -e 55street.net -e zero-city.com -e k-server.org -e ojiji.net -e zero-yen.com -e kt.fc2.com -e finito.fc2.com -e pimp.fc2.com -e happy-web.fc2.com -e ktplan.fc2.com
02:30:26<pokechu22>gives http://wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww http://momosakura21.zero-city http://pinkpink.zero-city http://tomoryoulove.zero-city http://halharuta.zero-city http://michi.zero-city http://largest.zero-city http://riorio.zero-city http://rabbit.zero-city http://taro.zero-city
02:30:28<pokechu22>http://natumi.zero-city http://lovelovecall.zero-city http://largest.finito-web.com
02:31:34<pokechu22>so allowing just those domains (well, all of fc2.com is probably easier)...
02:36:13<pokechu22>^(http|ftp)s?://(?!([^/]*[@.])?(fc2web\.com|easter\.ne\.jp|gooside\.com|k-free\.net|muvc\.net|55street\.net|zero-city\.com|k-server\.org|ojiji\.net|zero-yen\.com|fc2\.com)\.?(:\d+)?/)[^/]+/ should do it, apart from those broken URLs which I'll run separately
02:37:57<pokechu22>http://largest.finito-web.com/ might need more investigation actually, as there's other subdomains but that's the only one in that list
02:41:07dabs quits [Client Quit]
03:44:04<pokechu22>pabs: FYI, I didn't apply that ignore yet
03:45:15<pabs>maybe just allow all of finito-web.com, easiest option
03:46:10<pokechu22>eh, I don't think that's super useful since even if we allow it, it'd only recurse on http://largest.finito-web.com/ - finito-web.com will need its own !a < list. But I guess it's pretty cheap to add that too
03:47:45<pokechu22>ignore added
03:48:30<h2ibot>JustAnotherArchivist created WarriorBot (+471, Document that this sort of existed): https://wiki.archiveteam.org/?title=WarriorBot
03:48:31<h2ibot>Kevidryon2 created Glitch (+626, glitch is closing soon): https://wiki.archiveteam.org/?title=Glitch
03:48:33<pokechu22>hmm, it's only now getting robots.txt/sitemap.xml so it hasn't even recursed one layer deep :|
03:49:30<h2ibot>Nintendofan885 created National Archives Catalog (+973, create): https://wiki.archiveteam.org/?title=National%20Archives%20Catalog
03:49:31<h2ibot>Nintendofan885 edited NOAA (+25, no project but #UncleSamsArchive would make…): https://wiki.archiveteam.org/?diff=55785&oldid=54375
03:49:32<h2ibot>Nintendofan885 moved Archive Team press releases to Archiveteam:Press releases (move to project namespace): https://wiki.archiveteam.org/?title=Archiveteam%3APress%20releases
03:49:33<h2ibot>Nintendofan885 moved 2019-03-28 Help archive Google+ to Archiveteam:Press releases/2019-03-28 Help archive Google+ (move to subpage): https://wiki.archiveteam.org/?title=Archiveteam%3APress%20releases/2019-03-28%20Help%20archive%20Google%2B
03:51:30DogsRNice quits [Read error: Connection reset by peer]
03:56:32<h2ibot>JustAnotherArchivist created LIHKG (+280, Created page with "{{Infobox project | URL =…): https://wiki.archiveteam.org/?title=LIHKG
04:10:34<h2ibot>JustAnotherArchivist edited Mildom (+6): https://wiki.archiveteam.org/?diff=55791&oldid=53750
04:22:35<h2ibot>JustAnotherArchivist created Posts.cv (+541, Created page with "{{Infobox project | URL =…): https://wiki.archiveteam.org/?title=Posts.cv
04:23:36<h2ibot>Thezt edited List of websites excluded from the Wayback Machine/Partial exclusions (+43): https://wiki.archiveteam.org/?diff=55793&oldid=55638
04:23:37<h2ibot>JustAnotherArchivist edited Deathwatch (-46, Link to [[Glitch]] and [[Posts.cv]]): https://wiki.archiveteam.org/?diff=55794&oldid=55769
04:45:39<h2ibot>JustAnotherArchivist created Kinja (+1177, Created page with "{{Infobox project | URL =…): https://wiki.archiveteam.org/?title=Kinja
05:15:01Wohlstand quits [Ping timeout: 260 seconds]
05:44:31<pokechu22>I've started a separate archivebot job for finito-web.com, seeded with URLs from duckduckgo+google+IA CDX. The original list has http://www.finito.fc2.com/ stuff though
06:00:19<@arkiver>JAA: on closing the channels, feel free to just kick me out!
06:00:54<@JAA>arkiver: Yep, I'll kick everyone when I close them.
06:13:19Island quits [Read error: Connection reset by peer]
06:42:03<triplecamera|m>Hi. I'm looking for a file which hasn't been archived by the Wayback Machine. Is there a quick way to search through all archiving sites?
06:42:22ArchivalEfforts quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
06:42:31ArchivalEfforts joins
06:42:49<triplecamera|m>The file's URL is <http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-828-operating-system-engineering-fall-2006/labs/lab{1..6}handout.gz>, where 1 has been archived by wbm, but 2 to 6 haven't.
06:48:54<that_lurker>have you asked if the site maintainers still had them somewhere?
06:50:52<triplecamera|m>I will ask them if I can't find any copies from the Internet.
06:52:17<that_lurker>actually these might be the ones you are looking for https://dspace.mit.edu/bitstream/handle/1721.1/92292/6-828-fall-2006/contents/labs/index.htm
06:53:15<@JAA>https://pdos.csail.mit.edu/6.828/2006/index.html seems to have lab handouts for that lecture in that year.
06:53:25<triplecamera|m>Wikipedia says that there are many [web archiving initiatives](https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives), but I don't want to search one by one.
06:54:45<triplecamera|m>that_lurker: You are right. But unfortunately all tar.gz files from there are corrupted.
06:55:30<triplecamera|m>JAA: Yes, but the handouts for lab 5 & 6 are missing.
06:59:30<@JAA>Oof, yeah, the files on their DSpace have all high bytes replaced by U+FFFD. :-/
07:00:09<@JAA>They start with 1fefbfbd instead of 1f8b.
08:09:31Dada joins
08:50:20janos777 joins
08:57:50janos777 quits [Ping timeout: 276 seconds]
09:19:45Megame (Megame) joins
09:40:35fmixolydian joins
09:43:17janos777 joins
09:51:25<fmixolydian>i just sent an email to a member of glitch's staff regarding which archiving method to use
09:51:52<fmixolydian>i've tried requests with bs4, but that didn't work (since the website is 99% generated with JS)
09:52:12<fmixolydian>should i just do a wget full website grab?
09:53:20<fmixolydian>(im kinda new to archiving websites, sorry)
10:02:55<fmixolydian>welp, i tried that and it doesnt work
10:03:35<fmixolydian>bc wget doesnt execute javascript, and its the included javascript that creates the links
10:06:41<fmixolydian>wait a second, i just noticed they have an api - maybe i could use that
10:08:06fmixolydian quits [Client Quit]
10:21:41Megame quits [Ping timeout: 276 seconds]
10:22:13Webuser309148 joins
10:23:32Webuser309148 quits [Client Quit]
10:40:31janos777 quits [Ping timeout: 260 seconds]
10:50:00fmixolydian joins
10:56:17<fmixolydian>i'd like to update yall on my glitch.com python dumping script
10:56:45<fmixolydian>i successfully managed to extract ~80 project's data
10:56:54<fmixolydian>now i just need to speed it up
10:58:27<fmixolydian>unfortunately there are millions of projects, i doubt i'm gonna be able to grab all the data in time
11:00:03Bleo182600722719623455 quits [Quit: The Lounge - https://thelounge.chat]
11:02:46Bleo182600722719623455 joins
11:10:05janos777 joins
11:14:56tzt quits [Ping timeout: 260 seconds]
11:25:31fmixolydian quits [Client Quit]
11:29:53ducky quits [Remote host closed the connection]
11:32:29ducky (ducky) joins
11:34:01ducky quits [Read error: Connection reset by peer]
11:49:49ducky (ducky) joins
12:05:41janos777 quits [Ping timeout: 260 seconds]
12:37:05tzt (tzt) joins
12:37:11monoxane quits [Ping timeout: 260 seconds]
12:43:19ericgallager quits [Quit: Leaving]
12:46:04ericgallager joins
13:00:21ericgallager quits [Client Quit]
13:07:29fmixolydian joins
13:16:53fmixolydian quits [Client Quit]
13:35:02fmixolydian joins
13:36:24<fmixolydian>hello everyone, im doing a manual dump of glitch.com through a few scripts i made
13:36:56<fmixolydian>a few of the early userids (<20) have ~200 projects each on average
13:37:13<fmixolydian>however most userids now return 404s
13:38:04<fmixolydian>for now im only dumping project and user data (the project contents will have to be dumped differently)
13:45:39corentin joins
13:46:00<fmixolydian>hello
13:51:09@arkiver quits [Remote host closed the connection]
13:51:41arkiver (arkiver) joins
13:51:41@ChanServ sets mode: +o arkiver
13:52:07<fmixolydian>wb
13:54:17<fmixolydian>bruh i just got ratelimited from glitch
14:00:05<@arkiver>fmixolydian: for glitch.com, was would be really helpful is if you can make lists of content that is one there. users, posts, projects, etc.
14:00:07<@arkiver>or find ways to list all of them for each type
14:01:52fuzzy8021 quits [Read error: Connection reset by peer]
14:01:58fuzzy80211 (fuzzy80211) joins
14:03:22fuzzy8021 (fuzzy80211) joins
14:03:24fuzzy80211 quits [Read error: Connection reset by peer]
14:05:16NotGLaDOS quits [Ping timeout: 260 seconds]
14:05:18terry joins
14:05:42<fmixolydian>arkiver: there are users, collections, emails, projects, and teams (afaik)
14:06:22<fmixolydian>here's the source code for my dumping script (if it helps): https://codeberg.org/fmixolydian/rpglitch
14:08:13<fmixolydian>dumpall.sh is a script that, for every user in a range, calls dumpuser.py (which outputs a list of project uuids), and calls dumpproj.py on each one
14:09:10<fmixolydian>the userids appear to go up to ~80 million, projects use UUID 4
14:10:49<fmixolydian>also, the main thing to dump would be the subdomains (<domain>.glitch.social), or `.domain` in each project's dumped JSON file
14:12:00<fmixolydian>they're mostly small and could be dumped with a simple wget dump - however, for dynamic projects you first have to make a request to "wake it up", kinda like vercel, wait ~10s, and then try to dump the contents with wget.
14:12:08<@arkiver>are user IDs sequential?
14:12:11<fmixolydian>yea
14:12:20<@arkiver>(i did not look into the site yet, just collecting info for when i do)
14:12:23<@arkiver>alright that is nice
14:12:26<fmixolydian>however, i've noticed large gaps in userids
14:12:33<@arkiver>and can all other item types be found through a user?
14:13:04<fmixolydian>rpglitch (the codename i gave to my dumping scripts) only look for projects for eaach user
14:13:08<fmixolydian>but afaik yes
14:14:59<fmixolydian>(also, im not ratelimited anymore, but i'm gonna stop dumping glitch userids to not get banned)
14:23:37FiTheArchiver joins
14:24:05<fmixolydian>hello
14:25:04FiTheArchiver quits [Remote host closed the connection]
14:31:49<Dango360>seems that glitch project contents are sent via websockets (wss://api.glitch.com/{projectId}/ot?authorization={authIdThatIsGeneratedPerBrowserSession})
14:38:57ericgallager joins
14:43:50<@arkiver>oof i hope not
14:46:10fmixolydian quits [Client Quit]
14:48:07fmixolydian joins
14:49:16<Dango360>you can view the websocket messages in inspect element, network tab https://glitch.com/edit/#!/archiveteam-tracker-tgbot
14:55:42ahm258760 quits [Quit: The Lounge - https://thelounge.chat]
14:56:30<fmixolydian>Dango360: you mean ot and logs?
14:56:53<Dango360>fmixolydian: ot
14:57:02<fmixolydian>yea
14:58:27<fmixolydian>i cant connect to the project
14:58:36<Dango360>try a different project
14:58:39<Dango360>it might be broken
14:59:01<fmixolydian>static projects work fine
14:59:08<fmixolydian>lemme try another dynamic one
14:59:52ahm258760 joins
15:03:42<fmixolydian>Dango360: my own dynamic project doesnt work either
15:15:55Mateon2 joins
15:18:11Mateon1 quits [Ping timeout: 260 seconds]
15:18:11Mateon2 is now known as Mateon1
15:28:29Mateon2 joins
15:29:09ericgallager quits [Client Quit]
15:31:01Mateon1 quits [Ping timeout: 260 seconds]
15:31:01Mateon2 is now known as Mateon1
15:41:53grill (grill) joins
15:50:14fmixolydian quits [Client Quit]
15:55:30ericgallager joins
16:05:11janos777 joins
16:19:31Island joins
16:27:21Naruyoko5 joins
16:30:45BornOn420_ quits [Remote host closed the connection]
16:31:06Naruyoko quits [Ping timeout: 260 seconds]
16:31:25BornOn420 (BornOn420) joins
17:12:55BornOn420 quits [Remote host closed the connection]
17:13:33BornOn420 (BornOn420) joins
17:15:39<h2ibot>HadeanEon edited Deaths in 2025 (+862, BOT - Updating page: {{saved}} (130),…): https://wiki.archiveteam.org/?diff=55796&oldid=55778
17:15:40<h2ibot>HadeanEon edited Deaths in 2025/list (+61, BOT - Updating list): https://wiki.archiveteam.org/?diff=55797&oldid=55779
17:20:21JayEmbee quits [Quit: WeeChat 4.1.1]
17:53:26grill quits [Ping timeout: 276 seconds]
17:59:00fmixolydian joins
18:02:03fmixolydian quits [Client Quit]
19:18:15Wohlstand (Wohlstand) joins
19:51:23janos777 quits [Read error: Connection reset by peer]
21:05:02etnguyen03 (etnguyen03) joins
21:30:03Naruyoko joins
21:33:16Naruyoko5 quits [Ping timeout: 260 seconds]
21:36:35Naruyoko5 joins
21:40:56Naruyoko quits [Ping timeout: 276 seconds]
21:42:18Naruyoko joins
21:44:56Naruyoko5 quits [Ping timeout: 260 seconds]
22:47:28Dada quits [Remote host closed the connection]
23:10:55etnguyen03 quits [Client Quit]
23:32:24etnguyen03 (etnguyen03) joins
23:42:32ATinySpaceMarine quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]