00:01:18 | | ATinySpaceMarine joins |
00:04:33 | | DogsRNice joins |
00:26:05 | <pabs> | I recently saw some articles about Stack Exchange being obsoleted by AI, SO user engagement crashing, dying etc. they also added Cloudflare captchas for me, and I also saw some questions that are not yet in the WBM, and the data exports to IA items were discontinued due to "irresponsible" AI companies. |
00:26:37 | <pabs> | on the data exports: https://web.archive.org/web/20250425133147/https://meta.stackexchange.com/questions/401324/announcing-a-change-to-the-data-dump-process |
00:26:51 | | jspiros quits [Ping timeout: 260 seconds] |
00:27:18 | <pabs> | some posts: https://blog.pragmaticengineer.com/stack-overflow-is-almost-dead/ https://www.techzine.eu/news/devops/127669/stack-overflow-is-dying-is-it-being-replaced-by-ai/ https://devclass.com/2025/05/13/stack-overflow-seeks-rebrand-as-traffic-continues-to-plummet-which-is-bad-news-for-developers/ |
00:27:40 | <pabs> | https://old.reddit.com/r/webdev/comments/116vvpp/saying_goodbye_to_stack_overflow/ https://codeandhack.com/stack-overflow-is-falling-apart/ |
00:28:33 | | jspiros (jspiros) joins |
00:29:36 | <pokechu22> | The cloudflare captchas are weird; they happen when entering from duckduckgo but not when loading the page directly IIRC |
00:31:05 | <pabs> | I get them always, even when pasting a URL into a fresh browser profile |
00:31:35 | <pabs> | recent HN discussion about SO: https://news.ycombinator.com/item?id=43999125 |
00:33:08 | <steering> | yeah, I've been getting captchas for a week or two. |
00:39:11 | <ericgallager> | for anyone with an extra 6.5TB sitting around: https://ddosecrets.com/article/psyclone-media |
01:03:06 | <nicolas17> | I have 59GB free :) |
01:52:24 | <@JAA> | pabs: #stackunderflow |
01:55:12 | <h2ibot> | PaulWise created StackOverflow (+28, add SO redirect): https://wiki.archiveteam.org/?title=StackOverflow |
01:56:01 | <pabs> | woops, a wiki search did not find the page for SO. added a redirect |
01:57:16 | | katocala quits [Ping timeout: 260 seconds] |
01:59:13 | <h2ibot> | PaulWise edited Stack Exchange (+274, add SO dying para): https://wiki.archiveteam.org/?diff=55781&oldid=54103 |
02:23:06 | | dabs joins |
02:26:40 | <pabs> | pokechu22: hmm, AB 7qwz68jnobcw90l4utbhabotd old-fc2web-urls.txt isn't going to finish before the deadline. could you help me craft an ignore for offsite URLs? |
02:27:14 | <pokechu22> | !igd fc2web.com |
02:27:43 | <pabs> | its more than just that domain IIRC |
02:27:52 | <pokechu22> | Pabs: ^(http|ftp)s?://(?!([^/]*[@.])?fc2web\.com\.?(:\d+)?/)[^/]+/ for just fc2web.com and subdomains |
02:28:31 | <pokechu22> | uh, I'll check what else is in the list |
02:30:20 | <pokechu22> | cat old-fc2web-urls.txt | grep -Fv -e fc2web.com -e easter.ne.jp -e gooside.com -e k-free.net -e muvc.net -e 55street.net -e zero-city.com -e k-server.org -e ojiji.net -e zero-yen.com -e kt.fc2.com -e finito.fc2.com -e pimp.fc2.com -e happy-web.fc2.com -e ktplan.fc2.com |
02:30:26 | <pokechu22> | gives http://wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww http://momosakura21.zero-city http://pinkpink.zero-city http://tomoryoulove.zero-city http://halharuta.zero-city http://michi.zero-city http://largest.zero-city http://riorio.zero-city http://rabbit.zero-city http://taro.zero-city |
02:30:28 | <pokechu22> | http://natumi.zero-city http://lovelovecall.zero-city http://largest.finito-web.com |
02:31:34 | <pokechu22> | so allowing just those domains (well, all of fc2.com is probably easier)... |
02:36:13 | <pokechu22> | ^(http|ftp)s?://(?!([^/]*[@.])?(fc2web\.com|easter\.ne\.jp|gooside\.com|k-free\.net|muvc\.net|55street\.net|zero-city\.com|k-server\.org|ojiji\.net|zero-yen\.com|fc2\.com)\.?(:\d+)?/)[^/]+/ should do it, apart from those broken URLs which I'll run separately |
02:37:57 | <pokechu22> | http://largest.finito-web.com/ might need more investigation actually, as there's other subdomains but that's the only one in that list |
02:41:07 | | dabs quits [Client Quit] |
03:44:04 | <pokechu22> | pabs: FYI, I didn't apply that ignore yet |
03:45:15 | <pabs> | maybe just allow all of finito-web.com, easiest option |
03:46:10 | <pokechu22> | eh, I don't think that's super useful since even if we allow it, it'd only recurse on http://largest.finito-web.com/ - finito-web.com will need its own !a < list. But I guess it's pretty cheap to add that too |
03:47:45 | <pokechu22> | ignore added |
03:48:30 | <h2ibot> | JustAnotherArchivist created WarriorBot (+471, Document that this sort of existed): https://wiki.archiveteam.org/?title=WarriorBot |
03:48:31 | <h2ibot> | Kevidryon2 created Glitch (+626, glitch is closing soon): https://wiki.archiveteam.org/?title=Glitch |
03:48:33 | <pokechu22> | hmm, it's only now getting robots.txt/sitemap.xml so it hasn't even recursed one layer deep :| |
03:49:30 | <h2ibot> | Nintendofan885 created National Archives Catalog (+973, create): https://wiki.archiveteam.org/?title=National%20Archives%20Catalog |
03:49:31 | <h2ibot> | Nintendofan885 edited NOAA (+25, no project but #UncleSamsArchive would make…): https://wiki.archiveteam.org/?diff=55785&oldid=54375 |
03:49:32 | <h2ibot> | Nintendofan885 moved Archive Team press releases to Archiveteam:Press releases (move to project namespace): https://wiki.archiveteam.org/?title=Archiveteam%3APress%20releases |
03:49:33 | <h2ibot> | Nintendofan885 moved 2019-03-28 Help archive Google+ to Archiveteam:Press releases/2019-03-28 Help archive Google+ (move to subpage): https://wiki.archiveteam.org/?title=Archiveteam%3APress%20releases/2019-03-28%20Help%20archive%20Google%2B |
03:51:30 | | DogsRNice quits [Read error: Connection reset by peer] |
03:56:32 | <h2ibot> | JustAnotherArchivist created LIHKG (+280, Created page with "{{Infobox project | URL =…): https://wiki.archiveteam.org/?title=LIHKG |
04:10:34 | <h2ibot> | JustAnotherArchivist edited Mildom (+6): https://wiki.archiveteam.org/?diff=55791&oldid=53750 |
04:22:35 | <h2ibot> | JustAnotherArchivist created Posts.cv (+541, Created page with "{{Infobox project | URL =…): https://wiki.archiveteam.org/?title=Posts.cv |
04:23:36 | <h2ibot> | Thezt edited List of websites excluded from the Wayback Machine/Partial exclusions (+43): https://wiki.archiveteam.org/?diff=55793&oldid=55638 |
04:23:37 | <h2ibot> | JustAnotherArchivist edited Deathwatch (-46, Link to [[Glitch]] and [[Posts.cv]]): https://wiki.archiveteam.org/?diff=55794&oldid=55769 |
04:45:39 | <h2ibot> | JustAnotherArchivist created Kinja (+1177, Created page with "{{Infobox project | URL =…): https://wiki.archiveteam.org/?title=Kinja |
05:15:01 | | Wohlstand quits [Ping timeout: 260 seconds] |
05:44:31 | <pokechu22> | I've started a separate archivebot job for finito-web.com, seeded with URLs from duckduckgo+google+IA CDX. The original list has http://www.finito.fc2.com/ stuff though |
06:00:19 | <@arkiver> | JAA: on closing the channels, feel free to just kick me out! |
06:00:54 | <@JAA> | arkiver: Yep, I'll kick everyone when I close them. |
06:13:19 | | Island quits [Read error: Connection reset by peer] |
06:42:03 | <triplecamera|m> | Hi. I'm looking for a file which hasn't been archived by the Wayback Machine. Is there a quick way to search through all archiving sites? |
06:42:22 | | ArchivalEfforts quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |
06:42:31 | | ArchivalEfforts joins |
06:42:49 | <triplecamera|m> | The file's URL is <http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-828-operating-system-engineering-fall-2006/labs/lab{1..6}handout.gz>, where 1 has been archived by wbm, but 2 to 6 haven't. |
06:48:54 | <that_lurker> | have you asked if the site maintainers still had them somewhere? |
06:50:52 | <triplecamera|m> | I will ask them if I can't find any copies from the Internet. |
06:52:17 | <that_lurker> | actually these might be the ones you are looking for https://dspace.mit.edu/bitstream/handle/1721.1/92292/6-828-fall-2006/contents/labs/index.htm |
06:53:15 | <@JAA> | https://pdos.csail.mit.edu/6.828/2006/index.html seems to have lab handouts for that lecture in that year. |
06:53:25 | <triplecamera|m> | Wikipedia says that there are many [web archiving initiatives](https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives), but I don't want to search one by one. |
06:54:45 | <triplecamera|m> | that_lurker: You are right. But unfortunately all tar.gz files from there are corrupted. |
06:55:30 | <triplecamera|m> | JAA: Yes, but the handouts for lab 5 & 6 are missing. |
06:59:30 | <@JAA> | Oof, yeah, the files on their DSpace have all high bytes replaced by U+FFFD. :-/ |
07:00:09 | <@JAA> | They start with 1fefbfbd instead of 1f8b. |
08:09:31 | | Dada joins |
08:50:20 | | janos777 joins |
08:57:50 | | janos777 quits [Ping timeout: 276 seconds] |
09:19:45 | | Megame (Megame) joins |
09:40:35 | | fmixolydian joins |
09:43:17 | | janos777 joins |
09:51:25 | <fmixolydian> | i just sent an email to a member of glitch's staff regarding which archiving method to use |
09:51:52 | <fmixolydian> | i've tried requests with bs4, but that didn't work (since the website is 99% generated with JS) |
09:52:12 | <fmixolydian> | should i just do a wget full website grab? |
09:53:20 | <fmixolydian> | (im kinda new to archiving websites, sorry) |
10:02:55 | <fmixolydian> | welp, i tried that and it doesnt work |
10:03:35 | <fmixolydian> | bc wget doesnt execute javascript, and its the included javascript that creates the links |
10:06:41 | <fmixolydian> | wait a second, i just noticed they have an api - maybe i could use that |
10:08:06 | | fmixolydian quits [Client Quit] |
10:21:41 | | Megame quits [Ping timeout: 276 seconds] |
10:22:13 | | Webuser309148 joins |
10:23:32 | | Webuser309148 quits [Client Quit] |
10:40:31 | | janos777 quits [Ping timeout: 260 seconds] |
10:50:00 | | fmixolydian joins |
10:56:17 | <fmixolydian> | i'd like to update yall on my glitch.com python dumping script |
10:56:32 | | hexagonwin is now authenticated as hexagonwin |
10:56:45 | <fmixolydian> | i successfully managed to extract ~80 project's data |
10:56:54 | <fmixolydian> | now i just need to speed it up |
10:58:27 | <fmixolydian> | unfortunately there are millions of projects, i doubt i'm gonna be able to grab all the data in time |
11:00:03 | | Bleo182600722719623455 quits [Quit: The Lounge - https://thelounge.chat] |
11:02:46 | | Bleo182600722719623455 joins |
11:10:05 | | janos777 joins |
11:14:56 | | tzt quits [Ping timeout: 260 seconds] |
11:25:31 | | fmixolydian quits [Client Quit] |
11:29:53 | | ducky quits [Remote host closed the connection] |
11:32:29 | | ducky (ducky) joins |
11:34:01 | | ducky quits [Read error: Connection reset by peer] |
11:49:49 | | ducky (ducky) joins |
12:05:41 | | janos777 quits [Ping timeout: 260 seconds] |
12:37:05 | | tzt (tzt) joins |
12:37:11 | | monoxane quits [Ping timeout: 260 seconds] |
12:43:19 | | ericgallager quits [Quit: Leaving] |
12:46:04 | | ericgallager joins |
13:00:21 | | ericgallager quits [Client Quit] |
13:07:29 | | fmixolydian joins |
13:16:53 | | fmixolydian quits [Client Quit] |
13:35:02 | | fmixolydian joins |
13:36:24 | <fmixolydian> | hello everyone, im doing a manual dump of glitch.com through a few scripts i made |
13:36:56 | <fmixolydian> | a few of the early userids (<20) have ~200 projects each on average |
13:37:13 | <fmixolydian> | however most userids now return 404s |
13:38:04 | <fmixolydian> | for now im only dumping project and user data (the project contents will have to be dumped differently) |
13:45:39 | | corentin joins |
13:45:50 | | corentin is now authenticated as corentin |
13:46:00 | <fmixolydian> | hello |
13:51:09 | | @arkiver quits [Remote host closed the connection] |
13:51:41 | | arkiver (arkiver) joins |
13:51:41 | | @ChanServ sets mode: +o arkiver |
13:52:07 | <fmixolydian> | wb |
13:54:17 | <fmixolydian> | bruh i just got ratelimited from glitch |
14:00:05 | <@arkiver> | fmixolydian: for glitch.com, was would be really helpful is if you can make lists of content that is one there. users, posts, projects, etc. |
14:00:07 | <@arkiver> | or find ways to list all of them for each type |
14:01:52 | | fuzzy8021 quits [Read error: Connection reset by peer] |
14:01:58 | | fuzzy80211 (fuzzy80211) joins |
14:03:22 | | fuzzy8021 (fuzzy80211) joins |
14:03:24 | | fuzzy80211 quits [Read error: Connection reset by peer] |
14:05:16 | | NotGLaDOS quits [Ping timeout: 260 seconds] |
14:05:18 | | terry joins |
14:05:42 | <fmixolydian> | arkiver: there are users, collections, emails, projects, and teams (afaik) |
14:06:22 | <fmixolydian> | here's the source code for my dumping script (if it helps): https://codeberg.org/fmixolydian/rpglitch |
14:08:13 | <fmixolydian> | dumpall.sh is a script that, for every user in a range, calls dumpuser.py (which outputs a list of project uuids), and calls dumpproj.py on each one |
14:09:10 | <fmixolydian> | the userids appear to go up to ~80 million, projects use UUID 4 |
14:10:49 | <fmixolydian> | also, the main thing to dump would be the subdomains (<domain>.glitch.social), or `.domain` in each project's dumped JSON file |
14:12:00 | <fmixolydian> | they're mostly small and could be dumped with a simple wget dump - however, for dynamic projects you first have to make a request to "wake it up", kinda like vercel, wait ~10s, and then try to dump the contents with wget. |
14:12:08 | <@arkiver> | are user IDs sequential? |
14:12:11 | <fmixolydian> | yea |
14:12:20 | <@arkiver> | (i did not look into the site yet, just collecting info for when i do) |
14:12:23 | <@arkiver> | alright that is nice |
14:12:26 | <fmixolydian> | however, i've noticed large gaps in userids |
14:12:33 | <@arkiver> | and can all other item types be found through a user? |
14:13:04 | <fmixolydian> | rpglitch (the codename i gave to my dumping scripts) only look for projects for eaach user |
14:13:08 | <fmixolydian> | but afaik yes |
14:14:59 | <fmixolydian> | (also, im not ratelimited anymore, but i'm gonna stop dumping glitch userids to not get banned) |
14:23:37 | | FiTheArchiver joins |
14:24:05 | <fmixolydian> | hello |
14:25:04 | | FiTheArchiver quits [Remote host closed the connection] |
14:31:49 | <Dango360> | seems that glitch project contents are sent via websockets (wss://api.glitch.com/{projectId}/ot?authorization={authIdThatIsGeneratedPerBrowserSession}) |
14:38:57 | | ericgallager joins |
14:43:50 | <@arkiver> | oof i hope not |
14:46:10 | | fmixolydian quits [Client Quit] |
14:48:07 | | fmixolydian joins |
14:49:16 | <Dango360> | you can view the websocket messages in inspect element, network tab https://glitch.com/edit/#!/archiveteam-tracker-tgbot |
14:55:42 | | ahm258760 quits [Quit: The Lounge - https://thelounge.chat] |
14:56:30 | <fmixolydian> | Dango360: you mean ot and logs? |
14:56:53 | <Dango360> | fmixolydian: ot |
14:57:02 | <fmixolydian> | yea |
14:58:27 | <fmixolydian> | i cant connect to the project |
14:58:36 | <Dango360> | try a different project |
14:58:39 | <Dango360> | it might be broken |
14:59:01 | <fmixolydian> | static projects work fine |
14:59:08 | <fmixolydian> | lemme try another dynamic one |
14:59:52 | | ahm258760 joins |
15:03:42 | <fmixolydian> | Dango360: my own dynamic project doesnt work either |
15:15:55 | | Mateon2 joins |
15:18:11 | | Mateon1 quits [Ping timeout: 260 seconds] |
15:18:11 | | Mateon2 is now known as Mateon1 |
15:28:29 | | Mateon2 joins |
15:29:09 | | ericgallager quits [Client Quit] |
15:31:01 | | Mateon1 quits [Ping timeout: 260 seconds] |
15:31:01 | | Mateon2 is now known as Mateon1 |
15:41:53 | | grill (grill) joins |
15:50:14 | | fmixolydian quits [Client Quit] |
15:55:30 | | ericgallager joins |
16:05:11 | | janos777 joins |
16:19:31 | | Island joins |
16:27:21 | | Naruyoko5 joins |
16:30:45 | | BornOn420_ quits [Remote host closed the connection] |
16:31:06 | | Naruyoko quits [Ping timeout: 260 seconds] |
16:31:25 | | BornOn420 (BornOn420) joins |
17:12:55 | | BornOn420 quits [Remote host closed the connection] |
17:13:33 | | BornOn420 (BornOn420) joins |
17:15:39 | <h2ibot> | HadeanEon edited Deaths in 2025 (+862, BOT - Updating page: {{saved}} (130),…): https://wiki.archiveteam.org/?diff=55796&oldid=55778 |
17:15:40 | <h2ibot> | HadeanEon edited Deaths in 2025/list (+61, BOT - Updating list): https://wiki.archiveteam.org/?diff=55797&oldid=55779 |
17:20:21 | | JayEmbee quits [Quit: WeeChat 4.1.1] |
17:53:26 | | grill quits [Ping timeout: 276 seconds] |
17:59:00 | | fmixolydian joins |
18:02:03 | | fmixolydian quits [Client Quit] |
19:18:15 | | Wohlstand (Wohlstand) joins |
19:51:23 | | janos777 quits [Read error: Connection reset by peer] |
21:05:02 | | etnguyen03 (etnguyen03) joins |
21:30:03 | | Naruyoko joins |
21:33:16 | | Naruyoko5 quits [Ping timeout: 260 seconds] |
21:36:35 | | Naruyoko5 joins |
21:40:56 | | Naruyoko quits [Ping timeout: 276 seconds] |
21:42:18 | | Naruyoko joins |
21:44:56 | | Naruyoko5 quits [Ping timeout: 260 seconds] |
22:09:51 | | nicolas17 is now authenticated as nicolas17 |
22:47:28 | | Dada quits [Remote host closed the connection] |
23:10:55 | | etnguyen03 quits [Client Quit] |
23:32:24 | | etnguyen03 (etnguyen03) joins |
23:42:32 | | ATinySpaceMarine quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |