#archiveteam-bs log for 2026-04-19

Home Search Previous day Next day

00:00:06		ducky_ (ducky) joins
00:01:45		ducky quits [Ping timeout: 268 seconds]
00:01:45		ducky_ is now known as ducky
00:03:35		Shard6 (Shard) joins
00:04:55		Shard quits [Ping timeout: 268 seconds]
00:04:56		Shard6 is now known as Shard
00:13:29		retrograde quits [Remote host closed the connection]
00:13:50		retrograde (retrograde) joins
00:27:45		Arcorann (Arcorann) joins
00:28:54		etnguyen03 (etnguyen03) joins
00:29:28		Webuser022567 quits [Client Quit]
00:48:17	<pabs>	nothing mentioned here https://wiki.archiveteam.org/index.php/Bluesky /cc Shard
00:56:20		hyperreal (hyperreal) joins
01:35:29	<h2ibot>	Cooljeanius edited Archiwum Allegro (+16, /* Other archives */ use URL template): https://wiki.archiveteam.org/?diff=61091&oldid=60975
01:38:14		polypept1 (polypeptide) joins
01:39:03		etnguyen03 quits [Client Quit]
01:41:14		polypeptide quits [Ping timeout: 260 seconds]
01:48:58		cyanbox quits [Read error: Connection reset by peer]
02:10:11		cyanbox joins
02:18:14	<kline>	<klea> I wonder, is there any interest in archiving rsync:// urls?
02:18:48	<kline>	this would be good for source.org where i have a big filesystem copy, but it would need to support deltas as well as it's a constantly changing tree
02:19:37	<@JAA>	But are there large files where only small portions change, or do they just add/remove files?
02:19:45	<@JAA>	You'd only need delta for the former.
02:20:02	<@JAA>	Well, 'need'
02:21:10	<kline>	im not sure if there are changes, but there seem to be non-standard files (by which i mean symlinks etc) when I saw the logs scrolling by. I'm not sure if these symlinks change
02:21:44		etnguyen03 (etnguyen03) joins
02:22:13	<@JAA>	I don't think that'd be an issue (although figuring out how to represent them in WARC would be interesting).
02:22:22	<kline>	i think it would be reasonable to expect that this is something that an rsync archive _could_ cover in any case, or else you'd have limitations on what rsync is specifically good at
02:23:17	<kline>	i wonder if it could be modeled like http patch/range requests
02:23:35	<@JAA>	Well, it's also about the storage though. You'd need to store the binary delta and then also provide tooling that can combine it with the previous record(s) to produce the correct result again. Which would be very messy.
02:24:27	<@JAA>	If the vast majority of things being archived this way is of the 'public files that rarely change after first being added' category, that delta mechanism is mostly just unnecessary complexity.
02:24:37	<kline>	fair enough
02:35:44		SootBector quits [Remote host closed the connection]
02:36:11		rohvani quits [Quit: Ping timeout (120 seconds)]
02:36:26		rohvani joins
02:36:36	<h2ibot>	PaulWise edited Obstacles (+54, add trap_bots poisoner): https://wiki.archiveteam.org/?diff=61092&oldid=61084
02:36:47	<pabs>	arkiver: another poisoner https://maurycyz.com/projects/trap_bots/ - known instances are under https://maurycyz.com/babble/ and https://v6.maurycyz.com/babble/
02:36:55		SootBector (SootBector) joins
02:42:39	<triplecamera\|m>	"<@JAA> We have a new wiki user who chose the username 'User'..." <- It might be a good idea to ask him pick a new username on his talk page
02:43:25	<triplecamera\|m>	Usernames like "User" are typically against the username policy on large wikis
02:45:47		grill quits [Ping timeout: 268 seconds]
02:46:10		grill (grill) joins
03:02:56		SootBector quits [Remote host closed the connection]
03:07:27		sepro quits [Ping timeout: 268 seconds]
03:09:21		SootBector (SootBector) joins
03:15:55	<hyperreal>	Hi all. I'm kind of fuzzy about how all this works. I'd love to help out. What is an rsync target exactly, and how would I help the effort by setting one up on my infra? https://wiki.archiveteam.org/index.php/Dev/Staging
03:17:16	<hyperreal>	I know what rsync is, but this seems like it's part of some kind of pipeline. I'm not sure what the pipeline consists of.
03:24:01		Mateon1 quits [Ping timeout: 268 seconds]
03:24:39		Mateon1 joins
03:34:11	<pabs>	hyperreal: the pipeline is website -> warrior -> rsync target -> archive.org. usually targets are only run by core AT folks. anyone can contribute a warrior though
03:34:12	<pabs>	https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior
03:34:14	<TheTechRobo>	rsync target is where downloaded data is uploaded for packing into big WARCs to the Internet Archive
03:34:22	<pabs>	also see https://wiki.archiveteam.org/index.php/Archiveteam:Acronyms
03:34:34	<TheTechRobo>	haha, pabs beat me to it :P
03:34:52	<hyperreal>	Ah okay, thanks! Yeah I'm already running a warrior
03:36:54	<hyperreal>	On the Main_Page of the wiki, under Warrior-based projects, it shows a Telegram "for Warriors that don't have the multiple default projects update". What is the multiple default projects update?
03:39:52	<TheTechRobo>	A few weeks ago, there was an update to the Warrior code that allows a weighted randomness for the default project. Before the update, only one "Archiveteam's Choice" could exist; now multiple can exist with different percentages of Warriors running it
03:40:18		chrismeller3 quits [Quit: chrismeller3]
03:40:55		chrismeller3 (chrismeller) joins
03:43:46		Nekroschizofrenetyk joins
03:43:56		gosc joins
03:44:41	<gosc>	got a list of 5200 Brightcove Player urls, which channel do I send it to
03:45:37	<hyperreal>	TheTechRobo: I see. Would that be this? https://github.com/ArchiveTeam/warrior4-vm
03:58:07	<TheTechRobo>	hyperreal: -ish, the VM is really a wrapper around https://github.com/ArchiveTeam/warrior-dockerfile/ which is really a wrapper around https://github.com/ArchiveTeam/seesaw-kit/
03:58:29	<TheTechRobo>	(relevant commit: https://github.com/ArchiveTeam/seesaw-kit/commit/700fadd7d331ec6cbc9266d40ddca8c3fe9701bf)
04:00:29		BearFortress quits [Ping timeout: 268 seconds]
04:00:44	<hyperreal>	TheTechRobo: I see
04:04:48		n9nes quits [Ping timeout: 268 seconds]
04:04:51		n9nes joins
04:21:50	<h2ibot>	PaulWise edited Ukraine (+207, add archive.org.ua): https://wiki.archiveteam.org/?diff=61093&oldid=60747
04:22:50	<h2ibot>	PaulWise edited Ukraine (+0, formatting typo): https://wiki.archiveteam.org/?diff=61094&oldid=61093
04:23:19		etnguyen03 quits [Remote host closed the connection]
04:25:20		Island joins
04:40:26		sepro (sepro) joins
05:00:45		nexussfan quits [Quit: Konversation terminated!]
05:01:40	<pokechu22>	gosc: is there anything about those that wouldn't work in archivebot?
05:04:22		michaelblob764 quits [Quit: yoop]
05:07:13		michaelblob7641 joins
05:15:45	<pabs>	arkiver: https://lists.opensuse.org/ uses an unknown JS-based PoW challenge thing
05:22:12	<gosc>	pokechu22, idk, I have no idea if it could; it works off of m3u8 files
05:22:35	<pokechu22>	What videos are they?
05:25:21	<gosc>	all of them come from Aniplex, extracted from here https://www.aniplex.co.jp/js/player/aniplex.videoplayer.js
05:25:50	<gosc>	most of them are promotional trailers, there's some radio stuff (no visuals), and some misc. videos
05:29:58	<h2ibot>	Exorcism edited Mailman/2 (-16): https://wiki.archiveteam.org/?diff=61095&oldid=61090
05:36:47	<pokechu22>	Do the m3u8 URLs/the video segments in them seem to be signed? I've got tooling to parse them from WARCs
05:44:59	<gosc>	well the urls start with "?fastly_token="
05:45:07	<gosc>	sample https://players.brightcove.net/4929511769001/SkbLowH7g_default/index.html?videoId=5086049326001
05:45:55		Nekroschizofrenetyk quits [Client Quit]
05:55:29	<pokechu22>	Yeah, those will probably expire, though it's not immediately obvious how long they last
05:56:04	<h2ibot>	Exorcism edited Mailman/2 (+11, /* Status */): https://wiki.archiveteam.org/?diff=61096&oldid=61095
05:57:03	<pokechu22>	I could still try running them in AB I guess, but they'll probably expire before I get to them
05:59:21	<gosc>	ah rip
06:00:03	<gosc>	I'll probably also try to download the videos using yt-dlp and upload to ia just in case too
06:00:17	<gosc>	oh and also pokechu22, not all of the links work
06:00:33	<gosc>	some of the videos aren't publicly viewable I think? they just give an error
06:00:48	<gosc>	so slightly less than 5,000 videos
06:02:26	<pokechu22>	oh, no, AB won't work at all since that requires https://edge.api.brightcove.com/playback/v1/accounts/4929511769001/videos/5086049326001 and that requires a policy-key-raw header
06:03:05	<h2ibot>	Exorcism edited Mailman/2 (-65, /* Not yet archived */): https://wiki.archiveteam.org/?diff=61097&oldid=61096
06:40:16		gosc quits [Client Quit]
06:42:17		Island quits [Read error: Connection reset by peer]
07:13:35	<@arkiver>	pabs: thank you thank you
07:14:06	<nicolas17>	kline: is the source.org rsync data also available via HTTP(S)?
07:15:13	<nicolas17>	if so, we could use rsync to efficiently know when the tree changes, and then archive the changes alone via HTTP
07:15:18		Nekroschizofrenetyk joins
07:46:09		Webuser579225 joins
07:46:48		Webuser579225 quits [Client Quit]
07:50:39	<Nekroschizofrenetyk>	how does AB deal with Yandex (disk.yandex.ru specifically) captcha challenge?
07:51:19	<Nekroschizofrenetyk>	I tried saving one url via SPN and it saved a 302 to captcha
07:51:36	<nicolas17>	AB doesn't deal with any captcha challenges
07:52:17	<nicolas17>	it can simulate certain user agents which may bypass challenges or blocks altogether vs identifying as "archivebot", but if a regular browser is getting a captcha, AB probably can't deal with it
07:52:17	<Nekroschizofrenetyk>	is it likely that at least some pipelines have not been marked as bot IPs and could be used?
07:52:43	<nicolas17>	ah yes, if there's IP blocks then trying another pipeline can work
07:53:05	<Nekroschizofrenetyk>	I'm not getting it but a page I tried to save was redirected to a captcha chalenge on IA
07:53:09	<Nekroschizofrenetyk>	and the challenge was saved
07:59:00	<Nekroschizofrenetyk>	why I'm asking: while browsing throught a CDX for *.narod.ru urls, I came across urls like this one: http://narod.ru:80/disk/5776712000/shabchr.djvu.html which, when opened now redirects to https://disk.yandex.ru/d/d4vSMS7GTyKVi
07:59:32	<Nekroschizofrenetyk>	the slug is not required, too - http://narod.ru/disk/5776712000/ does just fine
08:01:34	<Nekroschizofrenetyk>	*I have an idea - I'll check in fart viewer
08:03:09	<Nekroschizofrenetyk>	https://archive.fart.website/archivebot/viewer/domain/disk.yandex.ru - I don't think that's all that is available
08:04:03	<Nekroschizofrenetyk>	(which tbh isn't surprising, given the urls I checked at IA have not been saved, come to think of it, ... I'm slow)
08:05:03	<Nekroschizofrenetyk>	urls to the narod.ru/disk files appear sequential, don't they?
08:08:16	<Nekroschizofrenetyk>	maybe, then, it would be possible to: 1) archive the urls from the CDX that are still not behind a login wall 2) if some more rules of old narod.ru/disk url building are established, to brute force find any still available link, which has not been archived already ?
08:09:56	<Nekroschizofrenetyk>	ad 2) - normally it would be 10 billion possible urls but indeed, from the structure of the urls, it seems like we could narrow it down by a lot
08:11:38	<Nekroschizofrenetyk>	however, given that SPN returns a captcha challenge (but my normal browsing does not), how likely is it that AB will utterly fail? also - to save the files themselves (not just the sites), do we just need to feed AB the urls and it will save the redirects along with all the content on a page?
08:12:03	<Nekroschizofrenetyk>	or do we have to first find the redirects and then run them?
08:28:49		danwellby quits [Read error: Connection reset by peer]
08:29:00		danwellby-1 joins
08:29:22		danwellby-1 is now known as danwellby
09:21:46	<cruller>	The previous AB jobs failed to save the files themselves.
09:22:30	<cruller>	While a file itself can be retrieved with a simple GET request, obtaining its URL requires a bit complex POST request. Therefore, saving files themselves would likely require at least a custom script?
09:24:16	<Nekroschizofrenetyk>	oh
09:25:38		dada joins
09:26:06		BearFortress joins
09:26:20	<cruller>	Also, if the restriction is only IP blocking, for saving only pages, wouldn't #// be more appropriate?
09:27:34	<cruller>	Personally, I don't think that's very meaningful, though.
09:35:05	<Nekroschizofrenetyk>	seems right re: #//
09:36:09	<Nekroschizofrenetyk>	if it cannot grab the files themselves, then agree, not that terribly meaningful (though still good to have, if the urls are pointed at somewhere online)
09:50:04	<klea>	cruller: How does the post work, is it different per file or is it a post to the same location? (ie, if saved properly, would it be playable properly on wbm or not)
09:51:04	<klea>	(re doing it on #//) Depends on how many disk.yandex.ru links we have to archive at once, since #// is able to DDoS sites.
09:51:24	<Nekroschizofrenetyk>	oh, I wouldn't be afraid of DDoSing yandex
09:51:35	<Nekroschizofrenetyk>	I guess
09:54:35	<Nekroschizofrenetyk>	there are 11 digits but they seem to have a structure
09:55:32	<Nekroschizofrenetyk>	at least for a vast majority of the urls, the ninth and tenth ones are 0s
09:55:45	<Nekroschizofrenetyk>	11th is either 0 or 1
10:24:00	<cruller>	klea: The POST target is always https://disk.yandex.ru/public/api/download-url AFAIK. I have no idea about replayablility.
10:24:01	<cruller>	Downloading a directory also sends a POST to https://disk.yandex.ru/public/api/get-dir-size , but this may not be necessary.
10:27:18	<klea>	Annoying, that'd mean it won't replay well.
10:28:09	<cruller>	I also found https://disk.yandex.ru/public/api/album-download-url . An example of an album is https://disk.yandex.ru/a/14XUtEbH3T7vh3/5aa00f2b0c1ee4b63fbaca78
10:28:17	<cruller>	Further investigation will likely be required.
10:44:09	<cruller>	https://web.archive.org/cdx/search/cdx?url=https://disk.yandex.ru/d/* There is clearly some kind of regularity here as well.
10:51:31		lennier2_ joins
10:51:35		Dango3607 (Dango360) joins
10:51:44		cyanbox_ joins
10:51:47		Starchives__ (Starchives) joins
10:51:50		benjins3_ joins
10:52:03		khaoohs_ joins
10:52:15		ummmSokar joins
10:52:16		scotrod2 quits [Quit: Ping timeout (120 seconds)]
10:52:16		notSokar quits [Read error: Connection reset by peer]
10:52:16		fuzzy80211 quits [Read error: Connection reset by peer]
10:52:29		scotrod2 joins
10:52:34		flotwig quits [Excess Flood]
10:52:36		flotwig joins
10:54:53		lennier2 quits [Ping timeout: 268 seconds]
10:54:54		Dango360 quits [Ping timeout: 268 seconds]
10:54:54		Starchives_ quits [Ping timeout: 268 seconds]
10:54:54		Dango3607 is now known as Dango360
10:55:30		cyanbox quits [Ping timeout: 268 seconds]
10:55:31		bladem quits [Ping timeout: 268 seconds]
10:55:31		benjins3 quits [Ping timeout: 268 seconds]
10:55:31		khaoohs quits [Ping timeout: 268 seconds]
10:56:43		bladem (bladem) joins
11:00:01		Bleo1826007227196234552220110 quits [Quit: The Lounge - https://thelounge.chat]
11:02:45		Bleo1826007227196234552220110 joins
11:25:56		Nekroschizofrenetyk quits [Quit: Ooops, wrong browser tab.]
12:00:30		etnguyen03 (etnguyen03) joins
12:07:57		bigfren joins
12:08:20	<bigfren>	Good day!
12:14:19		etnguyen03 quits [Client Quit]
12:29:27		Wohlstand (Wohlstand) joins
12:58:50		Nekroschizofrenetyk joins
13:04:03		driib975 quits [Quit: The Lounge - https://thelounge.chat]
13:10:44		driib975 (driib) joins
13:11:46		Nekroschizofrenetyk quits [Client Quit]
13:11:57		Nekroschizofrenetyk joins
13:44:38	<katia>	Good day!
13:45:41	<Nekroschizofrenetyk>	Hi!
13:52:27	<katia>	Hi!
13:52:35	<klea>	Hi!
14:02:07	<bigfren>	klea : please take a look at my new pull request, I really hope you would like it
14:02:18	<bigfren>	https://github.com/ArchiveTeam/grab-site/pull/251 - Resume feature, better regex, fix for a buggy no-offsite-links, grabber.sh with resume & openvpn support
14:02:41		nexussfan (nexussfan) joins
14:03:34	<bigfren>	most importantly, it has 3) Sophisticated-yet-easily-modifiable grabber.sh example script that supports: this wonderful resume feature for easily continuing the existing not-completed crawls , auto-generating and using the above-mentioned regex if --no-offsite-links is being used , up to three OpenVPN config+auth pairs with auto-reconnect if any
14:03:34	<bigfren>	connection problems
14:04:27	<klea>	I think AT relies on re2, also, vpn????
14:07:04	<klea>	Also, bigfren if it's development related → #archiveteam-dev
14:09:23	<bigfren>	klea: I looked through the sources of grab-site and this seemed the only place that required a re2
14:10:10	<bigfren>	vpn, used in a grabber script, allows to reliably get a website even if either it blocks your IP or a website is blocked in your country
14:12:21	<klea>	Oh ignore that, I thought re2 was used in AB and you were contributing to wpull2.
14:16:43	<bigfren>	using this, I have restarted my huge crawl of today and so far it seems to be going strong, 12GB dumped already through VPN
14:28:10		Arcorann quits [Ping timeout: 268 seconds]
14:43:35		TheEnbyperor_ quits [Ping timeout: 268 seconds]
14:43:40		TheEnbyperor quits [Ping timeout: 268 seconds]
15:06:30		TheEnbyperor joins
15:09:09		atphoenix_ (atphoenix) joins
15:10:43		atphoenix__ quits [Ping timeout: 268 seconds]
15:17:42		TheEnbyperor_ (TheEnbyperor) joins
15:26:04		bigfren quits [Quit: Ooops, wrong browser tab.]
15:37:13		fuzzy80211 (fuzzy80211) joins
15:46:35		redoste quits [Remote host closed the connection]
15:47:51		redoste (redoste) joins
15:56:12	<justauser>	klea, pabs, Nekroschizofrenetyk: archive.org.ua was already fully AB'd: https://archive.fart.website/archivebot/viewer/domain/archive.org.ua
15:56:20	<klea>	Ack.
15:58:52		etnguyen03 (etnguyen03) joins
16:01:55	<Nekroschizofrenetyk>	nice :)
16:10:17		Nekroschizofrenetyk quits [Client Quit]
16:11:49		Nekroschizofrenetyk joins
16:19:47		pabs quits [Ping timeout: 268 seconds]
16:24:55		pabs (pabs) joins
16:29:31		dabs joins
16:30:27		dabs quits [Remote host closed the connection]
16:30:39		dabs joins
17:04:19		etnguyen03 quits [Client Quit]
17:15:22		Webuser291943 joins
17:25:14		etnguyen03 (etnguyen03) joins
17:27:52		Nekroschizofrenetyk quits [Client Quit]
17:47:00		Cornelius705 quits [Quit: Cornelius705]
17:47:50		Cornelius705 (Cornelius) joins
18:03:26		Cornelius7052 (Cornelius) joins
18:04:00		Cornelius705 quits [Ping timeout: 268 seconds]
18:04:00		Cornelius7052 is now known as Cornelius705
18:08:56		Dango360 quits [Ping timeout: 268 seconds]
18:38:09		Dango360 (Dango360) joins
18:42:02	<justauser>	https://conf.tube/ 's CDN is back.
18:42:49	<klea>	Time to AB it?
18:44:05		Dango360 quits [Ping timeout: 268 seconds]
18:44:28		Dango360 (Dango360) joins
18:44:39	<justauser>	https://transfer.archivete.am/fZ6Xs/conf.tube_sb-conf-tube.b-cdn.net_videos.txt is the list I got when scraping metadata API, but it probably shouldn't be !ao'd directly.
18:44:40	<eggdrop>	inline (for browser viewing): https://transfer.archivete.am/inline/fZ6Xs/conf.tube_sb-conf-tube.b-cdn.net_videos.txt
18:45:00	<justauser>	Because it has more than one file per video, and that would be wasteful.
18:45:03	<klea>	I've read b-cdn.net a few times when setting up ublock.
18:45:27	<justauser>	Bunny CDN - pretty good, but costs accordingly and thus unpopular.
18:46:01	<justauser>	Cloudflare does not much worse for a fraction of price or even for free.
18:46:11	<klea>	Ah.
18:46:44	<klea>	I wonder which resolution we should try to archive.
18:56:56		Yakov1 (Yakov) joins
18:58:53		Yakov quits [Ping timeout: 268 seconds]
18:58:53		Yakov1 is now known as Yakov
19:06:17		owl quits [Ping timeout: 268 seconds]
19:06:47		owl joins
19:07:22		wotd joins
19:07:23		wotd quits [Remote host closed the connection]
19:10:02		etnguyen03 quits [Client Quit]
19:12:31	<@JAA>	Size estimate?
19:13:51	<klea>	399 1080p videos, or 568 720p.
19:16:15	<pokechu22>	How long? Are these hour-long talks or short summaries?
19:17:54	<klea>	Between ~5 minutes, 50 minutes, 4 hours, and 8 hours.
20:18:42		Goofybally quits [Killed (NickServ (GHOST command used by Goofybally9!~Goofyball@167.100.239.73))]
20:18:47		Goofybally joins
20:19:08		Island joins
20:43:08		BitByBit41 (BitByBit) joins
20:44:57		BitByBit4 quits [Ping timeout: 268 seconds]
20:44:57		BitByBit41 is now known as BitByBit4
20:53:29	<@JAA>	(Size = total data size)
20:55:11	<klea>	Wait a sec.
20:55:25	<@JAA>	Doesn't need to be super accurate.
20:55:31	<@JAA>	But a few hundred videos doesn't sound huge anyway.
20:56:27	<klea>	curl -s list \| xargs -I: curl -Is : \| grep content-length
21:05:19	<klea>	Gives https://transfer.archivete.am/RiIkx/t:fZ6Xs_cl_filesizes.txt.zst
21:05:55	<klea>	So ~398.236329921 GB
21:06:27		Nekroschizofrenetyk joins
21:11:39		Nekroschizofrenetyk quits [Client Quit]
21:25:54		Wohlstand quits [Quit: Wohlstand]
21:40:08		Cupping1285 quits [Quit: bye]
21:52:01		ducky_ (ducky) joins
21:53:24		ducky quits [Ping timeout: 268 seconds]
21:53:24		ducky_ is now known as ducky
22:11:38		ericgallager quits [Quit: This computer has gone to sleep]
22:13:14		iseaup (iseaup) joins
22:47:53		Webuser560770 joins
22:48:00		Webuser560770 quits [Client Quit]
23:05:17		etnguyen03 (etnguyen03) joins
23:17:21		igloo22225 quits [Ping timeout: 268 seconds]
23:27:25		igloo22225 (igloo22225) joins
23:31:54		Cupping1285 joins

Home Search Previous day Next day