#archiveteam-bs log for 2025-09-18

Home Search Previous day Next day

00:00:12	<lemuria>	but since i already have uploads on the internet archive for some other completely unrelated website where i made a newspaper with references to that streamer (as in, searching their name in text contents would get you results), and that there's no evidence she hasn't seen them yet... maybe she wouldn't notice?
00:00:26	<nicolas17>	why is she removing videos?
00:00:26	<lemuria>	the analysis paralysis is real, i just wish i had the necessary balls
00:00:39	<lemuria>	channel reorganization or whatever, she didn't provide an explicit reason
00:02:39	<lemuria>	"i've been cleaning up my og youtube" - partial quote from her
00:04:54		CuppyMan joins
00:05:17	<@OrIdow6>	... personally I think it's more justifiable if it really is just that
00:06:15		CuppyMan quits [Client Quit]
00:06:21		Cuphead2527480 (Cuphead2527480) joins
00:23:58	<Guest>	might be how people remove their old videos for being "cringe"
00:26:59	<lemuria>	We must enforce "once it's on the internet, it's there forever" after all
00:27:51	<wickedplayer494>	^^^
00:28:00	<lemuria>	welp, time to press enter on the command. i hope i don't wake up to a ban from the streamer's community
00:28:12	<lemuria>	information wants to be free!
00:29:31	<lemuria>	and then it ends up in Community Texts and not Community Movies.. how does IA's collection system work anyway
00:31:30	<lemuria>	once my 35 Mbit/s internet upload speed catches up it should be at https://archive.org/details/nalani-proctor-baking-melodies
00:31:38	<@JAA>	You have to specify the collection at item creation time.
00:32:15	<@JAA>	opensource (= 'Community Texts') is the default.
00:32:26	<lemuria>	--metadata="collection:opensource_movies"??
00:32:34	<@JAA>	If it's still on the first file upload, you could abort it and restart with ... that, yes.
00:32:49	<lemuria>	there's no going back now, description and info.json uploaded
00:32:55	<@JAA>	Ah, oh well
00:39:34		etnguyen03 (etnguyen03) joins
00:44:27	<lemuria>	good night, here's to the upload not crashing and burning while i eep
01:15:09	<nicolas17>	lemuria: the mediatype and collection can only be changed by admins, you can email info@archive.org to get that fixed
01:24:39	<Ryz>	hexagonwin\|m, Tistory websites generally have calendars to go on forever, which is why there's an ignore that caps it to not reach further 2050 and I think in further back to 2000
01:39:44		cyanbox joins
01:57:05	<hexagonwin\|m>	@Ryz:hackint.org ah you mean like the calendar here on https://coconx.tistory.com/ ? But tistory has sitemap.xml which links to all post, so there should be no need to recursively download everything i believe..
02:15:37		aninternettroll quits [Ping timeout: 258 seconds]
02:26:22		aninternettroll (aninternettroll) joins
02:32:37		SootBector quits [Remote host closed the connection]
02:33:48		SootBector (SootBector) joins
02:33:49		Cuphead2527480 quits [Client Quit]
02:46:31		etnguyen03 quits [Remote host closed the connection]
02:57:22	<Ryz>	hexagonwin\|m, as long as it's set up where calendars don't get crawled or something, because otherwise, AB would've gotten up to https://coconx.tistory.com/archive/999912
03:01:09	<hexagonwin\|m>	I think it should be sufficient just downloading everything in https://coconx.tistory.com/sitemap.xml (and their page requisites), but the attachment files are also links so not sure about that. Maybe only download the links inside the article div?,,
03:02:15	<hexagonwin\|m>	btw, do we also get all the pages for article list? like https://coconx.tistory.com/category/%EC%82%AC%EB%8A%94%20%EC%9D%B4%EC%95%BC%EA%B8%B0 to https://coconx.tistory.com/category/%EC%82%AC%EB%8A%94%20%EC%9D%B4%EC%95%BC%EA%B8%B0?page=31
03:02:33	<hexagonwin\|m>	would be great to have but actual blog posts should be prioritized i guess..
03:04:13		Island quits [Read error: Connection reset by peer]
03:22:02	<hexagonwin\|m>	(unrelated to tistory) if it's possible, could someone please share the log for androidfilehost.com on archivebot? (like wpull.log on grab-site) I need the list of URLs it downloaded, so that it can be compared to my first attempt and also extract the total list of FIDs. Thanks.
03:23:33	<@JAA>	hexagonwin\|m: The log is in the -meta.warc.gz on IA sometime after the job finishes.
03:28:18	<nicolas17>	or the cdx's
03:32:40		wyatt8750 joins
03:32:54		wyatt8740 quits [Ping timeout: 260 seconds]
03:53:18		ScarlettStunningSpace quits [Read error: Connection reset by peer]
04:01:49		Karlett (Karlett) joins
04:10:37		APOLLO03 quits [Ping timeout: 258 seconds]
04:12:29		APOLLO03 joins
04:22:16		cyanbox_ joins
04:25:16		Webuser654665 joins
04:25:20		Webuser654665 quits [Client Quit]
04:25:34		cyanbox quits [Ping timeout: 258 seconds]
04:52:21	<h2ibot>	Hans5958 edited Warrior projects (+9, Add #Y): https://wiki.archiveteam.org/?diff=57409&oldid=57403
04:53:21	<h2ibot>	Hans5958 edited Warrior projects (-30, Put #Y on hiatus): https://wiki.archiveteam.org/?diff=57410&oldid=57409
04:56:22	<h2ibot>	Hans5958 edited Warrior projects (+20, Put some 2025 projects started on 2025): https://wiki.archiveteam.org/?diff=57411&oldid=57410
05:51:59		BornOn420 quits [Quit: Textual IRC Client: www.textualapp.com]
05:52:14		BornOn420 (BornOn420) joins
06:17:59		BornOn420 quits [Ping timeout: 260 seconds]
07:07:52	<pabs>	hexagonwin\|m: I do searching when archiving domains, for subdomains as well as related resources like twitter/github/mediawiki/etc
07:16:42		Radzig2 joins
07:18:04		Radzig quits [Ping timeout: 258 seconds]
07:18:04		Radzig2 is now known as Radzig
07:23:01		b3nzo joins
07:27:20		Webuser754283 joins
07:30:20		Webuser754283 quits [Client Quit]
07:32:39		Commander001 quits [Ping timeout: 260 seconds]
07:33:00		Commander001 joins
07:44:31		Commander001 quits [Ping timeout: 258 seconds]
07:45:03		Commander001 joins
08:06:52		beastbg8_ joins
08:09:59		beastbg8 quits [Ping timeout: 260 seconds]
08:59:28		HP_Archivist (HP_Archivist) joins
09:16:29		Suika quits [Ping timeout: 260 seconds]
09:18:56		Suika joins
09:32:13	<@arkiver>	hexagonwin\|m: please also make sure next time to put deadlines on https://wiki.archiveteam.org/index.php/Deathwatch
09:35:18		Commander001 quits [Ping timeout: 258 seconds]
09:36:10		Commander001 joins
10:11:23		b3nz0 joins
10:13:53		b3nzo quits [Client Quit]
10:14:27		b3nz0 quits [Remote host closed the connection]
10:18:47		b3nzo joins
10:19:43		b3nz0 joins
10:38:08	<b3nzo>	sorry, sent it in #archiveteam. > JAA: whats the best way to pack many warc, meta-warc and cdx files? megawarc
10:43:37	<@JAA>	b3nzo: How many is 'many', and why do you want to pack them?
10:45:56	<b3nzo>	around 900GB, i want to upload them to the IA
10:46:15	<@JAA>	How many files?
10:58:24	<b3nzo>	around 90k
11:00:03		Bleo182600722719623455222 quits [Quit: The Lounge - https://thelounge.chat]
11:02:48		Bleo182600722719623455222 joins
11:08:04	<@arkiver>	b3nzo: i'd say pack them into 100 GB chunks and upload those
11:08:08	<@arkiver>	keep the meta WARCs separate
11:11:35	<@arkiver>	we'll focus on the blogs that had no activity for 3 years
11:11:37	<@arkiver>	or maybe 2 years
11:19:34		Commander001 quits [Ping timeout: 258 seconds]
11:19:59	<masterx244\|m>	found another useful detail on wplace....
11:19:59	<masterx244\|m>	https://github.com/murolem/wplace-archiver
11:19:59	<masterx244\|m>	That guy noticed that there are ipv6 shenanigans doable to get more ratelimit per host
11:22:47		Commander001 joins
12:03:03		Dada joins
12:04:30	<b3nzo>	does grab-site crawl despite specifying --1? not for all but for a few urls, the warc files are huge, some wikipedia articles are over 500GB. and specifically "grab-site --1 https://www.radiofrance.fr/ecouter-musique" is aroun 7.6GB, and has a bunch of mp3, mp4 from the same host
12:53:48		nicolas17_ joins
12:54:04		nicolas17 quits [Ping timeout: 260 seconds]
13:07:26		nicolas17 joins
13:08:39		nicolas17_ quits [Ping timeout: 260 seconds]
13:37:11	<@arkiver>	i have also posted the recent update in #archiveteam on opencollective
13:40:58	<@arkiver>	masterx244\|m: is wplace something that is shutting down?
13:47:48		HackMii quits [Remote host closed the connection]
13:47:48		SootBector quits [Remote host closed the connection]
13:48:10		HackMii (hacktheplanet) joins
13:48:57		SootBector (SootBector) joins
13:57:03		Oddly joins
14:00:26	<@arkiver>	let's make a channel for tistory
14:00:30	<@arkiver>	any ideas?
14:01:05	<@arkiver>	hexagonwin\|m: not all posts on a tistory blog are in the sitemap, for example i don't see https://quizbang.tistory.com/3072 in the sitemap https://quizbang.tistory.com/sitemap.xml
14:05:17	<@imer>	just history is too simple I think, something with it is -> 'tis history?
14:05:17	<@imer>	"Behind The Name Tistory is a compound word consisting of T, the initial letter of Tattertools, and History. " #tatteredhistory? #tatteredstory?
14:05:31		Oddly quits [Client Quit]
14:07:08	<@imer>	yeah that's all my creative juice used up I think
14:10:08	<@arkiver>	#tatteredstory is nice, nice incorporates that story
14:10:11	<@arkiver>	let's do that one
14:10:21	<@arkiver>	hexagonwin\|m: FYI ^
14:10:31	<@arkiver>	woah no JAA yet
14:20:01	<@arkiver>	hexagonwin\|m: i see at the bottom of https://p.z80.kr/tistory_archiveteam.html you write about egloos. Archive Team did do a project for it as well, got 11 TB https://archive.org/details/archiveteam_egloos
14:20:37	<@arkiver>	but from the wiki page it looks like the site was only "partially saved"
14:55:59		petrichor quits [Ping timeout: 260 seconds]
14:58:42		petrichor (petrichor) joins
15:08:13		Island joins
15:26:02	<h2ibot>	Arkiver uploaded File:Tistory-icon.png: https://wiki.archiveteam.org/?title=File%3ATistory-icon.png
15:26:37		Karlett quits [Remote host closed the connection]
15:33:17		cyanbox_ quits [Read error: Connection reset by peer]
16:31:07	<Hans5958>	<arkiver> "let's make a channel for tistory" <- There should be a one stop channel for anything blog services
16:31:48	<Hans5958>	I'd say #webroasting can be used, though it is from web hosting
16:35:55	<Hans5958>	Let's see if I would consistently call for using #webroasting for each web hosting that's going down in the future
16:36:06	<Hans5958>	I think it's been twice with this
16:36:06	<@arkiver>	Hans5958: i'm not sure about a one stop channel
16:36:10	<@arkiver>	the project can be very different
16:36:36	<@arkiver>	there could sure be a channel for more general coordination, but if we have a project for a certain service on a deadline, a dedicated temporary channel may be more fitting
16:41:51	<Hans5958>	I still think that every web hosting service share the same features and will be handled in the same manner on AT, so that's why it would be nice to put it in one channel to keep those who are interested in these web hosts in one place, as well as having info in one place
16:42:43	<Hans5958>	Though this is coming for me who rarely use the ephemeral nature of IRC, so I would agree to disagree to some motions
16:43:48	<hexagonwin\|m>	arkiver: yeah back when i was archiving egloos i've also chatted here, shared "some" stuff iirc on #eggos. i think archiveteam only got the rss feeds and website (before shutdown on 2023/06/16) but i got every posts from known blogs by hammering their API long after shutdown (until about october)
16:44:10	<katia>	Hans5958: generally each tracker project gets its channel; also sometimes other wishes-for-project gets their channel and sometimes these wishes are more broad
16:44:22	<katia>	i guess that’s what’s happening here
17:36:46	<masterx244\|m>	arkiver: JAA mentioned wplace a while ago with a archivebot crawl. its a living thing that changes consistently and if a crawl is pretty long its not consistent at all. luckily some datadumps exist already. currently mirroring one thats stored on github releases onto my own infra (might forward that to the IA as regular items once i got it synced
17:41:30		Karlett (Karlett) joins
17:42:14		@rewby quits [Ping timeout: 260 seconds]
17:45:09		rewby (rewby) joins
17:45:09		@ChanServ sets mode: +o rewby
17:52:36	<that_lurker>	easy channel name for tistory would be #thistory
18:01:24		javascript17 joins
18:02:00		HackMii quits [Ping timeout: 255 seconds]
18:02:22		javascript17 quits [Client Quit]
18:03:50		HackMii (hacktheplanet) joins
18:07:47		javascript1 joins
18:22:01		ATinySpaceMarine quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
18:22:28		ATinySpaceMarine joins
18:26:55	<lemuria>	Wplace?
18:27:20	<lemuria>	By the way the URL format for wplace tiles is https://backend.wplace.live/files/s0/tiles/X/Y.png where X and Y are numbers from 0-2048
18:27:25	<lemuria>	404 means tile is empty
18:27:41	<lemuria>	And it's zoom 11
18:35:31	<masterx244\|m>	yeah but they got annoying rate limits. luckily they fucked up on the IPV6 end of those limits and limit by /128 and not /64. having a way to spread requests across a /64 bypasses that easily
18:36:13	<masterx244\|m>	(wrong buttflare config to our advantage since the tiles are cached on buttflare's HW)
18:37:44	<that_lurker>	Have they done a rug pull on the crypto yet?
18:38:27	<that_lurker>	or is it being archived proactively?
18:40:17	<masterx244\|m>	was a proactive one afaik since the canvas evolves constantly
18:45:30	<b3nzo>	arkiver JAA: should i compress the megawarcs as gzip or zst? any preferred compression for the files
18:46:30		emanuele6 quits [Read error: Connection reset by peer]
18:47:35	<masterx244\|m>	zst requires pre-setup with a prepared dictionary and is more hassle for common users. .warc.gz is much easier to handle
18:49:33	<@JAA>	b3nzo: Why do you need to compress them? Are the input files not already compressed correctly?
18:50:34	<@JAA>	I don't believe megawarc lets you recompress with a different algorithm, and if there is any tooling out there, I'm not aware of and would be cautious about it.
18:51:35	<b3nzo>	not sure by what you mean by input files(maybe single warc.gz files), they are compressed but compressing thousands of files would be more efficient to store
18:52:26	<masterx244\|m>	WARC files are intentionally compressed per-file inside and not as a continuous stream. when replaying from WARCs you need to load files that might be far down the WARC without reading gigabytes of previous files
18:52:32	<@JAA>	per-record*
18:53:22	<@JAA>	You could compress whole for cold storage, I suppose, but yeah, for anything that's supposed to be accessible, it's not an option.
18:53:40	<@JAA>	(And for .warc.zst, the spec requires per-record compression.)
18:54:20		emanuele6 (emanuele6) joins
18:57:27	<b3nzo>	ah, so even incase IA wants to index the archives, non-megawarc/single warc files are the way to go?
18:59:14	<@JAA>	You can megawarc them, but the size should be exactly the sum of the small WARCs. A few bigger files are just much easier to manage than many small ones, both for you and for IA.
19:00:29	<b3nzo>	ah i see
19:00:30	<b3nzo>	thank you
19:01:55	<that_lurker>	always remember do not https://img.kuhaon.fun/u/tOxhpA.gif
19:37:12		Rejoin_HP_Archivist joins
19:40:39		HP_Archivist quits [Ping timeout: 260 seconds]
19:53:13		makeworld4 is now known as makeworld
20:01:54	<h2ibot>	Pokechu22 edited Mailing Lists (+38, /* Software */ Sympa…): https://wiki.archiveteam.org/?diff=57413&oldid=57338
20:02:32	<lemuria>	in random news, my router is definitely enjoying the upload speed exercise as i continue to archive what remains of nalani proctor's VODs
20:03:13	<lemuria>	deleting four years of VODs (and keeping the only copy on your backup hard drives; "your" referring to nalani's hard drives) is.. certainly a way to tear a hole in my historical record
20:03:57	<lemuria>	october 2024 complete. well, at least what remains of october 2024
20:46:44	<Guest>	https://www.malwarebytes.com/blog/news/2025/08/national-public-data-returns-after-massive-social-security-number-leak
20:46:55	<Guest>	meant to post in #archiveteam-ot
20:54:26		dabs joins
21:09:41		dabs quits [Client Quit]
21:12:35		atphoenix_ quits [Ping timeout: 258 seconds]
21:13:44		atphoenix_ (atphoenix) joins
21:39:48		emanuele6 quits [Read error: Connection reset by peer]
21:47:46		emanuele6 (emanuele6) joins
21:48:24		b3nz0 quits [Ping timeout: 260 seconds]
21:50:27		etnguyen03 (etnguyen03) joins
21:53:40		cyanbox joins
22:17:09		atphoenix__ (atphoenix) joins
22:19:19		atphoenix_ quits [Ping timeout: 260 seconds]
22:19:58		Dada quits [Remote host closed the connection]
22:23:44		BornOn420 (BornOn420) joins
22:30:21		opl4 quits [Read error: Connection reset by peer]
22:30:31		opl (opl) joins
22:52:34		Church quits [Ping timeout: 260 seconds]
23:11:27		etnguyen03 quits [Client Quit]
23:17:58		Church (Church) joins
23:31:34		HackMii quits [Remote host closed the connection]
23:32:11		HackMii (hacktheplanet) joins
23:42:44		luckcolors quits [Ping timeout: 260 seconds]
23:47:43		nicolas17 is now authenticated as nicolas17
23:48:06		etnguyen03 (etnguyen03) joins

Home Search Previous day Next day