#archiveteam-bs log for 2025-01-31

Home Search Previous day Next day

00:00:03	<opl>	out of curiosity and in case of any future lists, do the pages in warc files get deduplicated? if so, is it happening per job?
00:01:23		ducky (ducky) joins
00:01:29	<pokechu22>	Archivebot doesn't do any deduplication at all. I think distributed projects do perform deduplication though
00:02:18	<opl>	ah. i realized that if it did it would've been smarter to prevent the id and slug urls from getting split into multiple lists, since the id just redirects to the slug. seems it doesn't matter though
00:02:42	<pokechu22>	also, hmm, it seems like some of these are large files and that's causing issues for other files. I think I'm going to restart it with separate lists for stuff on catalog.data.gov and on other sites
00:03:03	<opl>	wait, i guess it would've been smarter to just not include the slug urls, since the redirect from id would've archived the slug anyway...
00:04:06	<pokechu22>	hmm, yeah, that seems to be the case
00:04:19	<opl>	if that makes sense i can provide a new list with just the id urls
00:08:04		th3z0l4 joins
00:08:27	<pokechu22>	I think I'm going to ignore catalog.data.gov for the existing jobs, and then do some additional jobs for the pagination (in order) and the actual items (shuffled)
00:10:40		th3z0l4_ quits [Ping timeout: 250 seconds]
00:10:43		utulien_ joins
00:11:59	<pokechu22>	https://data.nist.gov/ seems to be offline
00:12:25	<nulldata>	I can reach it
00:12:28	<opl>	home page works for me
00:12:37	<pokechu22>	hmm
00:12:49	<pokechu22>	rather, seems like stuff like https://data.nist.gov/od/ds/ark:/88434/mds2-2939/CosmosIDDownload/MOSAICWGS_Stool-2_sub-25B/2020_04_23_22_45/Phages/1.1.0/MOSAICWGS_Stool-2_sub-25B_phages_filtered.tsv.sha256 is timing out :\|
00:13:45	<nulldata>	That does indeed timeout for me
00:16:34	<opl>	lmao. ok, so that's from this monster of a dataset https://catalog.data.gov/dataset/continuous-mobile-manipulator-performance-experiment-02-01-2022
00:16:52	<opl>	yes, the catalog server is failing with http 500 for me too
00:19:20	<opl>	apparently it has 8326 resources https://catalog.data.gov/harvest/object/df79019e-7280-4c18-98c6-8d0d58e5a883
00:21:46		th3z0l4_ joins
00:23:40		th3z0l4 quits [Ping timeout: 250 seconds]
00:24:06	<@arkiver>	i have likely missed pings over the last half day... please do ping again if you think it's needed
00:24:36		etnguyen03 (etnguyen03) joins
00:27:34	<@arkiver>	opl thank you for that list
00:27:59	<@imer>	regarding dailymotion from #archiveteam: assuming this is a new thing we have ~3 months from now until videos get "archived" and ~6 months until they're deleted
00:27:59	<@imer>	unsure what archived means for accessibility
00:28:12	<pokechu22>	... ok, turns out the search has ratelimiting I think?
00:28:22	<@arkiver>	imer: i am guessing unavailable for the public and available to the creators only
00:28:23	<pokechu22>	the API seems fine
00:28:40	<@arkiver>	pokechu22: search on the list from opl?
00:28:52	<pokechu22>	Yeah
00:28:56	<@imer>	looking at the AT wiki on dailymotion, that links to https://www.dailymotion.com/archived/index.html which also uses the "archived" terminology
00:29:10	<@imer>	wouldnt trust that to be consistent though
00:29:26	<@imer>	sure would be convenient.. also a boatload of videos
00:29:40	<@arkiver>	yeah many hundreds of TBs again if not more
00:30:24	<@arkiver>	i'm not sure yet what to do on dailymotion - going check back on that tomorrow
00:30:30	<@arkiver>	could be very big L.
00:30:32	<@arkiver>	:/*
00:33:45	<@arkiver>	pokechu22: is AB able to capture the data from opl correctly, or do we need a custom project?
00:34:02	<pokechu22>	Looks like it is capturing things correctly now
00:34:15	<@arkiver>	pokechu22: also the data is actually linked to?
00:34:31	<@arkiver>	nice, a sequential identifier https://irma.nps.gov/DataStore/DownloadFile/569000
00:34:46	<pokechu22>	Yes, there's 4 jobs doing that, which are mostly working (but some data is missed because javascript/AB quirks/the site is broken)
00:35:10	<@arkiver>	pokechu22: do you perhaps have an example of missing data?
00:35:32	<pokechu22>	https://data.nist.gov/od/ds/ark:/88434/mds2-3061/Continuous%20Mobile%20Manipulator%20Experiment%2002-01-2022/Pre-Test%20_Data/Code_Backup/catkin_ws2/build/ur_msgs/CMakeFiles/std_msgs_generate_messages_lisp.dir/build.make is timing out (as are a bunch of data.nist.gov things)
00:35:43	<@arkiver>	alright
00:35:48	<@arkiver>	let's keep an eye on that
00:35:57	<h2ibot>	Imer edited Deathwatch (+223, /* 2025 */ add dailymotion deleting inactive…): https://wiki.archiveteam.org/?diff=54297&oldid=54279
00:36:05	<pokechu22>	https://data.ngdc.noaa.gov/platforms/ocean/nos/coast/H12001-H14000/H12593/BAG/H12593_MB_4m_MLLW_Combined.bagxyz.zip just got a 404 but it did exist in the past
00:36:37	<@imer>	added to deathwatch for may, might be good to have a better source, wasn't able to find anything from a cursory search though
00:36:46	<pokechu22>	I should add that the 4 jobs I'm doing are shuffled so there isn't an obvious pattern to whether everything on a site is gone or what (but it also means we're not going to spend ages on a single broken site without grabbing anything else)
00:36:50	<@arkiver>	thanks imer
00:37:01	<@arkiver>	i have not looked into it, but wonder how easily findable dailymotion videos are
00:37:27	<@arkiver>	if there's an indication of numbers of views over the last 12 months or so, we could somewhat easily decide what to archived and what not
00:37:46	<Flashfire42>	Dailymotion is gonna be fucking massive
00:37:48	<@arkiver>	if videos are not easily findable that may actually make a project more doable
00:37:56	<@arkiver>	Flashfire42: we'll very unlikely get a full copy
00:38:51	<Flashfire42>	dailymotion #down-the-tube equivalent wen
00:39:04	<@arkiver>	not now yet
00:39:55	<Flashfire42>	im still running deadtrickle before I move back over to Down the tube for more helping there
00:40:56	<opl>	btw, since the catalog.data.gov is just an index of data from other places, i plan to diff the datasets the next time the amount of datasets available changes to see what the changes actually are. i'm hoping it's all innocent enough
00:42:33	<pokechu22>	opl: there seems to be rate-limiting on e.g. https://catalog.data.gov/dataset/?q=&sort=title_string+asc&ext_location=&ext_bbox=&ext_prev_extent=&page=222 :\|
00:42:44		lennier2_ quits [Ping timeout: 250 seconds]
00:42:47		utulien quits [Client Quit]
00:43:23	<opl>	hm. i was able to get all the pages from the api in about an hour with no concurrency?
00:43:57	<pokechu22>	The API itself seems to be fine at con=6, d=0
00:44:01	<opl>	technically the api page size can be increased from 100, but i had the api timeout at 1000 so i decided not to risk it
00:44:28	<opl>	so that's at least something...
00:44:38	<pokechu22>	hmm actually, looks like it got annoyed at the very end (https://catalog.data.gov/api/3/action/package_search?rows=100&start=342500&sort=metadata_created+asc&include_deleted=true) but those are also empty
00:45:00	<opl>	oh. OH
00:45:10	<opl>	i'm dum. i set the page count too high
00:45:29	<pokechu22>	That's fine (I normally do that when making lists myself just in case new stuff gets added)
00:45:45	<pokechu22>	The 403s on https://catalog.data.gov/dataset/?q=&sort=title_string+asc&ext_location=&ext_bbox=&ext_prev_extent=&page=344 are on valid pages though :\|
00:46:44	<pokechu22>	It looks like the API job only got 403s on out of bounds pages like that so it's probably complete?
00:46:54	<pokechu22>	Is there info on the non-API pages that's not also in the API?
00:47:08	<opl>	yeaaah, the search ends at 305k. i went all the way up to 360k, one page at a time. that's just a big derp
00:47:18		lennier2_ joins
00:47:57	<opl>	"Is there info on the non-API pages that's not also in the API?" don't think so (which is why i mentioned the api not having rate limits)
00:56:35	<@arkiver>	let's make a channel for archiving the US government
00:56:37	<@arkiver>	any ideas?
00:57:11	<@arkiver>	pokechu22: do you know of any efforts to list all government sites at risk?
00:57:39	<pokechu22>	arkiver: no; I've just been grabbing things as I see them
00:58:17	<Flashfire42>	#UhShitArchival
00:58:47	<nulldata>	#FadingGlory
01:00:16	<LddPotato>	#USA #UncleSamsArchive
01:01:08	<@imer>	oh thats a good one
01:01:09	<TheTechRobo>	land of the something?
01:01:19	<@arkiver>	imer: which one?
01:01:22	<Flashfire42>	I like unclesamsarchive
01:01:23	<@imer>	uncle sams
01:01:34	<opl>	the reddit post anarcat linked to (/r/DataHoarder/comments/1idm9ii) links to https://eotarchive.org/
01:01:54	<Flashfire42>	I am babysitting it if you guys want ops there before I go to work in like 15 minutes
01:01:56	<nulldata>	#UncleSamsClub :P
01:01:57	<opl>	they have a bunch of urls, including a github repo with root urls
01:02:27	<Flashfire42>	arkiver you like unclesamsarchive? join it If you do I can give you ops
01:04:11	<@arkiver>	let's do #unclesamsarchive
01:04:29	<@imer>	LddPotato++
01:04:30	<eggdrop>	[karma] 'LddPotato' now has 1 karma!
01:04:39	<datechnoman>	LddPotato++
01:04:41	<eggdrop>	[karma] 'LddPotato' now has 2 karma!
01:04:47	<Flashfire42>	LddPotato ++
01:04:47	<eggdrop>	[karma] 'LddPotato' now has 3 karma!
01:04:50	<@arkiver>	congrats LddPotato :)
01:04:52	<TheTechRobo>	LddPotato++
01:04:52	<eggdrop>	[karma] 'LddPotato' now has 4 karma!
01:05:03	<pokechu22>	https://eotarchive.org is an archive.org project IIRC (at least I've seen captures attributed to that)
01:05:11	<LddPotato>	My first contribution, besides some uploaded data...
01:05:54	<@arkiver>	out effort will be unrelated to eotarchive
01:07:07		notarobot1 joins
01:09:00	<@arkiver>	our*
01:16:58		cascode quits [Ping timeout: 250 seconds]
01:19:13		cascode joins
01:24:19	<@OrIdow6>	On dailymotion, https://faq.dailymotion.com/hc/en-us/articles/4403392706194-Dailymotion-inactive-policies claimes this policy was in place from September 2024, but that page itself was only up for a week; and the dates don't line up with that screenshot either way
01:25:09	<@arkiver>	yeah i need to have a better look at this still
01:25:33	<@imer>	"Once your content is archived, you won't be able to access it anymore, but it will still exist in our database." that answers that at least
01:26:46	<@imer>	hopefully its a slow rollout then :\|
01:26:55	<@OrIdow6>	I can't find any other deletion notices doing a really cursory look online
01:27:06	<h2ibot>	Imer edited Deathwatch (+100, /* 2025 */ add dailymotions inactive video…): https://wiki.archiveteam.org/?diff=54298&oldid=54297
01:27:51	<@OrIdow6>	imer: Yeah, worst case is that they're "archiving" them already, based on view statistics before the policy change, and this is just the one notice that managed to make its way to us
01:28:38	<@arkiver>	would be nice to hear back from the person posting https://old.reddit.com/r/Archiveteam/comments/1idg2nh/dailymotion_start_deleting_inactive_videos/ on when they received that message
01:32:09	<@arkiver>	imer: is iMerRobin you?
01:32:17	<@imer>	yep, I can ask
01:32:23	<@arkiver>	eyah was about to ask that
01:32:25	<@arkiver>	yeah*
01:33:08	<@imer>	added, will see if they reply and if not have an attempt at messaging them if thats a thing you can do on reddit
01:33:36	<@arkiver>	... i have no idea on reddit :P
01:36:54		Webuser505650 joins
01:37:01	<TheTechRobo>	There's PMs on Reddit. In fact, there's two different ways of sending PMs. lol
01:38:16	<@OrIdow6>	imer: going thru https://www.dailymotion.com/archived/index.html, does seem like there's some kind of pattern there, in terms of video IDs being similar on videos "archived" at similar times
01:41:24	<@arkiver>	OrIdow6: i'm not sure if this is the same as what is being talked about in the announcement posted on reddit
01:41:46	<@arkiver>	i am guessing those "archived" videos should not be available anymore according to the reddit post
01:41:51	<@OrIdow6>	arkiver: I'm thinking it may offer an ability to enumerate IDs
01:42:08	<@OrIdow6>	Since it doesn't seem like they're random, unlike how Youtube seems to be
01:42:16	<@OrIdow6>	Entirely random
01:42:16	<@arkiver>	yeah there seems to be some sequential pattern
01:43:19		Webuser505650 quits [Client Quit]
01:46:16	<@imer>	yep, looks like it at a glance
01:46:30	<@arkiver>	this could be very big, multiple PBs
01:46:48	<@arkiver>	so really not sure yet what may be done, maybe smaller versions, maybe some limited scope
01:47:26	<@arkiver>	will wait for confirmation on this in some way, and when that message on reddit was actually received
01:47:56	<@arkiver>	we will very likely not archive PBs of this
01:49:52	<@arkiver>	so i'm really not sure yet, we can make a channel for dailymotion
01:50:01	<@arkiver>	if there's ideas? :P
01:50:46		wickedplayer494 quits [Ping timeout: 250 seconds]
01:51:03	<@arkiver>	i'm slightly surprised there's no dailymotion channel yet on the wiki
01:51:42	<@OrIdow6>	arkiver: What I'm thinking, if there's no more official info, is that we could figure out how the IDs work, do a survey of the whole site, and then watch (sample/etc) that to figure out how they're being deleted
01:51:47		wickedplayer494 joins
01:51:54	<@arkiver>	OrIdow6: yeah maybe
01:51:56	<@OrIdow6>	survey = HTML/API-only crawl of the whole site
01:53:18		wickedplayer494 is now authenticated as wickedplayer494
01:56:23		cascode quits [Read error: Connection reset by peer]
01:56:32	<@arkiver>	i'm not sure "sampling" here would be very useful. if it's really new policy, so the message on reddit was recent, and deletions have not started, there will be an initial huge wave of deletions, after which more deletion happen slowly
01:56:46	<@arkiver>	sampling to detect that wave will be too later
01:56:53	<@arkiver>	late*
01:57:04		cascode joins
01:59:56	<DigitalDragons>	i nominate #dailyfrozen or #dailydemotion
02:00:07	<@arkiver>	#dailydemotion is nice
02:00:09	<@arkiver>	let's do that
02:00:18		utulien_ quits [Ping timeout: 260 seconds]
02:00:43	<@arkiver>	don't get too hyped up on a project for dailymotion yet please
02:00:59	<@arkiver>	it could be huge and needs good consideration before starting
02:04:03	<@imer>	DigitalDragons++
02:04:03	<eggdrop>	[karma] 'DigitalDragons' now has 12 karma!
02:21:08		wumpus joins
02:23:40	<wumpus>	I'm working with some folks to archive climate data from US government websites. Is it possible that some folks here would be interested in helping?
02:24:43	<@arkiver>	wumpus: yes, we have just set up #UncleSamsArchive
02:24:49	<@arkiver>	feel free to join
02:30:01		HP_Archivist quits [Quit: Leaving]
02:30:58		sec^nd quits [Remote host closed the connection]
02:31:20		sec^nd (second) joins
02:34:45		Wohlstand (Wohlstand) joins
02:40:38		BlueMaxima joins
02:58:43		Wohlstand quits [Client Quit]
03:24:19		Shjosan quits [Quit: Am sleepy (-, – )…zzzZZZ]
03:24:27	<h2ibot>	Klorgbane edited Pomf.se/Clones (+982): https://wiki.archiveteam.org/?diff=54299&oldid=54009
03:25:06		Shjosan (Shjosan) joins
03:26:45		etnguyen03 quits [Quit: Konversation terminated!]
03:29:56		etnguyen03 (etnguyen03) joins
03:31:28	<h2ibot>	Klorgbane edited Pomf.se/Clones (+49): https://wiki.archiveteam.org/?diff=54300&oldid=54299
03:37:17		Wohlstand (Wohlstand) joins
03:46:54		etnguyen03 quits [Remote host closed the connection]
03:50:54		wumpus quits [Client Quit]
04:37:58		Wohlstand quits [Remote host closed the connection]
04:37:59		Wohlstand1 (Wohlstand) joins
04:40:20		Wohlstand1 quits [Remote host closed the connection]
05:05:20		cascode quits [Ping timeout: 250 seconds]
05:06:07		cascode joins
05:06:30		cascode quits [Read error: Connection reset by peer]
05:06:38		cascode joins
05:07:32		cascode quits [Read error: Connection reset by peer]
05:07:41		cascode joins
05:08:27		cascode quits [Read error: Connection reset by peer]
05:08:42		cascode joins
05:32:52	<Stagnant_>	Could someone add archivebot job for https://www.hevydevyforums.com? It's the official forum for the musician Devin Townsend. It has a lot of messages from 2004-2019 but it has been unmantained for ~5 years and since last year its been constantly filling with topics from spam bots.
05:33:58		BlueMaxima quits [Read error: Connection reset by peer]
05:36:03	<pokechu22>	Stagnant_: job started. I've disabled offsite links because of that spam, but it'll grab everything on the forum
05:42:16	<Stagnant_>	Thanks!
05:59:18		qinplus_phone joins
06:03:55	<h2ibot>	DigitalDragon created US Government (+1887, Created page with " == Discovery == An…): https://wiki.archiveteam.org/?title=US%20Government
06:06:48		niemasd1 joins
06:06:52		niemasd1 leaves
06:07:13		niemasd joins
06:07:32	<niemasd>	Can someone help me trigger a backup of cdc.gov?
06:09:01	<niemasd>	I'm not sure how I can get channel operator or voice permissions, but we have reason to believe significant edits will be made to the website in the near future
06:10:25	<pokechu22>	niemasd: that seems like a good idea, but it's also a super large site :\|
06:11:02	<niemasd>	How about portions of it, e.g. Fluview?
06:11:21	<niemasd>	And all sections related to the bird flu situation?
06:11:21	<pokechu22>	If there's some specific sections that seem particularly at risk I can do those first
06:12:32	<pokechu22>	I guess https://www.cdc.gov/bird-flu/wcms-auto-sitemap.xml https://www.cdc.gov/flu/wcms-auto-sitemap.xml https://www.cdc.gov/fluview/wcms-auto-sitemap.xml
06:17:34	<niemasd>	Sorry, juggling a few things, yes, those would be great
06:18:17	<niemasd>	According to sources I know, there may be edits made to any pages related to flu, LGBTQIA+ data (especially mpox, HIV, STIs), and more broadly possibly infectious disease data
06:20:34	<pokechu22>	Hmm, looking at https://www.cdc.gov/wcms-auto-sitemap-index.xml and everything it links to, there's a total of 43735 pages (which is large but doable). Though that number doesn't count images and other files (e.g. pdfs)
06:21:08	<niemasd>	I think information/text is most important
06:21:46	<pokechu22>	Yeah, it'll do everything linked directly from the sitemap first, and then stuff on those pages
06:21:55	<niemasd>	Wow, amazing; thank you so much!
06:31:34	<@JAA>	(Conversation moved to #UncleSamsArchive)
07:19:09	<h2ibot>	FMecha edited 4chan (+1477, /* Fuuka-based Archivers */ archived.moe search…): https://wiki.archiveteam.org/?diff=54303&oldid=53886
07:23:09	<h2ibot>	JustAnotherArchivist edited Deathwatch (+570, /* 2025 */ Add Windows Themes): https://wiki.archiveteam.org/?diff=54304&oldid=54298
07:26:55		midou quits [Remote host closed the connection]
07:27:09		midou joins
07:48:16		cascode quits [Ping timeout: 250 seconds]
07:49:28		cascode joins
07:55:53		niemasd is now authenticated as niemasd
07:56:45		niemasd quits [Client Quit]
08:05:29		cascode quits [Read error: Connection reset by peer]
08:05:38		cascode joins
08:09:13		qinplus_phone quits [Client Quit]
08:29:05		onetruth quits [Read error: Connection reset by peer]
08:36:08		Megame joins
08:38:21		Megame quits [Client Quit]
09:09:51		Wohlstand (Wohlstand) joins
09:15:20		nulldata quits [Quit: So long and thanks for all the fish!]
09:15:52		nulldata (nulldata) joins
09:25:13		Wohlstand quits [Client Quit]
09:36:35		Wohlstand (Wohlstand) joins
09:38:19		Wohlstand quits [Client Quit]
10:10:36		Island quits [Read error: Connection reset by peer]
10:17:53		nyakase quits [Ping timeout: 260 seconds]
10:19:45		nyakase (nyakase) joins
10:21:37		BearFortress_ quits []
10:30:43	<h2ibot>	Klorgbane edited Pomf.se/Clones (+342): https://wiki.archiveteam.org/?diff=54305&oldid=54300
10:31:50		Dango360 (Dango360) joins
11:08:44		BearFortress joins
11:26:45	<@JAA>	So I've been poking at BlogTalkRadio. The site is a bit of a mess. Audio playback probably won't work in the WBM because a random GUID is added to the MP3 URL with JS. I saw fake 404s. There are episodes that still exist but whose pages genuinely 404.
11:31:42	<@JAA>	A bunch of stuff also references external content. I've seen just about everything in that regard: onsite audio URLs redirecting to offsite, offsite audio URLs, offsite episode pages, offside podcast Atom feeds, ...
11:34:16	<@JAA>	It's also hosted on an oversized potato.
11:35:47	<@JAA>	I do have some good news though: the audio that's actually hosted by them is all accessible easily through CloudFront. While their player URLs redirect to fancy signed URLs etc., all episodes I've tried seem to work just fine under another direct URL.
11:36:03	<@arkiver>	but could we get the signed URLs as well?
11:36:58	<@JAA>	Well, yes, but the problem is the potato bit.
12:00:05		Bleo18260072271962345 quits [Quit: The Lounge - https://thelounge.chat]
12:02:48		Bleo18260072271962345 joins
12:11:18		cascode quits [Ping timeout: 250 seconds]
12:12:26		cascode joins
12:34:17		SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962]
12:34:49		SkilledAlpaca418962 joins
12:45:40		PotatoProton01 joins
12:45:56		PotatoProton01 quits [Client Quit]
13:12:48	<@JAA>	'Eh, the RSS feeds can't be that large, 10 MiB will be plenty...' - Nope.
13:23:09		loug8318142 joins
13:35:59	<@JAA>	Ugh, there are different kinds of 404s, too.
13:39:48		kitonthenet joins
13:42:49		kitonthenet is now authenticated as kitonthenet
14:00:14	<@JAA>	I give up. This is a bottomless pit. What I have now needs to be good enough.
14:02:23	<@arkiver>	JAA: sounds like ten years of duct tape on top of duct tape, with lower levels of duct tape disintegrating
14:03:40	<@JAA>	arkiver: Yep. Totally not like our things!
14:04:22	<knecht>	is there a better way to do simple spn captures from scripts or similar than https://github.com/overcast07/wayback-machine-spn-scripts ? returning outlinks somehow stopped working and it seems a bit buggy overall
14:05:02	<knecht>	context is an irc bot archiving links it sees
14:06:30	<@JAA>	Note that I am not fetching any audio right now. I don't have a good size estimate for that, but it'll be too big for a single system, probably. The AB job is at well over 6 TiB and nowhere near done, although it has duplicate some things, too.
14:15:32	<@JAA>	~20% of episodes exist but their pages are gone, by the way.
14:16:48		pold joins
14:20:24		katocala is now authenticated as katocala
14:23:30	<@JAA>	Ok, turns out that the direct access for the audio does not actually work for all episodes. Welp.
14:25:48	<pold>	good day everyone. if someone has time, there are two websites that need immidate attention. the swiss branch of "Depot", a home decor shop with over 30 shops in switzerland (450 in germany), has filled bankruptcy and announced yesterday evening that today is the final day of operation. I guess worth to queue up for a quick crawl:
14:25:48	<pold>	https://www.depot.ch
14:27:49	<@JAA>	I just read about that earlier, yeah.
14:35:10	<pold>	the 2nd website is way more important and I am way too late tbh. maybe more could have been done... pietsmiet.de is the website of one of the most influential early german gaming youtubers and going to shut down tomorrow. PietSmiet currently has 2.47 Million subscribers on YT and they started with lets plays all the way back in 2011. the website
14:35:10	<pold>	was not only used for gaming news articles and advertising their projects but also had a premium subscription model with bonus videos only viewable over the website. so everything cannot be saved anyway (even tho they claimed to have safed all of these videos and in their reddit several users started to safe these videos) but i hope a crawl could
14:35:10	<pold>	safe at least the articles publicly available. thank you in advance and have a nice afternoon :)
14:36:45		BornOn420 quits [Remote host closed the connection]
14:37:15		BornOn420 (BornOn420) joins
14:47:20	<@JAA>	pold: I've started ArchiveBot jobs for Depot. PietSmiet has been on our radar, but archiving it is messy. I'll see if I can poke it again later today.
14:50:29		Mist8kenGAS (Mist8kenGAS) joins
14:51:31	<@JAA>	Rough size estimate on BTR: in a random sample of 1k episode IDs, I got about 600 'existing' episodes, though that includes many broken ones. The average audio size seems to be about 8.2 MB per episode ID (including the broken and nonexisting ones). Episode IDs go to about 12.4 million, which gives a rough total size estimate of 102 TB.
14:54:29	<pold>	thank you very much :)
14:56:03		Mist8kenGAS quits [Remote host closed the connection]
14:56:20		Mist8kenGAS (Mist8kenGAS) joins
15:11:28	<pabs>	knecht: I like the email interface to SPN the best, doesn't do outlinks though.
15:23:49		Mist8kenGAS quits [Client Quit]
15:28:13		kitonthenet quits [Ping timeout: 260 seconds]
15:33:19		kitonthenet joins
15:36:07		earl joins
15:54:30		kitonthenet is now authenticated as kitonthenet
15:57:48		pold quits [Client Quit]
16:11:58		kitonthenet quits [Ping timeout: 260 seconds]
16:16:02	<h2ibot>	Nulldata uploaded File:USFlag.png: https://wiki.archiveteam.org/?title=File%3AUSFlag.png
16:16:03	<h2ibot>	Nulldata edited US Government (+268, Added infobox): https://wiki.archiveteam.org/?diff=54307&oldid=54301
16:18:27		kitonthenet joins
16:21:03	<h2ibot>	Nulldata edited Current Projects (+123, Add US to upcoming projects): https://wiki.archiveteam.org/?diff=54308&oldid=54205
16:22:03	<h2ibot>	Nulldata edited US Government (+80, Add source to infobox): https://wiki.archiveteam.org/?diff=54309&oldid=54307
16:28:33		Wohlstand (Wohlstand) joins
16:44:56		BornOn420 quits [Client Quit]
17:02:57		scurvy_duck joins
17:22:02		BornOn420 (BornOn420) joins
17:22:15	<h2ibot>	Arkiver uploaded File:Greater coat of arms of the United States.png: https://wiki.archiveteam.org/?title=File%3AGreater%20coat%20of%20arms%20of%20the%20United%20States.png
17:24:22		lflare quits [Quit: Bye]
17:24:47		lflare (lflare) joins
17:48:54	<balrog>	https://bsky.app/profile/ryanhatesthis.bsky.social/post/3lh2hvl6iqt2i
17:48:57	<balrog>	is someone on the CDC website?
17:49:17	<balrog>	seems like it
17:51:54		kitonthenet quits [Ping timeout: 250 seconds]
17:54:02		niemasd (niemasd) joins
17:54:06	<niemasd>	For the cdc.gov backup, I see in the progress that a lot of DOI URLs are getting backed up. It may be good to add an ignore pattern for now that ignores doi.org URLs (since those are less likely to go down soon, more likely to be links to external papers)
17:55:20		niemasd quits [Client Quit]
17:55:39	<@arkiver>	yes
17:58:24		cascode quits [Ping timeout: 250 seconds]
17:59:02		cascode joins
18:05:25	<h2ibot>	Nulldata edited US Government (+35, Change logo): https://wiki.archiveteam.org/?diff=54311&oldid=54309
18:11:49	<@arkiver>	nulldata: keeping a close eye on this i see, thanks :)
18:30:26		midou quits [Remote host closed the connection]
18:30:28		midou joins
19:02:04		Radzig quits [Quit: ZNC 1.9.1 - https://znc.in]
19:03:47		Radzig joins
19:06:37		notarobot1 quits [Quit: The Lounge - https://thelounge.chat]
19:07:07		notarobot1 joins
19:40:41	<@arkiver>	pokechu22: we need to speed up the AB government jobs
19:40:43	<@arkiver>	as fast as possible
19:42:55	<balrog>	https://cyberplace.social/@GossiTheDog/113921481331311737 -- looks like this likely affects computing.uga.edu, www.ic.gatech.edu, www.cs.uga.edu
19:43:09	<balrog>	oh wait that is being handled, nvm
19:47:07		kitonthenet joins
19:58:23		Wohlstand quits [Quit: Wohlstand]
20:10:50		i_have_n0_idea5 (i_have_n0_idea) joins
20:10:51		Wohlstand (Wohlstand) joins
20:11:03		BornOn420 quits [Remote host closed the connection]
20:11:38		BornOn420 (BornOn420) joins
20:13:10		i_have_n0_idea quits [Ping timeout: 250 seconds]
20:13:10		i_have_n0_idea5 is now known as i_have_n0_idea
20:22:48	<mgrandi>	https://www.politico.com/news/2025/01/31/usda-climate-change-websites-00201826 if not already handled
20:23:38	<balrog>	USDA seems to have already scrubbed lots of that
20:23:39	<balrog>	e.g. https://www.usda.gov/about-usda/general-information/staff-offices/office-chief-economist/office-energy-and-environmental-policy/climate-change
20:23:57	<balrog>	seems like that was grabbed in the past few weeks
20:31:22		scurvy_duck quits [Ping timeout: 250 seconds]
20:52:10		scurvy_duck joins
20:53:10	<mgrandi>	Article lists a few that work as of now
21:01:37		DogsRNice joins
21:11:02	<h2ibot>	Imer edited Current Projects (+97, /* Short-term, urgent projects */ add US…): https://wiki.archiveteam.org/?diff=54312&oldid=54308
21:16:03	<h2ibot>	DigitalDragon edited US Government (+1203, /* Content at risk */): https://wiki.archiveteam.org/?diff=54313&oldid=54311
21:18:03	<h2ibot>	TheTechRobo edited Current Projects (-4, Linkify US Government): https://wiki.archiveteam.org/?diff=54314&oldid=54312
21:24:14		scurvy_duck quits [Ping timeout: 250 seconds]
21:24:14	<@imer>	TheTechRobo: thanks, somehow missed that when searching if we had a page
21:24:48	<@imer>	search is case sensitive.
21:25:02	<@imer>	at least the autocomplete part
21:26:40		Wohlstand quits [Client Quit]
21:28:35		earl quits []
21:35:06	<h2ibot>	Nulldata edited Current Projects (-123, Remove duplicate US Government): https://wiki.archiveteam.org/?diff=54315&oldid=54314
21:36:07	<@imer>	wow I did a shoddy job with that
22:03:20		etnguyen03 (etnguyen03) joins
22:19:18		scurvy_duck joins
22:26:03	<@OrIdow6>	I really don't think preventing confirmed users from moving pages on the wiki is necessary
22:32:59		icedice (icedice) joins
22:35:13	<@JAA>	OrIdow6: Neither do I. If you tell me where the knob for that is, I can maybe fix it.
22:54:49		PredatorIWD25 quits [Read error: Connection reset by peer]
22:56:14		PredatorIWD25 joins
22:56:30	<@JAA>	After just under 9 hours, my BTR run is at 5.1% completion. :-\|
22:56:58		scurvy_duck quits [Ping timeout: 250 seconds]
22:57:49		skyrocket quits [Quit: ZNC 1.8.2+deb2build5 - https://znc.in]
23:00:43		SootBector quits [Remote host closed the connection]
23:01:03		SootBector (SootBector) joins
23:04:58		kitonthenet quits [Ping timeout: 260 seconds]
23:07:18		abirkill quits [Ping timeout: 260 seconds]
23:07:36		Webuser419634 joins
23:07:50		abirkill- (abirkill) joins
23:07:50		abirkill- is now known as abirkill
23:11:22		abirkill- (abirkill) joins
23:14:53		abirkill quits [Ping timeout: 260 seconds]
23:14:54		abirkill- is now known as abirkill
23:17:41		Webuser724606 joins
23:18:03		scurvy_duck joins
23:18:31		abirkill- (abirkill) joins
23:20:43		abirkill quits [Ping timeout: 260 seconds]
23:20:44		abirkill- is now known as abirkill
23:21:32		tertu2 (tertu) joins
23:24:16		tertu quits [Ping timeout: 250 seconds]
23:27:39		Island joins
23:28:10		abirkill quits [Ping timeout: 250 seconds]
23:28:27		abirkill- (abirkill) joins
23:29:01		abirkill- is now known as abirkill
23:31:38	<Webuser724606>	If I configure the warrior with a small disk limit like 3 GB, do the projects know to skip anything that won't fit?
23:35:11	<nstrom\|m>	not that I'm aware
23:35:16	<nstrom\|m>	pretty sure they'll just fail
23:35:25	<nstrom\|m>	abort the item then move on to the next
23:35:29	<nstrom\|m>	but probably after downloading a bunch
23:35:34	<opl>	i don't believe so either. for many items it wouldn't even be possible to determine the space required in advance
23:38:31		kitonthenet joins
23:38:38	<@JAA>	And parallelism would require coordination between items, which isn't a thing.
23:41:30	<nicolas17>	a large item filling the disk could also cause other smaller items to fail with "disk full"
23:45:52		lunik11 quits [Quit: :x]
23:46:23		lunik11 joins
23:48:58		abirkill quits [Ping timeout: 250 seconds]
23:58:04		scurvy_duck quits [Ping timeout: 250 seconds]

Home Search Previous day Next day