00:00:03 | <opl> | out of curiosity and in case of any future lists, do the pages in warc files get deduplicated? if so, is it happening per job? |
00:01:23 | | ducky (ducky) joins |
00:01:29 | <pokechu22> | Archivebot doesn't do any deduplication at all. I think distributed projects do perform deduplication though |
00:02:18 | <opl> | ah. i realized that if it did it would've been smarter to prevent the id and slug urls from getting split into multiple lists, since the id just redirects to the slug. seems it doesn't matter though |
00:02:42 | <pokechu22> | also, hmm, it seems like some of these *are* large files and that's causing issues for other files. I think I'm going to restart it with separate lists for stuff on catalog.data.gov and on other sites |
00:03:03 | <opl> | wait, i guess it would've been smarter to just not include the slug urls, since the redirect from id would've archived the slug anyway... |
00:04:06 | <pokechu22> | hmm, yeah, that seems to be the case |
00:04:19 | <opl> | if that makes sense i can provide a new list with just the id urls |
00:08:04 | | th3z0l4 joins |
00:08:27 | <pokechu22> | I think I'm going to ignore catalog.data.gov for the existing jobs, and then do some additional jobs for the pagination (in order) and the actual items (shuffled) |
00:10:40 | | th3z0l4_ quits [Ping timeout: 250 seconds] |
00:10:43 | | utulien_ joins |
00:11:59 | <pokechu22> | https://data.nist.gov/ seems to be offline |
00:12:25 | <nulldata> | I can reach it |
00:12:28 | <opl> | home page works for me |
00:12:37 | <pokechu22> | hmm |
00:12:49 | <pokechu22> | rather, seems like stuff like https://data.nist.gov/od/ds/ark:/88434/mds2-2939/CosmosIDDownload/MOSAICWGS_Stool-2_sub-25B/2020_04_23_22_45/Phages/1.1.0/MOSAICWGS_Stool-2_sub-25B_phages_filtered.tsv.sha256 is timing out :| |
00:13:45 | <nulldata> | That does indeed timeout for me |
00:16:34 | <opl> | lmao. ok, so that's from this monster of a dataset https://catalog.data.gov/dataset/continuous-mobile-manipulator-performance-experiment-02-01-2022 |
00:16:52 | <opl> | yes, the catalog server is failing with http 500 for me too |
00:19:20 | <opl> | apparently it has 8326 resources https://catalog.data.gov/harvest/object/df79019e-7280-4c18-98c6-8d0d58e5a883 |
00:21:46 | | th3z0l4_ joins |
00:23:40 | | th3z0l4 quits [Ping timeout: 250 seconds] |
00:24:06 | <@arkiver> | i have likely missed pings over the last half day... please do ping again if you think it's needed |
00:24:36 | | etnguyen03 (etnguyen03) joins |
00:27:34 | <@arkiver> | opl thank you for that list |
00:27:59 | <@imer> | regarding dailymotion from #archiveteam: assuming this is a new thing we have ~3 months from now until videos get "archived" and ~6 months until they're deleted |
00:27:59 | <@imer> | unsure what archived means for accessibility |
00:28:12 | <pokechu22> | ... ok, turns out the search has ratelimiting I think? |
00:28:22 | <@arkiver> | imer: i am guessing unavailable for the public and available to the creators only |
00:28:23 | <pokechu22> | the API seems fine |
00:28:40 | <@arkiver> | pokechu22: search on the list from opl? |
00:28:52 | <pokechu22> | Yeah |
00:28:56 | <@imer> | looking at the AT wiki on dailymotion, that links to https://www.dailymotion.com/archived/index.html which also uses the "archived" terminology |
00:29:10 | <@imer> | wouldnt trust that to be consistent though |
00:29:26 | <@imer> | sure would be convenient.. also a boatload of videos |
00:29:40 | <@arkiver> | yeah many hundreds of TBs again if not more |
00:30:24 | <@arkiver> | i'm not sure yet what to do on dailymotion - going check back on that tomorrow |
00:30:30 | <@arkiver> | could be very big L. |
00:30:32 | <@arkiver> | :/* |
00:33:45 | <@arkiver> | pokechu22: is AB able to capture the data from opl correctly, or do we need a custom project? |
00:34:02 | <pokechu22> | Looks like it is capturing things correctly now |
00:34:15 | <@arkiver> | pokechu22: also the data is actually linked to? |
00:34:31 | <@arkiver> | nice, a sequential identifier https://irma.nps.gov/DataStore/DownloadFile/569000 |
00:34:46 | <pokechu22> | Yes, there's 4 jobs doing that, which are *mostly* working (but some data is missed because javascript/AB quirks/the site is broken) |
00:35:10 | <@arkiver> | pokechu22: do you perhaps have an example of missing data? |
00:35:32 | <pokechu22> | https://data.nist.gov/od/ds/ark:/88434/mds2-3061/Continuous%20Mobile%20Manipulator%20Experiment%2002-01-2022/Pre-Test%20_Data/Code_Backup/catkin_ws2/build/ur_msgs/CMakeFiles/std_msgs_generate_messages_lisp.dir/build.make is timing out (as are a bunch of data.nist.gov things) |
00:35:43 | <@arkiver> | alright |
00:35:48 | <@arkiver> | let's keep an eye on that |
00:35:57 | <h2ibot> | Imer edited Deathwatch (+223, /* 2025 */ add dailymotion deleting inactive…): https://wiki.archiveteam.org/?diff=54297&oldid=54279 |
00:36:05 | <pokechu22> | https://data.ngdc.noaa.gov/platforms/ocean/nos/coast/H12001-H14000/H12593/BAG/H12593_MB_4m_MLLW_Combined.bagxyz.zip just got a 404 but it did exist in the past |
00:36:37 | <@imer> | added to deathwatch for may, might be good to have a better source, wasn't able to find anything from a cursory search though |
00:36:46 | <pokechu22> | I should add that the 4 jobs I'm doing are shuffled so there isn't an obvious pattern to whether everything on a site is gone or what (but it also means we're not going to spend ages on a single broken site without grabbing anything else) |
00:36:50 | <@arkiver> | thanks imer |
00:37:01 | <@arkiver> | i have not looked into it, but wonder how easily findable dailymotion videos are |
00:37:27 | <@arkiver> | if there's an indication of numbers of views over the last 12 months or so, we could somewhat easily decide what to archived and what not |
00:37:46 | <Flashfire42> | Dailymotion is gonna be fucking massive |
00:37:48 | <@arkiver> | if videos are not easily findable that may actually make a project more doable |
00:37:56 | <@arkiver> | Flashfire42: we'll very unlikely get a full copy |
00:38:51 | <Flashfire42> | dailymotion #down-the-tube equivalent wen |
00:39:04 | <@arkiver> | not now yet |
00:39:55 | <Flashfire42> | im still running deadtrickle before I move back over to Down the tube for more helping there |
00:40:56 | <opl> | btw, since the catalog.data.gov is just an index of data from other places, i plan to diff the datasets the next time the amount of datasets available changes to see what the changes actually are. i'm hoping it's all innocent enough |
00:42:33 | <pokechu22> | opl: there seems to be rate-limiting on e.g. https://catalog.data.gov/dataset/?q=&sort=title_string+asc&ext_location=&ext_bbox=&ext_prev_extent=&page=222 :| |
00:42:44 | | lennier2_ quits [Ping timeout: 250 seconds] |
00:42:47 | | utulien quits [Client Quit] |
00:43:23 | <opl> | hm. i was able to get all the pages from the api in about an hour with no concurrency? |
00:43:57 | <pokechu22> | The API itself seems to be fine at con=6, d=0 |
00:44:01 | <opl> | technically the api page size can be increased from 100, but i had the api timeout at 1000 so i decided not to risk it |
00:44:28 | <opl> | so that's at least something... |
00:44:38 | <pokechu22> | hmm actually, looks like it got annoyed at the very end (https://catalog.data.gov/api/3/action/package_search?rows=100&start=342500&sort=metadata_created+asc&include_deleted=true) but those are also empty |
00:45:00 | <opl> | oh. OH |
00:45:10 | <opl> | i'm dum. i set the page count too high |
00:45:29 | <pokechu22> | That's fine (I normally do that when making lists myself just in case new stuff gets added) |
00:45:45 | <pokechu22> | The 403s on https://catalog.data.gov/dataset/?q=&sort=title_string+asc&ext_location=&ext_bbox=&ext_prev_extent=&page=344 are on valid pages though :| |
00:46:44 | <pokechu22> | It looks like the API job only got 403s on out of bounds pages like that so it's probably complete? |
00:46:54 | <pokechu22> | Is there info on the non-API pages that's not also in the API? |
00:47:08 | <opl> | yeaaah, the search ends at 305k. i went all the way up to 360k, one page at a time. that's just a big derp |
00:47:18 | | lennier2_ joins |
00:47:57 | <opl> | "Is there info on the non-API pages that's not also in the API?" don't think so (which is why i mentioned the api not having rate limits) |
00:56:35 | <@arkiver> | let's make a channel for archiving the US government |
00:56:37 | <@arkiver> | any ideas? |
00:57:11 | <@arkiver> | pokechu22: do you know of any efforts to list all government sites at risk? |
00:57:39 | <pokechu22> | arkiver: no; I've just been grabbing things as I see them |
00:58:17 | <Flashfire42> | #UhShitArchival |
00:58:47 | <nulldata> | #FadingGlory |
01:00:16 | <LddPotato> | #USA #UncleSamsArchive |
01:01:08 | <@imer> | oh thats a good one |
01:01:09 | <TheTechRobo> | land of the something? |
01:01:19 | <@arkiver> | imer: which one? |
01:01:22 | <Flashfire42> | I like unclesamsarchive |
01:01:23 | <@imer> | uncle sams |
01:01:34 | <opl> | the reddit post anarcat linked to (/r/DataHoarder/comments/1idm9ii) links to https://eotarchive.org/ |
01:01:54 | <Flashfire42> | I am babysitting it if you guys want ops there before I go to work in like 15 minutes |
01:01:56 | <nulldata> | #UncleSamsClub :P |
01:01:57 | <opl> | they have a bunch of urls, including a github repo with root urls |
01:02:27 | <Flashfire42> | arkiver you like unclesamsarchive? join it If you do I can give you ops |
01:04:11 | <@arkiver> | let's do #unclesamsarchive |
01:04:29 | <@imer> | LddPotato++ |
01:04:30 | <eggdrop> | [karma] 'LddPotato' now has 1 karma! |
01:04:39 | <datechnoman> | LddPotato++ |
01:04:41 | <eggdrop> | [karma] 'LddPotato' now has 2 karma! |
01:04:47 | <Flashfire42> | LddPotato ++ |
01:04:47 | <eggdrop> | [karma] 'LddPotato' now has 3 karma! |
01:04:50 | <@arkiver> | congrats LddPotato :) |
01:04:52 | <TheTechRobo> | LddPotato++ |
01:04:52 | <eggdrop> | [karma] 'LddPotato' now has 4 karma! |
01:05:03 | <pokechu22> | https://eotarchive.org is an archive.org project IIRC (at least I've seen captures attributed to that) |
01:05:11 | <LddPotato> | My first contribution, besides some uploaded data... |
01:05:54 | <@arkiver> | out effort will be unrelated to eotarchive |
01:07:07 | | notarobot1 joins |
01:09:00 | <@arkiver> | our* |
01:16:58 | | cascode quits [Ping timeout: 250 seconds] |
01:19:13 | | cascode joins |
01:24:19 | <@OrIdow6> | On dailymotion, https://faq.dailymotion.com/hc/en-us/articles/4403392706194-Dailymotion-inactive-policies claimes this policy was in place from September 2024, but that page itself was only up for a week; and the dates don't line up with that screenshot either way |
01:25:09 | <@arkiver> | yeah i need to have a better look at this still |
01:25:33 | <@imer> | "Once your content is archived, you won't be able to access it anymore, but it will still exist in our database." that answers that at least |
01:26:46 | <@imer> | hopefully its a slow rollout then :| |
01:26:55 | <@OrIdow6> | I can't find any other deletion notices doing a really cursory look online |
01:27:06 | <h2ibot> | Imer edited Deathwatch (+100, /* 2025 */ add dailymotions inactive video…): https://wiki.archiveteam.org/?diff=54298&oldid=54297 |
01:27:51 | <@OrIdow6> | imer: Yeah, worst case is that they're "archiving" them already, based on view statistics before the policy change, and this is just the one notice that managed to make its way to us |
01:28:38 | <@arkiver> | would be nice to hear back from the person posting https://old.reddit.com/r/Archiveteam/comments/1idg2nh/dailymotion_start_deleting_inactive_videos/ on when they received that message |
01:32:09 | <@arkiver> | imer: is iMerRobin you? |
01:32:17 | <@imer> | yep, I can ask |
01:32:23 | <@arkiver> | eyah was about to ask that |
01:32:25 | <@arkiver> | yeah* |
01:33:08 | <@imer> | added, will see if they reply and if not have an attempt at messaging them if thats a thing you can do on reddit |
01:33:36 | <@arkiver> | ... i have no idea on reddit :P |
01:36:54 | | Webuser505650 joins |
01:37:01 | <TheTechRobo> | There's PMs on Reddit. In fact, there's two different ways of sending PMs. lol |
01:38:16 | <@OrIdow6> | imer: going thru https://www.dailymotion.com/archived/index.html, does seem like there's some kind of pattern there, in terms of video IDs being similar on videos "archived" at similar times |
01:41:24 | <@arkiver> | OrIdow6: i'm not sure if this is the same as what is being talked about in the announcement posted on reddit |
01:41:46 | <@arkiver> | i am guessing those "archived" videos should not be available anymore according to the reddit post |
01:41:51 | <@OrIdow6> | arkiver: I'm thinking it may offer an ability to enumerate IDs |
01:42:08 | <@OrIdow6> | Since it doesn't seem like they're random, unlike how Youtube seems to be |
01:42:16 | <@OrIdow6> | Entirely random |
01:42:16 | <@arkiver> | yeah there seems to be some sequential pattern |
01:43:19 | | Webuser505650 quits [Client Quit] |
01:46:16 | <@imer> | yep, looks like it at a glance |
01:46:30 | <@arkiver> | this could be very big, multiple PBs |
01:46:48 | <@arkiver> | so really not sure yet what may be done, maybe smaller versions, maybe some limited scope |
01:47:26 | <@arkiver> | will wait for confirmation on this in some way, and when that message on reddit was actually received |
01:47:56 | <@arkiver> | we will very likely not archive PBs of this |
01:49:52 | <@arkiver> | so i'm really not sure yet, we can make a channel for dailymotion |
01:50:01 | <@arkiver> | if there's ideas? :P |
01:50:46 | | wickedplayer494 quits [Ping timeout: 250 seconds] |
01:51:03 | <@arkiver> | i'm slightly surprised there's no dailymotion channel yet on the wiki |
01:51:42 | <@OrIdow6> | arkiver: What I'm thinking, if there's no more official info, is that we could figure out how the IDs work, do a survey of the whole site, and then watch (sample/etc) that to figure out how they're being deleted |
01:51:47 | | wickedplayer494 joins |
01:51:54 | <@arkiver> | OrIdow6: yeah maybe |
01:51:56 | <@OrIdow6> | survey = HTML/API-only crawl of the whole site |
01:53:18 | | wickedplayer494 is now authenticated as wickedplayer494 |
01:56:23 | | cascode quits [Read error: Connection reset by peer] |
01:56:32 | <@arkiver> | i'm not sure "sampling" here would be very useful. if it's really new policy, so the message on reddit was recent, and deletions have not started, there will be an initial huge wave of deletions, after which more deletion happen slowly |
01:56:46 | <@arkiver> | sampling to detect that wave will be too later |
01:56:53 | <@arkiver> | late* |
01:57:04 | | cascode joins |
01:59:56 | <DigitalDragons> | i nominate #dailyfrozen or #dailydemotion |
02:00:07 | <@arkiver> | #dailydemotion is nice |
02:00:09 | <@arkiver> | let's do that |
02:00:18 | | utulien_ quits [Ping timeout: 260 seconds] |
02:00:43 | <@arkiver> | don't get too hyped up on a project for dailymotion yet please |
02:00:59 | <@arkiver> | it could be huge and needs good consideration before starting |
02:04:03 | <@imer> | DigitalDragons++ |
02:04:03 | <eggdrop> | [karma] 'DigitalDragons' now has 12 karma! |
02:21:08 | | wumpus joins |
02:23:40 | <wumpus> | I'm working with some folks to archive climate data from US government websites. Is it possible that some folks here would be interested in helping? |
02:24:43 | <@arkiver> | wumpus: yes, we have just set up #UncleSamsArchive |
02:24:49 | <@arkiver> | feel free to join |
02:30:01 | | HP_Archivist quits [Quit: Leaving] |
02:30:58 | | sec^nd quits [Remote host closed the connection] |
02:31:20 | | sec^nd (second) joins |
02:34:45 | | Wohlstand (Wohlstand) joins |
02:40:38 | | BlueMaxima joins |
02:58:43 | | Wohlstand quits [Client Quit] |
03:24:19 | | Shjosan quits [Quit: Am sleepy (-, – )…zzzZZZ] |
03:24:27 | <h2ibot> | Klorgbane edited Pomf.se/Clones (+982): https://wiki.archiveteam.org/?diff=54299&oldid=54009 |
03:25:06 | | Shjosan (Shjosan) joins |
03:26:45 | | etnguyen03 quits [Quit: Konversation terminated!] |
03:29:56 | | etnguyen03 (etnguyen03) joins |
03:31:28 | <h2ibot> | Klorgbane edited Pomf.se/Clones (+49): https://wiki.archiveteam.org/?diff=54300&oldid=54299 |
03:37:17 | | Wohlstand (Wohlstand) joins |
03:46:54 | | etnguyen03 quits [Remote host closed the connection] |
03:50:54 | | wumpus quits [Client Quit] |
04:37:58 | | Wohlstand quits [Remote host closed the connection] |
04:37:59 | | Wohlstand1 (Wohlstand) joins |
04:40:20 | | Wohlstand1 quits [Remote host closed the connection] |
05:05:20 | | cascode quits [Ping timeout: 250 seconds] |
05:06:07 | | cascode joins |
05:06:30 | | cascode quits [Read error: Connection reset by peer] |
05:06:38 | | cascode joins |
05:07:32 | | cascode quits [Read error: Connection reset by peer] |
05:07:41 | | cascode joins |
05:08:27 | | cascode quits [Read error: Connection reset by peer] |
05:08:42 | | cascode joins |
05:32:52 | <Stagnant_> | Could someone add archivebot job for https://www.hevydevyforums.com? It's the official forum for the musician Devin Townsend. It has a lot of messages from 2004-2019 but it has been unmantained for ~5 years and since last year its been constantly filling with topics from spam bots. |
05:33:58 | | BlueMaxima quits [Read error: Connection reset by peer] |
05:36:03 | <pokechu22> | Stagnant_: job started. I've disabled offsite links because of that spam, but it'll grab everything on the forum |
05:42:16 | <Stagnant_> | Thanks! |
05:59:18 | | qinplus_phone joins |
06:03:55 | <h2ibot> | DigitalDragon created US Government (+1887, Created page with " == Discovery == An…): https://wiki.archiveteam.org/?title=US%20Government |
06:06:48 | | niemasd1 joins |
06:06:52 | | niemasd1 leaves |
06:07:13 | | niemasd joins |
06:07:32 | <niemasd> | Can someone help me trigger a backup of cdc.gov? |
06:09:01 | <niemasd> | I'm not sure how I can get channel operator or voice permissions, but we have reason to believe significant edits will be made to the website in the near future |
06:10:25 | <pokechu22> | niemasd: that seems like a good idea, but it's also a super large site :| |
06:11:02 | <niemasd> | How about portions of it, e.g. Fluview? |
06:11:21 | <niemasd> | And all sections related to the bird flu situation? |
06:11:21 | <pokechu22> | If there's some specific sections that seem particularly at risk I can do those first |
06:12:32 | <pokechu22> | I guess https://www.cdc.gov/bird-flu/wcms-auto-sitemap.xml https://www.cdc.gov/flu/wcms-auto-sitemap.xml https://www.cdc.gov/fluview/wcms-auto-sitemap.xml |
06:17:34 | <niemasd> | Sorry, juggling a few things, yes, those would be great |
06:18:17 | <niemasd> | According to sources I know, there may be edits made to any pages related to flu, LGBTQIA+ data (especially mpox, HIV, STIs), and more broadly possibly infectious disease data |
06:20:34 | <pokechu22> | Hmm, looking at https://www.cdc.gov/wcms-auto-sitemap-index.xml and everything it links to, there's a total of 43735 pages (which is large but doable). Though that number doesn't count images and other files (e.g. pdfs) |
06:21:08 | <niemasd> | I think information/text is most important |
06:21:46 | <pokechu22> | Yeah, it'll do everything linked directly from the sitemap first, and then stuff on those pages |
06:21:55 | <niemasd> | Wow, amazing; thank you so much! |
06:31:34 | <@JAA> | (Conversation moved to #UncleSamsArchive) |
07:19:09 | <h2ibot> | FMecha edited 4chan (+1477, /* Fuuka-based Archivers */ archived.moe search…): https://wiki.archiveteam.org/?diff=54303&oldid=53886 |
07:23:09 | <h2ibot> | JustAnotherArchivist edited Deathwatch (+570, /* 2025 */ Add Windows Themes): https://wiki.archiveteam.org/?diff=54304&oldid=54298 |
07:26:55 | | midou quits [Remote host closed the connection] |
07:27:09 | | midou joins |
07:48:16 | | cascode quits [Ping timeout: 250 seconds] |
07:49:28 | | cascode joins |
07:55:53 | | niemasd is now authenticated as niemasd |
07:56:45 | | niemasd quits [Client Quit] |
08:05:29 | | cascode quits [Read error: Connection reset by peer] |
08:05:38 | | cascode joins |
08:09:13 | | qinplus_phone quits [Client Quit] |
08:29:05 | | onetruth quits [Read error: Connection reset by peer] |
08:36:08 | | Megame joins |
08:38:21 | | Megame quits [Client Quit] |
09:09:51 | | Wohlstand (Wohlstand) joins |
09:15:20 | | nulldata quits [Quit: So long and thanks for all the fish!] |
09:15:52 | | nulldata (nulldata) joins |
09:25:13 | | Wohlstand quits [Client Quit] |
09:36:35 | | Wohlstand (Wohlstand) joins |
09:38:19 | | Wohlstand quits [Client Quit] |
10:10:36 | | Island quits [Read error: Connection reset by peer] |
10:17:53 | | nyakase quits [Ping timeout: 260 seconds] |
10:19:45 | | nyakase (nyakase) joins |
10:21:37 | | BearFortress_ quits [] |
10:30:43 | <h2ibot> | Klorgbane edited Pomf.se/Clones (+342): https://wiki.archiveteam.org/?diff=54305&oldid=54300 |
10:31:50 | | Dango360 (Dango360) joins |
11:08:44 | | BearFortress joins |
11:26:45 | <@JAA> | So I've been poking at BlogTalkRadio. The site is a bit of a mess. Audio playback probably won't work in the WBM because a random GUID is added to the MP3 URL with JS. I saw fake 404s. There are episodes that still exist but whose pages genuinely 404. |
11:31:42 | <@JAA> | A bunch of stuff also references external content. I've seen just about everything in that regard: onsite audio URLs redirecting to offsite, offsite audio URLs, offsite episode pages, offside podcast Atom feeds, ... |
11:34:16 | <@JAA> | It's also hosted on an oversized potato. |
11:35:47 | <@JAA> | I do have some good news though: the audio that's actually hosted by them is all accessible easily through CloudFront. While their player URLs redirect to fancy signed URLs etc., all episodes I've tried seem to work just fine under another direct URL. |
11:36:03 | <@arkiver> | but could we get the signed URLs as well? |
11:36:58 | <@JAA> | Well, yes, but the problem is the potato bit. |
12:00:05 | | Bleo18260072271962345 quits [Quit: The Lounge - https://thelounge.chat] |
12:02:48 | | Bleo18260072271962345 joins |
12:11:18 | | cascode quits [Ping timeout: 250 seconds] |
12:12:26 | | cascode joins |
12:34:17 | | SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962] |
12:34:49 | | SkilledAlpaca418962 joins |
12:45:40 | | PotatoProton01 joins |
12:45:56 | | PotatoProton01 quits [Client Quit] |
13:12:48 | <@JAA> | 'Eh, the RSS feeds can't be that large, 10 MiB will be plenty...' - Nope. |
13:23:09 | | loug8318142 joins |
13:35:59 | <@JAA> | Ugh, there are different *kinds* of 404s, too. |
13:39:48 | | kitonthenet joins |
13:42:49 | | kitonthenet is now authenticated as kitonthenet |
14:00:14 | <@JAA> | I give up. This is a bottomless pit. What I have now needs to be good enough. |
14:02:23 | <@arkiver> | JAA: sounds like ten years of duct tape on top of duct tape, with lower levels of duct tape disintegrating |
14:03:40 | <@JAA> | arkiver: Yep. Totally not like our things! |
14:04:22 | <knecht> | is there a better way to do simple spn captures from scripts or similar than https://github.com/overcast07/wayback-machine-spn-scripts ? returning outlinks somehow stopped working and it seems a bit buggy overall |
14:05:02 | <knecht> | context is an irc bot archiving links it sees |
14:06:30 | <@JAA> | Note that I am *not* fetching any audio right now. I don't have a good size estimate for that, but it'll be too big for a single system, probably. The AB job is at well over 6 TiB and nowhere near done, although it has duplicate some things, too. |
14:15:32 | <@JAA> | ~20% of episodes exist but their pages are gone, by the way. |
14:16:48 | | pold joins |
14:20:24 | | katocala is now authenticated as katocala |
14:23:30 | <@JAA> | Ok, turns out that the direct access for the audio does not actually work for all episodes. Welp. |
14:25:48 | <pold> | good day everyone. if someone has time, there are two websites that need immidate attention. the swiss branch of "Depot", a home decor shop with over 30 shops in switzerland (450 in germany), has filled bankruptcy and announced yesterday evening that today is the final day of operation. I guess worth to queue up for a quick crawl: |
14:25:48 | <pold> | https://www.depot.ch |
14:27:49 | <@JAA> | I just read about that earlier, yeah. |
14:35:10 | <pold> | the 2nd website is way more important and I am way too late tbh. maybe more could have been done... pietsmiet.de is the website of one of the most influential early german gaming youtubers and going to shut down tomorrow. PietSmiet currently has 2.47 Million subscribers on YT and they started with lets plays all the way back in 2011. the website |
14:35:10 | <pold> | was not only used for gaming news articles and advertising their projects but also had a premium subscription model with bonus videos only viewable over the website. so everything cannot be saved anyway (even tho they claimed to have safed all of these videos and in their reddit several users started to safe these videos) but i hope a crawl could |
14:35:10 | <pold> | safe at least the articles publicly available. thank you in advance and have a nice afternoon :) |
14:36:45 | | BornOn420 quits [Remote host closed the connection] |
14:37:15 | | BornOn420 (BornOn420) joins |
14:47:20 | <@JAA> | pold: I've started ArchiveBot jobs for Depot. PietSmiet has been on our radar, but archiving it is messy. I'll see if I can poke it again later today. |
14:50:29 | | Mist8kenGAS (Mist8kenGAS) joins |
14:51:31 | <@JAA> | Rough size estimate on BTR: in a random sample of 1k episode IDs, I got about 600 'existing' episodes, though that includes many broken ones. The average audio size seems to be about 8.2 MB per episode ID (including the broken and nonexisting ones). Episode IDs go to about 12.4 million, which gives a rough total size estimate of 102 TB. |
14:54:29 | <pold> | thank you very much :) |
14:56:03 | | Mist8kenGAS quits [Remote host closed the connection] |
14:56:20 | | Mist8kenGAS (Mist8kenGAS) joins |
15:11:28 | <pabs> | knecht: I like the email interface to SPN the best, doesn't do outlinks though. |
15:23:49 | | Mist8kenGAS quits [Client Quit] |
15:28:13 | | kitonthenet quits [Ping timeout: 260 seconds] |
15:33:19 | | kitonthenet joins |
15:36:07 | | earl joins |
15:54:30 | | kitonthenet is now authenticated as kitonthenet |
15:57:48 | | pold quits [Client Quit] |
16:11:58 | | kitonthenet quits [Ping timeout: 260 seconds] |
16:16:02 | <h2ibot> | Nulldata uploaded File:USFlag.png: https://wiki.archiveteam.org/?title=File%3AUSFlag.png |
16:16:03 | <h2ibot> | Nulldata edited US Government (+268, Added infobox): https://wiki.archiveteam.org/?diff=54307&oldid=54301 |
16:18:27 | | kitonthenet joins |
16:21:03 | <h2ibot> | Nulldata edited Current Projects (+123, Add US to upcoming projects): https://wiki.archiveteam.org/?diff=54308&oldid=54205 |
16:22:03 | <h2ibot> | Nulldata edited US Government (+80, Add source to infobox): https://wiki.archiveteam.org/?diff=54309&oldid=54307 |
16:28:33 | | Wohlstand (Wohlstand) joins |
16:44:56 | | BornOn420 quits [Client Quit] |
17:02:57 | | scurvy_duck joins |
17:22:02 | | BornOn420 (BornOn420) joins |
17:22:15 | <h2ibot> | Arkiver uploaded File:Greater coat of arms of the United States.png: https://wiki.archiveteam.org/?title=File%3AGreater%20coat%20of%20arms%20of%20the%20United%20States.png |
17:24:22 | | lflare quits [Quit: Bye] |
17:24:47 | | lflare (lflare) joins |
17:48:54 | <balrog> | https://bsky.app/profile/ryanhatesthis.bsky.social/post/3lh2hvl6iqt2i |
17:48:57 | <balrog> | is someone on the CDC website? |
17:49:17 | <balrog> | seems like it |
17:51:54 | | kitonthenet quits [Ping timeout: 250 seconds] |
17:54:02 | | niemasd (niemasd) joins |
17:54:06 | <niemasd> | For the cdc.gov backup, I see in the progress that a lot of DOI URLs are getting backed up. It may be good to add an ignore pattern for now that ignores doi.org URLs (since those are less likely to go down soon, more likely to be links to external papers) |
17:55:20 | | niemasd quits [Client Quit] |
17:55:39 | <@arkiver> | yes |
17:58:24 | | cascode quits [Ping timeout: 250 seconds] |
17:59:02 | | cascode joins |
18:05:25 | <h2ibot> | Nulldata edited US Government (+35, Change logo): https://wiki.archiveteam.org/?diff=54311&oldid=54309 |
18:11:49 | <@arkiver> | nulldata: keeping a close eye on this i see, thanks :) |
18:30:26 | | midou quits [Remote host closed the connection] |
18:30:28 | | midou joins |
19:02:04 | | Radzig quits [Quit: ZNC 1.9.1 - https://znc.in] |
19:03:47 | | Radzig joins |
19:06:37 | | notarobot1 quits [Quit: The Lounge - https://thelounge.chat] |
19:07:07 | | notarobot1 joins |
19:40:41 | <@arkiver> | pokechu22: we need to speed up the AB government jobs |
19:40:43 | <@arkiver> | as fast as possible |
19:42:55 | <balrog> | https://cyberplace.social/@GossiTheDog/113921481331311737 -- looks like this likely affects computing.uga.edu, www.ic.gatech.edu, www.cs.uga.edu |
19:43:09 | <balrog> | oh wait that is being handled, nvm |
19:47:07 | | kitonthenet joins |
19:58:23 | | Wohlstand quits [Quit: Wohlstand] |
20:10:50 | | i_have_n0_idea5 (i_have_n0_idea) joins |
20:10:51 | | Wohlstand (Wohlstand) joins |
20:11:03 | | BornOn420 quits [Remote host closed the connection] |
20:11:38 | | BornOn420 (BornOn420) joins |
20:13:10 | | i_have_n0_idea quits [Ping timeout: 250 seconds] |
20:13:10 | | i_have_n0_idea5 is now known as i_have_n0_idea |
20:22:48 | <mgrandi> | https://www.politico.com/news/2025/01/31/usda-climate-change-websites-00201826 if not already handled |
20:23:38 | <balrog> | USDA seems to have already scrubbed lots of that |
20:23:39 | <balrog> | e.g. https://www.usda.gov/about-usda/general-information/staff-offices/office-chief-economist/office-energy-and-environmental-policy/climate-change |
20:23:57 | <balrog> | seems like that was grabbed in the past few weeks |
20:31:22 | | scurvy_duck quits [Ping timeout: 250 seconds] |
20:52:10 | | scurvy_duck joins |
20:53:10 | <mgrandi> | Article lists a few that work as of now |
21:01:37 | | DogsRNice joins |
21:11:02 | <h2ibot> | Imer edited Current Projects (+97, /* Short-term, urgent projects */ add US…): https://wiki.archiveteam.org/?diff=54312&oldid=54308 |
21:16:03 | <h2ibot> | DigitalDragon edited US Government (+1203, /* Content at risk */): https://wiki.archiveteam.org/?diff=54313&oldid=54311 |
21:18:03 | <h2ibot> | TheTechRobo edited Current Projects (-4, Linkify US Government): https://wiki.archiveteam.org/?diff=54314&oldid=54312 |
21:24:14 | | scurvy_duck quits [Ping timeout: 250 seconds] |
21:24:14 | <@imer> | TheTechRobo: thanks, somehow missed that when searching if we had a page |
21:24:48 | <@imer> | search is case sensitive. |
21:25:02 | <@imer> | at least the autocomplete part |
21:26:40 | | Wohlstand quits [Client Quit] |
21:28:35 | | earl quits [] |
21:35:06 | <h2ibot> | Nulldata edited Current Projects (-123, Remove duplicate US Government): https://wiki.archiveteam.org/?diff=54315&oldid=54314 |
21:36:07 | <@imer> | wow I did a shoddy job with that |
22:03:20 | | etnguyen03 (etnguyen03) joins |
22:19:18 | | scurvy_duck joins |
22:26:03 | <@OrIdow6> | I really don't think preventing confirmed users from moving pages on the wiki is necessary |
22:32:59 | | icedice (icedice) joins |
22:35:13 | <@JAA> | OrIdow6: Neither do I. If you tell me where the knob for that is, I can maybe fix it. |
22:54:49 | | PredatorIWD25 quits [Read error: Connection reset by peer] |
22:56:14 | | PredatorIWD25 joins |
22:56:30 | <@JAA> | After just under 9 hours, my BTR run is at 5.1% completion. :-| |
22:56:58 | | scurvy_duck quits [Ping timeout: 250 seconds] |
22:57:49 | | skyrocket quits [Quit: ZNC 1.8.2+deb2build5 - https://znc.in] |
23:00:43 | | SootBector quits [Remote host closed the connection] |
23:01:03 | | SootBector (SootBector) joins |
23:04:58 | | kitonthenet quits [Ping timeout: 260 seconds] |
23:07:18 | | abirkill quits [Ping timeout: 260 seconds] |
23:07:36 | | Webuser419634 joins |
23:07:50 | | abirkill- (abirkill) joins |
23:07:50 | | abirkill- is now known as abirkill |
23:11:22 | | abirkill- (abirkill) joins |
23:14:53 | | abirkill quits [Ping timeout: 260 seconds] |
23:14:54 | | abirkill- is now known as abirkill |
23:17:41 | | Webuser724606 joins |
23:18:03 | | scurvy_duck joins |
23:18:31 | | abirkill- (abirkill) joins |
23:20:43 | | abirkill quits [Ping timeout: 260 seconds] |
23:20:44 | | abirkill- is now known as abirkill |
23:21:32 | | tertu2 (tertu) joins |
23:24:16 | | tertu quits [Ping timeout: 250 seconds] |
23:27:39 | | Island joins |
23:28:10 | | abirkill quits [Ping timeout: 250 seconds] |
23:28:27 | | abirkill- (abirkill) joins |
23:29:01 | | abirkill- is now known as abirkill |
23:31:38 | <Webuser724606> | If I configure the warrior with a small disk limit like 3 GB, do the projects know to skip anything that won't fit? |
23:35:11 | <nstrom|m> | not that I'm aware |
23:35:16 | <nstrom|m> | pretty sure they'll just fail |
23:35:25 | <nstrom|m> | abort the item then move on to the next |
23:35:29 | <nstrom|m> | but probably after downloading a bunch |
23:35:34 | <opl> | i don't believe so either. for many items it wouldn't even be possible to determine the space required in advance |
23:38:31 | | kitonthenet joins |
23:38:38 | <@JAA> | And parallelism would require coordination between items, which isn't a thing. |
23:41:30 | <nicolas17> | a large item filling the disk could also cause other smaller items to fail with "disk full" |
23:45:52 | | lunik11 quits [Quit: :x] |
23:46:23 | | lunik11 joins |
23:48:58 | | abirkill quits [Ping timeout: 250 seconds] |
23:58:04 | | scurvy_duck quits [Ping timeout: 250 seconds] |