00:10:46ericgallager quits [Quit: This computer has gone to sleep]
00:12:26cm quits [Ping timeout: 260 seconds]
00:38:41archivist99 quits [Ping timeout: 260 seconds]
00:43:39<h2ibot>JustAnotherArchivist edited Deathwatch (+306, /* 2025 */ Add 2025-05-15 Flickr download…): https://wiki.archiveteam.org/?diff=55556&oldid=55549
00:44:39<h2ibot>Bear edited List of websites excluded from the Wayback Machine/Time exclusions (-21, more recent date): https://wiki.archiveteam.org/?diff=55557&oldid=55555
01:03:48some_body quits [Quit: Leaving.]
01:04:46some_body joins
01:11:31ericgallager joins
01:14:22HP_Archivist (HP_Archivist) joins
01:16:31useretail quits [Quit: Leaving]
01:17:13useretail joins
01:29:16chunkynutz6 quits [Read error: Connection reset by peer]
01:29:41chunkynutz6 joins
02:04:06nine quits [Client Quit]
02:04:10<ericgallager>DailyDot in danger: https://bsky.app/profile/couts.bsky.social/post/3lo4kou3ttc2o
02:04:19nine joins
02:04:19nine quits [Changing host]
02:04:19nine (nine) joins
02:12:52<pokechu22>pabs: I started a job for pr.fc2.com. Note that I don't really have any fancy tooling for it; I just use find/replace in notepad++ (in regex mode, using ^ and $). I've also been splitting them at 40K lines (the spec says 50K is the max; I chose a smaller size to be safe); I noticed issues with larger ones even though I couldn't find anything in wpull that actually enforces
02:12:54<pokechu22>that 50K limit
02:37:40ericgallager quits [Client Quit]
02:44:21ericgallager joins
02:57:52ericgallager quits [Client Quit]
03:04:00<pabs>pokechu22: I see, thanks. have you tested using HTML instead of sitemaps btw?
03:15:16<h2ibot>JustAnotherArchivist edited Deathwatch (+249, /* 2025 */ Hayabusa.oRg update): https://wiki.archiveteam.org/?diff=55558&oldid=55556
03:18:17<h2ibot>JustAnotherArchivist edited Deathwatch (+192, /* 2025 */ Add SAGAPlus+): https://wiki.archiveteam.org/?diff=55559&oldid=55558
03:19:17<h2ibot>PaulWise edited ArchiveBot (+41, document the size limit): https://wiki.archiveteam.org/?diff=55560&oldid=55473
03:26:04ericgallager joins
03:26:12<nicolas17>"The staging server, known as Fortress of Solitude (FOS), is the place where all the WARC files are temporarily uploaded."
03:26:14<nicolas17>x_x
03:26:18<nicolas17>is that still a thing?
03:27:19<h2ibot>PaulWise edited Mailing Lists (+1201, more sympa lists): https://wiki.archiveteam.org/?diff=55561&oldid=55373
03:27:20<h2ibot>JustAnotherArchivist moved Fc2web to FC2WEB (All capitals is the canonical spelling): https://wiki.archiveteam.org/?title=FC2WEB
03:29:19<h2ibot>Nicolas17v2 edited ArchiveBot (+1, /* Components */ wiki markup: use a definition…): https://wiki.archiveteam.org/?diff=55564&oldid=55560
03:30:02<pabs>nicolas17: no, https://wiki.archiveteam.org/index.php/Fortress_of_Solitude from https://wiki.archiveteam.org/index.php/Archiveteam:Acronyms
03:30:21<nicolas17>archivebot page mentions FOS several times
03:30:48<pabs>probably outdated :)
03:36:01ericgallager quits [Client Quit]
03:37:07<pokechu22>pabs: I haven't, but the thing with HTML is that I'd need to write <a href="https://example.com">https://example.com</a> instead of just <url><loc>https://example.com</loc></url>
03:37:52<pabs>I see, you could write 1 2 3 for the text, but thats still an additional thing to add
03:38:18<@JAA>FOS does still exist.
03:38:20<h2ibot>JustAnotherArchivist edited Deathwatch (-1516, /* 2025 */ Cleanup, realphabetise): https://wiki.archiveteam.org/?diff=55565&oldid=55559
03:38:21<pabs>I assume it would bypass the 50K line limit though
03:38:53<pabs>FOS isn't used for AB though?
03:39:04<@JAA>What 50k line limit?
03:39:29<@JAA>No, AB stopped using FOS a few years ago.
03:40:51<pabs>pokechu22 says the [sitemap] spec says 50K is the max
03:41:49<pokechu22>Yeah, and when I tried a sitemap with several hundred thousand, AB didn't seem to load all of it
03:44:21<@JAA>I'm not aware of such a limit in the code.
03:44:22<h2ibot>JustAnotherArchivist edited Deathwatch (-17, /* 2025 */ Fix ref for MapleStory 2): https://wiki.archiveteam.org/?diff=55566&oldid=55565
03:49:23<h2ibot>JustAnotherArchivist edited Deathwatch (+432, /* 2025 */ Restore items falsely removed in…): https://wiki.archiveteam.org/?diff=55567&oldid=55566
03:53:09<pokechu22>Right, in https://ab2f.archivingyoursh.it/dakvgu6tnu591ks0tihjnnnp4.jsonl after it downloads https://transfer.archivete.am/TFYMu/www1.plala.or.jp_thru_www17.plala.or.jp_sitemap.xml queued went from 51 to 43053... which is a really odd number. Though that is also malformed
03:53:09<eggdrop>inline (for browser viewing): https://transfer.archivete.am/inline/TFYMu/www1.plala.or.jp_thru_www17.plala.or.jp_sitemap.xml
04:02:32ericgallager joins
04:07:58DogsRNice quits [Read error: Connection reset by peer]
04:12:26ericgallager quits [Client Quit]
04:15:27<h2ibot>JustAnotherArchivist edited List of websites excluded from the Wayback Machine/Time exclusions (+174): https://wiki.archiveteam.org/?diff=55568&oldid=55557
04:15:28<h2ibot>JustAnotherArchivist edited List of websites excluded from the Wayback Machine/Former exclusions (+104): https://wiki.archiveteam.org/?diff=55569&oldid=53774
04:16:27<h2ibot>JustAnotherArchivist edited List of websites excluded from the Wayback Machine (-60, Move unexcluded sites to subpages): https://wiki.archiveteam.org/?diff=55570&oldid=55554
04:24:12<gamer191-1|m>https://x.com/grok/status/1917905876301824364 Grok has gone hillariously rogue and started criticising Elon (and possibly even leaked information about internal training data https://x.com/grok/status/1917990656922042386). Is it possible to quickly archive Grok's 7.4 million replies, in case Elon shuts Grok off
04:24:13<eggdrop>nitter: https://nitter.net/grok/status/1917905876301824364 https://nitter.net/grok/status/1917990656922042386
04:25:09<nicolas17>gamer191-1|m: afaik we have no ability to archive anything from twitter atm
04:25:54<@JAA>We can grab individual tweets, but not at that scale, and there's no method for discovering them either.
04:26:11<@JAA>At least I think it works in #jseater anyway.
04:30:45atphoenix__ is now known as atphoenix
04:43:41cmlow quits [Ping timeout: 260 seconds]
05:00:26cyanbox joins
05:04:05<Vokun>Stuff like twitter makes it feel like it might be worth having a few archiveteam accounts to just go through things slowly with
05:09:54nine quits [Quit: See ya!]
05:10:07nine joins
05:10:07nine quits [Changing host]
05:10:07nine (nine) joins
05:10:31aninternettroll quits [Ping timeout: 260 seconds]
05:14:36katocala quits [Ping timeout: 260 seconds]
05:34:37lennier2 quits [Read error: Connection reset by peer]
05:34:52lennier2 joins
05:44:56nine quits [Ping timeout: 260 seconds]
06:04:51nine joins
06:04:52nine quits [Changing host]
06:04:52nine (nine) joins
06:08:07aninternettroll (aninternettroll) joins
06:13:20ericgallager joins
06:13:26aninternettroll quits [Remote host closed the connection]
06:13:36aninternettroll (aninternettroll) joins
06:22:46sec^nd quits [Remote host closed the connection]
06:23:07sec^nd (second) joins
06:46:31Wohlstand quits [Quit: Wohlstand]
06:57:16Jens quits []
06:57:45Jens (JensRex) joins
07:18:34Wohlstand (Wohlstand) joins
07:21:31Island quits [Read error: Connection reset by peer]
07:30:47Wohlstand quits [Client Quit]
08:17:41pabs quits [Read error: Connection reset by peer]
08:18:38pabs (pabs) joins
08:39:16<hlgs|m>[@pokechu22:hackint.org](https://matrix.to/#/@pokechu22:hackint.org): thank you!! hadn't realised that would make the difference
08:41:25flotwig quits [Read error: Connection reset by peer]
08:42:51flotwig joins
08:45:38<hlgs|m>[@JAA:hackint.org](https://matrix.to/#/@JAA:hackint.org): how does grabbing individual tweets go? i've got a list of a hundred or two tweets i've had to archive via archive.is because the wayback machine couldn't, though i got the media on there, which archive.is doesn't seem to give me the URL for from its archives... would be great to archive those tweets and have the media accessible from the tweet URL
08:48:50<pabs>with #jseater it worked, see the screenshots on https://mnbot.very-good-quality-co.de/item/8ead18bb-722e-4465-98bd-119c286abb62 https://mnbot.very-good-quality-co.de/item/0bc0d126-752d-4192-88a4-5c9541bc599e
08:49:08<pabs>mnbot is experimental and currently doesn't feed into the WBM though
08:49:43<pabs>and currently doesn't support recursive or bulk archiving, just one-at-a-time saves
08:50:45<pabs>looks like it didn't get any of the replies, just the tweets themselves
08:51:16<pabs>nitter.net can get the replies, but we don't have an instance of that for archiving
09:12:56FiTheArchiver joins
09:19:34<hlgs|m>the trouble i see with archiving nitter versions is that you need to know to look there. and it is in a way archiving a copy of the page rather than the page at its home address, which makes it less trustworthy as a source (not that i think nitter instances are changing tweets, but from a research point of view, it wouldn't be as good of a citation, e.g.)
09:24:34<steering>yeah, well, ask musk :)
09:24:54<pabs>btw, you can add the users to here for archiving when we can https://pad.notkiska.pw/p/archivebot-twitter
09:26:04<pabs>and https://wiki.archiveteam.org/index.php/Twitter
09:26:44<hlgs|m>can i add individual tweet links, rather than whole users?
09:27:04<hlgs|m>most of the links are still live, just at risk of going down if people delete their accounts
09:27:15<hlgs|m>(which more and more people are doing)
09:33:31<pabs>probably yes. maybe upload a text file of them to https://transfer.archivete.am/
09:35:21nulldata quits [Ping timeout: 260 seconds]
09:37:38<hlgs|m>👍
09:38:17<hlgs|m>i'll do that after i've added a few more in the coming weeks, finishing up a personal archiving project
09:47:23ducky quits [Remote host closed the connection]
09:51:40ducky (ducky) joins
09:56:00kedihacker (kedihacker) joins
10:02:02Lunarian1 (LunarianBunny1147) joins
10:05:41LunarianBunny1147 quits [Ping timeout: 260 seconds]
10:07:27nulldata (nulldata) joins
10:47:21SootBector quits [Remote host closed the connection]
10:48:32SootBector (SootBector) joins
11:00:01Bleo182600722719623455 quits [Quit: The Lounge - https://thelounge.chat]
11:02:47Bleo182600722719623455 joins
11:56:43Webuser885377 quits [Quit: Ooops, wrong browser tab.]
12:34:13<sensitiveParrot>Hi all, I've been lurking here for a few days, but today I came across this site which I find very valuable. Can someone help guide me on how best to mass download all the files (about 33k audio files)? I should say that I am not very familiar with web crawling or anything like that yet. Site: https://sound-effects.bbcrewind.co.uk/search
12:40:14riteo quits [Ping timeout: 258 seconds]
12:43:21<c3manu>sensitiveParrot: hi! would you like to download them for yourself, or would you like to get them into the wayback machine?
12:45:00<pabs>the site seems to need JavaScript, so needs something somewhat custom
12:45:35<sensitiveParrot>c3manu: Mostly for myself, but I think both would be good though
12:46:11<sensitiveParrot>pabs: Noted. I don't have the skills to set up something like that right now, but I might try and look into it. Any suggestions on where to start with that sort of thing?
12:47:48<pabs>use the browser devtools to find out what requests the JS makes, and figure out how to get a list of URLs from that. then either download them or we throw them into AB or both
12:48:22<pabs>this seems to be the main API URL https://sound-effects-api.bbcrewind.co.uk/api/sfx/cached/search
12:48:45<pabs>returns a small number of results, out of a total of 33066
12:51:09<pabs>and then request the JSON file for each item, for eg https://sound-effects-media.bbcrewind.co.uk/waveform/NHU05104246.json
12:51:33<pabs>and then request the MP3 file for each item, for eg https://sound-effects-media.bbcrewind.co.uk/waveform/NHU05104246.mp3
12:51:58<sensitiveParrot>Awesome, thank you! I will try and tinker with that when I have some extra time ^^
12:53:51<pabs>looks like to get more items, you do a POST to https://sound-effects-api.bbcrewind.co.uk/api/sfx/search with some JSON {"criteria":{"from":0,"size":20,"tags":null,"categories":null,"durations":null,"continents":null,"sortBy":null,"source":null,"recordist":null,"habitat":null}}
12:55:02<sensitiveParrot>Thanks for that added detail ^^
12:58:09<pabs>hmm, resending the same POST request returns no results
12:59:25<pabs>hmm, edit and resend doesn't work, but resend does
13:01:49<pabs>can only fetch 10000 at a time
13:02:11<pabs>the error also says "See the scroll api for a more efficient way to request large data sets."
13:04:00<pabs>10000 gives Internal server error
13:08:11<sensitiveParrot>Interesting. You really don't have to do all this now pabs, although I appreciate it! I guess fetching 10000 at a time is fine, then it can be done in 4 chunks, but maybe whatever the scroll api is, it's worth looking into
13:09:29<pabs>found a minimal curl command that works: curl 'https://sound-effects-api.bbcrewind.co.uk/api/sfx/search' -X POST -H 'Content-Type: application/json' --data-raw '{"criteria":{"from":0,"size":6000}}'
13:09:38<pabs>7000 gives an error too
13:10:46<pabs>so yeah, the WBM doesn't support POST requests, so anything you send there via AB etc won't work properly, but you could at least get all the files into the WBM in one job
13:11:20<sensitiveParrot>Okay, and what is WBM referring to?
13:12:35<pabs>wayback machine, web.archive.org. more jargon on https://wiki.archiveteam.org/index.php/Archiveteam:Acronyms
13:13:38<pabs>oh, they have both mp3 and wav downloads
13:13:41<sensitiveParrot>Oh right, of course
13:13:47<sensitiveParrot>Ahh yeah, I would strongly prefer wav
13:16:59<pabs>and they put the wav into a zip. there is also a license clickthrough, but no single-use tokens or anything like that
13:17:04<pabs>so this works https://sound-effects-media.bbcrewind.co.uk/zip/NHU05011147.wav.zip
13:17:17<pabs>full download URL being https://sound-effects-media.bbcrewind.co.uk/zip/NHU05011147.wav.zip?download&rename=BBC_Mandrill-B_NHU0501114
13:17:48<sensitiveParrot>So playing around with the curl command, I was able to do the following: curl 'https://sound-effects-api.bbcrewind.co.uk/api/sfx/search' -X POST -H 'Content-Type: application/json' --data-raw '{"criteria":{"from":0,"size":6119}}'. If I increase the size any further I get an Internal server error though
13:18:18<pabs>also non-zipped wav URLs exist https://sound-effects-media.bbcrewind.co.uk/wav/NHU05011147.wav
13:18:52<pabs>yep, so adjust that from and you should get the ones after the 6k
13:19:33<sensitiveParrot>Ahh, so then I just start set the from to whatever the size was +1?
13:21:20<pabs>I think so yeah. try with size 1 to verify
13:21:42<pabs>this URL works, but it does a POST request too :( https://sound-effects.bbcrewind.co.uk/search?resultSize=400
13:22:20<pabs>ah resultSize is limited to 300
13:22:31<sensitiveParrot>Hmm, no, I get an error saying: Result window is too large, from + size must be less than or equal to: [10000] but was [61201]. It seems like instead of doing 6121, it's doing 61201 for some reason
13:23:29<pabs>ah. might need to use the other criteria (see above) to reduce the result count for each query
13:24:46<sensitiveParrot>which other criteria? Sorry if I'm a bit lost :D
13:25:34<pabs>"tags":null, etc from above
13:28:13<pabs>all the categories except Nature are under 6k, so that might be a good one to filter by
13:29:37<pabs>and then do nature filtered by each continent
13:31:25<pabs>curl 'https://sound-effects-api.bbcrewind.co.uk/api/sfx/search' -X POST -H 'Content-Type: application/json' ' -H 'Connection: keep-alive' --data-raw '{"criteria":{"from":0,"size":6119,"tags":null,"categories":["Nature"],"durations":null,"continents":["Africa"],"sortBy":null,"source":null,"recordist":null,"habitat":null}}'
13:33:25<pabs>err, stray ' quote char there
13:35:27<sensitiveParrot>Hmm, I get this: no matches found: keep-alive --data-raw {criteria:categories:[Nature]}
13:40:13T31M quits [Quit: ZNC - https://znc.in]
13:40:23<nimaje>sensitiveParrot: did you remove that stray ' between setting the Content-Type and Connection header?
13:41:06T31M joins
13:41:07<sensitiveParrot>Oh right, I removed the wrong '
13:41:09<sensitiveParrot>Thanks :D
13:42:44<sensitiveParrot>Yep, now that's working. So basically, if I run the following command first and then this one you just shared, I will capture everything?
13:42:44<sensitiveParrot>curl 'https://sound-effects-api.bbcrewind.co.uk/api/sfx/search' -X POST -H 'Content-Type: application/json' --data-raw '{"criteria":{"from":0,"size":6119}}'
13:44:01<sensitiveParrot>Or will that just be 6119 x 2, which means only 12238 out of the total 33066 files?
13:44:32<pabs>no you will need to do null continents, then vary the categories, then category Nature and vary the continents. that should get everything
13:44:46<sensitiveParrot>Ahh right, that makes sense
13:45:01<sensitiveParrot>Thank you so much, this has been very helpful!
13:45:02<pabs>and after that you can extract the IDs from the data and get the wav/mp3/zip
13:45:30<sensitiveParrot>yep, that part is pretty clear to me thankfully lol
13:50:27ericgallager quits [Quit: This computer has gone to sleep]
14:45:37BearFortress quits [Read error: Connection reset by peer]
14:47:21grill (grill) joins
15:05:54BearFortress joins
15:08:15BearFortress quits [Read error: Connection reset by peer]
15:13:19riteo (riteo) joins
15:35:53cyanbox quits [Read error: Connection reset by peer]
15:38:11aninternettroll quits [Ping timeout: 260 seconds]
15:42:50aninternettroll (aninternettroll) joins
15:43:26nulldata quits [Ping timeout: 260 seconds]
16:14:05nulldata (nulldata) joins
16:41:42khaoohs quits [Read error: Connection reset by peer]
16:54:29khaoohs joins
17:08:26FiTheArchiver1 joins
17:11:38FiTheArchiver quits [Ping timeout: 258 seconds]
17:17:51simon816 quits [Remote host closed the connection]
17:20:51simon816 (simon816) joins
17:23:56flotwig quits [Read error: Connection reset by peer]
17:26:37flotwig joins
17:26:39ericgallager joins
17:33:06grill quits [Ping timeout: 260 seconds]
18:06:47<@arkiver>my setup is restored!
18:07:07<@arkiver>i did miss some pings, please re-ping me if there was something the past few days
18:07:14<@arkiver>i'll go through channels as well though
18:07:31Island joins
18:16:52<@imer>welcome back!
18:19:00Wohlstand (Wohlstand) joins
18:21:45<@arkiver>thank you :)
18:25:37sec^nd quits [Ping timeout: 240 seconds]
18:27:09<pokechu22>pabs, JAA: I think I figured out what went wrong with https://transfer.archivete.am/TFYMu/www1.plala.or.jp_thru_www17.plala.or.jp_sitemap.xml that caused only ~40k lines of the ~478k to be queued: it's the unescaped &. The first time I made that mistake was in https://transfer.archivete.am/ILS4I/vesselhistory.marad.dot.gov_list.xml for
18:27:10<eggdrop>inline (for browser viewing): https://transfer.archivete.am/inline/TFYMu/www1.plala.or.jp_thru_www17.plala.or.jp_sitemap.xml https://transfer.archivete.am/inline/ILS4I/vesselhistory.marad.dot.gov_list.xml
18:27:11<pokechu22>https://transfer.archivete.am/15MNt7/vesselhistory.marad.dot.gov_seed_urls.txt which didn't have any semicolons, and on that one the & just was removed (changing https://vesselhistory.marad.dot.gov/ShipHistory/ShipList?keywords=%23&pageNumber=1&matchFromStart=True&X-Requested-With=XMLHttpRequest to
18:27:13<pokechu22>https://vesselhistory.marad.dot.gov/ShipHistory/ShipList?keywords=%23pageNumber=1matchFromStart=TrueX-Requested-With=XMLHttpRequest). But the plala one includes semicolons in some URLs, so maybe it's treating the whole multi-line, multi-tag string from the & to the ; as one big invalid entity instead. Though if I remove &[^;]+; I only get ~27k lines so that doesn't entirely
18:27:15<pokechu22>add up
18:31:11sec^nd (second) joins
18:39:57grill (grill) joins
18:46:52Megame (Megame) joins
18:49:00pokechu22 quits [Ping timeout: 258 seconds]
18:55:41pokechu22 (pokechu22) joins
19:06:42<h2ibot>Pokechu22 edited Web Roasting/ISP Hosting (+168, atw.hu): https://wiki.archiveteam.org/?diff=55571&oldid=54321
19:13:31<@JAA>pokechu22: Hmm yeah, that sounds plausible. I'm not sure what the parser does with that exactly. I just saw that it apparently uses the HTML parser for the sitemap, so...
19:14:37<pokechu22>I'm going to try a single large sitemap for atw.hu (well, probably 2 sitemaps, one for URLs discovered from search engines and one from CDX) and see how that goes
19:33:36cm joins
19:56:15notarobot1 quits [Quit: Ping timeout (120 seconds)]
19:56:30notarobot1 joins
19:58:00<h2ibot>Cooljeanius edited Deathwatch (+6, /* 2025 */ minor copyedits): https://wiki.archiveteam.org/?diff=55572&oldid=55567
19:58:21grill quits [Ping timeout: 260 seconds]
20:10:01Snivy quits [Ping timeout: 260 seconds]
20:15:29Snivy (Snivy) joins
20:32:03DogsRNice joins
21:31:41eroc1990 quits [Ping timeout: 260 seconds]
21:33:41eroc1990 (eroc1990) joins
21:54:25etnguyen03 (etnguyen03) joins
22:01:28nulldata2 (nulldata) joins
22:09:37nulldata2 quits [Client Quit]
22:10:06nulldata2 (nulldata) joins
22:11:46nulldata is now known as nulldata-alt
22:12:23nulldata2 is now known as nulldata
22:14:27etnguyen03 quits [Client Quit]
22:18:37BornOn420 quits [Remote host closed the connection]
22:19:10BornOn420 (BornOn420) joins
22:19:57lennier2_ joins
22:23:01lennier2 quits [Ping timeout: 260 seconds]
23:16:28FiTheArchiver1 quits [Read error: Connection reset by peer]
23:16:41<BlankEclair>does AT still archive reddit posts from a url list, ir is that on hiatus?
23:21:04<that_lurker>IA archives reddit now as far as I know
23:27:58<BlankEclair>oh neat
23:36:03Wohlstand quits [Quit: Wohlstand]
23:39:15dabs joins