00:00:44<pokechu22>The meta one is the job log and should be uploaded. The .cdx file is normally derived from the WARC by IA itself, though I don't know if that always happens or only happens for items that get indexed by web.archive.org.
00:00:47tekulvw quits [Ping timeout: 272 seconds]
00:04:14<klea>I think it only happens for items that get indexed by web.archive.org? https://archive.org/download/limewire.com_d_7xNKB_NfXjrIqBWo
00:04:32<klea>Tho, maybe it was me not running the derive thing after every file
00:04:35<klea>lemme make it derive.
00:04:43<klea>(if i remember howto)
00:06:24<klea>It seems if you have IA derive (which is the default I believe on the web uploader?), it will make a cdx. <https://archive.org/log/5191197716> claims it will do a CDXIndex.
00:06:54<klea>Huh
00:06:58<klea>[ PST: 2026-02-16 16:05:08 ] Executing: ulimit -v 1048576 && PYTHONPATH=/petabox/sw/lib/python timeout 600 /petabox/sw/bin/cdx_writer.pex 'WARCPROX-20260216205304743-00000-y1i40ow9.warc.gz' --file-prefix='limewire.com_d_7xNKB_NfXjrIqBWo' --exclude-list='/petabox/sw/wayback/web_excludes.txt' --stats-file='/f/_limewire.com_d_7xNKB_NfXjrIqBWo/cdxstats.json'>
00:06:58<klea>'/t/_limewire.com_d_7xNKB_NfXjrIqBWo/cdx.txt'
00:07:09<klea>Wait a second.
00:07:35<klea>Couldn't that be a way to bulk check lots of urls by making a warc with records of lots of data, and then getting the cdx and seeing what apparently is missing?
00:07:50<klea>Then you'd request deletion of all that crap, because nobody wants it.
00:19:07nine quits [Quit: See ya!]
00:19:20nine joins
00:19:20nine quits [Changing host]
00:19:20nine (nine) joins
00:23:56<cruller>TheoH7: I uploaded the entire output directory. https://archive.org/details/community.jisc.ac.uk-2026-02-16-35e53623-00000
00:32:50etnguyen03 quits [Client Quit]
00:40:14SootBector quits [Remote host closed the connection]
00:41:22SootBector (SootBector) joins
00:42:30<TheoH7>cruller: Thanks, have downloaded it.
00:43:16<TheoH7>It looks like I also managed to do one where hard-coded links to https://community.ja.net (old address from years ago) are clickable in the WARC. I will upload that to IA likely in a few hours.
00:44:16<TheoH7>To upload the whole directory, is the best way to zip and upload, or can you select a whole folder for upload?
00:48:36<pokechu22>You can upload multiple files at once within a directory (uploading directories/subdirectories might also be possible but I think is more complicated?)
00:50:57<TheoH7>pokechu22: Great, will do that.
00:51:48<TheoH7>Seems one of my crawls has somehow managed to start crawling old versions of this site stored on the Wayback Machine, which is odd. I've added the pattern to ignores but just curious how grab-site would've found and started crawling such URL's.
00:52:11<TheoH7>I do already have 1 crawl without that done though, and will only upload the 2nd one if contains materially more content
01:02:20tekulvw (tekulvw) joins
01:03:29ducky quits [Ping timeout: 272 seconds]
01:04:19etnguyen03 (etnguyen03) joins
01:07:21tekulvw quits [Ping timeout: 268 seconds]
01:21:59tekulvw (tekulvw) joins
01:25:36Webuser614729 joins
01:26:29Webuser614729 quits [Client Quit]
01:27:43wotd joins
01:41:28pokechu22 quits [Quit: System maintenance]
02:22:10sec^nd quits [Remote host closed the connection]
02:22:35sec^nd (second) joins
02:36:40<nexussfan>There's a site dedicated to archiving Iranian series and films <https://nostalgik-tv.com/> which says they have 4 terabytes of videos. Would it be a good idea to archive it, or not now?
02:44:47APOLLO03 quits [Ping timeout: 268 seconds]
02:47:44ducky (ducky) joins
02:56:13nine quits [Ping timeout: 272 seconds]
02:58:38nine joins
02:58:40nine quits [Changing host]
02:58:40nine (nine) joins
03:13:09iPwnedYourIOTSmartdog quits [Ping timeout: 268 seconds]
03:13:46iPwnedYourIOTSmartdog joins
04:01:44etnguyen03 quits [Remote host closed the connection]
04:11:07tekulvw quits [Ping timeout: 268 seconds]
04:14:04tekulvw (tekulvw) joins
04:23:37tekulvw quits [Ping timeout: 272 seconds]
04:26:20Island quits [Read error: Connection reset by peer]