00:00:49<thuban>what are the odds of getting a super late dpos project in?
00:04:46<thuban>(js hell, but afaict all GET, so playback would work, and some reverse engineering's already been done)
00:31:50jasons quits [Ping timeout: 240 seconds]
01:09:57<fireonlive>-+rss- Flattr is closing down: https://flattr.com/ https://news.ycombinator.com/item?id=39040023
01:09:58<fireonlive>oh wow
01:10:18<fireonlive>https://dl.fireon.live/irc/0e497e22da4cc423/image.png < nice preview for flattr
01:26:20pedantic-darwin quits [Ping timeout: 240 seconds]
01:28:50khobragade quits [Ping timeout: 240 seconds]
01:33:46<project10>JAA: if I'm uploading a file to transfer, say via curl, and interrupt that with CTRL-C, will transfer clean up the partially transferred file? or does it leave a stub on e.g. wasabi
01:34:39<@JAA>project10: I think it should clean up, but I've never actually checked.
01:35:20jasons (jasons) joins
01:35:46<project10>OK I found myself accidentally uploading a plaintext file instead of zst, and I interrupted it after probably ~10GiB had already been transferred. I hope wasabi won't charge 90 day storage on that :(
01:42:51<@JAA>Found it. Looks like there's a metadata file left behind, but that's tiny. No idea about billing, but since it fails at the multipart upload step, I'd think it shouldn't count.
01:47:24<project10>Great, thanks for checking, in this case we actually do want the `rm` :D
01:50:40<Terbium>rewby: if you need some backup space at Hetzner temporarily, give me a ping
01:54:17pedantic-darwin joins
01:55:50parfait quits [Ping timeout: 240 seconds]
01:56:01sec^nd quits [Remote host closed the connection]
01:56:19sec^nd (second) joins
01:58:18BlueMaxima joins
02:26:41kiryu quits [Client Quit]
02:35:07jasons quits [Ping timeout: 272 seconds]
03:19:08<mgrandi>Not sure if anyone put this into whatever project already but https://github.com/Andre0512/hon and https://github.com/Andre0512/pyhOn going to be take down really soon
03:19:29<@JAA>#gitgud
03:19:51<mgrandi>Thanks
03:20:51<@JAA>Looks like you got beat by 66 seconds. lol
03:32:05AlsoHP_Archivist joins
03:34:39HP_Archivist quits [Ping timeout: 272 seconds]
03:38:20jasons (jasons) joins
03:39:40<fireonlive>close!
03:59:57Shjosan quits [Quit: Am sleepy (-, – )…zzzZZZ]
04:00:34Shjosan (Shjosan) joins
04:02:08<pabs>https://www.reuters.com/markets/deals/reddit-seeks-launch-ipo-march-sources-2024-01-18/
04:15:04<fireonlive>"[ih] Dave Mills has passed away" ~ https://elists.isoc.org/pipermail/internet-history/2024-January/009265.html https://news.ycombinator.com/item?id=39051246
04:31:21bf_ quits [Remote host closed the connection]
04:48:07tzt quits [Ping timeout: 272 seconds]
04:49:43HP_Archivist (HP_Archivist) joins
04:51:17AlsoHP_Archivist quits [Ping timeout: 272 seconds]
05:05:51HP_Archivist quits [Ping timeout: 272 seconds]
05:25:42HP_Archivist (HP_Archivist) joins
05:33:52AlsoHP_Archivist joins
05:36:15HP_Archivist quits [Ping timeout: 272 seconds]
05:36:34DogsRNice quits [Read error: Connection reset by peer]
05:58:24HP_Archivist (HP_Archivist) joins
06:00:19AlsoHP_Archivist quits [Ping timeout: 272 seconds]
06:04:39AlsoHP_Archivist joins
06:05:23HP_Archivist quits [Ping timeout: 272 seconds]
06:10:21pabs quits [Client Quit]
06:11:05AlsoHP_Archivist quits [Ping timeout: 272 seconds]
06:15:45pabs (pabs) joins
06:29:13HP_Archivist (HP_Archivist) joins
06:29:24khobragade (khobragade) joins
06:30:43jasons quits [Ping timeout: 272 seconds]
06:32:31Doranwen quits [Remote host closed the connection]
06:32:57Doranwen (Doranwen) joins
06:33:53lennier2 quits [Ping timeout: 272 seconds]
06:35:23lennier2 joins
06:43:20khobragade quits [Ping timeout: 240 seconds]
06:47:10khobragade (khobragade) joins
07:27:05khobragade quits [Ping timeout: 272 seconds]
07:29:29Wohlstand (Wohlstand) joins
07:33:47jasons (jasons) joins
07:35:55BlueMaxima quits [Read error: Connection reset by peer]
07:50:37koon joins
08:20:04Dan joins
08:29:48Dan leaves
09:20:57icedice (icedice) joins
09:51:40<h2ibot>Flashfire42 edited List of websites excluded from the Wayback Machine (+21): https://wiki.archiveteam.org/?diff=51528&oldid=51527
10:00:03Bleo18260 quits [Client Quit]
10:00:42<h2ibot>JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=51529&oldid=51528
10:01:21Bleo18260 joins
10:07:36<that_lurker>https://www.thestack.technology/vmware-is-killing-off-56-products-including-vsphere-hypervisor-and-nsx/
10:38:27<that_lurker>Somewhat of a misleading title as most are converted to varying subscription based licenses
11:21:32bf_ joins
11:41:01bf_ quits [Remote host closed the connection]
11:45:28bf_ joins
11:52:27c3manu quits [Ping timeout: 272 seconds]
11:52:27shreyasminocha quits [Ping timeout: 272 seconds]
11:52:27evan quits [Ping timeout: 272 seconds]
11:54:59thehedgeh0g quits [Ping timeout: 272 seconds]
11:59:10evan joins
11:59:12thehedgeh0g (mrHedgehog0) joins
11:59:12c3manu (c3manu) joins
11:59:15shreyasminocha (shreyasminocha) joins
12:25:53Iki joins
12:28:33jasons quits [Ping timeout: 272 seconds]
13:01:50Arcorann quits [Ping timeout: 240 seconds]
13:28:54IRC2DC quits [Remote host closed the connection]
13:31:44jasons (jasons) joins
13:57:38IRC2DC joins
14:26:59jasons quits [Ping timeout: 272 seconds]
14:48:05Wohlstand quits [Client Quit]
14:55:26<nulldata>The https://irc.tl/ network is shutting down on February 26th
15:00:56<pabs>add to deathwatch?
15:01:11<nicolas17>what is there to archive tho? the website?
15:15:59IRC2DC quits [Remote host closed the connection]
15:16:17IRC2DC joins
15:30:23jasons (jasons) joins
15:58:25<jodizzle>re: Baltimore Sun in #archiveteam: there was a big crawl of it recently back in mid 2021. Big site, takes a long time to run.
15:59:05<jodizzle>We could re-run everything, but I was thinking I could pull out only the new articles from the sitemaps
16:00:37<jodizzle>Which I think should cover all the most important stuff.
16:18:14<@JAA>Also worth mentioning that we probably archived a bunch of it with #//.
16:19:00<jodizzle>Yeah, good point.
16:19:16<@JAA>It's among the stuff being retrieved regularly. The homepage every hour according to urls-sources, but that's outdated per arkiver.
16:29:13jasons quits [Ping timeout: 272 seconds]
16:31:40<fireonlive>anything dying should probably go on death watch, apparently third parties like it too
16:39:34<@JAA>++
16:41:18<ScenarioPlanet>What do you think about archiving old audio donations on DonationAlerts (~150kb each, a lot of them) ? The main problem with it, that would require brute (at least the server doesn't seem to ban ips)
16:41:58<nicolas17>ScenarioPlanet: how big are the IDs?
16:42:24<@JAA>Also, some examples?
16:43:24<ScenarioPlanet>nicolas17, 10^7 to 10^8
16:44:27<ScenarioPlanet>JAA, http://static.donationalerts.ru/audiodonations/64886/64886380.wav
16:45:20<nicolas17>is this a russian site?
16:46:33<ScenarioPlanet>It is, a website used by livestreamers to provide donation alerts on live
16:46:41<@JAA>Is there metadata about the clips anywhere?
16:47:11<ScenarioPlanet>They have an API but it's auth-only
16:47:32<nicolas17>testing 64886
16:47:37<@JAA>Yeah, and 1 req/s, lol
16:47:58<nicolas17>found 39 audio files there (of 1000 possible IDs)
16:48:21<ScenarioPlanet>Also don't think it makes possible to check the donations sent to other users
16:48:27<nicolas17>adding up to 4MB, so average 107KB
16:49:15<ScenarioPlanet>I've downloaded 3k just to test, average is 170k
16:49:37<ScenarioPlanet>Some of them can be PCM instead of Opus
16:51:07<@JAA>It always serves them with .wav?
16:53:58<ScenarioPlanet>Seems to be like that, yes
16:54:28<ScenarioPlanet>Even when it's actually a .webm, the server still saves the file with .wav
16:57:22<nicolas17>found 285 wavs in 10000 IDs, took about 2 minutes (aria2c -j20)
17:20:05<h2ibot>Usernam edited List of websites excluded from the Wayback Machine/Partial exclusions (+133): https://wiki.archiveteam.org/?diff=51530&oldid=51525
17:23:35AlsoHP_Archivist joins
17:24:50HP_Archivist quits [Ping timeout: 240 seconds]
17:25:17<aninternettroll>Hi, statistikk.lanekassen.no was moved under the main domain. To get the new pages it's as `curl https://lanekassen.no/api/mt1502/sitemap/lkno | jq -r '.sider[].contentLink.publicUri | select(. | startswith("/nb-NO/statistikk-og-analyse/"))'` Could someone throw those into archivebot?
17:25:47<aninternettroll>(PS, only /nb-NO/ pages should be added, /nn-NO/ and /en-US/ pages are just copies of the same thing)
17:26:26<@JAA>Them having the same content doesn't mean they shouldn't be archived.
17:27:28<aninternettroll>fair enough, then feel free to archive those as well. Same path, just that they start with a different language code
17:30:27<@JAA>Looks like a few other URLs are also new.
17:30:59<aninternettroll>Which ones?
17:31:27<aninternettroll>Oh like PDFs?
17:31:28<@JAA>https://lanekassen.no/nb-NO/laresteder/veiledning/informasjonsmateriell-til-larestedene/forste-regning/ and a bunch of stuff in /nb-NO/fellesinngang/.
17:31:40<@JAA>Didn't count the PDFs, but yes, those too.
17:31:51<aninternettroll>fellesinngang should all return 404 I think, it's not really ready
17:31:53jasons (jasons) joins
17:32:43<aninternettroll>Also lol just noticed the "X (Twitter)" in the footer
17:34:07<@JAA>Yeah, all 404 on fellesinngang.
17:36:06sec^nd quits [Ping timeout: 255 seconds]
17:38:14<@JAA>All done
17:38:49<aninternettroll>Thanks!
17:51:35sec^nd (second) joins
17:56:59<nicolas17>aninternettroll: if the pages are identical, and they are archived in the same AB job, they will be deduplicated so they don't waste storage
17:57:52<aninternettroll>I don't think they are literally 1:1 identical code wise, but for all intents and purposes they are the same
17:58:19<nicolas17>ah
17:59:44<aninternettroll>The same issue shows up for /en-US/laresteder pages, which are copies of /nb-NO/laresteder
18:03:07<pokechu22>I don't think archivebot deduplicates pages like that either due to using .warc.gz where each record is separately compressed and then concatenated - I think that deduplication only really works for .warc.zst (as used by DPoS)
18:05:01<@JAA>Well, in theory, it could write revisit records if the responses were identical, but AB does not support that.
18:05:33<@JAA>The custom dictionary in .warc.zst is unrelated.
18:05:36<nicolas17>revisit records is what I was thinking
18:05:40<nicolas17>I didn't know AB didn't support it
18:06:15<pokechu22>The custom dictionary does mean that duplicated or near-duplicated pages compress better than they do with .warc.gz, though, right?
18:06:31<nicolas17>not really, records are *still* independent
18:06:54<nicolas17>pages that have enough content in common *with the dictionary* will take less space
18:07:15<pokechu22>Isn't the dictionary generated from page content?
18:07:27<pokechu22>(for all of the pages)
18:07:36<@JAA>You need to train a dictionary for that. You can't start writing a .warc.zst before you have a dict.
18:07:57<@JAA>On DPoS projects, we regularly train a new dictionary with the uploaded data, so it gets better over time.
18:08:03sec^nd quits [Ping timeout: 255 seconds]
18:08:12<nicolas17>and that's semi-manual right?
18:08:17<pokechu22>Is there one dictionary per .warc.zst file or are they shared between them?
18:08:21<@JAA>Mostly automated, I believe.
18:08:42<pokechu22>I was assuming one dictionary per file generated at the same time as the file
18:08:57<@JAA>It's reused.
18:09:07<TheTechRobo>https://github.com/ArchiveTeam/zstd-dictionary-trainer
18:09:23<nicolas17>afaik the dictionary is stored in every record, so they can still be decompressed independently like .warc.gz can
18:09:34<@JAA>You could first write a WARC without a dictionary (or with a general-purpose dict), then train a specific dictionary for that dataset, then recompress. We don't do that anywhere currently.
18:09:41<TheTechRobo>nicolas17: I thought it's a skippable frame at the start
18:09:53<@JAA>^ Correct
18:09:54<nicolas17>TheTechRobo: at the start of the whole .warc.zst?
18:09:58<TheTechRobo>Or any time in the middle to switch IIRC
18:10:03<@JAA>No, only at the start.
18:10:09<@JAA>Before the first non-skippable frame.
18:10:21<TheTechRobo>Right, there was a discussion about supporting having it in the middle in zstd's PR
18:10:21<@JAA>Or actually, I think it has to be the first frame.
18:10:34<@JAA>Yeah, they wanted something more generic.
18:10:42<@JAA>.warc.zst is specified to only support it at the start though.
18:10:53<nicolas17>changing it in the middle would make things complicated if you want to get one record out of a remote .warc.zst, given the .cdx
18:11:17<@JAA>Yeah, you'd have to store the dict offset separately.
18:11:21<nicolas17>it sounds like currently I can use the .cdx to get the correct range of the .warc.zst with the data I want, *and* get the first few KB to get the dictionary
18:11:29<@JAA>Correct
18:11:54<nicolas17>if any record in the middle could change the dictionary, then I would need to download and parse the entire .zst
18:12:17<@JAA>In practice, if that were supported, the CDX would get an extra field for the dict location.
18:12:36<nicolas17>are there any unused letters left for CDX fields? :p
18:12:36<@JAA>But it's such an obscure edge case that we just didn't put it in the spec at all.
18:13:14<@JAA>Looks like there are a few left. :-P
18:13:26<@JAA>O, T, and W are currently unused.
18:13:41<@JAA>Also q, u, and w.
18:13:56<@JAA>And if we run out, we'll just declare it UTF-8 and use emojis!
18:13:59<TheTechRobo>xD
18:14:11<TheTechRobo>Nah, we still have more ASCII characters
18:14:13<nicolas17><skull emoji>
18:14:14<murb>and all the þ's
18:14:14<TheTechRobo>Like NUL, BEL, etc
18:14:20<TheTechRobo>s/BEL/whatever it actually is/
18:14:27<TheTechRobo>(BEL does not sound right)
18:14:32<@JAA>BEL is right.
18:14:33<nicolas17>it sounds like a bell
18:14:44<@JAA>
18:15:02<TheTechRobo>And then extended ASCII!
18:15:08<TheTechRobo>That might break shit though.
18:15:16<TheTechRobo>(Hopefully not, but...)
18:15:40<@JAA>We could also use a Unicode Private Use Area!
18:15:50<TheTechRobo>Like the Apple emoji lol
18:16:06<@JAA>Yep
18:16:50<nicolas17>huh, looks like putting these Samsung kernel source code dumps into a git repository might actually be feasible
18:16:56<TheTechRobo>"Hey, John, your script is breaking on the Apple emoji letter"
18:16:58<TheTechRobo>"The what?"
18:17:03<TheTechRobo>"Does that not look like an Apple emoji to you?"
18:17:40<nicolas17>TheTechRobo: I know someone who added an emoji to her last name in the company employee database and several internal apps broke *for everyone*
18:17:51<TheTechRobo>lmao
18:18:30<@JAA>Nice
18:20:10sec^nd (second) joins
18:31:50jasons quits [Ping timeout: 240 seconds]
18:43:18btcat joins
18:53:57riku quits [Quit: WeeChat 4.1.2]
18:55:32Megame (Megame) joins
18:58:53<fireonlive>nice :3
19:09:49DogsRNice joins
19:35:29jasons (jasons) joins
19:39:02<btcat>hi, i have an old warc file from the imdb boards project
19:39:37<btcat>it seems to be for the ghostbusters 2016 movie, and none of the threads appear on the WBM
19:41:43<btcat>what should i do with it? should i just delete it? run `shred` on it?
19:46:22lennier2_ joins
19:46:26UserH joins
19:48:50lennier2 quits [Ping timeout: 240 seconds]
19:49:19<UserH>Not sure if this is the correct channel but...Would the news of Pitchfork being folded into GQ be a cause for concern to archive Pitchfork's website? Or since it isn't being shutdown the team would not see it as an applicable project?
20:10:52riku joins
20:17:25tech234a quits [Quit: Connection closed for inactivity]
20:32:50jasons quits [Ping timeout: 240 seconds]
20:40:26<TheTechRobo>btcat: You could always upload it to archive.org. It won't show up in the Wayback Machine but it's better than nothing. Make sure to add a lot of metadata so it's discoverable!
20:47:41<@JAA>Yeah, uploading to IA is what I'd recommend.
21:07:44pedantic-darwin quits [Client Quit]
21:12:28AlsoHP_Archivist quits [Read error: Connection reset by peer]
21:24:50Maykil107 joins
21:25:13Maykil107 quits [Remote host closed the connection]
21:35:50BlueMaxima joins
21:36:37jasons (jasons) joins
22:13:27SootBector quits [Remote host closed the connection]
22:14:40Gooshka (Gooshka) joins
22:14:48<Gooshka>Probably it would be useful. Telegram show deleted messages: https://habr.com/ru/articles/787642/
22:15:01<Gooshka>Mhm, shows.
22:22:50SootBector (SootBector) joins
22:24:31Gooshka quits [Ping timeout: 267 seconds]
22:25:39Arcorann (Arcorann) joins
22:28:15<@OrIdow6^2>fireonlive: Curious what 3rd parties are interested in DW
22:31:59SootBector quits [Remote host closed the connection]
22:32:25SootBector (SootBector) joins
22:33:23jasons quits [Ping timeout: 272 seconds]
23:05:58<btcat>it finally finished uploading: https://archive.org/details/imdb-title_board_1289401-20170216-165023.warc
23:07:01<btcat>unfortunately it used a leftover file as the thumbnail, but at least it's out there now
23:07:50<fireonlive>:) thanks
23:08:01<fireonlive>i was wondering who that man was for a second haha
23:16:15decky_e joins
23:34:09btcat quits [Remote host closed the connection]
23:36:44jasons (jasons) joins
23:37:29<fireonlive>OrIdow6^2: not sure entirely; but I think I saw some pointers on the wiki long ago that other sites referenced us
23:37:40<fireonlive>long forgotten :(
23:55:54qwertyasdfuiopghjkl quits [Remote host closed the connection]
23:56:54Webuser230 joins
23:57:27<Webuser230>JAA bumping the discussion to this channel
23:57:39<@JAA>To summarise from #internetarchive: https://forums.mangadex.org/ is a moderately sized forum that has apparently gone down before. Not at immediate risk but would be nice to grab. Webuser230 has been doing that via SPN, but at 20M post IDs, not really feasible.
23:58:37<Webuser230>MangaDex proper went down (including the forum) https://en.wikipedia.org/wiki/MangaDex#Data_breach_and_codebase_rewrite
23:59:44<@JAA>Ah right, I remember that now.