00:00:49 | <thuban> | what are the odds of getting a super late dpos project in? |
00:04:46 | <thuban> | (js hell, but afaict all GET, so playback would work, and some reverse engineering's already been done) |
00:31:50 | | jasons quits [Ping timeout: 240 seconds] |
01:09:57 | <fireonlive> | -+rss- Flattr is closing down: https://flattr.com/ https://news.ycombinator.com/item?id=39040023 |
01:09:58 | <fireonlive> | oh wow |
01:10:18 | <fireonlive> | https://dl.fireon.live/irc/0e497e22da4cc423/image.png < nice preview for flattr |
01:26:20 | | pedantic-darwin quits [Ping timeout: 240 seconds] |
01:28:50 | | khobragade quits [Ping timeout: 240 seconds] |
01:33:46 | <project10> | JAA: if I'm uploading a file to transfer, say via curl, and interrupt that with CTRL-C, will transfer clean up the partially transferred file? or does it leave a stub on e.g. wasabi |
01:34:39 | <@JAA> | project10: I think it should clean up, but I've never actually checked. |
01:35:20 | | jasons (jasons) joins |
01:35:46 | <project10> | OK I found myself accidentally uploading a plaintext file instead of zst, and I interrupted it after probably ~10GiB had already been transferred. I hope wasabi won't charge 90 day storage on that :( |
01:42:51 | <@JAA> | Found it. Looks like there's a metadata file left behind, but that's tiny. No idea about billing, but since it fails at the multipart upload step, I'd think it shouldn't count. |
01:47:24 | <project10> | Great, thanks for checking, in this case we actually do want the `rm` :D |
01:50:40 | <Terbium> | rewby: if you need some backup space at Hetzner temporarily, give me a ping |
01:54:17 | | pedantic-darwin joins |
01:55:50 | | parfait quits [Ping timeout: 240 seconds] |
01:56:01 | | sec^nd quits [Remote host closed the connection] |
01:56:19 | | sec^nd (second) joins |
01:58:18 | | BlueMaxima joins |
02:26:41 | | kiryu quits [Client Quit] |
02:35:07 | | jasons quits [Ping timeout: 272 seconds] |
03:19:08 | <mgrandi> | Not sure if anyone put this into whatever project already but https://github.com/Andre0512/hon and https://github.com/Andre0512/pyhOn going to be take down really soon |
03:19:29 | <@JAA> | #gitgud |
03:19:51 | <mgrandi> | Thanks |
03:20:51 | <@JAA> | Looks like you got beat by 66 seconds. lol |
03:32:05 | | AlsoHP_Archivist joins |
03:34:39 | | HP_Archivist quits [Ping timeout: 272 seconds] |
03:38:20 | | jasons (jasons) joins |
03:39:40 | <fireonlive> | close! |
03:59:57 | | Shjosan quits [Quit: Am sleepy (-, – )…zzzZZZ] |
04:00:34 | | Shjosan (Shjosan) joins |
04:02:08 | <pabs> | https://www.reuters.com/markets/deals/reddit-seeks-launch-ipo-march-sources-2024-01-18/ |
04:15:04 | <fireonlive> | "[ih] Dave Mills has passed away" ~ https://elists.isoc.org/pipermail/internet-history/2024-January/009265.html https://news.ycombinator.com/item?id=39051246 |
04:31:21 | | bf_ quits [Remote host closed the connection] |
04:48:07 | | tzt quits [Ping timeout: 272 seconds] |
04:49:43 | | HP_Archivist (HP_Archivist) joins |
04:51:17 | | AlsoHP_Archivist quits [Ping timeout: 272 seconds] |
05:05:51 | | HP_Archivist quits [Ping timeout: 272 seconds] |
05:25:42 | | HP_Archivist (HP_Archivist) joins |
05:33:52 | | AlsoHP_Archivist joins |
05:36:15 | | HP_Archivist quits [Ping timeout: 272 seconds] |
05:36:34 | | DogsRNice quits [Read error: Connection reset by peer] |
05:58:24 | | HP_Archivist (HP_Archivist) joins |
06:00:19 | | AlsoHP_Archivist quits [Ping timeout: 272 seconds] |
06:04:39 | | AlsoHP_Archivist joins |
06:05:23 | | HP_Archivist quits [Ping timeout: 272 seconds] |
06:10:21 | | pabs quits [Client Quit] |
06:11:05 | | AlsoHP_Archivist quits [Ping timeout: 272 seconds] |
06:15:45 | | pabs (pabs) joins |
06:29:13 | | HP_Archivist (HP_Archivist) joins |
06:29:24 | | khobragade (khobragade) joins |
06:30:43 | | jasons quits [Ping timeout: 272 seconds] |
06:32:31 | | Doranwen quits [Remote host closed the connection] |
06:32:57 | | Doranwen (Doranwen) joins |
06:33:53 | | lennier2 quits [Ping timeout: 272 seconds] |
06:35:23 | | lennier2 joins |
06:43:20 | | khobragade quits [Ping timeout: 240 seconds] |
06:47:10 | | khobragade (khobragade) joins |
07:27:05 | | khobragade quits [Ping timeout: 272 seconds] |
07:29:29 | | Wohlstand (Wohlstand) joins |
07:33:47 | | jasons (jasons) joins |
07:35:55 | | BlueMaxima quits [Read error: Connection reset by peer] |
07:50:37 | | koon joins |
08:20:04 | | Dan joins |
08:29:48 | | Dan leaves |
09:20:57 | | icedice (icedice) joins |
09:51:40 | <h2ibot> | Flashfire42 edited List of websites excluded from the Wayback Machine (+21): https://wiki.archiveteam.org/?diff=51528&oldid=51527 |
10:00:03 | | Bleo18260 quits [Client Quit] |
10:00:42 | <h2ibot> | JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=51529&oldid=51528 |
10:01:21 | | Bleo18260 joins |
10:07:36 | <that_lurker> | https://www.thestack.technology/vmware-is-killing-off-56-products-including-vsphere-hypervisor-and-nsx/ |
10:38:27 | <that_lurker> | Somewhat of a misleading title as most are converted to varying subscription based licenses |
11:21:32 | | bf_ joins |
11:41:01 | | bf_ quits [Remote host closed the connection] |
11:45:28 | | bf_ joins |
11:52:27 | | c3manu quits [Ping timeout: 272 seconds] |
11:52:27 | | shreyasminocha quits [Ping timeout: 272 seconds] |
11:52:27 | | evan quits [Ping timeout: 272 seconds] |
11:54:59 | | thehedgeh0g quits [Ping timeout: 272 seconds] |
11:59:10 | | evan joins |
11:59:12 | | thehedgeh0g (mrHedgehog0) joins |
11:59:12 | | c3manu (c3manu) joins |
11:59:15 | | shreyasminocha (shreyasminocha) joins |
12:25:53 | | Iki joins |
12:28:33 | | jasons quits [Ping timeout: 272 seconds] |
13:01:50 | | Arcorann quits [Ping timeout: 240 seconds] |
13:28:54 | | IRC2DC quits [Remote host closed the connection] |
13:31:44 | | jasons (jasons) joins |
13:57:38 | | IRC2DC joins |
14:26:59 | | jasons quits [Ping timeout: 272 seconds] |
14:48:05 | | Wohlstand quits [Client Quit] |
14:55:26 | <nulldata> | The https://irc.tl/ network is shutting down on February 26th |
15:00:56 | <pabs> | add to deathwatch? |
15:01:11 | <nicolas17> | what is there to archive tho? the website? |
15:15:59 | | IRC2DC quits [Remote host closed the connection] |
15:16:17 | | IRC2DC joins |
15:30:23 | | jasons (jasons) joins |
15:58:25 | <jodizzle> | re: Baltimore Sun in #archiveteam: there was a big crawl of it recently back in mid 2021. Big site, takes a long time to run. |
15:59:05 | <jodizzle> | We could re-run everything, but I was thinking I could pull out only the new articles from the sitemaps |
16:00:37 | <jodizzle> | Which I think should cover all the most important stuff. |
16:18:14 | <@JAA> | Also worth mentioning that we probably archived a bunch of it with #//. |
16:19:00 | <jodizzle> | Yeah, good point. |
16:19:16 | <@JAA> | It's among the stuff being retrieved regularly. The homepage every hour according to urls-sources, but that's outdated per arkiver. |
16:29:13 | | jasons quits [Ping timeout: 272 seconds] |
16:31:40 | <fireonlive> | anything dying should probably go on death watch, apparently third parties like it too |
16:39:34 | <@JAA> | ++ |
16:41:18 | <ScenarioPlanet> | What do you think about archiving old audio donations on DonationAlerts (~150kb each, a lot of them) ? The main problem with it, that would require brute (at least the server doesn't seem to ban ips) |
16:41:58 | <nicolas17> | ScenarioPlanet: how big are the IDs? |
16:42:24 | <@JAA> | Also, some examples? |
16:43:24 | <ScenarioPlanet> | nicolas17, 10^7 to 10^8 |
16:44:27 | <ScenarioPlanet> | JAA, http://static.donationalerts.ru/audiodonations/64886/64886380.wav |
16:45:20 | <nicolas17> | is this a russian site? |
16:46:33 | <ScenarioPlanet> | It is, a website used by livestreamers to provide donation alerts on live |
16:46:41 | <@JAA> | Is there metadata about the clips anywhere? |
16:47:11 | <ScenarioPlanet> | They have an API but it's auth-only |
16:47:32 | <nicolas17> | testing 64886 |
16:47:37 | <@JAA> | Yeah, and 1 req/s, lol |
16:47:58 | <nicolas17> | found 39 audio files there (of 1000 possible IDs) |
16:48:21 | <ScenarioPlanet> | Also don't think it makes possible to check the donations sent to other users |
16:48:27 | <nicolas17> | adding up to 4MB, so average 107KB |
16:49:15 | <ScenarioPlanet> | I've downloaded 3k just to test, average is 170k |
16:49:37 | <ScenarioPlanet> | Some of them can be PCM instead of Opus |
16:51:07 | <@JAA> | It always serves them with .wav? |
16:53:58 | <ScenarioPlanet> | Seems to be like that, yes |
16:54:28 | <ScenarioPlanet> | Even when it's actually a .webm, the server still saves the file with .wav |
16:57:22 | <nicolas17> | found 285 wavs in 10000 IDs, took about 2 minutes (aria2c -j20) |
17:20:05 | <h2ibot> | Usernam edited List of websites excluded from the Wayback Machine/Partial exclusions (+133): https://wiki.archiveteam.org/?diff=51530&oldid=51525 |
17:23:35 | | AlsoHP_Archivist joins |
17:24:50 | | HP_Archivist quits [Ping timeout: 240 seconds] |
17:25:17 | <aninternettroll> | Hi, statistikk.lanekassen.no was moved under the main domain. To get the new pages it's as `curl https://lanekassen.no/api/mt1502/sitemap/lkno | jq -r '.sider[].contentLink.publicUri | select(. | startswith("/nb-NO/statistikk-og-analyse/"))'` Could someone throw those into archivebot? |
17:25:47 | <aninternettroll> | (PS, only /nb-NO/ pages should be added, /nn-NO/ and /en-US/ pages are just copies of the same thing) |
17:26:26 | <@JAA> | Them having the same content doesn't mean they shouldn't be archived. |
17:27:28 | <aninternettroll> | fair enough, then feel free to archive those as well. Same path, just that they start with a different language code |
17:30:27 | <@JAA> | Looks like a few other URLs are also new. |
17:30:59 | <aninternettroll> | Which ones? |
17:31:27 | <aninternettroll> | Oh like PDFs? |
17:31:28 | <@JAA> | https://lanekassen.no/nb-NO/laresteder/veiledning/informasjonsmateriell-til-larestedene/forste-regning/ and a bunch of stuff in /nb-NO/fellesinngang/. |
17:31:40 | <@JAA> | Didn't count the PDFs, but yes, those too. |
17:31:51 | <aninternettroll> | fellesinngang should all return 404 I think, it's not really ready |
17:31:53 | | jasons (jasons) joins |
17:32:43 | <aninternettroll> | Also lol just noticed the "X (Twitter)" in the footer |
17:34:07 | <@JAA> | Yeah, all 404 on fellesinngang. |
17:36:06 | | sec^nd quits [Ping timeout: 255 seconds] |
17:38:14 | <@JAA> | All done |
17:38:49 | <aninternettroll> | Thanks! |
17:51:35 | | sec^nd (second) joins |
17:56:59 | <nicolas17> | aninternettroll: if the pages are identical, and they are archived in the same AB job, they will be deduplicated so they don't waste storage |
17:57:52 | <aninternettroll> | I don't think they are literally 1:1 identical code wise, but for all intents and purposes they are the same |
17:58:19 | <nicolas17> | ah |
17:59:44 | <aninternettroll> | The same issue shows up for /en-US/laresteder pages, which are copies of /nb-NO/laresteder |
18:03:07 | <pokechu22> | I don't think archivebot deduplicates pages like that either due to using .warc.gz where each record is separately compressed and then concatenated - I think that deduplication only really works for .warc.zst (as used by DPoS) |
18:05:01 | <@JAA> | Well, in theory, it could write revisit records if the responses were identical, but AB does not support that. |
18:05:33 | <@JAA> | The custom dictionary in .warc.zst is unrelated. |
18:05:36 | <nicolas17> | revisit records is what I was thinking |
18:05:40 | <nicolas17> | I didn't know AB didn't support it |
18:06:15 | <pokechu22> | The custom dictionary does mean that duplicated or near-duplicated pages compress better than they do with .warc.gz, though, right? |
18:06:31 | <nicolas17> | not really, records are *still* independent |
18:06:54 | <nicolas17> | pages that have enough content in common *with the dictionary* will take less space |
18:07:15 | <pokechu22> | Isn't the dictionary generated from page content? |
18:07:27 | <pokechu22> | (for all of the pages) |
18:07:36 | <@JAA> | You need to train a dictionary for that. You can't start writing a .warc.zst before you have a dict. |
18:07:57 | <@JAA> | On DPoS projects, we regularly train a new dictionary with the uploaded data, so it gets better over time. |
18:08:03 | | sec^nd quits [Ping timeout: 255 seconds] |
18:08:12 | <nicolas17> | and that's semi-manual right? |
18:08:17 | <pokechu22> | Is there one dictionary per .warc.zst file or are they shared between them? |
18:08:21 | <@JAA> | Mostly automated, I believe. |
18:08:42 | <pokechu22> | I was assuming one dictionary per file generated at the same time as the file |
18:08:57 | <@JAA> | It's reused. |
18:09:07 | <TheTechRobo> | https://github.com/ArchiveTeam/zstd-dictionary-trainer |
18:09:23 | <nicolas17> | afaik the dictionary is stored in every record, so they can still be decompressed independently like .warc.gz can |
18:09:34 | <@JAA> | You could first write a WARC without a dictionary (or with a general-purpose dict), then train a specific dictionary for that dataset, then recompress. We don't do that anywhere currently. |
18:09:41 | <TheTechRobo> | nicolas17: I thought it's a skippable frame at the start |
18:09:53 | <@JAA> | ^ Correct |
18:09:54 | <nicolas17> | TheTechRobo: at the start of the whole .warc.zst? |
18:09:58 | <TheTechRobo> | Or any time in the middle to switch IIRC |
18:10:03 | <@JAA> | No, only at the start. |
18:10:09 | <@JAA> | Before the first non-skippable frame. |
18:10:21 | <TheTechRobo> | Right, there was a discussion about supporting having it in the middle in zstd's PR |
18:10:21 | <@JAA> | Or actually, I think it has to be the first frame. |
18:10:34 | <@JAA> | Yeah, they wanted something more generic. |
18:10:42 | <@JAA> | .warc.zst is specified to only support it at the start though. |
18:10:53 | <nicolas17> | changing it in the middle would make things complicated if you want to get one record out of a remote .warc.zst, given the .cdx |
18:11:17 | <@JAA> | Yeah, you'd have to store the dict offset separately. |
18:11:21 | <nicolas17> | it sounds like currently I can use the .cdx to get the correct range of the .warc.zst with the data I want, *and* get the first few KB to get the dictionary |
18:11:29 | <@JAA> | Correct |
18:11:54 | <nicolas17> | if any record in the middle could change the dictionary, then I would need to download and parse the entire .zst |
18:12:17 | <@JAA> | In practice, if that were supported, the CDX would get an extra field for the dict location. |
18:12:36 | <nicolas17> | are there any unused letters left for CDX fields? :p |
18:12:36 | <@JAA> | But it's such an obscure edge case that we just didn't put it in the spec at all. |
18:13:14 | <@JAA> | Looks like there are a few left. :-P |
18:13:26 | <@JAA> | O, T, and W are currently unused. |
18:13:41 | <@JAA> | Also q, u, and w. |
18:13:56 | <@JAA> | And if we run out, we'll just declare it UTF-8 and use emojis! |
18:13:59 | <TheTechRobo> | xD |
18:14:11 | <TheTechRobo> | Nah, we still have more ASCII characters |
18:14:13 | <nicolas17> | <skull emoji> |
18:14:14 | <murb> | and all the þ's |
18:14:14 | <TheTechRobo> | Like NUL, BEL, etc |
18:14:20 | <TheTechRobo> | s/BEL/whatever it actually is/ |
18:14:27 | <TheTechRobo> | (BEL does not sound right) |
18:14:32 | <@JAA> | BEL is right. |
18:14:33 | <nicolas17> | it sounds like a bell |
18:14:44 | <@JAA> | ␇ |
18:15:02 | <TheTechRobo> | And then extended ASCII! |
18:15:08 | <TheTechRobo> | That might break shit though. |
18:15:16 | <TheTechRobo> | (Hopefully not, but...) |
18:15:40 | <@JAA> | We could also use a Unicode Private Use Area! |
18:15:50 | <TheTechRobo> | Like the Apple emoji lol |
18:16:06 | <@JAA> | Yep |
18:16:50 | <nicolas17> | huh, looks like putting these Samsung kernel source code dumps into a git repository might actually be feasible |
18:16:56 | <TheTechRobo> | "Hey, John, your script is breaking on the Apple emoji letter" |
18:16:58 | <TheTechRobo> | "The what?" |
18:17:03 | <TheTechRobo> | "Does that not look like an Apple emoji to you?" |
18:17:40 | <nicolas17> | TheTechRobo: I know someone who added an emoji to her last name in the company employee database and several internal apps broke *for everyone* |
18:17:51 | <TheTechRobo> | lmao |
18:18:30 | <@JAA> | Nice |
18:20:10 | | sec^nd (second) joins |
18:31:50 | | jasons quits [Ping timeout: 240 seconds] |
18:43:18 | | btcat joins |
18:53:57 | | riku quits [Quit: WeeChat 4.1.2] |
18:55:32 | | Megame (Megame) joins |
18:58:53 | <fireonlive> | nice :3 |
19:09:49 | | DogsRNice joins |
19:35:29 | | jasons (jasons) joins |
19:39:02 | <btcat> | hi, i have an old warc file from the imdb boards project |
19:39:37 | <btcat> | it seems to be for the ghostbusters 2016 movie, and none of the threads appear on the WBM |
19:41:43 | <btcat> | what should i do with it? should i just delete it? run `shred` on it? |
19:46:22 | | lennier2_ joins |
19:46:26 | | UserH joins |
19:48:50 | | lennier2 quits [Ping timeout: 240 seconds] |
19:49:19 | <UserH> | Not sure if this is the correct channel but...Would the news of Pitchfork being folded into GQ be a cause for concern to archive Pitchfork's website? Or since it isn't being shutdown the team would not see it as an applicable project? |
20:10:52 | | riku joins |
20:10:52 | | riku is now authenticated as riku |
20:17:25 | | tech234a quits [Quit: Connection closed for inactivity] |
20:32:50 | | jasons quits [Ping timeout: 240 seconds] |
20:40:26 | <TheTechRobo> | btcat: You could always upload it to archive.org. It won't show up in the Wayback Machine but it's better than nothing. Make sure to add a lot of metadata so it's discoverable! |
20:47:41 | <@JAA> | Yeah, uploading to IA is what I'd recommend. |
21:07:44 | | pedantic-darwin quits [Client Quit] |
21:12:28 | | AlsoHP_Archivist quits [Read error: Connection reset by peer] |
21:24:50 | | Maykil107 joins |
21:25:13 | | Maykil107 quits [Remote host closed the connection] |
21:35:50 | | BlueMaxima joins |
21:36:37 | | jasons (jasons) joins |
22:13:27 | | SootBector quits [Remote host closed the connection] |
22:14:40 | | Gooshka (Gooshka) joins |
22:14:48 | <Gooshka> | Probably it would be useful. Telegram show deleted messages: https://habr.com/ru/articles/787642/ |
22:15:01 | <Gooshka> | Mhm, shows. |
22:22:50 | | SootBector (SootBector) joins |
22:24:31 | | Gooshka quits [Ping timeout: 267 seconds] |
22:25:39 | | Arcorann (Arcorann) joins |
22:28:15 | <@OrIdow6^2> | fireonlive: Curious what 3rd parties are interested in DW |
22:31:59 | | SootBector quits [Remote host closed the connection] |
22:32:25 | | SootBector (SootBector) joins |
22:33:23 | | jasons quits [Ping timeout: 272 seconds] |
23:05:58 | <btcat> | it finally finished uploading: https://archive.org/details/imdb-title_board_1289401-20170216-165023.warc |
23:07:01 | <btcat> | unfortunately it used a leftover file as the thumbnail, but at least it's out there now |
23:07:50 | <fireonlive> | :) thanks |
23:08:01 | <fireonlive> | i was wondering who that man was for a second haha |
23:16:15 | | decky_e joins |
23:34:09 | | btcat quits [Remote host closed the connection] |
23:36:44 | | jasons (jasons) joins |
23:37:29 | <fireonlive> | OrIdow6^2: not sure entirely; but I think I saw some pointers on the wiki long ago that other sites referenced us |
23:37:40 | <fireonlive> | long forgotten :( |
23:55:54 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
23:56:54 | | Webuser230 joins |
23:57:27 | <Webuser230> | JAA bumping the discussion to this channel |
23:57:39 | <@JAA> | To summarise from #internetarchive: https://forums.mangadex.org/ is a moderately sized forum that has apparently gone down before. Not at immediate risk but would be nice to grab. Webuser230 has been doing that via SPN, but at 20M post IDs, not really feasible. |
23:58:37 | <Webuser230> | MangaDex proper went down (including the forum) https://en.wikipedia.org/wiki/MangaDex#Data_breach_and_codebase_rewrite |
23:59:44 | <@JAA> | Ah right, I remember that now. |