00:05:29<pabs><pabs> jbicha: when you say killing download.gnome.org, do you mean they plan to actually shut it down or just stop using it? is there a timeline for that? (I'd like to save it to archive.org before it dies)
00:05:30<pabs><jbicha> pabs: it's being suggested to stop pushing tarballs there. I'd expect it to be read-only for a while before it would actually be turned off. Also, I don't see this actually happening immediately since there are several details to figure out
00:05:30<pabs><pabs> jbicha: ok, please ping me once it becomes read-only :)
00:06:22<pabs><pabs> and I hope there is enough time between read-only and off to save the whole thing...
00:06:35<nicolas17>personally I'd keep using it but only uploading pristine tarballs generated by gitlab to it, so they match 1:1
00:07:19<pabs>AlsoJAA: <jbicha> GNOME doesn't actually do gpg signing for tarballs 🫤
00:07:30<pabs>er woops tab completion sorry :)
00:08:41Kinille quits [Client Quit]
00:09:52<@JAA>Fun
00:10:18lennier2 joins
00:10:21Kinille (Kinille) joins
00:10:46h3ndr1k_ quits [Remote host closed the connection]
00:11:59h3ndr1k (h3ndr1k) joins
00:12:04<@JAA>I'll get a total size.
00:13:16lennier1 quits [Ping timeout: 255 seconds]
00:14:34<fireonlive>pabs: april fools? :)
00:15:08<pabs>I think its more a reaction to the xz thing
00:15:16<fireonlive>entire internet: lets turn off and delete everything
00:15:16<fireonlive>archiveteam: https://dl.fireon.live/irc/d4ba4e80d890fc50/wtf-delete.png
00:15:24<fireonlive>pabs: ah, makes sense
00:18:54<nicolas17>JAA: I thought "they probably have rsync so you could get the size more easily that way"
00:19:00<nicolas17>but it turns out it's password-protected
00:20:14<nicolas17>pabs: https://download.gnome.org/conspiracy/ what.
00:24:57<fireonlive>lmao
00:28:28AlsoHP_Archivist quits [Client Quit]
00:28:57<@JAA>Catching all the various mirrors would be quite painful.
00:29:24<nicolas17>deduplication would be a *must*
00:30:24<@JAA>Obviously
00:30:50<fireonlive>i'm still rooting for the big deduplication day across all of IA
00:31:08<@JAA>Maybe we could get a list of the mirrors from GNOME. At least I couldn't find a public one quickly. I do remember Mirrorbits having a feature for that though.
00:31:34<nicolas17>mirrorbrain did
00:32:44<@JAA>Ah, right
00:33:08<nicolas17>ah found it
00:33:11<nicolas17>JAA: https://download.gnome.org/README?mirrorlist
00:33:42<@JAA>Hah :-)
00:34:11<nicolas17>note that both mirrorbrain and mirrorbits allow partial mirrors
00:34:54<nicolas17>where mirrors can pick their own exclusions
00:34:56<@JAA>The MirrorBrain website still prominently lists download.gnome.org as one of the users.
00:35:08<nicolas17>mirrorbits then scans what's in each mirror
00:35:25<@JAA>Right
00:35:38<nicolas17>*in theory* a mirror could choose to exclude README so it won't appear in that list, yet it will have other files mirrored
00:35:55<nicolas17>but that seems pretty unlikely :P
00:36:33<fireonlive>'this readme file is too big'
00:36:42<fireonlive>x
00:36:43<fireonlive>xP
00:37:18<fireonlive>cool though
00:39:12hackbug quits [Remote host closed the connection]
00:43:01<pabs>nicolas17: yeah that was weird, been around for a while https://gitlab.gnome.org/Infrastructure/download-web/-/issues/1 https://www.reddit.com/r/gnome/comments/16jej5y/what_is_swedish_gnome_conspiracy/
00:43:12<pabs>https://wiki.gnome.org/SwedishConspiracy https://wiki.gnome.org/Outreach/GnomeMysteries
00:43:29<pabs>probably some joke that happened at a conference
00:44:29hackbug (hackbug) joins
00:47:49Wohlstand quits [Client Quit]
00:50:49<@JAA>This is bigger than I would've guessed at first.
00:51:29<pabs>multiple TB?
00:56:33<@JAA>As in big dir structure. Don't have a number yet.
00:58:20<@JAA>I've discovered 528 GiB in 290k files so far.
01:05:20JaffaCakes118 quits [Remote host closed the connection]
01:05:44JaffaCakes118 (JaffaCakes118) joins
01:09:01benjins3_ joins
01:09:09<nicolas17>JAA: download.kde.org is 175251 files and 844GB
01:10:29<fireonlive>is kde planning to yeet too?
01:10:37<nicolas17>not that I know of
01:10:41<nicolas17>just a point of reference
01:11:25<fireonlive>ah :)
01:19:33<nicolas17>hoarders, 4 new files on https://data.nicolas17.xyz/samsung-grab/
01:19:57<nicolas17>ow, the thelounge effect
01:22:11<eightthree>nicolas17: what is this effect? i know thelounge is an irc client
01:22:28<@JAA>Listing finished. 303k files, 543 GiB
01:22:56<nicolas17>eightthree: thelounge client from 10-20 people on the channel simultaneously requesting the link as soon as I post it, to do a link preview
01:41:01<@JAA>So up to 15-ish TiB of data to fetch to cover all mirrors (including the excluded ones). Not too horrible, really, but would take some time obviously.
01:41:20<pabs>why all mirrors?
01:41:22HP_Archivist (HP_Archivist) joins
01:42:15<@JAA>I'm sure there are references to the various mirrors out there.
01:43:05<@JAA>Of course, this would just result in 15 TiB of download, not 15 TiB of WARCs.
01:44:57<nicolas17>pabs: in case you want wayback machine to load https://ftp-chi.osuosl.org/pub/gnome/README
01:45:42<pabs>ah, so it would be deduped?
01:45:47<nicolas17>yes
01:45:52<@JAA>When done with the right tools, at least.
01:46:01<@JAA>I'd probably do this with qwarc.
01:46:02<nicolas17>you still have to *download* from every mirror
01:46:23<nicolas17>but then IA only stores one copy of each file
01:46:33<pabs>interesting, I thought IA didn't do dedup. so this is pre-upload dedup?
01:46:43<@JAA>Yes, during WARC writing.
01:47:22<@JAA>IA supports reading the deduped records. It doesn't dedupe anything on its own, at least as far as our uploads are concerned. (No idea what they do in SPN and their crawls.)
02:03:38Kinille quits [Client Quit]
02:20:46Kinille (Kinille) joins
02:54:23MrMcNuggets joins
03:07:36MrMcNuggets quits [Client Quit]
03:08:30SootBector quits [Ping timeout: 255 seconds]
03:10:26SootBector (SootBector) joins
03:10:50SootBector quits [Remote host closed the connection]
03:11:15SootBector (SootBector) joins
03:15:13nic8 quits [Client Quit]
03:16:03nic8 (nic) joins
03:41:53<fuzzy8021>jeez nicolas17 only 2 concurrent these days
03:46:23<nicolas17>there
03:46:33<nicolas17>when there are 4 files total
03:46:45<nicolas17>why not spread them to multiple users :P
03:46:54f joins
03:47:25f quits [Client Quit]
03:49:56<fuzzy8021>;)
03:52:17<nicolas17>btw once you start uploading it already stops counting as "pending"
03:53:51<fuzzy8021>good to know. luckly upload doesnt take long
03:54:47<fuzzy8021>10 mins on the last one
04:34:48Island quits [Read error: Connection reset by peer]
04:37:18Perk quits [Client Quit]
04:41:53Craigle quits [Quit: The Lounge - https://thelounge.chat]
04:42:22Craigle (Craigle) joins
04:48:33icedice quits [Client Quit]
04:58:52Perk joins
05:34:30kiryu quits [Remote host closed the connection]
05:35:52kiryu (kiryu) joins
05:51:19BornOn420 quits [Client Quit]
05:54:02BornOn420 (BornOn420) joins
05:55:15midou joins
05:56:24JaffaCakes118 quits [Remote host closed the connection]
06:01:53Arcorann (Arcorann) joins
06:04:57JaffaCakes118 (JaffaCakes118) joins
06:05:31BlueMaxima_ quits [Client Quit]
07:05:02Unholy2361 quits [Remote host closed the connection]
07:05:55superkuh quits [Ping timeout: 255 seconds]
07:06:08Unholy2361 (Unholy2361) joins
07:07:41superkuh joins
07:09:14Unholy23615 (Unholy2361) joins
07:12:45Unholy2361 quits [Ping timeout: 272 seconds]
07:12:45Unholy23615 is now known as Unholy2361
08:24:56beastbg8_ joins
08:27:29beastbg8 quits [Ping timeout: 272 seconds]
09:00:04Bleo182600 quits [Client Quit]
09:01:21Bleo182600 joins
09:02:58neggles quits [Quit: bye friends - ZNC - https://znc.in]
09:03:58<steering>mgrandi: I dunno how official/public the API is, but yup, their own site uses it thankfully :)
09:17:37<steering>Page: 510
09:17:37<steering>106730 106730-aim_deagleaim.json 106730-300138-aim_deagleaim.rar MD5 mismatch 106730-300138-aim_deagleaim.rar!
09:17:42<steering>only 2000 more pages to go :'D
09:25:50grid joins
10:04:36<pabs>pokechu22: a jira https://tracker.crosswire.org/
10:42:28Exorcism2 (exorcism) joins
10:42:31qwertyasdfuiopghjkl quits [Client Quit]
10:42:31Exorcism quits [Read error: Connection reset by peer]
10:42:31Exorcism2 is now known as Exorcism
10:48:22qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
11:06:57<@arkiver>i see discourse pop up a lot now
11:10:30<c3manu>fireonlive: there will be a deduplication day? or is that still in the "we should do that someday" category? :)
11:14:46<@arkiver>current channel for discourse is #msgbored
11:14:57<@arkiver>but let's make a separate channel for discourse
11:15:00<@arkiver>anyone have ideas?
11:15:19qwertyasdfuiopghjkl73 joins
11:16:36qwertyasdfuiopghjkl73 is now known as qwertyasdfuiopghjkl_
11:18:01qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds]
11:19:27qwertyasdfuiopghjkl_ is now known as qwertyasdfuiopghjkl
11:22:07<c3manu>discourage? disconcert?
11:22:39<@arkiver>discourage sounds nice
11:22:50<@arkiver>#discourage for discourse
11:23:34<c3manu>aww i was just starting having fun with the dictionary :D
11:45:41grid quits [Client Quit]
11:48:17Wohlstand (Wohlstand) joins
12:08:14eroc19904 quits [Quit: The Lounge - https://thelounge.chat]
12:08:45eroc1990 (eroc1990) joins
12:10:40<pabs>whats the best solution for regularly archiving new content on http://meetbot.debian.net/ ? (currently down, fairly regularly has new IRC meeting logs)
12:10:48<pabs>I did an AB snapshot recently https://web.archive.org/web/20240218080351/http://meetbot.debian.net/
12:11:12<pabs>but it doesn't have more recent meetings like the DebConf24 team meetings, which they are wanting to refer to
12:13:12<pabs>I guess same question for mailing lists and IRC logs in general https://wiki.archiveteam.org/index.php/IRC/Logs https://wiki.archiveteam.org/index.php/Mailman/2
12:59:10Arcorann quits [Ping timeout: 255 seconds]
13:08:41<c3manu>!ao https://schuermans.info
13:08:46<c3manu>ugh, sorry :D
13:20:33qwertyasdfuiopghjkl quits [Client Quit]
13:28:13qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
13:34:52systwi quits [Client Quit]
13:43:06icedice (icedice) joins
13:45:32neggles (neggles) joins
14:21:52<nyany>the infamous c3manu strikes again
14:28:04<c3manu>yeah yeah :(
14:28:34<c3manu>the irc polica are already on my tail
14:28:37<c3manu>*police
15:03:57pixel leaves
15:03:58pixel (pixel) joins
15:13:40decky_e quits [Read error: Connection reset by peer]
15:13:41fuzzy8021 quits [Read error: Connection reset by peer]
15:14:04decky_e joins
15:14:09fuzzy8021 (fuzzy8021) joins
15:16:06icedice quits [Client Quit]
15:42:04katocala quits [Ping timeout: 255 seconds]
16:04:38<fireonlive>!a c3manu
16:05:07<fireonlive>none that i know of.. just pondering if it'll ever be done to save space
16:06:28<c3manu>so we don't actually know whether they're not doing that already?
16:08:33<fireonlive>iirc the answer is no for "everything" and unsure for other things
16:09:49<fireonlive>like if we uploaded a jpg and someone SPN'd the same jpg = duplication; or two people uploaded the same copy of windows XP
16:11:30<fireonlive>but if 10 people SPN'd the same jpg maybe only one is stored for the SPN collection? uncertain
16:14:46<c3manu>that's probably way easier said than done (both for the waybackmachine and other items) since you would want to keep the metadata of any duplicates and still have the file itself loaded properly in the interface
16:21:18<fireonlive>hmm yeah, it would be quite an undertaking especially if you had to rewrite warcs too
16:23:18knecht4 quits [Quit: knecht420]
16:24:35knecht4 joins
16:37:30<pokechu22>pabs: already done, see https://archive.fart.website/archivebot/viewer/?q=tracker.crosswire.org
16:39:06decky joins
16:41:55dave quits [Ping timeout: 255 seconds]
16:41:59katocala joins
16:42:11Bleo182600 quits [Client Quit]
16:42:18dave (dave) joins
16:42:22decky_e quits [Ping timeout: 255 seconds]
16:42:22kiska5 quits [Ping timeout: 255 seconds]
16:42:22project10 quits [Ping timeout: 255 seconds]
16:42:22@dxrt quits [Ping timeout: 255 seconds]
16:42:27project10_ (project10) joins
16:43:34Bleo182600 joins
16:44:22fuzzy8021 quits [Read error: Connection reset by peer]
16:48:10kiska5 joins
16:48:17fuzzy8021 (fuzzy8021) joins
16:48:30dxrt joins
16:48:32dxrt quits [Changing host]
16:48:32dxrt (dxrt) joins
16:48:32@ChanServ sets mode: +o dxrt
16:49:13Island joins
17:00:23<c3manu>rewriting warcs sounds like a big no-no actually
17:00:49<fireonlive>aiui there's no existing tooling for it
17:04:09<c3manu>a big value in the internet archive (at least that's my opinion) is that you can pretty much rely on its integrity. rewriting warcs woudl just *invite* manipulation attempts
17:12:05<fireonlive>i'd assume ia themselves would do it if it came to that
17:12:32<fireonlive>i'm curious if it would even be worth basic dedupe and then let's say.. deep dedupe
17:37:01<joepie91|m>the main problem with dedupe in archives is that it introduces a tradeoff in emergency accessibility
17:37:26<joepie91|m>the more you dedupe, the more complexity you introduce into the file processing pipeline, that is increasingly difficult to determine its correct workings of
17:38:11<joepie91|m>so if your infrastructure is on fire, you might suddenly find yourself unable to recover something not because the data isn't there, but because you no longer have the infrastructure/knowledge/capacity/whatever to piece it back together, and you may discover that actually some bits and pieces got lost in the process
17:38:49<joepie91|m>irritating edgecases like "we thought we had redundant copies but due to an error in dedupe bookkeeping it turns out we only had 1 copy even though the system thought we had multiple"
17:39:28<joepie91|m>tl;dr the more complexity you introduce into your core storage, the more likely it is that something goes horribly wrong at some point
17:39:49<joepie91|m>doesn't entirely rule out dedupe probably, but I imagine that IA might not consider it such a slam dunk for these reasons :p
18:43:50erkinalp joins
18:43:56<erkinalp>https://github.com/letsblockit/letsblockit is shutting down
18:49:25Wohlstand quits [Ping timeout: 272 seconds]
18:52:12MrMcNuggets joins
18:53:02VerifiedJ9 quits [Quit: The Lounge - https://thelounge.chat]
18:53:35VerifiedJ9 (VerifiedJ) joins
19:02:27MrMcNuggets quits [Client Quit]
19:20:19Barto quits [Ping timeout: 255 seconds]
19:21:37Barto (Barto) joins
19:21:37MrMcNuggets joins
19:26:11Matthww quits [Quit: The Lounge - https://thelounge.chat]
19:31:52Matthww joins
19:34:24MrMcNuggets quits [Ping timeout: 265 seconds]
19:40:36Wohlstand (Wohlstand) joins
19:48:32<mgrandi>@steering: maybe make a wiki page with your code if one doesn't exist already
20:03:13T31M quits [Quit: ZNC - https://znc.in]
20:03:33T31M joins
20:08:31<c3manu>joepie91|m: good point
20:09:52<c3manu>erkinalp: i posted it in #gitgud, which is the channel we use for archiving github repos
20:19:05<erkinalp>c3manu: it finished already :)
20:20:15<@OrIdow6>What level is this hypothetical dedup taking place on? FS level? Revisit records? Removing records entirely?
20:21:26<fireonlive>joepie91|m makes good points; it would have to be quite meticulous (and it would take quite a bit of power) and also ensure we keep the A/B of the current system
20:21:44<c3manu>erkinalp: ah, i must have missed that. sorry
20:21:46<fireonlive>OrIdow6: i was thinking both a surface-level and perhaps a deep-level
20:22:13<erkinalp>c3manu: yeah took just a few seconds and that's it
20:22:20<fireonlive>surface level say, 300 copies of the same youtube video = same sha512
20:22:44<fireonlive>maybe only need to store that 'once' (where once is 'once per datacentre/redundancy set')
20:23:48<fireonlive>deep-level/deep-clean perhaps (although no tooling exists for this so in magic land now), stuff like all the verified warcs where IA say has thousands of copies of the same jpg, of the same homepage, of the same tweet
20:24:20<fireonlive>not sure how that would look or work on a warc level, though....
20:24:34<fireonlive>i suspect the answer is 'storage is cheap enough'
20:24:53<fireonlive>but it's a thought i come back to sometimes
20:25:03<@OrIdow6>Or for all those lag the WBM their storage it tied up in things with one copy
20:25:32<fireonlive>esp. when WBM shows how many duplicates of X it has in say the * view and it's a lot down the same column
20:25:46<@OrIdow6>*storage is
20:25:52<fireonlive>hmm yeah
20:26:23<fireonlive>i suppose revisit records would solve? the 'the URI X on Y timestamp was the same as it was on Z timestamp'
20:26:31<fireonlive>that would be a feature you wouldn't want to lose
20:27:02<fireonlive>sometimes being able to prove.. somehow that something remained the same is quite useful
20:28:32<@OrIdow6>Yeah, I was thinking earlier "oh, better to remove it, what if the record the revisit references fails!" but "this was captured, even though the body data has been lost" is more useful than no information
20:28:43<@OrIdow6>But those integrity concerns really are a good point
20:29:14<@OrIdow6>Also if done thru revisits could break the AT use case of "download this WARC and you'll have a complete copy of the website!"
20:29:43<fireonlive>ah yes indeed..
20:29:53<fireonlive>that is one thing i like about our WARCs
20:30:12<fireonlive>you can actually download them (for the most part) and it's a complete ship
20:43:05JaffaCakes118 quits [Remote host closed the connection]
20:43:29JaffaCakes118 (JaffaCakes118) joins
20:58:45Matthww quits [Client Quit]
21:04:10Matthww joins
21:05:15bladem (bladem) joins
21:09:00<erkinalp>fireonlive: except the archive.org download is incredibly slow
21:09:06<erkinalp>like isdn speeds
21:12:57<fireonlive>it does download eventually :p
21:13:03<@JAA>It depends strongly on where you are geographically/network-wise.
21:15:31<pabs>c3manu, fireonlive: re dedup, you would probably want rolling chunk dedup too (like restic and other backup tools do), for partially duplicated files like the same page with just a timestamp changed
21:16:51<@JAA>That's not currently supported by WARC.
21:17:20<@JAA>The relevant discussion has been open since late 2015: https://github.com/iipc/warc-specifications/issues/30
21:28:56<fireonlive>oh interesting
21:34:48BlueMaxima joins
21:40:38erkinalp quits [Client Quit]
21:46:54MrMcNuggets (MrMcNuggets) joins
21:47:33Perk quits [Client Quit]
21:50:38Perk joins
22:03:16MrMcNuggets quits [Ping timeout: 265 seconds]
22:10:53MrMcNuggets (MrMcNuggets) joins
22:15:01lflare quits [Quit: Bye]
22:15:59lflare (lflare) joins
22:17:16Island_ joins
22:19:27Island_ quits [Read error: Connection reset by peer]
22:19:44Island_ joins
22:20:46Island quits [Ping timeout: 255 seconds]
22:28:25midou quits [Ping timeout: 255 seconds]
22:51:16Dango360 quits [Read error: Connection reset by peer]
22:56:00Dango360 (Dango360) joins
22:58:07mr_sarge quits [Ping timeout: 255 seconds]
22:59:10mr_sarge (sarge) joins
23:14:23JaffaCakes118 quits [Remote host closed the connection]
23:14:47JaffaCakes118 (JaffaCakes118) joins
23:32:19<h2ibot>Archivst edited Stack Exchange (-59, They are releasing dumps again): https://wiki.archiveteam.org/?diff=51998&oldid=49889
23:33:14<@JAA>Indeed, they have been doing so for a while, but it looks like they might be monthly now? That's new.
23:33:36<fireonlive>oh good, we got angry enough?
23:33:37<@JAA>Or well, relatively new, last time they did that was well over a decade ago.
23:33:41<fireonlive>i forgot about that
23:34:29<fireonlive>ah yeah: https://meta.stackexchange.com/a/390023
23:34:29<@JAA>There's a 2024-04-01 dump uploaded today, last one was on 2024-03-02.
23:34:39<fireonlive>:) nice
23:34:53<fireonlive>IIRC they were overwriting the old ones last time were they not?
23:35:06MrMcNuggets quits [Ping timeout: 265 seconds]
23:35:30<fireonlive>>Additionally, a comment written by Stack Overflow founder Jeff Atwood under the official response reads (emphasis his):
23:35:30<fireonlive>>I have confirmation via email from Prashanth that this is, indeed, *the new official policy*. I'm glad to see it. Creative Commons is part of our contract with the community, and it should never be broken -- however, CC does need to address the AI issue in an updated license, in my personal opinion.
23:35:30<fireonlive>>-- Jeff Atwood
23:36:26MrMcNuggets (MrMcNuggets) joins
23:39:47<@JAA>They are overwriting, yes, which is why I'm regularly mirroring them to separate items.
23:40:26<@JAA>They are also oversimplifying the licence situation.
23:40:42<@JAA>See e.g. my latest mirror: https://archive.org/details/stackexchange_20240305
23:44:42<fireonlive>JAA++
23:44:43<eggdrop>[karma] 'JAA' now has 31 karma!
23:45:08<@JAA>:-)
23:45:25<fireonlive>=]
23:58:26MrMcNuggets quits [Client Quit]
23:58:42MrMcNuggets (MrMcNuggets) joins