00:00:27etnguyen03 quits [Client Quit]
00:04:03fuzzy80211 quits [Read error: Connection reset by peer]
00:04:23fuzzy80211 (fuzzy80211) joins
00:11:12JaffaCakes118 quits [Remote host closed the connection]
00:11:45JaffaCakes118 (JaffaCakes118) joins
00:32:54nulldata quits [Quit: So long and thanks for all the fish!]
00:33:12Mateon2 joins
00:33:48nulldata (nulldata) joins
00:35:07Mateon1 quits [Ping timeout: 256 seconds]
00:35:08Mateon2 is now known as Mateon1
00:45:42etnguyen03 (etnguyen03) joins
00:48:52nulldata quits [Client Quit]
00:50:18nulldata (nulldata) joins
00:50:57nulldata quits [Client Quit]
00:53:31nulldata (nulldata) joins
01:36:00JaffaCakes118 quits [Remote host closed the connection]
01:36:02JaffaCakes118_2 (JaffaCakes118) joins
01:49:49Hackerpcs quits [Quit: Hackerpcs]
01:52:35Hackerpcs (Hackerpcs) joins
01:58:36Hackerpcs quits [Remote host closed the connection]
01:58:38gfgf joins
01:59:22gfgf quits [Client Quit]
02:00:09Hackerpcs (Hackerpcs) joins
02:05:58Hackerpcs quits [Remote host closed the connection]
02:07:28Hackerpcs (Hackerpcs) joins
02:12:35Hackerpcs quits [Ping timeout: 256 seconds]
02:13:52Irenes quits [Ping timeout: 255 seconds]
02:14:30Hackerpcs (Hackerpcs) joins
02:24:55Irenes (ireneista) joins
02:48:12etnguyen03 quits [Remote host closed the connection]
02:54:23hackbug quits [Remote host closed the connection]
02:55:19chains quits [Remote host closed the connection]
02:56:53hackbug (hackbug) joins
03:08:48hackbug quits [Remote host closed the connection]
03:10:00hackbug (hackbug) joins
03:13:09BlueMaxima quits [Read error: Connection reset by peer]
03:30:47ArchivalEfforts quits [Quit: No Ping reply in 180 seconds.]
03:31:58ArchivalEfforts joins
03:38:42JaffaCakes118_2 quits [Remote host closed the connection]
03:39:30JaffaCakes118_2 (JaffaCakes118) joins
03:46:02DogsRNice quits [Read error: Connection reset by peer]
03:49:01DogsRNice joins
03:53:46DogsRNice quits [Client Quit]
03:55:14igloo22225 quits [Quit: The Lounge - https://thelounge.chat]
04:05:44monoxane6 (monoxane) joins
04:05:45monoxane4 (monoxane) joins
04:05:56monoxane6 quits [Remote host closed the connection]
04:06:49monoxane quits [Ping timeout: 255 seconds]
04:06:50monoxane4 is now known as monoxane
04:18:17DogsRNice joins
04:19:33DogsRNice quits [Remote host closed the connection]
04:29:36<nulldata>.
04:30:04benjins2_ quits [Read error: Connection reset by peer]
04:31:09loug83 joins
04:32:32<nulldata>MariaDB acquired https://www.prnewswire.com/news-releases/k1-acquires-mariadb-a-leading-database-software-company-and-appoints-new-ceo-302243508.html
04:42:43<pabs>immibis: there are several mirrors for sourceforge file releases, they redirect requests to those mirrors
04:44:00<pabs>immibis: rolling hashes like modern backup systems like restic/borg do might be good for dealing with the redundancy stuff
05:31:33ace24x quits [Remote host closed the connection]
05:51:36igloo22225 (igloo22225) joins
05:58:21grid joins
06:25:53lennier2 quits [Ping timeout: 256 seconds]
06:28:07Miori quits [Ping timeout: 255 seconds]
06:43:21<@arkiver>nicolas17: you're trying to save on space archiving apple releases?
06:43:43<@arkiver>are you storing these at IA? what are you doing to save storage?
06:44:08<@arkiver>wondering because if it's stored at IA, maybe in this case (on this specifically) one doesn't have to save on space
07:09:31Unholy236192464537713 quits [Ping timeout: 256 seconds]
07:19:34<immibis>pabs: good idea but it's not that simple since compression such as .gz has a cascading effect: one different uncompressed bit changes the rest of the file
07:20:54<@JAA>Or even the same input might result in different compression output.
07:21:19<immibis>i did some experiments like this and compressed ~400GB of minecraft mod-packs (a hopefully complete set from FTB) down to ~4GB. But you want to get the original files back at the end, especially if they are signed, and this means writing a reversible decompressor which compresses the file the same way it was compressed before.
07:24:38<immibis>original bit-for-bit identical files
07:32:44Miori joins
07:43:40penny joins
07:46:56penny quits [Client Quit]
07:57:29vvvvv666 joins
07:57:55vvvvv666 quits [Client Quit]
08:28:42<that_lurker>Recent Tor Exit Node Operator Raids and Legal Harassment in Germany https://forum.torproject.org/t/tor-relays-artikel-5-e-v-another-police-raid-in-germany-general-assembly-on-sep-21st-2024/14533 https://news.ycombinator.com/item?id=41505009
08:30:18<that_lurker>"Artikel 5 e.V. is now calling for a general assembly on Sep 21st 2024. We are looking for new board members (who take over and organize a new registered address and keep running exits) or discuss ALL alternative options.
08:30:20<that_lurker> These options include "just stop running exits" or even the most drastic step of liquidating the entire organization and distribution of the remaining budget to other German organizations (that would have to qualify under our non-profit by-laws)."
08:47:45grid quits [Client Quit]
09:58:42pixel leaves [Error from remote client]
10:02:42pixel (pixel) joins
10:04:08lennier2_ joins
10:09:16<magmaus3>"Recent Tor Exit Node Operator Raids and Legal Harassment in Germany" ← shouldn't the tor exit operators be counted as not responsible for the traffic? (like ISPs and etc)
10:09:33<immibis>yes but most law enforcement are violent criminals, especially in germany
10:09:49<magmaus3>that's like everywhere
10:09:59<immibis>it's especially in germany
10:10:04<immibis>denazification never happened
10:10:10<@JAA>This discussion stops now.
10:10:23<magmaus3>sure
10:10:27<immibis>IMO the biggest risk isn't that they'll raid your tor node, since that will be cleared by the court - it's that they could discover something else, illegal or not, when they go to raid your tor node.
10:10:37<magmaus3>yeah
10:10:54<immibis>why is it illegal to talk about legal threats to tor nodes here?
10:11:03<magmaus3>that was for a diff reason
10:11:04<immibis>it should be in -ot?
10:11:13<magmaus3>maybe
10:11:24<magmaus3>i think the issue was with the political discussion
10:11:29<@JAA>Correct
10:11:40<immibis>tor node raids are politics
10:12:43<@JAA>I've archived the public web stuff of Artikel 5 e.V. that I could find with AB.
10:12:57<magmaus3>cool :3
10:15:56<immibis>Artikel 5 e.V. is also a highly political organization
10:16:32<@JAA>Which is completely irrelevant for this channel.
10:24:23<immibis>it seems A5eV would be well served by renting a clubhouse so that the business premises do not default to being at the chairman's home.
10:36:17<@arkiver>immibis: it's the "denazification never happened" stuff. that comment is not welcome here
10:36:48<@arkiver>that has been brought up before.
10:38:24<@arkiver>it's part of labeling entire perceived "groups" of people as x, where x may be some word like "nazi"/"communist"/"fascist"/etc.
10:39:59<@arkiver>also when it is not meant literally, but rather in some symbolical way, it's not something Archive Team is the right place for.
11:00:03Bleo1826007227196 quits [Quit: The Lounge - https://thelounge.chat]
11:01:22Bleo1826007227196 joins
11:26:30<kpcyrd>immibis: it looks like they do have an office location, there are multiple companies registered at that same address (including ImmobilienScout24)
11:27:26<kpcyrd>but they can still figure out home addresses of people involved in an e.V. (and apparently they did)
11:32:21<immibis>well the search warrant says business premises so whoever executed it belongs in prison for burglary then
11:33:33<immibis>a search warrant for location X doesn't give you a pass to burglarize location Y
11:34:04<kpcyrd>*the search warrant published by Artikel 5 e.V.
11:34:16<kpcyrd>there may be more
11:36:05<kpcyrd>https://artikel5ev.de/home/hausdurchsuchung-am-fr-16-08-2024/
11:36:11<kpcyrd>idk, this is all very confusing
11:37:35<kpcyrd>on their website they write "the club doesn't have any dedicated space" and that's why their homes got searched, but they very clearly do have a dedicated space at Hatzper Str. 172B
11:38:22SkilledAlpaca4 quits [Quit: SkilledAlpaca4]
11:38:40<kpcyrd>make it make sense
11:40:04SkilledAlpaca4 joins
11:42:06decky quits [Read error: Connection reset by peer]
11:43:40<kpcyrd>one of the chairman runs a company at that address, so maybe the e.V. was using a shared space
11:44:01loug83 quits [Read error: Connection reset by peer]
11:44:02<@arkiver>immibis: i'm pinging again though on what i wrote (and will leave it at that) - this has now happened a few times, i believe you have seen (or maybe not? let me know if not) the longer message i posted back then on this
11:44:05loug83 joins
11:45:30<immibis>i gather you want to ban me cause of what i said about search warrants. Fine. I'll stop all containers I may be running, delete their data without uploading it and leave IRC for good.
11:45:38<immibis>AT IRC, that is
11:46:03<@rewby>Ot'
11:46:08<@rewby>It's not about the warrants.
11:46:20<@rewby>It's about the nazi comments.
11:47:33<immibis>you already wrote about me saying denazification never happened
11:47:53<@arkiver>that is about symbolical nazi remarks - not search warrants
11:49:04<@arkiver>i not have as a goal to ban you - i an hoping you would somewhat understand what i wrote. again, it's not about search warrant discussion, it's about symbolically labeling groups as "nazi"/"fascists"/etc.
11:49:27<@arkiver>for completeness, i will post the message i wrote some time ago about this https://transfer.archivete.am/inline/NzLSU/message.txt
11:50:51monoxane1 (monoxane) joins
11:50:53monoxane0 (monoxane) joins
11:51:05monoxane1 quits [Remote host closed the connection]
11:52:18monoxane quits [Ping timeout: 256 seconds]
11:52:18monoxane0 is now known as monoxane
11:53:43<@arkiver>(note the message contains some references to the context at the time, but i believe it is still clear and i stand behind it)
11:54:35<kpcyrd>"especially now with everything going on" is a very timeless thing to say
11:57:01<@arkiver>for context see https://hackint.logs.kiska.pw/archiveteam-ot/20230520#c345976 and the previous days(s?)
12:03:50<kpcyrd>"As a consequence, I am personally no longer willing to provide my personal address&office-space as registered address for our non-profit/NGO[...]" written by the chairman who has a company at that address
12:03:59<kpcyrd>(at https://forum.torproject.org/t/tor-relays-artikel-5-e-v-another-police-raid-in-germany-general-assembly-on-sep-21st-2024/14533)
12:04:07<immibis>denazification is not identifying specific people as nazis, it is the removal of nazi ideology from general perception in the entire country of germany
12:04:27<kpcyrd>so it seems the theory of "the e.V. only had a postbox at that address" tracks
12:05:25<immibis>i see that you want no politics not directly related to archive team, so the whole exit node raiding thing is not allowed, except for the statement that it happened and therefore this e.V.'s site could be at risk.
12:33:30benjins2 joins
12:38:07<nicolas17>arkiver: I know of two people with a giant NAS at home with apple releases (including many that apple already deleted from their servers)
12:38:37<nicolas17>and there's so much redundant data...
12:39:39<nicolas17>and the files keep getting bigger and more numerous https://theapplewiki.com/wiki/Beta_Firmware/iPhone/17.x
13:03:14SootBector quits [Remote host closed the connection]
13:03:35SootBector (SootBector) joins
13:10:44igloo22225 quits [Read error: Connection reset by peer]
13:10:52igloo22225 (igloo22225) joins
13:15:22xarph_ quits [Ping timeout: 255 seconds]
13:16:31xarph joins
13:42:19<@arkiver>nicolas17: are we actively archiving those apple CDN URLs into the wayback machine?
13:42:32<@arkiver>please feel free to at least with ArchiveBot (CC JAA )
14:27:40hackbug quits [Remote host closed the connection]
14:27:40thuban quits [Read error: Connection reset by peer]
14:28:00hackbug (hackbug) joins
14:28:00<corentin>arkiver: I grabbed them all
14:28:06thuban (thuban) joins
14:30:30Radzig2 joins
14:31:31Radzig quits [Ping timeout: 256 seconds]
14:32:20<@arkiver>corentin: as in, is that happening periodically?
14:46:14<corentin>arkiver: no no sorry, I mean I grabbed the ones in this URLs shared. There must be something "bigger" to do though
14:46:56<@arkiver>corentin: as in the ones on https://theapplewiki.com/wiki/Beta_Firmware/iPhone/17.x ?
14:47:05<@arkiver>nicolas17: were you archiving those on the long term?
15:01:59<@arkiver>what was that TLD again that got a warning?
15:02:16<@arkiver>for hosting too much spam or pishing addresses or something
15:02:58<monoxane>probably one of the freenom ones
15:03:02<monoxane>.tk et al
15:05:47MrMcNuggets (MrMcNuggets) joins
15:06:06<@arkiver>hmm yeah
15:06:22<@arkiver>maybe
15:14:15pocketwax17 (pocketwax17) joins
15:16:37pocketwax17 quits [Client Quit]
15:17:29pocketwax17 (pocketwax17) joins
15:18:21pocketwax17 quits [Client Quit]
15:18:34pocketwax17 (pocketwax17) joins
15:30:30<corentin>arkiver: yes, sorry I should have been more clear!
15:53:14<@arkiver>alright!
15:56:02<@arkiver>well so to be clear, feel free to continue archiving this with ArchiveBot, it is well worth the size i think
16:07:55<nicolas17>corentin: note the wiki may have duplicates (XS and XS Max use the same files but have separate tables on the wiki)
16:08:27<nicolas17>if you just grabbed the URLs from the wiki and fed it to AB, I'm not sure if AB dedups
16:09:34VerifiedJ9 quits [Remote host closed the connection]
16:10:12VerifiedJ9 (VerifiedJ) joins
16:10:55<nicolas17>arkiver: I don't have the disk space to archive this long term myself :P but I'm helping people who do
16:11:04<nicolas17>and yeah I was planning on discussing how to archive this properly
16:11:15<myself>how much space we talkin'?
16:11:28<nicolas17>an admin of theapplewiki started feeding some URLs to savepagenow and I told him that was probably not the best way
16:12:47<@arkiver>nicolas17: you can see feed lists into ArchiveBot!
16:12:50<@arkiver>IA definitely has the space for this
16:13:04<@arkiver>individual items on IA _next to that_ are also welcome, i can create a collection for you and others
16:13:24<nicolas17>https://archive.org/details/apple-ipsws already done that for some files that were already deleted from Apple
16:14:01<@arkiver>yeah feel free to put everything in that collection and in ArchiveBot!
16:15:07<@arkiver>if any help is needed, don't hesitate to ping me :)
16:16:22<nicolas17>hmmmm I imported info from appledb into a SQLite database, and "select sum(file_size) from sourcefile where type='ipsw'" returns 48TB, which seems low, I wonder if my import was excluding something important... I last looked into it in June or so
16:16:40knecht quits [Quit: knecht]
16:17:12<@arkiver>nicolas17: i see on an item like https://archive.org/details/xcode-16.1-beta1 you included a `source` metadata field, is it possible to include that for ipsw items as well?
16:17:41<nicolas17>with the original URL? yes
16:17:57<@arkiver>yeah!
16:18:33<nicolas17>though there's some where the source will be "someone sent it to me and it seemed to be the right file but it has been gone from apple-cdn for 2 years now"
16:19:12<@arkiver>nicolas17: let's add a note to that, but feel free to include at your discretion
16:21:38<@arkiver>what does IPSW stand for? is the official name iPSW ? i can't actually easily find this info online...
16:22:40<nicolas17>afaik it was originally iPod Software Update, then iPhone Software Update with a different format
16:23:18<@JAA>nicolas17: AB only dedupes identical URLs.
16:23:22<nicolas17>nowadays even macOS uses iPhone-like IPSW files
16:23:32<nicolas17>JAA: oh that's fine for this case
16:23:45<@JAA>And even then, not on redirects.
16:23:53<@JAA>It does no content dedupe at all.
16:24:02<nicolas17>the wiki page has some URLs multiple times and I didn't know if corentin had dedup'd them, if AB dedups them that's enough
16:24:15<@JAA>Ah
16:24:34<@JAA>Last time we spoke about this, I think you said there were like 4 different URLs for each file.
16:24:59<nicolas17>for macOS InstallAssistant.pkg files, there's often 2 different URLs for each
16:26:01<nicolas17>someone uploaded WARCs of them as items to archive.org containing 4 copies, because they archived those two URLs, each on both http+https, with no content dedup :/
16:26:16<@JAA>Ah right, that's the one you mentioned, yeah.
16:26:49<nicolas17>afaik WBM doesn't care about http+https so we would only need to archive 2 anyway
16:27:05<@JAA>Correct, the WBM doesn't care about the scheme in general.
16:27:21<@arkiver>nicolas17: so would you say the official name nowadays is just IPSW? even apple or wikipedia doesn't clearly mention anything else
16:27:31<@arkiver>i guess it has so many meanings now, that it's just IPSW
16:27:45<nicolas17>arkiver: I think macOS Finder shows .ipsw files as "Apple software update" nowadays :P
16:30:38<@arkiver>that is annoying
16:31:11<nicolas17>Apple has many misnomers due to scope growth tbh
16:31:57<nicolas17>the sharingd daemon used to deal with AirDrop (sharing files wirelessly), now it handles most of the Continuity features many of which have nothing to do with sharing
16:35:06<nicolas17>mail on mac: "Mail.app"; mail on iPhone: "MobileMail.app"; many MobileSomething names refer to iOS... then some features like MobileAssets get ported to macOS and nothing makes sense anymore
16:38:08<masterx244|m>too bad that sometime device manufacturers DMCA those archives off the net even though others are glad that those archives exist
16:38:57<masterx244|m>(got a strike due to that crap already, luckily the IA version was just a secondary location, my personal copy where the state is kept of what i got already is not visible on the open web for obvious reasons)
16:40:00<nicolas17>betas used to be restricted to members of the developer program
16:40:11<nicolas17>later only beta 1 was restricted that way
16:41:02<nicolas17>last year I got a DMCA takedown not for re-hosting the 17.0b1 ipsws, but for *tweeting a link* to someone else's website that re-hosted them
16:41:49<nicolas17>the lawyers later withdrew the claim but I had already deleted the tweet myself by then *shrug*
16:43:16<masterx244|m>mine was for the archival of the sena (motorcycle intercom) firmware files. Might have gotten a few files that they didnt want out (their update server is a bit of a leaky pipe, got a dumb monitoring of a few files into a git and that sometimes leaks stuff before release)
16:43:18<@arkiver>masterx244|m: yes :/
16:44:00<masterx244|m>got a few 0.X.X versions that way, too
16:44:04<@arkiver>i do advise to keep your own copies of data very important, next to storing on IA, but in case of very large amounts of data that may not be practical
16:44:59<masterx244|m>luckily its pretty small, 20GB or so and for coldstorage a 7zip solid compression gets it down to 300MB or so due to quite a few cross-file-redundancies
16:45:37<nicolas17>I can't avoid feeling bad about duplication
16:45:39<masterx244|m>and yeah, got my local copy still since i have some automatic crawling, the IA copy was generated by that tooling, too with some CSV upload magic
16:45:55knecht (knecht) joins
16:46:17<masterx244|m>sometimes they had different language versions where the code part was identical. or different version of code but the audio snippets for the menu were the same
16:46:28<nicolas17>not like "me and Siguza and qwerty and archive.org having the same file", that's good for redundancy
16:46:37<nicolas17>but about "archive.org having the same file 3 times in separate captures/items"
16:46:57<nicolas17>waaaasteful >_<
16:47:28<nicolas17>"the file is only 10GB it's no big deal" yeah but wasteful >_<
16:48:30<@arkiver>nicolas17: if it makes 50 TB go times 4, that is a big problem
16:48:37<@arkiver>URL agnostic deduplication would help
16:49:18<masterx244|m>yeah, splitting the stuff into "headers" and payload and if a payload segment is == with another one just storing a pointer would be enough
16:50:55<@arkiver>i believe AB deduplicates (?)
16:51:12<@arkiver>Wget-AT can be run with the 4 URLs as input and URL agnostic deduplication turned on and it will handle it
16:53:19pocketwax17 quits [Client Quit]
16:59:57<@JAA>AB does not dedupe.
17:00:35tttt joins
17:02:15tttt quits [Client Quit]
17:03:07tettehe joins
17:03:55<@arkiver>ah
17:04:33<@arkiver>JAA: i do remember some messages about that streaming over archivebot.com in the past about duplicate content - was it a feature in the past?
17:06:10tettehe quits [Client Quit]
17:08:47<@JAA>arkiver: There is a dupe detection, but that's for stopping recursion on identical responses. It doesn't do anything about the WARC writing. It's also been broken for, uh, over 8 years, before I arrived here.
17:09:10<@arkiver>ah
17:09:23<@arkiver>thanks for clearing that up
17:12:24chains joins
17:14:05<chains>How do I know when a blog has been archived by frogger?
17:16:39<@rewby>I believe the bot has dedupe so you can probably just put it in and let the bot deal. That said, the only good way is to check if the Wayback machine contains the blog
17:18:54<nicolas17>masterx244|m: the WARC format supports that kind of deduplication (storing the request and response headers, and only a pointer to the previous response body), but archivebot doesn't use it
17:26:38bill93 joins
17:27:41bill93 quits [Client Quit]
17:28:03<chains>rewby, gotcha thanks
17:35:20PredatorIWD2 joins
17:37:27<nicolas17>corentin: what did you do with theapplewiki? archivebot? I don't see the job running and I doubt it finished
17:47:46<corentin>nicolas17: no I did it myself
17:48:15<nicolas17>what does that mean
17:48:17<nicolas17>savepagenow?
17:48:39<nicolas17>or downloaded to your own disk? :P
17:49:22<h2ibot>Nulldata edited Deathwatch (+270, /* 2024 */ Added SteamRep (thanks PredatorIWD2)): https://wiki.archiveteam.org/?diff=53445&oldid=53444
17:49:39<corentin>I work at the Internet Archive, I write and maintain crawlers & crawls, I captured it with Zeno and I'll upload it at some point (when the upload process kicks off)
17:50:42<@rewby>Neat, I've not checked up on how zeno's been coming along
17:50:52<nulldata>corentin++
17:50:52<eggdrop>[karma] 'corentin' now has 1 karma!
17:51:00kokos- joins
17:51:05<@rewby>I remember trying to get heritrix to do stuff a few years ago and that was pain and suffering
17:51:09<@rewby>Mostly because java
17:51:10<corentin>It is
17:51:19<corentin>It's why I wrote Zeno hahaha
17:51:27<@rewby>Very understandable
17:52:09<@rewby>Is there any docs on zeno or just "look at the code and figure it out?
17:52:50<corentin>We've had some huge work done on it recently to try and address long standing stability issues, because for a couple of years I was the only dev on it and so I was mostly using it for "experimental" crawls. Note: the WARC writing itself is very well tested and stable, I'm just talking about the crawling part. Anyway, a lot of work from me and a
17:52:50<corentin>couple of colleagues on it recently to get it way more stable, and more expendable / solid for the future features we'll add.
17:53:24<corentin>Sadly for now, no documentation. :) But --help will help you. ./Zeno get url https://google.com, ./Zeno get list list.txt... and -h for all the options
17:53:26<@rewby>Neat. I might have a go at it Later TM and see how it works
17:53:38<corentin>I hope it will, if you see any weird behavior, any bug, please open an issue
17:54:01<masterx244|m>same. might even be useful for grabs for the own sanity. currently using grab-site for those sanity grabs
17:54:03<@rewby>Just from looking at the readme, is crawl hq that couch db thing I recall from eons ago?
17:54:28<corentin>Not at all, Crawl HQ is an internal queuing system. Internal as in IA internal.
17:55:03<corentin>At some point I'll write in the README that even if Zeno is fully OSS, it still has very IA-specific features sometimes, optional of course
17:55:14<@rewby>Yeah makes sense
17:55:32<@rewby>ooo it has an api
17:55:35<@rewby>Interesting
17:55:37<corentin>It's also opinionated, there are choices that are made so that it fits our usage more than anything else
17:55:48<@rewby>That's very fair
17:55:49<corentin>Well.. the API is mostly reading, nothing more yet
17:55:52<@rewby>Ah
17:56:00<corentin>But yeah of course I thought about like, adding URLs via the API etc
17:56:06<corentin>so many possibilities
17:56:18<@rewby>Is there a spec for crawlhq's api somewhere? Might be interesting to do an alt implementation of that to coordinate a small set of zeno instances.
17:56:20<corentin>any PR is welcome btw, there is so much to do
17:56:35<corentin>https://git.archive.org/wb/gocrawlhq this should be public
17:56:43<corentin>It's not an API doc per say
17:56:51<corentin>but it should be enough for a smart man to understand the endpoints
17:56:54<@rewby>Oh cool it supports headless browsers
17:57:11<@rewby>(I'm having a read through your cmd/ directory)
17:57:29<corentin>Well... no, it's very experimental. There is actually a PR opened for that. (idk why the --headless option made it to the --help)
17:57:42<corentin>Goal with that PR is to bring the capability of doing mixed crawls
17:57:48<corentin>where headless is only used on some domains
17:57:53<corentin>it's like 80% done
17:58:00<corentin>I'll get back to it at some point haha
17:58:07shgaqnyrjp_ is now known as shgaqnyrjp
17:58:52<corentin>about HQ, there is actually someone that wrote his own HQ compliant system that use like MongoDB or whatever, just to interact with Zeno haha
17:58:59<@JAA>How do you deal with TLS in the headless browser case? Since MITM proxying is required to get a correct WARC capture, and I've heard that TLS config on headless browsers is a mild pain.
17:59:02<corentin>it's not open source hto I think
17:59:41<@arkiver>can we move this to #archiveteam-dev or #archiveteam-ot ?
17:59:47<nicolas17>corentin: what's the state of deduplication in zeno? :P
17:59:59<corentin>I'll answer you both in ot or dev
18:00:06<@JAA>-dev sounds fine.
18:00:12<@rewby>arkiver: Sorry <3
18:00:19<@arkiver>no worries, thanks :)
18:01:04MrMcNuggets quits [Client Quit]
18:01:22katia_ (katia) joins
18:03:51<nicolas17>JAA: https://transfer.archivete.am/inline/SYFTj/Screenshot_20240911_150109.png
18:04:10<nicolas17>nobody expects archiveteam scale :D
18:05:17<@rewby>lol
18:06:13<that_lurker>you should share the telegram numbers :-P
18:06:33<@rewby>Or reddit or #//
18:06:52<@rewby>Enjoy me some 8PiB of urls
18:17:12katia_ quits [Client Quit]
18:17:36kokos- quits [Client Quit]
18:18:21kokos- joins
18:28:13katia_ (katia) joins
18:38:34<IDK>https://www.pcmag.com/news/wix-to-block-russian-users-take-down-their-sites-in-wake-of-us-sanctions
18:52:06<IDK>Hi, the deadline would be sept 12
18:52:07<IDK>https://www.bleepingcomputer.com/news/legal/wix-to-block-russian-users-starting-september-12/
18:53:14<@arkiver>oof
18:53:21<@arkiver>today?
18:53:33<@arkiver>or no tomorrow, but yeah
18:53:34<nicolas17>how do we even find affected sites?
18:54:06<nulldata>Some discussion on this in #archivebot too
18:54:26<nulldata>"02:46 PM <@JAA> nyuuzyou shared a list of 167 presumably Russian Wix sites earlier. Needs a bit of cleanup. I was going to run it as !a <, but maybe separate jobs are better, not sure."
18:54:30<IDK>nicolas17: Дизайн этого сайта создан в конструкторе site:*.wixsite.com
19:25:24<pokechu22>I'm running https://transfer.archivete.am/inline/JRZXk/wixsite.com_russian_sites.txt which was obtained that way (though I also did a -site:wixsite.com search, which found a few results), apart from https://transfer.archivete.am/inline/55K2g/wix.txt which was sent by someone else and I don't know how they generated it
19:26:24<pokechu22>Finding more sites would be difficult because wix free sites are deliberately annoying and while https://woodland64.wixsite.com/mysite works, https://woodland64.wixsite.com/ and https://woodland64.wixsite.com/sitemap.xml are 404s
19:27:44<@rewby>JAA: Slight warning, GamersNexus just posted a news video calling for "datahoarders to pull down some of that" (anandtech) so we may see a bunch of people joining and asking.
19:30:10<@JAA>nicolas17: :-)
19:30:15<@JAA>rewby: Ack. No specific mention of us?
19:30:29<@rewby>No, but you know how this goes
19:30:36<@JAA>Yeah
19:30:59<@rewby>How is that AB job doing anyway?
19:31:10<@JAA>Put it in front in the #archiveteam topic, maybe it'll help.
19:31:14<@rewby>I don't see it on the dashboard but wiki says in progress
19:31:47<@rewby>I do see a forum job going
19:31:48<@JAA>The main site job finished, the forums are still going.
19:32:00<@rewby>I'll update the wiki
19:32:06<@JAA>No idea whether it's complete or there are complications.
19:32:13<@rewby>Ah
19:32:21<@rewby>We should probably check that first then
19:33:26<@rewby>JAA: How do we check if the job was successful?
19:34:10<@JAA>Browsing in the WBM, I guess, but not sure it's there yet. IA's upload processing is slow.
19:34:32<@JAA>Or poking around on the site to find things that are problematic with JS disabled etc.
19:35:03<that_lurker>someone could maybe contact GamersNexus and let them know
19:35:35<@rewby>Lemme see if we uploaded that WARC yet
19:36:44<@JAA>The WARCs are all uploaded.
19:38:35<@JAA>I think the last item(s) might still be deriving.
19:38:53<@JAA>And then there's another slight delay between derives and it showing up in the WBM, at least sometimes.
19:39:23<@rewby>It's starting to show up on wbm
19:40:17<@JAA>No surprises there, the first WARCs were uploaded over a week ago.
19:40:27<@rewby>The IA's search is just useless as per usual
19:41:57<@JAA>https://archive.fart.website/archivebot/viewer/job/20240901213047bvqa8
19:42:37<@JAA>Images (that were linked rather than embedded) were run in a separate job.
19:43:12<@JAA>https://archive.fart.website/archivebot/viewer/job/202409092003491pjfi
19:43:19<@JAA>That's definitely not in the WBM yet.
19:45:33<@rewby>Did we ever reenable auto upload?
19:46:53<@rewby>If not, we should
20:04:40<nicolas17>okay...
20:04:58<@JAA>(Not yet)
20:07:10<nicolas17>https://transfer.archivete.am/inline/pXoXh/swcdn.apple.com-missing.txt these files still exist on Apple's CDN, and are not on WBM; Safari is ~180MB, BridgeOS is ~500MB, InstallAssistant is ~12GB
20:07:20<nicolas17>I assume this is a bad time to AB due to the upload problems
20:08:29<nulldata>(and emergency Russian Wix jobs)
20:18:27<nicolas17>I'm now looking at those that *are* on WBM to see whether they are actually failed captures
20:20:56<@arkiver>nicolas17: the upload problems still exist?
20:21:07<nicolas17>idk, #archivebot topic says so
20:24:40<@rewby>It'll be fine tomorrow or so
20:24:45<nulldata>AB uploads are still being done by JAA, our mechanical turk, at the moment
20:24:54<nulldata>01:49 PM <@eggdrop> <@JAA> Normal operations wrt uploads should probably resume either tonight or tomorrow anyway.
20:25:08<@rewby>Yes.
20:25:42<@rewby>We've nearly cleared out 6T of backlog on atr3 so we have the iops again for AB
20:26:01lizardexile quits [Ping timeout: 255 seconds]
20:27:50lizardexile joins
20:32:34qwertyasdfuiopghjkl quits [Quit: Client closed]
20:50:33<nicolas17>uhh ok I have an urgent one
20:51:02<nicolas17>https://swcdn.apple.com/content/downloads/02/18/052-96238-A_V534Q7DYXO/lj721dkb4wvu0l3ucuhqfjk7i5uwq1s8tz/InstallAssistant.pkg this URL works intermittently depending on which CDN server I hit, I guess it was deleted from their origin
20:51:24<nicolas17>not sure how to deal with that; AB it, and if it gets 403, try again?
20:51:44<pokechu22>Yeah, that'll probably work
20:54:03<nicolas17>(I probably have the file content but that won't get us a WARC)
20:54:22<@JAA>nicolas17: I'm grabbing it with grab-site, working from there.
20:58:06<nicolas17>I guess there's also a chance that the cdn cache has a partial file, and then it will die halfway
20:59:40<@JAA>Nope, download just finished without issues.
20:59:57<@JAA>12407486945 bytes
21:00:18<nicolas17>repeatedly trying locally, in between a lot of 403s, I got some 200s that failed after ~15MB
21:01:19loug83 quits [Client Quit]
21:03:43<masterx244|m><nicolas17> "nobody expects archiveteam scale..." <- when AT goes to eleven its really big hoses for pumping out the data....
21:04:12<masterx244|m>(and the fun starts once the banhammers are flying and get dodged)
21:05:15<@JAA>> Date: Wed, 11 Sep 2024 13:06:10 GMT
21:05:16<@JAA>Interesting
21:05:33<@JAA>I guess they cache that header, too.
21:05:46<nicolas17>lol
21:05:57<nicolas17>isn't that against spec?
21:06:03nepeat quits [Remote host closed the connection]
21:06:31<@rewby>masterx244|m: I remember the time we accidentally'd hetzner cloud's backbone
21:07:01nepeat (nepeat) joins
21:14:36<@JAA>nicolas17: I'm actually not sure. It's the 'date and time at which the message was originated' per RFC 9110. It then references RFC 5322 about 'Internet Message Format (IMF)', whatever that is. Sounds email-like. And that specifically mentions:
21:14:45<@JAA>> it is specifically not intended to convey the time that the message is actually transported, but rather the time at which the human or other creator of the message has put the message into its final form, ready for transport.
21:15:19<@JAA>But is that even the HTTP response's final form if the CDN then updates the Age header and a bunch of other things?
21:15:29<@JAA>¯\_(ツ)_/¯
21:16:22<steering>it doesn't matter, it's over 9000
21:16:31<@JAA>:-)
21:16:38<steering>also yeah it's the email Date header that it's referencing.
21:17:14<steering>I would say that it should be the time of the original response, since that's the "message"; both email and HTTP assume the headers will have been modified along the way (i.e. adding Received, Return-Path)
21:17:18<@JAA>TIL the official name of an email.
21:17:42<@JAA>Hmm, yeah, that makes sense.
21:18:13<nicolas17>HTTP's analogies to MIME/email was wrong all along
21:18:14<steering>of course there could be another argument that the spec is saying it should be the same as Last-Modified or perhaps file-birthtime :P
21:18:27<steering>nicolas17: indeed
21:20:06<@JAA>And then WARC was heavily based on HTTP including all its flaws in the old RFC.
21:20:26<steering>I wonder how many caches do it which way
21:20:49<nicolas17>JAA: imagine chunked encoding at the WARC-record level /o\
21:21:00<@JAA>nicolas17: You mean segmented records?
21:21:16<@JAA>Although at least they're not terminated by an empty record.
21:21:17<nicolas17>I regret my comment
21:21:28<@JAA>Also I'm not sure any software out there properly supports them.
21:21:44<@JAA>But yes, it is a thing. D:
21:21:47DLoader quits [Ping timeout: 256 seconds]
21:22:16DLoader (DLoader) joins
21:22:41<@JAA>There are some nice use cases, actually. Like splitting up huge responses into multiple WARCs. Or streaming to WARC without first buffering the full response.
21:23:00<@JAA>It's just that nobody seems to support reading such data, so it's not used on writing either.
21:29:01msrn_ quits [Ping timeout: 255 seconds]
21:30:22eightthree quits [Ping timeout: 255 seconds]
21:30:48mikael joins
21:31:15eightthree joins
21:43:51<klg>aiui Last-modified is about the body while Date is about generation of the message itself regardless of what transformations proxy/cache might further apply; e.g. at a different date there might be a different set of representations available at the origin and so the content negotiation might play out differently, last-modified value of any particular representation is independent of that
21:44:39<klg>after all other kinds of Internet messages would also get Received header prepended or their Path header updated etc
21:45:32<nicolas17>I always thought Date was simply the current time of the server, used to compensate for clock skew when looking at Last-Modified and Expires and such
21:46:19<klg>and so did the authors of httpsdate and probably most people in general
21:50:46<steering>the question is how that works with caches (and reverse proxies in general) :)
21:51:43<steering>the cache might not have the same clock skew as the origin after all
21:51:51etnguyen03 (etnguyen03) joins
22:00:45Notrealname1234 (Notrealname1234) joins
22:01:33Notrealname1234 quits [Client Quit]
22:08:07iohjp joins
22:15:31iohjp quits [Client Quit]
22:20:20BlueMaxima joins
22:35:30etnguyen03 quits [Client Quit]
22:40:08etnguyen03 (etnguyen03) joins
23:09:32etnguyen03 quits [Client Quit]
23:29:10etnguyen03 (etnguyen03) joins