00:00:27 | | etnguyen03 quits [Client Quit] |
00:04:03 | | fuzzy80211 quits [Read error: Connection reset by peer] |
00:04:23 | | fuzzy80211 (fuzzy80211) joins |
00:11:12 | | JaffaCakes118 quits [Remote host closed the connection] |
00:11:45 | | JaffaCakes118 (JaffaCakes118) joins |
00:32:54 | | nulldata quits [Quit: So long and thanks for all the fish!] |
00:33:12 | | Mateon2 joins |
00:33:48 | | nulldata (nulldata) joins |
00:35:07 | | Mateon1 quits [Ping timeout: 256 seconds] |
00:35:08 | | Mateon2 is now known as Mateon1 |
00:45:42 | | etnguyen03 (etnguyen03) joins |
00:48:52 | | nulldata quits [Client Quit] |
00:50:18 | | nulldata (nulldata) joins |
00:50:57 | | nulldata quits [Client Quit] |
00:53:31 | | nulldata (nulldata) joins |
01:36:00 | | JaffaCakes118 quits [Remote host closed the connection] |
01:36:02 | | JaffaCakes118_2 (JaffaCakes118) joins |
01:49:49 | | Hackerpcs quits [Quit: Hackerpcs] |
01:52:35 | | Hackerpcs (Hackerpcs) joins |
01:58:36 | | Hackerpcs quits [Remote host closed the connection] |
01:58:38 | | gfgf joins |
01:59:22 | | gfgf quits [Client Quit] |
02:00:09 | | Hackerpcs (Hackerpcs) joins |
02:05:58 | | Hackerpcs quits [Remote host closed the connection] |
02:07:28 | | Hackerpcs (Hackerpcs) joins |
02:12:35 | | Hackerpcs quits [Ping timeout: 256 seconds] |
02:13:52 | | Irenes quits [Ping timeout: 255 seconds] |
02:14:30 | | Hackerpcs (Hackerpcs) joins |
02:24:55 | | Irenes (ireneista) joins |
02:48:12 | | etnguyen03 quits [Remote host closed the connection] |
02:54:23 | | hackbug quits [Remote host closed the connection] |
02:55:19 | | chains quits [Remote host closed the connection] |
02:56:53 | | hackbug (hackbug) joins |
03:08:48 | | hackbug quits [Remote host closed the connection] |
03:10:00 | | hackbug (hackbug) joins |
03:13:09 | | BlueMaxima quits [Read error: Connection reset by peer] |
03:30:47 | | ArchivalEfforts quits [Quit: No Ping reply in 180 seconds.] |
03:31:58 | | ArchivalEfforts joins |
03:38:42 | | JaffaCakes118_2 quits [Remote host closed the connection] |
03:39:30 | | JaffaCakes118_2 (JaffaCakes118) joins |
03:46:02 | | DogsRNice quits [Read error: Connection reset by peer] |
03:49:01 | | DogsRNice joins |
03:53:46 | | DogsRNice quits [Client Quit] |
03:55:14 | | igloo22225 quits [Quit: The Lounge - https://thelounge.chat] |
04:05:44 | | monoxane6 (monoxane) joins |
04:05:45 | | monoxane4 (monoxane) joins |
04:05:56 | | monoxane6 quits [Remote host closed the connection] |
04:06:49 | | monoxane quits [Ping timeout: 255 seconds] |
04:06:50 | | monoxane4 is now known as monoxane |
04:18:17 | | DogsRNice joins |
04:19:33 | | DogsRNice quits [Remote host closed the connection] |
04:29:36 | <nulldata> | . |
04:30:04 | | benjins2_ quits [Read error: Connection reset by peer] |
04:31:09 | | loug83 joins |
04:32:32 | <nulldata> | MariaDB acquired https://www.prnewswire.com/news-releases/k1-acquires-mariadb-a-leading-database-software-company-and-appoints-new-ceo-302243508.html |
04:42:43 | <pabs> | immibis: there are several mirrors for sourceforge file releases, they redirect requests to those mirrors |
04:44:00 | <pabs> | immibis: rolling hashes like modern backup systems like restic/borg do might be good for dealing with the redundancy stuff |
05:31:33 | | ace24x quits [Remote host closed the connection] |
05:51:36 | | igloo22225 (igloo22225) joins |
05:58:21 | | grid joins |
06:25:53 | | lennier2 quits [Ping timeout: 256 seconds] |
06:28:07 | | Miori quits [Ping timeout: 255 seconds] |
06:43:21 | <@arkiver> | nicolas17: you're trying to save on space archiving apple releases? |
06:43:43 | <@arkiver> | are you storing these at IA? what are you doing to save storage? |
06:44:08 | <@arkiver> | wondering because if it's stored at IA, maybe in this case (on this specifically) one doesn't have to save on space |
07:09:31 | | Unholy236192464537713 quits [Ping timeout: 256 seconds] |
07:19:34 | <immibis> | pabs: good idea but it's not that simple since compression such as .gz has a cascading effect: one different uncompressed bit changes the rest of the file |
07:20:54 | <@JAA> | Or even the same input might result in different compression output. |
07:21:19 | <immibis> | i did some experiments like this and compressed ~400GB of minecraft mod-packs (a hopefully complete set from FTB) down to ~4GB. But you want to get the original files back at the end, especially if they are signed, and this means writing a reversible decompressor which compresses the file the same way it was compressed before. |
07:24:38 | <immibis> | original bit-for-bit identical files |
07:32:44 | | Miori joins |
07:43:40 | | penny joins |
07:46:56 | | penny quits [Client Quit] |
07:57:29 | | vvvvv666 joins |
07:57:55 | | vvvvv666 quits [Client Quit] |
08:28:42 | <that_lurker> | Recent Tor Exit Node Operator Raids and Legal Harassment in Germany https://forum.torproject.org/t/tor-relays-artikel-5-e-v-another-police-raid-in-germany-general-assembly-on-sep-21st-2024/14533 https://news.ycombinator.com/item?id=41505009 |
08:30:18 | <that_lurker> | "Artikel 5 e.V. is now calling for a general assembly on Sep 21st 2024. We are looking for new board members (who take over and organize a new registered address and keep running exits) or discuss ALL alternative options. |
08:30:20 | <that_lurker> | These options include "just stop running exits" or even the most drastic step of liquidating the entire organization and distribution of the remaining budget to other German organizations (that would have to qualify under our non-profit by-laws)." |
08:47:45 | | grid quits [Client Quit] |
09:58:42 | | pixel leaves [Error from remote client] |
10:02:42 | | pixel (pixel) joins |
10:04:08 | | lennier2_ joins |
10:09:16 | <magmaus3> | "Recent Tor Exit Node Operator Raids and Legal Harassment in Germany" ← shouldn't the tor exit operators be counted as not responsible for the traffic? (like ISPs and etc) |
10:09:33 | <immibis> | yes but most law enforcement are violent criminals, especially in germany |
10:09:49 | <magmaus3> | that's like everywhere |
10:09:59 | <immibis> | it's especially in germany |
10:10:04 | <immibis> | denazification never happened |
10:10:10 | <@JAA> | This discussion stops now. |
10:10:23 | <magmaus3> | sure |
10:10:27 | <immibis> | IMO the biggest risk isn't that they'll raid your tor node, since that will be cleared by the court - it's that they could discover something else, illegal or not, when they go to raid your tor node. |
10:10:37 | <magmaus3> | yeah |
10:10:54 | <immibis> | why is it illegal to talk about legal threats to tor nodes here? |
10:11:03 | <magmaus3> | that was for a diff reason |
10:11:04 | <immibis> | it should be in -ot? |
10:11:13 | <magmaus3> | maybe |
10:11:24 | <magmaus3> | i think the issue was with the political discussion |
10:11:29 | <@JAA> | Correct |
10:11:40 | <immibis> | tor node raids are politics |
10:12:43 | <@JAA> | I've archived the public web stuff of Artikel 5 e.V. that I could find with AB. |
10:12:57 | <magmaus3> | cool :3 |
10:15:56 | <immibis> | Artikel 5 e.V. is also a highly political organization |
10:16:32 | <@JAA> | Which is completely irrelevant for this channel. |
10:24:23 | <immibis> | it seems A5eV would be well served by renting a clubhouse so that the business premises do not default to being at the chairman's home. |
10:36:17 | <@arkiver> | immibis: it's the "denazification never happened" stuff. that comment is not welcome here |
10:36:48 | <@arkiver> | that has been brought up before. |
10:38:24 | <@arkiver> | it's part of labeling entire perceived "groups" of people as x, where x may be some word like "nazi"/"communist"/"fascist"/etc. |
10:39:59 | <@arkiver> | also when it is not meant literally, but rather in some symbolical way, it's not something Archive Team is the right place for. |
11:00:03 | | Bleo1826007227196 quits [Quit: The Lounge - https://thelounge.chat] |
11:01:22 | | Bleo1826007227196 joins |
11:26:30 | <kpcyrd> | immibis: it looks like they do have an office location, there are multiple companies registered at that same address (including ImmobilienScout24) |
11:27:26 | <kpcyrd> | but they can still figure out home addresses of people involved in an e.V. (and apparently they did) |
11:32:21 | <immibis> | well the search warrant says business premises so whoever executed it belongs in prison for burglary then |
11:33:33 | <immibis> | a search warrant for location X doesn't give you a pass to burglarize location Y |
11:34:04 | <kpcyrd> | *the search warrant published by Artikel 5 e.V. |
11:34:16 | <kpcyrd> | there may be more |
11:36:05 | <kpcyrd> | https://artikel5ev.de/home/hausdurchsuchung-am-fr-16-08-2024/ |
11:36:11 | <kpcyrd> | idk, this is all very confusing |
11:37:35 | <kpcyrd> | on their website they write "the club doesn't have any dedicated space" and that's why their homes got searched, but they very clearly do have a dedicated space at Hatzper Str. 172B |
11:38:22 | | SkilledAlpaca4 quits [Quit: SkilledAlpaca4] |
11:38:40 | <kpcyrd> | make it make sense |
11:40:04 | | SkilledAlpaca4 joins |
11:42:06 | | decky quits [Read error: Connection reset by peer] |
11:43:40 | <kpcyrd> | one of the chairman runs a company at that address, so maybe the e.V. was using a shared space |
11:44:01 | | loug83 quits [Read error: Connection reset by peer] |
11:44:02 | <@arkiver> | immibis: i'm pinging again though on what i wrote (and will leave it at that) - this has now happened a few times, i believe you have seen (or maybe not? let me know if not) the longer message i posted back then on this |
11:44:05 | | loug83 joins |
11:45:30 | <immibis> | i gather you want to ban me cause of what i said about search warrants. Fine. I'll stop all containers I may be running, delete their data without uploading it and leave IRC for good. |
11:45:38 | <immibis> | AT IRC, that is |
11:46:03 | <@rewby> | Ot' |
11:46:08 | <@rewby> | It's not about the warrants. |
11:46:20 | <@rewby> | It's about the nazi comments. |
11:47:33 | <immibis> | you already wrote about me saying denazification never happened |
11:47:53 | <@arkiver> | that is about symbolical nazi remarks - not search warrants |
11:49:04 | <@arkiver> | i not have as a goal to ban you - i an hoping you would somewhat understand what i wrote. again, it's not about search warrant discussion, it's about symbolically labeling groups as "nazi"/"fascists"/etc. |
11:49:27 | <@arkiver> | for completeness, i will post the message i wrote some time ago about this https://transfer.archivete.am/inline/NzLSU/message.txt |
11:50:51 | | monoxane1 (monoxane) joins |
11:50:53 | | monoxane0 (monoxane) joins |
11:51:05 | | monoxane1 quits [Remote host closed the connection] |
11:52:18 | | monoxane quits [Ping timeout: 256 seconds] |
11:52:18 | | monoxane0 is now known as monoxane |
11:53:43 | <@arkiver> | (note the message contains some references to the context at the time, but i believe it is still clear and i stand behind it) |
11:54:35 | <kpcyrd> | "especially now with everything going on" is a very timeless thing to say |
11:57:01 | <@arkiver> | for context see https://hackint.logs.kiska.pw/archiveteam-ot/20230520#c345976 and the previous days(s?) |
12:03:50 | <kpcyrd> | "As a consequence, I am personally no longer willing to provide my personal address&office-space as registered address for our non-profit/NGO[...]" written by the chairman who has a company at that address |
12:03:59 | <kpcyrd> | (at https://forum.torproject.org/t/tor-relays-artikel-5-e-v-another-police-raid-in-germany-general-assembly-on-sep-21st-2024/14533) |
12:04:07 | <immibis> | denazification is not identifying specific people as nazis, it is the removal of nazi ideology from general perception in the entire country of germany |
12:04:27 | <kpcyrd> | so it seems the theory of "the e.V. only had a postbox at that address" tracks |
12:05:25 | <immibis> | i see that you want no politics not directly related to archive team, so the whole exit node raiding thing is not allowed, except for the statement that it happened and therefore this e.V.'s site could be at risk. |
12:33:30 | | benjins2 joins |
12:38:07 | <nicolas17> | arkiver: I know of two people with a giant NAS at home with apple releases (including many that apple already deleted from their servers) |
12:38:37 | <nicolas17> | and there's so much redundant data... |
12:39:39 | <nicolas17> | and the files keep getting bigger and more numerous https://theapplewiki.com/wiki/Beta_Firmware/iPhone/17.x |
13:03:14 | | SootBector quits [Remote host closed the connection] |
13:03:35 | | SootBector (SootBector) joins |
13:10:44 | | igloo22225 quits [Read error: Connection reset by peer] |
13:10:52 | | igloo22225 (igloo22225) joins |
13:15:22 | | xarph_ quits [Ping timeout: 255 seconds] |
13:16:31 | | xarph joins |
13:42:19 | <@arkiver> | nicolas17: are we actively archiving those apple CDN URLs into the wayback machine? |
13:42:32 | <@arkiver> | please feel free to at least with ArchiveBot (CC JAA ) |
14:27:40 | | hackbug quits [Remote host closed the connection] |
14:27:40 | | thuban quits [Read error: Connection reset by peer] |
14:28:00 | | hackbug (hackbug) joins |
14:28:00 | <corentin> | arkiver: I grabbed them all |
14:28:06 | | thuban (thuban) joins |
14:30:30 | | Radzig2 joins |
14:31:31 | | Radzig quits [Ping timeout: 256 seconds] |
14:32:20 | <@arkiver> | corentin: as in, is that happening periodically? |
14:46:14 | <corentin> | arkiver: no no sorry, I mean I grabbed the ones in this URLs shared. There must be something "bigger" to do though |
14:46:56 | <@arkiver> | corentin: as in the ones on https://theapplewiki.com/wiki/Beta_Firmware/iPhone/17.x ? |
14:47:05 | <@arkiver> | nicolas17: were you archiving those on the long term? |
15:01:59 | <@arkiver> | what was that TLD again that got a warning? |
15:02:16 | <@arkiver> | for hosting too much spam or pishing addresses or something |
15:02:58 | <monoxane> | probably one of the freenom ones |
15:03:02 | <monoxane> | .tk et al |
15:05:47 | | MrMcNuggets (MrMcNuggets) joins |
15:06:06 | <@arkiver> | hmm yeah |
15:06:22 | <@arkiver> | maybe |
15:14:15 | | pocketwax17 (pocketwax17) joins |
15:16:37 | | pocketwax17 quits [Client Quit] |
15:17:29 | | pocketwax17 (pocketwax17) joins |
15:18:21 | | pocketwax17 quits [Client Quit] |
15:18:34 | | pocketwax17 (pocketwax17) joins |
15:30:30 | <corentin> | arkiver: yes, sorry I should have been more clear! |
15:53:14 | <@arkiver> | alright! |
15:56:02 | <@arkiver> | well so to be clear, feel free to continue archiving this with ArchiveBot, it is well worth the size i think |
16:07:55 | <nicolas17> | corentin: note the wiki may have duplicates (XS and XS Max use the same files but have separate tables on the wiki) |
16:08:27 | <nicolas17> | if you just grabbed the URLs from the wiki and fed it to AB, I'm not sure if AB dedups |
16:09:34 | | VerifiedJ9 quits [Remote host closed the connection] |
16:10:12 | | VerifiedJ9 (VerifiedJ) joins |
16:10:55 | <nicolas17> | arkiver: I don't have the disk space to archive this long term myself :P but I'm helping people who do |
16:11:04 | <nicolas17> | and yeah I was planning on discussing how to archive this properly |
16:11:15 | <myself> | how much space we talkin'? |
16:11:28 | <nicolas17> | an admin of theapplewiki started feeding some URLs to savepagenow and I told him that was probably not the best way |
16:12:47 | <@arkiver> | nicolas17: you can see feed lists into ArchiveBot! |
16:12:50 | <@arkiver> | IA definitely has the space for this |
16:13:04 | <@arkiver> | individual items on IA _next to that_ are also welcome, i can create a collection for you and others |
16:13:24 | <nicolas17> | https://archive.org/details/apple-ipsws already done that for some files that were already deleted from Apple |
16:14:01 | <@arkiver> | yeah feel free to put everything in that collection and in ArchiveBot! |
16:15:07 | <@arkiver> | if any help is needed, don't hesitate to ping me :) |
16:16:22 | <nicolas17> | hmmmm I imported info from appledb into a SQLite database, and "select sum(file_size) from sourcefile where type='ipsw'" returns 48TB, which seems low, I wonder if my import was excluding something important... I last looked into it in June or so |
16:16:40 | | knecht quits [Quit: knecht] |
16:17:12 | <@arkiver> | nicolas17: i see on an item like https://archive.org/details/xcode-16.1-beta1 you included a `source` metadata field, is it possible to include that for ipsw items as well? |
16:17:41 | <nicolas17> | with the original URL? yes |
16:17:57 | <@arkiver> | yeah! |
16:18:33 | <nicolas17> | though there's some where the source will be "someone sent it to me and it seemed to be the right file but it has been gone from apple-cdn for 2 years now" |
16:19:12 | <@arkiver> | nicolas17: let's add a note to that, but feel free to include at your discretion |
16:21:38 | <@arkiver> | what does IPSW stand for? is the official name iPSW ? i can't actually easily find this info online... |
16:22:40 | <nicolas17> | afaik it was originally iPod Software Update, then iPhone Software Update with a different format |
16:23:18 | <@JAA> | nicolas17: AB only dedupes identical URLs. |
16:23:22 | <nicolas17> | nowadays even macOS uses iPhone-like IPSW files |
16:23:32 | <nicolas17> | JAA: oh that's fine for this case |
16:23:45 | <@JAA> | And even then, not on redirects. |
16:23:53 | <@JAA> | It does no content dedupe at all. |
16:24:02 | <nicolas17> | the wiki page has some URLs multiple times and I didn't know if corentin had dedup'd them, if AB dedups them that's enough |
16:24:15 | <@JAA> | Ah |
16:24:34 | <@JAA> | Last time we spoke about this, I think you said there were like 4 different URLs for each file. |
16:24:59 | <nicolas17> | for macOS InstallAssistant.pkg files, there's often 2 different URLs for each |
16:26:01 | <nicolas17> | someone uploaded WARCs of them as items to archive.org containing 4 copies, because they archived those two URLs, each on both http+https, with no content dedup :/ |
16:26:16 | <@JAA> | Ah right, that's the one you mentioned, yeah. |
16:26:49 | <nicolas17> | afaik WBM doesn't care about http+https so we would only need to archive 2 anyway |
16:27:05 | <@JAA> | Correct, the WBM doesn't care about the scheme in general. |
16:27:21 | <@arkiver> | nicolas17: so would you say the official name nowadays is just IPSW? even apple or wikipedia doesn't clearly mention anything else |
16:27:31 | <@arkiver> | i guess it has so many meanings now, that it's just IPSW |
16:27:45 | <nicolas17> | arkiver: I think macOS Finder shows .ipsw files as "Apple software update" nowadays :P |
16:30:38 | <@arkiver> | that is annoying |
16:31:11 | <nicolas17> | Apple has many misnomers due to scope growth tbh |
16:31:57 | <nicolas17> | the sharingd daemon used to deal with AirDrop (sharing files wirelessly), now it handles most of the Continuity features many of which have nothing to do with sharing |
16:35:06 | <nicolas17> | mail on mac: "Mail.app"; mail on iPhone: "MobileMail.app"; many MobileSomething names refer to iOS... then some features like MobileAssets get ported to macOS and nothing makes sense anymore |
16:38:08 | <masterx244|m> | too bad that sometime device manufacturers DMCA those archives off the net even though others are glad that those archives exist |
16:38:57 | <masterx244|m> | (got a strike due to that crap already, luckily the IA version was just a secondary location, my personal copy where the state is kept of what i got already is not visible on the open web for obvious reasons) |
16:40:00 | <nicolas17> | betas used to be restricted to members of the developer program |
16:40:11 | <nicolas17> | later only beta 1 was restricted that way |
16:41:02 | <nicolas17> | last year I got a DMCA takedown not for re-hosting the 17.0b1 ipsws, but for *tweeting a link* to someone else's website that re-hosted them |
16:41:49 | <nicolas17> | the lawyers later withdrew the claim but I had already deleted the tweet myself by then *shrug* |
16:43:16 | <masterx244|m> | mine was for the archival of the sena (motorcycle intercom) firmware files. Might have gotten a few files that they didnt want out (their update server is a bit of a leaky pipe, got a dumb monitoring of a few files into a git and that sometimes leaks stuff before release) |
16:43:18 | <@arkiver> | masterx244|m: yes :/ |
16:44:00 | <masterx244|m> | got a few 0.X.X versions that way, too |
16:44:04 | <@arkiver> | i do advise to keep your own copies of data very important, next to storing on IA, but in case of very large amounts of data that may not be practical |
16:44:59 | <masterx244|m> | luckily its pretty small, 20GB or so and for coldstorage a 7zip solid compression gets it down to 300MB or so due to quite a few cross-file-redundancies |
16:45:37 | <nicolas17> | I can't avoid feeling bad about duplication |
16:45:39 | <masterx244|m> | and yeah, got my local copy still since i have some automatic crawling, the IA copy was generated by that tooling, too with some CSV upload magic |
16:45:55 | | knecht (knecht) joins |
16:46:17 | <masterx244|m> | sometimes they had different language versions where the code part was identical. or different version of code but the audio snippets for the menu were the same |
16:46:28 | <nicolas17> | not like "me and Siguza and qwerty and archive.org having the same file", that's good for redundancy |
16:46:37 | <nicolas17> | but about "archive.org having the same file 3 times in separate captures/items" |
16:46:57 | <nicolas17> | waaaasteful >_< |
16:47:28 | <nicolas17> | "the file is only 10GB it's no big deal" yeah but wasteful >_< |
16:48:30 | <@arkiver> | nicolas17: if it makes 50 TB go times 4, that is a big problem |
16:48:37 | <@arkiver> | URL agnostic deduplication would help |
16:49:18 | <masterx244|m> | yeah, splitting the stuff into "headers" and payload and if a payload segment is == with another one just storing a pointer would be enough |
16:50:55 | <@arkiver> | i believe AB deduplicates (?) |
16:51:12 | <@arkiver> | Wget-AT can be run with the 4 URLs as input and URL agnostic deduplication turned on and it will handle it |
16:53:19 | | pocketwax17 quits [Client Quit] |
16:59:57 | <@JAA> | AB does not dedupe. |
17:00:35 | | tttt joins |
17:02:15 | | tttt quits [Client Quit] |
17:03:07 | | tettehe joins |
17:03:55 | <@arkiver> | ah |
17:04:33 | <@arkiver> | JAA: i do remember some messages about that streaming over archivebot.com in the past about duplicate content - was it a feature in the past? |
17:06:10 | | tettehe quits [Client Quit] |
17:08:47 | <@JAA> | arkiver: There is a dupe detection, but that's for stopping recursion on identical responses. It doesn't do anything about the WARC writing. It's also been broken for, uh, over 8 years, before I arrived here. |
17:09:10 | <@arkiver> | ah |
17:09:23 | <@arkiver> | thanks for clearing that up |
17:12:24 | | chains joins |
17:14:05 | <chains> | How do I know when a blog has been archived by frogger? |
17:16:39 | <@rewby> | I believe the bot has dedupe so you can probably just put it in and let the bot deal. That said, the only good way is to check if the Wayback machine contains the blog |
17:18:54 | <nicolas17> | masterx244|m: the WARC format supports that kind of deduplication (storing the request and response headers, and only a pointer to the previous response body), but archivebot doesn't use it |
17:26:38 | | bill93 joins |
17:27:41 | | bill93 quits [Client Quit] |
17:28:03 | <chains> | rewby, gotcha thanks |
17:35:20 | | PredatorIWD2 joins |
17:37:27 | <nicolas17> | corentin: what did you do with theapplewiki? archivebot? I don't see the job running and I doubt it finished |
17:47:46 | <corentin> | nicolas17: no I did it myself |
17:48:15 | <nicolas17> | what does that mean |
17:48:17 | <nicolas17> | savepagenow? |
17:48:39 | <nicolas17> | or downloaded to your own disk? :P |
17:49:22 | <h2ibot> | Nulldata edited Deathwatch (+270, /* 2024 */ Added SteamRep (thanks PredatorIWD2)): https://wiki.archiveteam.org/?diff=53445&oldid=53444 |
17:49:39 | <corentin> | I work at the Internet Archive, I write and maintain crawlers & crawls, I captured it with Zeno and I'll upload it at some point (when the upload process kicks off) |
17:50:42 | <@rewby> | Neat, I've not checked up on how zeno's been coming along |
17:50:52 | <nulldata> | corentin++ |
17:50:52 | <eggdrop> | [karma] 'corentin' now has 1 karma! |
17:51:00 | | kokos- joins |
17:51:05 | <@rewby> | I remember trying to get heritrix to do stuff a few years ago and that was pain and suffering |
17:51:09 | <@rewby> | Mostly because java |
17:51:10 | <corentin> | It is |
17:51:19 | <corentin> | It's why I wrote Zeno hahaha |
17:51:27 | <@rewby> | Very understandable |
17:52:09 | <@rewby> | Is there any docs on zeno or just "look at the code and figure it out? |
17:52:50 | <corentin> | We've had some huge work done on it recently to try and address long standing stability issues, because for a couple of years I was the only dev on it and so I was mostly using it for "experimental" crawls. Note: the WARC writing itself is very well tested and stable, I'm just talking about the crawling part. Anyway, a lot of work from me and a |
17:52:50 | <corentin> | couple of colleagues on it recently to get it way more stable, and more expendable / solid for the future features we'll add. |
17:53:24 | <corentin> | Sadly for now, no documentation. :) But --help will help you. ./Zeno get url https://google.com, ./Zeno get list list.txt... and -h for all the options |
17:53:26 | <@rewby> | Neat. I might have a go at it Later TM and see how it works |
17:53:38 | <corentin> | I hope it will, if you see any weird behavior, any bug, please open an issue |
17:54:01 | <masterx244|m> | same. might even be useful for grabs for the own sanity. currently using grab-site for those sanity grabs |
17:54:03 | <@rewby> | Just from looking at the readme, is crawl hq that couch db thing I recall from eons ago? |
17:54:28 | <corentin> | Not at all, Crawl HQ is an internal queuing system. Internal as in IA internal. |
17:55:03 | <corentin> | At some point I'll write in the README that even if Zeno is fully OSS, it still has very IA-specific features sometimes, optional of course |
17:55:14 | <@rewby> | Yeah makes sense |
17:55:32 | <@rewby> | ooo it has an api |
17:55:35 | <@rewby> | Interesting |
17:55:37 | <corentin> | It's also opinionated, there are choices that are made so that it fits our usage more than anything else |
17:55:48 | <@rewby> | That's very fair |
17:55:49 | <corentin> | Well.. the API is mostly reading, nothing more yet |
17:55:52 | <@rewby> | Ah |
17:56:00 | <corentin> | But yeah of course I thought about like, adding URLs via the API etc |
17:56:06 | <corentin> | so many possibilities |
17:56:18 | <@rewby> | Is there a spec for crawlhq's api somewhere? Might be interesting to do an alt implementation of that to coordinate a small set of zeno instances. |
17:56:20 | <corentin> | any PR is welcome btw, there is so much to do |
17:56:35 | <corentin> | https://git.archive.org/wb/gocrawlhq this should be public |
17:56:43 | <corentin> | It's not an API doc per say |
17:56:51 | <corentin> | but it should be enough for a smart man to understand the endpoints |
17:56:54 | <@rewby> | Oh cool it supports headless browsers |
17:57:11 | <@rewby> | (I'm having a read through your cmd/ directory) |
17:57:29 | <corentin> | Well... no, it's very experimental. There is actually a PR opened for that. (idk why the --headless option made it to the --help) |
17:57:42 | <corentin> | Goal with that PR is to bring the capability of doing mixed crawls |
17:57:48 | <corentin> | where headless is only used on some domains |
17:57:53 | <corentin> | it's like 80% done |
17:58:00 | <corentin> | I'll get back to it at some point haha |
17:58:07 | | shgaqnyrjp_ is now known as shgaqnyrjp |
17:58:52 | <corentin> | about HQ, there is actually someone that wrote his own HQ compliant system that use like MongoDB or whatever, just to interact with Zeno haha |
17:58:59 | <@JAA> | How do you deal with TLS in the headless browser case? Since MITM proxying is required to get a correct WARC capture, and I've heard that TLS config on headless browsers is a mild pain. |
17:59:02 | <corentin> | it's not open source hto I think |
17:59:41 | <@arkiver> | can we move this to #archiveteam-dev or #archiveteam-ot ? |
17:59:47 | <nicolas17> | corentin: what's the state of deduplication in zeno? :P |
17:59:59 | <corentin> | I'll answer you both in ot or dev |
18:00:06 | <@JAA> | -dev sounds fine. |
18:00:12 | <@rewby> | arkiver: Sorry <3 |
18:00:19 | <@arkiver> | no worries, thanks :) |
18:01:04 | | MrMcNuggets quits [Client Quit] |
18:01:22 | | katia_ (katia) joins |
18:03:51 | <nicolas17> | JAA: https://transfer.archivete.am/inline/SYFTj/Screenshot_20240911_150109.png |
18:04:10 | <nicolas17> | nobody expects archiveteam scale :D |
18:05:17 | <@rewby> | lol |
18:06:13 | <that_lurker> | you should share the telegram numbers :-P |
18:06:33 | <@rewby> | Or reddit or #// |
18:06:52 | <@rewby> | Enjoy me some 8PiB of urls |
18:17:12 | | katia_ quits [Client Quit] |
18:17:36 | | kokos- quits [Client Quit] |
18:18:21 | | kokos- joins |
18:28:13 | | katia_ (katia) joins |
18:38:34 | <IDK> | https://www.pcmag.com/news/wix-to-block-russian-users-take-down-their-sites-in-wake-of-us-sanctions |
18:52:06 | <IDK> | Hi, the deadline would be sept 12 |
18:52:07 | <IDK> | https://www.bleepingcomputer.com/news/legal/wix-to-block-russian-users-starting-september-12/ |
18:53:14 | <@arkiver> | oof |
18:53:21 | <@arkiver> | today? |
18:53:33 | <@arkiver> | or no tomorrow, but yeah |
18:53:34 | <nicolas17> | how do we even find affected sites? |
18:54:06 | <nulldata> | Some discussion on this in #archivebot too |
18:54:26 | <nulldata> | "02:46 PM <@JAA> nyuuzyou shared a list of 167 presumably Russian Wix sites earlier. Needs a bit of cleanup. I was going to run it as !a <, but maybe separate jobs are better, not sure." |
18:54:30 | <IDK> | nicolas17: Дизайн этого сайта создан в конструкторе site:*.wixsite.com |
19:25:24 | <pokechu22> | I'm running https://transfer.archivete.am/inline/JRZXk/wixsite.com_russian_sites.txt which was obtained that way (though I also did a -site:wixsite.com search, which found a few results), apart from https://transfer.archivete.am/inline/55K2g/wix.txt which was sent by someone else and I don't know how they generated it |
19:26:24 | <pokechu22> | Finding more sites would be difficult because wix free sites are deliberately annoying and while https://woodland64.wixsite.com/mysite works, https://woodland64.wixsite.com/ and https://woodland64.wixsite.com/sitemap.xml are 404s |
19:27:44 | <@rewby> | JAA: Slight warning, GamersNexus just posted a news video calling for "datahoarders to pull down some of that" (anandtech) so we may see a bunch of people joining and asking. |
19:30:10 | <@JAA> | nicolas17: :-) |
19:30:15 | <@JAA> | rewby: Ack. No specific mention of us? |
19:30:29 | <@rewby> | No, but you know how this goes |
19:30:36 | <@JAA> | Yeah |
19:30:59 | <@rewby> | How is that AB job doing anyway? |
19:31:10 | <@JAA> | Put it in front in the #archiveteam topic, maybe it'll help. |
19:31:14 | <@rewby> | I don't see it on the dashboard but wiki says in progress |
19:31:47 | <@rewby> | I do see a forum job going |
19:31:48 | <@JAA> | The main site job finished, the forums are still going. |
19:32:00 | <@rewby> | I'll update the wiki |
19:32:06 | <@JAA> | No idea whether it's complete or there are complications. |
19:32:13 | <@rewby> | Ah |
19:32:21 | <@rewby> | We should probably check that first then |
19:33:26 | <@rewby> | JAA: How do we check if the job was successful? |
19:34:10 | <@JAA> | Browsing in the WBM, I guess, but not sure it's there yet. IA's upload processing is slow. |
19:34:32 | <@JAA> | Or poking around on the site to find things that are problematic with JS disabled etc. |
19:35:03 | <that_lurker> | someone could maybe contact GamersNexus and let them know |
19:35:35 | <@rewby> | Lemme see if we uploaded that WARC yet |
19:36:44 | <@JAA> | The WARCs are all uploaded. |
19:38:35 | <@JAA> | I think the last item(s) might still be deriving. |
19:38:53 | <@JAA> | And then there's another slight delay between derives and it showing up in the WBM, at least sometimes. |
19:39:23 | <@rewby> | It's starting to show up on wbm |
19:40:17 | <@JAA> | No surprises there, the first WARCs were uploaded over a week ago. |
19:40:27 | <@rewby> | The IA's search is just useless as per usual |
19:41:57 | <@JAA> | https://archive.fart.website/archivebot/viewer/job/20240901213047bvqa8 |
19:42:37 | <@JAA> | Images (that were linked rather than embedded) were run in a separate job. |
19:43:12 | <@JAA> | https://archive.fart.website/archivebot/viewer/job/202409092003491pjfi |
19:43:19 | <@JAA> | That's definitely not in the WBM yet. |
19:45:33 | <@rewby> | Did we ever reenable auto upload? |
19:46:53 | <@rewby> | If not, we should |
20:04:40 | <nicolas17> | okay... |
20:04:58 | <@JAA> | (Not yet) |
20:07:10 | <nicolas17> | https://transfer.archivete.am/inline/pXoXh/swcdn.apple.com-missing.txt these files still exist on Apple's CDN, and are not on WBM; Safari is ~180MB, BridgeOS is ~500MB, InstallAssistant is ~12GB |
20:07:20 | <nicolas17> | I assume this is a bad time to AB due to the upload problems |
20:08:29 | <nulldata> | (and emergency Russian Wix jobs) |
20:18:27 | <nicolas17> | I'm now looking at those that *are* on WBM to see whether they are actually failed captures |
20:20:56 | <@arkiver> | nicolas17: the upload problems still exist? |
20:21:07 | <nicolas17> | idk, #archivebot topic says so |
20:24:40 | <@rewby> | It'll be fine tomorrow or so |
20:24:45 | <nulldata> | AB uploads are still being done by JAA, our mechanical turk, at the moment |
20:24:54 | <nulldata> | 01:49 PM <@eggdrop> <@JAA> Normal operations wrt uploads should probably resume either tonight or tomorrow anyway. |
20:25:08 | <@rewby> | Yes. |
20:25:42 | <@rewby> | We've nearly cleared out 6T of backlog on atr3 so we have the iops again for AB |
20:26:01 | | lizardexile quits [Ping timeout: 255 seconds] |
20:27:50 | | lizardexile joins |
20:32:34 | | qwertyasdfuiopghjkl quits [Quit: Client closed] |
20:50:33 | <nicolas17> | uhh ok I have an urgent one |
20:51:02 | <nicolas17> | https://swcdn.apple.com/content/downloads/02/18/052-96238-A_V534Q7DYXO/lj721dkb4wvu0l3ucuhqfjk7i5uwq1s8tz/InstallAssistant.pkg this URL works intermittently depending on which CDN server I hit, I guess it was deleted from their origin |
20:51:24 | <nicolas17> | not sure how to deal with that; AB it, and if it gets 403, try again? |
20:51:44 | <pokechu22> | Yeah, that'll probably work |
20:54:03 | <nicolas17> | (I probably have the file content but that won't get us a WARC) |
20:54:22 | <@JAA> | nicolas17: I'm grabbing it with grab-site, working from there. |
20:58:06 | <nicolas17> | I guess there's also a chance that the cdn cache has a partial file, and then it will die halfway |
20:59:40 | <@JAA> | Nope, download just finished without issues. |
20:59:57 | <@JAA> | 12407486945 bytes |
21:00:18 | <nicolas17> | repeatedly trying locally, in between a lot of 403s, I got some 200s that failed after ~15MB |
21:01:19 | | loug83 quits [Client Quit] |
21:03:43 | <masterx244|m> | <nicolas17> "nobody expects archiveteam scale..." <- when AT goes to eleven its really big hoses for pumping out the data.... |
21:04:12 | <masterx244|m> | (and the fun starts once the banhammers are flying and get dodged) |
21:05:15 | <@JAA> | > Date: Wed, 11 Sep 2024 13:06:10 GMT |
21:05:16 | <@JAA> | Interesting |
21:05:33 | <@JAA> | I guess they cache that header, too. |
21:05:46 | <nicolas17> | lol |
21:05:57 | <nicolas17> | isn't that against spec? |
21:06:03 | | nepeat quits [Remote host closed the connection] |
21:06:31 | <@rewby> | masterx244|m: I remember the time we accidentally'd hetzner cloud's backbone |
21:07:01 | | nepeat (nepeat) joins |
21:14:36 | <@JAA> | nicolas17: I'm actually not sure. It's the 'date and time at which the message was originated' per RFC 9110. It then references RFC 5322 about 'Internet Message Format (IMF)', whatever that is. Sounds email-like. And that specifically mentions: |
21:14:45 | <@JAA> | > it is specifically not intended to convey the time that the message is actually transported, but rather the time at which the human or other creator of the message has put the message into its final form, ready for transport. |
21:15:19 | <@JAA> | But is that even the HTTP response's final form if the CDN then updates the Age header and a bunch of other things? |
21:15:29 | <@JAA> | ¯\_(ツ)_/¯ |
21:16:22 | <steering> | it doesn't matter, it's over 9000 |
21:16:31 | <@JAA> | :-) |
21:16:38 | <steering> | also yeah it's the email Date header that it's referencing. |
21:17:14 | <steering> | I would say that it should be the time of the original response, since that's the "message"; both email and HTTP assume the headers will have been modified along the way (i.e. adding Received, Return-Path) |
21:17:18 | <@JAA> | TIL the official name of an email. |
21:17:42 | <@JAA> | Hmm, yeah, that makes sense. |
21:18:13 | <nicolas17> | HTTP's analogies to MIME/email was wrong all along |
21:18:14 | <steering> | of course there could be another argument that the spec is saying it should be the same as Last-Modified or perhaps file-birthtime :P |
21:18:27 | <steering> | nicolas17: indeed |
21:20:06 | <@JAA> | And then WARC was heavily based on HTTP including all its flaws in the old RFC. |
21:20:26 | <steering> | I wonder how many caches do it which way |
21:20:49 | <nicolas17> | JAA: imagine chunked encoding at the WARC-record level /o\ |
21:21:00 | <@JAA> | nicolas17: You mean segmented records? |
21:21:16 | <@JAA> | Although at least they're not terminated by an empty record. |
21:21:17 | <nicolas17> | I regret my comment |
21:21:28 | <@JAA> | Also I'm not sure any software out there properly supports them. |
21:21:44 | <@JAA> | But yes, it is a thing. D: |
21:21:47 | | DLoader quits [Ping timeout: 256 seconds] |
21:22:16 | | DLoader (DLoader) joins |
21:22:41 | <@JAA> | There are some nice use cases, actually. Like splitting up huge responses into multiple WARCs. Or streaming to WARC without first buffering the full response. |
21:23:00 | <@JAA> | It's just that nobody seems to support reading such data, so it's not used on writing either. |
21:29:01 | | msrn_ quits [Ping timeout: 255 seconds] |
21:30:22 | | eightthree quits [Ping timeout: 255 seconds] |
21:30:48 | | mikael joins |
21:31:15 | | eightthree joins |
21:43:51 | <klg> | aiui Last-modified is about the body while Date is about generation of the message itself regardless of what transformations proxy/cache might further apply; e.g. at a different date there might be a different set of representations available at the origin and so the content negotiation might play out differently, last-modified value of any particular representation is independent of that |
21:44:39 | <klg> | after all other kinds of Internet messages would also get Received header prepended or their Path header updated etc |
21:45:32 | <nicolas17> | I always thought Date was simply the current time of the server, used to compensate for clock skew when looking at Last-Modified and Expires and such |
21:46:19 | <klg> | and so did the authors of httpsdate and probably most people in general |
21:50:46 | <steering> | the question is how that works with caches (and reverse proxies in general) :) |
21:51:43 | <steering> | the cache might not have the same clock skew as the origin after all |
21:51:51 | | etnguyen03 (etnguyen03) joins |
22:00:45 | | Notrealname1234 (Notrealname1234) joins |
22:01:33 | | Notrealname1234 quits [Client Quit] |
22:08:07 | | iohjp joins |
22:15:31 | | iohjp quits [Client Quit] |
22:20:20 | | BlueMaxima joins |
22:35:30 | | etnguyen03 quits [Client Quit] |
22:40:08 | | etnguyen03 (etnguyen03) joins |
23:09:32 | | etnguyen03 quits [Client Quit] |
23:29:10 | | etnguyen03 (etnguyen03) joins |