00:16:00<endrift>heh, I used to know someone who used the username ThreeHeadedMonkey in high school. Kiiinda down ThreeHM is the same person...but I did lose touch with him like a decade ago
00:16:04<endrift>*doubt
00:17:22<endrift>I remember borrowing a copy of Final Fantasy Tactics from him and then never playing it
00:23:24cyanbox quits [Read error: Connection reset by peer]
00:24:41etnguyen03 quits [Client Quit]
00:36:39SootBector quits [Ping timeout: 255 seconds]
00:39:55SootBector (SootBector) joins
00:44:47Sluggs quits [Excess Flood]
00:45:47pedantic-darwin quits [Quit: The Lounge - https://thelounge.chat]
00:46:50Sluggs (Sluggs) joins
00:49:49pedantic-darwin joins
01:12:23etnguyen03 (etnguyen03) joins
01:47:38<endrift>I feel like I might have had this same thought like a decade ago already too
02:08:07etnguyen03 quits [Client Quit]
02:10:50etnguyen03 (etnguyen03) joins
02:26:34etnguyen03 quits [Remote host closed the connection]
02:42:14nine quits [Quit: See ya!]
02:42:26nine joins
02:42:27nine quits [Changing host]
02:42:27nine (nine) joins
02:49:10hackbug quits [Remote host closed the connection]
02:53:17hackbug (hackbug) joins
03:00:48<nicolas17>I need some archivebot help
03:01:29<nicolas17>if I run a recursive job for https://support.apple.com/guide/iphone/ and another for https://support.apple.com/guide/ipad/, then any common resources like images and stylesheets will be retrieved twice
03:01:30Gadelhas5628737844 quits [Ping timeout: 258 seconds]
03:01:53<nicolas17>so ideally I'd use a single job
03:02:11Gadelhas5628737844 joins
03:02:15<nicolas17>is there such thing as "!a <"?
03:03:35<nicolas17>"ArchiveBot does not ascend to parent links" how would it know what's the "root" if given multiple URLs?
03:05:02<nicolas17>hmmm I think this has a lot of pages and not that many resources so maybe I can just do separate jobs
03:05:26<nicolas17>the duplicated resources won't be too significant vs the unique html
03:06:58<nicolas17>JAA: any objections before I hit enter? :P
03:07:07<nicolas17>(and yes maybe I should have asked in #archivebot but there's usually too much bot noise)
03:10:01<@JAA>nicolas17: AB tracks where URLs are discovered. The first discovery of a URL wins and determines whether it's considered a link of requisite as well as its parent URL. You can trace this back to one of the root URLs in the input list (which is actually stored in the DB for performance). That's how --no-parent is implemented.
03:10:19<@JAA>Yes, there is an !a <, but it's undocumented because it's full of pitfalls and usually unsafe.
03:10:37<@JAA>Separate jobs sound fine to me.
03:10:56<nicolas17>oh so it's not "doesn't go to the parent of the URL you start with", but "doesn't go to the parent of any given page"?
03:12:07<nicolas17>if I !a https://example.com/root/ and it has a link to /root/a/b/ which in turn has a link to /root/c/, it won't follow that, because going from root/a/b/ to root/c/ is "ascending to a parent"?
03:12:59<nicolas17>even though it's below the URL I originally added
03:14:48<@JAA>It will; it's evaluated relative to the root URL.
03:15:36<@JAA>But if you !a < a list with https://example.org/foo/ and https://example.org/bar/ and /foo/ links to /bar/baz/, that won't be followed.
03:16:18<@JAA>Even if /bar/ links to that, too, assuming the /foo/ response is processed first.
03:16:40<nicolas17>so that's one of the !a< pitfalls
03:16:43<@JAA>Yep
03:17:30<nicolas17>gonna hack into apple's webserver and add https://support.apple.com/guide/all.html with the root links I want
03:17:41SootBector quits [Remote host closed the connection]
03:17:42BornOn420 quits [Remote host closed the connection]
03:18:25BornOn420 (BornOn420) joins
03:18:53SootBector (SootBector) joins
03:20:28BornOn420 quits [Max SendQ exceeded]
03:21:03BornOn420 (BornOn420) joins
03:22:08dhinakg_ (dhinakg) joins
03:26:48evergreen quits [Ping timeout: 258 seconds]
03:27:25SootBector quits [Remote host closed the connection]
03:27:25BornOn420 quits [Remote host closed the connection]
03:27:57BornOn420 (BornOn420) joins
03:28:38SootBector (SootBector) joins
03:30:05BornOn420 quits [Max SendQ exceeded]
03:30:38BornOn420 (BornOn420) joins
03:32:45BornOn420 quits [Max SendQ exceeded]
03:38:27Island quits [Read error: Connection reset by peer]
03:58:13<pabs>nicolas17: sounds like you want the sitemap trick
03:59:35<pabs>also IIRC !a < https://example.org/foo https://example.org/bar is safe too
04:00:12<pabs>the sitemap thing: https://wiki.archiveteam.org/index.php?title=ArchiveBot#Usage_tips
04:17:19Kayla joins
04:44:09Hackerpcs quits [Ping timeout: 260 seconds]
05:06:43<pabs>https://news.artnet.com/market/intelligence-report-storm-2025-2684512
05:12:33cyanbox joins
05:36:22TheEnbyperor quits [Ping timeout: 258 seconds]
05:38:24TheEnbyperor_ quits [Ping timeout: 260 seconds]
05:41:54wickedplayer494 quits [Ping timeout: 260 seconds]
05:42:51wickedplayer494 joins
06:09:01TheEnbyperor joins
06:10:56TheEnbyperor_ (TheEnbyperor) joins
06:40:01youbanana quits [Read error: Connection reset by peer]
06:56:49b3nzo joins
07:05:31beardicus9 quits [Quit: Ping timeout (120 seconds)]
07:05:47beardicus9 (beardicus) joins
07:36:57HP_Archivist (HP_Archivist) joins
07:38:37HP_Archivist quits [Client Quit]
07:39:13HP_Archivist (HP_Archivist) joins
08:28:23nathang21 quits [Read error: Connection reset by peer]
08:28:57nathang21 joins
08:40:24TheEnbyperor_ quits [Ping timeout: 260 seconds]
08:40:45TheEnbyperor quits [Ping timeout: 258 seconds]
08:49:37Dada joins
08:50:02emanuele6 quits [Read error: Connection reset by peer]
09:00:47TheEnbyperor joins
09:02:39TheEnbyperor_ (TheEnbyperor) joins
09:10:09Bleo182600722719623455222 quits [Ping timeout: 260 seconds]
09:13:04Bleo182600722719623455222 joins
09:16:52<hexagonwin_>Hello. Recently I've been backing up AndroidFileHost which holds a lot of valuable files related to (mostly old) android devices. A few months ago the site became inoperable (downloads impossible+sql errors), though about 80% of the files came back now. I've first scraped their website which contains the file IDs (used to fetch download links), size, md5 etc. It's about 180TB total,
09:16:52<hexagonwin_>so I created a server-client system inspired by the AT tracker so the server holds a list of file IDs to download and assigns them to client, tracks for successful downloads etc.
09:17:01<hexagonwin_>It's working pretty well, as of now we've download 180545 files with 32843 left to do (43255 files to retry - most didn't have any available mirror links, or failed download in the process). However, I'm not sure how to share them.
09:17:06<hexagonwin_>Could someone with experience please help us on this? I'm thinking they should be uploaded to the Internet Archive, but I'm not sure how. Some files are password protected archive files (many of which are actually relevant - somehow they're android firmware files with the archive password in the filename, perhaps reuploaded to AFH from other website?) which can't be uploaded to IA
09:17:06<hexagonwin_>directly, and since they would be 100+TB total perhaps there could be a better method. I've also thought of serving these files on my server and having archiveteam crawl my website into warc so it can be included on the wayback machine, but i'm not sure if this is a good idea.
09:30:59<@arkiver>hexagonwin_: are they fully public URLs, or do they require authentication to get?
09:33:05<@arkiver>you mention serving them from a different site so they can go into WARCs.
09:33:34<@arkiver>are they all public files like https://androidfilehost.com/?fid=4349826312261725872 ?
09:35:15<@arkiver>i see a POST request is made to a mirrors.otf.php endpoint, but it allows for arbitrary URL parameters
09:36:20<@arkiver>hexagonwin_: is there any information on why the website became inoperable, or did they mention anything around the website being available again? how "at risk" would you say the website is?
09:37:35<@arkiver>given a complete list of "fid"s, and it the website is truly at risk, i could setup a little Warrior project to get the data into WARCs and then into the Wayback Machine (of course, i can't promise the data will always remain fully publicly accessible), the site is not very complex.
09:38:43<@arkiver>i'd say it's better to crawl the original site and have the original URLs preserved, as opposed to archiving a rehosted dump
09:39:42<@arkiver>i do see this strange "watch the video and get access as a reward" thing
09:48:56<hexagonwin_>@arkiver: fully public urls. there's not much information on why it became inoperable, but it's pretty obvious, the android customization space isn't what it used to be and it no longer earns them enough money i believe. a person who was involved in its operation before had also commented on an xda thread about the site getting less visitors. the last update on their twitter was from
09:48:56<hexagonwin_>2022 or so, and the site hasn't seen any visible updates over the last few years.
09:50:32<hexagonwin_>the site is not very complex, but i've already downloaded most of it and i don't see any reason to duplicate the effort..
09:51:47<hexagonwin_>for the androidfilehost.com interface with file info etc i also think crawling the original site would be better, but there's not much reason to download that 180TB of data especially when they have an md5 to verify anyway
09:52:13<hexagonwin_>(i've downloaded androidfilehost.com using grab-site and it was less than 30gb)
09:52:54<@arkiver>understood yeah, and generally agreed
09:53:02<@arkiver>but i'm not sure how else we can get this into the Wayback Machie
09:54:00<@arkiver>the other way I guess would be uploading as single items to archive.org, perhaps packed as multiple files per item, or a single file per item (need to check if they are fine with 180k items for that)
09:57:02<hexagonwin_>having them as single items would be better in terms of searching (people needing the files would probably search the filename and get it on a search engine) but their policy of not accepting password protected archive files is a bit problematic :/
09:57:54<@arkiver>if you have the password, is there a good reason for not unpacking the files?
09:57:57@arkiver is afk for 30 minutes
09:59:30<hexagonwin_>there's simply too many files. can't really have a human go through every password protected files manually, look at their filename, guess(or read if it's obvious) the password, unpack and upload separately
10:03:31Wohlstand (Wohlstand) joins
10:37:59<@arkiver>hexagonwin_: if there is a password protected file, is all the information there to decrypt it? (for a human to interpret the information to decrypt)
10:47:52<masterx244|m>often the information is where the file is linked from in the forum thread that links to the data
10:50:24<hexagonwin_>arkiver: this site(AFH) allowed uploads from anyone in any format. There can be literally any case.. some files I've seen have the password in the filename, some passwords can be only found on another website, others we can't know at all.
11:00:05Bleo182600722719623455222 quits [Client Quit]
11:02:49Bleo182600722719623455222 joins
11:06:48<anonymoususer852>There is a way to test if the archive is password protected; https://stackoverflow.com/a/56030344 so it is possible to script this to search for such cases, in theory.
11:07:22Wohlstand quits [Client Quit]
11:07:24rohvani quits [Ping timeout: 260 seconds]
11:29:40<anonymoususer852>As a PoC, I downloaded the said, "Pictures.zip", ran a rather long string of command, and it does indicate that it is password protected, https://transfer.archivete.am/inline/cuGqp/7z_passwd_protect_zip_batch_test.txt
11:30:46Commander001 quits [Remote host closed the connection]
11:31:02Commander001 joins
11:34:05<anonymoususer852>Password protected, "Pictures.zip": https://www.k8oms.net/document/password-protected-file
11:41:05etnguyen03 (etnguyen03) joins
11:49:26<hexagonwin_>it would be possible to find password protected files among all the downloaded files, but the problem is that not uploading the password protected files would be a loss of information since the password might be obtainable elsewhere (or from the filename)
11:56:53<anonymoususer852>Could probably try and test extract files using the password obtained from filename, and if successful, optionally recompress them without password. Though the long story short is that it's still going to be tedious process for the sake of hopefully preventing future users to try and guess.
12:09:06<chrismrtn>The basic type of ZIP encryption (not the AES variants, as far as I recall) is vulnerable to a known-text attack. I've cracked one in the past using (I think) 16 bytes of a file header, and it didn't take too long. At the moment, I can't check what type of encryption is used for your example ZIP, though.
12:10:05<hexagonwin_>it may not even be a zip. there's no restriction on the file format the site accepts, so it could be anything like zip, rar, 7z, etc
12:10:50<hexagonwin_>i even recall getting some weird archive file in this format from the site before https://en.wikipedia.org/wiki/B1_Free_Archiver?useskin=monobook
12:14:03<chrismrtn>Ah, that's unfortunate. Obscure file types are even more of a hassle for long-term preservation, and B1 is certainly a new one for me!
12:14:43<masterx244|m>and then we just need one case where its a update file format that uses a passworded zip as its container and then its a "don't repack" case
12:49:32pabs quits [Read error: Connection reset by peer]
12:50:11pabs (pabs) joins
13:07:28Wohlstand (Wohlstand) joins
13:18:04monoxane quits [Ping timeout: 260 seconds]
13:18:51monoxane (monoxane) joins
13:25:12nine quits [Quit: See ya!]
13:25:31nine joins
13:25:32nine quits [Changing host]
13:25:32nine (nine) joins
13:27:45lemuria quits [Remote host closed the connection]
13:54:20hexagonwin joins
13:56:14hexagonwin_ quits [Ping timeout: 258 seconds]
14:44:22NeonGlitch quits [Read error: Connection reset by peer]
14:44:58NeonGlitch (NeonGlitch) joins
15:02:19<justauser|m>Does IA check nested archives? You could pack them into .tars by first letters, first byte of SHA etc.
15:05:44HugsNotDrugs quits [Read error: Connection reset by peer]
15:08:44HugsNotDrugs joins
15:21:48HugsNotDrugs quits [Client Quit]
15:22:07HugsNotDrugs joins
15:22:24cyanbox quits [Read error: Connection reset by peer]
15:23:21nine quits [Client Quit]
15:23:40nine joins
15:23:41nine quits [Changing host]
15:23:41nine (nine) joins
15:38:32nine quits [Client Quit]
15:38:49nine joins
15:38:49nine quits [Changing host]
15:38:49nine (nine) joins
15:43:14Island joins
15:49:05lennier2_ joins
15:50:54lennier2 quits [Ping timeout: 260 seconds]
16:31:58dabs joins
16:49:45nyakase5 (nyakase) joins
16:51:48nyakase quits [Ping timeout: 258 seconds]
16:53:17dhinakg_ quits [Quit: dhinakg_]
16:53:21nyakase (nyakase) joins
16:55:04nyakase5 quits [Ping timeout: 260 seconds]
17:05:33datechnoman quits [Quit: Ping timeout (120 seconds)]
17:06:04datechnoman (datechnoman) joins
17:13:16Barto quits [Ping timeout: 258 seconds]
17:16:23igloo22225 quits [Quit: The Lounge - https://thelounge.chat]
17:17:41datechnoman quits [Client Quit]
17:18:04datechnoman (datechnoman) joins
17:18:09BornOn420 (BornOn420) joins
17:25:01Guest58 quits [Quit: My Mac has gone to sleep. ZZZzzz…]
17:31:21Barto (Barto) joins
17:37:09etnguyen03 quits [Quit: Konversation terminated!]
17:46:08etnguyen03 (etnguyen03) joins
17:46:13igloo22225 (igloo22225) joins
17:53:59Island quits [Ping timeout: 260 seconds]
18:01:37PredatorIWD25 quits [Read error: Connection reset by peer]
18:04:43Island joins
18:14:39etnguyen03 quits [Client Quit]
18:26:40Guest58 joins
18:44:07corentin quits [Ping timeout: 258 seconds]
18:44:47@AlsoJAA sets the topic to: Lengthy ArchiveTeam-related discussions, questions here | Offtopic: #archiveteam-ot | https://twitter.com/textfiles/status/1069715869994020867
19:09:24lennier2 joins
19:12:29lennier2_ quits [Ping timeout: 258 seconds]
20:01:04corentin joins
20:24:53HP_Archivist quits [Read error: Connection reset by peer]
20:55:59anonymoususer852 quits [Ping timeout: 258 seconds]
20:57:45anonymoususer852 (anonymoususer852) joins
21:19:45Barto quits [Ping timeout: 258 seconds]
21:24:45Barto (Barto) joins
21:38:15Dada quits [Remote host closed the connection]
21:39:02<h2ibot>Anonymoususer852 edited Anubis/uncategorized (+26, Added https://git.slackware.nl/): https://wiki.archiveteam.org/?diff=57279&oldid=56871
21:59:14b3nzo quits [Ping timeout: 258 seconds]
22:12:08dabs quits [Read error: Connection reset by peer]
22:35:39Suika quits [Ping timeout: 258 seconds]
22:43:59Suika joins
22:53:04Radzig2 joins
22:55:34Radzig quits [Ping timeout: 260 seconds]
22:55:34Radzig2 is now known as Radzig
23:07:51klg quits [Ping timeout: 258 seconds]
23:15:31@rewby quits [Ping timeout: 258 seconds]
23:29:27klg (klg) joins
23:35:50wickedplayer494 quits [Ping timeout: 258 seconds]
23:36:18wickedplayer494 joins
23:54:54dabs joins