00:16:00 | <endrift> | heh, I used to know someone who used the username ThreeHeadedMonkey in high school. Kiiinda down ThreeHM is the same person...but I did lose touch with him like a decade ago |
00:16:04 | <endrift> | *doubt |
00:17:22 | <endrift> | I remember borrowing a copy of Final Fantasy Tactics from him and then never playing it |
00:23:24 | | cyanbox quits [Read error: Connection reset by peer] |
00:24:41 | | etnguyen03 quits [Client Quit] |
00:36:39 | | SootBector quits [Ping timeout: 255 seconds] |
00:39:55 | | SootBector (SootBector) joins |
00:44:47 | | Sluggs quits [Excess Flood] |
00:45:47 | | pedantic-darwin quits [Quit: The Lounge - https://thelounge.chat] |
00:46:50 | | Sluggs (Sluggs) joins |
00:49:49 | | pedantic-darwin joins |
01:12:23 | | etnguyen03 (etnguyen03) joins |
01:47:38 | <endrift> | I feel like I might have had this same thought like a decade ago already too |
02:08:07 | | etnguyen03 quits [Client Quit] |
02:10:50 | | etnguyen03 (etnguyen03) joins |
02:26:34 | | etnguyen03 quits [Remote host closed the connection] |
02:42:14 | | nine quits [Quit: See ya!] |
02:42:26 | | nine joins |
02:42:27 | | nine is now authenticated as nine |
02:42:27 | | nine quits [Changing host] |
02:42:27 | | nine (nine) joins |
02:49:10 | | hackbug quits [Remote host closed the connection] |
02:53:17 | | hackbug (hackbug) joins |
03:00:48 | <nicolas17> | I need some archivebot help |
03:01:29 | <nicolas17> | if I run a recursive job for https://support.apple.com/guide/iphone/ and another for https://support.apple.com/guide/ipad/, then any common resources like images and stylesheets will be retrieved twice |
03:01:30 | | Gadelhas5628737844 quits [Ping timeout: 258 seconds] |
03:01:53 | <nicolas17> | so ideally I'd use a single job |
03:02:11 | | Gadelhas5628737844 joins |
03:02:15 | <nicolas17> | is there such thing as "!a <"? |
03:03:35 | <nicolas17> | "ArchiveBot does not ascend to parent links" how would it know what's the "root" if given multiple URLs? |
03:05:02 | <nicolas17> | hmmm I think this has a lot of pages and not that many resources so maybe I can just do separate jobs |
03:05:26 | <nicolas17> | the duplicated resources won't be too significant vs the unique html |
03:05:30 | | nicolas17 is now authenticated as nicolas17 |
03:06:58 | <nicolas17> | JAA: any objections before I hit enter? :P |
03:07:07 | <nicolas17> | (and yes maybe I should have asked in #archivebot but there's usually too much bot noise) |
03:10:01 | <@JAA> | nicolas17: AB tracks where URLs are discovered. The first discovery of a URL wins and determines whether it's considered a link of requisite as well as its parent URL. You can trace this back to one of the root URLs in the input list (which is actually stored in the DB for performance). That's how --no-parent is implemented. |
03:10:19 | <@JAA> | Yes, there is an !a <, but it's undocumented because it's full of pitfalls and usually unsafe. |
03:10:37 | <@JAA> | Separate jobs sound fine to me. |
03:10:56 | <nicolas17> | oh so it's not "doesn't go to the parent of the URL you start with", but "doesn't go to the parent of any given page"? |
03:12:07 | <nicolas17> | if I !a https://example.com/root/ and it has a link to /root/a/b/ which in turn has a link to /root/c/, it won't follow that, because going from root/a/b/ to root/c/ is "ascending to a parent"? |
03:12:59 | <nicolas17> | even though it's below the URL I originally added |
03:14:48 | <@JAA> | It will; it's evaluated relative to the root URL. |
03:15:36 | <@JAA> | But if you !a < a list with https://example.org/foo/ and https://example.org/bar/ and /foo/ links to /bar/baz/, that won't be followed. |
03:16:18 | <@JAA> | Even if /bar/ links to that, too, assuming the /foo/ response is processed first. |
03:16:40 | <nicolas17> | so that's one of the !a< pitfalls |
03:16:43 | <@JAA> | Yep |
03:17:30 | <nicolas17> | gonna hack into apple's webserver and add https://support.apple.com/guide/all.html with the root links I want |
03:17:41 | | SootBector quits [Remote host closed the connection] |
03:17:42 | | BornOn420 quits [Remote host closed the connection] |
03:18:25 | | BornOn420 (BornOn420) joins |
03:18:53 | | SootBector (SootBector) joins |
03:20:28 | | BornOn420 quits [Max SendQ exceeded] |
03:21:03 | | BornOn420 (BornOn420) joins |
03:22:08 | | dhinakg_ (dhinakg) joins |
03:26:48 | | evergreen quits [Ping timeout: 258 seconds] |
03:27:25 | | SootBector quits [Remote host closed the connection] |
03:27:25 | | BornOn420 quits [Remote host closed the connection] |
03:27:57 | | BornOn420 (BornOn420) joins |
03:28:38 | | SootBector (SootBector) joins |
03:30:05 | | BornOn420 quits [Max SendQ exceeded] |
03:30:38 | | BornOn420 (BornOn420) joins |
03:32:45 | | BornOn420 quits [Max SendQ exceeded] |
03:38:27 | | Island quits [Read error: Connection reset by peer] |
03:58:13 | <pabs> | nicolas17: sounds like you want the sitemap trick |
03:59:35 | <pabs> | also IIRC !a < https://example.org/foo https://example.org/bar is safe too |
04:00:12 | <pabs> | the sitemap thing: https://wiki.archiveteam.org/index.php?title=ArchiveBot#Usage_tips |
04:17:19 | | Kayla joins |
04:44:09 | | Hackerpcs quits [Ping timeout: 260 seconds] |
05:06:43 | <pabs> | https://news.artnet.com/market/intelligence-report-storm-2025-2684512 |
05:12:33 | | cyanbox joins |
05:36:22 | | TheEnbyperor quits [Ping timeout: 258 seconds] |
05:38:24 | | TheEnbyperor_ quits [Ping timeout: 260 seconds] |
05:41:54 | | wickedplayer494 quits [Ping timeout: 260 seconds] |
05:42:51 | | wickedplayer494 joins |
06:09:01 | | TheEnbyperor joins |
06:10:56 | | TheEnbyperor_ (TheEnbyperor) joins |
06:40:01 | | youbanana quits [Read error: Connection reset by peer] |
06:56:49 | | b3nzo joins |
07:05:31 | | beardicus9 quits [Quit: Ping timeout (120 seconds)] |
07:05:47 | | beardicus9 (beardicus) joins |
07:36:57 | | HP_Archivist (HP_Archivist) joins |
07:38:37 | | HP_Archivist quits [Client Quit] |
07:39:13 | | HP_Archivist (HP_Archivist) joins |
08:28:23 | | nathang21 quits [Read error: Connection reset by peer] |
08:28:57 | | nathang21 joins |
08:40:24 | | TheEnbyperor_ quits [Ping timeout: 260 seconds] |
08:40:45 | | TheEnbyperor quits [Ping timeout: 258 seconds] |
08:49:37 | | Dada joins |
08:50:02 | | emanuele6 quits [Read error: Connection reset by peer] |
09:00:47 | | TheEnbyperor joins |
09:02:39 | | TheEnbyperor_ (TheEnbyperor) joins |
09:10:09 | | Bleo182600722719623455222 quits [Ping timeout: 260 seconds] |
09:13:04 | | Bleo182600722719623455222 joins |
09:16:52 | <hexagonwin_> | Hello. Recently I've been backing up AndroidFileHost which holds a lot of valuable files related to (mostly old) android devices. A few months ago the site became inoperable (downloads impossible+sql errors), though about 80% of the files came back now. I've first scraped their website which contains the file IDs (used to fetch download links), size, md5 etc. It's about 180TB total, |
09:16:52 | <hexagonwin_> | so I created a server-client system inspired by the AT tracker so the server holds a list of file IDs to download and assigns them to client, tracks for successful downloads etc. |
09:17:01 | <hexagonwin_> | It's working pretty well, as of now we've download 180545 files with 32843 left to do (43255 files to retry - most didn't have any available mirror links, or failed download in the process). However, I'm not sure how to share them. |
09:17:06 | <hexagonwin_> | Could someone with experience please help us on this? I'm thinking they should be uploaded to the Internet Archive, but I'm not sure how. Some files are password protected archive files (many of which are actually relevant - somehow they're android firmware files with the archive password in the filename, perhaps reuploaded to AFH from other website?) which can't be uploaded to IA |
09:17:06 | <hexagonwin_> | directly, and since they would be 100+TB total perhaps there could be a better method. I've also thought of serving these files on my server and having archiveteam crawl my website into warc so it can be included on the wayback machine, but i'm not sure if this is a good idea. |
09:30:59 | <@arkiver> | hexagonwin_: are they fully public URLs, or do they require authentication to get? |
09:33:05 | <@arkiver> | you mention serving them from a different site so they can go into WARCs. |
09:33:34 | <@arkiver> | are they all public files like https://androidfilehost.com/?fid=4349826312261725872 ? |
09:35:15 | <@arkiver> | i see a POST request is made to a mirrors.otf.php endpoint, but it allows for arbitrary URL parameters |
09:36:20 | <@arkiver> | hexagonwin_: is there any information on why the website became inoperable, or did they mention anything around the website being available again? how "at risk" would you say the website is? |
09:37:35 | <@arkiver> | given a complete list of "fid"s, and it the website is truly at risk, i could setup a little Warrior project to get the data into WARCs and then into the Wayback Machine (of course, i can't promise the data will always remain fully publicly accessible), the site is not very complex. |
09:38:43 | <@arkiver> | i'd say it's better to crawl the original site and have the original URLs preserved, as opposed to archiving a rehosted dump |
09:39:42 | <@arkiver> | i do see this strange "watch the video and get access as a reward" thing |
09:48:56 | <hexagonwin_> | @arkiver: fully public urls. there's not much information on why it became inoperable, but it's pretty obvious, the android customization space isn't what it used to be and it no longer earns them enough money i believe. a person who was involved in its operation before had also commented on an xda thread about the site getting less visitors. the last update on their twitter was from |
09:48:56 | <hexagonwin_> | 2022 or so, and the site hasn't seen any visible updates over the last few years. |
09:50:32 | <hexagonwin_> | the site is not very complex, but i've already downloaded most of it and i don't see any reason to duplicate the effort.. |
09:51:47 | <hexagonwin_> | for the androidfilehost.com interface with file info etc i also think crawling the original site would be better, but there's not much reason to download that 180TB of data especially when they have an md5 to verify anyway |
09:52:13 | <hexagonwin_> | (i've downloaded androidfilehost.com using grab-site and it was less than 30gb) |
09:52:54 | <@arkiver> | understood yeah, and generally agreed |
09:53:02 | <@arkiver> | but i'm not sure how else we can get this into the Wayback Machie |
09:54:00 | <@arkiver> | the other way I guess would be uploading as single items to archive.org, perhaps packed as multiple files per item, or a single file per item (need to check if they are fine with 180k items for that) |
09:57:02 | <hexagonwin_> | having them as single items would be better in terms of searching (people needing the files would probably search the filename and get it on a search engine) but their policy of not accepting password protected archive files is a bit problematic :/ |
09:57:54 | <@arkiver> | if you have the password, is there a good reason for not unpacking the files? |
09:57:57 | | @arkiver is afk for 30 minutes |
09:59:30 | <hexagonwin_> | there's simply too many files. can't really have a human go through every password protected files manually, look at their filename, guess(or read if it's obvious) the password, unpack and upload separately |
10:03:31 | | Wohlstand (Wohlstand) joins |
10:37:59 | <@arkiver> | hexagonwin_: if there is a password protected file, is all the information there to decrypt it? (for a human to interpret the information to decrypt) |
10:47:52 | <masterx244|m> | often the information is where the file is linked from in the forum thread that links to the data |
10:50:24 | <hexagonwin_> | arkiver: this site(AFH) allowed uploads from anyone in any format. There can be literally any case.. some files I've seen have the password in the filename, some passwords can be only found on another website, others we can't know at all. |
11:00:05 | | Bleo182600722719623455222 quits [Client Quit] |
11:02:49 | | Bleo182600722719623455222 joins |
11:06:48 | <anonymoususer852> | There is a way to test if the archive is password protected; https://stackoverflow.com/a/56030344 so it is possible to script this to search for such cases, in theory. |
11:07:22 | | Wohlstand quits [Client Quit] |
11:07:24 | | rohvani quits [Ping timeout: 260 seconds] |
11:29:40 | <anonymoususer852> | As a PoC, I downloaded the said, "Pictures.zip", ran a rather long string of command, and it does indicate that it is password protected, https://transfer.archivete.am/inline/cuGqp/7z_passwd_protect_zip_batch_test.txt |
11:30:46 | | Commander001 quits [Remote host closed the connection] |
11:31:02 | | Commander001 joins |
11:34:05 | <anonymoususer852> | Password protected, "Pictures.zip": https://www.k8oms.net/document/password-protected-file |
11:41:05 | | etnguyen03 (etnguyen03) joins |
11:49:26 | <hexagonwin_> | it would be possible to find password protected files among all the downloaded files, but the problem is that not uploading the password protected files would be a loss of information since the password might be obtainable elsewhere (or from the filename) |
11:56:53 | <anonymoususer852> | Could probably try and test extract files using the password obtained from filename, and if successful, optionally recompress them without password. Though the long story short is that it's still going to be tedious process for the sake of hopefully preventing future users to try and guess. |
12:09:06 | <chrismrtn> | The basic type of ZIP encryption (not the AES variants, as far as I recall) is vulnerable to a known-text attack. I've cracked one in the past using (I think) 16 bytes of a file header, and it didn't take too long. At the moment, I can't check what type of encryption is used for your example ZIP, though. |
12:10:05 | <hexagonwin_> | it may not even be a zip. there's no restriction on the file format the site accepts, so it could be anything like zip, rar, 7z, etc |
12:10:50 | <hexagonwin_> | i even recall getting some weird archive file in this format from the site before https://en.wikipedia.org/wiki/B1_Free_Archiver?useskin=monobook |
12:14:03 | <chrismrtn> | Ah, that's unfortunate. Obscure file types are even more of a hassle for long-term preservation, and B1 is certainly a new one for me! |
12:14:43 | <masterx244|m> | and then we just need one case where its a update file format that uses a passworded zip as its container and then its a "don't repack" case |
12:49:32 | | pabs quits [Read error: Connection reset by peer] |
12:50:11 | | pabs (pabs) joins |
13:07:28 | | Wohlstand (Wohlstand) joins |
13:18:04 | | monoxane quits [Ping timeout: 260 seconds] |
13:18:51 | | monoxane (monoxane) joins |
13:25:12 | | nine quits [Quit: See ya!] |
13:25:31 | | nine joins |
13:25:32 | | nine is now authenticated as nine |
13:25:32 | | nine quits [Changing host] |
13:25:32 | | nine (nine) joins |
13:27:45 | | lemuria quits [Remote host closed the connection] |
13:54:20 | | hexagonwin joins |
13:56:14 | | hexagonwin_ quits [Ping timeout: 258 seconds] |
14:44:22 | | NeonGlitch quits [Read error: Connection reset by peer] |
14:44:58 | | NeonGlitch (NeonGlitch) joins |
15:02:19 | <justauser|m> | Does IA check nested archives? You could pack them into .tars by first letters, first byte of SHA etc. |
15:05:44 | | HugsNotDrugs quits [Read error: Connection reset by peer] |
15:08:44 | | HugsNotDrugs joins |
15:21:48 | | HugsNotDrugs quits [Client Quit] |
15:22:07 | | HugsNotDrugs joins |
15:22:24 | | cyanbox quits [Read error: Connection reset by peer] |
15:23:21 | | nine quits [Client Quit] |
15:23:40 | | nine joins |
15:23:41 | | nine is now authenticated as nine |
15:23:41 | | nine quits [Changing host] |
15:23:41 | | nine (nine) joins |
15:38:32 | | nine quits [Client Quit] |
15:38:49 | | nine joins |
15:38:49 | | nine is now authenticated as nine |
15:38:49 | | nine quits [Changing host] |
15:38:49 | | nine (nine) joins |
15:43:14 | | Island joins |
15:49:05 | | lennier2_ joins |
15:50:54 | | lennier2 quits [Ping timeout: 260 seconds] |
16:31:58 | | dabs joins |
16:49:45 | | nyakase5 (nyakase) joins |
16:51:48 | | nyakase quits [Ping timeout: 258 seconds] |
16:53:17 | | dhinakg_ quits [Quit: dhinakg_] |
16:53:21 | | nyakase (nyakase) joins |
16:55:04 | | nyakase5 quits [Ping timeout: 260 seconds] |
17:05:33 | | datechnoman quits [Quit: Ping timeout (120 seconds)] |
17:06:04 | | datechnoman (datechnoman) joins |
17:13:16 | | Barto quits [Ping timeout: 258 seconds] |
17:16:23 | | igloo22225 quits [Quit: The Lounge - https://thelounge.chat] |
17:17:41 | | datechnoman quits [Client Quit] |
17:18:04 | | datechnoman (datechnoman) joins |
17:18:09 | | BornOn420 (BornOn420) joins |
17:25:01 | | Guest58 quits [Quit: My Mac has gone to sleep. ZZZzzz…] |
17:31:21 | | Barto (Barto) joins |
17:37:09 | | etnguyen03 quits [Quit: Konversation terminated!] |
17:46:08 | | etnguyen03 (etnguyen03) joins |
17:46:13 | | igloo22225 (igloo22225) joins |
17:53:59 | | Island quits [Ping timeout: 260 seconds] |
18:01:37 | | PredatorIWD25 quits [Read error: Connection reset by peer] |
18:04:43 | | Island joins |
18:14:39 | | etnguyen03 quits [Client Quit] |
18:26:40 | | Guest58 joins |
18:44:07 | | corentin quits [Ping timeout: 258 seconds] |
18:44:47 | | @AlsoJAA sets the topic to: Lengthy ArchiveTeam-related discussions, questions here | Offtopic: #archiveteam-ot | https://twitter.com/textfiles/status/1069715869994020867 |
19:09:24 | | lennier2 joins |
19:12:29 | | lennier2_ quits [Ping timeout: 258 seconds] |
20:01:04 | | corentin joins |
20:24:53 | | HP_Archivist quits [Read error: Connection reset by peer] |
20:55:59 | | anonymoususer852 quits [Ping timeout: 258 seconds] |
20:57:45 | | anonymoususer852 (anonymoususer852) joins |
21:19:45 | | Barto quits [Ping timeout: 258 seconds] |
21:24:45 | | Barto (Barto) joins |
21:38:15 | | Dada quits [Remote host closed the connection] |
21:39:02 | <h2ibot> | Anonymoususer852 edited Anubis/uncategorized (+26, Added https://git.slackware.nl/): https://wiki.archiveteam.org/?diff=57279&oldid=56871 |
21:59:14 | | b3nzo quits [Ping timeout: 258 seconds] |
22:12:08 | | dabs quits [Read error: Connection reset by peer] |
22:35:39 | | Suika quits [Ping timeout: 258 seconds] |
22:43:59 | | Suika joins |
22:53:04 | | Radzig2 joins |
22:55:34 | | Radzig quits [Ping timeout: 260 seconds] |
22:55:34 | | Radzig2 is now known as Radzig |
23:07:51 | | klg quits [Ping timeout: 258 seconds] |
23:15:31 | | @rewby quits [Ping timeout: 258 seconds] |
23:29:27 | | klg (klg) joins |
23:35:50 | | wickedplayer494 quits [Ping timeout: 258 seconds] |
23:36:18 | | wickedplayer494 joins |
23:36:36 | | wickedplayer494 is now authenticated as wickedplayer494 |
23:54:54 | | dabs joins |