#archiveteam-bs log for 2025-09-09

Home Search Previous day Next day

00:16:00	<endrift>	heh, I used to know someone who used the username ThreeHeadedMonkey in high school. Kiiinda down ThreeHM is the same person...but I did lose touch with him like a decade ago
00:16:04	<endrift>	*doubt
00:17:22	<endrift>	I remember borrowing a copy of Final Fantasy Tactics from him and then never playing it
00:23:24		cyanbox quits [Read error: Connection reset by peer]
00:24:41		etnguyen03 quits [Client Quit]
00:36:39		SootBector quits [Ping timeout: 255 seconds]
00:39:55		SootBector (SootBector) joins
00:44:47		Sluggs quits [Excess Flood]
00:45:47		pedantic-darwin quits [Quit: The Lounge - https://thelounge.chat]
00:46:50		Sluggs (Sluggs) joins
00:49:49		pedantic-darwin joins
01:12:23		etnguyen03 (etnguyen03) joins
01:47:38	<endrift>	I feel like I might have had this same thought like a decade ago already too
02:08:07		etnguyen03 quits [Client Quit]
02:10:50		etnguyen03 (etnguyen03) joins
02:26:34		etnguyen03 quits [Remote host closed the connection]
02:42:14		nine quits [Quit: See ya!]
02:42:26		nine joins
02:42:27		nine is now authenticated as nine
02:42:27		nine quits [Changing host]
02:42:27		nine (nine) joins
02:49:10		hackbug quits [Remote host closed the connection]
02:53:17		hackbug (hackbug) joins
03:00:48	<nicolas17>	I need some archivebot help
03:01:29	<nicolas17>	if I run a recursive job for https://support.apple.com/guide/iphone/ and another for https://support.apple.com/guide/ipad/, then any common resources like images and stylesheets will be retrieved twice
03:01:30		Gadelhas5628737844 quits [Ping timeout: 258 seconds]
03:01:53	<nicolas17>	so ideally I'd use a single job
03:02:11		Gadelhas5628737844 joins
03:02:15	<nicolas17>	is there such thing as "!a <"?
03:03:35	<nicolas17>	"ArchiveBot does not ascend to parent links" how would it know what's the "root" if given multiple URLs?
03:05:02	<nicolas17>	hmmm I think this has a lot of pages and not that many resources so maybe I can just do separate jobs
03:05:26	<nicolas17>	the duplicated resources won't be too significant vs the unique html
03:05:30		nicolas17 is now authenticated as nicolas17
03:06:58	<nicolas17>	JAA: any objections before I hit enter? :P
03:07:07	<nicolas17>	(and yes maybe I should have asked in #archivebot but there's usually too much bot noise)
03:10:01	<@JAA>	nicolas17: AB tracks where URLs are discovered. The first discovery of a URL wins and determines whether it's considered a link of requisite as well as its parent URL. You can trace this back to one of the root URLs in the input list (which is actually stored in the DB for performance). That's how --no-parent is implemented.
03:10:19	<@JAA>	Yes, there is an !a <, but it's undocumented because it's full of pitfalls and usually unsafe.
03:10:37	<@JAA>	Separate jobs sound fine to me.
03:10:56	<nicolas17>	oh so it's not "doesn't go to the parent of the URL you start with", but "doesn't go to the parent of any given page"?
03:12:07	<nicolas17>	if I !a https://example.com/root/ and it has a link to /root/a/b/ which in turn has a link to /root/c/, it won't follow that, because going from root/a/b/ to root/c/ is "ascending to a parent"?
03:12:59	<nicolas17>	even though it's below the URL I originally added
03:14:48	<@JAA>	It will; it's evaluated relative to the root URL.
03:15:36	<@JAA>	But if you !a < a list with https://example.org/foo/ and https://example.org/bar/ and /foo/ links to /bar/baz/, that won't be followed.
03:16:18	<@JAA>	Even if /bar/ links to that, too, assuming the /foo/ response is processed first.
03:16:40	<nicolas17>	so that's one of the !a< pitfalls
03:16:43	<@JAA>	Yep
03:17:30	<nicolas17>	gonna hack into apple's webserver and add https://support.apple.com/guide/all.html with the root links I want
03:17:41		SootBector quits [Remote host closed the connection]
03:17:42		BornOn420 quits [Remote host closed the connection]
03:18:25		BornOn420 (BornOn420) joins
03:18:53		SootBector (SootBector) joins
03:20:28		BornOn420 quits [Max SendQ exceeded]
03:21:03		BornOn420 (BornOn420) joins
03:22:08		dhinakg_ (dhinakg) joins
03:26:48		evergreen quits [Ping timeout: 258 seconds]
03:27:25		SootBector quits [Remote host closed the connection]
03:27:25		BornOn420 quits [Remote host closed the connection]
03:27:57		BornOn420 (BornOn420) joins
03:28:38		SootBector (SootBector) joins
03:30:05		BornOn420 quits [Max SendQ exceeded]
03:30:38		BornOn420 (BornOn420) joins
03:32:45		BornOn420 quits [Max SendQ exceeded]
03:38:27		Island quits [Read error: Connection reset by peer]
03:58:13	<pabs>	nicolas17: sounds like you want the sitemap trick
03:59:35	<pabs>	also IIRC !a < https://example.org/foo https://example.org/bar is safe too
04:00:12	<pabs>	the sitemap thing: https://wiki.archiveteam.org/index.php?title=ArchiveBot#Usage_tips
04:17:19		Kayla joins
04:44:09		Hackerpcs quits [Ping timeout: 260 seconds]
05:06:43	<pabs>	https://news.artnet.com/market/intelligence-report-storm-2025-2684512
05:12:33		cyanbox joins
05:36:22		TheEnbyperor quits [Ping timeout: 258 seconds]
05:38:24		TheEnbyperor_ quits [Ping timeout: 260 seconds]
05:41:54		wickedplayer494 quits [Ping timeout: 260 seconds]
05:42:51		wickedplayer494 joins
06:09:01		TheEnbyperor joins
06:10:56		TheEnbyperor_ (TheEnbyperor) joins
06:40:01		youbanana quits [Read error: Connection reset by peer]
06:56:49		b3nzo joins
07:05:31		beardicus9 quits [Quit: Ping timeout (120 seconds)]
07:05:47		beardicus9 (beardicus) joins
07:36:57		HP_Archivist (HP_Archivist) joins
07:38:37		HP_Archivist quits [Client Quit]
07:39:13		HP_Archivist (HP_Archivist) joins
08:28:23		nathang21 quits [Read error: Connection reset by peer]
08:28:57		nathang21 joins
08:40:24		TheEnbyperor_ quits [Ping timeout: 260 seconds]
08:40:45		TheEnbyperor quits [Ping timeout: 258 seconds]
08:49:37		Dada joins
08:50:02		emanuele6 quits [Read error: Connection reset by peer]
09:00:47		TheEnbyperor joins
09:02:39		TheEnbyperor_ (TheEnbyperor) joins
09:10:09		Bleo182600722719623455222 quits [Ping timeout: 260 seconds]
09:13:04		Bleo182600722719623455222 joins
09:16:52	<hexagonwin_>	Hello. Recently I've been backing up AndroidFileHost which holds a lot of valuable files related to (mostly old) android devices. A few months ago the site became inoperable (downloads impossible+sql errors), though about 80% of the files came back now. I've first scraped their website which contains the file IDs (used to fetch download links), size, md5 etc. It's about 180TB total,
09:16:52	<hexagonwin_>	so I created a server-client system inspired by the AT tracker so the server holds a list of file IDs to download and assigns them to client, tracks for successful downloads etc.
09:17:01	<hexagonwin_>	It's working pretty well, as of now we've download 180545 files with 32843 left to do (43255 files to retry - most didn't have any available mirror links, or failed download in the process). However, I'm not sure how to share them.
09:17:06	<hexagonwin_>	Could someone with experience please help us on this? I'm thinking they should be uploaded to the Internet Archive, but I'm not sure how. Some files are password protected archive files (many of which are actually relevant - somehow they're android firmware files with the archive password in the filename, perhaps reuploaded to AFH from other website?) which can't be uploaded to IA
09:17:06	<hexagonwin_>	directly, and since they would be 100+TB total perhaps there could be a better method. I've also thought of serving these files on my server and having archiveteam crawl my website into warc so it can be included on the wayback machine, but i'm not sure if this is a good idea.
09:30:59	<@arkiver>	hexagonwin_: are they fully public URLs, or do they require authentication to get?
09:33:05	<@arkiver>	you mention serving them from a different site so they can go into WARCs.
09:33:34	<@arkiver>	are they all public files like https://androidfilehost.com/?fid=4349826312261725872 ?
09:35:15	<@arkiver>	i see a POST request is made to a mirrors.otf.php endpoint, but it allows for arbitrary URL parameters
09:36:20	<@arkiver>	hexagonwin_: is there any information on why the website became inoperable, or did they mention anything around the website being available again? how "at risk" would you say the website is?
09:37:35	<@arkiver>	given a complete list of "fid"s, and it the website is truly at risk, i could setup a little Warrior project to get the data into WARCs and then into the Wayback Machine (of course, i can't promise the data will always remain fully publicly accessible), the site is not very complex.
09:38:43	<@arkiver>	i'd say it's better to crawl the original site and have the original URLs preserved, as opposed to archiving a rehosted dump
09:39:42	<@arkiver>	i do see this strange "watch the video and get access as a reward" thing
09:48:56	<hexagonwin_>	@arkiver: fully public urls. there's not much information on why it became inoperable, but it's pretty obvious, the android customization space isn't what it used to be and it no longer earns them enough money i believe. a person who was involved in its operation before had also commented on an xda thread about the site getting less visitors. the last update on their twitter was from
09:48:56	<hexagonwin_>	2022 or so, and the site hasn't seen any visible updates over the last few years.
09:50:32	<hexagonwin_>	the site is not very complex, but i've already downloaded most of it and i don't see any reason to duplicate the effort..
09:51:47	<hexagonwin_>	for the androidfilehost.com interface with file info etc i also think crawling the original site would be better, but there's not much reason to download that 180TB of data especially when they have an md5 to verify anyway
09:52:13	<hexagonwin_>	(i've downloaded androidfilehost.com using grab-site and it was less than 30gb)
09:52:54	<@arkiver>	understood yeah, and generally agreed
09:53:02	<@arkiver>	but i'm not sure how else we can get this into the Wayback Machie
09:54:00	<@arkiver>	the other way I guess would be uploading as single items to archive.org, perhaps packed as multiple files per item, or a single file per item (need to check if they are fine with 180k items for that)
09:57:02	<hexagonwin_>	having them as single items would be better in terms of searching (people needing the files would probably search the filename and get it on a search engine) but their policy of not accepting password protected archive files is a bit problematic :/
09:57:54	<@arkiver>	if you have the password, is there a good reason for not unpacking the files?
09:57:57		@arkiver is afk for 30 minutes
09:59:30	<hexagonwin_>	there's simply too many files. can't really have a human go through every password protected files manually, look at their filename, guess(or read if it's obvious) the password, unpack and upload separately
10:03:31		Wohlstand (Wohlstand) joins
10:37:59	<@arkiver>	hexagonwin_: if there is a password protected file, is all the information there to decrypt it? (for a human to interpret the information to decrypt)
10:47:52	<masterx244\|m>	often the information is where the file is linked from in the forum thread that links to the data
10:50:24	<hexagonwin_>	arkiver: this site(AFH) allowed uploads from anyone in any format. There can be literally any case.. some files I've seen have the password in the filename, some passwords can be only found on another website, others we can't know at all.
11:00:05		Bleo182600722719623455222 quits [Client Quit]
11:02:49		Bleo182600722719623455222 joins
11:06:48	<anonymoususer852>	There is a way to test if the archive is password protected; https://stackoverflow.com/a/56030344 so it is possible to script this to search for such cases, in theory.
11:07:22		Wohlstand quits [Client Quit]
11:07:24		rohvani quits [Ping timeout: 260 seconds]
11:29:40	<anonymoususer852>	As a PoC, I downloaded the said, "Pictures.zip", ran a rather long string of command, and it does indicate that it is password protected, https://transfer.archivete.am/inline/cuGqp/7z_passwd_protect_zip_batch_test.txt
11:30:46		Commander001 quits [Remote host closed the connection]
11:31:02		Commander001 joins
11:34:05	<anonymoususer852>	Password protected, "Pictures.zip": https://www.k8oms.net/document/password-protected-file
11:41:05		etnguyen03 (etnguyen03) joins
11:49:26	<hexagonwin_>	it would be possible to find password protected files among all the downloaded files, but the problem is that not uploading the password protected files would be a loss of information since the password might be obtainable elsewhere (or from the filename)
11:56:53	<anonymoususer852>	Could probably try and test extract files using the password obtained from filename, and if successful, optionally recompress them without password. Though the long story short is that it's still going to be tedious process for the sake of hopefully preventing future users to try and guess.
12:09:06	<chrismrtn>	The basic type of ZIP encryption (not the AES variants, as far as I recall) is vulnerable to a known-text attack. I've cracked one in the past using (I think) 16 bytes of a file header, and it didn't take too long. At the moment, I can't check what type of encryption is used for your example ZIP, though.
12:10:05	<hexagonwin_>	it may not even be a zip. there's no restriction on the file format the site accepts, so it could be anything like zip, rar, 7z, etc
12:10:50	<hexagonwin_>	i even recall getting some weird archive file in this format from the site before https://en.wikipedia.org/wiki/B1_Free_Archiver?useskin=monobook
12:14:03	<chrismrtn>	Ah, that's unfortunate. Obscure file types are even more of a hassle for long-term preservation, and B1 is certainly a new one for me!
12:14:43	<masterx244\|m>	and then we just need one case where its a update file format that uses a passworded zip as its container and then its a "don't repack" case
12:49:32		pabs quits [Read error: Connection reset by peer]
12:50:11		pabs (pabs) joins
13:07:28		Wohlstand (Wohlstand) joins
13:18:04		monoxane quits [Ping timeout: 260 seconds]
13:18:51		monoxane (monoxane) joins
13:25:12		nine quits [Quit: See ya!]
13:25:31		nine joins
13:25:32		nine is now authenticated as nine
13:25:32		nine quits [Changing host]
13:25:32		nine (nine) joins
13:27:45		lemuria quits [Remote host closed the connection]
13:54:20		hexagonwin joins
13:56:14		hexagonwin_ quits [Ping timeout: 258 seconds]
14:44:22		NeonGlitch quits [Read error: Connection reset by peer]
14:44:58		NeonGlitch (NeonGlitch) joins
15:02:19	<justauser\|m>	Does IA check nested archives? You could pack them into .tars by first letters, first byte of SHA etc.
15:05:44		HugsNotDrugs quits [Read error: Connection reset by peer]
15:08:44		HugsNotDrugs joins
15:21:48		HugsNotDrugs quits [Client Quit]
15:22:07		HugsNotDrugs joins
15:22:24		cyanbox quits [Read error: Connection reset by peer]
15:23:21		nine quits [Client Quit]
15:23:40		nine joins
15:23:41		nine is now authenticated as nine
15:23:41		nine quits [Changing host]
15:23:41		nine (nine) joins
15:38:32		nine quits [Client Quit]
15:38:49		nine joins
15:38:49		nine is now authenticated as nine
15:38:49		nine quits [Changing host]
15:38:49		nine (nine) joins
15:43:14		Island joins
15:49:05		lennier2_ joins
15:50:54		lennier2 quits [Ping timeout: 260 seconds]
16:31:58		dabs joins
16:49:45		nyakase5 (nyakase) joins
16:51:48		nyakase quits [Ping timeout: 258 seconds]
16:53:17		dhinakg_ quits [Quit: dhinakg_]
16:53:21		nyakase (nyakase) joins
16:55:04		nyakase5 quits [Ping timeout: 260 seconds]
17:05:33		datechnoman quits [Quit: Ping timeout (120 seconds)]
17:06:04		datechnoman (datechnoman) joins
17:13:16		Barto quits [Ping timeout: 258 seconds]
17:16:23		igloo22225 quits [Quit: The Lounge - https://thelounge.chat]
17:17:41		datechnoman quits [Client Quit]
17:18:04		datechnoman (datechnoman) joins
17:18:09		BornOn420 (BornOn420) joins
17:25:01		Guest58 quits [Quit: My Mac has gone to sleep. ZZZzzz…]
17:31:21		Barto (Barto) joins
17:37:09		etnguyen03 quits [Quit: Konversation terminated!]
17:46:08		etnguyen03 (etnguyen03) joins
17:46:13		igloo22225 (igloo22225) joins
17:53:59		Island quits [Ping timeout: 260 seconds]
18:01:37		PredatorIWD25 quits [Read error: Connection reset by peer]
18:04:43		Island joins
18:14:39		etnguyen03 quits [Client Quit]
18:26:40		Guest58 joins
18:44:07		corentin quits [Ping timeout: 258 seconds]
18:44:47		@AlsoJAA sets the topic to: Lengthy ArchiveTeam-related discussions, questions here \| Offtopic: #archiveteam-ot \| https://twitter.com/textfiles/status/1069715869994020867
19:09:24		lennier2 joins
19:12:29		lennier2_ quits [Ping timeout: 258 seconds]
20:01:04		corentin joins
20:24:53		HP_Archivist quits [Read error: Connection reset by peer]
20:55:59		anonymoususer852 quits [Ping timeout: 258 seconds]
20:57:45		anonymoususer852 (anonymoususer852) joins
21:19:45		Barto quits [Ping timeout: 258 seconds]
21:24:45		Barto (Barto) joins
21:38:15		Dada quits [Remote host closed the connection]
21:39:02	<h2ibot>	Anonymoususer852 edited Anubis/uncategorized (+26, Added https://git.slackware.nl/): https://wiki.archiveteam.org/?diff=57279&oldid=56871
21:59:14		b3nzo quits [Ping timeout: 258 seconds]
22:12:08		dabs quits [Read error: Connection reset by peer]
22:35:39		Suika quits [Ping timeout: 258 seconds]
22:43:59		Suika joins
22:53:04		Radzig2 joins
22:55:34		Radzig quits [Ping timeout: 260 seconds]
22:55:34		Radzig2 is now known as Radzig
23:07:51		klg quits [Ping timeout: 258 seconds]
23:15:31		@rewby quits [Ping timeout: 258 seconds]
23:29:27		klg (klg) joins
23:35:50		wickedplayer494 quits [Ping timeout: 258 seconds]
23:36:18		wickedplayer494 joins
23:36:36		wickedplayer494 is now authenticated as wickedplayer494
23:54:54		dabs joins

Home Search Previous day Next day