#archiveteam-bs log for 2023-03-28

Home Search Previous day Next day

00:02:00		hackbug quits [Client Quit]
00:03:55		hackbug (hackbug) joins
00:38:08		whoami quits [Client Quit]
00:39:29		whoami (whoami) joins
00:49:20		programmerq quits [Ping timeout: 252 seconds]
00:53:29		programmerq (programmerq) joins
02:17:56		second (second) joins
02:18:21		sec^nd quits [Ping timeout: 245 seconds]
02:18:22		second is now known as sec^nd
02:44:38	<pabs>	I heard that someone inside Google has been trying to get rid of feedburner for years.
02:44:41	<pabs>	asked them to do a proper transition and also contact us before it goes away
03:14:32		_19100 leaves
03:56:53		Island quits [Read error: Connection reset by peer]
03:58:18		user__ quits [Remote host closed the connection]
03:58:31		user__ joins
04:05:35		BlueMaxima quits [Read error: Connection reset by peer]
04:40:15		TastyWiener95 (TastyWiener95) joins
04:42:54		jamesatjaminit quits [Ping timeout: 252 seconds]
04:43:52		geezabiscuit quits [Read error: Connection reset by peer]
04:44:19		jamesatjaminit (jamesatjaminit) joins
04:44:40		geezabiscuit (geezabiscuit) joins
05:02:12		sonick (sonick) joins
05:03:59		hackbug quits [Ping timeout: 265 seconds]
05:14:45	<h2ibot>	JustAnotherArchivist edited Zippyshare.com (+172, Update infobox): https://wiki.archiveteam.org/?diff=49604&oldid=49575
06:19:08		Arcorann (Arcorann) joins
06:36:44		user__ quits [Remote host closed the connection]
06:40:32		umgr036 joins
06:41:20		umgr036 quits [Remote host closed the connection]
06:41:33		umgr036 joins
07:00:24		DiscantX quits [Ping timeout: 252 seconds]
08:10:47		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
08:19:42		benjins quits [Read error: Connection reset by peer]
08:19:54		michaelblob quits [Read error: Connection reset by peer]
08:20:19		michaelblob (michaelblob) joins
08:20:24		benjins joins
08:21:02		Jake quits [Client Quit]
08:21:46		Jake (Jake) joins
08:21:59		Lord_Nightmare quits [Quit: ZNC - http://znc.in]
08:22:20		Lord_Nightmare (Lord_Nightmare) joins
09:43:29		benjins quits [Remote host closed the connection]
09:43:38		benjins joins
10:02:18	<@OrIdow6^2>	Shutterfly share sites is dead
10:03:12	<@OrIdow6^2>	I did not figure out in time how to generate image URLs
10:04:36		TastyWiener95 quits [Client Quit]
10:05:13		hitgrr8 joins
10:09:27	<@OrIdow6^2>	Appears that the DNS trick works to some extent
10:10:57	<@OrIdow6^2>	Well, I should go asleep now, but if it's still "up" in the morning I guess I'll just take a best guess at that image URL generation
10:12:56	<@OrIdow6^2>	Which probably isn't that hard, but I was trying to be a perfectionist :\|
11:08:56		dan_a quits [Client Quit]
11:12:51		dan_a (dan_a) joins
11:24:12		eroc1990 quits [Client Quit]
11:24:50		eroc1990 (eroc1990) joins
11:30:45		benjinsm joins
11:31:44		benjins quits [Remote host closed the connection]
11:31:44		qwertyasdfuiopghjkl quits [Client Quit]
11:38:06		drin joins
11:38:35		hitgrr8 quits [Client Quit]
11:38:35		geezabiscuit quits [Client Quit]
11:38:35		Terbium quits [Client Quit]
11:38:41		Terbium joins
11:38:57		drin is now known as geezabiscuit
11:39:43		hitgrr8 joins
11:42:15		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
11:57:39		hackbug (hackbug) joins
12:21:53	<thuban>	grab-site works with a pyenv 3.8 venv but not with a system 3.7 venv, because the latter looks for libre2.so.9 and chokes on libre2.so.10. why? idk -_-
12:34:46		Sluggs quits [Excess Flood]
12:36:25		Sluggs joins
13:03:46		Arcorann quits [Ping timeout: 252 seconds]
14:05:17		_19100 (themillenniumbug) joins
14:05:55		zhongfu quits [Ping timeout: 252 seconds]
14:12:59		zhongfu (zhongfu) joins
14:22:49		umgr036 quits [Remote host closed the connection]
14:41:54		benjinsm is now known as benjins
14:42:01		benjins is now authenticated as benjins
14:54:18		umgr036 joins
14:55:06		umgr036 quits [Remote host closed the connection]
14:55:19		umgr036 joins
14:59:08		VerifiedJ quits [Quit: The Lounge - https://thelounge.chat]
14:59:33		VerifiedJ (VerifiedJ) joins
15:00:33		AnotherIki joins
15:04:24		Iki1 quits [Ping timeout: 252 seconds]
15:27:28		spirit quits [Client Quit]
15:36:22		Island joins
15:45:01		DopefishJustin quits [Remote host closed the connection]
15:45:53		umgr036 quits [Remote host closed the connection]
15:49:35		jacksonchen666 (jacksonchen666) joins
15:52:42		DopefishJustin joins
15:52:42		DopefishJustin is now authenticated as DopefishJustin
15:56:20		Sophira joins
16:01:47	<Sophira>	Hi there. I'm the owner of a site that's been running for the last 10-11 years or so dedicated to the TV Tropes ARG "The Wall Will Fall" ( http://twwf.info/ and its linked subdomains ). I intended to shut it down in December but haven't been able to bring myself to do so yet. The domain expires in a week, though, and I would prefer not to renew it if possible. Not all of the forum is archived on
16:01:53	<Sophira>	web.archive.org as for much of its life the forums had restrictive robots.txt files. I removed them a while back but there's a lot that still isn't in the archive. Is it possible to request that the sites be archived?
16:02:05	<Sophira>	As the owner I'm willing to help in any way I can for this to happen.
16:02:22	<Sophira>	(I was one of the original puppetmasters on the ARG.)
16:02:48	<Sophira>	Actually no my mistake, I think the forums had some kind of bot detection IIRC.
16:02:48		Nulo quits [Read error: Connection reset by peer]
16:02:55		Nulo joins
16:23:55	<thuban>	Sophira: yes, certainly! do you have a sitemap you can provide?
16:39:17	<Sophira>	thuban: I don't. I should be able to create one, I think, though it might take a while. Would a list of URLs suffice?
16:39:39	<thuban>	a list of urls would be perfect
16:39:48	<Sophira>	Also bear in mind that this will cover several different hostnames, though they're all under the umbrella domain of twwf.info.
16:41:11	<thuban>	that should be fine
16:41:32	<Sophira>	Okay. I'll do what I can, then! It might take a while though, as I say. Is there any kind of special processing you would normally do for phpBB forums and Wordpress sites?
16:42:06	<pokechu22>	wordpress and phpbb can both be done with archivebot without much issue
16:42:59	<pokechu22>	(in that the annoying stuff has mostly already been solved with some standard ignoresets)
16:44:05	<Sophira>	Awesome. One thing to bear in mind is that many of the sites will link to each other in forum posts and blog comments, so those 'external' links will need to be rewritten accordingly.
16:45:40	<pokechu22>	Yeah, archivebot doesn't do that super well - it only recurses within a single domain and saves individual outlinks. If each of the wordpress/phpBB forums has a front page where everything can be accessed that won't be as much of a problem though
16:46:45	<pokechu22>	there isn't any super good way to rewrite them with archivebot as-is :/
16:47:48	<thuban>	i think there's a miscommunication here--archivebot (like archiveteam) does not rewrite anything
16:48:31	<Sophira>	Oh, even links within the same site?
16:50:26	<thuban>	archivebot _follows_ links, but it won't alter anything. if an old post on foo.twwf.info links to bar.com (which is now copied at bar.twwf.info), it will be saved exactly as it is, including the link to bar.com
16:51:56	<pokechu22>	http://twwf.info/ says that links like that are already rewritten though so that might not be a problem
16:52:51	<Sophira>	Yeah, I use mod_filter on the server in order to do domain substitution like thata.
16:52:55	<Sophira>	^that/.
16:53:02	<Sophira>	...pretend I typed that correctly.
16:53:46	<Sophira>	But yes, all links to sites that have been archived to a subdomain beneath twwf.info are rewritten automatically.
16:54:10	<thuban>	what did you then mean by "those 'external' links will need to be rewritten"?
16:54:52		VerifiedJ quits [Client Quit]
16:55:32		VerifiedJ (VerifiedJ) joins
16:56:27		umgr036 joins
16:56:44	<Sophira>	I mean 'external' in that, for example, some users making comments on Romeo's blog site, romeo.ezblog.twwf.info, have links to Juliet's blog site, which are rewritten to juliet.ezblog.twwf.info automatically. From your point of view, juliet.ezblog.twwf.info will be a different site from romeo.ezblog.twwf.info, right?
16:57:50	<Sophira>	That's what I mean by 'external', and that's why I put the word in quotes - because they're still under twwf.info, but from the point of view that they use two different hostnames, they could be considered two different sites.
16:58:05	<Sophira>	twwf.info
16:58:11	<Sophira>	Er.
16:58:25	<thuban>	ah. so by "rewrite" you only mean 'consider as part of the same site'.
16:58:29	<Sophira>	The main page at http://twwf.info/ (sorry, no HTTPS) links to all the various sites and they should all be accessible.
16:58:38	<pokechu22>	Yes, but that won't cause issues with doing two separate jobs that recurse over all of http://romeo.ezblog.twwf.info/ and http://juliet.ezblog.twwf.info/ (the pages that are linked between them would get saved twice, but that's probably fine)
16:58:59	<pokechu22>	It'd be an issue for http://xovr.twwf.info/ though and any deep links that aren't reachable from the front page
16:59:40	<thuban>	archivebot's subdomain handling is complicated™, but a complete sitemap will render it moot
17:00:56	<thuban>	(or, failing that (eg for the forums), a complete list of subdomains)
17:01:42	<pokechu22>	My thought is that doing an !a on each of the forums and blogs would get good enough coverage of those; wordpress and phpbb are usually fine for discovering pages even without a sitemap (though wordpress generally generates a sitemap anyways; seems like there isn't one in this case (too old?))
17:02:03		Island_ joins
17:04:12	<pokechu22>	I might as well just try it and see how it goes... Sophira, any parameters on rate-limiting? Archivebot's default is 3 concurrency sets of requests where after each request it waits 250-375 milliseconds
17:04:54		hitgrr8_ joins
17:05:03		VerifiedJ8 (VerifiedJ) joins
17:05:32		Sophira_ joins
17:05:34	<Sophira_>	Okay. An example of a page on xovr.twwf.info, btw, would be http://xovr.twwf.info/i_xukb3tnd.php . Entering the password "Gurt" (case-sensitive) would then show an image. I assume in these cases I should give both the pages themselves and the image URLs.
17:05:53		VerifiedJ quits [Client Quit]
17:05:53		hitgrr8 quits [Client Quit]
17:05:53		Sluggs quits [Client Quit]
17:05:53		qwertyasdfuiopghjkl quits [Client Quit]
17:05:53		AnotherIki quits [Remote host closed the connection]
17:05:53		Island quits [Remote host closed the connection]
17:05:53		Sophira quits [Remote host closed the connection]
17:05:53		VerifiedJ8 is now known as VerifiedJ
17:06:00		Sophira_ is now known as Sophira
17:06:14	<Sophira>	I'm not sure if my last message sent because of the ping timeout, so:
17:06:18	<Sophira>	Okay. An example of a page on xovr.twwf.info, btw, would be http://xovr.twwf.info/i_xukb3tnd.php . Entering the password "Gurt" (case-sensitive) would then show an image. I assume in these cases I should give both the pages themselves and the image URLs.
17:06:41	<Sophira>	(also, the last thing I saw was thuban saying subdomain handling is complicated™.)
17:07:32	<thuban>	Sophira: https://hackint.logs.kiska.pw/archiveteam-bs/20230328#c340128
17:07:58		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
17:08:27		Sluggs joins
17:10:13	<Sophira>	Ah, thank you! Odd that my message sent but I didn't see anybody else's. Oh well. As for rate-limiting, I imagine that'll be fine. The sites themselves aren't really used any more so there won't really be any disturbances.
17:12:11	<Sophira>	Subdomain-wise, I think all the subdomains are on twwf.info's front page. Let me just double-check.
17:17:15	<Sophira>	Yeah, they're all listed, I believe. Also just to note, the only sites that are Wordpress/phpBB-based are watchthefootage.twwf.info, forum.watchthefootage.twwf.info, and all the *.ezblog.twwf.info subdomains.
17:19:37	<Sophira>	Actually, that said, I would also like to archive the phpBB forum at forum.twwf.info. It's not listed in the main table because it only became a thing after the ARG itself, but it has a lot on it.
17:19:52	<Sophira>	(not so active any more though, heh)
17:21:44	<pokechu22>	Yeah, I can do that too. Last active Dec 31, 2022 is fairly good as far as inactive forums go :P
17:22:56	<pokechu22>	I've started on the ezblog ones
17:23:11	<Sophira>	Awesome, thank you <3
17:24:52	<Sophira>	Heee. I like the "and not" in the User-Agent string.
17:28:49	<Sophira>	So does this mean that with regard to the site map that I don't need to bother with grabbing all the post URLs and such from the databases?
17:29:00	<Sophira>	Or should I do that anyway?
17:29:21	<pokechu22>	For the wordpress ones? It's probably not necessary
17:30:21	<pokechu22>	It might be useful after everything's been saved to verify that it's actually complete, though (but that would have to be in a few days)
17:35:48	<Sophira>	That makes sense. Okay.
17:52:55	<pokechu22>	Based on http://watchthefootage.twwf.info/ there's also several twitter accounts linked with it - I can save those via socialbot. Is there a more complete list than the ones in the sidebar?
17:54:38		IDK (IDK) joins
17:58:29	<Sophira>	One moment...
18:00:50	<kiska>	I hate npm... I broke etherpad :(
18:03:06	<Sophira>	I can't think of any other Twitter accounts to archive. I think it's complete.
18:07:14		jamesatjaminit quits [Client Quit]
18:07:24		_19100 quits [Client Quit]
18:11:13		_19100 (themillenniumbug) joins
18:32:15		sadsa joins
18:32:22		sadsa quits [Remote host closed the connection]
18:46:58		automato83 quits [Ping timeout: 252 seconds]
18:53:03	<kiska>	Fuck that was annoying... pad.notkiska.pw is back online
19:07:51		qwertyasdfuiopghjkl quits [Remote host closed the connection]
19:08:27		qwertyasdfuiopghjkl joins
19:12:26		jamesatjaminit (jamesatjaminit) joins
19:13:46		jamesatjaminit quits [Remote host closed the connection]
19:20:30		automato83 joins
19:21:37		Barto quits [Ping timeout: 252 seconds]
19:24:03		jamesatjaminit (jamesatjaminit) joins
19:28:39		TastyWiener95 (TastyWiener95) joins
19:46:29		AnotherIki joins
19:50:33		Pingerfowder quits [Quit: ZNC - https://znc.in]
19:50:42		Pingerfowder (Pingerfowder) joins
19:53:42		@rewby quits [Ping timeout: 252 seconds]
20:01:37		jacksonchen666 quits [Client Quit]
20:08:52		dan_a quits [Client Quit]
20:12:00		user__ joins
20:14:58		umgr036 quits [Ping timeout: 252 seconds]
20:23:48		Barto (Barto) joins
20:26:41	<qwertyasdfuiopghjkl>	Sophira: From searching for links to twitter.com on https://tvtropes.org/pmwiki/pmwiki.php/Recap/TheWallWillFall and the other tvtropes wiki pages linked from that, I found https://twitter.com/DeadCatInABox , https://twitter.com/GurtTheLimeMan and https://twitter.com/RADIOVOIDREBEL that look related but weren't listed on the sidebar
20:29:07		qwertyasdfuiopghjkl is now authenticated as qwertyasdfuiopghjkl
20:32:40		dan_a (dan_a) joins
20:40:24		dan_a quits [Client Quit]
20:47:47		dan_a (dan_a) joins
21:06:43		rewby (rewby) joins
21:06:43		@ChanServ sets mode: +o rewby
21:11:26		hitgrr8_ quits [Client Quit]
21:16:06		user__ quits [Read error: Connection reset by peer]
21:16:29		user__ joins
22:28:19	<@OrIdow6^2>	Shutterfly share sites is indeed usable with the DNS trick
22:32:24		BlueMaxima joins
23:35:44	<driib>	Hi, I've been running a warrior on the telegrab project for some short while and got interested to check what kind of data ends up published on IA. I tried to download and inspect one package from https://archive.org/details/archiveteam_telegram but ran into some issues. 1) I cannot unzstd the megawarc file due to a "Decoding error (36) :
23:35:44	<driib>	Dictionary mismatch"; the internet says it's due to an external dictionary use but I can't seem to find one on https://archive.org/download/archiveteam_telegram_20230327203637_2cf0eb8f, for example. 2) https://github.com/internetarchive/warctools does not seem to include tools to deal with megawarcs or zstd, what CLI tools do you recommend if I
23:35:44	<driib>	want to look into the payload of a single item? Thank you all for your patience with my noob questions! Hope I put em in the right channel too.
23:37:12	<pokechu22>	Pretty sure this is the right channel but I don't have an answer beyond that
23:37:15	<@OrIdow6^2>	The dict is in a skippable frame at the beginning of the zstd
23:37:47	<@OrIdow6^2>	I wrote an awful tool to extract them a while back, I believe someone else wrote a better one, but if no one comes around with that in a bit I can give you the old one
23:39:31	<@OrIdow6^2>	(It's in the skippable frame, and furthermore it itself is compressed with vanilla zstd)
23:48:54		user__ quits [Remote host closed the connection]
23:49:52		umgr036 joins
23:55:12	<@OrIdow6^2>	driib: Alright, actually I've made a little new one without the dependency issue https://transfer.archivete.am/sW9PL/get_zstd_dict_simple.py
23:55:33	<@OrIdow6^2>	This takes the name of the warc.gz as its argument and puts the compressed dict to stdout

Home Search Previous day Next day