#archiveteam-bs log for 2022-10-26

Home Search Previous day Next day

00:22:04		jacobk quits [Ping timeout: 240 seconds]
01:20:28		@rewby quits [Ping timeout: 240 seconds]
01:20:42		rewby (rewby) joins
01:20:43		@ChanServ sets mode: +o rewby
01:30:05		Mateon2 joins
01:30:45		Mateon1 quits [Ping timeout: 255 seconds]
01:30:45		Mateon2 is now known as Mateon1
02:10:43		qwertyasdfuiopghjkl joins
02:40:00		jacobk joins
03:03:32		katocala joins
03:03:56		katocala is now authenticated as katocala
04:04:50		michaelblob_ (michaelblob) joins
04:08:04		michaelblob quits [Ping timeout: 240 seconds]
04:20:11	<pabs>	JAA: when I got stuck into this mailman2 stuff, one thing I noticed are sites where lists aren't on the main listinfo page but can be found via search engines and some of those have public archives
04:20:49	<pabs>	so I think for each mailman site, searching for site:foo inurl:listinfo and site:foo inurl:pipermail might be good
04:21:49	<pabs>	in my list I'm recording individual lists where the main listinfo page had zero lists, but I just discovered some sites have lists on the main listinfo page but some lists in search engines are missing from the main listinfo
04:31:17	<pabs>	its pretty common in the rabbit warren I entered, universities in Germany :)
04:53:25	<pabs>	JAA: ok, this is my final list
04:53:26	<pabs>	https://paste.debian.net/plain/1258270
04:53:48	<pabs>	don't want to get sucked into this too much more, so I'm going to stop now
06:35:58		BlueMaxima quits [Read error: Connection reset by peer]
06:43:39		katocala quits [Remote host closed the connection]
06:43:51		katocala joins
06:46:24		qwertyasdfuiopghjkl quits [Remote host closed the connection]
06:46:24		lennier2 joins
06:46:35		qwertyasdfuiopghjkl joins
06:48:37		katocala quits [Ping timeout: 253 seconds]
06:49:09		katocala joins
06:49:35		Sluggs quits [Ping timeout: 260 seconds]
06:49:35		lennier1 quits [Ping timeout: 260 seconds]
06:49:44		lennier2 is now known as lennier1
06:50:42		Sluggs joins
07:37:30		lukash79 quits [Ping timeout: 255 seconds]
07:48:38	<Ruk8>	Update: The list of content not yet archived is HUGE, if someone wanna help with this I'll be glad. Especially for checking that all content is archived (i'm coping the links by hand, sanitizing them by hand, checking if previously archived version are already avaible and in case just skip by hand and so on...)
07:48:38	<Ruk8>	Currently I'm doing the windows versions of CS 6 and all the other prodoucts that are not che CS suite.
07:48:38	<Ruk8>	If someone wanna register a list of the other urls (CS6's DMGs and previously avaible cs products) please help.
07:48:38	<Ruk8>	To do this work you'll need to set up a cookie in you browser and get a a bit of hints.
07:49:59	<Ruk8>	The step first described obv comes after the content will be updated to the IA
07:57:44	<Ruk8>	I'm not considering asking for the creation of a new project because all of that could be done in 1-3 days by 2-3 pearsons.
07:57:44	<Ruk8>	Honestly I could do it all by myself but I'm giving up, checking previously archived urls, among a bunch of previously 403 pages is kinda exhausting.
08:00:51		@rewby quits [Remote host closed the connection]
08:01:00	<Ruk8>	*che -> the
08:16:23		rewby (rewby) joins
08:16:23		@ChanServ sets mode: +o rewby
08:55:21		abirkill quits [Remote host closed the connection]
08:57:39		abirkill (abirkill) joins
08:58:16		sonick quits [Client Quit]
09:18:52	<@rewby>	Found a blogging host that's going to close up shop soon: https://tweakblogs.net/
09:24:39	<Ryz>	rewby, when is it shutting down? oo;
09:25:09	<Ryz>	Looks like this article saying '2023' via https://tweakers.net/plan/3726/we-nemen-afscheid-van-tweakblogs.html - so when that year is reached
09:25:38	<asie4>	I mentioned chomikuj.pl a long time ago. Well, it seems the time has come.
09:25:58	<asie4>	They are removing all "long unused" files on November 16th, 2022.
09:26:31	<asie4>	Some claim this is a yearly procedure, but I have never seen this happen in the past
09:26:50	<asie4>	I have a longer write-up about the site somewhere in the IRC logs...
09:27:15		sonick (sonick) joins
09:28:55	<@rewby>	Ryz: Yeah, they say januari 2023 is when they'll shut down
09:30:24	<Ryz>	asie4, the problem is, as far as I recall from poking around that website months ago, requires payment just to download the files...? S:
09:30:34	<asie4>	Sort of.
09:30:39	<asie4>	<1MB files are free without an account
09:30:45	<asie4>	And a free account can download 50MB a week
09:31:16	<asie4>	So, in theory, a scraping job for small files could be viable; for anything else, the best one can do is create an easily searchable list so people can figure out if anything's important.
09:31:28	<asie4>	(They have a search engine but it's bare-bones)
09:32:26	<Ryz>	The search for the juiciest loot
09:36:32	<asie4>	It is possible to automate file downloading given URLs - plenty of PoCs exist online
09:36:44	<asie4>	There's also some work being done on figuring out if any workarounds exist
09:53:53		treora quits [Ping timeout: 268 seconds]
09:58:23		treora joins
10:53:40		treora quits [Ping timeout: 240 seconds]
11:09:44		JackThompson quits [Ping timeout: 268 seconds]
11:11:50		JackThompson joins
11:45:37		treora joins
11:46:10		adia quits [Quit: The Lounge - https://thelounge.chat]
11:46:23		adia (adia) joins
11:46:45		adia quits [Client Quit]
11:49:10		adia (adia) joins
12:01:47		JackThompson quits [Client Quit]
12:01:48		dm4v quits [Client Quit]
12:01:54		dm4v_ joins
12:02:05		JackThompson joins
12:02:21		dm4v_ is now known as dm4v
12:09:23		katocala is now authenticated as katocala
12:24:28		treora quits [Ping timeout: 240 seconds]
12:25:37		niku quits [Remote host closed the connection]
13:16:05	<@JAA>	pabs: Yeah, hidden lists are very common. Plenty of examples also in the internet infrastructure project I started a few months ago.
13:16:31	<pabs>	some of them aren't very well hidden :)
13:17:25	<@JAA>	Yeah. Some are cross-linked from emails in other lists, others are findable via web searches, some show up in the WBM.
13:18:08	<@JAA>	I built lists of mailing lists from searches and the WBM, ran the main job, then filtered out anything encountered there, and ran the rest manually.
13:18:53	<@JAA>	See e.g. https://wiki.archiveteam.org/index.php/Internet_infrastructure#Mailing_lists
13:19:10	<pabs>	ok. you probably got everything already, but will you check the list I prepared too?
13:19:28	<@JAA>	I absolutely didn't, only archived like two or three instances so far.
13:19:47	<@JAA>	I don't want to make this my life either. :-P
13:19:51	<@JAA>	But yeah, will do.
13:21:29	<pabs>	I see :) thanks
14:20:47		tech_exorcist (tech_exorcist) joins
15:55:39		sec^nd quits [Ping timeout: 255 seconds]
15:59:08		sec^nd (second) joins
16:04:05		qwertyasdfuiopghjkl quits [Remote host closed the connection]
16:08:20		michaelblob (michaelblob) joins
16:12:04		michaelblob_ quits [Ping timeout: 240 seconds]
16:12:31		qwertyasdfuiopghjkl joins
17:04:13		nostrum-tango quits [Remote host closed the connection]
17:04:29		nostrum-tango joins
17:16:52		Ketchup901 quits [Remote host closed the connection]
17:16:53		mutantm0nkey quits [Remote host closed the connection]
17:17:30		Ketchup901 (Ketchup901) joins
17:17:56		mutantm0nkey (mutantmonkey) joins
17:27:54		sec^nd quits [Ping timeout: 255 seconds]
17:30:50		march_happy quits [Ping timeout: 268 seconds]
17:33:37		sec^nd (second) joins
17:48:04		jacobk quits [Ping timeout: 240 seconds]
18:16:30		sec^nd quits [Ping timeout: 255 seconds]
18:26:13		sec^nd (second) joins
18:26:28		Ketchup901 quits [Client Quit]
18:32:42		Ketchup901 (Ketchup901) joins
18:55:22	<@rewby>	JAA: I've been doing some snooping on tweakblogs.net and I think this might either be a good use of a qwarc run or a dpos project.
18:55:55	<@rewby>	So here's what I figured out:
18:56:24	<@rewby>	So https://tweakblogs.net/ is the main front page, but each blog is hosted on a subdomain
18:56:31	<@rewby>	Example, https://iplaygamesonyoutube.tweakblogs.net/blog/20122/gris-review
18:56:41	<@rewby>	There's only 20k or so posts I think
18:56:56	<@rewby>	But the main issue is discovering the subdomains, because ids are only valid on the subdomain they belong to
18:57:11	<@rewby>	One does not need the actual post slug, the ID is enough
18:57:30	<@rewby>	I.e. https://iplaygamesonyoutube.tweakblogs.net/blog/20122/ works just fine for accessing the post from before
18:58:03	<@rewby>	Example: This is a valid id on another blog: https://iplaygamesonyoutube.tweakblogs.net/blog/20118/
18:58:26	<@rewby>	Each subdomain, one may notice, includes a link to their profile on the main tweakers site.
18:59:05	<@JAA>	Yeah, just saw that, and those profile pages just take a numeric ID.
18:59:11	<@rewby>	Ding
18:59:15	<@rewby>	So https://tweakers.net/gallery/338396/ is the profile page
18:59:18	<@JAA>	The activiteit subpage lists the blog.
18:59:22	<@rewby>	If you go to activity https://tweakers.net/gallery/338396/activiteit/
18:59:24	<@rewby>	Yep
18:59:35	<@rewby>	And the ids are sequential
18:59:45	<@rewby>	And 500000 is invalid so I presume that means less than 500k accounts
18:59:58	<@rewby>	All of which seem rather doable numbers
19:00:33	<@rewby>	The actual blogs could even be scraped by just a bunch of !a commands on AB.
19:00:45	<@JAA>	IDs go higher, e.g. https://tweakers.net/gallery/1260706/
19:00:46	<@rewby>	It's very basic recursive crawling once you have the subdomains
19:00:48		mutantm0nkey quits [Remote host closed the connection]
19:00:54	<@rewby>	Ah dangit
19:00:56	<@JAA>	But still reasonable
19:01:03	<@rewby>	I must've spot checked some numbers that didn't exist then
19:01:14	<@JAA>	'Geregistreerd op 3 oktober 2019' on that one.
19:01:33	<@JAA>	https://tweakers.net/gallery/1638202/ July 2021
19:01:44	<@rewby>	https://tweakers.net/gallery/1838350/activiteit/ was this month
19:01:53		mutantm0nkey (mutantmonkey) joins
19:02:37	<@JAA>	Are those blog post IDs globally unique? Are there really only 20k blog posts in total?
19:02:44	<@rewby>	Think so
19:02:53	<@JAA>	Huh
19:03:01	<@rewby>	https://tweakers.net/plan/3726/we-nemen-afscheid-van-tweakblogs.html mentions 10k blog posts
19:03:04	<@rewby>	And 160k "reactions"
19:03:31		jacobk joins
19:03:52	<@JAA>	Actually, all blog posts seem to have even IDs. I'm not sure I want to know why that is. :-)
19:03:53	<@rewby>	And only about 1000 blogs
19:03:57	<joepie91\|m>	https://tweakblogs.net/?allWeblogs=1 is apparently relevant
19:04:02	<joepie91\|m>	but I'm seeing some weird stuff
19:04:13	<joepie91\|m>	sec
19:04:38	<@rewby>	Are we sure that's all of them?
19:05:49	<joepie91\|m>	earlier today someone elsewhere reported seeing only 2 entries at https://coltrui.tweakblogs.net/blog/ and the link at https://twitter.com/arnoudwokke/status/244414155036176384 being dead, but neither of those seem to be the case now?
19:06:03	<joepie91\|m>	seems like some flaky stuff going on here
19:06:23	<@rewby>	Potentially depending on what server you hit?
19:06:42	<joepie91\|m>	possibly. means we need to be extra careful in indexing them though
19:07:12	<@JAA>	(Nevermind regarding the even IDs, only seems to apply to recent posts on https://tweakblogs.net/?extUpdate=1 )
19:07:55	<joepie91\|m>	hm https://tweakers.net/plan/3726/we-nemen-afscheid-van-tweakblogs.html?showReaction=18078860#r_18078860
19:10:49		Xesxen (Xesxen) joins
19:22:17		Peetz0r\|m joins
19:23:41		jacobk quits [Ping timeout: 268 seconds]
19:33:02		Ketchup901 quits [Remote host closed the connection]
19:35:29	<Peetz0r\|m>	Hey, I came here because I heard tweakblogs was being discussed and I may be interested to help out
19:36:07	<Peetz0r\|m>	I have time, bandwidth (1G) not much storage (I guess around 100G) and motivation
19:36:31	<Peetz0r\|m>	(also my name is somewhere on https://tweakblogs.net/?allWeblogs=1 too)
19:49:42	<asie4>	Update on the Chomikuj.pl woes: The deletion in November seems to be limited to a specific type of files; searching around showed mostly files which are expensive and risky to host, like Blu-Ray movie ISOs. The site is apparently safe otherwise
19:58:52		dm4v quits [Ping timeout: 265 seconds]
20:21:19	<Xesxen>	If context on tweakblogs wasn't provided yet: They're shutting down the service sometime in january 2023 and content will be purged/become unavailable. It contains 955 unique blogs, 9409 blog post total as of earlier today and ~160k comment posts.
20:22:29	<Xesxen>	Since I'm unfamiliar with how archiveteam projects are run/started: how does one do this?
20:26:39		dm4v joins
20:41:06	<@JAA>	The blogs on https://tweakblogs.net/?allWeblogs=1 in an easier data format: https://transfer.archivete.am/5fNvt/tweakblogs.jsonl
20:42:59	<@JAA>	9462 posts per that. Close enough to the mentioned 10k that we can assume it's complete, I guess.
20:58:35	<joepie91\|m>	JAA: you mentioned 20k earlier, though?
20:58:51	<joepie91\|m>	or was that just based on the post IDs?
20:58:54	<@JAA>	That's post IDs. Maybe many were deleted.
20:58:59	<joepie91\|m>	aha.
20:59:04		sloop_ quits [Quit: ZNC 1.8.2 - https://znc.in]
20:59:06	<joepie91\|m>	right
20:59:16		sloop joins
20:59:19	<@JAA>	And the recent posts linked on https://tweakblogs.net/?extUpdate=1 all have even IDs, so maybe there are gaps due to $reasons as well.
21:00:01	<joepie91\|m>	Xesxen: depends on the project, some are distributed (http://tracker.archiveteam.org/), some are just "someone throws a server at it"
21:00:20	<@JAA>	I'm contemplating just running these through ArchiveBot with queueh2ibot.
21:01:15	<joepie91\|m>	what's queueh2ibot?
21:02:46		tech_exorcist quits [Client Quit]
21:10:37	<@JAA>	A thing to continuously throw jobs into AB as others finish.
21:10:47	<joepie91\|m>	ahhh
21:10:53	<joepie91\|m>	wait, so a queue for the queue? 🤔
21:11:23	<@JAA>	Yeah, a low-priority queue if you will. The bot will back off if others start jobs, and I generally configure it to leave a few slots open at all times.
21:11:59	<@JAA>	More accurately, it submits a job whenever the total job counter is below a pre-configured value. And I set that value to a bit under the current capacity.
21:12:24	<joepie91\|m>	ahh, right
21:12:55	<@JAA>	(And even more accurately, queueh2ibot is just another instance of http2irc, like h2ibot, and the thing actually doing the queueing is a script that uses curl to talk to it.)
21:15:03	<@rewby>	JAA: I can second queueh2ibot. Sounds like a perfect solution tbh.
21:15:26	<@rewby>	I'd say maybe not capture outlinks, depending on how that ends up?
21:15:28	<joepie91\|m>	so a queue for a proxy to a queue :-)
21:15:53	<@rewby>	Although you may wanna check if pages with a ton of reactions need JS to load all reactions
21:16:02	<@rewby>	Judging by the rest of this site, I'm guessing no
21:16:09	<@rewby>	But it's worth a double check to be sure
21:16:11	<@rewby>	Since AB doesn't do JS
21:17:59		LeanTo joins
21:18:27	<Xesxen>	Comments don't need JS to be visible, they will all show up at once on the post page itself
21:18:35	<@rewby>	Perfect
21:18:59	<Ryz>	How many of these websites to toss into ArchiveBot via queueh2ibot?
21:19:08	<@rewby>	Thousand or so
21:20:14	<@rewby>	955 blogs apparently
21:20:23	<@rewby>	But we have until januari
21:20:37	<@rewby>	So if we can get this started within the week, then should be done well in tiem
21:23:30	<LeanTo>	Hey folks, don't know if there's any interest but I didn't see it on the death watch list...
21:23:57	<LeanTo>	A few days ago, there was an announcement that the Turner Classic Movies message boards were going to be taken down
21:24:38	<LeanTo>	"Our last effective day to participate on the TCM Forums will be Wednesday, November 30, 2022. However, the TCM Forums content will remain available to view until February 15, 2023, so users may continue to have access to content. After this time, all customer data will be deleted from the forums."
21:24:57	<@rewby>	We'd probably want to wait until the 1st of dec to archive it then
21:25:04	<@rewby>	To capture all of it
21:25:05	<joepie91\|m>	hmm. is this related to the Cartoon Network nonsense?
21:25:18	<LeanTo>	Dunno, but the same company
21:25:24	<joepie91\|m>	yep, hence my suspicion
21:25:25	<LeanTo>	Discover/WB own them
21:25:34	<joepie91\|m>	I think I saw another thing being mentioned that WB was killing off
21:26:26	<LeanTo>	I think there might be interest in using the data to resurrect it, I think there's someone who is doing up a new messageboard for the exodus
21:27:41	<LeanTo>	Are they killing off some Cartoon Network message board I don't know about as well?
21:28:14	<joepie91\|m>	I haven't been following it too closely, but the rumours are that they are just killing off Cartoon Network in its entirety
21:29:00	<LeanTo>	Ah, the thing I read regarding that was that they were folding some of the other animation studios in with it or something. Not sure.
21:29:39	<joepie91\|m>	it being corporate shenanigans, those may evaluate to the same thing in the end :)
21:30:51	<LeanTo>	Haha yeah
21:31:20	<LeanTo>	ah here's the forum post if it's helpful at all: https://forums.tcm.com/topic/271979-tcm-message-boards-sunsetting-in-november-2022/
21:31:25	<LeanTo>	"sunsetting"
21:31:42	<LeanTo>	as far as I know when the sun sets, it usually comes back up in a couple hours
21:38:37		BlueMaxima joins
21:50:18		march_happy (march_happy) joins
21:52:11	<@JAA>	Please add it to Deathwatch so we won't forget.
21:52:56	<@JAA>	rewby: Yeah, I'll try one or two of the bigger blogs manually first to see whether this approach is workable.
21:54:45	<LeanTo>	Ok, I'll do that :)
21:59:06	<@rewby>	JAA: Sounds good. Let me know how that goes.
22:01:38		qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds]
22:26:04	<h2ibot>	LeanTo edited Deathwatch (+212, Added Turner Classic Movies Message Boards to…): https://wiki.archiveteam.org/?diff=49119&oldid=49113
22:33:06		jacobk joins
22:42:26	<@arkiver>	ah communication about tweakblogs here as well
22:44:43	<@arkiver>	rewby: do we have a list of tweakblogs? where are we on this?
22:45:40	<@JAA>	arkiver: Yes, and the current plan is to run them through ArchiveBot.
22:45:47	<@arkiver>	alright good
22:45:53	<@arkiver>	well I send tweakers an email anyway :P
22:45:59	<@arkiver>	guess we'll see if anything is missing
22:46:08	<@arkiver>	less than 1000 blogs?
22:46:15	<@JAA>	Yeah, with under 10k posts.
22:46:41	<LeanTo>	Dunno if you could tell me off the bat, but any idea if the TCM message boards would be a warrior based project?
22:46:56	<LeanTo>	Might be possible to recruit some people from there to participate in getting the data
22:47:33	<LeanTo>	while it's still up and able to be posted to
22:48:25	<@JAA>	I can't even access them. HTTP 403 'The Amazon CloudFront distribution is configured to block access from your country.' lol
22:48:37	<@arkiver>	hah same here
22:48:42	<@arkiver>	LeanTo: ^
22:48:54	<@JAA>	Works from Canada.
22:48:57	<LeanTo>	Yeah I think they're actually blocked for some countries
22:49:00	<joepie91\|m>	same, also cannot access
22:49:04	<joepie91\|m>	I assume it's a GDPR block
22:49:40	<@JAA>	I'm not in the EU though.
22:49:49	<LeanTo>	No idea, I just saw someone mention it in a post
22:50:13	<@JAA>	Anyway, standard Invision forum. Could possibly be done with ArchiveBot, although it might be too big for that without aggressive ignores.
22:50:34	<@JAA>	If they don't have annoying rate limits, I can do it with qwarc, too.
22:50:49	<LeanTo>	Right on
22:51:52	<@JAA>	I'm pretty sure I already have code for archiving Invision with qwarc. But that'd be barebones without images etc.
22:52:15	<@JAA>	272k topic IDs is easy enough to cover in 2.5 months anyway.
22:52:31	<@arkiver>	JAA: any outlinks/stuff on other domains can go to #//
22:52:38	<@JAA>	Yeah
22:52:58	<LeanTo>	I'll probly reach out on there and let them know, and see if I can get in contact with the guy who is trying to implement a new board. Might be of use
22:53:23	<LeanTo>	might even be able to see the content from it it from all over the world for once XD
22:55:23	<@JAA>	How many posts are there in total? Can't find it with some simple searches in curl\|less and can't access it through a browser right now.
22:56:51	<LeanTo>	just looking at the numbers looks like probly over 1.5 million posts
22:58:07	<LeanTo>	going back to 2002 on the general discussion forum
22:59:51	<@JAA>	Ah yeah, just saw that the post ('comment') IDs go to 2.7M, so that sounds about right I guess.
23:03:10	<LeanTo>	actually surprised they have comments that far back given they did a redesign, probly a few redesigns since then
23:07:11	<LeanTo>	Targetting some forums seems like it would be more beneficial than others, the off-topic section is mostly just political babble from the last couple years. But like 600k posts.
23:07:40	<LeanTo>	440k rather
23:14:45	<@JAA>	Nah... https://transfer.archivete.am/inline/bG4mu/aatt.png
23:16:14	<LeanTo>	haha, as you wish :)
23:18:55	<@JAA>	It's easier to just iterate over all topic IDs anyway than recurse through the forum topic lists etc.
23:26:40	<LeanTo>	Gotcha, I don't really know the ins and outs of it all, just figured if it meant more space savings and quality over quantity etc
23:29:01	<LeanTo>	In either case, I appreciate your work. I participated in the imdb warrior project some time ago and still use the boards that came out of it to look up movie info/suggestions
23:29:03	<@JAA>	Yeah, it's a valid concern when there isn't enough time to save everything, but that shouldn't be a problem here. Space is hardly ever relevant for text data since it's tiny compared to images/videos and compresses very well.
23:29:07	<LeanTo>	wouldn't be possible without you guys
23:29:50	<@JAA>	Hah, the IMDb boards were my first project here. It's been a while... :-)
23:30:15	<LeanTo>	cool :)
23:31:00	<LeanTo>	I'm sure there's some nice images in there, but the text is probly more valuable anyways so whatevs. People can just look up a movie and find nice images if they want elsewhere :P
23:55:59		BlueMaxima quits [Read error: Connection reset by peer]

Home Search Previous day Next day