#archiveteam-bs log for 2023-04-16

Home Search Previous day Next day

00:02:04	<@JAA>	Ubisoft's forums are slow and rate-limited. That should be fun...
00:11:24		dvd_ joins
00:14:37		dvd__ quits [Ping timeout: 252 seconds]
00:44:08		klg quits [Ping timeout: 252 seconds]
01:06:14		klg (klg) joins
01:11:33		klg quits [Ping timeout: 265 seconds]
01:13:39		lunik173 quits [Quit: Ping timeout (120 seconds)]
01:15:42		lunik173 joins
01:16:39		Sanqui_ joins
01:16:41		Sanqui_ is now authenticated as Sanqui
01:16:42		Sanqui_ quits [Changing host]
01:16:42		Sanqui_ (Sanqui) joins
01:16:42		@ChanServ sets mode: +o Sanqui_
01:17:52		@Sanqui quits [Ping timeout: 252 seconds]
01:27:04		Arcorann (Arcorann) joins
01:32:59		klg (klg) joins
01:44:58		BlueMaxima joins
01:46:13		hitgrr8 quits [Client Quit]
02:25:18		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
02:45:50		Terbium quits [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.]
02:46:17		Terbium joins
02:53:52		umgr036 joins
02:54:39		umgr036 quits [Remote host closed the connection]
02:54:51		umgr036 joins
04:25:49	<pabs>	https://tech.slashdot.org/story/23/04/14/175246/valve-restricts-accounts-of-2500-users-who-marked-a-negative-game-review-useful
04:37:38		BlueMaxima quits [Read error: Connection reset by peer]
05:01:15	<Ryz>	pabs, ...
05:01:26	<Ryz>	...Doing my proactive web archiving on this a bit a go
05:01:27	<Ryz>	*ago
05:03:22		DigitalDragons quits [Ping timeout: 252 seconds]
05:11:31		DigitalDragons joins
05:31:35		user_ joins
05:31:36		umgr036 quits [Remote host closed the connection]
05:49:28		nicolas17 quits [Client Quit]
06:27:05	<pabs>	https://apnews.com/article/mexico-notimex-news-agency-lopez-obrador-ec777eb1858344c68b2906796b63f200
06:42:43	<@OrIdow6>	"Mexico’s president vows to eliminate national news agency"
07:01:43		qwertyasdfuiopghjkl quits [Remote host closed the connection]
07:04:30		hitgrr8 joins
07:04:41		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
07:06:48		wyatt8740 quits [Ping timeout: 260 seconds]
07:06:53		wyatt8750 joins
07:11:38		wyatt8750 quits [Ping timeout: 265 seconds]
07:11:54		wyatt8740 joins
07:30:41		@Sanqui_ is now known as @Sanqui
07:33:31		Island quits [Read error: Connection reset by peer]
09:25:33		tbc1887_ quits [Read error: Connection reset by peer]
09:30:28		retromouse (retromouse) joins
09:31:46	<retromouse>	I'm having a bit of trouble using dokuWikiDumper: Max retries exceeded with url: /wiki/lib/exe/ajax.php
09:32:22	<retromouse>	The thing is it re-trying it fails at different times with different pages, probably the server is overwhelmed even with 1 thread
09:34:45	<retromouse>	any way to increase retries ?
09:41:09	<retromouse>	Just try the next command:
09:41:29	<retromouse>	dokuWikiDumper https://www.ff6hacking.com/wiki/doku.php?id=start --auto
10:28:53	<retromouse>	seems editing the hard coded max retries did worked, what calls for adding the option
11:30:16		nimaje quits [Quit: WeeChat 3.7]
11:32:19		nimaje joins
11:51:00		DiscantX quits [Ping timeout: 265 seconds]
11:52:07		sec^nd quits [Ping timeout: 245 seconds]
11:52:32		HackMii_ quits [Ping timeout: 245 seconds]
11:54:29		DiscantX joins
11:54:33		HackMii_ (hacktheplanet) joins
11:54:46		sec^nd (second) joins
11:55:23		dan_a_ quits [Quit: weboots]
11:56:59		dan_a (dan_a) joins
12:10:15		user_ quits [Remote host closed the connection]
12:13:38		umgr036 joins
12:14:27		umgr036 quits [Remote host closed the connection]
12:14:39		umgr036 joins
12:40:58		retromouse quits [Ping timeout: 252 seconds]
12:49:36		ArcticCircleSys joins
12:50:56		sec^nd quits [Remote host closed the connection]
12:51:23		sec^nd (second) joins
12:53:32	<ArcticCircleSys>	http://otakuworld.com/ This hasn't been updated since 2014. Should I put this on Fire Drill?
13:00:58		nimaje quits [Client Quit]
13:02:57		nimaje joins
13:31:27		nimaje quits [Client Quit]
13:32:07		nimaje joins
13:34:26		ArcticCircleSys quits [Ping timeout: 265 seconds]
13:53:13		umgr036 quits [Remote host closed the connection]
13:53:15		umgr036 joins
14:03:50		Arcorann quits [Ping timeout: 252 seconds]
14:04:19		user_ joins
14:04:20		qwertyasdfuiopghjkl quits [Remote host closed the connection]
14:04:20		umgr036 quits [Read error: Connection reset by peer]
14:06:32		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
14:35:55		pie_ quits [Ping timeout: 252 seconds]
14:40:41		pie_ joins
15:12:58		pie_ quits [Client Quit]
15:13:12		pie_ joins
15:50:01		sec^nd quits [Ping timeout: 245 seconds]
16:26:43		sec^nd (second) joins
16:52:19		tzt quits [Ping timeout: 252 seconds]
17:11:46		wyatt8750 joins
17:13:10		wyatt8740 quits [Client Quit]
17:13:11		pie_ quits [Client Quit]
17:13:18		qwertyasdfuiopghjkl quits [Client Quit]
17:13:47		pie_ joins
17:19:44		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
17:22:28		retromouse (retromouse) joins
17:29:22		Island joins
17:30:11		retromouse-2 (retromouse) joins
17:31:25		retromouse quits [Remote host closed the connection]
17:47:51		qwertyasdfuiopghjkl quits [Remote host closed the connection]
17:48:00		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
18:13:48		retromouse-3 (retromouse) joins
18:16:28		retromouse-2 quits [Ping timeout: 252 seconds]
18:19:27		BearFortress quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
18:24:01		retromouse-3 quits [Client Quit]
18:36:53		dan_a quits [Client Quit]
18:39:53		dan_a (dan_a) joins
19:14:42		BearFortress joins
19:17:31		lennier1 quits [Ping timeout: 252 seconds]
19:19:52		lennier1 (lennier1) joins
19:33:01		Craigle quits [Quit: The Lounge - https://thelounge.chat]
19:33:34		Craigle (Craigle) joins
20:12:01		retromouse (retromouse) joins
20:14:09		RichieV joins
20:14:54	<RichieV>	Can I ask a question about the Geocities archive here?
20:16:37	<@Sanqui>	sure
20:17:13	<RichieV>	There are a couple of addresses I'd like to see added to it. How would I go about doing so?
20:20:20	<@OrIdow6>	I'm fairly sure the ArchiveTeam Geocities archive is complete in that nothing is going to be added to it
20:20:41	<@OrIdow6>	If you have some data of your own you should probably just upload it somewhere and try to make sure people should find it
20:21:21	<@Sanqui>	you can probably upload what you have to archive.org, and make a note on the wiki page
20:22:40		RichieV leaves
20:39:56		spirit quits [Quit: Leaving]
20:50:26		dumbgoy joins
20:58:58		hitgrr8 quits [Client Quit]
21:06:02		onetruth joins
21:24:44		tzt (tzt) joins
21:30:11	<retromouse>	Is there anything like the wiki dumpers for discourse or bb forums?
21:36:23	<pokechu22>	I don't think there is - the wikis have the benefit of having Special:Export or other mechanisms to export (and corresponding mechanisms to import) but I don't think most forum software has a similar feature
21:37:34	<@JAA>	I'm not aware of one. Someone here started working on a software like that a while ago, but I don't recall who or how far they got.
21:37:46	<@JAA>	It's been on my wishlist for a good while though.
21:41:15	<retromouse>	I have the impression the best thing I can do is writing a small crawler that allow me to build an index, then feed the index into wpull to get a warc
21:41:41	<retromouse>	that is not bad, is treating them as any general web, is just more automated tools could be made for these
21:47:56	<@JAA>	Recursive crawling with forum-specific URL filters works pretty well for most forums.
21:48:47	<@JAA>	The software I was referring to would be to extract the actual contents into a standard format regardless of the backing forum software.
22:04:25		dumbgoy quits [Read error: Connection reset by peer]
22:24:27		user_ quits [Remote host closed the connection]
22:24:39		user_ joins
22:34:48		ArcticCircleSys joins
22:35:05		ArcticCircleSys quits [Remote host closed the connection]
22:42:41	<retromouse>	I can't get wpull to work either 2.x or 1.2.3 versions, at least under Ubuntu using python 3.10 is broken
22:42:41		Iki1 joins
22:44:15	<@JAA>	retromouse: Known, 3.6 is the last supported currently.
22:45:58		AnotherIki quits [Ping timeout: 252 seconds]
22:50:46	<h2ibot>	Themadprogramer edited Discourse (+46, /* Notable Discourses */ Added Obsidian): https://wiki.archiveteam.org/?diff=49670&oldid=49442
22:53:46		m33katron joins
22:56:08	<retromouse>	Thanks JAA, I will try with 3.6
23:01:00	<thuban>	JAA: what's blocking 3.7+ support?
23:01:39	<@JAA>	thuban: Broken CI, and I'm actively looking into that again right now.
23:04:38	<thuban>	ah, that's good to hear. viel erfolg!
23:05:46		m33katron quits [Ping timeout: 252 seconds]
23:08:35		Craigle8 (Craigle) joins
23:10:58		pie_[bnc] joins
23:10:59		BearFortress_ joins
23:12:00		m33katron joins
23:12:04		tzt_ (tzt) joins
23:12:12		BearFortress quits [Client Quit]
23:12:13		Craigle quits [Client Quit]
23:12:13		pie_ quits [Client Quit]
23:12:13		qwertyasdfuiopghjkl quits [Client Quit]
23:12:13	<thuban>	oh, btw: there were a couple of people doing forum downloaders
23:12:13	<thuban>	avoozl was working on https://github.com/fairuse/warceater (discussion https://hackint.logs.kiska.pw/archiveteam-bs/20230309#c338872)
23:12:13	<thuban>	and mikolaj\|m had https://github.com/mikwielgus/forum-dl (discussion https://hackint.logs.kiska.pw/archiveteam-bs/20230308#c338789)
23:12:13		automato83 quits [Remote host closed the connection]
23:12:13		Hackerpcs quits [Remote host closed the connection]
23:12:13		tzt quits [Remote host closed the connection]
23:12:13		SketchCow quits [Remote host closed the connection]
23:12:13		Craigle8 is now known as Craigle
23:12:17		automato83 joins
23:12:26		SketchCow joins
23:12:28		Hackerpcs (Hackerpcs) joins
23:12:36		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
23:13:34	<thuban>	neither is finished afaict
23:16:52	<mikolaj\|m>	Im intensely working on mine, hoping to have a v0.1 release in a month or two
23:19:25	<@JAA>	Oh yeah, thanks. I remembered avoozl's, but that's for processing WARCs from existing crawls, so it avoids the 'how to retrieve all content' part of the problem.
23:23:36		BlueMaxima joins
23:24:22	<mikolaj\|m>	WARC reading is planned for v0.2
23:24:50	<mikolaj\|m>	WARC writing - no idea
23:25:08	<mikolaj\|m>	But i want to have it too
23:26:23		m33katro1 joins
23:26:40		m33katron quits [Ping timeout: 252 seconds]
23:27:09	<thuban>	mikolaj\|m: does your tool always output JSON or is there an option to have it just spit out urls (like snscrape)?
23:27:12	<@JAA>	You might want to wait for pywarc (my WIP, ETA unknown) for writing anyway. warcio is ... suboptimal.
23:27:52	<thuban>	forum spiderer -> (url list) -> general archiving tool -> (warcs) -> forum reader seems like a good workflow
23:28:28	<mikolaj\|m>	thuban: Ill add a switch to output URLs before I release v0.1 (atm it dumps all downloaded urls stdout by default)
23:28:30	<@JAA>	That won't work in the general case.
23:28:46	<@JAA>	For example, pagination and scripting requires some interaction between spidering and archiving.
23:28:58	<@JAA>	But I definitely agree with the separation of concerns angle.
23:29:57	<thuban>	hm, you're right (esp wrt scripting)
23:30:53	<thuban>	but i'm curious why you point at pagination--ime almost all forums paginate fine in html. am i missing some, or is there something else?
23:31:04		m33katro1 quits [Ping timeout: 252 seconds]
23:32:04	<thuban>	("in html" here meaning 'in such a way as to be functional after archival from a url list')
23:32:41	<@JAA>	Well, if the spiderer just descends forum listings and outputs the URLs for all topic pages, I guess that works. But it would depend on the listings being complete and including pagination info.
23:33:05	<@JAA>	Alternatively, the archiver would need to support limited recursion or similar.
23:36:20		m33katron joins
23:38:56	<thuban>	i had the impression that spiderer would indeed get all pages (either by generating them, which indeed might only work for some forum software, or by actually following pagination links, which seems pretty universally applicable)
23:40:34	<@JAA>	That would mean that the spiderer would itself download most of the content, duplicating effort and introducing a race condition e.g. when more posts are made in a topic between spidering and archiving that topic.
23:41:20	<@JAA>	I like to think that qwarc could be useful here though. That would combine the spidering and archiving into a single step but still keep the tooling separate.
23:41:59	<thuban>	i forget, what's qwarc currently use for warc writing?
23:43:00	<thuban>	ah, looks like warcio
23:43:10	<TheTechRobo>	thuban: I don't think qwarc uses warcio
23:43:14	<TheTechRobo>	IIRC it's a custom solution
23:45:07	<thuban>	https://gitea.arpa.li/JustAnotherArchivist/qwarc/src/branch/master/qwarc/warc.py
23:45:52	<@JAA>	Yeah, I ripped out warcio a good while ago.
23:46:10	<@JAA>	Right, master diverged from the release.
23:46:29	<thuban>	i'm a bit confused, what's canonical now?
23:47:47	<@JAA>	The master branch was a WIP version 0.3, but then I had to urgently fix stuff in the released version, so there's a separate 0.2 branch. See also the tags.
23:48:06	<@JAA>	Version 0.2.6+ use the custom WARC writer.
23:48:11	<thuban>	gotcha.
23:48:32	<retromouse>	Can anyone give me a link to your fork wget-at?
23:48:45	<@JAA>	Current master won't get released; I'll rewrite things on top of pywarc once that's ready.
23:49:03	<@JAA>	retromouse: It's in https://github.com/ArchiveTeam/wget-lua for historical reasons.
23:50:34	<retromouse>	Thanks JAA I'm having trouble to install a Python 3.6 to use wpull
23:50:52	<retromouse>	so I want to see if I can find another tool to create warc from a list of url
23:51:24	<@JAA>	pyenv!
23:51:50	<@JAA>	I haven't had a single 'how to install Python X.Y' problem since I started using it. It's glorious. :-)
23:54:02	<retromouse>	I just tried to use pyenv, the thing is I'm missing dependencies to install python 3.6 and even after resolving most of them is still missing stuff and pip breaks with a segmentation fault
23:54:25	<retromouse>	I know these are newbee problems
23:54:38	<retromouse>	But I'm not a python person even if I can read it and write it
23:54:38	<thuban>	retromouse: what os are you on?
23:54:47	<@JAA>	Yeah, there are quite a few build dependencies you need to install once to get things running. The pyenv wiki has details.
23:54:48	<retromouse>	Ubuntu
23:55:06	<@JAA>	Look for 'common build problems' linked somewhere in the readme.
23:55:44	<@JAA>	Yet another approach would be a docker.io/library/python:3.6 container.
23:56:03		m33katro1 joins
23:56:10	<retromouse>	Sure, virtualisation would be the easier aproach
23:56:44	<retromouse>	I tend to favor Java for my projects because the JVM is easy to install and all dependencies can give given with the tools
23:57:11	<retromouse>	But I see that most Archive Team tools I have seen are python based
23:57:18	<@JAA>	Eww, Java :-)
23:58:36		nicolas17 joins
23:59:07		m33katron quits [Ping timeout: 252 seconds]

Home Search Previous day Next day