00:02:04<@JAA>Ubisoft's forums are slow and rate-limited. That should be fun...
00:11:24dvd_ joins
00:14:37dvd__ quits [Ping timeout: 252 seconds]
00:44:08klg quits [Ping timeout: 252 seconds]
01:06:14klg (klg) joins
01:11:33klg quits [Ping timeout: 265 seconds]
01:13:39lunik173 quits [Quit: Ping timeout (120 seconds)]
01:15:42lunik173 joins
01:16:39Sanqui_ joins
01:16:42Sanqui_ quits [Changing host]
01:16:42Sanqui_ (Sanqui) joins
01:16:42@ChanServ sets mode: +o Sanqui_
01:17:52@Sanqui quits [Ping timeout: 252 seconds]
01:27:04Arcorann (Arcorann) joins
01:32:59klg (klg) joins
01:44:58BlueMaxima joins
01:46:13hitgrr8 quits [Client Quit]
02:25:18qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
02:45:50Terbium quits [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.]
02:46:17Terbium joins
02:53:52umgr036 joins
02:54:39umgr036 quits [Remote host closed the connection]
02:54:51umgr036 joins
04:25:49<pabs>https://tech.slashdot.org/story/23/04/14/175246/valve-restricts-accounts-of-2500-users-who-marked-a-negative-game-review-useful
04:37:38BlueMaxima quits [Read error: Connection reset by peer]
05:01:15<Ryz>pabs, ...
05:01:26<Ryz>...Doing my proactive web archiving on this a bit a go
05:01:27<Ryz>*ago
05:03:22DigitalDragons quits [Ping timeout: 252 seconds]
05:11:31DigitalDragons joins
05:31:35user_ joins
05:31:36umgr036 quits [Remote host closed the connection]
05:49:28nicolas17 quits [Client Quit]
06:27:05<pabs>https://apnews.com/article/mexico-notimex-news-agency-lopez-obrador-ec777eb1858344c68b2906796b63f200
06:42:43<@OrIdow6>"Mexico’s president vows to eliminate national news agency"
07:01:43qwertyasdfuiopghjkl quits [Remote host closed the connection]
07:04:30hitgrr8 joins
07:04:41qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
07:06:48wyatt8740 quits [Ping timeout: 260 seconds]
07:06:53wyatt8750 joins
07:11:38wyatt8750 quits [Ping timeout: 265 seconds]
07:11:54wyatt8740 joins
07:30:41@Sanqui_ is now known as @Sanqui
07:33:31Island quits [Read error: Connection reset by peer]
09:25:33tbc1887_ quits [Read error: Connection reset by peer]
09:30:28retromouse (retromouse) joins
09:31:46<retromouse>I'm having a bit of trouble using dokuWikiDumper: Max retries exceeded with url: /wiki/lib/exe/ajax.php
09:32:22<retromouse>The thing is it re-trying it fails at different times with different pages, probably the server is overwhelmed even with 1 thread
09:34:45<retromouse>any way to increase retries ?
09:41:09<retromouse>Just try the next command:
09:41:29<retromouse>dokuWikiDumper https://www.ff6hacking.com/wiki/doku.php?id=start --auto
10:28:53<retromouse>seems editing the hard coded max retries did worked, what calls for adding the option
11:30:16nimaje quits [Quit: WeeChat 3.7]
11:32:19nimaje joins
11:51:00DiscantX quits [Ping timeout: 265 seconds]
11:52:07sec^nd quits [Ping timeout: 245 seconds]
11:52:32HackMii_ quits [Ping timeout: 245 seconds]
11:54:29DiscantX joins
11:54:33HackMii_ (hacktheplanet) joins
11:54:46sec^nd (second) joins
11:55:23dan_a_ quits [Quit: weboots]
11:56:59dan_a (dan_a) joins
12:10:15user_ quits [Remote host closed the connection]
12:13:38umgr036 joins
12:14:27umgr036 quits [Remote host closed the connection]
12:14:39umgr036 joins
12:40:58retromouse quits [Ping timeout: 252 seconds]
12:49:36ArcticCircleSys joins
12:50:56sec^nd quits [Remote host closed the connection]
12:51:23sec^nd (second) joins
12:53:32<ArcticCircleSys>http://otakuworld.com/ This hasn't been updated since 2014. Should I put this on Fire Drill?
13:00:58nimaje quits [Client Quit]
13:02:57nimaje joins
13:31:27nimaje quits [Client Quit]
13:32:07nimaje joins
13:34:26ArcticCircleSys quits [Ping timeout: 265 seconds]
13:53:13umgr036 quits [Remote host closed the connection]
13:53:15umgr036 joins
14:03:50Arcorann quits [Ping timeout: 252 seconds]
14:04:19user_ joins
14:04:20qwertyasdfuiopghjkl quits [Remote host closed the connection]
14:04:20umgr036 quits [Read error: Connection reset by peer]
14:06:32qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
14:35:55pie_ quits [Ping timeout: 252 seconds]
14:40:41pie_ joins
15:12:58pie_ quits [Client Quit]
15:13:12pie_ joins
15:50:01sec^nd quits [Ping timeout: 245 seconds]
16:26:43sec^nd (second) joins
16:52:19tzt quits [Ping timeout: 252 seconds]
17:11:46wyatt8750 joins
17:13:10wyatt8740 quits [Client Quit]
17:13:11pie_ quits [Client Quit]
17:13:18qwertyasdfuiopghjkl quits [Client Quit]
17:13:47pie_ joins
17:19:44qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
17:22:28retromouse (retromouse) joins
17:29:22Island joins
17:30:11retromouse-2 (retromouse) joins
17:31:25retromouse quits [Remote host closed the connection]
17:47:51qwertyasdfuiopghjkl quits [Remote host closed the connection]
17:48:00qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
18:13:48retromouse-3 (retromouse) joins
18:16:28retromouse-2 quits [Ping timeout: 252 seconds]
18:19:27BearFortress quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
18:24:01retromouse-3 quits [Client Quit]
18:36:53dan_a quits [Client Quit]
18:39:53dan_a (dan_a) joins
19:14:42BearFortress joins
19:17:31lennier1 quits [Ping timeout: 252 seconds]
19:19:52lennier1 (lennier1) joins
19:33:01Craigle quits [Quit: The Lounge - https://thelounge.chat]
19:33:34Craigle (Craigle) joins
20:12:01retromouse (retromouse) joins
20:14:09RichieV joins
20:14:54<RichieV>Can I ask a question about the Geocities archive here?
20:16:37<@Sanqui>sure
20:17:13<RichieV>There are a couple of addresses I'd like to see added to it. How would I go about doing so?
20:20:20<@OrIdow6>I'm fairly sure the ArchiveTeam Geocities archive is complete in that nothing is going to be added to it
20:20:41<@OrIdow6>If you have some data of your own you should probably just upload it somewhere and try to make sure people should find it
20:21:21<@Sanqui>you can probably upload what you have to archive.org, and make a note on the wiki page
20:22:40RichieV leaves
20:39:56spirit quits [Quit: Leaving]
20:50:26dumbgoy joins
20:58:58hitgrr8 quits [Client Quit]
21:06:02onetruth joins
21:24:44tzt (tzt) joins
21:30:11<retromouse>Is there anything like the wiki dumpers for discourse or bb forums?
21:36:23<pokechu22>I don't think there is - the wikis have the benefit of having Special:Export or other mechanisms to export (and corresponding mechanisms to import) but I don't think most forum software has a similar feature
21:37:34<@JAA>I'm not aware of one. Someone here started working on a software like that a while ago, but I don't recall who or how far they got.
21:37:46<@JAA>It's been on my wishlist for a good while though.
21:41:15<retromouse>I have the impression the best thing I can do is writing a small crawler that allow me to build an index, then feed the index into wpull to get a warc
21:41:41<retromouse>that is not bad, is treating them as any general web, is just more automated tools could be made for these
21:47:56<@JAA>Recursive crawling with forum-specific URL filters works pretty well for most forums.
21:48:47<@JAA>The software I was referring to would be to extract the actual contents into a standard format regardless of the backing forum software.
22:04:25dumbgoy quits [Read error: Connection reset by peer]
22:24:27user_ quits [Remote host closed the connection]
22:24:39user_ joins
22:34:48ArcticCircleSys joins
22:35:05ArcticCircleSys quits [Remote host closed the connection]
22:42:41<retromouse>I can't get wpull to work either 2.x or 1.2.3 versions, at least under Ubuntu using python 3.10 is broken
22:42:41Iki1 joins
22:44:15<@JAA>retromouse: Known, 3.6 is the last supported currently.
22:45:58AnotherIki quits [Ping timeout: 252 seconds]
22:50:46<h2ibot>Themadprogramer edited Discourse (+46, /* Notable Discourses */ Added Obsidian): https://wiki.archiveteam.org/?diff=49670&oldid=49442
22:53:46m33katron joins
22:56:08<retromouse>Thanks JAA, I will try with 3.6
23:01:00<thuban>JAA: what's blocking 3.7+ support?
23:01:39<@JAA>thuban: Broken CI, and I'm actively looking into that again right now.
23:04:38<thuban>ah, that's good to hear. viel erfolg!
23:05:46m33katron quits [Ping timeout: 252 seconds]
23:08:35Craigle8 (Craigle) joins
23:10:58pie_[bnc] joins
23:10:59BearFortress_ joins
23:12:00m33katron joins
23:12:04tzt_ (tzt) joins
23:12:12BearFortress quits [Client Quit]
23:12:13Craigle quits [Client Quit]
23:12:13pie_ quits [Client Quit]
23:12:13qwertyasdfuiopghjkl quits [Client Quit]
23:12:13<thuban>oh, btw: there were a couple of people doing forum downloaders
23:12:13<thuban>avoozl was working on https://github.com/fairuse/warceater (discussion https://hackint.logs.kiska.pw/archiveteam-bs/20230309#c338872)
23:12:13<thuban>and mikolaj|m had https://github.com/mikwielgus/forum-dl (discussion https://hackint.logs.kiska.pw/archiveteam-bs/20230308#c338789)
23:12:13automato83 quits [Remote host closed the connection]
23:12:13Hackerpcs quits [Remote host closed the connection]
23:12:13tzt quits [Remote host closed the connection]
23:12:13SketchCow quits [Remote host closed the connection]
23:12:13Craigle8 is now known as Craigle
23:12:17automato83 joins
23:12:26SketchCow joins
23:12:28Hackerpcs (Hackerpcs) joins
23:12:36qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
23:13:34<thuban>neither is finished afaict
23:16:52<mikolaj|m>Im intensely working on mine, hoping to have a v0.1 release in a month or two
23:19:25<@JAA>Oh yeah, thanks. I remembered avoozl's, but that's for processing WARCs from existing crawls, so it avoids the 'how to retrieve all content' part of the problem.
23:23:36BlueMaxima joins
23:24:22<mikolaj|m>WARC reading is planned for v0.2
23:24:50<mikolaj|m>WARC writing - no idea
23:25:08<mikolaj|m>But i want to have it too
23:26:23m33katro1 joins
23:26:40m33katron quits [Ping timeout: 252 seconds]
23:27:09<thuban>mikolaj|m: does your tool always output JSON or is there an option to have it just spit out urls (like snscrape)?
23:27:12<@JAA>You might want to wait for pywarc (my WIP, ETA unknown) for writing anyway. warcio is ... suboptimal.
23:27:52<thuban>forum spiderer -> (url list) -> general archiving tool -> (warcs) -> forum reader seems like a good workflow
23:28:28<mikolaj|m>thuban: Ill add a switch to output URLs before I release v0.1 (atm it dumps all downloaded urls stdout by default)
23:28:30<@JAA>That won't work in the general case.
23:28:46<@JAA>For example, pagination and scripting requires some interaction between spidering and archiving.
23:28:58<@JAA>But I definitely agree with the separation of concerns angle.
23:29:57<thuban>hm, you're right (esp wrt scripting)
23:30:53<thuban>but i'm curious why you point at pagination--ime almost all forums paginate fine in html. am i missing some, or is there something else?
23:31:04m33katro1 quits [Ping timeout: 252 seconds]
23:32:04<thuban>("in html" here meaning 'in such a way as to be functional after archival from a url list')
23:32:41<@JAA>Well, if the spiderer just descends forum listings and outputs the URLs for all topic pages, I guess that works. But it would depend on the listings being complete and including pagination info.
23:33:05<@JAA>Alternatively, the archiver would need to support limited recursion or similar.
23:36:20m33katron joins
23:38:56<thuban>i had the impression that spiderer would indeed get all pages (either by generating them, which indeed might only work for some forum software, or by actually following pagination links, which seems pretty universally applicable)
23:40:34<@JAA>That would mean that the spiderer would itself download most of the content, duplicating effort and introducing a race condition e.g. when more posts are made in a topic between spidering and archiving that topic.
23:41:20<@JAA>I like to think that qwarc could be useful here though. That would combine the spidering and archiving into a single step but still keep the tooling separate.
23:41:59<thuban>i forget, what's qwarc currently use for warc writing?
23:43:00<thuban>ah, looks like warcio
23:43:10<TheTechRobo>thuban: I don't think qwarc uses warcio
23:43:14<TheTechRobo>IIRC it's a custom solution
23:45:07<thuban>https://gitea.arpa.li/JustAnotherArchivist/qwarc/src/branch/master/qwarc/warc.py
23:45:52<@JAA>Yeah, I ripped out warcio a good while ago.
23:46:10<@JAA>Right, master diverged from the release.
23:46:29<thuban>i'm a bit confused, what's canonical now?
23:47:47<@JAA>The master branch was a WIP version 0.3, but then I had to urgently fix stuff in the released version, so there's a separate 0.2 branch. See also the tags.
23:48:06<@JAA>Version 0.2.6+ use the custom WARC writer.
23:48:11<thuban>gotcha.
23:48:32<retromouse>Can anyone give me a link to your fork wget-at?
23:48:45<@JAA>Current master won't get released; I'll rewrite things on top of pywarc once that's ready.
23:49:03<@JAA>retromouse: It's in https://github.com/ArchiveTeam/wget-lua for historical reasons.
23:50:34<retromouse>Thanks JAA I'm having trouble to install a Python 3.6 to use wpull
23:50:52<retromouse>so I want to see if I can find another tool to create warc from a list of url
23:51:24<@JAA>pyenv!
23:51:50<@JAA>I haven't had a single 'how to install Python X.Y' problem since I started using it. It's glorious. :-)
23:54:02<retromouse>I just tried to use pyenv, the thing is I'm missing dependencies to install python 3.6 and even after resolving most of them is still missing stuff and pip breaks with a segmentation fault
23:54:25<retromouse>I know these are newbee problems
23:54:38<retromouse>But I'm not a python person even if I can read it and write it
23:54:38<thuban>retromouse: what os are you on?
23:54:47<@JAA>Yeah, there are quite a few build dependencies you need to install once to get things running. The pyenv wiki has details.
23:54:48<retromouse>Ubuntu
23:55:06<@JAA>Look for 'common build problems' linked somewhere in the readme.
23:55:44<@JAA>Yet another approach would be a docker.io/library/python:3.6 container.
23:56:03m33katro1 joins
23:56:10<retromouse>Sure, virtualisation would be the easier aproach
23:56:44<retromouse>I tend to favor Java for my projects because the JVM is easy to install and all dependencies can give given with the tools
23:57:11<retromouse>But I see that most Archive Team tools I have seen are python based
23:57:18<@JAA>Eww, Java :-)
23:58:36nicolas17 joins
23:59:07m33katron quits [Ping timeout: 252 seconds]