| 00:02:04 | <@JAA> | Ubisoft's forums are slow and rate-limited. That should be fun... |
| 00:11:24 | | dvd_ joins |
| 00:14:37 | | dvd__ quits [Ping timeout: 252 seconds] |
| 00:44:08 | | klg quits [Ping timeout: 252 seconds] |
| 01:06:14 | | klg (klg) joins |
| 01:11:33 | | klg quits [Ping timeout: 265 seconds] |
| 01:13:39 | | lunik173 quits [Quit: Ping timeout (120 seconds)] |
| 01:15:42 | | lunik173 joins |
| 01:16:39 | | Sanqui_ joins |
| 01:16:41 | | Sanqui_ is now authenticated as Sanqui |
| 01:16:42 | | Sanqui_ quits [Changing host] |
| 01:16:42 | | Sanqui_ (Sanqui) joins |
| 01:16:42 | | @ChanServ sets mode: +o Sanqui_ |
| 01:17:52 | | @Sanqui quits [Ping timeout: 252 seconds] |
| 01:27:04 | | Arcorann (Arcorann) joins |
| 01:32:59 | | klg (klg) joins |
| 01:44:58 | | BlueMaxima joins |
| 01:46:13 | | hitgrr8 quits [Client Quit] |
| 02:25:18 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 02:45:50 | | Terbium quits [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.] |
| 02:46:17 | | Terbium joins |
| 02:53:52 | | umgr036 joins |
| 02:54:39 | | umgr036 quits [Remote host closed the connection] |
| 02:54:51 | | umgr036 joins |
| 04:25:49 | <pabs> | https://tech.slashdot.org/story/23/04/14/175246/valve-restricts-accounts-of-2500-users-who-marked-a-negative-game-review-useful |
| 04:37:38 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 05:01:15 | <Ryz> | pabs, ... |
| 05:01:26 | <Ryz> | ...Doing my proactive web archiving on this a bit a go |
| 05:01:27 | <Ryz> | *ago |
| 05:03:22 | | DigitalDragons quits [Ping timeout: 252 seconds] |
| 05:11:31 | | DigitalDragons joins |
| 05:31:35 | | user_ joins |
| 05:31:36 | | umgr036 quits [Remote host closed the connection] |
| 05:49:28 | | nicolas17 quits [Client Quit] |
| 06:27:05 | <pabs> | https://apnews.com/article/mexico-notimex-news-agency-lopez-obrador-ec777eb1858344c68b2906796b63f200 |
| 06:42:43 | <@OrIdow6> | "Mexico’s president vows to eliminate national news agency" |
| 07:01:43 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
| 07:04:30 | | hitgrr8 joins |
| 07:04:41 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 07:06:48 | | wyatt8740 quits [Ping timeout: 260 seconds] |
| 07:06:53 | | wyatt8750 joins |
| 07:11:38 | | wyatt8750 quits [Ping timeout: 265 seconds] |
| 07:11:54 | | wyatt8740 joins |
| 07:30:41 | | @Sanqui_ is now known as @Sanqui |
| 07:33:31 | | Island quits [Read error: Connection reset by peer] |
| 09:25:33 | | tbc1887_ quits [Read error: Connection reset by peer] |
| 09:30:28 | | retromouse (retromouse) joins |
| 09:31:46 | <retromouse> | I'm having a bit of trouble using dokuWikiDumper: Max retries exceeded with url: /wiki/lib/exe/ajax.php |
| 09:32:22 | <retromouse> | The thing is it re-trying it fails at different times with different pages, probably the server is overwhelmed even with 1 thread |
| 09:34:45 | <retromouse> | any way to increase retries ? |
| 09:41:09 | <retromouse> | Just try the next command: |
| 09:41:29 | <retromouse> | dokuWikiDumper https://www.ff6hacking.com/wiki/doku.php?id=start --auto |
| 10:28:53 | <retromouse> | seems editing the hard coded max retries did worked, what calls for adding the option |
| 11:30:16 | | nimaje quits [Quit: WeeChat 3.7] |
| 11:32:19 | | nimaje joins |
| 11:51:00 | | DiscantX quits [Ping timeout: 265 seconds] |
| 11:52:07 | | sec^nd quits [Ping timeout: 245 seconds] |
| 11:52:32 | | HackMii_ quits [Ping timeout: 245 seconds] |
| 11:54:29 | | DiscantX joins |
| 11:54:33 | | HackMii_ (hacktheplanet) joins |
| 11:54:46 | | sec^nd (second) joins |
| 11:55:23 | | dan_a_ quits [Quit: weboots] |
| 11:56:59 | | dan_a (dan_a) joins |
| 12:10:15 | | user_ quits [Remote host closed the connection] |
| 12:13:38 | | umgr036 joins |
| 12:14:27 | | umgr036 quits [Remote host closed the connection] |
| 12:14:39 | | umgr036 joins |
| 12:40:58 | | retromouse quits [Ping timeout: 252 seconds] |
| 12:49:36 | | ArcticCircleSys joins |
| 12:50:56 | | sec^nd quits [Remote host closed the connection] |
| 12:51:23 | | sec^nd (second) joins |
| 12:53:32 | <ArcticCircleSys> | http://otakuworld.com/ This hasn't been updated since 2014. Should I put this on Fire Drill? |
| 13:00:58 | | nimaje quits [Client Quit] |
| 13:02:57 | | nimaje joins |
| 13:31:27 | | nimaje quits [Client Quit] |
| 13:32:07 | | nimaje joins |
| 13:34:26 | | ArcticCircleSys quits [Ping timeout: 265 seconds] |
| 13:53:13 | | umgr036 quits [Remote host closed the connection] |
| 13:53:15 | | umgr036 joins |
| 14:03:50 | | Arcorann quits [Ping timeout: 252 seconds] |
| 14:04:19 | | user_ joins |
| 14:04:20 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
| 14:04:20 | | umgr036 quits [Read error: Connection reset by peer] |
| 14:06:32 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 14:35:55 | | pie_ quits [Ping timeout: 252 seconds] |
| 14:40:41 | | pie_ joins |
| 15:12:58 | | pie_ quits [Client Quit] |
| 15:13:12 | | pie_ joins |
| 15:50:01 | | sec^nd quits [Ping timeout: 245 seconds] |
| 16:26:43 | | sec^nd (second) joins |
| 16:52:19 | | tzt quits [Ping timeout: 252 seconds] |
| 17:11:46 | | wyatt8750 joins |
| 17:13:10 | | wyatt8740 quits [Client Quit] |
| 17:13:11 | | pie_ quits [Client Quit] |
| 17:13:18 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 17:13:47 | | pie_ joins |
| 17:19:44 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 17:22:28 | | retromouse (retromouse) joins |
| 17:29:22 | | Island joins |
| 17:30:11 | | retromouse-2 (retromouse) joins |
| 17:31:25 | | retromouse quits [Remote host closed the connection] |
| 17:47:51 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
| 17:48:00 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 18:13:48 | | retromouse-3 (retromouse) joins |
| 18:16:28 | | retromouse-2 quits [Ping timeout: 252 seconds] |
| 18:19:27 | | BearFortress quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |
| 18:24:01 | | retromouse-3 quits [Client Quit] |
| 18:36:53 | | dan_a quits [Client Quit] |
| 18:39:53 | | dan_a (dan_a) joins |
| 19:14:42 | | BearFortress joins |
| 19:17:31 | | lennier1 quits [Ping timeout: 252 seconds] |
| 19:19:52 | | lennier1 (lennier1) joins |
| 19:33:01 | | Craigle quits [Quit: The Lounge - https://thelounge.chat] |
| 19:33:34 | | Craigle (Craigle) joins |
| 20:12:01 | | retromouse (retromouse) joins |
| 20:14:09 | | RichieV joins |
| 20:14:54 | <RichieV> | Can I ask a question about the Geocities archive here? |
| 20:16:37 | <@Sanqui> | sure |
| 20:17:13 | <RichieV> | There are a couple of addresses I'd like to see added to it. How would I go about doing so? |
| 20:20:20 | <@OrIdow6> | I'm fairly sure the ArchiveTeam Geocities archive is complete in that nothing is going to be added to it |
| 20:20:41 | <@OrIdow6> | If you have some data of your own you should probably just upload it somewhere and try to make sure people should find it |
| 20:21:21 | <@Sanqui> | you can probably upload what you have to archive.org, and make a note on the wiki page |
| 20:22:40 | | RichieV leaves |
| 20:39:56 | | spirit quits [Quit: Leaving] |
| 20:50:26 | | dumbgoy joins |
| 20:58:58 | | hitgrr8 quits [Client Quit] |
| 21:06:02 | | onetruth joins |
| 21:24:44 | | tzt (tzt) joins |
| 21:30:11 | <retromouse> | Is there anything like the wiki dumpers for discourse or bb forums? |
| 21:36:23 | <pokechu22> | I don't think there is - the wikis have the benefit of having Special:Export or other mechanisms to export (and corresponding mechanisms to import) but I don't think most forum software has a similar feature |
| 21:37:34 | <@JAA> | I'm not aware of one. Someone here started working on a software like that a while ago, but I don't recall who or how far they got. |
| 21:37:46 | <@JAA> | It's been on my wishlist for a good while though. |
| 21:41:15 | <retromouse> | I have the impression the best thing I can do is writing a small crawler that allow me to build an index, then feed the index into wpull to get a warc |
| 21:41:41 | <retromouse> | that is not bad, is treating them as any general web, is just more automated tools could be made for these |
| 21:47:56 | <@JAA> | Recursive crawling with forum-specific URL filters works pretty well for most forums. |
| 21:48:47 | <@JAA> | The software I was referring to would be to extract the actual contents into a standard format regardless of the backing forum software. |
| 22:04:25 | | dumbgoy quits [Read error: Connection reset by peer] |
| 22:24:27 | | user_ quits [Remote host closed the connection] |
| 22:24:39 | | user_ joins |
| 22:34:48 | | ArcticCircleSys joins |
| 22:35:05 | | ArcticCircleSys quits [Remote host closed the connection] |
| 22:42:41 | <retromouse> | I can't get wpull to work either 2.x or 1.2.3 versions, at least under Ubuntu using python 3.10 is broken |
| 22:42:41 | | Iki1 joins |
| 22:44:15 | <@JAA> | retromouse: Known, 3.6 is the last supported currently. |
| 22:45:58 | | AnotherIki quits [Ping timeout: 252 seconds] |
| 22:50:46 | <h2ibot> | Themadprogramer edited Discourse (+46, /* Notable Discourses */ Added Obsidian): https://wiki.archiveteam.org/?diff=49670&oldid=49442 |
| 22:53:46 | | m33katron joins |
| 22:56:08 | <retromouse> | Thanks JAA, I will try with 3.6 |
| 23:01:00 | <thuban> | JAA: what's blocking 3.7+ support? |
| 23:01:39 | <@JAA> | thuban: Broken CI, and I'm actively looking into that again right now. |
| 23:04:38 | <thuban> | ah, that's good to hear. viel erfolg! |
| 23:05:46 | | m33katron quits [Ping timeout: 252 seconds] |
| 23:08:35 | | Craigle8 (Craigle) joins |
| 23:10:58 | | pie_[bnc] joins |
| 23:10:59 | | BearFortress_ joins |
| 23:12:00 | | m33katron joins |
| 23:12:04 | | tzt_ (tzt) joins |
| 23:12:12 | | BearFortress quits [Client Quit] |
| 23:12:13 | | Craigle quits [Client Quit] |
| 23:12:13 | | pie_ quits [Client Quit] |
| 23:12:13 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 23:12:13 | <thuban> | oh, btw: there were a couple of people doing forum downloaders |
| 23:12:13 | <thuban> | avoozl was working on https://github.com/fairuse/warceater (discussion https://hackint.logs.kiska.pw/archiveteam-bs/20230309#c338872) |
| 23:12:13 | <thuban> | and mikolaj|m had https://github.com/mikwielgus/forum-dl (discussion https://hackint.logs.kiska.pw/archiveteam-bs/20230308#c338789) |
| 23:12:13 | | automato83 quits [Remote host closed the connection] |
| 23:12:13 | | Hackerpcs quits [Remote host closed the connection] |
| 23:12:13 | | tzt quits [Remote host closed the connection] |
| 23:12:13 | | SketchCow quits [Remote host closed the connection] |
| 23:12:13 | | Craigle8 is now known as Craigle |
| 23:12:17 | | automato83 joins |
| 23:12:26 | | SketchCow joins |
| 23:12:28 | | Hackerpcs (Hackerpcs) joins |
| 23:12:36 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 23:13:34 | <thuban> | neither is finished afaict |
| 23:16:52 | <mikolaj|m> | Im intensely working on mine, hoping to have a v0.1 release in a month or two |
| 23:19:25 | <@JAA> | Oh yeah, thanks. I remembered avoozl's, but that's for processing WARCs from existing crawls, so it avoids the 'how to retrieve all content' part of the problem. |
| 23:23:36 | | BlueMaxima joins |
| 23:24:22 | <mikolaj|m> | WARC reading is planned for v0.2 |
| 23:24:50 | <mikolaj|m> | WARC writing - no idea |
| 23:25:08 | <mikolaj|m> | But i want to have it too |
| 23:26:23 | | m33katro1 joins |
| 23:26:40 | | m33katron quits [Ping timeout: 252 seconds] |
| 23:27:09 | <thuban> | mikolaj|m: does your tool always output JSON or is there an option to have it just spit out urls (like snscrape)? |
| 23:27:12 | <@JAA> | You might want to wait for pywarc (my WIP, ETA unknown) for writing anyway. warcio is ... suboptimal. |
| 23:27:52 | <thuban> | forum spiderer -> (url list) -> general archiving tool -> (warcs) -> forum reader seems like a good workflow |
| 23:28:28 | <mikolaj|m> | thuban: Ill add a switch to output URLs before I release v0.1 (atm it dumps all downloaded urls stdout by default) |
| 23:28:30 | <@JAA> | That won't work in the general case. |
| 23:28:46 | <@JAA> | For example, pagination and scripting requires some interaction between spidering and archiving. |
| 23:28:58 | <@JAA> | But I definitely agree with the separation of concerns angle. |
| 23:29:57 | <thuban> | hm, you're right (esp wrt scripting) |
| 23:30:53 | <thuban> | but i'm curious why you point at pagination--ime almost all forums paginate fine in html. am i missing some, or is there something else? |
| 23:31:04 | | m33katro1 quits [Ping timeout: 252 seconds] |
| 23:32:04 | <thuban> | ("in html" here meaning 'in such a way as to be functional after archival from a url list') |
| 23:32:41 | <@JAA> | Well, if the spiderer just descends forum listings and outputs the URLs for all topic pages, I guess that works. But it would depend on the listings being complete and including pagination info. |
| 23:33:05 | <@JAA> | Alternatively, the archiver would need to support limited recursion or similar. |
| 23:36:20 | | m33katron joins |
| 23:38:56 | <thuban> | i had the impression that spiderer would indeed get all pages (either by generating them, which indeed might only work for some forum software, or by actually following pagination links, which seems pretty universally applicable) |
| 23:40:34 | <@JAA> | That would mean that the spiderer would itself download most of the content, duplicating effort and introducing a race condition e.g. when more posts are made in a topic between spidering and archiving that topic. |
| 23:41:20 | <@JAA> | I like to think that qwarc could be useful here though. That would combine the spidering and archiving into a single step but still keep the tooling separate. |
| 23:41:59 | <thuban> | i forget, what's qwarc currently use for warc writing? |
| 23:43:00 | <thuban> | ah, looks like warcio |
| 23:43:10 | <TheTechRobo> | thuban: I don't think qwarc uses warcio |
| 23:43:14 | <TheTechRobo> | IIRC it's a custom solution |
| 23:45:07 | <thuban> | https://gitea.arpa.li/JustAnotherArchivist/qwarc/src/branch/master/qwarc/warc.py |
| 23:45:52 | <@JAA> | Yeah, I ripped out warcio a good while ago. |
| 23:46:10 | <@JAA> | Right, master diverged from the release. |
| 23:46:29 | <thuban> | i'm a bit confused, what's canonical now? |
| 23:47:47 | <@JAA> | The master branch was a WIP version 0.3, but then I had to urgently fix stuff in the released version, so there's a separate 0.2 branch. See also the tags. |
| 23:48:06 | <@JAA> | Version 0.2.6+ use the custom WARC writer. |
| 23:48:11 | <thuban> | gotcha. |
| 23:48:32 | <retromouse> | Can anyone give me a link to your fork wget-at? |
| 23:48:45 | <@JAA> | Current master won't get released; I'll rewrite things on top of pywarc once that's ready. |
| 23:49:03 | <@JAA> | retromouse: It's in https://github.com/ArchiveTeam/wget-lua for historical reasons. |
| 23:50:34 | <retromouse> | Thanks JAA I'm having trouble to install a Python 3.6 to use wpull |
| 23:50:52 | <retromouse> | so I want to see if I can find another tool to create warc from a list of url |
| 23:51:24 | <@JAA> | pyenv! |
| 23:51:50 | <@JAA> | I haven't had a single 'how to install Python X.Y' problem since I started using it. It's glorious. :-) |
| 23:54:02 | <retromouse> | I just tried to use pyenv, the thing is I'm missing dependencies to install python 3.6 and even after resolving most of them is still missing stuff and pip breaks with a segmentation fault |
| 23:54:25 | <retromouse> | I know these are newbee problems |
| 23:54:38 | <retromouse> | But I'm not a python person even if I can read it and write it |
| 23:54:38 | <thuban> | retromouse: what os are you on? |
| 23:54:47 | <@JAA> | Yeah, there are quite a few build dependencies you need to install once to get things running. The pyenv wiki has details. |
| 23:54:48 | <retromouse> | Ubuntu |
| 23:55:06 | <@JAA> | Look for 'common build problems' linked somewhere in the readme. |
| 23:55:44 | <@JAA> | Yet another approach would be a docker.io/library/python:3.6 container. |
| 23:56:03 | | m33katro1 joins |
| 23:56:10 | <retromouse> | Sure, virtualisation would be the easier aproach |
| 23:56:44 | <retromouse> | I tend to favor Java for my projects because the JVM is easy to install and all dependencies can give given with the tools |
| 23:57:11 | <retromouse> | But I see that most Archive Team tools I have seen are python based |
| 23:57:18 | <@JAA> | Eww, Java :-) |
| 23:58:36 | | nicolas17 joins |
| 23:59:07 | | m33katron quits [Ping timeout: 252 seconds] |