00:00:13<retromouse>I think I'm going to try this one: https://github.com/iipc/jwarc
00:00:29<retromouse>If works I can start to use it in my projects
00:02:00<pabs>retromouse: I forget what it supports, but there is forum-dl https://github.com/mikwielgus/forum-dl/
00:02:13<pabs>JAA: ^
00:02:31<pabs>ah, already mentioned by thuban
00:06:41<@JAA>retromouse: Based on the example in the readme, absolutely do not use that for writing WARCs unless you know exactly what you're doing.
00:08:22m33katron joins
00:10:30m33katro1 quits [Ping timeout: 265 seconds]
00:10:49<retromouse>Why? You feel is low level or what JAA?
00:11:47<@JAA>retromouse: Well, the low-levelness isn't the problem per se, but it's very easy to get it wrong and write invalid WARCs with that.
00:12:22<@JAA>For example, HTTP headers and transfer encoding must be preserved. That's not trivial to do with most HTTP libraries.
00:12:34<@JAA>Preserved exactly as sent by the server, that is.
00:13:55<retromouse>I understand the preservation spirit, and some pieces of software are going to get it wrong. But is not better we still get some data that nothing at all?
00:14:28<retromouse>incorrect and invalid are different things
00:14:52<thuban>sure, but the data shouldn't claim to have accuracy that it actually lacks.
00:15:29<@JAA>^
00:15:47<@JAA>It being WARC implies certain things about the data, and if it doesn't have those, it shouldn't be stored as WARC.
00:15:50<retromouse>I wonder if archive team forked wget to improve warc support. Why the improvements where not submited to the original wget project?
00:17:42<@JAA>The most important improvement (fix of writing weird brackets) was suggested upstream with a very detailed writeup over three years ago. It's a three-character diff. It was never applied.
00:18:11<retromouse>That is sad
00:18:14m33katron quits [Ping timeout: 265 seconds]
00:18:15<TheTechRobo>Yeah, wget doesn't maintain their warc stuff well
00:18:21<retromouse>I'm sorry to hear that
00:18:26<@JAA>I've repeatedly poked them about that over those three years as well to no avail.
00:18:38<@JAA>An additional factor is that wget 1 is effectively EOL.
00:18:55m33katron joins
00:19:06<TheTechRobo>So the other features (like zstd warc compression support) are definitely not happening. :/
00:19:08<@JAA>It still gets maintained for critical bugs, security issues, etc., but the effort is clearly going towards wget 2.
00:19:13<retromouse>wget 1 is EOL but wget 2 even if faster is less stable than 1
00:19:28<retromouse>so I end up using 1
00:19:29<@JAA>Less stable and lacking numerous features, including WARC.
00:23:10fullpwnmedia joins
00:26:37retromouse quits [Ping timeout: 252 seconds]
00:27:38m33katro1 joins
00:30:28m33katron quits [Ping timeout: 252 seconds]
00:40:35<thuban>JAA, to return to the earlier discussion... yes, full spidering would duplicate a lot of network traffic with archiving. (of course, we already accept that when snscrape does it!) on the other hand it would probably also duplicate some logic with semantic data extraction. i think which concerns to bundle is a question of how much you want to trade off efficiency for
00:40:37<thuban>composability with other tools/data
00:41:14<thuban>(and i'm inclined to favor composability, or at least the option of it--it's good to be able to pass pages detected by a forum spiderer into a general-purpose archiver to get things like page requisites for wbm playback, and it's good to be able to do semantic extraction from warcs created by other tooling for forums that are already dead)
00:41:21m33katron joins
00:42:27<thuban>this is all reminding me that i never did finish that snscrape module for livejournal. ._. this summer maybe
00:42:32<@JAA>thuban: snscrape actually doesn't duplicate anything, because it deals with the search or profile page scrolling, which the archival step doesn't retrieve. But you're not wrong in principle of course.
00:42:53<@JAA>(At least when talking about Twitter, which is by far the most commonly used module.)
00:43:05<thuban>yeah, that's why i was careful to say "when"--i think it does to a bit in some of the non-twitter modules
00:43:10<thuban>(ninja'd)
00:43:13<thuban>*do
00:43:14<@JAA>:-)
00:44:05m33katron quits [Read error: Connection reset by peer]
00:44:46m33katro1 quits [Ping timeout: 252 seconds]
00:46:20m33katron joins
00:51:56m33katro1 joins
00:53:56m33katron quits [Ping timeout: 252 seconds]
00:56:54m33katron joins
00:57:25m33katro1 quits [Ping timeout: 252 seconds]
01:02:22m33katron quits [Ping timeout: 252 seconds]
01:08:19<Doranwen>the snscrape module for livejournal would make a LOT of people very, very happy XD
01:08:25<Doranwen>but I understand being busy :)
01:16:55tzt_ is now known as tzt
01:44:02nepeat quits [Quit: ZNC - https://znc.in]
01:45:09nepeat (nepeat) joins
01:45:57nepeat quits [Remote host closed the connection]
01:46:53nepeat (nepeat) joins
02:04:27m33katron joins
02:12:47qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds]
04:26:34fullpwnmedia quits [Remote host closed the connection]
04:40:23pabs quits [Remote host closed the connection]
04:43:57sonick quits [Client Quit]
04:45:43pabs (pabs) joins
05:15:54Sluggs quits [Excess Flood]
05:18:33Sluggs joins
06:03:27nicolas17 quits [Client Quit]
06:09:11Island quits [Read error: Connection reset by peer]
06:48:53m33katro1 joins
06:52:09m33katron quits [Ping timeout: 265 seconds]
06:55:28m33katro1 quits [Ping timeout: 252 seconds]
07:00:03nfriedly quits [Remote host closed the connection]
07:05:56Arcorann (Arcorann) joins
07:07:00BlueMaxima quits [Read error: Connection reset by peer]
07:17:27m33katron joins
07:25:08m33katro1 joins
07:28:06m33katron quits [Ping timeout: 252 seconds]
07:37:13lennier2 joins
07:40:00lennier1 quits [Ping timeout: 265 seconds]
07:40:06lennier2 is now known as lennier1
07:45:27qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
07:45:30<pabs>Microsoft deletes SwiftKey public forums https://mastodon.social/@mcc/110209163620520535 https://news.ycombinator.com/item?id=35597152
08:15:15LeGoupil joins
08:33:04qwertyasdfuiopghjkl quits [Client Quit]
08:34:26user_ quits [Remote host closed the connection]
08:34:40user_ joins
08:53:10abirkill quits [Ping timeout: 252 seconds]
08:54:11abirkill (abirkill) joins
08:56:19qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
09:12:37nfriedly joins
09:16:57abirkill- (abirkill) joins
09:18:45abirkill quits [Client Quit]
09:18:45abirkill- is now known as abirkill
09:18:49qwertyasdfuiopghjkl quits [Client Quit]
09:34:33user_ quits [Remote host closed the connection]
09:39:43umgr036 joins
09:40:30umgr036 quits [Remote host closed the connection]
09:40:43umgr036 joins
10:32:40LeGoupil quits [Client Quit]
10:34:37qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
11:32:56sec^nd quits [Ping timeout: 245 seconds]
11:46:09retromouse (retromouse) joins
11:46:22retromouse quits [Client Quit]
12:52:11retromouse (retromouse) joins
14:05:01Arcorann quits [Ping timeout: 252 seconds]
14:05:34wessel1512 quits [Ping timeout: 252 seconds]
14:08:10wessel1512 joins
14:16:00sec^nd (second) joins
14:17:29sec^nd quits [Remote host closed the connection]
14:18:29sec^nd (second) joins
14:39:40Barto quits [Ping timeout: 252 seconds]
14:39:55Barto (Barto) joins
14:51:02<h2ibot>Bear edited Zippyshare (+48, [[:Category:Excluded from the Wayback Machine]]): https://wiki.archiveteam.org/?diff=49671&oldid=49643
14:51:03<h2ibot>Bear edited The Chive (+157, infobox): https://wiki.archiveteam.org/?diff=49672&oldid=49507
14:54:01hitgrr8 joins
15:29:34m33katron joins
15:30:32Island joins
15:32:50m33katro1 quits [Ping timeout: 252 seconds]
15:34:07m33katron quits [Ping timeout: 252 seconds]
15:36:05m33katron joins
15:36:52Matthww1 quits [Ping timeout: 252 seconds]
15:38:05sprydagger (sprydagger) joins
15:38:56Matthww1 joins
15:47:45m33katro1 joins
15:50:06m33katron quits [Ping timeout: 265 seconds]
15:53:43sprydagger quits [Remote host closed the connection]
15:57:24fuzzy8021 quits [Ping timeout: 252 seconds]
16:08:44fuzzy8021 (fuzzy8021) joins
16:15:30sec^nd quits [Remote host closed the connection]
16:15:57sec^nd (second) joins
16:17:07<retromouse>I'm looking into the wget-lua repo, I realised is missing the configure script. How is supposed to be build?
16:18:40m33katro1 quits [Ping timeout: 252 seconds]
16:32:38<retromouse>Just realised the docker build, never mind
16:37:39dumbgoy joins
16:55:05<Ryz>So, I was archiving https://steamcommunity.com/app/1675900/discussions/0/6980058383072286962/ a day or two ago when the Steam review being deleted by Steam moderators for a game called Warlander was a thing, and what I discovered is that their forum posts can be deleted (unsure if by users themselves or moderators);
16:55:35<Ryz>This is what's the current post positioning as I archived moments earlier: https://web.archive.org/web/20230417161630/https://steamcommunity.com/app/1675900/discussions/0/6980058383072286962/?ctp=9#c6980058383080102757
16:55:41<Ryz>This was before: https://web.archive.org/web/20230416051200/https://steamcommunity.com/app/1675900/discussions/0/6980058383072286962/?ctp=10#c6980058383080102757
16:55:47<Ryz>...There's a 27 post gap difference O_o;
16:56:23<Ryz>It was really a good damn thing I did a proactive archive there~
16:56:44<Ryz>Fortunately the forum post number IDs are preserved
17:00:20<Ryz>Again, thanks pabs for reporting this; meaans a lot
17:01:11LeGoupil joins
17:05:54LeGoupil quits [Client Quit]
17:14:21<retromouse>JAA where I can find the diffs of whatever was submitted to fix the official wget distribution? I'm looking at wget-lua and there it was a lot of nice work put into this
17:15:53<@JAA>retromouse: Nowhere, it all happened on their IRC channel. An actual diff was never submitted because it's completely trivial.
17:17:04<retromouse>The thing is I can see with the effort put on wget-lua to support the new tags and improvements on warc that this could be used by anyone that has less coding skills to crawl sites
17:17:26<@JAA>retromouse: They didn't want to revert their silly angle brackets change, so the suggestion was to instead move to WARC/1.1. So three chars: 1.0 → 1.1, remove leading and trailing angle brackets on WARC-Target-URI.
17:19:24<retromouse>Is there any reason why the wget-lua build has been kept on docker and the repo has not been cleanup/documnented JAA?
17:20:21<@JAA>1) We use it exclusively in containers. 2) Changes were originally kept minimal in hopes of getting them merged back upstream. 3) Nobody had the time to do it.
17:21:16<@JAA>By the way, we have #archiveteam-dev for software dev discussions.
17:23:29retromouse quits [Client Quit]
17:23:59retromouse (retromouse) joins
17:24:44retromouse-2 (retromouse) joins
17:24:49retromouse quits [Client Quit]
17:26:04retromouse-2 is now known as retromouse
17:40:41jamesatjaminit_ quits [Quit: ZNC 1.8.2 - https://znc.in]
17:53:21thuban quits [Ping timeout: 265 seconds]
17:59:42thuban joins
18:02:04nyany quits [Quit: (516): and then you went into taco bell without pants...and surprisingly you weren't the only one there without pants]
18:03:48nyany (nyany) joins
18:28:21nicolas17 joins
19:11:35retromouse quits [Client Quit]
19:45:23Jake quits [Quit: Leaving for a bit!]
19:45:44Jake (Jake) joins
19:50:30umgr036 quits [Remote host closed the connection]
19:50:42umgr036 joins
20:22:30umgr036 quits [Ping timeout: 252 seconds]
20:54:34hitgrr8 quits [Client Quit]
21:06:33retromouse (retromouse) joins
21:22:16Chris5010 quits [Ping timeout: 252 seconds]
21:27:34<@JAA>Great engineering, much wow: https://www.ubisoft.com/forums/topic/104284/termin%C3%A9-maintenance-24-ao%C3%BBt-2021/1
21:27:50<@JAA>https://www.ubisoft.com/forums/topic/104284/termin%25C3%25A9-maintenance-24-ao%25C3%25BBt-2021/1 works...
21:29:42Chris5010 (Chris5010) joins
21:37:19<@JAA>The trailing /1 appears to be an offset in the topic, but pagination actually uses a ?page=2 parameter, and it takes precedence over the offset number.
21:37:42<@JAA>Sometimes, the offset is omitted for the first page. Sometimes, the slash is still included.
21:38:32michaelblob (michaelblob) joins
22:19:57mtji joins
22:21:39mtji leaves
22:25:03mtji joins
22:27:16mtji quits [Remote host closed the connection]
22:28:07nicolas17 quits [Remote host closed the connection]
22:31:28<pabs>Ryz: might be worth continuously proactively archiving Steam forum things via #// ?
22:44:24klg quits [Ping timeout: 252 seconds]
22:44:26<Ryz>pabs, all of my yes~
22:44:38<Ryz>Also Steam reviews
22:44:40klg (klg) joins
23:05:14<@JAA>Ubisoft uses HTTP 429 for 'yo, back off' and HTTP 479 with zero details (not even a reason in the HTTP header) for 'yo, fuck off'.
23:05:28<@arkiver>479?
23:06:03<@JAA>Yup
23:06:08<@JAA>¯\_(ツ)_/¯
23:08:00<@JAA>'HTTP/1.1 479 \r\nServer: awselb/2.0\r\nDate: Mon, 17 Apr 2023 22:56:56 GMT\r\nContent-Length: 0\r\nConnection: keep-alive\r\n\r\n'
23:08:13<datechnoman>Haven't heard of that code before lol
23:08:35<datechnoman>Blackhole your traffic to the 479 made up code lol
23:09:09<@JAA>LinkedIn uses 999 for that.
23:09:19<@JAA>I've also seen 666 somewhere before.
23:10:06<@JAA>I mean, it's better than 200s with an empty body or 404 or shit like that.
23:11:30<datechnoman>Dirty Telegram with the 200's and a gotcha :/
23:11:43<datechnoman>666 is a good one lol
23:20:19<@arkiver>telegram and tencent weibo are/were the worst
23:20:58<@arkiver>tencent weibo returning incomplete data upon high load (while not indicating so), and tencent weibo sneakily returning regular web page data as 302 response
23:21:12<@arkiver>and well telegram because they pretend an account doesn't exist when rate limited
23:21:55<datechnoman> *throws fist up at telegram*
23:22:00<@arkiver>actually that makes tencent weibo worse, with telegram we at least know we can never trust it when an account seemingly does not exist. with tencent weibo we could not trust anything and had to come up with stupid shit to try and determine if something is complete
23:22:39<datechnoman>This is true. Very inconsistent and unreliable
23:22:54<@arkiver>there are various telegram archiving tools out there... I wonder how many people are already burned by this without knowing. ("oh this account apparently has very few posts' - meanwhile they're actually being rate limited but they don't know and assume what they got it all there is)
23:23:09<@arkiver>JAA: ^ I would not be surprised if that has happened with snscrape FYI
23:23:53sonick (sonick) joins
23:24:08<@JAA>Oh, now I'm getting 478 responses, too.
23:24:54<@JAA>arkiver: Yeah, I bet it has. I need to implement that sometime.
23:25:26<@JAA>The 478 response has 'Server: nginx', not awselb.
23:25:43<@JAA>There are layers to this bullshit.
23:26:41<@arkiver>idea: a wiki page with weird status codes and where they were encountered :P
23:27:28<@JAA>Love it. Let's shame these people.
23:27:44<pabs>just put it on wikipedia :)
23:30:37BlueMaxima joins
23:35:23<@JAA>Will continue with this tomorrow. In case it wasn't obvious, I'm trying to qwarc the Ubisoft forums.
23:35:56<@arkiver>of course :)
23:36:04<@JAA>Topic pages only as usual, and I don't pay any attention to the 'lang' parameter etc.
23:36:20<@JAA>It's a messy forum software, but it's Ubisoft's own thing, so that was expected.
23:36:32<@arkiver>as long as they return somewhat proper status codes (known bad or clearly odd status codes) for bad content it should be fine right?
23:37:34<@JAA>Yeah, and I always do basic content checks anyway.
23:37:44<@JAA>So I'd notice if they start playing dirty.
23:37:56<@JAA>Although, if they just loginwall me, then I might not.
23:38:04<@arkiver>yeah
23:38:09<@arkiver>sucks we have to do those checks :/