| 00:00:13 | <retromouse> | I think I'm going to try this one: https://github.com/iipc/jwarc |
| 00:00:29 | <retromouse> | If works I can start to use it in my projects |
| 00:02:00 | <pabs> | retromouse: I forget what it supports, but there is forum-dl https://github.com/mikwielgus/forum-dl/ |
| 00:02:13 | <pabs> | JAA: ^ |
| 00:02:31 | <pabs> | ah, already mentioned by thuban |
| 00:06:41 | <@JAA> | retromouse: Based on the example in the readme, absolutely do not use that for writing WARCs unless you know exactly what you're doing. |
| 00:08:22 | | m33katron joins |
| 00:10:30 | | m33katro1 quits [Ping timeout: 265 seconds] |
| 00:10:49 | <retromouse> | Why? You feel is low level or what JAA? |
| 00:11:47 | <@JAA> | retromouse: Well, the low-levelness isn't the problem per se, but it's very easy to get it wrong and write invalid WARCs with that. |
| 00:12:22 | <@JAA> | For example, HTTP headers and transfer encoding must be preserved. That's not trivial to do with most HTTP libraries. |
| 00:12:34 | <@JAA> | Preserved exactly as sent by the server, that is. |
| 00:13:55 | <retromouse> | I understand the preservation spirit, and some pieces of software are going to get it wrong. But is not better we still get some data that nothing at all? |
| 00:14:28 | <retromouse> | incorrect and invalid are different things |
| 00:14:52 | <thuban> | sure, but the data shouldn't claim to have accuracy that it actually lacks. |
| 00:15:29 | <@JAA> | ^ |
| 00:15:47 | <@JAA> | It being WARC implies certain things about the data, and if it doesn't have those, it shouldn't be stored as WARC. |
| 00:15:50 | <retromouse> | I wonder if archive team forked wget to improve warc support. Why the improvements where not submited to the original wget project? |
| 00:17:42 | <@JAA> | The most important improvement (fix of writing weird brackets) was suggested upstream with a very detailed writeup over three years ago. It's a three-character diff. It was never applied. |
| 00:18:11 | <retromouse> | That is sad |
| 00:18:14 | | m33katron quits [Ping timeout: 265 seconds] |
| 00:18:15 | <TheTechRobo> | Yeah, wget doesn't maintain their warc stuff well |
| 00:18:21 | <retromouse> | I'm sorry to hear that |
| 00:18:26 | <@JAA> | I've repeatedly poked them about that over those three years as well to no avail. |
| 00:18:38 | <@JAA> | An additional factor is that wget 1 is effectively EOL. |
| 00:18:55 | | m33katron joins |
| 00:19:06 | <TheTechRobo> | So the other features (like zstd warc compression support) are definitely not happening. :/ |
| 00:19:08 | <@JAA> | It still gets maintained for critical bugs, security issues, etc., but the effort is clearly going towards wget 2. |
| 00:19:13 | <retromouse> | wget 1 is EOL but wget 2 even if faster is less stable than 1 |
| 00:19:28 | <retromouse> | so I end up using 1 |
| 00:19:29 | <@JAA> | Less stable and lacking numerous features, including WARC. |
| 00:23:10 | | fullpwnmedia joins |
| 00:26:37 | | retromouse quits [Ping timeout: 252 seconds] |
| 00:27:38 | | m33katro1 joins |
| 00:30:28 | | m33katron quits [Ping timeout: 252 seconds] |
| 00:40:35 | <thuban> | JAA, to return to the earlier discussion... yes, full spidering would duplicate a lot of network traffic with archiving. (of course, we already accept that when snscrape does it!) on the other hand it would probably also duplicate some logic with semantic data extraction. i think which concerns to bundle is a question of how much you want to trade off efficiency for |
| 00:40:37 | <thuban> | composability with other tools/data |
| 00:41:14 | <thuban> | (and i'm inclined to favor composability, or at least the option of it--it's good to be able to pass pages detected by a forum spiderer into a general-purpose archiver to get things like page requisites for wbm playback, and it's good to be able to do semantic extraction from warcs created by other tooling for forums that are already dead) |
| 00:41:21 | | m33katron joins |
| 00:42:27 | <thuban> | this is all reminding me that i never did finish that snscrape module for livejournal. ._. this summer maybe |
| 00:42:32 | <@JAA> | thuban: snscrape actually doesn't duplicate anything, because it deals with the search or profile page scrolling, which the archival step doesn't retrieve. But you're not wrong in principle of course. |
| 00:42:53 | <@JAA> | (At least when talking about Twitter, which is by far the most commonly used module.) |
| 00:43:05 | <thuban> | yeah, that's why i was careful to say "when"--i think it does to a bit in some of the non-twitter modules |
| 00:43:10 | <thuban> | (ninja'd) |
| 00:43:13 | <thuban> | *do |
| 00:43:14 | <@JAA> | :-) |
| 00:44:05 | | m33katron quits [Read error: Connection reset by peer] |
| 00:44:46 | | m33katro1 quits [Ping timeout: 252 seconds] |
| 00:46:20 | | m33katron joins |
| 00:51:56 | | m33katro1 joins |
| 00:53:56 | | m33katron quits [Ping timeout: 252 seconds] |
| 00:56:54 | | m33katron joins |
| 00:57:25 | | m33katro1 quits [Ping timeout: 252 seconds] |
| 01:02:22 | | m33katron quits [Ping timeout: 252 seconds] |
| 01:08:19 | <Doranwen> | the snscrape module for livejournal would make a LOT of people very, very happy XD |
| 01:08:25 | <Doranwen> | but I understand being busy :) |
| 01:16:55 | | tzt_ is now known as tzt |
| 01:44:02 | | nepeat quits [Quit: ZNC - https://znc.in] |
| 01:45:09 | | nepeat (nepeat) joins |
| 01:45:57 | | nepeat quits [Remote host closed the connection] |
| 01:46:53 | | nepeat (nepeat) joins |
| 02:04:27 | | m33katron joins |
| 02:12:47 | | qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds] |
| 04:26:34 | | fullpwnmedia quits [Remote host closed the connection] |
| 04:40:23 | | pabs quits [Remote host closed the connection] |
| 04:43:57 | | sonick quits [Client Quit] |
| 04:45:43 | | pabs (pabs) joins |
| 05:15:54 | | Sluggs quits [Excess Flood] |
| 05:18:33 | | Sluggs joins |
| 06:03:27 | | nicolas17 quits [Client Quit] |
| 06:09:11 | | Island quits [Read error: Connection reset by peer] |
| 06:48:53 | | m33katro1 joins |
| 06:52:09 | | m33katron quits [Ping timeout: 265 seconds] |
| 06:55:28 | | m33katro1 quits [Ping timeout: 252 seconds] |
| 07:00:03 | | nfriedly quits [Remote host closed the connection] |
| 07:05:56 | | Arcorann (Arcorann) joins |
| 07:07:00 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 07:17:27 | | m33katron joins |
| 07:25:08 | | m33katro1 joins |
| 07:28:06 | | m33katron quits [Ping timeout: 252 seconds] |
| 07:37:13 | | lennier2 joins |
| 07:40:00 | | lennier1 quits [Ping timeout: 265 seconds] |
| 07:40:06 | | lennier2 is now known as lennier1 |
| 07:45:27 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 07:45:30 | <pabs> | Microsoft deletes SwiftKey public forums https://mastodon.social/@mcc/110209163620520535 https://news.ycombinator.com/item?id=35597152 |
| 08:15:15 | | LeGoupil joins |
| 08:33:04 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 08:34:26 | | user_ quits [Remote host closed the connection] |
| 08:34:40 | | user_ joins |
| 08:53:10 | | abirkill quits [Ping timeout: 252 seconds] |
| 08:54:11 | | abirkill (abirkill) joins |
| 08:56:19 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 09:12:37 | | nfriedly joins |
| 09:16:57 | | abirkill- (abirkill) joins |
| 09:18:45 | | abirkill quits [Client Quit] |
| 09:18:45 | | abirkill- is now known as abirkill |
| 09:18:49 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 09:34:33 | | user_ quits [Remote host closed the connection] |
| 09:39:43 | | umgr036 joins |
| 09:40:30 | | umgr036 quits [Remote host closed the connection] |
| 09:40:43 | | umgr036 joins |
| 10:32:40 | | LeGoupil quits [Client Quit] |
| 10:34:37 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 11:32:56 | | sec^nd quits [Ping timeout: 245 seconds] |
| 11:46:09 | | retromouse (retromouse) joins |
| 11:46:22 | | retromouse quits [Client Quit] |
| 12:52:11 | | retromouse (retromouse) joins |
| 14:05:01 | | Arcorann quits [Ping timeout: 252 seconds] |
| 14:05:34 | | wessel1512 quits [Ping timeout: 252 seconds] |
| 14:08:10 | | wessel1512 joins |
| 14:16:00 | | sec^nd (second) joins |
| 14:17:29 | | sec^nd quits [Remote host closed the connection] |
| 14:18:29 | | sec^nd (second) joins |
| 14:39:40 | | Barto quits [Ping timeout: 252 seconds] |
| 14:39:55 | | Barto (Barto) joins |
| 14:51:02 | <h2ibot> | Bear edited Zippyshare (+48, [[:Category:Excluded from the Wayback Machine]]): https://wiki.archiveteam.org/?diff=49671&oldid=49643 |
| 14:51:03 | <h2ibot> | Bear edited The Chive (+157, infobox): https://wiki.archiveteam.org/?diff=49672&oldid=49507 |
| 14:54:01 | | hitgrr8 joins |
| 15:29:34 | | m33katron joins |
| 15:30:32 | | Island joins |
| 15:32:50 | | m33katro1 quits [Ping timeout: 252 seconds] |
| 15:34:07 | | m33katron quits [Ping timeout: 252 seconds] |
| 15:36:05 | | m33katron joins |
| 15:36:52 | | Matthww1 quits [Ping timeout: 252 seconds] |
| 15:38:05 | | sprydagger (sprydagger) joins |
| 15:38:56 | | Matthww1 joins |
| 15:47:45 | | m33katro1 joins |
| 15:50:06 | | m33katron quits [Ping timeout: 265 seconds] |
| 15:53:43 | | sprydagger quits [Remote host closed the connection] |
| 15:57:24 | | fuzzy8021 quits [Ping timeout: 252 seconds] |
| 16:08:44 | | fuzzy8021 (fuzzy8021) joins |
| 16:15:30 | | sec^nd quits [Remote host closed the connection] |
| 16:15:57 | | sec^nd (second) joins |
| 16:17:07 | <retromouse> | I'm looking into the wget-lua repo, I realised is missing the configure script. How is supposed to be build? |
| 16:18:40 | | m33katro1 quits [Ping timeout: 252 seconds] |
| 16:32:38 | <retromouse> | Just realised the docker build, never mind |
| 16:37:39 | | dumbgoy joins |
| 16:55:05 | <Ryz> | So, I was archiving https://steamcommunity.com/app/1675900/discussions/0/6980058383072286962/ a day or two ago when the Steam review being deleted by Steam moderators for a game called Warlander was a thing, and what I discovered is that their forum posts can be deleted (unsure if by users themselves or moderators); |
| 16:55:35 | <Ryz> | This is what's the current post positioning as I archived moments earlier: https://web.archive.org/web/20230417161630/https://steamcommunity.com/app/1675900/discussions/0/6980058383072286962/?ctp=9#c6980058383080102757 |
| 16:55:41 | <Ryz> | This was before: https://web.archive.org/web/20230416051200/https://steamcommunity.com/app/1675900/discussions/0/6980058383072286962/?ctp=10#c6980058383080102757 |
| 16:55:47 | <Ryz> | ...There's a 27 post gap difference O_o; |
| 16:56:23 | <Ryz> | It was really a good damn thing I did a proactive archive there~ |
| 16:56:44 | <Ryz> | Fortunately the forum post number IDs are preserved |
| 17:00:20 | <Ryz> | Again, thanks pabs for reporting this; meaans a lot |
| 17:01:11 | | LeGoupil joins |
| 17:05:54 | | LeGoupil quits [Client Quit] |
| 17:14:21 | <retromouse> | JAA where I can find the diffs of whatever was submitted to fix the official wget distribution? I'm looking at wget-lua and there it was a lot of nice work put into this |
| 17:15:53 | <@JAA> | retromouse: Nowhere, it all happened on their IRC channel. An actual diff was never submitted because it's completely trivial. |
| 17:17:04 | <retromouse> | The thing is I can see with the effort put on wget-lua to support the new tags and improvements on warc that this could be used by anyone that has less coding skills to crawl sites |
| 17:17:26 | <@JAA> | retromouse: They didn't want to revert their silly angle brackets change, so the suggestion was to instead move to WARC/1.1. So three chars: 1.0 → 1.1, remove leading and trailing angle brackets on WARC-Target-URI. |
| 17:19:24 | <retromouse> | Is there any reason why the wget-lua build has been kept on docker and the repo has not been cleanup/documnented JAA? |
| 17:20:21 | <@JAA> | 1) We use it exclusively in containers. 2) Changes were originally kept minimal in hopes of getting them merged back upstream. 3) Nobody had the time to do it. |
| 17:21:16 | <@JAA> | By the way, we have #archiveteam-dev for software dev discussions. |
| 17:23:29 | | retromouse quits [Client Quit] |
| 17:23:59 | | retromouse (retromouse) joins |
| 17:24:44 | | retromouse-2 (retromouse) joins |
| 17:24:49 | | retromouse quits [Client Quit] |
| 17:26:04 | | retromouse-2 is now known as retromouse |
| 17:40:41 | | jamesatjaminit_ quits [Quit: ZNC 1.8.2 - https://znc.in] |
| 17:53:21 | | thuban quits [Ping timeout: 265 seconds] |
| 17:59:42 | | thuban joins |
| 18:02:04 | | nyany quits [Quit: (516): and then you went into taco bell without pants...and surprisingly you weren't the only one there without pants] |
| 18:03:48 | | nyany (nyany) joins |
| 18:28:21 | | nicolas17 joins |
| 19:11:35 | | retromouse quits [Client Quit] |
| 19:45:23 | | Jake quits [Quit: Leaving for a bit!] |
| 19:45:44 | | Jake (Jake) joins |
| 19:50:30 | | umgr036 quits [Remote host closed the connection] |
| 19:50:42 | | umgr036 joins |
| 20:22:30 | | umgr036 quits [Ping timeout: 252 seconds] |
| 20:54:34 | | hitgrr8 quits [Client Quit] |
| 21:06:33 | | retromouse (retromouse) joins |
| 21:22:16 | | Chris5010 quits [Ping timeout: 252 seconds] |
| 21:27:34 | <@JAA> | Great engineering, much wow: https://www.ubisoft.com/forums/topic/104284/termin%C3%A9-maintenance-24-ao%C3%BBt-2021/1 |
| 21:27:50 | <@JAA> | https://www.ubisoft.com/forums/topic/104284/termin%25C3%25A9-maintenance-24-ao%25C3%25BBt-2021/1 works... |
| 21:29:42 | | Chris5010 (Chris5010) joins |
| 21:37:19 | <@JAA> | The trailing /1 appears to be an offset in the topic, but pagination actually uses a ?page=2 parameter, and it takes precedence over the offset number. |
| 21:37:42 | <@JAA> | Sometimes, the offset is omitted for the first page. Sometimes, the slash is still included. |
| 21:38:32 | | michaelblob (michaelblob) joins |
| 22:19:57 | | mtji joins |
| 22:21:39 | | mtji leaves |
| 22:25:03 | | mtji joins |
| 22:27:16 | | mtji quits [Remote host closed the connection] |
| 22:28:07 | | nicolas17 quits [Remote host closed the connection] |
| 22:31:28 | <pabs> | Ryz: might be worth continuously proactively archiving Steam forum things via #// ? |
| 22:44:24 | | klg quits [Ping timeout: 252 seconds] |
| 22:44:26 | <Ryz> | pabs, all of my yes~ |
| 22:44:38 | <Ryz> | Also Steam reviews |
| 22:44:40 | | klg (klg) joins |
| 23:05:14 | <@JAA> | Ubisoft uses HTTP 429 for 'yo, back off' and HTTP 479 with zero details (not even a reason in the HTTP header) for 'yo, fuck off'. |
| 23:05:28 | <@arkiver> | 479? |
| 23:06:03 | <@JAA> | Yup |
| 23:06:08 | <@JAA> | ¯\_(ツ)_/¯ |
| 23:08:00 | <@JAA> | 'HTTP/1.1 479 \r\nServer: awselb/2.0\r\nDate: Mon, 17 Apr 2023 22:56:56 GMT\r\nContent-Length: 0\r\nConnection: keep-alive\r\n\r\n' |
| 23:08:13 | <datechnoman> | Haven't heard of that code before lol |
| 23:08:35 | <datechnoman> | Blackhole your traffic to the 479 made up code lol |
| 23:09:09 | <@JAA> | LinkedIn uses 999 for that. |
| 23:09:19 | <@JAA> | I've also seen 666 somewhere before. |
| 23:10:06 | <@JAA> | I mean, it's better than 200s with an empty body or 404 or shit like that. |
| 23:11:30 | <datechnoman> | Dirty Telegram with the 200's and a gotcha :/ |
| 23:11:43 | <datechnoman> | 666 is a good one lol |
| 23:20:19 | <@arkiver> | telegram and tencent weibo are/were the worst |
| 23:20:58 | <@arkiver> | tencent weibo returning incomplete data upon high load (while not indicating so), and tencent weibo sneakily returning regular web page data as 302 response |
| 23:21:12 | <@arkiver> | and well telegram because they pretend an account doesn't exist when rate limited |
| 23:21:55 | <datechnoman> | *throws fist up at telegram* |
| 23:22:00 | <@arkiver> | actually that makes tencent weibo worse, with telegram we at least know we can never trust it when an account seemingly does not exist. with tencent weibo we could not trust anything and had to come up with stupid shit to try and determine if something is complete |
| 23:22:39 | <datechnoman> | This is true. Very inconsistent and unreliable |
| 23:22:54 | <@arkiver> | there are various telegram archiving tools out there... I wonder how many people are already burned by this without knowing. ("oh this account apparently has very few posts' - meanwhile they're actually being rate limited but they don't know and assume what they got it all there is) |
| 23:23:09 | <@arkiver> | JAA: ^ I would not be surprised if that has happened with snscrape FYI |
| 23:23:53 | | sonick (sonick) joins |
| 23:24:08 | <@JAA> | Oh, now I'm getting 478 responses, too. |
| 23:24:54 | <@JAA> | arkiver: Yeah, I bet it has. I need to implement that sometime. |
| 23:25:26 | <@JAA> | The 478 response has 'Server: nginx', not awselb. |
| 23:25:43 | <@JAA> | There are layers to this bullshit. |
| 23:26:41 | <@arkiver> | idea: a wiki page with weird status codes and where they were encountered :P |
| 23:27:28 | <@JAA> | Love it. Let's shame these people. |
| 23:27:44 | <pabs> | just put it on wikipedia :) |
| 23:30:37 | | BlueMaxima joins |
| 23:35:23 | <@JAA> | Will continue with this tomorrow. In case it wasn't obvious, I'm trying to qwarc the Ubisoft forums. |
| 23:35:56 | <@arkiver> | of course :) |
| 23:36:04 | <@JAA> | Topic pages only as usual, and I don't pay any attention to the 'lang' parameter etc. |
| 23:36:20 | <@JAA> | It's a messy forum software, but it's Ubisoft's own thing, so that was expected. |
| 23:36:32 | <@arkiver> | as long as they return somewhat proper status codes (known bad or clearly odd status codes) for bad content it should be fine right? |
| 23:37:34 | <@JAA> | Yeah, and I always do basic content checks anyway. |
| 23:37:44 | <@JAA> | So I'd notice if they start playing dirty. |
| 23:37:56 | <@JAA> | Although, if they just loginwall me, then I might not. |
| 23:38:04 | <@arkiver> | yeah |
| 23:38:09 | <@arkiver> | sucks we have to do those checks :/ |