| 00:01:02 | | qwertyasdfuiopghjkl joins |
| 00:13:46 | | Matthww quits [Client Quit] |
| 00:13:59 | | ircuser (scowlee) joins |
| 00:14:48 | | Matthww joins |
| 00:16:02 | | ircuser is now known as scowlee |
| 00:17:49 | | Matthww quits [Client Quit] |
| 00:18:09 | <scowlee> | duolingo is killing their forums at the end of march, public announcement to come but a ton of user-created language guides and resources will be gone |
| 00:18:54 | | Matthww joins |
| 00:25:58 | | Mateon1 quits [Remote host closed the connection] |
| 00:26:11 | | Mateon1 joins |
| 00:35:39 | | lennier1 quits [Client Quit] |
| 00:38:17 | | lennier1 (lennier1) joins |
| 00:44:12 | <@OrIdow6> | We should rename ArchiveTeam to ForumTeam if this keeps up |
| 00:45:17 | <@OrIdow6> | Anything public so far scowlee or is this inside info? And if the latter when can we expect a public announcement? |
| 00:51:43 | <jamesp> | OrIdow6: We should remain Archive Team, but Forum Team should specialize in forums. |
| 00:52:28 | <jamesp> | On the Wiki home, it doesn't mention Fandom. |
| 00:55:43 | <@OrIdow6> | jamesp: FWIW there has been a never-implemented idea to turn #msgbored into something like that, hence its topic |
| 01:00:02 | | dm4v quits [Client Quit] |
| 01:00:05 | | chrismeller (chrismeller) joins |
| 01:03:40 | | dm4v joins |
| 01:03:42 | | dm4v is now authenticated as dm4v |
| 01:03:42 | | dm4v quits [Changing host] |
| 01:03:42 | | dm4v (dm4v) joins |
| 01:55:46 | <@OrIdow6> | Beware, Duolingo sometimes returns inaccurate 200s (with body "500 Internal Server Error"), I suspect there's other status code weirdness too |
| 01:55:52 | <@OrIdow6> | *Duolingo forums API |
| 01:56:22 | <@OrIdow6> | Doing a quick estimate - does look like there are a few 10m posts |
| 02:03:35 | | dm4v_ joins |
| 02:03:51 | | dm4v quits [Ping timeout: 265 seconds] |
| 02:03:51 | | dm4v_ is now known as dm4v |
| 02:03:52 | | dm4v is now authenticated as dm4v |
| 02:03:52 | | dm4v quits [Changing host] |
| 02:03:52 | | dm4v (dm4v) joins |
| 02:04:40 | <jamesp> | !sa https://youtu.be/lqYTX7parRw |
| 02:36:22 | | sonick quits [Client Quit] |
| 02:37:02 | | LegitSi joins |
| 02:48:29 | | DogsRNice (Webuser299) joins |
| 03:04:54 | | onetruth joins |
| 03:28:12 | | onetruth quits [Read error: Connection reset by peer] |
| 03:42:51 | | march_happy (march_happy) joins |
| 03:58:55 | | onetruth joins |
| 04:17:55 | | DogsRNice quits [Read error: Connection reset by peer] |
| 05:06:27 | | march_happy quits [Remote host closed the connection] |
| 05:09:27 | | HP_Archivist quits [Ping timeout: 265 seconds] |
| 05:13:15 | | Jens quits [Quit: Jens] |
| 05:14:02 | | JensRex (JensRex) joins |
| 05:14:41 | | march_happy (march_happy) joins |
| 05:29:16 | | Eighty quits [Ping timeout: 265 seconds] |
| 05:36:40 | | Eighty joins |
| 05:36:40 | | Eighty is now authenticated as Eighty |
| 05:36:40 | | Eighty quits [Changing host] |
| 05:36:40 | | Eighty (Eighty) joins |
| 06:25:41 | <@JAA> | The MapleTip Forums have vanished in the past few hours. The AB job did not manage to grab everything in time, but it looks like a good majority of the content was covered. |
| 07:39:51 | | fiftysix_k_modem (fiftysix_k_modem) joins |
| 07:42:46 | | fiftysix_k_modem leaves |
| 09:35:26 | | march_happy quits [Ping timeout: 240 seconds] |
| 09:36:05 | | Eighty quits [Ping timeout: 252 seconds] |
| 09:52:50 | | Eighty (Eighty) joins |
| 10:26:23 | | DLoa joins |
| 10:29:29 | | knecht420 quits [Client Quit] |
| 10:30:37 | | knecht420 (knecht420) joins |
| 10:39:58 | | BlueMaxima quits [Client Quit] |
| 10:43:25 | | march_happy (march_happy) joins |
| 12:19:58 | | sonick (sonick) joins |
| 13:13:53 | | Arcorann quits [Ping timeout: 252 seconds] |
| 13:56:13 | | march_happy quits [Remote host closed the connection] |
| 13:56:55 | | Iki1 joins |
| 13:58:17 | | march_happy (march_happy) joins |
| 13:59:21 | | Iki quits [Ping timeout: 252 seconds] |
| 14:15:23 | <scowlee> | OrIdow6: i think it should be announced within a week or so |
| 14:23:42 | | DLoa quits [Remote host closed the connection] |
| 15:29:40 | <daxxy> | JAA, looked into using tapatalk for getting missing / machine-readable content from the technologyguide forums some more, turns out we *can* use unauthenticated GET requests for everything |
| 15:30:01 | | HP_Archivist (HP_Archivist) joins |
| 15:31:17 | <daxxy> | unless there's stuff only visible when logged in? all I've seen that requires login is attachment data (metadata/thumbnails are open) |
| 15:42:14 | <daxxy> | I'd be about ready to write a script for myself, is there any interest in running this as an "ArchiveTeam crawl", even though the HTML (sans broken pages) is already done? (I probably couldn't grab everything, nor have it end up in WBM) |
| 15:49:25 | <h2ibot> | Arkiver uploaded File:Pinger-logo.png: https://wiki.archiveteam.org/?title=File%3APinger-logo.png |
| 15:50:27 | | lennier1 quits [Ping timeout: 252 seconds] |
| 15:52:33 | | lennier1 (lennier1) joins |
| 15:56:05 | | lennier2 joins |
| 15:58:46 | | lennier1 quits [Ping timeout: 240 seconds] |
| 16:02:46 | | lennier2 quits [Ping timeout: 240 seconds] |
| 16:09:47 | | lennier2 joins |
| 16:09:49 | | lennier2 is now known as lennier1 |
| 16:18:23 | | march_happy quits [Ping timeout: 265 seconds] |
| 16:20:44 | <@arkiver> | rewby: can we get a target for pinger.pl ? |
| 16:20:51 | <@arkiver> | it would be archiveteam_pinger |
| 16:20:52 | <@arkiver> | pinger_ |
| 16:20:58 | <@arkiver> | Archive Team pinger: |
| 16:36:39 | <@rewby> | arkiver: Sure. What kind of file size you thinking and what kind of rate? + Is there a channel for this? |
| 16:37:50 | | Mateon1 quits [Remote host closed the connection] |
| 16:39:02 | | Mateon1 joins |
| 16:40:37 | | chrismeller quits [Ping timeout: 265 seconds] |
| 16:42:00 | | IDK quits [Client Quit] |
| 16:48:38 | | godane1 joins |
| 16:49:33 | <@arkiver> | rewby: i think will not be large at all |
| 16:49:44 | <@arkiver> | no channel at the moment, but we can think of one |
| 16:50:01 | <@rewby> | I set the target in the project |
| 16:50:11 | <@rewby> | *tracker |
| 16:50:20 | <@arkiver> | yep, already pushed first into into it |
| 16:51:06 | | godane2 quits [Ping timeout: 240 seconds] |
| 17:04:50 | | Megame (Megame) joins |
| 17:29:32 | <ThreeHM> | No docker image yet for pinger? |
| 17:31:23 | <@rewby> | ThreeHM: I'll go make one |
| 17:31:58 | <ThreeHM> | Thanks! |
| 17:32:14 | | nostalgebraist joins |
| 17:33:18 | <@rewby> | It's building, give it a few minutes and it'll be at the usual address |
| 17:34:19 | <@rewby> | ThreeHM: Build done |
| 17:47:45 | | nostalgebraist quits [Client Quit] |
| 17:51:23 | <Craigle> | Pinger just started returning a ton of 400's |
| 17:51:44 | <Craigle> | arkiver ^ |
| 17:51:59 | <@arkiver> | 403? |
| 17:52:18 | <Craigle> | Some, but a was seeing a wall of 400's with a few 403's and 200's |
| 17:52:21 | <@arkiver> | looks fine to me |
| 17:52:24 | <@arkiver> | hmm |
| 17:52:40 | <@arkiver> | the site pretty unstable yeah :/ |
| 17:52:50 | <Craigle> | Just picked back up |
| 17:53:01 | <Craigle> | Yeah, that was my thought |
| 17:53:25 | <@arkiver> | lets hope it stays online a little after the 31st |
| 17:53:31 | <@arkiver> | will see about contacting then |
| 17:53:34 | <@arkiver> | them* |
| 17:54:30 | <@arkiver> | anyone have ideas for pinger channel name? |
| 17:54:43 | <Sanqui> | pingas |
| 17:55:13 | <Sanqui> | sorry, I thought at first it would be a project for long term pings... idk lol |
| 17:57:49 | <@OrIdow6> | Not on Deathwatch? |
| 17:58:13 | <monika> | #pinged maybe? |
| 17:58:30 | <@OrIdow6> | #pingedout |
| 17:58:39 | <@arkiver> | lets do #pinged |
| 17:58:46 | <@arkiver> | saw yours too late OrIdow6 :P |
| 18:02:50 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 18:06:45 | | qwertyasdfuiopghjkl joins |
| 18:11:53 | <h2ibot> | OrIdow6 edited Deathwatch (+112, /* 2022 */ Add pinger.pl): https://wiki.archiveteam.org/?diff=48225&oldid=48215 |
| 18:13:58 | <@OrIdow6> | Did anything happen to forum.chip.de after the AB job got banned? Looks like they've made their change |
| 18:14:36 | <@OrIdow6> | I'm going to move its category anyhow |
| 18:16:51 | <@JAA> | I archived it fully, and it completed ten minutes before they added rules to their Buttflare configuration blocking most automated access. |
| 18:16:54 | <h2ibot> | OrIdow6 edited Deathwatch (+12, Forum.chip.de has made its changes): https://wiki.archiveteam.org/?diff=48226&oldid=48225 |
| 18:17:01 | <@OrIdow6> | Oh, good |
| 18:23:02 | | HP_Archivist quits [Remote host closed the connection] |
| 18:23:24 | | HP_Archivist (HP_Archivist) joins |
| 18:51:36 | | onetruth quits [Read error: Connection reset by peer] |
| 18:53:17 | | DLoa joins |
| 18:58:28 | <DLoa> | Hi, I joined today and my Warrior VM has been running for over 6hrs. I'd like to backup forums threads which are of interest to me on NotebookReview forums, which is closing for good in 2days. Is there a way to selectivey apply my Warrior VM to this and contribute to NBR archiving? @JAA work on this I believe archiving already and I'd like to |
| 18:58:29 | <DLoa> | help. Thank you |
| 19:05:25 | <@JAA> | DLoa: There is no distributed project for TechnologyGuide, so no, you can't. I have already archived (nearly) the entire four forums, only a few dozen threads missing that I will be looking into tonight. |
| 19:12:26 | | lennier1 quits [Client Quit] |
| 19:13:49 | | lennier1 (lennier1) joins |
| 19:17:22 | | godane1 quits [Client Quit] |
| 19:34:12 | | qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds] |
| 20:02:41 | <daxxy> | JAA, do you want to grab those threads yourself? I've written down my notes here https://gist.github.com/drdaxxy/b7731fb4217a56604956bcaa45641648 |
| 20:07:03 | <@JAA> | daxxy: Brilliant, thanks! Sorry for the delay, didn't have time to look into it yet. |
| 20:07:28 | | chrismeller (chrismeller) joins |
| 20:08:55 | <daxxy> | no worries :) what sorta resources / time did the HTML crawl take? |
| 20:11:09 | | hello joins |
| 20:11:22 | <@JAA> | About a day for all four forums with decent parallelism and multiple IPs. Not sure whether the IPs were actually needed or not. |
| 20:11:43 | <@JAA> | Also, yes, there are threads that require logging in. I'm not sure whether they're accessible to normal users or only mods or similar though. |
| 20:11:52 | <@JAA> | We generally only archive things that are publicly accessible. |
| 20:12:41 | <@JAA> | returnHtml=1 on get_thread renders the BBCode as HTML. |
| 20:13:06 | <@JAA> | Well, partially, anyway. [url=...] is not transformed apparently. |
| 20:14:13 | <daxxy> | neither is [quote] |
| 20:14:22 | | hello quits [Remote host closed the connection] |
| 20:15:00 | <daxxy> | nor img, so I have no idea if they actually render any BBCode or just newlines and maybe entities :v |
| 20:15:43 | <@JAA> | I'm seeing some <i> stuff as well. |
| 20:15:54 | <@JAA> | But yeah, it's weird. |
| 20:16:11 | <@JAA> | Smilies aren't translated into img tags either. |
| 20:28:52 | <daxxy> | okay, at least [b] just gets removed if returnHtml=0, see post 540654 in thread 75253 for example |
| 20:32:13 | | BlueMaxima joins |
| 20:32:52 | <@JAA> | Aw, there's a get_raw_post method, but that only works for users who can edit the post (i.e. poster/mods). |
| 20:42:15 | <daxxy> | yeah, I saw that, but now that you say it... I should talk to the mods, they seem interested in archival |
| 20:43:39 | <daxxy> | but since I figure this definitely isn't the place for crawling with a mod account - would you recommend the *-grab template for "outsiders" right now, or would I likely be better off hacking something together on my own? |
| 20:45:26 | <@JAA> | The -grab template is really only applicable to distributed projects, which is a major part of AT but not the only thing we do. I used my own tool (qwarc) for archiving the forums, but I can't recommend it to anyone as it's very much not user-friendly. |
| 20:45:47 | <@JAA> | And yeah, crawling with a mod account is not going to happen. |
| 20:45:56 | <@JAA> | (... here) |
| 20:48:38 | <@JAA> | I think I'll regrab all threads with get_thread, probably with returnHtml=0 but haven't decided yet. |
| 20:52:10 | <@JAA> | Trying to figure out where that transformation happens, but haven't quite found it. |
| 20:56:09 | <daxxy> | library/Tapatalk/Bridge.php, library/Tapatalk/BbCode/Formatter/Tapatalk.php, mobiquo/mbqClass/lib/read/MbqRdEtForumPost.php are the relevant places I've found |
| 20:56:39 | <@JAA> | Ah, push/TapatalkPush.php cleanPost, but it delegates to Tapatalk_BbCode_Formatter_Tapatalk which isn't in the plugin. |
| 20:57:16 | <daxxy> | it's in the archive, you may have only extracted the mobiquo folder |
| 20:58:01 | <@JAA> | Oh, right. I was grepping inside mobiquo, yeah. |
| 20:59:16 | <@JAA> | Wow, this code is a mess. |
| 20:59:19 | <daxxy> | hah |
| 20:59:31 | <@JAA> | Random indentation is exactly why I love Python. |
| 21:00:10 | | HP_Archivist quits [Ping timeout: 265 seconds] |
| 21:00:50 | <hexa-> | python2* |
| 21:01:09 | | @JAA slaps hexa- around a bit with a large trout |
| 21:01:26 | | hexa- slaps JAA back with python2.7 … BEST BEFORE 2Y AGO |
| 21:01:42 | <@JAA> | Great, thanks, now I have food poisoning. :-( |
| 21:01:55 | <hexa-> | I'm burnt, I do a lot of python packaging in NixOS :( |
| 21:03:24 | <@JAA> | [b] and [i] get stripped, [u] gets converted to <u>, [color] becomes a font tag, [img] should get stripped in both settings if I'm reading the code correctly. |
| 21:05:57 | <daxxy> | img stripped? where are you seeing that? |
| 21:07:38 | <@JAA> | Nevermind, it gets treated specially it seems. |
| 21:07:46 | <@JAA> | library/Tapatalk/BbCode/Formatter/Tapatalk.php is what I'm looking at. |
| 21:08:04 | <@JAA> | Specifically the getTags function. |
| 21:08:25 | | Megame quits [Client Quit] |
| 21:12:23 | | cadence quits [Ping timeout: 252 seconds] |
| 21:13:06 | <daxxy> | tbh, I don't think there's a need to analyze this properly right now -- we're not gonna get a lossless copy anyway, and clearly they only leave bbcode in that matches the parser in their app |
| 21:13:49 | <daxxy> | (the android app uses returnHtml=1, btw) |
| 21:15:10 | <@JAA> | Hmm, it would be neat if we could archive it in a way that someone could simply plug a Wayback Machine URL into the app and it all plays back correctly. But getting that to work would be quite a challenge. |
| 21:15:24 | <@JAA> | And it'd probably break anyway. |
| 21:15:43 | <daxxy> | you mean the tapatalk app? |
| 21:16:15 | <daxxy> | definitely not gonna work |
| 21:19:13 | | Mateon1 quits [Remote host closed the connection] |
| 21:19:14 | <daxxy> | for one, unless there's a way to force it into using the JSON API (doubt it, since the JSON API is newer, it ought to be preferred if client and server support it already), it POSTs to the xml-rpc interface and there's no way to make it request different URLs for different content |
| 21:20:11 | | Mateon1 joins |
| 21:23:45 | <@JAA> | Right |
| 21:26:22 | <daxxy> | writing a new (entirely client-side) webapp that reads everything from WBM (plus an externally hosted search index file, if you wanna get fancy) would work though, and not even take that much effort I think |
| 21:27:01 | | Jake quits [Quit: Leaving for a bit!] |
| 21:28:02 | <daxxy> | when you're not supporting 2 protocols in 8 codebases over 3 inheritance levels, this does not have to be complex software :P |
| 21:28:08 | <@JAA> | :-) |
| 21:29:48 | <@JAA> | It would have to be in the WBM though due to (the lack of) CORS. |
| 21:30:04 | <daxxy> | yeah, wasn't sure about that |
| 21:30:27 | <@JAA> | Anyway, that's something for the future. First step is getting the data. |
| 21:30:33 | <daxxy> | ...but then you can always just put your site into WBM, right? ^^ |
| 21:30:45 | | Jake (Jake) joins |
| 21:31:05 | <@JAA> | Also, someone here was working on a forum archive ingestion thingy a while ago. Not sure what happened to that idea. |
| 21:31:27 | <@JAA> | Yes, that's what I did with the Picosong data finder thingy. |
| 21:32:08 | | qw3rty_ joins |
| 21:32:23 | | atphoenix quits [Remote host closed the connection] |
| 21:33:38 | | AK quits [Client Quit] |
| 21:33:40 | | flashfire42 quits [Client Quit] |
| 21:34:17 | | @EggplantN quits [Quit: Ping timeout (120 seconds)] |
| 21:34:31 | <@JAA> | I'm going with returnHtml=0. As far as I can tell, it preserves a bit more data than =1 does, and the conversion should be easy enough. |
| 21:34:33 | | nyany quits [Quit: (516): and then you went into taco bell without pants...and surprisingly you weren't the only one there without pants] |
| 21:34:37 | | datechnoman1 (datechnoman) joins |
| 21:34:43 | | nyany (nyany) joins |
| 21:34:52 | <daxxy> | huh, what does it preserve that =1 doesn't? |
| 21:34:56 | | vukky quits [Remote host closed the connection] |
| 21:35:02 | <@JAA> | [b] [i] |
| 21:35:07 | | scowlee quits [Ping timeout: 252 seconds] |
| 21:35:07 | <daxxy> | hang on |
| 21:35:14 | | flashfire42 (flashfire42) joins |
| 21:35:26 | | EggplantN (EggplantN) joins |
| 21:35:26 | | @ChanServ sets mode: +o EggplantN |
| 21:35:29 | | mutantmnky quits [Ping timeout: 252 seconds] |
| 21:35:47 | | Matthww quits [Client Quit] |
| 21:35:51 | | qw3rty quits [Ping timeout: 252 seconds] |
| 21:35:56 | <daxxy> | no, =1 transforms them to HTML, but =0 strips them completely |
| 21:36:05 | | vukky (Vukky) joins |
| 21:36:13 | | HackMii quits [Ping timeout: 252 seconds] |
| 21:36:16 | <@JAA> | Huh |
| 21:36:35 | | ThreeHM quits [Ping timeout: 252 seconds] |
| 21:36:48 | | Matthww joins |
| 21:37:11 | | scowlee (scowlee) joins |
| 21:37:13 | | mutantmnky (mutantmonkey) joins |
| 21:37:19 | | datechnoman quits [Ping timeout: 252 seconds] |
| 21:37:19 | | datechnoman1 is now known as datechnoman |
| 21:37:31 | <@JAA> | Oh |
| 21:37:40 | | HackMii (hacktheplanet) joins |
| 21:38:20 | | ThreeHM (ThreeHeadedMonkey) joins |
| 21:39:48 | <daxxy> | any idea about the timeframe? if I (get the mods to) grab anything more, I'd rather do that after you've done your thing (especially with the missing posts) so my traffic won't get in your way |
| 21:41:16 | | Jake quits [Client Quit] |
| 21:42:38 | <@JAA> | Ok yeah, =1 it is I guess. |
| 21:44:33 | <@JAA> | I need to leave for a bit but will get it up and running in the next 1-2 hours. |
| 21:44:37 | | Jake (Jake) joins |
| 21:45:09 | | Jake quits [Remote host closed the connection] |
| 21:45:09 | <daxxy> | nice |
| 21:45:21 | | Jake (Jake) joins |
| 21:45:34 | | Jake quits [Remote host closed the connection] |
| 21:45:44 | | Jake (Jake) joins |
| 21:45:59 | | Jake quits [Remote host closed the connection] |
| 21:51:13 | | march_happy (march_happy) joins |
| 21:52:20 | | Jake (Jake) joins |
| 21:52:36 | | Jake quits [Remote host closed the connection] |
| 21:55:08 | | Jake (Jake) joins |
| 22:00:03 | | thelounge31 quits [Ping timeout: 252 seconds] |
| 22:35:44 | | HP_Archivist (HP_Archivist) joins |
| 22:36:50 | | chrismeller quits [Ping timeout: 265 seconds] |
| 23:20:11 | <DLoa> | DLoa: JAA This is great news. I was having trouble with Python3 script using wkhtmlto/pdfkit to save to pdf each pages of the threads of interest without running into issues after one or two pages (blocked?). I hope that it'll be possible to download to pdf for offline use. I have an account and can look into the remaining threads that could |
| 23:20:11 | <DLoa> | be saved. |
| 23:25:39 | <@JAA> | daxxy: The Tapatalk API has an ... interesting behaviour on thread redirects, e.g. thread 826132 on NotebookReview which redirects to 795536. It returns the data for the merged(?) thread and a positive total_post_num but no actual posts. |
| 23:25:58 | <daxxy> | huh |
| 23:26:29 | <daxxy> | I was wondering what it'd do with merged/moved threads but hadn't come across any, thanks |
| 23:27:17 | <@JAA> | Also, on threads that require logging in, it returns a 'Need valid topic id!' error, e.g. 247631 on NotebookReview. |
| 23:28:26 | <DLoa> | I can log in on NBR if it helps. |
| 23:35:19 | | march_happy quits [Ping timeout: 265 seconds] |
| 23:49:16 | <@JAA> | daxxy: Welp: http://forum.notebookreview.com/mobiquo/tapatalk.php?method_name=get_thread&topicId=763489&returnHtml=1&page=1&perPage=100 |
| 23:50:00 | <@JAA> | That one works fine through the website: http://forum.notebookreview.com/threads/why-arent-laptop-gpus-officially-sold.763489/ |
| 23:51:02 | <daxxy> | weird |
| 23:51:10 | <@JAA> | There are plenty more like that, it seems. Just running a little test with random IDs right now and immediately hit three like it. |
| 23:51:32 | <@JAA> | 738640 and 742521 are the other two. |
| 23:51:53 | <@JAA> | Their WAF is very, very odd. |
| 23:53:02 | <@JAA> | Haven't documented this anywhere yet, but any request containing 'temp' as a word gets blocked. Same for 'tmp' and one other I can't remember right now. And anything with 'nessus' results in a connection reset. |
| 23:54:35 | <@JAA> | But yeah, can't even get everything through the API. WTF? |