00:01:02qwertyasdfuiopghjkl joins
00:13:46Matthww quits [Client Quit]
00:13:59ircuser (scowlee) joins
00:14:48Matthww joins
00:16:02ircuser is now known as scowlee
00:17:49Matthww quits [Client Quit]
00:18:09<scowlee>duolingo is killing their forums at the end of march, public announcement to come but a ton of user-created language guides and resources will be gone
00:18:54Matthww joins
00:25:58Mateon1 quits [Remote host closed the connection]
00:26:11Mateon1 joins
00:35:39lennier1 quits [Client Quit]
00:38:17lennier1 (lennier1) joins
00:44:12<@OrIdow6>We should rename ArchiveTeam to ForumTeam if this keeps up
00:45:17<@OrIdow6>Anything public so far scowlee or is this inside info? And if the latter when can we expect a public announcement?
00:51:43<jamesp>OrIdow6: We should remain Archive Team, but Forum Team should specialize in forums.
00:52:28<jamesp>On the Wiki home, it doesn't mention Fandom.
00:55:43<@OrIdow6>jamesp: FWIW there has been a never-implemented idea to turn #msgbored into something like that, hence its topic
01:00:02dm4v quits [Client Quit]
01:00:05chrismeller (chrismeller) joins
01:03:40dm4v joins
01:03:42dm4v quits [Changing host]
01:03:42dm4v (dm4v) joins
01:55:46<@OrIdow6>Beware, Duolingo sometimes returns inaccurate 200s (with body "500 Internal Server Error"), I suspect there's other status code weirdness too
01:55:52<@OrIdow6>*Duolingo forums API
01:56:22<@OrIdow6>Doing a quick estimate - does look like there are a few 10m posts
02:03:35dm4v_ joins
02:03:51dm4v quits [Ping timeout: 265 seconds]
02:03:51dm4v_ is now known as dm4v
02:03:52dm4v quits [Changing host]
02:03:52dm4v (dm4v) joins
02:04:40<jamesp>!sa https://youtu.be/lqYTX7parRw
02:36:22sonick quits [Client Quit]
02:37:02LegitSi joins
02:48:29DogsRNice (Webuser299) joins
03:04:54onetruth joins
03:28:12onetruth quits [Read error: Connection reset by peer]
03:42:51march_happy (march_happy) joins
03:58:55onetruth joins
04:17:55DogsRNice quits [Read error: Connection reset by peer]
05:06:27march_happy quits [Remote host closed the connection]
05:09:27HP_Archivist quits [Ping timeout: 265 seconds]
05:13:15Jens quits [Quit: Jens]
05:14:02JensRex (JensRex) joins
05:14:41march_happy (march_happy) joins
05:29:16Eighty quits [Ping timeout: 265 seconds]
05:36:40Eighty joins
05:36:40Eighty quits [Changing host]
05:36:40Eighty (Eighty) joins
06:25:41<@JAA>The MapleTip Forums have vanished in the past few hours. The AB job did not manage to grab everything in time, but it looks like a good majority of the content was covered.
07:39:51fiftysix_k_modem (fiftysix_k_modem) joins
07:42:46fiftysix_k_modem leaves
09:35:26march_happy quits [Ping timeout: 240 seconds]
09:36:05Eighty quits [Ping timeout: 252 seconds]
09:52:50Eighty (Eighty) joins
10:26:23DLoa joins
10:29:29knecht420 quits [Client Quit]
10:30:37knecht420 (knecht420) joins
10:39:58BlueMaxima quits [Client Quit]
10:43:25march_happy (march_happy) joins
12:19:58sonick (sonick) joins
13:13:53Arcorann quits [Ping timeout: 252 seconds]
13:56:13march_happy quits [Remote host closed the connection]
13:56:55Iki1 joins
13:58:17march_happy (march_happy) joins
13:59:21Iki quits [Ping timeout: 252 seconds]
14:15:23<scowlee>OrIdow6: i think it should be announced within a week or so
14:23:42DLoa quits [Remote host closed the connection]
15:29:40<daxxy>JAA, looked into using tapatalk for getting missing / machine-readable content from the technologyguide forums some more, turns out we *can* use unauthenticated GET requests for everything
15:30:01HP_Archivist (HP_Archivist) joins
15:31:17<daxxy>unless there's stuff only visible when logged in? all I've seen that requires login is attachment data (metadata/thumbnails are open)
15:42:14<daxxy>I'd be about ready to write a script for myself, is there any interest in running this as an "ArchiveTeam crawl", even though the HTML (sans broken pages) is already done? (I probably couldn't grab everything, nor have it end up in WBM)
15:49:25<h2ibot>Arkiver uploaded File:Pinger-logo.png: https://wiki.archiveteam.org/?title=File%3APinger-logo.png
15:50:27lennier1 quits [Ping timeout: 252 seconds]
15:52:33lennier1 (lennier1) joins
15:56:05lennier2 joins
15:58:46lennier1 quits [Ping timeout: 240 seconds]
16:02:46lennier2 quits [Ping timeout: 240 seconds]
16:09:47lennier2 joins
16:09:49lennier2 is now known as lennier1
16:18:23march_happy quits [Ping timeout: 265 seconds]
16:20:44<@arkiver>rewby: can we get a target for pinger.pl ?
16:20:51<@arkiver>it would be archiveteam_pinger
16:20:52<@arkiver>pinger_
16:20:58<@arkiver>Archive Team pinger:
16:36:39<@rewby>arkiver: Sure. What kind of file size you thinking and what kind of rate? + Is there a channel for this?
16:37:50Mateon1 quits [Remote host closed the connection]
16:39:02Mateon1 joins
16:40:37chrismeller quits [Ping timeout: 265 seconds]
16:42:00IDK quits [Client Quit]
16:48:38godane1 joins
16:49:33<@arkiver>rewby: i think will not be large at all
16:49:44<@arkiver>no channel at the moment, but we can think of one
16:50:01<@rewby>I set the target in the project
16:50:11<@rewby>*tracker
16:50:20<@arkiver>yep, already pushed first into into it
16:51:06godane2 quits [Ping timeout: 240 seconds]
17:04:50Megame (Megame) joins
17:29:32<ThreeHM>No docker image yet for pinger?
17:31:23<@rewby>ThreeHM: I'll go make one
17:31:58<ThreeHM>Thanks!
17:32:14nostalgebraist joins
17:33:18<@rewby>It's building, give it a few minutes and it'll be at the usual address
17:34:19<@rewby>ThreeHM: Build done
17:47:45nostalgebraist quits [Client Quit]
17:51:23<Craigle>Pinger just started returning a ton of 400's
17:51:44<Craigle>arkiver ^
17:51:59<@arkiver>403?
17:52:18<Craigle>Some, but a was seeing a wall of 400's with a few 403's and 200's
17:52:21<@arkiver>looks fine to me
17:52:24<@arkiver>hmm
17:52:40<@arkiver>the site pretty unstable yeah :/
17:52:50<Craigle>Just picked back up
17:53:01<Craigle>Yeah, that was my thought
17:53:25<@arkiver>lets hope it stays online a little after the 31st
17:53:31<@arkiver>will see about contacting then
17:53:34<@arkiver>them*
17:54:30<@arkiver>anyone have ideas for pinger channel name?
17:54:43<Sanqui>pingas
17:55:13<Sanqui>sorry, I thought at first it would be a project for long term pings... idk lol
17:57:49<@OrIdow6>Not on Deathwatch?
17:58:13<monika>#pinged maybe?
17:58:30<@OrIdow6>#pingedout
17:58:39<@arkiver>lets do #pinged
17:58:46<@arkiver>saw yours too late OrIdow6 :P
18:02:50qwertyasdfuiopghjkl quits [Client Quit]
18:06:45qwertyasdfuiopghjkl joins
18:11:53<h2ibot>OrIdow6 edited Deathwatch (+112, /* 2022 */ Add pinger.pl): https://wiki.archiveteam.org/?diff=48225&oldid=48215
18:13:58<@OrIdow6>Did anything happen to forum.chip.de after the AB job got banned? Looks like they've made their change
18:14:36<@OrIdow6>I'm going to move its category anyhow
18:16:51<@JAA>I archived it fully, and it completed ten minutes before they added rules to their Buttflare configuration blocking most automated access.
18:16:54<h2ibot>OrIdow6 edited Deathwatch (+12, Forum.chip.de has made its changes): https://wiki.archiveteam.org/?diff=48226&oldid=48225
18:17:01<@OrIdow6>Oh, good
18:23:02HP_Archivist quits [Remote host closed the connection]
18:23:24HP_Archivist (HP_Archivist) joins
18:51:36onetruth quits [Read error: Connection reset by peer]
18:53:17DLoa joins
18:58:28<DLoa>Hi, I joined today and my Warrior VM has been running for over 6hrs. I'd like to backup forums threads which are of interest to me on NotebookReview forums, which is closing for good in 2days. Is there a way to selectivey apply my Warrior VM to this and contribute to NBR archiving? @JAA work on this I believe archiving already and I'd like to
18:58:29<DLoa>help. Thank you
19:05:25<@JAA>DLoa: There is no distributed project for TechnologyGuide, so no, you can't. I have already archived (nearly) the entire four forums, only a few dozen threads missing that I will be looking into tonight.
19:12:26lennier1 quits [Client Quit]
19:13:49lennier1 (lennier1) joins
19:17:22godane1 quits [Client Quit]
19:34:12qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds]
20:02:41<daxxy>JAA, do you want to grab those threads yourself? I've written down my notes here https://gist.github.com/drdaxxy/b7731fb4217a56604956bcaa45641648
20:07:03<@JAA>daxxy: Brilliant, thanks! Sorry for the delay, didn't have time to look into it yet.
20:07:28chrismeller (chrismeller) joins
20:08:55<daxxy>no worries :) what sorta resources / time did the HTML crawl take?
20:11:09hello joins
20:11:22<@JAA>About a day for all four forums with decent parallelism and multiple IPs. Not sure whether the IPs were actually needed or not.
20:11:43<@JAA>Also, yes, there are threads that require logging in. I'm not sure whether they're accessible to normal users or only mods or similar though.
20:11:52<@JAA>We generally only archive things that are publicly accessible.
20:12:41<@JAA>returnHtml=1 on get_thread renders the BBCode as HTML.
20:13:06<@JAA>Well, partially, anyway. [url=...] is not transformed apparently.
20:14:13<daxxy>neither is [quote]
20:14:22hello quits [Remote host closed the connection]
20:15:00<daxxy>nor img, so I have no idea if they actually render any BBCode or just newlines and maybe entities :v
20:15:43<@JAA>I'm seeing some <i> stuff as well.
20:15:54<@JAA>But yeah, it's weird.
20:16:11<@JAA>Smilies aren't translated into img tags either.
20:28:52<daxxy>okay, at least [b] just gets removed if returnHtml=0, see post 540654 in thread 75253 for example
20:32:13BlueMaxima joins
20:32:52<@JAA>Aw, there's a get_raw_post method, but that only works for users who can edit the post (i.e. poster/mods).
20:42:15<daxxy>yeah, I saw that, but now that you say it... I should talk to the mods, they seem interested in archival
20:43:39<daxxy>but since I figure this definitely isn't the place for crawling with a mod account - would you recommend the *-grab template for "outsiders" right now, or would I likely be better off hacking something together on my own?
20:45:26<@JAA>The -grab template is really only applicable to distributed projects, which is a major part of AT but not the only thing we do. I used my own tool (qwarc) for archiving the forums, but I can't recommend it to anyone as it's very much not user-friendly.
20:45:47<@JAA>And yeah, crawling with a mod account is not going to happen.
20:45:56<@JAA>(... here)
20:48:38<@JAA>I think I'll regrab all threads with get_thread, probably with returnHtml=0 but haven't decided yet.
20:52:10<@JAA>Trying to figure out where that transformation happens, but haven't quite found it.
20:56:09<daxxy>library/Tapatalk/Bridge.php, library/Tapatalk/BbCode/Formatter/Tapatalk.php, mobiquo/mbqClass/lib/read/MbqRdEtForumPost.php are the relevant places I've found
20:56:39<@JAA>Ah, push/TapatalkPush.php cleanPost, but it delegates to Tapatalk_BbCode_Formatter_Tapatalk which isn't in the plugin.
20:57:16<daxxy>it's in the archive, you may have only extracted the mobiquo folder
20:58:01<@JAA>Oh, right. I was grepping inside mobiquo, yeah.
20:59:16<@JAA>Wow, this code is a mess.
20:59:19<daxxy>hah
20:59:31<@JAA>Random indentation is exactly why I love Python.
21:00:10HP_Archivist quits [Ping timeout: 265 seconds]
21:00:50<hexa->python2*
21:01:09@JAA slaps hexa- around a bit with a large trout
21:01:26hexa- slaps JAA back with python2.7 … BEST BEFORE 2Y AGO
21:01:42<@JAA>Great, thanks, now I have food poisoning. :-(
21:01:55<hexa->I'm burnt, I do a lot of python packaging in NixOS :(
21:03:24<@JAA>[b] and [i] get stripped, [u] gets converted to <u>, [color] becomes a font tag, [img] should get stripped in both settings if I'm reading the code correctly.
21:05:57<daxxy>img stripped? where are you seeing that?
21:07:38<@JAA>Nevermind, it gets treated specially it seems.
21:07:46<@JAA>library/Tapatalk/BbCode/Formatter/Tapatalk.php is what I'm looking at.
21:08:04<@JAA>Specifically the getTags function.
21:08:25Megame quits [Client Quit]
21:12:23cadence quits [Ping timeout: 252 seconds]
21:13:06<daxxy>tbh, I don't think there's a need to analyze this properly right now -- we're not gonna get a lossless copy anyway, and clearly they only leave bbcode in that matches the parser in their app
21:13:49<daxxy>(the android app uses returnHtml=1, btw)
21:15:10<@JAA>Hmm, it would be neat if we could archive it in a way that someone could simply plug a Wayback Machine URL into the app and it all plays back correctly. But getting that to work would be quite a challenge.
21:15:24<@JAA>And it'd probably break anyway.
21:15:43<daxxy>you mean the tapatalk app?
21:16:15<daxxy>definitely not gonna work
21:19:13Mateon1 quits [Remote host closed the connection]
21:19:14<daxxy>for one, unless there's a way to force it into using the JSON API (doubt it, since the JSON API is newer, it ought to be preferred if client and server support it already), it POSTs to the xml-rpc interface and there's no way to make it request different URLs for different content
21:20:11Mateon1 joins
21:23:45<@JAA>Right
21:26:22<daxxy>writing a new (entirely client-side) webapp that reads everything from WBM (plus an externally hosted search index file, if you wanna get fancy) would work though, and not even take that much effort I think
21:27:01Jake quits [Quit: Leaving for a bit!]
21:28:02<daxxy>when you're not supporting 2 protocols in 8 codebases over 3 inheritance levels, this does not have to be complex software :P
21:28:08<@JAA>:-)
21:29:48<@JAA>It would have to be in the WBM though due to (the lack of) CORS.
21:30:04<daxxy>yeah, wasn't sure about that
21:30:27<@JAA>Anyway, that's something for the future. First step is getting the data.
21:30:33<daxxy>...but then you can always just put your site into WBM, right? ^^
21:30:45Jake (Jake) joins
21:31:05<@JAA>Also, someone here was working on a forum archive ingestion thingy a while ago. Not sure what happened to that idea.
21:31:27<@JAA>Yes, that's what I did with the Picosong data finder thingy.
21:32:08qw3rty_ joins
21:32:23atphoenix quits [Remote host closed the connection]
21:33:38AK quits [Client Quit]
21:33:40flashfire42 quits [Client Quit]
21:34:17@EggplantN quits [Quit: Ping timeout (120 seconds)]
21:34:31<@JAA>I'm going with returnHtml=0. As far as I can tell, it preserves a bit more data than =1 does, and the conversion should be easy enough.
21:34:33nyany quits [Quit: (516): and then you went into taco bell without pants...and surprisingly you weren't the only one there without pants]
21:34:37datechnoman1 (datechnoman) joins
21:34:43nyany (nyany) joins
21:34:52<daxxy>huh, what does it preserve that =1 doesn't?
21:34:56vukky quits [Remote host closed the connection]
21:35:02<@JAA>[b] [i]
21:35:07scowlee quits [Ping timeout: 252 seconds]
21:35:07<daxxy>hang on
21:35:14flashfire42 (flashfire42) joins
21:35:26EggplantN (EggplantN) joins
21:35:26@ChanServ sets mode: +o EggplantN
21:35:29mutantmnky quits [Ping timeout: 252 seconds]
21:35:47Matthww quits [Client Quit]
21:35:51qw3rty quits [Ping timeout: 252 seconds]
21:35:56<daxxy>no, =1 transforms them to HTML, but =0 strips them completely
21:36:05vukky (Vukky) joins
21:36:13HackMii quits [Ping timeout: 252 seconds]
21:36:16<@JAA>Huh
21:36:35ThreeHM quits [Ping timeout: 252 seconds]
21:36:48Matthww joins
21:37:11scowlee (scowlee) joins
21:37:13mutantmnky (mutantmonkey) joins
21:37:19datechnoman quits [Ping timeout: 252 seconds]
21:37:19datechnoman1 is now known as datechnoman
21:37:31<@JAA>Oh
21:37:40HackMii (hacktheplanet) joins
21:38:20ThreeHM (ThreeHeadedMonkey) joins
21:39:48<daxxy>any idea about the timeframe? if I (get the mods to) grab anything more, I'd rather do that after you've done your thing (especially with the missing posts) so my traffic won't get in your way
21:41:16Jake quits [Client Quit]
21:42:38<@JAA>Ok yeah, =1 it is I guess.
21:44:33<@JAA>I need to leave for a bit but will get it up and running in the next 1-2 hours.
21:44:37Jake (Jake) joins
21:45:09Jake quits [Remote host closed the connection]
21:45:09<daxxy>nice
21:45:21Jake (Jake) joins
21:45:34Jake quits [Remote host closed the connection]
21:45:44Jake (Jake) joins
21:45:59Jake quits [Remote host closed the connection]
21:51:13march_happy (march_happy) joins
21:52:20Jake (Jake) joins
21:52:36Jake quits [Remote host closed the connection]
21:55:08Jake (Jake) joins
22:00:03thelounge31 quits [Ping timeout: 252 seconds]
22:35:44HP_Archivist (HP_Archivist) joins
22:36:50chrismeller quits [Ping timeout: 265 seconds]
23:20:11<DLoa>DLoa: JAA This is great news. I was having trouble with Python3 script using wkhtmlto/pdfkit to save to pdf each pages of the threads of interest without running into issues after one or two pages (blocked?). I hope that it'll be possible to download to pdf for offline use. I have an account and can look into the remaining threads that could
23:20:11<DLoa>be saved.
23:25:39<@JAA>daxxy: The Tapatalk API has an ... interesting behaviour on thread redirects, e.g. thread 826132 on NotebookReview which redirects to 795536. It returns the data for the merged(?) thread and a positive total_post_num but no actual posts.
23:25:58<daxxy>huh
23:26:29<daxxy>I was wondering what it'd do with merged/moved threads but hadn't come across any, thanks
23:27:17<@JAA>Also, on threads that require logging in, it returns a 'Need valid topic id!' error, e.g. 247631 on NotebookReview.
23:28:26<DLoa>I can log in on NBR if it helps.
23:35:19march_happy quits [Ping timeout: 265 seconds]
23:49:16<@JAA>daxxy: Welp: http://forum.notebookreview.com/mobiquo/tapatalk.php?method_name=get_thread&topicId=763489&returnHtml=1&page=1&perPage=100
23:50:00<@JAA>That one works fine through the website: http://forum.notebookreview.com/threads/why-arent-laptop-gpus-officially-sold.763489/
23:51:02<daxxy>weird
23:51:10<@JAA>There are plenty more like that, it seems. Just running a little test with random IDs right now and immediately hit three like it.
23:51:32<@JAA>738640 and 742521 are the other two.
23:51:53<@JAA>Their WAF is very, very odd.
23:53:02<@JAA>Haven't documented this anywhere yet, but any request containing 'temp' as a word gets blocked. Same for 'tmp' and one other I can't remember right now. And anything with 'nessus' results in a connection reset.
23:54:35<@JAA>But yeah, can't even get everything through the API. WTF?