00:00:42dm4v quits [Read error: Connection reset by peer]
00:01:08dm4v joins
00:01:10dm4v quits [Changing host]
00:01:10dm4v (dm4v) joins
00:05:52<Ryz>Heya folks, I may need some help on finding the much older parts of CNN because of https://old.reddit.com/r/forgottenwebsites/comments/4xxcxj/cnn_forgot_to_update_some_parts_of_its_website/ - one of which is http://www.cnn.com/FOOD/resources/ (which I'm throwing into AB)
00:06:06<Ryz>It seems the rest of the links are dead or redirected back to CNN modern stuff s:
00:07:16<Ryz>There's also apparently this version of that page, http://edition.cnn.com/FOOD/resources/
00:08:31<Ryz>This would be harder to find those kinds of links since they're not really subdomains
00:11:50AlsoHP_Archivist joins
00:14:44HP_Archivist quits [Ping timeout: 250 seconds]
00:19:43AlsoHP_Archivist quits [Ping timeout: 258 seconds]
00:31:03Mateon1 quits [Client Quit]
00:31:14Mateon1 joins
00:32:21Mateon1 quits [Client Quit]
00:32:27Mateon2 joins
00:34:57Mateon2 is now known as Mateon1
00:35:13Ruthalas6 (Ruthalas) joins
00:36:24Ruthalas quits [Ping timeout: 250 seconds]
00:36:24Ruthalas6 is now known as Ruthalas
01:03:14dm4v quits [Read error: Connection reset by peer]
01:04:12dm4v joins
01:04:14dm4v quits [Changing host]
01:04:14dm4v (dm4v) joins
01:13:25Mateon1 quits [Client Quit]
01:13:26Mateon2 joins
01:13:49tzt joins
01:15:55Mateon2 is now known as Mateon1
02:00:57AlsoHP_Archivist joins
02:12:48Barto quits [Ping timeout: 258 seconds]
02:38:52ThreeHM quits [Ping timeout: 258 seconds]
02:40:32ThreeHM (ThreeHeadedMonkey) joins
02:42:54BlueMaxima joins
02:50:27Barto (Barto) joins
03:04:38<katocala>Ryz there is also http://www.cnn.com/FOOD/news/, /restaurants/, /key.ingredient/cardamom/
03:13:17Mateon1 quits [Client Quit]
03:13:21Mateon1 joins
03:14:05Mateon1 quits [Remote host closed the connection]
03:14:09Mateon1 joins
03:14:46Mateon1 quits [Client Quit]
03:14:57<Ryz>Ooo, ooo, katocala, keep 'em coming if you manage to find more 'em~
03:15:23Mateon1 joins
03:16:23bsmith093 quits [Client Quit]
03:16:23Mateon1 quits [Read error: Connection reset by peer]
03:16:51Mateon1 joins
03:17:26Mateon1 quits [Client Quit]
03:17:31Mateon2 joins
03:18:39Mateon2 quits [Client Quit]
03:18:43Mateon1 joins
03:21:18Megame quits [Client Quit]
03:21:27<@JAA>Ryz: 'More key ingredients' selector half-way down the page on that last one has some more. Also 'related stories' at the end of the main text goes to another section.
03:32:25qw3rty__ joins
03:35:59qw3rty_ quits [Ping timeout: 258 seconds]
04:04:54bsmith093 joins
04:07:27lennier1 quits [Quit: Going offline, see ya! (www.adiirc.com)]
04:13:12lennier1 (lennier1) joins
04:25:28DogsRNice quits [Read error: Connection reset by peer]
04:25:54Jonboy3451 joins
04:26:35Jonboy345 quits [Ping timeout: 258 seconds]
05:34:49abcde joins
05:37:11<abcde>Hi folks. LIHKG (lihkg.com), Hong Kong's version of Reddit, is now at risk of being shut down - the HK government singled them out as *the* site that needs to be investigated for "endangering national security" (https://unwire.hk/2021/07/07/hk-socialmedia/life-tech/). Is there any way to start backing it up?
05:41:55<thuban>very javascript-heavy, seems likely to need a dedicated warrior project.
05:42:12<thuban>- do you have a rough idea of how big the site is?
05:43:07<@JAA>Current thread IDs are just over 2.6 million.
05:43:37<thuban>oh, they do seem to be sequential
05:43:42<thuban>that's good
05:44:06BinzyBoi quits [Read error: Connection reset by peer]
05:44:58<thuban>- what's the important content? threads seem straightforward; is there anything else (like user profiles) we should try and get?
05:46:32<abcde>Threads is pretty much the only thing that needs to be backed up imo, user profiles isn't important, and apart from these two, there are no other features on Lihkg
05:48:31Atom joins
05:48:40<thuban>looks like thread data is at https://lihkg.com/api_v2/thread/<n>/page/<n>?order=reply_time
05:50:57<thuban>response text is in .response.item_data[<n>].msg; contains inline html (including images which would be nice to get)
05:51:41Atom-- quits [Ping timeout: 258 seconds]
05:52:36<thuban>ugh, behind cloudflare protection. that's bad :/
05:55:22<thuban>JAA, thoughts? i'm not sure what options we have besides contacting the operators and/or jury-rigging a webdriver setup
05:57:09<@JAA>I'll have a look when I'm awake again. What rate limiting are you seeing?
05:58:40<thuban>"Error 1020: Access denied" on straight attempts to access json in the browser; captchas on copying the legit request as curl.
05:59:01<@JAA>Mhm
06:00:56<thuban>(.response.total_page is all we should need for pagination)
06:05:37<@OrIdow6>J A A knows all
06:07:21<@JAA>Lately, I've been pretty dumbfounded by Buttflare's bullshit. They're getting more annoying to deal with. Not that they were pleasant before...
06:14:27Mateon1 quits [Remote host closed the connection]
06:14:31Mateon1 joins
06:15:50AlsoHP_Archivist quits [Ping timeout: 258 seconds]
06:20:14<@OrIdow6>Because of the political sensitivity of this, I think it might be nice to try to get it without contacting them
06:23:50<@OrIdow6>abcde: Any idea on when it may be shut down?
06:53:00Mateon1 quits [Client Quit]
06:53:06Mateon1 joins
07:20:38xit joins
07:40:20mutantmonkey quits [Remote host closed the connection]
07:40:20HackMii quits [Write error: Broken pipe]
07:40:54mutantmonkey (mutantmonkey) joins
07:40:58HackMii (hacktheplanet) joins
08:42:00Arcorann (Arcorann) joins
09:00:59BinzyBoi joins
09:21:21noteness (noteness) joins
10:00:25BlueMaxima quits [Client Quit]
10:47:23<@OrIdow6>Any general ideas on how to solve CAPTCHAS, specifically CloudFlare ones?
10:48:48<@OrIdow6>Specifically as applicable to the previously discussed, but presumably could be useful in the future
10:53:06<thuban>ideally we could mimic browser behavior successfully enough not to receive captchas. (this might be via brozzler/webdriver+warcprox, or might be via a very low-level http implementation--the latter presumably much faster but much more difficult to keep updated.)
10:53:17<thuban>but i don't know whether that would be a complete solution or cloudflare sometimes throws captchas anyway just to keep its hand in; some hybrid approach might be needed...
11:00:08<thuban>(if the latter there's always the 'farm it out' option like we did with yahoo groups. captcha-solving services are kinda sketchy, but depending on how often captchas show up/how well we manage overall throughput, it might be practical to work just with volunteers)
11:02:45<@OrIdow6>This is assuming it gives capatchas
11:02:55<@OrIdow6>Which is almost bound to happen even if you run headless
11:03:12<@OrIdow6>And IIRC some sites have set themselves to always have capatchas no matter what
11:03:27<thuban>mmm
11:03:33<@OrIdow6>(Though it might be questionable to scrape those then, but that can be dealt with when it happens)
11:03:38<@OrIdow6>Yeah
11:04:10<@OrIdow6>Might be best to try to cruise just under the capatcha rate on most?
11:04:41<rewby>That would require people be very careful about running too many of them
11:04:42<@OrIdow6>'Hybrid approach" sounds ncie
11:19:02Barto quits [Ping timeout: 250 seconds]
11:20:12wizards quits [Ping timeout: 258 seconds]
11:22:00wizards joins
11:31:09Megame (Megame) joins
11:41:44<AK>Another option (Which is not gonna be easy), is take advantage of the privacy pass: https://support.cloudflare.com/hc/en-us/articles/115001992652-Using-Privacy-Pass-with-Cloudflare
11:42:11<AK>You can get passes for completing the captchas, that then allow you to automatically get through other captchas
11:42:24<AK>Might work
11:42:44<thuban>"To help mitigate malicious usage of this, we automatically disable Privacy Pass anytime a domain is placed into 'I'm Under Attack!' mode."
11:50:05<AK>If we're in under attack mode, everyone gets a captcha, no matter how slow we go
11:50:57<thuban>ah, i guess there are intermediate levels of protection. is the documentation for that accessible?
12:26:23LeGoupil joins
12:33:23<AK>https://support.cloudflare.com/hc/en-us/articles/200170056-Understanding-the-Cloudflare-Security-Level Some here
12:59:32<rewby>I'm thinking that if we can solve the captcha issue, that lihkg website should be archivable by wget-at?
13:08:38<thuban>not recursively, but if we generated and piped in the json & resource urls, then sure--but i doubt the premise; i don't see a path to bypassing captcha that is compatible with making requests through wget
13:14:00<AK>I don't think we'll be able to bypass the captcha either. Cloudflare have worked pretty hard to make that very hard except by having humans complete the captcha
13:16:56<thuban>i think it's _doable_, jut impractical at scale with our resources
13:16:59<thuban>*just
13:21:23LeGoupil quits [Client Quit]
13:28:17<@arkiver>how long are CAPTCHA cookies working until a new one is required?
13:28:25<@arkiver>is this based on seconds or number of requests?
13:28:41AK shrugs
13:30:55thuban quits [Ping timeout: 258 seconds]
13:30:58<AK>Wouldn't be surprised if it's a mix of both
13:31:18<AK>As well as looking at other requests from ips+asns
13:31:42Barto (Barto) joins
13:50:18aphitex22 joins
14:11:06Jonboy3451 quits [Read error: Connection reset by peer]
14:15:36Jonboy345 joins
14:34:09Arcorann quits [Read error: Connection reset by peer]
14:34:29Arcorann (Arcorann) joins
14:50:34Astrid_ joins
14:55:18thuban joins
15:08:40Arcorann quits [Ping timeout: 258 seconds]
15:37:13lunik1 quits [Client Quit]
15:46:03lunik1 joins
16:05:49lunik1 quits [Client Quit]
16:12:37spirit joins
16:18:18lunik1 joins
16:36:48Astrid_ quits [Remote host closed the connection]
16:56:50AlsoHP_Archivist joins
17:11:57nuroten joins
17:18:23<nuroten>thuban: how are the saves of websites/Facebook pages of political parties going? there's been a wave of resignations following an announcement of potential wage clawback for those who risk being disqualified from their positions in the district or legislative councils. a lot of the parties likely won't be around as they are much longer
17:19:58<nuroten>(heard the sad news about LIHKG, thanks for doing whatever you all can to save it)
17:34:46<nuroten>the writing is on the wall for democrats in Macau, 21 candidates barred from upcoming legislative elections. source: https://hongkongfp.com/2021/07/10/macau-bans-21-democrats-from-legislative-elections/ putting it out there if it would be worthwhile preemptively saving the corresponding party websites/socials of those candidates
17:35:24<nuroten>(I can look up the urls if so)
18:15:57<duce1337>would it be possible to archive roblox.com?
18:16:18<duce1337>the catalog, some games <100 players and more?
18:26:16<Jake>I don't believe it's at risk of going away? Any specific reason we should?
18:39:53phiresky quits [Ping timeout: 258 seconds]
19:04:38phiresky joins
19:14:07<@JAA>thuban: The problem isn't just a low-level HTTP implementation. It's TLS as well these days.
19:14:41<@JAA>AK, thuban: 'I'm Under Attack' mode is not captchas but only JS challenge. Which is also a blocker at the moment but can be circumvented at least in theory.
19:38:48spirit quits [Client Quit]
20:00:54<thuban>nuroten: unfortunately, there is very little we can do about facebook at this time
20:01:12<thuban>i think the facebook rate limiting is just generous enough that with careful monitoring and a slow pace, it should be possible to access some recent posts from each page of interest, but i don't know if we can save them as warcs in a way that ia could accept.
20:01:22<thuban>(viz., archivebot, #//, etc are b&. i for one am willing to sit here with warcprox or whatever and make requests by hand if it comes to that, but i'm not whitelisted for wayback machine ingestion--and i'm not sure ia whitelists anyone for such artisanal setups)
20:01:36<thuban>as for websites, they're chugging along. we're mostly still on media outlets, though.
20:01:44<thuban>speaking of, i see the inmediahk job was aborted. anyone know whether we got decent coverage first? (was ddos protection always on or did they activate it in response to us?)
20:02:12<thuban>JAA: you are of course correct; i misspoke
20:02:22<nuroten>thuban: all right, thanks :)
20:04:37<thuban>nuroten: i'm about to do another round of checking on jobs and adding stuff to the hong kong media wiki page; if you care to grab those macau urls and dump them (in that same etherpad, maybe?) we'll see about archiving them as well
20:04:41<@JAA>thuban: inmediahk.net blocked AB shortly after the job started. Buttflare wasn't enabled at the time. curl on the same machine worked fine, even with identical headers...
20:05:21<thuban>JAA: gotcha, thanks
20:06:39<@JAA>That last part in particular is why I think TLS matters since recently. I'm not actually sure it's TLS, but when curl and AB send the exact same HTTP request down to the header order, it's the only thing I can think of.
20:08:26<duce1337>><Jake> I don't believe it's at risk of going away? Any specific reason we should?
20:08:37<duce1337>no, but just in case to preserve history
20:10:54<thuban>JAA: see discussion on 23 june https://hackint.logs.kiska.pw/archiveteam-bs/20210623#c292082
20:11:37<@JAA>Yup
20:17:36<thuban>looks like the passiontimes job probably needs a high delay or abort as well :/
20:34:30<thuban>thanks to whoever took care of that!
20:35:40<duce1337>xtube is shutting down https://www.ynot.com/xtube-close-abruptly-after-13-years/
20:37:37<@EggplantN>-> #nevermind
20:43:03<thuban>nuroten: did you cross out the twitter links in the 'larger parties' section, and if so, why? they look okay to me
20:43:21<nuroten>thuban: not me
20:44:40<thuban>huh
21:18:51<nuroten>thuban: Macau links at the very bottom of the pad. as I'm unfamiliar with the situation in Macau, maybe someone with knowledge of things there will come by and amend/add to it
21:30:22<Megame>Crossed out twitter links was prob me. Just meant I grabbed them in AB
21:32:23noteness quits [Ping timeout: 258 seconds]
21:37:46<Jake>duce1337: sure, but projects require a lot of time and storage space. Roblox, I imagine would take quite a bit of both. The OPs here will obviously consider it.
21:39:41<duce1337>ok
21:40:33<@JAA>I'm not opposed to it (I mean... https://transfer.archivete.am/inline/bG4mu/aatt.png ), but lower priority than a bunch of other things.
21:43:42<thuban>thank you, nuroten and Megame
21:43:46<thuban>AK: what's the story with the ab jobs you ran for https://612fund.hk/ on the 23rd? should we revisit or is the second one good?
21:44:49<Jake>I agree
21:49:05AlsoHP_Archivist quits [Client Quit]
21:49:10noteness (noteness) joins
21:49:25HP_Archivist (HP_Archivist) joins
22:40:37superkuh_ quits [Read error: Connection reset by peer]
22:40:41superkuh joins
22:57:08HP_Archivist quits [Ping timeout: 250 seconds]
23:07:26dm4v_ joins
23:07:27dm4v quits [Ping timeout: 258 seconds]
23:07:38dm4v_ is now known as dm4v
23:07:40dm4v quits [Changing host]
23:07:40dm4v (dm4v) joins
23:27:05Atom quits [Read error: Connection reset by peer]
23:38:04Mateon1 quits [Remote host closed the connection]
23:39:07Mateon1 joins
23:58:08Mateon1 quits [Remote host closed the connection]
23:58:25Mateon1 joins