| 00:00:42 | | dm4v quits [Read error: Connection reset by peer] |
| 00:01:08 | | dm4v joins |
| 00:01:10 | | dm4v is now authenticated as dm4v |
| 00:01:10 | | dm4v quits [Changing host] |
| 00:01:10 | | dm4v (dm4v) joins |
| 00:05:52 | <Ryz> | Heya folks, I may need some help on finding the much older parts of CNN because of https://old.reddit.com/r/forgottenwebsites/comments/4xxcxj/cnn_forgot_to_update_some_parts_of_its_website/ - one of which is http://www.cnn.com/FOOD/resources/ (which I'm throwing into AB) |
| 00:06:06 | <Ryz> | It seems the rest of the links are dead or redirected back to CNN modern stuff s: |
| 00:07:16 | <Ryz> | There's also apparently this version of that page, http://edition.cnn.com/FOOD/resources/ |
| 00:08:31 | <Ryz> | This would be harder to find those kinds of links since they're not really subdomains |
| 00:11:50 | | AlsoHP_Archivist joins |
| 00:14:44 | | HP_Archivist quits [Ping timeout: 250 seconds] |
| 00:19:43 | | AlsoHP_Archivist quits [Ping timeout: 258 seconds] |
| 00:31:03 | | Mateon1 quits [Client Quit] |
| 00:31:14 | | Mateon1 joins |
| 00:32:21 | | Mateon1 quits [Client Quit] |
| 00:32:27 | | Mateon2 joins |
| 00:34:57 | | Mateon2 is now known as Mateon1 |
| 00:35:13 | | Ruthalas6 (Ruthalas) joins |
| 00:36:24 | | Ruthalas quits [Ping timeout: 250 seconds] |
| 00:36:24 | | Ruthalas6 is now known as Ruthalas |
| 01:03:14 | | dm4v quits [Read error: Connection reset by peer] |
| 01:04:12 | | dm4v joins |
| 01:04:14 | | dm4v is now authenticated as dm4v |
| 01:04:14 | | dm4v quits [Changing host] |
| 01:04:14 | | dm4v (dm4v) joins |
| 01:13:25 | | Mateon1 quits [Client Quit] |
| 01:13:26 | | Mateon2 joins |
| 01:13:49 | | tzt joins |
| 01:15:55 | | Mateon2 is now known as Mateon1 |
| 02:00:57 | | AlsoHP_Archivist joins |
| 02:12:48 | | Barto quits [Ping timeout: 258 seconds] |
| 02:38:52 | | ThreeHM quits [Ping timeout: 258 seconds] |
| 02:40:32 | | ThreeHM (ThreeHeadedMonkey) joins |
| 02:42:54 | | BlueMaxima joins |
| 02:50:27 | | Barto (Barto) joins |
| 03:04:38 | <katocala> | Ryz there is also http://www.cnn.com/FOOD/news/, /restaurants/, /key.ingredient/cardamom/ |
| 03:13:17 | | Mateon1 quits [Client Quit] |
| 03:13:21 | | Mateon1 joins |
| 03:14:05 | | Mateon1 quits [Remote host closed the connection] |
| 03:14:09 | | Mateon1 joins |
| 03:14:46 | | Mateon1 quits [Client Quit] |
| 03:14:57 | <Ryz> | Ooo, ooo, katocala, keep 'em coming if you manage to find more 'em~ |
| 03:15:23 | | Mateon1 joins |
| 03:16:23 | | bsmith093 quits [Client Quit] |
| 03:16:23 | | Mateon1 quits [Read error: Connection reset by peer] |
| 03:16:51 | | Mateon1 joins |
| 03:17:26 | | Mateon1 quits [Client Quit] |
| 03:17:31 | | Mateon2 joins |
| 03:18:39 | | Mateon2 quits [Client Quit] |
| 03:18:43 | | Mateon1 joins |
| 03:21:18 | | Megame quits [Client Quit] |
| 03:21:27 | <@JAA> | Ryz: 'More key ingredients' selector half-way down the page on that last one has some more. Also 'related stories' at the end of the main text goes to another section. |
| 03:32:25 | | qw3rty__ joins |
| 03:35:59 | | qw3rty_ quits [Ping timeout: 258 seconds] |
| 04:04:54 | | bsmith093 joins |
| 04:04:54 | | bsmith093 is now authenticated as bsmith093 |
| 04:07:27 | | lennier1 quits [Quit: Going offline, see ya! (www.adiirc.com)] |
| 04:13:12 | | lennier1 (lennier1) joins |
| 04:25:28 | | DogsRNice quits [Read error: Connection reset by peer] |
| 04:25:54 | | Jonboy3451 joins |
| 04:26:35 | | Jonboy345 quits [Ping timeout: 258 seconds] |
| 05:34:49 | | abcde joins |
| 05:37:11 | <abcde> | Hi folks. LIHKG (lihkg.com), Hong Kong's version of Reddit, is now at risk of being shut down - the HK government singled them out as *the* site that needs to be investigated for "endangering national security" (https://unwire.hk/2021/07/07/hk-socialmedia/life-tech/). Is there any way to start backing it up? |
| 05:41:55 | <thuban> | very javascript-heavy, seems likely to need a dedicated warrior project. |
| 05:42:12 | <thuban> | - do you have a rough idea of how big the site is? |
| 05:43:07 | <@JAA> | Current thread IDs are just over 2.6 million. |
| 05:43:37 | <thuban> | oh, they do seem to be sequential |
| 05:43:42 | <thuban> | that's good |
| 05:44:06 | | BinzyBoi quits [Read error: Connection reset by peer] |
| 05:44:58 | <thuban> | - what's the important content? threads seem straightforward; is there anything else (like user profiles) we should try and get? |
| 05:46:32 | <abcde> | Threads is pretty much the only thing that needs to be backed up imo, user profiles isn't important, and apart from these two, there are no other features on Lihkg |
| 05:48:31 | | Atom joins |
| 05:48:40 | <thuban> | looks like thread data is at https://lihkg.com/api_v2/thread/<n>/page/<n>?order=reply_time |
| 05:50:57 | <thuban> | response text is in .response.item_data[<n>].msg; contains inline html (including images which would be nice to get) |
| 05:51:41 | | Atom-- quits [Ping timeout: 258 seconds] |
| 05:52:36 | <thuban> | ugh, behind cloudflare protection. that's bad :/ |
| 05:55:22 | <thuban> | JAA, thoughts? i'm not sure what options we have besides contacting the operators and/or jury-rigging a webdriver setup |
| 05:57:09 | <@JAA> | I'll have a look when I'm awake again. What rate limiting are you seeing? |
| 05:58:40 | <thuban> | "Error 1020: Access denied" on straight attempts to access json in the browser; captchas on copying the legit request as curl. |
| 05:59:01 | <@JAA> | Mhm |
| 06:00:56 | <thuban> | (.response.total_page is all we should need for pagination) |
| 06:05:37 | <@OrIdow6> | J A A knows all |
| 06:07:21 | <@JAA> | Lately, I've been pretty dumbfounded by Buttflare's bullshit. They're getting more annoying to deal with. Not that they were pleasant before... |
| 06:14:27 | | Mateon1 quits [Remote host closed the connection] |
| 06:14:31 | | Mateon1 joins |
| 06:15:50 | | AlsoHP_Archivist quits [Ping timeout: 258 seconds] |
| 06:20:14 | <@OrIdow6> | Because of the political sensitivity of this, I think it might be nice to try to get it without contacting them |
| 06:23:50 | <@OrIdow6> | abcde: Any idea on when it may be shut down? |
| 06:53:00 | | Mateon1 quits [Client Quit] |
| 06:53:06 | | Mateon1 joins |
| 07:20:38 | | xit joins |
| 07:40:20 | | mutantmonkey quits [Remote host closed the connection] |
| 07:40:20 | | HackMii quits [Write error: Broken pipe] |
| 07:40:54 | | mutantmonkey (mutantmonkey) joins |
| 07:40:58 | | HackMii (hacktheplanet) joins |
| 08:42:00 | | Arcorann (Arcorann) joins |
| 09:00:59 | | BinzyBoi joins |
| 09:21:21 | | noteness (noteness) joins |
| 10:00:25 | | BlueMaxima quits [Client Quit] |
| 10:47:23 | <@OrIdow6> | Any general ideas on how to solve CAPTCHAS, specifically CloudFlare ones? |
| 10:48:48 | <@OrIdow6> | Specifically as applicable to the previously discussed, but presumably could be useful in the future |
| 10:53:06 | <thuban> | ideally we could mimic browser behavior successfully enough not to receive captchas. (this might be via brozzler/webdriver+warcprox, or might be via a very low-level http implementation--the latter presumably much faster but much more difficult to keep updated.) |
| 10:53:17 | <thuban> | but i don't know whether that would be a complete solution or cloudflare sometimes throws captchas anyway just to keep its hand in; some hybrid approach might be needed... |
| 11:00:08 | <thuban> | (if the latter there's always the 'farm it out' option like we did with yahoo groups. captcha-solving services are kinda sketchy, but depending on how often captchas show up/how well we manage overall throughput, it might be practical to work just with volunteers) |
| 11:02:45 | <@OrIdow6> | This is assuming it gives capatchas |
| 11:02:55 | <@OrIdow6> | Which is almost bound to happen even if you run headless |
| 11:03:12 | <@OrIdow6> | And IIRC some sites have set themselves to always have capatchas no matter what |
| 11:03:27 | <thuban> | mmm |
| 11:03:33 | <@OrIdow6> | (Though it might be questionable to scrape those then, but that can be dealt with when it happens) |
| 11:03:38 | <@OrIdow6> | Yeah |
| 11:04:10 | <@OrIdow6> | Might be best to try to cruise just under the capatcha rate on most? |
| 11:04:41 | <rewby> | That would require people be very careful about running too many of them |
| 11:04:42 | <@OrIdow6> | 'Hybrid approach" sounds ncie |
| 11:19:02 | | Barto quits [Ping timeout: 250 seconds] |
| 11:20:12 | | wizards quits [Ping timeout: 258 seconds] |
| 11:22:00 | | wizards joins |
| 11:31:09 | | Megame (Megame) joins |
| 11:41:44 | <AK> | Another option (Which is not gonna be easy), is take advantage of the privacy pass: https://support.cloudflare.com/hc/en-us/articles/115001992652-Using-Privacy-Pass-with-Cloudflare |
| 11:42:11 | <AK> | You can get passes for completing the captchas, that then allow you to automatically get through other captchas |
| 11:42:24 | <AK> | Might work |
| 11:42:44 | <thuban> | "To help mitigate malicious usage of this, we automatically disable Privacy Pass anytime a domain is placed into 'I'm Under Attack!' mode." |
| 11:50:05 | <AK> | If we're in under attack mode, everyone gets a captcha, no matter how slow we go |
| 11:50:57 | <thuban> | ah, i guess there are intermediate levels of protection. is the documentation for that accessible? |
| 12:26:23 | | LeGoupil joins |
| 12:33:23 | <AK> | https://support.cloudflare.com/hc/en-us/articles/200170056-Understanding-the-Cloudflare-Security-Level Some here |
| 12:59:32 | <rewby> | I'm thinking that if we can solve the captcha issue, that lihkg website should be archivable by wget-at? |
| 13:08:38 | <thuban> | not recursively, but if we generated and piped in the json & resource urls, then sure--but i doubt the premise; i don't see a path to bypassing captcha that is compatible with making requests through wget |
| 13:14:00 | <AK> | I don't think we'll be able to bypass the captcha either. Cloudflare have worked pretty hard to make that very hard except by having humans complete the captcha |
| 13:16:56 | <thuban> | i think it's _doable_, jut impractical at scale with our resources |
| 13:16:59 | <thuban> | *just |
| 13:21:23 | | LeGoupil quits [Client Quit] |
| 13:28:17 | <@arkiver> | how long are CAPTCHA cookies working until a new one is required? |
| 13:28:25 | <@arkiver> | is this based on seconds or number of requests? |
| 13:28:41 | | AK shrugs |
| 13:30:55 | | thuban quits [Ping timeout: 258 seconds] |
| 13:30:58 | <AK> | Wouldn't be surprised if it's a mix of both |
| 13:31:18 | <AK> | As well as looking at other requests from ips+asns |
| 13:31:42 | | Barto (Barto) joins |
| 13:50:18 | | aphitex22 joins |
| 14:11:06 | | Jonboy3451 quits [Read error: Connection reset by peer] |
| 14:15:36 | | Jonboy345 joins |
| 14:34:09 | | Arcorann quits [Read error: Connection reset by peer] |
| 14:34:29 | | Arcorann (Arcorann) joins |
| 14:50:34 | | Astrid_ joins |
| 14:55:18 | | thuban joins |
| 15:08:40 | | Arcorann quits [Ping timeout: 258 seconds] |
| 15:37:13 | | lunik1 quits [Client Quit] |
| 15:46:03 | | lunik1 joins |
| 16:05:49 | | lunik1 quits [Client Quit] |
| 16:12:37 | | spirit joins |
| 16:18:18 | | lunik1 joins |
| 16:36:48 | | Astrid_ quits [Remote host closed the connection] |
| 16:56:50 | | AlsoHP_Archivist joins |
| 17:11:57 | | nuroten joins |
| 17:18:23 | <nuroten> | thuban: how are the saves of websites/Facebook pages of political parties going? there's been a wave of resignations following an announcement of potential wage clawback for those who risk being disqualified from their positions in the district or legislative councils. a lot of the parties likely won't be around as they are much longer |
| 17:19:58 | <nuroten> | (heard the sad news about LIHKG, thanks for doing whatever you all can to save it) |
| 17:34:46 | <nuroten> | the writing is on the wall for democrats in Macau, 21 candidates barred from upcoming legislative elections. source: https://hongkongfp.com/2021/07/10/macau-bans-21-democrats-from-legislative-elections/ putting it out there if it would be worthwhile preemptively saving the corresponding party websites/socials of those candidates |
| 17:35:24 | <nuroten> | (I can look up the urls if so) |
| 18:15:57 | <duce1337> | would it be possible to archive roblox.com? |
| 18:16:18 | <duce1337> | the catalog, some games <100 players and more? |
| 18:26:16 | <Jake> | I don't believe it's at risk of going away? Any specific reason we should? |
| 18:39:53 | | phiresky quits [Ping timeout: 258 seconds] |
| 19:04:38 | | phiresky joins |
| 19:14:07 | <@JAA> | thuban: The problem isn't just a low-level HTTP implementation. It's TLS as well these days. |
| 19:14:41 | <@JAA> | AK, thuban: 'I'm Under Attack' mode is not captchas but only JS challenge. Which is also a blocker at the moment but can be circumvented at least in theory. |
| 19:38:48 | | spirit quits [Client Quit] |
| 20:00:54 | <thuban> | nuroten: unfortunately, there is very little we can do about facebook at this time |
| 20:01:12 | <thuban> | i think the facebook rate limiting is just generous enough that with careful monitoring and a slow pace, it should be possible to access some recent posts from each page of interest, but i don't know if we can save them as warcs in a way that ia could accept. |
| 20:01:22 | <thuban> | (viz., archivebot, #//, etc are b&. i for one am willing to sit here with warcprox or whatever and make requests by hand if it comes to that, but i'm not whitelisted for wayback machine ingestion--and i'm not sure ia whitelists anyone for such artisanal setups) |
| 20:01:36 | <thuban> | as for websites, they're chugging along. we're mostly still on media outlets, though. |
| 20:01:44 | <thuban> | speaking of, i see the inmediahk job was aborted. anyone know whether we got decent coverage first? (was ddos protection always on or did they activate it in response to us?) |
| 20:02:12 | <thuban> | JAA: you are of course correct; i misspoke |
| 20:02:22 | <nuroten> | thuban: all right, thanks :) |
| 20:04:37 | <thuban> | nuroten: i'm about to do another round of checking on jobs and adding stuff to the hong kong media wiki page; if you care to grab those macau urls and dump them (in that same etherpad, maybe?) we'll see about archiving them as well |
| 20:04:41 | <@JAA> | thuban: inmediahk.net blocked AB shortly after the job started. Buttflare wasn't enabled at the time. curl on the same machine worked fine, even with identical headers... |
| 20:05:21 | <thuban> | JAA: gotcha, thanks |
| 20:06:39 | <@JAA> | That last part in particular is why I think TLS matters since recently. I'm not actually sure it's TLS, but when curl and AB send the exact same HTTP request down to the header order, it's the only thing I can think of. |
| 20:08:26 | <duce1337> | ><Jake> I don't believe it's at risk of going away? Any specific reason we should? |
| 20:08:37 | <duce1337> | no, but just in case to preserve history |
| 20:10:54 | <thuban> | JAA: see discussion on 23 june https://hackint.logs.kiska.pw/archiveteam-bs/20210623#c292082 |
| 20:11:37 | <@JAA> | Yup |
| 20:17:36 | <thuban> | looks like the passiontimes job probably needs a high delay or abort as well :/ |
| 20:34:30 | <thuban> | thanks to whoever took care of that! |
| 20:35:40 | <duce1337> | xtube is shutting down https://www.ynot.com/xtube-close-abruptly-after-13-years/ |
| 20:37:37 | <@EggplantN> | -> #nevermind |
| 20:43:03 | <thuban> | nuroten: did you cross out the twitter links in the 'larger parties' section, and if so, why? they look okay to me |
| 20:43:21 | <nuroten> | thuban: not me |
| 20:44:40 | <thuban> | huh |
| 21:18:51 | <nuroten> | thuban: Macau links at the very bottom of the pad. as I'm unfamiliar with the situation in Macau, maybe someone with knowledge of things there will come by and amend/add to it |
| 21:30:22 | <Megame> | Crossed out twitter links was prob me. Just meant I grabbed them in AB |
| 21:32:23 | | noteness quits [Ping timeout: 258 seconds] |
| 21:37:46 | <Jake> | duce1337: sure, but projects require a lot of time and storage space. Roblox, I imagine would take quite a bit of both. The OPs here will obviously consider it. |
| 21:39:41 | <duce1337> | ok |
| 21:40:33 | <@JAA> | I'm not opposed to it (I mean... https://transfer.archivete.am/inline/bG4mu/aatt.png ), but lower priority than a bunch of other things. |
| 21:43:42 | <thuban> | thank you, nuroten and Megame |
| 21:43:46 | <thuban> | AK: what's the story with the ab jobs you ran for https://612fund.hk/ on the 23rd? should we revisit or is the second one good? |
| 21:44:49 | <Jake> | I agree |
| 21:49:05 | | AlsoHP_Archivist quits [Client Quit] |
| 21:49:10 | | noteness (noteness) joins |
| 21:49:25 | | HP_Archivist (HP_Archivist) joins |
| 22:40:37 | | superkuh_ quits [Read error: Connection reset by peer] |
| 22:40:41 | | superkuh joins |
| 22:57:08 | | HP_Archivist quits [Ping timeout: 250 seconds] |
| 23:07:26 | | dm4v_ joins |
| 23:07:27 | | dm4v quits [Ping timeout: 258 seconds] |
| 23:07:38 | | dm4v_ is now known as dm4v |
| 23:07:40 | | dm4v is now authenticated as dm4v |
| 23:07:40 | | dm4v quits [Changing host] |
| 23:07:40 | | dm4v (dm4v) joins |
| 23:27:05 | | Atom quits [Read error: Connection reset by peer] |
| 23:38:04 | | Mateon1 quits [Remote host closed the connection] |
| 23:39:07 | | Mateon1 joins |
| 23:58:08 | | Mateon1 quits [Remote host closed the connection] |
| 23:58:25 | | Mateon1 joins |