| 00:02:02 | | dm4v quits [Read error: Connection reset by peer] |
| 00:02:10 | | dm4v_ joins |
| 00:02:36 | | dm4v_ is now known as dm4v |
| 00:02:36 | | dm4v is now authenticated as dm4v |
| 00:02:36 | | dm4v quits [Changing host] |
| 00:02:36 | | dm4v (dm4v) joins |
| 00:03:30 | | wyatt8740 quits [Ping timeout: 258 seconds] |
| 00:03:59 | | wyatt8740 joins |
| 00:31:20 | | Megame quits [Read error: Connection reset by peer] |
| 00:31:38 | | Megame joins |
| 00:44:04 | | lunik1 joins |
| 01:02:37 | | dm4v_ joins |
| 01:02:42 | | dm4v quits [Ping timeout: 250 seconds] |
| 01:02:49 | | dm4v_ is now known as dm4v |
| 01:02:52 | | dm4v is now authenticated as dm4v |
| 01:02:52 | | dm4v quits [Changing host] |
| 01:02:52 | | dm4v (dm4v) joins |
| 01:10:50 | | wessel1512 quits [Read error: Connection reset by peer] |
| 01:10:53 | | wessel1512 joins |
| 01:15:37 | | wessel15126 joins |
| 01:15:54 | | wessel1512 quits [Read error: Connection reset by peer] |
| 01:15:54 | | wessel15126 is now known as wessel1512 |
| 01:16:54 | <jacobk> | Hello, I'm interested in archiving the wordlists, images, and audios from wordplay.com. As far as I can tell, the website loads almost everything dynamically with a couple of javascript files, so view-source: does not get the information. I tried using a tool called phantomjs to render the page and save the modified DOM plus a png image, which works in most cases (but sometimes it stops before the page is fully loaded). I also |
| 01:16:54 | <jacobk> | called the API more directly with curl -i and saved the json response of each call. Neither of these methods get the images and audio though, just links to them (they're on a different domain). What would be the best way to download the linked images/audio, and also make sure that I redownload all of the pages that stopped before finishing loading (should return "not found" if the course/lesson actually doesn't exist, whereas |
| 01:16:54 | <jacobk> | is will say "loading" if it hasn't finished loading)? |
| 01:45:13 | | Megame quits [Client Quit] |
| 02:17:26 | | lennier1 (lennier1) joins |
| 02:23:00 | <thuban> | nuroten: grabbing _31 this week_ (https://podcast.rthk.hk/podcast/tv_thisweek2014_i.xml) now; will follow up with _open line open view_ (https://podcast.rthk.hk/podcast/radio1_openline_openview.xml). is that correct, and is the talk show available through the podcast site? |
| 02:23:44 | | nostalgebraist joins |
| 02:24:24 | <thuban> | also, thanks JAA for the monitor, and thanks nuroten for adding the twitter links |
| 02:24:28 | <nuroten> | thuban: thanks! yeah, that's the one, about ~1k videos |
| 02:24:41 | <nuroten> | sorry, not videos, audio I think |
| 02:25:00 | <nuroten> | 31 is video |
| 02:26:27 | <thuban> | note to self, update script to handle filetype correctly... |
| 02:32:44 | <thuban> | ah, easier than i was expecting |
| 02:34:43 | <nuroten> | a chunk of the the groups/items from parties onwards don't have twitter accounts, included where found |
| 02:39:46 | <nuroten> | I'm wondering whether to save certain renowned figures' twitter timelines as part of this whole thing ... Agnes Chow's facebook page is gone (unclear whether she removed it herself), but twitter is still up |
| 02:39:54 | | HP_Archivist (HP_Archivist) joins |
| 02:42:42 | <abccc> | nuroten I'd say saving twitter and fb timelines is important, lots of fb accounts have been disappearing lately (ex: Civic Party HK, the 2nd largest pro democracy party in HK). |
| 02:44:10 | <nuroten> | (a bit of background: she's an activist, arrested along with other key figures and served sentence for unauthorized assembly, recently released) |
| 02:45:05 | <nuroten> | abccc: that's unfortunate, about Civic Party HK's Facebook page |
| 02:45:35 | <abccc> | nuroten Agnes Chow's fb page was removed most likely due to the ongoing "endangering national security" case against her |
| 02:47:34 | <nuroten> | yeah ... but twitter account would be removed too? a lot of Japanese followers after all |
| 02:48:53 | <nuroten> | either way, yeah, it might be important. it does not bode well |
| 02:50:33 | <abccc> | nuroten yes because according to the HK government, having a Twitter account means you are "colluding with foreign forces". No joke, this was the reason they used to arrest Jimmy Lai, the owner of Apple Daily. |
| 02:51:07 | <abccc> | and of course "colluding with foreign forces" is a national security threat tantamount to treason. |
| 02:55:40 | <@OrIdow6> | EggplantN: 10.7 thousand |
| 02:55:58 | <@OrIdow6> | jacobk: Is the site at risk? |
| 02:58:44 | <nuroten> | abccc: yeah, sad times |
| 03:00:34 | <@JAA> | OrIdow6: https://wordplay.com/notice |
| 03:01:05 | <@JAA> | 'After more than 10 years of operation, Worpdlay will be shutting down permanently at the end of this school year. The last day of operation will be July 1, 2021.' |
| 03:01:06 | <@OrIdow6> | Thanks JAA |
| 03:01:18 | <@OrIdow6> | 2 days |
| 03:01:43 | <@JAA> | It's already the 30th in Europe, so possibly less than one. |
| 03:03:12 | <@OrIdow6> | I don't think the site is European |
| 03:04:46 | <@OrIdow6> | But in any case very close |
| 03:05:30 | <@JAA> | Uh yeah, probably not. I was thinking 'Spanish' as in the country, not the language, for some reason. |
| 03:13:38 | | Doranwen quits [Client Quit] |
| 03:20:09 | | HP_Archivist quits [Ping timeout: 258 seconds] |
| 03:33:10 | <@OrIdow6> | Trying to enumerate it now |
| 03:33:18 | <@OrIdow6> | I.e. trying different IDs or whatever they are |
| 03:37:37 | <jacobk> | I think the IDs for everything in wordplay are sequential, <9000 courses, <130000. I checked by just trying a bunch of IDs and creating a few courses/lessons myself. |
| 03:37:58 | <jacobk> | *<130000 lessons |
| 03:38:53 | <@OrIdow6> | Oh, you created them? |
| 03:39:46 | <jacobk> | I created a few just for testing; most are made by teachers I think. |
| 03:39:53 | <@OrIdow6> | Are there any other units of the site besides courses and lessons? |
| 03:40:29 | <jacobk> | There's users and classes, but those aren't publically accessible. |
| 03:40:38 | <@OrIdow6> | Oh |
| 03:40:48 | <jacobk> | There's words, but I don't think those can be individually requested. |
| 03:41:56 | | qw3rty__ joins |
| 03:44:00 | <jacobk> | <91000 words |
| 03:44:57 | <jacobk> | or, more accurately, "tiles", which contain a pair of strings ("targetText" and "nativeText") that are supposed to mean the same thing. |
| 03:45:50 | | qw3rty_ quits [Ping timeout: 258 seconds] |
| 03:58:03 | <@EggplantN> | We’ve got 2 days? |
| 04:01:01 | <jacobk> | Yeah |
| 04:01:33 | <jacobk> | Unless Wordplay stays up after what they say the last day of operation will be. |
| 04:01:54 | <@OrIdow6> | Or less |
| 04:02:04 | <@OrIdow6> | I expect it to be small, shouldn't need backfeed |
| 04:02:17 | <jacobk> | Maybe they'll close the visible site but keep the API up for longer. And the images/audios are stored on Cloudfront, so those may last a little while longer as well. |
| 04:02:32 | <jacobk> | Not that either of those things would happen intentionally. |
| 04:03:20 | <@OrIdow6> | The latter is fairly common, due to the way this site is set up the former is not |
| 04:03:48 | <@EggplantN> | Oridow6 I can setup backfeed keys if needed |
| 04:04:06 | <jacobk> | I've already saved all of the "printable word lists" with phantomjs. |
| 04:04:29 | <jacobk> | The HTML doesn't look quite right, but the text is there, and I also had phantomjs render a png which usually looks fine. |
| 04:04:44 | <@OrIdow6> | EggplantN: Well, if it's easy, I may use it anyway |
| 04:05:02 | <@OrIdow6> | jacobk: We do things properly here |
| 04:05:21 | <jacobk> | I don't know what I'm doing |
| 04:05:38 | <jacobk> | What would the proper way be? |
| 04:06:10 | <@JAA> | We archive the raw HTTP requests/responses as WARCs so that they can be ingested into the Wayback Machine. |
| 04:06:15 | <@OrIdow6> | Or, more thourpughly |
| 04:06:17 | <@OrIdow6> | With warcs |
| 04:07:17 | <jacobk> | I tried using wget with --warc-file="wp", but it seemed to be missing a lot for some reason. |
| 04:07:47 | <@OrIdow6> | But in any case saving sites like this is the whole "point" of ArchiveTeam |
| 04:09:46 | <@OrIdow6> | So ArchiveTeam will do it for you |
| 04:09:51 | | DogsRNice quits [Read error: Connection reset by peer] |
| 04:09:55 | <jacobk> | Oh |
| 04:10:30 | <@OrIdow6> | But on JS... there are no easy solutions. The only sure way (when working with warcs) is to make the right requests and save those |
| 04:11:37 | <@OrIdow6> | There are various things that e.g. reserialize the finished page, put it into images, run a headless browser and capture all traffic, etc., but nothing works 100% of the time |
| 04:18:17 | <jacobk> | With Wordplay, it should be feasible to programatically determine whether the page loaded properly or not, so a script that checks saved pages and resaves broken ones might make archival more reliable. |
| 04:19:40 | <jacobk> | If the DOM contains "Loading..." then it didn't finish, and if it says "not found" the the resource doesn't exist. |
| 04:40:01 | | AntiLiberal joins |
| 04:42:08 | | AntiLiberal quits [Remote host closed the connection] |
| 04:42:20 | | AntiLiberal joins |
| 04:47:51 | | AntiLiberal quits [Remote host closed the connection] |
| 04:48:03 | | AntiLiberal joins |
| 04:59:54 | | aaaaa quits [Remote host closed the connection] |
| 05:00:54 | | HP_Archivist (HP_Archivist) joins |
| 05:10:52 | | Doranwen (Doranwen) joins |
| 05:59:28 | <@OrIdow6> | Yes |
| 06:00:52 | <@OrIdow6> | Can someone tell me what this does? https://transfer.archivete.am/inline/142eSP/spanish_thing_function_2.js Pe is a function that does some string transformation |
| 06:09:39 | <thuban> | OrIdow6: it munges the article a bit |
| 06:09:58 | <thuban> | OrIdow6: specifically, if the string e begins with '(el)' or '(el/la)' it returns 'el' + t; '(los)' or '(los/las)', 'los' + t; '(la)', 'la' + t; '(las)', 'las' + t; otherwise, just t |
| 06:11:15 | <thuban> | (weird way to implement that) |
| 06:13:35 | <@OrIdow6> | Thank you thuban |
| 06:13:53 | <thuban> | yw |
| 06:27:12 | | vela quits [Quit: vela] |
| 06:27:41 | | vela (vela) joins |
| 07:15:13 | | BlueMaxima quits [Client Quit] |
| 07:19:30 | | pbm joins |
| 07:36:59 | | HP_Archivist quits [Ping timeout: 258 seconds] |
| 08:00:00 | | shoghicp quits [Ping timeout: 250 seconds] |
| 08:00:16 | | shoghicp (shoghicp) joins |
| 08:35:18 | | nuroten quits [Remote host closed the connection] |
| 08:56:46 | | Megame (Megame) joins |
| 09:03:56 | | HP_Archivist (HP_Archivist) joins |
| 09:10:08 | | HP_Archivist quits [Ping timeout: 258 seconds] |
| 09:42:16 | | shoghicp quits [Ping timeout: 250 seconds] |
| 09:44:27 | | shoghicp (shoghicp) joins |
| 10:01:46 | | shoghicp quits [Ping timeout: 250 seconds] |
| 10:06:56 | | shoghicp (shoghicp) joins |
| 10:10:33 | <@OrIdow6> | Alright wordplay should be ready in a bit |
| 10:13:01 | <rewby> | Cool. How are we doing this? AB, DPOS project, qwarc? |
| 10:13:14 | <@OrIdow6> | Warrior |
| 10:13:19 | <@OrIdow6> | I.e. DPOs |
| 10:13:39 | <rewby> | Aight. Let me know when you've got a channel or need a target |
| 10:14:28 | <@OrIdow6> | Ok |
| 10:14:47 | <rewby> | I'll do targets for this one. |
| 10:36:20 | <@OrIdow6> | arkiver: Can I get quick approval on https://github.com/OrIdow6/wordplay-grab ? Says it shuts down the 1st |
| 10:38:13 | <thuban> | will we be making this the warrior default? |
| 10:39:20 | <Megame> | "Kazakh human rights activist in need of data backup after Youtube channel was deleted, but reinstated back." |
| 10:39:21 | <Megame> | https://twitter.com/HumanKazakh/status/1409680883838185478 |
| 10:39:24 | <@OrIdow6> | Needs a backfeed key BTW |
| 10:39:45 | <@OrIdow6> | thuban: Not my decision, if you're asking me |
| 10:40:11 | <thuban> | nope, general question/suggestion |
| 10:40:25 | <@OrIdow6> | Oh |
| 10:53:50 | <rewby> | EggplantN: ^ Can you get OrIdow6 a backfeed key |
| 11:14:30 | <@EggplantN> | Not for a couple hours. |
| 11:14:39 | <@EggplantN> | Fusl can or arkiver can sorry I’ve had to run out |
| 11:19:47 | <@arkiver> | OrIdow6: checking |
| 11:19:59 | <@arkiver> | let me handle the backfeed key |
| 11:23:19 | <@arkiver> | OrIdow6: code looks good, pretty nice improvement over the old code of wikidot |
| 11:23:48 | <@arkiver> | i didnt test it, but will assume that its complete if you say so |
| 11:24:01 | <@arkiver> | (and dont have time to test for another few hours - so we'll just start that project |
| 11:27:47 | | pbm quits [Remote host closed the connection] |
| 11:34:49 | <@arkiver> | OrIdow6: you're admin on the wordplay tracker |
| 11:35:05 | <@arkiver> | if the project goes smooth, lets do it without a special channel |
| 11:35:13 | <@arkiver> | if more attention is needed, we'll create a channel |
| 11:35:37 | <rewby> | If you give me some vars I'll prepare a target |
| 11:40:31 | | aryashahnaughty joins |
| 11:40:56 | | aryashahnaughty quits [Remote host closed the connection] |
| 11:54:01 | <@arkiver> | rewby: thanks! its |
| 11:54:10 | <@arkiver> | archiveteam_wordplay_ |
| 11:54:12 | <@arkiver> | wordplay_ |
| 11:54:12 | <nyany> | lmk when that's up and i'll do a nyany |
| 11:54:16 | <@arkiver> | Archive Team Worldplay: |
| 11:54:18 | <@arkiver> | oops |
| 11:54:23 | <@arkiver> | last one is wrong |
| 11:54:26 | <@arkiver> | last one should be |
| 11:54:32 | <@arkiver> | Archive Team Wordplay: |
| 11:54:37 | <@arkiver> | nyany: its done |
| 11:54:41 | <rewby> | All right. What kind of speed are we expecting? |
| 11:55:18 | <@arkiver> | no idea |
| 11:55:36 | <nyany> | arkiver: brilliant. |
| 11:56:23 | <rewby> | Aight. Give me like 5 minutes and there'll be a target in the project |
| 12:01:01 | <@EggplantN> | Noice |
| 12:02:14 | <@arkiver> | started |
| 12:02:17 | <@arkiver> | all items have been queued |
| 12:03:54 | <rewby> | I'm setting the limit to 0 until I'm done with the target |
| 12:04:09 | <@HCross> | please leave it at 0 |
| 12:04:12 | <@HCross> | until target is ready |
| 12:14:34 | <rewby> | We've kicked off |
| 12:16:22 | <nyany> | well if I could get a build to succeed I'd help |
| 12:16:42 | <nyany> | i cannot get docker to run properly and buster-backports refuses to work |
| 12:22:59 | <nyany> | HCross: might be worth noting that the pubkeys for buster-backports need to be imported now |
| 12:24:05 | <AK> | https://univis.univie.ac.at/ausschreibungsstellensuche/ look recognisable to anyone? |
| 12:24:21 | <nyany> | yeah, looks like a pretty 404 |
| 12:24:37 | <nyany> | https://usercontent.irccloud-cdn.com/file/uQIJfdfW/image.png |
| 12:24:38 | <AK> | arkiver, the 404 you took out of #//, was it for that url above? |
| 12:26:56 | <@HCross> | OrIdow6: can we cut the exponential backoff? |
| 12:27:05 | <AK> | I got a lovely message from a Kevin at the university of Austria asking me to stop ddosing them lol https://share.aktheknight.co.uk/riJe0/bIyuFEku64.png/raw |
| 12:27:12 | <AK> | Apparently we're visiting that url a lot |
| 12:27:18 | <AK> | With unique params each time |
| 12:27:31 | <AK> | He'd like to know if we could tone it down, so they don't have to ban us |
| 12:29:32 | <@arkiver> | AK: can you forward that email to arkiver@protonmail.com? or else PM me the URL they linked in that email |
| 12:29:48 | <@HCross> | arkiver: OrIdow6 wordplay, we seem to be exponentially backing off on courses that don't exist |
| 12:29:55 | <AK> | It's via twitter, I'll send you all the info now |
| 12:33:32 | <@arkiver> | HCross: oof 12 max tires |
| 12:33:33 | <@arkiver> | tries |
| 12:33:55 | <@arkiver> | OrIdow6: please check the 400s |
| 12:34:25 | <rewby> | We've slowed down massively, I'm barely getting 10mbps inbound on average on the target. |
| 12:34:59 | <@arkiver> | HCross: should be better |
| 12:43:44 | <nyany> | nice |
| 12:58:47 | <@HCross> | arkiver: I never saw any backed in use |
| 13:02:00 | <@HCross> | arkiver: Can we have a multi-size of 2 please |
| 13:02:34 | <@HCross> | I think 4XX errors are cancelling the entire multi-item |
| 13:06:56 | | PlsNoJava quits [Quit: ttfn] |
| 13:16:01 | | univie-kd joins |
| 13:17:50 | <univie-kd> | Hi guys! I was told to mention this issue here: Some of your researchers are unintentionally DoSing one of our subdomains by crawling the same URL over and over again, sending thousands of requests, because the URL utilizes a uniq flow-id for every request :/ |
| 13:19:01 | <nyany> | univie-kd: is this that subdomain: https://univis.univie.ac.at/ausschreibungsstellensuche/ |
| 13:19:19 | <univie-kd> | yep, exactly |
| 13:19:28 | <nyany> | I believe this is being looked into, but AK arkiver |
| 13:19:30 | <rewby> | arkiver, AK: It looks like someone from that uni has showed up. |
| 13:21:07 | <univie-kd> | Yeah, I contacted one of your guys via twitter, because we could identify his IP and we don´t want to block the IPs on the firewall, as your project is definitely not harmful :) Just wanted to drop by here and spread the word, so you can fix this, if possible |
| 13:21:15 | <rewby> | univie-kd: We've just pinged the people who usually deal with these situations. They'll be around soon. |
| 13:21:22 | <Jake> | univie-kd: any idea on the useragent? |
| 13:22:36 | <@arkiver> | univie-kd: yes i've been pinged about it, thanks for joining |
| 13:22:38 | <@arkiver> | looking inot it |
| 13:22:40 | <@arkiver> | into* |
| 13:23:05 | <univie-kd> | Nope, I was only given a list of IPs and sent on the hunt :) I can ask the Sysadmins if they can tell me the user-agents |
| 13:23:10 | <univie-kd> | Thanks for looking into it! |
| 13:24:11 | <@arkiver> | univie-kd: useragent is a browser agent |
| 13:24:16 | <@arkiver> | how did you find it was us? |
| 13:24:45 | <@arkiver> | right i see the URLs |
| 13:25:02 | <rewby> | Maybe just put in an ignore or something? |
| 13:25:09 | <univie-kd> | tracked down the IPs and a few of them led to one of "your guys" :) via twitter contacted him and he said that he is doing stuff for your project |
| 13:25:11 | <@arkiver> | jsessionid and/or _flowexecutionkey are the problem here i guess |
| 13:25:18 | <@arkiver> | nice |
| 13:25:23 | <@arkiver> | yeah, putting in a block for this one |
| 13:25:44 | <univie-kd> | Thanks for reacting so quuickly, really appreciate it |
| 13:26:15 | <rewby> | We're just here to preserve the internet, not to cause it to go down. ;) |
| 13:26:48 | <thuban> | was this via #//? what was queueing it? |
| 13:28:07 | <univie-kd> | Haha, yeah we just wanted to get in touch with you, because as I said, we don´t mind crawlers archiving our sites at all - would be a shame to block your IPs, as you don´t have any harmful intention! |
| 13:28:43 | <@arkiver> | thuban: yes this is #// |
| 13:29:05 | <@arkiver> | we indeed dont have harmful intentions :) |
| 13:29:19 | <h3ndr1k> | univie-kd: I think all of the projects contain a hint to this channel in case there are problems. That way you could have found us earlier. But it seems to have worked out anyway. |
| 13:29:20 | <@arkiver> | opposite - we're trying preserve your content instead of getting it offline |
| 13:29:40 | <h3ndr1k> | univie-kd: *All of the projects User Agents. |
| 13:29:51 | <rewby> | I think urls might be using a chrome UA or something |
| 13:29:58 | <@arkiver> | yeah rewby is correct |
| 13:30:12 | <AK> | Oh Hi univie-kd o/ ~Alex |
| 13:30:13 | <h3ndr1k> | oh ok nevermind then |
| 13:30:26 | <@arkiver> | univie-kd: should be fixed, may take a few hours before you see no requests. the crawling is distirbuted over several machines, they all need to update |
| 13:30:36 | <univie-kd> | Thanks for the hint about the user agent. As I said, I was given only a list of IPs without further information. But it is good to know for the future - thank you very much! |
| 13:30:59 | <univie-kd> | Thanks for fixing it so fast - awesome community here, I am impressed |
| 13:31:03 | <univie-kd> | Hi Alex :) |
| 13:31:41 | <Jake> | thank you for alerting us to the problem! |
| 13:31:56 | <@arkiver> | AK: ^ :) |
| 13:31:58 | <rewby> | A conversation is always better than a block. Sadly, we often get blocked. |
| 13:32:07 | <AK> | Glad we got it all sorted :) |
| 13:36:29 | <rewby> | arkiver: Can you set multi-item on wordplay to something like 2 or even disable it entirely? Me and HCross think that some 4xx responses are cancelling whole multi-items instead of single items. And we've kinda ran out of to-do. |
| 13:39:35 | <Jake> | I believe it is, a ton of mine got cancelled for one 400 |
| 13:41:23 | <rewby> | Yeah, we're kinda working around it by requeueing often, that causes some smaller items to be issued which get completed |
| 13:41:29 | <rewby> | So we're not at a complete standstill |
| 13:41:43 | <rewby> | But I'd rather we just turn the multi-item way down or off |
| 13:42:20 | <@arkiver> | rewby: yes |
| 13:42:50 | <rewby> | There's only a few items left and I'm not worried about small files |
| 13:43:27 | <@arkiver> | done |
| 13:43:57 | <@arkiver> | OrIdow6: in general, it's better to use a low multi item size (or disable it entirely) when there is not a large number of todo items |
| 13:44:16 | <@arkiver> | the 150k in this case is a good example of a small number of items, multi item size 1 is fine for those |
| 13:46:27 | <rewby> | arkiver: Thanks! This is going much better |
| 13:47:49 | <@arkiver> | good! |
| 13:48:16 | <@HCross> | arkiver: have you set it so that we mark 400s as done in a few minutes |
| 13:48:19 | <@HCross> | after that. we'll be done |
| 13:48:24 | <@HCross> | *can you not have you |
| 13:49:46 | | Nikos410 joins |
| 13:50:42 | <rewby> | arkiver: Me and HCross think that we're done with non 400-ing items. Nothing's being handed in anymore and on his cluster it looks like all of the urls are 400ing. |
| 13:51:22 | <Jake> | all mine are new items are 400s as well |
| 13:52:36 | | univie-kd leaves |
| 14:04:59 | | britmob joins |
| 14:17:34 | | britmob quits [Ping timeout: 258 seconds] |
| 14:20:24 | <rewby> | arkiver: Yeah, I think we're done with succeeding items. Do you think you can rig it to accept the 4xx errors like HCross suggested? I'd like to wrap the project up |
| 14:21:08 | | lennier1 quits [Client Quit] |
| 14:23:46 | | lennier1 (lennier1) joins |
| 14:32:12 | | nostalgebraist quits [Client Quit] |
| 14:45:36 | | britmob joins |
| 14:53:50 | | benjins quits [Ping timeout: 250 seconds] |
| 15:01:07 | | nostalgebraist joins |
| 15:10:05 | | benjins joins |
| 15:10:38 | | benjins is now authenticated as benjins |
| 15:20:03 | | Arcorann__ quits [Ping timeout: 258 seconds] |
| 15:47:05 | | Nikos410 quits [Remote host closed the connection] |
| 15:54:39 | | nuroten joins |
| 17:06:59 | | somerando3 joins |
| 17:08:05 | <somerando3> | Does anyone have a list of ALL the facebook videos from https://www.facebook.com/hongkongfp/? I have a quick an dirty javascript that can grab all the live videos if I can scroll down through all of them, but the main videos page quickly gets gummed up and I haven't been able to scroll past ~1yr ago on the main video pages after several attempts. |
| 17:08:06 | <somerando3> | abccc was able to post a list like this for https://www.facebook.com/standnewshk/ several days ago. |
| 17:08:32 | <somerando3> | If anyone's interested, these are the lists I can provide. They're of live FB videos only (so missing some stuff). The lists have some metadata, which may be useful since no downloader I've tried has consistently gotten the FB metadata right: hkfp: https://www92.zippyshare.com/v/NoG8X74B/file.html; standnewshk:(probably incomplete, but less |
| 17:08:32 | <somerando3> | incomplete than my last list): https://www92.zippyshare.com/v/5VuglMD5/file.html |
| 17:14:40 | <nuroten> | thuban: just a heads-up, The Pulse has been axed https://podcast.rthk.hk/podcast/item.php?pid=205 co-host of Backchat, Stephen Vines (who used to host The Pulse), has resigned from the show. not sure how long it will continue to run https://podcast.rthk.hk/podcast/item.php?pid=177 |
| 17:16:00 | <nuroten> | (that's https://podcast.rthk.hk/podcast/thepulse_i.xml and https://podcast.rthk.hk/podcast/backchat.xml) |
| 17:26:25 | | spirit joins |
| 17:32:18 | <@JAA> | somerando3's files rehosted: https://transfer.archivete.am/12eR5x/hkfp.fb.live.videos.2021-6-30.tsv https://transfer.archivete.am/oR83S/standnewshk.fb.live_videos.2.tsv |
| 17:33:22 | <Jake> | I tried to download a bunch of facebook videos from the reddit list, some crazy ratelimits after 60 videos |
| 17:33:48 | <@JAA> | somerando3: There is probably no complete list. Facebook's a PITA. Their scrolling stuff was already broken sometimes, but the rate limits just render archiving any significant amount of stuff virtually impossible. |
| 18:11:43 | <@OrIdow6> | https://github.com/ArchiveTeam/wordplay-grab/pull/1 should fix the Wordplay problems |
| 18:14:44 | <@OrIdow6> | Since I do have tracker admin, I have set the minimum |
| 18:15:18 | <@OrIdow6> | EggplantN HCross etc. ^ |
| 18:18:15 | <tech234a> | https://twitter.com/minecraftearth/status/1410282240781828100 |
| 18:26:28 | <@OrIdow6> | For the benefit of those of us that don't have Electron IRC clients or whatever it is... |
| 18:26:30 | <@OrIdow6> | " Today we say farewell to Minecraft Earth. We are so incredibly thankful for this wonderful community and all the memories we have built together." |
| 19:03:27 | | noteness_ quits [Remote host closed the connection] |
| 19:03:45 | | noteness (noteness) joins |
| 19:08:20 | | HP_Archivist (HP_Archivist) joins |
| 19:12:30 | <@EggplantN> | Done oridow6 |
| 19:13:03 | <@OrIdow6> | Thank you |
| 19:24:24 | <@OrIdow6> | And it looks like it's done |
| 19:25:12 | <@OrIdow6> | Thank you everyone |
| 19:25:26 | <@OrIdow6> | Thank you for setting it up arkiver, I will keep the multiitem thing in mind |
| 19:25:30 | <rewby> | Cool. Can I clean up the targets? |
| 19:26:55 | <@OrIdow6> | Yes |
| 19:26:57 | <Jake> | good job! |
| 19:28:14 | <rewby> | I have received 17399830125 bytes of data from warriors. |
| 19:28:19 | <rewby> | Good job everyone |
| 19:31:37 | <jacobk> | Does that include all of the images and audio? |
| 19:33:32 | <jacobk> | (Based on wordplay.lua, it seems like it should have) |
| 19:34:08 | <rewby> | I'm not sure, someone'll have to double check the warc files |
| 19:34:59 | <jacobk> | Are those publicly donwloadable yet? |
| 19:35:05 | <rewby> | Uh. Let me double check. |
| 19:36:14 | <jacobk> | I see something on archive.org uploaded at 12:45 today |
| 19:36:17 | <rewby> | I think our items are set to restricted by default... They'll show up in the wayback machine in a few days one way or anohter. But for direct download, that's something to ask arkiver. |
| 19:36:36 | <rewby> | Yeah, this is one of the megawarcs, https://archive.org/details/archiveteam_wordplay_20210630124018_80a396be |
| 19:38:05 | <@OrIdow6> | jacobk: Yes, otherwise it wouldn't be a proper archive |
| 19:41:59 | <@OrIdow6> | And I will say, that some resources 403, but this happens in the live version as well |
| 19:43:05 | <rewby> | Final upload in progress. https://s3.services.ams.aperture-laboratories.science/rewby/public/e27d35e6-bf8f-41f1-aec4-7a33a331c6f0/1625082175.3618443.png |
| 19:44:28 | <rewby> | And here's the other pack! https://archive.org/details/archiveteam_wordplay_20210630213237_4db6cf0c |
| 19:44:46 | <rewby> | If I'd known this wasn't gonna be that big I'd have upped my chunk size to like 25G |
| 19:45:39 | <jacobk> | OrIdow6: 403? I rember getting 400s, but not 403s. Do you have an example of a 403 page? Probably not an actual problem, just curious. |
| 19:46:26 | <@OrIdow6> | jacobk: Audio for (el) centro comunitario on https://wordplay.com/lesson/100204 |
| 19:46:59 | <@OrIdow6> | 403 is cloudfront.net's way of saying 404 in thsi cas |
| 19:47:01 | <@OrIdow6> | e |
| 19:47:22 | <rewby> | Here's a link to the megawarc if you want to double check playback, https://ia601506.us.archive.org/9/items/archiveteam_wordplay_20210630213237_4db6cf0c/wordplay_20210630213237_4db6cf0c.megawarc.warc.gz |
| 19:47:59 | <jacobk> | Oh yeah, I forgot about cloudfront. |
| 19:48:01 | | Hackerpcs quits [Quit: Hackerpcs] |
| 19:48:04 | <@arkiver> | OrIdow6: wordplay done? |
| 19:48:06 | <@arkiver> | HCross: rewby: looks like OrIdow6 fixed it :) |
| 19:48:21 | <rewby> | arkiver: Yep, he did. I'm just cleaning down |
| 19:48:34 | <@arkiver> | alright, taking project off the tracker |
| 19:49:00 | <@arkiver> | off the front page that is |
| 19:49:23 | | Hackerpcs (Hackerpcs) joins |
| 19:49:26 | <jacobk> | rewby: That link says not available |
| 19:49:29 | <rewby> | Bye bye target-0f9b709f, you did well. |
| 19:49:37 | <rewby> | jacobk: I think the item got set to restricted automatically |
| 19:49:54 | <rewby> | arkiver: Any reason that the wordplay items are access restricted on archive.org? |
| 19:50:32 | <@arkiver> | rewby: they are? |
| 19:50:38 | <@arkiver> | checking |
| 19:50:41 | <jacobk> | I suppose it's possible that the teachers who created lists for students didn't intend for them to be public. |
| 19:50:44 | <rewby> | See my links from earleir |
| 19:50:46 | <rewby> | *earlier |
| 19:50:53 | <jacobk> | But, they should be just words |
| 19:52:13 | <rewby> | We're checking. :) |
| 19:52:14 | <jacobk> | Teachers can search for other teachers' lists. Wordplay documentation might suggest that sharing courses is optional, but I didn't see any such option when I logged in as a teacher. |
| 19:52:35 | <@OrIdow6> | jacobk: This is something different |
| 19:52:52 | <@arkiver> | rewby: it'll become public when i move it out of the inbox collection |
| 19:53:18 | <rewby> | arkiver: Ah okay. Fair enough |
| 19:58:27 | <jacobk> | Oh nvm, the "share a course" link says to copy the URL and send it to a colleague. So the courses are probably searchable by default. |
| 20:10:19 | | spirit quits [Client Quit] |
| 20:15:21 | | Mikesky joins |
| 20:25:10 | | abccc quits [Remote host closed the connection] |
| 20:25:43 | | abcde joins |
| 20:34:12 | | @dxrt quits [Client Quit] |
| 20:35:41 | | dxrt joins |
| 20:35:43 | | dxrt is now authenticated as dxrt |
| 20:35:43 | | dxrt quits [Changing host] |
| 20:35:43 | | dxrt (dxrt) joins |
| 20:35:43 | | @ChanServ sets mode: +o dxrt |
| 20:47:08 | | DogsRNice (Webuser299) joins |
| 20:50:29 | | nostalgebraist quits [Ping timeout: 258 seconds] |
| 20:55:51 | | DogsRNice quits [Ping timeout: 258 seconds] |
| 21:02:17 | | abcde quits [Remote host closed the connection] |
| 21:02:59 | | DogsRNice (Webuser299) joins |
| 21:15:17 | | sec^nd quits [Remote host closed the connection] |
| 21:20:15 | | genofire quits [Quit: Gateway shutdown] |
| 21:20:16 | | pcr quits [Quit: Gateway shutdown] |
| 21:28:35 | | sec^nd (second) joins |
| 21:54:37 | | nertzy__ joins |
| 21:57:11 | | nertzy_ quits [Ping timeout: 258 seconds] |
| 22:29:34 | | Mikesky quits [Read error: Connection reset by peer] |
| 22:51:00 | | abcde joins |
| 22:53:31 | | abcde quits [Remote host closed the connection] |
| 22:55:49 | | abcdef joins |
| 22:57:21 | | hexa- quits [Quit: WeeChat 3.2] |
| 23:00:00 | | hexa- (hexa-) joins |
| 23:08:22 | <somerando3> | nuroten & thuban, I looked at some of those RTHK podcasts. It looks like the official RSS lists only include 1000 episodes, so for instance the 自由風自由PHONE RSS only goes back about 6 months. Actual audio is still available on their archive server, though. For instance this site has a much more complete list of 自由風自由PHONE (6139 |
| 23:08:23 | <somerando3> | episodes, going back to Jan 2014): https://www.podchaser.com/podcasts/phone-174719/about |
| 23:13:46 | <nuroten> | somerando3: the rss go back to a year ... thanks, that's a very nice find! |
| 23:17:02 | <nuroten> | some like The Pulse do go back longer, but if everything is on the archive server then it would make sense to fetch from there |
| 23:23:31 | <somerando3> | It looks like podchaser has an API with a free tier, but if that doesn't work the naming convention for the files looks pretty regular, however it seems there might have been a format change from mp3 to m4a at some point. |
| 23:27:25 | <somerando3> | e.g. I found this old link on some website to https://archive.rthk.hk/mp3/radio/contentIndex/radio1/openline_openview/mp3/20180503_2.mp3 |
| 23:29:11 | <somerando3> | this is a newer link: https://archive.rthk.hk/mp3/radio/contentIndex/radio1/openline_openview/m4a/20210615_4.m4a All the stuff I checked only seems to be in either one format or the other. |
| 23:33:18 | | Megame quits [Client Quit] |
| 23:49:58 | | abcdef quits [Remote host closed the connection] |