| 00:00:52 | | hupool quits [Ping timeout: 244 seconds] |
| 00:11:27 | <nuroten> | thuban: not necessarily suggesting this one for consideration, but to maybe convey a sense of what the site has that could be lost: https://podcast.rthk.hk/podcast/item.php?pid=205&year=2014&lang=en-US news commentary about The Umbrella movement, 25th anniversary of June 4th (a topic censored in mainland China) |
| 00:17:47 | | billy549 quits [Remote host closed the connection] |
| 00:21:03 | <nuroten> | along with a few social topics like housing strategy, academic freedom. as you mentioned, there's a lot of content, so whatever else (if anything) you decide might be interesting or worthwhile |
| 00:21:30 | | Arcorann (Arcorann) joins |
| 00:23:47 | <thuban> | unfortunately their server seems to be very, very slow (at least for me) |
| 00:24:04 | <nuroten> | a lot of the podcasts listed run for maybe a year or two and are complete/discontinued |
| 00:25:07 | <nuroten> | maybe their servers are being flooded with people trying to save bits and pieces of old shows :) |
| 00:25:10 | <thuban> | the first video apparently downloaded for 40 minutes and then hit an ECONNRESET |
| 00:25:41 | | billy549 (Billy549) joins |
| 00:25:57 | <thuban> | maybe a geo thing? anyone have servers nearby? |
| 00:29:19 | <thuban> | hm, or maybe i should be faking a useragent; it didn't seem to be this bad in the browser |
| 00:30:22 | | Arcorann quits [Ping timeout: 250 seconds] |
| 00:30:29 | | Sylirana quits [Read error: Connection reset by peer] |
| 00:30:34 | | blankie joins |
| 00:30:34 | | blankie is now authenticated as blankie |
| 00:30:34 | | blankie quits [Changing host] |
| 00:30:34 | | blankie (blankie) joins |
| 00:31:08 | | Sylirana (Sylirana) joins |
| 00:31:28 | <nuroten> | think they're streaming from akamai, at least in the browser |
| 00:34:18 | <thuban> | the site uses akamai; the xml feed links to a file (not a playlist) on archive.rthk.hk--but i was able to load a video from it in the browser earlier |
| 00:34:24 | <thuban> | can't seem to now, though |
| 00:34:43 | <thuban> | or maybe a little, just really incredibly slow |
| 00:35:18 | <thuban> | i guess i can rewrite to grab the akamai version |
| 00:35:34 | | hupool joins |
| 00:40:49 | <Inhonion> | T-minus 3:20 until the Y!A shutdown right? |
| 00:51:20 | | Mineroboter joins |
| 00:51:30 | | hupool quits [Ping timeout: 244 seconds] |
| 00:53:20 | | Mineroboter_ quits [Ping timeout: 250 seconds] |
| 00:56:35 | <@JAA> | arkiver: The thing you're probably thinking of hasn't been in operation for a while now, unfortunately. |
| 00:58:09 | <@arkiver> | i see |
| 01:02:46 | | dm4v quits [Client Quit] |
| 01:02:55 | | dm4v joins |
| 01:02:58 | | dm4v is now authenticated as dm4v |
| 01:02:58 | | dm4v quits [Changing host] |
| 01:02:58 | | dm4v (dm4v) joins |
| 01:28:54 | | pcr leaves |
| 01:28:56 | | pcr joins |
| 01:36:15 | | hupool joins |
| 01:59:42 | | hupool quits [Ping timeout: 244 seconds] |
| 02:08:57 | | cmlow (cmlow) joins |
| 02:32:33 | | hupool joins |
| 02:36:28 | <@JAA> | MeriStation Comunidad Zonaforo qwarc grab is started. I'm only retrieving the thread pages. Their servers are horribly slow at an average response time of 4 seconds, so we'll see how that goes. |
| 02:38:21 | <@OrIdow6> | arkiver: From what I've experienced (have not systematically tested it), there's a short ban of between 10 hours and a day; then if you continue after that, it's permanent (or long enough that I haven't been unbanned yet) |
| 02:49:49 | | Webuser639 quits [Ping timeout: 244 seconds] |
| 03:14:28 | <@JAA> | I can't go very hard at MeriStation. Starting to see timeouts and DB errors at only 200 connections. Average response time also increased to 6.5 seconds. This is the most I can get out of it I think. |
| 03:15:51 | <@JAA> | Gives an ETA of 46 hours or so. Not fast enough, sadly. |
| 03:16:24 | <@JAA> | Less than 2 days of lead time after 21 years... :-| |
| 03:20:53 | <jodizzle> | :( |
| 03:46:01 | | qw3rty_ joins |
| 03:49:14 | | hupool quits [Ping timeout: 244 seconds] |
| 03:49:57 | | qw3rty__ quits [Ping timeout: 258 seconds] |
| 03:54:22 | | hupool joins |
| 04:07:44 | | etnguyen03 quits [Client Quit] |
| 04:20:45 | | hupool quits [Ping timeout: 244 seconds] |
| 04:26:47 | | DogsRNice quits [Read error: Connection reset by peer] |
| 05:09:18 | | HackMii_ quits [Remote host closed the connection] |
| 05:09:42 | | HackMii_ (hacktheplanet) joins |
| 05:25:37 | | NF885 (NF885) joins |
| 05:37:55 | | Ella quits [Remote host closed the connection] |
| 05:44:47 | <thuban> | hey avoozl, how's your xenforo support? |
| 05:45:48 | <avoozl> | Currently, non-existent, but adding new parsers is pretty doable |
| 05:47:49 | <thuban> | i have some warcs if you want raw material |
| 05:51:49 | <avoozl> | thuban: I currently am working first on getting it to go a bit faster so I can build the yahoo answers index, but pointers are always welcome. |
| 05:52:11 | <avoozl> | thuban: if you want to take a look how the parsers are currently implemented... the league of legends forum parser looks like this: https://paste.ofcode.org/QHnHH4ErUvsnmCptW4SiH2 |
| 05:52:30 | <avoozl> | thuban: basically a bunch of selectors to get the right bits from the page, construct them into a Post object, and the indexer takes it from there |
| 05:53:39 | <thuban> | whoa, go :o |
| 05:54:00 | <avoozl> | yeah I figured for something self-hosted that'd be easiest and most compact |
| 05:54:03 | <thuban> | been meaning to get into that, i'll have a look |
| 05:55:00 | <avoozl> | that html sanitization is probably going to be removed here, I'll make that a task of the front-end serving the html instead. It is all still pretty much in motion |
| 05:55:31 | | thuban nods |
| 05:55:52 | <avoozl> | for yahoo answers things are much more complex, as there are json payloads and other odd things in there (multiple payload types that each refer to the same type of data) |
| 06:26:29 | <avoozl> | I'm parsing the yahoo scrape at around 7MB/sec, most of that is spend on cpu limited tasks.. and my download speed from archive.org is pretty low so I'm still 9 days behind (20210423 is downloaded) |
| 06:30:26 | | NF885 quits [Ping timeout: 244 seconds] |
| 06:43:00 | | Sylirana quits [Remote host closed the connection] |
| 06:43:00 | | Sylirana joins |
| 06:43:00 | | Sylirana is now authenticated as Sylirana |
| 06:43:00 | | Sylirana quits [Changing host] |
| 06:43:00 | | Sylirana (Sylirana) joins |
| 06:44:51 | | NF885 (NF885) joins |
| 06:45:19 | | LeighR (LeighR) joins |
| 06:53:16 | | Webuser743 joins |
| 06:53:31 | | Webuser743 is now known as VukkyWork |
| 06:55:10 | | VukkyWork quits [Remote host closed the connection] |
| 06:57:17 | | VukkyWork (VukkyWork) joins |
| 07:05:03 | | NF885 quits [Ping timeout: 244 seconds] |
| 07:05:58 | | blankie quits [Client Quit] |
| 07:06:10 | | blankie joins |
| 07:06:10 | | blankie is now authenticated as blankie |
| 07:06:10 | | blankie quits [Changing host] |
| 07:06:10 | | blankie (blankie) joins |
| 07:11:03 | <LeighR> | Re: ArchiveBot - is it better for me to do an initial run with grab-site to see if there's some giant forum archive off to the side, or if a blind pull of the site ends up pulling a lot of external, uninteresting content (like an awful lot of 3rd party Wordpress controls or gstatic.com fonts?) |
| 07:12:12 | <LeighR> | Because one of the sites jo-dizzle put in for me just seems to be ballooning |
| 07:13:04 | <thuban> | if you have voice (which you can probably just ask nicely for) you can alter ignoresets on the fly as with grab-site |
| 07:14:04 | <jodizzle> | Yes, we regularly check in on crawls and add ignores as appropriate |
| 07:14:55 | | x9fff00 quits [Client Quit] |
| 07:15:05 | <jodizzle> | Leaving the forum in that job was intentional LeighR, but you're right that it probably needs some ignores |
| 07:15:53 | <LeighR> | ok |
| 07:16:39 | | x9fff00 (x9fff00) joins |
| 07:18:46 | <LeighR> | is it better in cases like this to instruct AB to ignore off-site links? |
| 07:19:50 | <LeighR> | at least in the forums? |
| 07:20:17 | <LeighR> | I know that it needs them to make the site itself display correctly |
| 07:25:37 | <jodizzle> | For very large forums it's usually better to ignore off-site links. I guess we'll see if this requires that. |
| 07:26:50 | <jodizzle> | Note though that if using `--no-offsite-links` when launching the job, AB will still pick up off-site page dependencies, like stylesheets and such |
| 07:27:56 | | Kinille (Kinille) joins |
| 07:36:11 | <LeighR> | for something like dwiggie.com, offsite links to individual files (images) would have made sense to keep, but probably not the complete contents needed to render an Amazon page for a book someone in a forum post recommended |
| 07:39:05 | <LeighR> | is there a way to make AB ignore offsite links for specific paths, or to only allow them for specific paths? |
| 07:39:26 | <LeighR> | or should a site be broken into separate jobs? |
| 07:47:00 | <thuban> | you can use regex ignores to manage offsite links (but it won't be as simple as applying one set of rules to same-domain urls and another to offsites) |
| 07:53:28 | <LeighR> | would it make more sense to break a site like this into "not-forum" and "forums"? |
| 07:53:46 | <LeighR> | the stuff under "not-forum" is the important part |
| 08:24:34 | | Arcorann (Arcorann) joins |
| 08:28:53 | <avoozl> | thuban: just in case, I've added a #warceater channel for anything related to this code I'm building |
| 08:36:33 | | avoozl is now authenticated as avoozl |
| 08:47:58 | | VukkyWork quits [Client Quit] |
| 08:47:58 | | nuroten quits [Client Quit] |
| 08:47:58 | | Inhonion quits [Client Quit] |
| 08:47:58 | | LeighR quits [Client Quit] |
| 08:47:59 | | Sylirana quits [Remote host closed the connection] |
| 08:48:05 | | Sylirana (Sylirana) joins |
| 09:12:28 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 09:24:00 | | Wayward (wayward) joins |
| 09:24:39 | | VukkyWork (VukkyWork) joins |
| 09:52:58 | | Jonboy345 quits [Ping timeout: 258 seconds] |
| 10:02:54 | | hupool joins |
| 10:17:46 | | hupool quits [Ping timeout: 244 seconds] |
| 10:19:59 | | kiskaLogBot quits [Remote host closed the connection] |
| 10:21:23 | | VukkyWork quits [Ping timeout: 244 seconds] |
| 10:25:14 | | VukkyWork (VukkyWork) joins |
| 10:33:13 | | kiskaLogBot joins |
| 11:25:27 | | Jonboy345 joins |
| 11:32:20 | | UnderMybrella joins |
| 11:32:39 | | Inhonion joins |
| 11:32:43 | | VukkyWork quits [Remote host closed the connection] |
| 11:53:44 | | kiskaLogBot quits [Ping timeout: 250 seconds] |
| 11:55:07 | | kiskaLogBot joins |
| 12:09:57 | <s-crypt> | Is there a dashboard to view the staging server progress anywhere? |
| 12:11:31 | <s-crypt> | http://fos.textfiles.com/pipeline.html doesnt seem to contain anything yahoo related afaik |
| 12:12:29 | <rewby> | If you're looking for Yahoo Answers related status, then no. There isn't a statuspage that shows the upload progress. |
| 12:23:21 | <@EggplantN> | yeah we dont have that public on any projects |
| 12:29:45 | | etnguyen03 (etnguyen03) joins |
| 12:33:26 | | Stiletto joins |
| 12:38:21 | <avoozl> | does anyone know how large the yahoo answers set will be in total? I may have to clean out some space for that |
| 12:39:18 | <Jake> | Tracker says 4.75TiB compressed for the new project, and 30TB (uncompressed) for the 2016 project. |
| 12:42:28 | <avoozl> | 4.75TiB sounds good. I've got around 2.5 downloaded so I will need to create some extra space |
| 12:42:32 | <avoozl> | thanks |
| 12:43:01 | <avoozl> | I'll probably need around 3TiB for the index as well. this will be interesting juggling some free space |
| 12:59:58 | | serx joins |
| 13:34:09 | | Webuser639 joins |
| 13:45:46 | <@HCross> | arkiver: isolario is on it's way to the IA :) |
| 13:48:52 | <@EggplantN> | so is a random 3TB of webs, 3.8TB of bintray (once I have vars) |
| 13:48:57 | <@EggplantN> | and im sure im about to find more crap |
| 13:49:41 | <@HCross> | bets on finding a folder of G+ somewhere? |
| 13:50:21 | <@EggplantN> | i dont think i have anything that old |
| 14:01:24 | <@arkiver> | Google+ was a nice one |
| 14:01:28 | <@arkiver> | sounds good HCross! |
| 14:18:10 | | Arcorann quits [Remote host closed the connection] |
| 14:18:35 | | Arcorann (Arcorann) joins |
| 14:19:03 | | Webuser639 quits [Ping timeout: 244 seconds] |
| 14:20:40 | | Arcorann quits [Remote host closed the connection] |
| 14:21:02 | | Arcorann (Arcorann) joins |
| 14:34:25 | | NF885 (NF885) joins |
| 14:42:06 | <hilda> | here's another favicon idea with a 3.5" floppy: https://i.imgur.com/ChCYwKs.png https://i.imgur.com/XDfSEOv.png |
| 14:42:21 | | NF885 quits [Remote host closed the connection] |
| 14:59:54 | | Hackerpcs quits [Quit: Hackerpcs] |
| 15:01:53 | | Hackerpcs (Hackerpcs) joins |
| 15:06:23 | | Hackerpcs quits [Client Quit] |
| 15:07:34 | | Hackerpcs (Hackerpcs) joins |
| 15:07:42 | | Webuser639 joins |
| 15:22:38 | | Matthww quits [Ping timeout: 258 seconds] |
| 15:24:33 | | Arcorann quits [Ping timeout: 258 seconds] |
| 15:33:08 | | Matthww joins |
| 15:41:46 | <@JAA> | So that MeriStation archive didn't go well... Slowed to a crawl, then I got banned. |
| 15:42:32 | | PlsNoJava7 (ROpdebee) joins |
| 15:43:07 | | PlsNoJava quits [Read error: Connection reset by peer] |
| 15:43:07 | | PlsNoJava7 is now known as PlsNoJava |
| 15:43:58 | | Webuser639 quits [Remote host closed the connection] |
| 15:48:07 | <@JAA> | Looks like I only got maybe 7 % of it up to that point. |
| 15:50:37 | | blankie quits [Ping timeout: 258 seconds] |
| 15:56:57 | | nuroten joins |
| 16:06:20 | <serx> | the meristation case is litterally incredible |
| 16:11:32 | | spirit joins |
| 16:46:03 | | DogsRNice (Webuser299) joins |
| 16:52:07 | | hooway joins |
| 17:04:00 | <nuroten> | thuban: is there a way to feed the rss xml to wget or some app and have it download the links inside, then format it for upload to IA? I tried to download a few audio podcast episodes manually (leaving aside for a moment the file descriptions in the xml being cropped so still have to find a way to fetch those) |
| 17:07:18 | <nuroten> | the AT wiki page on wget has a command for webpages I'm trying as well, but haven't managed to adjust it to narrow down fetching to just the pages related to a single podcast |
| 17:17:45 | | marked quits [Client Quit] |
| 17:18:37 | | marked joins |
| 17:25:58 | | nschmeller joins |
| 17:29:39 | <nschmeller> | Hi! I hope this is the right channel for this question--the Clash of Clans forums are being shut down in a couple months, and I'm wondering how I can archive them. I saw that there was a script for getting Yahoo Answers on the Internet Archive based off its sitemap, does anyone know where to find that script? |
| 17:33:18 | <@arkiver> | nschmeller: what is the URL for the forum |
| 17:33:59 | <@JAA> | https://forum.supercell.com/ |
| 17:34:07 | <@JAA> | https://forum.supercell.com/showthread.php/1953693-End-of-the-Official-Supercell-Forums |
| 17:34:59 | <@JAA> | Read-only in June, shutdown in August |
| 17:36:12 | <nschmeller> | Yup, ^^ |
| 17:36:36 | <@arkiver> | looks like sequential IDs |
| 17:36:45 | <nschmeller> | If i'm reading correctly, someone with permissions will have to point the archive bot at the main webpage and it'll get everything? |
| 17:36:58 | <@arkiver> | even the members have sequential IDs |
| 17:37:14 | <@arkiver> | JAA: is this small enough for archivebot? |
| 17:37:14 | <@JAA> | Yup, standard forum, but with session ID hell. |
| 17:37:51 | <nschmeller> | What is session ID hell? |
| 17:38:00 | <@JAA> | Too big for AB, but I can do it with qwarc. |
| 17:39:08 | <@JAA> | nschmeller: When you access it without cookies, it adds an 's' parameter to every link. As the session expires after a while, it inevitably devolves into a huge mess of different session IDs being crawled etc. |
| 17:40:13 | <nschmeller> | Interesting, sounds annoying. Does that mean that the same page might be archived multiple times once a session expires? |
| 17:41:01 | <@JAA> | Yeah |
| 17:41:12 | <@JAA> | It would keep recursing through the site endlessly. |
| 17:41:18 | <nschmeller> | Doesn't sound good |
| 17:41:37 | <nschmeller> | What can I do to help? |
| 17:48:12 | <@JAA> | I'll get this sorted. :-) |
| 17:50:51 | <nschmeller> | Awesome!! I'm surprised I haven't come across this group earlier, I've been religiously contributing to the IA since 2016 |
| 17:57:05 | | nschmeller quits [Ping timeout: 244 seconds] |
| 18:01:00 | | aarchi quits [Read error: Connection reset by peer] |
| 18:01:44 | | aarchi (aarchi) joins |
| 18:10:55 | | nico_32 quits [Ping timeout: 258 seconds] |
| 18:24:51 | | LeGoupil joins |
| 18:26:11 | <Ryz> | Uhhh, should we do a proactive archiving of Giant Bomb? https://www.giantbomb.com/ More and more people are leaving Giant Bomb, 3 notable people are Vinny Caravella, Alex Navarro, and Brad Shoemaker |
| 18:26:42 | <Ryz> | Ever since being acquired and bought away from CBS Interactive, there has been bleeding talent over time S: |
| 18:27:07 | <Ryz> | Apparently, there's only 2 notable people left :/ |
| 18:36:07 | <thuban> | nuroten: that is more or less what i'm doing |
| 18:36:50 | <thuban> | the trouble is that their video _and_ their web pages _and_ apparently their CDN are all a bit flaky, so there's a lot of retrying involved |
| 18:38:21 | <nuroten> | thuban: nice. yeah, their servers are slow |
| 18:39:24 | <nuroten> | did you manage to get the equivalent akamai urls? not that it's less flaky, hoped it would be a bit faster |
| 18:39:53 | <thuban> | i did |
| 18:42:02 | <thuban> | the 2020 ones were fine but 2019 (and presumably earlier) are giving streamlink problems; can't investigate now but will look at it this evening |
| 18:42:02 | <nuroten> | that's good, is that exposed/extractable via browser inspector? I saw some m3u8 playlist files with *.ts fragments but not sure how to put it back together (or maybe that's not it) |
| 18:43:01 | <thuban> | there are tools to handle those but, as i say, problems |
| 18:43:56 | <nuroten> | okay ... wouldn't be too surprised if the 2019 ones are flaky, it was one of the more eventful years |
| 18:45:14 | <nuroten> | thanks a lot for your work on this! |
| 18:46:30 | | Doranwen quits [Remote host closed the connection] |
| 18:47:16 | <nuroten> | I still have to check Youtube, if the quality is identical maybe grabbing from there is another option |
| 18:51:11 | | Barto quits [Quit: WeeChat 3.1] |
| 18:52:17 | <@arkiver> | thuban: are you archiving those RTHK videos? |
| 18:53:07 | <thuban> | arkiver: yes |
| 18:53:28 | <@arkiver> | thuban: alright, any details on what is being archived exactly and how? |
| 18:54:24 | <thuban> | podcast episodes + thumbnails + metadata (scraped from xml feed and episode pages) |
| 18:54:42 | <@arkiver> | and videos? |
| 18:54:49 | <@arkiver> | or are those videos |
| 18:55:02 | <thuban> | that's what i meant by "episodes" |
| 18:55:06 | <thuban> | i can throw the episode pages into archivebot too if we want provenance |
| 18:55:33 | <@arkiver> | yeah try to get everything into the Wayback Machine at least |
| 18:55:42 | <thuban> | k, will do |
| 18:55:43 | <@arkiver> | that is also the audio/video files themselves |
| 18:56:27 | <thuban> | that is likely to be problematic but i will generate the list |
| 18:59:24 | <@arkiver> | right i see podcast/rthk.hk |
| 19:04:58 | | goodtime joins |
| 19:08:49 | | ragu_ joins |
| 19:11:22 | | Daloader quits [Quit: Leaving] |
| 19:12:16 | | ragu quits [Ping timeout: 250 seconds] |
| 19:12:20 | | Doranwen (Doranwen) joins |
| 19:15:16 | | Barto (Barto) joins |
| 19:16:47 | | Barto quits [Client Quit] |
| 19:17:00 | | Barto (Barto) joins |
| 19:22:20 | <thuban> | oof, i see the problem: episodes more than a year old aren't on their cdn at all; they also come in a playlist version, but it's self-hosted as well. if i can get one down i will compare the quality to the 'archive' mp4 and act accordingly |
| 19:24:41 | | Barto quits [Remote host closed the connection] |
| 19:24:53 | | Barto (Barto) joins |
| 19:29:09 | | redlizard joins |
| 19:38:03 | <mgrandi> | Is all of their stuff not on youtube? |
| 19:38:18 | <mgrandi> | I just checked a recent video and it's just a youtube embed |
| 19:39:42 | | etnguyen03 quits [Read error: Connection reset by peer] |
| 19:41:06 | | spirit quits [Client Quit] |
| 19:41:13 | <thuban> | these videos are not youtube embeds |
| 19:41:18 | <mgrandi> | I don't see any indication that the site is going anywhere but it's good to get a backup |
| 19:41:46 | <masterX244> | yeah, better to backup stuff than being forced to a emergency rescue |
| 19:41:55 | <thuban> | they have a playlist for "hong kong connection" (this show), but many, many of the videos are unavailable https://www.youtube.com/playlist?list=PLuwJy35eAVaJ-DaWHYe8PK6Yg-cyEMVo1 |
| 19:42:19 | <@JAA> | Ryz: Yes re: Giant Bomb. |
| 19:43:03 | <Ryz> | Giant Bomb has forums, a wiki, and has premium content (requires a subscription to access that kind of content) |
| 19:43:14 | <Ryz> | On top of being a news and media website for video games |
| 19:43:39 | <Ryz> | This should expand to the other related websites that are under Red Ventures |
| 19:43:42 | <mgrandi> | https://www.giantbomb.com/shows/returnal/2970-21070 |
| 19:43:59 | <mgrandi> | So checking their recent video lists, I'd say 3/4 of them are on youtube |
| 19:44:09 | <mgrandi> | And some of them on the site are youtube embeds, such as ^ |
| 19:45:12 | <Ryz> | This isn't the first time calls for this stuff being archived was echoed, as Jason Scott gave a message via Twitter on encouraging ArchiveTeam to do such an archiving |
| 19:45:28 | <mgrandi> | But yes there are some that are not on youtube, such as https://www.giantbomb.com/shows/4-30-2021-g-is-for-golden/2970-21074 |
| 19:46:28 | <mgrandi> | I can get their recent twitch videos as a low res backup copy as they most likely will end up in youtube at a higher res copy and hard drives are expensive now :-\ |
| 19:48:17 | <lunik1> | ouch, youtube-dl does not like that link. Has a GiantBomb extractor but maybe it's unmaintained/broken? |
| 19:48:39 | <thuban> | ytdl is not known for keeping up with its prs; try youtube-dlc? |
| 19:49:06 | <mgrandi> | Isn't there another one besides that that is even more up to date |
| 19:49:27 | <lunik1> | youtube-dlc hasn't had a commit to master since October |
| 19:50:18 | <lunik1> | *December |
| 19:50:41 | <Ajay> | yt-dlp |
| 19:51:27 | <lunik1> | there is a download link but it's only for the audio, but the video just seems to be a placeholder |
| 19:53:05 | | hooway_ joins |
| 19:53:05 | | hooway quits [Read error: Connection reset by peer] |
| 19:53:13 | <mgrandi> | https://github.com/yt-dlp/yt-dlp |
| 19:58:59 | | Barto quits [Client Quit] |
| 19:59:06 | | Barto (Barto) joins |
| 20:03:00 | | LeighR (LeighR) joins |
| 20:10:00 | <goodtime> | from #archiveteam: |
| 20:10:03 | <goodtime> | Game site with ~13 years of history has 3 of its founders leaving after ~13 years. No word yet on if videos are going anywhere. videos hosted on their site as well as youtube.com , in most cases. tons and tons of 2h + video. As a fan, i think the biggest risk is that the site jettisons some of its less visible/ profitable features, like its |
| 20:10:03 | <goodtime> | extensive wiki. old videos (older ones may not be on youtube?) may also get deleted for storage reasons. https://www.resetera.com/threads/vinny-caravella-alex-navarro-brad-shoemaker-announce-theyre-leaving...goodtime15:03:38one of the people leaving: "We are still a website... in a time when websites kind of don't exist anymore". storm clouds on |
| 20:10:04 | <goodtime> | the wiki "Are they gonna be on our forum? Are they gonna be on discords?"founder, still staying: "Do we still need a website? I've been asking for 5 years" |
| 20:10:16 | | etnguyen03 (etnguyen03) joins |
| 20:10:26 | <goodtime> | tldr old videos (not on youtube) and non videos are the highest risk imo |
| 20:37:05 | <mgrandi> | Probably easiest to list the web pages to scrape and then get a listing of all the videos and download them somehow |
| 20:37:56 | <LeighR> | Holy Cow did I have an instinct for site at risk - pemberley.com is unresponsive, and its old IP address is a parking page |
| 20:38:34 | <masterX244> | Got it just in time? |
| 20:38:50 | <LeighR> | apparently! |
| 20:39:37 | <LeighR> | there wasn't anything on the site that announced it going away, so this might just be a temporary hiccup, but given how unresponsive it was, I felt its days were numbered |
| 20:40:05 | <LeighR> | Hope AB didn't knock it out (I don't seriously think AB knocked it out) |
| 20:41:15 | <mgrandi> | And if someone writes code to get a listing of GB's pages , that should be put on GitHub and linked on the wiki so it can be rerun in the future :) |
| 20:44:10 | <masterX244> | Did something similar for the TM-exchange. Dumped the URLLists to archive.org and added the source code of the tool into that item, too. Better to have the code at multiple locations |
| 20:44:52 | <masterX244> | URLList dump makes it easier to do a incremental update since replays don't need redownload after initial download, and no need to redo the POST search if you already got the IDs |
| 20:45:24 | | Jake quits [Ping timeout: 258 seconds] |
| 20:51:11 | | nico_32 joins |
| 20:54:38 | | LeGoupil quits [Client Quit] |
| 20:55:44 | <LeighR> | aside from downloading the whole WARC myself, is there a way to spot-check some URLs? Most of the stories in that site were indexed in a single, slightly mangled table that was de-mangled for viewers one page at a time |
| 20:56:52 | <LeighR> | (site is back up, but still slow as heck) |
| 20:58:10 | <masterX244> | each WARC has a cdx which is like a ToC |
| 21:01:03 | | nschmeller joins |
| 21:09:33 | | britmob quits [Ping timeout: 258 seconds] |
| 21:10:41 | | nschmeller quits [Remote host closed the connection] |
| 21:16:07 | <LeighR> | WRPlayer choked on the metadata WARC |
| 21:17:53 | <LeighR> | downloading the WARC from https://archive.fart.website/archivebot/viewer/job/b8mfh isn't eating into someone's monthly bandwidth allotment? |
| 21:19:17 | <@JAA> | It's just an index for the AB collection on IA. |
| 21:19:22 | <LeighR> | oh, good |
| 21:21:42 | <LeighR> | if those pages end up not being in there, what is the best way to archive the list of URLs I parse from the slightly mangled list? |
| 21:22:52 | <masterX244> | how is it mangled? |
| 21:23:50 | <LeighR> | https:\/\/pemberley.com\/derby\/ariane1.cim.html |
| 21:23:51 | <masterX244> | sidenote: Just noticed that on the Wikiteam dump the last upload was 2016. |
| 21:24:07 | <masterX244> | grep all out and replace \/ with / |
| 21:24:08 | <LeighR> | no big deal to clean up in PowerShell or whatever |
| 21:24:19 | <masterX244> | yeah, scripting or some quicjk C# code is the ebst way sometimes |
| 21:24:20 | <LeighR> | (to pull out of the table) |
| 21:25:02 | <masterX244> | *last upload of wikimedia commons |
| 21:25:03 | <@JAA> | sed 's,\\/,/,g' |
| 21:25:05 | <LeighR> | I thought it would be some serious JS BS, but no, I can see them all clear as day when I pull that page with curl |
| 21:25:35 | <@JAA> | Slashes are often unnecessarily escaped in JS strings (including embedded JSON). |
| 21:25:41 | <LeighR> | they're stuck in a table, but a regular enough pattern. Not sure if ArchiveBot would have caught this. |
| 21:25:59 | <masterX244> | probably nope since the backslashes hide it |
| 21:26:13 | <masterX244> | unless it got some unmangling code for that |
| 21:26:30 | <masterX244> | but easiest to verify by crosschecking that list with the cdx of the WARC file |
| 21:26:34 | <LeighR> | I get the feeling that some of this might have been done to prevent just the sort of thing we just did |
| 21:27:02 | <masterX244> | still better than __doPostBack aspx pagination that doesnt use the URL |
| 21:27:14 | <LeighR> | but their main fear was probably the stories being posted on fanfiction.net or the like under different authors' names |
| 21:27:18 | <@JAA> | If it's JS, wpull handles that by calling json.loads. |
| 21:27:26 | <LeighR> | nice |
| 21:27:44 | <masterX244> | whats the initial URL where the table resides? |
| 21:28:00 | <LeighR> | https://pemberley.com/?page_id=5270 |
| 21:30:30 | <LeighR> | if it turns out that AB didn't get them, I'll clean them up and put them in a list - no reason for y'all to bother |
| 21:31:14 | <masterX244> | just curious on the fuckery hidden in that page |
| 21:34:09 | <LeighR> | it's a site that was started before Google was |
| 21:34:38 | <LeighR> | all I can guess is that it's some effort to prevent low-effort web scraping |
| 21:35:03 | <masterX244> | script tag with a CDATA wrapper around, not sure if wpull expects a variable assignment containing the essential data |
| 21:37:19 | <LeighR> | what's the polite way to get AB to pull a list of links that are all on the same site, but aren't the only thing on that site? |
| 21:38:12 | <@JAA> | Oh, I see, it's HTML in JS strings. Yeah, that isn't processed by wpull I think. |
| 21:38:27 | <LeighR> | you probably don't want several hundred !ao messages in the channel |
| 21:39:06 | <@JAA> | Create a file containing one URL per line, upload that to https://transfer.archivete.am/ (with a good filename!), then use !ao < LISTURL. |
| 21:39:30 | <LeighR> | and you don't need several hundred copies of that obnoxious background image |
| 21:39:46 | <LeighR> | that was probably very classy in 1997 |
| 21:39:52 | <LeighR> | great! |
| 21:39:54 | <masterX244> | the transfer.archivete.am required or any deeplinkable host working |
| 21:40:33 | <masterX244> | ? |
| 21:40:54 | <@JAA> | Anything works. Anything with good filenames (e.g. not Pastebin) is acceptable. transfer.archivete.am is strongly recommended. |
| 21:41:01 | <LeighR> | I need to check, but I think some of them might just be the first chapter of multi-chaptered stories, linked in who knows what pattern |
| 21:41:16 | <@JAA> | (This might change in the future, we'll see.) |
| 21:41:35 | <masterX244> | also: got this link https://app.box.com/s/6b9wmjvr582c95uzma1136exumk6p989/folder/136698646305 via this tweet: https://twitter.com/simoncarless/status/1389297530341519362 |
| 21:42:23 | <masterX244> | Apple Vs Epic Lawsuit Extended stuff. (not directly in the RECAP archive which pipes to archive.org) |
| 22:02:28 | | hooway_ quits [Client Quit] |
| 22:14:43 | | Jake (Jake) joins |
| 22:20:51 | | britmob joins |
| 22:31:36 | <@arkiver> | LeighR: if we know of any people, would be good to get in contact with |
| 22:31:54 | <LeighR> | https://thediplomat.com/2021/04/hong-kongs-activists-in-exile/ |
| 22:32:35 | <LeighR> | but those are perhaps not as archive-oriented |
| 22:35:19 | <LeighR> | I remember some folks in college who were from Taiwan (important because they and HKers can read the full Traditional Chinese character set, while the mainland uses Simplified Chinese) |
| 22:37:03 | <LeighR> | This group would probably be delighted with your help: https://www.2021hkcharter.com/ |
| 22:41:47 | <LeighR> | I'll do some more looking for who might be able to make best use of AT's help |
| 22:42:18 | <goodtime> | for Giant Bomb we could probably amass a collection of premium subscribers who want to make sure the content is archived. premium subs get download URLs which are supposedly checked for abuse (i.e. no mass downloads, i think an api key is involved) |
| 22:42:24 | | goodtime quits [Remote host closed the connection] |
| 22:46:57 | | ragu_ quits [Client Quit] |
| 23:01:30 | | thuban quits [Ping timeout: 250 seconds] |
| 23:07:33 | | thuban joins |
| 23:13:35 | | BlueMaxima joins |
| 23:21:47 | | murmur quits [Quit: leaving] |
| 23:24:11 | | LeighR quits [Client Quit] |
| 23:25:26 | | murmur joins |
| 23:29:51 | | Zerote_ joins |
| 23:29:52 | | qw3rty__ joins |
| 23:30:07 | | serx quits [Remote host closed the connection] |
| 23:32:32 | | HackMii_ quits [Ping timeout: 258 seconds] |
| 23:32:55 | | qw3rty_ quits [Ping timeout: 258 seconds] |
| 23:32:55 | | Zerote quits [Ping timeout: 258 seconds] |
| 23:33:42 | | HackMii_ (hacktheplanet) joins |
| 23:42:40 | | Jack_Thompson joins |
| 23:47:05 | | hupool joins |
| 23:48:58 | | hupool quits [Remote host closed the connection] |