| 00:12:44 | | Ruthalas (Ruthalas) joins |
| 00:31:18 | | Arcorann_ joins |
| 00:41:31 | | Mineroboter joins |
| 00:44:04 | | Mineroboter_ quits [Ping timeout: 250 seconds] |
| 00:46:47 | | pcr leaves |
| 00:46:49 | | pcr joins |
| 01:01:12 | | howardad (howardad) joins |
| 01:01:41 | | howardad quits [Client Quit] |
| 01:02:10 | | howardad (howardad) joins |
| 01:03:06 | | dm4v quits [Read error: Connection reset by peer] |
| 01:03:55 | | dm4v joins |
| 01:03:57 | | dm4v is now authenticated as dm4v |
| 01:03:57 | | dm4v quits [Changing host] |
| 01:03:58 | | dm4v (dm4v) joins |
| 01:12:27 | | Wayward quits [Remote host closed the connection] |
| 02:03:29 | | lennier2 joins |
| 02:06:57 | | lennier1 quits [Quit: Going offline, see ya! (www.adiirc.com)] |
| 03:14:11 | | Wayward (wayward) joins |
| 03:40:19 | | qw3rty_ joins |
| 03:43:58 | | qw3rty__ quits [Ping timeout: 258 seconds] |
| 05:01:29 | | DogsRNice quits [Read error: Connection reset by peer] |
| 05:25:33 | | nerdguy1138 quits [Ping timeout: 258 seconds] |
| 05:40:34 | | nerdguy1138 (nerdguy1138) joins |
| 07:43:52 | | duce1337 (duce1337) joins |
| 08:04:07 | | bobbyb joins |
| 08:04:38 | <bobbyb> | Is there a place to submit whole websites for archiving? |
| 08:04:55 | <bobbyb> | I just came across https://www.tapeheads.net/forums.php that has little presence on the Wayback Machine |
| 08:05:42 | <bobbyb> | I just archived a very interesting thread on the merits of demagnetizing tape heads that had no backups |
| 08:06:56 | <bobbyb> | Oh, I think I may be in the wrong channel |
| 08:07:16 | <thuban> | bobbyb: no no, here is fine |
| 08:11:14 | <thuban> | JAA: vbulletin forum, just under 1M posts. is that too big for an archivebot job? qwarc? |
| 08:22:04 | <thuban> | in other news, older rthk videos stored on its internal servers (at least for _hong kong connection_) went down around the same time they started deleting youtube videos :( |
| 08:22:06 | <thuban> | streaming versions timed out for a while and are now 404ing; 'archive' versions just 500 |
| 08:23:18 | <thuban> | the episode web pages are still up, weirdly (even though they're now broken...) so i'm just going to grab thumbnails and descriptions for as many of the remaining episodes as i can |
| 08:56:32 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 09:00:13 | | shoghicp quits [Ping timeout: 258 seconds] |
| 09:02:07 | | shoghicp (shoghicp) joins |
| 09:17:05 | | treora quits [Ping timeout: 258 seconds] |
| 09:48:48 | | sliccricc (sliccricc) joins |
| 10:22:38 | | sliccricc quits [Remote host closed the connection] |
| 10:22:57 | | sliccricc (sliccricc) joins |
| 10:44:23 | | duce1337_ (duce1337) joins |
| 10:44:23 | | duce1337 quits [Read error: Connection reset by peer] |
| 11:26:33 | <betamax> | I've now finished pre-processing the URLs related to the recent UK elections. Specifically, I now have: |
| 11:26:37 | <betamax> | * 3904 Twitter usernames of political parties / candidates |
| 11:26:39 | <betamax> | * 4175 Facebook pages (likely not possible to be archived due to FB's rate limiting?) |
| 11:26:42 | <betamax> | * 318 Instagram profiles (again, likely not possible due to rate limiting?) |
| 11:26:45 | <betamax> | * 89 Youtube channels (and a Dailymotion channel) used by parties / candidates |
| 11:26:48 | <betamax> | * 530 Political Party websites |
| 11:26:51 | <betamax> | * 1273 Candidate Websites |
| 11:26:53 | <betamax> | What is the best way to get these archived? |
| 11:26:53 | <betamax> | The candidate web pages can go in archivebot with "archiveonly". |
| 11:26:53 | <betamax> | Should the party / candidate websites be split into groups / chunks to avoid overloading a single archivebot job? |
| 11:26:56 | <betamax> | * 2503 Candidate Web pages (these have been linked to separately and include manifestos, etc... - there will likely be a large overlap between these pages and the candidate websites, but I think it is best to archive these as well). |
| 11:27:00 | <betamax> | Is there a way to feed ~4000 twitter usernames into socialbot (without overloading it!) |
| 11:37:00 | <Barto> | this belongs in a wiki page |
| 11:37:10 | <AK> | Maybe some of that could go into urls, means ab doesn't have to do it all |
| 11:40:19 | | forkwhilefork quits [Remote host closed the connection] |
| 11:40:39 | | forkwhilefork (forkwhilefork) joins |
| 11:40:44 | <betamax> | Barto: OK, I'll make up a wiki page with the information. |
| 11:42:05 | <Barto> | i mean, the idea is to at least write it somewhere |
| 11:42:58 | <betamax> | Yup, seems sensible :) |
| 11:46:12 | <@HCross> | betamax: can you share the list please? I can also get started on them |
| 11:49:04 | <Barto> | ^ that's exactly why I think writing it somewhere is great |
| 11:50:08 | <betamax> | Yup, just about to. Can I upload txt files to the wiki, or is it best to host them on my own server, then wayback them and link to the archived list? |
| 11:50:49 | <betamax> | Ah, seems the wiki supports generic file upload. |
| 11:51:02 | <betamax> | lol, no it doesn't |
| 11:58:28 | <betamax> | wiki page (with links to lists): https://wiki.archiveteam.org/index.php/Elections/2021_UK_elections |
| 11:58:38 | <betamax> | HCross: ^ |
| 11:58:46 | | treora joins |
| 12:00:51 | | sliccricc quits [Remote host closed the connection] |
| 12:02:41 | | nuroten quits [Remote host closed the connection] |
| 12:05:03 | <masterX244> | i would have used a archive.org item for storing the files (did that with some other discovery results at my end earlier) |
| 12:47:14 | | Daloader joins |
| 14:12:05 | | treeplant quits [Ping timeout: 244 seconds] |
| 14:25:39 | | nuroten joins |
| 14:25:48 | <thuban> | rthk update: streaming videos appear to be back up; we'll see how long that lasts. also, 2+-year-old episodes appear to be hosted on yet another subdomain, which is still working. i am cautiously optimistic... |
| 14:27:08 | <nuroten> | thuban: thanks for the update. not sure how long they'll be up, but given what happened to the youtube videos better safe than regretful |
| 14:31:02 | <nuroten> | yesterday I found another commercial news outlet had pulled most of their short clips from a documentary series off their website (they fired a bunch of staff in december last year, including the entire team behind production of said series) |
| 14:33:02 | <nuroten> | the series was called News Lancet and was a bit like Hong Kong Connection in that it explored social and political topics |
| 14:37:18 | | britm0b quits [Read error: Connection reset by peer] |
| 14:40:03 | | duce1337_ quits [Read error: Connection reset by peer] |
| 14:40:18 | | duce1337 (duce1337) joins |
| 14:40:23 | <nuroten> | fingers crossed as much of the site can be saved before the purge |
| 14:42:45 | <nuroten> | thuban: looks like most are on archive.* and some older ones on podcast.* ... ipv6 is down on archive.* |
| 14:44:11 | | britmob joins |
| 14:45:43 | <thuban> | nuroten: there's actually a second set of sources for the newer eps (visible through the web page but not in the feed), what i've been calling 'streaming'; they are higher quality so i've been getting them where i can |
| 14:46:04 | <nuroten> | the ones from akamai? |
| 14:47:06 | <thuban> | the _new_ new ones are on akamai, but the old new ones are self-hosted (stmw.rthk.hk) |
| 14:48:23 | <nuroten> | thanks, didn't know that one ... how are speeds on it, as slow as archive.* ? |
| 14:48:55 | <thuban> | similar, yeah |
| 14:51:36 | | LeGoupil joins |
| 14:52:54 | | etnguyen03 (etnguyen03) joins |
| 14:53:25 | <nuroten> | okay. I've been trying to download select ones, and do a pass for all items in a feed later. full pass is more systematic, but resorting to cherry-pick due to slow speeds and unknown expiry |
| 14:53:40 | <thuban> | mhm |
| 14:55:37 | <thuban> | you may want to check against https://archive.org/details/hk-connection-2016-2020 (and other stuff under the "Hong Kong Connection" subject; i think there are a bunch of individual episodes, but i don't read chinese) and/or the contents of the torrent linked here https://old.reddit.com/r/DataHoarder/comments/n3pvnm/hong_kong_broadcaster_rthk_to_delete_shows_over_a/ |
| 14:56:27 | <nuroten> | the 2016-2020 if it is what is says would be great, that's halfway there |
| 14:56:56 | <thuban> | yeah, i'm going to try and get 2010-2016 if i can |
| 14:58:04 | <nuroten> | there's another playlist that is select episodes, not complete |
| 14:58:34 | <nuroten> | on IA that is, besides the 2016-2020 |
| 15:00:37 | <thuban> | (more: https://lih.kg/sMMkrnX (zh), https://docs.google.com/spreadsheets/d/1JPyevWnxvoq_xzY4ptOgaTaTYSLE9oMva66kxm66K0k/edit#gid=1023959306) |
| 15:04:11 | <nuroten> | from the little I can understand, poster said they already downloaded most of the english version |
| 15:04:34 | <nuroten> | and the rest of the post is instructions on downloading (via youtube-dl) |
| 15:04:49 | | LeGoupil quits [Client Quit] |
| 15:07:45 | <thuban> | there seems to be a looot of info in the spreadsheet and the google doc it links to, so if you or anyone with good enough chinese could read it and (a) recommend items for download or (b) get copies to go to ia, that would be incredibly helpful! |
| 15:08:40 | <nuroten> | there was a suggestion in the thread to grab The Pulse and another Chinese show, I'll pull up a link for the 2nd one |
| 15:14:52 | <thuban> | i am going afk for a while (download script will keep running), but ping me for anything i should look at when i get back |
| 15:15:34 | <nuroten> | okay, thanks, cheers ... I'll leave a few links here for shows |
| 15:18:39 | | LeGoupil joins |
| 15:19:02 | | icedice joins |
| 15:19:19 | | icedice quits [Remote host closed the connection] |
| 15:25:51 | | Arcorann_ quits [Ping timeout: 258 seconds] |
| 15:30:25 | | HP_Archivist (HP_Archivist) joins |
| 15:31:38 | <nuroten> | The Pulse https://podcast.rthk.hk/podcast/item.php?pid=205&lang=en-US (English version, there's a Chinese one of the same name with only 3 videos, not all that interesting) |
| 15:46:37 | <nuroten> | 五夜講場 (English title "Philosophy Night"), 2020 and 2021 page: https://podcast.rthk.hk/podcast/item.php?pid=1734&year=2020&lang=zh-CN |
| 15:53:58 | <nuroten> | the series is split across 13 rss feeds (by year and subtheme) about 30-40 videos each, not sure which one the poster wanted |
| 16:11:34 | | icedice joins |
| 16:12:35 | | spirit quits [Client Quit] |
| 16:13:21 | <nuroten> | about the google docs spreadsheet, check the 3rd tab (the 1st one in Chinese) for a list of titles they're looking to backup |
| 16:18:55 | <nuroten> | (rthk + the programme name in web search should pull up the podcast feed in most cases) ... another thing is someone apparently put the entire collection of HKC on torrent, but I don't know if it really is complete |
| 16:21:09 | <nuroten> | the other 2 tabs in Chinese are show categories (e.g. "current affairs" and "culture") and have links to mega of select episodes people have uploaded |
| 16:28:33 | | pcr leaves |
| 16:46:40 | | pcr joins |
| 16:50:57 | | treeplant joins |
| 17:05:35 | | rsn_ joins |
| 17:08:12 | | rsn quits [Ping timeout: 258 seconds] |
| 17:12:49 | | DogsRNice (Webuser299) joins |
| 17:25:27 | | Ruthalas quits [Ping timeout: 258 seconds] |
| 17:26:42 | | Ruthalas (Ruthalas) joins |
| 18:10:51 | | duce1337_ (duce1337) joins |
| 18:10:51 | | duce1337 quits [Read error: Connection reset by peer] |
| 18:13:29 | <@JAA> | thuban, bobbyb: 1M posts is fine with AB. It'll take a little while though and is around the point where I typically start ignoring the URLs for individual posts to not let it blow up too much. |
| 18:18:29 | | webdownload joins |
| 18:18:51 | <webdownload> | I'm now working on archiving the TED Talks website. |
| 18:25:58 | <webdownload> | I figured that it would be more feasible to archive than Vimeo. |
| 18:48:15 | | LeGoupil quits [Ping timeout: 258 seconds] |
| 19:08:12 | | LeGoupil joins |
| 19:19:54 | | Daloader quits [Ping timeout: 250 seconds] |
| 19:31:11 | | LeGoupil quits [Ping timeout: 258 seconds] |
| 19:34:38 | | HP_Archivist quits [Ping timeout: 258 seconds] |
| 20:23:52 | | nuroten quits [Remote host closed the connection] |
| 20:25:57 | | nuroten joins |
| 20:37:55 | | LeGoupil joins |
| 20:44:24 | | tzt quits [Ping timeout: 258 seconds] |
| 21:09:31 | | LeGoupil quits [Client Quit] |
| 21:25:35 | <betamax> | I thought I'd "bump" my earlier question about archiving a large amount of websites and twitter accounts (from the recent UK elections) - is there a good / recomended way of archiving ~4000 twitter users and ~2500 websites? |
| 21:25:39 | <betamax> | For the twitters, I figure either (a) socialbot (does it support that many accounts)? or (b) manually running snscrape then putting the list of tweets into AB separately. |
| 21:25:45 | <betamax> | For the websites, they should go in AB, but I don't know if / how I should split the list into more managable chunks. |
| 21:26:22 | <betamax> | (oops, wrong numbers - ~1700 websites, not ~2500) |
| 21:31:10 | <jodizzle> | I believe what we did last time was run snscrape manually like (b) and splitting into a few chunks, and running all the websites one-by-one |
| 21:32:24 | <betamax> | I've done that before and can certainly do that again. |
| 21:33:19 | <jodizzle> | For the twitters, I would definitely do (b). Just make sure to run snscrape with options for getting outlinks. |
| 21:34:01 | <jodizzle> | ~1700 websites is a lot, though. I wonder if we've ever done that much before for an election. |
| 21:34:51 | <jodizzle> | We could simplify the process by splititng into chunks and using '!a <', but I'm not sure if that's the best either. |
| 21:35:37 | <betamax> | '!a <' was my plan, and I wasn't sure if it should be split into chunks (and if so, what size) |
| 21:36:11 | <betamax> | Smaller allows for easier monitoring (e.g: applying custom ignoresets) but obviously increases the number of jobs required |
| 21:38:19 | <jodizzle> | One thing I recall helping with the last round of U.K. election work was that some of the candidate sites weren't independant domains, but rather just pages on a host site, e.g., https://www.partyof.wales/. In that circumstance, we could just grab the host site. Is that true this time around? |
| 21:39:52 | <betamax> | There's likely to be some of that, yes. I did try to filter out that as much as possible but with over a thousand sites I will have missed things (and I don't have a good way to filter to find pages on a domain) |
| 21:40:32 | <betamax> | jodizzle: what's the option in snscrape to get twitter outlinks? |
| 21:42:56 | <jodizzle> | I use `snscrape --format '{url} {tcooutlinksss} {outlinksss}' twitter-user <username>`. That produces up to three links per line, which you then need to put newline breaks between. You could do that by piping it into `tr " " "\n" | sed '/^$/d'` or something similar. |
| 21:43:43 | <betamax> | Ah, thanks. Wasn't aware of the --format option |
| 21:44:30 | <jodizzle> | Also add to the top of each file a link to just the profile page, i.e., https://twitter.com/<username>. If you do that as well, you'll produce lists exactly like snscrape does. |
| 21:46:51 | <betamax> | I'll start that running now, and also feed in the list of candidate pages into archivebot (with "!archiveonly"). |
| 21:47:10 | <jodizzle> | I think lists of a couple hundred-thousand twitter links are pretty good for '!ao <' jobs. An alternative would be to hand them to the URLs warrior project. |
| 21:51:26 | <@JAA> | I don't think the URLs project would get anything useful. |
| 21:52:12 | <@JAA> | We could do one (or a few) !a < job for the websites, but that's probably not going to end well. |
| 21:52:48 | <jodizzle> | Why wouldn't the URLs project be good for twitter links? |
| 21:54:41 | <@JAA> | I believe it'd get the new Twitter UI, which is useless without JS. |
| 21:55:07 | <@JAA> | AB still gets the old UI, which actually works. |
| 21:55:28 | <jodizzle> | Ah, good point |
| 21:56:43 | | tzt joins |
| 22:00:36 | <betamax> | JAA: what are the main issues with running !a < on the party / candidate websites? |
| 22:02:14 | <@JAA> | betamax: Ignores are hell, there will be so many outlinks that Python's cookie jar will slow everything to a crawl after a while, and probably simply too big for one job in general. |
| 22:03:17 | <betamax> | do you mean outlinks to external sites (which could be turned off I believe) or links to different pages in the same site |
| 22:04:02 | <@JAA> | External ones. They could be turned off, yeah, but I think they'd also be great to capture. |
| 22:04:38 | <betamax> | They would, but if they stop the entire thing from being captured... |
| 22:06:23 | <betamax> | JAA: is there anything I could / should be doing with the facebook / instagram URLs? (e.g: running the scrape very slowly over a few months to avoid ratelimiting, etc..) or is it a lost cause? |
| 22:06:49 | <@JAA> | It's pretty much a lost cause at this point, unfortunately. |
| 22:09:00 | <betamax> | OK - would it be worth !ao < all the links just to record that the pages existed (or will that trigger rate limiting) |
| 22:10:07 | <@JAA> | All AB pipelines are banned from Facebook and Instagram as of last time I checked. So it would only grab redirects to the login pages. |
| 22:11:18 | <betamax> | Ah, I now see the scope of their rate limiting.... OK, that's not going to work! |
| 22:13:15 | <@JAA> | Yeah, you get banned from both services just by accessing like 2 or 3 pages at a perfectly humanly reasonable speed. It's ridiculous. |
| 22:52:58 | | superkuh quits [Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilayer] |
| 23:04:16 | | HP_Archivist (HP_Archivist) joins |
| 23:17:58 | | duce1337_ quits [Client Quit] |
| 23:33:59 | | Arcorann_ joins |
| 23:44:26 | | IKI quits [Remote host closed the connection] |
| 23:49:39 | | BlueMaxima joins |