| 00:00:31 | <nuroten> | 100Most is the flagship magazine/yt video channel |
| 00:01:11 | | dm4v quits [Read error: Connection reset by peer] |
| 00:01:11 | | dm4v_ joins |
| 00:01:17 | <thuban> | gotcha, fixed |
| 00:01:37 | | dm4v_ is now known as dm4v |
| 00:01:37 | | dm4v is now authenticated as dm4v |
| 00:01:37 | | dm4v quits [Changing host] |
| 00:01:37 | | dm4v (dm4v) joins |
| 00:02:09 | <@arkiver> | if anything very big needs to be saved, let me know |
| 00:02:28 | <@arkiver> | every big if too big for AB or for an individual here to archive |
| 00:27:10 | | sec^nd quits [Ping timeout: 255 seconds] |
| 00:35:28 | <abccc> | Is there a way to generate a sitemap for a website (ex: inmediahk.net)? Or put another way, is there a way to get a list of all URLs that exist for this domain? |
| 00:39:49 | | sec^nd (second) joins |
| 00:43:23 | <@JAA> | Only the site operator would know *everything* that exists. sitemap.xml, recursive crawls, etc. may always be incomplete. |
| 00:54:12 | | HP_Archivist (HP_Archivist) joins |
| 01:07:55 | <nuroten> | thuban: tvmost (651 videos) https://www.youtube.com/channel/UCiJnCs2K5gP-DXnMxlstC9A and D100 (15881 videos) https://www.youtube.com/user/D100HK they slipped through the cracks, I had thought D100 yt was already archived |
| 01:09:07 | <thuban> | they may already have been, i'll check |
| 01:09:28 | <nuroten> | your table made it easier to see the blanks haha |
| 01:10:20 | | ddd joins |
| 01:10:46 | | ddd quits [Client Quit] |
| 01:28:02 | | katocala quits [Ping timeout: 258 seconds] |
| 01:31:00 | | katocala joins |
| 01:31:28 | | katocala is now authenticated as katocala |
| 01:36:28 | | HP_Archivist quits [Ping timeout: 258 seconds] |
| 01:41:50 | | h3ndr1k quits [Ping timeout: 258 seconds] |
| 01:42:03 | | h3ndr1k (h3ndr1k) joins |
| 02:01:38 | <SketchTheCow> | https://atdash.meo.ws/ is down, I assume this is known |
| 02:07:54 | <@JAA> | Yeah, it's known. |
| 02:33:04 | <abccc> | JAA would something like this work to find all (or the vast majority) of URLs on a site? https://github.com/maurosoria/dirsearch |
| 02:34:48 | <@JAA> | abccc: Extremely unlikely that it would find everything. |
| 02:35:59 | <abccc> | Would it find the vast majority of (well-trafficked) sites though? If so then wondering if this is something worth deploying to a list containing the vast majority of articles |
| 02:36:08 | <@OrIdow6> | Depending on how many trillion years you expect the site to stay up |
| 02:39:34 | <@OrIdow6> | Search space is way too big unless you have 8.3 filenames or something |
| 02:41:35 | <@JAA> | Even with 8.3 and assuming all uppercase letters, you'd still have 3.7 quadrillion combinations... |
| 02:42:23 | <@JAA> | Well, for all possible extensions. 'Only' 209 billion per extension, so trying a few common ones is still in the trillions. |
| 02:43:21 | <@JAA> | In this case, the site uses slugs of varying lengths with Chinese characters, so there will be so many combinations that I don't even know what the names for those numbers are. |
| 02:44:34 | | Ruthalas quits [Ping timeout: 250 seconds] |
| 02:46:00 | | Ruthalas (Ruthalas) joins |
| 02:50:01 | <thuban> | how are people getting video counts for youtube channels? |
| 02:51:43 | <@OrIdow6> | 3.7 quadrillion... just put in in #// |
| 02:54:58 | <@JAA> | thuban: Channel upload playlist is the easiest fairly reliable method, I believe. Get the channel ID (UCabc...), replace the first C with a U, and request /playlist?list=UUabc... |
| 02:55:47 | <thuban> | thanks! |
| 02:57:37 | | britmob joins |
| 03:01:11 | | britm0b quits [Ping timeout: 258 seconds] |
| 03:04:56 | | lennier1 quits [Ping timeout: 250 seconds] |
| 03:08:58 | | AntiLiberal joins |
| 03:12:59 | | AntiLiberal quits [Remote host closed the connection] |
| 03:13:12 | | AntiLiberal joins |
| 03:14:30 | <@JAA> | thuban: Thanks for creating the wiki page, by the way! Looks good. :-) |
| 03:15:48 | <thuban> | you're welcome! i'm adding more stuff from the etherpad now. |
| 03:20:32 | | AntiLiberal quits [Remote host closed the connection] |
| 03:20:44 | | AntiLiberal joins |
| 03:23:49 | | AntiLiberal quits [Remote host closed the connection] |
| 03:24:02 | | AntiLiberal joins |
| 03:26:08 | | AntiLiberal quits [Remote host closed the connection] |
| 03:26:18 | | AntiLiberal joins |
| 03:32:48 | <thuban> | abccc: you suggested archiving evck.wikia.org, but it's 404ing. is that a recent loss or do we have the wrong url? |
| 03:33:13 | <abccc> | evchk.wikia.org - sorry for the typo |
| 03:33:43 | <thuban> | np, thanks! |
| 03:34:43 | <thuban> | does that site have a title? |
| 03:35:04 | <thuban> | oh wait, i see it |
| 03:42:58 | | qw3rty_ joins |
| 03:45:22 | <thuban> | memehk.com is down; not sure whether that might be temporary |
| 03:46:48 | | qw3rty__ quits [Ping timeout: 258 seconds] |
| 03:48:17 | <@JAA> | They were online as of a bit over a day ago according to Google's cache. I'll set up a monitor for #nodeping. |
| 03:48:59 | <thuban> | sounds good, thanks |
| 03:53:11 | | HP_Archivist (HP_Archivist) joins |
| 04:12:10 | | DogsRNice quits [Read error: Connection reset by peer] |
| 04:26:07 | | aleph quits [Quit: WeeChat info:version] |
| 04:28:03 | <thuban> | polymerhk.com is "under maintenance"; can we monitor that similarly? |
| 04:29:39 | <thuban> | JAA: ^ |
| 04:30:25 | <thuban> | (it's all 503 if that matters) |
| 04:40:34 | <thuban> | also, anyone know if it has a youtube channel? |
| 04:43:55 | | mutantmonkey quits [Ping timeout: 258 seconds] |
| 04:44:24 | | mutantmonkey (mutantmonkey) joins |
| 04:51:07 | <tzt> | thuban: https://www.youtube.com/channel/UC-lVeL_4vOoRDCzpOWkB-Cw |
| 04:51:16 | <thuban> | tzt: thanks! |
| 04:51:17 | <tzt> | dormant since 2015 |
| 04:51:46 | <thuban> | ny idea about db channel? |
| 04:51:48 | <thuban> | *any |
| 04:52:21 | <tzt> | db? |
| 04:52:51 | <tzt> | what is that? |
| 04:53:06 | <tzt> | nvm |
| 04:53:15 | <thuban> | https://www.dbchannel.hk/ (offline; https://www.facebook.com/dbchannel.hk) |
| 04:56:10 | <tzt> | dbchannel used facebook video |
| 04:56:43 | <thuban> | i see, thanks |
| 05:38:12 | <thuban> | ok, i have added everything from the etherpad up to the 'political parties' section to the wiki page. |
| 05:39:07 | <thuban> | - corrections, additional information, etc are welcome |
| 05:40:17 | <thuban> | - should be reasonably clear from table which archivebot and/or youtubearchive jobs are still needed |
| 05:44:27 | <thuban> | - i know we can't do facebook/instagram (so i'm not sure how much we'll be able to do for the facebook-only student groups), but it might be worth looking for twitter handles belonging to the larger news orgs |
| 05:47:13 | | HP_Archivist quits [Client Quit] |
| 07:11:04 | | britmob quits [Read error: Connection reset by peer] |
| 08:08:39 | | sec^nd quits [Remote host closed the connection] |
| 08:15:31 | | sec^nd (second) joins |
| 08:26:24 | | nuroten quits [Remote host closed the connection] |
| 08:39:15 | <wizards> | is there a simple way to clone a git repository and pull *all* branches such that *everything* in the repository is available without the need for any further network access? |
| 08:41:12 | | Viniter6 quits [Ping timeout: 250 seconds] |
| 08:42:55 | <h3ndr1k> | wizards: I remember that there is a wiki page about that. The mentioned command would clone branches belonging to pull requests and one had to write something in the repo config so git does not garbage collect those, as they had no connection to the main branches. |
| 08:43:42 | <h3ndr1k> | Possibly on the github wiki page. |
| 09:14:12 | | Viniter6 (Viniter) joins |
| 09:40:50 | | Nikos410 joins |
| 10:01:42 | | HugsNotDrugs quits [Ping timeout: 258 seconds] |
| 11:27:53 | <Jake> | https://wiki.archiveteam.org/index.php/GitHub#Backup_tools |
| 12:05:01 | | BlueMaxima quits [Client Quit] |
| 12:19:12 | | HugsNotDrugs joins |
| 13:01:32 | | pbm joins |
| 13:01:47 | <pbm> | Hey. |
| 13:02:41 | <pbm> | So here's the story: I found your website when doing some research on how to archive bunch of websites on Web Archive... And was hoping for some assistance... |
| 13:03:08 | <@EggplantN> | what assistance pbm |
| 13:03:22 | <pbm> | So in Poland there is this pretty big tourist organisation (pttk.pl) with bunch of chapters and soon they will be doing some server migrations. |
| 13:03:42 | <pbm> | And I'm afraid that during that they will sunset some of the not-so-active sites they host |
| 13:03:52 | <pbm> | So was able to pull list of domains they have |
| 13:03:59 | <pbm> | But not sure how to go about archiving it |
| 13:04:02 | <@EggplantN> | sure have you got the list? |
| 13:06:11 | <pbm> | https://docs.google.com/spreadsheets/d/1Y5gpgzyOquAZAGtHcSjhI-A9MhyKef6xbayI4XgatJQ/edit?usp=sharing |
| 13:06:25 | <pbm> | So that's a list of all subdomains. |
| 13:06:50 | <pbm> | Might contains bs like mail. or ns. or even webservers that will return errors |
| 13:07:00 | <pbm> | It's about 800 domains |
| 13:07:54 | <@EggplantN> | aight lemme take a look at a few :) |
| 13:08:05 | <pbm> | Also I would assume http by default not https as they're bit stuck in 90s... ;) |
| 13:09:26 | <@EggplantN> | hop in #archivebot im gonna queue up http://www.ostrowiec-radwan.pttk.pl/ as a test |
| 13:10:51 | <pbm> | ty |
| 13:12:17 | <pbm> | The whole concern here is that the sites that are actively managed will get migrated, but ones that are abandoned will be most likely dropped |
| 13:15:07 | | pbm quits [Remote host closed the connection] |
| 13:16:36 | | pbm joins |
| 13:28:27 | | aleph joins |
| 13:33:40 | | balrog quits [Client Quit] |
| 13:34:14 | | balrog (balrog) joins |
| 13:45:00 | | nuroten joins |
| 13:50:44 | <pbm> | EggplantN ? |
| 13:51:20 | <@EggplantN> | hey yeah I'll check it shortly |
| 13:51:49 | <pbm> | I got disconnected earlier and probably missed what was going on |
| 13:51:54 | <pbm> | sorry.. :) |
| 14:13:45 | | dav3 joins |
| 14:24:49 | | dm4v quits [Read error: Connection reset by peer] |
| 14:25:32 | | dm4v joins |
| 14:25:34 | | dm4v is now authenticated as dm4v |
| 14:25:34 | | dm4v quits [Changing host] |
| 14:25:34 | | dm4v (dm4v) joins |
| 14:38:07 | | dm4v_ joins |
| 14:38:07 | | dm4v quits [Read error: Connection reset by peer] |
| 14:38:19 | | dm4v_ is now known as dm4v |
| 14:38:21 | | dm4v is now authenticated as dm4v |
| 14:38:21 | | dm4v quits [Changing host] |
| 14:38:21 | | dm4v (dm4v) joins |
| 14:53:23 | | dm4v_ joins |
| 14:53:48 | | dm4v quits [Ping timeout: 258 seconds] |
| 14:53:48 | | dm4v_ is now known as dm4v |
| 14:53:48 | | dm4v is now authenticated as dm4v |
| 14:53:48 | | dm4v quits [Changing host] |
| 14:53:48 | | dm4v (dm4v) joins |
| 14:56:52 | | Arcorann__ quits [Ping timeout: 258 seconds] |
| 15:10:22 | <@JAA> | thuban: Check for https://polymerhk.com/ added. |
| 15:16:02 | | dav3 quits [Remote host closed the connection] |
| 15:16:32 | | nertzy_ joins |
| 15:18:20 | | nertzy__ quits [Ping timeout: 258 seconds] |
| 15:39:26 | <nuroten> | thuban: I included some twitter links on the etherpad (line 21), for existing news sites where I could find them |
| 15:39:55 | <nuroten> | and yeah, I didn't have much luck with the student union ones |
| 15:43:39 | <nuroten> | zero really ... if/when you have them on the wiki page, I can do a cleanup and merge the twitter/youtube lists up to the parties section |
| 15:45:23 | <nuroten> | in the meantime, I'll try to find twitter links for parties onwards |
| 16:18:46 | <abccc> | Is there a way to dump *all* twitter tweets of a user? I think that for most twitter scraping tools, there is a limit on the number of tweets you can pull |
| 16:19:16 | <AK> | I think our socialbot gets everything? |
| 16:27:46 | <Jake> | I think it does |
| 16:28:40 | <abccc> | Nice. Do you know any non-AB tools that can pull all tweets too? I think (?) most tools like twitter-scraper for python have a limit. |
| 16:29:34 | <@JAA> | snscrape does that. (socialbot's just an IRC bot wrapper around snscrape.) |
| 16:31:15 | <AK> | Ooh I didn't know that part |
| 16:32:42 | | kiskaLogBot quits [Ping timeout: 258 seconds] |
| 16:34:11 | | kiskaLogBot joins |
| 16:35:08 | | Krownest quits [Read error: Connection reset by peer] |
| 16:40:08 | | Nikos410 quits [Remote host closed the connection] |
| 16:42:10 | | superkuh_ joins |
| 16:44:22 | | superkuh quits [Ping timeout: 250 seconds] |
| 16:48:40 | | Megame (Megame) joins |
| 17:03:03 | <Megame> | https://hackint.logs.kiska.pw/ went down |
| 17:04:53 | <kiska> | Its back up |
| 17:06:23 | <Megame> | That was quick. Thanks. |
| 17:07:45 | <kiska> | My Vultr instance got restarted, presumably they migrated me |
| 17:10:19 | <kiska> | For logbot visibility |
| 17:10:20 | <kiska> | [2021-06-29T16:28:40.925Z] <abccc> Nice. Do you know any non-AB tools that can pull all tweets too? I think (?) most tools like twitter-scraper for python have a limit. |
| 17:10:20 | <kiska> | [2021-06-29T16:29:35.073Z] <@JAA> snscrape does that. (socialbot's just an IRC bot wrapper around snscrape.) |
| 17:10:20 | <kiska> | [2021-06-29T16:31:15.365Z] <AK> Ooh I didn't know that part |
| 17:16:23 | <abccc> | JAA does snscrape work on facebook pages too? I know facebook has really aggressive rate limiting so wondering if snscrape can bypass that. |
| 17:16:32 | <AK> | I've turned off irc notifications my phone was getting sad at the pings. Pinging me in AB will get lost in the noise. If anyone needs me try dming me and I'll reply at some point |
| 17:17:24 | <kiska> | You know you can add exceptions in thelounge :D |
| 17:17:24 | <@JAA> | abccc: snscrape will quickly run into those rate limits, and I don't know of a way to circumvent them. |
| 17:18:20 | <AK> | kiska, but only words right? |
| 17:18:48 | <kiska> | This is what I have set mine to ignore https://server8.kiska.pw/uploads/40d7becc1a8a6798/image.png |
| 17:19:18 | <@JAA> | AK: An exception for pttk.pl might work since that's in the --explain. |
| 17:19:27 | <kiska> | This should be in -ot :D |
| 17:19:30 | <@JAA> | Yeah |
| 17:22:03 | | pbm quits [Remote host closed the connection] |
| 17:23:26 | | pbm joins |
| 17:47:26 | | lennier1 (lennier1) joins |
| 17:50:49 | | TheTechRobo joins |
| 17:51:16 | | Krownest (Krownest) joins |
| 17:51:30 | | TheTechRobo is now authenticated as TheTechRobo |
| 17:52:50 | <TheTechRobo> | How would I go about archiving a website that seems to rely heavily on JavaScript? |
| 17:53:32 | <TheTechRobo> | My first impulse was to look at Crocite, but I read that it has data integrity issues. |
| 17:54:09 | <TheTechRobo> | I then checked out webrecorder.io or whatever it is, but it doesn't seem to have an auto-scraper - I'd have to click all the buttons manually. |
| 17:59:55 | <TheTechRobo> | Also, I just read about WACZ (https://github.com/webrecorder/wacz-format)... what are people's thoughts on it? |
| 18:06:13 | <@OrIdow6> | AFAIK not used at all outside WebRecorder, dont use it for anything serious |
| 18:06:51 | <@OrIdow6> | You could use a headless browser with warcprox like Brozzler or whatever it is |
| 18:07:15 | <@OrIdow6> | Keep in mind that if a site uses Javascript, playing it back may be as much trouble as capturing it |
| 18:08:21 | <@OrIdow6> | The normal thing to do here would be to look at the requests the site makes, and then write a script to simulate the client |
| 18:08:55 | <@OrIdow6> | Which has the advantage of giving you a finer level of control or knowledge over playbacl |
| 18:09:06 | | @dxrt quits [Quit: ZNC - http://znc.sourceforge.net] |
| 18:09:24 | | dxrt joins |
| 18:09:27 | | dxrt is now authenticated as dxrt |
| 18:09:27 | | dxrt quits [Changing host] |
| 18:09:27 | | dxrt (dxrt) joins |
| 18:09:27 | | @ChanServ sets mode: +o dxrt |
| 18:10:27 | | SketchTheCow quits [Ping timeout: 258 seconds] |
| 18:11:44 | <TheTechRobo> | Orldow6 Thanks! I didn't notice Brozzler, will keep that in mind |
| 18:14:24 | <@OrIdow6> | Yeah |
| 18:15:21 | | SketchTheCow joins |
| 18:15:22 | <@JAA> | FWIW, webrecorder/pywb also has data integrity issues. |
| 18:18:16 | <TheTechRobo> | JAA Oh, I didn't know that... Good thing I haven't archived anything with it! ^^" |
| 18:19:26 | <@JAA> | Namely these: https://github.com/webrecorder/warcio/issues/128 https://github.com/webrecorder/warcio/issues/129 |
| 18:23:21 | | TheTechRobo quits [Remote host closed the connection] |
| 18:33:08 | | AntiLiberal quits [Ping timeout: 250 seconds] |
| 18:38:19 | <nuroten> | thuban: have we picked up the RTHK show Open Line Open View (CN: 自由風自由PHONE) yet? show host sacked, not sure how long the show will continue https://podcast.rthk.hk/podcast/item.php?pid=289 |
| 18:38:21 | | DogsRNice (Webuser299) joins |
| 18:39:41 | <@OrIdow6> | JAA: So I'd assume 10.7k subdomains (Framasite and Framawiki) is too much for queueh2ibot |
| 18:40:49 | | dm4v_ joins |
| 18:41:08 | <nuroten> | thuban: someone uploaded 123 files of the show to IA, podcast.rthk.hk page says 1k files |
| 18:42:14 | | dm4v quits [Ping timeout: 250 seconds] |
| 18:42:14 | | dm4v_ is now known as dm4v |
| 18:42:14 | | dm4v is now authenticated as dm4v |
| 18:42:14 | | dm4v quits [Changing host] |
| 18:42:14 | | dm4v (dm4v) joins |
| 18:43:02 | <@JAA> | OrIdow6: Oof, yeah. |
| 18:43:39 | <@JAA> | Especially in a week. |
| 18:43:52 | <@OrIdow6> | Alright, hopefully a project shouldn't be so bad |
| 18:44:45 | <AK> | Dammit OrIdow6, I thought I was doing well at 800, then you just bring out 10.7k lmao |
| 18:49:46 | <nuroten> | thuban: RTHK 31 This Week (CN: 視點31) show suspended (372 videos) https://podcast.rthk.hk/podcast/item.php?pid=636 |
| 18:57:28 | <AK> | If I say University of Vienna does anyone go "Ooh we archived one of their sites"? |
| 19:03:59 | | lunik1 quits [Quit: :x] |
| 19:07:47 | <nuroten> | RTHK Talk Show (CN: 五夜講場) also axed, but looks like iA has most (all?) of it. source: https://hongkongfp.com/2021/06/29/hong-kongs-rthk-fires-veteran-radio-phone-in-host-as-more-shows-are-axed/ |
| 19:41:36 | | lennier1 quits [Ping timeout: 250 seconds] |
| 20:22:47 | <@arkiver> | OrIdow6: what are the 10.7k domains? |
| 20:23:24 | <@OrIdow6> | arkiver: I haven't enumerated them, just added the numbers on https://frama.site/ |
| 20:27:24 | <@arkiver> | isee |
| 20:27:27 | <@arkiver> | i see* |
| 20:27:39 | <@arkiver> | do they have some full list of sites? |
| 20:27:45 | <@arkiver> | this would also be a good one again for #Y |
| 20:27:57 | <@OrIdow6> | I haven't checked |
| 20:27:59 | <@OrIdow6> | And yes |
| 20:30:52 | | HP_Archivist (HP_Archivist) joins |
| 20:43:07 | | lunik1 joins |
| 20:55:50 | <@EggplantN> | 10.7k? |
| 21:20:31 | | renibear90 joins |
| 21:20:38 | | renibear90 quits [Remote host closed the connection] |
| 21:21:14 | | renibear88 joins |
| 21:21:50 | | renibear88 quits [Remote host closed the connection] |
| 21:23:47 | | Ryz quits [Remote host closed the connection] |
| 21:25:10 | | Ruthalas quits [Ping timeout: 250 seconds] |
| 21:25:29 | | Ryz (Ryz) joins |
| 21:31:38 | | pbm quits [Remote host closed the connection] |
| 21:57:49 | | dm4v_ joins |
| 21:58:02 | | BlueMaxima joins |
| 21:59:50 | | dm4v quits [Ping timeout: 250 seconds] |
| 21:59:50 | | dm4v_ is now known as dm4v |
| 21:59:50 | | dm4v is now authenticated as dm4v |
| 21:59:50 | | dm4v quits [Changing host] |
| 21:59:50 | | dm4v (dm4v) joins |
| 22:00:27 | | Doranwen quits [Ping timeout: 258 seconds] |
| 22:03:59 | | Ruthalas (Ruthalas) joins |
| 22:04:47 | | Doranwen (Doranwen) joins |
| 22:10:19 | | HP_Archivist quits [Client Quit] |
| 22:20:34 | | HP_Archivist (HP_Archivist) joins |
| 22:39:35 | | lunik1 quits [Client Quit] |
| 22:43:03 | | HP_Archivist quits [Client Quit] |
| 23:53:27 | | Arcorann__ joins |