00:00:31<nuroten>100Most is the flagship magazine/yt video channel
00:01:11dm4v quits [Read error: Connection reset by peer]
00:01:11dm4v_ joins
00:01:17<thuban>gotcha, fixed
00:01:37dm4v_ is now known as dm4v
00:01:37dm4v quits [Changing host]
00:01:37dm4v (dm4v) joins
00:02:09<@arkiver>if anything very big needs to be saved, let me know
00:02:28<@arkiver>every big if too big for AB or for an individual here to archive
00:27:10sec^nd quits [Ping timeout: 255 seconds]
00:35:28<abccc> Is there a way to generate a sitemap for a website (ex: inmediahk.net)? Or put another way, is there a way to get a list of all URLs that exist for this domain?
00:39:49sec^nd (second) joins
00:43:23<@JAA>Only the site operator would know *everything* that exists. sitemap.xml, recursive crawls, etc. may always be incomplete.
00:54:12HP_Archivist (HP_Archivist) joins
01:07:55<nuroten>thuban: tvmost (651 videos) https://www.youtube.com/channel/UCiJnCs2K5gP-DXnMxlstC9A and D100 (15881 videos) https://www.youtube.com/user/D100HK they slipped through the cracks, I had thought D100 yt was already archived
01:09:07<thuban>they may already have been, i'll check
01:09:28<nuroten>your table made it easier to see the blanks haha
01:10:20ddd joins
01:10:46ddd quits [Client Quit]
01:28:02katocala quits [Ping timeout: 258 seconds]
01:31:00katocala joins
01:36:28HP_Archivist quits [Ping timeout: 258 seconds]
01:41:50h3ndr1k quits [Ping timeout: 258 seconds]
01:42:03h3ndr1k (h3ndr1k) joins
02:01:38<SketchTheCow>https://atdash.meo.ws/ is down, I assume this is known
02:07:54<@JAA>Yeah, it's known.
02:33:04<abccc>JAA would something like this work to find all (or the vast majority) of URLs on a site? https://github.com/maurosoria/dirsearch
02:34:48<@JAA>abccc: Extremely unlikely that it would find everything.
02:35:59<abccc>Would it find the vast majority of (well-trafficked) sites though? If so then wondering if this is something worth deploying to a list containing the vast majority of articles
02:36:08<@OrIdow6>Depending on how many trillion years you expect the site to stay up
02:39:34<@OrIdow6>Search space is way too big unless you have 8.3 filenames or something
02:41:35<@JAA>Even with 8.3 and assuming all uppercase letters, you'd still have 3.7 quadrillion combinations...
02:42:23<@JAA>Well, for all possible extensions. 'Only' 209 billion per extension, so trying a few common ones is still in the trillions.
02:43:21<@JAA>In this case, the site uses slugs of varying lengths with Chinese characters, so there will be so many combinations that I don't even know what the names for those numbers are.
02:44:34Ruthalas quits [Ping timeout: 250 seconds]
02:46:00Ruthalas (Ruthalas) joins
02:50:01<thuban>how are people getting video counts for youtube channels?
02:51:43<@OrIdow6>3.7 quadrillion... just put in in #//
02:54:58<@JAA>thuban: Channel upload playlist is the easiest fairly reliable method, I believe. Get the channel ID (UCabc...), replace the first C with a U, and request /playlist?list=UUabc...
02:55:47<thuban>thanks!
02:57:37britmob joins
03:01:11britm0b quits [Ping timeout: 258 seconds]
03:04:56lennier1 quits [Ping timeout: 250 seconds]
03:08:58AntiLiberal joins
03:12:59AntiLiberal quits [Remote host closed the connection]
03:13:12AntiLiberal joins
03:14:30<@JAA>thuban: Thanks for creating the wiki page, by the way! Looks good. :-)
03:15:48<thuban>you're welcome! i'm adding more stuff from the etherpad now.
03:20:32AntiLiberal quits [Remote host closed the connection]
03:20:44AntiLiberal joins
03:23:49AntiLiberal quits [Remote host closed the connection]
03:24:02AntiLiberal joins
03:26:08AntiLiberal quits [Remote host closed the connection]
03:26:18AntiLiberal joins
03:32:48<thuban>abccc: you suggested archiving evck.wikia.org, but it's 404ing. is that a recent loss or do we have the wrong url?
03:33:13<abccc>evchk.wikia.org - sorry for the typo
03:33:43<thuban>np, thanks!
03:34:43<thuban>does that site have a title?
03:35:04<thuban>oh wait, i see it
03:42:58qw3rty_ joins
03:45:22<thuban>memehk.com is down; not sure whether that might be temporary
03:46:48qw3rty__ quits [Ping timeout: 258 seconds]
03:48:17<@JAA>They were online as of a bit over a day ago according to Google's cache. I'll set up a monitor for #nodeping.
03:48:59<thuban>sounds good, thanks
03:53:11HP_Archivist (HP_Archivist) joins
04:12:10DogsRNice quits [Read error: Connection reset by peer]
04:26:07aleph quits [Quit: WeeChat info:version]
04:28:03<thuban>polymerhk.com is "under maintenance"; can we monitor that similarly?
04:29:39<thuban>JAA: ^
04:30:25<thuban>(it's all 503 if that matters)
04:40:34<thuban>also, anyone know if it has a youtube channel?
04:43:55mutantmonkey quits [Ping timeout: 258 seconds]
04:44:24mutantmonkey (mutantmonkey) joins
04:51:07<tzt>thuban: https://www.youtube.com/channel/UC-lVeL_4vOoRDCzpOWkB-Cw
04:51:16<thuban>tzt: thanks!
04:51:17<tzt>dormant since 2015
04:51:46<thuban>ny idea about db channel?
04:51:48<thuban>*any
04:52:21<tzt>db?
04:52:51<tzt>what is that?
04:53:06<tzt>nvm
04:53:15<thuban>https://www.dbchannel.hk/ (offline; https://www.facebook.com/dbchannel.hk)
04:56:10<tzt>dbchannel used facebook video
04:56:43<thuban>i see, thanks
05:38:12<thuban>ok, i have added everything from the etherpad up to the 'political parties' section to the wiki page.
05:39:07<thuban>- corrections, additional information, etc are welcome
05:40:17<thuban>- should be reasonably clear from table which archivebot and/or youtubearchive jobs are still needed
05:44:27<thuban>- i know we can't do facebook/instagram (so i'm not sure how much we'll be able to do for the facebook-only student groups), but it might be worth looking for twitter handles belonging to the larger news orgs
05:47:13HP_Archivist quits [Client Quit]
07:11:04britmob quits [Read error: Connection reset by peer]
08:08:39sec^nd quits [Remote host closed the connection]
08:15:31sec^nd (second) joins
08:26:24nuroten quits [Remote host closed the connection]
08:39:15<wizards>is there a simple way to clone a git repository and pull *all* branches such that *everything* in the repository is available without the need for any further network access?
08:41:12Viniter6 quits [Ping timeout: 250 seconds]
08:42:55<h3ndr1k>wizards: I remember that there is a wiki page about that. The mentioned command would clone branches belonging to pull requests and one had to write something in the repo config so git does not garbage collect those, as they had no connection to the main branches.
08:43:42<h3ndr1k>Possibly on the github wiki page.
09:14:12Viniter6 (Viniter) joins
09:40:50Nikos410 joins
10:01:42HugsNotDrugs quits [Ping timeout: 258 seconds]
11:27:53<Jake>https://wiki.archiveteam.org/index.php/GitHub#Backup_tools
12:05:01BlueMaxima quits [Client Quit]
12:19:12HugsNotDrugs joins
13:01:32pbm joins
13:01:47<pbm>Hey.
13:02:41<pbm>So here's the story: I found your website when doing some research on how to archive bunch of websites on Web Archive... And was hoping for some assistance...
13:03:08<@EggplantN>what assistance pbm
13:03:22<pbm>So in Poland there is this pretty big tourist organisation (pttk.pl) with bunch of chapters and soon they will be doing some server migrations.
13:03:42<pbm>And I'm afraid that during that they will sunset some of the not-so-active sites they host
13:03:52<pbm>So was able to pull list of domains they have
13:03:59<pbm>But not sure how to go about archiving it
13:04:02<@EggplantN>sure have you got the list?
13:06:11<pbm>https://docs.google.com/spreadsheets/d/1Y5gpgzyOquAZAGtHcSjhI-A9MhyKef6xbayI4XgatJQ/edit?usp=sharing
13:06:25<pbm>So that's a list of all subdomains.
13:06:50<pbm>Might contains bs like mail. or ns. or even webservers that will return errors
13:07:00<pbm>It's about 800 domains
13:07:54<@EggplantN>aight lemme take a look at a few :)
13:08:05<pbm>Also I would assume http by default not https as they're bit stuck in 90s... ;)
13:09:26<@EggplantN>hop in #archivebot im gonna queue up http://www.ostrowiec-radwan.pttk.pl/ as a test
13:10:51<pbm>ty
13:12:17<pbm>The whole concern here is that the sites that are actively managed will get migrated, but ones that are abandoned will be most likely dropped
13:15:07pbm quits [Remote host closed the connection]
13:16:36pbm joins
13:28:27aleph joins
13:33:40balrog quits [Client Quit]
13:34:14balrog (balrog) joins
13:45:00nuroten joins
13:50:44<pbm>EggplantN ?
13:51:20<@EggplantN>hey yeah I'll check it shortly
13:51:49<pbm>I got disconnected earlier and probably missed what was going on
13:51:54<pbm>sorry.. :)
14:13:45dav3 joins
14:24:49dm4v quits [Read error: Connection reset by peer]
14:25:32dm4v joins
14:25:34dm4v quits [Changing host]
14:25:34dm4v (dm4v) joins
14:38:07dm4v_ joins
14:38:07dm4v quits [Read error: Connection reset by peer]
14:38:19dm4v_ is now known as dm4v
14:38:21dm4v quits [Changing host]
14:38:21dm4v (dm4v) joins
14:53:23dm4v_ joins
14:53:48dm4v quits [Ping timeout: 258 seconds]
14:53:48dm4v_ is now known as dm4v
14:53:48dm4v quits [Changing host]
14:53:48dm4v (dm4v) joins
14:56:52Arcorann__ quits [Ping timeout: 258 seconds]
15:10:22<@JAA>thuban: Check for https://polymerhk.com/ added.
15:16:02dav3 quits [Remote host closed the connection]
15:16:32nertzy_ joins
15:18:20nertzy__ quits [Ping timeout: 258 seconds]
15:39:26<nuroten>thuban: I included some twitter links on the etherpad (line 21), for existing news sites where I could find them
15:39:55<nuroten>and yeah, I didn't have much luck with the student union ones
15:43:39<nuroten>zero really ... if/when you have them on the wiki page, I can do a cleanup and merge the twitter/youtube lists up to the parties section
15:45:23<nuroten>in the meantime, I'll try to find twitter links for parties onwards
16:18:46<abccc>Is there a way to dump *all* twitter tweets of a user? I think that for most twitter scraping tools, there is a limit on the number of tweets you can pull
16:19:16<AK>I think our socialbot gets everything?
16:27:46<Jake>I think it does
16:28:40<abccc>Nice. Do you know any non-AB tools that can pull all tweets too? I think (?) most tools like twitter-scraper for python have a limit.
16:29:34<@JAA>snscrape does that. (socialbot's just an IRC bot wrapper around snscrape.)
16:31:15<AK>Ooh I didn't know that part
16:32:42kiskaLogBot quits [Ping timeout: 258 seconds]
16:34:11kiskaLogBot joins
16:35:08Krownest quits [Read error: Connection reset by peer]
16:40:08Nikos410 quits [Remote host closed the connection]
16:42:10superkuh_ joins
16:44:22superkuh quits [Ping timeout: 250 seconds]
16:48:40Megame (Megame) joins
17:03:03<Megame>https://hackint.logs.kiska.pw/ went down
17:04:53<kiska>Its back up
17:06:23<Megame>That was quick. Thanks.
17:07:45<kiska>My Vultr instance got restarted, presumably they migrated me
17:10:19<kiska>For logbot visibility
17:10:20<kiska>[2021-06-29T16:28:40.925Z] <abccc> Nice. Do you know any non-AB tools that can pull all tweets too? I think (?) most tools like twitter-scraper for python have a limit.
17:10:20<kiska>[2021-06-29T16:29:35.073Z] <@JAA> snscrape does that. (socialbot's just an IRC bot wrapper around snscrape.)
17:10:20<kiska>[2021-06-29T16:31:15.365Z] <AK> Ooh I didn't know that part
17:16:23<abccc>JAA does snscrape work on facebook pages too? I know facebook has really aggressive rate limiting so wondering if snscrape can bypass that.
17:16:32<AK>I've turned off irc notifications my phone was getting sad at the pings. Pinging me in AB will get lost in the noise. If anyone needs me try dming me and I'll reply at some point
17:17:24<kiska>You know you can add exceptions in thelounge :D
17:17:24<@JAA>abccc: snscrape will quickly run into those rate limits, and I don't know of a way to circumvent them.
17:18:20<AK>kiska, but only words right?
17:18:48<kiska>This is what I have set mine to ignore https://server8.kiska.pw/uploads/40d7becc1a8a6798/image.png
17:19:18<@JAA>AK: An exception for pttk.pl might work since that's in the --explain.
17:19:27<kiska>This should be in -ot :D
17:19:30<@JAA>Yeah
17:22:03pbm quits [Remote host closed the connection]
17:23:26pbm joins
17:47:26lennier1 (lennier1) joins
17:50:49TheTechRobo joins
17:51:16Krownest (Krownest) joins
17:52:50<TheTechRobo>How would I go about archiving a website that seems to rely heavily on JavaScript?
17:53:32<TheTechRobo>My first impulse was to look at Crocite, but I read that it has data integrity issues.
17:54:09<TheTechRobo>I then checked out webrecorder.io or whatever it is, but it doesn't seem to have an auto-scraper - I'd have to click all the buttons manually.
17:59:55<TheTechRobo>Also, I just read about WACZ (https://github.com/webrecorder/wacz-format)... what are people's thoughts on it?
18:06:13<@OrIdow6>AFAIK not used at all outside WebRecorder, dont use it for anything serious
18:06:51<@OrIdow6>You could use a headless browser with warcprox like Brozzler or whatever it is
18:07:15<@OrIdow6>Keep in mind that if a site uses Javascript, playing it back may be as much trouble as capturing it
18:08:21<@OrIdow6>The normal thing to do here would be to look at the requests the site makes, and then write a script to simulate the client
18:08:55<@OrIdow6>Which has the advantage of giving you a finer level of control or knowledge over playbacl
18:09:06@dxrt quits [Quit: ZNC - http://znc.sourceforge.net]
18:09:24dxrt joins
18:09:27dxrt quits [Changing host]
18:09:27dxrt (dxrt) joins
18:09:27@ChanServ sets mode: +o dxrt
18:10:27SketchTheCow quits [Ping timeout: 258 seconds]
18:11:44<TheTechRobo>Orldow6 Thanks! I didn't notice Brozzler, will keep that in mind
18:14:24<@OrIdow6>Yeah
18:15:21SketchTheCow joins
18:15:22<@JAA>FWIW, webrecorder/pywb also has data integrity issues.
18:18:16<TheTechRobo>JAA Oh, I didn't know that... Good thing I haven't archived anything with it! ^^"
18:19:26<@JAA>Namely these: https://github.com/webrecorder/warcio/issues/128 https://github.com/webrecorder/warcio/issues/129
18:23:21TheTechRobo quits [Remote host closed the connection]
18:33:08AntiLiberal quits [Ping timeout: 250 seconds]
18:38:19<nuroten>thuban: have we picked up the RTHK show Open Line Open View (CN: 自由風自由PHONE) yet? show host sacked, not sure how long the show will continue https://podcast.rthk.hk/podcast/item.php?pid=289
18:38:21DogsRNice (Webuser299) joins
18:39:41<@OrIdow6>JAA: So I'd assume 10.7k subdomains (Framasite and Framawiki) is too much for queueh2ibot
18:40:49dm4v_ joins
18:41:08<nuroten>thuban: someone uploaded 123 files of the show to IA, podcast.rthk.hk page says 1k files
18:42:14dm4v quits [Ping timeout: 250 seconds]
18:42:14dm4v_ is now known as dm4v
18:42:14dm4v quits [Changing host]
18:42:14dm4v (dm4v) joins
18:43:02<@JAA>OrIdow6: Oof, yeah.
18:43:39<@JAA>Especially in a week.
18:43:52<@OrIdow6>Alright, hopefully a project shouldn't be so bad
18:44:45<AK>Dammit OrIdow6, I thought I was doing well at 800, then you just bring out 10.7k lmao
18:49:46<nuroten>thuban: RTHK 31 This Week (CN: 視點31) show suspended (372 videos) https://podcast.rthk.hk/podcast/item.php?pid=636
18:57:28<AK>If I say University of Vienna does anyone go "Ooh we archived one of their sites"?
19:03:59lunik1 quits [Quit: :x]
19:07:47<nuroten>RTHK Talk Show (CN: 五夜講場) also axed, but looks like iA has most (all?) of it. source: https://hongkongfp.com/2021/06/29/hong-kongs-rthk-fires-veteran-radio-phone-in-host-as-more-shows-are-axed/
19:41:36lennier1 quits [Ping timeout: 250 seconds]
20:22:47<@arkiver>OrIdow6: what are the 10.7k domains?
20:23:24<@OrIdow6>arkiver: I haven't enumerated them, just added the numbers on https://frama.site/
20:27:24<@arkiver>isee
20:27:27<@arkiver>i see*
20:27:39<@arkiver>do they have some full list of sites?
20:27:45<@arkiver>this would also be a good one again for #Y
20:27:57<@OrIdow6>I haven't checked
20:27:59<@OrIdow6>And yes
20:30:52HP_Archivist (HP_Archivist) joins
20:43:07lunik1 joins
20:55:50<@EggplantN>10.7k?
21:20:31renibear90 joins
21:20:38renibear90 quits [Remote host closed the connection]
21:21:14renibear88 joins
21:21:50renibear88 quits [Remote host closed the connection]
21:23:47Ryz quits [Remote host closed the connection]
21:25:10Ruthalas quits [Ping timeout: 250 seconds]
21:25:29Ryz (Ryz) joins
21:31:38pbm quits [Remote host closed the connection]
21:57:49dm4v_ joins
21:58:02BlueMaxima joins
21:59:50dm4v quits [Ping timeout: 250 seconds]
21:59:50dm4v_ is now known as dm4v
21:59:50dm4v quits [Changing host]
21:59:50dm4v (dm4v) joins
22:00:27Doranwen quits [Ping timeout: 258 seconds]
22:03:59Ruthalas (Ruthalas) joins
22:04:47Doranwen (Doranwen) joins
22:10:19HP_Archivist quits [Client Quit]
22:20:34HP_Archivist (HP_Archivist) joins
22:39:35lunik1 quits [Client Quit]
22:43:03HP_Archivist quits [Client Quit]
23:53:27Arcorann__ joins