00:04:15wyatt8740 joins
00:45:04britmob quits [Read error: Connection reset by peer]
00:54:26MaxG-1 quits [Remote host closed the connection]
00:55:19britmob joins
01:02:20dm4v quits [Client Quit]
01:03:05dm4v joins
01:03:07dm4v quits [Changing host]
01:03:07dm4v (dm4v) joins
01:30:04britmob quits [Read error: Connection reset by peer]
01:42:30britmob joins
01:44:23Edsavoie_srv quits [Ping timeout: 244 seconds]
02:01:55Iki quits [Read error: Connection reset by peer]
02:57:45britmob quits [Read error: Connection reset by peer]
02:59:34ThreeHM quits [Ping timeout: 250 seconds]
03:01:35ThreeHM (ThreeHeadedMonkey) joins
03:05:42<atphoenix>this may or may not be at risk: https://viking.tv/ . To my knowledge it originated as part of Viking's response to the covid pandemic, and was regularly featured during pre-show ad rolls on PBS Masterpiece episodes. It has been replaced with a new pre-show ad roll.
03:22:24Krownest quits [Read error: Connection reset by peer]
03:25:11Krownest (Krownest) joins
03:27:31teej (teej) joins
03:40:02teej quits [Client Quit]
03:52:11qw3rty_ joins
03:55:54qw3rty__ quits [Ping timeout: 250 seconds]
03:58:30DogsRNice quits [Read error: Connection reset by peer]
04:21:08save_fn joins
04:21:53save_fn quits [Client Quit]
04:22:02AntiLiberal joins
04:23:23CrasherTN joins
04:24:01CrasherTN quits [Remote host closed the connection]
04:24:13AntiLiberal is now known as save_fn
05:12:21rsn quits [Ping timeout: 258 seconds]
05:38:06forkwhilefork quits [Quit: The Lounge - https://irc.rekt.app]
06:13:03nertzy_ joins
06:14:08nertzy__ quits [Ping timeout: 250 seconds]
06:50:07sec^nd quits [Ping timeout: 255 seconds]
06:50:49sec^nd (second) joins
07:43:07HP_Archivist (HP_Archivist) joins
07:45:36wyatt8750 joins
07:46:00wyatt8740 quits [Ping timeout: 250 seconds]
07:54:26BlueMaxima quits [Read error: Connection reset by peer]
09:44:33wizards_ joins
09:47:20wizards quits [Ping timeout: 251 seconds]
09:54:01LeGoupil joins
11:59:19LeGoupil quits [Client Quit]
12:39:55yano quits [Quit: WeeChat, the better IRC client, https://weechat.org/]
12:40:03yanome quits [Quit: The Lounge - https://thelounge.chat]
12:40:14ThreeHM quits [Ping timeout: 250 seconds]
12:41:12yanome (yano) joins
12:41:49yano (yano) joins
12:46:59ThreeHM (ThreeHeadedMonkey) joins
12:50:40KRG joins
13:17:18KRG` joins
13:17:39KRG quits [Ping timeout: 258 seconds]
13:18:59<jodizzle>(from #archiveteam) https://hk.appledaily.com/ has been running in AB for a few days, and I've now some other domains
13:19:35<jodizzle>One important detail from the reddit post: "Many articles contain videos, but youtube-dl doesn't seem to work. I'm out of ideas on how to get them." (I haven't verified this myself.)
13:21:49KRG` is now known as KRG
13:23:25<Jake>The videos on the pages are just m3u8s from a JS script tag
13:26:57KRG quits [Remote host closed the connection]
13:31:39achivarin (achivarin) joins
13:38:37ats (ats) joins
13:48:05<jodizzle>Okay, thanks. Looks like it should be possible to script something to assemble the .ts files.
13:49:37<@OrIdow6>If it's normally set up, you should just be able to pass the m3u8s into ffmpeg
13:52:22<@OrIdow6>Well, give it tthe url
13:54:41<Jake>It is normally setup ;)
13:56:34KRG joins
13:57:30<jodizzle>Trying it and it does seem to work. Thanks!
13:58:27<jodizzle>I usually like to get the raw files in AB as well, though, but that's not too much work.
13:59:04<jodizzle>One problem might be iterating all the articles. Sitemaps don't seem to be complete.
13:59:32Mateon1 quits [Ping timeout: 250 seconds]
14:05:46<Jake>(I also wonder if the videos on the site are also on YouTube?)
14:08:24ragu__ joins
14:09:03<jodizzle>Ah, maybe.
14:11:02Jonboy345 quits [Read error: Connection reset by peer]
14:12:05ragu_ quits [Ping timeout: 258 seconds]
14:12:53ragu_ joins
14:14:00ragu__ quits [Ping timeout: 258 seconds]
14:14:42Mateon1 joins
14:14:54ragu__ joins
14:15:18Jonboy345 joins
14:18:59ragu_ quits [Ping timeout: 258 seconds]
14:22:54Iki joins
14:30:05britmob joins
14:41:31<@EggplantN>hkdaily is AB job 2u3qbx8mpv42jxi76wq27fbb2
14:41:37<@EggplantN>but it seems quite slow
14:48:11<@EggplantN>*cracks nuckles*
14:48:15<@EggplantN>sends IA 6Gbit >_>
14:55:17<rewby>Is this where we need to acquire some china telecom transit?
14:56:55<@EggplantN>i thing its just AB isnt the tool for the job
14:57:04<@EggplantN>or needs some more fine tuning
14:57:46<@OrIdow6>If you want to go really fast, 2 options I know of are Qwarc, and hackish backfeed warrior recursion
14:59:28<rewby>qwarc would go fast. But we'd need somewhere with good china connectivity. That said, I agree AB isn't the right tool for the job
14:59:56<@EggplantN>rewby sir?
15:00:00<@EggplantN>its akamai?
15:00:02<@EggplantN>not CT
15:00:14<rewby>Oh it's akamai?
15:00:18<rewby>I thought it was hosted in china
15:00:22<rewby>Did I look up the wrong domain
15:00:27<@EggplantN>no, even if it was it would be HK
15:00:31<@EggplantN>and HK is weird
15:00:38<rewby>That's a fair point
15:01:18<@EggplantN>does it have a usable sitemap or a way to find all the articles
15:01:20<rewby>If it's akamai, then yeah throw qwarc at the problem. I personally don't have amazing throughput to them but I'm sure someone here has
15:02:06<rewby>It appears to have a sitemap https://hk.appledaily.com/robots.txt
15:02:52<@OrIdow6>curl 'https://hk.appledaily.com/sitemap002.xml' | grep loc | wc -l gets me 14664, scale seems reasonable
15:03:28Jonboy345 quits [Read error: Connection reset by peer]
15:03:35<rewby>OrIdow6: try the sitemap-index
15:03:41<rewby>It's got a ton of additional sitemaps listed
15:04:17<@OrIdow6>I noticed
15:04:40<@OrIdow6>You know, if we really want to panic grab
15:04:55<@OrIdow6>We can generate an URL list from the sitemaps and feed them into #//
15:05:41<rewby>I have scripts to do this...
15:05:58<@OrIdow6>To get the sitemaps?
15:06:03Jonboy345 joins
15:06:05<rewby>Yeah, to take sitemaps and turn them into urls
15:06:16<rewby>Including recursive sitemaps
15:06:22<@OrIdow6>Nice
15:06:50<@OrIdow6>DO we have an idea for the timescale for this?
15:07:01<rewby>It's not very fast, only single threaded. But give it like an hour or two and it'll extract the urls
15:07:06<rewby>*from a map this size ish
15:07:09<@OrIdow6>For the shutdown, I mean
15:07:12<@OrIdow6>Hm
15:07:49<@OrIdow6>Would it work to extract from the big list with grep, and then do the others in parallel from a bash script or something?
15:08:06<@OrIdow6>Cheap parallelization with &, I mean
15:08:15<@EggplantN>if you grabbed all onsite content via a warrior project + offsite links to #// ?
15:08:21<@EggplantN>we could scale up to like
15:08:24<@EggplantN>infinity
15:08:31<@EggplantN>have this done quick AF
15:08:36<rewby>OrIdow6: Not quite with grep. The format is a bit weird so you have to actually python parse it
15:08:40<@OrIdow6>Warrior projects still take a while to set up
15:08:42<@OrIdow6>rewby: Oh
15:08:50<rewby>I do actual xml parsing because spacing is not consistent
15:08:58<rewby>And sometimes there's multiple items on one line
15:09:02<rewby>And other times there's weird encodings
15:09:11<rewby>Better to just let lxml deal with it
15:09:40<@OrIdow6>EggplantN: Maybe do an initial pass of the sitemap with #//, and then set up something more complicated?
15:09:58<rewby>The problem with #// is that it selects items from the queue randomly, doesn't it?
15:10:08<rewby>so we wouldn't guarantee it actually finishes those urls in time
15:10:11<@OrIdow6>The last-~hour increase in people talking about this sounds to me like social media panic, but you never know
15:10:47<@EggplantN>uh
15:10:55<@EggplantN>rewby we can make it do it quickly-sih
15:10:55<@OrIdow6>I haven't been paying much attention to #// recently, is it at capacity?
15:11:03<@EggplantN>no but its not in an amazing state
15:11:26<rewby>Can we just duplicate the urls project code and run it on its own tracker with just these urls?
15:11:32<rewby>That's an "easy" way to scale to the moon quickly
15:11:45<achivarin>Hi. For the sitemap idea, check out my reddit thread for more information: https://www.reddit.com/r/DataHoarder/comments/o4r4jv/help_wanted_hong_kongs_prodemocracy_newspaper_in/
15:12:31<rewby>Oh hm paywall.
15:12:38<rewby>Well, we can just put in the right cookie and that's fine
15:13:01<@EggplantN>oh
15:13:06<@EggplantN>i wonder if thats fucked AB over
15:13:18<rewby>I think it depends on whether you have the cookie or not?
15:13:21<rewby>Not sure
15:13:39<@EggplantN>it is a 200 response code
15:13:49<rewby>Oh I hate websites that do this
15:15:22<achivarin>Maybe try the googlebot user agent? It's also mentioned in the reddit thread.
15:15:31<@EggplantN>yeah that exact one didnt work for me
15:15:37<@EggplantN>plus it looks like a JS based paywall
15:16:09<@OrIdow6>So the correct text still gets sent in the response?
15:16:28<rewby>EggplantN: If I have adblock turned on I don't get paywalled
15:16:36<@EggplantN>i have adblock
15:17:03<@OrIdow6>Yeah, browsing around with Noscript gets me the right material
15:17:28<achivarin>Strange. In my testing I found that reading mode in Firefox just bypasses the wall, and if you go into developer console to delete the paywall box you can read the whole article.
15:17:41<rewby>Yeah so as long as we don't store cookies we should be fien
15:17:46<@OrIdow6>achivarin: Yes, that's what would be expected from a JS paywall
15:17:55<rewby>Heck, I think most of our tools ignore JS anyway
15:18:10<achivarin>Oh my bad
15:19:07<@arkiver>i believe JAA is able to archive lists of URLs at high speed into WARCs
15:19:18<@OrIdow6>Looks like the cookie may be set through JS anyway
15:19:25Jonboy345 quits [Read error: Connection reset by peer]
15:19:35<@OrIdow6>Though I didn't really check that thouroughly
15:20:23<@OrIdow6>arkiver: We were talking about doing it with #//
15:20:30<@arkiver>right
15:20:37<@OrIdow6>For a rough pass in case it shuts down next few hrs
15:20:48<@arkiver>how do they check if you read the first article without paywall?
15:20:51<@arkiver>do they set a cookie?
15:21:13<@EggplantN>yes but the paywall is JS based anyway
15:21:24<@arkiver>right so content is there anyway
15:21:31<@arkiver>in that case - what is the cookie talk about/
15:21:31<@arkiver>?
15:22:04<rewby>I'm enumerating the urls, FYI
15:22:09<rewby>From the sitemap, at least
15:22:09<@arkiver>rewby: thanks
15:22:13<@OrIdow6>Well, first we had to figure out if it was JS-based
15:22:21Jonboy345 joins
15:22:24<@OrIdow6>But now it's just been reduced to a playback concern
15:22:27<@arkiver>rewby: make sure to get the sitemap URLs themselves as well :)
15:22:37Arcorann quits [Ping timeout: 258 seconds]
15:22:55<rewby>arkiver: Working on it!
15:24:33<rewby>There's a lot of submaps
15:24:41<rewby>So my scripts are having to send a ton of requests
15:30:25<achivarin>After this fire is hopefully put out, other independent media outlets are in the crosshairs too: https://thestandnews.com/ https://www.hkcnews.com/ https://hongkongfp.com/
15:30:42<achivarin>And they also have large YouTube channels
15:33:56<@JAA>arkiver: qwarc isn't currently very good at archiving lists of URLs because it quickly gets bogged down by the DB locking. Can't go much beyond a couple thousand items per minute or so, and one item per URL would be the most reasonable approach with URL lists. Can be worked around though.
15:34:05<rewby>If these urls are structure the way I'm expecting, it's somewhere in 2008's sitemaps working it's way up to today
15:34:17<rewby>Doing about a month every 4 seconds
15:34:53Mateon1 quits [Ping timeout: 258 seconds]
15:38:42Mateon1 joins
15:41:38<@EggplantN>is there much documentation on qwarc
15:42:13<rewby>I'm midway through 2016 with extracting things
15:42:29<@EggplantN>just extracting for now?
15:42:36<rewby>Just extracting URLs from sitemaps
15:42:39<@JAA>EggplantN: How much is zero?
15:43:58<@EggplantN>ok so everything i've made
15:54:29missmega joins
15:54:32<missmega>Yo
15:54:40<missmega>So the megaupload archive was a failure?
15:55:09<missmega>I'm trying to find some content linked in this video at this moment: https://youtu.be/fD7X9SCn0To?t=398 It's all megaupload links. Sadly I can't find it with archive.org
15:55:35<Jake>https://wiki.archiveteam.org/index.php/MegaUpload The status is listed as Lost, so I imagine we didn't do a project.
15:56:05<missmega>That really sucks
15:56:07<missmega>:|
16:00:19ragu_ joins
16:03:28<missmega>OrIdow6
16:03:30<missmega>I am here
16:04:01ragu__ quits [Ping timeout: 258 seconds]
16:04:12<@OrIdow6>Oh
16:04:24ragu__ joins
16:05:09<@OrIdow6>I didn't notice you were the same person
16:05:10ragu_ quits [Ping timeout: 258 seconds]
16:06:58<@OrIdow6>Sorry
16:08:09<@OrIdow6>EggplantN: "everything I've made"?
16:13:45Jonboy345 quits [Remote host closed the connection]
16:14:01Jonboy345 joins
16:20:34<Jake>I believe it was a joke on a lack of documentation on what he codes
16:23:02<rewby>OrIdow6, EggplantN, arkiver: I've parsed the hk.appledaily.com sitemaps. Here's the urls extracted: https://transfer.archivete.am/fzhO1/sorted_urls.txt and here's the urls of the sitemaps I pulled them from: https://transfer.archivete.am/14ZEQ8/sitemaps.txt
16:23:13<rewby>I've got 3221709 urls from that
16:23:16<rewby>That's quite a lot
16:28:42<@EggplantN>rewby feeding into urls now
16:30:31<@EggplantN>backfeed go brrrrr
16:31:18<@JAA>It looks like the AB job for AppleDaily will be incomplete. I'm seeing countless parsing warnings in the log, for example: 2021-06-20 21:53:40,365 - wpull.scraper.html - WARNING - Failed to read document at ‘https://hk.appledaily.com/racing/20190505/JX6MZ2JBWZR4BXTPPXLS6A4DLQ/’: 'utf-8' codec can't decode byte 0xe8 in position 46: unexpected end of data
16:31:38<@EggplantN>yeah we're doing an emergency via #//
16:31:49<rewby>EggplantN: Are you somehow giving the urls priority over the rest of the queue?
16:31:52<@EggplantN>nope
16:32:03<@EggplantN>i'm just gonna scale up instead
16:32:39<rewby>Mkay
16:32:42<rewby>Good luckj
16:32:44<@EggplantN>hrm
16:32:45<@EggplantN>okay
16:32:49<@EggplantN>i've found an issue perhaps
16:32:57<rewby>Oh no
16:33:40<@JAA>502
16:33:44<@JAA>?
16:33:48<@EggplantN>nah
16:33:53<@EggplantN>its backfeed related
16:34:14<@JAA>Ah, AB Job started 502ing a bit the same moment you said you'd start queueing. lol
16:34:35<rewby>Are you hug-of-death-ing it Eggplant
16:34:56<@JAA>Yay, hugs.
16:35:27<@JAA>But yeah, let's try not to murder it.
16:35:39<@EggplantN>FYI
16:35:49<@EggplantN>i've also removed the :maxtries from #//
16:35:57<@EggplantN>until i can verify we've grabbed everything
16:37:34<missmega>So
16:38:03<missmega>megaupload's data is gone
16:38:17<@EggplantN>link?
16:39:10<@arkiver>rewby: does this include embedded images?
16:40:22<Jake>missmega: I believe so, yes. Unless someone out there has a copy.
16:40:44<@arkiver>from what i see no embedded images from pages
16:41:04<@arkiver>EggplantN: i can quickly turn on getting embedded images
16:41:12<@EggplantN>sure if you want
16:41:24<@EggplantN>i forgot that was enabled now
16:41:29<@EggplantN>🤦
16:41:37<rewby>arkiver: No, I think it's just links to stories.
16:42:15<rewby>Also, I'm seeing 502s while I'm trying to tweak my scripts. So we're going a smidge too fast for them I think
16:42:23<@arkiver>EggplantN: its not enabled now
16:42:28<@arkiver>i'm enabling it now
16:42:30<@EggplantN>sorry *supported
16:42:34<@EggplantN>wrong word
16:44:56<@arkiver>this site doesnt embed images like most sites do
16:45:12<@arkiver>Wget-AT cant easily extract them (without us parsing the HTML)
16:45:22<@arkiver>that is custom extraction
16:46:16<@arkiver>let's just finish this run now
16:46:25<@arkiver>we can maybe do a second run later to get the images
16:48:14<@arkiver>that would mean duplicating HTML pages, but thats fine with me
17:05:14<@EggplantN>yeah that was my thinking
17:05:17<@EggplantN>lets just get what we can
17:08:59<rewby>arkiver: In case it's useful to you, I put my sitemap scraper on github. https://github.com/rewbycraft/sitemap-enumerator
17:09:06<rewby>It's mostly a wrapper/partial reimplementation of a library
17:09:10<rewby>But this one can go nyooom
17:13:46britm0b joins
17:15:32Webuser431 joins
17:15:42britmob quits [Ping timeout: 258 seconds]
17:19:15<Ryz>On the subject on looking out other Hong Kong journalist/news websites, https://en.wikipedia.org/wiki/List_of_newspapers_in_Hong_Kong could be a good starting point
17:24:45<@JAA>Looks like en.appledaily.com needs some work as well. It was run through AB twice (once a few days ago, once today), but those jobs finished surprisingly quickly. It has infinite scrolling, and the sitemap only seems to cover the past month. :-/
17:25:16Daloader joins
17:32:11<achivarin>Ryz: You have the right idea but most papers on that list are pro-Beijing rags. Reposting what I said above:
17:32:16<achivarin>After this fire is hopefully put out, other independent media outlets are in the crosshairs too: https://thestandnews.com/ https://www.hkcnews.com/ https://hongkongfp.com/ And they also have large YouTube channels
18:29:16CZ joins
18:36:01CZ quits [Remote host closed the connection]
18:36:23Cz joins
19:10:25Jonboy345 quits [Read error: Connection reset by peer]
19:13:46Jonboy345 joins
19:16:27Jonboy345 quits [Read error: Connection reset by peer]
19:18:52Jonboy345 joins
19:20:12Daloader quits [Ping timeout: 250 seconds]
19:29:14missmega quits [Ping timeout: 244 seconds]
19:30:10Cz quits [Remote host closed the connection]
19:34:12Jonboy3451 joins
19:34:12Jonboy345 quits [Read error: Connection reset by peer]
19:51:52AlsoHP_Archivist joins
19:51:52HP_Archivist quits [Read error: Connection reset by peer]
20:03:48lennier1 quits [Quit: Going offline, see ya! (www.adiirc.com)]
20:04:49lennier1 (lennier1) joins
20:29:58ThreeHM quits [Ping timeout: 250 seconds]
20:33:16ThreeHM (ThreeHeadedMonkey) joins
20:47:44lunik1 quits [Ping timeout: 250 seconds]
21:05:30lunik1 joins
21:18:06EdSavoie joins
21:20:42AlsoHP_Archivist quits [Read error: Connection reset by peer]
21:21:10AlsoHP_Archivist joins
21:38:46<@JAA>Bethesda forum and mod comment API data retrieval is running. Specifically, I'm fetching /community/api/topic/TOPICID + /community/api/topic/SLUG?page=PAGE for topics and https://api.bethesda.net/mods/ugc-workshop/content/get?content_id=MODID + /community/comments/get/mods/mods_MODID/0 (and .../1 etc. for pagination) for mod comments. /community/* URLs are on bethesda.net.
21:39:35<@JAA>The list of mods also comes from the API, namely https://api.bethesda.net/mods/ugc-workshop/list/?number_results=20&order=desc&page=PAGE&platform=&product=&sort=published&text=
21:49:55@EggplantN quits [Quit: Ping timeout (120 seconds)]
21:50:15EggplantN joins
21:53:14<@JAA>Uh, looks like I'm getting zero mod comments. Maybe they already disabled that part without another announcement. :-(
21:53:41<@JAA>https://bethesda.net/en/mods/fallout4/mod-detail/911793 had a lot of comments as of a month ago, for example.
22:37:14DogsRNice (Webuser299) joins
22:42:44EggplantN quits [Changing host]
22:42:44EggplantN (EggplantN) joins
22:42:44@ChanServ sets mode: +o EggplantN
22:44:09<@EggplantN>when tf did my connection die
22:44:23<@JAA>21:49:55
22:44:36<@EggplantN>is it 22:44 now for you
22:44:47<@JAA>One time zone to rule them all (UTC). Yes
22:45:02<@EggplantN>that
22:45:05<@EggplantN>kinda doesnt make sense
22:45:07<@EggplantN>but ok
22:45:10Muad_Dib quits [Ping timeout: 250 seconds]
22:55:36Muad-Dib joins
23:21:47<@JAA>The forums part of the Bethesda crawl is done. I'm sending them an email about the mod comments.
23:23:40AlsoHP_Archivist quits [Client Quit]
23:23:56HP_Archivist (HP_Archivist) joins
23:42:40BlueMaxima joins
23:52:05Specular joins
23:53:04Arcorann (Arcorann) joins
23:55:21Specular quits [Client Quit]
23:55:39nerdguy1138 quits [Quit: Leaving.]
23:58:25<@JAA>Bethesda forum stats in my WARC: 297121 topics, 363778 topic pages, 2851308 posts. The topic numbers on the forum homepage add up to 299359; close enough. Post IDs go to 3.3 million, but there are plenty of deleted topics (IDs go to just over 456789), so that sounds good enough as well.