00:06:03lunik1 quits [Quit: Ping timeout (120 seconds)]
00:06:12lunik1 joins
00:08:50lunik1 quits [Client Quit]
00:08:59lunik1 joins
00:54:40balrog quits [Quit: Bye]
01:01:37dm4v_ joins
01:01:37dm4v quits [Read error: Connection reset by peer]
01:01:49dm4v_ is now known as dm4v
01:01:52dm4v quits [Changing host]
01:01:52dm4v (dm4v) joins
01:05:18balrog (balrog) joins
01:17:49Atom joins
01:46:43mgrytbak4 joins
01:48:17mgrytbak quits [Read error: Connection reset by peer]
01:48:17mgrytbak4 is now known as mgrytbak
01:58:07onetruth joins
02:02:39dm4v quits [Read error: Connection reset by peer]
02:03:46dm4v joins
02:03:48dm4v quits [Changing host]
02:03:48dm4v (dm4v) joins
02:46:08themadpro (themadpro) joins
02:46:27<themadpro>Hey someone on Reddit brought my attention to the fact that the Google Video collections on IA have been locked from downloads https://archive.org/details/GVID-20110417095014-crawl340
02:46:51<themadpro>Does anyone know why? https://usercontent.irccloud-cdn.com/file/zmKufJSE/image.png
02:47:06<themadpro>Relevant thread https://www.reddit.com/r/Archiveteam/comments/rhd00j/google_video_and_yahoo_video_archives/hoq3np9/
04:09:55systwi quits [Ping timeout: 258 seconds]
04:27:05Arcorann__ quits [Ping timeout: 265 seconds]
04:29:45qw3rty_ joins
04:33:22qw3rty__ quits [Ping timeout: 265 seconds]
04:58:31Arcorann__ joins
05:03:49HP_Archivist quits [Ping timeout: 265 seconds]
05:50:21Arcorann__ quits [Ping timeout: 258 seconds]
06:10:31ragu joins
06:51:14HackMii quits [Remote host closed the connection]
06:51:49HackMii (hacktheplanet) joins
07:00:54Atom-- joins
07:02:25Atom quits [Ping timeout: 258 seconds]
07:08:39systwi (systwi) joins
07:22:29Megame (Megame) joins
07:52:04benjins quits [Read error: Connection reset by peer]
08:01:53BlueMaxima quits [Client Quit]
08:07:29lennier1 quits [Ping timeout: 265 seconds]
08:08:08lennier1 (lennier1) joins
08:25:20Arcorann__ joins
09:06:40HackMii quits [Remote host closed the connection]
09:07:40HackMii (hacktheplanet) joins
09:17:33laser quits [Remote host closed the connection]
09:17:53laser joins
09:26:04monoxane4 quits [Quit: Ping timeout (120 seconds)]
09:26:34monoxane4 (monoxane) joins
09:53:57monoxane4 quits [Client Quit]
10:03:36monoxane4 (monoxane) joins
10:28:10pmlo3 quits [Ping timeout: 244 seconds]
11:03:47benjins joins
11:30:28qwertyasdfuiopghjkl joins
11:46:28Nulo quits [Ping timeout: 258 seconds]
11:53:27<pabs>Fedora retiring one of their services: https://communityblog.fedoraproject.org/retiring-taiga-instance-on-teams-fedoraproject-org/
11:53:36<pabs>the https://teams.fedoraproject.org/ service
11:53:40godane1 joins
11:57:33godane quits [Ping timeout: 265 seconds]
12:07:53laser quits [Remote host closed the connection]
12:11:52laser joins
12:15:31laser quits [Remote host closed the connection]
12:19:22Timestarter joins
12:19:34laser joins
12:19:38Timestarter quits [Remote host closed the connection]
12:19:50qwertyasdfuiopghjkl quits [Client Quit]
12:22:32Nulo joins
12:23:12laser quits [Remote host closed the connection]
12:29:22laser joins
12:36:18Nulo quits [Ping timeout: 258 seconds]
13:10:03Arcorann__ quits [Ping timeout: 265 seconds]
13:13:00HackMii quits [Remote host closed the connection]
13:13:59HackMii (hacktheplanet) joins
13:23:18march_happy (march_happy) joins
13:26:28<march_happy>How do you guys handle pages having login wall when saving to watch?
13:26:36<march_happy>*WARC
13:27:25<march_happy>Some pages at Baidu Tieba has a login wall, thus there is operator's account name
13:39:42<march_happy>Example WACZ: https://www.dropbox.com/s/42vgjklbuvuzu0w/my-web-archive.wacz
13:41:24<march_happy>Notice there is "Catme0w" (The operator's name), and when you click on 2 in "1 2 ..." (page numbers)
13:41:42<march_happy>The layout completely breaks
13:43:17<march_happy>The page "感觉最近没啥动力,来开个进度贴…" is a good example showing current situation.
13:46:35<march_happy>We wish to archive China Furry Fandom's history at Baidu Tieba (As Baidu Tieba is Reddit in China, it's the biggest online forum here).
13:46:55<march_happy>And if it works out, we have a big ambition to archive a full copy of Baidu Tieba up to now.
13:47:49<march_happy>But it just seems that WARC isn't a good enough format...
13:54:16<march_happy>For Chinese websites, will it be better for writing a separate frontend, and saving data into a database?
13:54:50<march_happy>Login wall is very popular here, and yes, I hate it.
14:04:41<thuban>march_happy: there isn't really a good option for getting around login walls. archiveteam usually avoids targeting login-walled or otherwise non-public data in its own projects. but that's partly because our stuff goes into the wayback machine and we don't want to misrepresent what a site would have looked like to a guest; in special cases where that doesn't apply, and in
14:04:43<thuban>personal archives made by individuals, we usually use burner accounts created for the purpose.
14:06:52<thuban>march_happy: as for WARC, we recommend it really strongly because of its fidelity (including metadata) and institutional support as a standard. it can be difficult to recreate dynamic content, but the file you posted actually looks fine to me in replayweb.page, including pagination in the thread you named; are you sure the problem isn't with your viewer?
14:10:26<march_happy>Maybe... But what can I do to hide operator's name?
14:11:57<march_happy>If WARC retains metadata and is not editable, the operator's account name is leaked
14:12:20<march_happy>And it'll be easy to find identity in real world
14:13:17<thuban>you could create a new account (or multiple accounts) and run the archiving operations under that. is there some reason that wouldn't be effective for baidu?
14:13:42qwertyasdfuiopghjkl joins
14:15:28<march_happy>Login wall mentioned above. Browsing anonymously only allows you viewing the first 50 posts
14:15:28<march_happy>And replies to a post is lost.
14:16:18<march_happy>Sorry let me say it again
14:17:15<march_happy>Browsing anonymously only allows you viewing the first 50 comments to a post, and replies to comment s are lost.
14:18:15<march_happy>Baidu Tieba doesn't have a comment tree hierarchy like Reddit. Instead, it only has three levels
14:18:35<march_happy>Post -> Comment -> Replies to a Comment
14:19:04<thuban>i understand "browsing anonymously" to mean browsing without being logged into an account at all. but i'm talking about creating an account (not identifiable as you), logging into that one (instead of your usual one), and doing the archiving that way. would that not work?
14:20:57<march_happy>Nope. I've already seen many massive database leaks of Chinese websites. And it's mandatory to use a phone number and upload national ID card info for registering account in mainland China...
14:21:27<march_happy>It's really just a matter of time finding out who you are...
14:22:45<march_happy>The only way to keep anonymous is to not login in, or clean up crawled data and save in a database.
14:23:08<thuban>i'm sorry, that must be incredibly difficult :(
14:23:27<thuban>is it possible to register accounts outside of mainland china?
14:25:31<march_happy>I guess it won't be long before getting banned. And I don't have that many Google Voice numbers! (Theoretically I shouldn't have one being a Chinese citizen but I already get it before Google tightening GV registration policies)
14:25:34<march_happy>XD
14:26:36<march_happy>In this case will crawling both to WayBack Machine and as an archive.org user-uploaded content work?
14:29:03<thuban>sorry, what do you mean by "this case"?
14:29:22<march_happy>Saving like a summary to WayBack Machine, for posts more than the brain-dead 50 comments limit, check the user-uploaded database.
14:29:49<march_happy>Of course I'll crawl assets like pictures
14:32:04HP_Archivist (HP_Archivist) joins
14:32:52<thuban>the wayback machine generally doesn't accept WARCs from third parties, since the internet archive can't verify the contents. but they would love to have your data as an archive.org item; a WARC of the anonymous view sounds like a great idea, and a database of the rest is definitely better than nothing.
14:34:15<march_happy>Or if I use WayBack Machine web crawler? I am not sure if feeding it with 5K+ URLs will work
14:34:40<march_happy>As if I am DDoS-ing WayBack Machine
14:34:59<thuban>yeah, that's not recommended (it can't handle the strain).
14:35:54<march_happy>Does it have explicit operation rate-limit?
14:36:07<march_happy>For a single IP?
14:36:43<thuban>no; ime if the load is too high it will just start failing.
14:38:34<march_happy>As I am trying to recover old posts being a discussion group's admin, I also need to be careful with Baidu ( ゚∀。)
14:39:13<march_happy>If I (Or the other operators) get banned by Baidu then...
14:42:49<march_happy>That is why I hate Web 2.0...
14:43:00<march_happy>Everything is up to the platform
14:44:19<march_happy>And when you being an end-user couldn't bring money to the platform, or there's a crazy operator deleting posts inside a discussion group in my case...
14:45:44<march_happy>Your uploaded contents are nuked, no matter how much effort you spent creating that thing...
14:46:12<thuban>yep, non-self-hosted platforms sure are like that :)
14:46:23<thuban>two comments, fwiw:
14:48:23HackMii quits [Remote host closed the connection]
14:49:29HackMii (hacktheplanet) joins
14:51:59<thuban>- tieba's dynamic layout means that the thread contents themselves (and their assets) are in separate network requests from the main site. if those requests aren't identifiable to the account, you could consider including WARCs of those requests, but not the rest, in your database.
14:56:53<thuban>- tieba (or at least its anonymous view) sounds like something to keep in mind as a potential archiveteam project (and WARCs from archiveteam projects _do_ go in the wayback machine). we usually prioritize sites that we know are going to disappear, but we recently started preemptively archiving reddit, and surely those posts are in less danger! so maybe someday...
15:18:34Minkafighter quits [Quit: The Lounge - https://thelounge.chat]
15:19:13Minkafighter joins
15:20:29Minkafighter quits [Client Quit]
15:40:33march_happy quits [Remote host closed the connection]
16:08:42march_happy (march_happy) joins
16:59:16march_happy quits [Ping timeout: 258 seconds]
17:20:34Megame quits [Client Quit]
19:02:42lennier1 quits [Ping timeout: 258 seconds]
19:04:06lennier2 joins
19:07:27lennier2 is now known as lennier1
19:14:56godane1 quits [Read error: Connection reset by peer]
19:43:44lunik1 quits [Client Quit]
19:43:53lunik1 joins
20:00:24lennier1 quits [Client Quit]
20:02:14AK quits [Quit: Ping timeout (120 seconds)]
20:02:32lennier1 (lennier1) joins
20:02:37AK (AK) joins
20:14:00AK quits [Ping timeout: 258 seconds]
20:17:50AK (AK) joins
20:26:19<@OrIdow6>I assume there's some sort of auth cookie or similar on the thread content requests, so you'd either have to restrict raw warc access or go with privacy by obscurity
20:26:52<@OrIdow6>If you want to do the two-tier approach, ArchiveTeam may be able to get the public stuff, instead of abusing SPN
20:31:11jamesp (jamesp) joins
20:53:33<@OrIdow6>I would like to do more stuff from China and similar countries
21:03:08thermospheric quits [Quit: Ping timeout (120 seconds)]
21:03:49<@OrIdow6>Reminds me - arkiver: can I get some sort of permission to directly edit https://github.com/archiveteam/wget-lua/wiki/Wget-with-Lua-hooks , or should I follow something like https://stackoverflow.com/questions/10642928/how-to-pull-request-a-wiki-page-on-github ?
21:04:09thermospheric (Thermospheric) joins
21:10:18<jamesp>Should the topic for #archiveteam be updated to remove YouTube since that project is already done?
21:13:27<IDK>Wondering if #// URL project handels JS better than #archivebot
21:17:46Larsenv (Larsenv) joins
21:22:37<@OrIdow6>IDK: No
21:22:45<@OrIdow6>Though neither actually runs Javascript
21:23:27<IDK>Oh welp
21:23:50supersonichub1 joins
21:24:04<@OrIdow6>Is there anything specific you're trying to do?
21:25:03<supersonichub1>Can you point me in the right direction of archiving Twitch streams with Warrior?
21:29:24<supersonichub1>Also wondering how I can run Warrior with podman instead of docker since the latter's daemon is quite tempermental.
21:30:23march_happy (march_happy) joins
21:30:47BlueMaxima joins
21:31:06<IDK>OrIdow6: #archivebot
21:31:14<jamesp>OrIdow6: I also meant to keep "thanks, we know about alexa" in the topic too
21:31:30<supersonichub1>Thanks!
21:31:39<IDK>I mean not really that important but its just I know its all going to be deleted in 6 days
21:32:52<@OrIdow6>supersonichub1: I don't think anyone here so far was replying to you. The warrior is for large, centralized projects, and there's no way to use it to save a site without setting up a "project" involving multiple people
21:33:09<@OrIdow6>I know this has been discussed in the past, I believe youtube-dl or whatever the latest fork is may work
21:33:19<@OrIdow6>If you want more info I can look through my logs
21:33:24<@OrIdow6>IDK: What is?
21:33:44<@rewby>yt-dlp is the youtube-dl I usually use
21:34:01Atom joins
21:34:06<@OrIdow6>jamesp: I know, I decided that myself
21:34:10supersonichub1 quits [Remote host closed the connection]
21:34:29<IDK>https://usercontent.irccloud-cdn.com/file/lLVXW4ep/1639690463.JPG
21:34:30<jamesp>OrIdow6: but why?
21:34:31<@OrIdow6>Decided to remove Alexa since it's been a week and seems people have stopped talking about it
21:34:53Atom-- quits [Ping timeout: 258 seconds]
21:35:07<IDK>I think most alexa features are locked behind a login wall
21:35:12<jamesp>Well people are gonna keep mentioning about it there.
21:35:25<@rewby>I've not heard a peep about alexa in like a week, just like OrIdow6.
21:35:28<jamesp>IDK: That doesn't stop us from collecting the data
21:35:32<@OrIdow6>Yes, but that's not what "Thanks, we know about..." is for
21:36:04<IDK>You mean alexa?
21:36:08<jamesp>yes
21:36:15<@rewby>Also, I agree with OrIdow6's decision to remove the alexa bit
21:36:55<@OrIdow6>"Thanks, we know about... " is for when something announces a shutdown and it gets to the top page of Reddit and several people come on here to tell about it, not for every possible topic of discussion in -bs
21:36:58<jamesp>We can still focus on it a little later (a few months), near the end
21:37:50<IDK>sometimes things just get thrown in ab or urls project
21:38:24<@OrIdow6>march_happy: (Repeating since you went offline without me noticing) If you want to do the two-tier approach, ArchiveTeam may be able to get the public stuff, instead of abusing SPN
21:39:11<@rewby>Most of our stuff is not located within china, so I'm curious as to how quickly we get blocked
21:39:43<jamesp>rewby: Why would we get blocked because it's NOT in China?
21:40:04<IDK>Rewby: warning that chinese firewall blocks all suspicious request or throttle most stuff under 500kbps
21:40:08<@rewby>Because they want to archive a chinese website?
21:40:39<jamesp>So we get blocked for trying to archive chinese websites
21:40:40<@rewby>I just kinda jumped on O.rIdow6's message to m.arch_happy
21:40:59<jamesp>The CCP doesn't want us to archive their propaganda
21:41:06<@rewby>China's known for a) having a not-so-great interconnect to the outside world and b) being block happy
21:42:40<@rewby>Iki: Yeah, that's why I wonder how quickly they'd pick up on any grab scripts we write. Since those requests can look rather funny. Especially since most of the warriors are in !china
21:42:43<@rewby>Er
21:42:48<@rewby>IDK: ^
21:43:28<IDK>rewby: CCP do poison some DNS records
21:43:39<IDK>On purpose
21:45:31<@rewby>Yeah, this all sounds like a "fun" project
21:45:39<@rewby>Aka "a lot of work"
21:50:38<@OrIdow6>So maybe reclassify this one to "only if necessary" then
21:51:33<@rewby>I'm not saying you can't give it a try
21:51:46<@rewby>Just be aware that it'll need a close eye on data integrity
22:30:11<@arkiver>OrIdow6: what are you planning on changing?
22:30:23<@arkiver>a PR always works
22:31:11<mpeter|m>yt-dlp is pretty ok for it. be sure to increase download thread number with the -N param. it makes sense until at most ~32 threads, but only if your ISP allows 500 Mb/s downloads.
22:31:11<mpeter|m>also, you might want to include the option to save the info.json.
22:31:11<mpeter|m>you'll get chapter data too, both in info.json and embedded to the output video file.
22:31:11<mpeter|m>I also have [a fork](https://github.com/mpeter50/yt-dlp/tree/twitchvod-livechat) that allows you to download the live chat too (use the option `--sub-lang live_chat`, be aware that `--sub-lang all` dows not imply it though), though it's a bit clunky, we haven't yet found out with the maintainer what is the proper yt-dlp way of doing that uniformly for all services
22:31:39mutantmnky quits [Remote host closed the connection]
22:31:40sec^nd quits [Write error: Broken pipe]
22:31:40HackMii quits [Write error: Broken pipe]
22:31:57mutantmnky (mutantmonkey) joins
22:31:58HackMii (hacktheplanet) joins
22:32:00sec^nd (second) joins
22:48:07<@JAA>arkiver: PRs don't work for GitHub wikis.
22:48:24<@JAA>They're full Git repos, but you can't create a PR for them.
22:57:39<@arkiver>right the wiki
22:58:45wyatt8740 quits [Ping timeout: 265 seconds]
23:07:35<@OrIdow6>arkiver: Just looking to add some of the additions that have been made to the API, and maybe improve clarity in some places
23:28:08Arcorann__ joins
23:52:53march_happy quits [Remote host closed the connection]
23:53:09march_happy (march_happy) joins