| 00:06:03 | | lunik1 quits [Quit: Ping timeout (120 seconds)] |
| 00:06:12 | | lunik1 joins |
| 00:08:50 | | lunik1 quits [Client Quit] |
| 00:08:59 | | lunik1 joins |
| 00:54:40 | | balrog quits [Quit: Bye] |
| 01:01:37 | | dm4v_ joins |
| 01:01:37 | | dm4v quits [Read error: Connection reset by peer] |
| 01:01:49 | | dm4v_ is now known as dm4v |
| 01:01:52 | | dm4v is now authenticated as dm4v |
| 01:01:52 | | dm4v quits [Changing host] |
| 01:01:52 | | dm4v (dm4v) joins |
| 01:05:18 | | balrog (balrog) joins |
| 01:17:49 | | Atom joins |
| 01:46:43 | | mgrytbak4 joins |
| 01:48:17 | | mgrytbak quits [Read error: Connection reset by peer] |
| 01:48:17 | | mgrytbak4 is now known as mgrytbak |
| 01:58:07 | | onetruth joins |
| 02:02:39 | | dm4v quits [Read error: Connection reset by peer] |
| 02:03:46 | | dm4v joins |
| 02:03:48 | | dm4v is now authenticated as dm4v |
| 02:03:48 | | dm4v quits [Changing host] |
| 02:03:48 | | dm4v (dm4v) joins |
| 02:46:08 | | themadpro (themadpro) joins |
| 02:46:27 | <themadpro> | Hey someone on Reddit brought my attention to the fact that the Google Video collections on IA have been locked from downloads https://archive.org/details/GVID-20110417095014-crawl340 |
| 02:46:51 | <themadpro> | Does anyone know why? https://usercontent.irccloud-cdn.com/file/zmKufJSE/image.png |
| 02:47:06 | <themadpro> | Relevant thread https://www.reddit.com/r/Archiveteam/comments/rhd00j/google_video_and_yahoo_video_archives/hoq3np9/ |
| 04:09:55 | | systwi quits [Ping timeout: 258 seconds] |
| 04:27:05 | | Arcorann__ quits [Ping timeout: 265 seconds] |
| 04:29:45 | | qw3rty_ joins |
| 04:33:22 | | qw3rty__ quits [Ping timeout: 265 seconds] |
| 04:58:31 | | Arcorann__ joins |
| 05:03:49 | | HP_Archivist quits [Ping timeout: 265 seconds] |
| 05:50:21 | | Arcorann__ quits [Ping timeout: 258 seconds] |
| 06:10:31 | | ragu joins |
| 06:10:58 | | ragu is now authenticated as ragu |
| 06:51:14 | | HackMii quits [Remote host closed the connection] |
| 06:51:49 | | HackMii (hacktheplanet) joins |
| 07:00:54 | | Atom-- joins |
| 07:02:25 | | Atom quits [Ping timeout: 258 seconds] |
| 07:08:39 | | systwi (systwi) joins |
| 07:22:29 | | Megame (Megame) joins |
| 07:52:04 | | benjins quits [Read error: Connection reset by peer] |
| 08:01:53 | | BlueMaxima quits [Client Quit] |
| 08:07:29 | | lennier1 quits [Ping timeout: 265 seconds] |
| 08:08:08 | | lennier1 (lennier1) joins |
| 08:25:20 | | Arcorann__ joins |
| 09:06:40 | | HackMii quits [Remote host closed the connection] |
| 09:07:40 | | HackMii (hacktheplanet) joins |
| 09:17:33 | | laser quits [Remote host closed the connection] |
| 09:17:53 | | laser joins |
| 09:26:04 | | monoxane4 quits [Quit: Ping timeout (120 seconds)] |
| 09:26:34 | | monoxane4 (monoxane) joins |
| 09:53:57 | | monoxane4 quits [Client Quit] |
| 10:03:36 | | monoxane4 (monoxane) joins |
| 10:28:10 | | pmlo3 quits [Ping timeout: 244 seconds] |
| 11:03:47 | | benjins joins |
| 11:30:28 | | qwertyasdfuiopghjkl joins |
| 11:46:28 | | Nulo quits [Ping timeout: 258 seconds] |
| 11:53:27 | <pabs> | Fedora retiring one of their services: https://communityblog.fedoraproject.org/retiring-taiga-instance-on-teams-fedoraproject-org/ |
| 11:53:36 | <pabs> | the https://teams.fedoraproject.org/ service |
| 11:53:40 | | godane1 joins |
| 11:57:33 | | godane quits [Ping timeout: 265 seconds] |
| 12:07:53 | | laser quits [Remote host closed the connection] |
| 12:11:52 | | laser joins |
| 12:15:31 | | laser quits [Remote host closed the connection] |
| 12:19:22 | | Timestarter joins |
| 12:19:34 | | laser joins |
| 12:19:38 | | Timestarter quits [Remote host closed the connection] |
| 12:19:50 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 12:22:32 | | Nulo joins |
| 12:23:12 | | laser quits [Remote host closed the connection] |
| 12:29:22 | | laser joins |
| 12:36:18 | | Nulo quits [Ping timeout: 258 seconds] |
| 13:10:03 | | Arcorann__ quits [Ping timeout: 265 seconds] |
| 13:13:00 | | HackMii quits [Remote host closed the connection] |
| 13:13:59 | | HackMii (hacktheplanet) joins |
| 13:23:18 | | march_happy (march_happy) joins |
| 13:26:28 | <march_happy> | How do you guys handle pages having login wall when saving to watch? |
| 13:26:36 | <march_happy> | *WARC |
| 13:27:25 | <march_happy> | Some pages at Baidu Tieba has a login wall, thus there is operator's account name |
| 13:39:42 | <march_happy> | Example WACZ: https://www.dropbox.com/s/42vgjklbuvuzu0w/my-web-archive.wacz |
| 13:41:24 | <march_happy> | Notice there is "Catme0w" (The operator's name), and when you click on 2 in "1 2 ..." (page numbers) |
| 13:41:42 | <march_happy> | The layout completely breaks |
| 13:43:17 | <march_happy> | The page "感觉最近没啥动力,来开个进度贴…" is a good example showing current situation. |
| 13:46:35 | <march_happy> | We wish to archive China Furry Fandom's history at Baidu Tieba (As Baidu Tieba is Reddit in China, it's the biggest online forum here). |
| 13:46:55 | <march_happy> | And if it works out, we have a big ambition to archive a full copy of Baidu Tieba up to now. |
| 13:47:49 | <march_happy> | But it just seems that WARC isn't a good enough format... |
| 13:54:16 | <march_happy> | For Chinese websites, will it be better for writing a separate frontend, and saving data into a database? |
| 13:54:50 | <march_happy> | Login wall is very popular here, and yes, I hate it. |
| 14:04:41 | <thuban> | march_happy: there isn't really a good option for getting around login walls. archiveteam usually avoids targeting login-walled or otherwise non-public data in its own projects. but that's partly because our stuff goes into the wayback machine and we don't want to misrepresent what a site would have looked like to a guest; in special cases where that doesn't apply, and in |
| 14:04:43 | <thuban> | personal archives made by individuals, we usually use burner accounts created for the purpose. |
| 14:06:52 | <thuban> | march_happy: as for WARC, we recommend it really strongly because of its fidelity (including metadata) and institutional support as a standard. it can be difficult to recreate dynamic content, but the file you posted actually looks fine to me in replayweb.page, including pagination in the thread you named; are you sure the problem isn't with your viewer? |
| 14:10:26 | <march_happy> | Maybe... But what can I do to hide operator's name? |
| 14:11:57 | <march_happy> | If WARC retains metadata and is not editable, the operator's account name is leaked |
| 14:12:20 | <march_happy> | And it'll be easy to find identity in real world |
| 14:13:17 | <thuban> | you could create a new account (or multiple accounts) and run the archiving operations under that. is there some reason that wouldn't be effective for baidu? |
| 14:13:42 | | qwertyasdfuiopghjkl joins |
| 14:15:28 | <march_happy> | Login wall mentioned above. Browsing anonymously only allows you viewing the first 50 posts |
| 14:15:28 | <march_happy> | And replies to a post is lost. |
| 14:16:18 | <march_happy> | Sorry let me say it again |
| 14:17:15 | <march_happy> | Browsing anonymously only allows you viewing the first 50 comments to a post, and replies to comment s are lost. |
| 14:18:15 | <march_happy> | Baidu Tieba doesn't have a comment tree hierarchy like Reddit. Instead, it only has three levels |
| 14:18:35 | <march_happy> | Post -> Comment -> Replies to a Comment |
| 14:19:04 | <thuban> | i understand "browsing anonymously" to mean browsing without being logged into an account at all. but i'm talking about creating an account (not identifiable as you), logging into that one (instead of your usual one), and doing the archiving that way. would that not work? |
| 14:20:57 | <march_happy> | Nope. I've already seen many massive database leaks of Chinese websites. And it's mandatory to use a phone number and upload national ID card info for registering account in mainland China... |
| 14:21:27 | <march_happy> | It's really just a matter of time finding out who you are... |
| 14:22:45 | <march_happy> | The only way to keep anonymous is to not login in, or clean up crawled data and save in a database. |
| 14:23:08 | <thuban> | i'm sorry, that must be incredibly difficult :( |
| 14:23:27 | <thuban> | is it possible to register accounts outside of mainland china? |
| 14:25:31 | <march_happy> | I guess it won't be long before getting banned. And I don't have that many Google Voice numbers! (Theoretically I shouldn't have one being a Chinese citizen but I already get it before Google tightening GV registration policies) |
| 14:25:34 | <march_happy> | XD |
| 14:26:36 | <march_happy> | In this case will crawling both to WayBack Machine and as an archive.org user-uploaded content work? |
| 14:29:03 | <thuban> | sorry, what do you mean by "this case"? |
| 14:29:22 | <march_happy> | Saving like a summary to WayBack Machine, for posts more than the brain-dead 50 comments limit, check the user-uploaded database. |
| 14:29:49 | <march_happy> | Of course I'll crawl assets like pictures |
| 14:32:04 | | HP_Archivist (HP_Archivist) joins |
| 14:32:52 | <thuban> | the wayback machine generally doesn't accept WARCs from third parties, since the internet archive can't verify the contents. but they would love to have your data as an archive.org item; a WARC of the anonymous view sounds like a great idea, and a database of the rest is definitely better than nothing. |
| 14:34:15 | <march_happy> | Or if I use WayBack Machine web crawler? I am not sure if feeding it with 5K+ URLs will work |
| 14:34:40 | <march_happy> | As if I am DDoS-ing WayBack Machine |
| 14:34:59 | <thuban> | yeah, that's not recommended (it can't handle the strain). |
| 14:35:54 | <march_happy> | Does it have explicit operation rate-limit? |
| 14:36:07 | <march_happy> | For a single IP? |
| 14:36:43 | <thuban> | no; ime if the load is too high it will just start failing. |
| 14:38:34 | <march_happy> | As I am trying to recover old posts being a discussion group's admin, I also need to be careful with Baidu ( ゚∀。) |
| 14:39:13 | <march_happy> | If I (Or the other operators) get banned by Baidu then... |
| 14:42:49 | <march_happy> | That is why I hate Web 2.0... |
| 14:43:00 | <march_happy> | Everything is up to the platform |
| 14:44:19 | <march_happy> | And when you being an end-user couldn't bring money to the platform, or there's a crazy operator deleting posts inside a discussion group in my case... |
| 14:45:44 | <march_happy> | Your uploaded contents are nuked, no matter how much effort you spent creating that thing... |
| 14:46:12 | <thuban> | yep, non-self-hosted platforms sure are like that :) |
| 14:46:23 | <thuban> | two comments, fwiw: |
| 14:48:23 | | HackMii quits [Remote host closed the connection] |
| 14:49:29 | | HackMii (hacktheplanet) joins |
| 14:51:59 | <thuban> | - tieba's dynamic layout means that the thread contents themselves (and their assets) are in separate network requests from the main site. if those requests aren't identifiable to the account, you could consider including WARCs of those requests, but not the rest, in your database. |
| 14:56:53 | <thuban> | - tieba (or at least its anonymous view) sounds like something to keep in mind as a potential archiveteam project (and WARCs from archiveteam projects _do_ go in the wayback machine). we usually prioritize sites that we know are going to disappear, but we recently started preemptively archiving reddit, and surely those posts are in less danger! so maybe someday... |
| 15:18:34 | | Minkafighter quits [Quit: The Lounge - https://thelounge.chat] |
| 15:19:13 | | Minkafighter joins |
| 15:20:29 | | Minkafighter quits [Client Quit] |
| 15:40:33 | | march_happy quits [Remote host closed the connection] |
| 16:08:42 | | march_happy (march_happy) joins |
| 16:59:16 | | march_happy quits [Ping timeout: 258 seconds] |
| 17:20:34 | | Megame quits [Client Quit] |
| 19:02:42 | | lennier1 quits [Ping timeout: 258 seconds] |
| 19:04:06 | | lennier2 joins |
| 19:07:27 | | lennier2 is now known as lennier1 |
| 19:14:56 | | godane1 quits [Read error: Connection reset by peer] |
| 19:43:44 | | lunik1 quits [Client Quit] |
| 19:43:53 | | lunik1 joins |
| 20:00:24 | | lennier1 quits [Client Quit] |
| 20:02:14 | | AK quits [Quit: Ping timeout (120 seconds)] |
| 20:02:32 | | lennier1 (lennier1) joins |
| 20:02:37 | | AK (AK) joins |
| 20:12:33 | | benjins is now authenticated as benjins |
| 20:14:00 | | AK quits [Ping timeout: 258 seconds] |
| 20:17:50 | | AK (AK) joins |
| 20:26:19 | <@OrIdow6> | I assume there's some sort of auth cookie or similar on the thread content requests, so you'd either have to restrict raw warc access or go with privacy by obscurity |
| 20:26:52 | <@OrIdow6> | If you want to do the two-tier approach, ArchiveTeam may be able to get the public stuff, instead of abusing SPN |
| 20:31:11 | | jamesp (jamesp) joins |
| 20:53:33 | <@OrIdow6> | I would like to do more stuff from China and similar countries |
| 21:03:08 | | thermospheric quits [Quit: Ping timeout (120 seconds)] |
| 21:03:49 | <@OrIdow6> | Reminds me - arkiver: can I get some sort of permission to directly edit https://github.com/archiveteam/wget-lua/wiki/Wget-with-Lua-hooks , or should I follow something like https://stackoverflow.com/questions/10642928/how-to-pull-request-a-wiki-page-on-github ? |
| 21:04:09 | | thermospheric (Thermospheric) joins |
| 21:10:18 | <jamesp> | Should the topic for #archiveteam be updated to remove YouTube since that project is already done? |
| 21:13:27 | <IDK> | Wondering if #// URL project handels JS better than #archivebot |
| 21:17:46 | | Larsenv (Larsenv) joins |
| 21:22:37 | <@OrIdow6> | IDK: No |
| 21:22:45 | <@OrIdow6> | Though neither actually runs Javascript |
| 21:23:27 | <IDK> | Oh welp |
| 21:23:50 | | supersonichub1 joins |
| 21:24:04 | <@OrIdow6> | Is there anything specific you're trying to do? |
| 21:25:03 | <supersonichub1> | Can you point me in the right direction of archiving Twitch streams with Warrior? |
| 21:29:24 | <supersonichub1> | Also wondering how I can run Warrior with podman instead of docker since the latter's daemon is quite tempermental. |
| 21:30:23 | | march_happy (march_happy) joins |
| 21:30:47 | | BlueMaxima joins |
| 21:31:06 | <IDK> | OrIdow6: #archivebot |
| 21:31:14 | <jamesp> | OrIdow6: I also meant to keep "thanks, we know about alexa" in the topic too |
| 21:31:30 | <supersonichub1> | Thanks! |
| 21:31:39 | <IDK> | I mean not really that important but its just I know its all going to be deleted in 6 days |
| 21:32:52 | <@OrIdow6> | supersonichub1: I don't think anyone here so far was replying to you. The warrior is for large, centralized projects, and there's no way to use it to save a site without setting up a "project" involving multiple people |
| 21:33:09 | <@OrIdow6> | I know this has been discussed in the past, I believe youtube-dl or whatever the latest fork is may work |
| 21:33:19 | <@OrIdow6> | If you want more info I can look through my logs |
| 21:33:24 | <@OrIdow6> | IDK: What is? |
| 21:33:44 | <@rewby> | yt-dlp is the youtube-dl I usually use |
| 21:34:01 | | Atom joins |
| 21:34:06 | <@OrIdow6> | jamesp: I know, I decided that myself |
| 21:34:10 | | supersonichub1 quits [Remote host closed the connection] |
| 21:34:29 | <IDK> | https://usercontent.irccloud-cdn.com/file/lLVXW4ep/1639690463.JPG |
| 21:34:30 | <jamesp> | OrIdow6: but why? |
| 21:34:31 | <@OrIdow6> | Decided to remove Alexa since it's been a week and seems people have stopped talking about it |
| 21:34:53 | | Atom-- quits [Ping timeout: 258 seconds] |
| 21:35:07 | <IDK> | I think most alexa features are locked behind a login wall |
| 21:35:12 | <jamesp> | Well people are gonna keep mentioning about it there. |
| 21:35:25 | <@rewby> | I've not heard a peep about alexa in like a week, just like OrIdow6. |
| 21:35:28 | <jamesp> | IDK: That doesn't stop us from collecting the data |
| 21:35:32 | <@OrIdow6> | Yes, but that's not what "Thanks, we know about..." is for |
| 21:36:04 | <IDK> | You mean alexa? |
| 21:36:08 | <jamesp> | yes |
| 21:36:15 | <@rewby> | Also, I agree with OrIdow6's decision to remove the alexa bit |
| 21:36:55 | <@OrIdow6> | "Thanks, we know about... " is for when something announces a shutdown and it gets to the top page of Reddit and several people come on here to tell about it, not for every possible topic of discussion in -bs |
| 21:36:58 | <jamesp> | We can still focus on it a little later (a few months), near the end |
| 21:37:50 | <IDK> | sometimes things just get thrown in ab or urls project |
| 21:38:24 | <@OrIdow6> | march_happy: (Repeating since you went offline without me noticing) If you want to do the two-tier approach, ArchiveTeam may be able to get the public stuff, instead of abusing SPN |
| 21:39:11 | <@rewby> | Most of our stuff is not located within china, so I'm curious as to how quickly we get blocked |
| 21:39:43 | <jamesp> | rewby: Why would we get blocked because it's NOT in China? |
| 21:40:04 | <IDK> | Rewby: warning that chinese firewall blocks all suspicious request or throttle most stuff under 500kbps |
| 21:40:08 | <@rewby> | Because they want to archive a chinese website? |
| 21:40:39 | <jamesp> | So we get blocked for trying to archive chinese websites |
| 21:40:40 | <@rewby> | I just kinda jumped on O.rIdow6's message to m.arch_happy |
| 21:40:59 | <jamesp> | The CCP doesn't want us to archive their propaganda |
| 21:41:06 | <@rewby> | China's known for a) having a not-so-great interconnect to the outside world and b) being block happy |
| 21:42:40 | <@rewby> | Iki: Yeah, that's why I wonder how quickly they'd pick up on any grab scripts we write. Since those requests can look rather funny. Especially since most of the warriors are in !china |
| 21:42:43 | <@rewby> | Er |
| 21:42:48 | <@rewby> | IDK: ^ |
| 21:43:28 | <IDK> | rewby: CCP do poison some DNS records |
| 21:43:39 | <IDK> | On purpose |
| 21:45:31 | <@rewby> | Yeah, this all sounds like a "fun" project |
| 21:45:39 | <@rewby> | Aka "a lot of work" |
| 21:50:38 | <@OrIdow6> | So maybe reclassify this one to "only if necessary" then |
| 21:51:33 | <@rewby> | I'm not saying you can't give it a try |
| 21:51:46 | <@rewby> | Just be aware that it'll need a close eye on data integrity |
| 22:30:11 | <@arkiver> | OrIdow6: what are you planning on changing? |
| 22:30:23 | <@arkiver> | a PR always works |
| 22:31:11 | <mpeter|m> | yt-dlp is pretty ok for it. be sure to increase download thread number with the -N param. it makes sense until at most ~32 threads, but only if your ISP allows 500 Mb/s downloads. |
| 22:31:11 | <mpeter|m> | also, you might want to include the option to save the info.json. |
| 22:31:11 | <mpeter|m> | you'll get chapter data too, both in info.json and embedded to the output video file. |
| 22:31:11 | <mpeter|m> | I also have [a fork](https://github.com/mpeter50/yt-dlp/tree/twitchvod-livechat) that allows you to download the live chat too (use the option `--sub-lang live_chat`, be aware that `--sub-lang all` dows not imply it though), though it's a bit clunky, we haven't yet found out with the maintainer what is the proper yt-dlp way of doing that uniformly for all services |
| 22:31:39 | | mutantmnky quits [Remote host closed the connection] |
| 22:31:40 | | sec^nd quits [Write error: Broken pipe] |
| 22:31:40 | | HackMii quits [Write error: Broken pipe] |
| 22:31:57 | | mutantmnky (mutantmonkey) joins |
| 22:31:58 | | HackMii (hacktheplanet) joins |
| 22:32:00 | | sec^nd (second) joins |
| 22:48:07 | <@JAA> | arkiver: PRs don't work for GitHub wikis. |
| 22:48:24 | <@JAA> | They're full Git repos, but you can't create a PR for them. |
| 22:57:39 | <@arkiver> | right the wiki |
| 22:58:45 | | wyatt8740 quits [Ping timeout: 265 seconds] |
| 23:07:35 | <@OrIdow6> | arkiver: Just looking to add some of the additions that have been made to the API, and maybe improve clarity in some places |
| 23:28:08 | | Arcorann__ joins |
| 23:52:53 | | march_happy quits [Remote host closed the connection] |
| 23:53:09 | | march_happy (march_happy) joins |