#archiveteam-bs log for 2021-12-16

Home Search Previous day Next day

00:06:03		lunik1 quits [Quit: Ping timeout (120 seconds)]
00:06:12		lunik1 joins
00:08:50		lunik1 quits [Client Quit]
00:08:59		lunik1 joins
00:54:40		balrog quits [Quit: Bye]
01:01:37		dm4v_ joins
01:01:37		dm4v quits [Read error: Connection reset by peer]
01:01:49		dm4v_ is now known as dm4v
01:01:52		dm4v is now authenticated as dm4v
01:01:52		dm4v quits [Changing host]
01:01:52		dm4v (dm4v) joins
01:05:18		balrog (balrog) joins
01:17:49		Atom joins
01:46:43		mgrytbak4 joins
01:48:17		mgrytbak quits [Read error: Connection reset by peer]
01:48:17		mgrytbak4 is now known as mgrytbak
01:58:07		onetruth joins
02:02:39		dm4v quits [Read error: Connection reset by peer]
02:03:46		dm4v joins
02:03:48		dm4v is now authenticated as dm4v
02:03:48		dm4v quits [Changing host]
02:03:48		dm4v (dm4v) joins
02:46:08		themadpro (themadpro) joins
02:46:27	<themadpro>	Hey someone on Reddit brought my attention to the fact that the Google Video collections on IA have been locked from downloads https://archive.org/details/GVID-20110417095014-crawl340
02:46:51	<themadpro>	Does anyone know why? https://usercontent.irccloud-cdn.com/file/zmKufJSE/image.png
02:47:06	<themadpro>	Relevant thread https://www.reddit.com/r/Archiveteam/comments/rhd00j/google_video_and_yahoo_video_archives/hoq3np9/
04:09:55		systwi quits [Ping timeout: 258 seconds]
04:27:05		Arcorann__ quits [Ping timeout: 265 seconds]
04:29:45		qw3rty_ joins
04:33:22		qw3rty__ quits [Ping timeout: 265 seconds]
04:58:31		Arcorann__ joins
05:03:49		HP_Archivist quits [Ping timeout: 265 seconds]
05:50:21		Arcorann__ quits [Ping timeout: 258 seconds]
06:10:31		ragu joins
06:10:58		ragu is now authenticated as ragu
06:51:14		HackMii quits [Remote host closed the connection]
06:51:49		HackMii (hacktheplanet) joins
07:00:54		Atom-- joins
07:02:25		Atom quits [Ping timeout: 258 seconds]
07:08:39		systwi (systwi) joins
07:22:29		Megame (Megame) joins
07:52:04		benjins quits [Read error: Connection reset by peer]
08:01:53		BlueMaxima quits [Client Quit]
08:07:29		lennier1 quits [Ping timeout: 265 seconds]
08:08:08		lennier1 (lennier1) joins
08:25:20		Arcorann__ joins
09:06:40		HackMii quits [Remote host closed the connection]
09:07:40		HackMii (hacktheplanet) joins
09:17:33		laser quits [Remote host closed the connection]
09:17:53		laser joins
09:26:04		monoxane4 quits [Quit: Ping timeout (120 seconds)]
09:26:34		monoxane4 (monoxane) joins
09:53:57		monoxane4 quits [Client Quit]
10:03:36		monoxane4 (monoxane) joins
10:28:10		pmlo3 quits [Ping timeout: 244 seconds]
11:03:47		benjins joins
11:30:28		qwertyasdfuiopghjkl joins
11:46:28		Nulo quits [Ping timeout: 258 seconds]
11:53:27	<pabs>	Fedora retiring one of their services: https://communityblog.fedoraproject.org/retiring-taiga-instance-on-teams-fedoraproject-org/
11:53:36	<pabs>	the https://teams.fedoraproject.org/ service
11:53:40		godane1 joins
11:57:33		godane quits [Ping timeout: 265 seconds]
12:07:53		laser quits [Remote host closed the connection]
12:11:52		laser joins
12:15:31		laser quits [Remote host closed the connection]
12:19:22		Timestarter joins
12:19:34		laser joins
12:19:38		Timestarter quits [Remote host closed the connection]
12:19:50		qwertyasdfuiopghjkl quits [Client Quit]
12:22:32		Nulo joins
12:23:12		laser quits [Remote host closed the connection]
12:29:22		laser joins
12:36:18		Nulo quits [Ping timeout: 258 seconds]
13:10:03		Arcorann__ quits [Ping timeout: 265 seconds]
13:13:00		HackMii quits [Remote host closed the connection]
13:13:59		HackMii (hacktheplanet) joins
13:23:18		march_happy (march_happy) joins
13:26:28	<march_happy>	How do you guys handle pages having login wall when saving to watch?
13:26:36	<march_happy>	*WARC
13:27:25	<march_happy>	Some pages at Baidu Tieba has a login wall, thus there is operator's account name
13:39:42	<march_happy>	Example WACZ: https://www.dropbox.com/s/42vgjklbuvuzu0w/my-web-archive.wacz
13:41:24	<march_happy>	Notice there is "Catme0w" (The operator's name), and when you click on 2 in "1 2 ..." (page numbers)
13:41:42	<march_happy>	The layout completely breaks
13:43:17	<march_happy>	The page "感觉最近没啥动力，来开个进度贴…" is a good example showing current situation.
13:46:35	<march_happy>	We wish to archive China Furry Fandom's history at Baidu Tieba (As Baidu Tieba is Reddit in China, it's the biggest online forum here).
13:46:55	<march_happy>	And if it works out, we have a big ambition to archive a full copy of Baidu Tieba up to now.
13:47:49	<march_happy>	But it just seems that WARC isn't a good enough format...
13:54:16	<march_happy>	For Chinese websites, will it be better for writing a separate frontend, and saving data into a database?
13:54:50	<march_happy>	Login wall is very popular here, and yes, I hate it.
14:04:41	<thuban>	march_happy: there isn't really a good option for getting around login walls. archiveteam usually avoids targeting login-walled or otherwise non-public data in its own projects. but that's partly because our stuff goes into the wayback machine and we don't want to misrepresent what a site would have looked like to a guest; in special cases where that doesn't apply, and in
14:04:43	<thuban>	personal archives made by individuals, we usually use burner accounts created for the purpose.
14:06:52	<thuban>	march_happy: as for WARC, we recommend it really strongly because of its fidelity (including metadata) and institutional support as a standard. it can be difficult to recreate dynamic content, but the file you posted actually looks fine to me in replayweb.page, including pagination in the thread you named; are you sure the problem isn't with your viewer?
14:10:26	<march_happy>	Maybe... But what can I do to hide operator's name?
14:11:57	<march_happy>	If WARC retains metadata and is not editable, the operator's account name is leaked
14:12:20	<march_happy>	And it'll be easy to find identity in real world
14:13:17	<thuban>	you could create a new account (or multiple accounts) and run the archiving operations under that. is there some reason that wouldn't be effective for baidu?
14:13:42		qwertyasdfuiopghjkl joins
14:15:28	<march_happy>	Login wall mentioned above. Browsing anonymously only allows you viewing the first 50 posts
14:15:28	<march_happy>	And replies to a post is lost.
14:16:18	<march_happy>	Sorry let me say it again
14:17:15	<march_happy>	Browsing anonymously only allows you viewing the first 50 comments to a post, and replies to comment s are lost.
14:18:15	<march_happy>	Baidu Tieba doesn't have a comment tree hierarchy like Reddit. Instead, it only has three levels
14:18:35	<march_happy>	Post -> Comment -> Replies to a Comment
14:19:04	<thuban>	i understand "browsing anonymously" to mean browsing without being logged into an account at all. but i'm talking about creating an account (not identifiable as you), logging into that one (instead of your usual one), and doing the archiving that way. would that not work?
14:20:57	<march_happy>	Nope. I've already seen many massive database leaks of Chinese websites. And it's mandatory to use a phone number and upload national ID card info for registering account in mainland China...
14:21:27	<march_happy>	It's really just a matter of time finding out who you are...
14:22:45	<march_happy>	The only way to keep anonymous is to not login in, or clean up crawled data and save in a database.
14:23:08	<thuban>	i'm sorry, that must be incredibly difficult :(
14:23:27	<thuban>	is it possible to register accounts outside of mainland china?
14:25:31	<march_happy>	I guess it won't be long before getting banned. And I don't have that many Google Voice numbers! (Theoretically I shouldn't have one being a Chinese citizen but I already get it before Google tightening GV registration policies)
14:25:34	<march_happy>	XD
14:26:36	<march_happy>	In this case will crawling both to WayBack Machine and as an archive.org user-uploaded content work?
14:29:03	<thuban>	sorry, what do you mean by "this case"?
14:29:22	<march_happy>	Saving like a summary to WayBack Machine, for posts more than the brain-dead 50 comments limit, check the user-uploaded database.
14:29:49	<march_happy>	Of course I'll crawl assets like pictures
14:32:04		HP_Archivist (HP_Archivist) joins
14:32:52	<thuban>	the wayback machine generally doesn't accept WARCs from third parties, since the internet archive can't verify the contents. but they would love to have your data as an archive.org item; a WARC of the anonymous view sounds like a great idea, and a database of the rest is definitely better than nothing.
14:34:15	<march_happy>	Or if I use WayBack Machine web crawler? I am not sure if feeding it with 5K+ URLs will work
14:34:40	<march_happy>	As if I am DDoS-ing WayBack Machine
14:34:59	<thuban>	yeah, that's not recommended (it can't handle the strain).
14:35:54	<march_happy>	Does it have explicit operation rate-limit?
14:36:07	<march_happy>	For a single IP?
14:36:43	<thuban>	no; ime if the load is too high it will just start failing.
14:38:34	<march_happy>	As I am trying to recover old posts being a discussion group's admin, I also need to be careful with Baidu ( ﾟ∀。)
14:39:13	<march_happy>	If I (Or the other operators) get banned by Baidu then...
14:42:49	<march_happy>	That is why I hate Web 2.0...
14:43:00	<march_happy>	Everything is up to the platform
14:44:19	<march_happy>	And when you being an end-user couldn't bring money to the platform, or there's a crazy operator deleting posts inside a discussion group in my case...
14:45:44	<march_happy>	Your uploaded contents are nuked, no matter how much effort you spent creating that thing...
14:46:12	<thuban>	yep, non-self-hosted platforms sure are like that :)
14:46:23	<thuban>	two comments, fwiw:
14:48:23		HackMii quits [Remote host closed the connection]
14:49:29		HackMii (hacktheplanet) joins
14:51:59	<thuban>	- tieba's dynamic layout means that the thread contents themselves (and their assets) are in separate network requests from the main site. if those requests aren't identifiable to the account, you could consider including WARCs of those requests, but not the rest, in your database.
14:56:53	<thuban>	- tieba (or at least its anonymous view) sounds like something to keep in mind as a potential archiveteam project (and WARCs from archiveteam projects _do_ go in the wayback machine). we usually prioritize sites that we know are going to disappear, but we recently started preemptively archiving reddit, and surely those posts are in less danger! so maybe someday...
15:18:34		Minkafighter quits [Quit: The Lounge - https://thelounge.chat]
15:19:13		Minkafighter joins
15:20:29		Minkafighter quits [Client Quit]
15:40:33		march_happy quits [Remote host closed the connection]
16:08:42		march_happy (march_happy) joins
16:59:16		march_happy quits [Ping timeout: 258 seconds]
17:20:34		Megame quits [Client Quit]
19:02:42		lennier1 quits [Ping timeout: 258 seconds]
19:04:06		lennier2 joins
19:07:27		lennier2 is now known as lennier1
19:14:56		godane1 quits [Read error: Connection reset by peer]
19:43:44		lunik1 quits [Client Quit]
19:43:53		lunik1 joins
20:00:24		lennier1 quits [Client Quit]
20:02:14		AK quits [Quit: Ping timeout (120 seconds)]
20:02:32		lennier1 (lennier1) joins
20:02:37		AK (AK) joins
20:12:33		benjins is now authenticated as benjins
20:14:00		AK quits [Ping timeout: 258 seconds]
20:17:50		AK (AK) joins
20:26:19	<@OrIdow6>	I assume there's some sort of auth cookie or similar on the thread content requests, so you'd either have to restrict raw warc access or go with privacy by obscurity
20:26:52	<@OrIdow6>	If you want to do the two-tier approach, ArchiveTeam may be able to get the public stuff, instead of abusing SPN
20:31:11		jamesp (jamesp) joins
20:53:33	<@OrIdow6>	I would like to do more stuff from China and similar countries
21:03:08		thermospheric quits [Quit: Ping timeout (120 seconds)]
21:03:49	<@OrIdow6>	Reminds me - arkiver: can I get some sort of permission to directly edit https://github.com/archiveteam/wget-lua/wiki/Wget-with-Lua-hooks , or should I follow something like https://stackoverflow.com/questions/10642928/how-to-pull-request-a-wiki-page-on-github ?
21:04:09		thermospheric (Thermospheric) joins
21:10:18	<jamesp>	Should the topic for #archiveteam be updated to remove YouTube since that project is already done?
21:13:27	<IDK>	Wondering if #// URL project handels JS better than #archivebot
21:17:46		Larsenv (Larsenv) joins
21:22:37	<@OrIdow6>	IDK: No
21:22:45	<@OrIdow6>	Though neither actually runs Javascript
21:23:27	<IDK>	Oh welp
21:23:50		supersonichub1 joins
21:24:04	<@OrIdow6>	Is there anything specific you're trying to do?
21:25:03	<supersonichub1>	Can you point me in the right direction of archiving Twitch streams with Warrior?
21:29:24	<supersonichub1>	Also wondering how I can run Warrior with podman instead of docker since the latter's daemon is quite tempermental.
21:30:23		march_happy (march_happy) joins
21:30:47		BlueMaxima joins
21:31:06	<IDK>	OrIdow6: #archivebot
21:31:14	<jamesp>	OrIdow6: I also meant to keep "thanks, we know about alexa" in the topic too
21:31:30	<supersonichub1>	Thanks!
21:31:39	<IDK>	I mean not really that important but its just I know its all going to be deleted in 6 days
21:32:52	<@OrIdow6>	supersonichub1: I don't think anyone here so far was replying to you. The warrior is for large, centralized projects, and there's no way to use it to save a site without setting up a "project" involving multiple people
21:33:09	<@OrIdow6>	I know this has been discussed in the past, I believe youtube-dl or whatever the latest fork is may work
21:33:19	<@OrIdow6>	If you want more info I can look through my logs
21:33:24	<@OrIdow6>	IDK: What is?
21:33:44	<@rewby>	yt-dlp is the youtube-dl I usually use
21:34:01		Atom joins
21:34:06	<@OrIdow6>	jamesp: I know, I decided that myself
21:34:10		supersonichub1 quits [Remote host closed the connection]
21:34:29	<IDK>	https://usercontent.irccloud-cdn.com/file/lLVXW4ep/1639690463.JPG
21:34:30	<jamesp>	OrIdow6: but why?
21:34:31	<@OrIdow6>	Decided to remove Alexa since it's been a week and seems people have stopped talking about it
21:34:53		Atom-- quits [Ping timeout: 258 seconds]
21:35:07	<IDK>	I think most alexa features are locked behind a login wall
21:35:12	<jamesp>	Well people are gonna keep mentioning about it there.
21:35:25	<@rewby>	I've not heard a peep about alexa in like a week, just like OrIdow6.
21:35:28	<jamesp>	IDK: That doesn't stop us from collecting the data
21:35:32	<@OrIdow6>	Yes, but that's not what "Thanks, we know about..." is for
21:36:04	<IDK>	You mean alexa?
21:36:08	<jamesp>	yes
21:36:15	<@rewby>	Also, I agree with OrIdow6's decision to remove the alexa bit
21:36:55	<@OrIdow6>	"Thanks, we know about... " is for when something announces a shutdown and it gets to the top page of Reddit and several people come on here to tell about it, not for every possible topic of discussion in -bs
21:36:58	<jamesp>	We can still focus on it a little later (a few months), near the end
21:37:50	<IDK>	sometimes things just get thrown in ab or urls project
21:38:24	<@OrIdow6>	march_happy: (Repeating since you went offline without me noticing) If you want to do the two-tier approach, ArchiveTeam may be able to get the public stuff, instead of abusing SPN
21:39:11	<@rewby>	Most of our stuff is not located within china, so I'm curious as to how quickly we get blocked
21:39:43	<jamesp>	rewby: Why would we get blocked because it's NOT in China?
21:40:04	<IDK>	Rewby: warning that chinese firewall blocks all suspicious request or throttle most stuff under 500kbps
21:40:08	<@rewby>	Because they want to archive a chinese website?
21:40:39	<jamesp>	So we get blocked for trying to archive chinese websites
21:40:40	<@rewby>	I just kinda jumped on O.rIdow6's message to m.arch_happy
21:40:59	<jamesp>	The CCP doesn't want us to archive their propaganda
21:41:06	<@rewby>	China's known for a) having a not-so-great interconnect to the outside world and b) being block happy
21:42:40	<@rewby>	Iki: Yeah, that's why I wonder how quickly they'd pick up on any grab scripts we write. Since those requests can look rather funny. Especially since most of the warriors are in !china
21:42:43	<@rewby>	Er
21:42:48	<@rewby>	IDK: ^
21:43:28	<IDK>	rewby: CCP do poison some DNS records
21:43:39	<IDK>	On purpose
21:45:31	<@rewby>	Yeah, this all sounds like a "fun" project
21:45:39	<@rewby>	Aka "a lot of work"
21:50:38	<@OrIdow6>	So maybe reclassify this one to "only if necessary" then
21:51:33	<@rewby>	I'm not saying you can't give it a try
21:51:46	<@rewby>	Just be aware that it'll need a close eye on data integrity
22:30:11	<@arkiver>	OrIdow6: what are you planning on changing?
22:30:23	<@arkiver>	a PR always works
22:31:11	<mpeter\|m>	yt-dlp is pretty ok for it. be sure to increase download thread number with the -N param. it makes sense until at most ~32 threads, but only if your ISP allows 500 Mb/s downloads.
22:31:11	<mpeter\|m>	also, you might want to include the option to save the info.json.
22:31:11	<mpeter\|m>	you'll get chapter data too, both in info.json and embedded to the output video file.
22:31:11	<mpeter\|m>	I also have [a fork](https://github.com/mpeter50/yt-dlp/tree/twitchvod-livechat) that allows you to download the live chat too (use the option `--sub-lang live_chat`, be aware that `--sub-lang all` dows not imply it though), though it's a bit clunky, we haven't yet found out with the maintainer what is the proper yt-dlp way of doing that uniformly for all services
22:31:39		mutantmnky quits [Remote host closed the connection]
22:31:40		sec^nd quits [Write error: Broken pipe]
22:31:40		HackMii quits [Write error: Broken pipe]
22:31:57		mutantmnky (mutantmonkey) joins
22:31:58		HackMii (hacktheplanet) joins
22:32:00		sec^nd (second) joins
22:48:07	<@JAA>	arkiver: PRs don't work for GitHub wikis.
22:48:24	<@JAA>	They're full Git repos, but you can't create a PR for them.
22:57:39	<@arkiver>	right the wiki
22:58:45		wyatt8740 quits [Ping timeout: 265 seconds]
23:07:35	<@OrIdow6>	arkiver: Just looking to add some of the additions that have been made to the API, and maybe improve clarity in some places
23:28:08		Arcorann__ joins
23:52:53		march_happy quits [Remote host closed the connection]
23:53:09		march_happy (march_happy) joins

Home Search Previous day Next day