#archiveteam-bs log for 2024-02-13

Home Search Previous day Next day

00:01:29		Ketchup901 quits [Remote host closed the connection]
00:01:36		Ketchup901 (Ketchup901) joins
00:32:30	<@OrIdow6>	Do we know why Mastadon fails to play back in SPN etc? Looking at a live instance there doesn't seem to be POST, nondeterminism, etc
00:59:24		datechnoman quits [Client Quit]
01:05:06		datechnoman (datechnoman) joins
01:15:52		Island quits [Read error: Connection reset by peer]
01:23:02		Island joins
02:10:21		jasons quits [Ping timeout: 272 seconds]
02:23:01		f_ quits [Read error: Connection reset by peer]
02:23:53		wyatt8750 joins
02:24:45		f_ (funderscore) joins
02:27:27		wyatt8740 quits [Ping timeout: 272 seconds]
02:41:01	<pabs>	what tools are people using for dumping URLs from AB meta-WARCs? in the past I just did ghetto grep/sed, but uhhh :)
02:41:48	<@JAA>	Ghetto grep :-)
02:42:15		Naruyoko joins
02:42:26	<@JAA>	archivebot-log-extract-ignores and wpull2-log-extract-errors in little-things, for example
02:50:55		Island quits [Read error: Connection reset by peer]
03:01:41		Island joins
03:13:15		jasons (jasons) joins
03:19:08		eggdrop quits []
03:20:08		eggdrop (eggdrop) joins
03:42:02		tech234a (tech234a) joins
04:10:20		jasons quits [Ping timeout: 240 seconds]
04:20:22		Naruyoko quits [Client Quit]
04:24:53		Naruyoko joins
04:40:25		Darken quits [Read error: Connection reset by peer]
04:40:45		Darken (Darken) joins
04:47:25		monoxane quits [Ping timeout: 272 seconds]
05:05:41		Wohlstand quits [Client Quit]
05:13:44		jasons (jasons) joins
05:38:37		Island quits [Read error: Connection reset by peer]
05:40:43		Wohlstand (Wohlstand) joins
05:54:47	<h2ibot>	FireonLive edited List of websites excluded from the Wayback Machine (+31, add showyourdick.org): https://wiki.archiveteam.org/?diff=51720&oldid=51699
06:00:48	<h2ibot>	JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=51721&oldid=51720
06:13:20		jasons quits [Ping timeout: 240 seconds]
06:35:13		BlueMaxima quits [Client Quit]
06:43:13		sonick (sonick) joins
07:16:49		jasons (jasons) joins
07:42:02		Arcorann (Arcorann) joins
08:17:50		jasons quits [Ping timeout: 240 seconds]
08:22:27		qwertyasdfuiopghjkl quits [Remote host closed the connection]
08:24:36		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
08:50:50		Wohlstand quits [Ping timeout: 240 seconds]
09:00:02		ThetaDev quits [Client Quit]
09:00:47		ThetaDev joins
09:21:23		jasons (jasons) joins
09:24:14		emtee quits [Remote host closed the connection]
10:00:02		Bleo18260 quits [Client Quit]
10:01:21		Bleo18260 joins
10:33:14		sonick quits [Client Quit]
11:04:46		pedantic-darwin quits [Quit: The Lounge - https://thelounge.chat]
11:26:30	<thuban>	OrIdow6: presumably the webpack is too obfuscated for wbm's rewriter (wombat.js, i believe) to recognize and handle
11:32:41	<thuban>	(i don't think it _either_ follows the chunk loading _or_ makes any api calls--the api response is there on the backend, but the rewritten js in the wbm's frontend can't find it)
11:33:25	<thuban>	i think we can safely bet that gotm.io won't work either
11:39:08	<eightthree>	otherwise public content obtained through a logged in account is offlimits? i.e. code search results on github?
11:43:36		qwertyasdfuiopghjkl quits [Remote host closed the connection]
11:51:53		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
11:54:08	<@OrIdow6>	thuban: Ah, not really a JS person, that's a shame
11:54:23	<@OrIdow6>	At some point I'll have to see if it works in my hackish WBM-as-HTTP-proxy thing
11:56:04	<@OrIdow6>	eightthree: In general yes - Github code search is more in the category of "why would we archive this in the first place?" but for instance a forum that require users to be logged in is definitely no-go
11:57:03	<@OrIdow6>	No hard and fast rule but for instance NSFW is usually an exception, basically since that's not a desire for privacy by the users/owner of the site, just them protecting themselves from legal stuff
11:59:28	<@OrIdow6>	And these's a big gray area cause sites use accounts for bot protection, to hide lists of posts that are themselves public, etc.
11:59:58	<thuban>	OrIdow6: yeah
12:00:01	<thuban>	ot1h, i see why the wbm works the way it does given its constraints (especially historically)
12:00:11	<thuban>	otoh, if you can do it conveniently, proxying is sure to be much more effective!
12:00:16	<eightthree>	OrIdow6: am i understanding correctly that websites that are or have nsfw content and require logging in to view them, in this case the archiving is ok for AT? this is the case for all major projects, reddit, twitter, or even some of the full-on porn sites that were archived?
12:01:04	<@OrIdow6>	thuban: Yeah, it plays back a lot better than the web version
12:01:40	<thuban>	eightthree: we've occasionally made that exception in the past for nsfw-focused sites (https://wiki.archiveteam.org/index.php/FurAffinity)
12:02:03	<thuban>	but increasingly rarely and reluctantly
12:02:43	<thuban>	it doesn't apply to any major or current projects
12:03:10	<@OrIdow6>	I've looked at trying to make something out of Webkit or similar that passes all HTTP traffic thru the WBM (something less hackish than my solution, which is a Python script + several minutes of configuring a Firefox profile) but IIRC most of them were dead-set on doing their own networking
12:03:33	<@OrIdow6>	*HTTP/S
12:04:30	<eightthree>	thuban: so nsfw or login-to-view content on youtube, x.com, reddit,telegram etc is not archived, right?
12:05:54	<thuban>	eightthree: it's a bit complicated, since 'nsfw' and' login to view' are generally not the same or even similar
12:06:19		benjinsm quits [Ping timeout: 272 seconds]
12:09:21	<thuban>	if it's nfsw but can be viewed freely or with a freely available mechanism (like reddit's 'i'm over 18' cookie), we'll get it
12:09:55	<thuban>	if it requires login to view, we won't, whether or not it's nsfw
12:10:24	<eightthree>	Im guessing Arkiver2 isn't a single human user but is a generic account where various people use it to contribute?
12:11:05	<eightthree>	they seem to have most of the commits to the projectname-grab and projectname-items repos
12:11:24	<thuban>	arkiver is very much a real person :)
12:11:33	<eightthree>	!!!
12:11:45	<Maakuth\|m>	on this very channel too
12:11:58	<eightthree>	busfactor = 1.0x?
12:12:37	<eightthree>	will the project survive if they can no longer contribute?
12:12:50		jasons quits [Ping timeout: 240 seconds]
12:13:38	<thuban>	not _quite_ that bad; there are other people who understand the relevant infra even if they presently contribute much less
12:14:37	<thuban>	but yeah it's a known issue
12:16:47	<thuban>	OrIdow6: implement a browser in your browser so you can make http requests while you make http requests! that's what webassembly was intended for, right?
12:21:37	<eightthree>	ah
12:25:09		benjins joins
12:37:59		Arcorann quits [Ping timeout: 272 seconds]
12:38:41	<@OrIdow6>	thuban: I actually considered that at one point, it seemed like too much work though
12:48:23	<h2ibot>	OrIdow6 edited Parler (+1215, I think enough time has passed to write a bit…): https://wiki.archiveteam.org/?diff=51722&oldid=49713
13:09:27	<h2ibot>	OrIdow6 uploaded File:Archiveteam and archiveteam-bs unique users, 2012-2024.png (A Pyplot of the number of unique users talking…): https://wiki.archiveteam.org/?title=File%3AArchiveteam%20and%20archiveteam-bs%20unique%20users%2C%202012-2024.png
13:13:28	<h2ibot>	OrIdow6 edited Parler (+158, /* Grab */ Add image showing the people coming…): https://wiki.archiveteam.org/?diff=51724&oldid=51722
13:16:19		jasons (jasons) joins
13:23:35	<TheTechRobo>	Service workers would also work, I think
14:00:08		etnguyen03 (etnguyen03) joins
14:03:57		SootBector quits [Remote host closed the connection]
14:04:17		SootBector (SootBector) joins
14:08:22		SootBector quits [Remote host closed the connection]
14:08:41		SootBector (SootBector) joins
14:12:20		jasons quits [Ping timeout: 240 seconds]
14:18:17		lucas joins
14:26:04	<lucas>	Hi! I was wondering if anybody thought about archiving Twitter threads that appear on threadreaderapp.com it seems it's already a way to crowdsource if some collections of tweet have been deemed of interest, and maybe it's even easier than scraping from twitter.com. Thanks!
14:29:26		Darken quits [Read error: Connection reset by peer]
14:34:23	<@arkiver>	aninternettroll: what you want to archive would be mastodon? in general?
14:34:51	<@arkiver>	eightthree: i'm real yes :)
14:53:05		lucas quits [Remote host closed the connection]
14:59:27	<aninternettroll>	arkiver: first thought went to speedrun.com actually, which has a big community of api users with which you could get a lot of URLs
14:59:55	<aninternettroll>	(also an excuse to do something in rust)
15:00:25	<aninternettroll>	anyway, those plans are very young and might go nowhere
15:03:12		lucas joins
15:03:51		lucas is now known as ketsapiwiq
15:12:25	<eightthree>	arkiver: awesome! how/where do I grep what to know which projects include/avoid nsfw or "login-needed-to-view" content
15:14:10	<eightthree>	I tried nsfw already but not much found, the second I have no idea how (or if) you write that in issues/commits personally. I dont have code search, so if thats the way, I will have to...download all the projects and grep them locally?
15:15:37		jasons (jasons) joins
15:47:11	<aninternettroll>	tbh mastodon wouldn't be a bad idea either, though it would require some lua knowledge and some analysing first. Sounds fun though
15:48:54	<that_lurker>	You could maybe do mastodon through the rss feed by adding .rss after the username
15:51:21	<aninternettroll>	the more interesting goal would be to archive everything the web ui needs for a given link
16:04:33	<DigitalDragons>	if not doing a full page grab it would probably be better to save the activitypub objects instead of the rss feed
16:11:17	<thuban>	eightthree: the premise of your question seems kind of confused (if we're not saving some portion of a site, we... don't write code to save it). what exactly are you trying to do here? what is your interest in nsfw specifically?
16:14:53		ketsapiwiq quits [Ping timeout: 265 seconds]
16:16:20		jasons quits [Ping timeout: 240 seconds]
16:16:43		pedantic-darwin joins
16:20:28	<eightthree>	thuban: nsfw isnt just lewd stuff...there's also relatively controversial content, proof of war crimes etc...
16:21:06	<thuban>	right, but why are you asking about it?
16:22:15	<eightthree>	ive also seen relatively tame text only content (incorrectly?) tagged with nsfw, like people would tag it as such just in case
16:24:58	<thuban>	right, but again, what of it?
16:25:58	<thuban>	i don't mean to be interrogatory, it just seems like there's a lingering misunderstanding here
16:32:21	<imer>	as far as I remember there wasn't anything "special" about nsfw content per se that would stop us archiving it, I believe that wasn't always the stance?
16:32:21	<imer>	nsfw usually means porn though.. and porn videos are big and there's many of them, which makes archiving those not feasable
16:32:21	<imer>	there's also the issue of things locked behind a login.. which is adjacient, but a different thing all together
16:32:49	<eightthree>	im trying to get a feel for how things are included or excluded from the archive created vs the original i.e. mapping out the differences (for my personal understanding)
16:33:19	<thuban>	eightthree: if we're saving a site, and something's publicly posted on it, we try and save it
16:33:27	<thuban>	there's no additional level of moderation
16:33:39	<eightthree>	censorship is also very subjective
16:34:26	<imer>	given constraints of data size of course, and we do sometimes skip stuff or grab only the lower quality versions if time is running out until shutdown
16:35:00	<thuban>	yeah, hence "try"
16:41:49		etnguyen03 quits [Ping timeout: 272 seconds]
17:02:47		lucas joins
17:04:29		lucas quits [Remote host closed the connection]
17:20:15		jasons (jasons) joins
18:06:24		qwertyasdfuiopghjkl quits [Ping timeout: 256 seconds]
18:15:09		Wohlstand (Wohlstand) joins
18:19:50		jasons quits [Ping timeout: 240 seconds]
18:39:51		etnguyen03 (etnguyen03) joins
19:04:57		Hackerpcs quits [Quit: Hackerpcs]
19:07:02		Hackerpcs (Hackerpcs) joins
19:09:22		Island joins
19:25:20		etnguyen03 quits [Ping timeout: 240 seconds]
19:28:33		jasons (jasons) joins
19:51:02		icedice quits [Client Quit]
20:20:50		jasons quits [Ping timeout: 240 seconds]
21:23:47		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
21:24:27		jasons (jasons) joins
21:33:08		threedeeitguy398 (threedeeitguy) joins
21:34:25		threedeeitguy39 quits [Ping timeout: 272 seconds]
21:34:25		threedeeitguy398 is now known as threedeeitguy39
21:51:31		BlueMaxima joins
21:57:40		threedeeitguy397 (threedeeitguy) joins
21:58:20		threedeeitguy39 quits [Ping timeout: 240 seconds]
21:58:20		threedeeitguy397 is now known as threedeeitguy39
22:22:20		jasons quits [Ping timeout: 240 seconds]
22:34:33		neggles quits [Quit: bye friends - ZNC - https://znc.in]
22:34:46		neggles (neggles) joins
23:15:20		jasons (jasons) joins

Home Search Previous day Next day