#archiveteam-bs log for 2023-03-17

Home Search Previous day Next day

00:13:15		Lord_Nightmare quits [Quit: ZNC - http://znc.in]
00:18:45		Lord_Nightmare (Lord_Nightmare) joins
00:30:40		Arcorann (Arcorann) joins
01:17:29		DiscantX joins
01:44:33		myself quits [Quit: The Lounge - https://thelounge.chat]
01:48:05		myself joins
01:51:07		tbc1887 (tbc1887) joins
01:51:42		DiscantX quits [Client Quit]
02:08:27		DiscantX joins
02:15:52		www2 quits [Ping timeout: 252 seconds]
02:31:49		tzt quits [Ping timeout: 252 seconds]
02:33:25		tzt (tzt) joins
02:40:40		DiscantX quits [Client Quit]
02:45:49		DiscantX joins
02:59:41		hackbug (hackbug) joins
03:08:40		hackbug quits [Ping timeout: 252 seconds]
03:11:09		hackbug (hackbug) joins
03:17:55		BlueMaxima quits [Read error: Connection reset by peer]
03:19:25		tzt quits [Remote host closed the connection]
03:19:47		tzt (tzt) joins
03:32:03		katocala is now authenticated as katocala
03:32:40		tbc1887 quits [Client Quit]
05:04:10		hackbug quits [Ping timeout: 252 seconds]
05:44:11		umgr036 quits [Remote host closed the connection]
05:44:25		umgr036 joins
06:45:40	<@OrIdow6>	If a question in #archiveteam could be resolved in one or a few answers I don't think sending them to #bs is necessary
06:45:42	<@OrIdow6>	*messages
06:48:16	<pokechu22>	I directed them to #bs because I wasn't sure if an answer would be available anytime soon (and so far it seems like nobody's had an answer) - better to let them know to check elsewhere immediately than to have it just be silent for a while and like an hour later have someone redirect them IMO
06:49:56		michaelblob_ quits [Read error: Connection reset by peer]
06:50:51	<@OrIdow6>	Well I see the point with making it clear to them that someone in the room's not dead
06:53:00	<@OrIdow6>	Channel for that is I believe #nintendone BTW, no substantial activity for over a year, but status on that specific service I don't know
06:53:43		michaelblob (michaelblob) joins
08:13:39		hitgrr8 joins
08:16:38		DLoader joins
08:26:56		Arachnophine quits [Remote host closed the connection]
08:27:39		Arachnophine (Arachnophine) joins
10:09:07		www2 joins
10:25:14	<Jake>	#nintendone was originally for Super Mario Maker, rather than the eShop. My understanding: the eShop is copywritten games that aren't publicly accessible, so I'm not sure there's anything that we can save there.
10:51:39		www2 quits [Client Quit]
11:22:23		jacksonchen666 (jacksonchen666) joins
11:57:39		hackbug (hackbug) joins
11:58:05		hitgrr8 quits [Client Quit]
12:24:53		HP_Archivist (HP_Archivist) joins
13:00:15		HP_Archivist quits [Client Quit]
13:15:45		jacksonchen666 quits [Client Quit]
13:20:16		Arcorann quits [Ping timeout: 252 seconds]
13:26:44		katocala quits [Remote host closed the connection]
15:00:01		Arachnophine quits [Client Quit]
15:00:01		qwertyasdfuiopghjkl quits [Remote host closed the connection]
15:00:05		Arachnophine (Arachnophine) joins
15:01:45		gydgyd joins
15:03:50		Arachnophine quits [Client Quit]
15:03:55		Arachnophine (Arachnophine) joins
15:10:28		benjins2 joins
15:13:50	<gydgyd>	Hey there, hope this is allowed! I'm trying to get a very single photo from the Panoramio archive for my father.
15:13:57	<gydgyd>	I know the photo ID and I generally see how the warc files stored in a set of 10s > panoramio-photos_xxxxx0-xxxxx9
15:14:07	<gydgyd>	So I was just wondering if there is any kind of searchable databse or something cause right now I just manually open the collection pages
15:14:12	<gydgyd>	Sometimes I get close (it's a 9 digit number), but yeah it's slow
15:14:18	<gydgyd>	Once again sorry if it's not allowed here
15:18:08	<@arkiver>	gydgyd: absolutely allowed
15:18:21	<@arkiver>	you could PM me the ID privately?
15:19:03	<gydgyd>	Yes!
15:24:02		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
15:28:42		umgr036 quits [Remote host closed the connection]
15:33:25		umgr036 joins
15:34:14		umgr036 quits [Remote host closed the connection]
15:34:28		umgr036 joins
16:37:15	<cm>	so I notice that this page doesn't have a paywall, at least not with my cookies/ip: https://www.ft.com/content/80818949-cbf6-4830-8703-0e561e2fead7
16:37:43	<cm>	but the archive.org page shows a paywall: http://web.archive.org/web/20230302015333/https://www.ft.com/content/80818949-cbf6-4830-8703-0e561e2fead7
16:38:01	<cm>	I get a similar paywall if I wget the url
16:39:18	<cm>	I wonder if this reveals a shortcoming in the warc format? if I can record my interaction with a web site, I should be able to emulate that web site by serving some archive file, no?
16:40:25	<cm>	perhaps it's not an issue with the archive format, maybe the crawler is just not getting the non-paywalled version due to its IP being known?
16:41:28	<cm>	I believe getting past the paywall requires running some JS and possibly allowing the JS to make requests; do archive.org or other archive tools record that part of the interaction?
16:43:10	<pokechu22>	web.archive.org does run JS when saving; archivebot doesn't
16:43:34	<cm>	archive.is does a better job for this article: https://archive.is/7MyEE
16:44:25	<cm>	so what's the difference between the methods of web.archive.org and archive.is?
16:44:43	<cm>	is the former runs JS, archive.is is doing something more to save the page
16:45:39	<pokechu22>	It's possible that ft.com has archive.org blacklisted to always get the paywall, not sure
16:45:58	<pokechu22>	(or they have some kind of per-IP limit, and archive.org has hit that while archive.is hasn't)
16:47:10	<cm>	archive.is seems to store the page in a rendered form that doesn't include e.g. the sticky navigation bar or the cookie banner
16:47:46	<pokechu22>	Yeah, they don't store in in WARC and it's not suitable for a normal replay
16:47:53	<qwertyasdfuiopghjkl>	I'm getting a paywall on that ft.com link and I have JS enabled. Maybe it's a geoblocking thing
16:47:55	<pokechu22>	They're also signed in to accounts on some sites (e.g. github)
16:48:11	<cm>	kind of seems like archive.is converts the state of the dom into html so that you can reproduce the page without running the same JS
16:48:13	<pokechu22>	archive.org does have an option to save a screenshot: https://web.archive.org/web/20230317164443/http://web.archive.org/screenshot/https://www.ft.com/content/80818949-cbf6-4830-8703-0e561e2fead7 - but it looks like that didn't help
16:49:25	<cm>	I would like to investigate the archive.is storage format but the "download .zip" function 404s
16:52:39	<klg_>	idk what heuristics ft uses, but it shows paywall to me as well; archvie.is used to go to a great lengths to avoid getting blocked by facebook etc, so maybe they just know which ip to access ft from to get the content
17:03:50	<cm>	since I'm able to get the content in my browser, I wonder if there's an archive tool I could run on my computer to produce an archive with the page content?
17:41:45	<thuban>	cm: yes. the tool we most commonly recommend for this purpose is warcprox: https://github.com/internetarchive/warcprox
17:42:57	<thuban>	once you've recorded a warc, you can view it in a player like replayweb.page: https://replayweb.page/, https://github.com/webrecorder/replayweb.page
17:58:45		benjins is now authenticated as benjins
18:21:07		user_ joins
18:21:35		Arachnophine9 (Arachnophine) joins
18:22:24		Arachnophine quits [Client Quit]
18:22:24		umgr036 quits [Remote host closed the connection]
18:22:24		gydgyd quits [Client Quit]
18:22:24		qwertyasdfuiopghjkl quits [Client Quit]
18:22:24		Arachnophine9 is now known as Arachnophine
18:22:43	<@OrIdow6>	Jake: I believe there was talk of using it for other Nintendo-related things
18:24:17	<@OrIdow6>	But what you say on the EShop sounds right
18:27:36		gydgyd joins
18:31:50		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
18:41:14		gydgyd quits [Remote host closed the connection]
19:18:36	<JensRex>	Remember when we backed up all of the Intel Download Center? Is there a convenient way to search that. I tried putting model number into archive.org, but I'm not getting useful results.
19:19:23	<JensRex>	wb machine, I mean.
19:34:22	<pokechu22>	https://web.archive.org/web//example.com/ might help for that depending on how the URLs are structured, but it also might not (and there's a 10k limit for that view)
19:39:48	<@OrIdow6>	Looks like someone SPNd the list pages e.g. https://web.archive.org/web/20190901145929/https://downloadcenter.intel.com/product/80939/Graphics-Drivers
19:40:07	<@OrIdow6>	To the effect that the "show more" thing works
19:42:33	<JensRex>	Specifically looking for updated BIOS for DCP847SKE (aka. DCP847DYE for some reason). I found a newer on some download site, but with no documentation, and I have no idea if it's the latest :(
19:43:29	<JensRex>	Axing all of that was such a brain damaged move by Intel.
19:47:06		Maddie_ quits [Quit: upgrade]
19:57:01	<@JAA>	There's no convenient way for searching, but we should have everything that existed at the time of archival. IIRC, we previously saw that things had been removed already.
19:59:02	<@OrIdow6>	https://web.archive.org/web/20190901155237/https://downloadcenter.intel.com/product/71620/Intel-NUC-Board-DCP847SKE ?
20:29:47	<JensRex>	OrIdow6: You did what I could not. Thanks.
20:36:27	<mgrandi>	https://twitter.com/lancereddick he passed away today, dunno if he has an official site too
21:06:37		Maddie joins
21:41:23		katocala joins
21:41:45		katocala is now authenticated as katocala
21:44:02		anon89 joins
21:44:15		anon89 quits [Remote host closed the connection]
22:30:51		voltagex\|m joins
22:31:45	<voltagex\|m>	Hey, is there a way to archive someone else's Twitter account, including media?
22:33:08	<pokechu22>	socialbot can save tweets, things linked from tweets, and images in tweets (but not videos in tweets) and upload it to web.archive.org automatically. It uses snscrape internally which is generally available
22:33:49	<@JAA>	socialbot lives in #archivebot except it's down for maintenance right now.
22:34:46	<pokechu22>	Plus you need to be voiced to use it, but I'd be happy to run the command for you when it's back online
22:36:25	<voltagex\|m>	Is it able to at least flag which videos are there? I can feed a list to yt-dlp
22:36:45	<voltagex\|m>	Anyway, thanks, I didn't know about snscrape
22:37:02	<@JAA>	snscrape extracts videos (but doesn't download them itself). socialbot does not.
22:38:52	<voltagex\|m>	Oh whoops, I wonder how "reacts" are translated here
22:39:36	<pokechu22>	https://github.com/JustAnotherArchivist/snscrape for reference
22:45:21		BlueMaxima joins
23:12:55		Pichu0202 joins
23:14:16		Pichu0102 quits [Ping timeout: 252 seconds]
23:27:12	<cm>	thuban: I tried wget with the --warc-file option, since it's packaged for my distro
23:27:32	<cm>	but the result does not include the page content
23:28:14	<thuban>	that's why i suggested warcprox
23:28:18	<@JAA>	cm: Upstream wget's WARC code is buggy.
23:28:20	<cm>	do you think warcprox is more likely to be able to fetch the page content?
23:28:27	<cm>	JAA: still?
23:28:31	<@JAA>	Yes, still.
23:28:32	<thuban>	yes, because it relies on your browser to decide what to request
23:28:51	<@JAA>	The bump to WARC/1.1 never happened, sadly.
23:29:05	<cm>	is it still going to?
23:29:13	<@JAA>	And they're reluctant to fix their misleading WARC/1.0 output.
23:29:43	<@JAA>	Not a clue. I've poked darnir about it a couple times without success.
23:30:27	<cm>	thuban: ahh that makes sense
23:30:29	<@JAA>	Most work is going towards wget2 now, I think, which doesn't support WARC yet at all.
23:31:14	<cm>	so there's not really a command line tool that will pretend to be a browser and run JS in order to get the full version of a page?
23:31:42	<cm>	anyone know how archive.org does it?
23:31:48	<cm>	<pokechu22> web.archive.org does run JS when saving; archivebot doesn't
23:32:12	<pokechu22>	I've heard of something called brozzler or something but I'm not sure if that's what they use or any details
23:32:41	<thuban>	it is https://github.com/internetarchive/brozzler
23:33:18	<@JAA>	Yeah, the current iteration of SPN uses brozzler.
23:33:26	<@JAA>	(As far as I know, anyway.)
23:33:28	<thuban>	(it's possible to run this locally, but this requires rethinkdb and i would not call it convenient)
23:33:57	<@JAA>	Everything that comes from the webrecorder people should be avoided for capturing traffic.
23:34:17	<cm>	the webrecorder people?
23:34:54	<@JAA>	So yeah, brozzler or something with a similar approach (browser with automation + a MITM proxy that writes to WARC) is the way to go. I'm not aware of anything other than brozzler that doesn't come from webrecorder.
23:35:05	<@JAA>	https://github.com/webrecorder
23:35:29	<cm>	cool I see
23:35:35	<@JAA>	https://github.com/webrecorder/warcio/issues/created_by/JustAnotherArchivist for a selection of reasons why their stuff needs to be avoided.
23:35:59	<@JAA>	They write inaccurate WARCs and don't seem to care too much about it.
23:36:53	<cm>	hm I see
23:37:23	<thuban>	unfortunately, it's not really possible to "run JS in order to get the full version of a page" in the general case, due to site-specific interactive elements, etc.
23:37:23	<cm>	thanks for helping me understand how this stuff works
23:38:00	<cm>	I think I might email the archive.is guy to ask how he generates his archives
23:39:06	<thuban>	(although the general approaches are different, both archive.today and archiveteam's own projects involve some degree of manual tailoring per site.)
23:40:27	<pokechu22>	example: http://archive.today/2023.03.16-070651/https://github.com/ is signed in
23:40:48	<@JAA>	WTF is that date format? lol
23:41:23	<pokechu22>	It also apparently supports a saner one but that's the one you get from the share link :/
23:42:00	<pokechu22>	(https://archive.ph/20230316070651/https://github.com - just get rid of the punctuation)
23:42:03	<@JAA>	Also interesting, that leaks some private repos. Hmmm...
23:42:38	<pokechu22>	You've also got http://archive.today/20230302022032/https://github.com/notifications?query=is:unread
23:43:33	<cm>	lol
23:43:48	<cm>	does it leak the code for the site somewhere?
23:44:46	<@JAA>	Aw: https://archive.ph/4Niz4
23:45:19	<cm>	whats volth
23:45:38	<@JAA>	The account used for those logged-in snapshots.
23:51:25		DLoader_ joins
23:56:09		DLoader quits [Ping timeout: 265 seconds]
23:56:17		DLoader_ is now known as DLoader
23:58:50	<cm>	I guess there is _some_ justification to keep things like archive.today closed source, lest certain sites take counter-measures against the various methods of obtaining content
23:59:14	<cm>	I wonder if archiveteam is ever worried about that?
23:59:30		Arcorann (Arcorann) joins
23:59:38	<cm>	ripcord, the unauthorized discord/slack client, is in a similar position

Home Search Previous day Next day