#archiveteam-bs log for 2023-02-11

Home Search Previous day Next day

00:22:29		Arcorann_ quits [Client Quit]
00:22:58		Arcorann (Arcorann) joins
01:52:25		sonick quits [Client Quit]
02:34:58		Stiletto quits [Remote host closed the connection]
02:39:26		Stiletto joins
02:45:07		eroc1990 quits [Ping timeout: 252 seconds]
02:54:47		eroc1990 (eroc1990) joins
03:50:11	<h2ibot>	Tomodachi94 created CurseForge (+576, Create page): https://wiki.archiveteam.org/?title=CurseForge
03:50:12	<h2ibot>	Tomodachi94 edited Deathwatch (+15, /* 2022 */ CurseForge: Link out to page): https://wiki.archiveteam.org/?diff=49454&oldid=49445
04:34:03	<@JAA>	So, as I understand it, keybase.pub is just a web interface to certain directories within the Keybase File System (KBFS). Only that web interface is going away for now. It would probably be good to grab a complete copy of the KBFS though, if that is even possible. It seems only a matter of time before more stuff gets shut down. As far as I can tell, the data is stored centrally on a closed-source
04:34:09	<@JAA>	server. I haven't been able to find any documentation of the protocol, only high-level descriptions (https://book.keybase.io/docs/files and https://book.keybase.io/docs/files/details ) and details on the cryptography (https://book.keybase.io/docs/crypto/kbfs ). The relevant source code (https://github.com/keybase/client/tree/master/go/kbfs ) is massive, so I quickly gave up trying to dig around there.
04:34:15	<@JAA>	And the first step to archiving KBFS would be to figure out how we can retrieve its data, other than running the thing locally and then taring up stuff from /keybase/public, which is suboptimal to say the least (doesn't preserve the signatures etc.).
05:08:29		Icyelut (Icyelut) joins
06:09:29		Ketchup901 quits [Remote host closed the connection]
06:09:46		Ketchup901 (Ketchup901) joins
06:21:28		hackbug quits [Client Quit]
06:42:01		hackbug (hackbug) joins
06:44:29		Icyelut\|2 (Icyelut) joins
06:48:51		Icyelut quits [Ping timeout: 265 seconds]
06:53:17	<tomodachi94>	JAA: There are a few Go submodules listed at <https://github.com/keybase/client/tree/master/go/kbfs#architecture> which could be used to interface with Keybase. I'm not sure how well-documented they are, though...
06:53:36	<tomodachi94>	*Keybase Filesystem
06:54:00		Ketchup901 quits [Remote host closed the connection]
06:54:27		Ketchup901 (Ketchup901) joins
06:54:55	<tomodachi94>	You're probably interested in libkbfs.
06:54:56	<@JAA>	tomodachi94: Yeah, and libkbfs is probably the relevant component, but that's about as far as I got.
07:12:33	<tomodachi94>	JAA: I found what looks like a list of public methods exposed by `libkbfs` at <https://pkg.go.dev/github.com/keybase/kbfs/libkbfs>; I'm not sure how useful that would be, but here it is anyways.
07:19:03	<Jake>	I feel like keybase.pub would be by far the easiest way to get the data. Somewhat unrelated, but is the code for keybase.pub public? Might be a good place to start?
07:19:16	<@JAA>	It is not.
07:19:35	<@JAA>	There's a large and old issue about open-sourcing the server components of Keybase.
07:19:44	<Jake>	frustrating
07:20:08	<@JAA>	https://github.com/keybase/client/issues/24105
07:22:30		Stiletto quits [Ping timeout: 252 seconds]
07:22:36		Stiletto joins
07:27:16		Stiletto quits [Ping timeout: 252 seconds]
07:33:49	<h2ibot>	Sidpatchy edited Tripod (+873, Add info on domain discovery and downloading in…): https://wiki.archiveteam.org/?diff=49456&oldid=28799
07:57:41	<tomodachi94>	There are a bunch of Japan and Japanese-related files up at <http://ftp.edrdg.org/pub/Nihongo/00INDEX.html>; I've uploaded a few of them to items (the Mac compression-related ones at the very top) in the Internet Archive, but I'm not sure about the rest.
07:58:21	<tomodachi94>	Included are copies of JMdict, one of the most well-regarded and one of the first open-access digital Japanese dictionaries.
07:59:43	<tomodachi94>	!a http://ftp.edrdg.org/pub/Nihongo/00INDEX.html#java_r
08:34:39		hitgrr8 joins
08:50:10		pabs quits [Ping timeout: 265 seconds]
08:52:02		pabs (pabs) joins
08:52:31		tzt quits [Ping timeout: 252 seconds]
09:01:29		tzt (tzt) joins
09:10:49		Island quits [Read error: Connection reset by peer]
09:31:38		user_ joins
09:34:52		umgr036 quits [Ping timeout: 252 seconds]
09:34:55		Fatal-Error joins
09:35:42		Fatal-Error quits [Remote host closed the connection]
09:39:21		Ketchup901 quits [Remote host closed the connection]
09:48:58		Ketchup901 (Ketchup901) joins
10:05:01		user_ quits [Remote host closed the connection]
10:05:14		user_ joins
10:30:34		Arcorann quits [Remote host closed the connection]
10:30:34		gazorpazorp quits [Remote host closed the connection]
10:30:34		user_ quits [Remote host closed the connection]
10:30:38		user__ (gazorpazorp) joins
10:30:47		user_ joins
10:34:08		dan_a quits [Quit: weboooot]
10:36:29		Arcorann (Arcorann) joins
10:37:11		dan_a (dan_a) joins
12:06:38		mut4ntmonkey quits [Client Quit]
12:19:23		VerifiedJ quits [Quit: The Lounge - https://thelounge.chat]
12:20:02		VerifiedJ (VerifiedJ) joins
12:36:59		immibis_ quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
12:43:13		sonick (sonick) joins
14:20:18		adamus1red quits [Quit: SigTerm]
14:21:43		adamus1red (adamus1red) joins
14:25:16		Arcorann quits [Ping timeout: 252 seconds]
14:44:14		AlsoTheTechRobo joins
14:44:29	<AlsoTheTechRobo>	Do we have a way of archiving iFastNet pages?
14:45:01	<AlsoTheTechRobo>	Free hosts like byet.host and infinityfree, etc that use ifastnet have a JavaScript challenge that prevents even SPN from saving the page
14:45:39	<AlsoTheTechRobo>	Example: I am trying to archive http://gliczide.rf.gd/ ; this is the SPN capture: https://web.archive.org/web/20230211143746/http://gliczide.rf.gd/?i=1
14:46:03	<AlsoTheTechRobo>	It's unusable, plus in SPN logs, it's clear it was redirected after ?i=3 to a Google "how to enable cookies" page.
14:46:59	<AlsoTheTechRobo>	So even if it sucessfully went to ?i=2 and ?i=3, it wouldn't have been a good capture.
14:47:51	<AlsoTheTechRobo>	More specifically, this is what the "Downloaded elements" box shows: https://pastebin.com/jz40udBn
14:48:22	<AlsoTheTechRobo>	(that's truncated, but you get the point)
14:50:07	<AlsoTheTechRobo>	Using curl's default user agent yields an empty response; using the latest chrome user agent yields this reply from `curl -v`: https://pastebin.com/ND0QiPaT
15:04:33	<ThreeHM>	Using the googlebot user agent gives me the correct page without the JS challenge. In fact, any UA containing the string "Googlebot" seems to work.
15:05:40	<AlsoTheTechRobo>	Oh yeah, that's a good catch
15:47:44		AlsoTheTechRobo quits [Remote host closed the connection]
15:51:04		sidpatchy quits [Ping timeout: 252 seconds]
16:05:01	<@JAA>	Ah yes, the aes.js/SlowAES/f655ba9d09a112d4968c63579db590b4 challenge.
16:05:08		lennier2 joins
16:06:08		lennier1 quits [Ping timeout: 265 seconds]
16:06:13		lennier2 is now known as lennier1
16:07:02		michaelblob_ (michaelblob) joins
16:09:13		michaelblob quits [Ping timeout: 252 seconds]
16:21:12		AlsoTheTechRobo joins
16:21:24	<AlsoTheTechRobo>	In any case, SPN allows you to set a custom user agent on captures, so that's nice :-)
16:21:31	<AlsoTheTechRobo>	does archivebot support googlebot user agent?
16:29:20		sidpatchy joins
16:55:55		katocala quits [Ping timeout: 265 seconds]
17:20:00		lennier2 joins
17:22:22		lennier1 quits [Ping timeout: 252 seconds]
17:22:34		lennier2 quits [Read error: Connection reset by peer]
17:22:51		lennier2 joins
17:22:52		lennier2 is now known as lennier1
17:26:01		DontKnow joins
17:26:12	<DontKnow>	hello!
17:26:49		DontKnow quits [Remote host closed the connection]
17:27:51	<@JAA>	AlsoTheTechRobo: Not currently, but it could be added with a trivial PR.
17:28:22	<AlsoTheTechRobo>	wouldn't that require pipelines to be upgraded or just the irc bot?
17:30:48	<@JAA>	Neither. The UA aliases are stored on the control node, so it can be deployed quickly.
17:36:07	<AlsoTheTechRobo>	oh, even better!
17:36:08		AlsoTheTechRobo is now authenticated as TheTechRobo
17:36:09		katocala joins
17:36:29		katocala is now authenticated as katocala
17:38:32		AlsoTheTechRobo quits [Remote host closed the connection]
17:38:32		qw3rty_ joins
17:38:45		fuzzy8021 quits [Killed (NickServ (GHOST command used by fuzzy802!~fuzzy8021@173.224.25.67))]
17:38:45		fuzzy8021 (fuzzy8021) joins
17:38:48		DLoader_ joins
17:39:19		celestial joins
17:39:30		benjins2_ joins
17:41:50		DLoader quits [Ping timeout: 265 seconds]
17:42:00		DLoader_ is now known as DLoader
17:43:17		qw3rty quits [Ping timeout: 265 seconds]
17:43:17		benjins2__ quits [Ping timeout: 265 seconds]
17:44:54		fishingforsoup joins
17:46:29		fishingforsoup is now authenticated as fishingforsoup
17:54:25		Atom joins
17:57:18		Atom__ quits [Ping timeout: 265 seconds]
18:45:08	<@JAA>	I've started spidering Issuu for users. Already have nearly a million after ten minutes.
18:46:15	<@JAA>	It's a simple subscription traversal starting from the 'issuu' account, using the undocumented API endpoint.
19:00:38	<@JAA>	FWIW, don't see rate limits so far at 200 req/s.
19:24:30	<tomodachi94>	Does anyone know what tools to use for extracting .lha files?
19:24:46	<tomodachi94>	The file in question contains some vintage Amiga software.
19:25:01	<@JAA>	File Formats Wiki to the rescue: http://fileformats.archiveteam.org/wiki/LHA
19:25:51	<tomodachi94>	JAA: Thanks, I forgot about File Formats Wiki.
19:26:14	<tomodachi94>	Oh good, 7-zip supports it.
19:28:55		Island joins
19:35:53	<balrog>	was the docstoc archive ingested into the WBM, and if not, is there a way to find a document by id?
20:00:46		tzt quits [Ping timeout: 252 seconds]
20:07:54		tzt (tzt) joins
20:08:29		Larsenv quits [Quit: ZNC 1.8.2+deb2build5 - https://znc.in]
20:11:07		Larsenv (Larsenv) joins
20:19:47		tzt quits [Read error: Connection reset by peer]
20:20:35		tzt (tzt) joins
20:25:34		luna joins
20:36:05	<tomodachi94>	Would something like <https://archive.org/details/soder> go under Community Texts or Community Datasets?
20:39:00	<pokechu22>	Looks more like datasets to me IMO
20:39:18	<pokechu22>	especially since the licensing note says "This dataset"
20:45:45	<thuban>	_does_ spn allow you to set a custom user agent?
20:47:33	<thuban>	there's no indication of it on the web form, i can't find any api documentation, and my understanding of the 'user_agent' param on waybackpy (etc) is that it's the ua supplied by python to the spn endpoint, not by spn to the captured page
20:51:38	<@JAA>	Not that I heard of, and it definitely doesn't reuse the agent supplied by the user.
20:51:52	<@JAA>	Old SPN did at one point, I believe.
20:52:11	<thuban>	thanks, that's what i thought.
20:53:35	<@JAA>	`curl -A 'SomeAgent' https://web.archive.org/save/https://httpbingo.org/get` → https://web.archive.org/web/20230211205054/https://httpbingo.org/get
20:54:13	<thuban>	(:
20:54:22	<@JAA>	Interesting 'Via' header.
20:55:09	<thuban>	kind of a shame, really--it seems like more and more social media sites like to redirect image requests in a way ia can't handle
21:02:11	<pokechu22>	I think SPN does respect the Accept header or something, or at least when embedding images in a page it'll request it as an image instead of html for those sites that do stupid stuff
21:02:40	<pokechu22>	more for the request to web.archive.org/web/1234im_/url redirecting to saving it
21:10:28	<tomodachi94>	pokechu22: How would I get that changed from Texts to Datasets?
21:10:56	<pokechu22>	I don't know how to change it unfortunately - not sure if you can (there's a few I've incorrectly set myself)
21:16:05	<thuban>	pokechu22: doesn't seem to be the case--
21:16:56	<thuban>	`curl -A 'TestAgent' -H 'Accept: image/, /' https://web.archive.org/save/https://httpbingo.org/get` comes back accepting "text/html,application/xhtml+xml,application/xml;q=0.9,image/apng,/*;q=0.8;v=b3".
21:17:32	<@JAA>	You probably get my snapshot, not a new one, due to the 45-minute resave limit.
21:17:59	<thuban>	oh, maybe.
21:18:53	<thuban>	at any rate i know for a fact that its attempt to request embedded images as actual images doesn't work consistently
21:19:57	<thuban>	e.g.: https://web.archive.org/web/20230211201009/https://stonedrunkwizard.tumblr.com/post/708989727879102464/just-another-really-big-doodle-dump
21:20:20	<tech234a>	Has https://twittercommunity.com/ been run recently? See https://techcrunch.com/2023/02/09/twitter-puts-its-developer-community-website-behind-a-login-after-announcing-new-api-pricing/
21:20:35	<tech234a>	The site has apparently been reopened
21:21:23	<@JAA>	It was run through AB last April.
21:24:13	<@JAA>	(Has it really been 10 months since Elon's initial bid to buy Twitter‽)
21:24:58	<@JAA>	Might be worth another run, yeah. 80 GiB in 11 days at the time.
21:25:06	<@JAA>	Ryz: ^
21:25:30	<Ryz>	o#O;
21:26:31	<thuban>	^ yeah, `curl -H 'Accept: image/*' https://web.archive.org/save/<some tumblr image>` doesn't work either
21:32:28	<pokechu22>	I think it's something like /save/embed that works
21:32:31	<pokechu22>	but I'm not 100% sure
21:38:49	<thuban>	i can't find any evidence of such a thing
21:40:50	<pokechu22>	I'm pretty sure I've seen it sometimes when loading a website where the HTML was saved but images weren't
21:41:02	<pokechu22>	but I don't remember the details and I doubt it's documented anywhere
21:43:09	<@JAA>	Yeah, I think there's an underscore somewhere as well, either _embed or embed_.
21:43:29	<@JAA>	You get there if you request /web/1234im_/ and the image isn't saved, I believe.
21:43:42	<@JAA>	It's not really visible unless you look at the HTTP requests.
21:51:09	<thuban>	it's /save/_embed/, but i can't get that to work either
21:59:10		Larsenv quits [Client Quit]
21:59:11	<thuban>	(archivebot, of course, _can_ set the user-agent and would handle this perfectly, and i've considered batching up all the old urls from my logs that i know didn't work well through spn and submitting them there. but half the reason i started auto-archiving in the first place is the speed at which tumblr stuff disappears...)
22:01:43		Larsenv (Larsenv) joins
22:17:31		john123521 joins
22:22:31		nico_32_ quits [Remote host closed the connection]
22:33:14		TheTechRobo (TheTechRobo) joins
22:58:30		john123521 quits [Remote host closed the connection]
22:58:36		spirit joins
23:01:06		hitgrr8 quits [Client Quit]
23:13:57		BlueMaxima joins
23:37:13		Arcorann joins
23:37:13		Arcorann is now authenticated as Arcorann
23:37:13		Arcorann quits [Changing host]
23:37:13		Arcorann (Arcorann) joins
23:47:17	<fishingforsoup>	Help.
23:47:40	<fishingforsoup>	I have a server link for a game I am pretty sure is going to shut down soon. Is there any way to scrape it?
23:47:40	<fishingforsoup>	http://jdnowweb-s.cdn.ubi.com/
23:48:58	<tomodachi94>	fishingforsoup: What's the game in question?
23:49:04	<fishingforsoup>	Just Dance Now.
23:57:59		user_ quits [Remote host closed the connection]

Home Search Previous day Next day