00:22:29Arcorann_ quits [Client Quit]
00:22:58Arcorann (Arcorann) joins
01:52:25sonick quits [Client Quit]
02:34:58Stiletto quits [Remote host closed the connection]
02:39:26Stiletto joins
02:45:07eroc1990 quits [Ping timeout: 252 seconds]
02:54:47eroc1990 (eroc1990) joins
03:50:11<h2ibot>Tomodachi94 created CurseForge (+576, Create page): https://wiki.archiveteam.org/?title=CurseForge
03:50:12<h2ibot>Tomodachi94 edited Deathwatch (+15, /* 2022 */ CurseForge: Link out to page): https://wiki.archiveteam.org/?diff=49454&oldid=49445
04:34:03<@JAA>So, as I understand it, keybase.pub is just a web interface to certain directories within the Keybase File System (KBFS). Only that web interface is going away for now. It would probably be good to grab a complete copy of the KBFS though, if that is even possible. It seems only a matter of time before more stuff gets shut down. As far as I can tell, the data is stored centrally on a closed-source
04:34:09<@JAA>server. I haven't been able to find any documentation of the protocol, only high-level descriptions (https://book.keybase.io/docs/files and https://book.keybase.io/docs/files/details ) and details on the cryptography (https://book.keybase.io/docs/crypto/kbfs ). The relevant source code (https://github.com/keybase/client/tree/master/go/kbfs ) is massive, so I quickly gave up trying to dig around there.
04:34:15<@JAA>And the first step to archiving KBFS would be to figure out how we can retrieve its data, other than running the thing locally and then taring up stuff from /keybase/public, which is suboptimal to say the least (doesn't preserve the signatures etc.).
05:08:29Icyelut (Icyelut) joins
06:09:29Ketchup901 quits [Remote host closed the connection]
06:09:46Ketchup901 (Ketchup901) joins
06:21:28hackbug quits [Client Quit]
06:42:01hackbug (hackbug) joins
06:44:29Icyelut|2 (Icyelut) joins
06:48:51Icyelut quits [Ping timeout: 265 seconds]
06:53:17<tomodachi94>JAA: There are a few Go submodules listed at <https://github.com/keybase/client/tree/master/go/kbfs#architecture> which could be used to interface with Keybase. I'm not sure how well-documented they are, though...
06:53:36<tomodachi94>*Keybase Filesystem
06:54:00Ketchup901 quits [Remote host closed the connection]
06:54:27Ketchup901 (Ketchup901) joins
06:54:55<tomodachi94>You're probably interested in libkbfs.
06:54:56<@JAA>tomodachi94: Yeah, and libkbfs is probably the relevant component, but that's about as far as I got.
07:12:33<tomodachi94>JAA: I found what looks like a list of public methods exposed by `libkbfs` at <https://pkg.go.dev/github.com/keybase/kbfs/libkbfs>; I'm not sure how useful that would be, but here it is anyways.
07:19:03<Jake>I feel like keybase.pub would be by far the easiest way to get the data. Somewhat unrelated, but is the code for keybase.pub public? Might be a good place to start?
07:19:16<@JAA>It is not.
07:19:35<@JAA>There's a large and old issue about open-sourcing the server components of Keybase.
07:19:44<Jake>frustrating
07:20:08<@JAA>https://github.com/keybase/client/issues/24105
07:22:30Stiletto quits [Ping timeout: 252 seconds]
07:22:36Stiletto joins
07:27:16Stiletto quits [Ping timeout: 252 seconds]
07:33:49<h2ibot>Sidpatchy edited Tripod (+873, Add info on domain discovery and downloading in…): https://wiki.archiveteam.org/?diff=49456&oldid=28799
07:57:41<tomodachi94>There are a bunch of Japan and Japanese-related files up at <http://ftp.edrdg.org/pub/Nihongo/00INDEX.html>; I've uploaded a few of them to items (the Mac compression-related ones at the very top) in the Internet Archive, but I'm not sure about the rest.
07:58:21<tomodachi94>Included are copies of JMdict, one of the most well-regarded and one of the first open-access digital Japanese dictionaries.
07:59:43<tomodachi94>!a http://ftp.edrdg.org/pub/Nihongo/00INDEX.html#java_r
08:34:39hitgrr8 joins
08:50:10pabs quits [Ping timeout: 265 seconds]
08:52:02pabs (pabs) joins
08:52:31tzt quits [Ping timeout: 252 seconds]
09:01:29tzt (tzt) joins
09:10:49Island quits [Read error: Connection reset by peer]
09:31:38user_ joins
09:34:52umgr036 quits [Ping timeout: 252 seconds]
09:34:55Fatal-Error joins
09:35:42Fatal-Error quits [Remote host closed the connection]
09:39:21Ketchup901 quits [Remote host closed the connection]
09:48:58Ketchup901 (Ketchup901) joins
10:05:01user_ quits [Remote host closed the connection]
10:05:14user_ joins
10:30:34Arcorann quits [Remote host closed the connection]
10:30:34gazorpazorp quits [Remote host closed the connection]
10:30:34user_ quits [Remote host closed the connection]
10:30:38user__ (gazorpazorp) joins
10:30:47user_ joins
10:34:08dan_a quits [Quit: weboooot]
10:36:29Arcorann (Arcorann) joins
10:37:11dan_a (dan_a) joins
12:06:38mut4ntmonkey quits [Client Quit]
12:19:23VerifiedJ quits [Quit: The Lounge - https://thelounge.chat]
12:20:02VerifiedJ (VerifiedJ) joins
12:36:59immibis_ quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
12:43:13sonick (sonick) joins
14:20:18adamus1red quits [Quit: SigTerm]
14:21:43adamus1red (adamus1red) joins
14:25:16Arcorann quits [Ping timeout: 252 seconds]
14:44:14AlsoTheTechRobo joins
14:44:29<AlsoTheTechRobo>Do we have a way of archiving iFastNet pages?
14:45:01<AlsoTheTechRobo>Free hosts like byet.host and infinityfree, etc that use ifastnet have a JavaScript challenge that prevents even SPN from saving the page
14:45:39<AlsoTheTechRobo>Example: I am trying to archive http://gliczide.rf.gd/ ; this is the SPN capture: https://web.archive.org/web/20230211143746/http://gliczide.rf.gd/?i=1
14:46:03<AlsoTheTechRobo>It's unusable, plus in SPN logs, it's clear it was redirected after ?i=3 to a Google "how to enable cookies" page.
14:46:59<AlsoTheTechRobo>So even if it sucessfully went to ?i=2 and ?i=3, it wouldn't have been a good capture.
14:47:51<AlsoTheTechRobo>More specifically, this is what the "Downloaded elements" box shows: https://pastebin.com/jz40udBn
14:48:22<AlsoTheTechRobo>(that's truncated, but you get the point)
14:50:07<AlsoTheTechRobo>Using curl's default user agent yields an empty response; using the latest chrome user agent yields this reply from `curl -v`: https://pastebin.com/ND0QiPaT
15:04:33<ThreeHM>Using the googlebot user agent gives me the correct page without the JS challenge. In fact, any UA containing the string "Googlebot" seems to work.
15:05:40<AlsoTheTechRobo>Oh yeah, that's a good catch
15:47:44AlsoTheTechRobo quits [Remote host closed the connection]
15:51:04sidpatchy quits [Ping timeout: 252 seconds]
16:05:01<@JAA>Ah yes, the aes.js/SlowAES/f655ba9d09a112d4968c63579db590b4 challenge.
16:05:08lennier2 joins
16:06:08lennier1 quits [Ping timeout: 265 seconds]
16:06:13lennier2 is now known as lennier1
16:07:02michaelblob_ (michaelblob) joins
16:09:13michaelblob quits [Ping timeout: 252 seconds]
16:21:12AlsoTheTechRobo joins
16:21:24<AlsoTheTechRobo>In any case, SPN allows you to set a custom user agent on captures, so that's nice :-)
16:21:31<AlsoTheTechRobo>does archivebot support googlebot user agent?
16:29:20sidpatchy joins
16:55:55katocala quits [Ping timeout: 265 seconds]
17:20:00lennier2 joins
17:22:22lennier1 quits [Ping timeout: 252 seconds]
17:22:34lennier2 quits [Read error: Connection reset by peer]
17:22:51lennier2 joins
17:22:52lennier2 is now known as lennier1
17:26:01DontKnow joins
17:26:12<DontKnow>hello!
17:26:49DontKnow quits [Remote host closed the connection]
17:27:51<@JAA>AlsoTheTechRobo: Not currently, but it could be added with a trivial PR.
17:28:22<AlsoTheTechRobo>wouldn't that require pipelines to be upgraded or just the irc bot?
17:30:48<@JAA>Neither. The UA aliases are stored on the control node, so it can be deployed quickly.
17:36:07<AlsoTheTechRobo>oh, even better!
17:36:09katocala joins
17:38:32AlsoTheTechRobo quits [Remote host closed the connection]
17:38:32qw3rty_ joins
17:38:45fuzzy8021 quits [Killed (NickServ (GHOST command used by fuzzy802!~fuzzy8021@173.224.25.67))]
17:38:45fuzzy8021 (fuzzy8021) joins
17:38:48DLoader_ joins
17:39:19celestial joins
17:39:30benjins2_ joins
17:41:50DLoader quits [Ping timeout: 265 seconds]
17:42:00DLoader_ is now known as DLoader
17:43:17qw3rty quits [Ping timeout: 265 seconds]
17:43:17benjins2__ quits [Ping timeout: 265 seconds]
17:44:54fishingforsoup joins
17:54:25Atom joins
17:57:18Atom__ quits [Ping timeout: 265 seconds]
18:45:08<@JAA>I've started spidering Issuu for users. Already have nearly a million after ten minutes.
18:46:15<@JAA>It's a simple subscription traversal starting from the 'issuu' account, using the undocumented API endpoint.
19:00:38<@JAA>FWIW, don't see rate limits so far at 200 req/s.
19:24:30<tomodachi94>Does anyone know what tools to use for extracting .lha files?
19:24:46<tomodachi94>The file in question contains some vintage Amiga software.
19:25:01<@JAA>File Formats Wiki to the rescue: http://fileformats.archiveteam.org/wiki/LHA
19:25:51<tomodachi94>JAA: Thanks, I forgot about File Formats Wiki.
19:26:14<tomodachi94>Oh good, 7-zip supports it.
19:28:55Island joins
19:35:53<balrog>was the docstoc archive ingested into the WBM, and if not, is there a way to find a document by id?
20:00:46tzt quits [Ping timeout: 252 seconds]
20:07:54tzt (tzt) joins
20:08:29Larsenv quits [Quit: ZNC 1.8.2+deb2build5 - https://znc.in]
20:11:07Larsenv (Larsenv) joins
20:19:47tzt quits [Read error: Connection reset by peer]
20:20:35tzt (tzt) joins
20:25:34luna joins
20:36:05<tomodachi94>Would something like <https://archive.org/details/soder> go under Community Texts or Community Datasets?
20:39:00<pokechu22>Looks more like datasets to me IMO
20:39:18<pokechu22>especially since the licensing note says "This dataset"
20:45:45<thuban>_does_ spn allow you to set a custom user agent?
20:47:33<thuban>there's no indication of it on the web form, i can't find any api documentation, and my understanding of the 'user_agent' param on waybackpy (etc) is that it's the ua supplied by python to the spn endpoint, not by spn to the captured page
20:51:38<@JAA>Not that I heard of, and it definitely doesn't reuse the agent supplied by the user.
20:51:52<@JAA>Old SPN did at one point, I believe.
20:52:11<thuban>thanks, that's what i thought.
20:53:35<@JAA>`curl -A 'SomeAgent' https://web.archive.org/save/https://httpbingo.org/get` → https://web.archive.org/web/20230211205054/https://httpbingo.org/get
20:54:13<thuban>(:
20:54:22<@JAA>Interesting 'Via' header.
20:55:09<thuban>kind of a shame, really--it seems like more and more social media sites like to redirect image requests in a way ia can't handle
21:02:11<pokechu22>I think SPN does respect the Accept header or something, or at least when embedding images in a page it'll request it as an image instead of html for those sites that do stupid stuff
21:02:40<pokechu22>more for the request to web.archive.org/web/1234im_/url redirecting to saving it
21:10:28<tomodachi94>pokechu22: How would I get that changed from Texts to Datasets?
21:10:56<pokechu22>I don't know how to change it unfortunately - not sure if you can (there's a few I've incorrectly set myself)
21:16:05<thuban>pokechu22: doesn't seem to be the case--
21:16:56<thuban>`curl -A 'TestAgent' -H 'Accept: image/*, */*' https://web.archive.org/save/https://httpbingo.org/get` comes back accepting "text/html,application/xhtml+xml,application/xml;q=0.9,image/apng,*/*;q=0.8;v=b3".
21:17:32<@JAA>You probably get my snapshot, not a new one, due to the 45-minute resave limit.
21:17:59<thuban>oh, maybe.
21:18:53<thuban>at any rate i know for a fact that its attempt to request embedded images as actual images doesn't work consistently
21:19:57<thuban>e.g.: https://web.archive.org/web/20230211201009/https://stonedrunkwizard.tumblr.com/post/708989727879102464/just-another-really-big-doodle-dump
21:20:20<tech234a>Has https://twittercommunity.com/ been run recently? See https://techcrunch.com/2023/02/09/twitter-puts-its-developer-community-website-behind-a-login-after-announcing-new-api-pricing/
21:20:35<tech234a>The site has apparently been reopened
21:21:23<@JAA>It was run through AB last April.
21:24:13<@JAA>(Has it really been 10 months since Elon's initial bid to buy Twitter‽)
21:24:58<@JAA>Might be worth another run, yeah. 80 GiB in 11 days at the time.
21:25:06<@JAA>Ryz: ^
21:25:30<Ryz>o#O;
21:26:31<thuban>^ yeah, `curl -H 'Accept: image/*' https://web.archive.org/save/<some tumblr image>` doesn't work either
21:32:28<pokechu22>I think it's something like /save/embed that works
21:32:31<pokechu22>but I'm not 100% sure
21:38:49<thuban>i can't find any evidence of such a thing
21:40:50<pokechu22>I'm pretty sure I've seen it sometimes when loading a website where the HTML was saved but images weren't
21:41:02<pokechu22>but I don't remember the details and I doubt it's documented anywhere
21:43:09<@JAA>Yeah, I think there's an underscore somewhere as well, either _embed or embed_.
21:43:29<@JAA>You get there if you request /web/1234im_/ and the image isn't saved, I believe.
21:43:42<@JAA>It's not really visible unless you look at the HTTP requests.
21:51:09<thuban>it's /save/_embed/, but i can't get that to work either
21:59:10Larsenv quits [Client Quit]
21:59:11<thuban>(archivebot, of course, _can_ set the user-agent and would handle this perfectly, and i've considered batching up all the old urls from my logs that i know didn't work well through spn and submitting them there. but half the reason i started auto-archiving in the first place is the speed at which tumblr stuff disappears...)
22:01:43Larsenv (Larsenv) joins
22:17:31john123521 joins
22:22:31nico_32_ quits [Remote host closed the connection]
22:33:14TheTechRobo (TheTechRobo) joins
22:58:30john123521 quits [Remote host closed the connection]
22:58:36spirit joins
23:01:06hitgrr8 quits [Client Quit]
23:13:57BlueMaxima joins
23:37:13Arcorann joins
23:37:13Arcorann quits [Changing host]
23:37:13Arcorann (Arcorann) joins
23:47:17<fishingforsoup>Help.
23:47:40<fishingforsoup>I have a server link for a game I am pretty sure is going to shut down soon. Is there any way to scrape it?
23:47:40<fishingforsoup>http://jdnowweb-s.cdn.ubi.com/
23:48:58<tomodachi94>fishingforsoup: What's the game in question?
23:49:04<fishingforsoup>Just Dance Now.
23:57:59user_ quits [Remote host closed the connection]