| 00:22:29 | | Arcorann_ quits [Client Quit] |
| 00:22:58 | | Arcorann (Arcorann) joins |
| 01:52:25 | | sonick quits [Client Quit] |
| 02:34:58 | | Stiletto quits [Remote host closed the connection] |
| 02:39:26 | | Stiletto joins |
| 02:45:07 | | eroc1990 quits [Ping timeout: 252 seconds] |
| 02:54:47 | | eroc1990 (eroc1990) joins |
| 03:50:11 | <h2ibot> | Tomodachi94 created CurseForge (+576, Create page): https://wiki.archiveteam.org/?title=CurseForge |
| 03:50:12 | <h2ibot> | Tomodachi94 edited Deathwatch (+15, /* 2022 */ CurseForge: Link out to page): https://wiki.archiveteam.org/?diff=49454&oldid=49445 |
| 04:34:03 | <@JAA> | So, as I understand it, keybase.pub is just a web interface to certain directories within the Keybase File System (KBFS). Only that web interface is going away for now. It would probably be good to grab a complete copy of the KBFS though, if that is even possible. It seems only a matter of time before more stuff gets shut down. As far as I can tell, the data is stored centrally on a closed-source |
| 04:34:09 | <@JAA> | server. I haven't been able to find any documentation of the protocol, only high-level descriptions (https://book.keybase.io/docs/files and https://book.keybase.io/docs/files/details ) and details on the cryptography (https://book.keybase.io/docs/crypto/kbfs ). The relevant source code (https://github.com/keybase/client/tree/master/go/kbfs ) is massive, so I quickly gave up trying to dig around there. |
| 04:34:15 | <@JAA> | And the first step to archiving KBFS would be to figure out how we can retrieve its data, other than running the thing locally and then taring up stuff from /keybase/public, which is suboptimal to say the least (doesn't preserve the signatures etc.). |
| 05:08:29 | | Icyelut (Icyelut) joins |
| 06:09:29 | | Ketchup901 quits [Remote host closed the connection] |
| 06:09:46 | | Ketchup901 (Ketchup901) joins |
| 06:21:28 | | hackbug quits [Client Quit] |
| 06:42:01 | | hackbug (hackbug) joins |
| 06:44:29 | | Icyelut|2 (Icyelut) joins |
| 06:48:51 | | Icyelut quits [Ping timeout: 265 seconds] |
| 06:53:17 | <tomodachi94> | JAA: There are a few Go submodules listed at <https://github.com/keybase/client/tree/master/go/kbfs#architecture> which could be used to interface with Keybase. I'm not sure how well-documented they are, though... |
| 06:53:36 | <tomodachi94> | *Keybase Filesystem |
| 06:54:00 | | Ketchup901 quits [Remote host closed the connection] |
| 06:54:27 | | Ketchup901 (Ketchup901) joins |
| 06:54:55 | <tomodachi94> | You're probably interested in libkbfs. |
| 06:54:56 | <@JAA> | tomodachi94: Yeah, and libkbfs is probably the relevant component, but that's about as far as I got. |
| 07:12:33 | <tomodachi94> | JAA: I found what looks like a list of public methods exposed by `libkbfs` at <https://pkg.go.dev/github.com/keybase/kbfs/libkbfs>; I'm not sure how useful that would be, but here it is anyways. |
| 07:19:03 | <Jake> | I feel like keybase.pub would be by far the easiest way to get the data. Somewhat unrelated, but is the code for keybase.pub public? Might be a good place to start? |
| 07:19:16 | <@JAA> | It is not. |
| 07:19:35 | <@JAA> | There's a large and old issue about open-sourcing the server components of Keybase. |
| 07:19:44 | <Jake> | frustrating |
| 07:20:08 | <@JAA> | https://github.com/keybase/client/issues/24105 |
| 07:22:30 | | Stiletto quits [Ping timeout: 252 seconds] |
| 07:22:36 | | Stiletto joins |
| 07:27:16 | | Stiletto quits [Ping timeout: 252 seconds] |
| 07:33:49 | <h2ibot> | Sidpatchy edited Tripod (+873, Add info on domain discovery and downloading in…): https://wiki.archiveteam.org/?diff=49456&oldid=28799 |
| 07:57:41 | <tomodachi94> | There are a bunch of Japan and Japanese-related files up at <http://ftp.edrdg.org/pub/Nihongo/00INDEX.html>; I've uploaded a few of them to items (the Mac compression-related ones at the very top) in the Internet Archive, but I'm not sure about the rest. |
| 07:58:21 | <tomodachi94> | Included are copies of JMdict, one of the most well-regarded and one of the first open-access digital Japanese dictionaries. |
| 07:59:43 | <tomodachi94> | !a http://ftp.edrdg.org/pub/Nihongo/00INDEX.html#java_r |
| 08:34:39 | | hitgrr8 joins |
| 08:50:10 | | pabs quits [Ping timeout: 265 seconds] |
| 08:52:02 | | pabs (pabs) joins |
| 08:52:31 | | tzt quits [Ping timeout: 252 seconds] |
| 09:01:29 | | tzt (tzt) joins |
| 09:10:49 | | Island quits [Read error: Connection reset by peer] |
| 09:31:38 | | user_ joins |
| 09:34:52 | | umgr036 quits [Ping timeout: 252 seconds] |
| 09:34:55 | | Fatal-Error joins |
| 09:35:42 | | Fatal-Error quits [Remote host closed the connection] |
| 09:39:21 | | Ketchup901 quits [Remote host closed the connection] |
| 09:48:58 | | Ketchup901 (Ketchup901) joins |
| 10:05:01 | | user_ quits [Remote host closed the connection] |
| 10:05:14 | | user_ joins |
| 10:30:34 | | Arcorann quits [Remote host closed the connection] |
| 10:30:34 | | gazorpazorp quits [Remote host closed the connection] |
| 10:30:34 | | user_ quits [Remote host closed the connection] |
| 10:30:38 | | user__ (gazorpazorp) joins |
| 10:30:47 | | user_ joins |
| 10:34:08 | | dan_a quits [Quit: weboooot] |
| 10:36:29 | | Arcorann (Arcorann) joins |
| 10:37:11 | | dan_a (dan_a) joins |
| 12:06:38 | | mut4ntmonkey quits [Client Quit] |
| 12:19:23 | | VerifiedJ quits [Quit: The Lounge - https://thelounge.chat] |
| 12:20:02 | | VerifiedJ (VerifiedJ) joins |
| 12:36:59 | | immibis_ quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |
| 12:43:13 | | sonick (sonick) joins |
| 14:20:18 | | adamus1red quits [Quit: SigTerm] |
| 14:21:43 | | adamus1red (adamus1red) joins |
| 14:25:16 | | Arcorann quits [Ping timeout: 252 seconds] |
| 14:44:14 | | AlsoTheTechRobo joins |
| 14:44:29 | <AlsoTheTechRobo> | Do we have a way of archiving iFastNet pages? |
| 14:45:01 | <AlsoTheTechRobo> | Free hosts like byet.host and infinityfree, etc that use ifastnet have a JavaScript challenge that prevents even SPN from saving the page |
| 14:45:39 | <AlsoTheTechRobo> | Example: I am trying to archive http://gliczide.rf.gd/ ; this is the SPN capture: https://web.archive.org/web/20230211143746/http://gliczide.rf.gd/?i=1 |
| 14:46:03 | <AlsoTheTechRobo> | It's unusable, plus in SPN logs, it's clear it was redirected after ?i=3 to a Google "how to enable cookies" page. |
| 14:46:59 | <AlsoTheTechRobo> | So even if it sucessfully went to ?i=2 and ?i=3, it wouldn't have been a good capture. |
| 14:47:51 | <AlsoTheTechRobo> | More specifically, this is what the "Downloaded elements" box shows: https://pastebin.com/jz40udBn |
| 14:48:22 | <AlsoTheTechRobo> | (that's truncated, but you get the point) |
| 14:50:07 | <AlsoTheTechRobo> | Using curl's default user agent yields an empty response; using the latest chrome user agent yields this reply from `curl -v`: https://pastebin.com/ND0QiPaT |
| 15:04:33 | <ThreeHM> | Using the googlebot user agent gives me the correct page without the JS challenge. In fact, any UA containing the string "Googlebot" seems to work. |
| 15:05:40 | <AlsoTheTechRobo> | Oh yeah, that's a good catch |
| 15:47:44 | | AlsoTheTechRobo quits [Remote host closed the connection] |
| 15:51:04 | | sidpatchy quits [Ping timeout: 252 seconds] |
| 16:05:01 | <@JAA> | Ah yes, the aes.js/SlowAES/f655ba9d09a112d4968c63579db590b4 challenge. |
| 16:05:08 | | lennier2 joins |
| 16:06:08 | | lennier1 quits [Ping timeout: 265 seconds] |
| 16:06:13 | | lennier2 is now known as lennier1 |
| 16:07:02 | | michaelblob_ (michaelblob) joins |
| 16:09:13 | | michaelblob quits [Ping timeout: 252 seconds] |
| 16:21:12 | | AlsoTheTechRobo joins |
| 16:21:24 | <AlsoTheTechRobo> | In any case, SPN allows you to set a custom user agent on captures, so that's nice :-) |
| 16:21:31 | <AlsoTheTechRobo> | does archivebot support googlebot user agent? |
| 16:29:20 | | sidpatchy joins |
| 16:55:55 | | katocala quits [Ping timeout: 265 seconds] |
| 17:20:00 | | lennier2 joins |
| 17:22:22 | | lennier1 quits [Ping timeout: 252 seconds] |
| 17:22:34 | | lennier2 quits [Read error: Connection reset by peer] |
| 17:22:51 | | lennier2 joins |
| 17:22:52 | | lennier2 is now known as lennier1 |
| 17:26:01 | | DontKnow joins |
| 17:26:12 | <DontKnow> | hello! |
| 17:26:49 | | DontKnow quits [Remote host closed the connection] |
| 17:27:51 | <@JAA> | AlsoTheTechRobo: Not currently, but it could be added with a trivial PR. |
| 17:28:22 | <AlsoTheTechRobo> | wouldn't that require pipelines to be upgraded or just the irc bot? |
| 17:30:48 | <@JAA> | Neither. The UA aliases are stored on the control node, so it can be deployed quickly. |
| 17:36:07 | <AlsoTheTechRobo> | oh, even better! |
| 17:36:08 | | AlsoTheTechRobo is now authenticated as TheTechRobo |
| 17:36:09 | | katocala joins |
| 17:36:29 | | katocala is now authenticated as katocala |
| 17:38:32 | | AlsoTheTechRobo quits [Remote host closed the connection] |
| 17:38:32 | | qw3rty_ joins |
| 17:38:45 | | fuzzy8021 quits [Killed (NickServ (GHOST command used by fuzzy802!~fuzzy8021@173.224.25.67))] |
| 17:38:45 | | fuzzy8021 (fuzzy8021) joins |
| 17:38:48 | | DLoader_ joins |
| 17:39:19 | | celestial joins |
| 17:39:30 | | benjins2_ joins |
| 17:41:50 | | DLoader quits [Ping timeout: 265 seconds] |
| 17:42:00 | | DLoader_ is now known as DLoader |
| 17:43:17 | | qw3rty quits [Ping timeout: 265 seconds] |
| 17:43:17 | | benjins2__ quits [Ping timeout: 265 seconds] |
| 17:44:54 | | fishingforsoup joins |
| 17:46:29 | | fishingforsoup is now authenticated as fishingforsoup |
| 17:54:25 | | Atom joins |
| 17:57:18 | | Atom__ quits [Ping timeout: 265 seconds] |
| 18:45:08 | <@JAA> | I've started spidering Issuu for users. Already have nearly a million after ten minutes. |
| 18:46:15 | <@JAA> | It's a simple subscription traversal starting from the 'issuu' account, using the undocumented API endpoint. |
| 19:00:38 | <@JAA> | FWIW, don't see rate limits so far at 200 req/s. |
| 19:24:30 | <tomodachi94> | Does anyone know what tools to use for extracting .lha files? |
| 19:24:46 | <tomodachi94> | The file in question contains some vintage Amiga software. |
| 19:25:01 | <@JAA> | File Formats Wiki to the rescue: http://fileformats.archiveteam.org/wiki/LHA |
| 19:25:51 | <tomodachi94> | JAA: Thanks, I forgot about File Formats Wiki. |
| 19:26:14 | <tomodachi94> | Oh good, 7-zip supports it. |
| 19:28:55 | | Island joins |
| 19:35:53 | <balrog> | was the docstoc archive ingested into the WBM, and if not, is there a way to find a document by id? |
| 20:00:46 | | tzt quits [Ping timeout: 252 seconds] |
| 20:07:54 | | tzt (tzt) joins |
| 20:08:29 | | Larsenv quits [Quit: ZNC 1.8.2+deb2build5 - https://znc.in] |
| 20:11:07 | | Larsenv (Larsenv) joins |
| 20:19:47 | | tzt quits [Read error: Connection reset by peer] |
| 20:20:35 | | tzt (tzt) joins |
| 20:25:34 | | luna joins |
| 20:36:05 | <tomodachi94> | Would something like <https://archive.org/details/soder> go under Community Texts or Community Datasets? |
| 20:39:00 | <pokechu22> | Looks more like datasets to me IMO |
| 20:39:18 | <pokechu22> | especially since the licensing note says "This dataset" |
| 20:45:45 | <thuban> | _does_ spn allow you to set a custom user agent? |
| 20:47:33 | <thuban> | there's no indication of it on the web form, i can't find any api documentation, and my understanding of the 'user_agent' param on waybackpy (etc) is that it's the ua supplied by python to the spn endpoint, not by spn to the captured page |
| 20:51:38 | <@JAA> | Not that I heard of, and it definitely doesn't reuse the agent supplied by the user. |
| 20:51:52 | <@JAA> | Old SPN did at one point, I believe. |
| 20:52:11 | <thuban> | thanks, that's what i thought. |
| 20:53:35 | <@JAA> | `curl -A 'SomeAgent' https://web.archive.org/save/https://httpbingo.org/get` → https://web.archive.org/web/20230211205054/https://httpbingo.org/get |
| 20:54:13 | <thuban> | (: |
| 20:54:22 | <@JAA> | Interesting 'Via' header. |
| 20:55:09 | <thuban> | kind of a shame, really--it seems like more and more social media sites like to redirect image requests in a way ia can't handle |
| 21:02:11 | <pokechu22> | I think SPN does respect the Accept header or something, or at least when embedding images in a page it'll request it as an image instead of html for those sites that do stupid stuff |
| 21:02:40 | <pokechu22> | more for the request to web.archive.org/web/1234im_/url redirecting to saving it |
| 21:10:28 | <tomodachi94> | pokechu22: How would I get that changed from Texts to Datasets? |
| 21:10:56 | <pokechu22> | I don't know how to change it unfortunately - not sure if you can (there's a few I've incorrectly set myself) |
| 21:16:05 | <thuban> | pokechu22: doesn't seem to be the case-- |
| 21:16:56 | <thuban> | `curl -A 'TestAgent' -H 'Accept: image/*, */*' https://web.archive.org/save/https://httpbingo.org/get` comes back accepting "text/html,application/xhtml+xml,application/xml;q=0.9,image/apng,*/*;q=0.8;v=b3". |
| 21:17:32 | <@JAA> | You probably get my snapshot, not a new one, due to the 45-minute resave limit. |
| 21:17:59 | <thuban> | oh, maybe. |
| 21:18:53 | <thuban> | at any rate i know for a fact that its attempt to request embedded images as actual images doesn't work consistently |
| 21:19:57 | <thuban> | e.g.: https://web.archive.org/web/20230211201009/https://stonedrunkwizard.tumblr.com/post/708989727879102464/just-another-really-big-doodle-dump |
| 21:20:20 | <tech234a> | Has https://twittercommunity.com/ been run recently? See https://techcrunch.com/2023/02/09/twitter-puts-its-developer-community-website-behind-a-login-after-announcing-new-api-pricing/ |
| 21:20:35 | <tech234a> | The site has apparently been reopened |
| 21:21:23 | <@JAA> | It was run through AB last April. |
| 21:24:13 | <@JAA> | (Has it really been 10 months since Elon's initial bid to buy Twitter‽) |
| 21:24:58 | <@JAA> | Might be worth another run, yeah. 80 GiB in 11 days at the time. |
| 21:25:06 | <@JAA> | Ryz: ^ |
| 21:25:30 | <Ryz> | o#O; |
| 21:26:31 | <thuban> | ^ yeah, `curl -H 'Accept: image/*' https://web.archive.org/save/<some tumblr image>` doesn't work either |
| 21:32:28 | <pokechu22> | I think it's something like /save/embed that works |
| 21:32:31 | <pokechu22> | but I'm not 100% sure |
| 21:38:49 | <thuban> | i can't find any evidence of such a thing |
| 21:40:50 | <pokechu22> | I'm pretty sure I've seen it sometimes when loading a website where the HTML was saved but images weren't |
| 21:41:02 | <pokechu22> | but I don't remember the details and I doubt it's documented anywhere |
| 21:43:09 | <@JAA> | Yeah, I think there's an underscore somewhere as well, either _embed or embed_. |
| 21:43:29 | <@JAA> | You get there if you request /web/1234im_/ and the image isn't saved, I believe. |
| 21:43:42 | <@JAA> | It's not really visible unless you look at the HTTP requests. |
| 21:51:09 | <thuban> | it's /save/_embed/, but i can't get that to work either |
| 21:59:10 | | Larsenv quits [Client Quit] |
| 21:59:11 | <thuban> | (archivebot, of course, _can_ set the user-agent and would handle this perfectly, and i've considered batching up all the old urls from my logs that i know didn't work well through spn and submitting them there. but half the reason i started auto-archiving in the first place is the speed at which tumblr stuff disappears...) |
| 22:01:43 | | Larsenv (Larsenv) joins |
| 22:17:31 | | john123521 joins |
| 22:22:31 | | nico_32_ quits [Remote host closed the connection] |
| 22:33:14 | | TheTechRobo (TheTechRobo) joins |
| 22:58:30 | | john123521 quits [Remote host closed the connection] |
| 22:58:36 | | spirit joins |
| 23:01:06 | | hitgrr8 quits [Client Quit] |
| 23:13:57 | | BlueMaxima joins |
| 23:37:13 | | Arcorann joins |
| 23:37:13 | | Arcorann is now authenticated as Arcorann |
| 23:37:13 | | Arcorann quits [Changing host] |
| 23:37:13 | | Arcorann (Arcorann) joins |
| 23:47:17 | <fishingforsoup> | Help. |
| 23:47:40 | <fishingforsoup> | I have a server link for a game I am pretty sure is going to shut down soon. Is there any way to scrape it? |
| 23:47:40 | <fishingforsoup> | http://jdnowweb-s.cdn.ubi.com/ |
| 23:48:58 | <tomodachi94> | fishingforsoup: What's the game in question? |
| 23:49:04 | <fishingforsoup> | Just Dance Now. |
| 23:57:59 | | user_ quits [Remote host closed the connection] |