| 00:01:22 | | dm4v quits [Read error: Connection reset by peer] |
| 00:02:08 | | dm4v joins |
| 00:02:10 | | dm4v is now authenticated as dm4v |
| 00:02:10 | | dm4v quits [Changing host] |
| 00:02:10 | | dm4v (dm4v) joins |
| 00:02:27 | | Stiletto quits [Remote host closed the connection] |
| 00:09:19 | | Stiletto joins |
| 00:27:35 | | wickedplayer494 is now authenticated as wickedplayer494 |
| 00:41:56 | | britmob2563 quits [Ping timeout: 250 seconds] |
| 00:42:21 | | qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds] |
| 00:47:14 | | Megame (Megame) joins |
| 00:48:46 | | qwertyasdfuiopghjkl joins |
| 00:50:13 | | sonick quits [Client Quit] |
| 00:53:28 | | britmob2563 joins |
| 01:03:31 | | dm4v quits [Ping timeout: 252 seconds] |
| 01:05:02 | | dm4v joins |
| 01:05:04 | | dm4v is now authenticated as dm4v |
| 01:05:04 | | dm4v quits [Changing host] |
| 01:05:04 | | dm4v (dm4v) joins |
| 01:10:15 | | Iki quits [Ping timeout: 244 seconds] |
| 01:13:00 | | BlueMaxima joins |
| 01:22:56 | <pabs> | AK: I recommend using anarcat's feed2exec for that RSS project, it already has a plugin for using SPN |
| 01:23:42 | <pabs> | hmm, looks like it only uses /save/http.... though not POST to /save/ |
| 01:24:23 | <pabs> | IIRC those are different? |
| 01:24:44 | <@arkiver> | nah, lets not use SPN for that |
| 01:25:58 | <pabs> | its easy to write another plugin for whatever other mechanism you want, the SPN plugin is just a few lines of Python |
| 01:26:19 | <@arkiver> | SPN is almost always over capacity |
| 01:26:40 | <@arkiver> | AK: ThreeHM: made an issue on urls-grab repo for it |
| 01:27:06 | <pabs> | am I remembering correctly that /save/https:... and POST to /save/ are different? |
| 01:33:05 | <Jake> | i believe they are different, but both are over capacity often if I'm not mistaken? |
| 01:33:38 | <@OrIdow6> | This is ArchiveTeam |
| 01:34:00 | <@OrIdow6> | There is no point bulk SPNing when you have a direct ability to put things into the Wayback Machien |
| 01:34:44 | | jamesp leaves |
| 01:37:45 | | Iki joins |
| 01:43:02 | | wyatt8740 quits [Ping timeout: 250 seconds] |
| 01:43:13 | | wyatt8740 joins |
| 02:08:36 | | Megame quits [Ping timeout: 250 seconds] |
| 02:50:08 | | sonick (sonick) joins |
| 03:35:15 | | nicolas17 quits [Client Quit] |
| 03:43:01 | | qw3rty_ joins |
| 03:46:52 | | qw3rty__ quits [Ping timeout: 252 seconds] |
| 04:04:13 | | DogsRNice quits [Read error: Connection reset by peer] |
| 04:33:02 | | BlueMaxima_ joins |
| 04:34:54 | | Wolfin quits [Quit: ZNC - https://znc.in] |
| 04:35:40 | | wolfin (wolfin) joins |
| 04:36:48 | | BlueMaxima quits [Ping timeout: 250 seconds] |
| 04:40:14 | | BlueMaxima joins |
| 04:44:04 | | BlueMaxima_ quits [Ping timeout: 252 seconds] |
| 05:06:29 | | Megame (Megame) joins |
| 05:19:57 | | sonick quits [Client Quit] |
| 07:12:00 | | sonick (sonick) joins |
| 08:09:16 | | Ruthalas quits [Ping timeout: 244 seconds] |
| 08:13:30 | | Ruthalas (Ruthalas) joins |
| 08:26:19 | | benjins quits [Ping timeout: 244 seconds] |
| 08:30:27 | | BlueMaxima_ joins |
| 08:34:31 | | BlueMaxima quits [Ping timeout: 252 seconds] |
| 09:56:17 | | knecht420 quits [Client Quit] |
| 09:58:13 | | knecht420 (knecht420) joins |
| 10:02:52 | | knecht420 quits [Client Quit] |
| 10:04:11 | | knecht420 (knecht420) joins |
| 11:10:05 | | benjins joins |
| 11:10:45 | | benjins is now authenticated as benjins |
| 11:23:11 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
| 11:42:19 | | russss quits [] |
| 11:42:31 | | russss (russss) joins |
| 11:43:32 | | anarcat (anarcat) joins |
| 11:43:46 | <anarcat> | i'm kind of afk but happy to discuss RSS feed archival with feed2exec, async |
| 11:44:53 | <pabs> | AK, arkiver, JAA: ^ |
| 11:47:32 | | JSharp quits [] |
| 11:47:44 | | JSharp (JSharp) joins |
| 11:59:03 | | revi quits [] |
| 11:59:18 | | revi (revi) joins |
| 11:59:37 | | Megame quits [Client Quit] |
| 12:04:15 | | BlueMaxima_ quits [Client Quit] |
| 12:11:02 | <AK> | Ooh |
| 12:11:42 | <AK> | Possibly not a bad idea, feed2exec sending new stuff somewhere that we can then add to #// or whatever else is needed |
| 13:02:21 | | knecht420 quits [Client Quit] |
| 13:03:53 | | knecht420 (knecht420) joins |
| 13:39:08 | | knecht420 quits [Client Quit] |
| 13:40:20 | | knecht420 (knecht420) joins |
| 13:45:32 | | Megame (Megame) joins |
| 14:35:36 | | devsnek quits [] |
| 14:35:48 | | devsnek (devsnek) joins |
| 14:42:06 | | knecht420 quits [Client Quit] |
| 14:42:27 | | knecht420 (knecht420) joins |
| 14:59:07 | | knecht420 quits [Read error: Connection reset by peer] |
| 14:59:11 | | knecht4202 (knecht420) joins |
| 15:00:43 | | knecht4202 is now known as knecht420 |
| 15:14:40 | | @EggplantN quits [] |
| 15:15:09 | | EggplantN joins |
| 15:29:50 | | gazorpazorp quits [Remote host closed the connection] |
| 15:30:04 | | gazorpazorp (gazorpazorp) joins |
| 15:30:42 | | Arcorann quits [Ping timeout: 250 seconds] |
| 15:35:28 | | ThreeHM quits [Ping timeout: 250 seconds] |
| 15:36:36 | | ThreeHM (ThreeHeadedMonkey) joins |
| 15:40:45 | | xkey quits [Quit: WeeChat 3.1] |
| 15:41:34 | | xkey (eyo) joins |
| 16:10:35 | | Megame quits [Client Quit] |
| 16:41:46 | | sonick quits [Client Quit] |
| 16:42:50 | | monoxane quits [Ping timeout: 244 seconds] |
| 17:31:54 | <anarcat> | well right now i have things like this: |
| 17:32:05 | <anarcat> | [anarcat-archive] |
| 17:32:05 | <anarcat> | url = https://anarc.at/recentchanges/archive.rss |
| 17:32:05 | <anarcat> | output = feed2exec.plugins.wayback |
| 17:32:05 | <anarcat> | filter = feed2exec.plugins.ikiwiki_recentchanges |
| 17:32:29 | <anarcat> | the "wayback" plugin is this thing https://gitlab.com/anarcat/feed2exec/-/blob/main/feed2exec/plugins/wayback.py |
| 17:33:00 | <anarcat> | which basically does a GET on web.archive.org/save/$URL |
| 17:33:11 | <anarcat> | i was told (by pabs) that i should be doing a POST instead |
| 17:33:32 | <anarcat> | but my understanding of the whole SPN thing is that it works better from a browser, because the browser triggers the inline resources archival |
| 17:33:41 | <anarcat> | and i don't really want to rewrite wpull here |
| 17:33:50 | <anarcat> | i guess i could have a wpull plugin or something, but then i have two problems |
| 17:35:58 | <@OrIdow6> | anarcat: What I think pabs is trying to tell you is that SPN2, which involves a POST request, uses a headless browser to make the capture |
| 17:37:04 | <anarcat> | oh, that actually happens server-side? |
| 17:37:11 | <anarcat> | i did tell pabs to send me a patch :p |
| 17:37:38 | <anarcat> | but i suspect it can't be as simple as this: |
| 17:37:39 | <anarcat> | - res = session.get(wayback_url, allow_redirects=True) |
| 17:37:40 | <anarcat> | + res = session.post(wayback_url, allow_redirects=True) |
| 17:37:42 | <anarcat> | right? |
| 17:37:59 | <@OrIdow6> | I don't remember the exact API it uses |
| 17:38:02 | <@OrIdow6> | https://blog.archive.org/2019/10/23/the-wayback-machines-save-page-now-is-new-and-improved/ |
| 17:38:56 | <@OrIdow6> | Officially if you want to use the API you're supposed to ask them, in practice it seems many people have watched the network traffic the web interface makes (the POST) and replicated that |
| 17:39:29 | <@OrIdow6> | Anyhow, if there is an ArchiveTeam RSS project, vs. personal RSS projects, I do not think it is a stretch to say that we will not be using SPN in any form |
| 17:43:19 | <anarcat> | for sure |
| 17:43:37 | <h2ibot> | Ryz edited List of websites excluded from the Wayback Machine (+27, Added https://palladiummag.com/): https://wiki.archiveteam.org/?diff=47095&oldid=47081 |
| 17:51:01 | | nicolas17 joins |
| 17:51:52 | <Iki> | Another site to add to the 'excluded' page: https://jonahbennett.com |
| 17:52:21 | <@JAA> | Iki: Yes, please add it. Anyone can make edits. :-) |
| 17:55:40 | <Ryz> | Added that~ |
| 17:56:34 | <anarcat> | oh hi JAA |
| 17:57:10 | <@JAA> | Hey :-) |
| 17:57:39 | <h2ibot> | Ryz edited List of websites excluded from the Wayback Machine (+28, Added https://jonahbennett.com/): https://wiki.archiveteam.org/?diff=47096&oldid=47095 |
| 18:00:40 | <h2ibot> | JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=47097&oldid=47096 |
| 18:22:09 | | spirit quits [Quit: Leaving] |
| 18:58:20 | | DogsRNice (Webuser299) joins |
| 19:31:19 | | Ryz quits [Remote host closed the connection] |
| 19:32:21 | | Ryz (Ryz) joins |
| 20:27:32 | | lennier1 quits [Client Quit] |
| 20:28:13 | | lennier1 (lennier1) joins |
| 20:45:57 | | britmob2563 quits [Client Quit] |
| 21:16:53 | <@arkiver> | anarcat: what is that? |
| 21:17:19 | <@arkiver> | yeah we're definitely not going to use SPN |
| 21:18:18 | <anarcat> | sorry, what is what? |
| 21:18:41 | <@JAA> | arkiver: Basically, we can use that to process RSS feeds and get URLs (or anything we want really) out. And then queue those to URLs. |
| 21:19:02 | <anarcat> | oh, feed2exec? |
| 21:19:05 | <anarcat> | it's a feed parser i wrote |
| 21:19:11 | <@JAA> | With or without human doublechecking etc. |
| 21:19:22 | <anarcat> | you add feeds to a .ini file, and you can run commands on the items, or run python code, or whatever |
| 21:19:25 | <@JAA> | At least I assume that's what the question was about. |
| 21:19:29 | <anarcat> | it's my rss swiss army knife |
| 21:19:38 | <@JAA> | :-) |
| 21:20:04 | <anarcat> | i used feed2imap for a while, and then wanted to run commands based on some feeds (e.g. "download the new mp3 or torrent on this feed") and nothing quite did that |
| 21:20:05 | <@arkiver> | JAA: why not just do that extraction with #// |
| 21:20:15 | <anarcat> | i use it to dump my feed on the wayback machine |
| 21:20:17 | <@arkiver> | i'm going to try and get newsgrabber back today with #// |
| 21:20:19 | <anarcat> | and check links |
| 21:20:27 | <@arkiver> | just need a machine to queue the links to #// |
| 21:20:45 | <@arkiver> | anarcat: do you have lists of old links? if yes, we can archive them through #// |
| 21:20:55 | <anarcat> | what is #// |
| 21:21:03 | <anarcat> | i don't have a list |
| 21:21:13 | <@JAA> | How would that work since we need to refetch the RSS feeds (and wherever else NG was getting the articles from) regularly? Deduping etc. |
| 21:21:15 | <anarcat> | i don't have any specific rss feed i want to archive, i think this was something pabs brought up |
| 21:21:38 | <anarcat> | feed2exec handles http-level caching (e.g. etag and 'last-modified') and has a client-side cache |
| 21:21:38 | <@arkiver> | JAA: queue as special item, and attach some randomness |
| 21:21:38 | <@JAA> | Unless you want ugly hacks like URL#20210902T21 |
| 21:21:53 | <@arkiver> | i am completely fine with attaching that ugly hack |
| 21:22:02 | <@arkiver> | but might go with rss:URL or so |
| 21:22:28 | <@arkiver> | we have a lot of capacity in #// now, it's easy to use. just have to push in the items periodically |
| 21:22:58 | <@JAA> | Wasn't the idea with relaunching NewsGrabber to make it *less* hacky rather than more? |
| 21:23:29 | <@arkiver> | it'll be more stable |
| 21:24:03 | <@JAA> | Until the regular queueing breaks and we don't notice for weeks and fun like that. :-) |
| 21:24:05 | <AK> | For the rss feeds I was thinking something pretty much exactly like how feed2exec seems to work. We'll give a box every rss feed we can find, it then pipes them through to urls (somehow). Dunno if that's what arkiver is planning |
| 21:24:28 | <@arkiver> | AK: i mean, scanning the RSS feeds on #// |
| 21:24:48 | <AK> | Ahh, so we'd just redo the entire feed every x hours or whatever? |
| 21:24:54 | <@arkiver> | yeah |
| 21:25:00 | <@arkiver> | queue any found URLs back to #// |
| 21:25:03 | <@JAA> | Well, dedupe handles the duplicated articles. |
| 21:25:12 | <@arkiver> | yep |
| 21:25:19 | <@JAA> | Which means we don't need to keep state on the thing that processes the feeds. |
| 21:25:24 | <@arkiver> | so |
| 21:25:28 | <@JAA> | Although it's worth mentioning that not all of NG was RSS feeds. |
| 21:25:41 | <@arkiver> | how about a list of feeds on github to which people can PR and issue new URLs |
| 21:25:58 | <@arkiver> | i think we might keep that list in urls-items |
| 21:26:02 | <AK> | ^^ That would be a really good start |
| 21:26:20 | <@arkiver> | going to try to get something done today |
| 21:26:33 | <@arkiver> | got some lists already? obviously i'll copy the list from newsgrabber |
| 21:26:58 | <nicolas17> | where does stuff fetched by #// go? WBM? |
| 21:27:01 | <AK> | Biggest one I like is https://azure.microsoft.com/cdn/en-gb/updates/feed/ lol |
| 21:27:02 | <@arkiver> | yes |
| 21:27:23 | <@arkiver> | AK: nice, lets do it |
| 21:27:41 | <@arkiver> | and #// now gets all outlinks as well, though that is still experimental for now (but it's happening) |
| 21:27:48 | <@arkiver> | no |
| 21:27:49 | <@arkiver> | sorry |
| 21:27:53 | <@arkiver> | i mean all page requisites |
| 21:27:54 | <AK> | Once the github file is setup, I'll pr in any more feeds I find |
| 21:28:07 | <@arkiver> | perfect |
| 21:28:42 | <@arkiver> | JAA: you somewhat fine with this as well, or do you have very strong reservations against using #// to scan? |
| 21:29:22 | <AK> | arkiver, ahh, so currently it would grab https://azure.microsoft.com/en-gb/updates/general-availability-azure-devops-august-2021-updates/ and requisites, but not any of the links from within the updates (using azure as an example) |
| 21:29:23 | <@JAA> | Nah, fine with me. I just want to see it up and running again. Although yeah, it won't replace NG entirely, so more work will be needed (or other special items in URLs). |
| 21:30:01 | <@arkiver> | AK: it archived page requisites |
| 21:30:04 | <@arkiver> | archives* |
| 21:30:28 | <@arkiver> | or do you want to go deeper? |
| 21:30:37 | <anarcat> | one caveat with feed2exec is there's no cache purge right now :/ |
| 21:30:49 | <nicolas17> | arkiver: I was thinking about the iPhone IPSW files, I think making them IA items would be better than loading them into WBM |
| 21:30:52 | <@arkiver> | anarcat: i'm not sure we'll use feed2exec |
| 21:30:53 | <anarcat> | probably not hard to implement, but it's not there... so if you have lots of feeds with lots of items, storage will explode |
| 21:31:16 | <@arkiver> | nicolas17: are they from official static apple URLs? |
| 21:31:17 | <nicolas17> | does WBM even cope with multi-gigabyte files? I remember having problems with that, getting the download interrupted halfway through, but I don't know if the problem was downloading from WBM, or WBM's original fetch from the source |
| 21:32:28 | <AK> | arkiver, I don't know. In the azure example, these are short update posts, with links to the main info for each update. I've seen a few rss feeds slightly similar. So I think we'd need to go 1 deeper (Follow links from the first page) to get all the new info |
| 21:32:37 | <AK> | Don't know if that's possible or potentially going to cause issues though |
| 21:32:44 | <@arkiver> | hmm |
| 21:32:51 | <@arkiver> | will figure something out |
| 21:34:04 | <nicolas17> | AK: hm that may need something a bit more custom, you might not want to follow *all* links from the first page including the footer nav bar... |
| 21:34:23 | <@arkiver> | im not worried about the footer |
| 21:34:38 | <@arkiver> | AK: i'll allow a depth to be set here |
| 21:34:41 | <AK> | nicolas17, I think the deduper would mean we'd only archive that stuff once, which probably isn't bad to do |
| 21:34:43 | <nicolas17> | I guess if you can deduplicate it's fine |
| 21:34:52 | <@arkiver> | like 1, making it go one level deep, 2 for 2 etc. |
| 21:35:01 | <AK> | That would be awesome |
| 21:35:35 | <@arkiver> | getting something in now |
| 21:37:27 | <@JAA> | nicolas17: The WBM doesn't mind serving big files from our WARCs, but SPN has a limit of 2 GiB and will truncate files that are bigger than that. |
| 21:37:52 | <nicolas17> | I see, maybe those files I saw before came from SPN then |
| 21:37:55 | <@JAA> | I.e. don't use SPN for big files. |
| 21:38:30 | <@JAA> | Also, I'm interested in learning where you get the IPSW URLs from. |
| 21:40:43 | <@arkiver> | again - SPN is almost always overloaded |
| 21:40:58 | <@arkiver> | #// is very far from being overloaded |
| 21:41:09 | | qwertyasdfuiopghjkl joins |
| 21:41:09 | <@JAA> | Yeah, that too. |
| 21:41:24 | <@arkiver> | nicolas17: so for those apple files, are they official static URLs from apple? |
| 21:41:46 | <@arkiver> | AK: lets move this to #// |
| 21:42:10 | <nicolas17> | JAA: officially from Apple, there's an XML file fetched by iTunes that links to the *last* available version for each device, I'm not sure if Apple has any list of previous versions |
| 21:42:48 | <@JAA> | Ah nice. Can haz XML URL please? |
| 21:42:49 | <nicolas17> | but there's several third-party sites that have URLs (pointing at Apple's CDN, not mirrors) for everything |
| 21:43:42 | <@arkiver> | nicolas17: yeah we definitely want to archive those into the wayback machine |
| 21:43:44 | <@JAA> | Yeah, I've seen those lists before. The official source is always great to have though and often much harder to find. |
| 21:43:51 | <@arkiver> | and i'm fine with duplicating them to IA items as well |
| 21:43:54 | <@arkiver> | so you have both :) |
| 21:44:05 | <@arkiver> | but first wayback, we can always get them out of these into items |
| 21:44:53 | <nicolas17> | https://itunes.apple.com/WebObjects/MZStore.woa/wa/com.apple.jingle.appserver.client.MZITunesClientCheck/version it's a mess because it includes old-style iPods, phone carrier config updates, etc. |
| 21:45:24 | <@arkiver> | nice... |
| 21:45:39 | <nicolas17> | and afaik only mentions the last version (for a given device) |
| 21:47:12 | <nicolas17> | but when an update is released, it will be added to https://ipsw.me/ and https://www.theiphonewiki.com/wiki/Firmware/iPhone/14.x etc within hours :) |
| 21:50:39 | <nicolas17> | I think beta downloads are officially only linked from a paid-dev-account apple webpage, but that too gets promptly added to the wiki (lately by me), and once you have the CDN link, the downloads themselves aren't login-walled (except beta 1) |
| 21:50:42 | <nicolas17> | eg https://www.theiphonewiki.com/wiki/Beta_Firmware/iPhone/15.x |
| 21:56:33 | <nicolas17> | that itunes.apple.com xml seems to be a backwards-compatibility mess, Apple has a more consistent way to deliver "asset updates" now, eg. the ipsws for the new ARM Macs: https://mesu.apple.com/assets/macos/com_apple_macOSIPSW/com_apple_macOSIPSW.xml |
| 21:57:40 | <@JAA> | Thanks! |
| 21:59:44 | | sonick (sonick) joins |
| 22:01:21 | <nicolas17> | I know of ~300 URLs in mesu.apple.com, I recently started archiving them into a git repository :) some have never changed, some seem to change every week ("LinguisticData", keyboard autocorrect stuff I think) |
| 22:01:45 | <nicolas17> | https://gitlab.com/nicolas17/mesu-archive/-/commit/79f4d982bb |
| 22:02:44 | <@JAA> | Very nice. |
| 22:03:09 | <@JAA> | I really need to get back to working on pywarc so I can launch some of my project ideas. This would fit nicely into one of those. |
| 22:04:17 | <nicolas17> | someone recently sent me a few more URLs he had found himself |
| 22:04:41 | <nicolas17> | "com_apple_MobileAsset_CharacterVoices.xml" mickey and minnie telling the time in 35 languages, for the mickey Apple Watch face |
| 22:31:04 | | Stiletto quits [Ping timeout: 252 seconds] |
| 22:42:08 | | Stiletto joins |
| 23:24:01 | <pabs> | anarcat: working on a patch now |
| 23:31:46 | <anarcat> | pabs: aaawesome |
| 23:32:03 | <anarcat> | JAA: would be curious to hear what your end result will be for rss feeds |
| 23:35:25 | <@arkiver> | just dump extracted URLs in #//, it'll do deduplication by itself |
| 23:35:30 | <@arkiver> | not much need for fancy stuff |
| 23:37:19 | <pabs> | are there any docs on the SPN2 API? |
| 23:38:04 | <pabs> | it doesn't seem to return proper HTTP codes and returns errors in HTML, I'm wondering if there is a way to get JSON responses |
| 23:45:32 | <anarcat> | what is #//? |
| 23:50:28 | <ArchivalEfforts> | After using grab-site for gettings a site, which files should I upload to Internet Archive? |
| 23:50:30 | <ArchivalEfforts> | Both the main WARC as well as the -meta one and the .cdx seem relevant, but I can safely ignore the other files without losing information, right? |
| 23:56:13 | | BlueMaxima joins |