#archiveteam-bs log for 2023-02-22

Home Search Previous day Next day

00:00:41		lennier1 quits [Client Quit]
00:35:05	<TheTechRobo>	JAA, arkiver: Sounds good!
00:50:24	<h2ibot>	JustAnotherArchivist edited Issuu (+1697): https://wiki.archiveteam.org/?diff=49478&oldid=30144
01:03:00		Pingerfowder quits [Client Quit]
01:03:12		Pingerfowder (Pingerfowder) joins
01:05:01		monoxane quits [Ping timeout: 252 seconds]
01:05:14		monoxane (monoxane) joins
01:42:33		useretail joins
02:14:46	<pabs>	once the site comes back online, it clarkesworldmagazine.com should probably get archived, as they are inundated with ChatGPT spam https://twitter.com/clarkesworld/status/1627711728245960704 https://news.ycombinator.com/item?id=34887478
02:17:11	<pabs>	hmm, https works but http doesn't...
02:18:27	<pabs>	oh, now their entire site is 403 with some sort of firewall enabled
02:18:42	<pabs>	aw "Your IP address was temporarily blocked by our IDS."
02:19:06	<pabs>	probably because I did wget without a U-A
02:20:18		qwertyasdfuiopghjkl quits [Remote host closed the connection]
02:24:39	<pabs>	now they are back, going to AB
02:26:53		fuzzy8021 quits [Read error: Connection reset by peer]
02:27:13		fuzzy8021 (fuzzy8021) joins
02:27:18		qwertyasdfuiopghjkl joins
02:29:28		ArchivalEfforts quits [Read error: Connection reset by peer]
02:30:16		sidpatchy quits [Ping timeout: 252 seconds]
02:30:30		ArchivalEfforts joins
02:30:38		anarcat quits [Ping timeout: 252 seconds]
02:31:01		sidpatchy joins
02:31:11		anarcat (anarcat) joins
02:34:01	<TheTechRobo>	Maybe we should add another "Project status" thing for when a service is partially going down, cutting functionality, or adding paywalls (like with Issuu)? "Special case" to me sounds like "the website is perfectly fine but we're archiving it anyway"
02:44:17		leo60228- quits [Quit: ZNC 1.8.2 - https://znc.in]
02:44:39		leo60228 (leo60228) joins
02:45:12		drin joins
02:45:40		geezabiscuit quits [Ping timeout: 252 seconds]
02:46:03		drin is now known as geezabiscuit
02:51:00		fishingforsoup_ quits [Read error: Connection reset by peer]
02:51:22		fishingforsoup_ joins
02:51:26		fuzzy8021 quits [Read error: Connection reset by peer]
02:51:57		ave quits [Quit: Ping timeout (120 seconds)]
02:51:57		lun4 quits [Quit: Ping timeout (120 seconds)]
02:53:06		nepeat quits [Quit: ZNC - https://znc.in]
02:53:22		fuzzy8021 (fuzzy8021) joins
02:54:30		ave (ave) joins
02:56:49		lun4 (lun4) joins
02:59:31		nepeat (nepeat) joins
03:05:52		lennier1 (lennier1) joins
03:13:03		fishingforsoup__ joins
03:16:28		fishingforsoup_ quits [Ping timeout: 252 seconds]
03:21:07		ell quits [Client Quit]
03:22:12		ell (ell) joins
03:43:49		Ketchup901 quits [Client Quit]
03:47:04		Ketchup901 (Ketchup901) joins
04:41:53		Island quits [Read error: Connection reset by peer]
04:43:30		user_ quits [Remote host closed the connection]
04:50:18		umgr036 joins
05:13:44		useretail_ joins
05:13:46		wyatt8750 joins
05:13:46		useretail quits [Remote host closed the connection]
05:13:46		wyatt8740 quits [Client Quit]
06:21:15		Arcorann (Arcorann) joins
07:40:22		hitgrr8 joins
07:46:26	<pabs>	https://techcrunch.com/2023/02/21/soylent-acquired-starco-brands-nutrition/
08:35:06		treora quits [Remote host closed the connection]
08:35:06		treora joins
09:26:11		Gereon6200 quits [Client Quit]
09:26:11		useretail_ quits [Remote host closed the connection]
09:26:11		ell quits [Client Quit]
09:26:11		Arcorann quits [Remote host closed the connection]
09:26:11		Gereon62005 (Gereon) joins
09:26:11		Gereon62005 is now known as Gereon6200
09:26:17		treora quits [Client Quit]
09:26:18		qwertyasdfuiopghjkl quits [Client Quit]
09:27:17		treora joins
09:30:44		Arcorann (Arcorann) joins
09:48:37		raxxy-137409 quits [Ping timeout: 252 seconds]
09:48:40		raxxy-137409 joins
10:21:18		LeGoupil joins
11:56:46		LeGoupil quits [Ping timeout: 252 seconds]
12:49:36		LeGoupil joins
12:51:13		Arcorann quits [Ping timeout: 252 seconds]
13:08:58	<audrooku\|m>	It's possible to grab warcs of individual pages directly from the wayback machine, right? How do you do that?
13:56:50		HP_Archivist (HP_Archivist) joins
14:58:55		HP_Archivist quits [Client Quit]
15:35:17		Island joins
15:57:29	<@arkiver>	kaz: HCross: are you able to reach rewby regarding issuu?
16:18:47	<hitgrr8>	Is there Flash archive for website banners and such that were on websites during early days?
16:19:13	<hitgrr8>	I hate that archive.org couldn't able to archive flash files in older websites :(
16:44:11		LeGoupil quits [Client Quit]
16:52:42		sonick quits [Client Quit]
16:52:43		nstrom joins
17:30:04		Gereon6200 quits [Ping timeout: 252 seconds]
17:33:01		Gereon6200 (Gereon) joins
17:46:42		sec^nd quits [Remote host closed the connection]
17:46:51		charles joins
18:03:37		lennier1 quits [Ping timeout: 252 seconds]
18:04:51		lennier1 (lennier1) joins
18:05:38		lennier2 joins
18:06:17		umgr036 quits [Remote host closed the connection]
18:09:18		lennier1 quits [Ping timeout: 252 seconds]
18:09:20		lennier2 is now known as lennier1
18:11:09		umgr036 joins
18:41:14		fl0w joins
18:43:13		fl0w_ quits [Ping timeout: 252 seconds]
19:19:53		nstrom quits [Client Quit]
19:23:55		wyatt8750 quits [Ping timeout: 252 seconds]
19:25:10		wyatt8740 joins
19:51:43	<@JAA>	audrooku\|m: No, that's not generally possible. A lot of data isn't publicly accessible.
19:52:23	<@JAA>	But you can see the item and WARC name in the response headers, and if it's in an open collection, you can access it that way. You'd still need to figure out where in the WARC the record is using the CDX.
20:24:03		dan_a quits [Quit: webootsesit]
20:25:40		dan_a (dan_a) joins
20:31:49		qwertyasdfuiopghjkl joins
20:40:12	<audrooku\|m>	..oh ;-;
20:40:44	<audrooku\|m>	You can view the original webpage without the urls being changed to wayback versions at least though rightm
20:41:22	<pokechu22>	Yes
20:42:28	<@arkiver>	yes
20:42:37	<pokechu22>	https://web.archive.org/web/20230222093528/https://example.com/ for wayback toolbar, https://web.archive.org/web/20230222093528if_/https://example.com/ for no wayback toolbar but still links rewritten ("f" = frame, this is embedded by the other version), https://web.archive.org/web/20230222093528id_/https://example.com/ or
20:42:39	<pokechu22>	https://web.archive.org/web/20230222093528im_/https://example.com/ for no changing ("d" = data, "im" = image), there's a few other variants too
20:42:55	<@arkiver>	(was just about to write that, what pokechu22 says yes)
20:43:20	<@arkiver>	id_ is the general way of getting the original data
20:43:43	<audrooku\|m>	Ooh, ok
20:45:11	<pokechu22>	Note that the link rewriting is important for things like CSS and images; compare https://web.archive.org/web/20230222092154if_/https://en.wikipedia.org/wiki/Main_Page with https://web.archive.org/web/20230222092154id_/https://en.wikipedia.org/wiki/Main_Page (I'm not 100% sure why any images work tbh)
20:45:50	<pokechu22>	Oh, right, also relative links - <a href="/wiki/Wikipedia" title="Wikipedia">Wikipedia</a> isn't going to work nicely
20:47:34	<@JAA>	The WBM does some weird magic for absolute URLs. That's why the images get redirected to snapshots.
20:47:55	<@arkiver>	weird magic
20:48:06	<@arkiver>	WBM is voodoo basically
20:48:57	<@JAA>	E.g. https://web.archive.org/static/images/icons/wikipedia.png redirects to https://web.archive.org/web/20230222092154/https://en.wikipedia.org/static/images/icons/wikipedia.png
20:49:14	<@arkiver>	oh that weird magic
20:49:18	<@JAA>	It's not referrer-based either.
20:49:24	<pokechu22>	For relative URLs, not absolute ones, right? Absolute ones are blocked by the content security policy or something like that?
20:50:36	<@JAA>	I mean relative URLs that are relative to the document root, yeah.
20:50:51	<@JAA>	My brain mixed that up with absolute paths. :-)
20:56:34	<@JAA>	Despite CSP, I've had WBM snapshots try to access external stuff before, by the way. uMatrix to the rescue.
21:21:21		sec^nd (second) joins
21:24:09	<audrooku\|m>	<pokechu22> "Note that the link rewriting..." <- > Note that the link rewriting is important for things like CSS and images
21:24:09	<audrooku\|m>	Yes I understand, I'm interested in scraping some metadata from page archives
22:24:48		BlueMaxima joins
22:33:10		p65 joins
22:33:14		p65 quits [Remote host closed the connection]
22:33:26	<h2ibot>	Arkiver uploaded File:Issuu-icon.png: https://wiki.archiveteam.org/?title=File%3AIssuu-icon.png
22:34:00		hitgrr8 quits [Client Quit]
23:04:58		lennier1 quits [Client Quit]
23:06:06		lennier1 (lennier1) joins

Home Search Previous day Next day