#archiveteam-bs log for 2025-01-30

Home Search Previous day Next day

00:00:45		Matthww quits [Client Quit]
00:03:21		Matthww joins
00:03:56		Matthww quits [Remote host closed the connection]
00:13:21		etnguyen03 (etnguyen03) joins
00:18:03		loug8318142 quits [Quit: The Lounge - https://thelounge.chat]
00:26:53		etnguyen03 quits [Client Quit]
00:38:10		HP_Archivist quits [Quit: Leaving]
00:43:50		beardicus (beardicus) joins
01:01:33		etnguyen03 (etnguyen03) joins
01:15:08		beardicus quits [Ping timeout: 260 seconds]
01:23:19		beardicus (beardicus) joins
01:33:19		hackbug quits [Remote host closed the connection]
01:34:12		hackbug (hackbug) joins
01:48:39		nine quits [Quit: See ya!]
01:48:52		nine joins
01:48:52		nine is now authenticated as nine
01:48:52		nine quits [Changing host]
01:48:52		nine (nine) joins
02:07:03		beardicus quits [Ping timeout: 260 seconds]
02:22:56		beardicus (beardicus) joins
02:30:48		sec^nd quits [Remote host closed the connection]
02:30:52		SootBector quits [Remote host closed the connection]
02:30:53		HP_Archivist (HP_Archivist) joins
02:31:08		sec^nd (second) joins
02:31:11		SootBector (SootBector) joins
02:31:59		StarletCharlotte quits [Read error: Connection reset by peer]
02:33:01	<pokechu22>	Do we have any good way of recording a bunch of POST requests with known data and URLs into a WARC? I generated a list of those for https://sleepnomoreauction.com/ yesterday, but I'm only able to save the images with archivebot since everything else is POST (note that they also require an origin header, and possibly some other headers)
02:42:08		etnguyen03 quits [Client Quit]
02:49:44	<TheTechRobo>	Should be theoretically easy with qwarc from my previous research, but /me doesn't currently have time to write a spec file
02:50:43		pabs quits [Read error: Connection reset by peer]
02:52:05	<@OrIdow6>	Could also be done with wget-at
02:52:17		graham9 joins
02:52:25		pabs (pabs) joins
02:53:06		etnguyen03 (etnguyen03) joins
02:58:10	<@JAA>	Yeah, easy with qwarc.
02:59:18	<pabs>	-feed/#hackernews- Society for Technical Communication to permanently close its doors https://www.stc.org/ https://news.ycombinator.com/item?id=42867324
03:01:47	<h2ibot>	OrIdow6 edited Niconico (+432, /* Nico Nico Seiga */ Nico Nico Shunga has done…): https://wiki.archiveteam.org/?diff=54287&oldid=54264
03:03:16	<TheTechRobo>	AttributeError: module 'asyncio' has no attribute 'coroutine'. Did you mean: 'coroutines'?
03:03:16	<TheTechRobo>	What version of Python should I use for qwarc?
03:03:43	<TheTechRobo>	(I'm on the latest commit in the 0.2 branch)
03:04:38	<@JAA>	Hmm yeah, that was removed in 3.11.
03:04:58	<@JAA>	FWIW, that isn't used in qwarc's code, so it'll be from a dependency.
03:05:06	<@JAA>	Probably the ancient aiohttp.
03:05:09	<TheTechRobo>	Ah, yeah, aiohttp
03:06:26	<@JAA>	The aiohttp code is ugly because it doesn't expose the raw HTTP traffic, so it hard-depends on that ancient version.
03:06:56	<@JAA>	I usually run my things under 3.6, but I know 3.8 works fine. Not sure about newer ones.
03:06:56	<TheTechRobo>	On 3.9 I get
03:06:56	<TheTechRobo>	class CeilTimeout(Timeout):
03:06:57	<TheTechRobo>	TypeError: function() argument 'code' must be code, not str
03:07:01	<TheTechRobo>	Also in aiohttp.
03:07:12	<@JAA>	Not in async-timeout?
03:07:25	<TheTechRobo>	/home/thetechrobo/qwarc/venv/lib/python3.9/site-packages/aiohttp/helpers.py
03:07:36	<@JAA>	Hmm yeah, I guess that's where the error happens.
03:07:40	<@JAA>	You need async-timeout==3.0.1.
03:07:49	<h2ibot>	OrIdow6 edited Web Roasting (+283, Explain what it is a bit): https://wiki.archiveteam.org/?diff=54288&oldid=30443
03:08:09	<@JAA>	https://github.com/aio-libs/aiohttp/issues/6320
03:08:11		pie_ quits []
03:09:43	<nicolas17>	let's build our own library
03:09:54	<nicolas17>	with h11, blackjack and hookers
03:10:02	<@JAA>	That's the plan, yes.
03:10:52	<TheTechRobo>	Can I make pyenv rebuild the sqlite part of python without removing and reinstalling the entire version? Turns out I didn't have the sqlite headers installed when I installed 3.9.
03:11:36	<@JAA>	This was never intended to be long-lived. Remember that qwarc in its current form is basically the code I wrote for one specific project years ago, repackaged into something somewhat reusable.
03:11:56	<@JAA>	TheTechRobo: As far as I know, no.
03:12:40	<@JAA>	qwarc also used to use warcio. I ripped that out in record time when I discovered its intentional data mangling.
03:12:53	<@JAA>	So now it's bespoke custom WARC-writing code.
03:12:54	<TheTechRobo>	I have wondered, are WARCs made by warcio still in the WBM?
03:13:06	<TheTechRobo>	Not just for qwarc, but also for other things
03:13:34	<@JAA>	Replacing that is at the top of my qwarc todo list, hence pywarc.
03:13:54	<@JAA>	I'm sure there's warcio data in the WBM, yeah.
03:15:06	<TheTechRobo>	Are the old qwarc grabs still in the WBM?
03:15:16	<@JAA>	I believe so.
03:17:50	<h2ibot>	TheTechRobo edited Qwarc (+248, Add dependency information): https://wiki.archiveteam.org/?diff=54289&oldid=53904
03:18:36	<nicolas17>	optane10 is on fire
03:27:42	<nicolas17>	optane10 is consistently returning "max connections -1" on youtube, and "connection refused" on blogger
03:36:53	<h2ibot>	PaulWise edited ArchiveBot/Ignore (+30, better facebook/instagram ignore): https://wiki.archiveteam.org/?diff=54290&oldid=54271
03:37:12	<@JAA>	That's been mentioned in the project channels, yes.
03:39:54	<h2ibot>	PaulWise edited ArchiveBot/Ignore (+193, add wordpress junk): https://wiki.archiveteam.org/?diff=54291&oldid=54290
03:39:55	<h2ibot>	PaulWise edited ArchiveBot/Ignore (+2, ignore trailing / too): https://wiki.archiveteam.org/?diff=54292&oldid=54291
03:40:54	<h2ibot>	PaulWise edited ArchiveBot/Ignore (+6, pinterest ignore other language subdomains too): https://wiki.archiveteam.org/?diff=54293&oldid=54292
03:43:04	<TheTechRobo>	JAA: I assume in the generate(cls) function, whatever I queue has to be a string?
03:43:39	<@JAA>	TheTechRobo: Yes
03:43:56	<@JAA>	Also, ensure there are no dupes.
03:58:07		graham9 quits [Client Quit]
03:58:40	<TheTechRobo>	Does qwarc write to stdout?
04:01:25		pixel (pixel) joins
04:04:18		pixel leaves
04:04:22		pixel (pixel) joins
04:04:31	<@JAA>	TheTechRobo: Only if your spec file does.
04:05:03	<@JAA>	qwarc on its own, no.
04:08:42		Wohlstand quits [Quit: Wohlstand]
04:10:06		etnguyen03 quits [Remote host closed the connection]
04:11:12	<@JAA>	(I do sometimes output things on FD 3 or similar for scripting around qwarc.)
04:13:54		ljcool2006_ quits [Quit: Leaving]
04:33:55	<TheTechRobo>	AttributeError: type object '_asyncio.Task' has no attribute 'current_task' on Python 3.9
04:34:16	<@JAA>	Welp
04:35:35	<@JAA>	Oh yeah, deprecated in 3.7, removed in 3.9.
04:35:52	<@JAA>	Again, not used in qwarc, so I bet it's aiohttp.
04:36:23	<TheTechRobo>	Yup
04:37:25	<TheTechRobo>	You said pywarc will be provide an API for HTTP requests/responses, right? I assume it'll also do weird things to aiohttp?
04:39:55	<TheTechRobo>	Er, this might be a stupid question, but is there a way to override qwarc's user agent? You can set one in `headers`, but then you'll have two.
04:43:07	<h2ibot>	TheTechRobo edited Qwarc (+52): https://wiki.archiveteam.org/?diff=54294&oldid=54289
04:45:03	<@JAA>	No, pywarc won't use aiohttp. It'll probably h11 with sync and async wrappers.
04:46:25	<@JAA>	Heh, there's been a todo comment in the code since 2019 about header overriding.
04:47:43	<TheTechRobo>	I'll take that as a no then. :-)
04:47:44	<@JAA>	The default headers are stored in the item's `headers` attribute. You can manipulate that from `__init__`, for example (after the `super().__init__` call).
04:47:56	<TheTechRobo>	spoke too soon
04:48:11	<@JAA>	E.g. `def __init__(self, args, kwargs): super().__init__(args, **kwargs); self.headers = []`
05:02:38		beardicus quits [Ping timeout: 260 seconds]
05:08:52	<TheTechRobo>	I like how I said I didn't have time to write a spec file, then proceeded to spend two hours writing my first one.
05:08:56	<TheTechRobo>	Procrastination is fun. lol
05:11:28		beardicus (beardicus) joins
05:14:16	<TheTechRobo>	If it's useful to anyone, this pretty much just takes a list of URL + body data + HTTP verb, and requests it all. No retries, but there's a JSON log with the status code to stdout. https://transfer.archivete.am/inline/103umH/yoink.py
05:21:41		BlueMaxima quits [Read error: Connection reset by peer]
05:29:51	<pokechu22>	I don't have the ability to upload WARCs that end up in WBM (though I guess for a warc of POSTs that's not relevant, but the data in question is all of the URLs with # in them in https://transfer.archivete.am/inline/TnzuJ/sleepnomoreauction.com_urls_2.txt and the headers from line 50 of https://transfer.archivete.am/twnvK/auction.io_sleepnomoreauction.com_process_2.py
05:29:52	<eggdrop>	inline (for browser viewing): https://transfer.archivete.am/inline/twnvK/auction.io_sleepnomoreauction.com_process_2.py
05:34:08	<TheTechRobo>	!remindme 8h do thing
05:34:09	<eggdrop>	[remind] ok, i'll remind you at 2025-01-30T13:34:08Z
05:38:17		Webuser884331 joins
05:38:49	<Webuser884331>	question.... yahoo breif case
05:39:58		beardicus quits [Ping timeout: 260 seconds]
05:40:13	<@OrIdow6>	Webuser884331: We didn't get it, sorry
05:40:33	<@OrIdow6>	"...but the warning time was roughly 60 days, which is long by Yahoo standards but hardly ideal for a service up for nearly a decade" per https://wiki.archiveteam.org/index.php/Yahoo!_Briefcase
05:43:15	<Webuser884331>	what about someones hotmail
05:43:36	<Webuser884331>	my mum died
05:43:54	<Webuser884331>	i want any photos she saved
05:46:17	<@JAA>	I think that was before AT even existed, although only barely.
06:20:18		Webuser884331 quits [Client Quit]
06:23:20	<@JAA>	Actually, not quite. AT emerged in January 2009, domain registration on 2009-01-06. I thought it was a bit later that year.
06:54:17		Dango360 quits [Read error: Connection reset by peer]
07:55:09		pabs quits [Read error: Connection reset by peer]
07:55:37		pabs (pabs) joins
08:30:25		SootBector quits [Remote host closed the connection]
08:30:42		SootBector (SootBector) joins
08:33:09		`
08:40:08		Emitewiki joins
08:40:24	<Emitewiki>	Anything we can do about this, or is it outside our purview/already done? https://bsky.app/profile/bobpony.com/post/3lgvxot2kos2j
08:41:04	<pabs>	"Microsoft will be removing the downloads for old Windows Themes in the future."
08:41:10	<pabs>	https://support.microsoft.com/windows/windows-themes-94880287-6046-1d35-6d2f-35dee759701e
08:42:17	<pabs>	looks like it will work in AB
08:42:51	<Emitewiki>	Sweet.
08:45:19	<pabs>	seems to be working, but some of the themes are already 404, including from a browser
08:45:46	<Emitewiki>	💀
08:46:19	<pabs>	it likely can't save the Windows Store stuff, which is behind a weird link ms-windows-store://collection/?collectionid=WindowsThemes
08:52:05	<Emitewiki>	Dang. Any way for us to manually do some shenanagins to save it?
08:54:19	<pabs>	reading the page again, that part isn't in danger, just the direct links, which are being saved
08:55:09	<pabs>	should be on archive.org in a few days
08:59:47	<Emitewiki>	Ah, you're right. Cool cool, thanks
08:59:48	<Emitewiki>	!
09:31:14		Island quits [Read error: Connection reset by peer]
09:56:35		scurvy_duck joins
10:00:32	<Emitewiki>	Anyone mind sending this through AB? The dev is starting to delist some of their games from stores, and this usually precludes a website shutdown, so I just want to be extra safe. https://www.catsoulstudios.com/
10:03:16	<that_lurker>	sure
10:03:47	<Emitewiki>	Thanks.
10:03:51	<that_lurker>	is there a news article about that somewhere?
10:04:25	<Emitewiki>	It's an announcement on their Steam game notices.
10:04:34	<Emitewiki>	So, like, within the Steam interface itself.
10:04:43	<that_lurker>	aa ok
10:06:46		Emitewiki quits [Client Quit]
11:30:15		Webuser220882 joins
11:30:26		Webuser220882 quits [Client Quit]
11:34:13		PotatoProton01 joins
11:51:00		PotatoProton01 quits [Client Quit]
12:00:03		Bleo18260072271962345 quits [Quit: The Lounge - https://thelounge.chat]
12:02:49		Bleo18260072271962345 joins
12:13:29		lennier2_ joins
12:16:03		lennier2 quits [Ping timeout: 260 seconds]
12:16:46		icedice (icedice) joins
12:24:59		pie_ (pie_) joins
12:28:33		pie_ quits [Client Quit]
12:30:14		pie_ (pie_) joins
12:30:21		pie_ quits [Client Quit]
12:30:33		pie_ (pie_) joins
12:35:20		SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962]
12:35:36		etnguyen03 (etnguyen03) joins
12:35:52		SkilledAlpaca418962 joins
13:07:02		beardicus (beardicus) joins
13:17:35		iram quits [Quit: ~]
13:18:05		iram joins
13:26:24		Naruyoko5 joins
13:27:36		Naruyoko quits [Ping timeout: 250 seconds]
13:27:48		scurvy_duck quits [Ping timeout: 260 seconds]
13:34:08	<eggdrop>	[remind] TheTechRobo: do thing
14:04:38		Wohlstand (Wohlstand) joins
14:06:49	<TheTechRobo>	pokechu22: So an example request might be a POST to https://auctionsoftware.net/mobileapi/fetchLocation with the body {"countryCode": 62} ?
14:07:20	<TheTechRobo>	Is there any link extraction needed or do they just have to be grabbed?
14:10:42		katocala joins
14:10:59		katocala is now authenticated as katocala
14:25:20		hexa- quits [Quit: WeeChat 4.4.3]
14:26:27		hexa- (hexa-) joins
14:36:50		BornOn420 quits [Remote host closed the connection]
14:37:23		BornOn420 (BornOn420) joins
14:55:21	<anarcat>	is this on someone's radar? https://www.reddit.com/r/DataHoarder/comments/1idm9ii/datagov_is_currently_being_scrubbed/
14:55:49	<anarcat>	i'm getting kind of exhausted at the "okay, this fascist government is in, and they're going to destroy the entire digital infrastructure of country foo, let's crawl"
14:59:23	<kiska>	Some 2.2k datasets have been removed, what they are, I don't know
15:00:06		kansei quits [Quit: ZNC 1.9.1 - https://znc.in]
15:01:42		kansei (kansei) joins
15:34:15		holbrooke joins
15:35:46		riteo (riteo) joins
15:46:34		earl joins
16:17:04		Wohlstand quits [Remote host closed the connection]
16:17:36		Wohlstand (Wohlstand) joins
16:23:58		katocala quits [Ping timeout: 260 seconds]
16:24:12		katocala joins
16:25:30		midou quits [Remote host closed the connection]
16:25:39		midou joins
16:39:56		loug8318142 joins
16:40:49		Wohlstand quits [Client Quit]
16:53:53		SootBector quits [Remote host closed the connection]
16:54:17		SootBector (SootBector) joins
17:22:54		katocala quits [Ping timeout: 250 seconds]
17:23:08		katocala joins
17:26:30		Wohlstand (Wohlstand) joins
17:33:18		holbrooke quits [Client Quit]
17:49:07	<pokechu22>	TheTechRobo: I've already done all of the link extraction (that's what the other lines in the file are); they just need to be grabbed. (But also the additional headers are needed; you should get a JSON response, not an HTML response)
18:09:05		scurvy_duck joins
18:13:59		etnguyen03 quits [Quit: Konversation terminated!]
18:24:09	<TheTechRobo>	pokechu22: Any rate limiting?
18:24:27	<pokechu22>	I didn't run into any with my script
18:25:09	<pokechu22>	(but the script was at concurrency 1 effectively)
18:28:55		etnguyen03 (etnguyen03) joins
18:32:10		holbrooke joins
18:45:14		holbrooke quits [Ping timeout: 250 seconds]
18:52:43		icedice quits [Quit: Leaving]
18:56:52	<@JAA>	From #hackint: 14:02:38 < i> They say that https://data.gov/ is getting deleted as we speak, losing 1000 datasets a day.
19:07:46		notarobot1 quits [Ping timeout: 250 seconds]
19:08:40	<TheTechRobo>	pokechu22: I think I got all of them downloaded to WARC. I don't have WBM permission either, but as you said, it's all POST, so kind of a moot point.
19:10:05	<pokechu22>	Thanks. It's probably worth doing it a second time near when the auctions finish (which I thought was tomorrow, but it looks like they extended it to Feb 2? or maybe I just confused myself)
19:10:22	<pokechu22>	!remindme 3d https://sleepnomoreauction.com/ auctions close shortly cc TheTechRobo
19:10:23	<eggdrop>	[remind] ok, i'll remind you at 2025-02-02T19:10:22Z
19:13:17	<TheTechRobo>	pokechu22: Will the URLs be the same the second time around?
19:13:35	<TheTechRobo>	+ POST data
19:13:56	<pokechu22>	They should. I'll re-run my script and make sure but unless they list new items (which seems unlikely given that they're closing) it shouldn't change
19:14:38	<TheTechRobo>	Ack
19:16:16	<TheTechRobo>	Waiting for book_op.php to decide I'm not a serial killer...
19:20:29		beardicus quits [Remote host closed the connection]
19:20:30	<TheTechRobo>	Up at https://archive.org/details/warc-sleepnomoreauction-post-urls
19:26:22		scurvy_duck quits [Client Quit]
19:26:22		moth_ quits [Read error: Connection reset by peer]
19:26:49		moth_ joins
21:03:16		cascode joins
21:04:45		earl quits []
21:09:12		moth_ quits [Read error: Connection reset by peer]
21:13:24		etnguyen03 quits [Client Quit]
22:03:31		balrog_ is now known as balrog
22:07:30	<h2ibot>	TheTechRobo edited Qwarc (+1780, Add some basic documentation): https://wiki.archiveteam.org/?diff=54295&oldid=54294
22:07:31		loug8318142 quits [Quit: The Lounge - https://thelounge.chat]
22:07:56		balrog quits [Quit: Bye]
22:08:14		balrog (balrog) joins
22:14:31	<h2ibot>	TheTechRobo edited Qwarc (-1): https://wiki.archiveteam.org/?diff=54296&oldid=54295
22:42:16		graham9 joins
22:43:27		graham9 quits [Client Quit]
22:45:00		graham9 joins
22:46:31		graham9 quits [Client Quit]
22:47:01		graham9 joins
22:49:48		graham9 quits [Client Quit]
23:00:43		SootBector quits [Remote host closed the connection]
23:00:53		SootBector (SootBector) joins
23:03:12		utulien joins
23:09:07		etnguyen03 (etnguyen03) joins
23:09:52	<opl>	made url lists for catalog.data.gov. apparently the fluctuating amount of available datasets is kinda normal, as it's been happening in older versions available through the wayback machine. doesn't matter though, as it still yielded a bunch of new urls
23:10:57	<opl>	here's a (mostly complete) list of urls linked to through the website's resources. as the catalog doesn't actually store the data by itself, all of this is hosted on other websites https://transfer.archivete.am/11tMVS/catalog.data.gov-found-urls.txt
23:10:59	<eggdrop>	inline (for browser viewing): https://transfer.archivete.am/inline/11tMVS/catalog.data.gov-found-urls.txt
23:12:47	<opl>	and then there are the urls on catalog.data.gov, to allow browsing it: https://transfer.archivete.am/76FH3/catalog.data.gov-urls.txt (initial api search + html catalog pages sorted by title + datasets) and harvest metadata which allows finding out where the data got indexed from https://transfer.archivete.am/gQgQy/catalog.data.gov-harvest-urls.txt
23:12:48	<eggdrop>	inline (for browser viewing): https://transfer.archivete.am/inline/76FH3/catalog.data.gov-urls.txt https://transfer.archivete.am/inline/gQgQy/catalog.data.gov-harvest-urls.txt
23:31:22	<pokechu22>	opl: which do you think is most important to run first?
23:31:44	<pokechu22>	... and, is there any rate-limiting?
23:32:00	<opl>	i have no idea :)
23:32:09		Wohlstand quits [Quit: Wohlstand]
23:32:10	<opl>	ok, more seriously
23:32:51	<opl>	the catalog seems to just be a search engine for externally hosted data. i have no idea what datasets get deleted, nor why
23:34:05	<opl>	it could be that the reason the datasets are disappearing is because they disappeared from the websites the dataset in the catalog links to, in which case urls from both lists might be disappearing at the same time
23:35:47	<opl>	that is, if i understand how this works correctly. i'm making some assumptions: if you go to a random dataset, scroll down to the "metadata source", click on the "harvested from" link, and then go to the "about" tab, you'll find that it links too a .json file on some external website
23:35:52	<pokechu22>	I think I might concatenate all 3 into one big list, shuffle it, and then run that
23:36:20	<opl>	and it seems those jsons linked to harvests are the source of truth
23:38:57	<opl>	yeah, shuffling it all seems sane enough. the api and frontend catalog urls are sorted by creation date and title respectively, so some stuff might be lost to pagination if the datasets happen to get updated while the urls are in the queue
23:39:55	<opl>	but there are links to the individual datasets in there too, so ultimately whatever datasets exist now will still get hit
23:41:31		etnguyen03 quits [Client Quit]
23:46:34	<pokechu22>	I've started 4 archivebot jobs for it (each with different parts of the shuffled list). Hopefully there won't be any issues with that (if there are super large files then the high concurrency I'm using might cause problems but we can deal with that later)
23:47:53	<opl>	nice, thanks
23:51:57		Island joins
23:54:06	<pokechu22>	looks like things aren't going to be perfect, e.g. https://data.usaid.gov/d/vaeq-cj7j redirects to https://data.usaid.gov/Basic-Education/Nepal-Early-Grade-Reading-Program-EGRP-/vaeq-cj7j/about_data and archivebot will reject that due to the no-parent rule (which is a thing for !ao < list jobs even though that doesn't really make sense), though it probably also won't work due
23:54:08	<pokechu22>	to javascript
23:54:27	<pokechu22>	but we'll still end up saving some stuff at least
23:56:11	<opl>	yeah, a lot of the datasets are just links to websites with more information about them, urls to map tile services, and other stuff like that
23:56:58	<opl>	information about most of the datasets' existence don't exist in the wayback machine at all though, so at least that'll get preserved
23:57:11	<opl>	s/don't/doesn't/
23:58:13		ducky quits [Ping timeout: 260 seconds]

Home Search Previous day Next day