#archiveteam-bs log for 2021-08-31

Home Search Previous day Next day

00:01:44		dm4v_ joins
00:01:44		dm4v quits [Read error: Connection reset by peer]
00:01:56		dm4v_ is now known as dm4v
00:01:58		dm4v is now authenticated as dm4v
00:01:58		dm4v quits [Changing host]
00:01:58		dm4v (dm4v) joins
00:47:35		dasineura joins
00:48:11	<dasineura>	https://www.23andmeforums.com/discussions shutting down 9/6/21. site requires login to access, wondering if it can still be archived?
00:48:28	<dasineura>	" - On September 6, 2021, the forum content will be deleted and 23andmeforums.com will redirect to 23andme.com" from the website
01:02:29		dm4v quits [Read error: Connection reset by peer]
01:03:08		dm4v joins
01:03:10		dm4v is now authenticated as dm4v
01:03:10		dm4v quits [Changing host]
01:03:10		dm4v (dm4v) joins
01:08:02		lennier2 joins
01:10:36		lennier1 quits [Ping timeout: 250 seconds]
01:10:43		lennier2 is now known as lennier1
01:45:10	<Jake>	alliew: mediatype:web would be perfect, and community data is the correct collection I believe, but it will go to the WARCzone automatically ( https://archive.org/details/warczone ). the Internet Archive isn't accepting WARCs for ingestion into the WBM from anyone but a very specific set of whitelisted users right now.
01:46:22	<alliew>	aight! archiveteam keyword yay or nay?
02:05:17	<Jake>	I think it's fine?
02:18:46	<@OrIdow6>	AFAIK it doesn't affect anything
02:26:18		ThreeHM quits [Ping timeout: 244 seconds]
02:28:33		ThreeHM (ThreeHeadedMonkey) joins
02:47:29	<alliew>	uploading ^^
02:48:02	<alliew>	ultimate-guitar has had a "80% off on pro accounts" banner for like a year and i am Nervous about it's financial situation because of it
02:48:59	<@OrIdow6>	dasineura: Generally the answer is no for login-only stuff like that
02:49:50	<dasineura>	yea was afraid of that
02:54:22	<@OrIdow6>	alliew: Is ultimate-guitar one of these things you uploaded?
02:54:30		dasineura leaves
02:54:34	<@JAA>	Translation: it can be done and has happened before, but it will generally never go into the Wayback Machine. Also, making the data publicly available is only really an option if anyone can get access to the forums with a simple registration and no other barriers, and even then it's murky.
02:54:39	<@JAA>	Welp
02:55:24	<alliew>	OrIdow6, nope. i'm interested in archiving it
02:55:43	<Frogging101>	So my misterpoll won't go into WBM, then
02:55:50	<Frogging101>	right?
02:55:56	<alliew>	their paging doesn't go past 100 pages but they have a static artists list
02:56:04	<alliew>	so i think i can scrape all the tab urls through that
02:57:12	<Frogging101>	"making the data publicly available is only really an option if anyone can get access to the forums with a simple registration and no other barriers, and even then it's murky."
02:57:14	<Frogging101>	what does this mean
02:57:25	<Frogging101>	does it mean that if it wasn't public before, it's unethical to make it public?
02:57:36	<@OrIdow6>	For me it says the sale ends in 4 hours, is that just a gimmick?
02:58:48	<@OrIdow6>	Or is the end of a months long timer really in 4 hours?
03:00:15	<@OrIdow6>	Anyhow, if you think there is a serious risk of it going down or if it is small and/or simple and/or especially important (e.g. political) ArchiveTeam may be able to take it on itself
03:00:18	<@JAA>	Frogging101: Basically yeah. Only public stuff goes into the WBM. If there's a login wall that anyone can bypass simply by creating an account, it's effectively public, but it still generally wouldn't go into the WBM. If there's any barrier beyond that (say, subscription fee to access part of the site, like on SomethingAwful, or something like reputation/minimum number of posts/etc.), the data
03:00:24	<@JAA>	probably shouldn't be public at all.
03:00:55		alliew quits [Ping timeout: 244 seconds]
03:01:32	<@OrIdow6>	Missed everything I said to them, of course
03:01:48	<@JAA>	As is tradition.
03:02:34		alliew joins
03:02:45	<@OrIdow6>	Oh
03:02:59	<@JAA>	Frogging101: Doesn't mean such things shouldn't be archived, of course. It's the data sharing/publication that's the problematic side of things.
03:03:07	<@OrIdow6>	alliew: For me it says the sale ends in 4 hours, is that just a gimmick, or is the end of a months long timer really in 4 hours? Anyhow, if you think there is a serious risk of it going down or if it is small and/or simple and/or especially important (e.g. political) ArchiveTeam may be able to take it on itself
03:03:25	<alliew>	oh, they just keep changing the banner lol
03:03:29	<Frogging101>	If it's archived but nobody is allowed to see it then how useful is it?
03:05:14	<@OrIdow6>	Because "nobody is allowed to see it" is presumed to be relaxed in the future or under special circumstances
03:05:30	<Frogging101>	true enough
03:05:47	<alliew>	as for the importance, i find it pretty important as a resource
03:05:59	<alliew>	i'll probably have a url scrape going tomorrow
03:06:10	<alliew>	it should be around like, 1.7 million?
03:06:16	<alliew>	which with http2 is doable solo
03:07:16	<@JAA>	WARC doesn't support HTTP/2 though.
03:08:18	<alliew>	most HTTP/2 responses you get are HTTP/1.1 valid
03:08:34	<h2ibot>	JustAnotherArchivist edited Deathwatch (+190, /* 2021 */ Add 23andMe forums): https://wiki.archiveteam.org/?diff=47092&oldid=47087
03:08:37	<@JAA>	Er, no, because they say 'HTTP/2'.
03:08:46	<alliew>	i mean, yeah
03:08:53	<alliew>	i'm saying the actual content
03:08:58	<alliew>	fits the HTTP/1.1 spec
03:09:05	<alliew>	though the transport is different
03:09:17	<@JAA>	Yeah, but the status line doesn't, so it isn't allowed by the WARC specifications.
03:09:26	<nicolas17>	yeah so it can be converted, doesn't mean it's simply compatible
03:09:31	<@JAA>	(Please don't write non-compliant or faked WARCs.)
03:09:31	<alliew>	yep
03:10:37	<alliew>	this is a personal take, as someone who works in archival, so yknow, grains of salt
03:11:42	<alliew>	but personally, if it makes saving data where it's possible while keeping all the content fine, i'm pro?
03:12:00	<alliew>	like, if it's a convertible response, HTTP/2 allows for really fast scraping
03:13:23	<@JAA>	Let's get the performance argument out of the way: I'm regularly grabbing hundreds of responses per second from a single slowish machine over HTTP/1.1. (All you need is lots of parallel connections.)
03:14:18	<alliew>	i've found HTTP/2 multiplexing + multithreading significantly faster than conn pooling
03:14:28	<@JAA>	The issue is a different one. These WARC files are supposed to stay around for decades (or hopefully 'forever'). The further they stray from the official standard, the messier things get down the line. So it's best to stick to the standard. And if that's not possible, one should work on extending the standard, like we've done with the zstd compression for example.
03:14:45	<alliew>	oh yeah, agreed on that
03:15:26	<@JAA>	Since the standard specifically refers to HTTP/1.1 everywhere, that's all that's allowed, really. Technically. you can't even archive HTTP/0.9 or /1.0 I think (which I guess should be fixed as well).
03:15:27		alliew quits [Read error: Connection reset by peer]
03:15:52		alliew joins
03:16:15	<nicolas17>	alliew: what's the last message you saw? >.>
03:16:43	<alliew>	changing a status line for a compatible response, with proper flagging, personally to me keeps to both the standard in usability and preservation, again, only if it's straight up compatible except for the status line
03:16:45	<alliew>	uhh
03:16:50	<alliew>	my "oh yeah, agreed on that"
03:17:03	<@JAA>	03:15:26 <@JAA> Since the standard specifically refers to HTTP/1.1 everywhere, that's all that's allowed, really. Technically. you can't even archive HTTP/0.9 or /1.0 I think (which I guess should be fixed as well).
03:17:29	<@JAA>	'Proper flagging' isn't possible within the standard either.
03:17:45	<@JAA>	Really WARC should get HTTP/2 support (and /3 in the future). Everything else is insanity.
03:17:58	<alliew>	if your archive's staying around i sure goddamn hope your metadata is staying around
03:18:13	<alliew>	just for the sake of the intern in a century who's having to re-index it :p
03:18:15	<nicolas17>	uh how do I tell WBM to archive a page now? it already exists in WBM, it's just old
03:18:45	<nicolas17>	IIRC there's a "save page now" button that appears when a page isn't in WBM yet
03:18:49	<@JAA>	I meant a clean way of marking such records, e.g. in a WARC header.
03:19:14	<Frogging101>	JAA: what are the chances of my misterpoll grab getting mediatype: web
03:19:24	<@JAA>	But well, I still think such transformations should always be avoided entirely.
03:19:41	<alliew>	i guess it comes down to letter-of-the-law or not, right
03:20:12	<@OrIdow6>	nicolas17: web.archive.org/save
03:20:34	<@JAA>	WARCs are already a huge mess to correctly process due to ambiguities in the standard and everyone ignoring the standard (cf. the payload digest debacle). Let's not make it worse.
03:21:00	<alliew>	oh, by the way y'all
03:21:33	<@JAA>	As I said, performance-wise, meh, not a big deal. It's rare to find sites that happily handle thousands of requests per second from one IP anyway.
03:21:52	<Frogging101>	JAA: speaking of payload hash, this never got merged :( https://github.com/ArchiveTeam/wpull/pull/360
03:22:15	<alliew>	if you're interested in a nightmare to archive, may i suggest the Hemeroteca Digital Brasileira? it's brazil's online newspaper archive, and has recently come down for days at a time. it's also ASPX and written so badly,,
03:22:26	<@JAA>	Frogging101: You can set the mediatype yourself. That's not an issue. It's also not the only factor for ingestion into the WBM though.
03:22:41	<alliew>	i've worked on scraping it before, and jesus christ i want to talk to whoever wrote this
03:23:08	<alliew>	just ask if they're ok. if ASPX has them bound to it through a curse that forces them to write in it.
03:23:54	<alliew>	JAA: yeah, i get your position on that
03:23:54	<@JAA>	Frogging101: Ugh yeah, right, that one. Reminds me of the warcio bugs I recently discovered.
03:27:27	<@JAA>	alliew: Oh, that clusterfuck. I remember looking at that briefly when the Museu Nacional went up in flames, I think.
03:27:42	<alliew>	god, riv museu nacional
03:27:56	<alliew>	being an archivist in this country is Pain
03:28:08	<alliew>	our film archive caught fire too recently
03:28:36	<@JAA>	Oof. I missed that.
03:29:24	<alliew>	didn't get as much stuff as the museu nacional fire but was still painful
03:30:10	<nicolas17>	yikes
03:34:16	<alliew>	oh also doing a wget --mirror warc grab of a-infos, found it on the fire drill page
03:43:27	<alliew>	anyways, gotta sleep and i haven't set my IRC bouncer back up yet. have a g'night
03:43:31		alliew quits [Client Quit]
03:45:41		qw3rty_ joins
03:49:12		qw3rty__ quits [Ping timeout: 250 seconds]
04:32:25	<@OrIdow6>	Anyone know of any Google Docs filetypes that are neither documents, spreadsheets, presentations, sites, nor regular (bytestring) files?
04:33:28	<@OrIdow6>	And technically folders
04:34:55	<@OrIdow6>	Apparently there's also drawings, tables, and forms, don't know if those are downloadable or not
04:35:57	<@OrIdow6>	Whatever the case, I intend to write this conservatively enough that if an unexpected type comes up, it will fail and make that known
05:06:59		benjins quits [Ping timeout: 244 seconds]
05:22:13		nicolas17 quits [Ping timeout: 252 seconds]
05:36:28		BlueMaxima_ joins
05:40:22		BlueMaxima quits [Ping timeout: 252 seconds]
06:13:07		BlueMaxima_ quits [Ping timeout: 244 seconds]
10:25:05		benjins joins
10:25:18		benjins is now authenticated as benjins
11:44:35		Iki joins
11:55:27		Iki1 joins
11:56:29		gurubob joins
11:59:17		Iki quits [Ping timeout: 244 seconds]
12:01:01		gurubob quits [Remote host closed the connection]
12:42:30		knecht420 quits [Quit: The Lounge - https://thelounge.chat]
12:42:45		knecht420 (knecht420) joins
12:54:56		nertzy__ joins
13:11:54		nertzy__ quits [Client Quit]
13:26:47		Megame quits [Client Quit]
14:34:07		qwertyasdfuiopghjkl joins
14:54:45		jonst123 joins
15:05:17		Arcorann quits [Ping timeout: 244 seconds]
15:13:28		hexa- quits [Quit: WeeChat 3.1]
15:14:56		hexa- (hexa-) joins
15:15:07	<@OrIdow6>	arkiver: Any updates on the review of google-drive-grab?
16:03:54		Gereon6 (Gereon) joins
16:16:00	<@JAA>	For visibility: ArchiveBot is currently broken and cannot accept new jobs for the time being.
16:26:54		alliew (alliew) joins
16:37:54		nicolas17 joins
17:03:39	<@arkiver>	OrIdow6: net yet, in a few hours i hope
17:05:32	<@arkiver>	checking it now
17:06:52	<@arkiver>	nice one on the two checks in checkip
17:08:30	<@arkiver>	OrIdow6: do you have items?
17:09:33	<@arkiver>	for a size estimate, i dont think we have to go through all folder first
17:10:13	<@arkiver>	we simply take a sample out of the complete pool of folder and files, and check the total size after archiving those
17:11:01	<@arkiver>	(assuming not many files in the folders are listed in our raw list outside of the folders)
17:16:51	<@arkiver>	OrIdow6: why are you adding an ignore for protocol_and_domain_and_port?
17:17:00	<@arkiver>	line 91-94
17:20:06	<@OrIdow6>	arkiver: I have test items, I do not have item list yet
17:23:27	<@OrIdow6>	You're right it's wrong, I think that line in add_ignore may be missing a $ at the end
17:28:30	<@OrIdow6>	I've also noticed I'm not checking status_code in "if req_callbacks[url] ~= nil then"
17:29:55	<@arkiver>	could be, still checking it
17:31:08	<@arkiver>	OrIdow6: why do we return false in allowed on start_urls_inverted[url] ?
17:33:28		LeGoupil joins
17:34:40	<@OrIdow6>	arkiver: Because if start_urls_inverted[url] is true, url is in start_urls, which I'm using to set new items
17:34:54	<@OrIdow6>	So there's a risk of setting a new item if start_urls_inverted[url]
17:37:38	<@arkiver>	right, yeah i see allowed is not used to check if a URL is in-scope in httploop_result
17:37:43	<@arkiver>	(because all URLs are i believe)
17:43:24	<@OrIdow6>	Yes, I'm trying to have more detailed control over the retry process
17:59:00	<@OrIdow6>	And I will say I have not bothered with previews, there are something like 8 different types I'd have to write and they are of marginal benefit
18:01:30		qwertyasdfuiopghjkl31 joins
18:01:59		qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds]
18:02:22		qwertyasdfuiopghjkl31 is now known as qwertyasdfuiopghjkl
18:41:37		alliew quits [Client Quit]
18:46:23	<@arkiver>	alright
19:02:53	<@arkiver>	OrIdow6: what are your test items?
19:03:05	<@arkiver>	nice one on the POST requests :)
19:09:43	<systwi>	JAA: What broke with ArchiveBot?
19:10:09	<Jake>	some Redis issue I believe.
19:21:39		lennier1 quits [Client Quit]
19:21:49	<spirit>	maybe use a webscale database next time?
19:24:42	<@OrIdow6>	arkiver: folder:1r8I5hpSPCf_9JWECwa6c4E4tQZELd3cx folder:1oCMgJeBc55NuEasPcgwjx2FuPdQd8neu folder:0B7z5EDsKyEsGfkEybGh2Y0tuc0dpMTVCbDZ4N1RXTGZMbnhwWEZqcnJmMzVYcy10SEplSlE
19:25:37	<@OrIdow6>	As well as subfolders etc. if you want more
19:32:00		onetruth joins
19:37:32	<pcr>	OrIdow6: You might want to add in something for google collab notebooks, those are stored in drive
19:38:40	<pcr>	Sample item: https://colab.research.google.com/drive/11z58bl3meSzo6kFqkahMa35G5jmh2Wgt
19:42:26		lennier1 (lennier1) joins
19:45:58	<pcr>	You might also need handling for Jamboard
19:50:41	<@OrIdow6>	Thank you pcr, do you have an example of the second?
19:55:26		wessel1512 is now authenticated as wessel1512
20:04:57	<pcr>	https://jamboard.google.com/d/1uJgeZf69HWVLATuEKjrAHCyDW1ADPmvDPJkjrHs6-j8/edit
20:16:14	<@OrIdow6>	Thanks
20:16:33	<@OrIdow6>	FYI it looks like colab will not need special handling, but jamboard will need a bit
20:22:09	<@OrIdow6>	Able to download what I assume is a canonical JSON vs export as PDF
20:25:59	<@OrIdow6>	If I have time may try to work on the web interface directly, but I think it's unlikely it will work
20:37:11		dewdrop quits [Quit: The Lounge - https://thelounge.chat]
20:39:50	<@JAA>	systwi: Redis broke. Check your logs in #archivebot for the details.
20:40:35		onetruth quits [Read error: Connection reset by peer]
20:42:34		dewdrop (dewdrop) joins
20:51:08	<@arkiver>	lets get a google drive channel going!
20:51:09	<@arkiver>	any ideas?
20:51:34	<@JAA>	googlecrash
20:51:46	<@arkiver>	nice
20:53:46	<Jake>	i like it :)
20:54:22	<@arkiver>	#googlecrash it is
21:28:17		LeGoupil quits [Client Quit]
21:36:14		Stiletto quits [Read error: Connection reset by peer]
21:36:28		Stiletto joins
21:40:01		qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds]
21:48:16		qwertyasdfuiopghjkl joins
22:22:34		Megame (Megame) joins
23:03:43		phiresky quits [Ping timeout: 252 seconds]
23:06:05		phiresky joins
23:31:08		Arcorann (Arcorann) joins
23:35:45		qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds]
23:46:36		dserve quits [Ping timeout: 244 seconds]
23:49:42		BlueMaxima joins

Home Search Previous day Next day