#archiveteam-bs log for 2021-08-23

Home Search Previous day Next day

00:00:43		dm4v quits [Read error: Connection reset by peer]
00:01:10		dm4v joins
00:01:12		dm4v is now authenticated as dm4v
00:01:12		dm4v quits [Changing host]
00:01:12		dm4v (dm4v) joins
00:07:44		nertzy quits [Client Quit]
00:10:04		BlueMaxima joins
00:14:01	<@arkiver>	OrIdow6: is mods.io/mods is really everything they have, i'm not sure if it needs a highly customized project
00:14:10	<@arkiver>	perhaps something for JAA s tool?
00:20:23	<@arkiver>	OrIdow6: it would be good to try to simulate the browser calls for google drive, which i guess would include the multipart post request if possible
00:22:18	<@arkiver>	for overall size estimate, if you're talking about everything, we're probably looking at 100s of TBs
00:22:28	<@JAA>	Possibly re mods.io. I did mean to look into it more. The downloads are obfuscated in an annoying way.
00:22:45	<@arkiver>	and i'm not sure about storing 100s of TBs of Google Drive
00:23:14	<@arkiver>	for a size estimate you could always check sizes of random samples and make an estimate using the lists you have
00:25:02	<@arkiver>	we could also just finish the Google Drive project code, start it on some random items, and estimate size using those
00:40:44		noguarantees joins
00:47:06	<@arkiver>	does anyone know who Janub is?
01:03:03		dm4v_ joins
01:03:56		jacobk quits [Ping timeout: 250 seconds]
01:04:31		dm4v quits [Ping timeout: 252 seconds]
01:04:31		dm4v_ is now known as dm4v
01:04:31		dm4v is now authenticated as dm4v
01:04:31		dm4v quits [Changing host]
01:04:31		dm4v (dm4v) joins
01:14:25		wyatt8740 quits [Ping timeout: 252 seconds]
01:15:25		wyatt8740 joins
01:21:00		wyatt8750 joins
01:21:34		wyatt8740 quits [Ping timeout: 252 seconds]
01:32:38	<@OrIdow6>	arkiver: You're probably right on that; I don't think there should be much conflict between getting the POSTs that the browser gets and the GETs that makes it accessible
01:32:54	<@OrIdow6>	Since the bulk of data from this proj will be files, and not those API calls
01:34:23	<@OrIdow6>	With the size estimate, one thing I am also looking to do is be able to come up with "if we only get files under 5 MB, total size will be XXX" sort of things
01:35:08		lennier1 quits [Client Quit]
01:35:23	<@OrIdow6>	The problem with making an estimate from a basic sample is that items are not independent, folder items (planned) contain files
01:35:57	<@OrIdow6>	Maybe it would work to do a 2-stage project, where first we get all file metadata, then we decide on a threshold or whatever for which file content to get?
01:36:59		lennier1 (lennier1) joins
01:38:33	<@OrIdow6>	Think I like that
01:39:02	<@OrIdow6>	Sort of does implement that idea of just surveying everythin
02:23:50		nertzy (nertzy) joins
02:33:37		Stilett0 quits [Ping timeout: 252 seconds]
02:40:13		nertzy quits [Client Quit]
02:50:41		nertzy (nertzy) joins
03:12:33		benjinsmith joins
03:14:14		jacobk joins
03:14:22		benjins quits [Ping timeout: 250 seconds]
03:42:34		noguarantees quits [Remote host closed the connection]
03:44:12		Barto quits [Ping timeout: 244 seconds]
03:46:00		nertzy quits [Client Quit]
03:53:44		qw3rty_ joins
03:57:13		qw3rty__ quits [Ping timeout: 252 seconds]
04:13:39		aleph quits [Ping timeout: 244 seconds]
04:15:29		nertzy (nertzy) joins
04:17:36		aleph joins
04:18:28		nertzy_ joins
04:21:32		nertzy quits [Ping timeout: 250 seconds]
04:22:40		Barto (Barto) joins
04:47:10	<Ryz>	Nexus Mods are getting more and more heat lately, as I stumbled upon this video: https://www.youtube.com/watch?v=e1wZS0VO0c8
04:47:35	<Ryz>	Anyone interested in trying to scrape content from it?
04:48:04	<Ryz>	It's kinda like Mods.io but huge and more complicated
05:02:15		DogsRNice quits [Read error: Connection reset by peer]
05:02:37		qwertyasdfuiopghjkl quits [Client Quit]
05:16:07		qwertyasdfuiopghjkl joins
06:00:12		BlueMaxima_ joins
06:03:42		BlueMaxima quits [Ping timeout: 244 seconds]
06:10:16		nertzy_ quits [Client Quit]
06:32:45		BlueMaxima__ joins
06:36:44		BlueMaxima_ quits [Ping timeout: 250 seconds]
06:40:00		nertzy_ joins
06:42:27		qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds]
06:55:17		nertzy_ quits [Client Quit]
06:59:27		knecht420 quits [Read error: Connection reset by peer]
06:59:31		knecht4206 (knecht420) joins
07:00:21		knecht4206 quits [Client Quit]
07:02:20		knecht4206 (knecht420) joins
07:02:59		knecht4206 is now known as knecht420
07:28:12		Megame quits [Client Quit]
07:37:11		BlueMaxima_ joins
07:41:04		BlueMaxima__ quits [Ping timeout: 252 seconds]
07:54:01		BlueMaxima__ joins
07:57:46		BlueMaxima_ quits [Ping timeout: 250 seconds]
08:09:56		BlueMaxima_ joins
08:13:48		BlueMaxima__ quits [Ping timeout: 250 seconds]
09:16:26		wessel1512 quits [Read error: Connection reset by peer]
09:16:46		wessel1512 joins
10:01:46		BlueMaxima_ quits [Client Quit]
10:24:22		benjinsmith is now known as benjins
10:24:24		benjins is now authenticated as benjins
10:31:37		pabs quits [Remote host closed the connection]
10:50:34		pabs (pabs) joins
11:11:19	<@arkiver>	OrIdow6: why not just implement folder support now as well and start the test with that?
11:12:07	<@arkiver>	i know folder support will discover new items, but overall (assuming the files in folders are not usually referenced on their own in the initial item list), this will still create a nice size estimate
11:59:10		nertzy_ joins
12:14:22		nertzy_ quits [Client Quit]
12:25:18	<@OrIdow6>	arkiver: If I'm understanding you right, that's roughly what I meant - get metadata of all files, including by expanding folders, and then get the estimate and decide on a threshold
12:27:08	<@OrIdow6>	I think it would be plausible to load all files and folders into the tracker, expand folders into files, and then take a random sample of the file items in the queue?
14:21:17		Webuser470 joins
14:23:48	<Webuser470>	I'm sure it's been crawled a million times over but snopes.com is running a "save us" donation drive and imply in the FAQ the site in its current form may disappear if goals aren't met. A bit vague but it might be worth having a quick look at if no one has mentioned it yet. https://www.snopes.com/sos/
14:34:42		Arcorann quits [Ping timeout: 250 seconds]
14:51:06	<@JAA>	Grabbing the mods.io downloads now. Shouldn't take long as there are only a couple hundred mods. The obfuscation of the download URL is actually just a simple cumulative xor.
14:51:45	<@JAA>	Not sure it'll work in the WBM though.
15:10:09	<@JAA>	Done
15:28:42		qwertyasdfuiopghjkl joins
15:52:14		atphoenix_ is now known as atphoenix
16:18:03		Stiletto joins
16:53:41	<Ryz>	Oh, Gamasutra's gonna be revamped and renamed to Game Developer: https://gamasutra.com/view/news/387227/Gamasutra_is_becoming_Game_Developer.php S:
16:54:11	<Ryz>	They're gonna make the move on Thursday D:
16:54:32	<Ryz>	And transfer the stuff meanwhile~
16:55:09	<Ryz>	Can someone look into this? This is most likely a huge website
16:57:45	<Frogging101>	I think it was put through archivebot a few years back
16:58:04	<Frogging101>	and IIRC it was not being actively updated at the time
17:00:10	<Ryz>	...I looked at https://archive.fart.website/archivebot/viewer/domain/gamasutra.com - all of these are shallow grabs, "!ao" grabs
17:02:47	<Frogging101>	Hmm. I see. Thanks for checking.
17:03:50	<Frogging101>	There's this too, but I don't see any major grabs here either. https://archive.fart.website/archivebot/viewer/domain/www.gamasutra.com
17:03:56	<Frogging101>	Maybe I was mistaken or confused it with another website
17:04:51	<Ryz>	Sigh, unfortunately if this will be archived in it's current state before the move, it won't get the comments because they changed how it was rendered a couple of years ago
17:07:21	<Frogging101>	The pages also load really slowly for some reason. The network activity finishes quickly but it takes seconds before content appears
17:07:24	<Frogging101>	maybe it's my browser, idk
17:07:34	<Ryz>	No, that's also a thing too...
17:07:43	<Ryz>	It used to really load fast~
17:07:49	<Frogging101>	or they're doing... something onLoad()
17:14:11	<@JAA>	There's an opacity:0 rule on the main div that probably gets removed with JS. I hate that trend, and it's fucking everywhere...
17:16:00	<Ryz>	...Ugh...
17:16:40	<@JAA>	Apart from the comments, at least that announcement page actually works okay, too...
17:18:02	<Ryz>	JAA, is there a way to extract offsite links while having --no-offsite-links on a job? I want to make sure hopefully Gamasutra finishes archiving before the switchover, and run the offsite goods in a separate section, since this is one of the main places for game developer stuff with low coverage on the more niche stuff
17:18:02	<@JAA>	(Without JS, after removing that rule, I mean.)
17:19:20	<@JAA>	Ryz: It's possible but not easy. I believe wpull still records offsite URLs in the DB even when that option is present, so that might work. Otherwise, the WARCs would have to be post-processed, e.g. with rewby's tooling.
17:20:12	<Ryz>	Option is present, as in when the option is available or used?
17:20:19	<@JAA>	Used
17:23:31	<Ryz>	Threw the thing onto AB, with --no-offsite-links
17:28:17		jacobk quits [Ping timeout: 244 seconds]
17:29:43		LeGoupil joins
17:29:50		wyatt8750 quits [Ping timeout: 244 seconds]
17:30:08		wyatt8740 joins
17:30:39	<Ryz>	I really really want the comments to be saved in some way properly, because right now they just load dynamically, as noted when checking the archived version on WBM... :C
17:31:26	<Ryz>	There's comments in the game maker pages like https://www.gamasutra.com/blogs/author/WilliamGrosso/997973/ - but the longer ones are truncated or squished~
17:35:24	<rewby>	Ryz, JAA: I was facing a similar issue on another grab. I ended up doing the grab with heritrix3 and having it log the results from the scope check to disk. That way I had a list of "accept" and "reject" for each url it finds. And I could just filter for the rejected ones.
17:36:02	<@JAA>	You successfully used Heritrix? I heard it's a massive pain to set up, so congratulations. lol
17:36:38	<rewby>	It's not that bad tbh
17:36:43		wyatt8740 quits [Ping timeout: 252 seconds]
17:36:49	<rewby>	You just have to wrap your head around it
17:37:05		wyatt8740 joins
17:37:26	<@JAA>	But yeah, guess that'd be another way. wpull does record off-site URLs in the DB even when not fetching them, so I can get them that way.
17:37:57	<rewby>	Interesting.
17:38:09	<rewby>	Care to share the code you use for that?
17:38:30	<rewby>	Could be useful when I'm doing stuff with grab-site
17:40:25	<@JAA>	Haven't actually done it yet. But basically I'll run a little SQL query for all skipped URLs that aren't inline with a LIKE filter for the primary domain. Doing it for a job with multiple root URLs would be more complicated.
17:41:35	<@JAA>	The Gamasutra sitemap.xml is obviously very incomplete. That's a shame.
17:54:18		tommyshinebox joins
18:05:03		qwertyasdfuiopghjkl quits [Client Quit]
18:34:53		Viniter692 (Viniter) joins
18:38:14		Viniter69 quits [Ping timeout: 250 seconds]
18:38:14		Viniter692 is now known as Viniter69
18:56:46	<h2ibot>	Entartet edited List of websites excluded from the Wayback Machine (+26, Added ummjackson.com.): https://wiki.archiveteam.org/?diff=47056&oldid=47037
18:57:43		@AlsoJAA quits [Ping timeout: 258 seconds]
18:57:46	<h2ibot>	Rewby edited ISP Hosting (-1): https://wiki.archiveteam.org/?diff=47057&oldid=46400
18:57:47	<h2ibot>	Kyndigs uploaded File:Featured-content-itunes-u 2x.png: https://wiki.archiveteam.org/?title=File%3AFeatured-content-itunes-u%202x.png
18:57:48	<h2ibot>	Kyndigs edited ITunes U (+196, /* Archiving Status */): https://wiki.archiveteam.org/?diff=47060&oldid=46973
18:57:49		AlsoJAA (JAA) joins
18:57:49		@ChanServ sets mode: +o AlsoJAA
18:58:07		sknebel quits [Quit: No Ping reply in 180 seconds.]
18:58:17	<rewby>	Woo thanks for approving my edit
18:59:29		sknebel (sknebel) joins
18:59:47	<h2ibot>	Wolfin edited Deathwatch (+167, /* 2021 */): https://wiki.archiveteam.org/?diff=47061&oldid=47036
19:03:47	<h2ibot>	Nintendofan885 edited YouTube (+26, /* Removed or blocked channels */ Link to…): https://wiki.archiveteam.org/?diff=47062&oldid=47047
19:03:48	<h2ibot>	Themadprogramer edited Discourse (+51, /* Active Discourses */): https://wiki.archiveteam.org/?diff=47063&oldid=47006
19:08:37		nertzy_ joins
19:09:48	<h2ibot>	JustAnotherArchivist edited Deathwatch (+163, /* 2021 */ Refs for Magic Legends and Tom…): https://wiki.archiveteam.org/?diff=47064&oldid=47061
19:10:51	<@JAA>	MetroLyrics, one of the major song lyrics sites, vanished in late June.
19:15:49	<h2ibot>	JustAnotherArchivist edited Deathwatch (+121, /* 2021 */ MetroLyrics is ded): https://wiki.archiveteam.org/?diff=47065&oldid=47064
19:23:56		nertzy_ quits [Client Quit]
19:27:12	<nicolas17>	so how's progress with iTunes U?
19:27:50	<nicolas17>	I have ways to intercept iOS TLS traffic easily if that'd help get URLs
19:29:42		swebb quits [Ping timeout: 244 seconds]
19:39:27	<nicolas17>	daaamn, it sent 24 cookies that I don't know the source of on the first request (probably my preexisting login from the appstore)
19:53:10		Webuser470 quits [Remote host closed the connection]
19:57:17		Stiletto quits [Remote host closed the connection]
19:58:54		Megame (Megame) joins
20:05:16		Stiletto joins
20:12:42		jacobk joins
20:54:31		qwertyasdfuiopghjkl joins
21:01:08		LeGoupil quits [Client Quit]
21:11:29		voltagex_ quits [Ping timeout: 244 seconds]
21:11:48		voltagex joins
21:15:09		Megame quits [Client Quit]
21:28:32		Iki quits [Ping timeout: 244 seconds]
21:34:14	<h2ibot>	Hosseinifard edited Deathwatch (+0, /* 2021 */ Add internal link for ZeroShell): https://wiki.archiveteam.org/?diff=47066&oldid=47065
21:48:39		Iki joins
22:06:50		Justin[home] is now known as DopefishJustin
22:07:52		DopefishJustin quits [Remote host closed the connection]
22:08:11		DopefishJustin joins
22:08:11		DopefishJustin is now authenticated as DopefishJustin
22:48:28		Arcorann (Arcorann) joins
23:43:07		fionera quits [.net .split]
23:44:39		fionera joins
23:45:04		fionera is now known as RJHacker63902
23:46:28		RJHacker63902 quits [Signing in (RJHacker63902)]
23:46:28		RJHacker63902 (Fionera) joins
23:56:50	<h2ibot>	Nicolas17v2 edited Discourse (-10, fix heading levels, fix some links (eg. CG…): https://wiki.archiveteam.org/?diff=47067&oldid=47063

Home Search Previous day Next day