00:00:43dm4v quits [Read error: Connection reset by peer]
00:01:10dm4v joins
00:01:12dm4v quits [Changing host]
00:01:12dm4v (dm4v) joins
00:07:44nertzy quits [Client Quit]
00:10:04BlueMaxima joins
00:14:01<@arkiver>OrIdow6: is mods.io/mods is really everything they have, i'm not sure if it needs a highly customized project
00:14:10<@arkiver>perhaps something for JAA s tool?
00:20:23<@arkiver>OrIdow6: it would be good to try to simulate the browser calls for google drive, which i guess would include the multipart post request if possible
00:22:18<@arkiver>for overall size estimate, if you're talking about *everything*, we're probably looking at 100s of TBs
00:22:28<@JAA>Possibly re mods.io. I did mean to look into it more. The downloads are obfuscated in an annoying way.
00:22:45<@arkiver>and i'm not sure about storing 100s of TBs of Google Drive
00:23:14<@arkiver>for a size estimate you could always check sizes of random samples and make an estimate using the lists you have
00:25:02<@arkiver>we could also just finish the Google Drive project code, start it on some random items, and estimate size using those
00:40:44noguarantees joins
00:47:06<@arkiver>does anyone know who Janub is?
01:03:03dm4v_ joins
01:03:56jacobk quits [Ping timeout: 250 seconds]
01:04:31dm4v quits [Ping timeout: 252 seconds]
01:04:31dm4v_ is now known as dm4v
01:04:31dm4v quits [Changing host]
01:04:31dm4v (dm4v) joins
01:14:25wyatt8740 quits [Ping timeout: 252 seconds]
01:15:25wyatt8740 joins
01:21:00wyatt8750 joins
01:21:34wyatt8740 quits [Ping timeout: 252 seconds]
01:32:38<@OrIdow6>arkiver: You're probably right on that; I don't think there should be much conflict between getting the POSTs that the browser gets and the GETs that makes it accessible
01:32:54<@OrIdow6>Since the bulk of data from this proj will be files, and not those API calls
01:34:23<@OrIdow6>With the size estimate, one thing I am also looking to do is be able to come up with "if we only get files under 5 MB, total size will be XXX" sort of things
01:35:08lennier1 quits [Client Quit]
01:35:23<@OrIdow6>The problem with making an estimate from a basic sample is that items are not independent, folder items (planned) contain files
01:35:57<@OrIdow6>Maybe it would work to do a 2-stage project, where first we get all file metadata, then we decide on a threshold or whatever for which file content to get?
01:36:59lennier1 (lennier1) joins
01:38:33<@OrIdow6>Think I like that
01:39:02<@OrIdow6>Sort of does implement that idea of just surveying everythin
02:23:50nertzy (nertzy) joins
02:33:37Stilett0 quits [Ping timeout: 252 seconds]
02:40:13nertzy quits [Client Quit]
02:50:41nertzy (nertzy) joins
03:12:33benjinsmith joins
03:14:14jacobk joins
03:14:22benjins quits [Ping timeout: 250 seconds]
03:42:34noguarantees quits [Remote host closed the connection]
03:44:12Barto quits [Ping timeout: 244 seconds]
03:46:00nertzy quits [Client Quit]
03:53:44qw3rty_ joins
03:57:13qw3rty__ quits [Ping timeout: 252 seconds]
04:13:39aleph quits [Ping timeout: 244 seconds]
04:15:29nertzy (nertzy) joins
04:17:36aleph joins
04:18:28nertzy_ joins
04:21:32nertzy quits [Ping timeout: 250 seconds]
04:22:40Barto (Barto) joins
04:47:10<Ryz>Nexus Mods are getting more and more heat lately, as I stumbled upon this video: https://www.youtube.com/watch?v=e1wZS0VO0c8
04:47:35<Ryz>Anyone interested in trying to scrape content from it?
04:48:04<Ryz>It's kinda like Mods.io but huge and more complicated
05:02:15DogsRNice quits [Read error: Connection reset by peer]
05:02:37qwertyasdfuiopghjkl quits [Client Quit]
05:16:07qwertyasdfuiopghjkl joins
06:00:12BlueMaxima_ joins
06:03:42BlueMaxima quits [Ping timeout: 244 seconds]
06:10:16nertzy_ quits [Client Quit]
06:32:45BlueMaxima__ joins
06:36:44BlueMaxima_ quits [Ping timeout: 250 seconds]
06:40:00nertzy_ joins
06:42:27qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds]
06:55:17nertzy_ quits [Client Quit]
06:59:27knecht420 quits [Read error: Connection reset by peer]
06:59:31knecht4206 (knecht420) joins
07:00:21knecht4206 quits [Client Quit]
07:02:20knecht4206 (knecht420) joins
07:02:59knecht4206 is now known as knecht420
07:28:12Megame quits [Client Quit]
07:37:11BlueMaxima_ joins
07:41:04BlueMaxima__ quits [Ping timeout: 252 seconds]
07:54:01BlueMaxima__ joins
07:57:46BlueMaxima_ quits [Ping timeout: 250 seconds]
08:09:56BlueMaxima_ joins
08:13:48BlueMaxima__ quits [Ping timeout: 250 seconds]
09:16:26wessel1512 quits [Read error: Connection reset by peer]
09:16:46wessel1512 joins
10:01:46BlueMaxima_ quits [Client Quit]
10:24:22benjinsmith is now known as benjins
10:31:37pabs quits [Remote host closed the connection]
10:50:34pabs (pabs) joins
11:11:19<@arkiver>OrIdow6: why not just implement folder support now as well and start the test with that?
11:12:07<@arkiver>i know folder support will discover new items, but overall (assuming the files in folders are not usually referenced on their own in the initial item list), this will still create a nice size estimate
11:59:10nertzy_ joins
12:14:22nertzy_ quits [Client Quit]
12:25:18<@OrIdow6>arkiver: If I'm understanding you right, that's roughly what I meant - get metadata of all files, including by expanding folders, and then get the estimate and decide on a threshold
12:27:08<@OrIdow6>I think it would be plausible to load all files and folders into the tracker, expand folders into files, and then take a random sample of the file items in the queue?
14:21:17Webuser470 joins
14:23:48<Webuser470>I'm sure it's been crawled a million times over but snopes.com is running a "save us" donation drive and imply in the FAQ the site in its current form may disappear if goals aren't met. A bit vague but it might be worth having a quick look at if no one has mentioned it yet. https://www.snopes.com/sos/
14:34:42Arcorann quits [Ping timeout: 250 seconds]
14:51:06<@JAA>Grabbing the mods.io downloads now. Shouldn't take long as there are only a couple hundred mods. The obfuscation of the download URL is actually just a simple cumulative xor.
14:51:45<@JAA>Not sure it'll work in the WBM though.
15:10:09<@JAA>Done
15:28:42qwertyasdfuiopghjkl joins
15:52:14atphoenix_ is now known as atphoenix
16:18:03Stiletto joins
16:53:41<Ryz>Oh, Gamasutra's gonna be revamped and renamed to Game Developer: https://gamasutra.com/view/news/387227/Gamasutra_is_becoming_Game_Developer.php S:
16:54:11<Ryz>They're gonna make the move on Thursday D:
16:54:32<Ryz>And transfer the stuff meanwhile~
16:55:09<Ryz>Can someone look into this? This is most likely a huge website
16:57:45<Frogging101>I think it was put through archivebot a few years back
16:58:04<Frogging101>and IIRC it was not being actively updated at the time
17:00:10<Ryz>...I looked at https://archive.fart.website/archivebot/viewer/domain/gamasutra.com - all of these are shallow grabs, "!ao" grabs
17:02:47<Frogging101>Hmm. I see. Thanks for checking.
17:03:50<Frogging101>There's this too, but I don't see any major grabs here either. https://archive.fart.website/archivebot/viewer/domain/www.gamasutra.com
17:03:56<Frogging101>Maybe I was mistaken or confused it with another website
17:04:51<Ryz>Sigh, unfortunately if this will be archived in it's current state before the move, it won't get the comments because they changed how it was rendered a couple of years ago
17:07:21<Frogging101>The pages also load really slowly for some reason. The network activity finishes quickly but it takes seconds before content appears
17:07:24<Frogging101>maybe it's my browser, idk
17:07:34<Ryz>No, that's also a thing too...
17:07:43<Ryz>It used to really load fast~
17:07:49<Frogging101>or they're doing... something onLoad()
17:14:11<@JAA>There's an opacity:0 rule on the main div that probably gets removed with JS. I hate that trend, and it's fucking everywhere...
17:16:00<Ryz>...Ugh...
17:16:40<@JAA>Apart from the comments, at least that announcement page actually works okay, too...
17:18:02<Ryz>JAA, is there a way to extract offsite links while having --no-offsite-links on a job? I want to make sure hopefully Gamasutra finishes archiving before the switchover, and run the offsite goods in a separate section, since this is one of the main places for game developer stuff with low coverage on the more niche stuff
17:18:02<@JAA>(Without JS, after removing that rule, I mean.)
17:19:20<@JAA>Ryz: It's possible but not easy. I believe wpull still records offsite URLs in the DB even when that option is present, so that might work. Otherwise, the WARCs would have to be post-processed, e.g. with rewby's tooling.
17:20:12<Ryz>Option is present, as in when the option is available or used?
17:20:19<@JAA>Used
17:23:31<Ryz>Threw the thing onto AB, with --no-offsite-links
17:28:17jacobk quits [Ping timeout: 244 seconds]
17:29:43LeGoupil joins
17:29:50wyatt8750 quits [Ping timeout: 244 seconds]
17:30:08wyatt8740 joins
17:30:39<Ryz>I really really want the comments to be saved in some way properly, because right now they just load dynamically, as noted when checking the archived version on WBM... :C
17:31:26<Ryz>There's comments in the game maker pages like https://www.gamasutra.com/blogs/author/WilliamGrosso/997973/ - but the longer ones are truncated or squished~
17:35:24<rewby>Ryz, JAA: I was facing a similar issue on another grab. I ended up doing the grab with heritrix3 and having it log the results from the scope check to disk. That way I had a list of "accept" and "reject" for each url it finds. And I could just filter for the rejected ones.
17:36:02<@JAA>You successfully used Heritrix? I heard it's a massive pain to set up, so congratulations. lol
17:36:38<rewby>It's not that bad tbh
17:36:43wyatt8740 quits [Ping timeout: 252 seconds]
17:36:49<rewby>You just have to wrap your head around it
17:37:05wyatt8740 joins
17:37:26<@JAA>But yeah, guess that'd be another way. wpull does record off-site URLs in the DB even when not fetching them, so I can get them that way.
17:37:57<rewby>Interesting.
17:38:09<rewby>Care to share the code you use for that?
17:38:30<rewby>Could be useful when I'm doing stuff with grab-site
17:40:25<@JAA>Haven't actually done it yet. But basically I'll run a little SQL query for all skipped URLs that aren't inline with a LIKE filter for the primary domain. Doing it for a job with multiple root URLs would be more complicated.
17:41:35<@JAA>The Gamasutra sitemap.xml is obviously very incomplete. That's a shame.
17:54:18tommyshinebox joins
18:05:03qwertyasdfuiopghjkl quits [Client Quit]
18:34:53Viniter692 (Viniter) joins
18:38:14Viniter69 quits [Ping timeout: 250 seconds]
18:38:14Viniter692 is now known as Viniter69
18:56:46<h2ibot>Entartet edited List of websites excluded from the Wayback Machine (+26, Added ummjackson.com.): https://wiki.archiveteam.org/?diff=47056&oldid=47037
18:57:43@AlsoJAA quits [Ping timeout: 258 seconds]
18:57:46<h2ibot>Rewby edited ISP Hosting (-1): https://wiki.archiveteam.org/?diff=47057&oldid=46400
18:57:47<h2ibot>Kyndigs uploaded File:Featured-content-itunes-u 2x.png: https://wiki.archiveteam.org/?title=File%3AFeatured-content-itunes-u%202x.png
18:57:48<h2ibot>Kyndigs edited ITunes U (+196, /* Archiving Status */): https://wiki.archiveteam.org/?diff=47060&oldid=46973
18:57:49AlsoJAA (JAA) joins
18:57:49@ChanServ sets mode: +o AlsoJAA
18:58:07sknebel quits [Quit: No Ping reply in 180 seconds.]
18:58:17<rewby>Woo thanks for approving my edit
18:59:29sknebel (sknebel) joins
18:59:47<h2ibot>Wolfin edited Deathwatch (+167, /* 2021 */): https://wiki.archiveteam.org/?diff=47061&oldid=47036
19:03:47<h2ibot>Nintendofan885 edited YouTube (+26, /* Removed or blocked channels */ Link to…): https://wiki.archiveteam.org/?diff=47062&oldid=47047
19:03:48<h2ibot>Themadprogramer edited Discourse (+51, /* Active Discourses */): https://wiki.archiveteam.org/?diff=47063&oldid=47006
19:08:37nertzy_ joins
19:09:48<h2ibot>JustAnotherArchivist edited Deathwatch (+163, /* 2021 */ Refs for Magic Legends and Tom…): https://wiki.archiveteam.org/?diff=47064&oldid=47061
19:10:51<@JAA>MetroLyrics, one of the major song lyrics sites, vanished in late June.
19:15:49<h2ibot>JustAnotherArchivist edited Deathwatch (+121, /* 2021 */ MetroLyrics is ded): https://wiki.archiveteam.org/?diff=47065&oldid=47064
19:23:56nertzy_ quits [Client Quit]
19:27:12<nicolas17>so how's progress with iTunes U?
19:27:50<nicolas17>I have ways to intercept iOS TLS traffic easily if that'd help get URLs
19:29:42swebb quits [Ping timeout: 244 seconds]
19:39:27<nicolas17>daaamn, it sent 24 cookies that I don't know the source of on the first request (probably my preexisting login from the appstore)
19:53:10Webuser470 quits [Remote host closed the connection]
19:57:17Stiletto quits [Remote host closed the connection]
19:58:54Megame (Megame) joins
20:05:16Stiletto joins
20:12:42jacobk joins
20:54:31qwertyasdfuiopghjkl joins
21:01:08LeGoupil quits [Client Quit]
21:11:29voltagex_ quits [Ping timeout: 244 seconds]
21:11:48voltagex joins
21:15:09Megame quits [Client Quit]
21:28:32Iki quits [Ping timeout: 244 seconds]
21:34:14<h2ibot>Hosseinifard edited Deathwatch (+0, /* 2021 */ Add internal link for ZeroShell): https://wiki.archiveteam.org/?diff=47066&oldid=47065
21:48:39Iki joins
22:06:50Justin[home] is now known as DopefishJustin
22:07:52DopefishJustin quits [Remote host closed the connection]
22:08:11DopefishJustin joins
22:48:28Arcorann (Arcorann) joins
23:43:07fionera quits [*.net *.split]
23:44:39fionera joins
23:45:04fionera is now known as RJHacker63902
23:46:28RJHacker63902 quits [Signing in (RJHacker63902)]
23:46:28RJHacker63902 (Fionera) joins
23:56:50<h2ibot>Nicolas17v2 edited Discourse (-10, fix heading levels, fix some links (eg. CG…): https://wiki.archiveteam.org/?diff=47067&oldid=47063