| 00:00:43 | | dm4v quits [Read error: Connection reset by peer] |
| 00:01:10 | | dm4v joins |
| 00:01:12 | | dm4v is now authenticated as dm4v |
| 00:01:12 | | dm4v quits [Changing host] |
| 00:01:12 | | dm4v (dm4v) joins |
| 00:07:44 | | nertzy quits [Client Quit] |
| 00:10:04 | | BlueMaxima joins |
| 00:14:01 | <@arkiver> | OrIdow6: is mods.io/mods is really everything they have, i'm not sure if it needs a highly customized project |
| 00:14:10 | <@arkiver> | perhaps something for JAA s tool? |
| 00:20:23 | <@arkiver> | OrIdow6: it would be good to try to simulate the browser calls for google drive, which i guess would include the multipart post request if possible |
| 00:22:18 | <@arkiver> | for overall size estimate, if you're talking about *everything*, we're probably looking at 100s of TBs |
| 00:22:28 | <@JAA> | Possibly re mods.io. I did mean to look into it more. The downloads are obfuscated in an annoying way. |
| 00:22:45 | <@arkiver> | and i'm not sure about storing 100s of TBs of Google Drive |
| 00:23:14 | <@arkiver> | for a size estimate you could always check sizes of random samples and make an estimate using the lists you have |
| 00:25:02 | <@arkiver> | we could also just finish the Google Drive project code, start it on some random items, and estimate size using those |
| 00:40:44 | | noguarantees joins |
| 00:47:06 | <@arkiver> | does anyone know who Janub is? |
| 01:03:03 | | dm4v_ joins |
| 01:03:56 | | jacobk quits [Ping timeout: 250 seconds] |
| 01:04:31 | | dm4v quits [Ping timeout: 252 seconds] |
| 01:04:31 | | dm4v_ is now known as dm4v |
| 01:04:31 | | dm4v is now authenticated as dm4v |
| 01:04:31 | | dm4v quits [Changing host] |
| 01:04:31 | | dm4v (dm4v) joins |
| 01:14:25 | | wyatt8740 quits [Ping timeout: 252 seconds] |
| 01:15:25 | | wyatt8740 joins |
| 01:21:00 | | wyatt8750 joins |
| 01:21:34 | | wyatt8740 quits [Ping timeout: 252 seconds] |
| 01:32:38 | <@OrIdow6> | arkiver: You're probably right on that; I don't think there should be much conflict between getting the POSTs that the browser gets and the GETs that makes it accessible |
| 01:32:54 | <@OrIdow6> | Since the bulk of data from this proj will be files, and not those API calls |
| 01:34:23 | <@OrIdow6> | With the size estimate, one thing I am also looking to do is be able to come up with "if we only get files under 5 MB, total size will be XXX" sort of things |
| 01:35:08 | | lennier1 quits [Client Quit] |
| 01:35:23 | <@OrIdow6> | The problem with making an estimate from a basic sample is that items are not independent, folder items (planned) contain files |
| 01:35:57 | <@OrIdow6> | Maybe it would work to do a 2-stage project, where first we get all file metadata, then we decide on a threshold or whatever for which file content to get? |
| 01:36:59 | | lennier1 (lennier1) joins |
| 01:38:33 | <@OrIdow6> | Think I like that |
| 01:39:02 | <@OrIdow6> | Sort of does implement that idea of just surveying everythin |
| 02:23:50 | | nertzy (nertzy) joins |
| 02:33:37 | | Stilett0 quits [Ping timeout: 252 seconds] |
| 02:40:13 | | nertzy quits [Client Quit] |
| 02:50:41 | | nertzy (nertzy) joins |
| 03:12:33 | | benjinsmith joins |
| 03:14:14 | | jacobk joins |
| 03:14:22 | | benjins quits [Ping timeout: 250 seconds] |
| 03:42:34 | | noguarantees quits [Remote host closed the connection] |
| 03:44:12 | | Barto quits [Ping timeout: 244 seconds] |
| 03:46:00 | | nertzy quits [Client Quit] |
| 03:53:44 | | qw3rty_ joins |
| 03:57:13 | | qw3rty__ quits [Ping timeout: 252 seconds] |
| 04:13:39 | | aleph quits [Ping timeout: 244 seconds] |
| 04:15:29 | | nertzy (nertzy) joins |
| 04:17:36 | | aleph joins |
| 04:18:28 | | nertzy_ joins |
| 04:21:32 | | nertzy quits [Ping timeout: 250 seconds] |
| 04:22:40 | | Barto (Barto) joins |
| 04:47:10 | <Ryz> | Nexus Mods are getting more and more heat lately, as I stumbled upon this video: https://www.youtube.com/watch?v=e1wZS0VO0c8 |
| 04:47:35 | <Ryz> | Anyone interested in trying to scrape content from it? |
| 04:48:04 | <Ryz> | It's kinda like Mods.io but huge and more complicated |
| 05:02:15 | | DogsRNice quits [Read error: Connection reset by peer] |
| 05:02:37 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 05:16:07 | | qwertyasdfuiopghjkl joins |
| 06:00:12 | | BlueMaxima_ joins |
| 06:03:42 | | BlueMaxima quits [Ping timeout: 244 seconds] |
| 06:10:16 | | nertzy_ quits [Client Quit] |
| 06:32:45 | | BlueMaxima__ joins |
| 06:36:44 | | BlueMaxima_ quits [Ping timeout: 250 seconds] |
| 06:40:00 | | nertzy_ joins |
| 06:42:27 | | qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds] |
| 06:55:17 | | nertzy_ quits [Client Quit] |
| 06:59:27 | | knecht420 quits [Read error: Connection reset by peer] |
| 06:59:31 | | knecht4206 (knecht420) joins |
| 07:00:21 | | knecht4206 quits [Client Quit] |
| 07:02:20 | | knecht4206 (knecht420) joins |
| 07:02:59 | | knecht4206 is now known as knecht420 |
| 07:28:12 | | Megame quits [Client Quit] |
| 07:37:11 | | BlueMaxima_ joins |
| 07:41:04 | | BlueMaxima__ quits [Ping timeout: 252 seconds] |
| 07:54:01 | | BlueMaxima__ joins |
| 07:57:46 | | BlueMaxima_ quits [Ping timeout: 250 seconds] |
| 08:09:56 | | BlueMaxima_ joins |
| 08:13:48 | | BlueMaxima__ quits [Ping timeout: 250 seconds] |
| 09:16:26 | | wessel1512 quits [Read error: Connection reset by peer] |
| 09:16:46 | | wessel1512 joins |
| 10:01:46 | | BlueMaxima_ quits [Client Quit] |
| 10:24:22 | | benjinsmith is now known as benjins |
| 10:24:24 | | benjins is now authenticated as benjins |
| 10:31:37 | | pabs quits [Remote host closed the connection] |
| 10:50:34 | | pabs (pabs) joins |
| 11:11:19 | <@arkiver> | OrIdow6: why not just implement folder support now as well and start the test with that? |
| 11:12:07 | <@arkiver> | i know folder support will discover new items, but overall (assuming the files in folders are not usually referenced on their own in the initial item list), this will still create a nice size estimate |
| 11:59:10 | | nertzy_ joins |
| 12:14:22 | | nertzy_ quits [Client Quit] |
| 12:25:18 | <@OrIdow6> | arkiver: If I'm understanding you right, that's roughly what I meant - get metadata of all files, including by expanding folders, and then get the estimate and decide on a threshold |
| 12:27:08 | <@OrIdow6> | I think it would be plausible to load all files and folders into the tracker, expand folders into files, and then take a random sample of the file items in the queue? |
| 14:21:17 | | Webuser470 joins |
| 14:23:48 | <Webuser470> | I'm sure it's been crawled a million times over but snopes.com is running a "save us" donation drive and imply in the FAQ the site in its current form may disappear if goals aren't met. A bit vague but it might be worth having a quick look at if no one has mentioned it yet. https://www.snopes.com/sos/ |
| 14:34:42 | | Arcorann quits [Ping timeout: 250 seconds] |
| 14:51:06 | <@JAA> | Grabbing the mods.io downloads now. Shouldn't take long as there are only a couple hundred mods. The obfuscation of the download URL is actually just a simple cumulative xor. |
| 14:51:45 | <@JAA> | Not sure it'll work in the WBM though. |
| 15:10:09 | <@JAA> | Done |
| 15:28:42 | | qwertyasdfuiopghjkl joins |
| 15:52:14 | | atphoenix_ is now known as atphoenix |
| 16:18:03 | | Stiletto joins |
| 16:53:41 | <Ryz> | Oh, Gamasutra's gonna be revamped and renamed to Game Developer: https://gamasutra.com/view/news/387227/Gamasutra_is_becoming_Game_Developer.php S: |
| 16:54:11 | <Ryz> | They're gonna make the move on Thursday D: |
| 16:54:32 | <Ryz> | And transfer the stuff meanwhile~ |
| 16:55:09 | <Ryz> | Can someone look into this? This is most likely a huge website |
| 16:57:45 | <Frogging101> | I think it was put through archivebot a few years back |
| 16:58:04 | <Frogging101> | and IIRC it was not being actively updated at the time |
| 17:00:10 | <Ryz> | ...I looked at https://archive.fart.website/archivebot/viewer/domain/gamasutra.com - all of these are shallow grabs, "!ao" grabs |
| 17:02:47 | <Frogging101> | Hmm. I see. Thanks for checking. |
| 17:03:50 | <Frogging101> | There's this too, but I don't see any major grabs here either. https://archive.fart.website/archivebot/viewer/domain/www.gamasutra.com |
| 17:03:56 | <Frogging101> | Maybe I was mistaken or confused it with another website |
| 17:04:51 | <Ryz> | Sigh, unfortunately if this will be archived in it's current state before the move, it won't get the comments because they changed how it was rendered a couple of years ago |
| 17:07:21 | <Frogging101> | The pages also load really slowly for some reason. The network activity finishes quickly but it takes seconds before content appears |
| 17:07:24 | <Frogging101> | maybe it's my browser, idk |
| 17:07:34 | <Ryz> | No, that's also a thing too... |
| 17:07:43 | <Ryz> | It used to really load fast~ |
| 17:07:49 | <Frogging101> | or they're doing... something onLoad() |
| 17:14:11 | <@JAA> | There's an opacity:0 rule on the main div that probably gets removed with JS. I hate that trend, and it's fucking everywhere... |
| 17:16:00 | <Ryz> | ...Ugh... |
| 17:16:40 | <@JAA> | Apart from the comments, at least that announcement page actually works okay, too... |
| 17:18:02 | <Ryz> | JAA, is there a way to extract offsite links while having --no-offsite-links on a job? I want to make sure hopefully Gamasutra finishes archiving before the switchover, and run the offsite goods in a separate section, since this is one of the main places for game developer stuff with low coverage on the more niche stuff |
| 17:18:02 | <@JAA> | (Without JS, after removing that rule, I mean.) |
| 17:19:20 | <@JAA> | Ryz: It's possible but not easy. I believe wpull still records offsite URLs in the DB even when that option is present, so that might work. Otherwise, the WARCs would have to be post-processed, e.g. with rewby's tooling. |
| 17:20:12 | <Ryz> | Option is present, as in when the option is available or used? |
| 17:20:19 | <@JAA> | Used |
| 17:23:31 | <Ryz> | Threw the thing onto AB, with --no-offsite-links |
| 17:28:17 | | jacobk quits [Ping timeout: 244 seconds] |
| 17:29:43 | | LeGoupil joins |
| 17:29:50 | | wyatt8750 quits [Ping timeout: 244 seconds] |
| 17:30:08 | | wyatt8740 joins |
| 17:30:39 | <Ryz> | I really really want the comments to be saved in some way properly, because right now they just load dynamically, as noted when checking the archived version on WBM... :C |
| 17:31:26 | <Ryz> | There's comments in the game maker pages like https://www.gamasutra.com/blogs/author/WilliamGrosso/997973/ - but the longer ones are truncated or squished~ |
| 17:35:24 | <rewby> | Ryz, JAA: I was facing a similar issue on another grab. I ended up doing the grab with heritrix3 and having it log the results from the scope check to disk. That way I had a list of "accept" and "reject" for each url it finds. And I could just filter for the rejected ones. |
| 17:36:02 | <@JAA> | You successfully used Heritrix? I heard it's a massive pain to set up, so congratulations. lol |
| 17:36:38 | <rewby> | It's not that bad tbh |
| 17:36:43 | | wyatt8740 quits [Ping timeout: 252 seconds] |
| 17:36:49 | <rewby> | You just have to wrap your head around it |
| 17:37:05 | | wyatt8740 joins |
| 17:37:26 | <@JAA> | But yeah, guess that'd be another way. wpull does record off-site URLs in the DB even when not fetching them, so I can get them that way. |
| 17:37:57 | <rewby> | Interesting. |
| 17:38:09 | <rewby> | Care to share the code you use for that? |
| 17:38:30 | <rewby> | Could be useful when I'm doing stuff with grab-site |
| 17:40:25 | <@JAA> | Haven't actually done it yet. But basically I'll run a little SQL query for all skipped URLs that aren't inline with a LIKE filter for the primary domain. Doing it for a job with multiple root URLs would be more complicated. |
| 17:41:35 | <@JAA> | The Gamasutra sitemap.xml is obviously very incomplete. That's a shame. |
| 17:54:18 | | tommyshinebox joins |
| 18:05:03 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 18:34:53 | | Viniter692 (Viniter) joins |
| 18:38:14 | | Viniter69 quits [Ping timeout: 250 seconds] |
| 18:38:14 | | Viniter692 is now known as Viniter69 |
| 18:56:46 | <h2ibot> | Entartet edited List of websites excluded from the Wayback Machine (+26, Added ummjackson.com.): https://wiki.archiveteam.org/?diff=47056&oldid=47037 |
| 18:57:43 | | @AlsoJAA quits [Ping timeout: 258 seconds] |
| 18:57:46 | <h2ibot> | Rewby edited ISP Hosting (-1): https://wiki.archiveteam.org/?diff=47057&oldid=46400 |
| 18:57:47 | <h2ibot> | Kyndigs uploaded File:Featured-content-itunes-u 2x.png: https://wiki.archiveteam.org/?title=File%3AFeatured-content-itunes-u%202x.png |
| 18:57:48 | <h2ibot> | Kyndigs edited ITunes U (+196, /* Archiving Status */): https://wiki.archiveteam.org/?diff=47060&oldid=46973 |
| 18:57:49 | | AlsoJAA (JAA) joins |
| 18:57:49 | | @ChanServ sets mode: +o AlsoJAA |
| 18:58:07 | | sknebel quits [Quit: No Ping reply in 180 seconds.] |
| 18:58:17 | <rewby> | Woo thanks for approving my edit |
| 18:59:29 | | sknebel (sknebel) joins |
| 18:59:47 | <h2ibot> | Wolfin edited Deathwatch (+167, /* 2021 */): https://wiki.archiveteam.org/?diff=47061&oldid=47036 |
| 19:03:47 | <h2ibot> | Nintendofan885 edited YouTube (+26, /* Removed or blocked channels */ Link to…): https://wiki.archiveteam.org/?diff=47062&oldid=47047 |
| 19:03:48 | <h2ibot> | Themadprogramer edited Discourse (+51, /* Active Discourses */): https://wiki.archiveteam.org/?diff=47063&oldid=47006 |
| 19:08:37 | | nertzy_ joins |
| 19:09:48 | <h2ibot> | JustAnotherArchivist edited Deathwatch (+163, /* 2021 */ Refs for Magic Legends and Tom…): https://wiki.archiveteam.org/?diff=47064&oldid=47061 |
| 19:10:51 | <@JAA> | MetroLyrics, one of the major song lyrics sites, vanished in late June. |
| 19:15:49 | <h2ibot> | JustAnotherArchivist edited Deathwatch (+121, /* 2021 */ MetroLyrics is ded): https://wiki.archiveteam.org/?diff=47065&oldid=47064 |
| 19:23:56 | | nertzy_ quits [Client Quit] |
| 19:27:12 | <nicolas17> | so how's progress with iTunes U? |
| 19:27:50 | <nicolas17> | I have ways to intercept iOS TLS traffic easily if that'd help get URLs |
| 19:29:42 | | swebb quits [Ping timeout: 244 seconds] |
| 19:39:27 | <nicolas17> | daaamn, it sent 24 cookies that I don't know the source of on the first request (probably my preexisting login from the appstore) |
| 19:53:10 | | Webuser470 quits [Remote host closed the connection] |
| 19:57:17 | | Stiletto quits [Remote host closed the connection] |
| 19:58:54 | | Megame (Megame) joins |
| 20:05:16 | | Stiletto joins |
| 20:12:42 | | jacobk joins |
| 20:54:31 | | qwertyasdfuiopghjkl joins |
| 21:01:08 | | LeGoupil quits [Client Quit] |
| 21:11:29 | | voltagex_ quits [Ping timeout: 244 seconds] |
| 21:11:48 | | voltagex joins |
| 21:15:09 | | Megame quits [Client Quit] |
| 21:28:32 | | Iki quits [Ping timeout: 244 seconds] |
| 21:34:14 | <h2ibot> | Hosseinifard edited Deathwatch (+0, /* 2021 */ Add internal link for ZeroShell): https://wiki.archiveteam.org/?diff=47066&oldid=47065 |
| 21:48:39 | | Iki joins |
| 22:06:50 | | Justin[home] is now known as DopefishJustin |
| 22:07:52 | | DopefishJustin quits [Remote host closed the connection] |
| 22:08:11 | | DopefishJustin joins |
| 22:08:11 | | DopefishJustin is now authenticated as DopefishJustin |
| 22:48:28 | | Arcorann (Arcorann) joins |
| 23:43:07 | | fionera quits [*.net *.split] |
| 23:44:39 | | fionera joins |
| 23:45:04 | | fionera is now known as RJHacker63902 |
| 23:46:28 | | RJHacker63902 quits [Signing in (RJHacker63902)] |
| 23:46:28 | | RJHacker63902 (Fionera) joins |
| 23:56:50 | <h2ibot> | Nicolas17v2 edited Discourse (-10, fix heading levels, fix some links (eg. CG…): https://wiki.archiveteam.org/?diff=47067&oldid=47063 |