| 00:03:24 | <nirv> | appledash, I think it might finally be extracting the second set of "uppercase" files using this line |
| 00:03:26 | <nirv> | 7zz -y x "/mnt/linux2tb/uppercaseextracted" -mmt=18 -o/mnt/linux5tb/uppercaseextracted2 > /mnt/linux5tb/uppercase2.log |
| 00:10:19 | <SketchTheCow> | Hooray, let's kill FOS again |
| 00:17:08 | | Niklink quits [Ping timeout: 244 seconds] |
| 00:27:23 | | Hifihedgehog joins |
| 00:29:39 | <@OrIdow6> | Still haven't found a way to map sequential IDs to images on radikal.ru, though I haven't expressly been looking for one |
| 00:30:15 | <nirv> | appledash, nevermind. here come the errors. it's fucked. https://cdn.discordapp.com/attachments/872443255281844236/949463752674263100/unknown.png |
| 00:31:41 | <Hifihedgehog> | Hello everyone! Thank you again for your tremendous efforts to preserve the Technology Guide sites. I have a request. As you are aware Russia is beginning to close off Internet to exterior sources such as Facebook. There is therefore valid concern within the Sonic the Hedgehog community that Sonic SCANF will eventually be inaccessible to |
| 00:31:41 | <Hifihedgehog> | non-Russian traffic. As such, this site will not necessarily cease to exist but the content might not ever be accessible again to the outside world. Would you be interested in gathering the site content? Many official, classic Sonic the Hedgehog media including artwork in print quality resolution can only be downloaded from this site. Here is the |
| 00:31:42 | <Hifihedgehog> | site: https://sonicscanf.org/ |
| 00:32:13 | <nirv> | outta be widescreen |
| 00:32:27 | <nicolas17> | scanf is hosted in Russia? |
| 00:32:45 | <Hifihedgehog> | I haven't checked the IP yet. |
| 00:32:47 | <nirv> | sonic 1 NTSC-J have parallax scrolling in the clouds. NTSC-U does not |
| 00:33:40 | <nicolas17> | seems to be behind cloudflare |
| 00:35:18 | <@OrIdow6> | Doesn't look too big |
| 00:36:18 | <nicolas17> | images from the gallery may be the big part |
| 00:36:38 | | Terbium quits [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.] |
| 00:37:17 | <nicolas17> | URL structure in all subdomains is quite "clean" |
| 00:37:29 | | Niklink joins |
| 00:38:06 | <Hifihedgehog> | If you could, that'd be great! :D |
| 00:38:15 | <nicolas17> | for the forum you probably want to exclude the post permalinks (.../?do=findComment&comment=214035) |
| 00:38:24 | | sonick quits [Client Quit] |
| 00:40:04 | <nicolas17> | aside from ?do= URLs, probably one of the most "wget -r"-friendly forums I've seen :P |
| 00:44:48 | <@OrIdow6> | Luckily we won't be using bare wget for this |
| 00:45:35 | <@OrIdow6> | So yes Hifihedgehog, though I'm not the one to do it presumably we can capture this site |
| 00:45:35 | <nicolas17> | yeah I'm sure you have better tools |
| 00:46:15 | | Hifihedgehog quits [Remote host closed the connection] |
| 00:47:25 | <@OrIdow6> | So is there a consensus yet on whether we consider all Russian sites at-risk? And whether that should be dealt with in the Ukraine channel |
| 00:48:16 | <@OrIdow6> | I told the above yes because the site looked small, and of course radikal.ru is shutting down in a normal manner, so I have had it easy |
| 00:50:11 | <@JAA> | Given that we're archiving all these things due to the situation in Ukraine, I think it makes most sense to keep it in that channel. |
| 00:56:08 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
| 00:57:54 | | HP_Archivist (HP_Archivist) joins |
| 01:04:40 | | Niklink quits [Ping timeout: 244 seconds] |
| 01:13:22 | <@OrIdow6> | I suppose I'm generally in favor of keeping it there at the current level of activity |
| 01:17:32 | | HP_Archivist quits [Client Quit] |
| 01:26:43 | | michaelblob quits [Read error: Connection reset by peer] |
| 01:28:56 | | michaelblob (michaelblob) joins |
| 02:02:34 | | dm4v quits [Read error: Connection reset by peer] |
| 02:02:56 | | dm4v joins |
| 02:02:59 | | dm4v is now authenticated as dm4v |
| 02:02:59 | | dm4v quits [Changing host] |
| 02:02:59 | | dm4v (dm4v) joins |
| 02:09:13 | | wyatt8750 joins |
| 02:10:32 | | wyatt8740 quits [Ping timeout: 265 seconds] |
| 02:51:03 | | Niklink joins |
| 03:54:56 | | Terbium joins |
| 04:01:02 | | Terbium quits [Client Quit] |
| 04:01:42 | | Terbium joins |
| 04:20:34 | <@OrIdow6> | So for radikal.ru I am thinking of starting with a design that's fairly liberal in what it gets (separate metadata request for each image etc) then constricting it as seems necessary |
| 04:20:46 | <@OrIdow6> | Close to an estimate for images by the way |
| 04:21:52 | <thuban> | OrIdow6: what are you using for discovery? the GetGalleryPage endpoint? |
| 04:22:02 | <@OrIdow6> | Yes thuban |
| 04:22:34 | <@OrIdow6> | Why? |
| 04:22:43 | <thuban> | < thuban> i'm a little concerned by the 'IsPublicImg' field in the results of the GetUserImgs endpoint used by the user page--it's sometimes false. any idea what this means? (will these images not show up in GetGalleryPage?) |
| 04:23:37 | <thuban> | i have no idea whether there's a way to enumerate users |
| 04:26:37 | <@OrIdow6> | Can you give an example of where it's false? |
| 04:27:12 | <thuban> | sure, one min |
| 04:28:56 | <thuban> | OrIdow6: https://radikal.ru/users/josif43#alb=kartinki&img=159871870 |
| 04:34:18 | <@OrIdow6> | Thank you |
| 04:34:24 | <@OrIdow6> | Alright, so, |
| 04:35:04 | <@OrIdow6> | Vaugely looks like images are somewhere between 200 and 500 TiB total |
| 04:36:01 | <nicolas17> | wew |
| 04:37:52 | <@OrIdow6> | I don't know for videos yet |
| 04:38:18 | <@OrIdow6> | So I don't think IA will let us upload everything |
| 04:42:52 | <@OrIdow6> | Exploring some possibilities for this |
| 04:43:21 | <@OrIdow6> | With of course the big possibility being that this is a moot issue because the site doesn't have the capacity to serve everything in any case |
| 04:44:58 | <@OrIdow6> | If we replace some big images with thumbnails it might help |
| 04:46:17 | <@OrIdow6> | If we could magically coerce all above-average-size images to be exactly average could cut it down by about 68% |
| 04:46:48 | <@OrIdow6> | Although the further I push this data towards hypotheticals the less I trust it to be accurate |
| 04:48:05 | <thuban> | you could use width/height as a proxy for filesize |
| 04:48:59 | <@OrIdow6> | Yes, presumably that's what we'd do |
| 04:49:08 | <thuban> | could even do a full run of the thumbnails first and then come back for the full images, prioritizing as appropriate |
| 04:50:14 | <@OrIdow6> | I don't know that there's much of a way to prioritize besides basic metadata |
| 04:50:41 | <@OrIdow6> | But that does sound like it could be a nice approach |
| 04:51:35 | <thuban> | i was thinking by size... but wasn't sure in which direction ('more images total' vs 'images most damaged by thumbnailing') |
| 04:51:50 | <thuban> | are we making any attempt at wbm playback or is it going to be just images/metadata? |
| 04:52:59 | <@OrIdow6> | Not really possible because it's POST everywhere |
| 04:53:15 | <@OrIdow6> | Since it's me I may make an attempt to make it nominally possible |
| 04:53:20 | <thuban> | i thought wbm did post? |
| 04:53:42 | <@OrIdow6> | But I am thinking that most of these images, people have linked them directly rather than the creaky preview pages this site uses |
| 04:53:45 | <@OrIdow6> | Since when? |
| 04:54:04 | | Niklink quits [Ping timeout: 244 seconds] |
| 04:55:23 | <appledash> | nirv: :( |
| 04:55:37 | <appledash> | My download is 92% |
| 04:56:19 | <thuban> | my bad, i must have misremembered |
| 04:58:06 | <thuban> | was thinking about the user pages, but if it's not going to work anyway... |
| 04:59:22 | <@OrIdow6> | On prioritization, your second option sounds the best to me, question is how reliable of an indicator size (itself determined indirectly via dimensions) is for loss due to thumbnailing |
| 05:00:02 | <@OrIdow6> | Will the big ones contain a bunch of text, or will they be needlessly (from our perspective) high-res iPhone photos? |
| 05:01:02 | <thuban> | highly unreliable, but i don't think anything better is available to us |
| 05:01:56 | <@OrIdow6> | The uploader by default claims to reduce images to 960 pixels, whatever dimension that's in |
| 05:02:23 | <@OrIdow6> | I was thinking we could deprioritize stuff above that threshold, or something? (Not exactly sure how that would work) |
| 05:03:21 | <@OrIdow6> | Or more broadly maybe there's a sweet spot between "needlessly big" and "no additional information content vs thumbnail" |
| 05:04:24 | <@OrIdow6> | My opinion now is to go with your first-stage-thumbnails idea, then make this decision once we have the data from that |
| 05:05:21 | <thuban> | sounds good. |
| 05:06:48 | <thuban> | are you using the -t thumbnails (180x149, with edge effect and some kind of caption) or the -x (160x120, same edge effect but no caption)? |
| 05:07:03 | <@OrIdow6> | Don't know |
| 05:07:33 | <@OrIdow6> | I mean, I'm not using anything in software yet, this is just talking |
| 05:09:24 | | Niklink joins |
| 05:10:07 | <@OrIdow6> | The responses call -t "PublicPrevUrl" and x "InternalPrevUrl", don't see the caption you're talking about |
| 05:10:30 | <thuban> | https://i035.radikal.ru/0908/78/15f27a5b7f0ft.jpg ? |
| 05:12:08 | <thuban> | ("Увеличить", 'increase, enlarge') |
| 05:14:46 | <@OrIdow6> | Looks like it's only on some of them? E.g. https://d.radikal.ru/d15/2203/5f/b71483b01175t.jpg |
| 05:15:40 | <@OrIdow6> | Ignoring size, disadvantage of -t is that they sometimes have the caption, advantage is that we may keep some image in forum posts etc. |
| 05:15:53 | <thuban> | huh. don't know what to make of that |
| 05:16:01 | <@OrIdow6> | *images |
| 05:16:04 | <@OrIdow6> | That's from https://radikal.ru/Img/ShowGallery#aid=6262837563&rnd=2&sm=true |
| 05:17:11 | <@OrIdow6> | (I like this image of the shutdown notice https://radikal.ru/Img/ShowGallery#aid=6262837568&rnd=4&sm=true ) |
| 05:20:00 | <thuban> | (lol) |
| 05:20:11 | <thuban> | that one's much newer; perhaps they changed it at some point |
| 05:20:15 | <@OrIdow6> | An alternate way to prioritize could be ones associated with accounts, on the grounds that those people are more likely to be storing photos there as opposed to just posting them to share them a single time |
| 05:21:05 | <thuban> | oh, there are images without accounts? what's the proportion? |
| 05:21:21 | <@OrIdow6> | I can't read Russian, what's the difference? |
| 05:21:24 | <@OrIdow6> | Let me see |
| 05:22:14 | <@OrIdow6> | About 86% of images I found have no owner |
| 05:22:41 | <@OrIdow6> | Incidentally, a lot (~1/3) of the images I found seem broken too, haven't figured out why |
| 05:23:41 | <@OrIdow6> | I.e. they throw a 404 (I think) when you try to visit them |
| 05:24:23 | <@OrIdow6> | "Them" being the raw image file URLs; the preview page stutters for a bit then shows what I assume is some kind of error notice |
| 05:24:36 | <thuban> | deleted, i guess. (definitely 404s and not the 301s that were being discussed?) |
| 05:25:03 | <thuban> | oh, what's the error message say? dump it in the google |
| 05:25:26 | <@OrIdow6> | 404s |
| 05:27:12 | <@OrIdow6> | The "error" is text in a placeholder it uses https://radikal.ru/Content/Images/Design/full-error.png |
| 05:27:21 | <@OrIdow6> | OCR is failing me |
| 05:28:39 | <@OrIdow6> | "Picture temporarily unavailable" |
| 05:28:49 | <@OrIdow6> | That's according to Google Translate |
| 05:28:54 | | tbc1887 (tbc1887) joins |
| 05:29:30 | <@OrIdow6> | You can see these browsing the gallery page by hand, but not at the rate I got them |
| 05:31:07 | <thuban> | 'картинка временно недоступна'--yeah |
| 05:32:57 | <thuban> | i suppose the files themselves may or may not have a similar failure rate |
| 05:33:38 | <thuban> | did you get image id 3c4559309af041bc8608f430d83a750d, by the way? |
| 05:34:19 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 05:35:30 | <@OrIdow6> | That's what I've been looking at |
| 05:35:53 | <@OrIdow6> | And not yet |
| 05:37:44 | <thuban> | if it turns out that not all images make it into the gallery (for whatever reason), we can extract author ids and use GetUserImgs to do a second round of discovery |
| 05:38:04 | <@OrIdow6> | So it looks like it's not in the main list |
| 05:38:04 | <thuban> | but if most files aren't associated with an account that's going to be quite limited :< |
| 05:38:23 | <@OrIdow6> | So you were right |
| 05:38:41 | <@OrIdow6> | I was going to probably do accounts anyway to be on the safe side |
| 05:39:30 | <thuban> | what about 4d805bfc3a0a4fc1a83c2cc20ec284fd? |
| 05:40:07 | <@OrIdow6> | What's that? |
| 05:40:44 | <thuban> | different image in the same album, but with IsPublicImg true |
| 05:41:11 | <thuban> | or, i guess, how many total images did you find? |
| 05:42:45 | <@OrIdow6> | I didn't check that but https://radikal.ru/users/josif43#alb=kartinki&img=293468357&rnd=2 fits that description and it is present in the main listing |
| 05:43:03 | <@OrIdow6> | I only took a sample of random 10-minute upload windows |
| 05:43:20 | <thuban> | oic |
| 05:44:14 | <thuban> | i was trying to get at whether the gallery endpoint omitted images for any other reasons |
| 05:57:17 | | audy quits [Remote host closed the connection] |
| 06:07:39 | | nicolas17 quits [Client Quit] |
| 06:25:56 | | dvd quits [Read error: Connection reset by peer] |
| 06:57:37 | | Megame (Megame) joins |
| 06:59:37 | | Niklink quits [Ping timeout: 244 seconds] |
| 07:24:29 | | tzt quits [Remote host closed the connection] |
| 07:24:53 | | tzt (tzt) joins |
| 07:48:18 | | tbc1887 quits [Read error: Connection reset by peer] |
| 08:25:23 | | Megame quits [Client Quit] |
| 08:32:44 | | sonick (sonick) joins |
| 08:41:25 | | spirit quits [Quit: Leaving] |
| 10:29:19 | | Mayk quits [Ping timeout: 252 seconds] |
| 10:32:17 | | benjinsmith joins |
| 10:34:39 | | benjins quits [Ping timeout: 265 seconds] |
| 11:31:46 | | qwertyasdfuiopghjkl joins |
| 12:09:06 | | march_happy quits [Ping timeout: 244 seconds] |
| 12:09:16 | | march_happy (march_happy) joins |
| 12:20:05 | | alva joins |
| 12:21:45 | <alva> | Hi, anyone having the up to date torrents to download the whole libgen collections? |
| 12:22:25 | | sonick quits [Client Quit] |
| 12:29:13 | | alva quits [Remote host closed the connection] |
| 12:31:34 | | sonick (sonick) joins |
| 12:38:55 | | Mayk78 joins |
| 12:48:39 | | shoghicp quits [Ping timeout: 252 seconds] |
| 12:56:45 | | pie_ quits [Ping timeout: 265 seconds] |
| 12:58:59 | | pie_ joins |
| 12:59:05 | | shoghicp (shoghicp) joins |
| 13:07:23 | | pie_ quits [Ping timeout: 265 seconds] |
| 13:11:13 | | pie_ joins |
| 13:16:40 | | benjinsmith is now known as benjins |
| 13:16:41 | | benjins is now authenticated as benjins |
| 13:31:04 | | G4te_Keep3r quits [Ping timeout: 265 seconds] |
| 13:34:52 | | march_happy quits [Ping timeout: 244 seconds] |
| 13:43:31 | | march_happy (march_happy) joins |
| 13:44:11 | | Megame (Megame) joins |
| 13:56:35 | | G4te_Keep3r joins |
| 14:01:39 | | Niklink joins |
| 14:02:29 | | Arcorann quits [Ping timeout: 265 seconds] |
| 14:14:01 | | pawbs quits [Read error: Connection reset by peer] |
| 14:14:29 | | pawbs joins |
| 14:48:55 | | Niklink22 joins |
| 14:49:16 | | Niklink quits [Ping timeout: 244 seconds] |
| 14:59:35 | | lennier2 joins |
| 15:01:27 | | lennier1 quits [Ping timeout: 265 seconds] |
| 15:01:36 | | lennier2 is now known as lennier1 |
| 15:22:20 | | Niklink22 quits [Ping timeout: 244 seconds] |
| 15:28:42 | | lennier2 joins |
| 15:31:54 | | lennier1 quits [Ping timeout: 265 seconds] |
| 15:32:04 | | lennier2 is now known as lennier1 |
| 15:41:41 | | Niklink joins |
| 16:33:34 | | jtagcat6 quits [Quit: Bye!] |
| 16:33:47 | | jtagcat6 (jtagcat) joins |
| 16:34:23 | | jtagcat6 quits [Client Quit] |
| 16:35:56 | | jtagcat6 (jtagcat) joins |
| 16:38:55 | | hexa- quits [Quit: WeeChat 3.3] |
| 16:40:17 | | hexa- (hexa-) joins |
| 16:41:10 | | bleb quits [Remote host closed the connection] |
| 17:13:06 | | pie_ quits [Client Quit] |
| 17:44:56 | | h3ndr1k quits [Quit: ] |
| 17:45:13 | | cm joins |
| 18:09:42 | | pie_ joins |
| 18:16:09 | | hackbug quits [Remote host closed the connection] |
| 18:16:48 | | pie_ quits [Client Quit] |
| 18:17:49 | | pie_ joins |
| 18:18:55 | | hackbug (hackbug) joins |
| 18:35:12 | <@OrIdow6> | So preliminary estimate for the -t thumbnails is a total of... 8 TiB |
| 18:35:23 | <@OrIdow6> | On radkial.ru |
| 18:35:59 | <@OrIdow6> | As we determined last time, that doesn't seem to include some pictures in user galleries, but I expect those to be negligible |
| 18:36:15 | <@OrIdow6> | Going to change how I'm doing samples, may change this estimate |
| 18:36:25 | <@OrIdow6> | And of course I haven't gotten started on video yet |
| 18:43:20 | | CowboyBunny joins |
| 18:46:46 | | h3ndr1k (h3ndr1k) joins |
| 18:48:29 | | Niklink quits [Ping timeout: 244 seconds] |
| 18:50:03 | | CowboyBunny quits [Client Quit] |
| 18:52:25 | | Niklink joins |
| 18:52:29 | | Mateon1 quits [Ping timeout: 265 seconds] |
| 18:52:42 | | Mateon1 joins |
| 19:01:32 | | CowboyBunny joins |
| 19:21:15 | | nicolas17 joins |
| 19:21:59 | | dvd (dvd) joins |
| 19:23:20 | | Terbium quits [Client Quit] |
| 19:27:17 | | tzt quits [Ping timeout: 265 seconds] |
| 19:32:06 | | tzt (tzt) joins |
| 19:32:25 | | sonick quits [Client Quit] |
| 19:34:15 | <@OrIdow6> | Weirdness with the radikal.ru date search but I'm finding patterns in it |
| 19:49:33 | <@arkiver> | radikal.ru has been dying for a long time |
| 19:49:41 | <@arkiver> | OrIdow6: do you want to get a project up and running? |
| 20:01:33 | <@OrIdow6> | arkiver: I'd like to once I have something written, I have a free Saturday so hopefully that will be the case by the end of the day |
| 20:01:39 | <@arkiver> | perfect! |
| 20:01:49 | <@arkiver> | if you do, lets get a channel going for radikal.ru |
| 20:01:53 | <@arkiver> | i have no ideas on names |
| 20:06:37 | <@OrIdow6> | I don't know enough about chemistry to make a pun on "free radical" |
| 20:07:10 | <@JAA> | I was just thinking about that but can't come up with anything really. There isn't an antonym to it or similar. |
| 20:10:55 | <@OrIdow6> | Something related to "deradicalize", at risk of it being political? |
| 20:28:48 | <@OrIdow6> | So for the structure (and this is just pictures), I am thinking of dividing the images in the "gallery" into ranges of somewhere between 10 minutes and an hour, and using those as items; then each worker gets the list API data and thumbnail images in its range |
| 20:29:14 | <@OrIdow6> | Maybe doing something separate for user albums |
| 20:29:45 | | Niklink quits [Ping timeout: 244 seconds] |
| 20:29:55 | <@OrIdow6> | But the main thing about this is, there would only be one one-per-image request, which would be for the thumbnail |
| 20:30:40 | <@OrIdow6> | Since I'm worried about the site's ability to handle nearly a billion search queries in 5 days |
| 20:31:57 | <@OrIdow6> | Then this finishes, we can get the individual image IDs out of that data (after the fact WARC analysis or send it to somewhere on the workers) and decide what to do with them as dicussed previously |
| 20:32:06 | <@OrIdow6> | *if this finishes |
| 20:48:58 | <SketchTheCow> | 15:06 <@OrIdow6> I don't know enough about chemistry to make a pun on "free radical" |
| 20:49:33 | <SketchTheCow> | Freeradical is your best bet |
| 20:59:00 | | Terbium joins |
| 20:59:58 | | Terbium quits [Client Quit] |
| 21:00:43 | | Terbium joins |
| 21:03:31 | | gazorpazorp quits [Remote host closed the connection] |
| 21:03:42 | | gazorpazorp (gazorpazorp) joins |
| 21:04:20 | <@arkiver> | OrIdow6: JAA: shall we do #deradikalize |
| 21:04:26 | <@arkiver> | it is a nice one |
| 21:05:07 | <@OrIdow6> | arkiver: Well, Sketch The Cow said it's the best, so I think that makes it the best |
| 21:16:54 | | ThreeHM_ is now known as ThreeHM |
| 21:29:40 | <@OrIdow6> | I can't read |
| 21:32:00 | <@OrIdow6> | But people already in the channel, so that'll be it |
| 21:32:26 | | Niklink joins |
| 22:00:29 | <thuban> | join #deradikalize |
| 22:00:31 | <thuban> | whoops |
| 22:08:32 | | wickedplayer494 quits [Remote host closed the connection] |
| 22:12:31 | | wickedplayer494 joins |
| 22:13:09 | | wickedplayer494 is now authenticated as wickedplayer494 |
| 22:28:42 | | BlueMaxima joins |
| 22:45:08 | | michaelblob_ (michaelblob) joins |
| 22:48:13 | | michaelblob quits [Ping timeout: 244 seconds] |
| 22:51:27 | | fuzzy8021 quits [Ping timeout: 252 seconds] |
| 23:35:00 | | lennier2 joins |
| 23:37:49 | | lennier1 quits [Ping timeout: 244 seconds] |
| 23:37:51 | | lennier2 is now known as lennier1 |
| 23:41:01 | | Megame quits [Client Quit] |
| 23:50:55 | | rszn quits [Remote host closed the connection] |
| 23:53:02 | | rsn joins |