00:03:24<nirv>appledash, I think it might finally be extracting the second set of "uppercase" files using this line
00:03:26<nirv>7zz -y x "/mnt/linux2tb/uppercaseextracted" -mmt=18 -o/mnt/linux5tb/uppercaseextracted2 > /mnt/linux5tb/uppercase2.log
00:10:19<SketchTheCow>Hooray, let's kill FOS again
00:17:08Niklink quits [Ping timeout: 244 seconds]
00:27:23Hifihedgehog joins
00:29:39<@OrIdow6>Still haven't found a way to map sequential IDs to images on radikal.ru, though I haven't expressly been looking for one
00:30:15<nirv>appledash, nevermind. here come the errors. it's fucked. https://cdn.discordapp.com/attachments/872443255281844236/949463752674263100/unknown.png
00:31:41<Hifihedgehog>Hello everyone! Thank you again for your tremendous efforts to preserve the Technology Guide sites. I have a request. As you are aware Russia is beginning to close off Internet to exterior sources such as Facebook. There is therefore valid concern within the Sonic the Hedgehog community that Sonic SCANF will eventually be inaccessible to
00:31:41<Hifihedgehog>non-Russian traffic. As such, this site will not necessarily cease to exist but the content might not ever be accessible again to the outside world. Would you be interested in gathering the site content? Many official, classic Sonic the Hedgehog media including artwork in print quality resolution can only be downloaded from this site. Here is the
00:31:42<Hifihedgehog>site: https://sonicscanf.org/
00:32:13<nirv>outta be widescreen
00:32:27<nicolas17>scanf is hosted in Russia?
00:32:45<Hifihedgehog>I haven't checked the IP yet.
00:32:47<nirv>sonic 1 NTSC-J have parallax scrolling in the clouds. NTSC-U does not
00:33:40<nicolas17>seems to be behind cloudflare
00:35:18<@OrIdow6>Doesn't look too big
00:36:18<nicolas17>images from the gallery may be the big part
00:36:38Terbium quits [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.]
00:37:17<nicolas17>URL structure in all subdomains is quite "clean"
00:37:29Niklink joins
00:38:06<Hifihedgehog>If you could, that'd be great! :D
00:38:15<nicolas17>for the forum you probably want to exclude the post permalinks (.../?do=findComment&comment=214035)
00:38:24sonick quits [Client Quit]
00:40:04<nicolas17>aside from ?do= URLs, probably one of the most "wget -r"-friendly forums I've seen :P
00:44:48<@OrIdow6>Luckily we won't be using bare wget for this
00:45:35<@OrIdow6>So yes Hifihedgehog, though I'm not the one to do it presumably we can capture this site
00:45:35<nicolas17>yeah I'm sure you have better tools
00:46:15Hifihedgehog quits [Remote host closed the connection]
00:47:25<@OrIdow6>So is there a consensus yet on whether we consider all Russian sites at-risk? And whether that should be dealt with in the Ukraine channel
00:48:16<@OrIdow6>I told the above yes because the site looked small, and of course radikal.ru is shutting down in a normal manner, so I have had it easy
00:50:11<@JAA>Given that we're archiving all these things due to the situation in Ukraine, I think it makes most sense to keep it in that channel.
00:56:08qwertyasdfuiopghjkl quits [Remote host closed the connection]
00:57:54HP_Archivist (HP_Archivist) joins
01:04:40Niklink quits [Ping timeout: 244 seconds]
01:13:22<@OrIdow6>I suppose I'm generally in favor of keeping it there at the current level of activity
01:17:32HP_Archivist quits [Client Quit]
01:26:43michaelblob quits [Read error: Connection reset by peer]
01:28:56michaelblob (michaelblob) joins
02:02:34dm4v quits [Read error: Connection reset by peer]
02:02:56dm4v joins
02:02:59dm4v quits [Changing host]
02:02:59dm4v (dm4v) joins
02:09:13wyatt8750 joins
02:10:32wyatt8740 quits [Ping timeout: 265 seconds]
02:51:03Niklink joins
03:54:56Terbium joins
04:01:02Terbium quits [Client Quit]
04:01:42Terbium joins
04:20:34<@OrIdow6>So for radikal.ru I am thinking of starting with a design that's fairly liberal in what it gets (separate metadata request for each image etc) then constricting it as seems necessary
04:20:46<@OrIdow6>Close to an estimate for images by the way
04:21:52<thuban>OrIdow6: what are you using for discovery? the GetGalleryPage endpoint?
04:22:02<@OrIdow6>Yes thuban
04:22:34<@OrIdow6>Why?
04:22:43<thuban>< thuban> i'm a little concerned by the 'IsPublicImg' field in the results of the GetUserImgs endpoint used by the user page--it's sometimes false. any idea what this means? (will these images not show up in GetGalleryPage?)
04:23:37<thuban>i have no idea whether there's a way to enumerate users
04:26:37<@OrIdow6>Can you give an example of where it's false?
04:27:12<thuban>sure, one min
04:28:56<thuban>OrIdow6: https://radikal.ru/users/josif43#alb=kartinki&img=159871870
04:34:18<@OrIdow6>Thank you
04:34:24<@OrIdow6>Alright, so,
04:35:04<@OrIdow6>Vaugely looks like images are somewhere between 200 and 500 TiB total
04:36:01<nicolas17>wew
04:37:52<@OrIdow6>I don't know for videos yet
04:38:18<@OrIdow6>So I don't think IA will let us upload everything
04:42:52<@OrIdow6>Exploring some possibilities for this
04:43:21<@OrIdow6>With of course the big possibility being that this is a moot issue because the site doesn't have the capacity to serve everything in any case
04:44:58<@OrIdow6>If we replace some big images with thumbnails it might help
04:46:17<@OrIdow6>If we could magically coerce all above-average-size images to be exactly average could cut it down by about 68%
04:46:48<@OrIdow6>Although the further I push this data towards hypotheticals the less I trust it to be accurate
04:48:05<thuban>you could use width/height as a proxy for filesize
04:48:59<@OrIdow6>Yes, presumably that's what we'd do
04:49:08<thuban>could even do a full run of the thumbnails first and then come back for the full images, prioritizing as appropriate
04:50:14<@OrIdow6>I don't know that there's much of a way to prioritize besides basic metadata
04:50:41<@OrIdow6>But that does sound like it could be a nice approach
04:51:35<thuban>i was thinking by size... but wasn't sure in which direction ('more images total' vs 'images most damaged by thumbnailing')
04:51:50<thuban>are we making any attempt at wbm playback or is it going to be just images/metadata?
04:52:59<@OrIdow6>Not really possible because it's POST everywhere
04:53:15<@OrIdow6>Since it's me I may make an attempt to make it nominally possible
04:53:20<thuban>i thought wbm did post?
04:53:42<@OrIdow6>But I am thinking that most of these images, people have linked them directly rather than the creaky preview pages this site uses
04:53:45<@OrIdow6>Since when?
04:54:04Niklink quits [Ping timeout: 244 seconds]
04:55:23<appledash>nirv: :(
04:55:37<appledash>My download is 92%
04:56:19<thuban>my bad, i must have misremembered
04:58:06<thuban>was thinking about the user pages, but if it's not going to work anyway...
04:59:22<@OrIdow6>On prioritization, your second option sounds the best to me, question is how reliable of an indicator size (itself determined indirectly via dimensions) is for loss due to thumbnailing
05:00:02<@OrIdow6>Will the big ones contain a bunch of text, or will they be needlessly (from our perspective) high-res iPhone photos?
05:01:02<thuban>highly unreliable, but i don't think anything better is available to us
05:01:56<@OrIdow6>The uploader by default claims to reduce images to 960 pixels, whatever dimension that's in
05:02:23<@OrIdow6>I was thinking we could deprioritize stuff above that threshold, or something? (Not exactly sure how that would work)
05:03:21<@OrIdow6>Or more broadly maybe there's a sweet spot between "needlessly big" and "no additional information content vs thumbnail"
05:04:24<@OrIdow6>My opinion now is to go with your first-stage-thumbnails idea, then make this decision once we have the data from that
05:05:21<thuban>sounds good.
05:06:48<thuban>are you using the -t thumbnails (180x149, with edge effect and some kind of caption) or the -x (160x120, same edge effect but no caption)?
05:07:03<@OrIdow6>Don't know
05:07:33<@OrIdow6>I mean, I'm not using anything in software yet, this is just talking
05:09:24Niklink joins
05:10:07<@OrIdow6>The responses call -t "PublicPrevUrl" and x "InternalPrevUrl", don't see the caption you're talking about
05:10:30<thuban>https://i035.radikal.ru/0908/78/15f27a5b7f0ft.jpg ?
05:12:08<thuban>("Увеличить", 'increase, enlarge')
05:14:46<@OrIdow6>Looks like it's only on some of them? E.g. https://d.radikal.ru/d15/2203/5f/b71483b01175t.jpg
05:15:40<@OrIdow6>Ignoring size, disadvantage of -t is that they sometimes have the caption, advantage is that we may keep some image in forum posts etc.
05:15:53<thuban>huh. don't know what to make of that
05:16:01<@OrIdow6>*images
05:16:04<@OrIdow6>That's from https://radikal.ru/Img/ShowGallery#aid=6262837563&rnd=2&sm=true
05:17:11<@OrIdow6>(I like this image of the shutdown notice https://radikal.ru/Img/ShowGallery#aid=6262837568&rnd=4&sm=true )
05:20:00<thuban>(lol)
05:20:11<thuban>that one's much newer; perhaps they changed it at some point
05:20:15<@OrIdow6>An alternate way to prioritize could be ones associated with accounts, on the grounds that those people are more likely to be storing photos there as opposed to just posting them to share them a single time
05:21:05<thuban>oh, there are images without accounts? what's the proportion?
05:21:21<@OrIdow6>I can't read Russian, what's the difference?
05:21:24<@OrIdow6>Let me see
05:22:14<@OrIdow6>About 86% of images I found have no owner
05:22:41<@OrIdow6>Incidentally, a lot (~1/3) of the images I found seem broken too, haven't figured out why
05:23:41<@OrIdow6>I.e. they throw a 404 (I think) when you try to visit them
05:24:23<@OrIdow6>"Them" being the raw image file URLs; the preview page stutters for a bit then shows what I assume is some kind of error notice
05:24:36<thuban>deleted, i guess. (definitely 404s and not the 301s that were being discussed?)
05:25:03<thuban>oh, what's the error message say? dump it in the google
05:25:26<@OrIdow6>404s
05:27:12<@OrIdow6>The "error" is text in a placeholder it uses https://radikal.ru/Content/Images/Design/full-error.png
05:27:21<@OrIdow6>OCR is failing me
05:28:39<@OrIdow6>"Picture temporarily unavailable"
05:28:49<@OrIdow6>That's according to Google Translate
05:28:54tbc1887 (tbc1887) joins
05:29:30<@OrIdow6>You can see these browsing the gallery page by hand, but not at the rate I got them
05:31:07<thuban>'картинка временно недоступна'--yeah
05:32:57<thuban>i suppose the files themselves may or may not have a similar failure rate
05:33:38<thuban>did you get image id 3c4559309af041bc8608f430d83a750d, by the way?
05:34:19BlueMaxima quits [Read error: Connection reset by peer]
05:35:30<@OrIdow6>That's what I've been looking at
05:35:53<@OrIdow6>And not yet
05:37:44<thuban>if it turns out that not all images make it into the gallery (for whatever reason), we can extract author ids and use GetUserImgs to do a second round of discovery
05:38:04<@OrIdow6>So it looks like it's not in the main list
05:38:04<thuban>but if most files aren't associated with an account that's going to be quite limited :<
05:38:23<@OrIdow6>So you were right
05:38:41<@OrIdow6>I was going to probably do accounts anyway to be on the safe side
05:39:30<thuban>what about 4d805bfc3a0a4fc1a83c2cc20ec284fd?
05:40:07<@OrIdow6>What's that?
05:40:44<thuban>different image in the same album, but with IsPublicImg true
05:41:11<thuban>or, i guess, how many total images did you find?
05:42:45<@OrIdow6>I didn't check that but https://radikal.ru/users/josif43#alb=kartinki&img=293468357&rnd=2 fits that description and it is present in the main listing
05:43:03<@OrIdow6>I only took a sample of random 10-minute upload windows
05:43:20<thuban>oic
05:44:14<thuban>i was trying to get at whether the gallery endpoint omitted images for any other reasons
05:57:17audy quits [Remote host closed the connection]
06:07:39nicolas17 quits [Client Quit]
06:25:56dvd quits [Read error: Connection reset by peer]
06:57:37Megame (Megame) joins
06:59:37Niklink quits [Ping timeout: 244 seconds]
07:24:29tzt quits [Remote host closed the connection]
07:24:53tzt (tzt) joins
07:48:18tbc1887 quits [Read error: Connection reset by peer]
08:25:23Megame quits [Client Quit]
08:32:44sonick (sonick) joins
08:41:25spirit quits [Quit: Leaving]
10:29:19Mayk quits [Ping timeout: 252 seconds]
10:32:17benjinsmith joins
10:34:39benjins quits [Ping timeout: 265 seconds]
11:31:46qwertyasdfuiopghjkl joins
12:09:06march_happy quits [Ping timeout: 244 seconds]
12:09:16march_happy (march_happy) joins
12:20:05alva joins
12:21:45<alva>Hi, anyone having the up to date torrents to download the whole libgen collections?
12:22:25sonick quits [Client Quit]
12:29:13alva quits [Remote host closed the connection]
12:31:34sonick (sonick) joins
12:38:55Mayk78 joins
12:48:39shoghicp quits [Ping timeout: 252 seconds]
12:56:45pie_ quits [Ping timeout: 265 seconds]
12:58:59pie_ joins
12:59:05shoghicp (shoghicp) joins
13:07:23pie_ quits [Ping timeout: 265 seconds]
13:11:13pie_ joins
13:16:40benjinsmith is now known as benjins
13:31:04G4te_Keep3r quits [Ping timeout: 265 seconds]
13:34:52march_happy quits [Ping timeout: 244 seconds]
13:43:31march_happy (march_happy) joins
13:44:11Megame (Megame) joins
13:56:35G4te_Keep3r joins
14:01:39Niklink joins
14:02:29Arcorann quits [Ping timeout: 265 seconds]
14:14:01pawbs quits [Read error: Connection reset by peer]
14:14:29pawbs joins
14:48:55Niklink22 joins
14:49:16Niklink quits [Ping timeout: 244 seconds]
14:59:35lennier2 joins
15:01:27lennier1 quits [Ping timeout: 265 seconds]
15:01:36lennier2 is now known as lennier1
15:22:20Niklink22 quits [Ping timeout: 244 seconds]
15:28:42lennier2 joins
15:31:54lennier1 quits [Ping timeout: 265 seconds]
15:32:04lennier2 is now known as lennier1
15:41:41Niklink joins
16:33:34jtagcat6 quits [Quit: Bye!]
16:33:47jtagcat6 (jtagcat) joins
16:34:23jtagcat6 quits [Client Quit]
16:35:56jtagcat6 (jtagcat) joins
16:38:55hexa- quits [Quit: WeeChat 3.3]
16:40:17hexa- (hexa-) joins
16:41:10bleb quits [Remote host closed the connection]
17:13:06pie_ quits [Client Quit]
17:44:56h3ndr1k quits [Quit: ]
17:45:13cm joins
18:09:42pie_ joins
18:16:09hackbug quits [Remote host closed the connection]
18:16:48pie_ quits [Client Quit]
18:17:49pie_ joins
18:18:55hackbug (hackbug) joins
18:35:12<@OrIdow6>So preliminary estimate for the -t thumbnails is a total of... 8 TiB
18:35:23<@OrIdow6>On radkial.ru
18:35:59<@OrIdow6>As we determined last time, that doesn't seem to include some pictures in user galleries, but I expect those to be negligible
18:36:15<@OrIdow6>Going to change how I'm doing samples, may change this estimate
18:36:25<@OrIdow6>And of course I haven't gotten started on video yet
18:43:20CowboyBunny joins
18:46:46h3ndr1k (h3ndr1k) joins
18:48:29Niklink quits [Ping timeout: 244 seconds]
18:50:03CowboyBunny quits [Client Quit]
18:52:25Niklink joins
18:52:29Mateon1 quits [Ping timeout: 265 seconds]
18:52:42Mateon1 joins
19:01:32CowboyBunny joins
19:21:15nicolas17 joins
19:21:59dvd (dvd) joins
19:23:20Terbium quits [Client Quit]
19:27:17tzt quits [Ping timeout: 265 seconds]
19:32:06tzt (tzt) joins
19:32:25sonick quits [Client Quit]
19:34:15<@OrIdow6>Weirdness with the radikal.ru date search but I'm finding patterns in it
19:49:33<@arkiver>radikal.ru has been dying for a long time
19:49:41<@arkiver>OrIdow6: do you want to get a project up and running?
20:01:33<@OrIdow6>arkiver: I'd like to once I have something written, I have a free Saturday so hopefully that will be the case by the end of the day
20:01:39<@arkiver>perfect!
20:01:49<@arkiver>if you do, lets get a channel going for radikal.ru
20:01:53<@arkiver>i have no ideas on names
20:06:37<@OrIdow6>I don't know enough about chemistry to make a pun on "free radical"
20:07:10<@JAA>I was just thinking about that but can't come up with anything really. There isn't an antonym to it or similar.
20:10:55<@OrIdow6>Something related to "deradicalize", at risk of it being political?
20:28:48<@OrIdow6>So for the structure (and this is just pictures), I am thinking of dividing the images in the "gallery" into ranges of somewhere between 10 minutes and an hour, and using those as items; then each worker gets the list API data and thumbnail images in its range
20:29:14<@OrIdow6>Maybe doing something separate for user albums
20:29:45Niklink quits [Ping timeout: 244 seconds]
20:29:55<@OrIdow6>But the main thing about this is, there would only be one one-per-image request, which would be for the thumbnail
20:30:40<@OrIdow6>Since I'm worried about the site's ability to handle nearly a billion search queries in 5 days
20:31:57<@OrIdow6>Then this finishes, we can get the individual image IDs out of that data (after the fact WARC analysis or send it to somewhere on the workers) and decide what to do with them as dicussed previously
20:32:06<@OrIdow6>*if this finishes
20:48:58<SketchTheCow>15:06 <@OrIdow6> I don't know enough about chemistry to make a pun on "free radical"
20:49:33<SketchTheCow>Freeradical is your best bet
20:59:00Terbium joins
20:59:58Terbium quits [Client Quit]
21:00:43Terbium joins
21:03:31gazorpazorp quits [Remote host closed the connection]
21:03:42gazorpazorp (gazorpazorp) joins
21:04:20<@arkiver>OrIdow6: JAA: shall we do #deradikalize
21:04:26<@arkiver>it is a nice one
21:05:07<@OrIdow6>arkiver: Well, Sketch The Cow said it's the best, so I think that makes it the best
21:16:54ThreeHM_ is now known as ThreeHM
21:29:40<@OrIdow6>I can't read
21:32:00<@OrIdow6>But people already in the channel, so that'll be it
21:32:26Niklink joins
22:00:29<thuban>join #deradikalize
22:00:31<thuban>whoops
22:08:32wickedplayer494 quits [Remote host closed the connection]
22:12:31wickedplayer494 joins
22:28:42BlueMaxima joins
22:45:08michaelblob_ (michaelblob) joins
22:48:13michaelblob quits [Ping timeout: 244 seconds]
22:51:27fuzzy8021 quits [Ping timeout: 252 seconds]
23:35:00lennier2 joins
23:37:49lennier1 quits [Ping timeout: 244 seconds]
23:37:51lennier2 is now known as lennier1
23:41:01Megame quits [Client Quit]
23:50:55rszn quits [Remote host closed the connection]
23:53:02rsn joins