00:09:58Arcorann (Arcorann) joins
02:52:48fl0w_ joins
02:56:22fl0w quits [Ping timeout: 252 seconds]
04:16:43LegitSi joins
04:32:21<LegitSi>It's a crime that I never mentioned this earlier, but I might as well say it now. I am here to talk about StripGenerator, an old website that lasted from 2005-2020.
04:32:46<LegitSi>StripGenerator was a place where you could make comics rather easily, and as far as I'm told had a thriving community around 2012.
04:33:12<LegitSi>"as far as I'm told" because I was really late to the party on this, which is why I haven't talked about it in so long. It's a sin which I'm still repenting for.
04:34:01<LegitSi>Although it operated on Flash and was thus vulnerable to the Flash Crash, the website actually broke two months earlier than expected. This is because SG had been in decline for years (ever since 2015 I'm told) before then due to neglect from its creators.
04:34:36<LegitSi>The reason I'm even able to share this story today is that some SG members made their own copy of SG, with the characters and all, and I was the one that got the legal greenlight from the creators to do so.
04:35:07<LegitSi>Although that website too is experiencing an eerily similar outcome to what SG had, its creator having abandoned the site months ago and the problems slowly starting to pile up.
04:36:52<LegitSi>Anyway, why am I here if the website shut down 3 years ago? Because it turns out, most of the 11 million or so strips have just been sitting on an Amazon AWS database all this time.
04:37:45<LegitSi>We knew this for a while, but only now (thanks to inactivity) have we realized that this database is quite vulnerable, and if it goes poof just like the original website did, there goes almost 15 years of history and 11 million strips,
04:38:07<LegitSi>The database isn't perfect. There's plenty of gaps, but it's the best we have. It only has the images as far as we know, so no metadata, which is a damn shame.
04:42:41<LegitSi>The knowledge of StripGenerator has mostly disappeared to time, with not a whole lot of people talking about it today. Look it up, you'll get age-old results talking about how good it is, not people talking about what once was.
04:45:06<pokechu22>11 million URLs is a lot, but probably doable (it's probably too big for archivebot as one big job at least, but I'm pretty sure there are other tools that would work)
04:45:55<@JAA>Is it just an open S3 bucket or something more complicated?
04:49:17<LegitSi>I believe it's an open S3 bucket, but I'm not knowledgeable at all on that subject. Can I post a link of a strip as demonstration?
04:49:24<@JAA>Sure
04:49:24<pokechu22>Sure
04:49:40<@JAA>Get ninja'd! :-)
04:50:11<LegitSi>http://s3.amazonaws.com/stripgenerator/strip/50/72/19/00/00/full.png
04:50:31<LegitSi>This particular strip's ID is 912705. It's filled in reverse base 10 order.
04:51:04<@JAA>Ok, it's not *open* open, but if the URLs can be generated, that's fine.
04:51:45<LegitSi>Yeah, in fact one of our members made a program to extract such in a presentable form, and then I modified the code a bit to make it easier. It's on Codepen, do you mind if I share a link?
04:52:32<LegitSi>This program has the ability to generate a random strip out of the 11.8 million, and while there are plenty of gaps, it's still a pretty good resource.
04:52:47<LegitSi>You can also search up any strip ID if you so please.
04:53:22<@JAA>Apart from full.png, there's also bribed.png and detail.png thumbnails.
04:53:42<@JAA>And thumb.png
04:53:42<pokechu22>OK, so it's about 11.8 million strips, but what's the highest valid ID?
04:53:50<LegitSi>Point is, I believe we can generate all of the possible URLs up to the last known strip ever made (#1180869). The first couple thousand or so we're unable to access, though.
04:53:50<pokechu22>And, sure, feel free to share any links you think are relevant
04:54:34<LegitSi>#1180869 is the latest strip on the last working Wayback Machine snapshot before the website's collapse in early November, and looking forward a while, I can't find any strip later than that.
04:54:52<pokechu22>http://s3.amazonaws.com/stripgenerator/strip/96/80/81/10/00/full.png being the last known one, and http://s3.amazonaws.com/stripgenerator/strip/07/80/81/10/00/full.png being the missing one after that? Assuming I've guessed the pattern correctly?
04:55:27<LegitSi>Yes, that one. There are gaps in the data, which can be anything from a deleted strip to one that failed to render (more common the later it is).
04:55:37<@JAA>Where do the 11 million come from if the highest ID is under 1.2 million?
04:56:03<pokechu22>Oh, that's a good point, 1180869 is 1,180,869
04:56:12<LegitSi>I dun messed up.
04:56:23<LegitSi>I may or may not have missed a digit, my bad.
04:56:38<LegitSi>1.1m is still quite a bit, at least in my book.
04:56:53<@JAA>Ah ok, so 6 million if we also grab all the thumbnails or 1.2 million for just the full strips. Easy.
04:57:31<LegitSi>Yeah. There's also the metadata, which the only way I know how to get is from the 1% of comics that got a Wayback Machine snapshot.
04:57:41<LegitSi>Unfortunately, I believe those might just be lost unless the database has a neat surprise for us.
04:57:50<pokechu22>Can you give an example of that metadata?
04:57:56<LegitSi>Metadata as in title, description, likes, comments, transcription, etc.
04:58:21<LegitSi>http://web.archive.org/web/20201029201327/http://stripgenerator.com/strip/912705/the-man-with-the-little-gold-statue
04:58:23<@JAA>It's entirely plausible they were just using S3 for hosting static files (like the rendered strips).
04:58:42<@JAA>So yeah, the metadata may be lost.
04:58:47<LegitSi>This is an archive of a StripGenerator webpage, you can see what metadata I'm referring to.
04:58:57<LegitSi>Unless I've been using the word metadata incorrectly all this time.
04:59:05<@JAA>(The metadata that wasn't captured at the time, that is.)
04:59:08<LegitSi>I'm a little scatterbrained, I spent 3 hours coding in two unfamiliar languages. Forgive me.
05:00:11<LegitSi>Honestly, I can settle with a 99% metadata loss if we can get all the strips into the Internet Archive.
05:00:22<pokechu22>Yeah, metadata is the right term for that, but if it was only hosted via http://stripgenerator.com/ itself and that's dead then it's probably gone
05:00:38<LegitSi>The metadata that has been archived happens to be from Wayback Machine snapshots, part of the Internet Archive.
05:01:01<LegitSi>So that's not going anywhere any time soon, unless it is, then we have way bigger problems.
05:01:02<@JAA>1.2 million images can easily be run through AB. It should even be quite fast.
05:01:11<pokechu22>"384,903 Members 1,064,245 Strips" at the top, too
05:01:14<@JAA>Would probably be done in like half a day or so.
05:01:43<pokechu22>There's also user thumbnails e.g. http://s3.amazonaws.com/stripgenerator/avatar/49/46/22/00/1393980595/thumb.png but I'm not sure if there's a practical way to generate those
05:01:45<LegitSi>That's fantastic. I feel quite guilty not bringing it to your attention earlier when the website was near collapse, but better late than never I guess.
05:02:10<LegitSi>In any case, you guys know how to explore Amazon AWS databases so if there's anything we missed, please let me know.
05:02:45<@JAA>This isn't a database, it's an object storage, and there's nothing to explore really since they're blocking the content listing access.
05:03:04<@JAA>http://s3.amazonaws.com/stripgenerator/ would otherwise show a list of the first 1k objects in the bucket.
05:03:21<@JAA>(And then you can paginate to get all of them.)
05:03:31<@JAA>But competent operators will lock that down.
05:03:46<LegitSi>There is a timeline where I simply ask them to unlock it. When we were discussing making that SG clone website, I was in a Twitter conversation with the creators that led to us getting the legal greenlight so long as we didn't use the source code.
05:03:54<LegitSi>But that's not happening anytime soon.
05:03:59<pokechu22>e.g. http://s3.amazonaws.com/MinecraftDownload and http://s3.amazonaws.com/MinecraftDownload/resources/music/calm1.ogg
05:04:02<@JAA>It's how half of the 'this data was leaked!!1!' situations happen these days.
05:04:25<LegitSi>I've heard.
05:04:59<pokechu22>Looks like there's also metadata on https://web.archive.org/web/20200709114857/http://mrterrorkiller.stripgenerator.com/gallery - probably not as much as the strip's own page but it's still better than nothing
05:05:05<LegitSi>Something that might be way harder would be using the strip IDs to explore all those that do have an available snapshot to acquire all the metadata from.
05:05:36<pokechu22>https://archive.org/developers/wayback-cdx-server.html is the better way to find what snapshots exist
05:06:10<LegitSi>My basic search found that 100,000 or so links existed under /strip, which is where all the strips were hosted.
05:06:18<LegitSi>Although there's probably something I've missed.
05:08:25<LegitSi>http://stripgenerator.com/strip/ MIME-types Count has 332,444 captures, 127,573 URLs, and 104,659 new URLs listed under text/html, if that helps.
05:08:53<LegitSi>http://web.archive.org/details/http://stripgenerator.com/strip/ if you're curious as to where I got that from.
05:09:30BlueMaxima quits [Client Quit]
05:09:46<pokechu22>I've never used that one before - there's https://web.archive.org/*/http://stripgenerator.com/strip/* but by default it's limited to 10k pages (with https://archive.org/developers/wayback-cdx-server.html being the less-user-friendly way of getting more)
05:10:09<@JAA>I love shell utils. Simple one-liner to generate all the full.png URLs. :-)
05:10:21<@JAA>`seq 1180869 | xargs printf '%010d\n' | rev | sed 's,..,&/,g; s,^,https://s3.amazonaws.com/stripgenerator/strip/,; s,$,full.png,'`
05:10:58<pokechu22>Huh, TIL of rev - I've seen tac but not that
05:11:13<LegitSi>Did that one-liner take into account empty listings? Otherwise (according to the strip IDs and last snapshot), you've got about 100,000 blank entries.
05:11:30<LegitSi>1.18 million comics with an ID, with 1.06 million listed on the last snapshot.
05:11:33<@JAA>It's simply all IDs from 1 through 1180869.
05:11:57<LegitSi>Got it. You might also have trouble with the first few thousand entries, we haven't found a way to get those.
05:12:22<@JAA>So it goes https://s3.amazonaws.com/stripgenerator/strip/10/00/00/00/00/full.png https://s3.amazonaws.com/stripgenerator/strip/20/00/00/00/00/full.png ... https://s3.amazonaws.com/stripgenerator/strip/86/80/81/10/00/full.png https://s3.amazonaws.com/stripgenerator/strip/96/80/81/10/00/full.png
05:12:27<pokechu22>If you're referring to e.g. http://s3.amazonaws.com/stripgenerator/strip/76/80/81/10/00/full.png then that would still be downloaded, but probably the fastest way to check if it exists is to just download it, so eh, nothing special needed there
05:12:36<LegitSi>Everything after 14133 should be good.
05:12:53<@JAA>Yeah, let's just run through those, not a concern.
05:13:03<LegitSi>Got it.
05:14:20<@JAA>It's in AB. It'll take a while to load the URL list, but then it'll go brrrrr. :-)
05:15:18<pokechu22>If you want to try to look for things with metadata, https://web.archive.org/cdx/search/cdx?url=http://stripgenerator.com/strip/1&output=json&matchType=prefix&collapse=urlkey / https://web.archive.org/cdx/search/cdx?url=http://stripgenerator.com/strip/2&output=json&matchType=prefix&collapse=urlkey / similar should be a good starting point I think
05:15:53<pokechu22>(again, there's documentation for that at https://archive.org/developers/wayback-cdx-server.html)
05:16:28<@JAA>You could also download a list of all known stripgenerator.com URLs to discover possible other sources of metadata (like the subdomains mentioned earlier).
05:16:47<@JAA>And don't trust the documentation too much. There are quirks and bugs.
05:16:48<LegitSi>Thanks.
05:17:36<@JAA>There's a script to help with this in my little-things repo, and I'd do this: `ia-cdx-search --concurrency 4 --tries 10 'url=stripgenerator.com&matchType=domain&collapse=urlkey&fl=original'`
05:18:04<@JAA>https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/ia-cdx-search
05:18:53<LegitSi>Thanks a lot. I'm gonna go to bed now and check on this when I get back, maybe even pursue hours looking for the metadata.
05:19:52<LegitSi>Goodnight everyone.
05:20:00LegitSi quits [Remote host closed the connection]
05:25:02IDK (IDK) joins
05:25:21ell (ell) joins
06:04:06hackbug quits [Ping timeout: 252 seconds]
06:08:20sec^nd quits [Ping timeout: 276 seconds]
06:11:16sec^nd (second) joins
06:19:34umgr036 quits [Remote host closed the connection]
06:27:10umgr036 joins
07:34:14IDK quits [Client Quit]
07:41:08hitgrr8 joins
09:31:39fl0w_ quits [Remote host closed the connection]
09:31:39umgr036 quits [Remote host closed the connection]
09:31:49umgr036 joins
09:31:59fl0w_ joins
09:46:40Barto quits [Read error: Connection reset by peer]
09:55:42Barto (Barto) joins
10:43:35IDK (IDK) joins
11:35:01Ruthalas5 quits [Ping timeout: 252 seconds]
12:57:38hackbug (hackbug) joins
13:08:29<pabs>OrIdow6: re PopJam, maybe their site has some APIs that could be saved?
13:21:28Ketchup901 quits [Remote host closed the connection]
13:22:09Ketchup901 (Ketchup901) joins
13:49:13Arcorann quits [Ping timeout: 252 seconds]
14:01:02Ketchup902 (Ketchup901) joins
14:01:34Ketchup901 quits [Remote host closed the connection]
14:01:37Ketchup902 quits [Remote host closed the connection]
14:01:39Ketchup902_ (Ketchup901) joins
14:19:31treora quits [Remote host closed the connection]
14:19:32treora joins
14:19:47treora quits [Remote host closed the connection]
14:19:48treora joins
14:22:20adia quits [Quit: The Lounge - https://thelounge.chat]
14:22:45IDK quits [Client Quit]
14:23:59adia (adia) joins
14:38:19treora quits [Remote host closed the connection]
14:38:19treora joins
14:39:27treora quits [Remote host closed the connection]
14:39:28treora joins
14:39:41treora quits [Remote host closed the connection]
14:39:41treora joins
15:00:07treora quits [Remote host closed the connection]
15:00:08treora joins
16:22:59ell quits [Client Quit]
18:02:27TheTechRobo quits [Client Quit]
18:28:10treora quits [Remote host closed the connection]
18:28:12treora joins
18:30:38HackMii quits [Ping timeout: 276 seconds]
18:31:10HackMii (hacktheplanet) joins
18:38:20hackbug quits [Remote host closed the connection]
18:40:30hackbug (hackbug) joins
18:51:49treora quits [Remote host closed the connection]
18:51:49treora joins
18:51:50treora quits [Read error: Connection reset by peer]
18:51:53treora joins
18:51:59treora quits [Remote host closed the connection]
18:52:00treora joins
18:54:01katocala quits [Remote host closed the connection]
19:08:06<@OrIdow6>pabs: Yeah I assume so
19:08:43<@OrIdow6>I found some APIs in CDX (whose recency makes me wonder if it's someone trying to SPN the site, but I didn't check which item they were in)
19:10:07<@OrIdow6>And also tried to decompile their Android app, in which I'm fairly out of my depth
19:12:00<pokechu22>Hmm, something related I've noticed: after starting a job with archivebot, the page the job was started on also seems to get SPN'd, though I'm not sure if that's actually an automatic thing or just someone doing it
19:14:47<@OrIdow6>Yeah that's been going on a while
19:15:05<@OrIdow6>We tried to figure out who was doing it a while ago but I don't think that succeeded
19:18:15<@OrIdow6>Wonder if there was a gap in it when the dashboard URL was switched, that could help narrow down the method
19:22:51Ketchup902_ quits [Remote host closed the connection]
19:23:31Ketchup901 (Ketchup901) joins
19:38:28umgr036 quits [Remote host closed the connection]
19:41:45katocala joins
19:45:54umgr036 joins
20:19:32wyatt8740 quits [Ping timeout: 252 seconds]
20:21:22umgr036 quits [Ping timeout: 252 seconds]
20:28:54wyatt8740 joins
21:01:07Wingy quits [Quit: The Lounge - https://thelounge.chat]
21:09:24eroc1990 quits [Ping timeout: 252 seconds]
21:09:46eroc1990 (eroc1990) joins
21:14:14<@kaz>iirc it's not archivebot jobs, it's just _urls posted in #archivebot_ ?
21:14:40<@kaz>i would test that theory but I don't have a webserver to hand
21:26:27<@OrIdow6>I believe that was one thing we tried
21:26:53<@OrIdow6>And part of the difficulty was that many peoples' IRC clients fetched all URLs they saw
21:27:02<@OrIdow6>"we" = someone else while I watched
21:28:16Island joins
21:53:34<thuban>is that still going on or did it stop a while back?
22:22:29Ketchup901 quits [Remote host closed the connection]
22:22:57Ketchup901 (Ketchup901) joins
22:25:49tzt quits [Remote host closed the connection]
22:26:12tzt (tzt) joins
22:45:36BlueMaxima joins
22:55:24<@OrIdow6>Looking through my AB logs, it seems like it's still going on and indeed on all URLs in the chan, but only sometimes? SPN restrictions preventing them all from being saved, maybe?
22:56:28<@OrIdow6>Also, we had a Joaquinit visit on the 13th and nobody told me?
23:07:38<thuban>huh, dunno then. if it had stopped i was going to say it was me--i wrote an auto-spn script that initially watched my entire irc client. but i realized this was silly and restricted it some time ago (plus i'm rarely in #archivebot). perhaps someone has something similar?
23:19:57hitgrr8 quits [Client Quit]
23:44:44Arcorann (Arcorann) joins