| 00:09:58 | | Arcorann (Arcorann) joins |
| 02:52:48 | | fl0w_ joins |
| 02:56:22 | | fl0w quits [Ping timeout: 252 seconds] |
| 04:16:43 | | LegitSi joins |
| 04:32:21 | <LegitSi> | It's a crime that I never mentioned this earlier, but I might as well say it now. I am here to talk about StripGenerator, an old website that lasted from 2005-2020. |
| 04:32:46 | <LegitSi> | StripGenerator was a place where you could make comics rather easily, and as far as I'm told had a thriving community around 2012. |
| 04:33:12 | <LegitSi> | "as far as I'm told" because I was really late to the party on this, which is why I haven't talked about it in so long. It's a sin which I'm still repenting for. |
| 04:34:01 | <LegitSi> | Although it operated on Flash and was thus vulnerable to the Flash Crash, the website actually broke two months earlier than expected. This is because SG had been in decline for years (ever since 2015 I'm told) before then due to neglect from its creators. |
| 04:34:36 | <LegitSi> | The reason I'm even able to share this story today is that some SG members made their own copy of SG, with the characters and all, and I was the one that got the legal greenlight from the creators to do so. |
| 04:35:07 | <LegitSi> | Although that website too is experiencing an eerily similar outcome to what SG had, its creator having abandoned the site months ago and the problems slowly starting to pile up. |
| 04:36:52 | <LegitSi> | Anyway, why am I here if the website shut down 3 years ago? Because it turns out, most of the 11 million or so strips have just been sitting on an Amazon AWS database all this time. |
| 04:37:45 | <LegitSi> | We knew this for a while, but only now (thanks to inactivity) have we realized that this database is quite vulnerable, and if it goes poof just like the original website did, there goes almost 15 years of history and 11 million strips, |
| 04:38:07 | <LegitSi> | The database isn't perfect. There's plenty of gaps, but it's the best we have. It only has the images as far as we know, so no metadata, which is a damn shame. |
| 04:42:41 | <LegitSi> | The knowledge of StripGenerator has mostly disappeared to time, with not a whole lot of people talking about it today. Look it up, you'll get age-old results talking about how good it is, not people talking about what once was. |
| 04:45:06 | <pokechu22> | 11 million URLs is a lot, but probably doable (it's probably too big for archivebot as one big job at least, but I'm pretty sure there are other tools that would work) |
| 04:45:55 | <@JAA> | Is it just an open S3 bucket or something more complicated? |
| 04:49:17 | <LegitSi> | I believe it's an open S3 bucket, but I'm not knowledgeable at all on that subject. Can I post a link of a strip as demonstration? |
| 04:49:24 | <@JAA> | Sure |
| 04:49:24 | <pokechu22> | Sure |
| 04:49:40 | <@JAA> | Get ninja'd! :-) |
| 04:50:11 | <LegitSi> | http://s3.amazonaws.com/stripgenerator/strip/50/72/19/00/00/full.png |
| 04:50:31 | <LegitSi> | This particular strip's ID is 912705. It's filled in reverse base 10 order. |
| 04:51:04 | <@JAA> | Ok, it's not *open* open, but if the URLs can be generated, that's fine. |
| 04:51:45 | <LegitSi> | Yeah, in fact one of our members made a program to extract such in a presentable form, and then I modified the code a bit to make it easier. It's on Codepen, do you mind if I share a link? |
| 04:52:32 | <LegitSi> | This program has the ability to generate a random strip out of the 11.8 million, and while there are plenty of gaps, it's still a pretty good resource. |
| 04:52:47 | <LegitSi> | You can also search up any strip ID if you so please. |
| 04:53:22 | <@JAA> | Apart from full.png, there's also bribed.png and detail.png thumbnails. |
| 04:53:42 | <@JAA> | And thumb.png |
| 04:53:42 | <pokechu22> | OK, so it's about 11.8 million strips, but what's the highest valid ID? |
| 04:53:50 | <LegitSi> | Point is, I believe we can generate all of the possible URLs up to the last known strip ever made (#1180869). The first couple thousand or so we're unable to access, though. |
| 04:53:50 | <pokechu22> | And, sure, feel free to share any links you think are relevant |
| 04:54:34 | <LegitSi> | #1180869 is the latest strip on the last working Wayback Machine snapshot before the website's collapse in early November, and looking forward a while, I can't find any strip later than that. |
| 04:54:52 | <pokechu22> | http://s3.amazonaws.com/stripgenerator/strip/96/80/81/10/00/full.png being the last known one, and http://s3.amazonaws.com/stripgenerator/strip/07/80/81/10/00/full.png being the missing one after that? Assuming I've guessed the pattern correctly? |
| 04:55:27 | <LegitSi> | Yes, that one. There are gaps in the data, which can be anything from a deleted strip to one that failed to render (more common the later it is). |
| 04:55:37 | <@JAA> | Where do the 11 million come from if the highest ID is under 1.2 million? |
| 04:56:03 | <pokechu22> | Oh, that's a good point, 1180869 is 1,180,869 |
| 04:56:12 | <LegitSi> | I dun messed up. |
| 04:56:23 | <LegitSi> | I may or may not have missed a digit, my bad. |
| 04:56:38 | <LegitSi> | 1.1m is still quite a bit, at least in my book. |
| 04:56:53 | <@JAA> | Ah ok, so 6 million if we also grab all the thumbnails or 1.2 million for just the full strips. Easy. |
| 04:57:31 | <LegitSi> | Yeah. There's also the metadata, which the only way I know how to get is from the 1% of comics that got a Wayback Machine snapshot. |
| 04:57:41 | <LegitSi> | Unfortunately, I believe those might just be lost unless the database has a neat surprise for us. |
| 04:57:50 | <pokechu22> | Can you give an example of that metadata? |
| 04:57:56 | <LegitSi> | Metadata as in title, description, likes, comments, transcription, etc. |
| 04:58:21 | <LegitSi> | http://web.archive.org/web/20201029201327/http://stripgenerator.com/strip/912705/the-man-with-the-little-gold-statue |
| 04:58:23 | <@JAA> | It's entirely plausible they were just using S3 for hosting static files (like the rendered strips). |
| 04:58:42 | <@JAA> | So yeah, the metadata may be lost. |
| 04:58:47 | <LegitSi> | This is an archive of a StripGenerator webpage, you can see what metadata I'm referring to. |
| 04:58:57 | <LegitSi> | Unless I've been using the word metadata incorrectly all this time. |
| 04:59:05 | <@JAA> | (The metadata that wasn't captured at the time, that is.) |
| 04:59:08 | <LegitSi> | I'm a little scatterbrained, I spent 3 hours coding in two unfamiliar languages. Forgive me. |
| 05:00:11 | <LegitSi> | Honestly, I can settle with a 99% metadata loss if we can get all the strips into the Internet Archive. |
| 05:00:22 | <pokechu22> | Yeah, metadata is the right term for that, but if it was only hosted via http://stripgenerator.com/ itself and that's dead then it's probably gone |
| 05:00:38 | <LegitSi> | The metadata that has been archived happens to be from Wayback Machine snapshots, part of the Internet Archive. |
| 05:01:01 | <LegitSi> | So that's not going anywhere any time soon, unless it is, then we have way bigger problems. |
| 05:01:02 | <@JAA> | 1.2 million images can easily be run through AB. It should even be quite fast. |
| 05:01:11 | <pokechu22> | "384,903 Members 1,064,245 Strips" at the top, too |
| 05:01:14 | <@JAA> | Would probably be done in like half a day or so. |
| 05:01:43 | <pokechu22> | There's also user thumbnails e.g. http://s3.amazonaws.com/stripgenerator/avatar/49/46/22/00/1393980595/thumb.png but I'm not sure if there's a practical way to generate those |
| 05:01:45 | <LegitSi> | That's fantastic. I feel quite guilty not bringing it to your attention earlier when the website was near collapse, but better late than never I guess. |
| 05:02:10 | <LegitSi> | In any case, you guys know how to explore Amazon AWS databases so if there's anything we missed, please let me know. |
| 05:02:45 | <@JAA> | This isn't a database, it's an object storage, and there's nothing to explore really since they're blocking the content listing access. |
| 05:03:04 | <@JAA> | http://s3.amazonaws.com/stripgenerator/ would otherwise show a list of the first 1k objects in the bucket. |
| 05:03:21 | <@JAA> | (And then you can paginate to get all of them.) |
| 05:03:31 | <@JAA> | But competent operators will lock that down. |
| 05:03:46 | <LegitSi> | There is a timeline where I simply ask them to unlock it. When we were discussing making that SG clone website, I was in a Twitter conversation with the creators that led to us getting the legal greenlight so long as we didn't use the source code. |
| 05:03:54 | <LegitSi> | But that's not happening anytime soon. |
| 05:03:59 | <pokechu22> | e.g. http://s3.amazonaws.com/MinecraftDownload and http://s3.amazonaws.com/MinecraftDownload/resources/music/calm1.ogg |
| 05:04:02 | <@JAA> | It's how half of the 'this data was leaked!!1!' situations happen these days. |
| 05:04:25 | <LegitSi> | I've heard. |
| 05:04:59 | <pokechu22> | Looks like there's also metadata on https://web.archive.org/web/20200709114857/http://mrterrorkiller.stripgenerator.com/gallery - probably not as much as the strip's own page but it's still better than nothing |
| 05:05:05 | <LegitSi> | Something that might be way harder would be using the strip IDs to explore all those that do have an available snapshot to acquire all the metadata from. |
| 05:05:36 | <pokechu22> | https://archive.org/developers/wayback-cdx-server.html is the better way to find what snapshots exist |
| 05:06:10 | <LegitSi> | My basic search found that 100,000 or so links existed under /strip, which is where all the strips were hosted. |
| 05:06:18 | <LegitSi> | Although there's probably something I've missed. |
| 05:08:25 | <LegitSi> | http://stripgenerator.com/strip/ MIME-types Count has 332,444 captures, 127,573 URLs, and 104,659 new URLs listed under text/html, if that helps. |
| 05:08:53 | <LegitSi> | http://web.archive.org/details/http://stripgenerator.com/strip/ if you're curious as to where I got that from. |
| 05:09:30 | | BlueMaxima quits [Client Quit] |
| 05:09:46 | <pokechu22> | I've never used that one before - there's https://web.archive.org/*/http://stripgenerator.com/strip/* but by default it's limited to 10k pages (with https://archive.org/developers/wayback-cdx-server.html being the less-user-friendly way of getting more) |
| 05:10:09 | <@JAA> | I love shell utils. Simple one-liner to generate all the full.png URLs. :-) |
| 05:10:21 | <@JAA> | `seq 1180869 | xargs printf '%010d\n' | rev | sed 's,..,&/,g; s,^,https://s3.amazonaws.com/stripgenerator/strip/,; s,$,full.png,'` |
| 05:10:58 | <pokechu22> | Huh, TIL of rev - I've seen tac but not that |
| 05:11:13 | <LegitSi> | Did that one-liner take into account empty listings? Otherwise (according to the strip IDs and last snapshot), you've got about 100,000 blank entries. |
| 05:11:30 | <LegitSi> | 1.18 million comics with an ID, with 1.06 million listed on the last snapshot. |
| 05:11:33 | <@JAA> | It's simply all IDs from 1 through 1180869. |
| 05:11:57 | <LegitSi> | Got it. You might also have trouble with the first few thousand entries, we haven't found a way to get those. |
| 05:12:22 | <@JAA> | So it goes https://s3.amazonaws.com/stripgenerator/strip/10/00/00/00/00/full.png https://s3.amazonaws.com/stripgenerator/strip/20/00/00/00/00/full.png ... https://s3.amazonaws.com/stripgenerator/strip/86/80/81/10/00/full.png https://s3.amazonaws.com/stripgenerator/strip/96/80/81/10/00/full.png |
| 05:12:27 | <pokechu22> | If you're referring to e.g. http://s3.amazonaws.com/stripgenerator/strip/76/80/81/10/00/full.png then that would still be downloaded, but probably the fastest way to check if it exists is to just download it, so eh, nothing special needed there |
| 05:12:36 | <LegitSi> | Everything after 14133 should be good. |
| 05:12:53 | <@JAA> | Yeah, let's just run through those, not a concern. |
| 05:13:03 | <LegitSi> | Got it. |
| 05:14:20 | <@JAA> | It's in AB. It'll take a while to load the URL list, but then it'll go brrrrr. :-) |
| 05:15:18 | <pokechu22> | If you want to try to look for things with metadata, https://web.archive.org/cdx/search/cdx?url=http://stripgenerator.com/strip/1&output=json&matchType=prefix&collapse=urlkey / https://web.archive.org/cdx/search/cdx?url=http://stripgenerator.com/strip/2&output=json&matchType=prefix&collapse=urlkey / similar should be a good starting point I think |
| 05:15:53 | <pokechu22> | (again, there's documentation for that at https://archive.org/developers/wayback-cdx-server.html) |
| 05:16:28 | <@JAA> | You could also download a list of all known stripgenerator.com URLs to discover possible other sources of metadata (like the subdomains mentioned earlier). |
| 05:16:47 | <@JAA> | And don't trust the documentation too much. There are quirks and bugs. |
| 05:16:48 | <LegitSi> | Thanks. |
| 05:17:36 | <@JAA> | There's a script to help with this in my little-things repo, and I'd do this: `ia-cdx-search --concurrency 4 --tries 10 'url=stripgenerator.com&matchType=domain&collapse=urlkey&fl=original'` |
| 05:18:04 | <@JAA> | https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/ia-cdx-search |
| 05:18:53 | <LegitSi> | Thanks a lot. I'm gonna go to bed now and check on this when I get back, maybe even pursue hours looking for the metadata. |
| 05:19:52 | <LegitSi> | Goodnight everyone. |
| 05:20:00 | | LegitSi quits [Remote host closed the connection] |
| 05:25:02 | | IDK (IDK) joins |
| 05:25:21 | | ell (ell) joins |
| 06:04:06 | | hackbug quits [Ping timeout: 252 seconds] |
| 06:08:20 | | sec^nd quits [Ping timeout: 276 seconds] |
| 06:11:16 | | sec^nd (second) joins |
| 06:19:34 | | umgr036 quits [Remote host closed the connection] |
| 06:27:10 | | umgr036 joins |
| 07:34:14 | | IDK quits [Client Quit] |
| 07:41:08 | | hitgrr8 joins |
| 09:31:39 | | fl0w_ quits [Remote host closed the connection] |
| 09:31:39 | | umgr036 quits [Remote host closed the connection] |
| 09:31:49 | | umgr036 joins |
| 09:31:59 | | fl0w_ joins |
| 09:46:40 | | Barto quits [Read error: Connection reset by peer] |
| 09:55:42 | | Barto (Barto) joins |
| 10:43:35 | | IDK (IDK) joins |
| 11:35:01 | | Ruthalas5 quits [Ping timeout: 252 seconds] |
| 12:57:38 | | hackbug (hackbug) joins |
| 13:08:29 | <pabs> | OrIdow6: re PopJam, maybe their site has some APIs that could be saved? |
| 13:21:28 | | Ketchup901 quits [Remote host closed the connection] |
| 13:22:09 | | Ketchup901 (Ketchup901) joins |
| 13:49:13 | | Arcorann quits [Ping timeout: 252 seconds] |
| 14:01:02 | | Ketchup902 (Ketchup901) joins |
| 14:01:34 | | Ketchup901 quits [Remote host closed the connection] |
| 14:01:37 | | Ketchup902 quits [Remote host closed the connection] |
| 14:01:39 | | Ketchup902_ (Ketchup901) joins |
| 14:19:31 | | treora quits [Remote host closed the connection] |
| 14:19:32 | | treora joins |
| 14:19:47 | | treora quits [Remote host closed the connection] |
| 14:19:48 | | treora joins |
| 14:22:20 | | adia quits [Quit: The Lounge - https://thelounge.chat] |
| 14:22:45 | | IDK quits [Client Quit] |
| 14:23:59 | | adia (adia) joins |
| 14:38:19 | | treora quits [Remote host closed the connection] |
| 14:38:19 | | treora joins |
| 14:39:27 | | treora quits [Remote host closed the connection] |
| 14:39:28 | | treora joins |
| 14:39:41 | | treora quits [Remote host closed the connection] |
| 14:39:41 | | treora joins |
| 15:00:07 | | treora quits [Remote host closed the connection] |
| 15:00:08 | | treora joins |
| 16:22:59 | | ell quits [Client Quit] |
| 18:02:27 | | TheTechRobo quits [Client Quit] |
| 18:28:10 | | treora quits [Remote host closed the connection] |
| 18:28:12 | | treora joins |
| 18:30:38 | | HackMii quits [Ping timeout: 276 seconds] |
| 18:31:10 | | HackMii (hacktheplanet) joins |
| 18:38:20 | | hackbug quits [Remote host closed the connection] |
| 18:40:30 | | hackbug (hackbug) joins |
| 18:51:49 | | treora quits [Remote host closed the connection] |
| 18:51:49 | | treora joins |
| 18:51:50 | | treora quits [Read error: Connection reset by peer] |
| 18:51:53 | | treora joins |
| 18:51:59 | | treora quits [Remote host closed the connection] |
| 18:52:00 | | treora joins |
| 18:54:01 | | katocala quits [Remote host closed the connection] |
| 19:08:06 | <@OrIdow6> | pabs: Yeah I assume so |
| 19:08:43 | <@OrIdow6> | I found some APIs in CDX (whose recency makes me wonder if it's someone trying to SPN the site, but I didn't check which item they were in) |
| 19:10:07 | <@OrIdow6> | And also tried to decompile their Android app, in which I'm fairly out of my depth |
| 19:12:00 | <pokechu22> | Hmm, something related I've noticed: after starting a job with archivebot, the page the job was started on also seems to get SPN'd, though I'm not sure if that's actually an automatic thing or just someone doing it |
| 19:14:47 | <@OrIdow6> | Yeah that's been going on a while |
| 19:15:05 | <@OrIdow6> | We tried to figure out who was doing it a while ago but I don't think that succeeded |
| 19:18:15 | <@OrIdow6> | Wonder if there was a gap in it when the dashboard URL was switched, that could help narrow down the method |
| 19:22:51 | | Ketchup902_ quits [Remote host closed the connection] |
| 19:23:31 | | Ketchup901 (Ketchup901) joins |
| 19:38:28 | | umgr036 quits [Remote host closed the connection] |
| 19:41:45 | | katocala joins |
| 19:42:03 | | katocala is now authenticated as katocala |
| 19:45:54 | | umgr036 joins |
| 20:19:32 | | wyatt8740 quits [Ping timeout: 252 seconds] |
| 20:21:22 | | umgr036 quits [Ping timeout: 252 seconds] |
| 20:28:54 | | wyatt8740 joins |
| 21:01:07 | | Wingy quits [Quit: The Lounge - https://thelounge.chat] |
| 21:09:24 | | eroc1990 quits [Ping timeout: 252 seconds] |
| 21:09:46 | | eroc1990 (eroc1990) joins |
| 21:14:14 | <@kaz> | iirc it's not archivebot jobs, it's just _urls posted in #archivebot_ ? |
| 21:14:40 | <@kaz> | i would test that theory but I don't have a webserver to hand |
| 21:26:27 | <@OrIdow6> | I believe that was one thing we tried |
| 21:26:53 | <@OrIdow6> | And part of the difficulty was that many peoples' IRC clients fetched all URLs they saw |
| 21:27:02 | <@OrIdow6> | "we" = someone else while I watched |
| 21:28:16 | | Island joins |
| 21:53:34 | <thuban> | is that still going on or did it stop a while back? |
| 22:22:29 | | Ketchup901 quits [Remote host closed the connection] |
| 22:22:57 | | Ketchup901 (Ketchup901) joins |
| 22:25:49 | | tzt quits [Remote host closed the connection] |
| 22:26:12 | | tzt (tzt) joins |
| 22:45:36 | | BlueMaxima joins |
| 22:55:24 | <@OrIdow6> | Looking through my AB logs, it seems like it's still going on and indeed on all URLs in the chan, but only sometimes? SPN restrictions preventing them all from being saved, maybe? |
| 22:56:28 | <@OrIdow6> | Also, we had a Joaquinit visit on the 13th and nobody told me? |
| 23:07:38 | <thuban> | huh, dunno then. if it had stopped i was going to say it was me--i wrote an auto-spn script that initially watched my entire irc client. but i realized this was silly and restricted it some time ago (plus i'm rarely in #archivebot). perhaps someone has something similar? |
| 23:19:57 | | hitgrr8 quits [Client Quit] |
| 23:44:44 | | Arcorann (Arcorann) joins |