| 00:22:04 | | jacobk quits [Ping timeout: 240 seconds] |
| 01:20:28 | | @rewby quits [Ping timeout: 240 seconds] |
| 01:20:42 | | rewby (rewby) joins |
| 01:20:43 | | @ChanServ sets mode: +o rewby |
| 01:30:05 | | Mateon2 joins |
| 01:30:45 | | Mateon1 quits [Ping timeout: 255 seconds] |
| 01:30:45 | | Mateon2 is now known as Mateon1 |
| 02:10:43 | | qwertyasdfuiopghjkl joins |
| 02:40:00 | | jacobk joins |
| 03:03:32 | | katocala joins |
| 03:03:56 | | katocala is now authenticated as katocala |
| 04:04:50 | | michaelblob_ (michaelblob) joins |
| 04:08:04 | | michaelblob quits [Ping timeout: 240 seconds] |
| 04:20:11 | <pabs> | JAA: when I got stuck into this mailman2 stuff, one thing I noticed are sites where lists aren't on the main listinfo page but can be found via search engines and some of those have public archives |
| 04:20:49 | <pabs> | so I think for each mailman site, searching for site:foo inurl:listinfo and site:foo inurl:pipermail might be good |
| 04:21:49 | <pabs> | in my list I'm recording individual lists where the main listinfo page had zero lists, but I just discovered some sites have lists on the main listinfo page but some lists in search engines are missing from the main listinfo |
| 04:31:17 | <pabs> | its pretty common in the rabbit warren I entered, universities in Germany :) |
| 04:53:25 | <pabs> | JAA: ok, this is my final list |
| 04:53:26 | <pabs> | https://paste.debian.net/plain/1258270 |
| 04:53:48 | <pabs> | don't want to get sucked into this too much more, so I'm going to stop now |
| 06:35:58 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 06:43:39 | | katocala quits [Remote host closed the connection] |
| 06:43:51 | | katocala joins |
| 06:46:24 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
| 06:46:24 | | lennier2 joins |
| 06:46:35 | | qwertyasdfuiopghjkl joins |
| 06:48:37 | | katocala quits [Ping timeout: 253 seconds] |
| 06:49:09 | | katocala joins |
| 06:49:35 | | Sluggs quits [Ping timeout: 260 seconds] |
| 06:49:35 | | lennier1 quits [Ping timeout: 260 seconds] |
| 06:49:44 | | lennier2 is now known as lennier1 |
| 06:50:42 | | Sluggs joins |
| 07:37:30 | | lukash79 quits [Ping timeout: 255 seconds] |
| 07:48:38 | <Ruk8> | Update: The list of content not yet archived is *HUGE*, if someone wanna help with this I'll be glad. Especially for checking that all content is archived (i'm coping the links by hand, sanitizing them by hand, checking if previously archived version are already avaible and in case just skip by hand and so on...) |
| 07:48:38 | <Ruk8> | Currently I'm doing the windows versions of CS 6 and all the other prodoucts that are not che CS suite. |
| 07:48:38 | <Ruk8> | If someone wanna register a list of the other urls (CS6's DMGs and previously avaible cs products) please help. |
| 07:48:38 | <Ruk8> | To do this work you'll need to set up a cookie in you browser and get a a bit of hints. |
| 07:49:59 | <Ruk8> | The step first described obv comes after the content will be updated to the IA |
| 07:57:44 | <Ruk8> | I'm not considering asking for the creation of a new project because all of that could be done in 1-3 days by 2-3 pearsons. |
| 07:57:44 | <Ruk8> | Honestly I could do it all by myself but I'm giving up, checking previously archived urls, among a bunch of previously 403 pages is kinda exhausting. |
| 08:00:51 | | @rewby quits [Remote host closed the connection] |
| 08:01:00 | <Ruk8> | *che -> the |
| 08:16:23 | | rewby (rewby) joins |
| 08:16:23 | | @ChanServ sets mode: +o rewby |
| 08:55:21 | | abirkill quits [Remote host closed the connection] |
| 08:57:39 | | abirkill (abirkill) joins |
| 08:58:16 | | sonick quits [Client Quit] |
| 09:18:52 | <@rewby> | Found a blogging host that's going to close up shop soon: https://tweakblogs.net/ |
| 09:24:39 | <Ryz> | rewby, when is it shutting down? oo; |
| 09:25:09 | <Ryz> | Looks like this article saying '2023' via https://tweakers.net/plan/3726/we-nemen-afscheid-van-tweakblogs.html - so when that year is reached |
| 09:25:38 | <asie4> | I mentioned chomikuj.pl a long time ago. Well, it seems the time has come. |
| 09:25:58 | <asie4> | They are removing all "long unused" files on November 16th, 2022. |
| 09:26:31 | <asie4> | Some claim this is a yearly procedure, but I have never seen this happen in the past |
| 09:26:50 | <asie4> | I have a longer write-up about the site somewhere in the IRC logs... |
| 09:27:15 | | sonick (sonick) joins |
| 09:28:55 | <@rewby> | Ryz: Yeah, they say januari 2023 is when they'll shut down |
| 09:30:24 | <Ryz> | asie4, the problem is, as far as I recall from poking around that website months ago, requires payment just to download the files...? S: |
| 09:30:34 | <asie4> | Sort of. |
| 09:30:39 | <asie4> | <1MB files are free without an account |
| 09:30:45 | <asie4> | And a free account can download 50MB a week |
| 09:31:16 | <asie4> | So, in theory, a scraping job for small files could be viable; for anything else, the best one can do is create an easily searchable list so people can figure out if anything's important. |
| 09:31:28 | <asie4> | (They have a search engine but it's bare-bones) |
| 09:32:26 | <Ryz> | The search for the juiciest loot |
| 09:36:32 | <asie4> | It is possible to automate file downloading given URLs - plenty of PoCs exist online |
| 09:36:44 | <asie4> | There's also some work being done on figuring out if any workarounds exist |
| 09:53:53 | | treora quits [Ping timeout: 268 seconds] |
| 09:58:23 | | treora joins |
| 10:53:40 | | treora quits [Ping timeout: 240 seconds] |
| 11:09:44 | | JackThompson quits [Ping timeout: 268 seconds] |
| 11:11:50 | | JackThompson joins |
| 11:45:37 | | treora joins |
| 11:46:10 | | adia quits [Quit: The Lounge - https://thelounge.chat] |
| 11:46:23 | | adia (adia) joins |
| 11:46:45 | | adia quits [Client Quit] |
| 11:49:10 | | adia (adia) joins |
| 12:01:47 | | JackThompson quits [Client Quit] |
| 12:01:48 | | dm4v quits [Client Quit] |
| 12:01:54 | | dm4v_ joins |
| 12:02:05 | | JackThompson joins |
| 12:02:21 | | dm4v_ is now known as dm4v |
| 12:09:23 | | katocala is now authenticated as katocala |
| 12:24:28 | | treora quits [Ping timeout: 240 seconds] |
| 12:25:37 | | niku quits [Remote host closed the connection] |
| 13:16:05 | <@JAA> | pabs: Yeah, hidden lists are very common. Plenty of examples also in the internet infrastructure project I started a few months ago. |
| 13:16:31 | <pabs> | some of them aren't very well hidden :) |
| 13:17:25 | <@JAA> | Yeah. Some are cross-linked from emails in other lists, others are findable via web searches, some show up in the WBM. |
| 13:18:08 | <@JAA> | I built lists of mailing lists from searches and the WBM, ran the main job, then filtered out anything encountered there, and ran the rest manually. |
| 13:18:53 | <@JAA> | See e.g. https://wiki.archiveteam.org/index.php/Internet_infrastructure#Mailing_lists |
| 13:19:10 | <pabs> | ok. you probably got everything already, but will you check the list I prepared too? |
| 13:19:28 | <@JAA> | I absolutely didn't, only archived like two or three instances so far. |
| 13:19:47 | <@JAA> | I don't want to make this my life either. :-P |
| 13:19:51 | <@JAA> | But yeah, will do. |
| 13:21:29 | <pabs> | I see :) thanks |
| 14:20:47 | | tech_exorcist (tech_exorcist) joins |
| 15:55:39 | | sec^nd quits [Ping timeout: 255 seconds] |
| 15:59:08 | | sec^nd (second) joins |
| 16:04:05 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
| 16:08:20 | | michaelblob (michaelblob) joins |
| 16:12:04 | | michaelblob_ quits [Ping timeout: 240 seconds] |
| 16:12:31 | | qwertyasdfuiopghjkl joins |
| 17:04:13 | | nostrum-tango quits [Remote host closed the connection] |
| 17:04:29 | | nostrum-tango joins |
| 17:16:52 | | Ketchup901 quits [Remote host closed the connection] |
| 17:16:53 | | mutantm0nkey quits [Remote host closed the connection] |
| 17:17:30 | | Ketchup901 (Ketchup901) joins |
| 17:17:56 | | mutantm0nkey (mutantmonkey) joins |
| 17:27:54 | | sec^nd quits [Ping timeout: 255 seconds] |
| 17:30:50 | | march_happy quits [Ping timeout: 268 seconds] |
| 17:33:37 | | sec^nd (second) joins |
| 17:48:04 | | jacobk quits [Ping timeout: 240 seconds] |
| 18:16:30 | | sec^nd quits [Ping timeout: 255 seconds] |
| 18:26:13 | | sec^nd (second) joins |
| 18:26:28 | | Ketchup901 quits [Client Quit] |
| 18:32:42 | | Ketchup901 (Ketchup901) joins |
| 18:55:22 | <@rewby> | JAA: I've been doing some snooping on tweakblogs.net and I think this might either be a good use of a qwarc run or a dpos project. |
| 18:55:55 | <@rewby> | So here's what I figured out: |
| 18:56:24 | <@rewby> | So https://tweakblogs.net/ is the main front page, but each blog is hosted on a subdomain |
| 18:56:31 | <@rewby> | Example, https://iplaygamesonyoutube.tweakblogs.net/blog/20122/gris-review |
| 18:56:41 | <@rewby> | There's only 20k or so posts I think |
| 18:56:56 | <@rewby> | But the main issue is discovering the subdomains, because ids are only valid on the subdomain they belong to |
| 18:57:11 | <@rewby> | One does not need the actual post slug, the ID is enough |
| 18:57:30 | <@rewby> | I.e. https://iplaygamesonyoutube.tweakblogs.net/blog/20122/ works just fine for accessing the post from before |
| 18:58:03 | <@rewby> | Example: This is a valid id on another blog: https://iplaygamesonyoutube.tweakblogs.net/blog/20118/ |
| 18:58:26 | <@rewby> | Each subdomain, one may notice, includes a link to their profile on the main tweakers site. |
| 18:59:05 | <@JAA> | Yeah, just saw that, and those profile pages just take a numeric ID. |
| 18:59:11 | <@rewby> | Ding |
| 18:59:15 | <@rewby> | So https://tweakers.net/gallery/338396/ is the profile page |
| 18:59:18 | <@JAA> | The activiteit subpage lists the blog. |
| 18:59:22 | <@rewby> | If you go to activity https://tweakers.net/gallery/338396/activiteit/ |
| 18:59:24 | <@rewby> | Yep |
| 18:59:35 | <@rewby> | And the ids are sequential |
| 18:59:45 | <@rewby> | And 500000 is invalid so I presume that means less than 500k accounts |
| 18:59:58 | <@rewby> | All of which seem rather doable numbers |
| 19:00:33 | <@rewby> | The actual blogs could even be scraped by just a bunch of !a commands on AB. |
| 19:00:45 | <@JAA> | IDs go higher, e.g. https://tweakers.net/gallery/1260706/ |
| 19:00:46 | <@rewby> | It's very basic recursive crawling once you have the subdomains |
| 19:00:48 | | mutantm0nkey quits [Remote host closed the connection] |
| 19:00:54 | <@rewby> | Ah dangit |
| 19:00:56 | <@JAA> | But still reasonable |
| 19:01:03 | <@rewby> | I must've spot checked some numbers that didn't exist then |
| 19:01:14 | <@JAA> | 'Geregistreerd op 3 oktober 2019' on that one. |
| 19:01:33 | <@JAA> | https://tweakers.net/gallery/1638202/ July 2021 |
| 19:01:44 | <@rewby> | https://tweakers.net/gallery/1838350/activiteit/ was this month |
| 19:01:53 | | mutantm0nkey (mutantmonkey) joins |
| 19:02:37 | <@JAA> | Are those blog post IDs globally unique? Are there really only 20k blog posts in total? |
| 19:02:44 | <@rewby> | Think so |
| 19:02:53 | <@JAA> | Huh |
| 19:03:01 | <@rewby> | https://tweakers.net/plan/3726/we-nemen-afscheid-van-tweakblogs.html mentions 10k blog posts |
| 19:03:04 | <@rewby> | And 160k "reactions" |
| 19:03:31 | | jacobk joins |
| 19:03:52 | <@JAA> | Actually, all blog posts seem to have even IDs. I'm not sure I want to know why that is. :-) |
| 19:03:53 | <@rewby> | And only about 1000 blogs |
| 19:03:57 | <joepie91|m> | https://tweakblogs.net/?allWeblogs=1 is apparently relevant |
| 19:04:02 | <joepie91|m> | but I'm seeing some weird stuff |
| 19:04:13 | <joepie91|m> | sec |
| 19:04:38 | <@rewby> | Are we sure that's all of them? |
| 19:05:49 | <joepie91|m> | earlier today someone elsewhere reported seeing only 2 entries at https://coltrui.tweakblogs.net/blog/ and the link at https://twitter.com/arnoudwokke/status/244414155036176384 being dead, but neither of those seem to be the case now? |
| 19:06:03 | <joepie91|m> | seems like some flaky stuff going on here |
| 19:06:23 | <@rewby> | Potentially depending on what server you hit? |
| 19:06:42 | <joepie91|m> | possibly. means we need to be extra careful in indexing them though |
| 19:07:12 | <@JAA> | (Nevermind regarding the even IDs, only seems to apply to recent posts on https://tweakblogs.net/?extUpdate=1 ) |
| 19:07:55 | <joepie91|m> | hm https://tweakers.net/plan/3726/we-nemen-afscheid-van-tweakblogs.html?showReaction=18078860#r_18078860 |
| 19:10:49 | | Xesxen (Xesxen) joins |
| 19:22:17 | | Peetz0r|m joins |
| 19:23:41 | | jacobk quits [Ping timeout: 268 seconds] |
| 19:33:02 | | Ketchup901 quits [Remote host closed the connection] |
| 19:35:29 | <Peetz0r|m> | Hey, I came here because I heard tweakblogs was being discussed and I may be interested to help out |
| 19:36:07 | <Peetz0r|m> | I have time, bandwidth (1G) not much storage (I guess around 100G) and motivation |
| 19:36:31 | <Peetz0r|m> | (also my name is somewhere on https://tweakblogs.net/?allWeblogs=1 too) |
| 19:49:42 | <asie4> | Update on the Chomikuj.pl woes: The deletion in November seems to be limited to a specific type of files; searching around showed mostly files which are expensive and risky to host, like Blu-Ray movie ISOs. The site is apparently safe otherwise |
| 19:58:52 | | dm4v quits [Ping timeout: 265 seconds] |
| 20:21:19 | <Xesxen> | If context on tweakblogs wasn't provided yet: They're shutting down the service sometime in january 2023 and content will be purged/become unavailable. It contains 955 unique blogs, 9409 blog post total as of earlier today and ~160k comment posts. |
| 20:22:29 | <Xesxen> | Since I'm unfamiliar with how archiveteam projects are run/started: how does one do this? |
| 20:26:39 | | dm4v joins |
| 20:41:06 | <@JAA> | The blogs on https://tweakblogs.net/?allWeblogs=1 in an easier data format: https://transfer.archivete.am/5fNvt/tweakblogs.jsonl |
| 20:42:59 | <@JAA> | 9462 posts per that. Close enough to the mentioned 10k that we can assume it's complete, I guess. |
| 20:58:35 | <joepie91|m> | JAA: you mentioned 20k earlier, though? |
| 20:58:51 | <joepie91|m> | or was that just based on the post IDs? |
| 20:58:54 | <@JAA> | That's post IDs. Maybe many were deleted. |
| 20:58:59 | <joepie91|m> | aha. |
| 20:59:04 | | sloop_ quits [Quit: ZNC 1.8.2 - https://znc.in] |
| 20:59:06 | <joepie91|m> | right |
| 20:59:16 | | sloop joins |
| 20:59:19 | <@JAA> | And the recent posts linked on https://tweakblogs.net/?extUpdate=1 all have even IDs, so maybe there are gaps due to $reasons as well. |
| 21:00:01 | <joepie91|m> | Xesxen: depends on the project, some are distributed (http://tracker.archiveteam.org/), some are just "someone throws a server at it" |
| 21:00:20 | <@JAA> | I'm contemplating just running these through ArchiveBot with queueh2ibot. |
| 21:01:15 | <joepie91|m> | what's queueh2ibot? |
| 21:02:46 | | tech_exorcist quits [Client Quit] |
| 21:10:37 | <@JAA> | A thing to continuously throw jobs into AB as others finish. |
| 21:10:47 | <joepie91|m> | ahhh |
| 21:10:53 | <joepie91|m> | wait, so a queue for the queue? 🤔 |
| 21:11:23 | <@JAA> | Yeah, a low-priority queue if you will. The bot will back off if others start jobs, and I generally configure it to leave a few slots open at all times. |
| 21:11:59 | <@JAA> | More accurately, it submits a job whenever the total job counter is below a pre-configured value. And I set that value to a bit under the current capacity. |
| 21:12:24 | <joepie91|m> | ahh, right |
| 21:12:55 | <@JAA> | (And even more accurately, queueh2ibot is just another instance of http2irc, like h2ibot, and the thing actually doing the queueing is a script that uses curl to talk to it.) |
| 21:15:03 | <@rewby> | JAA: I can second queueh2ibot. Sounds like a perfect solution tbh. |
| 21:15:26 | <@rewby> | I'd say maybe not capture outlinks, depending on how that ends up? |
| 21:15:28 | <joepie91|m> | so a queue for a proxy to a queue :-) |
| 21:15:53 | <@rewby> | Although you may wanna check if pages with a ton of reactions need JS to load all reactions |
| 21:16:02 | <@rewby> | Judging by the rest of this site, I'm guessing no |
| 21:16:09 | <@rewby> | But it's worth a double check to be sure |
| 21:16:11 | <@rewby> | Since AB doesn't do JS |
| 21:17:59 | | LeanTo joins |
| 21:18:27 | <Xesxen> | Comments don't need JS to be visible, they will all show up at once on the post page itself |
| 21:18:35 | <@rewby> | Perfect |
| 21:18:59 | <Ryz> | How many of these websites to toss into ArchiveBot via queueh2ibot? |
| 21:19:08 | <@rewby> | Thousand or so |
| 21:20:14 | <@rewby> | 955 blogs apparently |
| 21:20:23 | <@rewby> | But we have until januari |
| 21:20:37 | <@rewby> | So if we can get this started within the week, then should be done well in tiem |
| 21:23:30 | <LeanTo> | Hey folks, don't know if there's any interest but I didn't see it on the death watch list... |
| 21:23:57 | <LeanTo> | A few days ago, there was an announcement that the Turner Classic Movies message boards were going to be taken down |
| 21:24:38 | <LeanTo> | "Our last effective day to participate on the TCM Forums will be Wednesday, November 30, 2022. However, the TCM Forums content will remain available to view until February 15, 2023, so users may continue to have access to content. After this time, all customer data will be deleted from the forums." |
| 21:24:57 | <@rewby> | We'd probably want to wait until the 1st of dec to archive it then |
| 21:25:04 | <@rewby> | To capture all of it |
| 21:25:05 | <joepie91|m> | hmm. is this related to the Cartoon Network nonsense? |
| 21:25:18 | <LeanTo> | Dunno, but the same company |
| 21:25:24 | <joepie91|m> | yep, hence my suspicion |
| 21:25:25 | <LeanTo> | Discover/WB own them |
| 21:25:34 | <joepie91|m> | I think I saw another thing being mentioned that WB was killing off |
| 21:26:26 | <LeanTo> | I think there might be interest in using the data to resurrect it, I think there's someone who is doing up a new messageboard for the exodus |
| 21:27:41 | <LeanTo> | Are they killing off some Cartoon Network message board I don't know about as well? |
| 21:28:14 | <joepie91|m> | I haven't been following it too closely, but the rumours are that they are just killing off Cartoon Network in its entirety |
| 21:29:00 | <LeanTo> | Ah, the thing I read regarding that was that they were folding some of the other animation studios in with it or something. Not sure. |
| 21:29:39 | <joepie91|m> | it being corporate shenanigans, those may evaluate to the same thing in the end :) |
| 21:30:51 | <LeanTo> | Haha yeah |
| 21:31:20 | <LeanTo> | ah here's the forum post if it's helpful at all: https://forums.tcm.com/topic/271979-tcm-message-boards-sunsetting-in-november-2022/ |
| 21:31:25 | <LeanTo> | "sunsetting" |
| 21:31:42 | <LeanTo> | as far as I know when the sun sets, it usually comes back up in a couple hours |
| 21:38:37 | | BlueMaxima joins |
| 21:50:18 | | march_happy (march_happy) joins |
| 21:52:11 | <@JAA> | Please add it to Deathwatch so we won't forget. |
| 21:52:56 | <@JAA> | rewby: Yeah, I'll try one or two of the bigger blogs manually first to see whether this approach is workable. |
| 21:54:45 | <LeanTo> | Ok, I'll do that :) |
| 21:59:06 | <@rewby> | JAA: Sounds good. Let me know how that goes. |
| 22:01:38 | | qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds] |
| 22:26:04 | <h2ibot> | LeanTo edited Deathwatch (+212, Added Turner Classic Movies Message Boards to…): https://wiki.archiveteam.org/?diff=49119&oldid=49113 |
| 22:33:06 | | jacobk joins |
| 22:42:26 | <@arkiver> | ah communication about tweakblogs here as well |
| 22:44:43 | <@arkiver> | rewby: do we have a list of tweakblogs? where are we on this? |
| 22:45:40 | <@JAA> | arkiver: Yes, and the current plan is to run them through ArchiveBot. |
| 22:45:47 | <@arkiver> | alright good |
| 22:45:53 | <@arkiver> | well I send tweakers an email anyway :P |
| 22:45:59 | <@arkiver> | guess we'll see if anything is missing |
| 22:46:08 | <@arkiver> | less than 1000 blogs? |
| 22:46:15 | <@JAA> | Yeah, with under 10k posts. |
| 22:46:41 | <LeanTo> | Dunno if you could tell me off the bat, but any idea if the TCM message boards would be a warrior based project? |
| 22:46:56 | <LeanTo> | Might be possible to recruit some people from there to participate in getting the data |
| 22:47:33 | <LeanTo> | while it's still up and able to be posted to |
| 22:48:25 | <@JAA> | I can't even access them. HTTP 403 'The Amazon CloudFront distribution is configured to block access from your country.' lol |
| 22:48:37 | <@arkiver> | hah same here |
| 22:48:42 | <@arkiver> | LeanTo: ^ |
| 22:48:54 | <@JAA> | Works from Canada. |
| 22:48:57 | <LeanTo> | Yeah I think they're actually blocked for some countries |
| 22:49:00 | <joepie91|m> | same, also cannot access |
| 22:49:04 | <joepie91|m> | I assume it's a GDPR block |
| 22:49:40 | <@JAA> | I'm not in the EU though. |
| 22:49:49 | <LeanTo> | No idea, I just saw someone mention it in a post |
| 22:50:13 | <@JAA> | Anyway, standard Invision forum. Could possibly be done with ArchiveBot, although it might be too big for that without aggressive ignores. |
| 22:50:34 | <@JAA> | If they don't have annoying rate limits, I can do it with qwarc, too. |
| 22:50:49 | <LeanTo> | Right on |
| 22:51:52 | <@JAA> | I'm pretty sure I already have code for archiving Invision with qwarc. But that'd be barebones without images etc. |
| 22:52:15 | <@JAA> | 272k topic IDs is easy enough to cover in 2.5 months anyway. |
| 22:52:31 | <@arkiver> | JAA: any outlinks/stuff on other domains can go to #// |
| 22:52:38 | <@JAA> | Yeah |
| 22:52:58 | <LeanTo> | I'll probly reach out on there and let them know, and see if I can get in contact with the guy who is trying to implement a new board. Might be of use |
| 22:53:23 | <LeanTo> | might even be able to see the content from it it from all over the world for once XD |
| 22:55:23 | <@JAA> | How many posts are there in total? Can't find it with some simple searches in curl|less and can't access it through a browser right now. |
| 22:56:51 | <LeanTo> | just looking at the numbers looks like probly over 1.5 million posts |
| 22:58:07 | <LeanTo> | going back to 2002 on the general discussion forum |
| 22:59:51 | <@JAA> | Ah yeah, just saw that the post ('comment') IDs go to 2.7M, so that sounds about right I guess. |
| 23:03:10 | <LeanTo> | actually surprised they have comments that far back given they did a redesign, probly a few redesigns since then |
| 23:07:11 | <LeanTo> | Targetting some forums seems like it would be more beneficial than others, the off-topic section is mostly just political babble from the last couple years. But like 600k posts. |
| 23:07:40 | <LeanTo> | 440k rather |
| 23:14:45 | <@JAA> | Nah... https://transfer.archivete.am/inline/bG4mu/aatt.png |
| 23:16:14 | <LeanTo> | haha, as you wish :) |
| 23:18:55 | <@JAA> | It's easier to just iterate over all topic IDs anyway than recurse through the forum topic lists etc. |
| 23:26:40 | <LeanTo> | Gotcha, I don't really know the ins and outs of it all, just figured if it meant more space savings and quality over quantity etc |
| 23:29:01 | <LeanTo> | In either case, I appreciate your work. I participated in the imdb warrior project some time ago and still use the boards that came out of it to look up movie info/suggestions |
| 23:29:03 | <@JAA> | Yeah, it's a valid concern when there isn't enough time to save everything, but that shouldn't be a problem here. Space is hardly ever relevant for text data since it's tiny compared to images/videos and compresses very well. |
| 23:29:07 | <LeanTo> | wouldn't be possible without you guys |
| 23:29:50 | <@JAA> | Hah, the IMDb boards were my first project here. It's been a while... :-) |
| 23:30:15 | <LeanTo> | cool :) |
| 23:31:00 | <LeanTo> | I'm sure there's some nice images in there, but the text is probly more valuable anyways so whatevs. People can just look up a movie and find nice images if they want elsewhere :P |
| 23:55:59 | | BlueMaxima quits [Read error: Connection reset by peer] |