00:25:59 | | sidpatchy joins |
02:25:43 | | threedeeitguy quits [Client Quit] |
02:46:16 | | threedeeitguy (threedeeitguy) joins |
03:05:06 | | nulldata quits [Quit: The Lounge - https://thelounge.chat] |
03:13:16 | | nulldata joins |
03:32:19 | | IDK quits [Client Quit] |
04:34:32 | | eroc1990 quits [Client Quit] |
04:35:29 | | eroc1990 (eroc1990) joins |
05:12:58 | | Ryz quits [Ping timeout: 265 seconds] |
05:15:59 | | qw3rty quits [Ping timeout: 252 seconds] |
05:24:39 | <@phuzion> | albertlarsan68: Generally, you can either add it to the wiki page or just report the URL shortener here, and someone will get to adding it eventually. |
05:24:53 | <@phuzion> | albertlarsan68: What is the shortener you'd like to add? |
05:39:07 | | Ryz (Ryz) joins |
05:51:48 | <albertlarsan68> | I was thinking of adding the sk.mu shortener, already documented on the wiki in the Skyblog page. |
05:53:37 | <albertlarsan68> | I'm not sure it is a good idea, since it seems to be deterministic, and we can know where it points to just by looking at it. |
06:02:44 | | Ryz quits [Ping timeout: 252 seconds] |
06:09:38 | | Ryz (Ryz) joins |
06:15:52 | <@phuzion> | albertlarsan68: Yeah, looking at the info about how the URLs are generated, it's unlikely that sk.mu would be a good candidate to run in terroroftinytown |
06:16:56 | <albertlarsan68> | Maybe find a way to generate the redirects may be a way to archive those URLs nonetheless? |
06:21:26 | <@phuzion> | albertlarsan68: I'd say that it's probably better for the project to do this, because the project is going to have a starting point of some sort. URLTeam projects tend to be better suited when there's a relatively small URL space to work with. If the shortener slugs were say, 6 or 7 characters long, I'd probably fire it off tonight. But with sluts that are a-zA-Z0-9, 11 characters long, we're looking at an ENORMOUS scope of URLs to run, and |
06:21:26 | <@phuzion> | if we were to start absolutely slamming their site with 404 requests, tens of thousands of times per second, they'd ban our user agent in a heartbeat. |
06:21:41 | <@phuzion> | slugs, wow, what a typo |
06:26:44 | <albertlarsan68> | I was proposing generating the potential response you would get if you asked the server, to keep the short links working. I agree it would be better to do this in a cross-project effort, ie maybe sending slugs to and collect them here. I'm not sure the range is the whole of the a-zA-Z0-9 * 11 spectrum, because not all IDs are filled, but it would |
06:26:44 | <albertlarsan68> | still be a huge space IMO. |
06:27:09 | <albertlarsan68> | Its stream is #bowlofpetunias FYI |
06:30:04 | <@phuzion> | albertlarsan68: It's late, and it's entirely possible I'm lacking some information to be able to fully understand what you're talking about, but I fail to see why the shortlinks need to be crawled if they're deterministic per the wiki page? |
06:31:09 | <albertlarsan68> | They do not, that's why I propose we **generate ourselves** the 302/301 that occurs. |
06:31:19 | <albertlarsan68> | Not asking the server. |
06:32:29 | <@flashfire42> | You are suggesting archiveing the results of the shortlinks which will be long links of Skyblog which we plan to grab anyway? |
06:33:27 | <albertlarsan68> | Yep |
06:35:00 | | datechnoman quits [Quit: The Lounge - https://thelounge.chat] |
06:35:50 | <@phuzion> | albertlarsan68: So, let me make sure I am on the same page as you. Your suggestion is to generate a list of short URLs to scrape, and capture the 301/302, right? |
06:36:26 | <albertlarsan68> | My suggestion is to have a list of short URLs, and create the 302/301/... out of thin air. |
06:36:34 | | datechnoman (datechnoman) joins |
06:36:45 | <@phuzion> | albertlarsan68: Where would we get this list of short URLs? |
06:37:33 | <albertlarsan68> | Thin air??? Or the skyblog project, generate them from the Blog and Post IDs. |
06:37:52 | <albertlarsan68> | The goal would be for IA to be able to keep the short links alive. |
06:39:23 | | flashfire42|m joins |
06:39:47 | <@phuzion> | So you're suggesting that we brute force 100 billion post IDs, convert them to shortURLs, and then cram those into IA even though 99.99% of them will be 404s? |
06:40:26 | <albertlarsan68> | We could even (not backed by testing not researching) be able to see if a post is put in "secret" mode, depending on the implementation, if it redirects to the page but the page does not work or just the shortlink does not work. |
06:41:27 | <@phuzion> | Also, we tend not to generate data ourselves. Even if we know in principle how the shortener system works, we don't know for certain that there's not a bug that affects certain URLs, or other URLs break the rule of how the shortener works with manual overrides or something. |
06:42:07 | <albertlarsan68> | We would not cram all of those into IA, but the valid ones yes. Since we gather valid blog+post IDs in the skyblog project, we could transfer them to urlteam. |
06:42:38 | <@flashfire42> | I am not sure you understand how any of this works? |
06:42:43 | | @flashfire42 sets mode: +o flashfire42|m |
06:43:37 | <albertlarsan68> | flashfire42: Who are you talking to? |
06:43:56 | <@flashfire42> | You. Phuzion has been here longer than me if memory serves |
06:44:49 | <@phuzion> | I've had an account on the Wiki for almost 9 years, for whatever that's worth. |
06:45:32 | <albertlarsan68> | Anyway, it was just a kinda random thought that I have grown with you all, no problem if it won't work. |
06:46:19 | <albertlarsan68> | It is not something that I am attached to, I just want to be sure that no big part of the French internet history disappears. |
06:46:52 | <@flashfire42> | As it is the shortlinks dont get directly archived at all. They are stored in a text document and put on IA. There are a few people slowly grepping them and putting them into the URL project but yeah that idea is not how any of this works |
06:48:18 | <@phuzion> | It's not that it won't work. Your idea of pulling post IDs from the project grab might be feasible. I was just confused earlier when you were talking about pulling a list of post IDs out of thin air, because if that was the case, we would have to scrape at 25,000 URLs per second until the shutdown, and that's a super quick way to get banned. |
06:50:03 | <albertlarsan68> | It should be tested that if a "secret" post is created, and its shortlink accessed, what happens if we're not logged in? a) A normal 404 b) A redirect to the post page, but the post page errors c) Something else. |
06:50:31 | <albertlarsan68> | Since we don't know what blog an ID belongs to, this is what we should bruteforce. |
06:50:38 | <@phuzion> | Honestly, this sounds more and more like something that should be discussed in the project channel. |
06:51:26 | <@phuzion> | The URLTeam project is fairly scope-limited. We have a lot of URL shorteners that don't get scraped because we don't have the resources to develop custom code for each one of them. |
06:51:48 | <@phuzion> | But if someone is going to write a seesaw pipeline for this project, your idea about checking the URL shorteners could be implemented as a step. |
06:52:10 | <albertlarsan68> | OK, I'll try to move the ideas to the #bowlofpetunias channel. |
07:03:40 | | IDK (IDK) joins |
07:11:32 | | someone1 joins |
07:13:37 | | qw3rty joins |
07:26:42 | | masterX244 (masterX244) joins |
07:47:47 | | nulldata quits [Ping timeout: 252 seconds] |
07:56:42 | | nulldata joins |
07:57:33 | | nulldata is now authenticated as nulldata |
11:45:55 | | Ryz quits [Ping timeout: 265 seconds] |
12:13:28 | | Ryz (Ryz) joins |
12:38:29 | | PredatorIWD quits [Read error: Connection reset by peer] |
12:41:58 | | PredatorIWD joins |
13:51:45 | | W7RFa6AbNFz joins |
13:51:51 | | W7RFa6AbNFz quits [Remote host closed the connection] |
14:00:38 | | atphoenix__ quits [Remote host closed the connection] |
14:01:19 | | atphoenix__ (atphoenix) joins |
14:02:47 | | atphoenix__ is now known as atphoenix |
15:07:09 | | nulldata quits [Ping timeout: 258 seconds] |
15:17:25 | | nulldata joins |
15:24:42 | | VerifiedJ quits [Quit: The Lounge - https://thelounge.chat] |
15:25:22 | | VerifiedJ (VerifiedJ) joins |
16:01:47 | | driib quits [Client Quit] |
16:03:22 | | driib (driib) joins |
16:18:03 | | driib quits [Client Quit] |
16:19:56 | | driib (driib) joins |
17:15:12 | | T31M quits [Read error: Connection reset by peer] |
17:15:31 | | T31M joins |
17:16:43 | | T31M is now authenticated as T31M |
17:51:35 | | TheTechRobo quits [Client Quit] |
17:51:55 | | TheTechRobo (TheTechRobo) joins |
17:54:16 | | TheTechRobo quits [Remote host closed the connection] |
17:54:33 | | TheTechRobo (TheTechRobo) joins |
18:09:27 | | atphoenix quits [Read error: Connection reset by peer] |
18:10:09 | | atphoenix (atphoenix) joins |
18:50:11 | | nulldata quits [Client Quit] |
18:50:52 | | nulldata joins |
19:00:22 | | IDK quits [Client Quit] |
19:31:41 | | threedeeitguy quits [Client Quit] |
19:32:32 | | threedeeitguy (threedeeitguy) joins |
20:28:56 | | someone1 quits [Client Quit] |
21:58:04 | | driib quits [Client Quit] |
21:58:04 | | kiska quits [Quit: Ping timeout (120 seconds)] |
21:58:04 | | @flashfire42 quits [Quit: Ping timeout (120 seconds)] |
21:58:04 | | VerifiedJ quits [Client Quit] |
21:58:04 | | andrew quits [Quit: Ping timeout (120 seconds)] |
21:58:04 | | Matthww1 quits [Quit: Ping timeout (120 seconds)] |
21:58:04 | | ave quits [Quit: Ping timeout (120 seconds)] |
21:58:26 | | VerifiedJ (VerifiedJ) joins |
21:58:26 | | ave (ave) joins |
21:58:31 | | driib (driib) joins |
21:58:39 | | Matthww1 joins |
21:58:54 | | andrew (andrew) joins |
21:59:09 | | flashfire42 (flashfire42) joins |
21:59:09 | | @ChanServ sets mode: +o flashfire42 |
22:00:03 | | kiska (kiska) joins |