#urlteam log for 2023-07-04

Home Search Previous day Next day

00:25:59		sidpatchy joins
02:25:43		threedeeitguy quits [Client Quit]
02:46:16		threedeeitguy (threedeeitguy) joins
03:05:06		nulldata quits [Quit: The Lounge - https://thelounge.chat]
03:13:16		nulldata joins
03:32:19		IDK quits [Client Quit]
04:34:32		eroc1990 quits [Client Quit]
04:35:29		eroc1990 (eroc1990) joins
05:12:58		Ryz quits [Ping timeout: 265 seconds]
05:15:59		qw3rty quits [Ping timeout: 252 seconds]
05:24:39	<@phuzion>	albertlarsan68: Generally, you can either add it to the wiki page or just report the URL shortener here, and someone will get to adding it eventually.
05:24:53	<@phuzion>	albertlarsan68: What is the shortener you'd like to add?
05:39:07		Ryz (Ryz) joins
05:51:48	<albertlarsan68>	I was thinking of adding the sk.mu shortener, already documented on the wiki in the Skyblog page.
05:53:37	<albertlarsan68>	I'm not sure it is a good idea, since it seems to be deterministic, and we can know where it points to just by looking at it.
06:02:44		Ryz quits [Ping timeout: 252 seconds]
06:09:38		Ryz (Ryz) joins
06:15:52	<@phuzion>	albertlarsan68: Yeah, looking at the info about how the URLs are generated, it's unlikely that sk.mu would be a good candidate to run in terroroftinytown
06:16:56	<albertlarsan68>	Maybe find a way to generate the redirects may be a way to archive those URLs nonetheless?
06:21:26	<@phuzion>	albertlarsan68: I'd say that it's probably better for the project to do this, because the project is going to have a starting point of some sort. URLTeam projects tend to be better suited when there's a relatively small URL space to work with. If the shortener slugs were say, 6 or 7 characters long, I'd probably fire it off tonight. But with sluts that are a-zA-Z0-9, 11 characters long, we're looking at an ENORMOUS scope of URLs to run, and
06:21:26	<@phuzion>	if we were to start absolutely slamming their site with 404 requests, tens of thousands of times per second, they'd ban our user agent in a heartbeat.
06:21:41	<@phuzion>	slugs, wow, what a typo
06:26:44	<albertlarsan68>	I was proposing generating the potential response you would get if you asked the server, to keep the short links working. I agree it would be better to do this in a cross-project effort, ie maybe sending slugs to and collect them here. I'm not sure the range is the whole of the a-zA-Z0-9 * 11 spectrum, because not all IDs are filled, but it would
06:26:44	<albertlarsan68>	still be a huge space IMO.
06:27:09	<albertlarsan68>	Its stream is #bowlofpetunias FYI
06:30:04	<@phuzion>	albertlarsan68: It's late, and it's entirely possible I'm lacking some information to be able to fully understand what you're talking about, but I fail to see why the shortlinks need to be crawled if they're deterministic per the wiki page?
06:31:09	<albertlarsan68>	They do not, that's why I propose we generate ourselves the 302/301 that occurs.
06:31:19	<albertlarsan68>	Not asking the server.
06:32:29	<@flashfire42>	You are suggesting archiveing the results of the shortlinks which will be long links of Skyblog which we plan to grab anyway?
06:33:27	<albertlarsan68>	Yep
06:35:00		datechnoman quits [Quit: The Lounge - https://thelounge.chat]
06:35:50	<@phuzion>	albertlarsan68: So, let me make sure I am on the same page as you. Your suggestion is to generate a list of short URLs to scrape, and capture the 301/302, right?
06:36:26	<albertlarsan68>	My suggestion is to have a list of short URLs, and create the 302/301/... out of thin air.
06:36:34		datechnoman (datechnoman) joins
06:36:45	<@phuzion>	albertlarsan68: Where would we get this list of short URLs?
06:37:33	<albertlarsan68>	Thin air??? Or the skyblog project, generate them from the Blog and Post IDs.
06:37:52	<albertlarsan68>	The goal would be for IA to be able to keep the short links alive.
06:39:23		flashfire42\|m joins
06:39:47	<@phuzion>	So you're suggesting that we brute force 100 billion post IDs, convert them to shortURLs, and then cram those into IA even though 99.99% of them will be 404s?
06:40:26	<albertlarsan68>	We could even (not backed by testing not researching) be able to see if a post is put in "secret" mode, depending on the implementation, if it redirects to the page but the page does not work or just the shortlink does not work.
06:41:27	<@phuzion>	Also, we tend not to generate data ourselves. Even if we know in principle how the shortener system works, we don't know for certain that there's not a bug that affects certain URLs, or other URLs break the rule of how the shortener works with manual overrides or something.
06:42:07	<albertlarsan68>	We would not cram all of those into IA, but the valid ones yes. Since we gather valid blog+post IDs in the skyblog project, we could transfer them to urlteam.
06:42:38	<@flashfire42>	I am not sure you understand how any of this works?
06:42:43		@flashfire42 sets mode: +o flashfire42\|m
06:43:37	<albertlarsan68>	flashfire42: Who are you talking to?
06:43:56	<@flashfire42>	You. Phuzion has been here longer than me if memory serves
06:44:49	<@phuzion>	I've had an account on the Wiki for almost 9 years, for whatever that's worth.
06:45:32	<albertlarsan68>	Anyway, it was just a kinda random thought that I have grown with you all, no problem if it won't work.
06:46:19	<albertlarsan68>	It is not something that I am attached to, I just want to be sure that no big part of the French internet history disappears.
06:46:52	<@flashfire42>	As it is the shortlinks dont get directly archived at all. They are stored in a text document and put on IA. There are a few people slowly grepping them and putting them into the URL project but yeah that idea is not how any of this works
06:48:18	<@phuzion>	It's not that it won't work. Your idea of pulling post IDs from the project grab might be feasible. I was just confused earlier when you were talking about pulling a list of post IDs out of thin air, because if that was the case, we would have to scrape at 25,000 URLs per second until the shutdown, and that's a super quick way to get banned.
06:50:03	<albertlarsan68>	It should be tested that if a "secret" post is created, and its shortlink accessed, what happens if we're not logged in? a) A normal 404 b) A redirect to the post page, but the post page errors c) Something else.
06:50:31	<albertlarsan68>	Since we don't know what blog an ID belongs to, this is what we should bruteforce.
06:50:38	<@phuzion>	Honestly, this sounds more and more like something that should be discussed in the project channel.
06:51:26	<@phuzion>	The URLTeam project is fairly scope-limited. We have a lot of URL shorteners that don't get scraped because we don't have the resources to develop custom code for each one of them.
06:51:48	<@phuzion>	But if someone is going to write a seesaw pipeline for this project, your idea about checking the URL shorteners could be implemented as a step.
06:52:10	<albertlarsan68>	OK, I'll try to move the ideas to the #bowlofpetunias channel.
07:03:40		IDK (IDK) joins
07:11:32		someone1 joins
07:13:37		qw3rty joins
07:26:42		masterX244 (masterX244) joins
07:47:47		nulldata quits [Ping timeout: 252 seconds]
07:56:42		nulldata joins
07:57:33		nulldata is now authenticated as nulldata
11:45:55		Ryz quits [Ping timeout: 265 seconds]
12:13:28		Ryz (Ryz) joins
12:38:29		PredatorIWD quits [Read error: Connection reset by peer]
12:41:58		PredatorIWD joins
13:51:45		W7RFa6AbNFz joins
13:51:51		W7RFa6AbNFz quits [Remote host closed the connection]
14:00:38		atphoenix__ quits [Remote host closed the connection]
14:01:19		atphoenix__ (atphoenix) joins
14:02:47		atphoenix__ is now known as atphoenix
15:07:09		nulldata quits [Ping timeout: 258 seconds]
15:17:25		nulldata joins
15:24:42		VerifiedJ quits [Quit: The Lounge - https://thelounge.chat]
15:25:22		VerifiedJ (VerifiedJ) joins
16:01:47		driib quits [Client Quit]
16:03:22		driib (driib) joins
16:18:03		driib quits [Client Quit]
16:19:56		driib (driib) joins
17:15:12		T31M quits [Read error: Connection reset by peer]
17:15:31		T31M joins
17:16:43		T31M is now authenticated as T31M
17:51:35		TheTechRobo quits [Client Quit]
17:51:55		TheTechRobo (TheTechRobo) joins
17:54:16		TheTechRobo quits [Remote host closed the connection]
17:54:33		TheTechRobo (TheTechRobo) joins
18:09:27		atphoenix quits [Read error: Connection reset by peer]
18:10:09		atphoenix (atphoenix) joins
18:50:11		nulldata quits [Client Quit]
18:50:52		nulldata joins
19:00:22		IDK quits [Client Quit]
19:31:41		threedeeitguy quits [Client Quit]
19:32:32		threedeeitguy (threedeeitguy) joins
20:28:56		someone1 quits [Client Quit]
21:58:04		driib quits [Client Quit]
21:58:04		kiska quits [Quit: Ping timeout (120 seconds)]
21:58:04		@flashfire42 quits [Quit: Ping timeout (120 seconds)]
21:58:04		VerifiedJ quits [Client Quit]
21:58:04		andrew quits [Quit: Ping timeout (120 seconds)]
21:58:04		Matthww1 quits [Quit: Ping timeout (120 seconds)]
21:58:04		ave quits [Quit: Ping timeout (120 seconds)]
21:58:26		VerifiedJ (VerifiedJ) joins
21:58:26		ave (ave) joins
21:58:31		driib (driib) joins
21:58:39		Matthww1 joins
21:58:54		andrew (andrew) joins
21:59:09		flashfire42 (flashfire42) joins
21:59:09		@ChanServ sets mode: +o flashfire42
22:00:03		kiska (kiska) joins

Home Search Previous day Next day