#urlteam log for 2022-05-08

Home Search Previous day Next day

00:10:18	<Somebody2>	We can make a combined page that automatically includes them all.
00:10:27	<Somebody2>	Actually, I'm going to do that now.
00:11:47	<Ryz>	o.o;
00:22:59	<@JAA>	Hmm, true.
00:23:33	<@JAA>	I forgot that you can include pages, not just templates.
00:29:10	<Somebody2>	yep, any arbitrary page
00:29:13	<Somebody2>	(or pages)
01:02:24	<Somebody2>	OK, that's the two biggest pieces of the page split off. I'm not sure if the other tables should be split or not, so I'll wait for comments on that.
01:02:34	<Somebody2>	But now it should be easier to keep those ones auto-alphabetized, I think?
01:02:42	<Somebody2>	JAA?
01:21:41		tbc1887 (tbc1887) joins
01:31:16	<@JAA>	Somebody2: Yeah, that should be easier. I see that there are technically separate lists for each starting letter. I don't think we care about that, do we? Visually, it only adds a couple pixels anyway, barely noticeable.
01:35:16	<@JAA>	(If it were a single list, the logic would be much simpler.)
01:48:02		arkiver2 (arkiver) joins
01:48:02		@ChanServ sets mode: +o arkiver2
01:48:02		le0n quits [Ping timeout: 286 seconds]
01:48:02		@arkiver quits [Remote host closed the connection]
01:48:02		n9nes quits [Quit: ZNC 1.8.2 - https://znc.in]
01:48:03		fuzzy8021 quits [Remote host closed the connection]
01:48:03		yano quits [Remote host closed the connection]
01:48:03		tbc1887 quits [Remote host closed the connection]
01:48:03		Craigle quits [Quit: Ping timeout (120 seconds)]
01:48:03		driib08353 quits [Client Quit]
01:48:03		VerifiedJ quits [Client Quit]
01:48:03		Ryz quits [Quit: Ping timeout (120 seconds)]
01:48:03		@chfoo quits [Write error: Broken pipe]
01:48:03		@OrIdow6 quits [Write error: Broken pipe]
01:48:09		tbc1887 (tbc1887) joins
01:48:13		OrIdow6 (OrIdow6) joins
01:48:13		@ChanServ sets mode: +o OrIdow6
01:48:17		driib083536 (driib) joins
01:48:26		VerifiedJ9 (VerifiedJ) joins
01:48:29		le0n (le0n) joins
01:48:29		Craigle (Craigle) joins
01:48:31		chfoo (chfoo) joins
01:48:31		@ChanServ sets mode: +o chfoo
01:48:46		Ryz (Ryz) joins
01:49:18		n9nes joins
01:49:35		yano (yano) joins
01:50:09		@arkiver2 is now known as @arkiver
01:58:07		fuzzy8021 (fuzzy8021) joins
02:00:32		tbc1887 quits [Client Quit]
02:01:22		tbc1887 (tbc1887) joins
02:17:27	<Somebody2>	Yeah, one list is fine
02:17:41	<Somebody2>	I think I separated them that way to make it easier to manually order them, but that's not a problem now
02:31:22	<@JAA>	Yeah, thought so. :-)
02:31:47	<Ryz>	So uhh, help with http://man.ac.uk/ Somebody2, since I think you're back? o:
02:35:30	<Ryz>	Would prefer not to leave it hanging; as I'll probably have to hold the 3 weird URL shorteners for a later time
02:49:26	<@phuzion>	Ryz: lemme take a look at man.ac.uk
02:50:11	<@phuzion>	Ryz: what exactly is going on with it?
02:51:00	<Ryz>	Uhh, I mentioned what I went through when trying to add that as a project
02:51:06	<Ryz>	In here, in the logs,
02:51:34	<Ryz>	Uhh, basically, a non-www link got redirected to a www-link, but it was counted as found
02:51:50	<Ryz>	Even though it didn't find the resulting URL
02:51:59	<@phuzion>	Do all non-www links get redirected to www links before getting redirected to the actual destination URL?
02:52:54	<Ryz>	Here's what it looks for a valid URL: https://wheregoes.com/trace/20222460984/
02:53:00	<Ryz>	Doesn't look like it~
02:53:27	<Ryz>	The invalid entry in comparsion: https://wheregoes.com/trace/20222460989/
02:53:29	<@phuzion>	Alright, I'll add man.ac.uk as an ignore
02:54:13	<@phuzion>	Actually, hang on
02:54:15	<@phuzion>	This isn't consistent.
02:54:16	<Ryz>	Oh?
02:54:44	<@phuzion>	Oh wait
02:54:54	<@phuzion>	Maybe it does consistently redirect to www.man.ac.uk if it's a 404?
02:55:32	<Ryz>	To note on http://man.ac.uk/ - it's a URL shortener but also it has non-URL shortener stuff
02:55:54	<Ryz>	Which was why at the time as I was going through this, I was hesitant to add it as a project
02:56:10	<@phuzion>	Yeah we might wanna ignore this one for now tbh
02:56:43	<@phuzion>	I'm not 100% sure but I think we can add man.ac.uk and www.man.ac.uk as redirect ignores, which should theoretically make it work.
02:56:48	<Ryz>	Oh, welp~
02:56:51	<@phuzion>	But like I said I'm not 100% sure about that.
02:57:09	<Ryz>	And I thought the next 3 I have would be much more difficult
02:57:21	<Ryz>	Not sure how long you'll be around to deal with that >.>;
02:57:33	<@phuzion>	I'll be around for a bit.
02:57:44		@JAA sets the topic to: URL shorteners were a terrible idea \| http://urlte.am/ \| IRC Logs: https://archive.fart.website/bin/irclogger_log/urlteam https://logs.kiska.pw/urlteam/today \| https://tracker.archiveteam.org:1338/
02:59:12	<Ryz>	Okay, hold on
03:00:03	<Ryz>	The first one from what I feel is the easiest to hardest of the non-normal looking URL shorteners is https://4fun.tw/ - the sample URL is https://4fun.tw/iedi
03:01:10	<@phuzion>	Ok so that one is probably going to require a lot of work to get going.
03:01:26	<@phuzion>	Because it appears to use javascript to actually fetch the link to redirect to.
03:01:47	<Ryz>	Their security check, where when going through, yeah, and I'm not sure if it applies to all entries or not~
03:06:41	<Ryz>	Mmm, this is going to be annoying that needs custom code s:
03:07:14	<@phuzion>	If it's even doable. If they start popping recaptchas after 10 hits per hour, that'll basically make it impossible.
03:08:59	<Ryz>	Oh, the reCAPTCHA stuff, oh no s:
03:10:26	<Ryz>	Is there a way to check something like that without running a project like this?
03:15:51	<@phuzion>	I mean, you could just automate hitting dozens of URLs with the same IP to check if they start hitting you with captchas or something.
03:17:09	<Ryz>	Hmm, could also make a longer delay, although hmm...
03:17:42	<Ryz>	Is https://4fun.tw/ something we could do?
03:18:06	<@phuzion>	I'm gonna say let's pass on that one for now.
03:18:06	<Ryz>	Sampling of the IDs is that they're 4 character letters o.o;
03:18:37	<Ryz>	Oof, I did say these 3 are tough <_>;
03:20:18	<@phuzion>	Honestly, what I'd probably do is find the easier ones (anything marked as a YOURLS shortener is gonna be very easy) and knock as many of those out as possible.
03:24:25	<Ryz>	Mm, I wanted to try and work on the 3 because I found them myself at the time, and I rather not have them disappear too; found http://onelink.to/3q657x in 2018 November (the hardest one I feel), https://4fun.tw/iedi was found in 2021 May (easiest out of the three, welp), and http://rlu.ru/2KGu was found in 2021 July (2nd hardest)
03:25:35	<Ryz>	Gonna have to make a well of URL shorteners ideally when they are cleared out~
03:28:28	<@phuzion>	Ryz: rlu.ru is probably going to be achievable. Basically, you need to write a regular expression to extract the URL from the page that they serve you.
03:30:18	<Ryz>	Huh, interesting; the reason why I felt that it would be the second hardest is because JAA found out that an ID that redirects to another URL shortener gives a different prompt: http://rlu.ru/2YdDt
03:30:40	<Ryz>	Unless that's just the same somehow
03:30:45	<@phuzion>	Yeah that is because is.gd is another shortener.
03:31:00	<@phuzion>	So, you could theoretically write a regex that handles either of those.
03:34:07	<Ryz>	Huh...uhh, even with me doing a ton of ignores that have regexes in ArchiveBot, I don't really have a strong enough understanding, or that there's a different method to take here~
03:34:30	<Ryz>	Since it's not the usual 301 or 302 thing s:
03:36:00	<@JAA>	phuzion: Sometimes, it also redirects directly. It's a very 'fun' shortener.
03:36:12	<@phuzion>	Oh what the fuck lmao
03:36:35	<@JAA>	I wasn't able to reproduce it, but I saw it once earlier.
03:36:40	<@phuzion>	Ryz: https://tracker.archiveteam.org:1338/project/gkurl-us/settings here's a project that uses the "Content body regular expression" feature
03:36:43	<@phuzion>	It's a starting point.
03:42:34	<Ryz>	Hmm, I'm trying to see how this exactly works in this case
03:43:52	<Ryz>	I'm trying to find a sample URL for http://gkurl.us/ to do view-source to see how it would react. and the three sample URLs from the wiki page don't work anymore
03:44:25	<Ryz>	Makes it hard for me to visualize how to do a regex correctly
03:54:29	<@JAA>	JAABot will now keep URLTeam/Dead sorted. It runs every hour.
03:57:14	<Ryz>	Yaaaaaaaay o:
03:58:49	<Ryz>	So phuzion, besides trying to find very easy URL shorteners, what other guidelines to keep in mind?
03:59:19	<@phuzion>	Keep an eye on the results and errors page. If things break, turn them off and troubleshoot what happened.
03:59:33	<@phuzion>	Honestly, this is a project that once you've got it set up, it kinda runs on its own.
03:59:49	<@phuzion>	As far as I know, people haven't been doing a lot here for months, and the projects that we have active just keep chugging along.
04:00:40	<Ryz>	Yeah, I've been seeing some 503s on the chng-it project but it's more sprinkles than constant~
04:08:21	<Ryz>	Hmm, checking https://wiki.archiveteam.org/index.php/URLTeam#Common_numbers - should I avoid IDs with 6 character letters?
04:24:08	<Ryz>	Hmm, I guess while waiting or killing time, guess I'll work on https://lttr.ai/tsAs since I recently mined a bunch of potential URL shorteners from https://archive.org/download/archiveteam_urls_20220412133750_0a3fe79f/urls_20220412133750_0a3fe79f.1606352862.megawarc.warc.os.cdx.gz
04:26:18	<Ryz>	Checking https://web.archive.org/web//https://lttr.ai/ - seems to be the standard alphabet
04:27:10	<Ryz>	Checking https://wheregoes.com/trace/20222461865/ - 302 Redirect
04:27:36	<Ryz>	Checking https://wheregoes.com/trace/20222461872/ for invalid ID, it's a 404
04:28:47	<Ryz>	phuzion, if you're there, is there a distinction between "No redirect status codes" and "Unavailable status codes"?
04:28:55	<@phuzion>	I'm not 100% sure to be honest.
04:28:58	<Ryz>	Somebody2 says it's the same oo;
04:30:00	<Ryz>	Well crap, even my main mentor supposed to guide me doesn't know <#>;
04:30:32	<@phuzion>	Let me poke around in some source code and see if I can differentiate the two
04:30:50	<Ryz>	Nothing else to say meanwhile on https://lttr.ai/ - saving settings...
04:31:09	<Ryz>	I guess besides https://lttr.ai/ redirects to https://missinglettr.com/
04:31:42	<Ryz>	And auto queue and queue~
04:32:46	<Ryz>	Woo, instant results, valid links too O:
04:33:10	<@phuzion>	A redirect status code means there is a unshortened URL. A no redirect status code means that there is no shortened URL. A unavailable status code means that the shortener has content issues (like a DCMA notice) with unshortened URL. Banned status codes means that the scraper client has been banned.
04:38:25	<Ryz>	Oh, so I guess I was kinda right initially, I thought "No redirect status codes" meant that the ID isn't valid <#>;
04:42:02	<Ryz>	Regarding https://lttr.ai/ phuzion, it looks like it might be sequential since it seems it keeps getting valid results
04:42:29	<Ryz>	There is a sparkle in the gap difference though o:
04:45:14	<Ryz>	Now working on https://l8r.it/eAL4
04:45:36	<Ryz>	https://l8r.it/ redirects to https://later.com/
04:47:25	<Ryz>	Checked https://web.archive.org/web//l8r.it/ - usual alphabet
04:48:43	<Ryz>	For valid, per https://wheregoes.com/trace/20222461993/ - 301 Redirect
04:49:14	<Ryz>	For not available or invalid, per https://wheregoes.com/trace/20222462003/ - it's a 400, not distinctly a 404
04:49:42	<Ryz>	Should I add 400 in "No redirect status codes", phuzion?
04:50:23	<@phuzion>	HTTP 400 is bad request, are you sure you're not just triggering a 400 somehow?
04:51:22	<@phuzion>	Yeah that's weird that they're returning 400 instead of 404 or something.
04:51:28	<Ryz>	Looking at dev tools, I see a 400 from my side as well
04:51:33	<Ryz>	Incase https://wheregoes.com/trace/20222462003/ was wrong
04:51:50	<@phuzion>	But yes, add that to the no redirect status codes.
04:52:13	<Ryz>	Added it indeedy
04:53:33	<Ryz>	Time to run it as yet another project :p
04:57:19	<Ryz>	Hmm, phuzion, did I do something wrong? The entry is still having at X and Y, not numbers
04:58:41	<Ryz>	It seems to be taking longer than usual
04:58:43	<@phuzion>	Ryz: Just means we haven't gotten any results back yet as far as I can tell
04:58:53	<@phuzion>	Is the shortener known to be slow or rate-limiting?
04:59:31	<Ryz>	I'm not sure~
05:07:19	<Ryz>	Should I pause the project? It's still at X and Y
05:14:36	<Ryz>	phuzion?
05:15:15	<@phuzion>	Yeah disable it
05:16:00	<Ryz>	I now disabled it
05:16:26	<Ryz>	Meanwhile I added another entry via https://wiki.archiveteam.org/index.php/URLTeam/Warrior - even though it's not supposed to be pink
05:17:26	<Ryz>	And uhh, alphabetizing those new entries is awful, it's probably worse than having to alphabetize the dead URL shorteners...
05:18:41	<Ryz>	JAA, is it possible to use JAABot to alphabetize https://wiki.archiveteam.org/index.php/URLTeam/Warrior ? That was really awful and annoying for me having to sort the 6 new entries
05:22:37	<Ryz>	Uhhh, uhhh, phuzion, I see a new message on the main tracker thing, a red box saying "Too many error reports! Figure out what went wrong."
05:22:49	<@phuzion>	Lemme see what's up
05:23:01	<@phuzion>	Ok, so stuf.in is giving us 500s now
05:23:20	<@phuzion>	And toi.in is as well
05:23:43	<Ryz>	I was seeing a ton of errors regarding l8r-it
05:24:50	<@phuzion>	Ryz: I'd check and see if 500 is a possible "client banned" response
05:25:17	<Ryz>	I see 500, welp
05:26:31	<@phuzion>	It looks like maybe stuf.in is indicating issues with those particular URLs, but it's hard to say for sure.
05:28:19	<@phuzion>	stuf.in was causing the majority of the errors.
05:28:29	<@phuzion>	I disabled it and nuked the error reports, we seem to be good to go now.
05:31:58	<Ryz>	Woo~
05:33:32	<Ryz>	So phuzion, regarding https://wiki.archiveteam.org/index.php/URLTeam#Common_numbers - should I just stay away URL shorteners with IDs being up to 6 character letters?
05:34:03	<@phuzion>	Depends on the situation
05:34:47	<Ryz>	Mm, because chng-it has yet to get even 1 URL that's found s:
05:35:08	<Ryz>	Meanwhile, still sucks that regarding https://l8r.it/ - it all ended with 500s just as it started
05:37:41	<@phuzion>	Might be worth it to disable it then.
05:37:54	<@phuzion>	Allow some of the other shorteners to get some attention paid to them.
05:39:54	<Ryz>	Yeah, disabling it
05:40:18	<Ryz>	Also, uhh, eye opener, checking https://web.archive.org/web//chng.it/ - it's /seven/ character letters X_x;
05:42:47		michaelblob_ quits [Ping timeout: 265 seconds]
05:43:22	<Ryz>	Would it be better to adjust it to make sure it only grabs only 7 character letters?
05:43:37		michaelblob (michaelblob) joins
05:47:52	<Ryz>	Probably gonna do one more, this time it's https://postly.app/Td0
05:49:25	<Ryz>	Checked https://web.archive.org/web//https://postly.app/ - again, usual alphabet
05:49:49	<Ryz>	...Oddly, https://postly.app/ on it's own gives a 404
05:50:32	<Ryz>	Checking https://wheregoes.com/trace/20222462458/ for valid URL, it's a 307 Redirect this time
05:50:49	<Ryz>	Okay that's auto-covered~
05:51:21	<Ryz>	For not valid URL via https://wheregoes.com/trace/20222462464/ - it's a 404
05:53:14	<Ryz>	And time to run it~ >#<;
05:56:52	<Ryz>	Awwn, I think the postly-app project is a bust, phuzion
05:57:07	<Ryz>	Seems to be the same dealio as l8r-it...
05:58:06		tbc1887 quits [Read error: Connection reset by peer]
05:58:09	<Ryz>	Gonna have to stop it as much as l8t-it s:
05:59:12	<Ryz>	Wha...? It's 405
07:18:56		qwertyasdfuiopghjkl joins
07:40:51		driib083536 quits [Ping timeout: 252 seconds]
08:36:30		tbc1887 (tbc1887) joins
10:16:44		Terbium quits [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.]
10:17:10		Terbium joins
11:14:29		qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds]
11:30:19		knecht42079 is now known as knecht420
11:33:52		tbc1887 quits [Read error: Connection reset by peer]
12:15:45		driib083536 (driib) joins
12:26:48		TheTechRobo quits [Remote host closed the connection]
12:28:23		TheTechRobo (TheTechRobo) joins
12:29:50		TheTechRobo quits [Read error: Connection reset by peer]
12:30:27		TheTechRobo (TheTechRobo) joins
12:30:43		TheTechRobo quits [Remote host closed the connection]
12:31:05		TheTechRobo (TheTechRobo) joins
12:47:50		qwertyasdfuiopghjkl joins
12:56:43	<Somebody2>	Ryz: nice work -- everything looks good to me
13:07:35		TheTechRobo quits [Ping timeout: 265 seconds]
13:10:33		TheTechRobo (TheTechRobo) joins
13:15:28	<Ryz>	Uh, what are you referring to Somebody2? oo;
13:26:18	<Ryz>	Working on https://dy.si/y9FX7
13:27:17	<Ryz>	Checked https://web.archive.org/web//https://dy.si/ - seems to be the usual alphabet
13:27:35	<Ryz>	Checking valid URL as per https://wheregoes.com/trace/20222465805/ - it's a redirect
13:29:02	<Ryz>	For invalid URL as per https://wheregoes.com/trace/20222465832/ - uhh, interesting, it's a redirect to http://www.dynamicsignal.com/ - looks like I'll have to set up an ignore just for that
13:29:38	<Ryz>	Although what if there's a valid URL redirect that links to any link under http://www.dynamicsignal.com/ ?
13:30:26	<Ryz>	To clarify on redirect on valid URL, it's a 302 Redirect
13:34:23	<Ryz>	Hmm, phuzion or Somebody2, could something like (https?://dynamicsignal\.com\|https?://www.dynamicsignal\.com)$ work under "Location header reject regular expression" entry?
13:35:01	<Ryz>	Mmm, it's leaving out the '/' thing before the '$'
13:35:17	<Ryz>	Redirects can be weird on having with or without that
14:39:07	<Ryz>	While waiting,
14:39:11	<Ryz>	Working on https://ed.gr/d0laa
14:42:27	<Ryz>	Checking https://web.archive.org/web//https://ed.gr/ - interesting, uhh, the alphabet doesn't use uppercase letters; but uhh, https://ed.gr/d0laa and https://ed.gr/d0lAA are the same
14:45:26	<Ryz>	For valid, checking https://wheregoes.com/trace/20222466298/ - 301 Redirect
14:46:05	<Ryz>	For unavailable, checking https://wheregoes.com/trace/20222466350/ - uhh, there's a 302 Redirect followed by a 200 with https://rebrandly.com/404
14:47:33	<Ryz>	So with "Location header reject regular expression" entry, it would be https?://rebrandly\.com/404$
14:47:54	<Ryz>	Guess I'll have to wait for a response for this; phuzion or Somebody2 ^
15:33:51		fuzzy8021 quits [Ping timeout: 252 seconds]
16:20:54		fuzzy8021 (fuzzy8021) joins
17:03:02	<Ryz>	Uhh, there's a red box that says "Too many results! Run the export script.", what the hell does that mean...?
17:23:17	<h2ibot>	[AT] URLTeam tracker https://tracker.archiveteam.org:1338/api/health is down (507,HTTP status returned: 507)
17:27:39	<@JAA>	Ryz: Sorting URLTeam/Warrior is a lot more complicated than URLTeam/Dead. Not going to happen anytime soon I'm afraid.
17:28:59	<@JAA>	chfoo: 'Too many results! Run the export script.'
17:29:20	<Ryz>	How often does that happen?
17:30:16	<@JAA>	Rarely, like a few times since I'm here. But now it might be related to the addition of various shorteners producing more results than previously.
17:30:49	<@JAA>	There's an automatic thing in the background that periodically exports the results and uploads them to IA.
17:31:21	<Ryz>	Yeah, I've been adding more and more URL shorteners since I got access, sometimes with mixed results, oof
17:33:37	<Ryz>	Just waiting for one of the two to aid me since I added more URL shorteners but haven't ran them yet
17:43:08	<Ryz>	Oh, interesting, it seems to stop at 15 million results
19:20:27		n9nes quits [Ping timeout: 252 seconds]
19:21:31		n9nes joins
19:28:09		JackThompson05 quits [Ping timeout: 252 seconds]
20:20:49	<Somebody2>	I was just saying: "good job" in general with regards to your additions.
20:30:36	<Ryz>	Aaaaah~ >#<;
20:30:49	<Ryz>	So yeah, I can't really do much since the results bin is too full <#>;
20:34:07	<@chfoo>	i changed the crontab entry to be more frequent and i'm running the export manually now.
20:35:03	<Ryz>	Oh hey, "Export in Progress" followed by "We're currently exporting URLs to the Internet Archive. Please check back later; we'll be back soon!"; how long will this take, curious? o:
20:37:23	<@chfoo>	it blocks the entire system for about 10 minutes, if i recall correctly
20:46:57	<Ryz>	Woo, it's finished~
20:47:09	<h2ibot>	[AT] URLTeam tracker https://tracker.archiveteam.org:1338/api/health is up (200,Success)
20:48:17	<Ryz>	crontab, what are you talking about, chfoo? oo;
20:53:39	<@chfoo>	the export script just gets run by cron every few hours. nothing special about it.
20:55:43	<Ryz>	Oh, it's so that you don't have to export manually~
20:56:08	<Ryz>	Heheh, yeah, I may be throwing more and more URL shorteners <.<;
20:56:19	<Ryz>	*export manually as frequently
23:26:01		qwertyasdfuiopghjkl quits [Remote host closed the connection]
23:46:05	<datechnoman>	Great work on expanding the project Ryz. Great to see it moving right along with new shortners as it has sat relatively stale for the last year or so. Keep up the good work :)
23:46:20	<Ryz>	Woo >#<;
23:46:41	<Ryz>	I still need some help with some of the URL shorteners I set up but need further guidance
23:47:53	<datechnoman>	That's all good. Once you get some help i'm sure you will keep motoring through the lists
23:48:31	<Ryz>	Yeah, mainly focusing on the earlier URL shorteners ideally >#<;
23:51:05	<datechnoman>	Yeah and keep working up from there. Sounds like a good plan. Least there is plenty of capacity for more shortners
23:56:37	<Ryz>	I think before doing any further tackling, I think it would be a good idea to review the existing projects, not just the active ones~
23:56:51	<Ryz>	Because I'm not sure when was the last time that was extensively worked on
23:58:58		qwertyasdfuiopghjkl joins

Home Search Previous day Next day