05:42:56fuzzy8021 quits [Ping timeout: 252 seconds]
05:47:44fuzzy8021 (fuzzy8021) joins
09:00:32Maturion joins
09:25:20Maturion quits [Remote host closed the connection]
15:34:13kiryu quits [Remote host closed the connection]
21:06:31<thuban>arkiver: how is orange going? something come up?
21:11:47<imer>soon™
22:06:47Peroniko quits [Client Quit]
22:33:33<@arkiver>thuban: what characters can the 'slug' have?
22:34:55<pokechu22>For subdomains, there's a wide variety of possible ones, some of which are actually garbage that don't work at all (including one that's malformed IDN which is my fault, based on an attempt at cleaning up something else that was also junky)
22:36:16<pokechu22>a lot of old CDX data likes having trailing %0A in URLs too for some reason - note that I used CDX data without limiting it to 2XXs. I generally cleaned that up but the 2GB of stuff I sent also includes the junky versions
22:38:54<@arkiver>but the slug
22:38:59<@arkiver>so like the username/blogname
22:39:04<@arkiver>what characters can that have?
22:41:09<@arkiver>due to the small size i'm not going to care much here about duplicate and queue back every type of URL individually
22:45:45<pokechu22>Only things that are valid as subdomains I'm pretty sure, though I think there are some examples with accented characters (not sure if those were ever valid or not)
22:46:46<pokechu22>Some of the existing data has special characters but I don't think most of those are valid (I'm pretty sure some of them trigger a 400 bad request, maybe + characters? I don't remember exactly)
22:51:29<pokechu22>If you have a link like http://perso.wanadoo.fr/shihtzupassion or http://perso.orange.fr/shihtzupassion it'll only be meaningful to do something with it if https://shihtzupassion.pagesperso-orange.fr/ is valid - I haven't seen any counterexamples
22:58:55<@arkiver>what a mess
22:59:01<@arkiver>i'm just going to match everything with everything
22:59:12<@arkiver>if that becomes problematic due to site performance we can change it
23:00:08<pokechu22>There is a specific set of rules though, which I'll try to write up
23:01:48<pokechu22>http://perso.wanadoo.fr/slug and http://slug.perso.orange.fr/ and http://perso.orange.fr/slug and http://slug.perso.orange.fr/ and http://pagesperso-orange.fr/slug all become http://slug.pagesperso-orange.fr (https if slug does not contain any periods)
23:02:10<pokechu22>er, http://slug.perso.wanadoo.fr/
23:03:32<pokechu22>http://monsite.wanadoo.fr/slug and http://slug.monsite.wanadoo.fr/ and http://monsite.orange.fr/slug and http://slug.monsite.orange.fr/ and http://monsite-orange.fr/slug all become http://slug.monsite-orange.fr/ (https if slug does not contain any periods)
23:05:37<pokechu22>http://pro.wanadoo.fr/slug and http://slug.pro.wanadoo.fr/ http://pro.orange.fr/slug and http://slug.pro.orange.fr and http://pros.orange.fr/slug and http://slug.pros.orange.fr/ and http://pagespro-orange.fr/slug all map to http://slug.pagespro-orange.fr/
23:05:58<pokechu22>As far as I know, there is no sharing between pagesperso-orange.fr, monsite-orange.fr, and pagespro-orange.fr
23:07:11<pokechu22>but pagespro-orange.fr does have a slight complication: https://slug.pagespro-orange.fr/ == https://slug.mairie.pagespro-orange.fr/ == https://slug.assoc.pagespro-orange.fr/ == https://slug.ecole.pagespro-orange.fr/ and THOSE all have the same content. It's probably fine to duplicate work there though since there are far fewer pro sites than anything else
23:07:45<pokechu22>still the rule seems to be https if slug doesn't have periods, http if slug does, just with ".assoc", ".mairie", and ".ecole" not being part of the slug
23:09:10<pokechu22>Also, subdomains under orange.fr and wanadoo.fr either redirect to a page telling you about the new site (without redirecting you to the new site directly) or don't load at all (the case for perso.wanadoo.fr and pro.wanadoo.fr, but monsite.wanadoo.fr does still give that redirect page)
23:13:00<@arkiver>i'll use exactly what you wrote
23:14:39<fireonlive>jesus christ
23:16:19<pokechu22>Also worth noting that I attempted to normalize some URLs already, but the 2GB of URLs includes both forms already. It'd be necessary for the crawl to know these rules though since some sites use absolute links to older forms too :|
23:17:47<@arkiver>can you post your new list please?
23:19:34<pokechu22>The one I linked before is https://transfer.archivete.am/Ijfl4/Orange_all_lists.7z (230MB compressed) - the ones with "seed_urls" or similar are the ones I cleaned up based on the other lists, but my organization is a mess
23:19:46<pokechu22>(and I don't 100% remember what all of the files are :|)
23:22:09<@arkiver>shall i just queue everything with both http and https
23:24:14<pokechu22>That's probably fine since it'll just redirect to the correct form if you get it wrong (and if you get it wrong with https it'll have an invalid cert)
23:24:53<pokechu22>I'd be a bit concerned about rate-limiting though
23:25:07<@arkiver>did you find any limits?
23:25:11<pokechu22>Yes!
23:25:17<@arkiver>what do they look like?
23:25:19<pokechu22>1 request per second
23:25:35<pokechu22>You can go faster temporarily but sustained higher speeds will result in timeouts for 24 hours
23:25:41<@arkiver>right
23:26:04<pokechu22>I don't know for sure whether http <--> https redirects count to the rate limits but it seems best to assume they will
23:26:31<pokechu22>the redirects to the single shared 404 error page counted at least (though that's something you can skip fortunately)
23:27:15<@arkiver>do they have any interesting status codes?
23:27:27<@arkiver>what do they give when you're banned/rate limited?
23:27:54<pokechu22>No status code at all, the request just times out. There might also be cases of refused connections, I think
23:28:03<@arkiver>alright
23:28:06<pokechu22>I don't think I saw any 429s
23:32:15<pokechu22>There are also some sites that aren't public (or something like that) which redirect to a common error page as well (https://pages.perso.orange.fr/pages-perso-error&r=403 - probably through e.orange.fr or something like that but I'd need to double-check)
23:32:54<pokechu22>ah, there are also 401s, e.g. https://cath.monsite-orange.fr
23:33:16<@arkiver>we're going to only accept 200 and 3xx initially
23:33:19<pokechu22>... and a 403: https://callune.monsite-orange.fr
23:34:25<pokechu22>ok, https://30-rue-louis-pons.monsite-orange.fr/ redirects directly to https://pages.perso.orange.fr/pages-perso-error&r=403 while the more common situation is that https://30villard90vercors.monsite-orange.fr goes to https://r.orange.fr/r/Oerreur_404 goes to https://e.orange.fr/error404.html
23:34:43<pokechu22>I guess if you queue 3xxs as new tasks instead of directly following them that'd deduplicate things without extra work
23:36:47<@arkiver>we'll queue them individually, but not follow the redirect