| 05:42:56 | | fuzzy8021 quits [Ping timeout: 252 seconds] |
| 05:47:44 | | fuzzy8021 (fuzzy8021) joins |
| 09:00:32 | | Maturion joins |
| 09:25:20 | | Maturion quits [Remote host closed the connection] |
| 15:34:13 | | kiryu quits [Remote host closed the connection] |
| 21:06:31 | <thuban> | arkiver: how is orange going? something come up? |
| 21:11:47 | <imer> | soon™ |
| 22:06:47 | | Peroniko quits [Client Quit] |
| 22:33:33 | <@arkiver> | thuban: what characters can the 'slug' have? |
| 22:34:55 | <pokechu22> | For subdomains, there's a wide variety of possible ones, some of which are actually garbage that don't work at all (including one that's malformed IDN which is my fault, based on an attempt at cleaning up something else that was also junky) |
| 22:36:16 | <pokechu22> | a lot of old CDX data likes having trailing %0A in URLs too for some reason - note that I used CDX data without limiting it to 2XXs. I generally cleaned that up but the 2GB of stuff I sent also includes the junky versions |
| 22:38:54 | <@arkiver> | but the slug |
| 22:38:59 | <@arkiver> | so like the username/blogname |
| 22:39:04 | <@arkiver> | what characters can that have? |
| 22:41:09 | <@arkiver> | due to the small size i'm not going to care much here about duplicate and queue back every type of URL individually |
| 22:45:45 | <pokechu22> | Only things that are valid as subdomains I'm pretty sure, though I think there are some examples with accented characters (not sure if those were ever valid or not) |
| 22:46:46 | <pokechu22> | Some of the existing data has special characters but I don't think most of those are valid (I'm pretty sure some of them trigger a 400 bad request, maybe + characters? I don't remember exactly) |
| 22:51:29 | <pokechu22> | If you have a link like http://perso.wanadoo.fr/shihtzupassion or http://perso.orange.fr/shihtzupassion it'll only be meaningful to do something with it if https://shihtzupassion.pagesperso-orange.fr/ is valid - I haven't seen any counterexamples |
| 22:58:55 | <@arkiver> | what a mess |
| 22:59:01 | <@arkiver> | i'm just going to match everything with everything |
| 22:59:12 | <@arkiver> | if that becomes problematic due to site performance we can change it |
| 23:00:08 | <pokechu22> | There is a specific set of rules though, which I'll try to write up |
| 23:01:48 | <pokechu22> | http://perso.wanadoo.fr/slug and http://slug.perso.orange.fr/ and http://perso.orange.fr/slug and http://slug.perso.orange.fr/ and http://pagesperso-orange.fr/slug all become http://slug.pagesperso-orange.fr (https if slug does not contain any periods) |
| 23:02:10 | <pokechu22> | er, http://slug.perso.wanadoo.fr/ |
| 23:03:32 | <pokechu22> | http://monsite.wanadoo.fr/slug and http://slug.monsite.wanadoo.fr/ and http://monsite.orange.fr/slug and http://slug.monsite.orange.fr/ and http://monsite-orange.fr/slug all become http://slug.monsite-orange.fr/ (https if slug does not contain any periods) |
| 23:05:37 | <pokechu22> | http://pro.wanadoo.fr/slug and http://slug.pro.wanadoo.fr/ http://pro.orange.fr/slug and http://slug.pro.orange.fr and http://pros.orange.fr/slug and http://slug.pros.orange.fr/ and http://pagespro-orange.fr/slug all map to http://slug.pagespro-orange.fr/ |
| 23:05:58 | <pokechu22> | As far as I know, there is no sharing between pagesperso-orange.fr, monsite-orange.fr, and pagespro-orange.fr |
| 23:07:11 | <pokechu22> | but pagespro-orange.fr does have a slight complication: https://slug.pagespro-orange.fr/ == https://slug.mairie.pagespro-orange.fr/ == https://slug.assoc.pagespro-orange.fr/ == https://slug.ecole.pagespro-orange.fr/ and THOSE all have the same content. It's probably fine to duplicate work there though since there are far fewer pro sites than anything else |
| 23:07:45 | <pokechu22> | still the rule seems to be https if slug doesn't have periods, http if slug does, just with ".assoc", ".mairie", and ".ecole" not being part of the slug |
| 23:09:10 | <pokechu22> | Also, subdomains under orange.fr and wanadoo.fr either redirect to a page telling you about the new site (without redirecting you to the new site directly) or don't load at all (the case for perso.wanadoo.fr and pro.wanadoo.fr, but monsite.wanadoo.fr does still give that redirect page) |
| 23:13:00 | <@arkiver> | i'll use exactly what you wrote |
| 23:14:39 | <fireonlive> | jesus christ |
| 23:16:19 | <pokechu22> | Also worth noting that I attempted to normalize some URLs already, but the 2GB of URLs includes both forms already. It'd be necessary for the crawl to know these rules though since some sites use absolute links to older forms too :| |
| 23:17:47 | <@arkiver> | can you post your new list please? |
| 23:19:34 | <pokechu22> | The one I linked before is https://transfer.archivete.am/Ijfl4/Orange_all_lists.7z (230MB compressed) - the ones with "seed_urls" or similar are the ones I cleaned up based on the other lists, but my organization is a mess |
| 23:19:46 | <pokechu22> | (and I don't 100% remember what all of the files are :|) |
| 23:22:09 | <@arkiver> | shall i just queue everything with both http and https |
| 23:24:14 | <pokechu22> | That's probably fine since it'll just redirect to the correct form if you get it wrong (and if you get it wrong with https it'll have an invalid cert) |
| 23:24:53 | <pokechu22> | I'd be a bit concerned about rate-limiting though |
| 23:25:07 | <@arkiver> | did you find any limits? |
| 23:25:11 | <pokechu22> | Yes! |
| 23:25:17 | <@arkiver> | what do they look like? |
| 23:25:19 | <pokechu22> | 1 request per second |
| 23:25:35 | <pokechu22> | You can go faster temporarily but sustained higher speeds will result in timeouts for 24 hours |
| 23:25:41 | <@arkiver> | right |
| 23:26:04 | <pokechu22> | I don't know for sure whether http <--> https redirects count to the rate limits but it seems best to assume they will |
| 23:26:31 | <pokechu22> | the redirects to the single shared 404 error page counted at least (though that's something you can skip fortunately) |
| 23:27:15 | <@arkiver> | do they have any interesting status codes? |
| 23:27:27 | <@arkiver> | what do they give when you're banned/rate limited? |
| 23:27:54 | <pokechu22> | No status code at all, the request just times out. There might also be cases of refused connections, I think |
| 23:28:03 | <@arkiver> | alright |
| 23:28:06 | <pokechu22> | I don't think I saw any 429s |
| 23:32:15 | <pokechu22> | There are also some sites that aren't public (or something like that) which redirect to a common error page as well (https://pages.perso.orange.fr/pages-perso-error&r=403 - probably through e.orange.fr or something like that but I'd need to double-check) |
| 23:32:54 | <pokechu22> | ah, there are also 401s, e.g. https://cath.monsite-orange.fr |
| 23:33:16 | <@arkiver> | we're going to only accept 200 and 3xx initially |
| 23:33:19 | <pokechu22> | ... and a 403: https://callune.monsite-orange.fr |
| 23:34:25 | <pokechu22> | ok, https://30-rue-louis-pons.monsite-orange.fr/ redirects directly to https://pages.perso.orange.fr/pages-perso-error&r=403 while the more common situation is that https://30villard90vercors.monsite-orange.fr goes to https://r.orange.fr/r/Oerreur_404 goes to https://e.orange.fr/error404.html |
| 23:34:43 | <pokechu22> | I guess if you queue 3xxs as new tasks instead of directly following them that'd deduplicate things without extra work |
| 23:36:47 | <@arkiver> | we'll queue them individually, but not follow the redirect |