00:52:04Jake quits [Quit: Ping timeout (120 seconds)]
00:58:02Jake (Jake) joins
01:01:38Jake quits [Client Quit]
01:12:50Jake (Jake) joins
02:03:03TheTechRobo quits [Excess Flood]
02:06:24TheTechRobo (TheTechRobo) joins
05:59:01TastyWiener95 quits [Client Quit]
06:09:50Jake quits [Ping timeout: 265 seconds]
06:14:56Jake (Jake) joins
07:13:23IDK (IDK) joins
09:32:21ThreeHM_ quits [Ping timeout: 265 seconds]
09:39:54ThreeHM_ (ThreeHeadedMonkey) joins
09:54:14ThreeHM_ quits [Ping timeout: 252 seconds]
10:00:34ThreeHM_ (ThreeHeadedMonkey) joins
11:13:20ThreeHM_ is now known as ThreeHM
13:39:19imer quits [Killed (NickServ (GHOST command used by imer8))]
13:40:04imer (imer) joins
14:35:51Terbium quits [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.]
14:36:29Terbium joins
14:43:58Terbium quits [Client Quit]
14:46:53Terbium joins
14:57:20@HCross quits [Read error: Connection reset by peer]
14:57:20IDK quits [Read error: Connection reset by peer]
14:57:21@rewby|backup quits [Read error: Connection reset by peer]
14:57:34@hook54321 quits [Read error: Connection reset by peer]
14:59:16HCross (HCross) joins
14:59:16@ChanServ sets mode: +o HCross
14:59:42IDK (IDK) joins
14:59:43rewby|backup (rewby) joins
14:59:43@ChanServ sets mode: +o rewby|backup
14:59:54hook54321 (hook54321) joins
14:59:54@ChanServ sets mode: +o hook54321
16:56:03Peroniko (Peroniko) joins
18:58:42mtw_ joins
19:01:31mtw quits [Ping timeout: 258 seconds]
19:01:44<pokechu22>arkiver: I should warn you that when I said "assorted" I really mean over 200 files, some duplicates, without much clarity in organization :)
19:01:51<pokechu22>I'm working on compressing it
19:02:47<pokechu22>Also worth noting that there's probably extra data that can be obtained from the AB jobs, particularly the ones for which the database was saved before being aborted; I haven't looked into the database for those yet
19:03:08mtw_ quits [Ping timeout: 252 seconds]
19:05:58mtw (mtw) joins
19:06:19thuban joins
19:08:25<thuban>hello, should i repost the orange.fr lists?
19:08:47<@arkiver>yes
19:08:56<@arkiver>i'm going to repurpose #Y for a bit especially for this
19:09:05<pokechu22>Your lists probably are mixed in with my lists but still worth mixing in
19:09:21<thuban>yeah, deduper should be able to handle things
19:09:24<@arkiver>all will be deduplicated so just dump them all
19:09:29<pokechu22>I guess one other thing that a special-purpose grab can do that AB can't is that it can ignore the redirects to 404/403 pages and avoid deduping them
19:09:48<pokechu22>https://transfer.archivete.am/Ijfl4/Orange_all_lists.7z (230MB, zst was larger)
19:10:07<@arkiver>yay
19:10:43<@arkiver>thuban: pokechu22: do you have a ball park size estimate?
19:10:45<pokechu22>some links will be malformed or in a weird format (including IA's tendency to add :80 to some CDX data); I normalized some of that when making my own lists but all of the data is in there
19:10:52<pokechu22>For the overall job?
19:10:53<@arkiver>pokechu22: that is fine
19:10:59<@arkiver>the :80 was for older URLs i believe at IA
19:11:05<@arkiver>but technically :80 is totally correct
19:11:14<@arkiver>pokechu22: yes
19:11:39<thuban>arkiver: estimate 1e5 sites
19:12:28<@arkiver>sorry, i meant GBs
19:12:45<pokechu22>AB got like 130 GB so far, but that's at a surface-level crawl
19:12:55<@arkiver>few TB at most?
19:12:58<pokechu22>I doubt it would be more than a few TB total, yeah
19:13:03<thuban>site size seems to be power-law-ish (many sites with just a single page, some with thousands), so not sure re total number of pages, but maybe in the 1e6 range? so yeah, few tb i think
19:13:10<@arkiver>alright
19:13:39<pokechu22>Oh, one other thing is that there are a few sites that use backslashes in links, which AB didn't like (browsers correct them to forward slashes but if backslashes are used the server gets mad)
19:14:37<thuban>not sure if pokechu22's "all" list includes my lists, reposting just in case:
19:15:11<thuban>https://transfer.archivete.am/bE5jI/orangefr_raw.txt.zst 156864 raw urls taken from their directory
19:15:44<thuban>https://transfer.archivete.am/SB82D/orangefr_scrubbed.txt.zst 159650 cleaned urls (where i tried to fix malformed entries)
19:17:51<thuban>https://transfer.archivete.am/eabTo/orangefr_online_raw.txt.zst 48298 raw urls taken from their directory with accessible=true and status=active (subset of first list)
19:18:20<thuban>https://transfer.archivete.am/7QXLi/orangefr_online_scrubbed.txt.zst 49007 cleaned urls
19:18:59<@arkiver>what is the 'official' name of this? just 'Orange'? or something else?
19:21:29<thuban>arkiver: company is called "Orange S.A."; hosting seems to be referred to as "Pages perso" (spparently covering all of pagesperso-orange, pagespro-orange, and monsite-orange)
19:21:55<thuban>^ those lists are missing the protocol prefix btw
19:23:09<@arkiver>I see "Pages Perso Orange", shall we do that?
19:24:00<thuban>works for me
19:24:15<@arkiver>rewby: can we have a target here please?
19:24:25<@arkiver>this would be archiveteam_pagespersoorange_
19:24:30<@arkiver>pagepersoorange_
19:24:35<@arkiver>Archive Team Pages Perso Orange:
19:24:51<@rewby|backup>I will when I get to a place with mains power
19:24:56<@rewby|backup>So give me an hour or so
19:24:58<@arkiver>and... let's make this flow to IA directly if you have the right infrastructure in place for that still! (we have a bit of short term relief at IA now luckily)
19:25:02<@arkiver>rewby|backup: sure!
19:25:16<@arkiver>i'll copy (i know, the horror) and repurpose #Y for orange
19:25:24<@arkiver>since we don't have a general #Y functional yet
19:28:08<pokechu22>I'll try to come up with a list of rules to follow regarding redirects and stuff (and whether https needs to be added or not)
19:30:15Sluggs joins
19:31:06<thuban>the main rule is s/(monsite|pagesperso|pagespro)\.orange\.fr\/(.*)$/\2.\1-orange.fr/\3/
19:32:43<thuban>(to recap, orange.fr has a pseudo-redirect: monsite.orange.fr/<slug>/ redirects to <slug>.monsite.orange.fr, which is just a landing page with a link to <slug>.monsite-orange.fr)
19:33:17<pokechu22>There's also http://monsite-orange.fr/abcdarchi/page4/index.html -> https://abcdarchi.monsite-orange.fr/page4/index.html but a redirect exists for that
19:34:44<thuban>sites with slugs that do not contain '.' work under (and redirect to) https; others are http-only
19:35:28<@arkiver>alright we'll got both
19:36:12<pokechu22>Here's another one: http://perso.wanadoo.fr/shihtzupassion and http://shihtzupassion.perso.wanadoo.fr/ no longer work, but the relevant content exists at https://shihtzupassion.pagesperso-orange.fr/ instead
19:37:18<thuban>ah yeah, ditto monsite.wanadoo.fr and pro.wanadoo.fr
19:37:40<pokechu22>I guess one other edge case with http vs https is that https://f6ikyradioamateur.assoc.pagespro-orange.fr/ also works (the cert covers *.assoc.pagespro-orange.fr), but http://geza.roheim.assoc.pagespro-orange.fr/ doesn't
19:39:15<thuban>we can just use http for everything, since https redirects apply where valid, right?
19:40:32<pokechu22>Yeah, though it'd be better to skip that redirect when we know it's safe since it would count against the rate-lmiiting
19:40:45<pokechu22>or at least I don't have evidence that it doesn't count against the rate-limiting
19:41:20<pokechu22>https://geza.roheim.assoc.pagespro-orange.fr/ also has a bad cert but the server is still willing to load it if you ignore the cert error (and it redirects to http afterwards in that case)
19:41:38<thuban>if we have enough workers that might not matter (presuming the site doesn't just fall over)
19:42:44<thuban>also, wow, i mis-copied my regex earlier
19:45:40<@arkiver>so we want to accept bad certificates?
19:46:23<pokechu22>I don't think there's any case where a site will actually link to a page with a bad certificate
19:46:36<pokechu22>and hopefully my list doesn't contain any like that
19:47:05<pokechu22>There *are* sites that link to wanadoo.fr or perso.orange.fr or similar though and those need to be worked around
19:50:15<@arkiver>if we don't accept bad certificates, this would error out on https://geza.roheim.assoc.pagespro-orange.fr/
19:51:24<pokechu22>I don't think any of those would naturally occur, though since AB ignored bad certs some might be in there? That's probably worth checking
19:51:31phaeton (phaeton) joins
19:54:47<pokechu22>in any case, they've got a cert for: monsite-orange.fr, *.monsite-orange.fr, pagesperso-orange.fr, *.pagesperso-orange.fr, *.assoc.pagespro-orange.fr, *.ecole.pagespro-orange.fr, *.mairie.pagespro-orange.fr, *.pagespro-orange.fr, pagespro-orange.fr, assoc.pagespro-orange.fr, ecole.pagespro-orange.fr, mairie.pagespro-orange.fr
19:55:51<pokechu22>so anything that would be valid under that cert should be https and anything else should be http, I think. (Another example I only noticed now: https://ambialet.mairie.pagespro-orange.fr/ is indeed https)
19:56:11<pokechu22>err, actually
19:56:36<pokechu22>https://ambialet.pagespro-orange.fr/ == https://ambialet.mairie.pagespro-orange.fr/ == https://ambialet.assoc.pagespro-orange.fr/ == https://ambialet.ecole.pagespro-orange.fr/ :|
19:57:03<pokechu22>I'm pretty sure everything under pagespro-orange.fr is complete for one of those forms so it probably doesn't matter
20:05:02<thuban>here are the pseudo-redirect regexes, as a sed script-file: https://transfer.archivete.am/M4ZOG/orangefr_regexes.txt
20:05:07<thuban>(actually tested this time)
20:05:55<thuban>has to be three different ones because 'monsite' lacks 'pages' and it's actually 'pros.orange.fr' but 'pro.wanadoo.fr'
20:08:52<thuban>(the wanadoo urls _don't_ have any redirect, but _do_ sometimes still work at the corresponding -orange url)
20:10:44<pokechu22>I think I might have seen both actually?
20:11:11<pokechu22>Yeah, both pro.orange.fr and pros.orange.fr existed at some point
20:11:21<pokechu22>but only ever pro.wanadoo.fr based on CDX
20:11:33<thuban>gross and bad
20:11:37<thuban>anyway, both are handled
20:23:29systwi__ is now known as systwi
22:19:24<@rewby>arkiver: You need to create a tracker project
22:20:36<@arkiver>rewby: sorry about that, tracker is up
23:10:50<flashfire42|m>What tracker now?
23:15:35@Sanqui quits [Ping timeout: 252 seconds]
23:16:44<TheTechRobo>probably not a publicly-listed one
23:16:52<TheTechRobo>maybe http://tracker.archiveteam.org/orange/
23:17:10<TheTechRobo>I think r.ewby just needs a tracker to plug the rsync targets into
23:34:53Sanqui joins
23:34:55Sanqui quits [Changing host]
23:34:55Sanqui (Sanqui) joins
23:34:55@ChanServ sets mode: +o Sanqui
23:49:50project10 (project10) joins