| 00:52:04 | | Jake quits [Quit: Ping timeout (120 seconds)] |
| 00:58:02 | | Jake (Jake) joins |
| 01:01:38 | | Jake quits [Client Quit] |
| 01:12:50 | | Jake (Jake) joins |
| 02:03:03 | | TheTechRobo quits [Excess Flood] |
| 02:06:24 | | TheTechRobo (TheTechRobo) joins |
| 05:59:01 | | TastyWiener95 quits [Client Quit] |
| 06:09:50 | | Jake quits [Ping timeout: 265 seconds] |
| 06:14:56 | | Jake (Jake) joins |
| 07:13:23 | | IDK (IDK) joins |
| 09:32:21 | | ThreeHM_ quits [Ping timeout: 265 seconds] |
| 09:39:54 | | ThreeHM_ (ThreeHeadedMonkey) joins |
| 09:54:14 | | ThreeHM_ quits [Ping timeout: 252 seconds] |
| 10:00:34 | | ThreeHM_ (ThreeHeadedMonkey) joins |
| 11:13:20 | | ThreeHM_ is now known as ThreeHM |
| 13:39:19 | | imer quits [Killed (NickServ (GHOST command used by imer8))] |
| 13:40:04 | | imer (imer) joins |
| 14:35:51 | | Terbium quits [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.] |
| 14:36:29 | | Terbium joins |
| 14:43:58 | | Terbium quits [Client Quit] |
| 14:46:53 | | Terbium joins |
| 14:57:20 | | @HCross quits [Read error: Connection reset by peer] |
| 14:57:20 | | IDK quits [Read error: Connection reset by peer] |
| 14:57:21 | | @rewby|backup quits [Read error: Connection reset by peer] |
| 14:57:34 | | @hook54321 quits [Read error: Connection reset by peer] |
| 14:59:16 | | HCross (HCross) joins |
| 14:59:16 | | @ChanServ sets mode: +o HCross |
| 14:59:42 | | IDK (IDK) joins |
| 14:59:43 | | rewby|backup (rewby) joins |
| 14:59:43 | | @ChanServ sets mode: +o rewby|backup |
| 14:59:54 | | hook54321 (hook54321) joins |
| 14:59:54 | | @ChanServ sets mode: +o hook54321 |
| 16:56:03 | | Peroniko (Peroniko) joins |
| 18:58:42 | | mtw_ joins |
| 19:01:31 | | mtw quits [Ping timeout: 258 seconds] |
| 19:01:44 | <pokechu22> | arkiver: I should warn you that when I said "assorted" I really mean over 200 files, some duplicates, without much clarity in organization :) |
| 19:01:51 | <pokechu22> | I'm working on compressing it |
| 19:02:47 | <pokechu22> | Also worth noting that there's probably extra data that can be obtained from the AB jobs, particularly the ones for which the database was saved before being aborted; I haven't looked into the database for those yet |
| 19:03:08 | | mtw_ quits [Ping timeout: 252 seconds] |
| 19:05:58 | | mtw (mtw) joins |
| 19:06:19 | | thuban joins |
| 19:08:25 | <thuban> | hello, should i repost the orange.fr lists? |
| 19:08:47 | <@arkiver> | yes |
| 19:08:56 | <@arkiver> | i'm going to repurpose #Y for a bit especially for this |
| 19:09:05 | <pokechu22> | Your lists probably are mixed in with my lists but still worth mixing in |
| 19:09:21 | <thuban> | yeah, deduper should be able to handle things |
| 19:09:24 | <@arkiver> | all will be deduplicated so just dump them all |
| 19:09:29 | <pokechu22> | I guess one other thing that a special-purpose grab can do that AB can't is that it can ignore the redirects to 404/403 pages and avoid deduping them |
| 19:09:48 | <pokechu22> | https://transfer.archivete.am/Ijfl4/Orange_all_lists.7z (230MB, zst was larger) |
| 19:10:07 | <@arkiver> | yay |
| 19:10:43 | <@arkiver> | thuban: pokechu22: do you have a ball park size estimate? |
| 19:10:45 | <pokechu22> | some links will be malformed or in a weird format (including IA's tendency to add :80 to some CDX data); I normalized some of that when making my own lists but all of the data is in there |
| 19:10:52 | <pokechu22> | For the overall job? |
| 19:10:53 | <@arkiver> | pokechu22: that is fine |
| 19:10:59 | <@arkiver> | the :80 was for older URLs i believe at IA |
| 19:11:05 | <@arkiver> | but technically :80 is totally correct |
| 19:11:14 | <@arkiver> | pokechu22: yes |
| 19:11:39 | <thuban> | arkiver: estimate 1e5 sites |
| 19:12:28 | <@arkiver> | sorry, i meant GBs |
| 19:12:45 | <pokechu22> | AB got like 130 GB so far, but that's at a surface-level crawl |
| 19:12:55 | <@arkiver> | few TB at most? |
| 19:12:58 | <pokechu22> | I doubt it would be more than a few TB total, yeah |
| 19:13:03 | <thuban> | site size seems to be power-law-ish (many sites with just a single page, some with thousands), so not sure re total number of pages, but maybe in the 1e6 range? so yeah, few tb i think |
| 19:13:10 | <@arkiver> | alright |
| 19:13:39 | <pokechu22> | Oh, one other thing is that there are a few sites that use backslashes in links, which AB didn't like (browsers correct them to forward slashes but if backslashes are used the server gets mad) |
| 19:14:37 | <thuban> | not sure if pokechu22's "all" list includes my lists, reposting just in case: |
| 19:15:11 | <thuban> | https://transfer.archivete.am/bE5jI/orangefr_raw.txt.zst 156864 raw urls taken from their directory |
| 19:15:44 | <thuban> | https://transfer.archivete.am/SB82D/orangefr_scrubbed.txt.zst 159650 cleaned urls (where i tried to fix malformed entries) |
| 19:17:51 | <thuban> | https://transfer.archivete.am/eabTo/orangefr_online_raw.txt.zst 48298 raw urls taken from their directory with accessible=true and status=active (subset of first list) |
| 19:18:20 | <thuban> | https://transfer.archivete.am/7QXLi/orangefr_online_scrubbed.txt.zst 49007 cleaned urls |
| 19:18:59 | <@arkiver> | what is the 'official' name of this? just 'Orange'? or something else? |
| 19:21:29 | <thuban> | arkiver: company is called "Orange S.A."; hosting seems to be referred to as "Pages perso" (spparently covering all of pagesperso-orange, pagespro-orange, and monsite-orange) |
| 19:21:55 | <thuban> | ^ those lists are missing the protocol prefix btw |
| 19:23:09 | <@arkiver> | I see "Pages Perso Orange", shall we do that? |
| 19:24:00 | <thuban> | works for me |
| 19:24:15 | <@arkiver> | rewby: can we have a target here please? |
| 19:24:25 | <@arkiver> | this would be archiveteam_pagespersoorange_ |
| 19:24:30 | <@arkiver> | pagepersoorange_ |
| 19:24:35 | <@arkiver> | Archive Team Pages Perso Orange: |
| 19:24:51 | <@rewby|backup> | I will when I get to a place with mains power |
| 19:24:56 | <@rewby|backup> | So give me an hour or so |
| 19:24:58 | <@arkiver> | and... let's make this flow to IA directly if you have the right infrastructure in place for that still! (we have a bit of short term relief at IA now luckily) |
| 19:25:02 | <@arkiver> | rewby|backup: sure! |
| 19:25:16 | <@arkiver> | i'll copy (i know, the horror) and repurpose #Y for orange |
| 19:25:24 | <@arkiver> | since we don't have a general #Y functional yet |
| 19:28:08 | <pokechu22> | I'll try to come up with a list of rules to follow regarding redirects and stuff (and whether https needs to be added or not) |
| 19:30:15 | | Sluggs joins |
| 19:31:06 | <thuban> | the main rule is s/(monsite|pagesperso|pagespro)\.orange\.fr\/(.*)$/\2.\1-orange.fr/\3/ |
| 19:32:43 | <thuban> | (to recap, orange.fr has a pseudo-redirect: monsite.orange.fr/<slug>/ redirects to <slug>.monsite.orange.fr, which is just a landing page with a link to <slug>.monsite-orange.fr) |
| 19:33:17 | <pokechu22> | There's also http://monsite-orange.fr/abcdarchi/page4/index.html -> https://abcdarchi.monsite-orange.fr/page4/index.html but a redirect exists for that |
| 19:34:44 | <thuban> | sites with slugs that do not contain '.' work under (and redirect to) https; others are http-only |
| 19:35:28 | <@arkiver> | alright we'll got both |
| 19:36:12 | <pokechu22> | Here's another one: http://perso.wanadoo.fr/shihtzupassion and http://shihtzupassion.perso.wanadoo.fr/ no longer work, but the relevant content exists at https://shihtzupassion.pagesperso-orange.fr/ instead |
| 19:37:18 | <thuban> | ah yeah, ditto monsite.wanadoo.fr and pro.wanadoo.fr |
| 19:37:40 | <pokechu22> | I guess one other edge case with http vs https is that https://f6ikyradioamateur.assoc.pagespro-orange.fr/ also works (the cert covers *.assoc.pagespro-orange.fr), but http://geza.roheim.assoc.pagespro-orange.fr/ doesn't |
| 19:39:15 | <thuban> | we can just use http for everything, since https redirects apply where valid, right? |
| 19:40:32 | <pokechu22> | Yeah, though it'd be better to skip that redirect when we know it's safe since it would count against the rate-lmiiting |
| 19:40:45 | <pokechu22> | or at least I don't have evidence that it doesn't count against the rate-limiting |
| 19:41:20 | <pokechu22> | https://geza.roheim.assoc.pagespro-orange.fr/ also has a bad cert but the server is still willing to load it if you ignore the cert error (and it redirects to http afterwards in that case) |
| 19:41:38 | <thuban> | if we have enough workers that might not matter (presuming the site doesn't just fall over) |
| 19:42:44 | <thuban> | also, wow, i mis-copied my regex earlier |
| 19:45:40 | <@arkiver> | so we want to accept bad certificates? |
| 19:46:23 | <pokechu22> | I don't think there's any case where a site will actually link to a page with a bad certificate |
| 19:46:36 | <pokechu22> | and hopefully my list doesn't contain any like that |
| 19:47:05 | <pokechu22> | There *are* sites that link to wanadoo.fr or perso.orange.fr or similar though and those need to be worked around |
| 19:50:15 | <@arkiver> | if we don't accept bad certificates, this would error out on https://geza.roheim.assoc.pagespro-orange.fr/ |
| 19:51:24 | <pokechu22> | I don't think any of those would naturally occur, though since AB ignored bad certs some might be in there? That's probably worth checking |
| 19:51:31 | | phaeton (phaeton) joins |
| 19:54:47 | <pokechu22> | in any case, they've got a cert for: monsite-orange.fr, *.monsite-orange.fr, pagesperso-orange.fr, *.pagesperso-orange.fr, *.assoc.pagespro-orange.fr, *.ecole.pagespro-orange.fr, *.mairie.pagespro-orange.fr, *.pagespro-orange.fr, pagespro-orange.fr, assoc.pagespro-orange.fr, ecole.pagespro-orange.fr, mairie.pagespro-orange.fr |
| 19:55:51 | <pokechu22> | so anything that would be valid under that cert should be https and anything else should be http, I think. (Another example I only noticed now: https://ambialet.mairie.pagespro-orange.fr/ is indeed https) |
| 19:56:11 | <pokechu22> | err, actually |
| 19:56:36 | <pokechu22> | https://ambialet.pagespro-orange.fr/ == https://ambialet.mairie.pagespro-orange.fr/ == https://ambialet.assoc.pagespro-orange.fr/ == https://ambialet.ecole.pagespro-orange.fr/ :| |
| 19:57:03 | <pokechu22> | I'm pretty sure everything under pagespro-orange.fr is complete for one of those forms so it probably doesn't matter |
| 20:05:02 | <thuban> | here are the pseudo-redirect regexes, as a sed script-file: https://transfer.archivete.am/M4ZOG/orangefr_regexes.txt |
| 20:05:07 | <thuban> | (actually tested this time) |
| 20:05:55 | <thuban> | has to be three different ones because 'monsite' lacks 'pages' and it's actually 'pros.orange.fr' but 'pro.wanadoo.fr' |
| 20:08:52 | <thuban> | (the wanadoo urls _don't_ have any redirect, but _do_ sometimes still work at the corresponding -orange url) |
| 20:10:44 | <pokechu22> | I think I might have seen both actually? |
| 20:11:11 | <pokechu22> | Yeah, both pro.orange.fr and pros.orange.fr existed at some point |
| 20:11:21 | <pokechu22> | but only ever pro.wanadoo.fr based on CDX |
| 20:11:33 | <thuban> | gross and bad |
| 20:11:37 | <thuban> | anyway, both are handled |
| 20:23:29 | | systwi__ is now known as systwi |
| 22:19:24 | <@rewby> | arkiver: You need to create a tracker project |
| 22:20:36 | <@arkiver> | rewby: sorry about that, tracker is up |
| 23:10:50 | <flashfire42|m> | What tracker now? |
| 23:15:35 | | @Sanqui quits [Ping timeout: 252 seconds] |
| 23:16:44 | <TheTechRobo> | probably not a publicly-listed one |
| 23:16:52 | <TheTechRobo> | maybe http://tracker.archiveteam.org/orange/ |
| 23:17:10 | <TheTechRobo> | I think r.ewby just needs a tracker to plug the rsync targets into |
| 23:34:53 | | Sanqui joins |
| 23:34:55 | | Sanqui is now authenticated as Sanqui |
| 23:34:55 | | Sanqui quits [Changing host] |
| 23:34:55 | | Sanqui (Sanqui) joins |
| 23:34:55 | | @ChanServ sets mode: +o Sanqui |
| 23:49:50 | | project10 (project10) joins |