| 00:00:25 | <thuban> | arkiver, pokechu22: here's my list of 156864 raw orange.fr urls: https://transfer.archivete.am/bE5jI/orangefr_raw.txt.zst |
| 00:01:23 | <pokechu22> | Will look at this shortly, thanks |
| 00:01:44 | | Naruyoko5 quits [Read error: Connection reset by peer] |
| 00:01:50 | <thuban> | here's my list of 159650 'cleaned' urls (where i cleaned up whitespace, handled transformations like monsite.orange.fr/<slug> -> <slug>.monsite-orange.fr, and otherwise took my best guess at anything malformed): https://transfer.archivete.am/SB82D/orangefr_scrubbed.txt.zst |
| 00:02:49 | <thuban> | and here's a list of 61667 'bad' urls (which is just the raw list minus the cleaned list): https://transfer.archivete.am/vCBTZ/orangefr_badraw.txt.zst |
| 00:03:02 | | imer4 quits [Ping timeout: 252 seconds] |
| 00:05:34 | <thuban> | (the cleaned list is longer than the raw list because i (a) generated <site> if i only had <site>/path.ext, to avoid no-parent issues, and (b) generated multiple guesses for some malformed urls where i had only the username) |
| 00:06:36 | <pokechu22> | And this is based on scraping a list that they provide, right? So most of the pages should exist? |
| 00:07:06 | <thuban> | yes; no |
| 00:07:25 | <thuban> | unfortunately a lot of the pages in the directory are down |
| 00:07:57 | <pokechu22> | I'm a bit worried because two of my !a < list jobs for monsite-orange.fr both seem to have resulted in the site banning it (possibly because of too many requests to nonexistent pages, but maybe just because it was running too fast) which is annoying... |
| 00:08:19 | <thuban> | the api had 'accessible' and 'status' parameters; i am not sure what the distinction is and chose the values that gave me the largest list |
| 00:08:22 | <thuban> | oof :/ |
| 00:09:29 | <thuban> | i can change those params and get you a shorter list to prioritize, if that would help |
| 00:09:31 | <pokechu22> | An additional anoyance is that each page that doesn't exist redirects twice (https://yachtlink.pagesperso-orange.fr/ -> https://r.orange.fr/r/Oerreur_404 -> https://e.orange.fr/error404.html) |
| 00:09:43 | <thuban> | ye |
| 00:09:58 | <pokechu22> | Sure, that'd be helpful as it'd be pretty easy to run that list first and then run the remaining stuff not on that list |
| 00:11:08 | <thuban> | ok, will do. probably take a few hours |
| 00:11:16 | <pokechu22> | Alright |
| 00:15:09 | <thuban> | list is going to be about 1/3 the size of the big one |
| 00:15:30 | | imer1 (imer) joins |
| 00:18:59 | | imer quits [Ping timeout: 252 seconds] |
| 00:18:59 | | imer1 is now known as imer |
| 00:31:36 | <pokechu22> | thuban: how exactly did you make the badraw list? http://acf.luis.pagesperso-orange.fr/ is valid for instance (it just doesn't work with https) |
| 00:35:09 | <thuban> | literally just raw minus scrubbed. that site had a trailing slash in the raw list ("acf.luis.pagesperso-orange.fr/"); i removed those if they were directly on the domain (for deduping purposes) |
| 00:36:48 | <pokechu22> | Oh, not links that seemed like complete junk |
| 00:38:24 | <thuban> | yeah, the idea was mostly to have the originals for discoverability (esp for the changed domains) |
| 00:42:22 | | imer6 (imer) joins |
| 00:45:32 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 00:46:08 | | imer quits [Ping timeout: 265 seconds] |
| 00:46:09 | | imer6 is now known as imer |
| 00:48:43 | | dumbgoy_ joins |
| 00:51:59 | | dumbgoy quits [Ping timeout: 252 seconds] |
| 00:55:30 | | fangfufu joins |
| 00:55:32 | | fangfufu quits [Remote host closed the connection] |
| 00:56:40 | | fangfufu joins |
| 00:56:57 | | fangfufu is now authenticated as fangfufu |
| 01:04:12 | | imer0 (imer) joins |
| 01:05:33 | | nexusxe quits [Client Quit] |
| 01:07:56 | | imer quits [Ping timeout: 252 seconds] |
| 01:07:56 | | imer0 is now known as imer |
| 01:21:20 | <pokechu22> | thuban: some (as in several thousand?) of the ones you have aren't in my list at all, which means there's no archive.org coverage. Unfortunately my organization is a mess and I now have 2GB of lists of URLs so it'll be a bit before I can actually run stuff though... and make sure I'm actually looking at all of this correctly :| |
| 01:29:52 | <thuban> | that's ok, take your time! the priority list will probably be done in another 30-45 minutes, if that helps |
| 01:30:58 | <nicolas17> | my VPS has 659GB unused bandwidth for the rest of the month |
| 01:51:54 | | miki_57 quits [Client Quit] |
| 01:59:02 | <fireonlive> | DogsRNice: trouble with factorio? or just proactive? |
| 02:00:01 | <DogsRNice> | no idea i just noticed someone was doing the factorio sites and didnt do the forums |
| 02:02:06 | | sec^nd quits [Ping timeout: 245 seconds] |
| 02:04:18 | <fireonlive> | ah ok |
| 02:13:20 | | imer6 (imer) joins |
| 02:16:41 | | imer quits [Ping timeout: 252 seconds] |
| 02:16:41 | | imer6 is now known as imer |
| 02:21:36 | | imer1 (imer) joins |
| 02:24:53 | | nic quits [Quit: The Lounge - https://thelounge.chat] |
| 02:25:12 | <pokechu22> | I skipped the forums because they're somewhat large - it'd make sense to do them later but I'd rather not start a multi-day proactive thing just yet |
| 02:25:29 | | imer quits [Ping timeout: 252 seconds] |
| 02:25:29 | | imer1 is now known as imer |
| 02:25:44 | <pokechu22> | If we want to do one it's fine but eh |
| 02:26:21 | <thuban> | arkiver, pokechu22: here are my 'priority' lists (scraped with accessible=true and status=active; sites should all be online). these lists are a strict subset of those previously posted |
| 02:26:44 | <thuban> | 48298 raw urls: https://transfer.archivete.am/eabTo/orangefr_online_raw.txt.zst |
| 02:27:12 | <thuban> | 49007 cleaned urls: https://transfer.archivete.am/7QXLi/orangefr_online_scrubbed.txt.zst |
| 02:29:09 | | nic (nic) joins |
| 02:30:38 | <thuban> | the 'bad' urls all either had trailing slashes or were of the old *.(orange|wanadoo).fr format with quasi-redirects. trailing slashes are transparent for our purposes, so instead of the entire 'bad' list here are just the redirects |
| 02:31:01 | <thuban> | 7440 redirect urls: https://transfer.archivete.am/LUa27/orangefr_online_redirect.txt.zst |
| 02:32:03 | <pokechu22> | I'm going to run this with entries like 08.pagesperso-orange.fr/odp/index.htm stripped out (leaving only 08.pagesperso-orange.fr) for now since having both is the kind of situation that can lead to really weird no-parent behavior |
| 02:32:27 | <thuban> | hmm, ok |
| 02:32:28 | <pokechu22> | AB also needs either http:// or https:// before each URL; I'll add http to ones with multiple dots and https to ones without |
| 02:33:09 | <thuban> | ah, i never remember that. do you want me to do that / any other processing? |
| 02:33:11 | | dumbgoy_ quits [Ping timeout: 252 seconds] |
| 02:33:26 | <pokechu22> | I can handle it - I've already built some jank regexes for it :) |
| 02:33:46 | <thuban> | ok! |
| 02:34:25 | <pokechu22> | first prefix everything with http:// and then replace ^http://([^/\.]+\.[^/\.]+-orange\.fr)$ with https://\1 |
| 02:37:09 | | sambo joins |
| 02:38:52 | | sambo quits [Remote host closed the connection] |
| 02:43:58 | | imer2 (imer) joins |
| 02:47:56 | | imer quits [Ping timeout: 265 seconds] |
| 02:47:56 | | imer2 is now known as imer |
| 03:10:04 | | Larsenv76 is now known as Larsenv |
| 03:19:59 | | sec^nd (second) joins |
| 03:35:10 | | sec^nd quits [Remote host closed the connection] |
| 03:35:49 | | sec^nd (second) joins |
| 03:46:43 | | sec^nd quits [Remote host closed the connection] |
| 03:58:01 | | sec^nd (second) joins |
| 04:00:50 | | dumbgoy_ joins |
| 04:15:36 | | lukash96 joins |
| 04:15:54 | | lukash9 quits [Ping timeout: 245 seconds] |
| 04:15:54 | | lukash96 is now known as lukash9 |
| 04:24:58 | | DogsRNice quits [Read error: Connection reset by peer] |
| 04:43:40 | <pokechu22> | I don't think the orange stuff is going to finish on time - running at more than 1 page/second seemed to result in blocks, and after going through about 4.5K seed URLs of 45K URLs we're already at ~125K queued or a day and a half. So at that rate it'd be 15 days to finish, which we don't have. And that's just for this smaller list. Any ideas about how to handle that? |
| 04:49:09 | | nic0 (nic) joins |
| 04:52:20 | | nic quits [Ping timeout: 252 seconds] |
| 04:52:20 | | nic0 is now known as nic |
| 04:58:54 | <thuban> | i guess i would suggest either seeing if you can reduce the delay (i know it's different infra, but i was able to do all my scraping with 0.5s delay and didn't get banned) or trying to parallelize the load across multiple pipelines |
| 05:00:25 | <pokechu22> | If .5s is fine I can do that - it was originally .25-.375 at con=1 |
| 05:00:42 | <pokechu22> | I'm not sure how long they ban for though which makes me nervous about experimenting |
| 05:02:24 | <thuban> | as i said, different infra (and it involved a token which i just yoinked from the browser), so can't be sure based just on that. could you try testing with a sacrificial ip, like a home connection? |
| 05:03:07 | <pokechu22> | I guess I could - though I don't have quite the same infra either |
| 05:03:14 | <thuban> | i mean on their end |
| 05:03:44 | <thuban> | i.e., the directory api being different from the actual page servers |
| 05:04:21 | <pokechu22> | What host is the directory API on? |
| 05:04:47 | <thuban> | api.annuaire-pp.orange.fr |
| 05:06:29 | <pokechu22> | ah, yeah, might have different rate-limiting then :| |
| 05:09:11 | <thuban> | multiple pipelines is probably easiest/safest, but idk what wrangling them is like |
| 05:09:40 | <thuban> | (alas, this is really a job for #Y...) |
| 05:11:49 | <pokechu22> | Theoretically I could just run e.g. all of the pagespro-orange.fr jobs on one pipeline, pagesperso-orange.fr on a second, and moinsite-orange.fr on a third (that's trivial by just using different lists), and that's what I originally planned on doing, but it's not easy to do that for in-progress jobs |
| 05:12:41 | <pokechu22> | I'm going to try running pagespro-orange.fr locally since there's no job for that yet (beyond the ones you have in your list) |
| 05:29:21 | <pokechu22> | The other thing that would help is if we could just skip the 2-step redirect chain, but there's no way to apply ignores onto redirect targets so it's going to redownload https://r.orange.fr/r/Oerreur_404 and https://e.orange.fr/error404.html every time it hits a 404 :| |
| 05:51:11 | | katocala quits [Ping timeout: 252 seconds] |
| 05:51:23 | | katocala joins |
| 05:54:16 | | Island quits [Read error: Connection reset by peer] |
| 05:58:53 | | Exorcism (exorcism) joins |
| 06:28:09 | | BigBrain quits [Remote host closed the connection] |
| 06:29:48 | | BigBrain (bigbrain) joins |
| 07:00:08 | | nfriedly quits [Remote host closed the connection] |
| 07:01:16 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 07:06:42 | | Unholy2361316618085 (Unholy2361) joins |
| 07:09:54 | | nulldata quits [Ping timeout: 265 seconds] |
| 07:12:51 | | nulldata (nulldata) joins |
| 07:13:37 | | Krume (Krume) joins |
| 07:32:45 | <AntoninDelFabbro|m> | pokechu22: If I can help, I will! |
| 07:54:06 | | Arcorann (Arcorann) joins |
| 07:54:51 | | dazld quits [Ping timeout: 265 seconds] |
| 08:21:20 | | nulldata quits [Ping timeout: 252 seconds] |
| 08:24:38 | | nulldata (nulldata) joins |
| 09:04:10 | | Exorcism quits [Remote host closed the connection] |
| 09:07:31 | | Exorcism (exorcism) joins |
| 09:22:42 | | parfait quits [Client Quit] |
| 09:23:29 | | appledash quits [Ping timeout: 252 seconds] |
| 09:25:31 | | appledash joins |
| 09:26:00 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 09:47:48 | | erkinalp joins |
| 10:00:00 | | nfriedly joins |
| 10:00:01 | | railen63 quits [Remote host closed the connection] |
| 10:00:18 | | railen63 joins |
| 10:13:31 | | miki_57 joins |
| 10:14:02 | | miki_57 quits [Max SendQ exceeded] |
| 10:14:05 | | miki_57 joins |
| 10:14:36 | | miki_57 quits [Max SendQ exceeded] |
| 10:14:39 | | miki_57 joins |
| 10:15:10 | | miki_57 quits [Max SendQ exceeded] |
| 10:15:13 | | miki_57 joins |
| 10:15:44 | | miki_57 quits [Max SendQ exceeded] |
| 10:15:47 | | miki_57 joins |
| 10:16:19 | | miki_57 quits [Max SendQ exceeded] |
| 10:16:21 | | miki_57 joins |
| 10:16:52 | | miki_57 quits [Max SendQ exceeded] |
| 10:16:54 | | miki_57 joins |
| 10:17:26 | | miki_57 quits [Max SendQ exceeded] |
| 10:17:29 | | miki_57 joins |
| 10:18:00 | | miki_57 quits [Max SendQ exceeded] |
| 10:18:03 | | miki_57 joins |
| 10:18:34 | | miki_57 quits [Max SendQ exceeded] |
| 10:18:37 | | miki_57 joins |
| 10:19:08 | | miki_57 quits [Max SendQ exceeded] |
| 10:19:10 | | miki_57 joins |
| 10:19:42 | | miki_57 quits [Max SendQ exceeded] |
| 10:19:44 | | miki_57 joins |
| 10:20:16 | | miki_57 quits [Max SendQ exceeded] |
| 10:20:19 | | miki_57 joins |
| 10:20:50 | | miki_57 quits [Max SendQ exceeded] |
| 10:20:53 | | miki_57 joins |
| 10:21:24 | | miki_57 quits [Max SendQ exceeded] |
| 10:21:27 | | miki_57 joins |
| 10:21:58 | | miki_57 quits [Max SendQ exceeded] |
| 10:22:01 | | miki_57 joins |
| 10:22:32 | | miki_57 quits [Max SendQ exceeded] |
| 10:22:34 | | miki_57 joins |
| 10:23:06 | | miki_57 quits [Max SendQ exceeded] |
| 10:23:08 | | miki_57 joins |
| 10:23:40 | | miki_57 quits [Max SendQ exceeded] |
| 10:23:42 | | miki_57 joins |
| 10:24:14 | | miki_57 quits [Max SendQ exceeded] |
| 10:24:17 | | miki_57 joins |
| 10:24:48 | | miki_57 quits [Max SendQ exceeded] |
| 10:24:51 | | miki_57 joins |
| 10:25:22 | | miki_57 quits [Max SendQ exceeded] |
| 10:25:25 | | miki_57 joins |
| 10:25:56 | | miki_57 quits [Max SendQ exceeded] |
| 10:25:59 | | miki_57 joins |
| 10:26:30 | | miki_57 quits [Max SendQ exceeded] |
| 10:26:33 | | miki_57 joins |
| 10:26:34 | | Earendil7 quits [Client Quit] |
| 10:27:04 | | miki_57 quits [Max SendQ exceeded] |
| 10:27:07 | | miki_57 joins |
| 10:27:38 | | miki_57 quits [Max SendQ exceeded] |
| 10:27:41 | | miki_57 joins |
| 10:27:59 | | Earendil7 (Earendil7) joins |
| 10:28:12 | | miki_57 quits [Max SendQ exceeded] |
| 10:28:15 | | miki_57 joins |
| 10:28:46 | | miki_57 quits [Max SendQ exceeded] |
| 10:28:49 | | miki_57 joins |
| 10:29:20 | | miki_57 quits [Max SendQ exceeded] |
| 10:29:23 | | miki_57 joins |
| 10:29:29 | | wickedplayer494 quits [Ping timeout: 252 seconds] |
| 10:29:48 | | wickedplayer494 joins |
| 10:29:54 | | miki_57 quits [Max SendQ exceeded] |
| 10:29:57 | | miki_57 joins |
| 10:30:28 | | miki_57 quits [Max SendQ exceeded] |
| 10:30:31 | | miki_57 joins |
| 10:31:02 | | miki_57 quits [Max SendQ exceeded] |
| 10:31:04 | | miki_57 joins |
| 10:31:36 | | miki_57 quits [Max SendQ exceeded] |
| 10:31:39 | | miki_57 joins |
| 10:32:10 | | miki_57 quits [Max SendQ exceeded] |
| 10:32:13 | | miki_57 joins |
| 10:32:44 | | miki_57 quits [Max SendQ exceeded] |
| 10:32:47 | | miki_57 joins |
| 10:33:18 | | miki_57 quits [Max SendQ exceeded] |
| 10:33:21 | | miki_57 joins |
| 10:33:52 | | miki_57 quits [Max SendQ exceeded] |
| 10:33:55 | | miki_57 joins |
| 10:34:26 | | miki_57 quits [Max SendQ exceeded] |
| 10:34:29 | | miki_57 joins |
| 10:35:00 | | miki_57 quits [Max SendQ exceeded] |
| 10:35:03 | | miki_57 joins |
| 10:35:34 | | miki_57 quits [Max SendQ exceeded] |
| 10:35:36 | | miki_57 joins |
| 10:36:08 | | miki_57 quits [Max SendQ exceeded] |
| 10:36:11 | | miki_57 joins |
| 10:36:42 | | miki_57 quits [Max SendQ exceeded] |
| 10:36:45 | | miki_57 joins |
| 10:37:16 | | miki_57 quits [Max SendQ exceeded] |
| 10:37:18 | | miki_57 joins |
| 10:37:50 | | miki_57 quits [Max SendQ exceeded] |
| 10:37:53 | | miki_57 joins |
| 10:38:24 | | miki_57 quits [Max SendQ exceeded] |
| 10:38:27 | | miki_57 joins |
| 10:38:58 | | miki_57 quits [Max SendQ exceeded] |
| 10:39:01 | | miki_57 joins |
| 11:23:01 | <erkinalp> | pokechu22: wowturkey still down |
| 11:23:12 | | railen69 joins |
| 11:24:08 | | wickedplayer494 quits [Ping timeout: 265 seconds] |
| 11:24:29 | | wickedplayer494 joins |
| 11:24:36 | | wickedplayer494 is now authenticated as wickedplayer494 |
| 11:25:36 | | railen63 quits [Ping timeout: 265 seconds] |
| 11:53:52 | | Miki57 joins |
| 11:56:46 | | Earendil7 quits [Client Quit] |
| 11:57:05 | | Earendil7 (Earendil7) joins |
| 12:00:04 | | katocala is now authenticated as katocala |
| 12:04:54 | | yo joins |
| 12:05:20 | | yo quits [Remote host closed the connection] |
| 12:11:47 | | Dango360 quits [Ping timeout: 252 seconds] |
| 12:27:48 | | icedice (icedice) joins |
| 12:28:54 | | le0n quits [Ping timeout: 265 seconds] |
| 12:30:53 | | ethan joins |
| 12:31:31 | | ethan quits [Remote host closed the connection] |
| 12:32:20 | | Exorcism quits [Client Quit] |
| 12:33:26 | | Exorcism (exorcism) joins |
| 12:52:06 | | Exorcism quits [Ping timeout: 245 seconds] |
| 12:55:54 | | Exorcism (exorcism) joins |
| 13:08:45 | | Icyelut|2 quits [Quit: bye] |
| 13:29:37 | | nic quits [Client Quit] |
| 13:33:15 | | nic (nic) joins |
| 13:42:18 | | benjins2 joins |
| 14:02:23 | | bf_ joins |
| 14:07:50 | | Arcorann quits [Ping timeout: 252 seconds] |
| 14:17:51 | | erkinalp quits [Remote host closed the connection] |
| 14:18:21 | | le0n (le0n) joins |
| 14:46:36 | | Island joins |
| 14:51:48 | | miki_57 quits [Client Quit] |
| 15:07:35 | | khaosfox quits [Quit: leaving] |
| 15:10:51 | | LeGoupil joins |
| 15:18:03 | | Core4657 joins |
| 15:18:07 | | Core4657 quits [Remote host closed the connection] |
| 16:14:02 | | kiryu quits [Read error: Connection reset by peer] |
| 16:37:35 | <pokechu22> | So unfortunately, 500-500 delay results in a ban unfortunately. Happened to me on my residential connection overnight and happened to one of the jobs (not the priority one) I changed yesterday too. I guess the 1-second delay is the only safe one :| |
| 16:48:27 | <pokechu22> | I did, however, build a list of stuff under pagespro-orange.fr that's valid |
| 17:03:55 | <fireonlive> | 09:58:59 AM -+rss- Fig Has Joined AWS: https://fig.io/blog/post/fig-joins-aws https://news.ycombinator.com/item?id=37296401 |
| 17:07:27 | | aninternettroll quits [Remote host closed the connection] |
| 17:10:19 | | aninternettroll (aninternettroll) joins |
| 17:26:06 | | railen69 quits [Remote host closed the connection] |
| 17:29:21 | | railen63 joins |
| 17:42:21 | <@JAA> | So, what channel do we use for ZOWA? |
| 17:43:06 | <@JAA> | The ideas from yesterday: zowch z-oww-a nowa zowwa zowaah zowie (plus one that shall not be named) |
| 18:31:47 | | DogsRNice joins |
| 18:36:43 | | AmAnd0A quits [Ping timeout: 265 seconds] |
| 18:37:21 | | AmAnd0A joins |
| 18:46:48 | | wyatt8740 quits [Remote host closed the connection] |
| 18:56:26 | | yts98 leaves |
| 18:56:31 | | yts98 joins |
| 18:59:04 | | Unholy2361316618085 quits [Remote host closed the connection] |
| 19:01:29 | | Unholy2361316618085 (Unholy2361) joins |
| 19:04:34 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 19:04:47 | | AmAnd0A joins |
| 19:16:06 | <fireonlive> | ooh! ooh! the shall not be named one! |
| 19:17:47 | <fireonlive> | in absence of that, zowch |
| 19:19:51 | <DigitalDragons> | +1 zowch |
| 19:20:01 | | Exorcism quits [Ping timeout: 245 seconds] |
| 19:20:26 | | sec^nd quits [Ping timeout: 245 seconds] |
| 19:20:51 | | BigBrain quits [Ping timeout: 245 seconds] |
| 19:21:46 | <h2ibot> | FireonLive edited Current Projects (+121, add ZOWA): https://wiki.archiveteam.org/?diff=50608&oldid=50551 |
| 19:22:04 | | Exorcism (exorcism) joins |
| 19:22:44 | | BigBrain (bigbrain) joins |
| 19:24:08 | <fireonlive> | one day i'll go though and make 300,000 edits with the https://www.mediawiki.org/wiki/Help:Magic_words#formatdate thing |
| 19:24:21 | <fireonlive> | too bad there doesn't seem to be one for time |
| 19:25:05 | <fireonlive> | hmmm |
| 19:25:44 | | sec^nd (second) joins |
| 19:26:09 | <fireonlive> | yeah sadly {{#formatdate:2023-09-29T03:00Z}} doesn't appear to work |
| 19:27:02 | | LeGoupil quits [Client Quit] |
| 19:28:47 | <h2ibot> | FireonLive edited Current Projects (+16, use formatdate for ZOWA, more to come): https://wiki.archiveteam.org/?diff=50609&oldid=50608 |
| 19:31:33 | <fireonlive> | i found {{#time}} but what the fuck is this: 2023-09-29UTC03:000 |
| 19:32:10 | <fireonlive> | i'll look more into it later :p |
| 19:32:49 | <fireonlive> | mediawiki is really something |
| 19:39:50 | | Exorcism quits [Client Quit] |
| 19:47:47 | <@JAA> | #time doesn't seem to account for user preferences. |
| 19:49:50 | <h2ibot> | Yts98 edited ZOWA (+24, Update project status): https://wiki.archiveteam.org/?diff=50610&oldid=50195 |
| 19:51:50 | | leo60228 quits [Quit: ZNC 1.8.2 - https://znc.in] |
| 19:52:12 | | leo60228 (leo60228) joins |
| 19:54:25 | <fireonlive> | ah, darn |
| 19:54:39 | <fireonlive> | thanks yts98 :) |
| 20:01:01 | <@JAA> | Perhaps we should just have a simple template to render datetimes in a consistent manner. {{datetime|2023-08-28|22:00|CEST|+2}} → {{#formatdate:2023-08-28}} 22:00 CEST (UTC+2) or similar |
| 20:01:39 | <fireonlive> | i'd be up for something that's consistent |
| 20:01:56 | <@JAA> | The last two parameters could be optional, and the default would be UTC. |
| 20:02:23 | <fireonlive> | people wildly get confused with named timezones though so perhaps we could leave that out |
| 20:02:38 | <fireonlive> | EST vs EDT, even big streamers scheduling things |
| 20:03:05 | <fireonlive> | 'hey you know it's DT over there now.. so is happening at 7 or 8?' |
| 20:03:15 | <fireonlive> | seems to come up a lot lol |
| 20:04:36 | <@JAA> | 'ET' |
| 20:04:42 | <@JAA> | (ノಥ益ಥ)ノ彡┻━┻ |
| 20:04:53 | <fireonlive> | too bad we can't just link them all to something like (js-ridden) https://www.timeanddate.com/worldclock/converter.html?iso=20230831T030000&p1=1440 |
| 20:04:55 | <fireonlive> | :P |
| 20:05:06 | <fireonlive> | 'type where you are and see what it is' |
| 20:05:29 | <fireonlive> | https://mkx9delh5a.execute-api.ca-central-1.amazonaws.com/uploads/e5654758afc913ec/image.png (i added Ottawa in this example) |
| 20:07:39 | <fireonlive> | the frowny faces are because it's mainly used for figuring out when to meet i guess |
| 20:08:42 | <fireonlive> | JAA: can we pls kill DST everywhere tks |
| 20:08:45 | <fireonlive> | T_T |
| 20:09:00 | <fireonlive> | inb4 perma-dst everywhere because i guess that sounds nicer to politicans |
| 20:16:54 | <@JAA> | Yes please |
| 20:19:40 | <fireonlive> | as long as it's gone i'll accept it |
| 20:19:49 | <fireonlive> | :D |
| 20:20:04 | <fireonlive> | (the DST vs ST 'final time' debate) |
| 20:20:30 | <@JAA> | Same, I don't even care anymore which one is chosen, just get rid of the stupid transition twice per year. |
| 20:21:33 | <fireonlive> | for sure |
| 20:32:09 | <thuban> | pokechu22: that sucks. multiple pipelines, then? i know you can't really do that with the jobs already in progress, but i don't think duplicating some of the work would hurt |
| 20:32:13 | <thuban> | (i also don't see any reason it needs to be done by domain--seems better to just split evenly) |
| 20:34:35 | <pokechu22> | Yeah, there's no real reason to split by domain, other than how I was building up my own lists originally. If it were an !a < list job for example.com/foo example.com/bar example.org/baz example.org/quux it would make sense to split example.com and example.org into two jobs to fully avoid !a < list issues, but we've already got multiple subdomains and multiple domains doesn't |
| 20:34:38 | <pokechu22> | make much of a difference |
| 20:35:08 | <pokechu22> | Unfortunately there are only 6 different sets of pipelines with distinct IPs, of which 3 are banned and 2 currently have jobs running on them |
| 20:35:21 | <thuban> | oof |
| 20:35:51 | <pokechu22> | the remaining one is also basically always full since it effectively only has 4 slots at the moment and they're usually filled with long-running jobs :| |
| 20:36:41 | <pokechu22> | Hopefully the bans don't last too long and we can get the other ones back into use |
| 20:36:46 | <thuban> | :I |
| 20:36:48 | <thuban> | yeah |
| 20:38:05 | <thuban> | at least we'll definitely get through all the front pages from the priority list (and probably their assets as well) |
| 20:40:11 | <pokechu22> | Yeah |
| 20:53:17 | <vokunal|m> | +1 zowch |
| 20:53:44 | | Unholy2361316618085 quits [Ping timeout: 252 seconds] |
| 21:03:47 | <nicolas17> | what's ZOWA |
| 21:04:03 | <@JAA> | https://wiki.archiveteam.org/index.php/ZOWA |
| 21:04:41 | <nicolas17> | oh yikes, video... any idea of size? |
| 21:05:51 | | Unholy2361316618085 (Unholy2361) joins |
| 21:06:50 | <@JAA> | #zowch for ZOWA |
| 21:07:28 | <nicolas17> | anyone updating channel on wiki? |
| 21:07:38 | <appledash> | Does archiveteam accept donations? if so, I hope they all go to the guy responsible for coming up with the channel names |
| 21:07:42 | <appledash> | he's got a hard jo |
| 21:07:43 | <appledash> | b |
| 21:07:43 | <flashfire42> | is the telegram thing still going nuts? |
| 21:08:07 | <flashfire42> | Like is the redoing everything thing still active or is it back to normal? |
| 21:09:19 | <fireonlive> | so many OWASP channels |
| 21:11:13 | <@JAA> | appledash: https://wiki.archiveteam.org/index.php/Donate |
| 21:12:22 | <appledash> | wtf, the fact that someone who has only donated $40 is top 15 is a travesty |
| 21:12:28 | <appledash> | Remind me to contribute when I gat paid |
| 21:13:55 | <nstrom|m> | Can someone fill me in on the owasp drama? Maybe in -ot |
| 21:14:37 | <flashfire42> | I have no fucking idea I just jumped on the bandwagon |
| 21:14:50 | <@JAA> | appledash: It has only been in use and publicised since a couple months ago during the Imgur project, although the page has existed for years. |
| 21:14:59 | <appledash> | Ahhh |
| 21:16:30 | <h2ibot> | Switchnode edited ZOWA (+5, add irc channel): https://wiki.archiveteam.org/?diff=50611&oldid=50610 |
| 21:16:53 | <pokechu22> | I queued one more job for orange.fr URLs that aren't found on archive.org at all, though whether or not the pipeline slot will free up remains to be seen |
| 21:33:10 | <h2ibot> | JustAnotherArchivist edited ZOWA (+56, Reference for shutdown): https://wiki.archiveteam.org/?diff=50612&oldid=50611 |
| 21:44:40 | <nicolas17> | rewby: how are the targets and IA doing? do you have a giant backlog in temporary storage again? |
| 21:50:16 | <@rewby> | nicolas17: I have about 31.2TiB in temp storage. And another 200 or so TiB left on it. |
| 21:50:30 | <@rewby> | Targets are fine at the moment] |
| 21:50:47 | <@rewby> | It's just that all active projects managed to hit bugs all at once as far as I can tell |
| 21:51:32 | <@rewby> | Based on what I've read (and I'm not an authority here): shreddit is paused due to some concern around image capture maybe not working right |
| 21:51:39 | <@rewby> | deadcat is just mostly done |
| 21:51:57 | <nicolas17> | oh, I thought shreddit was still paused to give capacity to gfycat/xuite |
| 21:51:59 | <@rewby> | (and waiting for an update for the last few items) |
| 21:52:03 | <@rewby> | xuite is just slow |
| 21:52:14 | <@rewby> | (something something asia is a pain to get data in and out of) |
| 21:52:29 | <@rewby> | If you have ipv6, I think xuite could use your help |
| 21:52:49 | <@rewby> | telegram was provided offload capacity but I don't know if it's being used yet |
| 21:53:10 | <nicolas17> | telegram seems to have 0 in todo |
| 21:53:25 | <@rewby> | Actually, tg is slowly returning stuff |
| 21:53:28 | <@rewby> | So looks to be working |
| 21:53:42 | <@rewby> | Uh... what else... urls is still paused |
| 21:54:06 | <nicolas17> | I think a bunch of stuff in tg was stashed away, maybe it needs to be brought back, but idk status, I wasn't even in the channel the last few days |
| 21:54:09 | <@rewby> | Although that's been hooked up to offload too in case arkiver wants to have a go at it (although probably not at full speed to conserve space) |
| 21:54:31 | <@rewby> | And yeah... that's about it? |
| 21:55:24 | <fireonlive> | shreddit was paused while i.reddit.com's new javascript/etc fuckery is checked to ensure the data we save is good |
| 21:55:28 | <fireonlive> | AIUI |
| 21:55:48 | <nicolas17> | if there's "free" capacity we can slightly open the faucet on imgur (: |
| 21:56:03 | <fireonlive> | imgur is slowly deleting images off of the CDN now, per BigBrain |
| 21:56:11 | <fireonlive> | 302s are rising from the canary list |
| 21:56:19 | <@rewby> | Ah |
| 21:56:23 | <@rewby> | I'll add it to offload I guess |
| 21:56:28 | <@rewby> | And then it's up to arkiver and JAA to turn that on and off |
| 21:56:34 | <fireonlive> | :) thanks |
| 21:56:50 | <@rewby> | Mind you, I've only got like a quarter of a PiB of space |
| 21:56:58 | <@rewby> | And that has to last us until the IA comes back |
| 21:57:11 | <nicolas17> | are you not uploading anything to IA right now? |
| 21:57:16 | <@rewby> | Not yet |
| 21:57:18 | <@rewby> | Code's not ready for it |
| 21:57:51 | <vokunal|m> | It's nice to see nearly 200M items in queue and realize for once it's only like ~75GiB |
| 21:58:13 | <nicolas17> | vokunal|m: lol, in what project? |
| 21:58:22 | <fireonlive> | xuite if i had to guess |
| 21:58:58 | <vokunal|m> | Imgur. Though is it probably the item size avg bugged after being offline so long? |
| 21:59:10 | <@rewby> | nicolas17: Getting code ready for uploading to IA is a lower prio than actually capturing data atm |
| 21:59:11 | <thuban> | telegram is still running (so items submitted to the bot are still processed), but its backlog was stashed and since other projects are paused it's not receiving items from outlinks (which were the majority of its volume) |
| 21:59:38 | | Megame (Megame) joins |
| 22:00:38 | <fireonlive> | ah |
| 22:00:44 | <nicolas17> | vokunal|m: that math doesn't look right :P |
| 22:01:02 | <nicolas17> | item size is 367 KB |
| 22:01:20 | | BlueMaxima joins |
| 22:02:25 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 22:02:31 | | BlueMaxima joins |
| 22:02:40 | <flashfire42> | arkiver is the deduplication still turned off for telegram? |
| 22:03:49 | <nicolas17> | rewby: imgur has a lot of 'redo' that will probably have low success rate, so we can also regulate speed that way |
| 22:03:59 | <nicolas17> | move some stuff from redo to todo to slow down, ask me to add a bruteforced list to speed up :P |
| 22:04:09 | <vokunal|m> | 73TB? I think i divided instead of multiplied |
| 22:05:03 | <nicolas17> | vokunal|m: yes that's the right multiplication, but note a lot of those 200M are retries and will fail |
| 22:12:19 | <h2ibot> | FireonLive edited Current Projects (+27, add IRC channel for ZOWA): https://wiki.archiveteam.org/?diff=50613&oldid=50609 |
| 22:12:24 | | ymgve_ joins |
| 22:13:44 | | ymgve quits [Ping timeout: 265 seconds] |
| 22:22:00 | <@arkiver> | flashfire42: yes, i'll turn that on shortly again |
| 22:23:47 | <flashfire42|m> | Probably a good idea |
| 22:28:48 | <fireonlive> | https://wiki.archiveteam.org/index.php/Template:@ |
| 22:28:51 | <fireonlive> | interesting template |
| 22:29:20 | <fireonlive> | (it's an image!) |
| 22:29:31 | <fireonlive> | oh, for emails |
| 22:30:31 | <fireonlive> | (well one email :3) |
| 22:31:04 | <flashfire42|m> | I wonder if we will ever find out the reason behind the ingestion issues |
| 22:32:07 | <flashfire42|m> | And are we slowly pushing from the offload storage or is it just sitting quietly? |
| 22:33:52 | <fireonlive> | not uploading to IA from offload atm, code needs to be written (rewby mentioned it above) |
| 22:34:35 | <@rewby> | My plan is to spend some time later this week getting uploading going |
| 22:35:24 | <h2ibot> | FireonLive edited Template:IRC-Hackint (+22, +deleteme in favour of Template:IRC): https://wiki.archiveteam.org/?diff=50614&oldid=41452 |
| 22:35:29 | <fireonlive> | i have no idea what i went to wiki.archiveteam.org for initially, but it ended in that |
| 22:41:26 | <h2ibot> | FireonLive edited YouTube (-2, #youtubearchive → on haitus): https://wiki.archiveteam.org/?diff=50615&oldid=50569 |
| 22:41:39 | <fireonlive> | it wasn't that either |
| 22:41:43 | <fireonlive> | oh well :D |
| 23:04:29 | | fangfufu quits [Ping timeout: 265 seconds] |
| 23:05:48 | <thuban> | front pages of 'online' orange.fr sites are done :D |
| 23:06:10 | | fangfufu joins |
| 23:06:16 | | fangfufu is now authenticated as fangfufu |
| 23:07:25 | <thuban> | ~8 days' worth of requests remaining in queue, so front page assets at least should just finish before shutdown |
| 23:09:00 | <fireonlive> | awesome |
| 23:09:08 | <fireonlive> | ^_^ |
| 23:46:51 | | Darken (Darken) joins |
| 23:56:46 | | Darken quits [Remote host closed the connection] |