00:42:58testttt quits [Ping timeout: 265 seconds]
01:47:44Naruyoko quits [Ping timeout: 265 seconds]
01:57:55Exorcism quits [Remote host closed the connection]
01:59:09Exorcism (exorcism) joins
02:02:00qwertyasdfuiopghjkl quits [Client Quit]
04:13:49<@arkiver>up and running again!
04:18:05<thuban>arkiver: returning nil from queue_all if the site slug matches *\.(mairie|assoc|ecole) fails to queue any of those sites (in any variant). that's not the intended behavior, is it?
04:19:43<@arkiver>no
04:19:52<@arkiver>see where the queue_all function is called
04:20:25<@arkiver>hmm
04:20:32<@arkiver>actually i see
04:20:45<@arkiver>let me just strip it
04:20:48<thuban>yeah
04:20:53<@arkiver>good point
04:29:46<@arkiver>thuban: done
04:31:11<@arkiver>DLoader: project10: fuzzy8021: FYI we're back up!
04:31:25<@arkiver>according to findings by imer limits will be lowered tomorrow
04:31:40<project10>ty, updating ;)
04:32:03<@arkiver>project10: you have no auto-update stuff in place right?
04:32:55<project10>I have a few machines that were building the image manually, most with watchtower
04:32:56<@arkiver>i'll be off now
04:33:02<@arkiver>project10: alright!
04:33:19<@arkiver>so i'll ping you then in case of important updates
04:33:39<project10>nah, I'll just switch to mainstream updates. I just did that to jump the gun before the drone/image-poker got poked.
04:33:51<@arkiver>ah! i see
04:34:04<project10>(I guess warrior can git pull the grab project itself? People's warriors seemed to start working right away)
04:34:23<@arkiver>yeah, it auto updates
04:34:31<@arkiver>pulls the repo actually, not the image
04:35:08<thuban>arkiver: do we want to restrict stripping to the case where sub == "pagespro-orange"? theoretically possible that people could use those suffixes on the other domains
04:35:59<thuban>ah yep, there's at least one online http://kerglaw.ecole.pagesperso-orange.fr/
04:37:02<project10>arkiver: thx for your efforts getting this going again
04:41:15<@arkiver>project10: thanks
04:41:19<@arkiver>thuban: yeah
04:50:23<@arkiver>thuban: also in
04:50:29<@arkiver>project10: updated again :)
04:51:40<project10>just as all the containers finally spun up ;)
04:51:47<@arkiver>ohno :P
04:51:53<thuban>lgtm!
04:51:56<project10>(back on watchtower mainline)
04:52:00<thuban>thanks arkiver :)
04:52:10<@arkiver>thanks for the watchful eye thuban
04:52:24<@arkiver>project10: so do i still need to ping in case of an update?
04:53:14<project10>nope, thanks for the ping. Maybe you could ping if you are making a major adjustment to the rates/limits such that people with weird setups might get banned :)
04:53:30<@arkiver>alright!
04:53:31<@arkiver>and yeah
04:53:33<project10>ref imer's findings, not sure what that's about
04:53:46<@arkiver>we'll put that in tomorrow
04:54:34<project10>were the (mairie|assoc|ecole) items removed from backfeed queue? or did my eyes deceive, and they were never there?
04:56:12<@arkiver>they're being moved away
05:07:13<project10>3=200 https://mairie-ballainvilliers.mairie.pagespro-orange.fr/style.css
05:07:17<project10>cool :)
05:55:34yts98 leaves
05:55:53yts98 joins
05:59:49Exorcism quits [Remote host closed the connection]
06:01:50Exorcism (exorcism) joins
06:51:52Exorcism2 (exorcism) joins
06:53:27Exorcism quits [Read error: Connection reset by peer]
06:53:27Exorcism2 is now known as Exorcism
07:40:34wickedplayer494 quits [Ping timeout: 265 seconds]
09:13:21toss (toss) joins
09:35:39qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
09:38:21Exorcism quits [Remote host closed the connection]
09:39:07Exorcism (exorcism) joins
10:11:53Exorcism quits [Remote host closed the connection]
10:12:39Exorcism (exorcism) joins
10:30:39Hans5958 (Hans5958) joins
11:06:14Peroniko (Peroniko) joins
11:25:42kallemarc joins
11:27:26kallemarc quits [Remote host closed the connection]
11:27:47toss quits [Read error: Connection reset by peer]
11:38:02levomi joins
12:06:53sonick (sonick) joins
12:39:40Maturion joins
12:57:20Peroniko quits [Ping timeout: 252 seconds]
12:58:05Peroniko (Peroniko) joins
14:03:33<nulldata>found item url:https://sibel.reunion.pagesperso-orange.fr///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////parapharmacie/parapharmacie_kln.html
14:04:05<nulldata>How many slashes could one need? lol
14:05:29<imer>Yes
14:23:19<imer>4-5 day eta at current speed, so gotta go a bit faster yet ;)
14:30:59<@arkiver>nulldata: ouch - i'll add a little extra rule for that
14:56:28Exorcism quits [Remote host closed the connection]
14:57:58Exorcism (exorcism) joins
15:29:25<@arkiver>imer: with your research, what would be safe sleeping times?
15:29:29<@arkiver>we now sleep 2 second on 200
15:29:33<@arkiver>and 6 seconds on non-200
15:33:00<imer>arkiver: I'd probably keep the 6s for error redirects (if you have access to that info), but everything else should be fine with barely any sleeping, maybe half a second or less (I was doing 10 req/s without seeing bans)?
15:33:13<imer>Just checking my banned ips, those are unbanned again
15:33:24<imer>bit late, meant to check those at the 24h mark
15:34:30<@arkiver>i can reduce the sleep on 200 to 1 second
15:34:48<@arkiver>if we keep the non-200 sleep on 6 seconds, it won't change much since the majority we go through is non-200
15:35:04<imer>can you sleep depending on if it's an error redirect or do you not have the info there?
15:35:22<@arkiver>i have that info!
15:35:25<@arkiver>so yes we can do that
15:35:40<imer>lets do that then :)
15:36:07<imer>6s only for error redirect and then everything else low (1s for a start I guess? although I think we can go lower)
15:36:13<@arkiver>alright!
15:54:49Exorcism quits [Remote host closed the connection]
15:54:56yts98 leaves
15:55:11yts98 joins
15:55:32Exorcism (exorcism) joins
15:59:40qwertyasdfuiopghjkl quits [Remote host closed the connection]
16:25:52qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
16:38:56threedeeitguy397 (threedeeitguy) joins
16:40:38threedeeitguy39 quits [Ping timeout: 252 seconds]
16:45:02threedeeitguy397 quits [Ping timeout: 252 seconds]
16:52:21threedeeitguy39 (threedeeitguy) joins
17:17:47<pokechu22>ArchiveBot reduces all repeated slashes to just 1 slash so we probably never noticed that (archivebot's behavior is bad for URLs that contain other URLs in them since it converts https://example.com/proxy/https://example.com into https://example.com/proxy/https:/example.com which some things don't like, but that's probably not a concern here)
17:29:01qwertyasdfuiopghjkl quits [Client Quit]
17:54:57flashfire42 joins
17:56:03kiska (kiska) joins
17:58:29<BornOn420>20 separate docker instances here for 4 hours already without bans.
17:59:02<pokechu22>BornOn420: On the same IP?
18:01:37<pokechu22>and actually separate instances, as in they don't know about each other, instead of using the standard concurrency mechanism? (Using the standard one adjusts the delays for this project to compensate)
18:05:59<BornOn420>yes, one IP, all concurrency=1
18:06:47<pokechu22>Hmm
18:11:26kiska5 joins
18:13:51<imer>40 resulted in a ban, so don't go too high :D
18:36:50<BornOn420>I do see an increase in 'bad response' answers, so no ban, but not all roses either.
19:11:25<@arkiver>update is in
19:11:28<@arkiver>to reduce sleep times
19:11:42<@arkiver>project10: DLoader: FYI ^
19:12:47<@arkiver>6 second sleep for redirect to 404
19:12:49<@arkiver>else 1 second
19:28:40wickedplayer494 joins
19:37:23<imer>the speeeed :D
19:37:35<@arkiver>:)
19:42:11Peroniko quits [Ping timeout: 265 seconds]
19:42:39Peroniko (Peroniko) joins
19:42:39Peroniko quits [Max SendQ exceeded]
19:48:28tzt quits [Ping timeout: 265 seconds]
19:51:20tzt (tzt) joins
20:08:48@ChanServ sets mode: +o flashfire42
20:51:48sonick quits [Client Quit]
21:09:21<imer>2.5day eta at current rate, more like it
21:20:03<levomi>arkiver : ////// url's still show up in my logs : Archiving item url:http://pagesperso-orange.fr/sibel.reunion////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
21:20:03<levomi>//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////images/gammes_produits/gam_carasa.jpg
21:25:34<BornOn420>Item url:https://pagesperso-orange.fr/sibel.reunion////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////images
21:25:34<BornOn420>/logos/logo_oligo.jpg is aborted.
21:26:12<pokechu22>If they're immediatley aborted that seems plausible
21:26:29Carnildo joins
21:29:54<levomi>Ah, i assumed that the they would be filtered/corrected at the tracker side
21:29:58<BornOn420>Aha of course. That's the output of 'Starting SetBadUrl'
21:45:27<@JAA>Maybe they should be filtered out tracker-side instead?
21:45:43<@arkiver>i'll add that in a bit, finishing up something else
21:54:52Exorcism quits [Remote host closed the connection]
21:55:37Exorcism (exorcism) joins
21:57:36<@arkiver>they're being filtered out now
21:57:45<@arkiver>periodically
21:57:52<@arkiver>at ////
22:09:36<pokechu22>Oh, are you doing anything to normalize for sites that use backslashes instead of front slashes?
22:09:58<pokechu22>e.g. https://hansi.pagespro-orange.fr/
22:10:14<pokechu22>IIRC the pages don't work if you use backslashes, only if you use forward slashes, but I'm not 100% sure of that
22:14:07kalle joins
22:32:19kalle quits [Remote host closed the connection]
22:38:02tarsubo joins
22:39:22tarsubo quits [Remote host closed the connection]
22:50:11Exorcism quits [Remote host closed the connection]
22:50:52Exorcism (exorcism) joins
23:29:11<thuban>i believe the repeated slashes are due to ~malformed paths in the page source
23:29:16<thuban>eg, on sibel.reunion, a bunch of the img srcs begin with "..//"
23:29:22<thuban>as a result the second slash is captured as part of the `rest` group
23:29:26<thuban>and as it's collapsed by the browser but distinct to the deduper, the crawl just keeps bouncing back and forth adding more slashes
23:30:21<thuban>replacing the `rest`-preceding '/' with '/+' should fix this i think
23:31:17<thuban>("by the browser"--well, you know what i mean)
23:35:49Maturion quits [Remote host closed the connection]
23:53:45<@arkiver>since when does Pages Perso Orange exist? and what is a reference for this?