| 00:00:44 | | Chris5010 quits [Remote host closed the connection] |
| 00:25:50 | <@flashfire42> | My warriors are spinning and ready |
| 00:53:39 | <fireonlive> | 🥷🏻 |
| 01:17:48 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
| 01:54:04 | | pabs quits [Quit: Don't rest until all the world is paved in moss and greenery.] |
| 01:55:04 | | pabs (pabs) joins |
| 02:24:21 | <@arkiver> | alright i think the changes as given by pokechu22 are in |
| 02:30:15 | | Peroniko quits [Ping timeout: 265 seconds] |
| 02:30:43 | | Peroniko (Peroniko) joins |
| 02:31:34 | <thuban> | arkiver: i don't see them on github |
| 02:32:52 | <@arkiver> | doing final test |
| 02:33:08 | <thuban> | oic |
| 02:38:01 | <@arkiver> | seems to be working well |
| 02:47:58 | <@arkiver> | thuban: pushed |
| 02:50:49 | <@flashfire42> | tracker limited seems slightly better than no item recieved |
| 02:51:23 | <@arkiver> | i paused it |
| 02:51:27 | <@arkiver> | items are being queued |
| 02:51:47 | <project10> | will JAA need to poke something for a docker image build? |
| 02:52:20 | <pokechu22> | Looks good. Theoretically we still don't need to try both protocols but doing both is still fine |
| 02:53:04 | <pokechu22> | arkiver: err, " local maxtries = tries - 1"? |
| 02:53:59 | <pokechu22> | I don't think that makes sense, as `tries > maxtries` will always be true (barring integer overflow maybe, which shouldn't matter). Though for maxtries = 1 it's probably fine |
| 02:56:18 | <@arkiver> | i know |
| 02:56:25 | <@arkiver> | it's just to not retry |
| 02:56:38 | <@arkiver> | while keeping original code somewhat in place for if we want to start doing more stuff with that |
| 02:56:50 | <pokechu22> | Does maxtries = 1 not do the same thing? |
| 02:58:18 | <@arkiver> | no because tries will be 1 and the condition after is a > condition |
| 02:59:49 | <@flashfire42> | looks like we are loaded |
| 03:09:06 | <thuban> | i'm unsure about skipping the working redirect domains (*.orange.fr, monsite.wanadoo.fr) |
| 03:09:09 | <thuban> | they are low-value, but they do preserve discoverability in the wbm |
| 03:09:40 | <thuban> | might it be better to 'pass through' urls we receive (either manually queued or through backfeed) at the redirect domains, but not generate them from other formats? |
| 03:09:48 | <@flashfire42> | could queue them if we have time? |
| 03:10:00 | <@flashfire42> | We are already on quite borrowed time here |
| 03:11:48 | <thuban> | we could, but |
| 03:16:48 | <thuban> | manually generating and queuing an e.g. monsite.orange version of every monsite-orange url is equivalent to (but less convenient than) the original queue_all_versions logic |
| 03:16:55 | <thuban> | and it fails to distinguish between monsite.orange urls actually discovered in the wild (for which preserving the redirect may be useful) and those we merely conjecture (less so) |
| 03:21:39 | <thuban> | my previous suggestion was to put the old domains in a secondary queue, but idk whether the tracker can actually do pattern-based sorting of backfed items |
| 03:21:51 | <nstrom|m> | so I'm around for a little bit if this gets started within the next 45 mins or so but otherwise will need to get to bed |
| 03:24:19 | <@JAA> | arkiver: So, do I need to poke Drone? |
| 03:44:12 | <@flashfire42> | Atm it looks like its still queueing items |
| 03:48:00 | <nstrom|m> | oh well I'll check back in tomorrow, gl |
| 04:00:45 | | kiryu quits [Client Quit] |
| 04:01:45 | <@arkiver> | JAA: yes please |
| 04:03:29 | <@arkiver> | thuban: i'll add them |
| 04:12:46 | <@arkiver> | it's running |
| 04:12:57 | <@arkiver> | for now i'll keep it running on a low rate, i'm off to bed |
| 04:14:54 | <thuban> | sounds good! |
| 04:15:06 | <@arkiver> | so 1000 items/min |
| 04:15:10 | <@arkiver> | we'll optimize that tomorrow |
| 04:15:26 | <thuban> | thanks for your hard work, good night :) |
| 04:17:58 | <@arkiver> | well sorry for the delays |
| 04:18:01 | <@arkiver> | but i think we'll make it |
| 04:18:04 | <@arkiver> | and thanks! |
| 04:18:10 | <@arkiver> | good day/night to you too |
| 04:44:46 | <@flashfire42> | https://server8.kiska.pw/uploads/8ce843fb94aa421f/image.png |
| 04:55:31 | <@flashfire42> | https://server8.kiska.pw/uploads/0699e0ade83c7397/image.png |
| 04:55:41 | <pokechu22> | I'm seeing items on 9=0 http://perso.wanadoo.fr/haikal/_themes/zero/zerbul1a.gif - that doesn't seem particularly useful since perso.wanadoo.fr always times out |
| 04:56:17 | <fireonlive> | docker: Error response from daemon: manifest for atdr.meo.ws/archiveteam/pagespersoorange-grab:latest not found: manifest unknown: manifest unknown. |
| 04:56:19 | <fireonlive> | UwU |
| 05:03:31 | | kiryu (kiryu) joins |
| 05:04:49 | <@flashfire42> | https://transfer.archivete.am/QrrL3/Failed%20SetBadUrls%20for%20Item%20urlhttpp.txt |
| 05:07:26 | <@flashfire42> | Traceback (most recent call last): |
| 05:07:26 | <@flashfire42> | File "/usr/local/lib/python3.9/site-packages/seesaw/task.py", line 88, in enqueue |
| 05:07:26 | <@flashfire42> | self.process(item) |
| 05:07:26 | <@flashfire42> | File "<string>", line 158, in process |
| 05:07:26 | <@flashfire42> | ValueError: 'url:http://perso.orange.fr/imagetransfert/mur du son/mur du sonframeset-1.htm/robots.txt' is not in list |
| 05:07:43 | <@flashfire42> | Seems its freaking out when the URL isnt in the list? |
| 05:12:45 | <pokechu22> | My guess is it's the spaces maybe? |
| 05:14:18 | <pokechu22> | Here's a bigger problem: |
| 05:14:21 | <pokechu22> | Archiving item url:http://pagesperso-orange.fr/closdominant/copains.htm |
| 05:14:23 | <pokechu22> | 5=301 http://pagesperso-orange.fr/closdominant/copains.htm |
| 05:14:25 | <pokechu22> | Server returned bad response. Skipping. |
| 05:14:27 | <pokechu22> | Aborting item url:http://pagesperso-orange.fr/closdominant/copains.htm. |
| 05:15:53 | <pokechu22> | http://pagesperso-orange.fr/xxxx will ALWAYS redirect to xxxx.pagesperso-orange.fr, but we're aborting those redirects. There's no point in grabbing the page in that case. This might also affect on-site links that redirect (I don't have any examples of these, but maybe ones to an open directory where the slash is added would have that?) (Aborting for redirects makes some sense |
| 05:15:53 | <@flashfire42> | at an educated guess anything "aborted" goes into the backlog |
| 05:15:56 | <pokechu22> | for redirects to the redirect to the 404 page though) |
| 05:16:50 | <@flashfire42> | and yeah I would say its the spaces making it freak out |
| 05:17:26 | <pokechu22> | Yeah, but we still want to save that redirect, otherwise there's no point in downloading it in the first place :) |
| 05:19:32 | <thuban> | https://github.com/ArchiveTeam/pagespersoorange-grab/blob/aaeb1a7b35f02c3d56d944651cc4dd7c655f9553/pagespersoorange.lua#L221 yeah, whoops |
| 05:23:50 | <@flashfire42> | wait so they wont be retried later? |
| 05:23:53 | <@flashfire42> | I am hella confused |
| 05:49:19 | <@flashfire42> | No fair project10 we dont all have a fleet XD |
| 05:50:01 | <project10> | :3 |
| 05:50:24 | <fireonlive> | 💪 |
| 05:50:35 | <fireonlive> | did you just build the image yourself |
| 05:50:38 | <project10> | I think arkiver said 1k items/min is the throttle rate? we're doing like 10/min |
| 05:50:43 | <project10> | I did yes |
| 05:50:49 | <fireonlive> | ah :3 |
| 05:50:55 | <project10> | well via docker-compose build directive |
| 05:59:25 | <project10> | ETA ~1000d |
| 07:02:02 | | magmaus3 (magmaus3) joins |
| 07:11:51 | <@flashfire42> | Traceback (most recent call last): |
| 07:11:51 | <@flashfire42> | File "/usr/local/lib/python3.9/site-packages/seesaw/task.py", line 88, in enqueue |
| 07:11:51 | <@flashfire42> | self.process(item) |
| 07:11:51 | <@flashfire42> | File "<string>", line 156, in process |
| 07:11:51 | <@flashfire42> | File "/usr/local/lib/python3.9/codecs.py", line 322, in decode |
| 07:11:51 | <@flashfire42> | (result, consumed) = self._buffer_decode(data, self.errors, final) |
| 07:11:51 | <@flashfire42> | UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position 290: invalid continuation byte |
| 07:59:29 | | Peroniko quits [Client Quit] |
| 08:47:52 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 11:55:27 | <phaeton> | was a docker container built for this? |
| 11:55:49 | <phaeton> | image |
| 12:03:11 | | Exorcism quits [Read error: Connection reset by peer] |
| 12:06:06 | | Exorcism (exorcism) joins |
| 12:26:43 | <nulldata> | phaeton - doesn't seem so yet |
| 12:33:09 | <imer> | Im sure we'll speed up plenty with docker and once this is set to default warrior project |
| 12:35:16 | <phaeton> | looks like a high percentage of my items are getting dumped due to that line 88 error. Seems to be unable to match spacing in one direction or another |
| 12:36:49 | <phaeton> | https://transfer.archivete.am/3AZPk/line88.txt |
| 12:44:47 | | Exorcism quits [Remote host closed the connection] |
| 12:47:10 | | Exorcism (exorcism) joins |
| 13:32:41 | <phaeton> | Same line, different error. Looks similar to one earlier in the chat. https://transfer.archivete.am/FFij7/line88UnicodeDecodeError |
| 13:36:42 | <imer> | arkiver: ^ |
| 13:44:26 | | Maturion joins |
| 13:47:21 | | Maturion quits [Remote host closed the connection] |
| 15:01:26 | <project10> | Archiving item url:http://pagesperso-orange.fr/philippe.dornbusch/analyses/games/1?1/=2<30444=4243>4b.<.a-,.-?*=...<+,=+)< |
| 15:07:15 | <imer> | JAA: reminder to poke ci :) |
| 15:07:45 | <@JAA> | imer: I did, but it's not working correctly. |
| 15:07:59 | <imer> | ok, thanks! |
| 15:08:20 | <imer> | I assume you've tried poking it harder? |
| 15:08:24 | <imer> | longer stick? |
| 15:08:51 | <@JAA> | I think I need a sharper one instead. |
| 15:08:59 | <imer> | ooh, good idea |
| 15:20:35 | <nulldata> | Maybe needs to be slapped around a bit with a large trout? |
| 15:45:09 | <@arkiver> | pokechu22: so perso.wanadoo.fr is completely gone? what else if completely gone? |
| 15:45:45 | <@arkiver> | fixes coming |
| 16:15:57 | <project10> | arkiver: pro.wanadoo.fr is also NXDOMAIN |
| 16:25:32 | <pokechu22> | arkiver: perso.wanadoo.fr is completely gone, and I guess pro is too, but http://monsite.wanadoo.fr/ *isn't*. It was like this before the original deadline as well. (Though the content that was on perso.wanadoo.fr is still on pagesperso-orange.fr.) |
| 16:26:34 | <pokechu22> | The last capture IA has for perso.wanadoo.fr is in May 2023 |
| 16:33:43 | | Exorcism0 (exorcism) joins |
| 16:34:27 | | Exorcism quits [Read error: Connection reset by peer] |
| 16:34:27 | | Exorcism0 is now known as Exorcism |
| 16:50:34 | <Exorcism> | https://thelounge.exorcism.repl.co/uploads/97722943a80c6a32/image.png 😭 |
| 17:03:38 | <nulldata> | Exorcism - yeah, JA A is finding a sharper stick to poke drone with as of this morning |
| 17:03:58 | | VoynichCR (VoynichCR) joins |
| 17:07:25 | <fireonlive> | 🔪 |
| 17:32:43 | <Exorcism> | 👹 |
| 17:45:10 | | VoynichCR quits [Remote host closed the connection] |
| 18:07:59 | <@arkiver> | pokechu22: "i guess pro is too" - can you be specific? |
| 18:08:09 | <@arkiver> | pro.wanadoo.fr? |
| 18:08:13 | <@arkiver> | because i'll filter these out now |
| 18:09:16 | <pokechu22> | Yeah, http://pro.wanadoo.fr/ and http://perso.wanadoo.fr/ should be filtered out |
| 18:09:46 | <pokechu22> | They're already marked as "skip" in the script though, so I guess that's just not working? |
| 18:10:54 | <pokechu22> | The other thing I noticed is that the script aborted http://pagesperso-orange.fr/closdominant/copains.htm because it gave a 301, but it was a 301 to https://closdominant.pagesperso-orange.fr/copains.htm which is the kind of 301 we definitely want to be saving (otherwise there's no point in doing requests to http://pagesperso-orange.fr/xxx as all of those will redirect) |
| 18:11:23 | | @flashfire42 quits [Client Quit] |
| 18:11:23 | | kiska quits [Client Quit] |
| 18:14:12 | | flashfire42 joins |
| 18:15:26 | | kiska (kiska) joins |
| 18:24:30 | | flashfire42 quits [Client Quit] |
| 18:24:31 | | kiska quits [Client Quit] |
| 18:50:23 | | Peroniko (Peroniko) joins |
| 18:56:35 | | flashfire42 joins |
| 18:57:37 | | kiska (kiska) joins |
| 19:03:43 | | Exorcism quits [Remote host closed the connection] |
| 19:04:31 | | Exorcism (exorcism) joins |
| 19:28:12 | | Exorcism quits [Remote host closed the connection] |
| 19:28:50 | | Exorcism (exorcism) joins |
| 19:29:08 | | Exorcism quits [Remote host closed the connection] |
| 19:29:51 | | Exorcism (exorcism) joins |
| 20:13:44 | | Chris5010 (Chris5010) joins |
| 20:15:51 | | Chris5010 quits [Client Quit] |
| 20:25:20 | | Chris5010 (Chris5010) joins |
| 21:03:33 | <@arkiver> | pokechu22: the "skip" is for something else |
| 21:03:44 | <@arkiver> | so we have directly queued URLs from your dumps, and URLs discovered through others |
| 21:03:55 | <@arkiver> | those marked "skip" will not be queued from other URLs |
| 21:04:08 | <@arkiver> | (while still seen as 'being part' of the project) |
| 21:04:28 | <pokechu22> | Alright, that makes sense |
| 21:05:39 | <pokechu22> | Are pages that don't exist/aren't valid still recorded into the WARC (e.g. https://aquilon.pagesperso-orange.fr/)? It feels like it would be useful to save that into the WBM especially if you queued all of the stuff in my files already (recording that a file existed in the past but not now is still interesting) |
| 21:08:40 | <@arkiver> | pokechu22: to confirm - a ban is not a 301 to some "ban page" right? |
| 21:08:49 | <pokechu22> | Right |
| 21:08:49 | <@arkiver> | if i remember correctly you said they just don't respond |
| 21:09:15 | <pokechu22> | The 301s to error pages are normal behavior for 404s and 401s/403s |
| 21:09:24 | <pokechu22> | it just won't respond or will refuse connections when banned |
| 21:09:28 | <@arkiver> | alright we'll accept all 301s as being fine |
| 21:09:31 | <@arkiver> | good |
| 21:10:24 | <pokechu22> | There's also 302s, not 100% sure what the pattern is between 301 and 302 |
| 21:12:01 | <pokechu22> | might just be that 301 goes to a new location that's valid (or that's the page for the site having moved to a new location), while 302s go to the 404/403 page (example of a 302 to the 403 page: http://paroisse.wambrechies.pagesperso-orange.fr/Files/Image/Nouveau%20dossiervie%20tous%20les%20jours/presbytere%20izi.jpg) |
| 21:12:27 | <@arkiver> | yeah |
| 21:12:31 | <@arkiver> | we'll accept both 301 and 302 |
| 21:14:17 | | colona quits [Ping timeout: 252 seconds] |
| 21:14:33 | | colona (colona) joins |
| 21:20:55 | <@arkiver> | pokechu22: do we still need to queue all versions if there is a redirect to the 404 page? |
| 21:21:21 | <pokechu22> | No, that's probably not necessary |
| 21:21:26 | <@arkiver> | alright |
| 21:23:18 | <@arkiver> | pokechu22: do all types go to https://r.orange.fr/r/Oerreur_404 in case of a 404? |
| 21:23:39 | <pokechu22> | I'm pretty sure they do |
| 21:23:47 | <pokechu22> | though there's a complication to that |
| 21:23:50 | <pokechu22> | of course |
| 21:24:11 | <@arkiver> | of course :P |
| 21:25:13 | <pokechu22> | http://pagesperso-orange.fr/convoi/css/index.css doesn't exist, but http://perso.orange.fr/convoi/css/index.css redirects to http://convoi.perso.orange.fr/ (but interestingly, http://convoi.perso.orange.fr/css/index.css does just give a redirect to the 404 directly? Going to try a second one to get more data) |
| 21:26:10 | <pokechu22> | I guess relatedly: http://18-25ans.perso.orange.fr/ existing does not in any way imply that https://18-25ans.pagesperso-orange.fr/ exists :| |
| 21:28:19 | <pokechu22> | ok this is dumb: http://ovine.sngtv.pagesperso-orange.fr/Enterites%20infectieuses.pdf exists |
| 21:28:22 | <pokechu22> | http://pagesperso-orange.fr/ovine.sngtv/Enterites%20infectieuses.pdf redirect to http://ovine.sngtv.pagesperso-orange.fr/Enterites%20infectieuses.pdf |
| 21:28:24 | <pokechu22> | http://perso.orange.fr/ovine.sngtv/Enterites%20infectieuses.pdf redirect to http://ovine.sngtv.perso.orange.fr/ |
| 21:28:26 | <pokechu22> | http://ovine.sngtv.perso.orange.fr/Enterites%20infectieuses.pdf 404 despite existing on http://pagesperso-orange.fr/ |
| 21:31:17 | <@arkiver> | pokechu22: so... reading that we should still queue all versions even if we get a redirect to a 404 |
| 21:31:37 | <pokechu22> | Yeah :| |
| 21:32:05 | <pokechu22> | at least for ones on perso.orange.fr |
| 21:33:00 | <pokechu22> | It's probably not super useful to queue URLs on perso.orange.fr if the site doesn't explicitly link into those forms, but those that have already been found need to be queued into the other forms even if they give a 404 or redirect on perso.orange.fr |
| 21:33:25 | <pokechu22> | err, "give a 404" won't happen, rather if they redirect to the front page or if they redirect to the 404 page |
| 21:34:30 | <pokechu22> | But just doing everything is fine too, assuming we have time, which may or may not be the case :| |
| 21:40:26 | <@arkiver> | what a mess |
| 21:43:05 | <nstrom|m> | what's the docker image for this one called? the one I tried didn't work |
| 21:43:17 | <@arkiver> | that is unfortunately still having problems :/ |
| 21:43:23 | <@arkiver> | JAA: we did not find a solution yet right? |
| 21:44:48 | <nstrom|m> | I can build it myself, just wasn't sure how the warrior users were getting it but I guess that doesn't use docker, it gets w git |
| 21:44:58 | <@arkiver> | excactly, yes |
| 21:45:04 | <@arkiver> | exactly* |
| 21:45:06 | <@arkiver> | update is coming |
| 21:50:16 | <@JAA> | arkiver: Nope :-| |
| 21:50:32 | <@JAA> | Maybe another commit will fix it. |
| 21:50:42 | <@JAA> | Which it seems is coming anyway. |
| 21:51:15 | <@arkiver> | yep |
| 21:58:39 | <@arkiver> | JAA: it's building |
| 21:58:48 | <@JAA> | :-) |
| 21:59:06 | <@arkiver> | this is now the warrior default |
| 22:04:33 | <@arkiver> | imer: do you think 4 seconds pause on a non-200 is safe? |
| 22:06:01 | <@arkiver> | pokechu22: i'm thinking of not queuing all versions anymore |
| 22:06:34 | <imer> | arkiver: i'd go higher maybe, not sure how many error redirects we're expecting though |
| 22:06:51 | <imer> | start off high and lower if once we know how things are going is safer than getting everyone banned :D |
| 22:07:08 | <@arkiver> | right i'm at 6 second now for non-200 |
| 22:07:11 | <@arkiver> | 2 second for 200 |
| 22:07:28 | <pokechu22> | arkiver: that's probably fine, queueing all versions isn't super important, but queueing the right version from a URL in an older format is pretty important |
| 22:07:44 | <pokechu22> | e.g. going from http://ovine.sngtv.perso.orange.fr/Enterites%20infectieuses.pdf to http://ovine.sngtv.pagesperso-orange.fr/Enterites%20infectieuses.pdf is important but not the other way around |
| 22:07:59 | <@arkiver> | hmm okey |
| 22:08:12 | | anewarchiverlol2 joins |
| 22:08:47 | <pokechu22> | and as for http versus https, I'm still fairly confident in the rule of whether or not a dot is present determining whether the site redirects to or from https, but some of the pagespro stuff complicates that |
| 22:09:02 | <@arkiver> | pokechu22: in https://github.com/ArchiveTeam/pagespersoorange-grab/blob/master/pagespersoorange.lua#L140-L159 can you please indicate which i should mark "skip"? |
| 22:09:13 | <@arkiver> | "skip" means we will not queue this URLs if another version of it is found |
| 22:10:05 | <pokechu22> | Probably "perso.orange", "monsite.orange", and "pro.orange"/"pros.orange" |
| 22:10:19 | <pokechu22> | ... and "monsite.wanadoo" |
| 22:10:45 | <@arkiver> | okey then we're left with queuing _to_ |
| 22:10:50 | <@JAA> | (Permalink for future reference: https://github.com/ArchiveTeam/pagespersoorange-grab/blob/48fc3b422ca345fb3fb14c855304efc0e96bfd69/pagespersoorange.lua#L140-L159 ) |
| 22:10:56 | <@arkiver> | pagesperso-orange |
| 22:11:01 | <@arkiver> | monsite-orange |
| 22:11:06 | <@arkiver> | pagespro-orange |
| 22:11:11 | <@arkiver> | and the three *.pagespro-orange |
| 22:11:14 | <pokechu22> | Yeah |
| 22:11:18 | <@arkiver> | JAA: what key do i have to press again to get that? |
| 22:11:21 | <pokechu22> | y |
| 22:11:22 | <@arkiver> | that URL |
| 22:11:25 | <@arkiver> | ah |
| 22:11:26 | <@arkiver> | thanks |
| 22:11:27 | <nstrom|m> | does it make sense to run this one at any higher than 1 concurrency w the throttling? |
| 22:11:35 | <@arkiver> | pokechu22: alright that change is rolling out now |
| 22:11:37 | <@arkiver> | nstrom|m: no |
| 22:11:41 | <@arkiver> | nstrom|m: we need IPs here |
| 22:11:45 | <nstrom|m> | thought so |
| 22:12:16 | <@arkiver> | update is in for that |
| 22:12:34 | <@arkiver> | nstrom|m: if you have multiple concurrent this will just increase the sleep time per request |
| 22:12:58 | <thuban> | arkiver: with this change, will we still retrieve (eg) monsite.orange urls queued manually and/or through backfeed? |
| 22:13:15 | <@arkiver> | thuban: yes |
| 22:13:38 | <@arkiver> | but we will not during archiving queue the domains pokechu22 just listed |
| 22:13:41 | <thuban> | ok, cool |
| 22:13:49 | <@arkiver> | (we will still accept them though as being part of the project) |
| 22:16:48 | <@arkiver> | pokechu22: since they opened it up again for us, could we ask them to lift the rate limiting? |
| 22:17:18 | <pokechu22> | Maybe? Let me double check who contacted support... |
| 22:17:30 | <@arkiver> | imer: on the 200, how confident are you that 1 second is fine? |
| 22:18:12 | <thuban> | pokechu22: it was plcp |
| 22:19:30 | <@arkiver> | project10: nice speed :) |
| 22:19:56 | <imer> | arkiver: not very, I can retest if you'd like |
| 22:20:36 | <@arkiver> | we can give it a day and see how this progresses before we do risky stuff |
| 22:20:37 | <imer> | I did get one ip banned if I remember right, although the 2nd at 1s run didnt |
| 22:20:49 | <@arkiver> | i have a 6 second sleep at non-200 |
| 22:20:53 | <@arkiver> | do you think that should be safe? |
| 22:21:12 | <imer> | Maybe? I don't think I tested higher than 4 for error redirects |
| 22:21:24 | <@arkiver> | and you were banned at 4? |
| 22:21:28 | <imer> | yeah |
| 22:21:34 | <imer> | thats only hitting error redirects though |
| 22:21:42 | <@arkiver> | hmm okey |
| 22:21:48 | <@arkiver> | we're hitting a few of those |
| 22:22:03 | <imer> | I can also test non-error redirects, could only delay for error onces if they are the limit |
| 22:22:13 | <@arkiver> | yes please! |
| 22:22:21 | <@arkiver> | if you can that would be greatly appreciated |
| 22:22:27 | <@arkiver> | we do have the most of that i believe |
| 22:22:36 | <imer> | sure. will report back in an hour or so, give this time to run |
| 22:22:51 | <@arkiver> | thanks a lot! |
| 22:29:46 | | sepro (sepro) joins |
| 22:30:47 | | BornOn420 (BornOn420) joins |
| 22:31:34 | | @JAA sets the topic to: Pages Perso Orange: 1 concurrency recommended, strict rate limits, needs lots of IPs | Finding ISP web hosting services before the Grim Reaper finds them. | https://archiveteam.org/index.php?title=ISP_Hosting |
| 22:35:04 | <Ryz> | o#o; |
| 22:36:21 | <fireonlive> | owo |
| 22:37:13 | <Ryz> | Any idea when the Pages Perso Orange stuff is gonna shut down? Apparently it's supposed to shut down on 2023 September 05, but it didn't happen yet? |
| 22:37:38 | <imer> | Ryz: there was an extension until the 5th? of october |
| 22:37:41 | <@arkiver> | pokechu22: how important are the mairie, assoc, and ecole URLs? |
| 22:37:53 | <@arkiver> | i wonder if we can skip them too |
| 22:38:04 | <@arkiver> | if yes, we have only one URL per 'type' left that will be queued |
| 22:38:20 | <pokechu22> | I've seen them all in the wild |
| 22:38:51 | <pokechu22> | we don't necessarily need to queue between them, but sites using one form should probably have all of the pages in that form |
| 22:38:53 | <@arkiver> | right, but should we convert URLs _to_ them as well? |
| 22:39:18 | <@arkiver> | all of the pages - i guess those would be linked from each other right? if yes, they would already be found and queued that way |
| 22:39:20 | <pokechu22> | It's probably not needed |
| 22:39:42 | <pokechu22> | on the other hand there's a lot less content in the pros section in the first place so it's probably not *that* important |
| 22:39:57 | <@arkiver> | shall i set in that third 'type' on pagespro-orange to "queue" and the other 3 after that to "skip"? |
| 22:40:19 | <@arkiver> | (sorry that was a badly written sentence - maybe still clear) |
| 22:40:22 | <pokechu22> | Yeah, that's probably fine. Maybe come back and do the other ones if we have more time? |
| 22:40:30 | <@arkiver> | that might be problematic |
| 22:40:41 | <project10> | arkiver: you awoke the sleeping DLoader |
| 22:40:50 | <@arkiver> | project10: indeed :) |
| 22:40:56 | <@arkiver> | now we're all doomed |
| 22:41:17 | <pokechu22> | The main reason why mairie/assoc/ecole matter is for determining whether to use http versus https, which we're just doing both of anyways so that doesn't really matter I think |
| 22:41:19 | <@arkiver> | almost 4k items/min now :) |
| 22:41:19 | <DLoader> | :D |
| 22:41:23 | <imer> | guess you'll have to scale up more project10, the battle is on >:D |
| 22:41:55 | <DLoader> | I'm missing a /26 :/ |
| 22:41:56 | <@arkiver> | the battle of biggest IP ranges :P |
| 22:42:12 | <fireonlive> | time to start bgp hijacking |
| 22:42:18 | <@arkiver> | DLoader: did you check your pockets? |
| 22:42:42 | <project10> | it's nice to see the ETA less than two years |
| 22:42:50 | <DLoader> | lol |
| 22:42:55 | <DLoader> | will get it online tomorrow |
| 22:43:27 | <imer> | so it _was_ in your pockets? |
| 22:43:46 | <DLoader> | maybe |
| 22:43:47 | <@arkiver> | :P |
| 22:43:50 | <fireonlive> | *checks where i am on the leaderboard* |
| 22:43:51 | <fireonlive> | oof |
| 22:44:17 | <fireonlive> | need me an /8 |
| 22:44:37 | <project10> | fireonlive: https://kagi.com/proxy/ignorance-is-bliss-cypher.gif |
| 22:44:41 | <project10> | err :/ |
| 22:44:46 | <project10> | https://c.tenor.com/RZ1wlnUXbskAAAAC/ignorance-is-bliss-cypher.gif |
| 22:44:56 | <fireonlive> | true :D |
| 22:48:27 | <anewarchiverlol2> | Sorry if I'm in the wrong spot to ask (I just started today): if I need to stop running the Warrior appliance, I should press "shutdown" and wait for it to say it is finished before stopping, right? I think I understand this, but I want to check so I do it right. |
| 22:48:46 | <@arkiver> | anewarchiverlol2: that is the nice way yeah! |
| 22:49:07 | <@arkiver> | anewarchiverlol2: if you make it stop immediately though, it will not be too bad |
| 22:49:35 | <imer> | ^ uncompleted items will get retried eventually |
| 22:49:49 | <imer> | no bans so far, which is odd. I expected at least one |
| 22:50:01 | <@arkiver> | imer: sound good |
| 22:50:37 | <@arkiver> | so we have 3 or 4 types of URLs: |
| 22:50:38 | <project10> | I didn't see the topic, am running conc 20, not seeing anything odd |
| 22:50:38 | <@arkiver> | - 200 |
| 22:50:41 | <anewarchiverlol2> | Thank you, arkiver |
| 22:50:50 | <@arkiver> | - 3xx to 404 |
| 22:50:51 | <@JAA> | project10: It'll throttle itself automatically. |
| 22:51:01 | <@arkiver> | - 3xx to https version from http |
| 22:51:12 | <project10> | JAA: throttle on what status code though, if any? |
| 22:51:15 | <@arkiver> | - 3xx from x.domain.fr to domain.fr/x/ version |
| 22:51:41 | <@arkiver> | project10: 2*concurrency on 200 |
| 22:51:51 | <@arkiver> | project10: 6*concurrency on non-200 |
| 22:52:04 | <@arkiver> | imer: if checking if we can lower the factor we multiple with the concurrency |
| 22:52:18 | <imer> | the inverse as well http://pagesperso-orange.fr/dupui/wow/trombinoscope_wow_038.htm redirects to http://dupui.pagesperso-orange.fr/... (and then 404, but thats beside the point :D) |
| 22:52:25 | <@arkiver> | oh yeah! |
| 22:52:45 | <@arkiver> | so for all those types i could add separate sleeping times if we know what the rates are there |
| 22:54:31 | <imer> | there's two(?) error pages I've seen https://r.orange.fr/r/Oerreur_40X redirects to https://e.orange.fr/error40X.html (replacing X with the code) |
| 22:55:45 | <pokechu22> | I don't think I've seen any examples of 3xx from x.domain.fr to domain.fr/x/ version, only the other direction, but not 100% sure |
| 22:55:59 | <imer> | https://r.orange.fr/r/Oerreur_404 doesnt seem to like curl, always redirects me to 403, but when I open that in browser I end up on 404 |
| 22:56:01 | <project10> | is it just my eyes, or is todo:backfeed growing faster than todo queue is emptying? |
| 22:56:29 | <imer> | project10: yes, ~2.6x faster currently |
| 22:58:15 | <@arkiver> | imer: yeah on the two error pages |
| 23:03:35 | <fireonlive> | alert: the NotAlexes are multiplying 😱 |
| 23:05:03 | <imer> | current status, all unbanned, testing: |
| 23:05:03 | <imer> | http://fabienne.oreb.pagesperso-orange.fr/photos/andalousie/af_andalousie.html 200 on (req/min) 90, 70, 60, 50 |
| 23:05:03 | <imer> | http://imagesdeparfums.perso.orange.fr/Cartier/TN_SoPretty98.JPG redirect to 404 on 12, 10, 6, 4 |
| 23:05:03 | <imer> | http://pagesperso-orange.fr/dupui/wow/trombinoscope_wow_038.htm normal redirect (not following the 2nd/3rd one to 404 here) on 90, 70, 60, 30 |
| 23:06:27 | <imer> | gonna try an error page on a high rate to see if bans are still working.. |
| 23:10:44 | <@arkiver> | well that sounds pretty promising |
| 23:16:37 | <imer> | 5=400 http://baggio%20.monsite-orange.fr/ yep, thats bad request indeed :D |
| 23:16:57 | <@arkiver> | whoops.. |
| 23:16:59 | <imer> | well, when I tested last I would have been banned by now |
| 23:17:09 | <imer> | so, uh, not sure what that means |
| 23:17:20 | <pokechu22> | Yeah, the list of URLs includes some junk like that :| |
| 23:17:32 | <@arkiver> | pokechu22: i tried to filter some junk out |
| 23:17:36 | <@arkiver> | but there may still be left |
| 23:17:49 | <pokechu22> | I think I did include ones with %20 like that filtered out but I'm not 100% sure |
| 23:18:19 | <imer> | don't have any other 400 in my logs so should be fine |
| 23:18:43 | <pokechu22> | Yeah, https://baggio.monsite-orange.fr is in some of the lists |
| 23:18:48 | <@arkiver> | well interesting, maybe increased the allowed rate |
| 23:23:08 | <imer> | been trying the error page on 5req/s for over 15min now and nothing, took ~7min at 1req/s last time |
| 23:23:28 | <imer> | going to stop that experiment now I think, is a bit rude |
| 23:24:08 | <imer> | yeah, no bans. weird |
| 23:25:01 | <@arkiver> | imer: please feel free to continue the experiment |
| 23:25:13 | <imer> | unless they do ip reputation stuff, not sure which ones I used last time around |
| 23:25:24 | <@arkiver> | if we increase the limits on our side, they'll get more requests anyway |
| 23:26:07 | <imer> | anything you'd like me to try in particular? |
| 23:26:22 | <@arkiver> | i guess just the 4 different cases |
| 23:26:34 | <@arkiver> | i could try lowering he timeout if you see no bans at all... |
| 23:31:04 | <imer> | bans *are* still working, my experimental vm just got banned (running 40x conc 1 containers on one ip) |
| 23:32:05 | <imer> | took about 20min |
| 23:39:12 | | Exorcism quits [Remote host closed the connection] |
| 23:40:00 | | Exorcism (exorcism) joins |
| 23:44:19 | | octylFractal|m joins |
| 23:45:42 | <imer> | running (different urls) for 200 in req/min: 600, 300, 150, 60. 404 redirect 120, 60, 30, 15. normal redirect: 600 300 200 60 |
| 23:46:09 | <imer> | got a ban on 2/s 404 redirect for https://tatudream.pagesperso-orange.fr |
| 23:46:59 | <imer> | after 460s |
| 23:47:58 | <project10> | imer: what do bans look like here? 429, dropped SYNs, or RST? |
| 23:48:04 | <imer> | connection timeouts |
| 23:48:33 | <project10> | thanx |
| 23:49:24 | <imer> | 1/s error got banned after 680s |
| 23:51:41 | <project10> | too bad they aren't listening on ipv6 >:P |
| 23:54:37 | <@arkiver> | :P |
| 23:58:47 | <imer> | 0.5/s error got banned after 1200s |