00:03:29AmAnd0A quits [Ping timeout: 252 seconds]
00:04:17AmAnd0A joins
00:05:09Void0 (Void0) joins
00:05:56DLoader_ joins
00:07:10Void0 quits [Client Quit]
00:07:10DLoader quits [Ping timeout: 252 seconds]
00:07:17DLoader_ is now known as DLoader
00:12:07Arcorann (Arcorann) joins
00:14:58fullpwnmedia joins
00:15:35<fullpwnmedia>Hey! Me again. Just checking up on the status of that Windows Update url list I sent.
00:18:50<fullpwnmedia>JAA did you chuck it into ArchiveBot?
00:22:56le0n_ quits [Ping timeout: 265 seconds]
00:24:01icedice (icedice) joins
00:26:35icedice quits [Client Quit]
00:26:41BigBrain quits [Ping timeout: 245 seconds]
00:41:29AmAnd0A quits [Read error: Connection reset by peer]
00:41:38le0n (le0n) joins
00:41:45AmAnd0A joins
00:55:47fullpwnmedia quits [Remote host closed the connection]
01:35:37imer0 (imer) joins
01:36:53imer quits [Ping timeout: 265 seconds]
01:36:53imer0 is now known as imer
01:53:24systwi__ is now known as systwi
01:55:57fullpwn quits [Read error: Connection reset by peer]
01:56:13fullpwn joins
02:07:00xkey quits [Quit: xkey]
02:12:31BlueMaxima joins
02:49:36sec^nd quits [Ping timeout: 245 seconds]
03:05:13<tomodachi94>More useragents for ArchiveBot, yay!!! https://github.com/ArchiveTeam/ArchiveBot/pull/556
03:05:32dumbgoy_ quits [Ping timeout: 252 seconds]
03:22:35<@JAA>fullpwnmedia: I didn't, I was going to keep this list in my stash for the software binaries project I've been meaning to launch. That requires software that doesn't exist yet, and there's no ETA for it. Does Microsoft typically get rid of these files? I thought they kept them for quite a long time.
03:36:55Shjosan quits [Quit: Am sleepy (-, – )…zzzZZZ]
03:37:33Shjosan (Shjosan) joins
03:45:57Sluggs quits [Excess Flood]
03:46:21Sluggs joins
03:59:07railen63 quits [Remote host closed the connection]
04:00:00aGerman quits [Quit: The Lounge - https://thelounge.chat]
04:00:01treora quits [Quit: blub blub.]
04:01:31treora joins
04:03:00aGerman (aGerman) joins
04:04:30Dango360 quits [Read error: Connection reset by peer]
04:05:28railen63 joins
04:06:28railen63 quits [Remote host closed the connection]
04:06:42Dango360 (Dango360) joins
04:06:43railen63 joins
04:16:10railen63 quits [Remote host closed the connection]
04:19:13railen63 joins
04:20:10railen63 quits [Remote host closed the connection]
04:20:23railen63 joins
04:32:16decky_e quits [Ping timeout: 252 seconds]
04:32:36decky_e (decky_e) joins
04:42:00decky_e quits [Ping timeout: 265 seconds]
04:42:40sonick (sonick) joins
04:42:45decky_e (decky_e) joins
04:46:08<sonick>It was announced in an email newsletter that https://technote.ipros.jp/ will end on June 15, 2023.
04:46:44<sonick>It seems large enough to run in AB.
04:55:05nicolas17 quits [Client Quit]
04:55:24nicolas17 joins
05:23:18xkey (xkey) joins
05:57:24systwi_ quits [Quit: systwi_]
05:57:24nothere quits [Quit: Leaving]
05:58:47BlueMaxima quits [Read error: Connection reset by peer]
06:07:20Island quits [Read error: Connection reset by peer]
06:08:33systwi_ joins
06:08:48nothere joins
06:17:49BigBrain (bigbrain) joins
07:06:16bf_ joins
07:36:23Ivan226 joins
08:38:17fuzzy8021 quits [Read error: Connection reset by peer]
09:27:32Minkafighter quits [Quit: The Lounge - https://thelounge.chat]
09:27:48Minkafighter joins
09:40:21bloom joins
09:53:54driib quits [Quit: The Lounge - https://thelounge.chat]
09:54:17driib (driib) joins
09:57:00bloom quits [Remote host closed the connection]
10:05:21thehedgeh0g quits [Remote host closed the connection]
10:05:22shreyasminocha quits [Remote host closed the connection]
10:05:22evan quits [Remote host closed the connection]
10:06:31evan joins
10:06:34thehedgeh0g (mrHedgehog0) joins
10:06:34shreyasminocha (shreyasminocha) joins
10:07:09AK quits [Quit: AK]
10:13:20<that_lurker>Would it be possible to do a crawl of http://porn.serverbear.com/ it's the blog of serverbear that closed in 2016 and the site is slowly deteriorating. If the crawl speed can be slowed down that would be best as load times are longish.
10:13:35<that_lurker>Has some cool stuff in it https://twitter.com/mikko_2013/status/1664534090324869120
10:14:37<that_lurker>most photos of the site are at least "safe" on tumblr
10:24:30decky_e quits [Remote host closed the connection]
10:28:29icedice (icedice) joins
10:29:16icedice2 (icedice) joins
10:52:43AK (AK) joins
11:01:18JohnnyJ quits [Client Quit]
11:01:59JohnnyJ joins
11:05:46Ruthalas5 quits [Ping timeout: 265 seconds]
11:09:27Ruthalas5 (Ruthalas) joins
11:33:34CreaZyp154 joins
11:37:25<CreaZyp154>I've got a few ideas to avoid redirection loops: 1, save redirections directly instead of queuing them. 2, warrior follows redirect to see if there's a loop before queuing. 3, add an url parameter for the api endpoint for a redirection history, maybe hashed so that long urls don't cause issue
11:39:43<CreaZyp154>or maybe just a redirection count for the last one
11:44:33fuzzy8021 (fuzzy8021) joins
11:54:36icedice2 quits [Client Quit]
11:54:56icedice2 (icedice) joins
12:00:41Ruthalas5 quits [Ping timeout: 252 seconds]
12:35:43Ruthalas5 (Ruthalas) joins
12:38:20BigBrain quits [Remote host closed the connection]
12:39:35BigBrain (bigbrain) joins
12:41:57Ruthalas5 quits [Ping timeout: 265 seconds]
12:47:35sec^nd (second) joins
12:59:33geezabiscuit quits [Read error: Connection reset by peer]
12:59:54geezabiscuit (geezabiscuit) joins
13:06:05Ruthalas5 (Ruthalas) joins
13:07:33geezabiscuit quits [Read error: Connection reset by peer]
13:07:42geezabiscuit (geezabiscuit) joins
13:13:18geezabiscuit quits [Ping timeout: 252 seconds]
13:13:33railen63 quits [Remote host closed the connection]
13:16:28railen63 joins
13:17:28railen63 quits [Remote host closed the connection]
13:17:42railen63 joins
13:26:02geezabiscuit (geezabiscuit) joins
13:34:59katocala quits [Remote host closed the connection]
14:00:58HP_Archivist quits [Client Quit]
14:08:52hitgrr8 joins
14:41:20CreaZyp154 quits [Ping timeout: 265 seconds]
15:00:50lennier1 quits [Read error: Connection reset by peer]
15:01:09lennier1 (lennier1) joins
15:02:27HP_Archivist (HP_Archivist) joins
15:25:27HP_Archivist quits [Client Quit]
15:34:51za3k joins
15:35:08Island joins
15:39:37katocala joins
15:44:25Arcorann quits [Remote host closed the connection]
15:50:38Arcorann (Arcorann) joins
16:10:56Arcorann quits [Read error: Connection reset by peer]
16:19:44dumbgoy_ joins
16:21:32hitgrr8 quits [Client Quit]
16:56:02Matthww1 quits [Ping timeout: 252 seconds]
16:57:48Matthww1 joins
17:08:28<that_lurker>Would it be possible to do a crawl of http://porn.serverbear.com/ it's a blog of serverbear that has some cool computer history and whatnot. serverbear closed in 2016 and the site is slowly deteriorating. If the crawl speed can be slowed down that would be best as load times are longish.
17:09:14<that_lurker>The site has some cool stuff in it like this https://twitter.com/mikko_2013/status/1664534090324869120 and at least most of the images on the site are on tumblr so they are "safe"
17:19:21hitgrr8 joins
17:24:21<pokechu22>that_lurker: It looks like it was saved back in 2016: https://archive.fart.website/archivebot/viewer/job/khrwx - is there new content since then?
17:24:24nicolas17 quits [Remote host closed the connection]
17:24:43nicolas17 joins
17:25:17<that_lurker>pokechu22: Most likely not. Thanks did not know it was archived
17:26:16<pokechu22>Looking at view-source:https://porn.serverbear.com/ it does seem like that's a tumblr-based blog, but it does seem a lot laggier compared to most of those (e.g. https://just-shower-thoughts.com) which is odd
17:27:49<that_lurker>most likely using some old code that causes javascript loops
17:28:17<that_lurker>though some pages refuse to load completely
17:31:03decky_e (decky_e) joins
17:43:47<klg>it is tumblr and the tumblr part seems to work normally to me, but they have a bunch of assets in their theme from outside of tumblr, like that blocking javascript from blog.serverbear.com which timeouts for me; but anyway no new posts since 2013
17:46:38decky_e quits [Ping timeout: 252 seconds]
17:47:11decky_e (decky_e) joins
17:59:06flashfire42 quits [Read error: Connection reset by peer]
17:59:06s-crypt quits [Read error: Connection reset by peer]
17:59:06Ryz2 quits [Remote host closed the connection]
17:59:06kiska quits [Read error: Connection reset by peer]
17:59:18Ryz2 (Ryz) joins
17:59:19s-crypt (s-crypt) joins
17:59:24flashfire42 (flashfire42) joins
18:00:37kiska (kiska) joins
18:12:52qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
18:14:08decky_e quits [Ping timeout: 252 seconds]
18:25:56spirit quits [Client Quit]
18:58:58<nicolas17>JAA: I want to make POST requests and save the request and response in WARCs, qwarc only supports GET right?
18:59:32<@JAA>nicolas17: qwarc supports GET, POST, and HEAD.
19:00:27<nicolas17>oh I see, write_client_response is writing the request body too :)
19:00:52<@JAA>Remember to use 0.2.6 or higher, not the master branch.
19:01:53<nicolas17>the github repo is outdated btw
19:02:19decky_e (decky_e) joins
19:03:43<@JAA>Yep, the proper repo is linked there though.
19:03:57<nicolas17>why did you get rid of warcio?
19:04:10<@JAA>Because warcio is buggy and shouldn't be used for any WARC writing.
19:04:37<nicolas17>warcio's own docs sound... self-confident :P
19:04:40<@JAA>https://github.com/webrecorder/warcio/issues/created_by/JustAnotherArchivist
19:09:14<fireonlive>leave it to "Wario" to be evil
19:09:20<fireonlive>(or 'bad')
19:10:46<@JAA>The fact that lots of development happens at webrecorder but these bugs in the core library that corrupt data are getting ignored means that I can't recommend enough against using any webrecorder software for archival. It's fine for playback though.
19:11:48<fireonlive>are they a 'big company' or 'startup' or is it just sort of a group or people
19:11:55<fireonlive>s/or/of/
19:13:11<nicolas17>sooo how do I use qwarc? :P
19:13:20<@hook54321>fireonlive: non-profit iirc
19:13:44decky_e quits [Ping timeout: 252 seconds]
19:13:58<@JAA>nicolas17: *two heavy breaths* Good luck! *hangs up*
19:14:03<fireonlive>ahh ok
19:14:19<nicolas17>this invalidates your "warcio does <undocumented thing>" issues! /s
19:14:27<@JAA>:-P
19:14:28decky_e (decky_e) joins
19:14:31<fireonlive>x3
19:15:20<fireonlive>JAA reads RFCs recreationatly; amazing
19:15:34<fireonlive>big thumbs up from me
19:15:47<@hook54321>one of the people involved with Rhizome's web archiving ethics conference was also angry that an ArchiveTeam member was archiving public social media
19:15:47<@JAA>Yeah, I care too much about standards and compliance, I guess.
19:15:58<fireonlive>they are fun!
19:16:08<@JAA>I also tried to implement an IRC client based on the RFCs. Yeah, that didn't go well...
19:16:32<fireonlive>haha
19:16:40<fireonlive>reminds me of trying to make my own 'services'
19:16:41<nicolas17>I started making a Wireshark dissector for Matter before realizing what "the spec is 900 pages" really means
19:16:48<fireonlive>linking up with unrealircd however long ago
19:17:18<fireonlive>was fun in either case though
19:17:53<TheTechRobo>hook54321: do you have a video link to that conference?
19:20:10<nicolas17>okay warcio it is then /s
19:20:51<@hook54321>TheTechRobo: https://invidious.snopyta.org/channel/UCxT4WqoDaO3B_Hvhr6rpB6Q
19:21:01<@hook54321>unsure if that's all of the talks or not
19:21:35<@hook54321>wait that's a different conference i think
19:21:57Megame (Megame) joins
19:22:19<@JAA>nicolas17: qwarc is self-documenting in that it copies the code into the meta WARC. You can find examples in my IA uploads.
19:22:48<@JAA>(There's also a mechanism to copy further dependencies beyond just the spec file itself.)
19:23:43<fireonlive>we don't need no documentation
19:23:48<fireonlive>we don't need no thought control
19:24:01<fireonlive>🎶
19:24:26<@JAA>:-)
19:24:44za3k quits [Remote host closed the connection]
19:24:53<fireonlive>:D
19:29:08za3k joins
19:30:26BigBrain quits [Ping timeout: 245 seconds]
19:49:41sonick quits [Client Quit]
20:12:51za3k quits [Remote host closed the connection]
20:38:44<icedice2>https://old.reddit.com/r/Piracy/comments/13yglih/stop_crying_wipe_your_tears_introducing_nqrarbg/?sort=top
20:38:46<icedice2>Based
20:39:02icedice2 quits [Client Quit]
20:39:23icedice2 (icedice) joins
20:39:35icedice2 quits [Remote host closed the connection]
20:47:16<icedice>Fuck, they have a Discord server
20:47:24<icedice>That's going to fuck them over eventually
20:49:11icedice quits [Client Quit]
20:49:31<Terbium>Yep...
20:49:43<Terbium>also they have cloudflare, that's going to be a pain to archive
20:58:57nickofni1 is now known as nickofnicks
20:59:49icedice (icedice) joins
21:00:45Unholy2361 quits [Quit: The Lounge - https://thelounge.chat]
21:01:27Unholy2361 (Unholy2361) joins
21:04:05Dango360 quits [Ping timeout: 252 seconds]
21:04:22<masterX244>s/cloudflare/buttflare/g
21:04:36Dango360 (Dango360) joins
21:11:47<nicolas17>how the heck does reddit allow this content?
21:14:57BigBrain (bigbrain) joins
21:16:27<fireonlive>links to torrent trackers?
21:17:17<fireonlive>interesting load balancing.. https://whatever -> https://s<n>.whatever
21:18:06<fireonlive>..it's done in javascript
21:18:53<fireonlive>https://transfer.archivete.am/W0YBg/ngrarbg.txt
21:18:57<fireonlive>that's... a way
21:19:36<nicolas17>fireonlive: a whole r/Piracy subreddit, there *has* to be content in there that can't be justified with "well it's actually just a link to a search engine yadda yadda"
21:19:51lunik173 quits [Remote host closed the connection]
21:20:00<fireonlive>ahh
21:20:06<fireonlive>https://old.reddit.com/r/Piracy/comments/13yglih/stop_crying_wipe_your_tears_introducing_nqrarbg/jmnip44/?context=1
21:20:11<fireonlive>oops
21:27:44<fireonlive>https://github.com/Not-Quite-RARBG/api/commits/main/search.php ; ah
21:27:54za3k joins
21:27:58spirit joins
21:30:11<Terbium>i don't that fixes the problem...
21:32:04<systwi_>Glad to see RARBG is getting the love it deserves. :-)
21:33:53<fireonlive>Terbium: yeahhh.........
21:34:25<fireonlive>'ok PDO..' 'oh like that?'
21:34:27<fireonlive>'what?'
21:37:02<joepie91|m>what is this, 2012?
21:37:17<fireonlive>ikr
21:37:23<fireonlive>i was half expecting to see mysql_query
21:38:16<joepie91|m>so like, these people are on Discord, putting their code on Github, and it contains a 2010s-era SQLi
21:38:21<fireonlive>'s3' from their 'load balancing algorithm' just redirects back to apex (after about 20 years)
21:38:30<fireonlive>oh it also leaks their real server hostname, oops
21:38:34<joepie91|m>I can't help but get a "literal 13 year olds with no experience" vibe from this
21:38:36<Terbium>*Today on Code Review with AT*: We'll review how not to protect against SQL Injection in your PHP Code
21:38:42<joepie91|m>which is Bad News
21:38:48<joepie91|m>(also for them)
21:39:21<fireonlive>i don't want to say it in this logged chat
21:39:23<fireonlive>but curl -D - https://s3.nq-rarbg.to
21:39:32<fireonlive>and look at the returned 'host:'
21:39:42<fireonlive>make a cup of tea while it's loading
21:40:09<fireonlive>is that their actual behind-the-flare server?
21:40:12<fireonlive>hm
21:42:18<fireonlive>yeah it's some free like netfly/heroku thing
21:42:21<fireonlive>(with paid plans)
21:42:58<Terbium>why not run the site on Github pages at this point :P
21:42:59<fireonlive>(also why is host: in the response header?)
21:42:59<masterX244>odd, no host appearing for me
21:44:52<fireonlive>pm'd
21:45:04<Terbium>you're right, it's probably netlify
21:45:31<fireonlive>look like now that it's 500ing instead of 302ing there's no host header
21:48:02<fireonlive>based on absolutely nothing looks like they're all using the same web host thing
21:48:10<fireonlive>well based on the x-nf-request-id header
21:48:56<fireonlive>unless all their servers so happen to be fronted by netlify :p
21:49:05<fireonlive>and then by cloudflare
22:01:49<TheTechRobo>I cant load their website _at all_.
22:02:08<nicolas17>JAA: qwarc requires aiohttp version exactly 2.3.10, and with that version I can't even "import aiohttp", nor understand the error >_<
22:02:24<TheTechRobo>Oh, s3 works. s2 and s1 seem to be borked.
22:02:47<nicolas17>JAA: https://paste.debian.net/1281844/
22:03:08<nicolas17>why does it need such a specific version of aiohttp?
22:03:20<@JAA>nicolas17: Ah yeah, that.
22:03:27<@JAA>That's not caused by aiohttp.
22:03:43<TheTechRobo>Oh, great, they *already* shut down their Discord.
22:03:50<@JAA>The aiohttp requirement is because I'm monkeypatching internals to get access to the raw data streams.
22:03:57<nicolas17>ah seems I need to pin to an older async-timeout
22:04:08<fireonlive>s3 just redirects for me
22:04:13<@JAA>But that error is because you need async-timeout==3.0.1.
22:04:23<TheTechRobo>fireonlive: https://s3.nq-rarbg.to/ ?
22:04:47<fireonlive>ye, trying again though
22:05:36<fireonlive>ah, 500s now
22:05:39<TheTechRobo>Eventually s1 and s2 give me empty responses.
22:05:44<TheTechRobo>oh yep, 502
22:05:46<fireonlive>wonder if someone sqli'd them
22:06:14<TheTechRobo>lol
22:06:25<TheTechRobo>either that or they're running this on a potato that rivals imgur's
22:06:42<fireonlive>anyone here can slide into my DM for the host leak if they wish
22:07:59<nicolas17>$ qwarc --version
22:08:00<nicolas17>qwarc 0.2.8
22:08:01<nicolas17>yay
22:12:23hitgrr8 quits [Client Quit]
22:16:14<nicolas17>JAA: looking at soundcloud-tracks spec file https://paste.debian.net/1281845/ it seems generate() splits the range 0-679999999 into 10000-sized chunks, then _process splits them again into 200-sized chunks to send the requests? if you can request 200 IDs at a time, why not split into 200-sized chunks since the beginning?
22:19:03<@JAA>nicolas17: I don't remember whether it was an issue on that specific one, but lock contention becomes an issue at high throughput.
22:19:30<@JAA>Each item getting checked out or back into the DB requires a lock.
22:20:13<nicolas17>ah, I don't plan to use high concurrency so can I make process() just run a single fetch?
22:20:40<@JAA>Sure
22:21:22<@JAA>As a rough number, on my old potato that runs most of these crawls, the contention becomes an issue at a couple thousand items per minute.
22:22:08<@JAA>And I frequently do several hundred requests per second, so items with very few requests just aren't going to work there.
22:24:07<nicolas17>so if I understand this correctly, --concurrency changes the number of simultaneous async tasks in one process, if I want to use multiple processes I just run multiple instances with the same spec and db?
22:25:53<@JAA>Correct
22:26:44<nicolas17>but if I want to use multiple computers/IPs I'm on my own? :D
22:27:20<@JAA>Yeah, coordination only works between processes on the same machine.
22:28:02<@JAA>I have rough plans about that, but not sure when that will happen.
22:38:34<nicolas17>fuuuuuuck
22:38:47<nicolas17>aiohttp calls asyncio.Task.current_task
22:39:04<nicolas17>"This method is *deprecated* and will be removed in Python 3.9. Use the asyncio.current_task() function instead."
22:39:25<fireonlive>are you using greater than 3.9? :D
22:39:33<nicolas17>I'm on 3.9.2
22:39:33fullpwn quits [Read error: Connection reset by peer]
22:39:34<@JAA>Oh yes, I run this under 3.6 or something ancient like that. Haven't had time to maintain it recently.
22:40:12fullpwn joins
22:40:46<@JAA>And it's an ancient aiohttp version, of course.
22:40:59<fireonlive>*pulls out worksfornow stamp*
22:41:02<@JAA>Migrating that to something more recent isn't exactly straightforward since there have been two new major versions since.
22:41:55<nicolas17>can be done slowly :P
22:42:10<@JAA>I've been meaning to either push the aiohttp maintainers to add proper access to the raw data streams or replace it with h11 + own asyncio code.
22:42:21<@JAA>But well, ENOTIME
22:43:07<fireonlive>we gotta clone JAA
22:43:10<fireonlive>we don't have the technology
22:43:37<@JAA>:-)
22:49:08fullpwn quits [Ping timeout: 252 seconds]
22:58:48<nicolas17>seems this works successfully, unless I pass "--concurrency 2", in which case it downloads everything it has to download and exits with asyncio.exceptions.CancelledError (?!)
23:00:30<nicolas17>that was python 3.8, gonna try 3.7...
23:04:44<nicolas17>3.7 works better
23:07:44<nicolas17>JAA: aww this thing can't re-request stuff and deduplicate with revisit records right?
23:14:05HP_Archivist (HP_Archivist) joins
23:27:30railen63 quits [Remote host closed the connection]
23:27:45railen63 joins
23:27:59Lord_Nightmare quits [Quit: ZNC - http://znc.in]
23:31:39Lord_Nightmare (Lord_Nightmare) joins
23:41:39HP_Archivist quits [Client Quit]
23:46:45lunik173 joins
23:58:16Unholy2361 quits [Remote host closed the connection]