01:22:52rocketdive quits [Ping timeout: 252 seconds]
01:41:00<h2ibot>JustAnotherArchivist edited Onlyfiles (+432, It's dead, Jim): https://wiki.archiveteam.org/?diff=49404&oldid=49375
01:42:00<h2ibot>JustAnotherArchivist edited Onlyfiles (+59): https://wiki.archiveteam.org/?diff=49405&oldid=49404
01:44:00<h2ibot>Mentalist edited Deathwatch (-17): https://wiki.archiveteam.org/?diff=49406&oldid=49397
02:54:12katocala quits [Remote host closed the connection]
02:55:08<audrooku|m>Re: WBM CDX Stuffs
02:55:08<audrooku|m>JAA: Thanks for the response, I really appreciate it.
02:55:08<audrooku|m>Do even privileged users in archiveteam have access to the wbm grab items or is that something I have no chance of touching?
02:55:37<@JAA>I'm sure Jason has access, but he works at IA, so...
02:56:02<@JAA>Otherwise, I'm almost certain the answer is no.
03:14:29Test joins
03:24:21Test quits [Ping timeout: 265 seconds]
03:28:14jacksonchen666 quits [Ping timeout: 276 seconds]
03:30:09jacksonchen666 (jacksonchen666) joins
03:52:45katocala joins
03:58:23Hackerpcs quits [Quit: Hackerpcs]
04:05:49Hackerpcs (Hackerpcs) joins
05:27:54BlueMaxima quits [Read error: Connection reset by peer]
05:52:50Island quits [Read error: Connection reset by peer]
05:58:44<h2ibot>JustAnotherArchivist edited Deathwatch (+164, /* 2023 */ Add MyWOT forum): https://wiki.archiveteam.org/?diff=49407&oldid=49406
05:58:54<@JAA>Haha qwarc goes brrrrr :-)
06:09:01<@JAA>Looks good, I should have a complete copy of the topic pages in 20-ish minutes.
06:12:10le0n quits [Ping timeout: 252 seconds]
06:28:46le0n (le0n) joins
06:30:57<@JAA>Done, and apart from one topic with an SQL error, it ran almost suspiciously smoothly.
06:33:16<pabs>for sites with a bunch of JS links on one page, is it possible to pass AB multiple entry points for !a instead of just one?
06:34:49<@JAA>pabs: !a < exists, but it has various sharp edges that can backfire badly with regards to recursion, which is why it's undocumented. Depending on the URL list, site structure, and retrieval order, it might recurse further or less far than you'd think.
06:36:34<pabs>for eg I wanted to archive http://www.hungry.com/ but http://www.hungry.com/people/ only has JS links
06:42:02le0n_ (le0n) joins
06:44:04le0n quits [Ping timeout: 252 seconds]
06:44:41le0n_ quits [Client Quit]
06:49:51<pokechu22>The challenge is that http://www.hungry.com/people/ gives links to e.g. http://www.hungry.com/~alves/ and http://www.hungry.com/~beers/
06:50:47<pokechu22>In this case, an !a < list would probably work, since it's (at least assumed to be) safe if you can do e.g. http://www.hungry.com/~alves http://www.hungry.com/~beers etc (all URLs without a slash in the path)
06:53:01<pokechu22>Looks like http://www.hungry.com/hungries/hungries.js stores all of that stuff. AB does sometimes directly extract links from javascript, but I'm not sure if it would extract these or not (I think it will extract the images like /hungries/watkins0.jpg but I'm less sure if it extracts full URLs for whatever reason)
07:00:50<audrooku|m>Re: wbm cdx stuff again
07:00:50<audrooku|m>Thanks JAA, figures haha, didnt realize Jason was considered part of AT
07:01:05le0n (le0n) joins
07:02:10<@JAA>audrooku|m: Well, he founded it, so yeah. :-) He isn't around much these days though.
07:02:32<pokechu22>JAA: How does qwarc handle threads with multiple pages?
07:03:15<audrooku|m>Ah huh didnt know haha, makes a lot of sense though ;)
07:03:27<pokechu22>(my impression is that it just indiscriminately downloads a series of URLs, and you'd generate an incremental list of thread IDs and download those, in which case you'd need special logic to know if multiple pages exist. but I might be wrong on this)
07:03:53jacksonchen666 quits [Remote host closed the connection]
07:05:05<@JAA>pokechu22: qwarc's just a framework for grabbing stuff with very little overhead. It doesn't do anything on its own. You need to write code (called a spec file, I might change that term at some point) to tell it what to actually do. And yes, my code for MyWOT handled topic pagination (and session IDs and the language switcher).
07:06:25<pokechu22>Ah, it's not just a version of !ao < list that doesn't extract images, it actually does know how to look at the response to extract more links (if you tell it how to). OK, good to know
07:06:42<@JAA>No, it does not. You need to write that code to make it do that.
07:07:28<pokechu22>Right, but I was assuming that it was *just* capable of doing lists without any way of expanding on it
07:08:07<@JAA>It provides an interface for 'fetch this URL' with a callback to decide what to do with the response (accept, retry; write to WARC or not). And it handles the HTTP and WARC writing. Plus there's a CLI for actually running it, with concurrency and parallel processes and stuff.
07:08:20<@JAA>All other logic is user-provided in the spec file.
07:09:03<@JAA>I did write a spec file a long while ago that does recursive crawls. That's what I was referring to in #archivebot earlier.
07:09:23<@JAA>But on its own, qwarc doesn't even look at the HTTP body at all. It just downloads it and (unless prevented) writes it to the WARC.
07:09:54<@JAA>That's also part of why it's so efficient.
07:12:10<@JAA>This is what the spec file for the MyWOT Forums looks like: https://transfer.archivete.am/GDKC3/forum.mywot.com.py
07:23:33<@arkiver>JAA: as always feel free to put the outlinks from that site in #// :)
07:25:38<@JAA>arkiver: Yup, and I'll have a bunch from others as well. :-)
07:25:53<@arkiver>awesome!
07:35:46IDK (IDK) joins
07:42:01hitgrr8 joins
08:03:10Icyelut (Icyelut) joins
08:05:28Icyelut|2 quits [Ping timeout: 252 seconds]
08:06:29<pabs>pokechu22: hmm, how would one find out? guess I could just try !a http://www.hungry.com/hungries/hungries.js ? and separately !a the other stuff
08:08:59<pokechu22>I think the extraction behaves differently if you save the js file itself versus saving a page containing it, so I'm not sure exactly. Probably the easiest thing to do would be just !a http://www.hungry.com/ and watch the log to see whether it finds those pages or not
08:16:34<pokechu22>In any case, I'm going to sleep
09:45:06IDK quits [Client Quit]
10:06:23le0n quits [Client Quit]
10:07:31le0n (le0n) joins
10:25:20DLoader_ joins
10:26:26DLoader quits [Ping timeout: 264 seconds]
10:26:30DLoader_ is now known as DLoader
10:46:04TransoniskGravitation joins
10:49:50TransonicGravity quits [Ping timeout: 264 seconds]
11:39:32<@arkiver>did we run all known twu.net sites through AB?
11:39:36<@arkiver>I never got a reply from them, bbtw
11:39:39<@arkiver>btw*
11:42:34<@arkiver>oh
11:43:27<@arkiver>i see a reply in my spam saying the email to support@twu.net could not be delivered
11:47:42knecht420 quits [Quit: The Lounge - https://thelounge.chat]
11:48:03knecht420 (knecht420) joins
11:59:13DLoader quits [Client Quit]
12:48:02lennier1 quits [Ping timeout: 264 seconds]
12:50:46lennier1 (lennier1) joins
12:52:41rocketdive joins
13:26:54HP_Archivist (HP_Archivist) joins
13:47:34Wingy quits [Ping timeout: 252 seconds]
13:51:35Wingy (Wingy) joins
14:31:43IDK (IDK) joins
14:35:27user__ quits [Read error: Connection reset by peer]
14:36:00user__ (gazorpazorp) joins
14:40:32qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds]
14:57:56sonick (sonick) joins
15:33:05Island joins
16:04:34DLoader joins
16:21:47Megame (Megame) joins
16:42:46ano joins
16:45:57<ano>i'd have thought there'd be more available logs for ircs around here -- must be a good sum of wisdom, resources, and history that passes through
17:14:03Megame quits [Client Quit]
17:23:09spirit joins
17:30:10<audrooku|m>I thought so too
17:43:13HP_Archivist quits [Client Quit]
17:46:49Wingy quits [Ping timeout: 252 seconds]
17:47:43Wingy (Wingy) joins
18:07:55Iki1 joins
18:11:45Iki quits [Ping timeout: 265 seconds]
18:15:39jacksonchen666 (jacksonchen666) joins
18:20:51Iki joins
18:22:42lennier1 quits [Client Quit]
18:23:07Iki1 quits [Ping timeout: 252 seconds]
18:23:22lennier1 (lennier1) joins
18:23:37Iki1 joins
18:24:16jacksonchen666 quits [Remote host closed the connection]
18:24:52jacksonchen666 (jacksonchen666) joins
18:27:42Iki quits [Ping timeout: 265 seconds]
19:00:04jacksonchen666 quits [Remote host closed the connection]
19:02:25jacksonchen666 (jacksonchen666) joins
19:47:24<Ryz>Hmm, is there an ongoing project where there's constant archiving of Pastebin and Pastebin-like services?
20:08:09rocketdives joins
20:11:08rocketdive quits [Ping timeout: 265 seconds]
20:11:23Retrofan joins
20:11:41<Retrofan>Hi
20:13:33<Retrofan>I was wondering... what the (Readline timed out)?
20:13:41<Retrofan>and how to fix it?
20:19:15sonick quits [Client Quit]
20:27:13balrog quits [Quit: Bye]
20:29:36<pokechu22>That just means that the site didn't respond in time. If it's happening for everything on the site, the site might have blocked you. If it's happening somewhat randomly, and in particular the site has lots of large files, try setting the concurrency to 1
20:31:36<Retrofan>Oh, yeah, it's happened randomly. I'll give your solution a shot
20:33:55balrog (balrog) joins
20:48:50Iki1 quits [Ping timeout: 265 seconds]
20:49:32sonick (sonick) joins
20:52:57<Retrofan>pokechu22: Thank you, it makes it even better... but I've noticed that some "soft 404" pages (or redirects) on other sites return "Readline timeout"
20:54:37<pokechu22>Hmm, that probably depends on the site as well
20:55:02<pokechu22>At least with archivebot, pages that give an error like that are automatically retried after everything finishes (and will be attempted up to 3 times total)
21:04:36<Retrofan>It would be better if it archived those "soft error" pages so that the website experience (via WARC) is more similar to the original... 
21:08:55venom joins
21:09:13venom quits [Remote host closed the connection]
21:09:17<pokechu22>If it's an actual 404 (or 3XX or similar) error code, it'll be saved. An example of an actual readline timeout is... e.g. http://www.holypeak.com/talent/voiceactor/shuka_saito.html which isn't something that it would make sense to save
21:10:38<pokechu22>hmm, actually, that one eventually redirected to a cloudflare error page (after a minute), not a perfect example as that could be saved, but archivebot gives up after 30 seconds (IIRC)
21:11:31<pokechu22>http://www.ja.net/company/policies/aup.html might be a better example then?
21:23:15<Retrofan>Can I extend the archive bot's timeout?
21:28:17<pokechu22>I don't know. I assume you're using grab-site, which may or may not let you change that easily; I'm not familiar with it. (#archivebot via IRC doesn't support changing the timeout though)
21:30:44<Retrofan>I am not using grab-site; I am using the full archive bot system (with pipelines)
21:33:25BlueMaxima joins
22:12:12<Retrofan>Anyway, I was thinking of making the archivebot to check the website (perhaps for a keyword or something) and archive it if true, and otherwise, if it false.
22:12:31<Retrofan>but I don't know how to do it...
22:12:39<Retrofan>*that
22:16:55Retrofan quits [Remote host closed the connection]
22:17:18Retrofan joins
22:35:10hitgrr8 quits [Client Quit]
23:02:18Megame (Megame) joins
23:29:21Retrofan quits [Remote host closed the connection]
23:48:38tzt quits [Ping timeout: 265 seconds]
23:49:55tzt (tzt) joins