| 01:22:52 | | rocketdive quits [Ping timeout: 252 seconds] |
| 01:41:00 | <h2ibot> | JustAnotherArchivist edited Onlyfiles (+432, It's dead, Jim): https://wiki.archiveteam.org/?diff=49404&oldid=49375 |
| 01:42:00 | <h2ibot> | JustAnotherArchivist edited Onlyfiles (+59): https://wiki.archiveteam.org/?diff=49405&oldid=49404 |
| 01:44:00 | <h2ibot> | Mentalist edited Deathwatch (-17): https://wiki.archiveteam.org/?diff=49406&oldid=49397 |
| 02:54:12 | | katocala quits [Remote host closed the connection] |
| 02:54:13 | | Somebody2 is now authenticated as Somebody2 |
| 02:55:08 | <audrooku|m> | Re: WBM CDX Stuffs |
| 02:55:08 | <audrooku|m> | JAA: Thanks for the response, I really appreciate it. |
| 02:55:08 | <audrooku|m> | Do even privileged users in archiveteam have access to the wbm grab items or is that something I have no chance of touching? |
| 02:55:37 | <@JAA> | I'm sure Jason has access, but he works at IA, so... |
| 02:56:02 | <@JAA> | Otherwise, I'm almost certain the answer is no. |
| 03:14:29 | | Test joins |
| 03:24:21 | | Test quits [Ping timeout: 265 seconds] |
| 03:28:14 | | jacksonchen666 quits [Ping timeout: 276 seconds] |
| 03:30:09 | | jacksonchen666 (jacksonchen666) joins |
| 03:52:45 | | katocala joins |
| 03:53:13 | | katocala is now authenticated as katocala |
| 03:58:23 | | Hackerpcs quits [Quit: Hackerpcs] |
| 04:05:49 | | Hackerpcs (Hackerpcs) joins |
| 05:27:54 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 05:52:50 | | Island quits [Read error: Connection reset by peer] |
| 05:58:44 | <h2ibot> | JustAnotherArchivist edited Deathwatch (+164, /* 2023 */ Add MyWOT forum): https://wiki.archiveteam.org/?diff=49407&oldid=49406 |
| 05:58:54 | <@JAA> | Haha qwarc goes brrrrr :-) |
| 06:09:01 | <@JAA> | Looks good, I should have a complete copy of the topic pages in 20-ish minutes. |
| 06:12:10 | | le0n quits [Ping timeout: 252 seconds] |
| 06:28:46 | | le0n (le0n) joins |
| 06:30:57 | <@JAA> | Done, and apart from one topic with an SQL error, it ran almost suspiciously smoothly. |
| 06:33:16 | <pabs> | for sites with a bunch of JS links on one page, is it possible to pass AB multiple entry points for !a instead of just one? |
| 06:34:49 | <@JAA> | pabs: !a < exists, but it has various sharp edges that can backfire badly with regards to recursion, which is why it's undocumented. Depending on the URL list, site structure, and retrieval order, it might recurse further or less far than you'd think. |
| 06:36:34 | <pabs> | for eg I wanted to archive http://www.hungry.com/ but http://www.hungry.com/people/ only has JS links |
| 06:42:02 | | le0n_ (le0n) joins |
| 06:44:04 | | le0n quits [Ping timeout: 252 seconds] |
| 06:44:41 | | le0n_ quits [Client Quit] |
| 06:49:51 | <pokechu22> | The challenge is that http://www.hungry.com/people/ gives links to e.g. http://www.hungry.com/~alves/ and http://www.hungry.com/~beers/ |
| 06:50:47 | <pokechu22> | In this case, an !a < list would probably work, since it's (at least assumed to be) safe if you can do e.g. http://www.hungry.com/~alves http://www.hungry.com/~beers etc (all URLs without a slash in the path) |
| 06:53:01 | <pokechu22> | Looks like http://www.hungry.com/hungries/hungries.js stores all of that stuff. AB does sometimes directly extract links from javascript, but I'm not sure if it would extract these or not (I think it will extract the images like /hungries/watkins0.jpg but I'm less sure if it extracts full URLs for whatever reason) |
| 07:00:50 | <audrooku|m> | Re: wbm cdx stuff again |
| 07:00:50 | <audrooku|m> | Thanks JAA, figures haha, didnt realize Jason was considered part of AT |
| 07:01:05 | | le0n (le0n) joins |
| 07:02:10 | <@JAA> | audrooku|m: Well, he founded it, so yeah. :-) He isn't around much these days though. |
| 07:02:32 | <pokechu22> | JAA: How does qwarc handle threads with multiple pages? |
| 07:03:15 | <audrooku|m> | Ah huh didnt know haha, makes a lot of sense though ;) |
| 07:03:27 | <pokechu22> | (my impression is that it just indiscriminately downloads a series of URLs, and you'd generate an incremental list of thread IDs and download those, in which case you'd need special logic to know if multiple pages exist. but I might be wrong on this) |
| 07:03:53 | | jacksonchen666 quits [Remote host closed the connection] |
| 07:05:05 | <@JAA> | pokechu22: qwarc's just a framework for grabbing stuff with very little overhead. It doesn't do anything on its own. You need to write code (called a spec file, I might change that term at some point) to tell it what to actually do. And yes, my code for MyWOT handled topic pagination (and session IDs and the language switcher). |
| 07:06:25 | <pokechu22> | Ah, it's not just a version of !ao < list that doesn't extract images, it actually does know how to look at the response to extract more links (if you tell it how to). OK, good to know |
| 07:06:42 | <@JAA> | No, it does not. You need to write that code to make it do that. |
| 07:07:28 | <pokechu22> | Right, but I was assuming that it was *just* capable of doing lists without any way of expanding on it |
| 07:08:07 | <@JAA> | It provides an interface for 'fetch this URL' with a callback to decide what to do with the response (accept, retry; write to WARC or not). And it handles the HTTP and WARC writing. Plus there's a CLI for actually running it, with concurrency and parallel processes and stuff. |
| 07:08:20 | <@JAA> | All other logic is user-provided in the spec file. |
| 07:09:03 | <@JAA> | I did write a spec file a long while ago that does recursive crawls. That's what I was referring to in #archivebot earlier. |
| 07:09:23 | <@JAA> | But on its own, qwarc doesn't even look at the HTTP body at all. It just downloads it and (unless prevented) writes it to the WARC. |
| 07:09:54 | <@JAA> | That's also part of why it's so efficient. |
| 07:12:10 | <@JAA> | This is what the spec file for the MyWOT Forums looks like: https://transfer.archivete.am/GDKC3/forum.mywot.com.py |
| 07:23:33 | <@arkiver> | JAA: as always feel free to put the outlinks from that site in #// :) |
| 07:25:38 | <@JAA> | arkiver: Yup, and I'll have a bunch from others as well. :-) |
| 07:25:53 | <@arkiver> | awesome! |
| 07:35:46 | | IDK (IDK) joins |
| 07:42:01 | | hitgrr8 joins |
| 08:03:10 | | Icyelut (Icyelut) joins |
| 08:05:28 | | Icyelut|2 quits [Ping timeout: 252 seconds] |
| 08:06:29 | <pabs> | pokechu22: hmm, how would one find out? guess I could just try !a http://www.hungry.com/hungries/hungries.js ? and separately !a the other stuff |
| 08:08:59 | <pokechu22> | I think the extraction behaves differently if you save the js file itself versus saving a page containing it, so I'm not sure exactly. Probably the easiest thing to do would be just !a http://www.hungry.com/ and watch the log to see whether it finds those pages or not |
| 08:16:34 | <pokechu22> | In any case, I'm going to sleep |
| 09:45:06 | | IDK quits [Client Quit] |
| 10:06:23 | | le0n quits [Client Quit] |
| 10:07:31 | | le0n (le0n) joins |
| 10:25:20 | | DLoader_ joins |
| 10:26:26 | | DLoader quits [Ping timeout: 264 seconds] |
| 10:26:30 | | DLoader_ is now known as DLoader |
| 10:46:04 | | TransoniskGravitation joins |
| 10:49:50 | | TransonicGravity quits [Ping timeout: 264 seconds] |
| 11:39:32 | <@arkiver> | did we run all known twu.net sites through AB? |
| 11:39:36 | <@arkiver> | I never got a reply from them, bbtw |
| 11:39:39 | <@arkiver> | btw* |
| 11:42:34 | <@arkiver> | oh |
| 11:43:27 | <@arkiver> | i see a reply in my spam saying the email to support@twu.net could not be delivered |
| 11:47:42 | | knecht420 quits [Quit: The Lounge - https://thelounge.chat] |
| 11:48:03 | | knecht420 (knecht420) joins |
| 11:59:13 | | DLoader quits [Client Quit] |
| 12:48:02 | | lennier1 quits [Ping timeout: 264 seconds] |
| 12:50:46 | | lennier1 (lennier1) joins |
| 12:52:41 | | rocketdive joins |
| 13:26:54 | | HP_Archivist (HP_Archivist) joins |
| 13:47:34 | | Wingy quits [Ping timeout: 252 seconds] |
| 13:51:35 | | Wingy (Wingy) joins |
| 14:31:43 | | IDK (IDK) joins |
| 14:35:27 | | user__ quits [Read error: Connection reset by peer] |
| 14:36:00 | | user__ (gazorpazorp) joins |
| 14:40:32 | | qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds] |
| 14:57:56 | | sonick (sonick) joins |
| 15:33:05 | | Island joins |
| 16:04:34 | | DLoader joins |
| 16:21:47 | | Megame (Megame) joins |
| 16:42:46 | | ano joins |
| 16:45:57 | <ano> | i'd have thought there'd be more available logs for ircs around here -- must be a good sum of wisdom, resources, and history that passes through |
| 17:14:03 | | Megame quits [Client Quit] |
| 17:23:09 | | spirit joins |
| 17:30:10 | <audrooku|m> | I thought so too |
| 17:43:13 | | HP_Archivist quits [Client Quit] |
| 17:46:49 | | Wingy quits [Ping timeout: 252 seconds] |
| 17:47:43 | | Wingy (Wingy) joins |
| 18:07:55 | | Iki1 joins |
| 18:11:45 | | Iki quits [Ping timeout: 265 seconds] |
| 18:15:39 | | jacksonchen666 (jacksonchen666) joins |
| 18:20:51 | | Iki joins |
| 18:22:42 | | lennier1 quits [Client Quit] |
| 18:23:07 | | Iki1 quits [Ping timeout: 252 seconds] |
| 18:23:22 | | lennier1 (lennier1) joins |
| 18:23:37 | | Iki1 joins |
| 18:24:16 | | jacksonchen666 quits [Remote host closed the connection] |
| 18:24:52 | | jacksonchen666 (jacksonchen666) joins |
| 18:27:42 | | Iki quits [Ping timeout: 265 seconds] |
| 19:00:04 | | jacksonchen666 quits [Remote host closed the connection] |
| 19:02:25 | | jacksonchen666 (jacksonchen666) joins |
| 19:47:24 | <Ryz> | Hmm, is there an ongoing project where there's constant archiving of Pastebin and Pastebin-like services? |
| 20:08:09 | | rocketdives joins |
| 20:11:08 | | rocketdive quits [Ping timeout: 265 seconds] |
| 20:11:23 | | Retrofan joins |
| 20:11:41 | <Retrofan> | Hi |
| 20:13:33 | <Retrofan> | I was wondering... what the (Readline timed out)? |
| 20:13:41 | <Retrofan> | and how to fix it? |
| 20:19:15 | | sonick quits [Client Quit] |
| 20:27:13 | | balrog quits [Quit: Bye] |
| 20:29:36 | <pokechu22> | That just means that the site didn't respond in time. If it's happening for everything on the site, the site might have blocked you. If it's happening somewhat randomly, and in particular the site has lots of large files, try setting the concurrency to 1 |
| 20:31:36 | <Retrofan> | Oh, yeah, it's happened randomly. I'll give your solution a shot |
| 20:33:55 | | balrog (balrog) joins |
| 20:48:50 | | Iki1 quits [Ping timeout: 265 seconds] |
| 20:49:32 | | sonick (sonick) joins |
| 20:52:57 | <Retrofan> | pokechu22: Thank you, it makes it even better... but I've noticed that some "soft 404" pages (or redirects) on other sites return "Readline timeout" |
| 20:54:37 | <pokechu22> | Hmm, that probably depends on the site as well |
| 20:55:02 | <pokechu22> | At least with archivebot, pages that give an error like that are automatically retried after everything finishes (and will be attempted up to 3 times total) |
| 21:04:36 | <Retrofan> | It would be better if it archived those "soft error" pages so that the website experience (via WARC) is more similar to the original... |
| 21:08:55 | | venom joins |
| 21:09:13 | | venom quits [Remote host closed the connection] |
| 21:09:17 | <pokechu22> | If it's an actual 404 (or 3XX or similar) error code, it'll be saved. An example of an actual readline timeout is... e.g. http://www.holypeak.com/talent/voiceactor/shuka_saito.html which isn't something that it would make sense to save |
| 21:10:38 | <pokechu22> | hmm, actually, that one eventually redirected to a cloudflare error page (after a minute), not a perfect example as that could be saved, but archivebot gives up after 30 seconds (IIRC) |
| 21:11:31 | <pokechu22> | http://www.ja.net/company/policies/aup.html might be a better example then? |
| 21:23:15 | <Retrofan> | Can I extend the archive bot's timeout? |
| 21:28:17 | <pokechu22> | I don't know. I assume you're using grab-site, which may or may not let you change that easily; I'm not familiar with it. (#archivebot via IRC doesn't support changing the timeout though) |
| 21:30:44 | <Retrofan> | I am not using grab-site; I am using the full archive bot system (with pipelines) |
| 21:33:25 | | BlueMaxima joins |
| 22:12:12 | <Retrofan> | Anyway, I was thinking of making the archivebot to check the website (perhaps for a keyword or something) and archive it if true, and otherwise, if it false. |
| 22:12:31 | <Retrofan> | but I don't know how to do it... |
| 22:12:39 | <Retrofan> | *that |
| 22:16:55 | | Retrofan quits [Remote host closed the connection] |
| 22:17:18 | | Retrofan joins |
| 22:35:10 | | hitgrr8 quits [Client Quit] |
| 23:02:18 | | Megame (Megame) joins |
| 23:29:21 | | Retrofan quits [Remote host closed the connection] |
| 23:48:38 | | tzt quits [Ping timeout: 265 seconds] |
| 23:49:55 | | tzt (tzt) joins |