#archiveteam-bs log for 2023-01-23

Home Search Previous day Next day

01:22:52		rocketdive quits [Ping timeout: 252 seconds]
01:41:00	<h2ibot>	JustAnotherArchivist edited Onlyfiles (+432, It's dead, Jim): https://wiki.archiveteam.org/?diff=49404&oldid=49375
01:42:00	<h2ibot>	JustAnotherArchivist edited Onlyfiles (+59): https://wiki.archiveteam.org/?diff=49405&oldid=49404
01:44:00	<h2ibot>	Mentalist edited Deathwatch (-17): https://wiki.archiveteam.org/?diff=49406&oldid=49397
02:54:12		katocala quits [Remote host closed the connection]
02:54:13		Somebody2 is now authenticated as Somebody2
02:55:08	<audrooku\|m>	Re: WBM CDX Stuffs
02:55:08	<audrooku\|m>	JAA: Thanks for the response, I really appreciate it.
02:55:08	<audrooku\|m>	Do even privileged users in archiveteam have access to the wbm grab items or is that something I have no chance of touching?
02:55:37	<@JAA>	I'm sure Jason has access, but he works at IA, so...
02:56:02	<@JAA>	Otherwise, I'm almost certain the answer is no.
03:14:29		Test joins
03:24:21		Test quits [Ping timeout: 265 seconds]
03:28:14		jacksonchen666 quits [Ping timeout: 276 seconds]
03:30:09		jacksonchen666 (jacksonchen666) joins
03:52:45		katocala joins
03:53:13		katocala is now authenticated as katocala
03:58:23		Hackerpcs quits [Quit: Hackerpcs]
04:05:49		Hackerpcs (Hackerpcs) joins
05:27:54		BlueMaxima quits [Read error: Connection reset by peer]
05:52:50		Island quits [Read error: Connection reset by peer]
05:58:44	<h2ibot>	JustAnotherArchivist edited Deathwatch (+164, /* 2023 */ Add MyWOT forum): https://wiki.archiveteam.org/?diff=49407&oldid=49406
05:58:54	<@JAA>	Haha qwarc goes brrrrr :-)
06:09:01	<@JAA>	Looks good, I should have a complete copy of the topic pages in 20-ish minutes.
06:12:10		le0n quits [Ping timeout: 252 seconds]
06:28:46		le0n (le0n) joins
06:30:57	<@JAA>	Done, and apart from one topic with an SQL error, it ran almost suspiciously smoothly.
06:33:16	<pabs>	for sites with a bunch of JS links on one page, is it possible to pass AB multiple entry points for !a instead of just one?
06:34:49	<@JAA>	pabs: !a < exists, but it has various sharp edges that can backfire badly with regards to recursion, which is why it's undocumented. Depending on the URL list, site structure, and retrieval order, it might recurse further or less far than you'd think.
06:36:34	<pabs>	for eg I wanted to archive http://www.hungry.com/ but http://www.hungry.com/people/ only has JS links
06:42:02		le0n_ (le0n) joins
06:44:04		le0n quits [Ping timeout: 252 seconds]
06:44:41		le0n_ quits [Client Quit]
06:49:51	<pokechu22>	The challenge is that http://www.hungry.com/people/ gives links to e.g. http://www.hungry.com/~alves/ and http://www.hungry.com/~beers/
06:50:47	<pokechu22>	In this case, an !a < list would probably work, since it's (at least assumed to be) safe if you can do e.g. http://www.hungry.com/~alves http://www.hungry.com/~beers etc (all URLs without a slash in the path)
06:53:01	<pokechu22>	Looks like http://www.hungry.com/hungries/hungries.js stores all of that stuff. AB does sometimes directly extract links from javascript, but I'm not sure if it would extract these or not (I think it will extract the images like /hungries/watkins0.jpg but I'm less sure if it extracts full URLs for whatever reason)
07:00:50	<audrooku\|m>	Re: wbm cdx stuff again
07:00:50	<audrooku\|m>	Thanks JAA, figures haha, didnt realize Jason was considered part of AT
07:01:05		le0n (le0n) joins
07:02:10	<@JAA>	audrooku\|m: Well, he founded it, so yeah. :-) He isn't around much these days though.
07:02:32	<pokechu22>	JAA: How does qwarc handle threads with multiple pages?
07:03:15	<audrooku\|m>	Ah huh didnt know haha, makes a lot of sense though ;)
07:03:27	<pokechu22>	(my impression is that it just indiscriminately downloads a series of URLs, and you'd generate an incremental list of thread IDs and download those, in which case you'd need special logic to know if multiple pages exist. but I might be wrong on this)
07:03:53		jacksonchen666 quits [Remote host closed the connection]
07:05:05	<@JAA>	pokechu22: qwarc's just a framework for grabbing stuff with very little overhead. It doesn't do anything on its own. You need to write code (called a spec file, I might change that term at some point) to tell it what to actually do. And yes, my code for MyWOT handled topic pagination (and session IDs and the language switcher).
07:06:25	<pokechu22>	Ah, it's not just a version of !ao < list that doesn't extract images, it actually does know how to look at the response to extract more links (if you tell it how to). OK, good to know
07:06:42	<@JAA>	No, it does not. You need to write that code to make it do that.
07:07:28	<pokechu22>	Right, but I was assuming that it was just capable of doing lists without any way of expanding on it
07:08:07	<@JAA>	It provides an interface for 'fetch this URL' with a callback to decide what to do with the response (accept, retry; write to WARC or not). And it handles the HTTP and WARC writing. Plus there's a CLI for actually running it, with concurrency and parallel processes and stuff.
07:08:20	<@JAA>	All other logic is user-provided in the spec file.
07:09:03	<@JAA>	I did write a spec file a long while ago that does recursive crawls. That's what I was referring to in #archivebot earlier.
07:09:23	<@JAA>	But on its own, qwarc doesn't even look at the HTTP body at all. It just downloads it and (unless prevented) writes it to the WARC.
07:09:54	<@JAA>	That's also part of why it's so efficient.
07:12:10	<@JAA>	This is what the spec file for the MyWOT Forums looks like: https://transfer.archivete.am/GDKC3/forum.mywot.com.py
07:23:33	<@arkiver>	JAA: as always feel free to put the outlinks from that site in #// :)
07:25:38	<@JAA>	arkiver: Yup, and I'll have a bunch from others as well. :-)
07:25:53	<@arkiver>	awesome!
07:35:46		IDK (IDK) joins
07:42:01		hitgrr8 joins
08:03:10		Icyelut (Icyelut) joins
08:05:28		Icyelut\|2 quits [Ping timeout: 252 seconds]
08:06:29	<pabs>	pokechu22: hmm, how would one find out? guess I could just try !a http://www.hungry.com/hungries/hungries.js ? and separately !a the other stuff
08:08:59	<pokechu22>	I think the extraction behaves differently if you save the js file itself versus saving a page containing it, so I'm not sure exactly. Probably the easiest thing to do would be just !a http://www.hungry.com/ and watch the log to see whether it finds those pages or not
08:16:34	<pokechu22>	In any case, I'm going to sleep
09:45:06		IDK quits [Client Quit]
10:06:23		le0n quits [Client Quit]
10:07:31		le0n (le0n) joins
10:25:20		DLoader_ joins
10:26:26		DLoader quits [Ping timeout: 264 seconds]
10:26:30		DLoader_ is now known as DLoader
10:46:04		TransoniskGravitation joins
10:49:50		TransonicGravity quits [Ping timeout: 264 seconds]
11:39:32	<@arkiver>	did we run all known twu.net sites through AB?
11:39:36	<@arkiver>	I never got a reply from them, bbtw
11:39:39	<@arkiver>	btw*
11:42:34	<@arkiver>	oh
11:43:27	<@arkiver>	i see a reply in my spam saying the email to support@twu.net could not be delivered
11:47:42		knecht420 quits [Quit: The Lounge - https://thelounge.chat]
11:48:03		knecht420 (knecht420) joins
11:59:13		DLoader quits [Client Quit]
12:48:02		lennier1 quits [Ping timeout: 264 seconds]
12:50:46		lennier1 (lennier1) joins
12:52:41		rocketdive joins
13:26:54		HP_Archivist (HP_Archivist) joins
13:47:34		Wingy quits [Ping timeout: 252 seconds]
13:51:35		Wingy (Wingy) joins
14:31:43		IDK (IDK) joins
14:35:27		user__ quits [Read error: Connection reset by peer]
14:36:00		user__ (gazorpazorp) joins
14:40:32		qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds]
14:57:56		sonick (sonick) joins
15:33:05		Island joins
16:04:34		DLoader joins
16:21:47		Megame (Megame) joins
16:42:46		ano joins
16:45:57	<ano>	i'd have thought there'd be more available logs for ircs around here -- must be a good sum of wisdom, resources, and history that passes through
17:14:03		Megame quits [Client Quit]
17:23:09		spirit joins
17:30:10	<audrooku\|m>	I thought so too
17:43:13		HP_Archivist quits [Client Quit]
17:46:49		Wingy quits [Ping timeout: 252 seconds]
17:47:43		Wingy (Wingy) joins
18:07:55		Iki1 joins
18:11:45		Iki quits [Ping timeout: 265 seconds]
18:15:39		jacksonchen666 (jacksonchen666) joins
18:20:51		Iki joins
18:22:42		lennier1 quits [Client Quit]
18:23:07		Iki1 quits [Ping timeout: 252 seconds]
18:23:22		lennier1 (lennier1) joins
18:23:37		Iki1 joins
18:24:16		jacksonchen666 quits [Remote host closed the connection]
18:24:52		jacksonchen666 (jacksonchen666) joins
18:27:42		Iki quits [Ping timeout: 265 seconds]
19:00:04		jacksonchen666 quits [Remote host closed the connection]
19:02:25		jacksonchen666 (jacksonchen666) joins
19:47:24	<Ryz>	Hmm, is there an ongoing project where there's constant archiving of Pastebin and Pastebin-like services?
20:08:09		rocketdives joins
20:11:08		rocketdive quits [Ping timeout: 265 seconds]
20:11:23		Retrofan joins
20:11:41	<Retrofan>	Hi
20:13:33	<Retrofan>	I was wondering... what the (Readline timed out)?
20:13:41	<Retrofan>	and how to fix it?
20:19:15		sonick quits [Client Quit]
20:27:13		balrog quits [Quit: Bye]
20:29:36	<pokechu22>	That just means that the site didn't respond in time. If it's happening for everything on the site, the site might have blocked you. If it's happening somewhat randomly, and in particular the site has lots of large files, try setting the concurrency to 1
20:31:36	<Retrofan>	Oh, yeah, it's happened randomly. I'll give your solution a shot
20:33:55		balrog (balrog) joins
20:48:50		Iki1 quits [Ping timeout: 265 seconds]
20:49:32		sonick (sonick) joins
20:52:57	<Retrofan>	pokechu22: Thank you, it makes it even better... but I've noticed that some "soft 404" pages (or redirects) on other sites return "Readline timeout"
20:54:37	<pokechu22>	Hmm, that probably depends on the site as well
20:55:02	<pokechu22>	At least with archivebot, pages that give an error like that are automatically retried after everything finishes (and will be attempted up to 3 times total)
21:04:36	<Retrofan>	It would be better if it archived those "soft error" pages so that the website experience (via WARC) is more similar to the original...
21:08:55		venom joins
21:09:13		venom quits [Remote host closed the connection]
21:09:17	<pokechu22>	If it's an actual 404 (or 3XX or similar) error code, it'll be saved. An example of an actual readline timeout is... e.g. http://www.holypeak.com/talent/voiceactor/shuka_saito.html which isn't something that it would make sense to save
21:10:38	<pokechu22>	hmm, actually, that one eventually redirected to a cloudflare error page (after a minute), not a perfect example as that could be saved, but archivebot gives up after 30 seconds (IIRC)
21:11:31	<pokechu22>	http://www.ja.net/company/policies/aup.html might be a better example then?
21:23:15	<Retrofan>	Can I extend the archive bot's timeout?
21:28:17	<pokechu22>	I don't know. I assume you're using grab-site, which may or may not let you change that easily; I'm not familiar with it. (#archivebot via IRC doesn't support changing the timeout though)
21:30:44	<Retrofan>	I am not using grab-site; I am using the full archive bot system (with pipelines)
21:33:25		BlueMaxima joins
22:12:12	<Retrofan>	Anyway, I was thinking of making the archivebot to check the website (perhaps for a keyword or something) and archive it if true, and otherwise, if it false.
22:12:31	<Retrofan>	but I don't know how to do it...
22:12:39	<Retrofan>	*that
22:16:55		Retrofan quits [Remote host closed the connection]
22:17:18		Retrofan joins
22:35:10		hitgrr8 quits [Client Quit]
23:02:18		Megame (Megame) joins
23:29:21		Retrofan quits [Remote host closed the connection]
23:48:38		tzt quits [Ping timeout: 265 seconds]
23:49:55		tzt (tzt) joins

Home Search Previous day Next day