#archiveteam-bs log for 2023-04-07

Home Search Previous day Next day

00:18:20		tech234a (tech234a) joins
00:23:18		michaelblob (michaelblob) joins
00:52:40		Arcorann (Arcorann) joins
01:48:17		sarge quits [Client Quit]
01:57:35		g-man joins
02:00:43	<g-man>	hi, i'm trying to figure out how to use cookies with wget to rip a website that needs an account. i have my cookies exported from firefox.
02:00:57	<g-man>	this works on curl (even with multiple invocations in a row), but curl isn't a crawler: `--cookie cookies.txt --cookie-jar cookies.txt`
02:01:11	<g-man>	neither of these is working in wget: `--load-cookies cookies.txt --keep-session-cookies` , `--load-cookies cookies.txt --keep-session-cookies --save-cookies cookies.txt`
02:01:21	<g-man>	thoughts?
02:06:31	<thuban>	g-man: do the problematic cookies start with #HttpOnly_ in the cookies.txt file?
02:11:47	<g-man>	someone else actually just suggested that to me moments ago. i wonder if you're the same person. lol
02:11:57	<g-man>	i'm about to look into that. gotta grab a fresh cookie file
02:13:34	<g-man>	yes, they do. i will try stripping that. if that fixes it, that's pretty cool
02:15:31	<g-man>	follow up question though
02:16:39	<g-man>	for a single invocation of wget (say for crawling/mirroring mode), should i be using `--save-cookies` ? and if i do, should it get the exact same file argument as `--load-cookies` ? it works that way conceptually with curl, but i couldn't find that clearly explained anywhere
02:21:04	<g-man>	apparently this is a wordpress-based site. hm
02:21:37	<g-man>	so i want to go from `#HttpOnly_example.com FALSE /wp-admin ...` to `example.com FALSE /wp-admin ...`? removing the underscore?
02:22:02	<thuban>	right
02:22:54	<g-man>	trying now. thanks
02:23:51		nicolas17 joins
02:23:54	<g-man>	also i have `--exclude-directories='/membership-login,/subscribe'` since those paths contain logout links. hopefully i'm using that correctly
02:26:11	<g-man>	holy cow, i think that worked. what a simple solution to days of frustration and investigating other tools (ones that can save plain files with rewritten relative urls instead of WARC)
02:26:20	<g-man>	thanks bro
02:27:25	<g-man>	also, for another project, i have WARC files, but i'm having a hell of a time extracting files from them. almost every tool i've tried chokes on some of them, and they're almost all insanely slow.
02:27:50	<thuban>	np. it's a nonstandard extension to the format; some tools (including the firefox cookies.txt extension and curl) support it, but some (including wget) don't (instead treating the whole line as a comment)
02:27:57	<g-man>	have you ever used jwat-tools? it seems the best written of the bunch, but i can't figure out how to get real filenames out of the damn thing
02:28:14	<g-man>	it outputs like `extracted.001` `extracted.002` ...
02:28:25	<g-man>	using the `extract` subcommand
02:32:08	<thuban>	as for saving cookies, it's up to you. you don't need `--save-cookies` to have cookies persist during a crawling session; that's the default (you can turn it off with `--no-cookies`), saving cookies just writes them to a file. you can set it to whatever, and i would guess it's safe to write over the same file you loaded from, but i've never actually used this feature
02:33:40	<thuban>	i don't do much warc processing and definitely don't know that tool, but JAA might be able to recommend something
02:33:42	<g-man>	i found that in curl, a cookies file worked for a single invocation, but it would fail on subsequent invocations. the addition of `--cookie-jar` (which writes cookies to a file) fixed it for me. probably session cookie changes with browsing
02:34:02	<g-man>	so i'm guessing that the same is true for wget if you do multiple invocations to keep the cookie file valid
02:34:22	<g-man>	but you probably don't need strictly need multiple invocations since wget can crawl
02:36:40	<thuban>	if you think that session cookie changes are an issue (and that you'll need multiple invocations), you should also use `--keep-session-cookies`
02:37:20	<g-man>	right
02:38:12	<g-man>	then i'll use all three (load, keep, save) unless i find that it breaks something
03:02:02		AlsoHP_Archivist joins
03:02:45		IDK_ quits [Client Quit]
03:02:45		balrog quits [Client Quit]
03:02:45		HP_Archivist quits [Remote host closed the connection]
03:02:45		g-man quits [Remote host closed the connection]
03:02:52		qwertyasdfuiopghjkl quits [Remote host closed the connection]
03:02:56		IDK_ joins
03:03:01		balrog (balrog) joins
03:10:25		benjins2 joins
03:18:40		benjins2 quits [Ping timeout: 252 seconds]
03:21:57		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
03:24:10		nicolas17 quits [Ping timeout: 252 seconds]
03:26:52		Shjosan quits [Quit: Am sleepy (-, – )…zzzZZZ]
03:27:30		Shjosan (Shjosan) joins
03:29:07		dumbgoy quits [Ping timeout: 252 seconds]
03:42:28		g-man joins
03:53:12	<g-man>	JAA any recommendations for WARC extraction utils?
04:15:35		superkuh_ quits [Remote host closed the connection]
04:15:52		superkuh_ joins
04:26:24		dumbgoy joins
04:27:58		Maddie_ quits [Ping timeout: 252 seconds]
04:35:20		Island quits [Read error: Connection reset by peer]
04:52:32		sec^nd quits [Ping timeout: 245 seconds]
04:56:29		sec^nd (second) joins
05:00:06		qwertyasdfuiopghjkl quits [Remote host closed the connection]
05:08:29		superkuh__ joins
05:09:43		superkuh_ quits [Read error: Connection reset by peer]
05:24:18		fuzzy8021 quits [Read error: Connection reset by peer]
05:26:20		fuzzy8021 (fuzzy8021) joins
05:38:33		fuzzy8021 quits [Read error: Connection reset by peer]
05:40:06		fuzzy8021 (fuzzy8021) joins
05:59:24		hitgrr8 joins
06:05:19		g-man quits [Ping timeout: 265 seconds]
06:21:16		thuban quits [Ping timeout: 265 seconds]
06:27:08		BlueMaxima quits [Read error: Connection reset by peer]
06:52:19		fuzzy8021 quits [Read error: Connection reset by peer]
06:52:38		fuzzy8021 (fuzzy8021) joins
08:07:32		thuban joins
08:18:42		LeGoupil joins
08:22:02		Shjosan quits [Client Quit]
09:04:37		LeGoupil quits [Ping timeout: 252 seconds]
09:06:44		LeGoupil joins
09:16:51		Shjosan (Shjosan) joins
09:29:27		Minkafighter7225 quits [Client Quit]
09:29:44		Minkafighter7225 joins
09:37:39		Hajdar58 joins
09:38:53		Hajdar58 quits [Remote host closed the connection]
09:55:33		umgr036 quits [Remote host closed the connection]
09:55:46		umgr036 joins
10:17:04		umgr036 quits [Remote host closed the connection]
10:20:16		umgr036 joins
10:21:05		umgr036 quits [Remote host closed the connection]
10:21:17		umgr036 joins
10:22:10		LeGoupil quits [Ping timeout: 252 seconds]
10:25:26		HackMii quits [Ping timeout: 245 seconds]
10:26:41		HackMii (hacktheplanet) joins
11:00:22		kiskaLogBot quits [Quit: kiskaLogBot]
11:06:31		kiskaLogBot joins
11:12:28		user_ joins
11:14:05		umgr036 quits [Remote host closed the connection]
11:14:06		IDK_ quits [Client Quit]
11:14:06		fuzzy8021 quits [Remote host closed the connection]
11:14:15		IDK_ joins
11:15:10		fuzzy8021 (fuzzy8021) joins
11:30:56		BearFortress quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
11:49:05		LeGoupil joins
11:52:37		BearFortress joins
12:07:57		AlsoHP_Archivist quits [Client Quit]
12:09:40		mr_sarge (sarge) joins
12:10:22		LeGoupil quits [Read error: Connection reset by peer]
12:10:28		LeGoupil joins
12:11:24	<h2ibot>	Bzc6p edited Indafotó (-316, Had to reorganize and postpone - for technical…): https://wiki.archiveteam.org/?diff=49638&oldid=49451
12:23:26		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
12:41:15		sarge (sarge) joins
12:43:35		mr_sarge quits [Ping timeout: 265 seconds]
12:46:35		HP_Archivist (HP_Archivist) joins
12:53:10		HP_Archivist quits [Client Quit]
12:56:41		sarge quits [Read error: Connection reset by peer]
13:14:18		Dalek quits [Quit: Dalek]
13:17:01		Dalek (Dalek) joins
13:17:04		LeGoupil quits [Ping timeout: 252 seconds]
13:31:16		HackMii quits [Ping timeout: 245 seconds]
13:33:42		Nulo quits [Read error: Connection reset by peer]
13:33:51		Nulo joins
13:38:32		HackMii (hacktheplanet) joins
13:49:09		LeGoupil joins
13:49:45		LeGoupil quits [Client Quit]
13:49:48		useretail quits [Remote host closed the connection]
14:08:37		benjins2 joins
14:13:10		benjins2 quits [Ping timeout: 252 seconds]
14:41:46		Arcorann quits [Ping timeout: 252 seconds]
14:53:11		IDK_8 joins
14:53:11		fuzzy802 joins
14:53:11		fuzzy8021 quits [Remote host closed the connection]
14:53:11		IDK_ quits [Client Quit]
14:53:11		IDK_8 is now known as IDK_
14:53:24		user_ quits [Remote host closed the connection]
15:39:22		Island joins
16:26:03		nicolas17 joins
16:47:01		HP_Archivist (HP_Archivist) joins
16:47:06		qwertyasdfuiopghjkl quits [Remote host closed the connection]
16:47:07		HP_Archivist quits [Max SendQ exceeded]
16:47:30		HP_Archivist (HP_Archivist) joins
17:05:31		dumbgoy quits [Read error: Connection reset by peer]
17:06:56		dumbgoy joins
17:16:25		dvd quits [Remote host closed the connection]
17:18:09		dvd joins
17:53:57		benjins2 joins
17:56:27		BearFortress quits [Read error: Connection reset by peer]
17:56:38		BearFortress joins
18:00:29	<h2ibot>	JAABot edited CurrentWarriorProject (+0): https://wiki.archiveteam.org/?diff=49639&oldid=49629
18:02:42		benjins2 quits [Ping timeout: 252 seconds]
18:13:38	<@JAA>	thuban, g-man: The short version is that I don't know of any WARC processing tooling that doesn't suck.
18:19:58		dvd quits [Remote host closed the connection]
18:21:39		dvd joins
18:38:04		dvd quits [Remote host closed the connection]
18:39:30	<thuban>	i was afraid you were going to say that :)
18:39:35		dvd joins
18:40:05	<@JAA>	I've been working on improving the situation, but don't hold your breath, there's no ETA yet.
18:46:10		AlsoHP_Archivist joins
18:48:59		HP_Archivist quits [Ping timeout: 243 seconds]
18:50:38	<h2ibot>	JustAnotherArchivist edited YouTube/Technical details (+153, /* Playlists */ Add UULV): https://wiki.archiveteam.org/?diff=49640&oldid=49228
18:53:38	<h2ibot>	Tech234a edited DPReview (+51, Project has begun): https://wiki.archiveteam.org/?diff=49641&oldid=49600
18:54:38	<h2ibot>	Tech234a edited Current Projects (+0, Move DPReview to current): https://wiki.archiveteam.org/?diff=49642&oldid=49597
18:54:39	<h2ibot>	JustAnotherArchivist moved Zippyshare.com to Zippyshare (Disambiguation unnecessary, there has only been…): https://wiki.archiveteam.org/?title=Zippyshare
18:54:40	<h2ibot>	JustAnotherArchivist deleted Zippyshare (Deleted to make way for move from…)
19:00:40	<h2ibot>	Switchnode edited Current Projects (-173, remove zippyshare): https://wiki.archiveteam.org/?diff=49645&oldid=49642
19:04:20		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
19:05:51		Jake quits [Quit: Leaving for a bit!]
19:06:11		Jake (Jake) joins
19:07:32		benjins2 joins
19:13:28		benjins2 quits [Ping timeout: 252 seconds]
20:00:52	<h2ibot>	JAABot edited CurrentWarriorProject (+0): https://wiki.archiveteam.org/?diff=49646&oldid=49639
20:13:55	<h2ibot>	Usernam edited List of websites excluded from the Wayback Machine/Partial exclusions (+35): https://wiki.archiveteam.org/?diff=49647&oldid=49593
20:14:55	<h2ibot>	DoomTay uploaded File:Screenshot 2023-04-02 at 13-04-47 Home.png: https://wiki.archiveteam.org/?title=File%3AScreenshot%202023-04-02%20at%2013-04-47%20Home.png
20:20:56	<h2ibot>	JustAnotherArchivist edited List of websites excluded from the Wayback Machine/Partial exclusions (-35, Undo revision 49647 by…): https://wiki.archiveteam.org/?diff=49650&oldid=49647
20:21:56	<h2ibot>	JustAnotherArchivist edited List of websites excluded from the Wayback Machine (+35, Import from [[/Partial exclusions]]): https://wiki.archiveteam.org/?diff=49651&oldid=49592
20:28:15		g-man joins
20:28:19		g-man quits [Remote host closed the connection]
20:28:35		g-man joins
20:28:52	<g-man>	that's kind of what i figured. i tried like every tool i could find
20:29:10	<g-man>	and was unable to get the job done at the end of the day
20:30:57	<h2ibot>	Sanqui edited Deathwatch (+175, add vulpine.club): https://wiki.archiveteam.org/?diff=49652&oldid=49636
20:31:01	<g-man>	if getting files out is a pipe dream, what should i be using to host them for replay? i need something suitable for other tools to connect to for scraping
20:31:36	<g-man>	what is the tool most likely to "just work". i need to prioritize "just working" and "flexibility" over "easy"
20:32:28	<@JAA>	I've had decent experiences with pywb for playback.
20:34:03	<g-man>	ok
20:35:17	<g-man>	so it is evident that something about warc is not conducive to getting files out. like maybe the spec is not firm
20:35:46	<g-man>	does this extend to problems for playback too? because that would just be too much for me to put up with lol
20:36:08	<g-man>	might just flip a table
20:36:46	<@JAA>	The spec is mostly fine, although it does have ambiguities and whatnot.
20:36:53	<g-man>	or does playback "just work" (at least without crashing or something)
20:37:01	<@JAA>	It's just that nobody has written solid software for it.
20:37:37	<@JAA>	WARCs mostly just get thrown into a Wayback Machine-like thing, either IA's or OpenWayback or pywb with no other processing.
20:37:37	<g-man>	i see
20:37:48	<@JAA>	Also, WARC has no concept of 'files'.
20:38:15	<@JAA>	It contains records, and the relevant ones here are request and response records, containing a raw HTTP request and response, respectively.
20:38:37	<g-man>	so how do external resources outside of the warc get handled vs internal? are URIs rewritten to be 'relative' in some way? and is that in the warc, or does that happen at playback?
20:38:38	<g-man>	just curious
20:38:49	<@JAA>	So extraction requires the same step an HTTP client would usually do, constructing some kind of filename from the URL, HTTP headers, contents, etc.
20:39:16	<@JAA>	That happens at playback, yes.
20:40:05	<@JAA>	Not sure how it's implemented in pywb exactly, but the IA WBM rewrites all hrefs, srcs, etc. with an absolute URL prefixed with the WBM stuff.
20:40:09	<g-man>	i see. all of this was probably a waste of time then. i should have just written my tools to crawl a playback
20:40:22	<g-man>	it's just that i have some file dumps too
20:40:30	<g-man>	and that's kind of simpler to work with
20:40:38	<g-man>	since you can list dirs
20:41:09	<@JAA>	Yeah, it certainly is for usual consumption. It doesn't make for a great archival format though since you need to keep the metadata separately somehow and make sure it doesn't get lost on data transfers etc.
20:41:34	<@JAA>	Plus it can't really handle repeated retrievals of the same URL and various special cases.
20:48:00	<g-man>	ignoring the possibility that some warc tools like grab-site or browsertrix-crawler may be better for archiving, is httrack probably the best tool for making relative 'just files' siterips?
20:50:28	<@JAA>	Not a clue, I haven't used HTTrack in 15-ish years. You could also use wget or wpull to produce both WARCs and file trees.
20:50:57	<g-man>	yeah, have had some success with wget. never tried wpull or httrack
20:51:51	<@JAA>	Not sure if it's been mentioned already, but upstream wget's WARC output is buggy and doesn't work with a lot of WARC software.
20:52:13	<@JAA>	(Attempts to get that fixed have been ... not very successful.)
20:52:26	<g-man>	yes some of the warcs i'm dealing with are old stuff from wget
20:52:49	<g-man>	and some of that is stuff that.. i actually made myself years ago.. and i didn't save the normal files because i thought the warc was equivalent
20:53:14	<g-man>	having had no experience with warc back then
20:53:20	<@JAA>	Well, it's not equivalent, it actually contains more information than plain files.
20:53:36	<@JAA>	But yeah, not as convenient for interactive usage.
20:53:48	<g-man>	yes i meant equivalent to the extent that i didn't need to save the plain files
20:53:58	<g-man>	when backing things up
20:54:17	<g-man>	so i'm kicking myself
20:54:25		myself kicks g-man
20:54:31	<g-man>	ty
20:55:12	<@JAA>	The plain file tree can in theory be produced from the WARCs, at least. So yeah, you didn't lose anything, it's just a PITA.
20:55:19	<@Sanqui>	don't worry, warcs make me sad too
20:56:20	<@Sanqui>	I'd be in favor of switching something like mitmproxy, just with a better format than mitmproxy actually has
20:56:56	<@Sanqui>	warcs try too hard to be clever and transmogrify the http metadata for no good reason
20:57:02	<@Sanqui>	but then again, they're a particularly old spec
20:57:20	<@Sanqui>	dating back to alexa, I guess
20:57:57	<g-man>	probably anything switched to would need a guaranteed way to convert to from warc
20:58:38	<@JAA>	Not sure how mitmdump or similar would improve the situation. As long as you have a serialised format (which is pretty much required), it's going to be a pain to use in casual settings where people 'just want files'.
20:58:49	<@Sanqui>	converting mitmproxy to warc is easy as mitmproxy contains a strict superset of information, by design
20:59:05	<@JAA>	And WARC isn't that old. ARC was the predecessor and is a format from hell.
20:59:11	<g-man>	what can you use mitmproxy for today? just glancing at it. can you just browse through it, and it archives everything as you browse? is that the idea?
20:59:21	<@JAA>	WARC/1.0 is from 2008 I think? And 1.1 from 2017.
20:59:46	<@Sanqui>	it would improve the tooling situation, as it feels like warcs have just too many rough edges, but yeah, it still wouldn't make getting "just the files" easy
21:00:24	<@Sanqui>	g-man: mitmproxy is not designed for archival, but it can serve that purpose, with different caveats than warcs
21:01:03	<h2ibot>	JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=49653&oldid=49651
21:01:16	<@Sanqui>	but imho it should form the basis of a (wishful) warc contender in the web archival space
21:03:11		fuzzy8021 (fuzzy8021) joins
21:03:23	<@Sanqui>	it's a bigger discussion, and I probably need to make some longer writeup if I want to see progress on this front, but it's probably futile given IA's role so I'll do more productive things
21:04:01		nicolas17 quits [Remote host closed the connection]
21:04:01		IDK_ quits [Client Quit]
21:04:01		fuzzy802 quits [Remote host closed the connection]
21:04:01	<g-man>	"just the files, ma'am"
21:04:01	<g-man>	well, if you ever make a warc extractor that just works, you'll be a hero basically
21:04:01		g-man quits [Client Quit]
21:04:01		qwertyasdfuiopghjkl quits [Client Quit]
21:04:16	<@JAA>	My issues with WARC are limited to its slow development (still no HTTP/2 or WebSocket) and the minor bugs that can be fixed relatively easily. Other than that, I think it's 'fine'.
21:04:22		g-man joins
21:04:22		nicolas17 joins
21:04:31	<@JAA>	The rest is just a matter of tooling.
21:04:40	<@JAA>	And as mentioned, something for that is in progress.
21:05:01	<@JAA>	It won't directly allow 'extract just the files', but it'll make working with WARCs much easier, and an extractor could easily be written on top of it.
21:09:21		dan_a quits [Ping timeout: 265 seconds]
21:09:21		programmerq quits [Ping timeout: 265 seconds]
21:09:21		g-man quits [Ping timeout: 265 seconds]
21:09:25		programmerq (programmerq) joins
21:09:29		g-man joins
21:09:30		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
21:09:34		dan_a (dan_a) joins
21:09:41	<g-man>	does pywb "just work" for those mega site rips that can be found on archive.org that consist of dozens of warcgz files (~2GiB split seems common), or must they be combined first somehow?
21:11:58	<@JAA>	It should work fine. It can definitely handle multiple WARCs. Note that it copies the WARCs into its own storage thingy on adding (unless that changed within the last couple years).
21:12:48	<g-man>	ty
21:18:48		hitgrr8 quits [Client Quit]
21:23:52		michaelblob quits [Read error: Connection reset by peer]
21:27:56		g-man quits [Remote host closed the connection]
21:31:37		michaelblob (michaelblob) joins
22:40:59		umgr036 joins
22:41:45		umgr036 quits [Remote host closed the connection]
22:41:58		umgr036 joins
23:37:46		benjins quits [Read error: Connection reset by peer]
23:38:09		benjins joins
23:40:22		benjins is now authenticated as benjins
23:51:24		benjins2 joins
23:56:10		benjins2 quits [Ping timeout: 252 seconds]

Home Search Previous day Next day