00:18:20tech234a (tech234a) joins
00:23:18michaelblob (michaelblob) joins
00:52:40Arcorann (Arcorann) joins
01:48:17sarge quits [Client Quit]
01:57:35g-man joins
02:00:43<g-man>hi, i'm trying to figure out how to use cookies with wget to rip a website that needs an account. i have my cookies exported from firefox.
02:00:57<g-man>this works on curl (even with multiple invocations in a row), but curl isn't a crawler: `--cookie cookies.txt --cookie-jar cookies.txt`
02:01:11<g-man>neither of these is working in wget: `--load-cookies cookies.txt --keep-session-cookies` , `--load-cookies cookies.txt --keep-session-cookies --save-cookies cookies.txt`
02:01:21<g-man>thoughts?
02:06:31<thuban>g-man: do the problematic cookies start with #HttpOnly_ in the cookies.txt file?
02:11:47<g-man>someone else actually just suggested that to me moments ago. i wonder if you're the same person. lol
02:11:57<g-man>i'm about to look into that. gotta grab a fresh cookie file
02:13:34<g-man>yes, they do. i will try stripping that. if that fixes it, that's pretty cool
02:15:31<g-man>follow up question though
02:16:39<g-man>for a single invocation of wget (say for crawling/mirroring mode), should i be using `--save-cookies` ? and if i do, should it get the exact same file argument as `--load-cookies` ? it works that way conceptually with curl, but i couldn't find that clearly explained anywhere
02:21:04<g-man>apparently this is a wordpress-based site. hm
02:21:37<g-man>so i want to go from `#HttpOnly_example.com FALSE /wp-admin ...` to `example.com FALSE /wp-admin ...`? removing the underscore?
02:22:02<thuban>right
02:22:54<g-man>trying now. thanks
02:23:51nicolas17 joins
02:23:54<g-man>also i have `--exclude-directories='/membership-login,/subscribe'` since those paths contain logout links. hopefully i'm using that correctly
02:26:11<g-man>holy cow, i think that worked. what a simple solution to days of frustration and investigating other tools (ones that can save plain files with rewritten relative urls instead of WARC)
02:26:20<g-man>thanks bro
02:27:25<g-man>also, for another project, i have WARC files, but i'm having a hell of a time extracting files from them. almost every tool i've tried chokes on some of them, and they're almost all insanely slow.
02:27:50<thuban>np. it's a nonstandard extension to the format; some tools (including the firefox cookies.txt extension and curl) support it, but some (including wget) don't (instead treating the whole line as a comment)
02:27:57<g-man>have you ever used jwat-tools? it seems the best written of the bunch, but i can't figure out how to get real filenames out of the damn thing
02:28:14<g-man>it outputs like `extracted.001` `extracted.002` ...
02:28:25<g-man>using the `extract` subcommand
02:32:08<thuban>as for saving cookies, it's up to you. you don't need `--save-cookies` to have cookies persist during a crawling session; that's the default (you can turn it off with `--no-cookies`), saving cookies just writes them to a file. you can set it to whatever, and i would guess it's safe to write over the same file you loaded from, but i've never actually used this feature
02:33:40<thuban>i don't do much warc processing and definitely don't know that tool, but JAA might be able to recommend something
02:33:42<g-man>i found that in curl, a cookies file worked for a single invocation, but it would fail on subsequent invocations. the addition of `--cookie-jar` (which writes cookies to a file) fixed it for me. probably session cookie changes with browsing
02:34:02<g-man>so i'm guessing that the same is true for wget if you do multiple invocations to keep the cookie file valid
02:34:22<g-man>but you probably don't need strictly need multiple invocations since wget can crawl
02:36:40<thuban>if you think that session cookie changes are an issue (and that you'll need multiple invocations), you should also use `--keep-session-cookies`
02:37:20<g-man>right
02:38:12<g-man>then i'll use all three (load, keep, save) unless i find that it breaks something
03:02:02AlsoHP_Archivist joins
03:02:45IDK_ quits [Client Quit]
03:02:45balrog quits [Client Quit]
03:02:45HP_Archivist quits [Remote host closed the connection]
03:02:45g-man quits [Remote host closed the connection]
03:02:52qwertyasdfuiopghjkl quits [Remote host closed the connection]
03:02:56IDK_ joins
03:03:01balrog (balrog) joins
03:10:25benjins2 joins
03:18:40benjins2 quits [Ping timeout: 252 seconds]
03:21:57qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
03:24:10nicolas17 quits [Ping timeout: 252 seconds]
03:26:52Shjosan quits [Quit: Am sleepy (-, – )…zzzZZZ]
03:27:30Shjosan (Shjosan) joins
03:29:07dumbgoy quits [Ping timeout: 252 seconds]
03:42:28g-man joins
03:53:12<g-man>JAA any recommendations for WARC extraction utils?
04:15:35superkuh_ quits [Remote host closed the connection]
04:15:52superkuh_ joins
04:26:24dumbgoy joins
04:27:58Maddie_ quits [Ping timeout: 252 seconds]
04:35:20Island quits [Read error: Connection reset by peer]
04:52:32sec^nd quits [Ping timeout: 245 seconds]
04:56:29sec^nd (second) joins
05:00:06qwertyasdfuiopghjkl quits [Remote host closed the connection]
05:08:29superkuh__ joins
05:09:43superkuh_ quits [Read error: Connection reset by peer]
05:24:18fuzzy8021 quits [Read error: Connection reset by peer]
05:26:20fuzzy8021 (fuzzy8021) joins
05:38:33fuzzy8021 quits [Read error: Connection reset by peer]
05:40:06fuzzy8021 (fuzzy8021) joins
05:59:24hitgrr8 joins
06:05:19g-man quits [Ping timeout: 265 seconds]
06:21:16thuban quits [Ping timeout: 265 seconds]
06:27:08BlueMaxima quits [Read error: Connection reset by peer]
06:52:19fuzzy8021 quits [Read error: Connection reset by peer]
06:52:38fuzzy8021 (fuzzy8021) joins
08:07:32thuban joins
08:18:42LeGoupil joins
08:22:02Shjosan quits [Client Quit]
09:04:37LeGoupil quits [Ping timeout: 252 seconds]
09:06:44LeGoupil joins
09:16:51Shjosan (Shjosan) joins
09:29:27Minkafighter7225 quits [Client Quit]
09:29:44Minkafighter7225 joins
09:37:39Hajdar58 joins
09:38:53Hajdar58 quits [Remote host closed the connection]
09:55:33umgr036 quits [Remote host closed the connection]
09:55:46umgr036 joins
10:17:04umgr036 quits [Remote host closed the connection]
10:20:16umgr036 joins
10:21:05umgr036 quits [Remote host closed the connection]
10:21:17umgr036 joins
10:22:10LeGoupil quits [Ping timeout: 252 seconds]
10:25:26HackMii quits [Ping timeout: 245 seconds]
10:26:41HackMii (hacktheplanet) joins
11:00:22kiskaLogBot quits [Quit: kiskaLogBot]
11:06:31kiskaLogBot joins
11:12:28user_ joins
11:14:05umgr036 quits [Remote host closed the connection]
11:14:06IDK_ quits [Client Quit]
11:14:06fuzzy8021 quits [Remote host closed the connection]
11:14:15IDK_ joins
11:15:10fuzzy8021 (fuzzy8021) joins
11:30:56BearFortress quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
11:49:05LeGoupil joins
11:52:37BearFortress joins
12:07:57AlsoHP_Archivist quits [Client Quit]
12:09:40mr_sarge (sarge) joins
12:10:22LeGoupil quits [Read error: Connection reset by peer]
12:10:28LeGoupil joins
12:11:24<h2ibot>Bzc6p edited Indafotó (-316, Had to reorganize and postpone - for technical…): https://wiki.archiveteam.org/?diff=49638&oldid=49451
12:23:26qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
12:41:15sarge (sarge) joins
12:43:35mr_sarge quits [Ping timeout: 265 seconds]
12:46:35HP_Archivist (HP_Archivist) joins
12:53:10HP_Archivist quits [Client Quit]
12:56:41sarge quits [Read error: Connection reset by peer]
13:14:18Dalek quits [Quit: Dalek]
13:17:01Dalek (Dalek) joins
13:17:04LeGoupil quits [Ping timeout: 252 seconds]
13:31:16HackMii quits [Ping timeout: 245 seconds]
13:33:42Nulo quits [Read error: Connection reset by peer]
13:33:51Nulo joins
13:38:32HackMii (hacktheplanet) joins
13:49:09LeGoupil joins
13:49:45LeGoupil quits [Client Quit]
13:49:48useretail quits [Remote host closed the connection]
14:08:37benjins2 joins
14:13:10benjins2 quits [Ping timeout: 252 seconds]
14:41:46Arcorann quits [Ping timeout: 252 seconds]
14:53:11IDK_8 joins
14:53:11fuzzy802 joins
14:53:11fuzzy8021 quits [Remote host closed the connection]
14:53:11IDK_ quits [Client Quit]
14:53:11IDK_8 is now known as IDK_
14:53:24user_ quits [Remote host closed the connection]
15:39:22Island joins
16:26:03nicolas17 joins
16:47:01HP_Archivist (HP_Archivist) joins
16:47:06qwertyasdfuiopghjkl quits [Remote host closed the connection]
16:47:07HP_Archivist quits [Max SendQ exceeded]
16:47:30HP_Archivist (HP_Archivist) joins
17:05:31dumbgoy quits [Read error: Connection reset by peer]
17:06:56dumbgoy joins
17:16:25dvd quits [Remote host closed the connection]
17:18:09dvd joins
17:53:57benjins2 joins
17:56:27BearFortress quits [Read error: Connection reset by peer]
17:56:38BearFortress joins
18:00:29<h2ibot>JAABot edited CurrentWarriorProject (+0): https://wiki.archiveteam.org/?diff=49639&oldid=49629
18:02:42benjins2 quits [Ping timeout: 252 seconds]
18:13:38<@JAA>thuban, g-man: The short version is that I don't know of any WARC processing tooling that doesn't suck.
18:19:58dvd quits [Remote host closed the connection]
18:21:39dvd joins
18:38:04dvd quits [Remote host closed the connection]
18:39:30<thuban>i was afraid you were going to say that :)
18:39:35dvd joins
18:40:05<@JAA>I've been working on improving the situation, but don't hold your breath, there's no ETA yet.
18:46:10AlsoHP_Archivist joins
18:48:59HP_Archivist quits [Ping timeout: 243 seconds]
18:50:38<h2ibot>JustAnotherArchivist edited YouTube/Technical details (+153, /* Playlists */ Add UULV): https://wiki.archiveteam.org/?diff=49640&oldid=49228
18:53:38<h2ibot>Tech234a edited DPReview (+51, Project has begun): https://wiki.archiveteam.org/?diff=49641&oldid=49600
18:54:38<h2ibot>Tech234a edited Current Projects (+0, Move DPReview to current): https://wiki.archiveteam.org/?diff=49642&oldid=49597
18:54:39<h2ibot>JustAnotherArchivist moved Zippyshare.com to Zippyshare (Disambiguation unnecessary, there has only been…): https://wiki.archiveteam.org/?title=Zippyshare
18:54:40<h2ibot>JustAnotherArchivist deleted Zippyshare (Deleted to make way for move from…)
19:00:40<h2ibot>Switchnode edited Current Projects (-173, remove zippyshare): https://wiki.archiveteam.org/?diff=49645&oldid=49642
19:04:20qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
19:05:51Jake quits [Quit: Leaving for a bit!]
19:06:11Jake (Jake) joins
19:07:32benjins2 joins
19:13:28benjins2 quits [Ping timeout: 252 seconds]
20:00:52<h2ibot>JAABot edited CurrentWarriorProject (+0): https://wiki.archiveteam.org/?diff=49646&oldid=49639
20:13:55<h2ibot>Usernam edited List of websites excluded from the Wayback Machine/Partial exclusions (+35): https://wiki.archiveteam.org/?diff=49647&oldid=49593
20:14:55<h2ibot>DoomTay uploaded File:Screenshot 2023-04-02 at 13-04-47 Home.png: https://wiki.archiveteam.org/?title=File%3AScreenshot%202023-04-02%20at%2013-04-47%20Home.png
20:20:56<h2ibot>JustAnotherArchivist edited List of websites excluded from the Wayback Machine/Partial exclusions (-35, Undo revision 49647 by…): https://wiki.archiveteam.org/?diff=49650&oldid=49647
20:21:56<h2ibot>JustAnotherArchivist edited List of websites excluded from the Wayback Machine (+35, Import from [[/Partial exclusions]]): https://wiki.archiveteam.org/?diff=49651&oldid=49592
20:28:15g-man joins
20:28:19g-man quits [Remote host closed the connection]
20:28:35g-man joins
20:28:52<g-man>that's kind of what i figured. i tried like every tool i could find
20:29:10<g-man>and was unable to get the job done at the end of the day
20:30:57<h2ibot>Sanqui edited Deathwatch (+175, add vulpine.club): https://wiki.archiveteam.org/?diff=49652&oldid=49636
20:31:01<g-man>if getting files out is a pipe dream, what should i be using to host them for replay? i need something suitable for other tools to connect to for scraping
20:31:36<g-man>what is the tool most likely to "just work". i need to prioritize "just working" and "flexibility" over "easy"
20:32:28<@JAA>I've had decent experiences with pywb for playback.
20:34:03<g-man>ok
20:35:17<g-man>so it is evident that something about warc is not conducive to getting files out. like maybe the spec is not firm
20:35:46<g-man>does this extend to problems for playback too? because that would just be too much for me to put up with lol
20:36:08<g-man>might just flip a table
20:36:46<@JAA>The spec is mostly fine, although it does have ambiguities and whatnot.
20:36:53<g-man>or does playback "just work" (at least without crashing or something)
20:37:01<@JAA>It's just that nobody has written solid software for it.
20:37:37<@JAA>WARCs mostly just get thrown into a Wayback Machine-like thing, either IA's or OpenWayback or pywb with no other processing.
20:37:37<g-man>i see
20:37:48<@JAA>Also, WARC has no concept of 'files'.
20:38:15<@JAA>It contains records, and the relevant ones here are request and response records, containing a raw HTTP request and response, respectively.
20:38:37<g-man>so how do external resources outside of the warc get handled vs internal? are URIs rewritten to be 'relative' in some way? and is that in the warc, or does that happen at playback?
20:38:38<g-man>just curious
20:38:49<@JAA>So extraction requires the same step an HTTP client would usually do, constructing some kind of filename from the URL, HTTP headers, contents, etc.
20:39:16<@JAA>That happens at playback, yes.
20:40:05<@JAA>Not sure how it's implemented in pywb exactly, but the IA WBM rewrites all hrefs, srcs, etc. with an absolute URL prefixed with the WBM stuff.
20:40:09<g-man>i see. all of this was probably a waste of time then. i should have just written my tools to crawl a playback
20:40:22<g-man>it's just that i have some file dumps too
20:40:30<g-man>and that's kind of simpler to work with
20:40:38<g-man>since you can list dirs
20:41:09<@JAA>Yeah, it certainly is for usual consumption. It doesn't make for a great archival format though since you need to keep the metadata separately somehow and make sure it doesn't get lost on data transfers etc.
20:41:34<@JAA>Plus it can't really handle repeated retrievals of the same URL and various special cases.
20:48:00<g-man>ignoring the possibility that some warc tools like grab-site or browsertrix-crawler may be better for archiving, is httrack probably the best tool for making relative 'just files' siterips?
20:50:28<@JAA>Not a clue, I haven't used HTTrack in 15-ish years. You could also use wget or wpull to produce both WARCs and file trees.
20:50:57<g-man>yeah, have had some success with wget. never tried wpull or httrack
20:51:51<@JAA>Not sure if it's been mentioned already, but upstream wget's WARC output is buggy and doesn't work with a lot of WARC software.
20:52:13<@JAA>(Attempts to get that fixed have been ... not very successful.)
20:52:26<g-man>yes some of the warcs i'm dealing with are old stuff from wget
20:52:49<g-man>and some of that is stuff that.. i actually made myself years ago.. and i didn't save the normal files because i thought the warc was equivalent
20:53:14<g-man>having had no experience with warc back then
20:53:20<@JAA>Well, it's not equivalent, it actually contains *more* information than plain files.
20:53:36<@JAA>But yeah, not as convenient for interactive usage.
20:53:48<g-man>yes i meant equivalent to the extent that i didn't need to save the plain files
20:53:58<g-man>when backing things up
20:54:17<g-man>so i'm kicking myself
20:54:25myself kicks g-man
20:54:31<g-man>ty
20:55:12<@JAA>The plain file tree can in theory be produced from the WARCs, at least. So yeah, you didn't lose anything, it's just a PITA.
20:55:19<@Sanqui>don't worry, warcs make me sad too
20:56:20<@Sanqui>I'd be in favor of switching something like mitmproxy, just with a better format than mitmproxy actually has
20:56:56<@Sanqui>warcs try too hard to be clever and transmogrify the http metadata for no good reason
20:57:02<@Sanqui>but then again, they're a particularly old spec
20:57:20<@Sanqui>dating back to alexa, I guess
20:57:57<g-man>probably anything switched to would need a guaranteed way to convert to from warc
20:58:38<@JAA>Not sure how mitmdump or similar would improve the situation. As long as you have a serialised format (which is pretty much required), it's going to be a pain to use in casual settings where people 'just want files'.
20:58:49<@Sanqui>converting mitmproxy to warc is easy as mitmproxy contains a strict superset of information, by design
20:59:05<@JAA>And WARC isn't that old. ARC was the predecessor and is a format from hell.
20:59:11<g-man>what can you use mitmproxy for today? just glancing at it. can you just browse through it, and it archives everything as you browse? is that the idea?
20:59:21<@JAA>WARC/1.0 is from 2008 I think? And 1.1 from 2017.
20:59:46<@Sanqui>it would improve the tooling situation, as it feels like warcs have just too many rough edges, but yeah, it still wouldn't make getting "just the files" easy
21:00:24<@Sanqui>g-man: mitmproxy is not designed for archival, but it can serve that purpose, with different caveats than warcs
21:01:03<h2ibot>JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=49653&oldid=49651
21:01:16<@Sanqui>but imho it should form the basis of a (wishful) warc contender in the web archival space
21:03:11fuzzy8021 (fuzzy8021) joins
21:03:23<@Sanqui>it's a bigger discussion, and I probably need to make some longer writeup if I want to see progress on this front, but it's probably futile given IA's role so I'll do more productive things
21:04:01nicolas17 quits [Remote host closed the connection]
21:04:01IDK_ quits [Client Quit]
21:04:01fuzzy802 quits [Remote host closed the connection]
21:04:01<g-man>"just the files, ma'am"
21:04:01<g-man>well, if you ever make a warc extractor that just works, you'll be a hero basically
21:04:01g-man quits [Client Quit]
21:04:01qwertyasdfuiopghjkl quits [Client Quit]
21:04:16<@JAA>My issues with WARC are limited to its slow development (still no HTTP/2 or WebSocket) and the minor bugs that can be fixed relatively easily. Other than that, I think it's 'fine'.
21:04:22g-man joins
21:04:22nicolas17 joins
21:04:31<@JAA>The rest is just a matter of tooling.
21:04:40<@JAA>And as mentioned, something for that is in progress.
21:05:01<@JAA>It won't directly allow 'extract just the files', but it'll make working with WARCs *much* easier, and an extractor could easily be written on top of it.
21:09:21dan_a quits [Ping timeout: 265 seconds]
21:09:21programmerq quits [Ping timeout: 265 seconds]
21:09:21g-man quits [Ping timeout: 265 seconds]
21:09:25programmerq (programmerq) joins
21:09:29g-man joins
21:09:30qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
21:09:34dan_a (dan_a) joins
21:09:41<g-man>does pywb "just work" for those mega site rips that can be found on archive.org that consist of dozens of warcgz files (~2GiB split seems common), or must they be combined first somehow?
21:11:58<@JAA>It should work fine. It can definitely handle multiple WARCs. Note that it copies the WARCs into its own storage thingy on adding (unless that changed within the last couple years).
21:12:48<g-man>ty
21:18:48hitgrr8 quits [Client Quit]
21:23:52michaelblob quits [Read error: Connection reset by peer]
21:27:56g-man quits [Remote host closed the connection]
21:31:37michaelblob (michaelblob) joins
22:40:59umgr036 joins
22:41:45umgr036 quits [Remote host closed the connection]
22:41:58umgr036 joins
23:37:46benjins quits [Read error: Connection reset by peer]
23:38:09benjins joins
23:51:24benjins2 joins
23:56:10benjins2 quits [Ping timeout: 252 seconds]