00:00:59dm4v quits [Read error: Connection reset by peer]
00:02:05dm4v joins
00:02:07dm4v quits [Changing host]
00:02:07dm4v (dm4v) joins
00:36:15AlsoHP_Archivist quits [Client Quit]
00:36:32HP_Archivist (HP_Archivist) joins
00:53:53qwertyasdfuiopghjkl87 joins
00:55:15<h2ibot>JustAnotherArchivist edited Template:Infobox project sandbox (+449, Add irc_network parameter (cf. edit 41606) and…): https://wiki.archiveteam.org/?diff=47287&oldid=31242
00:55:46qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds]
01:02:47dm4v_ joins
01:03:49dm4v quits [Ping timeout: 252 seconds]
01:03:49dm4v_ is now known as dm4v
01:03:49dm4v quits [Changing host]
01:03:49dm4v (dm4v) joins
01:17:22mgrytbak joins
01:20:21sonick quits [Client Quit]
01:21:25qwertyasdfuiopghjkl87 quits [Remote host closed the connection]
01:23:45hexa- quits [Quit: WeeChat 3.1]
01:25:08hexa- (hexa-) joins
01:48:30<systwi>I'm assuming this is the correct place to ask.
01:48:34<systwi>I'm trying to save a webpage with `grab-site' under Debian Bullseye, and I need to import a cookies.txt so page is grabbed as if I were logged in.
01:48:39<systwi>I try using the following command: grab-site --1 --wpull-args='--load-cookies=/data/cookies.txt' 'https://example.com/'
01:49:06<systwi>But for some reason the page grabbed does not show me signed in.
01:49:43<systwi>The cookies.txt was exported from Librewolf using https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/
01:50:15<@JAA>Is the relevant cookie in the first line of cookies.txt?
01:51:06<systwi>The site in question writes multiple cookies, The first line in particular is "# Netscape HTTP Cookie File"
01:51:47<@JAA>Hmm, ok, so not that wpull bug then.
01:53:06<systwi>Passing through the same user agent and using the same IP as how I logged into the site through the browser didn't make a difference.
01:53:49<@JAA>Are the cookies in the request record in the WARC?
01:54:13<systwi>I see a cookies.txt inside the output directory, but it's significantly smaller than the one I specified.
01:54:19<systwi>If that's what you meant.
01:54:26<wickedplayer494>---> #archiveteam-bs, now.
01:54:35<@JAA>That's where we are, wickedplayer494. lol
01:54:36<wickedplayer494>oops thought this was #archiveteam nvm
01:55:22<@JAA>systwi: Open the .warc.gz file with zless and look for the first `WARC-Type: request` record. It should have some `Cookie: X` line.
01:58:02<systwi>It does have a line like that, yes.
01:59:33<@JAA>Well, then at least the cookie loading itself works I guess.
01:59:55<systwi>It looks like some cookies match, but there are also new cookies in the WARC not present in the cookies.txt. Maybe from grabbing outlinks.
02:00:24<@JAA>Yes, and grab-site also has some default cookies I think. Not sure if those get loaded if you specify your own --load-cookies though.
02:10:35<systwi>For context, I'm trying to grab a Quizlet page.
02:11:21<systwi>Looking it over closer, the WARC seems like it has every cookie specified that cookies.txt has.
02:14:30<h2ibot>JustAnotherArchivist edited Template:Infobox project sandbox (+214, archiving_type: s/warrior/dpos/, add archivebot…): https://wiki.archiveteam.org/?diff=47288&oldid=47287
02:43:06tzt (tzt) joins
02:43:28<systwi>Know of anything else I could check/try?
02:49:25tzt quits [Ping timeout: 252 seconds]
03:01:46tzt (tzt) joins
03:07:49qw3rty_ joins
03:11:31qw3rty__ quits [Ping timeout: 258 seconds]
03:38:17Earendil (Cobalt17) joins
03:39:40Earendil quits [Client Quit]
03:40:58qwertyasdfuiopghjkl joins
04:00:01treora quits [Quit: blub blub.]
04:01:18treora joins
04:18:31HP_Archivist quits [Ping timeout: 252 seconds]
05:06:00sonick (sonick) joins
05:13:25BlueMaxima quits [Read error: Connection reset by peer]
06:08:56pabs quits [Quit: Don't rest until all the world is paved in moss and greenery.]
06:14:08pabs (pabs) joins
07:08:52qwertyasdfuiopghjkl quits [Remote host closed the connection]
07:46:22Jake quits [Ping timeout: 258 seconds]
08:28:29tzt quits [Ping timeout: 265 seconds]
09:28:43Barto quits [Ping timeout: 252 seconds]
10:11:07driib73 (driib) joins
10:15:06driib7 quits [Ping timeout: 258 seconds]
10:15:06driib73 is now known as driib7
10:34:49Terbium quits [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.]
10:35:32Terbium joins
10:37:21Barto (Barto) joins
11:05:45sonick quits [Client Quit]
12:44:44mgrytbak8 joins
12:47:04mgrytbak quits [Ping timeout: 265 seconds]
12:47:04mgrytbak8 is now known as mgrytbak
13:29:40HP_Archivist (HP_Archivist) joins
13:31:06Dark_Hunter quits [Remote host closed the connection]
13:34:50spirit quits [Client Quit]
14:01:31arkhive quits [Ping timeout: 252 seconds]
14:01:55arkhive joins
14:12:23Arcorann quits [Ping timeout: 258 seconds]
14:42:09<rewby>systwi: Have you considered whether the website is doing something like personalizing with javascript instead of sending you different pages?
14:58:57Wingy quits [Remote host closed the connection]
14:59:54Wingy (Wingy) joins
15:09:18Jake (Jake) joins
15:43:05paul2520 (paul2520) joins
16:39:36Wingy quits [Read error: Connection reset by peer]
16:40:29Wingy (Wingy) joins
16:53:24HP_Archivist quits [Client Quit]
16:53:43HP_Archivist (HP_Archivist) joins
18:06:33HP_Archivist quits [Ping timeout: 265 seconds]
18:35:29Aoede quits [Quit: ZNC - https://znc.in]
18:40:59Aoede (Aoede) joins
18:50:28tzt (tzt) joins
19:24:15<@OrIdow6>systwi: Play around with the storage inspector of your browser
19:24:51<@OrIdow6>Or, one thing I find really useful is Firefox's "copy as curl" action in the network inspector, then eliminate curl args until you reproduce it
19:25:02<@OrIdow6>That'ss assuming you've already done what rewby's said
19:49:44Wingy quits [Remote host closed the connection]
19:50:39Wingy (Wingy) joins
19:50:58HP_Archivist (HP_Archivist) joins
20:11:11driib7 quits [Ping timeout: 258 seconds]
20:13:53Wingy quits [Read error: Connection reset by peer]
20:14:47Wingy (Wingy) joins
20:18:31Wingy quits [Remote host closed the connection]
20:20:14Wingy (Wingy) joins
20:21:42Wingy quits [Remote host closed the connection]
20:22:37Wingy (Wingy) joins
20:47:59Wingy quits [Ping timeout: 258 seconds]
20:51:42TheTechRobo3641 joins
20:54:16nimaje quits [Ping timeout: 265 seconds]
20:55:16TheTechRobo quits [Ping timeout: 258 seconds]
21:16:00TheTechRobo3641 is now known as TheTechRobo
21:24:55qwertyasdfuiopghjkl joins
21:32:45<TheTechRobo>What is a warc.zst in the warrior projects and why is it different to a gz?
21:34:47<@JAA>zstandard compression instead of gzip. gzip does decent compression, zstd is black magic.
21:35:19<TheTechRobo>Do you mean a better ratio
21:35:20<TheTechRobo>?
21:35:23<@JAA>We also use a dictionary with the zstd WARCs (would be possible with gzip but not well-supported by the tooling), which makes it *much* more efficient.
21:35:27<@JAA>Yes, better compression ratio.
21:35:43<TheTechRobo>Ah
21:36:32<@JAA>Here's my little script for decompressing .warc.zst: https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/zstdwarccat
21:37:02<@JAA>(Requires the zstd tools in PATH.)
21:37:12<TheTechRobo>On a related note, I find lrzip works wonders for compressing warc.gz files for archival. High compression windows are great! Wouldn't work here, though.
21:37:37<@JAA>They won't be compressed properly though. Each record needs to be compressed individually to allow for random access.
21:37:48<TheTechRobo>Exactky, that's why it wouldn't work her
21:37:49<TheTechRobo>e
21:37:49<@JAA>And that's what wrecks the compression ratio.
21:38:11<TheTechRobo>Even if that did work, lrzip only really works well with large files (>50MB) which most warc entries won't be
21:38:22<@JAA>The custom dictionary on the zstd WARCs fixes this because that way you *can* compress the similar parts between records by shoving them in the dictionary.
21:38:23<TheTechRobo>so it's only good for compressing whole warcs
21:39:10<@JAA>Unfortunately, tooling for zstd WARCs so far is ... scarce.
21:41:12<@JAA>Really wget-at is the only tool that can write them, and IA's CDX-Writer and related software is the only thing that can read them.
21:41:28<TheTechRobo>really? wpull can't write?
21:41:46<@JAA>I don't think there has been a commit to the wpull repo since we invented .warc.zst. lol
21:42:04<TheTechRobo>Good point
21:42:10<@JAA>wpull only produces gzipped WARCs.
21:42:22<TheTechRobo>I was tempted to use ludios_wpull for my project since it's the only one decently maintained
21:42:56<TheTechRobo>Ended up just compiling python 3.6 and using wpull with it
21:44:54<@JAA>pyenv ftw :-)
21:45:22<TheTechRobo>I was going to, but I can't be bothered to install it :P
21:46:45<TheTechRobo>IIRC I gave up on ludios_wpull because it didn't download anything
21:47:05nimaje joins
21:47:13<TheTechRobo>Probably would have worked if I had fallen back to my 3.7 install (debian buster ftw) but I like to live on the edge with 3.9 :P
21:47:27<TheTechRobo>I'll have to recompile soon tho, 3.10 just came out
21:49:06<@JAA>cd ~/.pyenv; git pull; pyenv install 3.10
21:49:07<@JAA>Done
21:49:08<@JAA>:-P
21:49:39<@JAA>3.10.0 *
21:50:18<TheTechRobo>Shouldn't 3.10 be aliased to the latest 3.10.*?
21:52:43<@JAA>I don't think pyenv has such aliases, but not sure.
21:56:06paul2520 quits [Remote host closed the connection]
22:08:02sonick (sonick) joins
22:10:58Arcorann (Arcorann) joins
22:42:15BlueMaxima joins
22:51:49Wingy (Wingy) joins
22:57:59Wingy quits [Read error: Connection reset by peer]
22:58:52Wingy (Wingy) joins
23:05:49Wingy quits [Remote host closed the connection]
23:08:31Wingy (Wingy) joins
23:15:07IDK quits [Quit: Connection closed for inactivity]
23:20:36Wingy quits [Read error: Connection reset by peer]
23:21:25Wingy (Wingy) joins
23:25:02Wingy quits [Remote host closed the connection]
23:25:58Wingy (Wingy) joins
23:28:20Arcorann quits [Read error: Connection reset by peer]
23:32:49Wingy quits [Ping timeout: 258 seconds]
23:46:45Wingy (Wingy) joins
23:58:30Wingy quits [Ping timeout: 258 seconds]