| 00:00:59 | | dm4v quits [Read error: Connection reset by peer] |
| 00:02:05 | | dm4v joins |
| 00:02:07 | | dm4v is now authenticated as dm4v |
| 00:02:07 | | dm4v quits [Changing host] |
| 00:02:07 | | dm4v (dm4v) joins |
| 00:36:15 | | AlsoHP_Archivist quits [Client Quit] |
| 00:36:32 | | HP_Archivist (HP_Archivist) joins |
| 00:53:53 | | qwertyasdfuiopghjkl87 joins |
| 00:55:15 | <h2ibot> | JustAnotherArchivist edited Template:Infobox project sandbox (+449, Add irc_network parameter (cf. edit 41606) and…): https://wiki.archiveteam.org/?diff=47287&oldid=31242 |
| 00:55:46 | | qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds] |
| 01:02:47 | | dm4v_ joins |
| 01:03:49 | | dm4v quits [Ping timeout: 252 seconds] |
| 01:03:49 | | dm4v_ is now known as dm4v |
| 01:03:49 | | dm4v is now authenticated as dm4v |
| 01:03:49 | | dm4v quits [Changing host] |
| 01:03:49 | | dm4v (dm4v) joins |
| 01:17:22 | | mgrytbak joins |
| 01:20:21 | | sonick quits [Client Quit] |
| 01:21:25 | | qwertyasdfuiopghjkl87 quits [Remote host closed the connection] |
| 01:23:45 | | hexa- quits [Quit: WeeChat 3.1] |
| 01:25:08 | | hexa- (hexa-) joins |
| 01:48:30 | <systwi> | I'm assuming this is the correct place to ask. |
| 01:48:34 | <systwi> | I'm trying to save a webpage with `grab-site' under Debian Bullseye, and I need to import a cookies.txt so page is grabbed as if I were logged in. |
| 01:48:39 | <systwi> | I try using the following command: grab-site --1 --wpull-args='--load-cookies=/data/cookies.txt' 'https://example.com/' |
| 01:49:06 | <systwi> | But for some reason the page grabbed does not show me signed in. |
| 01:49:43 | <systwi> | The cookies.txt was exported from Librewolf using https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/ |
| 01:50:15 | <@JAA> | Is the relevant cookie in the first line of cookies.txt? |
| 01:51:06 | <systwi> | The site in question writes multiple cookies, The first line in particular is "# Netscape HTTP Cookie File" |
| 01:51:47 | <@JAA> | Hmm, ok, so not that wpull bug then. |
| 01:53:06 | <systwi> | Passing through the same user agent and using the same IP as how I logged into the site through the browser didn't make a difference. |
| 01:53:49 | <@JAA> | Are the cookies in the request record in the WARC? |
| 01:54:13 | <systwi> | I see a cookies.txt inside the output directory, but it's significantly smaller than the one I specified. |
| 01:54:19 | <systwi> | If that's what you meant. |
| 01:54:26 | <wickedplayer494> | ---> #archiveteam-bs, now. |
| 01:54:35 | <@JAA> | That's where we are, wickedplayer494. lol |
| 01:54:36 | <wickedplayer494> | oops thought this was #archiveteam nvm |
| 01:55:22 | <@JAA> | systwi: Open the .warc.gz file with zless and look for the first `WARC-Type: request` record. It should have some `Cookie: X` line. |
| 01:58:02 | <systwi> | It does have a line like that, yes. |
| 01:59:33 | <@JAA> | Well, then at least the cookie loading itself works I guess. |
| 01:59:55 | <systwi> | It looks like some cookies match, but there are also new cookies in the WARC not present in the cookies.txt. Maybe from grabbing outlinks. |
| 02:00:24 | <@JAA> | Yes, and grab-site also has some default cookies I think. Not sure if those get loaded if you specify your own --load-cookies though. |
| 02:10:35 | <systwi> | For context, I'm trying to grab a Quizlet page. |
| 02:11:21 | <systwi> | Looking it over closer, the WARC seems like it has every cookie specified that cookies.txt has. |
| 02:14:30 | <h2ibot> | JustAnotherArchivist edited Template:Infobox project sandbox (+214, archiving_type: s/warrior/dpos/, add archivebot…): https://wiki.archiveteam.org/?diff=47288&oldid=47287 |
| 02:43:06 | | tzt (tzt) joins |
| 02:43:28 | <systwi> | Know of anything else I could check/try? |
| 02:49:25 | | tzt quits [Ping timeout: 252 seconds] |
| 03:01:46 | | tzt (tzt) joins |
| 03:07:49 | | qw3rty_ joins |
| 03:11:31 | | qw3rty__ quits [Ping timeout: 258 seconds] |
| 03:38:17 | | Earendil (Cobalt17) joins |
| 03:39:40 | | Earendil quits [Client Quit] |
| 03:40:58 | | qwertyasdfuiopghjkl joins |
| 04:00:01 | | treora quits [Quit: blub blub.] |
| 04:01:18 | | treora joins |
| 04:18:31 | | HP_Archivist quits [Ping timeout: 252 seconds] |
| 05:06:00 | | sonick (sonick) joins |
| 05:13:25 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 06:08:56 | | pabs quits [Quit: Don't rest until all the world is paved in moss and greenery.] |
| 06:14:08 | | pabs (pabs) joins |
| 07:08:52 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
| 07:46:22 | | Jake quits [Ping timeout: 258 seconds] |
| 08:28:29 | | tzt quits [Ping timeout: 265 seconds] |
| 09:28:43 | | Barto quits [Ping timeout: 252 seconds] |
| 10:11:07 | | driib73 (driib) joins |
| 10:15:06 | | driib7 quits [Ping timeout: 258 seconds] |
| 10:15:06 | | driib73 is now known as driib7 |
| 10:34:49 | | Terbium quits [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.] |
| 10:35:32 | | Terbium joins |
| 10:37:21 | | Barto (Barto) joins |
| 11:05:45 | | sonick quits [Client Quit] |
| 12:44:44 | | mgrytbak8 joins |
| 12:47:04 | | mgrytbak quits [Ping timeout: 265 seconds] |
| 12:47:04 | | mgrytbak8 is now known as mgrytbak |
| 13:29:40 | | HP_Archivist (HP_Archivist) joins |
| 13:31:06 | | Dark_Hunter quits [Remote host closed the connection] |
| 13:34:50 | | spirit quits [Client Quit] |
| 14:01:31 | | arkhive quits [Ping timeout: 252 seconds] |
| 14:01:55 | | arkhive joins |
| 14:12:23 | | Arcorann quits [Ping timeout: 258 seconds] |
| 14:42:09 | <rewby> | systwi: Have you considered whether the website is doing something like personalizing with javascript instead of sending you different pages? |
| 14:58:57 | | Wingy quits [Remote host closed the connection] |
| 14:59:54 | | Wingy (Wingy) joins |
| 15:09:18 | | Jake (Jake) joins |
| 15:43:05 | | paul2520 (paul2520) joins |
| 16:39:36 | | Wingy quits [Read error: Connection reset by peer] |
| 16:40:29 | | Wingy (Wingy) joins |
| 16:53:24 | | HP_Archivist quits [Client Quit] |
| 16:53:43 | | HP_Archivist (HP_Archivist) joins |
| 18:06:33 | | HP_Archivist quits [Ping timeout: 265 seconds] |
| 18:35:29 | | Aoede quits [Quit: ZNC - https://znc.in] |
| 18:40:59 | | Aoede (Aoede) joins |
| 18:50:28 | | tzt (tzt) joins |
| 19:24:15 | <@OrIdow6> | systwi: Play around with the storage inspector of your browser |
| 19:24:51 | <@OrIdow6> | Or, one thing I find really useful is Firefox's "copy as curl" action in the network inspector, then eliminate curl args until you reproduce it |
| 19:25:02 | <@OrIdow6> | That'ss assuming you've already done what rewby's said |
| 19:49:44 | | Wingy quits [Remote host closed the connection] |
| 19:50:39 | | Wingy (Wingy) joins |
| 19:50:58 | | HP_Archivist (HP_Archivist) joins |
| 20:11:11 | | driib7 quits [Ping timeout: 258 seconds] |
| 20:13:53 | | Wingy quits [Read error: Connection reset by peer] |
| 20:14:47 | | Wingy (Wingy) joins |
| 20:18:31 | | Wingy quits [Remote host closed the connection] |
| 20:20:14 | | Wingy (Wingy) joins |
| 20:21:42 | | Wingy quits [Remote host closed the connection] |
| 20:22:37 | | Wingy (Wingy) joins |
| 20:47:59 | | Wingy quits [Ping timeout: 258 seconds] |
| 20:51:42 | | TheTechRobo3641 joins |
| 20:54:16 | | nimaje quits [Ping timeout: 265 seconds] |
| 20:55:16 | | TheTechRobo quits [Ping timeout: 258 seconds] |
| 21:16:00 | | TheTechRobo3641 is now known as TheTechRobo |
| 21:16:13 | | TheTechRobo is now authenticated as TheTechRobo |
| 21:24:55 | | qwertyasdfuiopghjkl joins |
| 21:32:45 | <TheTechRobo> | What is a warc.zst in the warrior projects and why is it different to a gz? |
| 21:34:47 | <@JAA> | zstandard compression instead of gzip. gzip does decent compression, zstd is black magic. |
| 21:35:19 | <TheTechRobo> | Do you mean a better ratio |
| 21:35:20 | <TheTechRobo> | ? |
| 21:35:23 | <@JAA> | We also use a dictionary with the zstd WARCs (would be possible with gzip but not well-supported by the tooling), which makes it *much* more efficient. |
| 21:35:27 | <@JAA> | Yes, better compression ratio. |
| 21:35:43 | <TheTechRobo> | Ah |
| 21:36:32 | <@JAA> | Here's my little script for decompressing .warc.zst: https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/zstdwarccat |
| 21:37:02 | <@JAA> | (Requires the zstd tools in PATH.) |
| 21:37:12 | <TheTechRobo> | On a related note, I find lrzip works wonders for compressing warc.gz files for archival. High compression windows are great! Wouldn't work here, though. |
| 21:37:37 | <@JAA> | They won't be compressed properly though. Each record needs to be compressed individually to allow for random access. |
| 21:37:48 | <TheTechRobo> | Exactky, that's why it wouldn't work her |
| 21:37:49 | <TheTechRobo> | e |
| 21:37:49 | <@JAA> | And that's what wrecks the compression ratio. |
| 21:38:11 | <TheTechRobo> | Even if that did work, lrzip only really works well with large files (>50MB) which most warc entries won't be |
| 21:38:22 | <@JAA> | The custom dictionary on the zstd WARCs fixes this because that way you *can* compress the similar parts between records by shoving them in the dictionary. |
| 21:38:23 | <TheTechRobo> | so it's only good for compressing whole warcs |
| 21:39:10 | <@JAA> | Unfortunately, tooling for zstd WARCs so far is ... scarce. |
| 21:41:12 | <@JAA> | Really wget-at is the only tool that can write them, and IA's CDX-Writer and related software is the only thing that can read them. |
| 21:41:28 | <TheTechRobo> | really? wpull can't write? |
| 21:41:46 | <@JAA> | I don't think there has been a commit to the wpull repo since we invented .warc.zst. lol |
| 21:42:04 | <TheTechRobo> | Good point |
| 21:42:10 | <@JAA> | wpull only produces gzipped WARCs. |
| 21:42:22 | <TheTechRobo> | I was tempted to use ludios_wpull for my project since it's the only one decently maintained |
| 21:42:56 | <TheTechRobo> | Ended up just compiling python 3.6 and using wpull with it |
| 21:44:54 | <@JAA> | pyenv ftw :-) |
| 21:45:22 | <TheTechRobo> | I was going to, but I can't be bothered to install it :P |
| 21:46:45 | <TheTechRobo> | IIRC I gave up on ludios_wpull because it didn't download anything |
| 21:47:05 | | nimaje joins |
| 21:47:13 | <TheTechRobo> | Probably would have worked if I had fallen back to my 3.7 install (debian buster ftw) but I like to live on the edge with 3.9 :P |
| 21:47:27 | <TheTechRobo> | I'll have to recompile soon tho, 3.10 just came out |
| 21:49:06 | <@JAA> | cd ~/.pyenv; git pull; pyenv install 3.10 |
| 21:49:07 | <@JAA> | Done |
| 21:49:08 | <@JAA> | :-P |
| 21:49:39 | <@JAA> | 3.10.0 * |
| 21:50:18 | <TheTechRobo> | Shouldn't 3.10 be aliased to the latest 3.10.*? |
| 21:52:43 | <@JAA> | I don't think pyenv has such aliases, but not sure. |
| 21:56:06 | | paul2520 quits [Remote host closed the connection] |
| 22:08:02 | | sonick (sonick) joins |
| 22:10:58 | | Arcorann (Arcorann) joins |
| 22:42:15 | | BlueMaxima joins |
| 22:51:49 | | Wingy (Wingy) joins |
| 22:57:59 | | Wingy quits [Read error: Connection reset by peer] |
| 22:58:52 | | Wingy (Wingy) joins |
| 23:05:49 | | Wingy quits [Remote host closed the connection] |
| 23:08:31 | | Wingy (Wingy) joins |
| 23:15:07 | | IDK quits [Quit: Connection closed for inactivity] |
| 23:20:36 | | Wingy quits [Read error: Connection reset by peer] |
| 23:21:25 | | Wingy (Wingy) joins |
| 23:25:02 | | Wingy quits [Remote host closed the connection] |
| 23:25:58 | | Wingy (Wingy) joins |
| 23:28:20 | | Arcorann quits [Read error: Connection reset by peer] |
| 23:32:49 | | Wingy quits [Ping timeout: 258 seconds] |
| 23:46:45 | | Wingy (Wingy) joins |
| 23:58:30 | | Wingy quits [Ping timeout: 258 seconds] |