00:19:07nine quits [Quit: See ya!]
00:19:20nine joins
00:19:20nine quits [Changing host]
00:19:20nine (nine) joins
00:32:50etnguyen03 quits [Client Quit]
00:40:14SootBector quits [Remote host closed the connection]
00:40:22<steering>person: "it's now 1:30" - youtube auto captions: "it's now 01:30 hours"
00:41:22SootBector (SootBector) joins
01:03:29ducky quits [Ping timeout: 272 seconds]
01:04:19etnguyen03 (etnguyen03) joins
02:22:10sec^nd quits [Remote host closed the connection]
02:22:35sec^nd (second) joins
02:44:47APOLLO03 quits [Ping timeout: 268 seconds]
02:47:44ducky (ducky) joins
02:56:13nine quits [Ping timeout: 272 seconds]
02:58:38nine joins
02:58:40nine quits [Changing host]
02:58:40nine (nine) joins
03:13:09iPwnedYourIOTSmartdog quits [Ping timeout: 268 seconds]
03:13:48iPwnedYourIOTSmartdog joins
04:01:44etnguyen03 quits [Remote host closed the connection]
04:08:23<Doranwen>Regex101 says this will extract the link but I just tested it and it's *not* extracting one of the links from an html file (and almost certainly other similar ones): `for f in *.html; do grep -Po '(?<=href=")[^"]*' "$f" >> links1.txt; done`
04:08:43<Doranwen>Sample from the html is this: `Read it here: <a href="http://unfortunateggs.livejournal.com/125304.html" target="_self"><strong>http://unfortunateggs.livejournal.com/125304.html</strong></a> </article> </div>`
04:12:13<Doranwen>It *should* be pulling that LJ link from the first bit - but it's not. I'm only seeing links from the *last* few html files rather than all of them, for some reason.
04:15:18<@JAA>Doranwen: Do you see any 'binary file matches' in the output? Are you using GNU grep (`grep --version`)? If so, add the `-a` option.
04:15:41<@JAA>Also, no need for a loop.
04:15:50<Doranwen>It did not give me any error about it.
04:16:04<Doranwen>It does need a loop if there's 700 html files, I would think?
04:16:04<@JAA>`grep -Pho '(?<=href=")[^"]*' *.html >>links1.txt` is equivalent.
04:16:14<Doranwen>Or not?
04:16:45<@JAA>You might need something special when you have many thousands of files. A loop wouldn't be ideal though.
04:16:47<Doranwen>The folder also has pdf files in it, so I thought it needed to have html specified to look at.
04:17:44<Doranwen>And sometimes there are csv files that I didn't want searched.
04:17:54<@JAA>I still use the same `*.html` glob as you do.
04:18:03<Doranwen>Ah
04:18:04<@JAA>`grep` can take multiple filename arguments.
04:18:15<@JAA>You're thinking of recursive mode instead.
04:18:31<@JAA>The `-h` option is to suppress the filename prefix on each line.
04:19:08<@JAA>And 'binary file matches' wouldn't be an error. It'd go into the output file.
04:19:19<@JAA>(Is that annoying? Yes.)
04:19:25<Doranwen>The output file opens without any issues, so I doubt it found any.
04:19:40<@JAA>I mean the literal string, not binary data in the output file.
04:19:41<Doranwen>When it has binary stuff it usually gives my text editor fits trying to open it, lol.
04:19:54<Doranwen>Ahh
04:19:55<@JAA>> printf '\0meow' | grep meow
04:19:55<@JAA>grep: (standard input): binary file matches
04:20:11<Doranwen>Nope. The word "binary" is not in the output file.
04:20:22<@JAA>Then I'd need an example file.
04:20:57<@JAA>I.e. one that has a link which isn't extracted.
04:21:24<Doranwen>Give me a sec and I'll dump it up.
04:22:24<Doranwen>https://transfer.archivete.am/GcD4I/sample.zip
04:22:25<eggdrop>inline (for browser viewing): https://transfer.archivete.am/inline/GcD4I/sample.zip
04:22:41<Doranwen>That's only a subset of them, but has several with links that were missing when I tried it.
04:22:58<Doranwen>I can give you the entire set if you want it .
04:23:29<@JAA>Ah, your HTML files are UTF-16, not UTF-8.
04:23:49Doranwen sighs.
04:23:56<Doranwen>I figured it had to be the encoding switch.
04:24:09<@JAA>> xxd jane-maura-ff-100043.html | head -n 2
04:24:09<@JAA>00000000: fffe 3c00 6800 7400 6d00 6c00 3e00 3c00 ..<.h.t.m.l.>.<.
04:24:09<@JAA>00000010: 6800 6500 6100 6400 3e00 3c00 7400 6900 h.e.a.d.>.<.t.i.
04:24:14<Doranwen>Before this all the stuff we got seemed to be in ISO whatever it is - because any Russian characters ended up as ??????
04:24:43<Doranwen>And we pestered the guy who made the archiving tool and he fixed that - but apparently his fix was to make it utf-16 rather than utf-8.
04:26:11<Doranwen>I'll pester him to fix that, LOL.
04:26:46<Doranwen>Thank you for helping me diagnose the issue!
04:26:54<Doranwen>I wondered why everything seemed to be working fine until today, lol.
04:28:08<@JAA>You can also use `iconv` to convert it. `iconv -f utf16 -t utf8` is probably sufficient.
04:28:44<@JAA>But you'll want to only pass the UTF-16 files into that, of course.
04:29:28<Doranwen>How do I tell which ones are UTF-16 and which are UTF-8? I'm thinking because I was able to extract *some* links that some of them have to be utf-8?
04:30:52<Doranwen>The ones I got links from weren't in the sample I sent you.
04:31:17<@JAA>Well, they don't necessarily have to be UTF-8, but yeah.
04:31:35<@JAA>Clearly not all files in your directory are the same encoding, and everything gets painful at that point.
04:32:52<nicolas17>ff fe prefix is a pretty sure sign of UTF-16
04:34:05<Doranwen>Yeah, it might be easier to just re-download this comm (and the other one I grabbed with the new version) with an old version of the archiving tool.
04:34:39<Doranwen>I'll lose a few hours of downloading but it'll be a lot less hassle.
04:34:52<Doranwen>I still am not sure how I would - in one fell swoop - see which files are utf-8 and which are utf-16.
04:35:01<@JAA>Yeah, UTF-16-encoded data usually starts with a Byte Order Mark, either FF FE or FE FF depending on the flavour. (It's why that codepoint even exists.)
04:42:45<Doranwen>Thanks again for the help. I was pulling my hair out trying to figure out why this script didn't seem to be working anymore, lol.
04:43:25<Doranwen>And I'll replace that loop with the grep command then.
04:45:21<@JAA>It's immediately visible in `less` (because it renders out all the NUL bytes), and `file` should probably report it as well.
04:46:02<@JAA>Or `xxd $filename | head` as I did above.
04:46:10<@JAA>Just some general tips for this kind of situation. :-)
05:01:17HackMii quits [Remote host closed the connection]
05:01:36HackMii (hacktheplanet) joins
05:32:43sec^nd quits [Remote host closed the connection]
05:33:05sec^nd (second) joins
05:46:10teppum quits [Remote host closed the connection]
06:17:13ArchivalEfforts quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
06:17:23ArchivalEfforts joins
08:52:56ducky quits [Ping timeout: 268 seconds]
08:53:09ducky (ducky) joins
08:54:25Dango360 quits [Quit: The Lounge - https://thelounge.chat]
09:30:09cipherrot quits [Ping timeout: 272 seconds]
09:37:48Snivy quits [Quit: The Lounge - https://thelounge.chat]
09:38:11petrichor (petrichor) joins
09:38:18Snivy (Snivy) joins
09:38:23Snivy quits [Remote host closed the connection]
09:39:43Snivy (Snivy) joins
10:03:54rohvani quits [Quit: The Lounge - https://thelounge.chat]
10:09:54@arkiver quits [Remote host closed the connection]
10:10:21arkiver (arkiver) joins
10:10:21@ChanServ sets mode: +o arkiver
10:14:12fireatseaparks quits [Remote host closed the connection]
10:14:52fireatseaparks (fireatseaparks) joins
10:26:37APOLLO03 joins
11:39:38Cornelius7 (Cornelius) joins
11:41:15Cornelius quits [Ping timeout: 272 seconds]
11:41:15Cornelius7 is now known as Cornelius
11:47:37irisfreckles13 joins
11:58:52APOLLO03 quits [Read error: Connection reset by peer]
11:59:51APOLLO03 joins
12:00:03Bleo1826007227196234552220 quits [Quit: The Lounge - https://thelounge.chat]
12:02:46Bleo1826007227196234552220 joins
12:37:37petrichor quits [Client Quit]
13:23:15petrichor (petrichor) joins
14:03:07irisfreckles13 quits [Ping timeout: 272 seconds]
14:12:44Dada joins
14:18:22Dango360 (Dango360) joins
15:09:03irisfreckles13 joins