| 00:19:07 | | nine quits [Quit: See ya!] |
| 00:19:20 | | nine joins |
| 00:19:20 | | nine is now authenticated as nine |
| 00:19:20 | | nine quits [Changing host] |
| 00:19:20 | | nine (nine) joins |
| 00:32:50 | | etnguyen03 quits [Client Quit] |
| 00:40:14 | | SootBector quits [Remote host closed the connection] |
| 00:40:22 | <steering> | person: "it's now 1:30" - youtube auto captions: "it's now 01:30 hours" |
| 00:41:22 | | SootBector (SootBector) joins |
| 01:03:29 | | ducky quits [Ping timeout: 272 seconds] |
| 01:04:19 | | etnguyen03 (etnguyen03) joins |
| 02:22:10 | | sec^nd quits [Remote host closed the connection] |
| 02:22:35 | | sec^nd (second) joins |
| 02:44:47 | | APOLLO03 quits [Ping timeout: 268 seconds] |
| 02:47:44 | | ducky (ducky) joins |
| 02:56:13 | | nine quits [Ping timeout: 272 seconds] |
| 02:58:38 | | nine joins |
| 02:58:40 | | nine is now authenticated as nine |
| 02:58:40 | | nine quits [Changing host] |
| 02:58:40 | | nine (nine) joins |
| 03:13:09 | | iPwnedYourIOTSmartdog quits [Ping timeout: 268 seconds] |
| 03:13:48 | | iPwnedYourIOTSmartdog joins |
| 04:01:44 | | etnguyen03 quits [Remote host closed the connection] |
| 04:08:23 | <Doranwen> | Regex101 says this will extract the link but I just tested it and it's *not* extracting one of the links from an html file (and almost certainly other similar ones): `for f in *.html; do grep -Po '(?<=href=")[^"]*' "$f" >> links1.txt; done` |
| 04:08:43 | <Doranwen> | Sample from the html is this: `Read it here: <a href="http://unfortunateggs.livejournal.com/125304.html" target="_self"><strong>http://unfortunateggs.livejournal.com/125304.html</strong></a> </article> </div>` |
| 04:12:13 | <Doranwen> | It *should* be pulling that LJ link from the first bit - but it's not. I'm only seeing links from the *last* few html files rather than all of them, for some reason. |
| 04:15:18 | <@JAA> | Doranwen: Do you see any 'binary file matches' in the output? Are you using GNU grep (`grep --version`)? If so, add the `-a` option. |
| 04:15:41 | <@JAA> | Also, no need for a loop. |
| 04:15:50 | <Doranwen> | It did not give me any error about it. |
| 04:16:04 | <Doranwen> | It does need a loop if there's 700 html files, I would think? |
| 04:16:04 | <@JAA> | `grep -Pho '(?<=href=")[^"]*' *.html >>links1.txt` is equivalent. |
| 04:16:14 | <Doranwen> | Or not? |
| 04:16:45 | <@JAA> | You might need something special when you have many thousands of files. A loop wouldn't be ideal though. |
| 04:16:47 | <Doranwen> | The folder also has pdf files in it, so I thought it needed to have html specified to look at. |
| 04:17:44 | <Doranwen> | And sometimes there are csv files that I didn't want searched. |
| 04:17:54 | <@JAA> | I still use the same `*.html` glob as you do. |
| 04:18:03 | <Doranwen> | Ah |
| 04:18:04 | <@JAA> | `grep` can take multiple filename arguments. |
| 04:18:15 | <@JAA> | You're thinking of recursive mode instead. |
| 04:18:31 | <@JAA> | The `-h` option is to suppress the filename prefix on each line. |
| 04:19:08 | <@JAA> | And 'binary file matches' wouldn't be an error. It'd go into the output file. |
| 04:19:19 | <@JAA> | (Is that annoying? Yes.) |
| 04:19:25 | <Doranwen> | The output file opens without any issues, so I doubt it found any. |
| 04:19:40 | <@JAA> | I mean the literal string, not binary data in the output file. |
| 04:19:41 | <Doranwen> | When it has binary stuff it usually gives my text editor fits trying to open it, lol. |
| 04:19:54 | <Doranwen> | Ahh |
| 04:19:55 | <@JAA> | > printf '\0meow' | grep meow |
| 04:19:55 | <@JAA> | grep: (standard input): binary file matches |
| 04:20:11 | <Doranwen> | Nope. The word "binary" is not in the output file. |
| 04:20:22 | <@JAA> | Then I'd need an example file. |
| 04:20:57 | <@JAA> | I.e. one that has a link which isn't extracted. |
| 04:21:24 | <Doranwen> | Give me a sec and I'll dump it up. |
| 04:22:24 | <Doranwen> | https://transfer.archivete.am/GcD4I/sample.zip |
| 04:22:25 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/GcD4I/sample.zip |
| 04:22:41 | <Doranwen> | That's only a subset of them, but has several with links that were missing when I tried it. |
| 04:22:58 | <Doranwen> | I can give you the entire set if you want it . |
| 04:23:29 | <@JAA> | Ah, your HTML files are UTF-16, not UTF-8. |
| 04:23:49 | | Doranwen sighs. |
| 04:23:56 | <Doranwen> | I figured it had to be the encoding switch. |
| 04:24:09 | <@JAA> | > xxd jane-maura-ff-100043.html | head -n 2 |
| 04:24:09 | <@JAA> | 00000000: fffe 3c00 6800 7400 6d00 6c00 3e00 3c00 ..<.h.t.m.l.>.<. |
| 04:24:09 | <@JAA> | 00000010: 6800 6500 6100 6400 3e00 3c00 7400 6900 h.e.a.d.>.<.t.i. |
| 04:24:14 | <Doranwen> | Before this all the stuff we got seemed to be in ISO whatever it is - because any Russian characters ended up as ?????? |
| 04:24:43 | <Doranwen> | And we pestered the guy who made the archiving tool and he fixed that - but apparently his fix was to make it utf-16 rather than utf-8. |
| 04:26:11 | <Doranwen> | I'll pester him to fix that, LOL. |
| 04:26:46 | <Doranwen> | Thank you for helping me diagnose the issue! |
| 04:26:54 | <Doranwen> | I wondered why everything seemed to be working fine until today, lol. |
| 04:28:08 | <@JAA> | You can also use `iconv` to convert it. `iconv -f utf16 -t utf8` is probably sufficient. |
| 04:28:44 | <@JAA> | But you'll want to only pass the UTF-16 files into that, of course. |
| 04:29:28 | <Doranwen> | How do I tell which ones are UTF-16 and which are UTF-8? I'm thinking because I was able to extract *some* links that some of them have to be utf-8? |
| 04:30:52 | <Doranwen> | The ones I got links from weren't in the sample I sent you. |
| 04:31:17 | <@JAA> | Well, they don't necessarily have to be UTF-8, but yeah. |
| 04:31:35 | <@JAA> | Clearly not all files in your directory are the same encoding, and everything gets painful at that point. |
| 04:32:52 | <nicolas17> | ff fe prefix is a pretty sure sign of UTF-16 |
| 04:34:05 | <Doranwen> | Yeah, it might be easier to just re-download this comm (and the other one I grabbed with the new version) with an old version of the archiving tool. |
| 04:34:39 | <Doranwen> | I'll lose a few hours of downloading but it'll be a lot less hassle. |
| 04:34:52 | <Doranwen> | I still am not sure how I would - in one fell swoop - see which files are utf-8 and which are utf-16. |
| 04:35:01 | <@JAA> | Yeah, UTF-16-encoded data usually starts with a Byte Order Mark, either FF FE or FE FF depending on the flavour. (It's why that codepoint even exists.) |
| 04:42:45 | <Doranwen> | Thanks again for the help. I was pulling my hair out trying to figure out why this script didn't seem to be working anymore, lol. |
| 04:43:25 | <Doranwen> | And I'll replace that loop with the grep command then. |
| 04:45:21 | <@JAA> | It's immediately visible in `less` (because it renders out all the NUL bytes), and `file` should probably report it as well. |
| 04:46:02 | <@JAA> | Or `xxd $filename | head` as I did above. |
| 04:46:10 | <@JAA> | Just some general tips for this kind of situation. :-) |
| 05:01:17 | | HackMii quits [Remote host closed the connection] |
| 05:01:36 | | HackMii (hacktheplanet) joins |
| 05:32:43 | | sec^nd quits [Remote host closed the connection] |
| 05:33:05 | | sec^nd (second) joins |
| 05:46:10 | | teppum quits [Remote host closed the connection] |
| 06:17:13 | | ArchivalEfforts quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |
| 06:17:23 | | ArchivalEfforts joins |
| 08:52:56 | | ducky quits [Ping timeout: 268 seconds] |
| 08:53:09 | | ducky (ducky) joins |
| 08:54:25 | | Dango360 quits [Quit: The Lounge - https://thelounge.chat] |
| 09:30:09 | | cipherrot quits [Ping timeout: 272 seconds] |
| 09:37:48 | | Snivy quits [Quit: The Lounge - https://thelounge.chat] |
| 09:38:11 | | petrichor (petrichor) joins |
| 09:38:18 | | Snivy (Snivy) joins |
| 09:38:23 | | Snivy quits [Remote host closed the connection] |
| 09:39:43 | | Snivy (Snivy) joins |
| 10:03:54 | | rohvani quits [Quit: The Lounge - https://thelounge.chat] |
| 10:09:54 | | @arkiver quits [Remote host closed the connection] |
| 10:10:21 | | arkiver (arkiver) joins |
| 10:10:21 | | @ChanServ sets mode: +o arkiver |
| 10:14:12 | | fireatseaparks quits [Remote host closed the connection] |
| 10:14:52 | | fireatseaparks (fireatseaparks) joins |
| 10:26:37 | | APOLLO03 joins |
| 11:39:38 | | Cornelius7 (Cornelius) joins |
| 11:41:15 | | Cornelius quits [Ping timeout: 272 seconds] |
| 11:41:15 | | Cornelius7 is now known as Cornelius |
| 11:47:37 | | irisfreckles13 joins |
| 11:58:52 | | APOLLO03 quits [Read error: Connection reset by peer] |
| 11:59:51 | | APOLLO03 joins |
| 12:00:03 | | Bleo1826007227196234552220 quits [Quit: The Lounge - https://thelounge.chat] |
| 12:02:46 | | Bleo1826007227196234552220 joins |
| 12:37:37 | | petrichor quits [Client Quit] |
| 13:23:15 | | petrichor (petrichor) joins |
| 14:03:07 | | irisfreckles13 quits [Ping timeout: 272 seconds] |
| 14:12:44 | | Dada joins |
| 14:18:22 | | Dango360 (Dango360) joins |
| 15:09:03 | | irisfreckles13 joins |