00:19:07nine quits [Quit: See ya!]
00:19:20nine joins
00:19:20nine quits [Changing host]
00:19:20nine (nine) joins
00:32:50etnguyen03 quits [Client Quit]
00:40:14SootBector quits [Remote host closed the connection]
00:40:22<steering>person: "it's now 1:30" - youtube auto captions: "it's now 01:30 hours"
00:41:22SootBector (SootBector) joins
01:03:29ducky quits [Ping timeout: 272 seconds]
01:04:19etnguyen03 (etnguyen03) joins
02:22:10sec^nd quits [Remote host closed the connection]
02:22:35sec^nd (second) joins
02:44:47APOLLO03 quits [Ping timeout: 268 seconds]
02:47:44ducky (ducky) joins
02:56:13nine quits [Ping timeout: 272 seconds]
02:58:38nine joins
02:58:40nine quits [Changing host]
02:58:40nine (nine) joins
03:13:09iPwnedYourIOTSmartdog quits [Ping timeout: 268 seconds]
03:13:48iPwnedYourIOTSmartdog joins
04:01:44etnguyen03 quits [Remote host closed the connection]
04:08:23<Doranwen>Regex101 says this will extract the link but I just tested it and it's *not* extracting one of the links from an html file (and almost certainly other similar ones): `for f in *.html; do grep -Po '(?<=href=")[^"]*' "$f" >> links1.txt; done`
04:08:43<Doranwen>Sample from the html is this: `Read it here: <a href="http://unfortunateggs.livejournal.com/125304.html" target="_self"><strong>http://unfortunateggs.livejournal.com/125304.html</strong></a> </article> </div>`
04:12:13<Doranwen>It *should* be pulling that LJ link from the first bit - but it's not. I'm only seeing links from the *last* few html files rather than all of them, for some reason.
04:15:18<@JAA>Doranwen: Do you see any 'binary file matches' in the output? Are you using GNU grep (`grep --version`)? If so, add the `-a` option.
04:15:41<@JAA>Also, no need for a loop.
04:15:50<Doranwen>It did not give me any error about it.
04:16:04<Doranwen>It does need a loop if there's 700 html files, I would think?
04:16:04<@JAA>`grep -Pho '(?<=href=")[^"]*' *.html >>links1.txt` is equivalent.
04:16:14<Doranwen>Or not?
04:16:45<@JAA>You might need something special when you have many thousands of files. A loop wouldn't be ideal though.
04:16:47<Doranwen>The folder also has pdf files in it, so I thought it needed to have html specified to look at.
04:17:44<Doranwen>And sometimes there are csv files that I didn't want searched.
04:17:54<@JAA>I still use the same `*.html` glob as you do.
04:18:03<Doranwen>Ah
04:18:04<@JAA>`grep` can take multiple filename arguments.
04:18:15<@JAA>You're thinking of recursive mode instead.
04:18:31<@JAA>The `-h` option is to suppress the filename prefix on each line.
04:19:08<@JAA>And 'binary file matches' wouldn't be an error. It'd go into the output file.
04:19:19<@JAA>(Is that annoying? Yes.)
04:19:25<Doranwen>The output file opens without any issues, so I doubt it found any.
04:19:40<@JAA>I mean the literal string, not binary data in the output file.
04:19:41<Doranwen>When it has binary stuff it usually gives my text editor fits trying to open it, lol.
04:19:54<Doranwen>Ahh
04:19:55<@JAA>> printf '\0meow' | grep meow
04:19:55<@JAA>grep: (standard input): binary file matches
04:20:11<Doranwen>Nope. The word "binary" is not in the output file.
04:20:22<@JAA>Then I'd need an example file.
04:20:57<@JAA>I.e. one that has a link which isn't extracted.
04:21:24<Doranwen>Give me a sec and I'll dump it up.
04:22:24<Doranwen>https://transfer.archivete.am/GcD4I/sample.zip
04:22:25<eggdrop>inline (for browser viewing): https://transfer.archivete.am/inline/GcD4I/sample.zip
04:22:41<Doranwen>That's only a subset of them, but has several with links that were missing when I tried it.
04:22:58<Doranwen>I can give you the entire set if you want it .
04:23:29<@JAA>Ah, your HTML files are UTF-16, not UTF-8.
04:23:49Doranwen sighs.
04:23:56<Doranwen>I figured it had to be the encoding switch.
04:24:09<@JAA>> xxd jane-maura-ff-100043.html | head -n 2
04:24:09<@JAA>00000000: fffe 3c00 6800 7400 6d00 6c00 3e00 3c00 ..<.h.t.m.l.>.<.
04:24:09<@JAA>00000010: 6800 6500 6100 6400 3e00 3c00 7400 6900 h.e.a.d.>.<.t.i.
04:24:14<Doranwen>Before this all the stuff we got seemed to be in ISO whatever it is - because any Russian characters ended up as ??????
04:24:43<Doranwen>And we pestered the guy who made the archiving tool and he fixed that - but apparently his fix was to make it utf-16 rather than utf-8.
04:26:11<Doranwen>I'll pester him to fix that, LOL.
04:26:46<Doranwen>Thank you for helping me diagnose the issue!
04:26:54<Doranwen>I wondered why everything seemed to be working fine until today, lol.
04:28:08<@JAA>You can also use `iconv` to convert it. `iconv -f utf16 -t utf8` is probably sufficient.
04:28:44<@JAA>But you'll want to only pass the UTF-16 files into that, of course.
04:29:28<Doranwen>How do I tell which ones are UTF-16 and which are UTF-8? I'm thinking because I was able to extract *some* links that some of them have to be utf-8?
04:30:52<Doranwen>The ones I got links from weren't in the sample I sent you.
04:31:17<@JAA>Well, they don't necessarily have to be UTF-8, but yeah.
04:31:35<@JAA>Clearly not all files in your directory are the same encoding, and everything gets painful at that point.
04:32:52<nicolas17>ff fe prefix is a pretty sure sign of UTF-16
04:34:05<Doranwen>Yeah, it might be easier to just re-download this comm (and the other one I grabbed with the new version) with an old version of the archiving tool.
04:34:39<Doranwen>I'll lose a few hours of downloading but it'll be a lot less hassle.
04:34:52<Doranwen>I still am not sure how I would - in one fell swoop - see which files are utf-8 and which are utf-16.
04:35:01<@JAA>Yeah, UTF-16-encoded data usually starts with a Byte Order Mark, either FF FE or FE FF depending on the flavour. (It's why that codepoint even exists.)
04:42:45<Doranwen>Thanks again for the help. I was pulling my hair out trying to figure out why this script didn't seem to be working anymore, lol.
04:43:25<Doranwen>And I'll replace that loop with the grep command then.
04:45:21<@JAA>It's immediately visible in `less` (because it renders out all the NUL bytes), and `file` should probably report it as well.
04:46:02<@JAA>Or `xxd $filename | head` as I did above.
04:46:10<@JAA>Just some general tips for this kind of situation. :-)
05:01:17HackMii quits [Remote host closed the connection]
05:01:36HackMii (hacktheplanet) joins
05:32:43sec^nd quits [Remote host closed the connection]
05:33:05sec^nd (second) joins
05:46:10teppum quits [Remote host closed the connection]
06:17:13ArchivalEfforts quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
06:17:23ArchivalEfforts joins
08:52:56ducky quits [Ping timeout: 268 seconds]
08:53:09ducky (ducky) joins
08:54:25Dango360 quits [Quit: The Lounge - https://thelounge.chat]
09:30:09cipherrot quits [Ping timeout: 272 seconds]
09:37:48Snivy quits [Quit: The Lounge - https://thelounge.chat]
09:38:11petrichor (petrichor) joins
09:38:18Snivy (Snivy) joins
09:38:23Snivy quits [Remote host closed the connection]
09:39:43Snivy (Snivy) joins
10:03:54rohvani quits [Quit: The Lounge - https://thelounge.chat]
10:09:54@arkiver quits [Remote host closed the connection]
10:10:21arkiver (arkiver) joins
10:10:21@ChanServ sets mode: +o arkiver
10:14:12fireatseaparks quits [Remote host closed the connection]
10:14:52fireatseaparks (fireatseaparks) joins
10:26:37APOLLO03 joins
11:39:38Cornelius7 (Cornelius) joins
11:41:15Cornelius quits [Ping timeout: 272 seconds]
11:41:15Cornelius7 is now known as Cornelius
11:47:37irisfreckles13 joins
11:58:52APOLLO03 quits [Read error: Connection reset by peer]
11:59:51APOLLO03 joins
12:00:03Bleo1826007227196234552220 quits [Quit: The Lounge - https://thelounge.chat]
12:02:46Bleo1826007227196234552220 joins
12:37:37petrichor quits [Client Quit]
13:23:15petrichor (petrichor) joins
14:03:07irisfreckles13 quits [Ping timeout: 272 seconds]
14:12:44Dada joins
14:18:22Dango360 (Dango360) joins
15:09:03irisfreckles13 joins
16:40:37Goofybally9 quits [Quit: The Lounge - https://thelounge.chat]
16:41:24Goofybally joins
16:42:12Goofybally quits [Client Quit]
16:42:44Goofybally (Goofybally) joins
16:52:03DogsRNice joins
17:54:22corentin quits [Ping timeout: 268 seconds]
18:10:39Webuser810542 joins
18:27:13@rewby quits [Ping timeout: 272 seconds]
18:29:20rewby (rewby) joins
18:29:20@ChanServ sets mode: +o rewby
19:04:40APOLLO03 quits [Ping timeout: 268 seconds]
19:04:43APOLLO03 joins
19:33:43twiswist quits [Ping timeout: 272 seconds]
19:54:53APOLLO03a joins
19:57:42APOLLO03 quits [Ping timeout: 268 seconds]
20:05:58<steering>Doranwen: linux command line max length is in megabytes, meaning unless your filenames are spectacularly long you can fit thousands of them in a command line
20:06:57<steering>(over 100,000 files if they're all 20 characters)
20:08:07<klea>Well, it also relies on you not using absolute paths for everything too :p
20:08:13<klea>since then it'd add up.
20:08:51<steering>even if your paths are 100 chars long you can still have 20,971 files
20:09:02<steering>(minus the command name etc)
20:09:46<steering>so yeah, not really something to worry about until you have many thousands of files
20:09:47<klea>btw, pro tip, don't put random unicode characters down your user's throats.
20:10:14<klea> | gsub("\n"; "\u2028") -> | gsub("\n"; "↳") fixed part of a rendering issue in my own crap program using jq.
20:10:18<steering>💩
20:10:38Doranwen quits [Read error: Connection reset by peer]
20:10:42<klea>no, don't ask me why I make a shell script that uses jq for most of it's work, and then proceeds to call fzf to have it call itself again.
20:10:47<klea>s/make/made/
20:10:50<steering>anyway once you get to enough files to exceed command line length (on linux), you won't want to turn to a for loop anyway :P
20:11:01Doranwen (Doranwen) joins
20:11:02<steering>aaaand how dare you peer
20:11:06rohvani joins
20:11:23<klea>peer?
20:15:40<steering>find -exec ... {} + is super useful when you can use it
20:15:59<klea>2026-02-17 20:10:50 <steering> anyway once you get to enough files to exceed command line length (on linux), you won't want to turn to a for loop anyway :P <- I should make something like 4096 character long machine id, and then do things like grep 'thing'
20:15:59<klea>/srv/archivthingy/data/machine/${MACHINE_ID}/project/${PROJECT_ID}/work/items/${WORKITEM_ID}/mass_datas/{$MASSDATAID1,$MASSDATAID2,$MASSDATAID3}/*/*.txt :p
20:16:10<steering>(can't use it if every filename needs to be used with some switch or if you need some other arg at the end...)
20:16:27<klea>s/+/\;/ in those cases.
20:17:07<steering>yeah and then you get to start a million new processes instead of a hundred :P
20:17:47<klea>(of course, every ID is at least 4096 unicode characters long, and it's full of multibyte strings (yes, everyone loves a 'k' that has some other letters attached)
20:18:11<steering>yeah, that's what's nice about find, it doesn't matter what weird characters are in the filename :P
20:18:56klea sticks a '\0' in steering's filenames.
20:19:17<steering>good luck with that
20:19:25<@JAA>The one thing you can't do.
20:19:28<steering>linux might be unhappy before my program gets a chance to be
20:19:36<klea>linux isn't everyone else :p
20:19:48<@JAA>Which file system supports NULs in filenames?
20:19:50<steering>i'm pretty sure it applies to everyone else too
20:19:58<steering>but i also don't really care
20:20:46<klea>Any file system made for WASI should :p https://github.com/WebAssembly/WASI/blob/main/proposals/filesystem/wit-0.3.0-draft/types.wit#L6-L7
20:21:11<steering>a system supporting NULs in filenames is either a useless point of nerdery or completely unlike any current OS to the point where there's no expectation that it would even have a find program or be in any way compatible
20:21:25<klea>Yeah
20:21:28<steering>considering that you can't access such a file through any normal system call
20:21:33<klea>https://stackoverflow.com/a/54208483
20:21:51iPwnedYourIOTSmartdog quits [Ping timeout: 272 seconds]
20:22:09<klea>Apparently unixes and windows don't support it, but you can stick U+0000 somehow into some stuffs, which will be fun if you parse incorrectly.
20:22:15<steering>>useless point of nerdery
20:22:29<klea>yep
20:23:50steering is not entirely sure how one would get windows to "treat" a filename as utf-8
20:24:19<steering>i thought it was all utf-16 but idk anything about windows programming :P
20:25:34<klea>Neither do I :p
20:42:01<klea>https://groups.google.com/g/mozilla.dev.security.policy/c/JFwqZx7RLL0/m/T-J9gxUNAQAJ#:~:text=The%20GDPR%20issues%20surrounding%20WHOIS%20and%20RDAP%20have%20already%20led%20it%20to%20be%20compelling%20in%20its%20own%20right%2E%20Most
20:42:10<steering>hmm
20:42:25<klea>Why not have rdap have a array of sha512 hashed email addresses?
20:42:39<steering>i wonder if python (surrogateescape) presents any problems with such a situation
20:43:12<steering>i guess one would have to re-encode with ignore or something
20:44:17<steering>hashing doesn't give you privacy with email addresses
20:45:37<klea>unless you hash in some specific way?
20:45:51<klea>like, if CA hashing standard is to hash email with some text before and after the email?
20:46:37<klea>that specific text would make you knowing an email address be to check each record for if it's on some specific domain, yes, but wouldn't make you be able to correlate two hashes for different purposes?
20:47:09<klea>(the texts would have to either be also given with the hash, or be static)
20:48:36<katia>what is going on in here
20:49:55<steering>that doesn't help.
20:50:34<steering>what you end up doing if you want to maintain any privacy with hashes like that is truncating the hash to only the first few bits
20:51:06<steering>because, as it turns out, publishing something that's publicly confirmable is the literal opposite of maintaining info privately
20:55:03<steering>i can go try 10GH/s of sha256 on my computer after all, and besides that the very fact that it will confirm a suspected email would be a problem
20:58:30<klea>oh
20:58:49twiswist (twiswist) joins
21:00:23<@JAA>Putting C0 codes into filenames in general is nerdery. But it wasn't historically restricted, so we still have to write our code to handle it.
21:00:48<@JAA>IIRC, there's some movement towards banning LF in POSIX.
21:01:13<klea>Sorry, I thought privacy was defined as something else.
21:03:07<@JAA>Hashing is irrelevant under the GDPR as well. As long as it's possible to correlate the values, it's still personal data.
21:25:49Ryz quits [Ping timeout: 272 seconds]
21:25:56Ryz (Ryz) joins
21:27:59<klea>oh
21:56:56Dada quits [Remote host closed the connection]
22:11:52Dada joins
22:21:25<nukke>https://www.gamedeveloper.com/programming/godot-co-founder-says-ai-slop-pull-requests-have-become-overwhelming
22:22:16<nukke>also this setting spawned because of all the AI crap https://github.blog/changelog/2026-02-13-new-repository-settings-for-configuring-pull-request-access/
22:23:25<klea>Oh I no longer need to keep my account always in lockdown mode?
22:23:26<klea>yay!
22:23:49<steering>or just don't use github :P
22:23:57<klea>I use it for forks :p
22:24:03<klea>to contribute back stuff.
22:24:29<klea>and then I have to do the thing of pushing my own stub commit to set as default commit to disable PRs, because else everyone who contributed to the original repo when I forked would be considered a contributor of my fork.
22:28:26<klea>Also, wut, why is there an agents tab. https://github.com/notklea/causalpkgs/
22:28:46<klea>Oh, maybe I should move that repo to codebeg later.
22:44:28Dada quits [Remote host closed the connection]
22:59:33APOLLO03a quits [Ping timeout: 272 seconds]
22:59:33^ quits [Ping timeout: 272 seconds]
23:03:56^ (^) joins
23:13:51APOLLO03 joins
23:14:26Dada joins
23:22:20etnguyen03 (etnguyen03) joins
23:35:12Dada quits [Remote host closed the connection]