| 00:19:07 | | nine quits [Quit: See ya!] |
| 00:19:20 | | nine joins |
| 00:19:20 | | nine is now authenticated as nine |
| 00:19:20 | | nine quits [Changing host] |
| 00:19:20 | | nine (nine) joins |
| 00:32:50 | | etnguyen03 quits [Client Quit] |
| 00:40:14 | | SootBector quits [Remote host closed the connection] |
| 00:40:22 | <steering> | person: "it's now 1:30" - youtube auto captions: "it's now 01:30 hours" |
| 00:41:22 | | SootBector (SootBector) joins |
| 01:03:29 | | ducky quits [Ping timeout: 272 seconds] |
| 01:04:19 | | etnguyen03 (etnguyen03) joins |
| 02:22:10 | | sec^nd quits [Remote host closed the connection] |
| 02:22:35 | | sec^nd (second) joins |
| 02:44:47 | | APOLLO03 quits [Ping timeout: 268 seconds] |
| 02:47:44 | | ducky (ducky) joins |
| 02:56:13 | | nine quits [Ping timeout: 272 seconds] |
| 02:58:38 | | nine joins |
| 02:58:40 | | nine is now authenticated as nine |
| 02:58:40 | | nine quits [Changing host] |
| 02:58:40 | | nine (nine) joins |
| 03:13:09 | | iPwnedYourIOTSmartdog quits [Ping timeout: 268 seconds] |
| 03:13:48 | | iPwnedYourIOTSmartdog joins |
| 04:01:44 | | etnguyen03 quits [Remote host closed the connection] |
| 04:08:23 | <Doranwen> | Regex101 says this will extract the link but I just tested it and it's *not* extracting one of the links from an html file (and almost certainly other similar ones): `for f in *.html; do grep -Po '(?<=href=")[^"]*' "$f" >> links1.txt; done` |
| 04:08:43 | <Doranwen> | Sample from the html is this: `Read it here: <a href="http://unfortunateggs.livejournal.com/125304.html" target="_self"><strong>http://unfortunateggs.livejournal.com/125304.html</strong></a> </article> </div>` |
| 04:12:13 | <Doranwen> | It *should* be pulling that LJ link from the first bit - but it's not. I'm only seeing links from the *last* few html files rather than all of them, for some reason. |
| 04:15:18 | <@JAA> | Doranwen: Do you see any 'binary file matches' in the output? Are you using GNU grep (`grep --version`)? If so, add the `-a` option. |
| 04:15:41 | <@JAA> | Also, no need for a loop. |
| 04:15:50 | <Doranwen> | It did not give me any error about it. |
| 04:16:04 | <Doranwen> | It does need a loop if there's 700 html files, I would think? |
| 04:16:04 | <@JAA> | `grep -Pho '(?<=href=")[^"]*' *.html >>links1.txt` is equivalent. |
| 04:16:14 | <Doranwen> | Or not? |
| 04:16:45 | <@JAA> | You might need something special when you have many thousands of files. A loop wouldn't be ideal though. |
| 04:16:47 | <Doranwen> | The folder also has pdf files in it, so I thought it needed to have html specified to look at. |
| 04:17:44 | <Doranwen> | And sometimes there are csv files that I didn't want searched. |
| 04:17:54 | <@JAA> | I still use the same `*.html` glob as you do. |
| 04:18:03 | <Doranwen> | Ah |
| 04:18:04 | <@JAA> | `grep` can take multiple filename arguments. |
| 04:18:15 | <@JAA> | You're thinking of recursive mode instead. |
| 04:18:31 | <@JAA> | The `-h` option is to suppress the filename prefix on each line. |
| 04:19:08 | <@JAA> | And 'binary file matches' wouldn't be an error. It'd go into the output file. |
| 04:19:19 | <@JAA> | (Is that annoying? Yes.) |
| 04:19:25 | <Doranwen> | The output file opens without any issues, so I doubt it found any. |
| 04:19:40 | <@JAA> | I mean the literal string, not binary data in the output file. |
| 04:19:41 | <Doranwen> | When it has binary stuff it usually gives my text editor fits trying to open it, lol. |
| 04:19:54 | <Doranwen> | Ahh |
| 04:19:55 | <@JAA> | > printf '\0meow' | grep meow |
| 04:19:55 | <@JAA> | grep: (standard input): binary file matches |
| 04:20:11 | <Doranwen> | Nope. The word "binary" is not in the output file. |
| 04:20:22 | <@JAA> | Then I'd need an example file. |
| 04:20:57 | <@JAA> | I.e. one that has a link which isn't extracted. |
| 04:21:24 | <Doranwen> | Give me a sec and I'll dump it up. |
| 04:22:24 | <Doranwen> | https://transfer.archivete.am/GcD4I/sample.zip |
| 04:22:25 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/GcD4I/sample.zip |
| 04:22:41 | <Doranwen> | That's only a subset of them, but has several with links that were missing when I tried it. |
| 04:22:58 | <Doranwen> | I can give you the entire set if you want it . |
| 04:23:29 | <@JAA> | Ah, your HTML files are UTF-16, not UTF-8. |
| 04:23:49 | | Doranwen sighs. |
| 04:23:56 | <Doranwen> | I figured it had to be the encoding switch. |
| 04:24:09 | <@JAA> | > xxd jane-maura-ff-100043.html | head -n 2 |
| 04:24:09 | <@JAA> | 00000000: fffe 3c00 6800 7400 6d00 6c00 3e00 3c00 ..<.h.t.m.l.>.<. |
| 04:24:09 | <@JAA> | 00000010: 6800 6500 6100 6400 3e00 3c00 7400 6900 h.e.a.d.>.<.t.i. |
| 04:24:14 | <Doranwen> | Before this all the stuff we got seemed to be in ISO whatever it is - because any Russian characters ended up as ?????? |
| 04:24:43 | <Doranwen> | And we pestered the guy who made the archiving tool and he fixed that - but apparently his fix was to make it utf-16 rather than utf-8. |
| 04:26:11 | <Doranwen> | I'll pester him to fix that, LOL. |
| 04:26:46 | <Doranwen> | Thank you for helping me diagnose the issue! |
| 04:26:54 | <Doranwen> | I wondered why everything seemed to be working fine until today, lol. |
| 04:28:08 | <@JAA> | You can also use `iconv` to convert it. `iconv -f utf16 -t utf8` is probably sufficient. |
| 04:28:44 | <@JAA> | But you'll want to only pass the UTF-16 files into that, of course. |
| 04:29:28 | <Doranwen> | How do I tell which ones are UTF-16 and which are UTF-8? I'm thinking because I was able to extract *some* links that some of them have to be utf-8? |
| 04:30:52 | <Doranwen> | The ones I got links from weren't in the sample I sent you. |
| 04:31:17 | <@JAA> | Well, they don't necessarily have to be UTF-8, but yeah. |
| 04:31:35 | <@JAA> | Clearly not all files in your directory are the same encoding, and everything gets painful at that point. |
| 04:32:52 | <nicolas17> | ff fe prefix is a pretty sure sign of UTF-16 |
| 04:34:05 | <Doranwen> | Yeah, it might be easier to just re-download this comm (and the other one I grabbed with the new version) with an old version of the archiving tool. |
| 04:34:39 | <Doranwen> | I'll lose a few hours of downloading but it'll be a lot less hassle. |
| 04:34:52 | <Doranwen> | I still am not sure how I would - in one fell swoop - see which files are utf-8 and which are utf-16. |
| 04:35:01 | <@JAA> | Yeah, UTF-16-encoded data usually starts with a Byte Order Mark, either FF FE or FE FF depending on the flavour. (It's why that codepoint even exists.) |
| 04:42:45 | <Doranwen> | Thanks again for the help. I was pulling my hair out trying to figure out why this script didn't seem to be working anymore, lol. |
| 04:43:25 | <Doranwen> | And I'll replace that loop with the grep command then. |
| 04:45:21 | <@JAA> | It's immediately visible in `less` (because it renders out all the NUL bytes), and `file` should probably report it as well. |
| 04:46:02 | <@JAA> | Or `xxd $filename | head` as I did above. |
| 04:46:10 | <@JAA> | Just some general tips for this kind of situation. :-) |
| 05:01:17 | | HackMii quits [Remote host closed the connection] |
| 05:01:36 | | HackMii (hacktheplanet) joins |
| 05:32:43 | | sec^nd quits [Remote host closed the connection] |
| 05:33:05 | | sec^nd (second) joins |
| 05:46:10 | | teppum quits [Remote host closed the connection] |
| 06:17:13 | | ArchivalEfforts quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |
| 06:17:23 | | ArchivalEfforts joins |
| 08:52:56 | | ducky quits [Ping timeout: 268 seconds] |
| 08:53:09 | | ducky (ducky) joins |
| 08:54:25 | | Dango360 quits [Quit: The Lounge - https://thelounge.chat] |
| 09:30:09 | | cipherrot quits [Ping timeout: 272 seconds] |
| 09:37:48 | | Snivy quits [Quit: The Lounge - https://thelounge.chat] |
| 09:38:11 | | petrichor (petrichor) joins |
| 09:38:18 | | Snivy (Snivy) joins |
| 09:38:23 | | Snivy quits [Remote host closed the connection] |
| 09:39:43 | | Snivy (Snivy) joins |
| 10:03:54 | | rohvani quits [Quit: The Lounge - https://thelounge.chat] |
| 10:09:54 | | @arkiver quits [Remote host closed the connection] |
| 10:10:21 | | arkiver (arkiver) joins |
| 10:10:21 | | @ChanServ sets mode: +o arkiver |
| 10:14:12 | | fireatseaparks quits [Remote host closed the connection] |
| 10:14:52 | | fireatseaparks (fireatseaparks) joins |
| 10:26:37 | | APOLLO03 joins |
| 11:39:38 | | Cornelius7 (Cornelius) joins |
| 11:41:15 | | Cornelius quits [Ping timeout: 272 seconds] |
| 11:41:15 | | Cornelius7 is now known as Cornelius |
| 11:47:37 | | irisfreckles13 joins |
| 11:58:52 | | APOLLO03 quits [Read error: Connection reset by peer] |
| 11:59:51 | | APOLLO03 joins |
| 12:00:03 | | Bleo1826007227196234552220 quits [Quit: The Lounge - https://thelounge.chat] |
| 12:02:46 | | Bleo1826007227196234552220 joins |
| 12:37:37 | | petrichor quits [Client Quit] |
| 13:23:15 | | petrichor (petrichor) joins |
| 14:03:07 | | irisfreckles13 quits [Ping timeout: 272 seconds] |
| 14:12:44 | | Dada joins |
| 14:18:22 | | Dango360 (Dango360) joins |
| 15:09:03 | | irisfreckles13 joins |
| 16:40:37 | | Goofybally9 quits [Quit: The Lounge - https://thelounge.chat] |
| 16:41:24 | | Goofybally joins |
| 16:42:12 | | Goofybally quits [Client Quit] |
| 16:42:44 | | Goofybally (Goofybally) joins |
| 16:52:03 | | DogsRNice joins |
| 17:54:22 | | corentin quits [Ping timeout: 268 seconds] |
| 18:10:39 | | Webuser810542 joins |
| 18:27:13 | | @rewby quits [Ping timeout: 272 seconds] |
| 18:29:20 | | rewby (rewby) joins |
| 18:29:20 | | @ChanServ sets mode: +o rewby |
| 19:04:40 | | APOLLO03 quits [Ping timeout: 268 seconds] |
| 19:04:43 | | APOLLO03 joins |
| 19:33:43 | | twiswist quits [Ping timeout: 272 seconds] |
| 19:54:53 | | APOLLO03a joins |
| 19:57:42 | | APOLLO03 quits [Ping timeout: 268 seconds] |
| 20:05:58 | <steering> | Doranwen: linux command line max length is in megabytes, meaning unless your filenames are spectacularly long you can fit thousands of them in a command line |
| 20:06:57 | <steering> | (over 100,000 files if they're all 20 characters) |
| 20:08:07 | <klea> | Well, it also relies on you not using absolute paths for everything too :p |
| 20:08:13 | <klea> | since then it'd add up. |
| 20:08:51 | <steering> | even if your paths are 100 chars long you can still have 20,971 files |
| 20:09:02 | <steering> | (minus the command name etc) |
| 20:09:46 | <steering> | so yeah, not really something to worry about until you have many thousands of files |
| 20:09:47 | <klea> | btw, pro tip, don't put random unicode characters down your user's throats. |
| 20:10:14 | <klea> | | gsub("\n"; "\u2028") -> | gsub("\n"; "↳") fixed part of a rendering issue in my own crap program using jq. |
| 20:10:18 | <steering> | 💩 |
| 20:10:38 | | Doranwen quits [Read error: Connection reset by peer] |
| 20:10:42 | <klea> | no, don't ask me why I make a shell script that uses jq for most of it's work, and then proceeds to call fzf to have it call itself again. |
| 20:10:47 | <klea> | s/make/made/ |
| 20:10:50 | <steering> | anyway once you get to enough files to exceed command line length (on linux), you won't want to turn to a for loop anyway :P |
| 20:11:01 | | Doranwen (Doranwen) joins |
| 20:11:02 | <steering> | aaaand how dare you peer |
| 20:11:06 | | rohvani joins |
| 20:11:23 | <klea> | peer? |
| 20:15:40 | <steering> | find -exec ... {} + is super useful when you can use it |
| 20:15:59 | <klea> | 2026-02-17 20:10:50 <steering> anyway once you get to enough files to exceed command line length (on linux), you won't want to turn to a for loop anyway :P <- I should make something like 4096 character long machine id, and then do things like grep 'thing' |
| 20:15:59 | <klea> | /srv/archivthingy/data/machine/${MACHINE_ID}/project/${PROJECT_ID}/work/items/${WORKITEM_ID}/mass_datas/{$MASSDATAID1,$MASSDATAID2,$MASSDATAID3}/*/*.txt :p |
| 20:16:10 | <steering> | (can't use it if every filename needs to be used with some switch or if you need some other arg at the end...) |
| 20:16:27 | <klea> | s/+/\;/ in those cases. |
| 20:17:07 | <steering> | yeah and then you get to start a million new processes instead of a hundred :P |
| 20:17:47 | <klea> | (of course, every ID is at least 4096 unicode characters long, and it's full of multibyte strings (yes, everyone loves a 'k' that has some other letters attached) |
| 20:18:11 | <steering> | yeah, that's what's nice about find, it doesn't matter what weird characters are in the filename :P |
| 20:18:56 | | klea sticks a '\0' in steering's filenames. |
| 20:19:17 | <steering> | good luck with that |
| 20:19:25 | <@JAA> | The one thing you can't do. |
| 20:19:28 | <steering> | linux might be unhappy before my program gets a chance to be |
| 20:19:36 | <klea> | linux isn't everyone else :p |
| 20:19:48 | <@JAA> | Which file system supports NULs in filenames? |
| 20:19:50 | <steering> | i'm pretty sure it applies to everyone else too |
| 20:19:58 | <steering> | but i also don't really care |
| 20:20:46 | <klea> | Any file system made for WASI should :p https://github.com/WebAssembly/WASI/blob/main/proposals/filesystem/wit-0.3.0-draft/types.wit#L6-L7 |
| 20:21:11 | <steering> | a system supporting NULs in filenames is either a useless point of nerdery or completely unlike any current OS to the point where there's no expectation that it would even have a find program or be in any way compatible |
| 20:21:25 | <klea> | Yeah |
| 20:21:28 | <steering> | considering that you can't access such a file through any normal system call |
| 20:21:33 | <klea> | https://stackoverflow.com/a/54208483 |
| 20:21:51 | | iPwnedYourIOTSmartdog quits [Ping timeout: 272 seconds] |
| 20:22:09 | <klea> | Apparently unixes and windows don't support it, but you can stick U+0000 somehow into some stuffs, which will be fun if you parse incorrectly. |
| 20:22:15 | <steering> | >useless point of nerdery |
| 20:22:29 | <klea> | yep |
| 20:23:50 | | steering is not entirely sure how one would get windows to "treat" a filename as utf-8 |
| 20:24:19 | <steering> | i thought it was all utf-16 but idk anything about windows programming :P |
| 20:25:34 | <klea> | Neither do I :p |
| 20:42:01 | <klea> | https://groups.google.com/g/mozilla.dev.security.policy/c/JFwqZx7RLL0/m/T-J9gxUNAQAJ#:~:text=The%20GDPR%20issues%20surrounding%20WHOIS%20and%20RDAP%20have%20already%20led%20it%20to%20be%20compelling%20in%20its%20own%20right%2E%20Most |
| 20:42:10 | <steering> | hmm |
| 20:42:25 | <klea> | Why not have rdap have a array of sha512 hashed email addresses? |
| 20:42:39 | <steering> | i wonder if python (surrogateescape) presents any problems with such a situation |
| 20:43:12 | <steering> | i guess one would have to re-encode with ignore or something |
| 20:44:17 | <steering> | hashing doesn't give you privacy with email addresses |
| 20:45:37 | <klea> | unless you hash in some specific way? |
| 20:45:51 | <klea> | like, if CA hashing standard is to hash email with some text before and after the email? |
| 20:46:37 | <klea> | that specific text would make you knowing an email address be to check each record for if it's on some specific domain, yes, but wouldn't make you be able to correlate two hashes for different purposes? |
| 20:47:09 | <klea> | (the texts would have to either be also given with the hash, or be static) |
| 20:48:36 | <katia> | what is going on in here |
| 20:49:55 | <steering> | that doesn't help. |
| 20:50:34 | <steering> | what you end up doing if you want to maintain any privacy with hashes like that is truncating the hash to only the first few bits |
| 20:51:06 | <steering> | because, as it turns out, publishing something that's publicly confirmable is the literal opposite of maintaining info privately |
| 20:55:03 | <steering> | i can go try 10GH/s of sha256 on my computer after all, and besides that the very fact that it will confirm a suspected email would be a problem |
| 20:58:30 | <klea> | oh |
| 20:58:49 | | twiswist (twiswist) joins |
| 21:00:23 | <@JAA> | Putting C0 codes into filenames in general is nerdery. But it wasn't historically restricted, so we still have to write our code to handle it. |
| 21:00:48 | <@JAA> | IIRC, there's some movement towards banning LF in POSIX. |
| 21:01:13 | <klea> | Sorry, I thought privacy was defined as something else. |
| 21:03:07 | <@JAA> | Hashing is irrelevant under the GDPR as well. As long as it's possible to correlate the values, it's still personal data. |
| 21:25:49 | | Ryz quits [Ping timeout: 272 seconds] |
| 21:25:56 | | Ryz (Ryz) joins |
| 21:27:59 | <klea> | oh |
| 21:56:56 | | Dada quits [Remote host closed the connection] |
| 22:11:52 | | Dada joins |
| 22:21:25 | <nukke> | https://www.gamedeveloper.com/programming/godot-co-founder-says-ai-slop-pull-requests-have-become-overwhelming |
| 22:22:16 | <nukke> | also this setting spawned because of all the AI crap https://github.blog/changelog/2026-02-13-new-repository-settings-for-configuring-pull-request-access/ |
| 22:23:25 | <klea> | Oh I no longer need to keep my account always in lockdown mode? |
| 22:23:26 | <klea> | yay! |
| 22:23:49 | <steering> | or just don't use github :P |
| 22:23:57 | <klea> | I use it for forks :p |
| 22:24:03 | <klea> | to contribute back stuff. |
| 22:24:29 | <klea> | and then I have to do the thing of pushing my own stub commit to set as default commit to disable PRs, because else everyone who contributed to the original repo when I forked would be considered a contributor of my fork. |
| 22:28:26 | <klea> | Also, wut, why is there an agents tab. https://github.com/notklea/causalpkgs/ |
| 22:28:46 | <klea> | Oh, maybe I should move that repo to codebeg later. |
| 22:44:28 | | Dada quits [Remote host closed the connection] |
| 22:59:33 | | APOLLO03a quits [Ping timeout: 272 seconds] |
| 22:59:33 | | ^ quits [Ping timeout: 272 seconds] |
| 23:03:56 | | ^ (^) joins |
| 23:13:51 | | APOLLO03 joins |
| 23:14:26 | | Dada joins |
| 23:22:20 | | etnguyen03 (etnguyen03) joins |
| 23:35:12 | | Dada quits [Remote host closed the connection] |