#archiveteam-ot log for 2026-02-17

Home Search Previous day Next day

00:19:07		nine quits [Quit: See ya!]
00:19:20		nine joins
00:19:20		nine is now authenticated as nine
00:19:20		nine quits [Changing host]
00:19:20		nine (nine) joins
00:32:50		etnguyen03 quits [Client Quit]
00:40:14		SootBector quits [Remote host closed the connection]
00:40:22	<steering>	person: "it's now 1:30" - youtube auto captions: "it's now 01:30 hours"
00:41:22		SootBector (SootBector) joins
01:03:29		ducky quits [Ping timeout: 272 seconds]
01:04:19		etnguyen03 (etnguyen03) joins
02:22:10		sec^nd quits [Remote host closed the connection]
02:22:35		sec^nd (second) joins
02:44:47		APOLLO03 quits [Ping timeout: 268 seconds]
02:47:44		ducky (ducky) joins
02:56:13		nine quits [Ping timeout: 272 seconds]
02:58:38		nine joins
02:58:40		nine is now authenticated as nine
02:58:40		nine quits [Changing host]
02:58:40		nine (nine) joins
03:13:09		iPwnedYourIOTSmartdog quits [Ping timeout: 268 seconds]
03:13:48		iPwnedYourIOTSmartdog joins
04:01:44		etnguyen03 quits [Remote host closed the connection]
04:08:23	<Doranwen>	Regex101 says this will extract the link but I just tested it and it's not extracting one of the links from an html file (and almost certainly other similar ones): `for f in .html; do grep -Po '(?<=href=")[^"]' "$f" >> links1.txt; done`
04:08:43	<Doranwen>	Sample from the html is this: `Read it here: <a href="http://unfortunateggs.livejournal.com/125304.html" target="_self"><strong>http://unfortunateggs.livejournal.com/125304.html</strong></a> </article> </div>`
04:12:13	<Doranwen>	It should be pulling that LJ link from the first bit - but it's not. I'm only seeing links from the last few html files rather than all of them, for some reason.
04:15:18	<@JAA>	Doranwen: Do you see any 'binary file matches' in the output? Are you using GNU grep (`grep --version`)? If so, add the `-a` option.
04:15:41	<@JAA>	Also, no need for a loop.
04:15:50	<Doranwen>	It did not give me any error about it.
04:16:04	<Doranwen>	It does need a loop if there's 700 html files, I would think?
04:16:04	<@JAA>	`grep -Pho '(?<=href=")[^"]' .html >>links1.txt` is equivalent.
04:16:14	<Doranwen>	Or not?
04:16:45	<@JAA>	You might need something special when you have many thousands of files. A loop wouldn't be ideal though.
04:16:47	<Doranwen>	The folder also has pdf files in it, so I thought it needed to have html specified to look at.
04:17:44	<Doranwen>	And sometimes there are csv files that I didn't want searched.
04:17:54	<@JAA>	I still use the same `*.html` glob as you do.
04:18:03	<Doranwen>	Ah
04:18:04	<@JAA>	`grep` can take multiple filename arguments.
04:18:15	<@JAA>	You're thinking of recursive mode instead.
04:18:31	<@JAA>	The `-h` option is to suppress the filename prefix on each line.
04:19:08	<@JAA>	And 'binary file matches' wouldn't be an error. It'd go into the output file.
04:19:19	<@JAA>	(Is that annoying? Yes.)
04:19:25	<Doranwen>	The output file opens without any issues, so I doubt it found any.
04:19:40	<@JAA>	I mean the literal string, not binary data in the output file.
04:19:41	<Doranwen>	When it has binary stuff it usually gives my text editor fits trying to open it, lol.
04:19:54	<Doranwen>	Ahh
04:19:55	<@JAA>	> printf '\0meow' \| grep meow
04:19:55	<@JAA>	grep: (standard input): binary file matches
04:20:11	<Doranwen>	Nope. The word "binary" is not in the output file.
04:20:22	<@JAA>	Then I'd need an example file.
04:20:57	<@JAA>	I.e. one that has a link which isn't extracted.
04:21:24	<Doranwen>	Give me a sec and I'll dump it up.
04:22:24	<Doranwen>	https://transfer.archivete.am/GcD4I/sample.zip
04:22:25	<eggdrop>	inline (for browser viewing): https://transfer.archivete.am/inline/GcD4I/sample.zip
04:22:41	<Doranwen>	That's only a subset of them, but has several with links that were missing when I tried it.
04:22:58	<Doranwen>	I can give you the entire set if you want it .
04:23:29	<@JAA>	Ah, your HTML files are UTF-16, not UTF-8.
04:23:49		Doranwen sighs.
04:23:56	<Doranwen>	I figured it had to be the encoding switch.
04:24:09	<@JAA>	> xxd jane-maura-ff-100043.html \| head -n 2
04:24:09	<@JAA>	00000000: fffe 3c00 6800 7400 6d00 6c00 3e00 3c00 ..<.h.t.m.l.>.<.
04:24:09	<@JAA>	00000010: 6800 6500 6100 6400 3e00 3c00 7400 6900 h.e.a.d.>.<.t.i.
04:24:14	<Doranwen>	Before this all the stuff we got seemed to be in ISO whatever it is - because any Russian characters ended up as ??????
04:24:43	<Doranwen>	And we pestered the guy who made the archiving tool and he fixed that - but apparently his fix was to make it utf-16 rather than utf-8.
04:26:11	<Doranwen>	I'll pester him to fix that, LOL.
04:26:46	<Doranwen>	Thank you for helping me diagnose the issue!
04:26:54	<Doranwen>	I wondered why everything seemed to be working fine until today, lol.
04:28:08	<@JAA>	You can also use `iconv` to convert it. `iconv -f utf16 -t utf8` is probably sufficient.
04:28:44	<@JAA>	But you'll want to only pass the UTF-16 files into that, of course.
04:29:28	<Doranwen>	How do I tell which ones are UTF-16 and which are UTF-8? I'm thinking because I was able to extract some links that some of them have to be utf-8?
04:30:52	<Doranwen>	The ones I got links from weren't in the sample I sent you.
04:31:17	<@JAA>	Well, they don't necessarily have to be UTF-8, but yeah.
04:31:35	<@JAA>	Clearly not all files in your directory are the same encoding, and everything gets painful at that point.
04:32:52	<nicolas17>	ff fe prefix is a pretty sure sign of UTF-16
04:34:05	<Doranwen>	Yeah, it might be easier to just re-download this comm (and the other one I grabbed with the new version) with an old version of the archiving tool.
04:34:39	<Doranwen>	I'll lose a few hours of downloading but it'll be a lot less hassle.
04:34:52	<Doranwen>	I still am not sure how I would - in one fell swoop - see which files are utf-8 and which are utf-16.
04:35:01	<@JAA>	Yeah, UTF-16-encoded data usually starts with a Byte Order Mark, either FF FE or FE FF depending on the flavour. (It's why that codepoint even exists.)
04:42:45	<Doranwen>	Thanks again for the help. I was pulling my hair out trying to figure out why this script didn't seem to be working anymore, lol.
04:43:25	<Doranwen>	And I'll replace that loop with the grep command then.
04:45:21	<@JAA>	It's immediately visible in `less` (because it renders out all the NUL bytes), and `file` should probably report it as well.
04:46:02	<@JAA>	Or `xxd $filename \| head` as I did above.
04:46:10	<@JAA>	Just some general tips for this kind of situation. :-)
05:01:17		HackMii quits [Remote host closed the connection]
05:01:36		HackMii (hacktheplanet) joins
05:32:43		sec^nd quits [Remote host closed the connection]
05:33:05		sec^nd (second) joins
05:46:10		teppum quits [Remote host closed the connection]
06:17:13		ArchivalEfforts quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
06:17:23		ArchivalEfforts joins
08:52:56		ducky quits [Ping timeout: 268 seconds]
08:53:09		ducky (ducky) joins
08:54:25		Dango360 quits [Quit: The Lounge - https://thelounge.chat]
09:30:09		cipherrot quits [Ping timeout: 272 seconds]
09:37:48		Snivy quits [Quit: The Lounge - https://thelounge.chat]
09:38:11		petrichor (petrichor) joins
09:38:18		Snivy (Snivy) joins
09:38:23		Snivy quits [Remote host closed the connection]
09:39:43		Snivy (Snivy) joins
10:03:54		rohvani quits [Quit: The Lounge - https://thelounge.chat]
10:09:54		@arkiver quits [Remote host closed the connection]
10:10:21		arkiver (arkiver) joins
10:10:21		@ChanServ sets mode: +o arkiver
10:14:12		fireatseaparks quits [Remote host closed the connection]
10:14:52		fireatseaparks (fireatseaparks) joins
10:26:37		APOLLO03 joins
11:39:38		Cornelius7 (Cornelius) joins
11:41:15		Cornelius quits [Ping timeout: 272 seconds]
11:41:15		Cornelius7 is now known as Cornelius
11:47:37		irisfreckles13 joins
11:58:52		APOLLO03 quits [Read error: Connection reset by peer]
11:59:51		APOLLO03 joins
12:00:03		Bleo1826007227196234552220 quits [Quit: The Lounge - https://thelounge.chat]
12:02:46		Bleo1826007227196234552220 joins
12:37:37		petrichor quits [Client Quit]
13:23:15		petrichor (petrichor) joins
14:03:07		irisfreckles13 quits [Ping timeout: 272 seconds]
14:12:44		Dada joins
14:18:22		Dango360 (Dango360) joins
15:09:03		irisfreckles13 joins
16:40:37		Goofybally9 quits [Quit: The Lounge - https://thelounge.chat]
16:41:24		Goofybally joins
16:42:12		Goofybally quits [Client Quit]
16:42:44		Goofybally (Goofybally) joins
16:52:03		DogsRNice joins
17:54:22		corentin quits [Ping timeout: 268 seconds]
18:10:39		Webuser810542 joins
18:27:13		@rewby quits [Ping timeout: 272 seconds]
18:29:20		rewby (rewby) joins
18:29:20		@ChanServ sets mode: +o rewby
19:04:40		APOLLO03 quits [Ping timeout: 268 seconds]
19:04:43		APOLLO03 joins
19:33:43		twiswist quits [Ping timeout: 272 seconds]
19:54:53		APOLLO03a joins
19:57:42		APOLLO03 quits [Ping timeout: 268 seconds]
20:05:58	<steering>	Doranwen: linux command line max length is in megabytes, meaning unless your filenames are spectacularly long you can fit thousands of them in a command line
20:06:57	<steering>	(over 100,000 files if they're all 20 characters)
20:08:07	<klea>	Well, it also relies on you not using absolute paths for everything too :p
20:08:13	<klea>	since then it'd add up.
20:08:51	<steering>	even if your paths are 100 chars long you can still have 20,971 files
20:09:02	<steering>	(minus the command name etc)
20:09:46	<steering>	so yeah, not really something to worry about until you have many thousands of files
20:09:47	<klea>	btw, pro tip, don't put random unicode characters down your user's throats.
20:10:14	<klea>	\| gsub("\n"; "\u2028") -> \| gsub("\n"; "↳") fixed part of a rendering issue in my own crap program using jq.
20:10:18	<steering>	💩
20:10:38		Doranwen quits [Read error: Connection reset by peer]
20:10:42	<klea>	no, don't ask me why I make a shell script that uses jq for most of it's work, and then proceeds to call fzf to have it call itself again.
20:10:47	<klea>	s/make/made/
20:10:50	<steering>	anyway once you get to enough files to exceed command line length (on linux), you won't want to turn to a for loop anyway :P
20:11:01		Doranwen (Doranwen) joins
20:11:02	<steering>	aaaand how dare you peer
20:11:06		rohvani joins
20:11:23	<klea>	peer?
20:15:40	<steering>	find -exec ... {} + is super useful when you can use it
20:15:59	<klea>	2026-02-17 20:10:50 <steering> anyway once you get to enough files to exceed command line length (on linux), you won't want to turn to a for loop anyway :P <- I should make something like 4096 character long machine id, and then do things like grep 'thing'
20:15:59	<klea>	/srv/archivthingy/data/machine/${MACHINE_ID}/project/${PROJECT_ID}/work/items/${WORKITEM_ID}/mass_datas/{$MASSDATAID1,$MASSDATAID2,$MASSDATAID3}//.txt :p
20:16:10	<steering>	(can't use it if every filename needs to be used with some switch or if you need some other arg at the end...)
20:16:27	<klea>	s/+/\;/ in those cases.
20:17:07	<steering>	yeah and then you get to start a million new processes instead of a hundred :P
20:17:47	<klea>	(of course, every ID is at least 4096 unicode characters long, and it's full of multibyte strings (yes, everyone loves a 'k' that has some other letters attached)
20:18:11	<steering>	yeah, that's what's nice about find, it doesn't matter what weird characters are in the filename :P
20:18:56		klea sticks a '\0' in steering's filenames.
20:19:17	<steering>	good luck with that
20:19:25	<@JAA>	The one thing you can't do.
20:19:28	<steering>	linux might be unhappy before my program gets a chance to be
20:19:36	<klea>	linux isn't everyone else :p
20:19:48	<@JAA>	Which file system supports NULs in filenames?
20:19:50	<steering>	i'm pretty sure it applies to everyone else too
20:19:58	<steering>	but i also don't really care
20:20:46	<klea>	Any file system made for WASI should :p https://github.com/WebAssembly/WASI/blob/main/proposals/filesystem/wit-0.3.0-draft/types.wit#L6-L7
20:21:11	<steering>	a system supporting NULs in filenames is either a useless point of nerdery or completely unlike any current OS to the point where there's no expectation that it would even have a find program or be in any way compatible
20:21:25	<klea>	Yeah
20:21:28	<steering>	considering that you can't access such a file through any normal system call
20:21:33	<klea>	https://stackoverflow.com/a/54208483
20:21:51		iPwnedYourIOTSmartdog quits [Ping timeout: 272 seconds]
20:22:09	<klea>	Apparently unixes and windows don't support it, but you can stick U+0000 somehow into some stuffs, which will be fun if you parse incorrectly.
20:22:15	<steering>	>useless point of nerdery
20:22:29	<klea>	yep
20:23:50		steering is not entirely sure how one would get windows to "treat" a filename as utf-8
20:24:19	<steering>	i thought it was all utf-16 but idk anything about windows programming :P
20:25:34	<klea>	Neither do I :p
20:42:01	<klea>	https://groups.google.com/g/mozilla.dev.security.policy/c/JFwqZx7RLL0/m/T-J9gxUNAQAJ#:~:text=The%20GDPR%20issues%20surrounding%20WHOIS%20and%20RDAP%20have%20already%20led%20it%20to%20be%20compelling%20in%20its%20own%20right%2E%20Most
20:42:10	<steering>	hmm
20:42:25	<klea>	Why not have rdap have a array of sha512 hashed email addresses?
20:42:39	<steering>	i wonder if python (surrogateescape) presents any problems with such a situation
20:43:12	<steering>	i guess one would have to re-encode with ignore or something
20:44:17	<steering>	hashing doesn't give you privacy with email addresses
20:45:37	<klea>	unless you hash in some specific way?
20:45:51	<klea>	like, if CA hashing standard is to hash email with some text before and after the email?
20:46:37	<klea>	that specific text would make you knowing an email address be to check each record for if it's on some specific domain, yes, but wouldn't make you be able to correlate two hashes for different purposes?
20:47:09	<klea>	(the texts would have to either be also given with the hash, or be static)
20:48:36	<katia>	what is going on in here
20:49:55	<steering>	that doesn't help.
20:50:34	<steering>	what you end up doing if you want to maintain any privacy with hashes like that is truncating the hash to only the first few bits
20:51:06	<steering>	because, as it turns out, publishing something that's publicly confirmable is the literal opposite of maintaining info privately
20:55:03	<steering>	i can go try 10GH/s of sha256 on my computer after all, and besides that the very fact that it will confirm a suspected email would be a problem
20:58:30	<klea>	oh
20:58:49		twiswist (twiswist) joins
21:00:23	<@JAA>	Putting C0 codes into filenames in general is nerdery. But it wasn't historically restricted, so we still have to write our code to handle it.
21:00:48	<@JAA>	IIRC, there's some movement towards banning LF in POSIX.
21:01:13	<klea>	Sorry, I thought privacy was defined as something else.
21:03:07	<@JAA>	Hashing is irrelevant under the GDPR as well. As long as it's possible to correlate the values, it's still personal data.
21:25:49		Ryz quits [Ping timeout: 272 seconds]
21:25:56		Ryz (Ryz) joins
21:27:59	<klea>	oh
21:56:56		Dada quits [Remote host closed the connection]
22:11:52		Dada joins
22:21:25	<nukke>	https://www.gamedeveloper.com/programming/godot-co-founder-says-ai-slop-pull-requests-have-become-overwhelming
22:22:16	<nukke>	also this setting spawned because of all the AI crap https://github.blog/changelog/2026-02-13-new-repository-settings-for-configuring-pull-request-access/
22:23:25	<klea>	Oh I no longer need to keep my account always in lockdown mode?
22:23:26	<klea>	yay!
22:23:49	<steering>	or just don't use github :P
22:23:57	<klea>	I use it for forks :p
22:24:03	<klea>	to contribute back stuff.
22:24:29	<klea>	and then I have to do the thing of pushing my own stub commit to set as default commit to disable PRs, because else everyone who contributed to the original repo when I forked would be considered a contributor of my fork.
22:28:26	<klea>	Also, wut, why is there an agents tab. https://github.com/notklea/causalpkgs/
22:28:46	<klea>	Oh, maybe I should move that repo to codebeg later.
22:44:28		Dada quits [Remote host closed the connection]
22:59:33		APOLLO03a quits [Ping timeout: 272 seconds]
22:59:33		^ quits [Ping timeout: 272 seconds]
23:03:56		^ (^) joins
23:13:51		APOLLO03 joins
23:14:26		Dada joins
23:22:20		etnguyen03 (etnguyen03) joins
23:35:12		Dada quits [Remote host closed the connection]

Home Search Previous day Next day