00:23:33<@JAA>I'm compiling a list of www.people.vcu.edu user directories now for an !a < job.
00:23:59<@JAA>Turns out the WBM has ~10M URLs from the domain.
00:35:17<@JAA>That resulted in about 889 ~foo directories, including some noise I won't attempt to clean up.
00:35:36<@JAA>For context, the directory page only has 158.
01:14:55pabs quits [Ping timeout: 255 seconds]
03:27:14<@JAA>I've started an !a < job with 897 directories collected from the faculty directory at https://atoz.vcu.edu/personal , the CDX API, a DuckDuckGo search, and a Reddit search.
03:27:20<@JAA>I don't think the latter two added anything, but why not.
03:28:11<@JAA>The WBM data has a lot of noise as usual. Plenty of broken links, long-gone sites, etc.
03:32:38Jake quits [Quit: Leaving for a bit!]
03:32:48<@JAA>I mostly just listed these things and then did a `grep -Po '/(%7[Ee]|~)[^/?]+' | sed 's,%7[Ee],~,'`.
03:32:57<pokechu22>hmm, I think I found a few extra ones on Google
03:33:47<pokechu22>(in the 300 it lets you see)
03:34:23Jake (Jake) joins
03:35:44<@JAA>I don't expect there to be a whole lot extra ones, so we can probably just do individual !a for those?
03:36:47<pokechu22>I currently have 13 and *maybe* there'll be a few more
03:37:08<@JAA>And probably also without the trailing slash so they can fan out. We can add ignores if it comes across big ones from my list.
03:37:16<@JAA>Hmm, maybe another !a < then. :-)
03:37:40<@JAA>I'm also scraping Bing, but that's always a bit random whether it yields anything useful.
03:38:25<@JAA>(And it's taking a while, and I wanted to start the primary job.)
03:41:46<pokechu22>OK, I got 14 more
03:42:34<@JAA>I also kept the list of actual URLs from DDG and Reddit, will probably !ao < those just in case they're not found by the main job for some reason.
04:15:13<@JAA>Bing search yielded nothing extra.
04:51:38pabs (pabs) joins
04:53:46pabs quits [Read error: Connection reset by peer]
04:54:21pabs (pabs) joins
04:55:45pabs quits [Remote host closed the connection]
04:56:11pabs (pabs) joins
05:38:56datechnoman quits [Quit: The Lounge - https://thelounge.chat]
05:41:02datechnoman (datechnoman) joins
06:01:33<pokechu22>Doesn't DDG use Bing internally?
06:02:41<@JAA>Yeah, among others, details clear as mud.
06:03:12<@JAA>I have previously got extra things from Bing because DDG usually only lets you get a few pages into the results.
09:15:38nulldata quits [Client Quit]
09:17:21nulldata (nulldata) joins
22:15:51Maturion joins
23:33:22Maturion quits [Remote host closed the connection]