00:23:33 | <@JAA> | I'm compiling a list of www.people.vcu.edu user directories now for an !a < job. |
00:23:59 | <@JAA> | Turns out the WBM has ~10M URLs from the domain. |
00:35:17 | <@JAA> | That resulted in about 889 ~foo directories, including some noise I won't attempt to clean up. |
00:35:36 | <@JAA> | For context, the directory page only has 158. |
01:14:55 | | pabs quits [Ping timeout: 255 seconds] |
03:27:14 | <@JAA> | I've started an !a < job with 897 directories collected from the faculty directory at https://atoz.vcu.edu/personal , the CDX API, a DuckDuckGo search, and a Reddit search. |
03:27:20 | <@JAA> | I don't think the latter two added anything, but why not. |
03:28:11 | <@JAA> | The WBM data has a lot of noise as usual. Plenty of broken links, long-gone sites, etc. |
03:32:38 | | Jake quits [Quit: Leaving for a bit!] |
03:32:48 | <@JAA> | I mostly just listed these things and then did a `grep -Po '/(%7[Ee]|~)[^/?]+' | sed 's,%7[Ee],~,'`. |
03:32:57 | <pokechu22> | hmm, I think I found a few extra ones on Google |
03:33:47 | <pokechu22> | (in the 300 it lets you see) |
03:34:23 | | Jake (Jake) joins |
03:35:44 | <@JAA> | I don't expect there to be a whole lot extra ones, so we can probably just do individual !a for those? |
03:36:47 | <pokechu22> | I currently have 13 and *maybe* there'll be a few more |
03:37:08 | <@JAA> | And probably also without the trailing slash so they can fan out. We can add ignores if it comes across big ones from my list. |
03:37:16 | <@JAA> | Hmm, maybe another !a < then. :-) |
03:37:40 | <@JAA> | I'm also scraping Bing, but that's always a bit random whether it yields anything useful. |
03:38:25 | <@JAA> | (And it's taking a while, and I wanted to start the primary job.) |
03:41:46 | <pokechu22> | OK, I got 14 more |
03:42:34 | <@JAA> | I also kept the list of actual URLs from DDG and Reddit, will probably !ao < those just in case they're not found by the main job for some reason. |
04:15:13 | <@JAA> | Bing search yielded nothing extra. |
04:51:38 | | pabs (pabs) joins |
04:53:46 | | pabs quits [Read error: Connection reset by peer] |
04:54:21 | | pabs (pabs) joins |
04:55:45 | | pabs quits [Remote host closed the connection] |
04:56:11 | | pabs (pabs) joins |
05:38:56 | | datechnoman quits [Quit: The Lounge - https://thelounge.chat] |
05:41:02 | | datechnoman (datechnoman) joins |
06:01:33 | <pokechu22> | Doesn't DDG use Bing internally? |
06:02:41 | <@JAA> | Yeah, among others, details clear as mud. |
06:03:12 | <@JAA> | I have previously got extra things from Bing because DDG usually only lets you get a few pages into the results. |
09:15:38 | | nulldata quits [Client Quit] |
09:17:21 | | nulldata (nulldata) joins |
22:15:51 | | Maturion joins |
23:33:22 | | Maturion quits [Remote host closed the connection] |