| 00:08:11 | <pokechu22> | https://people.umass.edu/pelliott/web2/web2a.html is pretty great | 
| 00:08:45 | <@JAA> | Oh neat! | 
| 00:11:07 | <@JAA> | This looks fairly decent: `grep -Po '^https?://([^/?]+\.)?people\.umass\.edu\.?(:\d+)?/\K[^/?]+' | sed 's,^%7E,,i; s,^%E2%88%BC,,i; s,^~,,' | grep -E '^[a-zA-Z0-9_.-]+$'` | 
| 00:11:58 | <@JAA> | 181 of 116k URLs from the CDX API don't make it through that, and they all look weird. | 
| 00:13:41 | <@JAA> | In case someone feels like hand-cleaning these, although probably most are already in the list in another form anyway: https://transfer.archivete.am/inline/y7TxB/people.umass.edu-cdx-api-weird-names.txt | 
| 00:14:17 | <@JAA> | After uniqify, I'm getting 156 unique dir names. | 
| 00:14:37 | <@JAA> | Er, no, I forgor to disable -v again. | 
| 00:14:45 | <@JAA> | 15429 dir name candidates | 
| 00:16:46 | <@JAA> | I was going to check a little sample, but turns out my OVH IP is banned. | 
| 00:18:42 | <pokechu22> | OK, here's my searching results: https://transfer.archivete.am/inline/mP5Jg/courses.umass.edu_people.umass.edu_urls.txt | 
| 00:25:48 | <pokechu22> | https://transfer.archivete.am/inline/vJhZt/people.umass.edu_weird_names_tidied.txt (some still have spaces, which may or may not be valid; I suspect program files and application data are junk, but they won't cause problems if they're in the list) | 
| 00:52:39 | <@JAA> | datechnoman: Can you check what you have for people.umass.edu and courses.umass.edu in your large URL lists? Simple grep/noisy data is fine. | 
| 02:34:38 |  | wickedplayer494 quits [Remote host closed the connection] | 
| 03:22:43 |  | wickedplayer494 joins | 
| 04:42:29 | <datechnoman> | JAA - Can do mate. Will take awhile as the data set it so huge but more than happy to provide what I can :) | 
| 04:42:35 | <datechnoman> | Will kick of some searches when I get home | 
| 04:44:23 | <@JAA> | datechnoman: Lovely, thanks. What time scale are we talking? I might already start a big job and then deal with additional stuff later. | 
| 04:47:11 | <datechnoman> | There is atleast 30-40 billion urls at this time in my collection. It will be a week or so to go through it all, but, I can dump progress outputs as we go that you can use as seed urls if that helps? JAA | 
| 04:47:35 | <datechnoman> | What timeframe are we looking at? | 
| 04:47:47 | <@JAA> | soon™ | 
| 04:47:57 | <@JAA> | No stated deadline anywhere. | 
| 04:48:26 | <@JAA> | 2024-11-06 16:56:36 < tmbr> no public announcement, but confirmed with their IT: the legacy tilde-style shared hosting at University of Massachusetts Amherst will be sunsetted "soon" | 
| 04:48:30 | <@JAA> | 2024-11-06 16:56:45 < tmbr> I've asked whether they have a timeline for pulling that server offline and if they'll be announcing, but so far all I have received is "the days are numbered" | 
| 04:48:55 | <datechnoman> | Ahhh good. Ill kick off a few streams of grep and see where we go. Its about 50TB of compressed (zst) text files :P  | 
| 04:49:16 | <datechnoman> | Makes the HDD's go brrrrrrr and cough | 
| 04:49:18 | <@JAA> | And I thought the terabytes of JSONL I'm dealing with currently was bad. lol | 
| 04:50:05 | <datechnoman> | Yeah the scale ive hoarded at far exceed the practical processing of the data lol  | 
| 04:50:20 | <datechnoman> | It worth noting that is just cleaned urls with no garbage >.<  | 
| 04:51:41 | <datechnoman> | I really need to start sorting them into TLD's so I dont have to scan over everything each time | 
| 04:51:45 | <datechnoman> | It just takes too long | 
| 04:51:46 | <@JAA> | I think I'll get something started with what we already have (CDX and DDG) now. | 
| 04:52:39 | <@JAA> | The lists for com and net would still be massive. But yeah, hard thing to deal with. | 
| 04:53:17 | <datechnoman> | Oh for sure, but at this point any saving (even a few billions urls) will be well worth it | 
| 04:54:04 | <@JAA> | You'll also have IPs, which you'd probably want to organise by subnet, not last octet. | 
| 04:55:58 | <@JAA> | I don't expect there to be a huge number of cross-user links based on how other personal uni sites tend to work, and a bit of duplication is fine, so might also just run whatever you (or others) find as another full recursive job. | 
| 04:56:21 | <@JAA> | These sites tend to not be huge because there's usually only a low amount of disk space available per user. | 
| 06:57:48 | <datechnoman> | TLDR - not many urls will be found. Gotcha! Will be home shortly to kick off some grep jobs  | 
| 07:06:27 | <@JAA> | My lists from the CDX API with the grep/sed processing mentioned above: https://transfer.archivete.am/lOuOC/people.umass.edu-unique-usernames-cdx.txt https://transfer.archivete.am/114dL/courses.umass.edu-unique-dirnames-cdx.txt | 
| 07:06:29 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/lOuOC/people.umass.edu-unique-usernames-cdx.txt https://transfer.archivete.am/inline/114dL/courses.umass.edu-unique-dirnames-cdx.txt | 
| 07:08:58 | <@JAA> | I will combine mP5Jg vJhZt lOuOC 114dL into one list per domain, each username on people with and without tilde (courses doesn't seem to use tildes), and start two !a < jobs for them. | 
| 07:09:34 | <@JAA> | Future lists can then be deduped against that and run as additional !a < jobs. | 
| 07:11:12 | <@JAA> | pokechu22: It might be worth running your DDG list as !ao < as well; I've seen cases where DDG returned things that weren't discoverable from links on the site. | 
| 07:12:40 | <pokechu22> | Yeah, that or check the meta-warc for duplication | 
| 07:19:03 | <@JAA> | https://transfer.archivete.am/TANTs/people.umass.edu-seeds-cdx-ddg.txt.zst | 
| 07:19:37 | <@JAA> | There will be lots of 404s. | 
| 07:41:46 | <@JAA> | There's some weirdness where it seems that 404s trigger 403s instead. See #archivebot in the past several minutes for details. | 
| 07:42:41 | <@JAA> | We'll have to do a manual spot check of those later. | 
| 07:46:17 | <pokechu22> | https://people.umass.edu/bzecchi seems to be a genuine 403 | 
| 07:48:08 | <@JAA> | Yeah, the ones that go 301→403 seem to be legitimate. | 
| 07:48:28 | <@JAA> | Those might be users that exist but don't have an index.html or similar. | 
| 07:48:49 | <@JAA> | https://transfer.archivete.am/1bREw/courses.umass.edu-seeds-cdx-ddg.txt.zst | 
| 09:15:17 |  | nulldata quits [Client Quit] | 
| 09:17:13 |  | nulldata (nulldata) joins | 
| 09:56:15 |  | pokechu22 quits [Ping timeout: 260 seconds] | 
| 10:17:33 |  | pokechu22 (pokechu22) joins | 
| 14:55:05 |  | Chris50100 (Chris5010) joins | 
| 14:56:18 |  | Chris5010 quits [Ping timeout: 240 seconds] | 
| 14:56:18 |  | Chris50100 is now known as Chris5010 | 
| 16:57:19 |  | qwertyasdfuiopghjkl18 joins | 
| 17:00:22 |  | qwertyasdfuiopghjkl quits [Ping timeout: 255 seconds] | 
| 19:15:58 |  | Aoede_ is now known as Aoede | 
| 19:26:38 |  | ThreeHM quits [Ping timeout: 240 seconds] | 
| 19:28:58 |  | ThreeHM (ThreeHeadedMonkey) joins | 
| 20:16:46 |  | DigitalDragons quits [Read error: Connection reset by peer] | 
| 20:19:30 |  | DigitalDragons (DigitalDragons) joins | 
| 20:51:07 |  | DigitalDragons quits [Client Quit] | 
| 20:55:29 |  | DigitalDragons (DigitalDragons) joins | 
| 22:56:29 | <seacow> | Here's another list of files from umass: https://transfer.archivete.am/TRnvI/umass.txt | 
| 22:56:29 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/TRnvI/umass.txt | 
| 22:56:29 | <seacow> | Most of these will probably be 403/404s | 
| 23:26:19 | <Ryz> | Any further updates, pokechu22 and JAA? Guess we're still waiting on datechnoman since it'll be a week to grind through all of those URLs to find interesting stuff >#<; | 
| 23:26:40 | <@JAA> | seacow: Thanks, will check that in a bit. | 
| 23:26:56 | <@JAA> | Ryz: The AB jobs have grabbed a lot of stuff already. The courses one finished, too. | 
| 23:26:58 | <pokechu22> | Ryz: we're already running the !a < list jobs | 
| 23:55:15 | <Ryz> | Ah, got it, got it, yeah, there's still https://transfer.archivete.am/TANTs/people.umass.edu-seeds-cdx-ddg.txt - about over halfway finished looking at the bar in beta~ | 
| 23:55:16 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/TANTs/people.umass.edu-seeds-cdx-ddg.txt | 
| 23:56:04 | <@JAA> | Yeah, the bar is useless. :-) |