00:08:11<pokechu22>https://people.umass.edu/pelliott/web2/web2a.html is pretty great
00:08:45<@JAA>Oh neat!
00:11:07<@JAA>This looks fairly decent: `grep -Po '^https?://([^/?]+\.)?people\.umass\.edu\.?(:\d+)?/\K[^/?]+' | sed 's,^%7E,,i; s,^%E2%88%BC,,i; s,^~,,' | grep -E '^[a-zA-Z0-9_.-]+$'`
00:11:58<@JAA>181 of 116k URLs from the CDX API don't make it through that, and they all look weird.
00:13:41<@JAA>In case someone feels like hand-cleaning these, although probably most are already in the list in another form anyway: https://transfer.archivete.am/inline/y7TxB/people.umass.edu-cdx-api-weird-names.txt
00:14:17<@JAA>After uniqify, I'm getting 156 unique dir names.
00:14:37<@JAA>Er, no, I forgor to disable -v again.
00:14:45<@JAA>15429 dir name candidates
00:16:46<@JAA>I was going to check a little sample, but turns out my OVH IP is banned.
00:18:42<pokechu22>OK, here's my searching results: https://transfer.archivete.am/inline/mP5Jg/courses.umass.edu_people.umass.edu_urls.txt
00:25:48<pokechu22>https://transfer.archivete.am/inline/vJhZt/people.umass.edu_weird_names_tidied.txt (some still have spaces, which may or may not be valid; I suspect program files and application data are junk, but they won't cause problems if they're in the list)
00:52:39<@JAA>datechnoman: Can you check what you have for people.umass.edu and courses.umass.edu in your large URL lists? Simple grep/noisy data is fine.
02:34:38wickedplayer494 quits [Remote host closed the connection]
03:22:43wickedplayer494 joins
04:42:29<datechnoman>JAA - Can do mate. Will take awhile as the data set it so huge but more than happy to provide what I can :)
04:42:35<datechnoman>Will kick of some searches when I get home
04:44:23<@JAA>datechnoman: Lovely, thanks. What time scale are we talking? I might already start a big job and then deal with additional stuff later.
04:47:11<datechnoman>There is atleast 30-40 billion urls at this time in my collection. It will be a week or so to go through it all, but, I can dump progress outputs as we go that you can use as seed urls if that helps? JAA
04:47:35<datechnoman>What timeframe are we looking at?
04:47:47<@JAA>soon™
04:47:57<@JAA>No stated deadline anywhere.
04:48:26<@JAA>2024-11-06 16:56:36 < tmbr> no public announcement, but confirmed with their IT: the legacy tilde-style shared hosting at University of Massachusetts Amherst will be sunsetted "soon"
04:48:30<@JAA>2024-11-06 16:56:45 < tmbr> I've asked whether they have a timeline for pulling that server offline and if they'll be announcing, but so far all I have received is "the days are numbered"
04:48:55<datechnoman>Ahhh good. Ill kick off a few streams of grep and see where we go. Its about 50TB of compressed (zst) text files :P
04:49:16<datechnoman>Makes the HDD's go brrrrrrr and cough
04:49:18<@JAA>And I thought the terabytes of JSONL I'm dealing with currently was bad. lol
04:50:05<datechnoman>Yeah the scale ive hoarded at far exceed the practical processing of the data lol
04:50:20<datechnoman>It worth noting that is just cleaned urls with no garbage >.<
04:51:41<datechnoman>I really need to start sorting them into TLD's so I dont have to scan over everything each time
04:51:45<datechnoman>It just takes too long
04:51:46<@JAA>I think I'll get something started with what we already have (CDX and DDG) now.
04:52:39<@JAA>The lists for com and net would still be massive. But yeah, hard thing to deal with.
04:53:17<datechnoman>Oh for sure, but at this point any saving (even a few billions urls) will be well worth it
04:54:04<@JAA>You'll also have IPs, which you'd probably want to organise by subnet, not last octet.
04:55:58<@JAA>I don't expect there to be a huge number of cross-user links based on how other personal uni sites tend to work, and a bit of duplication is fine, so might also just run whatever you (or others) find as another full recursive job.
04:56:21<@JAA>These sites tend to not be huge because there's usually only a low amount of disk space available per user.
06:57:48<datechnoman>TLDR - not many urls will be found. Gotcha! Will be home shortly to kick off some grep jobs
07:06:27<@JAA>My lists from the CDX API with the grep/sed processing mentioned above: https://transfer.archivete.am/lOuOC/people.umass.edu-unique-usernames-cdx.txt https://transfer.archivete.am/114dL/courses.umass.edu-unique-dirnames-cdx.txt
07:06:29<eggdrop>inline (for browser viewing): https://transfer.archivete.am/inline/lOuOC/people.umass.edu-unique-usernames-cdx.txt https://transfer.archivete.am/inline/114dL/courses.umass.edu-unique-dirnames-cdx.txt
07:08:58<@JAA>I will combine mP5Jg vJhZt lOuOC 114dL into one list per domain, each username on people with and without tilde (courses doesn't seem to use tildes), and start two !a < jobs for them.
07:09:34<@JAA>Future lists can then be deduped against that and run as additional !a < jobs.
07:11:12<@JAA>pokechu22: It might be worth running your DDG list as !ao < as well; I've seen cases where DDG returned things that weren't discoverable from links on the site.
07:12:40<pokechu22>Yeah, that or check the meta-warc for duplication
07:19:03<@JAA>https://transfer.archivete.am/TANTs/people.umass.edu-seeds-cdx-ddg.txt.zst
07:19:37<@JAA>There will be lots of 404s.
07:41:46<@JAA>There's some weirdness where it seems that 404s trigger 403s instead. See #archivebot in the past several minutes for details.
07:42:41<@JAA>We'll have to do a manual spot check of those later.
07:46:17<pokechu22>https://people.umass.edu/bzecchi seems to be a genuine 403
07:48:08<@JAA>Yeah, the ones that go 301→403 seem to be legitimate.
07:48:28<@JAA>Those might be users that exist but don't have an index.html or similar.
07:48:49<@JAA>https://transfer.archivete.am/1bREw/courses.umass.edu-seeds-cdx-ddg.txt.zst
09:15:17nulldata quits [Client Quit]
09:17:13nulldata (nulldata) joins
09:56:15pokechu22 quits [Ping timeout: 260 seconds]
10:17:33pokechu22 (pokechu22) joins
14:55:05Chris50100 (Chris5010) joins
14:56:18Chris5010 quits [Ping timeout: 240 seconds]
14:56:18Chris50100 is now known as Chris5010
16:57:19qwertyasdfuiopghjkl18 joins
17:00:22qwertyasdfuiopghjkl quits [Ping timeout: 255 seconds]
19:15:58Aoede_ is now known as Aoede
19:26:38ThreeHM quits [Ping timeout: 240 seconds]
19:28:58ThreeHM (ThreeHeadedMonkey) joins
20:16:46DigitalDragons quits [Read error: Connection reset by peer]
20:19:30DigitalDragons (DigitalDragons) joins
20:51:07DigitalDragons quits [Client Quit]
20:55:29DigitalDragons (DigitalDragons) joins
22:56:29<seacow>Here's another list of files from umass: https://transfer.archivete.am/TRnvI/umass.txt
22:56:29<eggdrop>inline (for browser viewing): https://transfer.archivete.am/inline/TRnvI/umass.txt
22:56:29<seacow>Most of these will probably be 403/404s
23:26:19<Ryz>Any further updates, pokechu22 and JAA? Guess we're still waiting on datechnoman since it'll be a week to grind through all of those URLs to find interesting stuff >#<;
23:26:40<@JAA>seacow: Thanks, will check that in a bit.
23:26:56<@JAA>Ryz: The AB jobs have grabbed a lot of stuff already. The courses one finished, too.
23:26:58<pokechu22>Ryz: we're already running the !a < list jobs
23:55:15<Ryz>Ah, got it, got it, yeah, there's still https://transfer.archivete.am/TANTs/people.umass.edu-seeds-cdx-ddg.txt - about over halfway finished looking at the bar in beta~
23:55:16<eggdrop>inline (for browser viewing): https://transfer.archivete.am/inline/TANTs/people.umass.edu-seeds-cdx-ddg.txt
23:56:04<@JAA>Yeah, the bar is useless. :-)