00:08:11 | <pokechu22> | https://people.umass.edu/pelliott/web2/web2a.html is pretty great |
00:08:45 | <@JAA> | Oh neat! |
00:11:07 | <@JAA> | This looks fairly decent: `grep -Po '^https?://([^/?]+\.)?people\.umass\.edu\.?(:\d+)?/\K[^/?]+' | sed 's,^%7E,,i; s,^%E2%88%BC,,i; s,^~,,' | grep -E '^[a-zA-Z0-9_.-]+$'` |
00:11:58 | <@JAA> | 181 of 116k URLs from the CDX API don't make it through that, and they all look weird. |
00:13:41 | <@JAA> | In case someone feels like hand-cleaning these, although probably most are already in the list in another form anyway: https://transfer.archivete.am/inline/y7TxB/people.umass.edu-cdx-api-weird-names.txt |
00:14:17 | <@JAA> | After uniqify, I'm getting 156 unique dir names. |
00:14:37 | <@JAA> | Er, no, I forgor to disable -v again. |
00:14:45 | <@JAA> | 15429 dir name candidates |
00:16:46 | <@JAA> | I was going to check a little sample, but turns out my OVH IP is banned. |
00:18:42 | <pokechu22> | OK, here's my searching results: https://transfer.archivete.am/inline/mP5Jg/courses.umass.edu_people.umass.edu_urls.txt |
00:25:48 | <pokechu22> | https://transfer.archivete.am/inline/vJhZt/people.umass.edu_weird_names_tidied.txt (some still have spaces, which may or may not be valid; I suspect program files and application data are junk, but they won't cause problems if they're in the list) |
00:52:39 | <@JAA> | datechnoman: Can you check what you have for people.umass.edu and courses.umass.edu in your large URL lists? Simple grep/noisy data is fine. |
02:34:38 | | wickedplayer494 quits [Remote host closed the connection] |
03:22:43 | | wickedplayer494 joins |
04:42:29 | <datechnoman> | JAA - Can do mate. Will take awhile as the data set it so huge but more than happy to provide what I can :) |
04:42:35 | <datechnoman> | Will kick of some searches when I get home |
04:44:23 | <@JAA> | datechnoman: Lovely, thanks. What time scale are we talking? I might already start a big job and then deal with additional stuff later. |
04:47:11 | <datechnoman> | There is atleast 30-40 billion urls at this time in my collection. It will be a week or so to go through it all, but, I can dump progress outputs as we go that you can use as seed urls if that helps? JAA |
04:47:35 | <datechnoman> | What timeframe are we looking at? |
04:47:47 | <@JAA> | soon™ |
04:47:57 | <@JAA> | No stated deadline anywhere. |
04:48:26 | <@JAA> | 2024-11-06 16:56:36 < tmbr> no public announcement, but confirmed with their IT: the legacy tilde-style shared hosting at University of Massachusetts Amherst will be sunsetted "soon" |
04:48:30 | <@JAA> | 2024-11-06 16:56:45 < tmbr> I've asked whether they have a timeline for pulling that server offline and if they'll be announcing, but so far all I have received is "the days are numbered" |
04:48:55 | <datechnoman> | Ahhh good. Ill kick off a few streams of grep and see where we go. Its about 50TB of compressed (zst) text files :P |
04:49:16 | <datechnoman> | Makes the HDD's go brrrrrrr and cough |
04:49:18 | <@JAA> | And I thought the terabytes of JSONL I'm dealing with currently was bad. lol |
04:50:05 | <datechnoman> | Yeah the scale ive hoarded at far exceed the practical processing of the data lol |
04:50:20 | <datechnoman> | It worth noting that is just cleaned urls with no garbage >.< |
04:51:41 | <datechnoman> | I really need to start sorting them into TLD's so I dont have to scan over everything each time |
04:51:45 | <datechnoman> | It just takes too long |
04:51:46 | <@JAA> | I think I'll get something started with what we already have (CDX and DDG) now. |
04:52:39 | <@JAA> | The lists for com and net would still be massive. But yeah, hard thing to deal with. |
04:53:17 | <datechnoman> | Oh for sure, but at this point any saving (even a few billions urls) will be well worth it |
04:54:04 | <@JAA> | You'll also have IPs, which you'd probably want to organise by subnet, not last octet. |
04:55:58 | <@JAA> | I don't expect there to be a huge number of cross-user links based on how other personal uni sites tend to work, and a bit of duplication is fine, so might also just run whatever you (or others) find as another full recursive job. |
04:56:21 | <@JAA> | These sites tend to not be huge because there's usually only a low amount of disk space available per user. |
06:57:48 | <datechnoman> | TLDR - not many urls will be found. Gotcha! Will be home shortly to kick off some grep jobs |
07:06:27 | <@JAA> | My lists from the CDX API with the grep/sed processing mentioned above: https://transfer.archivete.am/lOuOC/people.umass.edu-unique-usernames-cdx.txt https://transfer.archivete.am/114dL/courses.umass.edu-unique-dirnames-cdx.txt |
07:06:29 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/lOuOC/people.umass.edu-unique-usernames-cdx.txt https://transfer.archivete.am/inline/114dL/courses.umass.edu-unique-dirnames-cdx.txt |
07:08:58 | <@JAA> | I will combine mP5Jg vJhZt lOuOC 114dL into one list per domain, each username on people with and without tilde (courses doesn't seem to use tildes), and start two !a < jobs for them. |
07:09:34 | <@JAA> | Future lists can then be deduped against that and run as additional !a < jobs. |
07:11:12 | <@JAA> | pokechu22: It might be worth running your DDG list as !ao < as well; I've seen cases where DDG returned things that weren't discoverable from links on the site. |
07:12:40 | <pokechu22> | Yeah, that or check the meta-warc for duplication |
07:19:03 | <@JAA> | https://transfer.archivete.am/TANTs/people.umass.edu-seeds-cdx-ddg.txt.zst |
07:19:37 | <@JAA> | There will be lots of 404s. |
07:41:46 | <@JAA> | There's some weirdness where it seems that 404s trigger 403s instead. See #archivebot in the past several minutes for details. |
07:42:41 | <@JAA> | We'll have to do a manual spot check of those later. |
07:46:17 | <pokechu22> | https://people.umass.edu/bzecchi seems to be a genuine 403 |
07:48:08 | <@JAA> | Yeah, the ones that go 301→403 seem to be legitimate. |
07:48:28 | <@JAA> | Those might be users that exist but don't have an index.html or similar. |
07:48:49 | <@JAA> | https://transfer.archivete.am/1bREw/courses.umass.edu-seeds-cdx-ddg.txt.zst |
09:15:17 | | nulldata quits [Client Quit] |
09:17:13 | | nulldata (nulldata) joins |
09:56:15 | | pokechu22 quits [Ping timeout: 260 seconds] |
10:17:33 | | pokechu22 (pokechu22) joins |
14:55:05 | | Chris50100 (Chris5010) joins |
14:56:18 | | Chris5010 quits [Ping timeout: 240 seconds] |
14:56:18 | | Chris50100 is now known as Chris5010 |
16:57:19 | | qwertyasdfuiopghjkl18 joins |
17:00:22 | | qwertyasdfuiopghjkl quits [Ping timeout: 255 seconds] |
19:15:58 | | Aoede_ is now known as Aoede |
19:26:38 | | ThreeHM quits [Ping timeout: 240 seconds] |
19:28:58 | | ThreeHM (ThreeHeadedMonkey) joins |
20:16:46 | | DigitalDragons quits [Read error: Connection reset by peer] |
20:19:30 | | DigitalDragons (DigitalDragons) joins |
20:51:07 | | DigitalDragons quits [Client Quit] |
20:55:29 | | DigitalDragons (DigitalDragons) joins |
22:56:29 | <seacow> | Here's another list of files from umass: https://transfer.archivete.am/TRnvI/umass.txt |
22:56:29 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/TRnvI/umass.txt |
22:56:29 | <seacow> | Most of these will probably be 403/404s |
23:26:19 | <Ryz> | Any further updates, pokechu22 and JAA? Guess we're still waiting on datechnoman since it'll be a week to grind through all of those URLs to find interesting stuff >#<; |
23:26:40 | <@JAA> | seacow: Thanks, will check that in a bit. |
23:26:56 | <@JAA> | Ryz: The AB jobs have grabbed a lot of stuff already. The courses one finished, too. |
23:26:58 | <pokechu22> | Ryz: we're already running the !a < list jobs |
23:55:15 | <Ryz> | Ah, got it, got it, yeah, there's still https://transfer.archivete.am/TANTs/people.umass.edu-seeds-cdx-ddg.txt - about over halfway finished looking at the bar in beta~ |
23:55:16 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/TANTs/people.umass.edu-seeds-cdx-ddg.txt |
23:56:04 | <@JAA> | Yeah, the bar is useless. :-) |