#webroasting log for 2024-11-08

Home Search Previous day Next day

00:08:11	<pokechu22>	https://people.umass.edu/pelliott/web2/web2a.html is pretty great
00:08:45	<@JAA>	Oh neat!
00:11:07	<@JAA>	This looks fairly decent: `grep -Po '^https?://([^/?]+\.)?people\.umass\.edu\.?(:\d+)?/\K[^/?]+' \| sed 's,^%7E,,i; s,^%E2%88%BC,,i; s,^~,,' \| grep -E '^[a-zA-Z0-9_.-]+$'`
00:11:58	<@JAA>	181 of 116k URLs from the CDX API don't make it through that, and they all look weird.
00:13:41	<@JAA>	In case someone feels like hand-cleaning these, although probably most are already in the list in another form anyway: https://transfer.archivete.am/inline/y7TxB/people.umass.edu-cdx-api-weird-names.txt
00:14:17	<@JAA>	After uniqify, I'm getting 156 unique dir names.
00:14:37	<@JAA>	Er, no, I forgor to disable -v again.
00:14:45	<@JAA>	15429 dir name candidates
00:16:46	<@JAA>	I was going to check a little sample, but turns out my OVH IP is banned.
00:18:42	<pokechu22>	OK, here's my searching results: https://transfer.archivete.am/inline/mP5Jg/courses.umass.edu_people.umass.edu_urls.txt
00:25:48	<pokechu22>	https://transfer.archivete.am/inline/vJhZt/people.umass.edu_weird_names_tidied.txt (some still have spaces, which may or may not be valid; I suspect program files and application data are junk, but they won't cause problems if they're in the list)
00:52:39	<@JAA>	datechnoman: Can you check what you have for people.umass.edu and courses.umass.edu in your large URL lists? Simple grep/noisy data is fine.
02:34:38		wickedplayer494 quits [Remote host closed the connection]
03:22:43		wickedplayer494 joins
04:42:29	<datechnoman>	JAA - Can do mate. Will take awhile as the data set it so huge but more than happy to provide what I can :)
04:42:35	<datechnoman>	Will kick of some searches when I get home
04:44:23	<@JAA>	datechnoman: Lovely, thanks. What time scale are we talking? I might already start a big job and then deal with additional stuff later.
04:47:11	<datechnoman>	There is atleast 30-40 billion urls at this time in my collection. It will be a week or so to go through it all, but, I can dump progress outputs as we go that you can use as seed urls if that helps? JAA
04:47:35	<datechnoman>	What timeframe are we looking at?
04:47:47	<@JAA>	soon™
04:47:57	<@JAA>	No stated deadline anywhere.
04:48:26	<@JAA>	2024-11-06 16:56:36 < tmbr> no public announcement, but confirmed with their IT: the legacy tilde-style shared hosting at University of Massachusetts Amherst will be sunsetted "soon"
04:48:30	<@JAA>	2024-11-06 16:56:45 < tmbr> I've asked whether they have a timeline for pulling that server offline and if they'll be announcing, but so far all I have received is "the days are numbered"
04:48:55	<datechnoman>	Ahhh good. Ill kick off a few streams of grep and see where we go. Its about 50TB of compressed (zst) text files :P
04:49:16	<datechnoman>	Makes the HDD's go brrrrrrr and cough
04:49:18	<@JAA>	And I thought the terabytes of JSONL I'm dealing with currently was bad. lol
04:50:05	<datechnoman>	Yeah the scale ive hoarded at far exceed the practical processing of the data lol
04:50:20	<datechnoman>	It worth noting that is just cleaned urls with no garbage >.<
04:51:41	<datechnoman>	I really need to start sorting them into TLD's so I dont have to scan over everything each time
04:51:45	<datechnoman>	It just takes too long
04:51:46	<@JAA>	I think I'll get something started with what we already have (CDX and DDG) now.
04:52:39	<@JAA>	The lists for com and net would still be massive. But yeah, hard thing to deal with.
04:53:17	<datechnoman>	Oh for sure, but at this point any saving (even a few billions urls) will be well worth it
04:54:04	<@JAA>	You'll also have IPs, which you'd probably want to organise by subnet, not last octet.
04:55:58	<@JAA>	I don't expect there to be a huge number of cross-user links based on how other personal uni sites tend to work, and a bit of duplication is fine, so might also just run whatever you (or others) find as another full recursive job.
04:56:21	<@JAA>	These sites tend to not be huge because there's usually only a low amount of disk space available per user.
06:57:48	<datechnoman>	TLDR - not many urls will be found. Gotcha! Will be home shortly to kick off some grep jobs
07:06:27	<@JAA>	My lists from the CDX API with the grep/sed processing mentioned above: https://transfer.archivete.am/lOuOC/people.umass.edu-unique-usernames-cdx.txt https://transfer.archivete.am/114dL/courses.umass.edu-unique-dirnames-cdx.txt
07:06:29	<eggdrop>	inline (for browser viewing): https://transfer.archivete.am/inline/lOuOC/people.umass.edu-unique-usernames-cdx.txt https://transfer.archivete.am/inline/114dL/courses.umass.edu-unique-dirnames-cdx.txt
07:08:58	<@JAA>	I will combine mP5Jg vJhZt lOuOC 114dL into one list per domain, each username on people with and without tilde (courses doesn't seem to use tildes), and start two !a < jobs for them.
07:09:34	<@JAA>	Future lists can then be deduped against that and run as additional !a < jobs.
07:11:12	<@JAA>	pokechu22: It might be worth running your DDG list as !ao < as well; I've seen cases where DDG returned things that weren't discoverable from links on the site.
07:12:40	<pokechu22>	Yeah, that or check the meta-warc for duplication
07:19:03	<@JAA>	https://transfer.archivete.am/TANTs/people.umass.edu-seeds-cdx-ddg.txt.zst
07:19:37	<@JAA>	There will be lots of 404s.
07:41:46	<@JAA>	There's some weirdness where it seems that 404s trigger 403s instead. See #archivebot in the past several minutes for details.
07:42:41	<@JAA>	We'll have to do a manual spot check of those later.
07:46:17	<pokechu22>	https://people.umass.edu/bzecchi seems to be a genuine 403
07:48:08	<@JAA>	Yeah, the ones that go 301→403 seem to be legitimate.
07:48:28	<@JAA>	Those might be users that exist but don't have an index.html or similar.
07:48:49	<@JAA>	https://transfer.archivete.am/1bREw/courses.umass.edu-seeds-cdx-ddg.txt.zst
09:15:17		nulldata quits [Client Quit]
09:17:13		nulldata (nulldata) joins
09:56:15		pokechu22 quits [Ping timeout: 260 seconds]
10:17:33		pokechu22 (pokechu22) joins
14:55:05		Chris50100 (Chris5010) joins
14:56:18		Chris5010 quits [Ping timeout: 240 seconds]
14:56:18		Chris50100 is now known as Chris5010
16:57:19		qwertyasdfuiopghjkl18 joins
17:00:22		qwertyasdfuiopghjkl quits [Ping timeout: 255 seconds]
19:15:58		Aoede_ is now known as Aoede
19:26:38		ThreeHM quits [Ping timeout: 240 seconds]
19:28:58		ThreeHM (ThreeHeadedMonkey) joins
20:16:46		DigitalDragons quits [Read error: Connection reset by peer]
20:19:30		DigitalDragons (DigitalDragons) joins
20:51:07		DigitalDragons quits [Client Quit]
20:55:29		DigitalDragons (DigitalDragons) joins
22:56:29	<seacow>	Here's another list of files from umass: https://transfer.archivete.am/TRnvI/umass.txt
22:56:29	<eggdrop>	inline (for browser viewing): https://transfer.archivete.am/inline/TRnvI/umass.txt
22:56:29	<seacow>	Most of these will probably be 403/404s
23:26:19	<Ryz>	Any further updates, pokechu22 and JAA? Guess we're still waiting on datechnoman since it'll be a week to grind through all of those URLs to find interesting stuff >#<;
23:26:40	<@JAA>	seacow: Thanks, will check that in a bit.
23:26:56	<@JAA>	Ryz: The AB jobs have grabbed a lot of stuff already. The courses one finished, too.
23:26:58	<pokechu22>	Ryz: we're already running the !a < list jobs
23:55:15	<Ryz>	Ah, got it, got it, yeah, there's still https://transfer.archivete.am/TANTs/people.umass.edu-seeds-cdx-ddg.txt - about over halfway finished looking at the bar in beta~
23:55:16	<eggdrop>	inline (for browser viewing): https://transfer.archivete.am/inline/TANTs/people.umass.edu-seeds-cdx-ddg.txt
23:56:04	<@JAA>	Yeah, the bar is useless. :-)

Home Search Previous day Next day