#archiveteam-bs log for 2022-05-18

Home Search Previous day Next day

00:12:31		G4te_Keep3r quits [Client Quit]
00:12:48		G4te_Keep3r joins
00:25:03	<Somebody2>	TheTechRobo: Note that this is just that tag; there are likely more categorized as cemetary, etc.
00:25:36	<Somebody2>	And what's wrong with just running them thru a normal spider, like ArchiveBot?
00:42:58		Stiletto joins
00:48:56	<TheTechRobo>	Somebody2: Yeah, there's nothing wrong with a normal spider. But that takes time with ignores, etc. It also doesn't get any XHR data,
00:50:55	<TheTechRobo>	Which is why I'm doing targeted cralws.
00:51:37	<thuban>	what is it you're using?
00:51:59	<TheTechRobo>	Seesaw
01:01:30		tbc1887 (tbc1887) joins
01:02:43		dm4v quits [Read error: Connection reset by peer]
01:03:39		dm4v joins
01:03:41		dm4v is now authenticated as dm4v
01:03:41		dm4v quits [Changing host]
01:03:41		dm4v (dm4v) joins
01:07:55	<TheTechRobo>	It's great practice, actually. I'm getting pretty good with Seesaw, wget+lua, and lua in general! :-)
01:12:01	<thuban>	hmm, more of us should do that
01:13:23		tbc1887 quits [Client Quit]
01:14:26		bonga joins
01:16:09	<Somebody2>	agreed
01:16:15	<Somebody2>	thanks for doing that, TheTechRobo !
01:16:55	<TheTechRobo>	I'm hoping eventually I'll be able to help with official AT projects.
01:17:21	<Somebody2>	Out of the 3k sites, I bet a number of them use the same framework for their obits. You might want to try doing some target spidering of all of them, to see if you can
01:17:33	<Somebody2>	... identify some common tools used, then write custom code for those.
01:17:58	<Somebody2>	Have you posted your code up somewhere?
01:20:35		Megame quits [Client Quit]
01:33:51	<TheTechRobo>	Somebody2: I can, if you want
01:33:55	<TheTechRobo>	It's very ugly and terrible, though
01:37:00	<TheTechRobo>	Somebody2: https://git.thetechrobo.ca/TheTechRobo/funeralhomes-grab
01:37:30	<TheTechRobo>	Somebody2: yeah, I'm sure there's a lot of similar frameworks. probably lots out there, though :/
01:37:40	<TheTechRobo>	also 90% of html parsing is per site
01:37:55	<TheTechRobo>	it could be easily changed to allow other sites, though
01:38:18	<TheTechRobo>	(basically to prevent infinite queuing, it makes sure it's on a specific page of the website which includes the domain name. but i can just add another part to the conditional)
01:38:52	<TheTechRobo>	Unfortunately, all the commit dates are gone
01:39:20	<TheTechRobo>	let's move to #archiveteam-dev for further programming conversaiton
01:47:26		tzt quits [Remote host closed the connection]
01:48:34		tzt (tzt) joins
02:02:09		march_happy quits [Ping timeout: 252 seconds]
02:03:20		march_happy (march_happy) joins
02:21:24		march_happy quits [Ping timeout: 252 seconds]
02:22:59		march_happy (march_happy) joins
02:57:22		michaelblob_ quits [Read error: Connection reset by peer]
02:59:09		michaelblob (michaelblob) joins
02:59:16		michaelblob_ (michaelblob) joins
03:00:27		march_happy quits [Ping timeout: 252 seconds]
03:00:33		michaelblob quits [Client Quit]
03:01:21		march_happy (march_happy) joins
03:03:43		lunik1 quits [Ping timeout: 265 seconds]
03:05:24		Sanqui quits [Ping timeout: 252 seconds]
03:13:15		lunik1 joins
03:15:27		Sanqui joins
03:15:29		Sanqui is now authenticated as Sanqui
04:07:19	<lennier1>	I've extracted 13,492,859 links from the Apple App Store sitemap. That includes all international storefronts where an app is listed, so there aren't actually that many apps. https://transfer.archivete.am/AHJmS/appStoreLinks.zip
04:08:20	<lennier1>	Is that feasible to archive? From spot checking, the Wayback Machine coverage is definitely incomplete.
04:12:17		eroc1990 quits [Quit: The Lounge - https://thelounge.chat]
04:40:10		eroc1990 (eroc1990) joins
05:14:42		Arcorann quits [Ping timeout: 265 seconds]
05:50:18	<tech234a>	lennier1: out of curiosity how many unique apps (regardless of region)?
05:53:00	<tech234a>	Also from https://apps.apple.com/robots.txt is a list of App Store "stories", not sure how valuable those are https://apps.apple.com/sitemaps_story_index_1.xml
05:55:34	<tech234a>	Example stories: https://apps.apple.com/us/story/whats-in-a-name-when-its-io/id1273615890 https://apps.apple.com/us/story/seamlessly-scan-everything/id1469965100
05:56:13	<tech234a>	Surprisingly a lot of them are intended for regions other than the US
05:59:52	<tech234a>	There's also https://itunes.apple.com/robots.txt for other parts of the iTunes store
06:05:24	<tech234a>	1,610,224 app IDs btw (based on extracting and deduping the last part of the URL paths)
06:06:41	<tech234a>	The number seems about right for the total size of the App Store given all the removals they've been doing https://en.wikipedia.org/wiki/App_Store_(iOS/iPadOS)
06:08:20	<tech234a>	https://www.apple.com/app-store/ also approximately backs up the numbers and says there should be 20k+ stories
06:12:25	<tech234a>	Also worth noting that unlisted apps became a thing recently, Google Hangouts is one of those apps and is not included in the sitemap https://apps.apple.com/us/app/hangouts/id643496868
06:16:03		Arcorann (Arcorann) joins
06:44:14	<lennier1>	Interesting. Is there any way to find unlisted apps?
07:07:12	<tech234a>	lennier1: if they were previously public they would have kept their same IDs/links
07:08:24	<tech234a>	otherwise unless the link is published elsewhere probably not unless you want to iterate through all the possible IDs
07:27:16		march_happy quits [Ping timeout: 265 seconds]
07:28:21		march_happy (march_happy) joins
07:32:42		march_happy quits [Ping timeout: 252 seconds]
07:33:52		march_happy (march_happy) joins
08:02:24		march_happy quits [Ping timeout: 252 seconds]
08:03:23		march_happy (march_happy) joins
08:12:18		march_happy quits [Ping timeout: 252 seconds]
08:12:32		march_happy (march_happy) joins
08:53:21		adia quits [Quit: The Lounge - https://thelounge.chat]
08:53:36		adia (adia) joins
09:20:36		kn1002 joins
09:22:39		kn100 quits [Ping timeout: 265 seconds]
09:22:39		kn1002 is now known as kn100
10:28:45		qwertyasdfuiopghjkl quits [Client Quit]
10:55:26		march_happy quits [Remote host closed the connection]
11:02:49		march_happy (march_happy) joins
11:10:08		qwertyasdfuiopghjkl joins
11:47:47		qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds]
11:54:43	<TheTechRobo>	Somebody2: Very sorry about the inconvenience; the repository was set to private. you should now be able to access the funeralhomes link
12:25:51		march_happy quits [Ping timeout: 252 seconds]
12:33:27		march_happy (march_happy) joins
12:55:00		BlueMaxima quits [Read error: Connection reset by peer]
13:07:41		yano quits [Quit: WeeChat, the better IRC client, https://weechat.org/]
13:09:28		march_happy quits [Ping timeout: 265 seconds]
13:10:12		march_happy (march_happy) joins
13:11:37		yano (yano) joins
13:55:23	<Somebody2>	Nice -- I hadn't gotten around to looking at it, so no harm done.
13:57:30	<AK>	Hmm, does every domain in osm get at least one archive in the wbm I wonder
14:17:36	<Jake>	I can't remember, but I know at one point we wanted to throw OSM links into #//
14:21:21		Arcorann quits [Ping timeout: 265 seconds]
14:42:37		Barto quits [Ping timeout: 265 seconds]
14:43:20		Barto (Barto) joins
14:48:51		adia quits [Client Quit]
14:48:51		kn100 quits [Client Quit]
14:48:51		driib quits [Client Quit]
14:48:51		G4te_Keep3r quits [Client Quit]
14:48:51		yano quits [Remote host closed the connection]
14:48:51		dm4v quits [Client Quit]
14:48:56		adia (adia) joins
14:48:56		dm4v joins
14:48:57		dm4v is now authenticated as dm4v
14:48:57		dm4v quits [Changing host]
14:48:57		dm4v (dm4v) joins
14:48:58		driib (driib) joins
14:49:05		G4te_Keep3r joins
14:49:13		yano (yano) joins
14:49:37		kn100 joins
14:49:53	<Somebody2>	If we haven't done that, we should.
15:00:09		march_happy quits [Ping timeout: 265 seconds]
15:07:00		bonga quits [Ping timeout: 252 seconds]
15:07:30		bonga joins
15:19:40	<AK>	If anyone fancies doing it you can make a PR to urls-sources to get all the urls added and auto queued: https://github.com/ArchiveTeam/urls-sources
15:28:05	<Somebody2>	AK: sweet, I hadn't noticed that project
15:28:09	<Jake>	https://github.com/ArchiveTeam/urls-sources/issues/1 if this is anything to go by, looks like we didn't. I think it's a great idea :-)
15:33:52		bonga quits [Read error: Connection reset by peer]
15:35:04		bonga joins
16:09:33		Mayk78 quits [Quit: ZNC 1.7.5+deb4 - https://znc.in]
16:27:52		HP_Archivist quits [Client Quit]
17:31:47		klg quits [Ping timeout: 265 seconds]
17:40:23	<thuban>	i am sorry to say that mcgraw-hill seems to have banned the ip i was using to brute-force the glencoe sites.
17:43:28	<thuban>	thoughts? i got about a fifth of the way through, so it might be possible to finish from just a few more ips (unless they deploy the banhammer quicker now that they've noticed), but it seems like it might be better to distribute it--
17:48:07	<thuban>	how, though? seesaw? it seems a waste to write a script for something so simple. #//? no recursion, so the site scrapes would have to be done separately from the brute-force; fine in principle, but i'm not sure how doable it would be to extract the list of hits from the results.
17:49:25	<thuban>	(also it's a _lot_ of candidates, don't know how much that would eat into general archiving)
18:13:58	<thuban>	oh, no, #// items can recurse if queued appropriately
18:21:28	<ThreeHM>	Maybe #Y could work for this?
18:21:34		klg (klg) joins
18:32:26	<thuban>	oh, is that actually running now?
18:32:50	<ThreeHM>	It did at some point, not sure what the current status is
18:34:17	<AK>	thuban, do you have a list of sites?
18:36:31	<thuban>	AK: i have a list of urls at which there might be sites. (404 pages have no outlinks, fwiw.) i can produce a list of sites within the first 1800000000 urls from my partial results.
18:37:02	<AK>	If you can grab a list, we can probably tag ark_iver and ask for it to be thrown in
18:37:10	<AK>	*thrown in #Y
18:38:04	<thuban>	the list of urls or the partial list of sites?
18:40:49	<AK>	Hmm, list of sites might make more sense
19:07:59	<h2ibot>	52,700 edits edited Deathwatch (+92, /* 2022 */ add Hashbase): https://wiki.archiveteam.org/?diff=48598&oldid=48568
19:31:05	<@JAA>	Travis CI is breaking shit again, as is tradition. Sometime recently, travis-ci.com/user/repo was removed without redirecting to the new format app.travis-ci.com/github/user/repo.
19:32:02	<h2ibot>	Entartet edited List of websites excluded from the Wayback Machine (+27, Added miloush.net.): https://wiki.archiveteam.org/?diff=48599&oldid=48596
20:00:06	<h2ibot>	JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=48600&oldid=48599
20:03:57		qwertyasdfuiopghjkl joins
20:41:57		Ruthalas quits [Ping timeout: 252 seconds]
20:46:55		Ruthalas (Ruthalas) joins
20:56:27		bonga quits [Remote host closed the connection]
20:56:41		bonga joins
21:07:16	<h2ibot>	Entartet edited List of websites excluded from the Wayback Machine (-14, Fixed two broken HTML comments. Removed…): https://wiki.archiveteam.org/?diff=48601&oldid=48600
21:07:17	<h2ibot>	Entartet edited List of lost Twitter accounts (-18, Removed [[Category:List]] because this page is…): https://wiki.archiveteam.org/?diff=48602&oldid=46735
21:07:18	<h2ibot>	Entartet edited List of volatile messages (-18, Removed [[Category:List]] because this page is…): https://wiki.archiveteam.org/?diff=48603&oldid=45200
21:07:19	<h2ibot>	Entartet edited List of book databases (+96, Added fantasticfiction.com and isfdb.org.): https://wiki.archiveteam.org/?diff=48604&oldid=47164
21:46:12		march_happy (march_happy) joins
22:31:04		Pingerfowder quits [Remote host closed the connection]
22:31:13		Pingerfowder (Pingerfowder) joins
23:00:58		BlueMaxima joins
23:05:55		Arcorann (Arcorann) joins
23:22:32		lennier1 quits [Quit: Going offline, see ya! (www.adiirc.com)]
23:28:57		lennier1 (lennier1) joins
23:34:39		march_happy quits [Ping timeout: 252 seconds]
23:35:54		march_happy (march_happy) joins
23:51:49		bonga quits [Ping timeout: 265 seconds]
23:54:23		bonga joins

Home Search Previous day Next day