00:12:31G4te_Keep3r quits [Client Quit]
00:12:48G4te_Keep3r joins
00:25:03<Somebody2>TheTechRobo: Note that this is just that tag; there are likely more categorized as cemetary, etc.
00:25:36<Somebody2>And what's wrong with just running them thru a normal spider, like ArchiveBot?
00:42:58Stiletto joins
00:48:56<TheTechRobo>Somebody2: Yeah, there's nothing wrong with a normal spider. But that takes time with ignores, etc. It also doesn't get any XHR data,
00:50:55<TheTechRobo>Which is why I'm doing targeted cralws.
00:51:37<thuban>what is it you're using?
00:51:59<TheTechRobo>Seesaw
01:01:30tbc1887 (tbc1887) joins
01:02:43dm4v quits [Read error: Connection reset by peer]
01:03:39dm4v joins
01:03:41dm4v quits [Changing host]
01:03:41dm4v (dm4v) joins
01:07:55<TheTechRobo>It's great practice, actually. I'm getting pretty good with Seesaw, wget+lua, and lua in general! :-)
01:12:01<thuban>hmm, more of us should do that
01:13:23tbc1887 quits [Client Quit]
01:14:26bonga joins
01:16:09<Somebody2>agreed
01:16:15<Somebody2>thanks for doing that, TheTechRobo !
01:16:55<TheTechRobo>I'm hoping eventually I'll be able to help with official AT projects.
01:17:21<Somebody2>Out of the 3k sites, I bet a number of them use the same framework for their obits. You might want to try doing some target spidering of all of them, to see if you can
01:17:33<Somebody2>... identify some common tools used, then write custom code for those.
01:17:58<Somebody2>Have you posted your code up somewhere?
01:20:35Megame quits [Client Quit]
01:33:51<TheTechRobo>Somebody2: I can, if you want
01:33:55<TheTechRobo>It's very ugly and terrible, though
01:37:00<TheTechRobo>Somebody2: https://git.thetechrobo.ca/TheTechRobo/funeralhomes-grab
01:37:30<TheTechRobo>Somebody2: yeah, I'm sure there's a lot of similar frameworks. probably lots out there, though :/
01:37:40<TheTechRobo>also 90% of html parsing is per site
01:37:55<TheTechRobo>it could be easily changed to allow other sites, though
01:38:18<TheTechRobo>(basically to prevent infinite queuing, it makes sure it's on a specific page of the website which includes the domain name. but i can just add another part to the conditional)
01:38:52<TheTechRobo>Unfortunately, all the commit dates are gone
01:39:20<TheTechRobo>let's move to #archiveteam-dev for further programming conversaiton
01:47:26tzt quits [Remote host closed the connection]
01:48:34tzt (tzt) joins
02:02:09march_happy quits [Ping timeout: 252 seconds]
02:03:20march_happy (march_happy) joins
02:21:24march_happy quits [Ping timeout: 252 seconds]
02:22:59march_happy (march_happy) joins
02:57:22michaelblob_ quits [Read error: Connection reset by peer]
02:59:09michaelblob (michaelblob) joins
02:59:16michaelblob_ (michaelblob) joins
03:00:27march_happy quits [Ping timeout: 252 seconds]
03:00:33michaelblob quits [Client Quit]
03:01:21march_happy (march_happy) joins
03:03:43lunik1 quits [Ping timeout: 265 seconds]
03:05:24Sanqui quits [Ping timeout: 252 seconds]
03:13:15lunik1 joins
03:15:27Sanqui joins
04:07:19<lennier1>I've extracted 13,492,859 links from the Apple App Store sitemap. That includes all international storefronts where an app is listed, so there aren't actually that many apps. https://transfer.archivete.am/AHJmS/appStoreLinks.zip
04:08:20<lennier1>Is that feasible to archive? From spot checking, the Wayback Machine coverage is definitely incomplete.
04:12:17eroc1990 quits [Quit: The Lounge - https://thelounge.chat]
04:40:10eroc1990 (eroc1990) joins
05:14:42Arcorann quits [Ping timeout: 265 seconds]
05:50:18<tech234a>lennier1: out of curiosity how many unique apps (regardless of region)?
05:53:00<tech234a>Also from https://apps.apple.com/robots.txt is a list of App Store "stories", not sure how valuable those are https://apps.apple.com/sitemaps_story_index_1.xml
05:55:34<tech234a>Example stories: https://apps.apple.com/us/story/whats-in-a-name-when-its-io/id1273615890 https://apps.apple.com/us/story/seamlessly-scan-everything/id1469965100
05:56:13<tech234a>Surprisingly a lot of them are intended for regions other than the US
05:59:52<tech234a>There's also https://itunes.apple.com/robots.txt for other parts of the iTunes store
06:05:24<tech234a>1,610,224 app IDs btw (based on extracting and deduping the last part of the URL paths)
06:06:41<tech234a>The number seems about right for the total size of the App Store given all the removals they've been doing https://en.wikipedia.org/wiki/App_Store_(iOS/iPadOS)
06:08:20<tech234a>https://www.apple.com/app-store/ also approximately backs up the numbers and says there should be 20k+ stories
06:12:25<tech234a>Also worth noting that unlisted apps became a thing recently, Google Hangouts is one of those apps and is not included in the sitemap https://apps.apple.com/us/app/hangouts/id643496868
06:16:03Arcorann (Arcorann) joins
06:44:14<lennier1>Interesting. Is there any way to find unlisted apps?
07:07:12<tech234a>lennier1: if they were previously public they would have kept their same IDs/links
07:08:24<tech234a>otherwise unless the link is published elsewhere probably not unless you want to iterate through all the possible IDs
07:27:16march_happy quits [Ping timeout: 265 seconds]
07:28:21march_happy (march_happy) joins
07:32:42march_happy quits [Ping timeout: 252 seconds]
07:33:52march_happy (march_happy) joins
08:02:24march_happy quits [Ping timeout: 252 seconds]
08:03:23march_happy (march_happy) joins
08:12:18march_happy quits [Ping timeout: 252 seconds]
08:12:32march_happy (march_happy) joins
08:53:21adia quits [Quit: The Lounge - https://thelounge.chat]
08:53:36adia (adia) joins
09:20:36kn1002 joins
09:22:39kn100 quits [Ping timeout: 265 seconds]
09:22:39kn1002 is now known as kn100
10:28:45qwertyasdfuiopghjkl quits [Client Quit]
10:55:26march_happy quits [Remote host closed the connection]
11:02:49march_happy (march_happy) joins
11:10:08qwertyasdfuiopghjkl joins
11:47:47qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds]
11:54:43<TheTechRobo>Somebody2: Very sorry about the inconvenience; the repository was set to private. you should now be able to access the funeralhomes link
12:25:51march_happy quits [Ping timeout: 252 seconds]
12:33:27march_happy (march_happy) joins
12:55:00BlueMaxima quits [Read error: Connection reset by peer]
13:07:41yano quits [Quit: WeeChat, the better IRC client, https://weechat.org/]
13:09:28march_happy quits [Ping timeout: 265 seconds]
13:10:12march_happy (march_happy) joins
13:11:37yano (yano) joins
13:55:23<Somebody2>Nice -- I hadn't gotten around to looking at it, so no harm done.
13:57:30<AK>Hmm, does every domain in osm get at least one archive in the wbm I wonder
14:17:36<Jake>I can't remember, but I know at one point we wanted to throw OSM links into #//
14:21:21Arcorann quits [Ping timeout: 265 seconds]
14:42:37Barto quits [Ping timeout: 265 seconds]
14:43:20Barto (Barto) joins
14:48:51adia quits [Client Quit]
14:48:51kn100 quits [Client Quit]
14:48:51driib quits [Client Quit]
14:48:51G4te_Keep3r quits [Client Quit]
14:48:51yano quits [Remote host closed the connection]
14:48:51dm4v quits [Client Quit]
14:48:56adia (adia) joins
14:48:56dm4v joins
14:48:57dm4v quits [Changing host]
14:48:57dm4v (dm4v) joins
14:48:58driib (driib) joins
14:49:05G4te_Keep3r joins
14:49:13yano (yano) joins
14:49:37kn100 joins
14:49:53<Somebody2>If we haven't done that, we should.
15:00:09march_happy quits [Ping timeout: 265 seconds]
15:07:00bonga quits [Ping timeout: 252 seconds]
15:07:30bonga joins
15:19:40<AK>If anyone fancies doing it you can make a PR to urls-sources to get all the urls added and auto queued: https://github.com/ArchiveTeam/urls-sources
15:28:05<Somebody2>AK: sweet, I hadn't noticed that project
15:28:09<Jake>https://github.com/ArchiveTeam/urls-sources/issues/1 if this is anything to go by, looks like we didn't. I think it's a great idea :-)
15:33:52bonga quits [Read error: Connection reset by peer]
15:35:04bonga joins
16:09:33Mayk78 quits [Quit: ZNC 1.7.5+deb4 - https://znc.in]
16:27:52HP_Archivist quits [Client Quit]
17:31:47klg quits [Ping timeout: 265 seconds]
17:40:23<thuban>i am sorry to say that mcgraw-hill seems to have banned the ip i was using to brute-force the glencoe sites.
17:43:28<thuban>thoughts? i got about a fifth of the way through, so it might be possible to finish from just a few more ips (unless they deploy the banhammer quicker now that they've noticed), but it seems like it might be better to distribute it--
17:48:07<thuban>how, though? seesaw? it seems a waste to write a script for something so simple. #//? no recursion, so the site scrapes would have to be done separately from the brute-force; fine in principle, but i'm not sure how doable it would be to extract the list of hits from the results.
17:49:25<thuban>(also it's a _lot_ of candidates, don't know how much that would eat into general archiving)
18:13:58<thuban>oh, no, #// items can recurse if queued appropriately
18:21:28<ThreeHM>Maybe #Y could work for this?
18:21:34klg (klg) joins
18:32:26<thuban>oh, is that actually running now?
18:32:50<ThreeHM>It did at some point, not sure what the current status is
18:34:17<AK>thuban, do you have a list of sites?
18:36:31<thuban>AK: i have a list of urls at which there might be sites. (404 pages have no outlinks, fwiw.) i can produce a list of sites within the first 1800000000 urls from my partial results.
18:37:02<AK>If you can grab a list, we can probably tag ark_iver and ask for it to be thrown in
18:37:10<AK>*thrown in #Y
18:38:04<thuban>the list of urls or the partial list of sites?
18:40:49<AK>Hmm, list of sites might make more sense
19:07:59<h2ibot>52,700 edits edited Deathwatch (+92, /* 2022 */ add Hashbase): https://wiki.archiveteam.org/?diff=48598&oldid=48568
19:31:05<@JAA>Travis CI is breaking shit again, as is tradition. Sometime recently, travis-ci.com/user/repo was removed without redirecting to the new format app.travis-ci.com/github/user/repo.
19:32:02<h2ibot>Entartet edited List of websites excluded from the Wayback Machine (+27, Added miloush.net.): https://wiki.archiveteam.org/?diff=48599&oldid=48596
20:00:06<h2ibot>JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=48600&oldid=48599
20:03:57qwertyasdfuiopghjkl joins
20:41:57Ruthalas quits [Ping timeout: 252 seconds]
20:46:55Ruthalas (Ruthalas) joins
20:56:27bonga quits [Remote host closed the connection]
20:56:41bonga joins
21:07:16<h2ibot>Entartet edited List of websites excluded from the Wayback Machine (-14, Fixed two broken HTML comments. Removed…): https://wiki.archiveteam.org/?diff=48601&oldid=48600
21:07:17<h2ibot>Entartet edited List of lost Twitter accounts (-18, Removed [[Category:List]] because this page is…): https://wiki.archiveteam.org/?diff=48602&oldid=46735
21:07:18<h2ibot>Entartet edited List of volatile messages (-18, Removed [[Category:List]] because this page is…): https://wiki.archiveteam.org/?diff=48603&oldid=45200
21:07:19<h2ibot>Entartet edited List of book databases (+96, Added fantasticfiction.com and isfdb.org.): https://wiki.archiveteam.org/?diff=48604&oldid=47164
21:46:12march_happy (march_happy) joins
22:31:04Pingerfowder quits [Remote host closed the connection]
22:31:13Pingerfowder (Pingerfowder) joins
23:00:58BlueMaxima joins
23:05:55Arcorann (Arcorann) joins
23:22:32lennier1 quits [Quit: Going offline, see ya! (www.adiirc.com)]
23:28:57lennier1 (lennier1) joins
23:34:39march_happy quits [Ping timeout: 252 seconds]
23:35:54march_happy (march_happy) joins
23:51:49bonga quits [Ping timeout: 265 seconds]
23:54:23bonga joins