| 00:12:31 | | G4te_Keep3r quits [Client Quit] |
| 00:12:48 | | G4te_Keep3r joins |
| 00:25:03 | <Somebody2> | TheTechRobo: Note that this is just that tag; there are likely more categorized as cemetary, etc. |
| 00:25:36 | <Somebody2> | And what's wrong with just running them thru a normal spider, like ArchiveBot? |
| 00:42:58 | | Stiletto joins |
| 00:48:56 | <TheTechRobo> | Somebody2: Yeah, there's nothing wrong with a normal spider. But that takes time with ignores, etc. It also doesn't get any XHR data, |
| 00:50:55 | <TheTechRobo> | Which is why I'm doing targeted cralws. |
| 00:51:37 | <thuban> | what is it you're using? |
| 00:51:59 | <TheTechRobo> | Seesaw |
| 01:01:30 | | tbc1887 (tbc1887) joins |
| 01:02:43 | | dm4v quits [Read error: Connection reset by peer] |
| 01:03:39 | | dm4v joins |
| 01:03:41 | | dm4v is now authenticated as dm4v |
| 01:03:41 | | dm4v quits [Changing host] |
| 01:03:41 | | dm4v (dm4v) joins |
| 01:07:55 | <TheTechRobo> | It's great practice, actually. I'm getting pretty good with Seesaw, wget+lua, and lua in general! :-) |
| 01:12:01 | <thuban> | hmm, more of us should do that |
| 01:13:23 | | tbc1887 quits [Client Quit] |
| 01:14:26 | | bonga joins |
| 01:16:09 | <Somebody2> | agreed |
| 01:16:15 | <Somebody2> | thanks for doing that, TheTechRobo ! |
| 01:16:55 | <TheTechRobo> | I'm hoping eventually I'll be able to help with official AT projects. |
| 01:17:21 | <Somebody2> | Out of the 3k sites, I bet a number of them use the same framework for their obits. You might want to try doing some target spidering of all of them, to see if you can |
| 01:17:33 | <Somebody2> | ... identify some common tools used, then write custom code for those. |
| 01:17:58 | <Somebody2> | Have you posted your code up somewhere? |
| 01:20:35 | | Megame quits [Client Quit] |
| 01:33:51 | <TheTechRobo> | Somebody2: I can, if you want |
| 01:33:55 | <TheTechRobo> | It's very ugly and terrible, though |
| 01:37:00 | <TheTechRobo> | Somebody2: https://git.thetechrobo.ca/TheTechRobo/funeralhomes-grab |
| 01:37:30 | <TheTechRobo> | Somebody2: yeah, I'm sure there's a lot of similar frameworks. probably lots out there, though :/ |
| 01:37:40 | <TheTechRobo> | also 90% of html parsing is per site |
| 01:37:55 | <TheTechRobo> | it could be easily changed to allow other sites, though |
| 01:38:18 | <TheTechRobo> | (basically to prevent infinite queuing, it makes sure it's on a specific page of the website which includes the domain name. but i can just add another part to the conditional) |
| 01:38:52 | <TheTechRobo> | Unfortunately, all the commit dates are gone |
| 01:39:20 | <TheTechRobo> | let's move to #archiveteam-dev for further programming conversaiton |
| 01:47:26 | | tzt quits [Remote host closed the connection] |
| 01:48:34 | | tzt (tzt) joins |
| 02:02:09 | | march_happy quits [Ping timeout: 252 seconds] |
| 02:03:20 | | march_happy (march_happy) joins |
| 02:21:24 | | march_happy quits [Ping timeout: 252 seconds] |
| 02:22:59 | | march_happy (march_happy) joins |
| 02:57:22 | | michaelblob_ quits [Read error: Connection reset by peer] |
| 02:59:09 | | michaelblob (michaelblob) joins |
| 02:59:16 | | michaelblob_ (michaelblob) joins |
| 03:00:27 | | march_happy quits [Ping timeout: 252 seconds] |
| 03:00:33 | | michaelblob quits [Client Quit] |
| 03:01:21 | | march_happy (march_happy) joins |
| 03:03:43 | | lunik1 quits [Ping timeout: 265 seconds] |
| 03:05:24 | | Sanqui quits [Ping timeout: 252 seconds] |
| 03:13:15 | | lunik1 joins |
| 03:15:27 | | Sanqui joins |
| 03:15:29 | | Sanqui is now authenticated as Sanqui |
| 04:07:19 | <lennier1> | I've extracted 13,492,859 links from the Apple App Store sitemap. That includes all international storefronts where an app is listed, so there aren't actually that many apps. https://transfer.archivete.am/AHJmS/appStoreLinks.zip |
| 04:08:20 | <lennier1> | Is that feasible to archive? From spot checking, the Wayback Machine coverage is definitely incomplete. |
| 04:12:17 | | eroc1990 quits [Quit: The Lounge - https://thelounge.chat] |
| 04:40:10 | | eroc1990 (eroc1990) joins |
| 05:14:42 | | Arcorann quits [Ping timeout: 265 seconds] |
| 05:50:18 | <tech234a> | lennier1: out of curiosity how many unique apps (regardless of region)? |
| 05:53:00 | <tech234a> | Also from https://apps.apple.com/robots.txt is a list of App Store "stories", not sure how valuable those are https://apps.apple.com/sitemaps_story_index_1.xml |
| 05:55:34 | <tech234a> | Example stories: https://apps.apple.com/us/story/whats-in-a-name-when-its-io/id1273615890 https://apps.apple.com/us/story/seamlessly-scan-everything/id1469965100 |
| 05:56:13 | <tech234a> | Surprisingly a lot of them are intended for regions other than the US |
| 05:59:52 | <tech234a> | There's also https://itunes.apple.com/robots.txt for other parts of the iTunes store |
| 06:05:24 | <tech234a> | 1,610,224 app IDs btw (based on extracting and deduping the last part of the URL paths) |
| 06:06:41 | <tech234a> | The number seems about right for the total size of the App Store given all the removals they've been doing https://en.wikipedia.org/wiki/App_Store_(iOS/iPadOS) |
| 06:08:20 | <tech234a> | https://www.apple.com/app-store/ also approximately backs up the numbers and says there should be 20k+ stories |
| 06:12:25 | <tech234a> | Also worth noting that unlisted apps became a thing recently, Google Hangouts is one of those apps and is not included in the sitemap https://apps.apple.com/us/app/hangouts/id643496868 |
| 06:16:03 | | Arcorann (Arcorann) joins |
| 06:44:14 | <lennier1> | Interesting. Is there any way to find unlisted apps? |
| 07:07:12 | <tech234a> | lennier1: if they were previously public they would have kept their same IDs/links |
| 07:08:24 | <tech234a> | otherwise unless the link is published elsewhere probably not unless you want to iterate through all the possible IDs |
| 07:27:16 | | march_happy quits [Ping timeout: 265 seconds] |
| 07:28:21 | | march_happy (march_happy) joins |
| 07:32:42 | | march_happy quits [Ping timeout: 252 seconds] |
| 07:33:52 | | march_happy (march_happy) joins |
| 08:02:24 | | march_happy quits [Ping timeout: 252 seconds] |
| 08:03:23 | | march_happy (march_happy) joins |
| 08:12:18 | | march_happy quits [Ping timeout: 252 seconds] |
| 08:12:32 | | march_happy (march_happy) joins |
| 08:53:21 | | adia quits [Quit: The Lounge - https://thelounge.chat] |
| 08:53:36 | | adia (adia) joins |
| 09:20:36 | | kn1002 joins |
| 09:22:39 | | kn100 quits [Ping timeout: 265 seconds] |
| 09:22:39 | | kn1002 is now known as kn100 |
| 10:28:45 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 10:55:26 | | march_happy quits [Remote host closed the connection] |
| 11:02:49 | | march_happy (march_happy) joins |
| 11:10:08 | | qwertyasdfuiopghjkl joins |
| 11:47:47 | | qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds] |
| 11:54:43 | <TheTechRobo> | Somebody2: Very sorry about the inconvenience; the repository was set to private. you should now be able to access the funeralhomes link |
| 12:25:51 | | march_happy quits [Ping timeout: 252 seconds] |
| 12:33:27 | | march_happy (march_happy) joins |
| 12:55:00 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 13:07:41 | | yano quits [Quit: WeeChat, the better IRC client, https://weechat.org/] |
| 13:09:28 | | march_happy quits [Ping timeout: 265 seconds] |
| 13:10:12 | | march_happy (march_happy) joins |
| 13:11:37 | | yano (yano) joins |
| 13:55:23 | <Somebody2> | Nice -- I hadn't gotten around to looking at it, so no harm done. |
| 13:57:30 | <AK> | Hmm, does every domain in osm get at least one archive in the wbm I wonder |
| 14:17:36 | <Jake> | I can't remember, but I know at one point we wanted to throw OSM links into #// |
| 14:21:21 | | Arcorann quits [Ping timeout: 265 seconds] |
| 14:42:37 | | Barto quits [Ping timeout: 265 seconds] |
| 14:43:20 | | Barto (Barto) joins |
| 14:48:51 | | adia quits [Client Quit] |
| 14:48:51 | | kn100 quits [Client Quit] |
| 14:48:51 | | driib quits [Client Quit] |
| 14:48:51 | | G4te_Keep3r quits [Client Quit] |
| 14:48:51 | | yano quits [Remote host closed the connection] |
| 14:48:51 | | dm4v quits [Client Quit] |
| 14:48:56 | | adia (adia) joins |
| 14:48:56 | | dm4v joins |
| 14:48:57 | | dm4v is now authenticated as dm4v |
| 14:48:57 | | dm4v quits [Changing host] |
| 14:48:57 | | dm4v (dm4v) joins |
| 14:48:58 | | driib (driib) joins |
| 14:49:05 | | G4te_Keep3r joins |
| 14:49:13 | | yano (yano) joins |
| 14:49:37 | | kn100 joins |
| 14:49:53 | <Somebody2> | If we haven't done that, we should. |
| 15:00:09 | | march_happy quits [Ping timeout: 265 seconds] |
| 15:07:00 | | bonga quits [Ping timeout: 252 seconds] |
| 15:07:30 | | bonga joins |
| 15:19:40 | <AK> | If anyone fancies doing it you can make a PR to urls-sources to get all the urls added and auto queued: https://github.com/ArchiveTeam/urls-sources |
| 15:28:05 | <Somebody2> | AK: sweet, I hadn't noticed that project |
| 15:28:09 | <Jake> | https://github.com/ArchiveTeam/urls-sources/issues/1 if this is anything to go by, looks like we didn't. I think it's a great idea :-) |
| 15:33:52 | | bonga quits [Read error: Connection reset by peer] |
| 15:35:04 | | bonga joins |
| 16:09:33 | | Mayk78 quits [Quit: ZNC 1.7.5+deb4 - https://znc.in] |
| 16:27:52 | | HP_Archivist quits [Client Quit] |
| 17:31:47 | | klg quits [Ping timeout: 265 seconds] |
| 17:40:23 | <thuban> | i am sorry to say that mcgraw-hill seems to have banned the ip i was using to brute-force the glencoe sites. |
| 17:43:28 | <thuban> | thoughts? i got about a fifth of the way through, so it might be possible to finish from just a few more ips (unless they deploy the banhammer quicker now that they've noticed), but it seems like it might be better to distribute it-- |
| 17:48:07 | <thuban> | how, though? seesaw? it seems a waste to write a script for something so simple. #//? no recursion, so the site scrapes would have to be done separately from the brute-force; fine in principle, but i'm not sure how doable it would be to extract the list of hits from the results. |
| 17:49:25 | <thuban> | (also it's a _lot_ of candidates, don't know how much that would eat into general archiving) |
| 18:13:58 | <thuban> | oh, no, #// items can recurse if queued appropriately |
| 18:21:28 | <ThreeHM> | Maybe #Y could work for this? |
| 18:21:34 | | klg (klg) joins |
| 18:32:26 | <thuban> | oh, is that actually running now? |
| 18:32:50 | <ThreeHM> | It did at some point, not sure what the current status is |
| 18:34:17 | <AK> | thuban, do you have a list of sites? |
| 18:36:31 | <thuban> | AK: i have a list of urls at which there might be sites. (404 pages have no outlinks, fwiw.) i can produce a list of sites within the first 1800000000 urls from my partial results. |
| 18:37:02 | <AK> | If you can grab a list, we can probably tag ark_iver and ask for it to be thrown in |
| 18:37:10 | <AK> | *thrown in #Y |
| 18:38:04 | <thuban> | the list of urls or the partial list of sites? |
| 18:40:49 | <AK> | Hmm, list of sites might make more sense |
| 19:07:59 | <h2ibot> | 52,700 edits edited Deathwatch (+92, /* 2022 */ add Hashbase): https://wiki.archiveteam.org/?diff=48598&oldid=48568 |
| 19:31:05 | <@JAA> | Travis CI is breaking shit again, as is tradition. Sometime recently, travis-ci.com/user/repo was removed without redirecting to the new format app.travis-ci.com/github/user/repo. |
| 19:32:02 | <h2ibot> | Entartet edited List of websites excluded from the Wayback Machine (+27, Added miloush.net.): https://wiki.archiveteam.org/?diff=48599&oldid=48596 |
| 20:00:06 | <h2ibot> | JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=48600&oldid=48599 |
| 20:03:57 | | qwertyasdfuiopghjkl joins |
| 20:41:57 | | Ruthalas quits [Ping timeout: 252 seconds] |
| 20:46:55 | | Ruthalas (Ruthalas) joins |
| 20:56:27 | | bonga quits [Remote host closed the connection] |
| 20:56:41 | | bonga joins |
| 21:07:16 | <h2ibot> | Entartet edited List of websites excluded from the Wayback Machine (-14, Fixed two broken HTML comments. Removed…): https://wiki.archiveteam.org/?diff=48601&oldid=48600 |
| 21:07:17 | <h2ibot> | Entartet edited List of lost Twitter accounts (-18, Removed [[Category:List]] because this page is…): https://wiki.archiveteam.org/?diff=48602&oldid=46735 |
| 21:07:18 | <h2ibot> | Entartet edited List of volatile messages (-18, Removed [[Category:List]] because this page is…): https://wiki.archiveteam.org/?diff=48603&oldid=45200 |
| 21:07:19 | <h2ibot> | Entartet edited List of book databases (+96, Added fantasticfiction.com and isfdb.org.): https://wiki.archiveteam.org/?diff=48604&oldid=47164 |
| 21:46:12 | | march_happy (march_happy) joins |
| 22:31:04 | | Pingerfowder quits [Remote host closed the connection] |
| 22:31:13 | | Pingerfowder (Pingerfowder) joins |
| 23:00:58 | | BlueMaxima joins |
| 23:05:55 | | Arcorann (Arcorann) joins |
| 23:22:32 | | lennier1 quits [Quit: Going offline, see ya! (www.adiirc.com)] |
| 23:28:57 | | lennier1 (lennier1) joins |
| 23:34:39 | | march_happy quits [Ping timeout: 252 seconds] |
| 23:35:54 | | march_happy (march_happy) joins |
| 23:51:49 | | bonga quits [Ping timeout: 265 seconds] |
| 23:54:23 | | bonga joins |