00:02:51<@arkiver>thuban: maybe, needs a better look
00:02:55<@arkiver>do you have some examples?
00:06:17<thuban>arkiver: http://glencoe.mheducation.com/sites/0078807239/, http://glencoe.mheducation.com/sites/0000001840, http://glencoe.mheducation.com/sites/1777777444/, http://glencoe.mheducation.com/sites/007895312x/, etc
00:07:40<thuban>if you poke around through a browser, you may see some 'password protection', but it's client-side--everything i've spot-checked is accessible through the sitemap html, although it may be hidden with js
00:12:50<thuban>(Jake: the ids usually appear to be the isbn of the book the site is associated with, but not always--and i can't necessarily tell whether they're using a non-isbn id for a particular site or they really did assign an invalid isbn to a particular edition/resource (as publishers sometimes do) and worldcat just doesn't have it)
00:13:31<thuban>(the point is just that we can't use the isbn-10 spec to narrow the search space without losing data)
00:14:17HackMii quits [Remote host closed the connection]
00:14:47<Jake>Sorry, I was more saying, rather than trying to bruteforce _everything_, grab a list of ISBNs from the company. Bruteforcing that large of a space sounds inefficient to me.
00:15:34<h2ibot>Ryz edited URLTeam/Warrior (+321, /* Warrior projects */ Added fal-cn): https://wiki.archiveteam.org/?diff=48605&oldid=48583
00:16:08HackMii (hacktheplanet) joins
00:17:14<thuban>quite, but unfortunately i don't know of anywhere to get such a list (and even if we did it might not account for all the sites).
00:35:06<Jake>(I'm still looking but https://glencoe.mheducation.com/sites/dl/free/* through CDX gets you a few.)
00:36:02Ruthalas3 (Ruthalas) joins
00:36:58bonga quits [Ping timeout: 265 seconds]
00:37:07bonga joins
00:38:04Ruthalas quits [Ping timeout: 265 seconds]
00:38:04Ruthalas3 is now known as Ruthalas
00:39:08<@JAA>Yeah, this doesn't seem like a valid ISBN: http://glencoe.mheducation.com/sites/2222555555/
00:40:56<@JAA>And here's another page for ... the same book and edition but a different cover? http://glencoe.mheducation.com/sites/0073513326/
00:48:44bonga quits [Read error: Connection reset by peer]
00:49:06bonga joins
00:52:28Mateon2 joins
00:53:46Mateon1 quits [Ping timeout: 252 seconds]
00:53:47Mateon2 is now known as Mateon1
00:58:22DiscantX quits [Ping timeout: 265 seconds]
00:59:42march_happy quits [Remote host closed the connection]
01:00:07<angelika>thuban: seems you can list all their books with the search engine on the main site https://www.mheducation.com/search.html?page=1&sortby=relevance&order=desc&bu=seg
01:00:31<angelika>and those aren't ISBN but what they call "MHID"
01:01:34march_happy (march_happy) joins
01:01:59<angelika>not sure if that would cover everything, 2222555555 is not listed for example
01:02:46<thuban>yeah, all of the ones i've checked are not listed (presumably because they're no longer sold)
01:09:37<Jake>(Sorting copyright oldest on that page, gives 2020, and I believe this site is much older.)
01:09:50<TheTechRobo>thuban: Happy to run a script or something to help. :-)
01:14:16<angelika>found old catalog here https://web.archive.org/web/20110102075038/http://catalogs.mhhe.com/mhhe/home.do
01:17:58<HP_Archivist>!a https://www.universaltesttarget.com/index.php --explain "Universal Target Test"
01:18:08<HP_Archivist>oops
01:20:49<Jake>That's where I started, but couldn't find a real complete list. :(
02:07:16<angelika>old domain used to be novella.mhhe.com, CDX has interesting results
02:15:38<pabs>an acquisition: https://glitch.com/ https://www.fastly.com/blog/fastly-announces-acquisition-of-glitch-a-future-of-yes-code-at-the-edge https://news.ycombinator.com/item?id=31434084
02:16:51tyler090 joins
02:18:54<tyler090>Rapid7 has locked access to their internet wide scans, Project Sonar. The wiki should probably be updated to reflect that this is no longer online. Does anyone have archives of this? I'd be willing to mirror the data. I have 42TB free
02:19:05<tyler090>Wiki link: https://wiki.archiveteam.org/index.php/Project_Sonar
02:20:02<@JAA>Unfortunately, they were already very restrictive before that. It would literally have taken years to mirror it all.
02:20:31<@JAA>But yeah, it's a shame that it's locked down almost entirely now.
02:21:49<angelika>the free sonar data was hosted on s3
02:21:56<angelika>for certain scans they would only let you grab most recent ones
02:22:04<@JAA>It was on Backblaze.
02:22:22<tyler090>i doubt those links are still active?
02:22:35<tyler090>otherwise we could use the sonar to find the sonar
02:22:59<@JAA>You could never access it directly. Required a signature, and those expired after a short while.
02:23:55<angelika>the DNS datasets were most useful, real shame they're gone
02:24:03<tyler090>Very much a shame
02:24:09<tyler090>data going back to 2013
02:24:18<@JAA>The signatures were given out by Rapid7's servers (API or website), and you were limited to I think 30 files per 24 hours or something like that.
02:24:35<tyler090>they are not really accepting requests from individuals either
02:25:19<tyler090>so it feels like a rug pull to enrich their business under a false flag of pii
02:25:48<tyler090>seeing as the project started by university of michigan
02:29:53<h2ibot>Ka edited WattPad (+198, templates): https://wiki.archiveteam.org/?diff=48606&oldid=45555
02:29:54<h2ibot>Ka edited Coronavirus/Notable deaths (+395, /* Table */): https://wiki.archiveteam.org/?diff=48607&oldid=48228
02:29:55<h2ibot>Ka edited List of book databases (+16): https://wiki.archiveteam.org/?diff=48608&oldid=48604
02:37:55HackMii quits [Client Quit]
02:39:15Arcorann quits [Ping timeout: 265 seconds]
02:40:46HackMii (hacktheplanet) joins
02:42:18<angelika>tyler090: search engine here https://dns.bufferover.run/dns?q=google.com
02:42:22<angelika>so whoever runs it will have a copy
02:44:56<tyler090>thanks
02:45:30<angelika>there is https://zonefiles.io/ but it costs money
02:46:28<@JAA>There are plenty of sites like that. Rapid7's collection was one of the very few (if not the only major one) that was freely accessible.
02:46:45<angelika>old sonar data here https://archive.org/details/internet-mapping
02:49:32<tyler090>its better than nothing
02:49:42<tyler090>datasets are mostly 2014-15
02:51:22<tyler090>rapid7 collection was unique in that it was raw data, theres a handful of aggregates out there
02:56:39<angelika>certificate transparency logs are another alternative
03:08:08<angelika>for a bulk sub-domain dataset should cover about 90% of what they had
03:14:19<angelika>for running internet-wide scans yourself: digitalocean does not care, linode does not care if you tell them first
03:16:40<tzt>https://scans.io/json
03:23:05sonick quits [Quit: Connection closed for inactivity]
05:03:08sonick (sonick) joins
05:24:33hisofhy joins
05:35:05hisofhy quits [Remote host closed the connection]
05:51:47HistoryOfHyrule joins
05:51:59<HistoryOfHyrule>Hey! Thanks
05:52:09<Ryz>Regarding the archive of https://historyofhyrule.com/ from last year via ArchiveBot, I noticed some pages like https://web.archive.org/web/2021*/https://historyofhyrule.com/index.php?frame=publications/index_translationproject.php didn't appear to get archived through ArchiveBot
05:52:25<Ryz>The only instance it was saved as through the Wayback Machine's Save Page Now feature
05:52:37<Ryz>At least for that year of 2021
05:53:34<Ryz>I think it was one of those websites that I had in my backlog that I was hesitant to archive because of problems like that, stuff that would only be realized if the website was checked through Wayback Machine
05:53:37<HistoryOfHyrule>I feel like such a novice asking this but is there a way I can check which pages were not archived? There's a chance they're not anything I consider vital to preservation
05:54:44<HistoryOfHyrule>Yeah, back in 2014 or whenever I did that build I, uh, lol, had a lot more I needed to learn before I should have attempted it XD
05:54:59<Ryz>Heheheh x3
05:56:10<HistoryOfHyrule>XD Thank you so much for working on this with me though! Someone suggested I ask about this and said you guys had always been super friendly. I'm so happy to discover how true that is
05:56:17<Ryz>Well, you could check through https://archive.fart.website/archivebot/viewer/job/7xywy - this is the instance where the website was archived by ArchiveBot, but I highly recommend sticking around for a better answer because I don't think it lists URLs that aren't archived, only the ones that do
05:56:48<Ryz>Heheh, you're welcome~ >#<;
05:57:05<HistoryOfHyrule>Good to know, thank you!
05:58:07<Ryz>Yeah, for me, I've been trying to archive older websites through ArchiveBot because one day they would just disappear without announcing they'll shut down s:
05:58:33<Ryz>And something like https://historyofhyrule.com/ felt like it could be one of them because of how older looking it is compared to the websites of today ><;
05:59:09<Ryz>It's getting more nostalgic seeing a website design like that
05:59:28<HistoryOfHyrule>OMG, yeah, I hear you! I run into that a lot since I also hunt for Zelda art to complete collections and it's shocking how much is just GONE
05:59:30<Ryz>Or even more so with probably multiple generations now
06:00:22<Ryz>Mm, the likes of DeviantArt has gone through a bunch s:
06:00:47<Ryz>It seems Twitter seems to be more popular seeing fanart nowadays when moving away from that website I hear... z:
06:00:50<HistoryOfHyrule>Like I've been around so long I remember all the old coppermine galleries with art and like only 2 are around any more, a couple of the old sites still have hand-made galleries up but -oof- I have no idea when they'll drop off, just like you said
06:01:20march_happy quits [Remote host closed the connection]
06:01:27<Ryz>Oh, do you have links to them? Could archive them through ArchiveBot if it hasn't been tried before
06:01:56<Ryz>Been trying to grind through my backlog of stuff to archive in the first place, while being distracted by unusual or obscure goodies to dive into and archive too
06:02:11<HistoryOfHyrule>lol, I actually still maintain my very original site if you want to see a REALLY old site XD Look at this layout, lol: https://historyofhyrule.com/old/
06:02:24<HistoryOfHyrule>You want the old coppermine galleries?
06:02:58<HistoryOfHyrule>Here's one that's still online: http://www.zeldalegends.net/gallery/?cat=80
06:02:58<Ryz>Yeah, I'm assuming it's not part of your website and found elsewhere o:
06:03:05march_happy (march_happy) joins
06:03:15<HistoryOfHyrule>That's not mine, but I was an admin there a billion years ago
06:03:55<Ryz>I randomly came across a similar fansite that has an art gallery of fanart but for Mega Man, which was apparently around since 2000: https://themechanicalmaniacs.com/
06:04:05<Ryz>*Earlier today, I randomly
06:04:36<HistoryOfHyrule>Joshua of Zelda Universe has also been keeping a list of active and inactive Zelda sites and if you want to check on back-ups of these, it's a really lovingly kept resource: https://zeldaarchive.org/category/active/
06:05:25Arcorann (Arcorann) joins
06:06:20<HistoryOfHyrule>Wait, random question, but since maybe you know about this then: if I have my old phpbb2 forum backups, is there a way to preserve them and make them accessible without me having to actually- get them running again? Because I have a shocking amount of artist and publication info on my last one that's basically lost unless I get around to doing that
06:07:11<HistoryOfHyrule>I am in absolutely love with that Mega Man site you linked to
06:07:38<HistoryOfHyrule>They obviously had a lot of fun making that
06:08:04<Ryz>The forum backups? Oh, uhh, my apologies, I'm not familiar with running a forum myself or a way to open it; please, I highly recommend sticking around to get a better answer for it
06:08:22<Ryz>Right now admittedly there isn't a lot of activity at this time of the day compared to other times at least from my experience
06:08:51<HistoryOfHyrule>Will do. I might not be around until tomorrow evening but I can leave this open (unless I'm an idiot and hit shutdown as force of habit)
06:09:29<Ryz>As for http://www.zeldalegends.net/gallery/?cat=80 - I archived http://www.zeldalegends.net/ back in 2019 November, hmm, trying to look at the coppermine gallery pages...
06:12:42<Ryz>Well, this is unfortunately very concerning, while something like http://www.zeldalegends.net/gallery/displayimage.php?album=980&pos=1 has been archived, the full versions of the image haven't really been archived...
06:13:08<Ryz>Yeah, https://web.archive.org/web/2019*/http://www.zeldalegends.net/gallery/displayimage.php?album=980&pos=1 - this is the only instance in the Wayback Machine where this is archived
06:14:22<Ryz>I think I may have a solution but it's unfortunate a manual thing, as in would require some extra steps to ensure a better or more complete archival
06:23:56<HistoryOfHyrule>Oh, it's okay! Honestly I probably have better scans of everything on there by now and am going to upload image packets to archive.org and am talking to some other archivist groups and orgs to make sure redundancies exist
06:25:33<HistoryOfHyrule>It was just one of those things that, if it wasn't too hard, seemed worth mentioning
06:25:43<Ryz>Mhm, for me, it's trying to make sure stuff is more accessible through the Wayback Machine, because otherwise people might give up because of obscurities or the amount of friction to find stuff since the Wayback Machine is a stupidly huge maze to untangle~
06:26:58<HistoryOfHyrule>It makes a lot of sense. Even with what I do, having old broken links still allows me to find more info on the wayback machine sometimes. But, yes, it can get really complicated
06:27:28<Ryz>Yeah, I have a ton of broken links I keep around, they're like broken keys but they point to other sorts of goodies Oo;
06:28:51<Ryz>So many opportunities for finding such goodies <#>;
06:30:36<HistoryOfHyrule>Yussssss
06:31:47<Ryz>Soon, I think I'll have to focus or have a project to dig into that's all about archiving websites that have coppermine gallery images <#>;
06:32:04<Ryz>Gallery software technology that's getting older and less used :C
06:32:19<HistoryOfHyrule>Do you have like a twitter handle or anything? I love keeping in contact with people like you who love this kind of stuff too
06:32:37<HistoryOfHyrule>No pressure, but I wanted to ask just in case
06:34:32<Ryz>My apologies, I don't really have a Twitter account, much of my archiving activities and chattering comes from here in ArchiveTeam~
06:36:41<HistoryOfHyrule><3 no worries! It's kind of the same for me: I always have my head buried in some project so I'm just never online much (except there) so I figured no harm in asking!
06:37:08march_happy quits [Remote host closed the connection]
06:37:30thuban quits [Ping timeout: 252 seconds]
06:38:18<Ryz>Heheh, just like me falling into seemingly endless amount of rabbit holes that originated from archiving >#<;
06:38:40<Ryz>I seriously wish there were multiple me's that would grind through my backlog nad more
06:39:46<HistoryOfHyrule>Hahah, same here XD I'll spend 3 days trying to figure out some rare book and then need another few weeks for scanning and auction hunting and then I still need to get back to editing art for the gallery XD
06:40:01<HistoryOfHyrule>I need one of me for each thing XD
06:41:11<Ryz>I have a ton of potlinks (tis a silly word for myself, like a pot full of links) from browsing countless websites for which I would find even more potlinks to be pulled into, and more and more, and the cycle seems to be endless ><;
06:42:29<Ryz>And no, it's not a pot full of multiple Links from Zelda >.>;
06:44:13march_happy (march_happy) joins
06:45:45<HistoryOfHyrule>Haha, it almost makes a good pun for that though. Considering how much Link loves pots
06:46:59<HistoryOfHyrule>Well, I am out for the night but it was beautiful talking to you and thank you. I will leave this open for a while in case someone has any info on how I might be able to archive a phpbb2 forum without getting it running on my server again
06:49:36<Ryz>It was really great talking to you HistoryOfHyrule O:
06:50:52<HistoryOfHyrule><3 <3 <3 Thanks for all you do, you're kind of a Hero of History!
06:51:47<Ryz>Don't hesitate to suggest more websites to archive beyond your own website >#<;
07:27:11<Sanqui|m>Good job with the fansites, I also notice them dropping like flies year by year
07:27:31<Sanqui|m>I wish I had better ways to search for regional ones
07:31:47thuban joins
08:16:44qwertyasdfuiopghjkl quits [Remote host closed the connection]
08:48:54atphoenix_ (atphoenix) joins
08:49:43bonga quits [Remote host closed the connection]
08:49:53bonga joins
08:51:25atphoenix quits [Ping timeout: 265 seconds]
09:15:22march_happy quits [Remote host closed the connection]
09:19:08march_happy (march_happy) joins
09:29:27syntaxx quits [Client Quit]
09:29:33syntaxx (syntaxx) joins
09:31:29driib quits [Client Quit]
09:31:39driib (driib) joins
09:33:17tbc1887 (tbc1887) joins
10:13:30DiscantX joins
10:58:12Iki quits [Ping timeout: 252 seconds]
12:02:07tbc1887 quits [Read error: Connection reset by peer]
12:47:56wyatt8740 joins
12:57:33march_happy quits [Ping timeout: 252 seconds]
12:58:44march_happy (march_happy) joins
13:23:41<TheTechRobo><HistoryOfHyrule> I feel like such a novice asking this but is there a way I can check which pages were not archived? There's a chance they're not anything I consider vital to preservation
13:24:22<TheTechRobo>There's not really a good way of doing this that I know of, but maybe I can figure out something with the Wayback Machine's CDX API.
14:04:42LeGoupil joins
14:13:21bonga quits [Remote host closed the connection]
14:15:59bonga joins
14:44:14LeGoupil quits [Client Quit]
14:44:21tyler090 quits [Remote host closed the connection]
14:45:50<spirit>hard to get a complete overview but https://web.archive.org/web/*/https://historyofhyrule.com/* and http://web.archive.org/web/sitemap/https://historyofhyrule.com/ can help
14:48:20<@JAA>Well, it's logically impossible to tell whether an archive is complete. You need intimate knowledge of the target site so that you can generate a list of all URLs that exist and then compare that to what was covered.
14:48:53yoshino101 joins
14:48:53<@JAA>And even that only reliably works for static pages.
14:49:06<TheTechRobo>Yeah. But seeing which pages weren't saved at all by ArchiveBot, but were saved by something else can help
14:49:14<TheTechRobo>to give a general idea
14:50:37<@JAA>HistoryOfHyrule: You can find a list of all URLs that were retrieved by ArchiveBot in the job's CDX files. For each WARC listed on the viewer, there is a corresponding warc.os.cdx.gz file. Each line represents one HTTP response, and one of the fields in there is the URL that was retrieved. So compare that to what you know exists, I suppose.
14:52:21<@JAA>Regarding the phpBB forums: well, you could share the database publicly, but then you need to be very careful to first remove the data that should absolutely not be made public (email addresses, passwords or their hashes, IP addresses, activity dates, etc.). The safer way is to set it up again and then crawl it anonymously.
14:54:09bonga quits [Ping timeout: 252 seconds]
14:54:31<@JAA>Alternatively, I guess you could write a little script that parses the backup and produces a static site that basically shows what an anonymous user of a phpBB installation would see. Not sure if there's anything like that already out there, but if not, that's probably not trivial.
14:55:51bonga joins
15:00:50qwertyasdfuiopghjkl joins
15:12:28angelika quits [Client Quit]
15:12:59monoxane4 (monoxane) joins
15:27:16Arcorann quits [Ping timeout: 265 seconds]
15:50:55yoshino101 quits [Remote host closed the connection]
16:09:43godane (godane) joins
16:26:51lennier1 quits [Ping timeout: 265 seconds]
16:28:19lennier1 (lennier1) joins
17:01:30<HistoryOfHyrule>Thanks everyone, I appreciate knowing. The phpbb thing is pretty much what I figured but never hurts to ask. As for the site, I know every page of it, so it's not too hard for me to tell. It's more about the publication image files being saved because that's what people use it for the most. It's a little bit slow but honestly maybe it's just faster
17:01:31<HistoryOfHyrule>for me to go through and check each publication manually.
17:06:15<@JAA>If there aren't too many, that's a valid option I guess. And manual verification would make sure that the archive is indeed correct as well, not just 'something was saved', which may have been a reverse HTTP proxy error or whatever.
17:06:35<@JAA>(You know, broken servers presenting errors as HTTP 200 and all that.)
17:11:06march_happy quits [Ping timeout: 252 seconds]
17:27:14atphoenix_ quits [Remote host closed the connection]
17:28:21atphoenix_ (atphoenix) joins
17:28:37atphoenix_ quits [Remote host closed the connection]
17:29:51atphoenix_ (atphoenix) joins
17:30:07atphoenix_ quits [Remote host closed the connection]
17:31:21atphoenix_ (atphoenix) joins
17:31:37atphoenix_ quits [Remote host closed the connection]
17:32:51atphoenix_ (atphoenix) joins
17:33:08atphoenix_ quits [Remote host closed the connection]
17:34:21atphoenix_ (atphoenix) joins
17:34:38atphoenix_ quits [Remote host closed the connection]
17:35:51atphoenix_ (atphoenix) joins
17:36:08atphoenix_ quits [Remote host closed the connection]
17:37:21atphoenix_ (atphoenix) joins
17:37:38atphoenix_ quits [Remote host closed the connection]
17:38:51atphoenix_ (atphoenix) joins
17:39:08atphoenix_ quits [Remote host closed the connection]
17:39:45atphoenix_ (atphoenix) joins
18:10:51HistoryOfHyrule quits [Remote host closed the connection]
18:11:44DiscantX quits [Ping timeout: 265 seconds]
18:21:19<h2ibot>JustAnotherArchivist edited Deathwatch (+158, /* 2022 */ Add AnyNowhere): https://wiki.archiveteam.org/?diff=48609&oldid=48598
19:18:31DiscantX joins
19:21:27<TheTechRobo>Only 1257 more funeral home pages in my queue. I'll have to write scraping code for more websites soon.
19:30:30<h2ibot>Jurta created .ps (+254, Created page with "{{Infobox project | logo =…): https://wiki.archiveteam.org/?title=.ps
19:30:31<h2ibot>Jurta uploaded File:Palestinian National Internet Naming Authority logo.png: https://wiki.archiveteam.org/?title=File%3APalestinian%20National%20Internet%20Naming%20Authority%20logo.png
19:30:32<h2ibot>Ka edited Coronavirus/Notable deaths (+326): https://wiki.archiveteam.org/?diff=48612&oldid=48607
19:44:56JackThompson05 joins
20:26:21DiscantX quits [Ping timeout: 252 seconds]
21:37:06sepro (sepro) joins
21:55:48lennier1 quits [Client Quit]
21:58:28lennier1 (lennier1) joins
22:13:47pie_ quits []
22:14:49pie_ joins
22:18:39<@JAA>AnyNowhere's sister site 80.style may be in danger or not (cf. https://old.reddit.com/r/Archiveteam/comments/utze7f/anynowherecom_an_early_2000s_bulletin_board_full/). It's a JS mess and uses POST for everything.
22:31:12eroc1990 quits [Ping timeout: 252 seconds]
22:40:33AlsoHP_Archivist joins
22:43:51march_happy (march_happy) joins
22:44:24HP_Archivist quits [Ping timeout: 252 seconds]
22:49:01monoxane43 (monoxane) joins
22:51:27monoxane4 quits [Ping timeout: 265 seconds]
22:51:28monoxane43 is now known as monoxane4
23:25:02eroc1990 (eroc1990) joins
23:25:06<Jake>boo!
23:35:41monoxane47 (monoxane) joins
23:36:43BlueMaxima joins
23:37:30monoxane4 quits [Ping timeout: 265 seconds]
23:37:31monoxane47 is now known as monoxane4
23:39:17HP_Archivist (HP_Archivist) joins
23:42:12AlsoHP_Archivist quits [Ping timeout: 265 seconds]
23:44:45HP_Archivist quits [Ping timeout: 265 seconds]
23:49:04jtagcat6 quits [Quit: Bye!]
23:50:08HP_Archivist (HP_Archivist) joins
23:51:04jtagcat6 (jtagcat) joins
23:57:32AlsoHP_Archivist joins
23:57:36AlsoHP_Archivist quits [Remote host closed the connection]