| 00:03:44 | <Nexus> | Alright, good to know. Have three workers spun up on two different VMs. The Oracle Cloud free tier VMs seem to be quite capable of running warriors; I have 2 running on one and it seems to be handling it quite well |
| 00:05:50 | | IDK quits [Client Quit] |
| 00:27:02 | | @Sanqui quits [Ping timeout: 264 seconds] |
| 00:27:46 | | simon8162 quits [Quit: ZNC 1.8.2 - https://znc.in] |
| 00:28:03 | | simon816 (simon816) joins |
| 00:28:48 | <simon816> | https://web.archive.org/web/20230000000000*/https://burg.biz/ one for the exclusions list I think? |
| 00:34:01 | | lunik17 quits [Ping timeout: 252 seconds] |
| 00:37:03 | <h2ibot> | Markass edited List of websites excluded from the Wayback Machine (+67, sanctionedsuicide.com and sanctioned-suicide.org): https://wiki.archiveteam.org/?diff=49399&oldid=49350 |
| 01:00:07 | <h2ibot> | JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=49400&oldid=49399 |
| 01:05:19 | | Sanqui joins |
| 01:05:21 | | Sanqui is now authenticated as Sanqui |
| 01:05:21 | | Sanqui quits [Changing host] |
| 01:05:21 | | Sanqui (Sanqui) joins |
| 01:05:21 | | @ChanServ sets mode: +o Sanqui |
| 01:11:38 | | lunik17 joins |
| 01:26:41 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 01:32:51 | | rocketdive quits [Ping timeout: 265 seconds] |
| 01:34:32 | <@OrIdow6> | It appears that the endpoint for Zhihu Circles comments gives a 404 |
| 02:23:30 | | katocala quits [Remote host closed the connection] |
| 02:32:29 | | katocala joins |
| 02:32:50 | | katocala is now authenticated as katocala |
| 02:55:47 | | Nexus quits [Client Quit] |
| 03:02:15 | <qwertyasdfuiopghjkl> | Ryz: Related to Issuu, https://issuu.com/categories probably won't find everything, since there are unlisted pages ( https://help.issuu.com/hc/en-us/articles/5772764718363-Unlisted-Content ). |
| 03:32:19 | | fishingforsoup joins |
| 03:32:31 | <fishingforsoup> | I need serious help in finding the videos that were on these URLs. |
| 03:32:32 | <fishingforsoup> | http://youtube.com/watch?v=XFMS0Hr1Ub4 |
| 03:32:47 | <fishingforsoup> | http://youtube.com/watch?v=yj8MTPX8zDE |
| 03:33:06 | <fishingforsoup> | http://youtube.com/watch?v=JDyCsTDoKc0 |
| 03:33:42 | <fishingforsoup> | From the metadata I can find, they include songs. Songs that are lost. |
| 05:10:26 | | Craigle quits [Quit: The Lounge - https://thelounge.chat] |
| 05:10:53 | | Craigle (Craigle) joins |
| 05:52:59 | <@JAA> | FWIW, I poked at Issuu a bit, but I couldn't find anything that would allow us to enumerate the documents. I did find an SWF reader, but it takes a document ID of the form 170214070741-cc557127661564b84205d23b42cf67f1, obviously still not bruteforcable. |
| 05:53:18 | <@JAA> | That's at https://static.issuu.com/webembed/viewers/style1/v2/IssuuReader.swf?mode=mini&documentId=170214070741-cc557127661564b84205d23b42cf67f1 if anyone wants to dig further. |
| 05:56:10 | <@JAA> | That's revisionId-publicationId per some JSON in the document pages. |
| 06:22:00 | | benjinsm joins |
| 06:23:16 | | benjins2_ joins |
| 06:25:14 | | benjins quits [Ping timeout: 264 seconds] |
| 06:25:50 | | benjins2 quits [Ping timeout: 264 seconds] |
| 06:26:25 | | Craigle quits [Client Quit] |
| 06:27:04 | | Craigle (Craigle) joins |
| 06:41:15 | | Ruthalas5 (Ruthalas) joins |
| 07:08:26 | | wyatt8750 quits [Ping timeout: 264 seconds] |
| 07:10:41 | | hitgrr8 joins |
| 07:13:49 | <audrooku|m> | JAA: In regards to your messages in #onlyfails |
| 07:13:49 | <audrooku|m> | It didnt really click for me how those IA collections of megawarcs and that data being in the wbm were related, so I guess they're basically indexed by the WBM and the various grabs end up in various collections? Both by the IA, AT, and other partners? I've found quite a lot of use through the WBM CDX API since I've learned of it, though searching through a local set of CDX data could be beneficial to me, especially since sometimes the CDX API |
| 07:13:49 | <audrooku|m> | appears to miss some results (sometimes more old grabs appear at a later data), do you know roughly how often these grab collections are publicly available? For example there's a collection of deviantart grabs that are not, if most collections are public it might be worth my time to scrape the CDXs from many collections. |
| 07:24:32 | | wyatt8740 joins |
| 07:29:04 | | wyatt8740 quits [Ping timeout: 265 seconds] |
| 07:29:15 | | wyatt8740 joins |
| 08:04:52 | | Icyelut|2 (Icyelut) joins |
| 08:06:38 | | Icyelut quits [Ping timeout: 264 seconds] |
| 08:32:09 | | us3rrr quits [Read error: Connection reset by peer] |
| 08:50:30 | | Island quits [Read error: Connection reset by peer] |
| 09:01:00 | | Mateon1 quits [Remote host closed the connection] |
| 09:01:08 | | Mateon1 joins |
| 09:26:09 | | Minkafighter72 quits [Quit: The Lounge - https://thelounge.chat] |
| 09:27:53 | | Minkafighter72 joins |
| 10:12:47 | | jacksonchen666 quits [Ping timeout: 276 seconds] |
| 10:14:53 | | jacksonchen666 (jacksonchen666) joins |
| 11:18:47 | | IDK (IDK) joins |
| 11:45:14 | | leo60228 quits [Ping timeout: 265 seconds] |
| 11:47:14 | | leo60228 (leo60228) joins |
| 11:48:30 | | Jackster joins |
| 11:51:32 | <Jackster> | I am using heritrix to archive a few websites. Because I had to restart it a few times, I have ended up with a couple .warc.gz files. Looking to make it into a single if possible. I have tried https://github.com/arquivo/mergeWARCs but it creates 3 files? https://github.com/maturban/WARCMerge wont run. Any suggestions or am I doing it wrong? |
| 11:53:44 | <Jackster> | If pywb can use these multiple files, then I am not needed to do this |
| 12:02:40 | <Jackster> | Ah yes I don't need to do that. |
| 12:02:43 | <Jackster> | ggwp |
| 12:02:45 | | Jackster quits [Remote host closed the connection] |
| 12:06:33 | | benjinsm is now known as benjins |
| 12:06:34 | | benjins is now authenticated as benjins |
| 12:56:10 | <h2ibot> | Bzc6p edited Kepfeltoltes.eu (-20, more accurate statistics, 2022 in progress): https://wiki.archiveteam.org/?diff=49401&oldid=49353 |
| 14:06:29 | | rocketdive joins |
| 14:33:14 | | ok joins |
| 14:47:27 | | ok quits [Ping timeout: 265 seconds] |
| 15:22:09 | | ok joins |
| 15:23:20 | | ok quits [Remote host closed the connection] |
| 15:26:28 | | ok joins |
| 15:26:34 | <ok> | hi |
| 15:43:00 | | HP_Archivist (HP_Archivist) joins |
| 15:43:28 | <@arkiver> | OrIdow6: yeah i was looking into zhihu circles |
| 15:43:44 | <@arkiver> | but wasn't exactly sure what part was shutting down |
| 15:43:52 | <@arkiver> | it may be good though to do a project for the entire thing in general |
| 15:47:40 | | ok74 joins |
| 15:47:47 | <@arkiver> | for zhihu, only the /club/* pages on zhihu.com are shutting down? |
| 15:49:10 | <@arkiver> | so, if I read this correctly, access to https://www.zhihu.com/club/explore will be disables end of this month, and all actual posts will not be accessible anymore end of march? |
| 15:49:49 | <ok74> | hey |
| 15:49:55 | <@arkiver> | hi |
| 15:50:16 | <ok74> | just find out my twitter account was targeted by your org |
| 15:50:40 | | ok quits [Remote host closed the connection] |
| 15:52:03 | <ok74> | don't get the idea to put a bot to stalk evey of my tweet |
| 15:54:03 | <ok74> | someone can give my some explanation please |
| 15:54:12 | <madpro|m> | Hi ok74, |
| 15:54:18 | <ok74> | hi |
| 15:54:23 | <madpro|m> | That's IRC for you, it can take a few hours for people to respond. |
| 15:54:30 | <thuban> | ok74: we archive a very broad range of material. if you don't want your account archived, it's best to contact archive.org and ask for it to be excluded. that will prevent everyone from saving it there, not just us https://help.archive.org/help/how-do-i-request-to-remove-something-from-archive-org/ |
| 15:55:26 | <madpro|m> | Rest assured Archive Team is a bona fide volunteer organisation. Its mission is to preserve digital heritage, without disturbing people if possible. |
| 15:56:44 | <madpro|m> | as thuban: mentioned, Archive Team scrapes of Twitter are redirected to the archive.org (unaffiliated). So you can ask for a take-down from archive.org, directly. |
| 15:56:55 | <ok74> | man i only tweeted 5 times in 2 years you archived all including 3 tweets from a conversation in a space where i wrote to a friend some location (the tweets lasted litteraly 2 min) |
| 15:56:59 | <ok74> | it's insane |
| 15:58:06 | <ok74> | based on the fact i'm osint volonteer it can be so harmful |
| 15:58:33 | <madpro|m> | Yeah, we have lot of overlap. |
| 15:58:37 | <madpro|m> | If you are familliar with DEFCON |
| 15:58:55 | <ok74> | yes |
| 15:58:56 | <madpro|m> | With OS-INT, I mean. |
| 15:59:06 | <ok74> | i kniow that dude |
| 16:00:40 | <h2ibot> | Bzc6p edited EOldal (+238, Finished.): https://wiki.archiveteam.org/?diff=49402&oldid=49291 |
| 16:04:00 | <madpro|m> | ok74: An AT member once put it like this |
| 16:04:21 | <madpro|m> | "Our priority should be sites where user **content was solicited** and then **provided.**." |
| 16:06:02 | <madpro|m> | And that is a good maxim. It trusts the original author's consent and even will in making their content public. |
| 16:06:27 | <madpro|m> | Every now and again, a tracker may slip-up. Or someone may dump links important to them into the crawler. |
| 16:07:28 | <madpro|m> | That's why archive.org has its take-down form. That's how we uphold "the right to be forgotten" |
| 16:09:02 | <madpro|m> | Compared to CommonCrawl and larger projects, I would say that Archive Team is much more reasonable. |
| 16:09:05 | <madpro|m> | Wouldn't you agree? |
| 16:09:49 | | ivan leaves |
| 16:15:30 | <ok74> | i don't know much about commoncrawl or larger projects, but you right it was a mistake to use a non-randomised pseudonym and to tweet publicly location to meet friend |
| 16:16:49 | <madpro|m> | Mistakes happen, don't be ashamed. :) |
| 16:16:51 | <madpro|m> | and this is how we try to remedy it |
| 16:17:24 | <madpro|m> | I'm sorry I cannot offer anything more |
| 16:19:09 | <ok74> | don't worry i will burn name and account, so sad it was so cool, but security before everything |
| 16:23:45 | | HP_Archivist quits [Client Quit] |
| 16:34:46 | <h2ibot> | Bzc6p edited EOldal (-16, /* Archiving */ 874 GiB total): https://wiki.archiveteam.org/?diff=49403&oldid=49402 |
| 16:51:51 | | ok74 quits [Remote host closed the connection] |
| 17:06:39 | | jacksonchen666_ (jacksonchen666) joins |
| 17:08:08 | | jacksonchen666 quits [Ping timeout: 276 seconds] |
| 17:11:32 | | jacksonchen666_ is now known as jacksonchen666 |
| 17:22:44 | | spirit quits [Quit: Leaving] |
| 18:36:05 | <@OrIdow6> | arkiver: I'm not sure. The link to https://zhuanlan.zhihu.com/p/201000811 suggests it is /club/, though that page mainly refers to circles as being the pages accessed through interface elements I don't see. Pessimistically "all the entrances of the circle except the personal page will be closed" (Google Translate of https://zhuanlan.zhihu.com/p/585385202) could mean that they become inaccessible to the general public, though I do think your |
| 18:36:07 | <@OrIdow6> | interpretation is more likely. |
| 18:49:12 | <@JAA> | audrooku|m: Yes, the WBM is basically just an interface to WARC files (and, for ancient data, ARC files) stored in IA items. This includes a wide variety of sources, including IA's own crawls and our stuff. Our data is normally publicly accessible, but IA's almost always isn't. For other sources, it varies. The WBM CDX API returns matches for anything that can be found in the WBM. It's just more |
| 18:49:18 | <@JAA> | convenient to download a handful of CDX files from the items when that's possible. The API has some annoying quirks, is rate-limited, and doesn't scale well. |
| 18:50:29 | <pokechu22> | Incidentally, https://wayback.archive-it.org/5902/*/http://www.nsf.gov/statistics/nsf01313/patterns.htm is an interesting thing to compare with https://web.archive.org/web/*/http://www.nsf.gov/statistics/nsf01313/patterns.htm - I vaguely knew of archive-it.org but didn't realize it had a separate interface |
| 18:53:15 | <pokechu22> | hmm, also interesting: https://wayback.archive-it.org/5902/timemap/cdx?url=http://www.nsf.gov/statistics/nsf01313/patterns.htm exists but https://archive.org/download/ARCHIVEIT-5902-ONE_TIME-JOB162185-00000 has CDX files restricted |
| 18:54:20 | <@JAA> | Well, yeah, you can get the same info from the CDX API as well. The API doesn't allow you to list all entries from a particular WARC file though, as the CDX file would. |
| 18:55:20 | <@JAA> | It makes sense for SPN, I guess. I don't really understand why they restrict their web-wide crawls though. |
| 18:56:02 | <@JAA> | Archive-It partners might not want the URL lists to be available either, so maybe that's somewhat reasonable as well. |
| 19:02:40 | | hitgrr8 quits [Read error: Connection reset by peer] |
| 19:03:48 | | hitgrr8 joins |
| 19:39:02 | | igloo22225 quits [Ping timeout: 264 seconds] |
| 19:41:15 | | mut4ntm0nkey quits [Remote host closed the connection] |
| 19:41:55 | | mut4ntm0nkey (mutantmonkey) joins |
| 19:53:32 | <avoozl> | JAA: do you know if I can see any of the telegram pages online anywhere? I'm parsing them and attempting to construct some data extractors for them, but have a bit of trouble coming up with the right element selectors on the html pages |
| 19:57:26 | <@JAA> | avoozl: Should all be in the WBM, unless you're looking at a very recent upload (as it takes a bit until a newly indexed WARC actually shows up in the WBM). |
| 20:02:03 | <avoozl> | I'll have a look, maybe I was looking in the wrong place. Just pick a random url from the json.gz and plug that into wbm? |
| 20:02:46 | | igloo22225 (igloo22225) joins |
| 20:03:48 | | Island joins |
| 20:04:50 | <@JAA> | From the CDX |
| 20:06:26 | <avoozl> | ahh ok |
| 20:06:50 | <avoozl> | I feel like I forgot a lot of stuff since last time I touched this :) |
| 20:10:38 | <avoozl> | I see a lot of things captured but I'm unsure where to find actual content, maybe I'm just unlucky |
| 20:10:40 | <avoozl> | https://web.archive.org/web/20221225004205/https://t.me/Aallbany/1257 |
| 20:14:14 | <@JAA> | Yeah, that's just the awkward Telegram web interface. Try one of the /s/ URLs. |
| 20:14:47 | <@JAA> | I believe there are other cases, too. But for specific stuff rather than general WARC parsing etc., #telegrab |
| 20:55:01 | | Chris5010 quits [Ping timeout: 252 seconds] |
| 21:10:30 | | Chris5010 (Chris5010) joins |
| 22:29:16 | | BlueMaxima joins |
| 22:33:02 | | hitgrr8 quits [Client Quit] |
| 23:18:00 | | IDK quits [Client Quit] |