00:03:44<Nexus>Alright, good to know. Have three workers spun up on two different VMs. The Oracle Cloud free tier VMs seem to be quite capable of running warriors; I have 2 running on one and it seems to be handling it quite well
00:05:50IDK quits [Client Quit]
00:27:02@Sanqui quits [Ping timeout: 264 seconds]
00:27:46simon8162 quits [Quit: ZNC 1.8.2 - https://znc.in]
00:28:03simon816 (simon816) joins
00:28:48<simon816>https://web.archive.org/web/20230000000000*/https://burg.biz/ one for the exclusions list I think?
00:34:01lunik17 quits [Ping timeout: 252 seconds]
00:37:03<h2ibot>Markass edited List of websites excluded from the Wayback Machine (+67, sanctionedsuicide.com and sanctioned-suicide.org): https://wiki.archiveteam.org/?diff=49399&oldid=49350
01:00:07<h2ibot>JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=49400&oldid=49399
01:05:19Sanqui joins
01:05:21Sanqui quits [Changing host]
01:05:21Sanqui (Sanqui) joins
01:05:21@ChanServ sets mode: +o Sanqui
01:11:38lunik17 joins
01:26:41BlueMaxima quits [Read error: Connection reset by peer]
01:32:51rocketdive quits [Ping timeout: 265 seconds]
01:34:32<@OrIdow6>It appears that the endpoint for Zhihu Circles comments gives a 404
02:23:30katocala quits [Remote host closed the connection]
02:32:29katocala joins
02:55:47Nexus quits [Client Quit]
03:02:15<qwertyasdfuiopghjkl>Ryz: Related to Issuu, https://issuu.com/categories probably won't find everything, since there are unlisted pages ( https://help.issuu.com/hc/en-us/articles/5772764718363-Unlisted-Content ).
03:32:19fishingforsoup joins
03:32:31<fishingforsoup>I need serious help in finding the videos that were on these URLs.
03:32:32<fishingforsoup>http://youtube.com/watch?v=XFMS0Hr1Ub4
03:32:47<fishingforsoup>http://youtube.com/watch?v=yj8MTPX8zDE
03:33:06<fishingforsoup>http://youtube.com/watch?v=JDyCsTDoKc0
03:33:42<fishingforsoup>From the metadata I can find, they include songs. Songs that are lost.
05:10:26Craigle quits [Quit: The Lounge - https://thelounge.chat]
05:10:53Craigle (Craigle) joins
05:52:59<@JAA>FWIW, I poked at Issuu a bit, but I couldn't find anything that would allow us to enumerate the documents. I did find an SWF reader, but it takes a document ID of the form 170214070741-cc557127661564b84205d23b42cf67f1, obviously still not bruteforcable.
05:53:18<@JAA>That's at https://static.issuu.com/webembed/viewers/style1/v2/IssuuReader.swf?mode=mini&documentId=170214070741-cc557127661564b84205d23b42cf67f1 if anyone wants to dig further.
05:56:10<@JAA>That's revisionId-publicationId per some JSON in the document pages.
06:22:00benjinsm joins
06:23:16benjins2_ joins
06:25:14benjins quits [Ping timeout: 264 seconds]
06:25:50benjins2 quits [Ping timeout: 264 seconds]
06:26:25Craigle quits [Client Quit]
06:27:04Craigle (Craigle) joins
06:41:15Ruthalas5 (Ruthalas) joins
07:08:26wyatt8750 quits [Ping timeout: 264 seconds]
07:10:41hitgrr8 joins
07:13:49<audrooku|m>JAA: In regards to your messages in #onlyfails
07:13:49<audrooku|m>It didnt really click for me how those IA collections of megawarcs and that data being in the wbm were related, so I guess they're basically indexed by the WBM and the various grabs end up in various collections? Both by the IA, AT, and other partners? I've found quite a lot of use through the WBM CDX API since I've learned of it, though searching through a local set of CDX data could be beneficial to me, especially since sometimes the CDX API
07:13:49<audrooku|m>appears to miss some results (sometimes more old grabs appear at a later data), do you know roughly how often these grab collections are publicly available? For example there's a collection of deviantart grabs that are not, if most collections are public it might be worth my time to scrape the CDXs from many collections.
07:24:32wyatt8740 joins
07:29:04wyatt8740 quits [Ping timeout: 265 seconds]
07:29:15wyatt8740 joins
08:04:52Icyelut|2 (Icyelut) joins
08:06:38Icyelut quits [Ping timeout: 264 seconds]
08:32:09us3rrr quits [Read error: Connection reset by peer]
08:50:30Island quits [Read error: Connection reset by peer]
09:01:00Mateon1 quits [Remote host closed the connection]
09:01:08Mateon1 joins
09:26:09Minkafighter72 quits [Quit: The Lounge - https://thelounge.chat]
09:27:53Minkafighter72 joins
10:12:47jacksonchen666 quits [Ping timeout: 276 seconds]
10:14:53jacksonchen666 (jacksonchen666) joins
11:18:47IDK (IDK) joins
11:45:14leo60228 quits [Ping timeout: 265 seconds]
11:47:14leo60228 (leo60228) joins
11:48:30Jackster joins
11:51:32<Jackster>I am using heritrix to archive a few websites. Because I had to restart it a few times, I have ended up with a couple .warc.gz files. Looking to make it into a single if possible. I have tried https://github.com/arquivo/mergeWARCs but it creates 3 files? https://github.com/maturban/WARCMerge wont run. Any suggestions or am I doing it wrong?
11:53:44<Jackster>If pywb can use these multiple files, then I am not needed to do this
12:02:40<Jackster>Ah yes I don't need to do that.
12:02:43<Jackster>ggwp
12:02:45Jackster quits [Remote host closed the connection]
12:06:33benjinsm is now known as benjins
12:56:10<h2ibot>Bzc6p edited Kepfeltoltes.eu (-20, more accurate statistics, 2022 in progress): https://wiki.archiveteam.org/?diff=49401&oldid=49353
14:06:29rocketdive joins
14:33:14ok joins
14:47:27ok quits [Ping timeout: 265 seconds]
15:22:09ok joins
15:23:20ok quits [Remote host closed the connection]
15:26:28ok joins
15:26:34<ok>hi
15:43:00HP_Archivist (HP_Archivist) joins
15:43:28<@arkiver>OrIdow6: yeah i was looking into zhihu circles
15:43:44<@arkiver>but wasn't exactly sure what part was shutting down
15:43:52<@arkiver>it may be good though to do a project for the entire thing in general
15:47:40ok74 joins
15:47:47<@arkiver>for zhihu, only the /club/* pages on zhihu.com are shutting down?
15:49:10<@arkiver>so, if I read this correctly, access to https://www.zhihu.com/club/explore will be disables end of this month, and all actual posts will not be accessible anymore end of march?
15:49:49<ok74>hey
15:49:55<@arkiver>hi
15:50:16<ok74>just find out my twitter account was targeted by your org
15:50:40ok quits [Remote host closed the connection]
15:52:03<ok74>don't get the idea to put a bot to stalk evey of my tweet
15:54:03<ok74>someone can give my some explanation please
15:54:12<madpro|m>Hi ok74,
15:54:18<ok74>hi
15:54:23<madpro|m>That's IRC for you, it can take a few hours for people to respond.
15:54:30<thuban>ok74: we archive a very broad range of material. if you don't want your account archived, it's best to contact archive.org and ask for it to be excluded. that will prevent everyone from saving it there, not just us https://help.archive.org/help/how-do-i-request-to-remove-something-from-archive-org/
15:55:26<madpro|m>Rest assured Archive Team is a bona fide volunteer organisation. Its mission is to preserve digital heritage, without disturbing people if possible.
15:56:44<madpro|m>as thuban: mentioned, Archive Team scrapes of Twitter are redirected to the archive.org (unaffiliated). So you can ask for a take-down from archive.org, directly.
15:56:55<ok74>man i only tweeted 5 times in 2 years you archived all including 3 tweets from a conversation in a space where i wrote to a friend some location (the tweets lasted litteraly 2 min)
15:56:59<ok74>it's insane
15:58:06<ok74>based on the fact i'm osint volonteer it can be so harmful
15:58:33<madpro|m>Yeah, we have lot of overlap.
15:58:37<madpro|m>If you are familliar with DEFCON
15:58:55<ok74>yes
15:58:56<madpro|m>With OS-INT, I mean.
15:59:06<ok74>i kniow that dude
16:00:40<h2ibot>Bzc6p edited EOldal (+238, Finished.): https://wiki.archiveteam.org/?diff=49402&oldid=49291
16:04:00<madpro|m>ok74: An AT member once put it like this
16:04:21<madpro|m>"Our priority should be sites where user **content was solicited** and then **provided.**."
16:06:02<madpro|m>And that is a good maxim. It trusts the original author's consent and even will in making their content public.
16:06:27<madpro|m>Every now and again, a tracker may slip-up. Or someone may dump links important to them into the crawler.
16:07:28<madpro|m>That's why archive.org has its take-down form. That's how we uphold "the right to be forgotten"
16:09:02<madpro|m>Compared to CommonCrawl and larger projects, I would say that Archive Team is much more reasonable.
16:09:05<madpro|m>Wouldn't you agree?
16:09:49ivan leaves
16:15:30<ok74>i don't know much about commoncrawl or larger projects, but you right it was a mistake to use a non-randomised pseudonym and to tweet publicly location to meet friend
16:16:49<madpro|m>Mistakes happen, don't be ashamed. :)
16:16:51<madpro|m>and this is how we try to remedy it
16:17:24<madpro|m>I'm sorry I cannot offer anything more
16:19:09<ok74>don't worry i will burn name and account, so sad it was so cool, but security before everything
16:23:45HP_Archivist quits [Client Quit]
16:34:46<h2ibot>Bzc6p edited EOldal (-16, /* Archiving */ 874 GiB total): https://wiki.archiveteam.org/?diff=49403&oldid=49402
16:51:51ok74 quits [Remote host closed the connection]
17:06:39jacksonchen666_ (jacksonchen666) joins
17:08:08jacksonchen666 quits [Ping timeout: 276 seconds]
17:11:32jacksonchen666_ is now known as jacksonchen666
17:22:44spirit quits [Quit: Leaving]
18:36:05<@OrIdow6>arkiver: I'm not sure. The link to https://zhuanlan.zhihu.com/p/201000811 suggests it is /club/, though that page mainly refers to circles as being the pages accessed through interface elements I don't see. Pessimistically "all the entrances of the circle except the personal page will be closed" (Google Translate of https://zhuanlan.zhihu.com/p/585385202) could mean that they become inaccessible to the general public, though I do think your
18:36:07<@OrIdow6>interpretation is more likely.
18:49:12<@JAA>audrooku|m: Yes, the WBM is basically just an interface to WARC files (and, for ancient data, ARC files) stored in IA items. This includes a wide variety of sources, including IA's own crawls and our stuff. Our data is normally publicly accessible, but IA's almost always isn't. For other sources, it varies. The WBM CDX API returns matches for anything that can be found in the WBM. It's just more
18:49:18<@JAA>convenient to download a handful of CDX files from the items when that's possible. The API has some annoying quirks, is rate-limited, and doesn't scale well.
18:50:29<pokechu22>Incidentally, https://wayback.archive-it.org/5902/*/http://www.nsf.gov/statistics/nsf01313/patterns.htm is an interesting thing to compare with https://web.archive.org/web/*/http://www.nsf.gov/statistics/nsf01313/patterns.htm - I vaguely knew of archive-it.org but didn't realize it had a separate interface
18:53:15<pokechu22>hmm, also interesting: https://wayback.archive-it.org/5902/timemap/cdx?url=http://www.nsf.gov/statistics/nsf01313/patterns.htm exists but https://archive.org/download/ARCHIVEIT-5902-ONE_TIME-JOB162185-00000 has CDX files restricted
18:54:20<@JAA>Well, yeah, you can get the same info from the CDX API as well. The API doesn't allow you to list all entries from a particular WARC file though, as the CDX file would.
18:55:20<@JAA>It makes sense for SPN, I guess. I don't really understand why they restrict their web-wide crawls though.
18:56:02<@JAA>Archive-It partners might not want the URL lists to be available either, so maybe that's somewhat reasonable as well.
19:02:40hitgrr8 quits [Read error: Connection reset by peer]
19:03:48hitgrr8 joins
19:39:02igloo22225 quits [Ping timeout: 264 seconds]
19:41:15mut4ntm0nkey quits [Remote host closed the connection]
19:41:55mut4ntm0nkey (mutantmonkey) joins
19:53:32<avoozl>JAA: do you know if I can see any of the telegram pages online anywhere? I'm parsing them and attempting to construct some data extractors for them, but have a bit of trouble coming up with the right element selectors on the html pages
19:57:26<@JAA>avoozl: Should all be in the WBM, unless you're looking at a very recent upload (as it takes a bit until a newly indexed WARC actually shows up in the WBM).
20:02:03<avoozl>I'll have a look, maybe I was looking in the wrong place. Just pick a random url from the json.gz and plug that into wbm?
20:02:46igloo22225 (igloo22225) joins
20:03:48Island joins
20:04:50<@JAA>From the CDX
20:06:26<avoozl>ahh ok
20:06:50<avoozl>I feel like I forgot a lot of stuff since last time I touched this :)
20:10:38<avoozl>I see a lot of things captured but I'm unsure where to find actual content, maybe I'm just unlucky
20:10:40<avoozl>https://web.archive.org/web/20221225004205/https://t.me/Aallbany/1257
20:14:14<@JAA>Yeah, that's just the awkward Telegram web interface. Try one of the /s/ URLs.
20:14:47<@JAA>I believe there are other cases, too. But for specific stuff rather than general WARC parsing etc., #telegrab
20:55:01Chris5010 quits [Ping timeout: 252 seconds]
21:10:30Chris5010 (Chris5010) joins
22:29:16BlueMaxima joins
22:33:02hitgrr8 quits [Client Quit]
23:18:00IDK quits [Client Quit]