#archiveteam-bs log for 2023-01-22

Home Search Previous day Next day

00:03:44	<Nexus>	Alright, good to know. Have three workers spun up on two different VMs. The Oracle Cloud free tier VMs seem to be quite capable of running warriors; I have 2 running on one and it seems to be handling it quite well
00:05:50		IDK quits [Client Quit]
00:27:02		@Sanqui quits [Ping timeout: 264 seconds]
00:27:46		simon8162 quits [Quit: ZNC 1.8.2 - https://znc.in]
00:28:03		simon816 (simon816) joins
00:28:48	<simon816>	https://web.archive.org/web/20230000000000*/https://burg.biz/ one for the exclusions list I think?
00:34:01		lunik17 quits [Ping timeout: 252 seconds]
00:37:03	<h2ibot>	Markass edited List of websites excluded from the Wayback Machine (+67, sanctionedsuicide.com and sanctioned-suicide.org): https://wiki.archiveteam.org/?diff=49399&oldid=49350
01:00:07	<h2ibot>	JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=49400&oldid=49399
01:05:19		Sanqui joins
01:05:21		Sanqui is now authenticated as Sanqui
01:05:21		Sanqui quits [Changing host]
01:05:21		Sanqui (Sanqui) joins
01:05:21		@ChanServ sets mode: +o Sanqui
01:11:38		lunik17 joins
01:26:41		BlueMaxima quits [Read error: Connection reset by peer]
01:32:51		rocketdive quits [Ping timeout: 265 seconds]
01:34:32	<@OrIdow6>	It appears that the endpoint for Zhihu Circles comments gives a 404
02:23:30		katocala quits [Remote host closed the connection]
02:32:29		katocala joins
02:32:50		katocala is now authenticated as katocala
02:55:47		Nexus quits [Client Quit]
03:02:15	<qwertyasdfuiopghjkl>	Ryz: Related to Issuu, https://issuu.com/categories probably won't find everything, since there are unlisted pages ( https://help.issuu.com/hc/en-us/articles/5772764718363-Unlisted-Content ).
03:32:19		fishingforsoup joins
03:32:31	<fishingforsoup>	I need serious help in finding the videos that were on these URLs.
03:32:32	<fishingforsoup>	http://youtube.com/watch?v=XFMS0Hr1Ub4
03:32:47	<fishingforsoup>	http://youtube.com/watch?v=yj8MTPX8zDE
03:33:06	<fishingforsoup>	http://youtube.com/watch?v=JDyCsTDoKc0
03:33:42	<fishingforsoup>	From the metadata I can find, they include songs. Songs that are lost.
05:10:26		Craigle quits [Quit: The Lounge - https://thelounge.chat]
05:10:53		Craigle (Craigle) joins
05:52:59	<@JAA>	FWIW, I poked at Issuu a bit, but I couldn't find anything that would allow us to enumerate the documents. I did find an SWF reader, but it takes a document ID of the form 170214070741-cc557127661564b84205d23b42cf67f1, obviously still not bruteforcable.
05:53:18	<@JAA>	That's at https://static.issuu.com/webembed/viewers/style1/v2/IssuuReader.swf?mode=mini&documentId=170214070741-cc557127661564b84205d23b42cf67f1 if anyone wants to dig further.
05:56:10	<@JAA>	That's revisionId-publicationId per some JSON in the document pages.
06:22:00		benjinsm joins
06:23:16		benjins2_ joins
06:25:14		benjins quits [Ping timeout: 264 seconds]
06:25:50		benjins2 quits [Ping timeout: 264 seconds]
06:26:25		Craigle quits [Client Quit]
06:27:04		Craigle (Craigle) joins
06:41:15		Ruthalas5 (Ruthalas) joins
07:08:26		wyatt8750 quits [Ping timeout: 264 seconds]
07:10:41		hitgrr8 joins
07:13:49	<audrooku\|m>	JAA: In regards to your messages in #onlyfails
07:13:49	<audrooku\|m>	It didnt really click for me how those IA collections of megawarcs and that data being in the wbm were related, so I guess they're basically indexed by the WBM and the various grabs end up in various collections? Both by the IA, AT, and other partners? I've found quite a lot of use through the WBM CDX API since I've learned of it, though searching through a local set of CDX data could be beneficial to me, especially since sometimes the CDX API
07:13:49	<audrooku\|m>	appears to miss some results (sometimes more old grabs appear at a later data), do you know roughly how often these grab collections are publicly available? For example there's a collection of deviantart grabs that are not, if most collections are public it might be worth my time to scrape the CDXs from many collections.
07:24:32		wyatt8740 joins
07:29:04		wyatt8740 quits [Ping timeout: 265 seconds]
07:29:15		wyatt8740 joins
08:04:52		Icyelut\|2 (Icyelut) joins
08:06:38		Icyelut quits [Ping timeout: 264 seconds]
08:32:09		us3rrr quits [Read error: Connection reset by peer]
08:50:30		Island quits [Read error: Connection reset by peer]
09:01:00		Mateon1 quits [Remote host closed the connection]
09:01:08		Mateon1 joins
09:26:09		Minkafighter72 quits [Quit: The Lounge - https://thelounge.chat]
09:27:53		Minkafighter72 joins
10:12:47		jacksonchen666 quits [Ping timeout: 276 seconds]
10:14:53		jacksonchen666 (jacksonchen666) joins
11:18:47		IDK (IDK) joins
11:45:14		leo60228 quits [Ping timeout: 265 seconds]
11:47:14		leo60228 (leo60228) joins
11:48:30		Jackster joins
11:51:32	<Jackster>	I am using heritrix to archive a few websites. Because I had to restart it a few times, I have ended up with a couple .warc.gz files. Looking to make it into a single if possible. I have tried https://github.com/arquivo/mergeWARCs but it creates 3 files? https://github.com/maturban/WARCMerge wont run. Any suggestions or am I doing it wrong?
11:53:44	<Jackster>	If pywb can use these multiple files, then I am not needed to do this
12:02:40	<Jackster>	Ah yes I don't need to do that.
12:02:43	<Jackster>	ggwp
12:02:45		Jackster quits [Remote host closed the connection]
12:06:33		benjinsm is now known as benjins
12:06:34		benjins is now authenticated as benjins
12:56:10	<h2ibot>	Bzc6p edited Kepfeltoltes.eu (-20, more accurate statistics, 2022 in progress): https://wiki.archiveteam.org/?diff=49401&oldid=49353
14:06:29		rocketdive joins
14:33:14		ok joins
14:47:27		ok quits [Ping timeout: 265 seconds]
15:22:09		ok joins
15:23:20		ok quits [Remote host closed the connection]
15:26:28		ok joins
15:26:34	<ok>	hi
15:43:00		HP_Archivist (HP_Archivist) joins
15:43:28	<@arkiver>	OrIdow6: yeah i was looking into zhihu circles
15:43:44	<@arkiver>	but wasn't exactly sure what part was shutting down
15:43:52	<@arkiver>	it may be good though to do a project for the entire thing in general
15:47:40		ok74 joins
15:47:47	<@arkiver>	for zhihu, only the /club/* pages on zhihu.com are shutting down?
15:49:10	<@arkiver>	so, if I read this correctly, access to https://www.zhihu.com/club/explore will be disables end of this month, and all actual posts will not be accessible anymore end of march?
15:49:49	<ok74>	hey
15:49:55	<@arkiver>	hi
15:50:16	<ok74>	just find out my twitter account was targeted by your org
15:50:40		ok quits [Remote host closed the connection]
15:52:03	<ok74>	don't get the idea to put a bot to stalk evey of my tweet
15:54:03	<ok74>	someone can give my some explanation please
15:54:12	<madpro\|m>	Hi ok74,
15:54:18	<ok74>	hi
15:54:23	<madpro\|m>	That's IRC for you, it can take a few hours for people to respond.
15:54:30	<thuban>	ok74: we archive a very broad range of material. if you don't want your account archived, it's best to contact archive.org and ask for it to be excluded. that will prevent everyone from saving it there, not just us https://help.archive.org/help/how-do-i-request-to-remove-something-from-archive-org/
15:55:26	<madpro\|m>	Rest assured Archive Team is a bona fide volunteer organisation. Its mission is to preserve digital heritage, without disturbing people if possible.
15:56:44	<madpro\|m>	as thuban: mentioned, Archive Team scrapes of Twitter are redirected to the archive.org (unaffiliated). So you can ask for a take-down from archive.org, directly.
15:56:55	<ok74>	man i only tweeted 5 times in 2 years you archived all including 3 tweets from a conversation in a space where i wrote to a friend some location (the tweets lasted litteraly 2 min)
15:56:59	<ok74>	it's insane
15:58:06	<ok74>	based on the fact i'm osint volonteer it can be so harmful
15:58:33	<madpro\|m>	Yeah, we have lot of overlap.
15:58:37	<madpro\|m>	If you are familliar with DEFCON
15:58:55	<ok74>	yes
15:58:56	<madpro\|m>	With OS-INT, I mean.
15:59:06	<ok74>	i kniow that dude
16:00:40	<h2ibot>	Bzc6p edited EOldal (+238, Finished.): https://wiki.archiveteam.org/?diff=49402&oldid=49291
16:04:00	<madpro\|m>	ok74: An AT member once put it like this
16:04:21	<madpro\|m>	"Our priority should be sites where user content was solicited and then provided.."
16:06:02	<madpro\|m>	And that is a good maxim. It trusts the original author's consent and even will in making their content public.
16:06:27	<madpro\|m>	Every now and again, a tracker may slip-up. Or someone may dump links important to them into the crawler.
16:07:28	<madpro\|m>	That's why archive.org has its take-down form. That's how we uphold "the right to be forgotten"
16:09:02	<madpro\|m>	Compared to CommonCrawl and larger projects, I would say that Archive Team is much more reasonable.
16:09:05	<madpro\|m>	Wouldn't you agree?
16:09:49		ivan leaves
16:15:30	<ok74>	i don't know much about commoncrawl or larger projects, but you right it was a mistake to use a non-randomised pseudonym and to tweet publicly location to meet friend
16:16:49	<madpro\|m>	Mistakes happen, don't be ashamed. :)
16:16:51	<madpro\|m>	and this is how we try to remedy it
16:17:24	<madpro\|m>	I'm sorry I cannot offer anything more
16:19:09	<ok74>	don't worry i will burn name and account, so sad it was so cool, but security before everything
16:23:45		HP_Archivist quits [Client Quit]
16:34:46	<h2ibot>	Bzc6p edited EOldal (-16, /* Archiving */ 874 GiB total): https://wiki.archiveteam.org/?diff=49403&oldid=49402
16:51:51		ok74 quits [Remote host closed the connection]
17:06:39		jacksonchen666_ (jacksonchen666) joins
17:08:08		jacksonchen666 quits [Ping timeout: 276 seconds]
17:11:32		jacksonchen666_ is now known as jacksonchen666
17:22:44		spirit quits [Quit: Leaving]
18:36:05	<@OrIdow6>	arkiver: I'm not sure. The link to https://zhuanlan.zhihu.com/p/201000811 suggests it is /club/, though that page mainly refers to circles as being the pages accessed through interface elements I don't see. Pessimistically "all the entrances of the circle except the personal page will be closed" (Google Translate of https://zhuanlan.zhihu.com/p/585385202) could mean that they become inaccessible to the general public, though I do think your
18:36:07	<@OrIdow6>	interpretation is more likely.
18:49:12	<@JAA>	audrooku\|m: Yes, the WBM is basically just an interface to WARC files (and, for ancient data, ARC files) stored in IA items. This includes a wide variety of sources, including IA's own crawls and our stuff. Our data is normally publicly accessible, but IA's almost always isn't. For other sources, it varies. The WBM CDX API returns matches for anything that can be found in the WBM. It's just more
18:49:18	<@JAA>	convenient to download a handful of CDX files from the items when that's possible. The API has some annoying quirks, is rate-limited, and doesn't scale well.
18:50:29	<pokechu22>	Incidentally, https://wayback.archive-it.org/5902//http://www.nsf.gov/statistics/nsf01313/patterns.htm is an interesting thing to compare with https://web.archive.org/web//http://www.nsf.gov/statistics/nsf01313/patterns.htm - I vaguely knew of archive-it.org but didn't realize it had a separate interface
18:53:15	<pokechu22>	hmm, also interesting: https://wayback.archive-it.org/5902/timemap/cdx?url=http://www.nsf.gov/statistics/nsf01313/patterns.htm exists but https://archive.org/download/ARCHIVEIT-5902-ONE_TIME-JOB162185-00000 has CDX files restricted
18:54:20	<@JAA>	Well, yeah, you can get the same info from the CDX API as well. The API doesn't allow you to list all entries from a particular WARC file though, as the CDX file would.
18:55:20	<@JAA>	It makes sense for SPN, I guess. I don't really understand why they restrict their web-wide crawls though.
18:56:02	<@JAA>	Archive-It partners might not want the URL lists to be available either, so maybe that's somewhat reasonable as well.
19:02:40		hitgrr8 quits [Read error: Connection reset by peer]
19:03:48		hitgrr8 joins
19:39:02		igloo22225 quits [Ping timeout: 264 seconds]
19:41:15		mut4ntm0nkey quits [Remote host closed the connection]
19:41:55		mut4ntm0nkey (mutantmonkey) joins
19:53:32	<avoozl>	JAA: do you know if I can see any of the telegram pages online anywhere? I'm parsing them and attempting to construct some data extractors for them, but have a bit of trouble coming up with the right element selectors on the html pages
19:57:26	<@JAA>	avoozl: Should all be in the WBM, unless you're looking at a very recent upload (as it takes a bit until a newly indexed WARC actually shows up in the WBM).
20:02:03	<avoozl>	I'll have a look, maybe I was looking in the wrong place. Just pick a random url from the json.gz and plug that into wbm?
20:02:46		igloo22225 (igloo22225) joins
20:03:48		Island joins
20:04:50	<@JAA>	From the CDX
20:06:26	<avoozl>	ahh ok
20:06:50	<avoozl>	I feel like I forgot a lot of stuff since last time I touched this :)
20:10:38	<avoozl>	I see a lot of things captured but I'm unsure where to find actual content, maybe I'm just unlucky
20:10:40	<avoozl>	https://web.archive.org/web/20221225004205/https://t.me/Aallbany/1257
20:14:14	<@JAA>	Yeah, that's just the awkward Telegram web interface. Try one of the /s/ URLs.
20:14:47	<@JAA>	I believe there are other cases, too. But for specific stuff rather than general WARC parsing etc., #telegrab
20:55:01		Chris5010 quits [Ping timeout: 252 seconds]
21:10:30		Chris5010 (Chris5010) joins
22:29:16		BlueMaxima joins
22:33:02		hitgrr8 quits [Client Quit]
23:18:00		IDK quits [Client Quit]

Home Search Previous day Next day