#archiveteam-bs log for 2024-06-16

Home Search Previous day Next day

00:10:45		Earendil7 quits [Ping timeout: 272 seconds]
00:10:47		Earendil7_ (Earendil7) joins
00:11:47		Arcorann (Arcorann) joins
00:13:00		nertzy joins
00:15:29		pedantic-darwin quits [Quit: The Lounge - https://thelounge.chat]
00:15:36	<BornOn420>	no, straight from the Netherlands
00:15:45		pedantic-darwin joins
00:16:40	<BornOn420>	Unable to find image 'containrrr/watchtower:latest' locally
00:16:40	<BornOn420>	docker: Error response from daemon: Head "https://registry-1.docker.io/v2/containrrr/watchtower/manifests/latest": unauthorized: incorrect username or password.
00:26:05		DogsRNice joins
00:28:51	<Notrealname1234>	Can we archive LEGO Life?
00:29:03	<Notrealname1234>	It's not so popular anyway
00:29:33	<Notrealname1234>	If we can get the API calls
00:34:44	<nulldata>	Notrealname1234 - Yeah would need to figure out API calls. Slightly harder as I think it's app only - no browser version.
00:35:07		nothere joins
00:35:08	<Notrealname1234>	Yeah, it's app only.
00:35:16	<Notrealname1234>	Use fiddler?
00:58:36		pedantic-darwin quits [Client Quit]
00:58:52		pedantic-darwin joins
01:04:56		Notrealname1234 quits [Client Quit]
01:08:48	<nulldata>	!tell Notrealname123 "Maybe. A lot of apps these days use certificate pinning, which make it hard to MITM as they don't allow self-signed certs."
01:08:48	<eggdrop>	[tell] ok, I'll tell Notrealname123 when they join next
02:24:25	<h2ibot>	PaulWise edited SmolNet (+88, add some probably missed finger URLs): https://wiki.archiveteam.org/?diff=52361&oldid=52339
04:34:56	<pabs>	for archiving things that redirect to random URLs from a non-public list, do we have a way to archive the randomisatoin URL many times to extract the whole list?
04:35:12	<pabs>	for eg https://theforest.link/ https://theforest.link/go-for-a-walk
05:02:11	<pokechu22>	https://theforest.link/go-for-a-walk?1 https://theforest.link/go-for-a-walk?2 https://theforest.link/go-for-a-walk?3 https://theforest.link/go-for-a-walk?4 etc, but if it's redirecting to a ton of offsite domains that might cause cookie problems for archivebot
05:06:07	<pabs>	cookie problems?
05:07:12	<pabs>	as in too many cookies in the cookie jar, leading to slowdowns?
05:10:44	<pokechu22>	Yeah, due to inefficiencies with out python's cookie jar implementation works
05:11:10	<pokechu22>	something about it not removing entries for expired cookies from the dictionary containing cookies
05:11:27	<pokechu22>	it'd probably be fine for an !ao < list job for discovery but not recursion... probably
05:13:17	<pabs>	yeah I was just going to do front pages
05:13:35	<pabs>	who knows how many links are in it :)
05:14:12	<pabs>	hmm, wonder if that python cookie jar thing is fixed in newer python
05:47:25		Sophira joins
05:50:18	<Sophira>	Hi there. This is a bit of a long shot since I don't think what I'm about to ask about is actually an Archive Team project, but there's a YouTube video from 2008 that I have reason to believe would be contained within the "YouTube Video Crawldata" collections on archive.org, which have "Internet Archive" listed as the contributor. Does anyone know who I'd need to contact to get access to this?
05:50:55	<Sophira>	(the downloads on these items are restricted, presumably for bandwidth reasons)
05:53:02	<Sophira>	(I can see an email address for the operator in the data, but I don't know if that's something I should be using or not.)
05:55:50	<imer>	Sophira: try https://findyoutubevideo.thetechrobo.ca/
05:58:22		mighty-dob (mighty-dob) joins
05:58:48	<Sophira>	Ah, thank you!
06:03:15	<Sophira>	Sadly, it doesn't appear to be available in any of the services searched by that link. It says the metadata is available in the Internet Archive but I don't believe that's actually the case.
06:03:26	<Sophira>	Thank you for the link though <3 I'll save that.
06:08:37	<Sophira>	(In the Wayback Machine, rather.)
06:30:02		HP_Archivist quits [Read error: Connection reset by peer]
06:36:02		DogsRNice quits [Read error: Connection reset by peer]
06:50:20		BlueMaxima quits [Read error: Connection reset by peer]
07:06:11		Unholy23619246453771 (Unholy2361) joins
07:52:27		Earendil7_ quits [Ping timeout: 272 seconds]
07:52:35		Earendil7 (Earendil7) joins
08:08:43		Wohlstand (Wohlstand) joins
08:09:22		aninternettroll quits [Ping timeout: 255 seconds]
08:24:03	<mighty-dob>	hi people. sensing the upcoming apocalypse I started to think about the same way as you. I am not really a cool hacker but I managed to make a snapshot of Arduino code database and was looking for to make a github archive, I even bought a NAS for 10TB for personal archives. but now I found your community and it looks like you already did all the job
08:31:17		aninternettroll (aninternettroll) joins
08:52:44		Island quits [Read error: Connection reset by peer]
09:00:01		Bleo1826007227196 quits [Client Quit]
09:01:21		Bleo1826007227196 joins
09:04:10		Wohlstand quits [Client Quit]
09:15:25		Earendil7 quits [Ping timeout: 272 seconds]
09:28:43		nulldata quits [Ping timeout: 272 seconds]
09:33:18	<yarrow>	#archivebot request: please archive https://callchelseaperetti.tumblr.com/archive if you can. Reason: proactive grab.
09:53:19		shgaqnyrjp quits [Remote host closed the connection]
09:53:22		shgaqnyrjp_ (shgaqnyrjp) joins
09:56:55		Ryz quits [Ping timeout: 255 seconds]
09:58:45		Ryz (Ryz) joins
10:22:12	<pabs>	mighty-dob: for code archiving, see #gitgud #codearchiver (hackint) #swh (libera)
10:22:59	<pabs>	https://wiki.archiveteam.org/index.php/Codearchiver https://www.softwareheritage.org/
10:23:34	<pabs>	mighty-dob: if you've got websites you want on archive.org, ArchiveBot can save them, list sites and reasons here
10:23:58	<pabs>	personal archives are also good to have too of course :)
10:24:17	<pabs>	also check out https://wiki.archiveteam.org/index.php/Warrior
10:24:48	<pabs>	the apocalypse is ongoing, websites die every day https://wiki.archiveteam.org/index.php/Deathwatch
10:28:42		JaffaCakes118 (JaffaCakes118) joins
10:42:45		muklumsum quits [Client Quit]
10:42:54	<mighty-dob>	pabs: ty I'll check it
10:48:37	<mighty-dob>	what do you think cause websites to close?
10:50:13	<pabs>	lots of reasons, usually money or people got bored or some drama
10:50:45		muklumsum joins
11:02:32		yarrow quits [Read error: Connection reset by peer]
11:05:26		yarrow (yarrow) joins
11:35:55		qwertyasdfuiopghjkl2 joins
11:42:02	<qwertyasdfuiopghjkl2>	https://www.connectseward.org/connect-seward-services-shutting-down/ "After 27 years of offering free e-mail and website hosting for many businesses, organizations, and individuals in Seward County, Connect Seward County will be shutting down effective June 30th, 2024."
11:46:31	<mighty-dob>	https://www.geeksforgeeks.org/ the best C++ self-learning online book I've found. it doesn't close but I've been interested in getting an offline copy
11:46:54	<mighty-dob>	*contains a lot of javascript
11:47:39	<qwertyasdfuiopghjkl2>	From https://www.google.com/search?q=%22Hosted+by+Connect+Seward+County%22 there seems to be a lot of sites that will be affected by the shutdown, but I'm guessing that search probably won't find all of them. (I don't currently have the time to look into it more)
11:48:31		kiryu__ quits [Ping timeout: 255 seconds]
12:08:19		mighty-dob quits [Ping timeout: 272 seconds]
13:04:00		mighty-dob (mighty-dob) joins
13:16:45		kiryu joins
13:16:45		kiryu is now authenticated as kiryu
13:16:45		kiryu quits [Changing host]
13:16:45		kiryu (kiryu) joins
13:17:29		shgaqnyrjp_ is now known as shgaqnyrjp
13:31:55		nertzy quits [Ping timeout: 272 seconds]
13:50:54		nertzy joins
13:52:49		Arcorann quits [Ping timeout: 272 seconds]
13:54:25		Notrealname1234 (Notrealname1234) joins
13:58:44		Notrealname1234 quits [Client Quit]
14:34:42		nulldata (nulldata) joins
15:04:54		grid joins
15:25:25	<myself>	mighty-dob: I'd love to learn more about your Arduino stuff, do you mean you grabbed the Arduino-as-an-organization's own repos for the Arduino-branded IDE and stuff? Or were you able to spider all the libraries and board-support packages?
15:36:01		Guest54 joins
15:40:43		Guest54 quits [Ping timeout: 255 seconds]
15:55:30		Lord_Nightmare quits [Quit: ZNC - http://znc.in]
15:58:59		Lord_Nightmare (Lord_Nightmare) joins
16:13:10		JaffaCakes118 quits [Remote host closed the connection]
16:30:21		JaffaCakes118 (JaffaCakes118) joins
16:31:48		JaffaCakes118_2 (JaffaCakes118) joins
16:32:12		ymgve quits [Quit: Leaving]
16:33:15		JaffaCakes118_2 quits [Read error: Connection reset by peer]
16:34:13		JaffaCakes118 quits [Remote host closed the connection]
17:12:18		superkuh joins
17:14:38		grid quits [Client Quit]
17:22:57	<mighty-dob>	myself: I wrote a bash script crawler that walked accross the entire API (public JSON file) and downloaded all libraries one-by-one. then I packed them into ~28 zip archives (A..Z) as otherwise moving 30k files across drives was impossible. later I found open ZIM project and decided that it could be handy to pack Arduino library files into ZIM package so it's easier to work with but didn't make that yet
17:24:36	<myself>	niiiiiice. That plus all the board support stuff would be amazing to have reliably archived offline.
17:29:16	<mighty-dob>	yep. ZIM format could be great for it. I have the entire wikipedia and a lot of other useful web resources downloaded in ZIM format on my NAS for worst case scenarios. I can use them offline or share with the other people as it has webserver to access it
17:31:53	<that_lurker>	remember to take a look at the publishing policy of openzim https://openzim.org/wiki/Content_team#Publishing
17:32:42	<mighty-dob>	right
17:32:46	<that_lurker>	If you want to make them official that is.
17:33:07	<that_lurker>	https://openzim.org/wiki/Build_your_ZIM_file
17:34:10	<mighty-dob>	perhaps I'll need Arduino permission to publish their database
17:35:51	<mighty-dob>	or just share it via torrent
17:40:31	<that_lurker>	If it's public data then at least push it to Internet Archive
17:40:39		nertzy quits [Client Quit]
17:43:04	<mighty-dob>	I don't know much about your movement yet, didn't figure out how you make archives and how to use them
17:43:36	<mighty-dob>	I am just lurking so far in free time
17:46:20		coderobe quits [Quit: Killed (K-Lined)]
17:52:04		BearFortress quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
19:04:41		@arkiver is back from vacation :)
19:05:38	<imer>	welcome back!
19:15:36	<that_lurker>	did you have a nice vacation
19:27:58		mighty-dob quits [Ping timeout: 255 seconds]
20:18:49		coderobe (coderobe) joins
20:31:58		JayEmbee quits [Quit: WeeChat 2.3]
20:43:46	<fireonlive>	welcome back arkiver :3 hope you had a good time
21:10:17		fuzzy8021 is now known as fuzzy80211
21:10:24		fuzzy80211 is now known as group
21:10:28		group is now known as fuzzy80211
21:11:05		fuzzy80211 is now authenticated as *
21:11:05		fuzzy80211 is now authenticated as fuzzy80211
21:24:04		Chris5010 quits [Ping timeout: 255 seconds]
21:28:46		simon8162 quits [Quit: ZNC 1.8.2 - https://znc.in]
21:30:50		simon816 (simon816) joins
21:56:34		JaffaCakes118 (JaffaCakes118) joins
22:02:04		Chris5010 (Chris5010) joins
22:11:15		Wohlstand (Wohlstand) joins
22:11:56		coderobe quits [Client Quit]
23:01:55		knecht4 quits [Ping timeout: 272 seconds]
23:12:41		knecht4 joins
23:15:13		ats quits [Ping timeout: 255 seconds]
23:17:20		wyatt8750 quits [Remote host closed the connection]
23:17:51		wyatt8740 joins
23:19:01		@Sanqui quits [Ping timeout: 272 seconds]
23:23:27		thuban quits [Ping timeout: 272 seconds]
23:24:17		thuban (thuban) joins
23:36:07	<@arkiver>	thank you :)
23:40:19	<@arkiver>	https://www.wired.com/story/the-fight-against-ai-comes-to-a-foundational-data-set/
23:40:55	<@arkiver>	https://www.businessinsider.com/new-york-times-content-removed-common-crawl-ai-training-dataset-2023-11
23:41:04	<@arkiver>	> The New York Times discovered that Common Crawl, one of the largest AI training datasets, contained millions of URLs linking to its paywalled articles and other copyrighted content.
23:41:17	<nicolas17>	>discovered
23:41:25	<nicolas17>	it seems kind of obvious that CC would have NYT?
23:41:32	<@arkiver>	well sure
23:42:01	<@arkiver>	The main problem here is that web archivists behind CC are seen as data collectors for LLM training.
23:42:21	<nicolas17>	also doesn't CC only have links, so if you want to train your AI with it, you have to actually download them off the original source again?
23:43:42	<@arkiver>	They have WARCs available I believe.
23:44:13	<@arkiver>	example https://data.commoncrawl.org/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gz
23:44:24	<nicolas17>	hm I see
23:45:00	<@arkiver>	But, the Common Crawl case is an example here. Unfortunately this also affects Archive Team and our WARCs.
23:45:33		BlueMaxima joins
23:47:19	<katia>	:\ youtube?
23:55:47		Wohlstand quits [Client Quit]
23:55:54		loug4 quits [Client Quit]
23:57:16	<@arkiver>	katia: yes, that is an example.
23:57:31		yarrow quits [Read error: Connection reset by peer]
23:57:38	<nicolas17>	is that why youtube warcs were blocked recently?
23:58:11		Sanqui joins
23:58:13		Sanqui is now authenticated as Sanqui
23:58:13		Sanqui quits [Changing host]
23:58:13		Sanqui (Sanqui) joins
23:58:13		@ChanServ sets mode: +o Sanqui

Home Search Previous day Next day