#archiveteam-bs log for 2023-04-13

Home Search Previous day Next day

00:11:30		lunik17 quits [Quit: Ping timeout (120 seconds)]
00:11:35		lunik17 joins
00:12:29	<myself>	maybe we can ask the admin to disable some stuff like hit counters or whatever that's slowing it down.... or just hand us a database dump
00:30:26		sec^nd quits [Ping timeout: 245 seconds]
00:30:57		sec^nd (second) joins
01:23:19		xYantix joins
01:31:15		xYantix quits [Remote host closed the connection]
02:27:28		Icyelut (Icyelut) joins
02:49:52		lunik17 quits [Client Quit]
02:49:54		lunik173 joins
02:54:02		mr_sarge quits [Read error: Connection reset by peer]
02:56:22		mr_sarge (sarge) joins
02:58:33		BlueMaxima joins
03:06:02		HP_Archivist quits [Read error: Connection reset by peer]
03:07:09		eroc1990 quits [Client Quit]
03:37:56		sec^nd quits [Ping timeout: 245 seconds]
03:50:58		sec^nd (second) joins
04:01:35		eroc1990 (eroc1990) joins
04:31:58		pabs quits [Client Quit]
04:32:14		pabs (pabs) joins
04:40:21		Dj-Wawa quits [Remote host closed the connection]
04:40:21		qwertyasdfuiopghjkl quits [Client Quit]
04:41:01		Dj-Wawa joins
04:41:01		Dj-Wawa is now authenticated as Dj-Wawa
04:44:44		sonick (sonick) joins
04:50:46		eroc1990 quits [Client Quit]
04:51:18		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
04:55:15		lennier1 (lennier1) joins
04:58:32		dvd_ quits [Ping timeout: 252 seconds]
05:01:41	<pabs>	https://www.theguardian.com/media/2023/apr/12/npr-leaves-twitter-elon-musk-state-media
05:03:07		dumbgoy__ quits [Ping timeout: 252 seconds]
05:04:11		sec^nd quits [Ping timeout: 245 seconds]
05:08:52		sec^nd (second) joins
05:09:23	<pabs>	a few have already been added to AB, others are findable with search engines: site:twitter.com -inurl:status npr
05:09:29		pabs ENOTIME to do this
05:09:47		atphoenix quits [Read error: Connection reset by peer]
05:12:23		atphoenix (atphoenix) joins
05:17:04		eroc1990 (eroc1990) joins
05:24:20		atphoenix quits [Read error: Connection reset by peer]
05:27:27		atphoenix (atphoenix) joins
05:31:10		DLoader quits [Ping timeout: 252 seconds]
05:32:06		sec^nd quits [Ping timeout: 245 seconds]
05:38:54		sec^nd (second) joins
05:45:17	<pabs>	here is a list posted on #archivebot https://transfer.archivete.am/mzdmm/npr-pbs-twitter-accounts.txt
05:48:21		sec^nd quits [Ping timeout: 245 seconds]
05:48:27		fuzzy8021 quits [Read error: Connection reset by peer]
05:48:56		sec^nd (second) joins
05:49:26		fuzzy8021 (fuzzy8021) joins
05:54:15		user_ quits [Remote host closed the connection]
05:54:28		user_ joins
06:18:16		Island quits [Read error: Connection reset by peer]
06:23:55		hitgrr8 joins
06:25:28		BlueMaxima quits [Client Quit]
06:29:02		eroc1990 quits [Client Quit]
06:29:22		eroc1990 (eroc1990) joins
07:19:52		DLoader joins
07:51:14		wickedplayer494 quits [Ping timeout: 252 seconds]
07:57:16		Arcorann (Arcorann) joins
08:59:04		Doomaholic quits [Read error: Connection reset by peer]
08:59:19		Doomaholic joins
09:00:26		sec^nd quits [Ping timeout: 245 seconds]
09:00:57		sec^nd (second) joins
10:07:12		sonick quits [Client Quit]
11:17:48		Iki1 joins
11:20:58		Iki quits [Ping timeout: 252 seconds]
11:25:09		eroc1990 quits [Client Quit]
11:33:20		eroc1990 (eroc1990) joins
11:58:45		Icyelut\|2 (Icyelut) joins
12:01:08		dumbgoy__ joins
12:03:03		Icyelut quits [Ping timeout: 265 seconds]
12:25:17		Iki1 quits [Ping timeout: 265 seconds]
12:41:46		HP_Archivist (HP_Archivist) joins
13:00:34		Iki joins
13:00:46		retromouse joins
13:07:30		retromouse-2 joins
13:09:38		jacksonchen666 (jacksonchen666) joins
13:09:45		retromouse-3 joins
13:09:54		retromouse-3 quits [Remote host closed the connection]
13:09:54		retromouse-2 quits [Client Quit]
13:10:14		retromouse quits [Ping timeout: 265 seconds]
13:10:15		retromouse-2 joins
13:12:50		retromouse-2 is now authenticated as retromouse
13:13:19		retromouse-2 is now known as retromouse
13:21:12		jacksonchen666 quits [Client Quit]
13:23:37		Iki quits [Ping timeout: 252 seconds]
13:31:42		ehmry quits [Client Quit]
13:32:06		ehmry joins
13:46:03		HP_Archivist quits [Client Quit]
14:04:52		DiscantX quits [Ping timeout: 252 seconds]
14:08:23		DiscantX joins
14:15:23	<retromouse>	Greetings everyone,
14:15:23	<retromouse>	I have lately used dokuwiki-dumper and mediawiki-scraper and learned about archive team.
14:15:23	<retromouse>	I'm looking for a place where hopefully I can learn to tools that make my crawls useful for more people.
14:16:09	<retromouse>	Do archive team has any way to create warcs from existing static pages you have in your drive?
14:28:34		Iki joins
14:29:02	<retromouse>	If someone can drop me a link to the right tool even if is an elaborate process I would appreciate it
14:33:28		qwertyasdfuiopghjkl quits [Client Quit]
14:54:57		geezabiscuit (geezabiscuit) joins
15:01:31		Arcorann quits [Ping timeout: 252 seconds]
15:06:20		dvd joins
15:06:39	<@arkiver>	retromouse: that likely won't be possible, unless you have the original request and response headers saved and have various metadata
15:08:38		jacksonchen666 (jacksonchen666) joins
15:08:49	<retromouse>	If you need to replay the whole process, I could just deploy the website on a static server so I a robot can browse around.
15:09:22	<retromouse>	So I can deploy the website and make any fixes to it to display correctly from the static files
15:09:37	<retromouse>	then I can browse the website from localhost
15:09:44	<retromouse>	would that be enough?
15:12:22	<retromouse>	What do you say @arkiver?
15:15:28	<@Sanqui>	no that is not possible, or desirable, because in that case you are "making up" data
15:15:30		dvd_ joins
15:15:32	<@Sanqui>	since you don't have the data necessary for a warc, I have to ask what is your motivation to make a warc?
15:15:39	<@Sanqui>	your data is still useful the way it is, no need to "soup it up" with fake information
15:16:22		dvd quits [Ping timeout: 252 seconds]
15:16:34	<tech234a>	Google Currents (remainder of Google Plus) shuts down July 5 https://workspaceupdates.googleblog.com/2023/04/new-community-features-for-google-chat-and-an-update-currents%20.html
15:21:36	<retromouse>	To have it in a standard format for easy distribution sanqui, I can of course note the changes I made to make everything browsable withouut accounts or captchas, etc..
15:22:35	<retromouse>	I don't want anyone feels I'm faking anything, I just made the minimum changes to make everything accessible without having the original server
15:26:16		Iki quits [Ping timeout: 252 seconds]
15:27:20		Iki joins
15:32:42		Island joins
15:35:01		dvd__ joins
15:35:26		Iki1 joins
15:35:51		geezabiscuit quits [Client Quit]
15:35:51		dvd_ quits [Remote host closed the connection]
15:35:51		Iki quits [Remote host closed the connection]
15:36:27		geezabiscuit (geezabiscuit) joins
16:04:14		user_ quits [Remote host closed the connection]
16:04:27		user_ joins
16:13:21		jacksonchen666 quits [Ping timeout: 245 seconds]
16:23:41		hackbug quits [Remote host closed the connection]
16:27:29		hackbug (hackbug) joins
16:29:52	<retromouse>	So, could someone point me on the right direction?
16:31:23	<pokechu22>	For mediawiki, the benefit of the dump is that you can import it into a new wiki, and that it's compact for that purpose
16:32:16	<retromouse>	Well on this case I'm talking about general pages, not wikies
16:32:27	<pokechu22>	Warc is a bit of an awkward format - the main benefit is that it works on web.archive.org, but if you're generating it instead of scraping an actual site it probably won't be added there. It's also basically a list of pages and responses, and things like searches won't work on it
16:32:36	<retromouse>	the wiki dumps done with mediawiki-scraper and dokuwiki-dumper are great
16:33:06	<retromouse>	But I have crawl sites using my own spider and end up with a set of static pages
16:33:18	<@JAA>	Thou shalt never 'create' WARCs from anything but direct interaction with the original server.
16:33:30	<pokechu22>	Ah, a list of static files you want to browse - if it works as a folder, it's probably easier to just upload a 7z file of that and extract it and then browse that
16:34:26	<pokechu22>	The big thing WARCs have going over a compressed folder is that they include the original HTTP metadata - which isn't something you can get if you're starting with a folder
16:34:35	<@JAA>	Static files in a tar, ZIP, or similar is fine, I suppose.
16:36:10	<retromouse>	Well if the original server didn't allow you to access all information is not that useful the original WARC
16:36:22		jacksonchen666 (jacksonchen666) joins
16:36:31	<retromouse>	I'm looking for a standard way of sharing some sites
16:36:42	<retromouse>	and WARC I think provides that
16:37:07	<@JAA>	Yes, WARC is the international standard for that, but again, you can't create that after the fact.
16:37:35	<retromouse>	I guess I can mount the site on a local server and use a robot to crawl it JAA
16:37:57	<@JAA>	Yes, that would be a WARC of your localhost or whatever, so it wouldn't be associated with the original site.
16:38:04	<retromouse>	The question is what tools do you have I can use to create the WARC easily
16:38:26	<retromouse>	That is fine, can be a warc of a localhost
16:38:32	<@JAA>	And if you use /etc/hosts trickery etc., we're in faking WARC territory again.
16:39:31	<retromouse>	I feel you are more worried of me trying to fake something than on helping me archiving the goal of building a standard distribution file
16:39:45	<retromouse>	when you get over it
16:39:53	<retromouse>	please point me to the tools you have for it
16:40:00	<retromouse>	if you want
16:40:39	<pokechu22>	The main point is that creating a warc won't add anything useful you don't already have and will probably be more annoying for other people to navigate than just a zip of the original files
16:41:27	<pokechu22>	https://github.com/ArchiveTeam/wpull/ is a tool that can crawl websites to create WARCs, and is used by archivebot; there are probably other tools too
16:42:00	<joepie91\|m>	The question about standard distribution formats was already answered; an archive file such as 7z. The only thing that WARC adds is the ability to replay server responses, which is not relevant when you don't have those server responses.
16:42:30		ymgve joins
16:43:23	<retromouse>	Thanks for your explanations <pokechu22> , <joepie91\|m>
16:43:35	<@JAA>	So there's wget, but it produces buggy WARCs that don't play well with other tooling. wget-at is our fork with (among other things) fixes for that. wpull is a more or less compatible reimplementation; version 2.x is buggy, so you'd want to use 1.2.3 probably. All of these can take a list of URLs as input and then retrieve those into WARCs with the right options.
16:43:37		jacksonc1 (jacksonchen666) joins
16:43:46		jacksonchen666 quits [Ping timeout: 245 seconds]
16:44:12	<@JAA>	(wpull also has some bugs in the WARC writing, but they're not as bad as wget's or warcio's.)
16:44:36	<retromouse>	That is great so instead of crawling the site I can give a list with the exact meaninful urls so the tool doesn't get lost JAA?
16:44:49	<@JAA>	Sure
16:48:52	<retromouse>	Because that is one of my best problems with using wget or similar, it just runs through all links recursively, if I could give a list and get a warc would be wonderful
16:50:57	<retromouse>	JAA you mean this wpull -> https://github.com/ArchiveTeam/wpull
16:53:19	<retromouse>	By the example of the options is not obvious to me how to provide the list of urls
16:54:25	<retromouse>	it seems to have the typical recursive option that is what I would rather avoid
16:56:19	<pokechu22>	You want -i (AKA --input-file) I think
16:56:36	<retromouse>	Yep! Found on the manual, thanks pokechu22
16:56:44		yts98 joins
16:57:00	<retromouse>	I think with this I'm set to create warcs from well known urls
16:57:04	<pokechu22>	and also probably -p (AKA --page-requisites)
16:57:25	<retromouse>	sure
16:57:53		ehmry quits [Ping timeout: 265 seconds]
16:57:56		jacksonc1 quits [Ping timeout: 245 seconds]
16:58:07	<retromouse>	I'm just a bit tired of needing to write my own crawlers, if I can learn to use a good set of tools and crawlers and create warcs would save me time and make things easier to distribute I think
16:58:24		ymgve_ joins
16:58:37	<retromouse>	when I write my own crawlers I end with a bunch of static files
17:00:41		ehmry joins
17:00:52		ymgve quits [Ping timeout: 252 seconds]
17:01:53		ymgve__ joins
17:04:01	<yts98>	The blog service Xuite is going to shut down on August 31. Could somebody launch a warrior-based project similar to Wretch?
17:05:16		ymgve_ quits [Ping timeout: 252 seconds]
17:08:43	<@JAA>	yts98: Please add it to the Deathwatch wiki page so we don't forget.
17:12:08		MrRadar_ (MrRadar) joins
17:12:23		MrRadar quits [Ping timeout: 265 seconds]
17:12:48		jacksonc1 (jacksonchen666) joins
17:18:40	<yts98>	Okay I just submitted an edit and it's pending review.
17:20:13		jacksonc1 quits [Client Quit]
17:27:33		ymgve__ is now known as ymgve
17:31:07		hitgrr8 quits [Ping timeout: 252 seconds]
17:34:50		hitgrr8 joins
17:39:21	<pokechu22>	https://www.bloomberg.com/news/articles/2023-04-12/pbs-joins-npr-in-quitting-twitter-over-state-backed-designation#xj4y7vzkg?leadSource=reddit_wall - oh boy, more twitter stuff to deal with
17:45:57		ymgve_ joins
17:48:43		ymgve quits [Ping timeout: 252 seconds]
18:32:25		retromouse-2 (retromouse) joins
18:34:55		retromouse quits [Ping timeout: 252 seconds]
18:35:26		sec^nd quits [Ping timeout: 245 seconds]
18:36:21		retromouse-2 quits [Remote host closed the connection]
18:36:27		retromouse-3 (retromouse) joins
18:37:14		retromouse-3 quits [Client Quit]
18:37:26		dvd__ quits [Remote host closed the connection]
18:42:43		sec^nd (second) joins
18:43:26		Mateon1 quits [Remote host closed the connection]
18:44:35		ymgve_ is now known as ymgve
18:56:32		Mateon1 joins
19:03:43		ehmry quits [Read error: Connection reset by peer]
19:05:20		ehmry joins
19:09:17		yts98 quits [Remote host closed the connection]
19:10:46		dvd joins
19:10:47		dvd_ joins
19:11:10		dvd_ quits [Remote host closed the connection]
19:24:20		ThreeHM quits [Ping timeout: 265 seconds]
19:31:15		ThreeHM (ThreeHeadedMonkey) joins
19:44:38		Barto quits [Ping timeout: 265 seconds]
19:51:17		BearFortress quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
19:53:04		Ketchup901 (Ketchup901) joins
20:33:14		Barto (Barto) joins
21:08:00		tech234a quits [Quit: Connection closed for inactivity]
21:11:17	<h2ibot>	Yts98 edited Deathwatch (+187): https://wiki.archiveteam.org/?diff=49661&oldid=49660
21:11:18	<h2ibot>	Yts98 created Xuite (+2230, Created page with "{{Infobox project \| title =…): https://wiki.archiveteam.org/?title=Xuite
21:20:01		Ketchup901 quits [Ping timeout: 245 seconds]
21:21:15		Ketchup901 (Ketchup901) joins
21:23:39	<@arkiver>	If anyone has ideas projects feel free to always bring them up :)
21:25:25	<pokechu22>	arkiver: https://wiki.archiveteam.org/index.php/Enjin is closing fairly soon
21:25:32	<@arkiver>	oh right!
21:28:47	<@arkiver>	alright time for a bunch of projects
21:30:09	<pokechu22>	Regarding Enjin there is a bit of jank with their forum software which is exacerbated with ArchiveBot's jank - I should look into my notes from the AB jobs I did and document that properly, but there's e.g. a normally invisible <a href=""> tag on some pages which leads to really dumb behavior
21:47:23	<h2ibot>	Z.abdain90 edited ArchiveBot/Educational institutions/list (+92, /* Unsorted */ add napata college): https://wiki.archiveteam.org/?diff=49663&oldid=49515
22:03:46		sec^nd quits [Ping timeout: 245 seconds]
22:08:36		sec^nd (second) joins
22:16:03		BlueMaxima joins
22:18:20		BlueMaxima quits [Read error: Connection reset by peer]
22:32:17		hitgrr8 quits [Client Quit]
22:39:18		@Sanqui quits [Ping timeout: 252 seconds]
22:39:40		lunik173 quits [Ping timeout: 252 seconds]
22:43:18		HP_Archivist (HP_Archivist) joins
22:52:44		onetruth joins
22:59:04		Sanqui joins
22:59:06		Sanqui is now authenticated as Sanqui
22:59:06		Sanqui quits [Changing host]
22:59:06		Sanqui (Sanqui) joins
22:59:06		@ChanServ sets mode: +o Sanqui
23:05:19		HP_Archivist quits [Client Quit]
23:06:25		sarge (sarge) joins
23:09:21		tzt quits [Client Quit]
23:10:04		mr_sarge quits [Ping timeout: 265 seconds]
23:10:56		tzt (tzt) joins
23:36:56	<dvd>	ArchiveBot/Educational institution its a list of all (big) educational institutions public facing websites? or for closing soon only?
23:37:34		@Sanqui quits [Client Quit]
23:37:55		Sanqui joins
23:37:57		Sanqui is now authenticated as Sanqui
23:37:57		Sanqui quits [Changing host]
23:37:57		Sanqui (Sanqui) joins
23:37:57		@ChanServ sets mode: +o Sanqui
23:38:31	<@JAA>	dvd: It's intended as the former. Pretty sure it's far from complete though.
23:42:06		sec^nd quits [Ping timeout: 245 seconds]
23:42:42		sec^nd (second) joins
23:43:43		lunik173 joins

Home Search Previous day Next day