00:11:30lunik17 quits [Quit: Ping timeout (120 seconds)]
00:11:35lunik17 joins
00:12:29<myself>maybe we can ask the admin to disable some stuff like hit counters or whatever that's slowing it down.... or just hand us a database dump
00:30:26sec^nd quits [Ping timeout: 245 seconds]
00:30:57sec^nd (second) joins
01:23:19xYantix joins
01:31:15xYantix quits [Remote host closed the connection]
02:27:28Icyelut (Icyelut) joins
02:49:52lunik17 quits [Client Quit]
02:49:54lunik173 joins
02:54:02mr_sarge quits [Read error: Connection reset by peer]
02:56:22mr_sarge (sarge) joins
02:58:33BlueMaxima joins
03:06:02HP_Archivist quits [Read error: Connection reset by peer]
03:07:09eroc1990 quits [Client Quit]
03:37:56sec^nd quits [Ping timeout: 245 seconds]
03:50:58sec^nd (second) joins
04:01:35eroc1990 (eroc1990) joins
04:31:58pabs quits [Client Quit]
04:32:14pabs (pabs) joins
04:40:21Dj-Wawa quits [Remote host closed the connection]
04:40:21qwertyasdfuiopghjkl quits [Client Quit]
04:41:01Dj-Wawa joins
04:44:44sonick (sonick) joins
04:50:46eroc1990 quits [Client Quit]
04:51:18qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
04:55:15lennier1 (lennier1) joins
04:58:32dvd_ quits [Ping timeout: 252 seconds]
05:01:41<pabs>https://www.theguardian.com/media/2023/apr/12/npr-leaves-twitter-elon-musk-state-media
05:03:07dumbgoy__ quits [Ping timeout: 252 seconds]
05:04:11sec^nd quits [Ping timeout: 245 seconds]
05:08:52sec^nd (second) joins
05:09:23<pabs>a few have already been added to AB, others are findable with search engines: site:twitter.com -inurl:status npr
05:09:29pabs ENOTIME to do this
05:09:47atphoenix quits [Read error: Connection reset by peer]
05:12:23atphoenix (atphoenix) joins
05:17:04eroc1990 (eroc1990) joins
05:24:20atphoenix quits [Read error: Connection reset by peer]
05:27:27atphoenix (atphoenix) joins
05:31:10DLoader quits [Ping timeout: 252 seconds]
05:32:06sec^nd quits [Ping timeout: 245 seconds]
05:38:54sec^nd (second) joins
05:45:17<pabs>here is a list posted on #archivebot https://transfer.archivete.am/mzdmm/npr-pbs-twitter-accounts.txt
05:48:21sec^nd quits [Ping timeout: 245 seconds]
05:48:27fuzzy8021 quits [Read error: Connection reset by peer]
05:48:56sec^nd (second) joins
05:49:26fuzzy8021 (fuzzy8021) joins
05:54:15user_ quits [Remote host closed the connection]
05:54:28user_ joins
06:18:16Island quits [Read error: Connection reset by peer]
06:23:55hitgrr8 joins
06:25:28BlueMaxima quits [Client Quit]
06:29:02eroc1990 quits [Client Quit]
06:29:22eroc1990 (eroc1990) joins
07:19:52DLoader joins
07:51:14wickedplayer494 quits [Ping timeout: 252 seconds]
07:57:16Arcorann (Arcorann) joins
08:59:04Doomaholic quits [Read error: Connection reset by peer]
08:59:19Doomaholic joins
09:00:26sec^nd quits [Ping timeout: 245 seconds]
09:00:57sec^nd (second) joins
10:07:12sonick quits [Client Quit]
11:17:48Iki1 joins
11:20:58Iki quits [Ping timeout: 252 seconds]
11:25:09eroc1990 quits [Client Quit]
11:33:20eroc1990 (eroc1990) joins
11:58:45Icyelut|2 (Icyelut) joins
12:01:08dumbgoy__ joins
12:03:03Icyelut quits [Ping timeout: 265 seconds]
12:25:17Iki1 quits [Ping timeout: 265 seconds]
12:41:46HP_Archivist (HP_Archivist) joins
13:00:34Iki joins
13:00:46retromouse joins
13:07:30retromouse-2 joins
13:09:38jacksonchen666 (jacksonchen666) joins
13:09:45retromouse-3 joins
13:09:54retromouse-3 quits [Remote host closed the connection]
13:09:54retromouse-2 quits [Client Quit]
13:10:14retromouse quits [Ping timeout: 265 seconds]
13:10:15retromouse-2 joins
13:13:19retromouse-2 is now known as retromouse
13:21:12jacksonchen666 quits [Client Quit]
13:23:37Iki quits [Ping timeout: 252 seconds]
13:31:42ehmry quits [Client Quit]
13:32:06ehmry joins
13:46:03HP_Archivist quits [Client Quit]
14:04:52DiscantX quits [Ping timeout: 252 seconds]
14:08:23DiscantX joins
14:15:23<retromouse>Greetings everyone,
14:15:23<retromouse>I have lately used dokuwiki-dumper and mediawiki-scraper and learned about archive team.
14:15:23<retromouse>I'm looking for a place where hopefully I can learn to tools that make my crawls useful for more people.
14:16:09<retromouse>Do archive team has any way to create warcs from existing static pages you have in your drive?
14:28:34Iki joins
14:29:02<retromouse>If someone can drop me a link to the right tool even if is an elaborate process I would appreciate it
14:33:28qwertyasdfuiopghjkl quits [Client Quit]
14:54:57geezabiscuit (geezabiscuit) joins
15:01:31Arcorann quits [Ping timeout: 252 seconds]
15:06:20dvd joins
15:06:39<@arkiver>retromouse: that likely won't be possible, unless you have the original request and response headers saved and have various metadata
15:08:38jacksonchen666 (jacksonchen666) joins
15:08:49<retromouse>If you need to replay the whole process, I could just deploy the website on a static server so I a robot can browse around.
15:09:22<retromouse>So I can deploy the website and make any fixes to it to display correctly from the static files
15:09:37<retromouse>then I can browse the website from localhost
15:09:44<retromouse>would that be enough?
15:12:22<retromouse>What do you say @arkiver?
15:15:28<@Sanqui>no that is not possible, or desirable, because in that case you are "making up" data
15:15:30dvd_ joins
15:15:32<@Sanqui>since you don't have the data necessary for a warc, I have to ask what is your motivation to make a warc?
15:15:39<@Sanqui>your data is still useful the way it is, no need to "soup it up" with fake information
15:16:22dvd quits [Ping timeout: 252 seconds]
15:16:34<tech234a>Google Currents (remainder of Google Plus) shuts down July 5 https://workspaceupdates.googleblog.com/2023/04/new-community-features-for-google-chat-and-an-update-currents%20.html
15:21:36<retromouse>To have it in a standard format for easy distribution sanqui, I can of course note the changes I made to make everything browsable withouut accounts or captchas, etc..
15:22:35<retromouse>I don't want anyone feels I'm faking anything, I just made the minimum changes to make everything accessible without having the original server
15:26:16Iki quits [Ping timeout: 252 seconds]
15:27:20Iki joins
15:32:42Island joins
15:35:01dvd__ joins
15:35:26Iki1 joins
15:35:51geezabiscuit quits [Client Quit]
15:35:51dvd_ quits [Remote host closed the connection]
15:35:51Iki quits [Remote host closed the connection]
15:36:27geezabiscuit (geezabiscuit) joins
16:04:14user_ quits [Remote host closed the connection]
16:04:27user_ joins
16:13:21jacksonchen666 quits [Ping timeout: 245 seconds]
16:23:41hackbug quits [Remote host closed the connection]
16:27:29hackbug (hackbug) joins
16:29:52<retromouse>So, could someone point me on the right direction?
16:31:23<pokechu22>For mediawiki, the benefit of the dump is that you can import it into a new wiki, and that it's compact for that purpose
16:32:16<retromouse>Well on this case I'm talking about general pages, not wikies
16:32:27<pokechu22>Warc is a bit of an awkward format - the main benefit is that it works on web.archive.org, but if you're generating it instead of scraping an actual site it probably won't be added there. It's also basically a list of pages and responses, and things like searches won't work on it
16:32:36<retromouse>the wiki dumps done with mediawiki-scraper and dokuwiki-dumper are great
16:33:06<retromouse>But I have crawl sites using my own spider and end up with a set of static pages
16:33:18<@JAA>Thou shalt never 'create' WARCs from anything but direct interaction with the original server.
16:33:30<pokechu22>Ah, a list of static files you want to browse - if it works as a folder, it's probably easier to just upload a 7z file of that and extract it and then browse that
16:34:26<pokechu22>The big thing WARCs have going over a compressed folder is that they include the original HTTP metadata - which isn't something you can get if you're starting with a folder
16:34:35<@JAA>Static files in a tar, ZIP, or similar is fine, I suppose.
16:36:10<retromouse>Well if the original server didn't allow you to access all information is not that useful the original WARC
16:36:22jacksonchen666 (jacksonchen666) joins
16:36:31<retromouse>I'm looking for a standard way of sharing some sites
16:36:42<retromouse>and WARC I think provides that
16:37:07<@JAA>Yes, WARC is the international standard for that, but again, you can't create that after the fact.
16:37:35<retromouse>I guess I can mount the site on a local server and use a robot to crawl it JAA
16:37:57<@JAA>Yes, that would be a WARC of your localhost or whatever, so it wouldn't be associated with the original site.
16:38:04<retromouse>The question is what tools do you have I can use to create the WARC easily
16:38:26<retromouse>That is fine, can be a warc of a localhost
16:38:32<@JAA>And if you use /etc/hosts trickery etc., we're in faking WARC territory again.
16:39:31<retromouse>I feel you are more worried of me trying to fake something than on helping me archiving the goal of building a standard distribution file
16:39:45<retromouse>when you get over it
16:39:53<retromouse>please point me to the tools you have for it
16:40:00<retromouse>if you want
16:40:39<pokechu22>The main point is that creating a warc won't add anything useful you don't already have and will probably be more annoying for other people to navigate than just a zip of the original files
16:41:27<pokechu22>https://github.com/ArchiveTeam/wpull/ is a tool that can crawl websites to create WARCs, and is used by archivebot; there are probably other tools too
16:42:00<joepie91|m>The question about standard distribution formats was already answered; an archive file such as 7z. The only thing that WARC adds is the ability to replay server responses, which is not relevant when you don't *have* those server responses.
16:42:30ymgve joins
16:43:23<retromouse>Thanks for your explanations <pokechu22> , <joepie91|m>
16:43:35<@JAA>So there's wget, but it produces buggy WARCs that don't play well with other tooling. wget-at is our fork with (among other things) fixes for that. wpull is a more or less compatible reimplementation; version 2.x is buggy, so you'd want to use 1.2.3 probably. All of these can take a list of URLs as input and then retrieve those into WARCs with the right options.
16:43:37jacksonc1 (jacksonchen666) joins
16:43:46jacksonchen666 quits [Ping timeout: 245 seconds]
16:44:12<@JAA>(wpull also has some bugs in the WARC writing, but they're not *as* bad as wget's or warcio's.)
16:44:36<retromouse>That is great so instead of crawling the site I can give a list with the exact meaninful urls so the tool doesn't get lost JAA?
16:44:49<@JAA>Sure
16:48:52<retromouse>Because that is one of my best problems with using wget or similar, it just runs through all links recursively, if I could give a list and get a warc would be wonderful
16:50:57<retromouse>JAA you mean this wpull -> https://github.com/ArchiveTeam/wpull
16:53:19<retromouse>By the example of the options is not obvious to me how to provide the list of urls
16:54:25<retromouse>it seems to have the typical recursive option that is what I would rather avoid
16:56:19<pokechu22>You want -i (AKA --input-file) I think
16:56:36<retromouse>Yep! Found on the manual, thanks pokechu22
16:56:44yts98 joins
16:57:00<retromouse>I think with this I'm set to create warcs from well known urls
16:57:04<pokechu22>and also probably -p (AKA --page-requisites)
16:57:25<retromouse>sure
16:57:53ehmry quits [Ping timeout: 265 seconds]
16:57:56jacksonc1 quits [Ping timeout: 245 seconds]
16:58:07<retromouse>I'm just a bit tired of needing to write my own crawlers, if I can learn to use a good set of tools and crawlers and create warcs would save me time and make things easier to distribute I think
16:58:24ymgve_ joins
16:58:37<retromouse>when I write my own crawlers I end with a bunch of static files
17:00:41ehmry joins
17:00:52ymgve quits [Ping timeout: 252 seconds]
17:01:53ymgve__ joins
17:04:01<yts98>The blog service Xuite is going to shut down on August 31. Could somebody launch a warrior-based project similar to Wretch?
17:05:16ymgve_ quits [Ping timeout: 252 seconds]
17:08:43<@JAA>yts98: Please add it to the Deathwatch wiki page so we don't forget.
17:12:08MrRadar_ (MrRadar) joins
17:12:23MrRadar quits [Ping timeout: 265 seconds]
17:12:48jacksonc1 (jacksonchen666) joins
17:18:40<yts98>Okay I just submitted an edit and it's pending review.
17:20:13jacksonc1 quits [Client Quit]
17:27:33ymgve__ is now known as ymgve
17:31:07hitgrr8 quits [Ping timeout: 252 seconds]
17:34:50hitgrr8 joins
17:39:21<pokechu22>https://www.bloomberg.com/news/articles/2023-04-12/pbs-joins-npr-in-quitting-twitter-over-state-backed-designation#xj4y7vzkg?leadSource=reddit_wall - oh boy, more twitter stuff to deal with
17:45:57ymgve_ joins
17:48:43ymgve quits [Ping timeout: 252 seconds]
18:32:25retromouse-2 (retromouse) joins
18:34:55retromouse quits [Ping timeout: 252 seconds]
18:35:26sec^nd quits [Ping timeout: 245 seconds]
18:36:21retromouse-2 quits [Remote host closed the connection]
18:36:27retromouse-3 (retromouse) joins
18:37:14retromouse-3 quits [Client Quit]
18:37:26dvd__ quits [Remote host closed the connection]
18:42:43sec^nd (second) joins
18:43:26Mateon1 quits [Remote host closed the connection]
18:44:35ymgve_ is now known as ymgve
18:56:32Mateon1 joins
19:03:43ehmry quits [Read error: Connection reset by peer]
19:05:20ehmry joins
19:09:17yts98 quits [Remote host closed the connection]
19:10:46dvd joins
19:10:47dvd_ joins
19:11:10dvd_ quits [Remote host closed the connection]
19:24:20ThreeHM quits [Ping timeout: 265 seconds]
19:31:15ThreeHM (ThreeHeadedMonkey) joins
19:44:38Barto quits [Ping timeout: 265 seconds]
19:51:17BearFortress quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
19:53:04Ketchup901 (Ketchup901) joins
20:33:14Barto (Barto) joins
21:08:00tech234a quits [Quit: Connection closed for inactivity]
21:11:17<h2ibot>Yts98 edited Deathwatch (+187): https://wiki.archiveteam.org/?diff=49661&oldid=49660
21:11:18<h2ibot>Yts98 created Xuite (+2230, Created page with "{{Infobox project | title =…): https://wiki.archiveteam.org/?title=Xuite
21:20:01Ketchup901 quits [Ping timeout: 245 seconds]
21:21:15Ketchup901 (Ketchup901) joins
21:23:39<@arkiver>If anyone has ideas projects feel free to always bring them up :)
21:25:25<pokechu22>arkiver: https://wiki.archiveteam.org/index.php/Enjin is closing fairly soon
21:25:32<@arkiver>oh right!
21:28:47<@arkiver>alright time for a bunch of projects
21:30:09<pokechu22>Regarding Enjin there is a bit of jank with their forum software which is exacerbated with ArchiveBot's jank - I should look into my notes from the AB jobs I did and document that properly, but there's e.g. a normally invisible <a href=""> tag on some pages which leads to really dumb behavior
21:47:23<h2ibot>Z.abdain90 edited ArchiveBot/Educational institutions/list (+92, /* Unsorted */ add napata college): https://wiki.archiveteam.org/?diff=49663&oldid=49515
22:03:46sec^nd quits [Ping timeout: 245 seconds]
22:08:36sec^nd (second) joins
22:16:03BlueMaxima joins
22:18:20BlueMaxima quits [Read error: Connection reset by peer]
22:32:17hitgrr8 quits [Client Quit]
22:39:18@Sanqui quits [Ping timeout: 252 seconds]
22:39:40lunik173 quits [Ping timeout: 252 seconds]
22:43:18HP_Archivist (HP_Archivist) joins
22:52:44onetruth joins
22:59:04Sanqui joins
22:59:06Sanqui quits [Changing host]
22:59:06Sanqui (Sanqui) joins
22:59:06@ChanServ sets mode: +o Sanqui
23:05:19HP_Archivist quits [Client Quit]
23:06:25sarge (sarge) joins
23:09:21tzt quits [Client Quit]
23:10:04mr_sarge quits [Ping timeout: 265 seconds]
23:10:56tzt (tzt) joins
23:36:56<dvd>ArchiveBot/Educational institution its a list of all (big) educational institutions public facing websites? or for closing soon only?
23:37:34@Sanqui quits [Client Quit]
23:37:55Sanqui joins
23:37:57Sanqui quits [Changing host]
23:37:57Sanqui (Sanqui) joins
23:37:57@ChanServ sets mode: +o Sanqui
23:38:31<@JAA>dvd: It's intended as the former. Pretty sure it's *far* from complete though.
23:42:06sec^nd quits [Ping timeout: 245 seconds]
23:42:42sec^nd (second) joins
23:43:43lunik173 joins