| 00:11:30 | | lunik17 quits [Quit: Ping timeout (120 seconds)] |
| 00:11:35 | | lunik17 joins |
| 00:12:29 | <myself> | maybe we can ask the admin to disable some stuff like hit counters or whatever that's slowing it down.... or just hand us a database dump |
| 00:30:26 | | sec^nd quits [Ping timeout: 245 seconds] |
| 00:30:57 | | sec^nd (second) joins |
| 01:23:19 | | xYantix joins |
| 01:31:15 | | xYantix quits [Remote host closed the connection] |
| 02:27:28 | | Icyelut (Icyelut) joins |
| 02:49:52 | | lunik17 quits [Client Quit] |
| 02:49:54 | | lunik173 joins |
| 02:54:02 | | mr_sarge quits [Read error: Connection reset by peer] |
| 02:56:22 | | mr_sarge (sarge) joins |
| 02:58:33 | | BlueMaxima joins |
| 03:06:02 | | HP_Archivist quits [Read error: Connection reset by peer] |
| 03:07:09 | | eroc1990 quits [Client Quit] |
| 03:37:56 | | sec^nd quits [Ping timeout: 245 seconds] |
| 03:50:58 | | sec^nd (second) joins |
| 04:01:35 | | eroc1990 (eroc1990) joins |
| 04:31:58 | | pabs quits [Client Quit] |
| 04:32:14 | | pabs (pabs) joins |
| 04:40:21 | | Dj-Wawa quits [Remote host closed the connection] |
| 04:40:21 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 04:41:01 | | Dj-Wawa joins |
| 04:41:01 | | Dj-Wawa is now authenticated as Dj-Wawa |
| 04:44:44 | | sonick (sonick) joins |
| 04:50:46 | | eroc1990 quits [Client Quit] |
| 04:51:18 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 04:55:15 | | lennier1 (lennier1) joins |
| 04:58:32 | | dvd_ quits [Ping timeout: 252 seconds] |
| 05:01:41 | <pabs> | https://www.theguardian.com/media/2023/apr/12/npr-leaves-twitter-elon-musk-state-media |
| 05:03:07 | | dumbgoy__ quits [Ping timeout: 252 seconds] |
| 05:04:11 | | sec^nd quits [Ping timeout: 245 seconds] |
| 05:08:52 | | sec^nd (second) joins |
| 05:09:23 | <pabs> | a few have already been added to AB, others are findable with search engines: site:twitter.com -inurl:status npr |
| 05:09:29 | | pabs ENOTIME to do this |
| 05:09:47 | | atphoenix quits [Read error: Connection reset by peer] |
| 05:12:23 | | atphoenix (atphoenix) joins |
| 05:17:04 | | eroc1990 (eroc1990) joins |
| 05:24:20 | | atphoenix quits [Read error: Connection reset by peer] |
| 05:27:27 | | atphoenix (atphoenix) joins |
| 05:31:10 | | DLoader quits [Ping timeout: 252 seconds] |
| 05:32:06 | | sec^nd quits [Ping timeout: 245 seconds] |
| 05:38:54 | | sec^nd (second) joins |
| 05:45:17 | <pabs> | here is a list posted on #archivebot https://transfer.archivete.am/mzdmm/npr-pbs-twitter-accounts.txt |
| 05:48:21 | | sec^nd quits [Ping timeout: 245 seconds] |
| 05:48:27 | | fuzzy8021 quits [Read error: Connection reset by peer] |
| 05:48:56 | | sec^nd (second) joins |
| 05:49:26 | | fuzzy8021 (fuzzy8021) joins |
| 05:54:15 | | user_ quits [Remote host closed the connection] |
| 05:54:28 | | user_ joins |
| 06:18:16 | | Island quits [Read error: Connection reset by peer] |
| 06:23:55 | | hitgrr8 joins |
| 06:25:28 | | BlueMaxima quits [Client Quit] |
| 06:29:02 | | eroc1990 quits [Client Quit] |
| 06:29:22 | | eroc1990 (eroc1990) joins |
| 07:19:52 | | DLoader joins |
| 07:51:14 | | wickedplayer494 quits [Ping timeout: 252 seconds] |
| 07:57:16 | | Arcorann (Arcorann) joins |
| 08:59:04 | | Doomaholic quits [Read error: Connection reset by peer] |
| 08:59:19 | | Doomaholic joins |
| 09:00:26 | | sec^nd quits [Ping timeout: 245 seconds] |
| 09:00:57 | | sec^nd (second) joins |
| 10:07:12 | | sonick quits [Client Quit] |
| 11:17:48 | | Iki1 joins |
| 11:20:58 | | Iki quits [Ping timeout: 252 seconds] |
| 11:25:09 | | eroc1990 quits [Client Quit] |
| 11:33:20 | | eroc1990 (eroc1990) joins |
| 11:58:45 | | Icyelut|2 (Icyelut) joins |
| 12:01:08 | | dumbgoy__ joins |
| 12:03:03 | | Icyelut quits [Ping timeout: 265 seconds] |
| 12:25:17 | | Iki1 quits [Ping timeout: 265 seconds] |
| 12:41:46 | | HP_Archivist (HP_Archivist) joins |
| 13:00:34 | | Iki joins |
| 13:00:46 | | retromouse joins |
| 13:07:30 | | retromouse-2 joins |
| 13:09:38 | | jacksonchen666 (jacksonchen666) joins |
| 13:09:45 | | retromouse-3 joins |
| 13:09:54 | | retromouse-3 quits [Remote host closed the connection] |
| 13:09:54 | | retromouse-2 quits [Client Quit] |
| 13:10:14 | | retromouse quits [Ping timeout: 265 seconds] |
| 13:10:15 | | retromouse-2 joins |
| 13:12:50 | | retromouse-2 is now authenticated as retromouse |
| 13:13:19 | | retromouse-2 is now known as retromouse |
| 13:21:12 | | jacksonchen666 quits [Client Quit] |
| 13:23:37 | | Iki quits [Ping timeout: 252 seconds] |
| 13:31:42 | | ehmry quits [Client Quit] |
| 13:32:06 | | ehmry joins |
| 13:46:03 | | HP_Archivist quits [Client Quit] |
| 14:04:52 | | DiscantX quits [Ping timeout: 252 seconds] |
| 14:08:23 | | DiscantX joins |
| 14:15:23 | <retromouse> | Greetings everyone, |
| 14:15:23 | <retromouse> | I have lately used dokuwiki-dumper and mediawiki-scraper and learned about archive team. |
| 14:15:23 | <retromouse> | I'm looking for a place where hopefully I can learn to tools that make my crawls useful for more people. |
| 14:16:09 | <retromouse> | Do archive team has any way to create warcs from existing static pages you have in your drive? |
| 14:28:34 | | Iki joins |
| 14:29:02 | <retromouse> | If someone can drop me a link to the right tool even if is an elaborate process I would appreciate it |
| 14:33:28 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 14:54:57 | | geezabiscuit (geezabiscuit) joins |
| 15:01:31 | | Arcorann quits [Ping timeout: 252 seconds] |
| 15:06:20 | | dvd joins |
| 15:06:39 | <@arkiver> | retromouse: that likely won't be possible, unless you have the original request and response headers saved and have various metadata |
| 15:08:38 | | jacksonchen666 (jacksonchen666) joins |
| 15:08:49 | <retromouse> | If you need to replay the whole process, I could just deploy the website on a static server so I a robot can browse around. |
| 15:09:22 | <retromouse> | So I can deploy the website and make any fixes to it to display correctly from the static files |
| 15:09:37 | <retromouse> | then I can browse the website from localhost |
| 15:09:44 | <retromouse> | would that be enough? |
| 15:12:22 | <retromouse> | What do you say @arkiver? |
| 15:15:28 | <@Sanqui> | no that is not possible, or desirable, because in that case you are "making up" data |
| 15:15:30 | | dvd_ joins |
| 15:15:32 | <@Sanqui> | since you don't have the data necessary for a warc, I have to ask what is your motivation to make a warc? |
| 15:15:39 | <@Sanqui> | your data is still useful the way it is, no need to "soup it up" with fake information |
| 15:16:22 | | dvd quits [Ping timeout: 252 seconds] |
| 15:16:34 | <tech234a> | Google Currents (remainder of Google Plus) shuts down July 5 https://workspaceupdates.googleblog.com/2023/04/new-community-features-for-google-chat-and-an-update-currents%20.html |
| 15:21:36 | <retromouse> | To have it in a standard format for easy distribution sanqui, I can of course note the changes I made to make everything browsable withouut accounts or captchas, etc.. |
| 15:22:35 | <retromouse> | I don't want anyone feels I'm faking anything, I just made the minimum changes to make everything accessible without having the original server |
| 15:26:16 | | Iki quits [Ping timeout: 252 seconds] |
| 15:27:20 | | Iki joins |
| 15:32:42 | | Island joins |
| 15:35:01 | | dvd__ joins |
| 15:35:26 | | Iki1 joins |
| 15:35:51 | | geezabiscuit quits [Client Quit] |
| 15:35:51 | | dvd_ quits [Remote host closed the connection] |
| 15:35:51 | | Iki quits [Remote host closed the connection] |
| 15:36:27 | | geezabiscuit (geezabiscuit) joins |
| 16:04:14 | | user_ quits [Remote host closed the connection] |
| 16:04:27 | | user_ joins |
| 16:13:21 | | jacksonchen666 quits [Ping timeout: 245 seconds] |
| 16:23:41 | | hackbug quits [Remote host closed the connection] |
| 16:27:29 | | hackbug (hackbug) joins |
| 16:29:52 | <retromouse> | So, could someone point me on the right direction? |
| 16:31:23 | <pokechu22> | For mediawiki, the benefit of the dump is that you can import it into a new wiki, and that it's compact for that purpose |
| 16:32:16 | <retromouse> | Well on this case I'm talking about general pages, not wikies |
| 16:32:27 | <pokechu22> | Warc is a bit of an awkward format - the main benefit is that it works on web.archive.org, but if you're generating it instead of scraping an actual site it probably won't be added there. It's also basically a list of pages and responses, and things like searches won't work on it |
| 16:32:36 | <retromouse> | the wiki dumps done with mediawiki-scraper and dokuwiki-dumper are great |
| 16:33:06 | <retromouse> | But I have crawl sites using my own spider and end up with a set of static pages |
| 16:33:18 | <@JAA> | Thou shalt never 'create' WARCs from anything but direct interaction with the original server. |
| 16:33:30 | <pokechu22> | Ah, a list of static files you want to browse - if it works as a folder, it's probably easier to just upload a 7z file of that and extract it and then browse that |
| 16:34:26 | <pokechu22> | The big thing WARCs have going over a compressed folder is that they include the original HTTP metadata - which isn't something you can get if you're starting with a folder |
| 16:34:35 | <@JAA> | Static files in a tar, ZIP, or similar is fine, I suppose. |
| 16:36:10 | <retromouse> | Well if the original server didn't allow you to access all information is not that useful the original WARC |
| 16:36:22 | | jacksonchen666 (jacksonchen666) joins |
| 16:36:31 | <retromouse> | I'm looking for a standard way of sharing some sites |
| 16:36:42 | <retromouse> | and WARC I think provides that |
| 16:37:07 | <@JAA> | Yes, WARC is the international standard for that, but again, you can't create that after the fact. |
| 16:37:35 | <retromouse> | I guess I can mount the site on a local server and use a robot to crawl it JAA |
| 16:37:57 | <@JAA> | Yes, that would be a WARC of your localhost or whatever, so it wouldn't be associated with the original site. |
| 16:38:04 | <retromouse> | The question is what tools do you have I can use to create the WARC easily |
| 16:38:26 | <retromouse> | That is fine, can be a warc of a localhost |
| 16:38:32 | <@JAA> | And if you use /etc/hosts trickery etc., we're in faking WARC territory again. |
| 16:39:31 | <retromouse> | I feel you are more worried of me trying to fake something than on helping me archiving the goal of building a standard distribution file |
| 16:39:45 | <retromouse> | when you get over it |
| 16:39:53 | <retromouse> | please point me to the tools you have for it |
| 16:40:00 | <retromouse> | if you want |
| 16:40:39 | <pokechu22> | The main point is that creating a warc won't add anything useful you don't already have and will probably be more annoying for other people to navigate than just a zip of the original files |
| 16:41:27 | <pokechu22> | https://github.com/ArchiveTeam/wpull/ is a tool that can crawl websites to create WARCs, and is used by archivebot; there are probably other tools too |
| 16:42:00 | <joepie91|m> | The question about standard distribution formats was already answered; an archive file such as 7z. The only thing that WARC adds is the ability to replay server responses, which is not relevant when you don't *have* those server responses. |
| 16:42:30 | | ymgve joins |
| 16:43:23 | <retromouse> | Thanks for your explanations <pokechu22> , <joepie91|m> |
| 16:43:35 | <@JAA> | So there's wget, but it produces buggy WARCs that don't play well with other tooling. wget-at is our fork with (among other things) fixes for that. wpull is a more or less compatible reimplementation; version 2.x is buggy, so you'd want to use 1.2.3 probably. All of these can take a list of URLs as input and then retrieve those into WARCs with the right options. |
| 16:43:37 | | jacksonc1 (jacksonchen666) joins |
| 16:43:46 | | jacksonchen666 quits [Ping timeout: 245 seconds] |
| 16:44:12 | <@JAA> | (wpull also has some bugs in the WARC writing, but they're not *as* bad as wget's or warcio's.) |
| 16:44:36 | <retromouse> | That is great so instead of crawling the site I can give a list with the exact meaninful urls so the tool doesn't get lost JAA? |
| 16:44:49 | <@JAA> | Sure |
| 16:48:52 | <retromouse> | Because that is one of my best problems with using wget or similar, it just runs through all links recursively, if I could give a list and get a warc would be wonderful |
| 16:50:57 | <retromouse> | JAA you mean this wpull -> https://github.com/ArchiveTeam/wpull |
| 16:53:19 | <retromouse> | By the example of the options is not obvious to me how to provide the list of urls |
| 16:54:25 | <retromouse> | it seems to have the typical recursive option that is what I would rather avoid |
| 16:56:19 | <pokechu22> | You want -i (AKA --input-file) I think |
| 16:56:36 | <retromouse> | Yep! Found on the manual, thanks pokechu22 |
| 16:56:44 | | yts98 joins |
| 16:57:00 | <retromouse> | I think with this I'm set to create warcs from well known urls |
| 16:57:04 | <pokechu22> | and also probably -p (AKA --page-requisites) |
| 16:57:25 | <retromouse> | sure |
| 16:57:53 | | ehmry quits [Ping timeout: 265 seconds] |
| 16:57:56 | | jacksonc1 quits [Ping timeout: 245 seconds] |
| 16:58:07 | <retromouse> | I'm just a bit tired of needing to write my own crawlers, if I can learn to use a good set of tools and crawlers and create warcs would save me time and make things easier to distribute I think |
| 16:58:24 | | ymgve_ joins |
| 16:58:37 | <retromouse> | when I write my own crawlers I end with a bunch of static files |
| 17:00:41 | | ehmry joins |
| 17:00:52 | | ymgve quits [Ping timeout: 252 seconds] |
| 17:01:53 | | ymgve__ joins |
| 17:04:01 | <yts98> | The blog service Xuite is going to shut down on August 31. Could somebody launch a warrior-based project similar to Wretch? |
| 17:05:16 | | ymgve_ quits [Ping timeout: 252 seconds] |
| 17:08:43 | <@JAA> | yts98: Please add it to the Deathwatch wiki page so we don't forget. |
| 17:12:08 | | MrRadar_ (MrRadar) joins |
| 17:12:23 | | MrRadar quits [Ping timeout: 265 seconds] |
| 17:12:48 | | jacksonc1 (jacksonchen666) joins |
| 17:18:40 | <yts98> | Okay I just submitted an edit and it's pending review. |
| 17:20:13 | | jacksonc1 quits [Client Quit] |
| 17:27:33 | | ymgve__ is now known as ymgve |
| 17:31:07 | | hitgrr8 quits [Ping timeout: 252 seconds] |
| 17:34:50 | | hitgrr8 joins |
| 17:39:21 | <pokechu22> | https://www.bloomberg.com/news/articles/2023-04-12/pbs-joins-npr-in-quitting-twitter-over-state-backed-designation#xj4y7vzkg?leadSource=reddit_wall - oh boy, more twitter stuff to deal with |
| 17:45:57 | | ymgve_ joins |
| 17:48:43 | | ymgve quits [Ping timeout: 252 seconds] |
| 18:32:25 | | retromouse-2 (retromouse) joins |
| 18:34:55 | | retromouse quits [Ping timeout: 252 seconds] |
| 18:35:26 | | sec^nd quits [Ping timeout: 245 seconds] |
| 18:36:21 | | retromouse-2 quits [Remote host closed the connection] |
| 18:36:27 | | retromouse-3 (retromouse) joins |
| 18:37:14 | | retromouse-3 quits [Client Quit] |
| 18:37:26 | | dvd__ quits [Remote host closed the connection] |
| 18:42:43 | | sec^nd (second) joins |
| 18:43:26 | | Mateon1 quits [Remote host closed the connection] |
| 18:44:35 | | ymgve_ is now known as ymgve |
| 18:56:32 | | Mateon1 joins |
| 19:03:43 | | ehmry quits [Read error: Connection reset by peer] |
| 19:05:20 | | ehmry joins |
| 19:09:17 | | yts98 quits [Remote host closed the connection] |
| 19:10:46 | | dvd joins |
| 19:10:47 | | dvd_ joins |
| 19:11:10 | | dvd_ quits [Remote host closed the connection] |
| 19:24:20 | | ThreeHM quits [Ping timeout: 265 seconds] |
| 19:31:15 | | ThreeHM (ThreeHeadedMonkey) joins |
| 19:44:38 | | Barto quits [Ping timeout: 265 seconds] |
| 19:51:17 | | BearFortress quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |
| 19:53:04 | | Ketchup901 (Ketchup901) joins |
| 20:33:14 | | Barto (Barto) joins |
| 21:08:00 | | tech234a quits [Quit: Connection closed for inactivity] |
| 21:11:17 | <h2ibot> | Yts98 edited Deathwatch (+187): https://wiki.archiveteam.org/?diff=49661&oldid=49660 |
| 21:11:18 | <h2ibot> | Yts98 created Xuite (+2230, Created page with "{{Infobox project | title =…): https://wiki.archiveteam.org/?title=Xuite |
| 21:20:01 | | Ketchup901 quits [Ping timeout: 245 seconds] |
| 21:21:15 | | Ketchup901 (Ketchup901) joins |
| 21:23:39 | <@arkiver> | If anyone has ideas projects feel free to always bring them up :) |
| 21:25:25 | <pokechu22> | arkiver: https://wiki.archiveteam.org/index.php/Enjin is closing fairly soon |
| 21:25:32 | <@arkiver> | oh right! |
| 21:28:47 | <@arkiver> | alright time for a bunch of projects |
| 21:30:09 | <pokechu22> | Regarding Enjin there is a bit of jank with their forum software which is exacerbated with ArchiveBot's jank - I should look into my notes from the AB jobs I did and document that properly, but there's e.g. a normally invisible <a href=""> tag on some pages which leads to really dumb behavior |
| 21:47:23 | <h2ibot> | Z.abdain90 edited ArchiveBot/Educational institutions/list (+92, /* Unsorted */ add napata college): https://wiki.archiveteam.org/?diff=49663&oldid=49515 |
| 22:03:46 | | sec^nd quits [Ping timeout: 245 seconds] |
| 22:08:36 | | sec^nd (second) joins |
| 22:16:03 | | BlueMaxima joins |
| 22:18:20 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 22:32:17 | | hitgrr8 quits [Client Quit] |
| 22:39:18 | | @Sanqui quits [Ping timeout: 252 seconds] |
| 22:39:40 | | lunik173 quits [Ping timeout: 252 seconds] |
| 22:43:18 | | HP_Archivist (HP_Archivist) joins |
| 22:52:44 | | onetruth joins |
| 22:59:04 | | Sanqui joins |
| 22:59:06 | | Sanqui is now authenticated as Sanqui |
| 22:59:06 | | Sanqui quits [Changing host] |
| 22:59:06 | | Sanqui (Sanqui) joins |
| 22:59:06 | | @ChanServ sets mode: +o Sanqui |
| 23:05:19 | | HP_Archivist quits [Client Quit] |
| 23:06:25 | | sarge (sarge) joins |
| 23:09:21 | | tzt quits [Client Quit] |
| 23:10:04 | | mr_sarge quits [Ping timeout: 265 seconds] |
| 23:10:56 | | tzt (tzt) joins |
| 23:36:56 | <dvd> | ArchiveBot/Educational institution its a list of all (big) educational institutions public facing websites? or for closing soon only? |
| 23:37:34 | | @Sanqui quits [Client Quit] |
| 23:37:55 | | Sanqui joins |
| 23:37:57 | | Sanqui is now authenticated as Sanqui |
| 23:37:57 | | Sanqui quits [Changing host] |
| 23:37:57 | | Sanqui (Sanqui) joins |
| 23:37:57 | | @ChanServ sets mode: +o Sanqui |
| 23:38:31 | <@JAA> | dvd: It's intended as the former. Pretty sure it's *far* from complete though. |
| 23:42:06 | | sec^nd quits [Ping timeout: 245 seconds] |
| 23:42:42 | | sec^nd (second) joins |
| 23:43:43 | | lunik173 joins |