| 00:43:38 | <mgrandi> | @fionera: is the data there but the headers are bad? |
| 00:44:07 | <fionera> | I don't know if the zip is readable but I can't read the directory index |
| 00:44:14 | <fionera> | In zip it's at the end |
| 00:44:46 | | BlueMaxima_ joins |
| 00:48:38 | | BlueMaxima quits [Ping timeout: 250 seconds] |
| 01:05:56 | | Ajay11 is now known as Ajay |
| 01:18:16 | | Hyenadae joins |
| 01:18:53 | | Hyenadae quits [Remote host closed the connection] |
| 01:19:27 | | Hyenadae joins |
| 01:31:38 | | AlsoHP_Archivist joins |
| 01:34:52 | | HP_Archivist quits [Ping timeout: 258 seconds] |
| 01:41:00 | | AlsoHP_Archivist quits [Ping timeout: 258 seconds] |
| 01:41:50 | | AlsoHP_Archivist joins |
| 01:47:22 | | Mineroboter_ joins |
| 01:49:49 | | Mineroboter quits [Ping timeout: 258 seconds] |
| 02:15:53 | | AlsoHP_Archivist quits [Ping timeout: 258 seconds] |
| 02:16:41 | | AlsoHP_Archivist joins |
| 02:23:50 | | AlsoHP_Archivist quits [Client Quit] |
| 02:24:18 | | HP_Archivist (HP_Archivist) joins |
| 02:29:59 | | sliccricc (sliccricc) joins |
| 02:38:22 | | nathan joins |
| 03:06:54 | | DogsRNice quits [Read error: Connection reset by peer] |
| 03:13:14 | | pcr leaves |
| 03:13:20 | | pcr joins |
| 03:21:24 | <@OrIdow6> | arkiver: I see you wrote it; is it fine with you if I look into fixing up sourceforge-grab? (Within a scale of weeks, not right now) |
| 03:22:42 | <@arkiver> | OrIdow6: thats some old code :P |
| 03:23:03 | <@OrIdow6> | Yeah |
| 03:23:12 | <@arkiver> | but yeah sure, though rewrite here may be best |
| 03:23:33 | <@arkiver> | cant promise we can get a copy of all of sourceforge though, if it's 100s of TBs |
| 03:25:20 | <@arkiver> | i sourceforge underwent some changes, so it may not be useful at all |
| 03:25:23 | <@arkiver> | i think* |
| 03:26:11 | <@OrIdow6> | That's fine, I think a subset of each project would serve well enough |
| 03:26:24 | <@OrIdow6> | Yeah, I tried to run it, and it recursed to Facebook and got stuck |
| 03:26:49 | <@OrIdow6> | (Through a share link IIRC) |
| 03:27:15 | <@arkiver> | yeah i'd just start over, basing it on a recent project |
| 03:27:51 | <@arkiver> | well ping me in case of anything :) |
| 03:27:58 | <@OrIdow6> | Alright |
| 03:28:46 | <@JAA> | #sourceforget |
| 03:38:09 | | qw3rty_ joins |
| 03:41:45 | | qw3rty__ quits [Ping timeout: 258 seconds] |
| 03:58:19 | <mgrandi> | Is sourceforge in danger? Or just preemptive |
| 03:59:14 | <atphoenix> | it has signs of decline |
| 03:59:57 | <atphoenix> | like a tree with cracks, broken branches, dead limbs, but still some leaves |
| 04:02:15 | <Ajay> | it was bought by the same company that bought slashdot |
| 04:02:24 | <Ajay> | not sure how they survive buying declining websites |
| 04:07:15 | | etnguyen03 quits [Client Quit] |
| 04:10:00 | | lennier1 quits [Client Quit] |
| 04:10:12 | | lennier1 (lennier1) joins |
| 04:14:31 | | Stilett0 joins |
| 04:18:22 | | Stiletto quits [Ping timeout: 250 seconds] |
| 04:29:25 | <purplebot> | Twitch.tv edited by Pokechu22 (+1, /* Twitch Chat Downloader */ fix …) 1 minute ago -- https://www.archiveteam.org/?diff=46496&oldid=46482 |
| 04:50:12 | | NIC007a83 joins |
| 04:50:57 | | Stiletto joins |
| 04:53:49 | | Stilett0 quits [Ping timeout: 258 seconds] |
| 05:13:51 | | Stilett0 joins |
| 05:17:12 | | Stiletto quits [Ping timeout: 258 seconds] |
| 06:00:23 | <Wayward> | SF also has a permanently tarnished reputation for adware bundling in windows binaries (using a wrapper around the official .exe and same icon). |
| 06:01:10 | <Wayward> | it truly amazes me that open source projects still use SF |
| 06:04:22 | <Wayward> | "If Windows users can't be bothered to compile their own source code, then they deserve to get their passwords and credit card information stolen by malicious adware that can't be uninstalled." -- SourceForge, maybe. |
| 06:11:47 | <Wayward> | Apparently GIMP had even removed themselves from SF, and so SF created their own GIMP profile and added it back. |
| 06:31:32 | | qc joins |
| 06:38:12 | <atphoenix> | I seem to remember that bad move. Similar to how ImgBurn and others had adware in some installers. Not everyone saw the installers prompt to install the junk, as that stuff usually downloaded during install, meaning that disconnecting from the Internet, or having restricted network firewalls and/or HOSTS lists could cause those "reach out and download crap" steps to fail. |
| 06:41:57 | | wessel1512 joins |
| 06:42:21 | <Wayward> | atphoenix: re > "ImgBurn had adware." Here's probably why... http://web.archive.org/web/20140409025848/https://sourceforge.net/projects/imgburn/ |
| 06:42:59 | <Wayward> | a very very short lived existence on SF. |
| 06:47:30 | <Wayward> | but SF weren't the only ones. If I remember rightly, there was also c|net and downlods.com pumping users with PUPs |
| 06:48:00 | <atphoenix> | it was an 'in thing' for a while |
| 06:48:02 | <Wayward> | It's just that SF also sees itself as a "microsoft free alternative to github." |
| 06:48:45 | <Wayward> | I could never imagine FOSS projects like Brave Browser on SF. But also there was a similar row on FOSSHub and PUPware |
| 06:49:24 | <atphoenix> | SF of 2001 =/= SF of 2021 |
| 06:49:52 | <Wayward> | SF of 2014~2016 was the "period of pure evil" |
| 06:50:16 | <Wayward> | and new management iirc. |
| 07:03:46 | <tech234a> | FYI there seems to be some kind of issue with caching of redirect pages on the Wiki: for example if you search and go directly to a page that redirects to another page, you will get an outdated version of the target page. For example: https://wiki.archiveteam.org/index.php/Yahoo_Answers (no exclamation point) https://wiki.archiveteam.org/index.php/Warrior (as opposed to ArchiveTeam_Warrior) etc. |
| 07:04:05 | <tech234a> | Perhaps a bug in LiteSpeedCache? |
| 07:08:58 | <tech234a> | These outdated cached pages can also be reached when searching |
| 07:12:50 | <tech234a> | Probably need to clear the cache for pages that redirect to a given page when an article is saved: https://github.com/litespeedtech/lscache_mediawiki/blob/master/LiteSpeedCache/LiteSpeedCache_body.php#L117-L127 |
| 07:19:42 | | hooway joins |
| 07:32:34 | | Mineroboter_ quits [Client Quit] |
| 07:34:44 | | Mineroboter joins |
| 08:15:58 | | Arcorann (Arcorann) joins |
| 08:26:34 | | Arcorann quits [Remote host closed the connection] |
| 08:26:53 | | Arcorann joins |
| 08:26:53 | | Arcorann is now authenticated as Arcorann |
| 08:30:20 | | Arcorann_ joins |
| 08:32:44 | | Arcorann quits [Ping timeout: 250 seconds] |
| 08:34:01 | | Arcorann__ joins |
| 08:34:15 | <Ryz> | Heya folks, I need help whether http://furani.jp/kunio/ can be found from http://furani.jp/ - or else I would have to archive both of the sections separately in AB |
| 08:36:12 | | Arcorann_ quits [Ping timeout: 250 seconds] |
| 08:37:34 | | Arcorann_ joins |
| 08:39:49 | <Wayward> | weee no line breaks in page source |
| 08:40:06 | | Arcorann__ quits [Ping timeout: 250 seconds] |
| 08:42:04 | | Arcorann_ quits [Read error: Connection reset by peer] |
| 08:42:26 | <Wayward> | "Toy Box" header link has a link to kunio/index.html with the image images/toybox/kuniomania-preview.jpg |
| 08:42:27 | | Arcorann_ joins |
| 08:43:12 | <Wayward> | first/left item in the second section down. |
| 08:44:25 | <Ryz> | Thank you Wayward, I appreciate it~ |
| 08:44:37 | <Wayward> | you tricked me! i thought this furani site would have furry animation art. |
| 08:45:48 | <Ryz> | Huh o.o; |
| 08:46:17 | | Arcorann__ joins |
| 08:46:22 | <Ryz> | I've been going through my backlog of links I wanted to send it to archiving on AB, I'm doing a kinda decent job getting through but there's some time consuming ones like that one above~ |
| 08:50:20 | | Arcorann_ quits [Ping timeout: 258 seconds] |
| 09:14:24 | | qc quits [Remote host closed the connection] |
| 09:24:49 | | Arcorann (Arcorann) joins |
| 09:28:17 | | Arcorann__ quits [Ping timeout: 258 seconds] |
| 09:29:47 | <Ryz> | Huh, spending the previous 30 minutes and ongoing on typing up something regarding WBM and AB that I'll post here later (since it's not as active right now) oo; |
| 09:41:19 | | BlueMaxima_ quits [Client Quit] |
| 09:44:16 | <Sanqui> | Ryz: I would subscribe to your newsletter |
| 10:14:42 | | billy549 quits [Quit: ZNC - https://znc.in] |
| 10:23:38 | | billy549 (Billy549) joins |
| 10:35:14 | | Barto quits [Client Quit] |
| 10:35:20 | | Barto (Barto) joins |
| 10:35:21 | | Barto quits [Client Quit] |
| 10:35:25 | | Barto (Barto) joins |
| 12:56:26 | | NIC007a83 quits [Ping timeout: 258 seconds] |
| 12:59:40 | | Wayward quits [Ping timeout: 250 seconds] |
| 13:07:27 | | etnguyen03 (etnguyen03) joins |
| 14:35:27 | | DopefishJustin quits [Remote host closed the connection] |
| 14:36:01 | | DopefishJustin joins |
| 14:36:01 | | DopefishJustin is now authenticated as DopefishJustin |
| 14:39:19 | | bananapotato joins |
| 15:22:10 | <@JAA> | jrwr: There's a cache bug on the wiki, see tech234a's messages above at 07:03 ff. |
| 15:27:32 | | AlsoIDK joins |
| 15:27:50 | | IDK joins |
| 15:39:26 | | NIC007a83 joins |
| 16:05:11 | | dm4v (dm4v) joins |
| 16:07:42 | | Eighty (Eighty) joins |
| 16:16:48 | | IDK quits [Remote host closed the connection] |
| 16:25:46 | | Arcorann quits [Remote host closed the connection] |
| 16:26:04 | | Arcorann (Arcorann) joins |
| 16:49:26 | <purplebot> | Twitch.tv edited by Mgrandi (+94, added AOC to VOD expiry exception …) just now -- https://www.archiveteam.org/?diff=46497&oldid=46496 |
| 16:49:28 | | IDK joins |
| 16:50:26 | <purplebot> | Super Mario Maker Bookmark edited by Znak (+166, /* Archives */ Link tachyo's 2018 …) just now -- https://www.archiveteam.org/?diff=46498&oldid=46470 |
| 16:59:25 | <purplebot> | Yahoo! Answers edited by Ajay (+264, Add info for 2021) just now -- https://www.archiveteam.org/?diff=46501&oldid=46490 |
| 17:13:11 | | IDK quits [Remote host closed the connection] |
| 17:15:25 | <purplebot> | Super Mario Maker Bookmark edited by JustAnotherArchivist (+36), JustAnotherArchivist (+0) 24 minutes ago -- https://www.archiveteam.org/?diff=46500&oldid=46498 |
| 17:18:48 | | Stilett0 quits [Ping timeout: 250 seconds] |
| 17:21:10 | <Ryz> | Heya folks, I'm not sure if this worth posting here in #archiveteam-bs or #internetarchive - but here's my question and my thought process: Is there a way to have an ArchiveBot-like equivalent job running within in the Wayback Machine? Like as a bot or specialized tool? |
| 17:21:14 | <Ryz> | Incoming large bricks of text! |
| 17:21:21 | <Ryz> | Lemme explain, I have http://www.spiv.cz/ in my backlog of links I wanna archive into AB since 2019 November and poking around it, as there were a bunch of image entry pages that rendered the same image that made me curious if it's just that or there's something more. It turns out that there are images or content that are currently hidden because o |
| 17:21:21 | <Ryz> | f how those links output their stuff |
| 17:21:28 | <Ryz> | For example, currently something like http://www.spiv.cz/mn2.php?id0=050625fireman&id1=2 is broken since that link along with a bunch of the other links just fetch the same exact image as http://www.spiv.cz/mn2.php?id1=0 |
| 17:21:41 | <Ryz> | Whereas https://web.archive.org/web/20070220125030/http://www.spiv.cz/mn2.php?id0=050625fireman&id1=2 there's actually a different image (and I went there, the earliest copy of that link, WBM managed to fetch a link of that image when I went in that link, note the one capture https://web.archive.org/web/20210406085215/http://www.spiv.cz/050625firem |
| 17:21:41 | <Ryz> | an.jpg because of my action visiting there) |
| 17:21:51 | <Ryz> | So currently, if I do an AB job on http://www.spiv.cz/ as-is, it won't grab the images that were previously linked before because those images just don't link to the actual images anymore, which is why what if there could an archive run but within an WBM page as it's starting point and do it recursively within WBM |
| 17:22:00 | <Ryz> | Like, I would try to point a specific timestamp and have it roam through - something like https://web.archive.org/web/20110905054758/http://www.spiv.cz/ - since according to https://web.archive.org/web/20111101143021/http://www.spiv.cz/mn2.php?id1=0 - the last image that was uploaded when checking the number '100520' translates to 2010 May 20 |
| 17:22:08 | <Ryz> | Alternatively you may have a bot just explore through all the links in chronological order from going through something like https://web.archive.org/web/*/http://www.spiv.cz/* and ideally target a specific date, |
| 17:22:22 | <Ryz> | Or do it hardcore and check through every iteration of those multiple saved pages and see if there were any links that weren't found in other versions/revisions, something that AB can't do since it can only archive what's currently live (this is the main reason I feel a bot or some kind of specialized tool should exist!) |
| 17:22:33 | <Ryz> | Of course, a notable limitation is that it can't tell if there was supposed to be something that used to exist, if says it stumbles say a new link that's never been archived in WBM before; so basically virtually lost unless there's a lot of brute-forcing to find the right link |
| 17:22:39 | <Ryz> | Has something like this been asked before? |
| 17:22:56 | <Ryz> | Sanqui ^ - since you wanted to read my 'newspaper' :p |
| 17:26:34 | | Daloader_ joins |
| 17:29:18 | | NIC007a83 quits [Read error: Connection reset by peer] |
| 17:29:38 | | Stiletto joins |
| 17:39:01 | | Arcorann_ joins |
| 17:41:16 | <@hook54321> | Not sure if i understand or not. Do you want to construct a WARC from data crawled from the WBM? I'm pretty sure there's tools to do that, but I wouldn't use them. Or do you want to use pages in the WBM to discover new URLs to archive? |
| 17:42:47 | | Arcorann quits [Ping timeout: 258 seconds] |
| 17:51:55 | <@OrIdow6> | The second, I think |
| 17:54:52 | <@OrIdow6> | I don't think it would be too hard to do something like that (if I'm reading correctly) - download all pages under a domain or prefix in the WBM, extract links (or images, etc.) from them, deduplicate those links against what's already in the WBM, output the unique ones - is that basically right, Ryz? |
| 17:55:35 | | LeGoupil joins |
| 17:55:49 | <Ryz> | Most likely the latter, since again, I can't archive http://www.spiv.cz/ in AB because those images are currently hidden whereas there were there before checking in WBM |
| 17:56:29 | <Ryz> | Since I can't do something like https://web.archive.org/web/20110905054758/http://www.spiv.cz/ - since there's links that won't match this targeted URL at all |
| 17:56:38 | <Ryz> | The timestamp in particular |
| 17:57:44 | | thuban joins |
| 18:03:51 | <@OrIdow6> | With the list I meant one after the other, not as alternatives |
| 18:09:12 | <@hook54321> | you could use something like https://github.com/hartator/wayback-machine-downloader and then grep for what you're looking for |
| 18:14:43 | | YazofArc joins |
| 18:23:59 | | sliccricc_ (sliccricc) joins |
| 18:25:43 | | sliccricc quits [Ping timeout: 258 seconds] |
| 18:29:44 | <Ryz> | Does it have the option to list down or create a list of links that haven't been crawled under a targeted URL? Rather confused as I look onto the link, since I don't use Python in general |
| 18:30:40 | <Ryz> | Like, ideally I would want it as a bot to do it, but this kind of request is really specialized, like once in a blue moon; the only way I can think of extending the use is maybe for websites that changed a lot and has information somewhere at one point that still exists but it's just hidden |
| 18:37:15 | <YazofArc> | So I have a question before I get into archiving as a warrior. I see that the VM client has a hard limit of like 60gb but it there a way I can lower that. |
| 18:37:57 | <YazofArc> | Also does the program delete the data after it is uploaded. |
| 18:39:41 | <@OrIdow6> | #warrior |
| 18:39:51 | <@OrIdow6> | But in short, yes unless something goes wrong, and I don't know |
| 19:00:04 | | lennier1 quits [Client Quit] |
| 19:00:57 | | lennier1 (lennier1) joins |
| 19:15:56 | | Arcorann_ quits [Ping timeout: 258 seconds] |
| 19:23:04 | | nepeat is now authenticated as nepeat |
| 19:26:32 | <@OrIdow6> | Ryz: So I did try to do something like that https://gist.github.com/OrIdow6/1b82c175809e910070fdf0f8066856c3 https://gist.github.com/OrIdow6/9fa3f07524eab84a4db19136c9413d60 |
| 19:32:42 | <Ryz> | For the latter link OrIdow6, I'm trying to see if there's even a single image or link that hasn't been archived in WBM at all |
| 19:33:13 | <Ryz> | Is there a way to filter out links that have been archived versus those that haven't at all? |
| 19:34:43 | <atphoenix> | I see 3 parts here. 1.) figuring out the likely names of the images (maybe by extracting from WBM) 2.) generating the correct URLs for those images. 3.) sending the URL list to AB via !ao < list.txt |
| 19:35:29 | <@OrIdow6> | The problem was that I didn't normalize them aggressively enough |
| 19:35:38 | <@OrIdow6> | Mostly in filtering out the www. |
| 19:36:24 | <atphoenix> | so URL with id0=050625fireman -> 050625fireman.jpg as that is the pattern you used in working example. |
| 19:36:30 | <Ryz> | Eeeeeh, I wouldn't necessarily just filter out the 'www', I encountered a domain where the 'www' and non-'www' versions of the website are considered different domains Z: |
| 19:38:46 | <thuban> | i'm cleaning up the wiki page for the warrior a bit, since some of the troubleshooting suggestions are outdated. does anyone object to my moving the docker instructions below the toc now that the vm is fixed? |
| 19:40:32 | <@OrIdow6> | Oh, and because I was collapsing by hash from the CDX server, non-www and www pages of the same thing would get collapsed together, which meant that one of the two wasn't added to the list of already-captured URLs |
| 19:40:59 | <atphoenix> | I don't think Docker instructions need to be at top anymore of warrior page. |
| 19:41:14 | <atphoenix> | they were up there because warrior was out of service mostly |
| 19:41:41 | <atphoenix> | (I think you're only talking about the link to the Docker instructions) |
| 19:42:13 | <atphoenix> | as the docker instructions themselves had their own page that I put a lot of details on |
| 19:42:14 | | HP_Archivist quits [Ping timeout: 250 seconds] |
| 19:42:15 | <Ryz> | OrIdow6, here's an example I got back in 2020 October: http://www.mcom.com/ and http://mcom.com/ |
| 19:42:15 | <thuban> | (no, there's a very long section of docker instructions on the warrior page itself, above the faq) |
| 19:42:33 | <atphoenix> | looking |
| 19:43:08 | <thuban> | (oh, it seems to be a copy of the "basic usage" section from that page? not sure who put that in) |
| 19:46:02 | <atphoenix> | seems someone copied the docker section from https://wiki.archiveteam.org/index.php/Running_Archive_Team_Projects_with_Docker |
| 19:46:16 | <atphoenix> | https://wiki.archiveteam.org/index.php?title=ArchiveTeam_Warrior&action=history |
| 19:47:03 | <atphoenix> | I'm not sure those docker details are relevant to running the warrior in general |
| 19:47:22 | <thuban> | tech234a: i think it's time to remove that again; y/n/q? |
| 19:48:56 | <atphoenix> | I'd vote for keeping the Warrior article focused on normal warrior usage, and the Docker article on docker usage. Even though Warrior uses docker internally, it's not really meant for interactive use (in my understanding). |
| 19:49:58 | <@OrIdow6> | Ryz: Well, I think this site has them as identical? Otherwise there's a lot of duplication |
| 19:50:12 | <@OrIdow6> | But if you want me to make this thing more flexible, that can be an option |
| 19:51:37 | <Ryz> | OrIdow6, it'll definitely be a obscure option on top of said solution, just incase something really exceptionally exceptional like the one I stumbled upon happens~ |
| 19:53:38 | | DogsRNice (Webuser299) joins |
| 19:54:38 | <Ryz> | Also OrIdow6, instead of maybe filtering out www or non-www, there would be another option just to combo the two and print it out at www or non-www (or those unfortunate numbered www's like www2 or something) |
| 19:56:32 | <@hook54321> | Ajay: it won't let me approve your edit because there's no changes |
| 19:59:49 | <Ajay> | Ah darn, I tried to remove a new line |
| 20:00:09 | <@OrIdow6> | Ryz: For that I think you'd have to specify the domain to turn into www |
| 20:00:47 | <@OrIdow6> | Because else if you had images.website.com, it would try to turn that into www.images.website.com |
| 20:01:45 | <Ryz> | Oh, welp, that's interesting and more complicated; maybe just follow under the targeted link? So that it wouldn't covert the other ones that aren't under the targeted link into www or non-www |
| 20:03:56 | <@OrIdow6> | I suppose that would work |
| 20:05:23 | <Ryz> | Mm, honestly, I'm a bit miffed that when archiving websites, I have to keep in mind whether it's supposed to be www or non-www...and even then, there's certain links that rendered them as the opposite |
| 20:05:51 | <Ryz> | Like in a primarily www website, there's one lonely non-www link in there S: |
| 20:14:38 | | x9fff00 joins |
| 20:16:32 | <atphoenix> | Ryz, that problem is why I wish "groups of domains" could be given to AB together. I've used HTTrack to do groups of related domains that were not consistent in their www/non-www usage. |
| 20:17:16 | <atphoenix> | intended use-case is for closely related domains and subdomains that should all be considered part of one job |
| 20:17:34 | <tech234a> | thuban: I recently added the Docker instructions to that page to provide an alternate method for setting up the Warrior |
| 20:18:16 | <tech234a> | While it was based on the instructions for running a general project, some things were adjusted for the Warrior image specifically |
| 20:18:25 | <atphoenix> | Re: Warrior: is the docker option considered to be a warrior? |
| 20:18:33 | <thuban> | so |
| 20:18:52 | <thuban> | the docker instructions on the 'running projects with docker' page are for running _projects_ with docker |
| 20:19:01 | <thuban> | not for running the warrior docker image |
| 20:19:05 | <tech234a> | Yes |
| 20:19:13 | <tech234a> | Individual project images |
| 20:19:37 | <tech234a> | The Warrior VM is now just a wrapper for Warrior Docker image |
| 20:19:43 | <thuban> | it's my understanding that watchtower, etc is totally redundant if you're actually running the warrior docker image |
| 20:20:06 | <tech234a> | Not entirely... I think dependencies aren’t automatically updated? |
| 20:20:17 | <atphoenix> | I don't consider docker workers to be 'The Warrior'. I consider the VMs to be 'The Warrior' |
| 20:21:00 | <atphoenix> | The VMs = self-contained preconfigured environment. In contrast to running docker yourself, you need to setup docker on some host OS |
| 20:21:19 | <tech234a> | True... but the result is identical |
| 20:21:33 | <atphoenix> | results sure, but user experience is different |
| 20:21:41 | <thuban> | you also need to set up a virtualization application on your host os, so i think that's a distinction without a difference... |
| 20:21:41 | <tech234a> | Is there a way to create collapsible sections on the wiki? |
| 20:21:58 | <tech234a> | Perhaps we can collapse the command explanations to make it shorter |
| 20:22:52 | <tech234a> | When I was updating the Warrior instructions recently, I was even considering putting Docker first before the VM |
| 20:23:31 | <thuban> | anyway, what i planned to do is copy the instructions for running the _warrior_ dockerfile from the github readme, add material (where appropriate) under the various faq/troubleshooting sections for docker (just like we have material for both virtualbox and vmware), and remove the material copied from 'running projects with docker' entirely, since it is not in fact related to |
| 20:23:33 | <thuban> | the warrior (but retain a link and a mention that you can run scripts in docker containers just like you can run them directly) |
| 20:24:19 | <atphoenix> | Docker has less overhead. VM is more compartmentalized. Virtualbox VM is easy to get running on Windows host. If the user is running Linux, well then, maybe the docker option is just as easy. |
| 20:24:39 | <tech234a> | It's also fairly easy to install Docker desktop nowadays |
| 20:24:44 | <thuban> | but can someone weigh in on the watchtower/dependency issue? ^ seems like an important thing to know about |
| 20:24:45 | <@JAA> | The warrior is simultaneously the VM but also the part of seesaw that runs that web interface which lets you switch between projects etc. |
| 20:24:52 | <@JAA> | Just to clarify the terminology. |
| 20:26:08 | <atphoenix> | Warrior VM doesn't require *any* command line interaction = lower barrier of entry |
| 20:26:34 | <tech234a> | Perhaps we could write a small program to do the required Docker commands? |
| 20:27:03 | <tech234a> | Starting/stopping/deleting/viewing logs/getting a shell of containers can be done from Docker desktop |
| 20:27:38 | <tech234a> | Also restarting containers |
| 20:27:53 | <tech234a> | and viewing stats/opening the browser to the web interface |
| 20:28:05 | <atphoenix> | I can suspend my Virtualbox VMs, reboot my host, and resume my VMs. |
| 20:28:17 | <tech234a> | True, that is a limitation of Docker |
| 20:28:41 | <tech234a> | thuban: I thought I added Docker stuff to the FAQ recently; is anything missing? |
| 20:29:12 | <tech234a> | (also I commented out a few very outdated FAQ questions, perhaps we could remove them from the page source altogether) |
| 20:30:49 | <thuban> | tech234a: some stuff, yes (and some is inaccurate) |
| 20:31:31 | <tech234a> | thuban: just to make sure you're viewing the current version of the page (and not a cached old version): the new version mentions VM version 3.2 at the top of the page (as opposed to being unable to run wget-at) |
| 20:31:45 | <thuban> | yes, i'm aware |
| 20:31:51 | <tech234a> | just checking :) |
| 20:33:17 | <atphoenix> | what old FAQ questions are you referring to? |
| 20:33:32 | <Hyenadae> | Does it work on raspberry pi? |
| 20:33:58 | <Hyenadae> | Was curious if there's ARM 32bit & 64bit 'warrior' stuff |
| 20:34:23 | <atphoenix> | <!--=== How can I run the warrior without a virtual machine? (The VM has too much overhead for a VPS!) === |
| 20:34:24 | <atphoenix> | ? Not sure why that would be irrelevant |
| 20:34:29 | <thuban> | Hyenadae: yes https://github.com/ArchiveTeam/warrior-dockerfile#raspberry-pi |
| 20:34:58 | <Hyenadae> | Oh nice, I didn't see much mention of it on the wiki (at a glance). Nice |
| 20:35:01 | <tech234a> | atphoenix: I think it had outdated Docker information(?) |
| 20:35:14 | <atphoenix> | I haven't seen this problem myself: <!--=== I just imported the ova image and the warrior is stuck on "Preparing the data partition". === |
| 20:37:30 | <tech234a> | I think that was for Warrior 2 |
| 20:38:52 | <tech234a> | I copied "Recovering from a ungraceful virtual machine/Docker container stop" over from the running projects with Docker page; the path in the command might be inaccurate for the Warrior image |
| 20:40:08 | <tech234a> | Also: it might be possible to limit disk space usage on Docker, but it seemed complicated. If anyone knows something about that, please feel free to add it. |
| 20:43:46 | | Daloader_ quits [Ping timeout: 250 seconds] |
| 20:53:26 | <purplebot> | Running Archive Team Projects with Docker edited by Tech234a (+88, Add explanation of Watchtower image …) just now -- https://www.archiveteam.org/?diff=46502&oldid=46487 |
| 20:54:26 | <purplebot> | ArchiveTeam Warrior edited by Tech234a (+404, Make Docker command explanations …) just now -- https://www.archiveteam.org/?diff=46503&oldid=46488 |
| 20:55:33 | <tech234a> | thuban, atphoenix: This should make the Docker section a little easier to read, especially since the users don't need to change anything in the commands. (FYI if the Wiki ever gets a mobile theme the collapsible section might not work on that theme) |
| 20:56:02 | <thuban> | argh, i was in the middle of a big edit |
| 20:56:39 | <tech234a> | Oh sorry... my only changes were wrap the instruction sections with a small block of code: |
| 20:57:24 | <tech234a> | and added one line in the command explanation for Watchtower |
| 20:57:54 | <tech234a> | Perhaps just integrate the changes from the diff into your draft? https://www.archiveteam.org/?diff=46503&oldid=46488 |
| 20:58:22 | <thuban> | i see, thx. this suggests that you *do* need watchtower when using the docker container (but not when using the vm)... i'm still hoping someone can confirm or deny that |
| 20:59:01 | <tech234a> | (when I updated the VM to the new Warrior image I also configured it to use Watchtower) |
| 20:59:16 | <thuban> | (aha) |
| 20:59:59 | <tech234a> | Updates are rare but do still occasionally happen: https://atdrone.meo.ws/ArchiveTeam/warrior-dockerfile |
| 21:01:13 | <tech234a> | chfoo probably knows the most about this though |
| 21:28:16 | | LeGoupil quits [Client Quit] |
| 21:30:10 | | DogsRNice quits [Read error: Connection reset by peer] |
| 21:30:32 | | DogsRNice (Webuser299) joins |
| 21:33:59 | <thuban> | >The minimum concurrency value is 1, the default concurrency value is 3, and maximum recommended concurrency value is 5, and the maximum allowed concurrency value is 20. |
| 21:34:31 | <thuban> | web interface still says 6 is the max (although i'm not sure whether it enforces that properly) |
| 21:34:40 | <thuban> | is this a frontend/backend disagreement? |
| 21:39:17 | <tech234a> | not sure... I know that when running projects individual I could do up to 20... perhaps the Warrior has a different limit? |
| 21:39:42 | <tech234a> | code is here btw: https://github.com/ArchiveTeam/seesaw-kit |
| 21:41:38 | | hnds joins |
| 21:42:00 | | hooway quits [Client Quit] |
| 21:44:21 | | LeighR (LeighR) joins |
| 21:45:01 | <@OrIdow6> | Ryz: It picks up on "URLs" like "http://spiv.cz/mn4.php?id0=031124czert&id1=CZERT&id2=14&id3=I FOUND COLORS :)". Should those be discarded, or can AB handle them? |
| 21:45:14 | | hnds quits [Remote host closed the connection] |
| 21:45:25 | | @Fusl quits [Max SendQ exceeded] |
| 21:45:45 | | Fusl (Fusl) joins |
| 21:45:45 | | @ChanServ sets mode: +o Fusl |
| 21:45:53 | <@OrIdow6> | Anyhow extraction is very poor on this, I am just extracting href's from a's and src's from img's |
| 21:46:32 | <Ryz> | Ah, not necessarily using maybe wpull or something similar; and I think AB will handle them if they're actually 404s~ o.o; |
| 21:50:47 | <tech234a> | thuban: Concurrency limit is 6 in web UI https://github.com/ArchiveTeam/seesaw-kit/blob/development/seesaw/warrior.py#L206-L214. 20 when running project manually https://github.com/ArchiveTeam/seesaw-kit/blob/development/seesaw/warrior.py#L206-L214 |
| 21:51:28 | | jut quits [Quit: Ping timeout (120 seconds)] |
| 21:51:46 | | jut (jut) joins |
| 21:52:02 | <thuban> | gotcha |
| 21:52:56 | <thuban> | er, that's the same url twice there, did you mean to link something else? |
| 21:53:02 | <tech234a> | oops |
| 21:53:12 | <tech234a> | https://github.com/ArchiveTeam/seesaw-kit/blob/development/seesaw/script/run_pipeline.py#L56-L74 |
| 21:53:27 | <thuban> | thx |
| 21:53:43 | <tech234a> | It seems that if you manually edit the configuration file/use Docker environment variable the limit of 6 is not enforced |
| 21:54:34 | <tech234a> | not fully sure though |
| 21:56:40 | <thuban> | does `docker run` work without a preceding `docker pull`? docker wiki page implies yes but warrior-dockerfile readme implies no |
| 21:57:01 | <LeighR> | thuban: yes. |
| 21:57:07 | <thuban> | thanks |
| 21:57:14 | <tech234a> | Yeah it pulls if needed |
| 21:57:41 | <LeighR> | docker pull is mostly to make sure you have a valid image present before going on in a script |
| 21:58:08 | <LeighR> | or if you want to do some inspection of the image before running |
| 21:58:25 | <tech234a> | thuban: see also https://hackint.logs.kiska.pw/archiveteam/20210204 |
| 21:58:47 | <tech234a> | (the VM has been fixed since then) |
| 21:58:58 | <thuban> | (was about to ask, lol) |
| 22:01:30 | | frontier_mycologist joins |
| 22:02:30 | | frontier_mycologist leaves |
| 22:09:18 | <@OrIdow6> | Ryz: Here's the new list https://gist.github.com/OrIdow6/25739114f03e8edf562fcf3d908dad82 - www's stripped out during comparison and output |
| 22:09:27 | | HP_Archivist (HP_Archivist) joins |
| 22:10:25 | <@OrIdow6> | As you can see, it's mostly weird stuff, but there are a few good ones |
| 22:16:13 | <Ryz> | Huh, OrIdow6, is this just a sample of what the tool would grab? Or like recursive? |
| 22:16:50 | <@OrIdow6> | Ryz: This should be everything it would get under *.spiv.cz |
| 22:17:09 | | @EggplantN quits [Quit: Ping timeout (120 seconds)] |
| 22:17:17 | <@OrIdow6> | I used the CDX server to find all WBM captures for that domain, and downloaded all of those |
| 22:17:32 | | EggplantN joins |
| 22:20:36 | <@OrIdow6> | Well, "gets", not "would get", as this is the actual script output |
| 22:21:00 | <@OrIdow6> | It doesn't need to be recursive, because the CDX server lets you list all the (non-recent) WBM captures for a domain or prefix |
| 22:24:41 | | LeighR leaves |
| 22:25:21 | <Ryz> | OrIdow6, I'm trying to figure out if it's grabbing links that don't have copies in WBM before; there's https://web.archive.org/web/20210406180527/http://spiv.cz/050529mechbug.gif - which there's one singular copy, but I'm not sure if it's because of your script coming to the page that links which makes WBM fetch the still alive image |
| 22:26:03 | <@OrIdow6> | Ryz: That wasn't the script, that was me manually going to a page that had that image a few hours ago in a web browser |
| 22:26:21 | <@OrIdow6> | The script shouldn't trigger SPN |
| 22:26:38 | <Ryz> | Oh~ |
| 22:27:25 | <Ryz> | The reason why I say recursive is like I would want the bot to explore the new links within the WBM copy and find new content, not just get it from the CDX server listings |
| 22:28:38 | <@OrIdow6> | Recurse onto the live site? |
| 22:28:43 | <Ryz> | Yeah~ |
| 22:28:56 | <@OrIdow6> | Oh |
| 22:29:45 | <@OrIdow6> | I suppose that's possible |
| 22:31:17 | | Omni78 quits [Ping timeout: 244 seconds] |
| 22:31:22 | <Ryz> | There was an earlier example that I did (which at the time I did that manually), where there were images or links that were hidden currently but still exist, and I had to fetch those images/links through WBM manually (this is way before I got into AT) |
| 22:34:35 | <Ryz> | This is why I wanna do a recursive, because of even the remote possibility of the WBM version of a page being the only one or couple of pages that have a link that's just currently hidden, think of it as a treasure trove OrIdow6 |
| 22:36:51 | <@OrIdow6> | I think I see |
| 22:37:20 | <@OrIdow6> | There's nothing impossible about it, it's just that dealing with arbitrairy sites online requires a lot more caution than only dealing with the WBM |
| 22:38:07 | <@OrIdow6> | In that you have to set timeouts, be prepared for them to send invalid headers or have bad certs, etc. |
| 22:38:40 | <Ryz> | To give another example, if say I archive http://www.afn.org/~twagner/ - it wouldn't stumble upon http://www.afn.org/~twagner/Breitungen/ (which I archived earlier) on it's own, and the only way at the time to have it run a specialized tool I suggested (but it would unfortunately work only if there was an existing copy in WBM) |
| 22:39:14 | <Ryz> | Unfortunately that's the true limitation, if it's not found and archived in WBM, it just doesn't exist to us :c |
| 22:40:07 | <Ryz> | Hardcore mode would be trying to find new links from each and every version of the archived page |
| 22:40:28 | <@OrIdow6> | I see, that is something that would be nice to do |
| 22:40:48 | <@OrIdow6> | Right now it sort of does a compromise, where it deduplicates by hashes of pages |
| 22:41:05 | <@OrIdow6> | So it does fetch all versions of pages, but only one copy per version |
| 22:41:33 | <@OrIdow6> | Ideally I think the solution would be to make AB more flexible, where you could have a seed list or something like that, but obviously that's beyond the scope of this |
| 22:43:50 | <Ryz> | There is always the talk of wanting AB to be improved and make stuff easier~ |
| 22:44:30 | <Ryz> | You know, since ArchiveTeam links, the Wayback Machine links are ignored globally, I'm surprised those links weren't just converted back to non-WBM links |
| 22:45:36 | <Ryz> | *You know, since ArchiveBot has the Wayback Machine links |
| 22:50:43 | <Ryz> | Or alternatively, have ArchiveBot act the WBM links it stumbles upon on and act as if they're the targeted links they're recursiving through while trying to get the live version too |
| 22:54:08 | <Ryz> | Makes me wonder if someone has mentioned about my main idea before in the past... |
| 22:57:07 | | EggplantN quits [Ping timeout: 258 seconds] |
| 22:58:00 | <@OrIdow6> | I don't know about that specifically, but I do think that "AB should be more flexible" is a perennial topic |
| 22:58:33 | <@OrIdow6> | At least as the superset of all those "it would be nice if AB would do this", etc. things |
| 23:00:42 | <Ryz> | What's holding back from AB being worked on more? |
| 23:02:14 | | bananapotato quits [Client Quit] |
| 23:14:03 | <@OrIdow6> | Don't know, not my area |
| 23:21:19 | <@JAA> | Time, mostly. |
| 23:29:10 | <thuban> | ok, one last question for the wiki page: we have one old question that's pretty chill about people turning off the warrior (since items will just get retried), and one newer one that's not at all chill (asks people to leave the warrior turned off and report it in irc, even if it was a deliberate hard-stop, since there might not be time to retry items). |
| 23:29:43 | <@arkiver> | so whats the question? |
| 23:30:03 | | Arcorann_ joins |
| 23:30:12 | | BlueMaxima joins |
| 23:31:20 | <thuban> | what line do we take? 'never hard-stop/always report', 'ok to hard-stop', 'use your judgment and report failures / avoid hard-stops only if items are huge'? |
| 23:45:20 | | @Fusl quits [Excess Flood] |
| 23:45:39 | | Fusl (Fusl) joins |
| 23:45:39 | | @ChanServ sets mode: +o Fusl |
| 23:50:47 | | Arcorann_ quits [Ping timeout: 258 seconds] |
| 23:53:28 | | Sylirana quits [Read error: Connection reset by peer] |
| 23:54:15 | | Sylirana (Sylirana) joins |
| 23:56:55 | <mgrandi> | DPoS project items don't get retried manually, which I think is a improvement, like reporting back to the server of an error occured or the process is exiting |
| 23:57:35 | <mgrandi> | But I think they are usually small enough to be retried, so I don't see why we would need to do anything there |
| 23:58:54 | <Ryz> | Uhhhhhhh, did something in Photobucket happened? It appears it's it quietly transformed to just another host for Shutterstock now |
| 23:59:58 | <Ryz> | There's no access to the accounts individually anymore... |