00:43:38<mgrandi>@fionera: is the data there but the headers are bad?
00:44:07<fionera>I don't know if the zip is readable but I can't read the directory index
00:44:14<fionera>In zip it's at the end
00:44:46BlueMaxima_ joins
00:48:38BlueMaxima quits [Ping timeout: 250 seconds]
01:05:56Ajay11 is now known as Ajay
01:18:16Hyenadae joins
01:18:53Hyenadae quits [Remote host closed the connection]
01:19:27Hyenadae joins
01:31:38AlsoHP_Archivist joins
01:34:52HP_Archivist quits [Ping timeout: 258 seconds]
01:41:00AlsoHP_Archivist quits [Ping timeout: 258 seconds]
01:41:50AlsoHP_Archivist joins
01:47:22Mineroboter_ joins
01:49:49Mineroboter quits [Ping timeout: 258 seconds]
02:15:53AlsoHP_Archivist quits [Ping timeout: 258 seconds]
02:16:41AlsoHP_Archivist joins
02:23:50AlsoHP_Archivist quits [Client Quit]
02:24:18HP_Archivist (HP_Archivist) joins
02:29:59sliccricc (sliccricc) joins
02:38:22nathan joins
03:06:54DogsRNice quits [Read error: Connection reset by peer]
03:13:14pcr leaves
03:13:20pcr joins
03:21:24<@OrIdow6>arkiver: I see you wrote it; is it fine with you if I look into fixing up sourceforge-grab? (Within a scale of weeks, not right now)
03:22:42<@arkiver>OrIdow6: thats some old code :P
03:23:03<@OrIdow6>Yeah
03:23:12<@arkiver>but yeah sure, though rewrite here may be best
03:23:33<@arkiver>cant promise we can get a copy of all of sourceforge though, if it's 100s of TBs
03:25:20<@arkiver>i sourceforge underwent some changes, so it may not be useful at all
03:25:23<@arkiver>i think*
03:26:11<@OrIdow6>That's fine, I think a subset of each project would serve well enough
03:26:24<@OrIdow6>Yeah, I tried to run it, and it recursed to Facebook and got stuck
03:26:49<@OrIdow6>(Through a share link IIRC)
03:27:15<@arkiver>yeah i'd just start over, basing it on a recent project
03:27:51<@arkiver>well ping me in case of anything :)
03:27:58<@OrIdow6>Alright
03:28:46<@JAA>#sourceforget
03:38:09qw3rty_ joins
03:41:45qw3rty__ quits [Ping timeout: 258 seconds]
03:58:19<mgrandi>Is sourceforge in danger? Or just preemptive
03:59:14<atphoenix>it has signs of decline
03:59:57<atphoenix>like a tree with cracks, broken branches, dead limbs, but still some leaves
04:02:15<Ajay>it was bought by the same company that bought slashdot
04:02:24<Ajay>not sure how they survive buying declining websites
04:07:15etnguyen03 quits [Client Quit]
04:10:00lennier1 quits [Client Quit]
04:10:12lennier1 (lennier1) joins
04:14:31Stilett0 joins
04:18:22Stiletto quits [Ping timeout: 250 seconds]
04:29:25<purplebot>Twitch.tv edited by Pokechu22 (+1, /* Twitch Chat Downloader */ fix …) 1 minute ago -- https://www.archiveteam.org/?diff=46496&oldid=46482
04:50:12NIC007a83 joins
04:50:57Stiletto joins
04:53:49Stilett0 quits [Ping timeout: 258 seconds]
05:13:51Stilett0 joins
05:17:12Stiletto quits [Ping timeout: 258 seconds]
06:00:23<Wayward>SF also has a permanently tarnished reputation for adware bundling in windows binaries (using a wrapper around the official .exe and same icon).
06:01:10<Wayward>it truly amazes me that open source projects still use SF
06:04:22<Wayward>"If Windows users can't be bothered to compile their own source code, then they deserve to get their passwords and credit card information stolen by malicious adware that can't be uninstalled." -- SourceForge, maybe.
06:11:47<Wayward>Apparently GIMP had even removed themselves from SF, and so SF created their own GIMP profile and added it back.
06:31:32qc joins
06:38:12<atphoenix>I seem to remember that bad move. Similar to how ImgBurn and others had adware in some installers. Not everyone saw the installers prompt to install the junk, as that stuff usually downloaded during install, meaning that disconnecting from the Internet, or having restricted network firewalls and/or HOSTS lists could cause those "reach out and download crap" steps to fail.
06:41:57wessel1512 joins
06:42:21<Wayward>atphoenix: re > "ImgBurn had adware." Here's probably why... http://web.archive.org/web/20140409025848/https://sourceforge.net/projects/imgburn/
06:42:59<Wayward>a very very short lived existence on SF.
06:47:30<Wayward>but SF weren't the only ones. If I remember rightly, there was also c|net and downlods.com pumping users with PUPs
06:48:00<atphoenix>it was an 'in thing' for a while
06:48:02<Wayward>It's just that SF also sees itself as a "microsoft free alternative to github."
06:48:45<Wayward>I could never imagine FOSS projects like Brave Browser on SF. But also there was a similar row on FOSSHub and PUPware
06:49:24<atphoenix>SF of 2001 =/= SF of 2021
06:49:52<Wayward>SF of 2014~2016 was the "period of pure evil"
06:50:16<Wayward>and new management iirc.
07:03:46<tech234a>FYI there seems to be some kind of issue with caching of redirect pages on the Wiki: for example if you search and go directly to a page that redirects to another page, you will get an outdated version of the target page. For example: https://wiki.archiveteam.org/index.php/Yahoo_Answers (no exclamation point) https://wiki.archiveteam.org/index.php/Warrior (as opposed to ArchiveTeam_Warrior) etc.
07:04:05<tech234a>Perhaps a bug in LiteSpeedCache?
07:08:58<tech234a>These outdated cached pages can also be reached when searching
07:12:50<tech234a>Probably need to clear the cache for pages that redirect to a given page when an article is saved: https://github.com/litespeedtech/lscache_mediawiki/blob/master/LiteSpeedCache/LiteSpeedCache_body.php#L117-L127
07:19:42hooway joins
07:32:34Mineroboter_ quits [Client Quit]
07:34:44Mineroboter joins
08:15:58Arcorann (Arcorann) joins
08:26:34Arcorann quits [Remote host closed the connection]
08:26:53Arcorann joins
08:30:20Arcorann_ joins
08:32:44Arcorann quits [Ping timeout: 250 seconds]
08:34:01Arcorann__ joins
08:34:15<Ryz>Heya folks, I need help whether http://furani.jp/kunio/ can be found from http://furani.jp/ - or else I would have to archive both of the sections separately in AB
08:36:12Arcorann_ quits [Ping timeout: 250 seconds]
08:37:34Arcorann_ joins
08:39:49<Wayward>weee no line breaks in page source
08:40:06Arcorann__ quits [Ping timeout: 250 seconds]
08:42:04Arcorann_ quits [Read error: Connection reset by peer]
08:42:26<Wayward>"Toy Box" header link has a link to kunio/index.html with the image images/toybox/kuniomania-preview.jpg
08:42:27Arcorann_ joins
08:43:12<Wayward>first/left item in the second section down.
08:44:25<Ryz>Thank you Wayward, I appreciate it~
08:44:37<Wayward>you tricked me! i thought this furani site would have furry animation art.
08:45:48<Ryz>Huh o.o;
08:46:17Arcorann__ joins
08:46:22<Ryz>I've been going through my backlog of links I wanted to send it to archiving on AB, I'm doing a kinda decent job getting through but there's some time consuming ones like that one above~
08:50:20Arcorann_ quits [Ping timeout: 258 seconds]
09:14:24qc quits [Remote host closed the connection]
09:24:49Arcorann (Arcorann) joins
09:28:17Arcorann__ quits [Ping timeout: 258 seconds]
09:29:47<Ryz>Huh, spending the previous 30 minutes and ongoing on typing up something regarding WBM and AB that I'll post here later (since it's not as active right now) oo;
09:41:19BlueMaxima_ quits [Client Quit]
09:44:16<Sanqui>Ryz: I would subscribe to your newsletter
10:14:42billy549 quits [Quit: ZNC - https://znc.in]
10:23:38billy549 (Billy549) joins
10:35:14Barto quits [Client Quit]
10:35:20Barto (Barto) joins
10:35:21Barto quits [Client Quit]
10:35:25Barto (Barto) joins
12:56:26NIC007a83 quits [Ping timeout: 258 seconds]
12:59:40Wayward quits [Ping timeout: 250 seconds]
13:07:27etnguyen03 (etnguyen03) joins
14:35:27DopefishJustin quits [Remote host closed the connection]
14:36:01DopefishJustin joins
14:39:19bananapotato joins
15:22:10<@JAA>jrwr: There's a cache bug on the wiki, see tech234a's messages above at 07:03 ff.
15:27:32AlsoIDK joins
15:27:50IDK joins
15:39:26NIC007a83 joins
16:05:11dm4v (dm4v) joins
16:07:42Eighty (Eighty) joins
16:16:48IDK quits [Remote host closed the connection]
16:25:46Arcorann quits [Remote host closed the connection]
16:26:04Arcorann (Arcorann) joins
16:49:26<purplebot>Twitch.tv edited by Mgrandi (+94, added AOC to VOD expiry exception …) just now -- https://www.archiveteam.org/?diff=46497&oldid=46496
16:49:28IDK joins
16:50:26<purplebot>Super Mario Maker Bookmark edited by Znak (+166, /* Archives */ Link tachyo's 2018 …) just now -- https://www.archiveteam.org/?diff=46498&oldid=46470
16:59:25<purplebot>Yahoo! Answers edited by Ajay (+264, Add info for 2021) just now -- https://www.archiveteam.org/?diff=46501&oldid=46490
17:13:11IDK quits [Remote host closed the connection]
17:15:25<purplebot>Super Mario Maker Bookmark edited by JustAnotherArchivist (+36), JustAnotherArchivist (+0) 24 minutes ago -- https://www.archiveteam.org/?diff=46500&oldid=46498
17:18:48Stilett0 quits [Ping timeout: 250 seconds]
17:21:10<Ryz>Heya folks, I'm not sure if this worth posting here in #archiveteam-bs or #internetarchive - but here's my question and my thought process: Is there a way to have an ArchiveBot-like equivalent job running within in the Wayback Machine? Like as a bot or specialized tool?
17:21:14<Ryz>Incoming large bricks of text!
17:21:21<Ryz>Lemme explain, I have http://www.spiv.cz/ in my backlog of links I wanna archive into AB since 2019 November and poking around it, as there were a bunch of image entry pages that rendered the same image that made me curious if it's just that or there's something more. It turns out that there are images or content that are currently hidden because o
17:21:21<Ryz>f how those links output their stuff
17:21:28<Ryz>For example, currently something like http://www.spiv.cz/mn2.php?id0=050625fireman&id1=2 is broken since that link along with a bunch of the other links just fetch the same exact image as http://www.spiv.cz/mn2.php?id1=0
17:21:41<Ryz>Whereas https://web.archive.org/web/20070220125030/http://www.spiv.cz/mn2.php?id0=050625fireman&id1=2 there's actually a different image (and I went there, the earliest copy of that link, WBM managed to fetch a link of that image when I went in that link, note the one capture https://web.archive.org/web/20210406085215/http://www.spiv.cz/050625firem
17:21:41<Ryz>an.jpg because of my action visiting there)
17:21:51<Ryz>So currently, if I do an AB job on http://www.spiv.cz/ as-is, it won't grab the images that were previously linked before because those images just don't link to the actual images anymore, which is why what if there could an archive run but within an WBM page as it's starting point and do it recursively within WBM
17:22:00<Ryz>Like, I would try to point a specific timestamp and have it roam through - something like https://web.archive.org/web/20110905054758/http://www.spiv.cz/ - since according to https://web.archive.org/web/20111101143021/http://www.spiv.cz/mn2.php?id1=0 - the last image that was uploaded when checking the number '100520' translates to 2010 May 20
17:22:08<Ryz>Alternatively you may have a bot just explore through all the links in chronological order from going through something like https://web.archive.org/web/*/http://www.spiv.cz/* and ideally target a specific date,
17:22:22<Ryz>Or do it hardcore and check through every iteration of those multiple saved pages and see if there were any links that weren't found in other versions/revisions, something that AB can't do since it can only archive what's currently live (this is the main reason I feel a bot or some kind of specialized tool should exist!)
17:22:33<Ryz>Of course, a notable limitation is that it can't tell if there was supposed to be something that used to exist, if says it stumbles say a new link that's never been archived in WBM before; so basically virtually lost unless there's a lot of brute-forcing to find the right link
17:22:39<Ryz>Has something like this been asked before?
17:22:56<Ryz>Sanqui ^ - since you wanted to read my 'newspaper' :p
17:26:34Daloader_ joins
17:29:18NIC007a83 quits [Read error: Connection reset by peer]
17:29:38Stiletto joins
17:39:01Arcorann_ joins
17:41:16<@hook54321>Not sure if i understand or not. Do you want to construct a WARC from data crawled from the WBM? I'm pretty sure there's tools to do that, but I wouldn't use them. Or do you want to use pages in the WBM to discover new URLs to archive?
17:42:47Arcorann quits [Ping timeout: 258 seconds]
17:51:55<@OrIdow6>The second, I think
17:54:52<@OrIdow6>I don't think it would be too hard to do something like that (if I'm reading correctly) - download all pages under a domain or prefix in the WBM, extract links (or images, etc.) from them, deduplicate those links against what's already in the WBM, output the unique ones - is that basically right, Ryz?
17:55:35LeGoupil joins
17:55:49<Ryz>Most likely the latter, since again, I can't archive http://www.spiv.cz/ in AB because those images are currently hidden whereas there were there before checking in WBM
17:56:29<Ryz>Since I can't do something like https://web.archive.org/web/20110905054758/http://www.spiv.cz/ - since there's links that won't match this targeted URL at all
17:56:38<Ryz>The timestamp in particular
17:57:44thuban joins
18:03:51<@OrIdow6>With the list I meant one after the other, not as alternatives
18:09:12<@hook54321>you could use something like https://github.com/hartator/wayback-machine-downloader and then grep for what you're looking for
18:14:43YazofArc joins
18:23:59sliccricc_ (sliccricc) joins
18:25:43sliccricc quits [Ping timeout: 258 seconds]
18:29:44<Ryz>Does it have the option to list down or create a list of links that haven't been crawled under a targeted URL? Rather confused as I look onto the link, since I don't use Python in general
18:30:40<Ryz>Like, ideally I would want it as a bot to do it, but this kind of request is really specialized, like once in a blue moon; the only way I can think of extending the use is maybe for websites that changed a lot and has information somewhere at one point that still exists but it's just hidden
18:37:15<YazofArc>So I have a question before I get into archiving as a warrior. I see that the VM client has a hard limit of like 60gb but it there a way I can lower that.
18:37:57<YazofArc>Also does the program delete the data after it is uploaded.
18:39:41<@OrIdow6>#warrior
18:39:51<@OrIdow6>But in short, yes unless something goes wrong, and I don't know
19:00:04lennier1 quits [Client Quit]
19:00:57lennier1 (lennier1) joins
19:15:56Arcorann_ quits [Ping timeout: 258 seconds]
19:26:32<@OrIdow6>Ryz: So I did try to do something like that https://gist.github.com/OrIdow6/1b82c175809e910070fdf0f8066856c3 https://gist.github.com/OrIdow6/9fa3f07524eab84a4db19136c9413d60
19:32:42<Ryz>For the latter link OrIdow6, I'm trying to see if there's even a single image or link that hasn't been archived in WBM at all
19:33:13<Ryz>Is there a way to filter out links that have been archived versus those that haven't at all?
19:34:43<atphoenix>I see 3 parts here. 1.) figuring out the likely names of the images (maybe by extracting from WBM) 2.) generating the correct URLs for those images. 3.) sending the URL list to AB via !ao < list.txt
19:35:29<@OrIdow6>The problem was that I didn't normalize them aggressively enough
19:35:38<@OrIdow6>Mostly in filtering out the www.
19:36:24<atphoenix>so URL with id0=050625fireman -> 050625fireman.jpg as that is the pattern you used in working example.
19:36:30<Ryz>Eeeeeh, I wouldn't necessarily just filter out the 'www', I encountered a domain where the 'www' and non-'www' versions of the website are considered different domains Z:
19:38:46<thuban>i'm cleaning up the wiki page for the warrior a bit, since some of the troubleshooting suggestions are outdated. does anyone object to my moving the docker instructions below the toc now that the vm is fixed?
19:40:32<@OrIdow6>Oh, and because I was collapsing by hash from the CDX server, non-www and www pages of the same thing would get collapsed together, which meant that one of the two wasn't added to the list of already-captured URLs
19:40:59<atphoenix>I don't think Docker instructions need to be at top anymore of warrior page.
19:41:14<atphoenix>they were up there because warrior was out of service mostly
19:41:41<atphoenix>(I think you're only talking about the link to the Docker instructions)
19:42:13<atphoenix>as the docker instructions themselves had their own page that I put a lot of details on
19:42:14HP_Archivist quits [Ping timeout: 250 seconds]
19:42:15<Ryz>OrIdow6, here's an example I got back in 2020 October: http://www.mcom.com/ and http://mcom.com/
19:42:15<thuban>(no, there's a very long section of docker instructions on the warrior page itself, above the faq)
19:42:33<atphoenix>looking
19:43:08<thuban>(oh, it seems to be a copy of the "basic usage" section from that page? not sure who put that in)
19:46:02<atphoenix>seems someone copied the docker section from https://wiki.archiveteam.org/index.php/Running_Archive_Team_Projects_with_Docker
19:46:16<atphoenix>https://wiki.archiveteam.org/index.php?title=ArchiveTeam_Warrior&action=history
19:47:03<atphoenix>I'm not sure those docker details are relevant to running the warrior in general
19:47:22<thuban>tech234a: i think it's time to remove that again; y/n/q?
19:48:56<atphoenix>I'd vote for keeping the Warrior article focused on normal warrior usage, and the Docker article on docker usage. Even though Warrior uses docker internally, it's not really meant for interactive use (in my understanding).
19:49:58<@OrIdow6>Ryz: Well, I think this site has them as identical? Otherwise there's a lot of duplication
19:50:12<@OrIdow6>But if you want me to make this thing more flexible, that can be an option
19:51:37<Ryz>OrIdow6, it'll definitely be a obscure option on top of said solution, just incase something really exceptionally exceptional like the one I stumbled upon happens~
19:53:38DogsRNice (Webuser299) joins
19:54:38<Ryz>Also OrIdow6, instead of maybe filtering out www or non-www, there would be another option just to combo the two and print it out at www or non-www (or those unfortunate numbered www's like www2 or something)
19:56:32<@hook54321>Ajay: it won't let me approve your edit because there's no changes
19:59:49<Ajay>Ah darn, I tried to remove a new line
20:00:09<@OrIdow6>Ryz: For that I think you'd have to specify the domain to turn into www
20:00:47<@OrIdow6>Because else if you had images.website.com, it would try to turn that into www.images.website.com
20:01:45<Ryz>Oh, welp, that's interesting and more complicated; maybe just follow under the targeted link? So that it wouldn't covert the other ones that aren't under the targeted link into www or non-www
20:03:56<@OrIdow6>I suppose that would work
20:05:23<Ryz>Mm, honestly, I'm a bit miffed that when archiving websites, I have to keep in mind whether it's supposed to be www or non-www...and even then, there's certain links that rendered them as the opposite
20:05:51<Ryz>Like in a primarily www website, there's one lonely non-www link in there S:
20:14:38x9fff00 joins
20:16:32<atphoenix>Ryz, that problem is why I wish "groups of domains" could be given to AB together. I've used HTTrack to do groups of related domains that were not consistent in their www/non-www usage.
20:17:16<atphoenix>intended use-case is for closely related domains and subdomains that should all be considered part of one job
20:17:34<tech234a>thuban: I recently added the Docker instructions to that page to provide an alternate method for setting up the Warrior
20:18:16<tech234a>While it was based on the instructions for running a general project, some things were adjusted for the Warrior image specifically
20:18:25<atphoenix>Re: Warrior: is the docker option considered to be a warrior?
20:18:33<thuban>so
20:18:52<thuban>the docker instructions on the 'running projects with docker' page are for running _projects_ with docker
20:19:01<thuban>not for running the warrior docker image
20:19:05<tech234a>Yes
20:19:13<tech234a>Individual project images
20:19:37<tech234a>The Warrior VM is now just a wrapper for Warrior Docker image
20:19:43<thuban>it's my understanding that watchtower, etc is totally redundant if you're actually running the warrior docker image
20:20:06<tech234a>Not entirely... I think dependencies aren’t automatically updated?
20:20:17<atphoenix>I don't consider docker workers to be 'The Warrior'. I consider the VMs to be 'The Warrior'
20:21:00<atphoenix>The VMs = self-contained preconfigured environment. In contrast to running docker yourself, you need to setup docker on some host OS
20:21:19<tech234a>True... but the result is identical
20:21:33<atphoenix>results sure, but user experience is different
20:21:41<thuban>you also need to set up a virtualization application on your host os, so i think that's a distinction without a difference...
20:21:41<tech234a>Is there a way to create collapsible sections on the wiki?
20:21:58<tech234a>Perhaps we can collapse the command explanations to make it shorter
20:22:52<tech234a>When I was updating the Warrior instructions recently, I was even considering putting Docker first before the VM
20:23:31<thuban>anyway, what i planned to do is copy the instructions for running the _warrior_ dockerfile from the github readme, add material (where appropriate) under the various faq/troubleshooting sections for docker (just like we have material for both virtualbox and vmware), and remove the material copied from 'running projects with docker' entirely, since it is not in fact related to
20:23:33<thuban>the warrior (but retain a link and a mention that you can run scripts in docker containers just like you can run them directly)
20:24:19<atphoenix>Docker has less overhead. VM is more compartmentalized. Virtualbox VM is easy to get running on Windows host. If the user is running Linux, well then, maybe the docker option is just as easy.
20:24:39<tech234a>It's also fairly easy to install Docker desktop nowadays
20:24:44<thuban>but can someone weigh in on the watchtower/dependency issue? ^ seems like an important thing to know about
20:24:45<@JAA>The warrior is simultaneously the VM but also the part of seesaw that runs that web interface which lets you switch between projects etc.
20:24:52<@JAA>Just to clarify the terminology.
20:26:08<atphoenix>Warrior VM doesn't require *any* command line interaction = lower barrier of entry
20:26:34<tech234a>Perhaps we could write a small program to do the required Docker commands?
20:27:03<tech234a>Starting/stopping/deleting/viewing logs/getting a shell of containers can be done from Docker desktop
20:27:38<tech234a>Also restarting containers
20:27:53<tech234a>and viewing stats/opening the browser to the web interface
20:28:05<atphoenix>I can suspend my Virtualbox VMs, reboot my host, and resume my VMs.
20:28:17<tech234a>True, that is a limitation of Docker
20:28:41<tech234a>thuban: I thought I added Docker stuff to the FAQ recently; is anything missing?
20:29:12<tech234a>(also I commented out a few very outdated FAQ questions, perhaps we could remove them from the page source altogether)
20:30:49<thuban>tech234a: some stuff, yes (and some is inaccurate)
20:31:31<tech234a>thuban: just to make sure you're viewing the current version of the page (and not a cached old version): the new version mentions VM version 3.2 at the top of the page (as opposed to being unable to run wget-at)
20:31:45<thuban>yes, i'm aware
20:31:51<tech234a>just checking :)
20:33:17<atphoenix>what old FAQ questions are you referring to?
20:33:32<Hyenadae>Does it work on raspberry pi?
20:33:58<Hyenadae>Was curious if there's ARM 32bit & 64bit 'warrior' stuff
20:34:23<atphoenix><!--=== How can I run the warrior without a virtual machine? (The VM has too much overhead for a VPS!) ===
20:34:24<atphoenix> ? Not sure why that would be irrelevant
20:34:29<thuban>Hyenadae: yes https://github.com/ArchiveTeam/warrior-dockerfile#raspberry-pi
20:34:58<Hyenadae>Oh nice, I didn't see much mention of it on the wiki (at a glance). Nice
20:35:01<tech234a>atphoenix: I think it had outdated Docker information(?)
20:35:14<atphoenix>I haven't seen this problem myself: <!--=== I just imported the ova image and the warrior is stuck on "Preparing the data partition". ===
20:37:30<tech234a>I think that was for Warrior 2
20:38:52<tech234a>I copied "Recovering from a ungraceful virtual machine/Docker container stop" over from the running projects with Docker page; the path in the command might be inaccurate for the Warrior image
20:40:08<tech234a>Also: it might be possible to limit disk space usage on Docker, but it seemed complicated. If anyone knows something about that, please feel free to add it.
20:43:46Daloader_ quits [Ping timeout: 250 seconds]
20:53:26<purplebot>Running Archive Team Projects with Docker edited by Tech234a (+88, Add explanation of Watchtower image …) just now -- https://www.archiveteam.org/?diff=46502&oldid=46487
20:54:26<purplebot>ArchiveTeam Warrior edited by Tech234a (+404, Make Docker command explanations …) just now -- https://www.archiveteam.org/?diff=46503&oldid=46488
20:55:33<tech234a>thuban, atphoenix: This should make the Docker section a little easier to read, especially since the users don't need to change anything in the commands. (FYI if the Wiki ever gets a mobile theme the collapsible section might not work on that theme)
20:56:02<thuban>argh, i was in the middle of a big edit
20:56:39<tech234a>Oh sorry... my only changes were wrap the instruction sections with a small block of code:
20:57:24<tech234a>and added one line in the command explanation for Watchtower
20:57:54<tech234a>Perhaps just integrate the changes from the diff into your draft? https://www.archiveteam.org/?diff=46503&oldid=46488
20:58:22<thuban>i see, thx. this suggests that you *do* need watchtower when using the docker container (but not when using the vm)... i'm still hoping someone can confirm or deny that
20:59:01<tech234a>(when I updated the VM to the new Warrior image I also configured it to use Watchtower)
20:59:16<thuban>(aha)
20:59:59<tech234a>Updates are rare but do still occasionally happen: https://atdrone.meo.ws/ArchiveTeam/warrior-dockerfile
21:01:13<tech234a>chfoo probably knows the most about this though
21:28:16LeGoupil quits [Client Quit]
21:30:10DogsRNice quits [Read error: Connection reset by peer]
21:30:32DogsRNice (Webuser299) joins
21:33:59<thuban>>The minimum concurrency value is 1, the default concurrency value is 3, and maximum recommended concurrency value is 5, and the maximum allowed concurrency value is 20.
21:34:31<thuban>web interface still says 6 is the max (although i'm not sure whether it enforces that properly)
21:34:40<thuban>is this a frontend/backend disagreement?
21:39:17<tech234a>not sure... I know that when running projects individual I could do up to 20... perhaps the Warrior has a different limit?
21:39:42<tech234a>code is here btw: https://github.com/ArchiveTeam/seesaw-kit
21:41:38hnds joins
21:42:00hooway quits [Client Quit]
21:44:21LeighR (LeighR) joins
21:45:01<@OrIdow6>Ryz: It picks up on "URLs" like "http://spiv.cz/mn4.php?id0=031124czert&id1=CZERT&id2=14&id3=I FOUND COLORS :)". Should those be discarded, or can AB handle them?
21:45:14hnds quits [Remote host closed the connection]
21:45:25@Fusl quits [Max SendQ exceeded]
21:45:45Fusl (Fusl) joins
21:45:45@ChanServ sets mode: +o Fusl
21:45:53<@OrIdow6>Anyhow extraction is very poor on this, I am just extracting href's from a's and src's from img's
21:46:32<Ryz>Ah, not necessarily using maybe wpull or something similar; and I think AB will handle them if they're actually 404s~ o.o;
21:50:47<tech234a>thuban: Concurrency limit is 6 in web UI https://github.com/ArchiveTeam/seesaw-kit/blob/development/seesaw/warrior.py#L206-L214. 20 when running project manually https://github.com/ArchiveTeam/seesaw-kit/blob/development/seesaw/warrior.py#L206-L214
21:51:28jut quits [Quit: Ping timeout (120 seconds)]
21:51:46jut (jut) joins
21:52:02<thuban>gotcha
21:52:56<thuban>er, that's the same url twice there, did you mean to link something else?
21:53:02<tech234a>oops
21:53:12<tech234a>https://github.com/ArchiveTeam/seesaw-kit/blob/development/seesaw/script/run_pipeline.py#L56-L74
21:53:27<thuban>thx
21:53:43<tech234a>It seems that if you manually edit the configuration file/use Docker environment variable the limit of 6 is not enforced
21:54:34<tech234a>not fully sure though
21:56:40<thuban>does `docker run` work without a preceding `docker pull`? docker wiki page implies yes but warrior-dockerfile readme implies no
21:57:01<LeighR>thuban: yes.
21:57:07<thuban>thanks
21:57:14<tech234a>Yeah it pulls if needed
21:57:41<LeighR>docker pull is mostly to make sure you have a valid image present before going on in a script
21:58:08<LeighR>or if you want to do some inspection of the image before running
21:58:25<tech234a>thuban: see also https://hackint.logs.kiska.pw/archiveteam/20210204
21:58:47<tech234a>(the VM has been fixed since then)
21:58:58<thuban>(was about to ask, lol)
22:01:30frontier_mycologist joins
22:02:30frontier_mycologist leaves
22:09:18<@OrIdow6>Ryz: Here's the new list https://gist.github.com/OrIdow6/25739114f03e8edf562fcf3d908dad82 - www's stripped out during comparison and output
22:09:27HP_Archivist (HP_Archivist) joins
22:10:25<@OrIdow6>As you can see, it's mostly weird stuff, but there are a few good ones
22:16:13<Ryz>Huh, OrIdow6, is this just a sample of what the tool would grab? Or like recursive?
22:16:50<@OrIdow6>Ryz: This should be everything it would get under *.spiv.cz
22:17:09@EggplantN quits [Quit: Ping timeout (120 seconds)]
22:17:17<@OrIdow6>I used the CDX server to find all WBM captures for that domain, and downloaded all of those
22:17:32EggplantN joins
22:20:36<@OrIdow6>Well, "gets", not "would get", as this is the actual script output
22:21:00<@OrIdow6>It doesn't need to be recursive, because the CDX server lets you list all the (non-recent) WBM captures for a domain or prefix
22:24:41LeighR leaves
22:25:21<Ryz>OrIdow6, I'm trying to figure out if it's grabbing links that don't have copies in WBM before; there's https://web.archive.org/web/20210406180527/http://spiv.cz/050529mechbug.gif - which there's one singular copy, but I'm not sure if it's because of your script coming to the page that links which makes WBM fetch the still alive image
22:26:03<@OrIdow6>Ryz: That wasn't the script, that was me manually going to a page that had that image a few hours ago in a web browser
22:26:21<@OrIdow6>The script shouldn't trigger SPN
22:26:38<Ryz>Oh~
22:27:25<Ryz>The reason why I say recursive is like I would want the bot to explore the new links within the WBM copy and find new content, not just get it from the CDX server listings
22:28:38<@OrIdow6>Recurse onto the live site?
22:28:43<Ryz>Yeah~
22:28:56<@OrIdow6>Oh
22:29:45<@OrIdow6>I suppose that's possible
22:31:17Omni78 quits [Ping timeout: 244 seconds]
22:31:22<Ryz>There was an earlier example that I did (which at the time I did that manually), where there were images or links that were hidden currently but still exist, and I had to fetch those images/links through WBM manually (this is way before I got into AT)
22:34:35<Ryz>This is why I wanna do a recursive, because of even the remote possibility of the WBM version of a page being the only one or couple of pages that have a link that's just currently hidden, think of it as a treasure trove OrIdow6
22:36:51<@OrIdow6>I think I see
22:37:20<@OrIdow6>There's nothing impossible about it, it's just that dealing with arbitrairy sites online requires a lot more caution than only dealing with the WBM
22:38:07<@OrIdow6>In that you have to set timeouts, be prepared for them to send invalid headers or have bad certs, etc.
22:38:40<Ryz>To give another example, if say I archive http://www.afn.org/~twagner/ - it wouldn't stumble upon http://www.afn.org/~twagner/Breitungen/ (which I archived earlier) on it's own, and the only way at the time to have it run a specialized tool I suggested (but it would unfortunately work only if there was an existing copy in WBM)
22:39:14<Ryz>Unfortunately that's the true limitation, if it's not found and archived in WBM, it just doesn't exist to us :c
22:40:07<Ryz>Hardcore mode would be trying to find new links from each and every version of the archived page
22:40:28<@OrIdow6>I see, that is something that would be nice to do
22:40:48<@OrIdow6>Right now it sort of does a compromise, where it deduplicates by hashes of pages
22:41:05<@OrIdow6>So it does fetch all versions of pages, but only one copy per version
22:41:33<@OrIdow6>Ideally I think the solution would be to make AB more flexible, where you could have a seed list or something like that, but obviously that's beyond the scope of this
22:43:50<Ryz>There is always the talk of wanting AB to be improved and make stuff easier~
22:44:30<Ryz>You know, since ArchiveTeam links, the Wayback Machine links are ignored globally, I'm surprised those links weren't just converted back to non-WBM links
22:45:36<Ryz>*You know, since ArchiveBot has the Wayback Machine links
22:50:43<Ryz>Or alternatively, have ArchiveBot act the WBM links it stumbles upon on and act as if they're the targeted links they're recursiving through while trying to get the live version too
22:54:08<Ryz>Makes me wonder if someone has mentioned about my main idea before in the past...
22:57:07EggplantN quits [Ping timeout: 258 seconds]
22:58:00<@OrIdow6>I don't know about that specifically, but I do think that "AB should be more flexible" is a perennial topic
22:58:33<@OrIdow6>At least as the superset of all those "it would be nice if AB would do this", etc. things
23:00:42<Ryz>What's holding back from AB being worked on more?
23:02:14bananapotato quits [Client Quit]
23:14:03<@OrIdow6>Don't know, not my area
23:21:19<@JAA>Time, mostly.
23:29:10<thuban>ok, one last question for the wiki page: we have one old question that's pretty chill about people turning off the warrior (since items will just get retried), and one newer one that's not at all chill (asks people to leave the warrior turned off and report it in irc, even if it was a deliberate hard-stop, since there might not be time to retry items).
23:29:43<@arkiver>so whats the question?
23:30:03Arcorann_ joins
23:30:12BlueMaxima joins
23:31:20<thuban>what line do we take? 'never hard-stop/always report', 'ok to hard-stop', 'use your judgment and report failures / avoid hard-stops only if items are huge'?
23:45:20@Fusl quits [Excess Flood]
23:45:39Fusl (Fusl) joins
23:45:39@ChanServ sets mode: +o Fusl
23:50:47Arcorann_ quits [Ping timeout: 258 seconds]
23:53:28Sylirana quits [Read error: Connection reset by peer]
23:54:15Sylirana (Sylirana) joins
23:56:55<mgrandi>DPoS project items don't get retried manually, which I think is a improvement, like reporting back to the server of an error occured or the process is exiting
23:57:35<mgrandi>But I think they are usually small enough to be retried, so I don't see why we would need to do anything there
23:58:54<Ryz>Uhhhhhhh, did something in Photobucket happened? It appears it's it quietly transformed to just another host for Shutterstock now
23:59:58<Ryz>There's no access to the accounts individually anymore...