00:00:52nexussfan (nexussfan) joins
00:11:14mr_sarge quits [Read error: Connection reset by peer]
00:11:47StarletCharlotte joins
00:11:58<StarletCharlotte>What's the best way to upload large files to the Internet Archive?
00:12:22<StarletCharlotte>Because I'm trying to upload an archive of ftp://ftp.funcom.com and it's stuck at 4.9 GB. It's been several hours.
00:12:39<StarletCharlotte>My internet isn't the best but I don't think it's that bad.
00:14:01<Yakov>reading some of ABs source, I think it supports ftp..?
00:14:08<@imer>StarletCharlotte: there's some tips here (if you've not seen it yet) https://wiki.archiveteam.org/index.php/Internet_Archive#Upload_speed
00:17:20<pokechu22>ab doesn't interact nicely with ftp - there's some code for it but it crashes most of the time and as such is mostly disabled at this point
00:18:11<StarletCharlotte>imer I'll take a look
00:26:25<@OrIdow6>What's the current tracker architecture? I found old logs talking about it being an Nginx(Lua) proxy that talks to the original tracker, but doesn't directly talk to Redis - is that still the case?
00:28:36<nicolas17>StarletCharlotte: are you on Linux?
00:28:46<StarletCharlotte>yeah
00:29:47<nicolas17>in my experience "sudo sysctl net.ipv4.tcp_congestion_control=bbr" makes uploads to archive.org significantly faster
00:29:50<nicolas17>won't help with ongoing connections/uploads though, you'd have to start over
00:30:21<StarletCharlotte>Got it. Should I turn it off after though?
00:30:46<nicolas17>I didn't notice any negative effects on the rest of my internet use tbh
00:30:57<StarletCharlotte>got it
00:31:01<nicolas17>but you could run "sudo sysctl net.ipv4.tcp_congestion_control" to see what your current value is
00:31:05<nicolas17>and restore it afterwards
00:31:07<StarletCharlotte>whoops
00:31:12<StarletCharlotte>i uh... already did it
00:31:17<StarletCharlotte>oh well
00:32:59<BlankEclair>out of curiosity, does that only make IA uploads fast, or does it make all tcp connections go faster
00:34:26<nicolas17>there's *something* in archive.org's networking that doesn't interact well with the default congestion control algorithm, I don't understand the details
00:35:34<klea>https://www.kernel.org/doc/html/latest/networking/ip-sysctl.html#:~:text=tcp%5Fcongestion%5Fcontrol%20%2D%20STRING is not very clear what that does.
00:40:24<nicolas17>https://en.wikipedia.org/wiki/TCP_congestion_control
00:40:44<StarletCharlotte>> /usr/bin/python: Error while finding module specification for 'ia-upload-stream.py' (ModuleNotFoundError: __path__ attribute not found on 'ia-upload-stream' while trying to find 'ia-upload-stream.py'). Try using 'ia-upload-stream' instead of 'ia-upload-stream.py' as the module name.
00:40:49<StarletCharlotte>not sure what's going on here
00:41:20<StarletCharlotte>Removing .py also fails
00:41:42<StarletCharlotte>same with removing -m
00:47:47<klea>nicolas17: what version does setting the variable to bbr set it to, BBRv1, BBRv2 or BBRv3?
00:49:12<StarletCharlotte>yeah I can't figure out how to run this. The example on the wiki just doesn't work for some reason
00:50:00<@JAA>Why is that command trying to run it as a module? (I either never knew or forgot that my uploader is even listed there.)
00:51:16<StarletCharlotte>Good question, but not running it as a module also fails.
00:51:18<@JAA>And what's that bit about installing the ia package? ia-upload-stream only depends on requests.
00:51:53<StarletCharlotte>https://pastebin.com/s88c8eJr
00:53:03<@JAA>Hmm yeah, I suppose.
00:53:22<@JAA>That does run the script correctly though.
00:53:46<@JAA>You can specify the S3 credentials via IA_S3_ACCESS and IA_S3_SECRET environment variables as well.
00:54:14<@JAA>And ia-s3-auth can get you those values without `ia configure`.
00:55:52etnguyen03 quits [Client Quit]
00:56:42<StarletCharlotte>S3?
00:57:16<StarletCharlotte>Okay I guess
00:57:24<klea>They're available on the web at https://archive.org/account/s3.php too
00:57:32<klea>It's an S3-like API
00:57:34<StarletCharlotte>oh okay thanks
00:59:27<StarletCharlotte>Tried again, same error. It's asking about a config file or something?
00:59:43<@JAA>To explain that error referencing `ia configure`: `ia-upload-stream` reads ia's config file if it's available (and not overridden by the environment variable). There's no actual dependency on `ia`.
01:00:29<StarletCharlotte>I assume ia is from python-internetarchive?
01:00:55<StarletCharlotte>I set the environment variables for the S3 credentials so it's not that.
01:01:27<TheTechRobo>StarletCharlotte: the sysctl option should go back to what it was before after a reboot, FWIW, so don't worry about losing it
01:01:39<StarletCharlotte>got it
01:01:41<@JAA>Sounds like you didn't set them correctly then. It won't even reach that code when they're set.
01:02:03<TheTechRobo>(ia comes from https://pypi.org/project/internetarchive BTW)
01:02:40<StarletCharlotte>Huh, I guess set just sets the shell variables and not environment variables? I think?
01:02:48<@JAA>Yes
01:02:50<klea>try to export.
01:02:53<TheTechRobo>export IA_S3_ACCESS=...
01:03:07<@JAA>Either run it as `IA_S3_ACCESS=... IA_S3_SECRET=... ./ia-upload-stream ...` or `export` them.
01:03:49<StarletCharlotte>There it goes. thank you
01:03:50<@JAA>And `set` sets the arguments, not variables.
01:03:54<StarletCharlotte>that explains a lot
01:04:34StarletCharlotte quits [Client Quit]
01:11:50pabs (pabs) joins
01:13:49LddPotato quits [Read error: Connection reset by peer]
01:14:30LddPotato (LddPotato) joins
01:15:12roverinexile joins
01:17:41rover quits [Ping timeout: 272 seconds]
01:18:31etnguyen03 (etnguyen03) joins
01:24:27LddPotato quits [Read error: Connection reset by peer]
01:25:09LddPotato (LddPotato) joins
01:34:57LddPotato quits [Read error: Connection reset by peer]
01:35:51LddPotato (LddPotato) joins
01:36:03petrichor quits [Ping timeout: 272 seconds]
01:44:13fangfufu quits [Client Quit]
01:45:53LddPotato quits [Read error: Connection reset by peer]
01:46:34LddPotato (LddPotato) joins
01:50:08fangfufu joins
01:50:28kansei- (kansei) joins
01:51:52kansei quits [Ping timeout: 256 seconds]
02:03:57LddPotato quits [Read error: Connection reset by peer]
02:05:31LddPotato (LddPotato) joins
02:29:50pokechu22 quits [Ping timeout: 256 seconds]
02:40:35pokechu22 (pokechu22) joins
02:52:14ducky_ (ducky) joins
02:53:04ducky quits [Ping timeout: 256 seconds]
02:53:04ducky_ is now known as ducky
02:53:29thalia quits [Quit: Connection closed for inactivity]
03:06:40ducky quits [Ping timeout: 256 seconds]
03:08:16ducky (ducky) joins
03:30:58nexussfan quits [Quit: Konversation terminated!]
03:36:42Godzfire quits [Quit: Ooops, wrong browser tab.]
03:47:30nexussfan (nexussfan) joins
04:08:05etnguyen03 quits [Remote host closed the connection]
04:08:17fireatseaparks quits [Quit: Textual IRC Client: www.textualapp.com]
04:16:13fireatseaparks (fireatseaparks) joins
04:39:57Island quits [Read error: Connection reset by peer]
04:46:18cyanbox joins
04:55:14DogsRNice quits [Read error: Connection reset by peer]
05:04:32n9nes quits [Ping timeout: 256 seconds]
05:05:03khaoohs quits [Ping timeout: 272 seconds]
05:06:01n9nes joins
05:06:36khaoohs joins
05:08:58nexussfan quits [Client Quit]
05:15:33steering wonders how thoroughly wikipedia links have been archived
05:23:34<steering>i know there's bots that try and point links to archives when they're dead but is there stuff going through and SPN'ing links for example
05:24:26<BlankEclair>wikipedia-eventstream or something
05:24:55<BlankEclair>https://archive.org/details/wikipedia-eventstream?tab=about
05:27:59<pokechu22>Yeah, my understanding is that there's a project that does that (that isn't by archiveteam). Looking at https://archive.org/details/wikipedia-eventstream?tab=collection&sort=-publicdate it seems like stuff is ran weeklyish?
05:35:01<steering>ah good :)
06:08:23Snivy quits [Ping timeout: 272 seconds]
06:15:57petrichor (petrichor) joins
06:25:00fionera quits [Ping timeout: 256 seconds]
06:29:23BennyOtt (BennyOtt) joins
06:40:59Wohlstand1 (Wohlstand) joins
06:43:24Wohlstand1 is now known as Wohlstand
06:51:24Wohlstand quits [Client Quit]
07:12:09Snivy (Snivy) joins
08:30:53rohvani quits [Ping timeout: 272 seconds]
08:55:44ducky quits [Ping timeout: 256 seconds]
08:57:25<ericgallager>https://en.wikipedia.org/wiki/User:GreenC_bot does archiving of Wikipedia links
08:57:40<ericgallager>https://en.wikipedia.org/wiki/User:GreenC/WaybackMedic
08:59:51<ericgallager>oh and this one too: https://en.wikipedia.org/wiki/User:InternetArchiveBot
09:14:57ducky (ducky) joins
09:32:19sec^nd quits [Ping timeout: 244 seconds]
09:34:36sec^nd (second) joins
09:58:55BornOn420 quits [Ping timeout: 272 seconds]
10:41:42TheEnbyperor quits [Ping timeout: 256 seconds]
10:41:59TheEnbyperor_ quits [Ping timeout: 272 seconds]
10:46:16TheEnbyperor (TheEnbyperor) joins
10:51:29TheEnbyperor quits [Ping timeout: 272 seconds]
10:57:47TheEnbyperor joins
10:59:34TheEnbyperor_ (TheEnbyperor) joins
11:02:13Dada joins
11:05:11Dada quits [Remote host closed the connection]
11:40:27APOLLO03a joins
11:42:54APOLLO03 quits [Ping timeout: 256 seconds]
11:59:46StarletCharlotte joins
12:00:03Bleo1826007227196234552220 quits [Quit: The Lounge - https://thelounge.chat]
12:02:12<StarletCharlotte>Good news: ia-upload-stream.py works! Bad news: I can't edit the metadata to say I finished uploading the actual file instead of the placeholder now because it turns out the Internet Archive REALLY doesn't like when an item identifier has dots in it. But it only tells you that breaks things AFTER you make that the name of your item, only when you
12:02:12<StarletCharlotte>try to edit the item. https://archive.org/details/ftp.funcom.com
12:02:15<StarletCharlotte>Not sure what to do.
12:02:48Bleo1826007227196234552220 joins
13:03:07StarletCharlotte quits [Client Quit]
13:19:32Webuser302981 joins
13:19:39<Webuser302981>What
13:20:06Webuser302981 quits [Client Quit]
13:20:22@imer nods
13:22:51Arcorann_ quits [Ping timeout: 272 seconds]
13:42:17ice quits [Quit: WeeChat 4.7.1]
13:42:29oxtyped quits [Ping timeout: 272 seconds]
13:54:00mgrytbak8 joins
13:54:50ice joins
13:55:09mgrytbak quits [Ping timeout: 272 seconds]
13:55:09mgrytbak8 is now known as mgrytbak
14:15:02oxtyped joins
14:34:13Webuser247771 joins
14:34:57Webuser247771 quits [Client Quit]
14:40:07oxtyped quits [Ping timeout: 272 seconds]
14:49:40oxtyped joins
14:51:57GodzFire joins
14:58:28<GodzFire>pokechu22 I was watching the crawler and noticed it was seemingly scrapping some production websites so I checked the productionmusic.fandom.com_articles_and_outlinks.txt list. There's a crap ton that should be removed. I went through and took out 17000 links. Here's an updated txt that only has ProdMusic Wiki stuff, could you restart it with
14:58:28<GodzFire>this?: https://litter.catbox.moe/gke9wfo08aoe2dpx.txt
15:00:18<GodzFire>I was wondering why it pulled 111gbs when the site is only 12 total.
15:04:21FiTheArchiver joins
15:04:39FiTheArchiver quits [Remote host closed the connection]
15:14:18Dada joins
15:19:02Webuser963758 joins
15:19:30Webuser963758 quits [Client Quit]
15:20:51<aaq|m>That would compress down well at least
15:21:51<justauser>GodzFire: That's fine, our motto is "Archive All The Things".
15:22:29<justauser>IA is willing to store the junk.
15:24:28<justauser>However, it only pulled 7GB so far - where is your number from?
15:26:33<justauser>Oh, nevermind - it's my number that came from a frozen dashboard.
15:33:39<GodzFire>justauser well from what the other 17000 links are too, it's all collections of actual music files on big Production Music websites which is licensed and could get in trouble. I would really prefer if the job could please just get restarted with only the ProdMusic Wiki links.
15:34:53<GodzFire>It doesn't feel right otherwise.
15:37:21BornOn420 (BornOn420) joins
15:43:27<klea>IA excludes stuff from WBM, so that's fine I believe?
16:02:02Island joins
16:02:45polduran joins
16:08:38Wohlstand (Wohlstand) joins
16:10:14Dada quits [Remote host closed the connection]
16:15:45AK quits [Quit: AK]
16:30:07Boppen_ quits [Read error: Connection reset by peer]
16:34:18Boppen (Boppen) joins
16:52:00Dada joins
16:53:16janos777 joins
16:53:23janos778 joins
17:00:23<polduran>hello there. normaly, I only stumble here, when I hear of the approaching end of a website so it gets queued for the archivebot. my understanding of your projects and how you work is therefore very limited. anyway, you may have already heard that archive.today is apparently using site visits for DDoS attacks which makes its already endangered
17:00:23<polduran>future even worse. in the english wikipedia and the german wikipedia (and probably several other languages as well) they started the discussion to ban the URL as it is obviously bad to link to a malicious website. the problem is, that there are almost 700'000 existing links to sites archived on archive.today and its mirrors. most often websites
17:00:23<polduran>that have not been archived (properly) in the wayback machine. i was hoping some of you might be interested to join the discussion and might offer some ideas how to preserve the archived information somewhere else. here the link to the english discussion: https://en.wikipedia.org/wiki/Wikipedia:Requests_for_comment/Archive.is_RFC_5
17:00:23<polduran>by any chance, is anyone already working on or planing a project to somehow rescue data from there? after all, there are 700k existing links but i cannot imagine how many dead links must be in the wikipedias, which are only archived on archive.today.
17:27:04<@arkiver>archive.is is definitely on my radar
17:30:58lennier2_ joins
17:33:06lennier2 quits [Ping timeout: 256 seconds]
17:46:19ThreeHM quits [Ping timeout: 272 seconds]
17:47:59ThreeHM (ThreeHeadedMonkey) joins
17:48:10<pokechu22>GodzFire: I chose to scrape the outlinks as well because that's what we would have done for a recursive job as well - I included them so that if you clicked on links on the site, those would also be saved... and given that most of them seem to have previously *not* been saved, it feels like it's useful to save them
17:49:41<pokechu22>It should only be files that are public previews - if archivebot is somehow finding music that you're supposed to pay for, then they've done something really weird
17:53:19<@JAA>!tell StarletCharlotte Dots in IA item names are perfectly fine; I use them all the time. And there's a script in little-things for metadata as well. Feel free to ask in #internetarchive if you have more questions.
17:53:19<eggdrop>[tell] ok, I'll tell StarletCharlotte when they join next
17:54:36<pokechu22>(I was expecting it to be outlinks with text information about albums only, but if they have previews, might as well get them too)
18:05:19sg72 quits [Ping timeout: 272 seconds]
18:08:37sg72 joins
18:26:41Webuser043121 joins
18:27:41SootBector quits [Remote host closed the connection]
18:28:49SootBector (SootBector) joins
18:35:04Webuser043121 quits [Client Quit]
18:36:18<GodzFire>pokechu22 everything from ProdMusic Wiki seems to be erroring
18:37:10<pokechu22>Yeah, that's something fandom does - it has a lot of things that look like relative links to scripts to archivebot, but actually aren't
18:37:51<pokechu22>I'll add an ignore for some of them, but it's annoying to deal with (which is one reason why I did an !ao < list job like this instead of a fully recursive !a job)
18:38:14<GodzFire>Is there any way to see how far along it is and how many it's done/have left?
18:41:34<pokechu22>Yes, though the information is presented in a more obvious way on http://archivebot.com/3. It's processed 225k URLs and has another 275k URLs to go (though some are ignored or otherwise not relevant). That means it's saved the HTML for everything in my original list of 53k URLs, and is now saving images/scripts/media files embedded in those pages
18:42:14<GodzFire>I truly appreciate what you're doing and helping, but I really do have a worry about all the sound files. If you wanted to separate this into two jobs where one is all the ProdMusicWiki stuff and another is all the links to other sites, that would make me feel a lot better. One so the ProdMusicWiki can get done by itself, but the other is because
18:42:14<GodzFire>the sheer amount of filesize and music those others link to is isanity.
18:43:05Webuser567384 joins
18:43:29<GodzFire>For example one music library alone is easily 50 gigs of samples and thousands of pages, and currently it's trying to pull probably a hundred different music libraries
18:43:37<pokechu22>Yeah, I probably should have split it off initially and would have if I'd thought of the media files... but archivebot doesn't have a good way of doing that without aborting the job and starting from scratch (which would mean I'd duplicate the 131 GB of media already saved)
18:47:10<pokechu22>I could add ignores for the media files and then !ao < list them afterwards, but I don't really feel like that makes much of a difference (archivebot jobs get split into 5GB chunks that are uploaded as they finish, so just because it's downloaded over 100 GB doesn't mean there's a single 100 GB file sitting on the machine or any risk of running out of disk space)
18:47:31<GodzFire>Would it be possible to just do a separate additional job for just the ProdMusicWiki stuff since it's only 12 gigs? That way this other one can keep going. Then it will just see the ProdMusicWiki stuff is already uploaded to wayback and skip it.
18:47:43<nicolas17>it does not skip stuff that way
18:48:32<nicolas17>note that 85GB was *already* uploaded to archive.org
19:11:17ericgallager quits [Read error: Connection reset by peer]
19:14:57ericgallager joins
19:19:25mls quits [Ping timeout: 272 seconds]
19:20:35mls (mls) joins
19:34:16UwU quits [Remote host closed the connection]
19:34:52UwU joins
19:46:27UwU quits [Remote host closed the connection]
19:47:46UwU joins
19:53:08polduran quits [Quit: Ooops, wrong browser tab.]
19:57:11UwU quits [Remote host closed the connection]
19:58:21UwU joins
20:09:47UwU quits [Remote host closed the connection]
20:10:22UwU joins
20:15:08<klea>https://codeberg.org/lindenii/sethrawall - sethrawall is a small HTTP reverse proxy with SSH-based authentication.
20:26:39UwU quits [Remote host closed the connection]
20:27:16UwU joins
20:44:01UwU quits [Remote host closed the connection]
20:44:36UwU joins
21:10:54GodzFire quits [Quit: Ooops, wrong browser tab.]
21:11:39UwU quits [Remote host closed the connection]
21:12:16UwU joins
21:20:05Webuser567384 quits [Client Quit]
21:30:32UwU quits [Remote host closed the connection]
21:31:13UwU joins
21:32:13Dj-Wawa quits []
21:34:58Dj-Wawa joins
21:35:04Dada quits [Ping timeout: 256 seconds]
21:43:18fionera joins
21:43:18fionera quits [Changing host]
21:43:18fionera (Fionera) joins
21:45:27Hackerpcs quits [Quit: Hackerpcs]
21:46:22Hackerpcs (Hackerpcs) joins
21:57:23Dj-Wawa quits [Client Quit]
21:57:30Dj-Wawa joins
22:00:45Dj-Wawa quits [Client Quit]
22:00:53Dj-Wawa joins
22:11:35UwU quits [Remote host closed the connection]
22:12:20UwU joins
22:31:39Webuser851055 joins
22:42:30G4te_Keep3r34924156 quits [Ping timeout: 256 seconds]
22:44:20UwU quits [Client Quit]
22:44:50G4te_Keep3r34924156 joins
22:44:59UwU joins
22:46:55thedude joins
22:47:48etnguyen03 (etnguyen03) joins
22:48:22<thedude>I'm trying to recover a webpage from archive.today archives. Are there any tools out there that can do this?
22:50:00<thedude>I'd rather not try to hack something together in selenium myself
22:50:36<klea>What's the current approach around archiving Dropbox links? (i'm interested in archiving https://www.dropbox.com/s/l8yoah76t7nq04y/mueller-report.pdf from a url list I found somewhere on the web)
22:52:44<pokechu22>https://www.dropbox.com/s/l8yoah76t7nq04y/mueller-report.pdf?dl=1 and https://dl.dropboxusercontent.com/s/l8yoah76t7nq04y/mueller-report.pdf - note that www.dropbox.com is excluded from WBM
22:59:13<klea>so shoving those two into AB?
23:00:18<pokechu22>Yeah, I'll do that
23:01:58Arcorann_ (Arcorann) joins
23:02:07<klea>Thanks
23:02:56thedude quits [Client Quit]
23:16:45atphoenix__ (atphoenix) joins
23:19:20atphoenix_ quits [Ping timeout: 256 seconds]
23:23:32Webuser851055 quits [Client Quit]
23:46:41ericgallager quits [Ping timeout: 272 seconds]
23:47:57nicolas17 quits [Ping timeout: 272 seconds]
23:48:30nicolas17 (nicolas17) joins
23:57:45rohvani joins