00:09:00<SketchCow>WE'LL FIND OUT
00:24:00<godane>SketchCow: I'm grabing all of offical xbox magazine podcast
00:24:00<godane>there is like 311 podcast
00:25:00<godane>*podcasts
00:25:00<godane>i'm uploading the rest of no bs podcast now too
02:49:00<dashcloud>so, I've got all the laptop service manuals from dell's ftp- someone have a place I can upload them to?
11:37:00<joepie91>alard: is btinternet a warrior project yet?
11:38:00<alard>Yes, it's more or less ready (barring any new insights) but it's not actually on the warrior.
11:39:00<alard>https://github.com/ArchiveTeam/btinternet-grab
11:39:00<alard>Is ready to go.
11:39:00<alard>(Almost.)
11:40:00<alard>Why?
11:42:00<joepie91>well, when it's done, my warrior has something important to do :P
11:43:00<alard>We should keep looking for more usernames, though.
11:43:00<alard>I added the sites from DMOZ, from the wayback machine and am waiting for the btinternet links on tvtropes.org.
11:45:00<joepie91>alright
11:49:00<alard>I'm now downloading the wikipedia dump as well.
11:50:00<joepie91>wikipedia dump? as in, find btinternet links on wikipedia?
11:50:00<joepie91>speaking of which.. I'll have a look in the stackexchange dump
11:50:00<joepie91>I have it here locally
11:53:00<alard>joepie91: Yes, bunzip2 | grep ...
11:54:00<alard>It seems that there are a few links on Wikipedia: https://encrypted.google.com/search?hl=en&q=site%3Awikipedia.org%20btinternet.co.uk
11:55:00<joepie91>oh goddamnit, I removed the stackexchange data dump a few days ago
11:55:00<joepie91>redownload time
11:59:00<SmileyG>alard: I think "all Projects" tab in warrior should be "Choose Project" ?
12:00:00<alard>SmileyG: Perhaps. But "Choose" is a verb. "Settings" is not. Is "Available projects" a solution?
12:01:00<SmileyG>Yeah that works
12:01:00<SmileyG>Currently I'd think "All Projects" would select all projects...... make sense?
12:03:00<alard>Yes, I think I understand your point. (Although you could also say that it's a tab, not a button, so it shows you "all projects", like it does.)
12:03:00<SmileyG>Hehe
12:03:00<joepie91>UI design is hard :P
12:04:00<SmileyG>Well I have a habit of reading things differently to others, but I was good at it at uni. :S
12:04:00<alard>It's fun.
12:15:00<alard>http://tracker.archiveteam.org/btinternet/
12:16:00<alard>(Don't go too fast.)
12:16:00<BlueMaxim>thanks for reminding me to see how webshots was doing :P
12:16:00<BlueMaxim>underscor with 2364GB.
12:16:00<BlueMaxim>I'm going to kill him one day
12:19:00<balrog_>why does it say only 8 items done so far? :P
12:19:00<balrog_>oh I see...
12:19:00<balrog_>nvm :P
12:21:00<alard>balrog_: You could be number 1 with 9!
12:22:00<balrog_>alard: do I have to use warrior?
12:22:00<balrog_>:|
12:23:00<alard>What's wrong with the warrior? It's a small project.
12:23:00<balrog_>takes up more ram and cpu on my side :/
12:23:00<BlueMaxim>It's pretty minimal how much it takes up.
12:24:00<joepie91>BlueMaxim: not exactly
12:24:00<joepie91>it uses up to 20% of my 4GB of RAM
12:24:00<SmileyG>how long til bt ties?
12:24:00<joepie91>usually around 13
12:24:00<BlueMax>joepie91, seriously? I thought it only needed 256MB of RAM
12:24:00<joepie91>BlueMax: that's the VM itself - apparently virtualbox adds a bunch of overhead on top of that
12:24:00<joepie91>also
12:24:00<joepie91>or something
12:24:00<joepie91>it's quite heavy on CPU
12:24:00<joepie91>on my shitty notebook i3
12:25:00<joepie91>2 x 1,3ghz
12:25:00<Cameron_D>hm, these are synging like 6kb
12:25:00<Cameron_D>*syncing
12:25:00<Cameron_D>I guess if there is only one page
12:26:00<Cameron_D>Oh, 404 error, even smaller
12:26:00<BlueMax>guess I didn't notice
12:27:00<BlueMax>My computer must be better at this than I thought :P
12:28:00<SmileyG>can the tracker do less more verbose than 0Mb?
12:30:00<Deewiant>I ran into virtualbox using 7 gigabytes of ram before it got OOM-killed
12:30:00<Deewiant>While running the warrior a few days back
12:31:00<joepie91>lot of 0MBs
12:31:00<joepie91>lol
12:31:00<BlueMax>memory leak to the max :P
12:32:00<SmileyG>alard: when should it start new processes :S, I've got it set to 6 but it still only shows 4?
12:33:00<alard>SmileyG: When an item finishes the warrior checks the number to see how many new items there should be.
12:33:00<SmileyG>hmmm k
12:33:00<SmileyG>ones just finished, lets ee if it works this time
12:33:00<SmileyG>Also I've changed to BT but the banner still shows webshots (I presume because some of the jobs are still webshots).
12:34:00<joepie91>have there ever been archiving/warrior projects where the warriors were throttled/rate-limited/blocked?
12:34:00<joepie91>SmileyG: it will first finish the webshots jobs
12:34:00<joepie91>then move on to BT
12:34:00<SmileyG>I rate limit mine joepie91 :P
12:34:00<joepie91>oooo, 39MB
12:34:00<alard>The warrior can't run multiple projects at the same time, so yes, it waits for webshots to complete.
12:35:00<SmileyG>ok, makes sense :D
12:35:00<alard>(Also: why not keep it on webshots? I expect btinternet won't take long.)
12:35:00<BlueMax>it'd be cool if it could multitask
12:35:00<BlueMax>one process on one project, four on another
12:35:00<SmileyG>I have a webshots running at work on 5Mbit, this is amazingly slow compared to that ;)
12:42:00<joepie91>alard: http://www.quickonlinetips.com/archives/2012/09/google-feedburner-shutting-down/
12:43:00<joepie91>not sure if there's any useful data on feedburner
12:43:00<joepie91>but sure looks like signs of imminent death
12:43:00<joepie91>also http://searchenginewatch.com/article/2213759/Google-Shutting-Down-AdSense-for-Feeds-Classic-Plus-More-Services?utm_source=twitterfeed&utm_medium=twitter
12:43:00<alard>Isn't that just a proxy/cache/stats service?
12:43:00<Cameron_D>Yeah, it is a stats tracking service for RSS feeds
12:44:00<Cameron_D>So thousands of RSS feeds will break
12:44:00<Cameron_D>but they don't really host much data
12:44:00<joepie91>this may also be a problem for THQ-related sites: http://www.gamearena.com.au/news/read.php/5116588
12:44:00<joepie91>THQ Asia Pacific shutting down
12:44:00<godane>i got to grab my t3 magazine podcast then
12:45:00<joepie91>are there any THQ Asia Pacific-run sites that have user content?
12:45:00<Cameron_D>looking now
12:46:00<godane>Carmeron_D: links to a lot podcasts and stuff could be lost
12:47:00<godane>http://feeds.feedburner.com/T3/podcast
12:48:00<Cameron_D>feedburner just acts as a proxy though (To collect stats)
12:48:00<Cameron_D>Somewhere on the t3 site is the actual feed
12:48:00<Cameron_D>At least that is how I remember it working
12:49:00<godane>but that feed i think doesn't go back that far
12:50:00<joepie91>Cameron_D: also as an aggregrator afaik
12:50:00<godane>there only feed is from feedburner
13:07:00<balrog_>the warrior image has issues
13:08:00<balrog_>first off, vmware complains that it doesn't meet ova specs
13:08:00<balrog_>second, I get an error that there's an ide slave with no master
13:08:00<alard>balrog_: Which image?
13:09:00<alard>20121008?
13:09:00<balrog_>archiveteam-warrior-v2-20121008
13:09:00<balrog_>yes
13:09:00<Cameron_D>http://dmorton.staff.hostgator.com/archiveteam-warrior-vmware.ova vmware-compatible (albeit an older version)
13:09:00<balrog_>why did this one break?
13:10:00<alard>I don't know about the ova specs. There previously was a problem with the filename. I had exported the image as archiveteam-warrior-v2.ova, and then renamed it to include the date. This new image is exported with the correct name.
13:10:00<alard>And IDE slave with no master, that seems to be a virtualbox - vmware incompatibility.
13:10:00<balrog_>The import failed because /path/to/archiveteam-warrior-v2-20121008.ova did not pass OVF specification conformance or virtual hardware compliance checks. Click Retry to relax OVF specification and virtual hardware compliance checks and try the import again, or click Cancel to cancel the import. If you retry the import, you might not be able to use the virtual machine in VMware Fusion.
13:11:00<alard>I've added two disks in VirtualBox, but for some reason VMware ends up with two controllers: 1-master for disk 1, 2-slave for disk 2.
13:11:00<balrog_>and then ... There is an IDE slave with no master at ide1:1. This configuration does not work correctly in virtual machines. Move the disk/CD-ROM from ide1:1 to ide1:0 using the configuration editor.
13:12:00<balrog_>I wouldn't be surprised if VBox is malforming the ova
13:12:00<balrog_>VBox is unfortunately full of bugs
13:13:00<Cameron_D>heh, ESXi still rejects the file too http://i.imgur.com/z3Kox.png
13:14:00<balrog_>hm, they have an OVF tool
13:16:00<S[h]O[r]T>balrog_
13:16:00<S[h]O[r]T>are you running vmware workstation?
13:16:00<balrog_>no, fusion
13:16:00<balrog_>which is basically the mac version of workstation
13:17:00<S[h]O[r]T>when i first imported archiveteam-warrior-v2-20120813 i got the error about it not being valid. then i just imported again and it worked.
13:17:00<S[h]O[r]T>i got the ide error as well after that too
13:17:00<balrog_>yeah but I keep getting the ide error
13:17:00<S[h]O[r]T>you just have to go into the settings and change the second drive to ide0:1
13:17:00<S[h]O[r]T>from ide 1:0
13:23:00<balrog_>hmm
13:23:00<balrog_>what if someone imported the vm into vmware, fixed it, and exported it?
13:23:00<balrog_>I wonder if the ova file would be more up-to-spec
13:25:00<S[h]O[r]T>youd probably want to export as a vmdk or wahtever the vmware equivlent is. you can always just rar up the vmdk files and if someone uses them vmware will just ask if they copied it
13:25:00<joepie91>alard: btinternet\.(com|co\.uk)
13:25:00<joepie91>right?
13:25:00<balrog_>ova is better if it's compatible
13:25:00<balrog_>err, compliant
13:25:00<balrog_>apparently vbox does't produce compliant files
13:26:00<joepie91>bingo
13:26:00<joepie91>http://www.btinternet.com/~se16/hgb/statjoke.htm
13:26:00<joepie91>se16 :P
13:27:00<godane>uploaded: http://archive.org/details/cdrom-linuxformatmagazine-76
13:27:00<alard>joepie91: Yes, and then www\.(.+)\.btinternet or /~([^%?/]+)
13:28:00<SmileyG>Final webshots rsync finishes in a few min and then bt ':D
13:29:00<joepie91>alard: I've also seen a few *without* www in front
13:29:00<joepie91>and just the username
13:31:00<joepie91>alard: 7z e -so *.7z | grep -P "(([^\s(/]+)\.)?btinternet\.(com|co\.uk)(\/~([^/ %?]+))?"
13:31:00<joepie91>:)
13:31:00<joepie91>will take a few hours for the torrent to finish downloading
13:31:00<joepie91>after that, that will yield all the relevant entries
13:36:00<joepie91>better:
13:36:00<joepie91>7z e -so *.7z 2> /dev/null | grep -Po "(([^\s(/]+)\.)?btinternet\.(com|co\.uk)(\/~([^/ %?]+))?"
13:57:00<balrog_>how well does warrior handle a network connection change?
14:01:00<balrog_>how well does warrior handle a network connection change?
14:01:00<balrog_>also, why no rsync with continue?
14:05:00<SmileyG>balrog_: it should back off then continue once it figures it out
14:06:00<balrog_>you mean with the wget?
14:06:00<balrog_>rsync seems to lack continue though...
14:08:00<alard>Doesn't --partial-dir enable --partial?
14:08:00<alard>(Just rsync --partial is dangerous in this case, since SketchCow will move any file in the upload directory.)
14:22:00<willwill>Hey there, if you see my name on uncompleted webshots job please release the lock.
14:25:00<alard>willwill: No problem. (There will probably be other failed jobs, so I'll requeue them all at once later.)
14:46:00<SmileyG>balrog_: rsync, continue?
14:46:00<SmileyG>rsync knows what its sent and it doesn't require continue
14:46:00<balrog_>resume rather
14:47:00<balrog_>--partial or -P switch
14:47:00<SmileyG>doesn'tneed it....
14:47:00<SmileyG>partial does partial files
14:48:00<SmileyG>rsync checks for each file as it goes
14:48:00<balrog_>yeah well a single .warc is pretty large
14:48:00<balrog_>and if it gets interrupted, whole thing has to start over
14:48:00<SmileyG>yeah true, then your screwed :S
14:52:00<alard>I've added --partial to btinternet, so the next project will have it too.
14:52:00<SmileyG>Isn't that going to cause issues as you highlighted earlier?
14:52:00<alard>No, because --partial-dir keeps the partial files in a separate directory.
14:53:00<alard>They're uploaded to the .rsync-tmp/ subdirectory and moved when they're uploaded.
14:54:00<alard>I thought --partial-dir would be enough, but apparently you need --partial too.
14:55:00<SmileyG>oooo
14:55:00<SmileyG>heh thats random devs for you
14:59:00<joepie91>alard: the title in the btinternet pipeline.py is still webshots
14:59:00<joepie91>;)
15:02:00<alard>I see. And apparently the title isn't used anywhere.
15:03:00<alard>Wikipedia produced 933 new btinternet names.
15:04:00<joepie91>:D
15:04:00<joepie91>I'm searching math stackexchange now
15:04:00<SmileyG>wikipedia? :o
15:04:00<joepie91>alard: stats stackexchange produced "se16" as only username
15:06:00<joepie91>it's referenced a *lot* on math. as well
15:06:00<joepie91>seems like a pretty important site
15:06:00<joepie91>ha
15:06:00<joepie91>Think twice before using BT as an ISP.
15:06:00<joepie91>on the homepage of that site
15:06:00<joepie91>BT used to provide its internet subscribers with a small amount of personal webspace, but did not promote the service so only the oldest most loyal customers used it. Now it now longer wishes to satisfy these customers and is closing the service down. So this page and others of mine, which have received over 2 million hits in 13 years, have to move.
15:06:00<joepie91>If your browser does not automatically go to http://www.se16.info/index.htm within a few seconds, you may want to go to the destination manually.
15:06:00<joepie91>My conclusion is that if you ever consider BT as a possible ISP for some reason, you should not expect that reason to last.
15:07:00<SmileyG>yah
15:09:00<alard>joepie91: We already had it. :) Processed items: 1, added to main queue: 0
15:12:00<joepie91>alright :P
15:12:00<joepie91>brb
15:14:00<DoubleJ>alard: Quick question about the warrior: If there are multiple warcs waiting to upload, how does it decide which one goes next?
15:15:00<alard>LIFO, I think, but if you really want to know you should check here: https://github.com/ArchiveTeam/seesaw-kit/blob/master/seesaw/task.py#L72-107
15:17:00<DoubleJ>I... have no idea what I'm looking at.
15:18:00<DoubleJ>But since it looks like array manipulation, I'm guessing my request to do smallest file first is a no-go.
15:19:00<alard>That would be hard, I think. Then the queueing thing would have to know about file sizes.
15:19:00<alard>And does it really matter?
15:19:00<DoubleJ>Kinda-maybe. It'd free up more threads to download quicker.
15:20:00<DoubleJ>As it is there are times when all my worker threads are waiting for one upload to finish so they can go.
15:20:00<DoubleJ>Of course then you'd have a problem with large files never uploading, but you could conceivably have that with LIFO as well and I haven't seen it happen yet.
15:22:00<alard>Maybe the upload limit should just go.
15:23:00<alard>Some people wanted it in the previous warrior.
15:23:00<SmileyG>I limit the VM, shrug.
15:23:00<DoubleJ>Upload limit, as in throughput, or as inwaiting turns?
15:24:00<alard>Waiting turns. I think the thinking then was that one rsync uploads faster, so can start downloading sooner.
15:24:00<alard>The opposite of what you say now, basically. :)
15:24:00<DoubleJ>I can kinda see that, since the overhead for switching wouldn't help overall.
15:24:00<SmileyG>wasn't it because the upload location was really slow at one point?
15:24:00<SmileyG>and no one could finish anything :D
15:24:00<SmileyG>ended up eating all the space on the warriors.
15:25:00<DoubleJ>Is there someplace I can set it to let 2 upload at once, see if there are any wins to be had that way?
15:26:00<SmileyG>yup
15:26:00<SmileyG>you running vm?
15:26:00<SmileyG>I have upto 6 uploads at once.
15:26:00<DoubleJ>Yes.
15:26:00<SmileyG>ok, on the vm window
15:26:00<SmileyG>alt+F3
15:26:00<DoubleJ>OK, log in to the VM. Got that.
15:26:00<SmileyG>nano -w /home/warrior/projects/webshots/pipeline.py
15:27:00<SmileyG>ctrl+w
15:27:00<DoubleJ>(Well, I will have that, about 6:00 tonight. can't access theVM from work :) )
15:27:00<SmileyG>Ah ok
15:27:00<SmileyG>I need to do a page on this on the wiki
15:27:00<DoubleJ>But keep going. I'll check the scrollback tonight.
15:29:00<DoubleJ>alard: Dunno what project it was requested for, but webshots may just be a different critter. Large variation in upload sizes. Waiting is probably still good, we just might want to be smarter about the criteria for deciding who's next :)
15:29:00<DoubleJ>But the current warrior wins on simplicity.
15:29:00<alard>Is it worth removing the limit?
15:29:00<SmileyG>type LimitConcurrent and hit enter, and change the 1 to 6 (or whatever figure)
15:29:00<DoubleJ>(At least, I think it does. I can read Python about as well as I can read Japanese. (Not at all.))
15:30:00<DoubleJ>I'll try mine tonight. It may let smaller files squeak out, butit may also take longer because of drive-spinning at either end.
15:32:00<alard>Word of caution: if you change the pipeline.py in your warrior, you may break future updates. (If git can't figure out how to apply the update to your modified version.)
15:32:00<SmileyG>heh, i seem to have breoken it anyway ¬_¬
15:32:00<SmileyG>still getting no output
15:33:00<alard>Stop the project, go into your warrior and use git pull to figure out what's wrong?
15:33:00<DoubleJ>Understood. But define "break". Update won't apply, warrior will conk out, house burns down, what?
15:33:00<alard>I think you can expect the SmileyG problem.
15:34:00<DoubleJ>Ah.
15:34:00<SmileyG>webserver runs, nothing else does :D
15:34:00<alard>So you'll have to login, use git pull to figure out what's going wrong.
15:34:00<DoubleJ>And as we're talking about it my 261-meg user finishes:)
15:35:00<primus>alard, would it work to just delete project and restart warrior?
15:35:00<SmileyG>alard: I'd vote for keep the limit, but add option to change it.
15:35:00<alard>SmileyG: Is that worth stopping every warrior? (That's what happens if I push an update. Every warrior will finish its current task and restart the project.)
15:36:00<alard>primus: That would work.
15:36:00<SmileyG>alard: can't you just do the update and let them pull it in time?
15:36:00<DoubleJ>Yeah, restarting warriors on this project I think is worse.
15:36:00<alard>Define "in time"?
15:36:00<SmileyG>when ever they restart their vm?
15:36:00<alard>No. They check for updates on github.
15:36:00<SmileyG>Also, add "Check for updates" button to settings page?
15:36:00<DoubleJ>Heh. Like Windows Update. "Updates to this warrior are now available. Apply? This may require your warrior to restart."
15:36:00<primus>lol
15:37:00<SmileyG>where do I run the git pull?
15:37:00<alard>What we should have, in a future version, is a gradual update.
15:37:00<alard>cd /home/warrior/projects/$project/
15:37:00<alard>(perhaps su -u warrior first)
15:38:00<SmileyG>hmmm its moanin about the changes in pipeline
15:39:00SmileyG changs it back and git pulls
15:39:00<DoubleJ>It'd probably be an awful bitch, but would the multiple-project stuff be useful for that? So /home/warrior/projects/$project.$version instead? Let one run out while the new one sees threadsdisappear and spins up?
15:39:00<DoubleJ>s/stuff/idea/
15:40:00<SmileyG>alard: ok I see the new rsync code...
15:40:00<SmileyG>need to restart the warrior for web interface to update?
15:41:00<SmileyG>or is it only set via the code (And won't this then cause git to explode again?)
15:41:00<SmileyG>:O
15:41:00<SmileyG>ITS GONE CRAZY
15:41:00<SmileyG>15 users and counting on one screen
15:43:00<SmileyG>There we go...
15:43:00<SmileyG>that is bonkers when it first starts up
15:43:00<SmileyG>you just see hundreds of boxes popping up
15:44:00<SmileyG>alard: I remember - The script to create the 50Gb tars couldn't keep up for fortuneCity, thats why the rsync got limited.
15:54:00<alard>DoubleJ: Yes, that's similar. (I was thinking it might be better to have the cloned git repo in /home/warrior/projects/$project, as the most up-to-date version, then do a clone to /data/projects/$project.$version before starting a project.)
16:37:00<alard>Have we killed fos?
16:38:00<SmileyG>:O
16:39:00<SmileyG>2Kb/s! \o/
16:39:00<SmileyG>Oh its coming back now
16:40:00<SmileyG>Planned Delivery Date
16:40:00<SmileyG>Wednesday 10th October
16:40:00<SmileyG>Planned Delivery Time
16:40:00<SmileyG>Between 07:30 and 17:30
16:40:00<SmileyG>Wed Oct 10 17:40:33 BST 2012
16:40:00<SmileyG>HERP?
17:08:00<joepie91>HEY
17:08:00<SmileyG>yeah the uploads are totally dead?
17:08:00<joepie91>primus
17:08:00<joepie91>:(
17:08:00<joepie91>you've overtaken me
17:08:00<joepie91>SmileyG: ?
17:08:00<SmileyG>4587520 39% 12.21kB/s 0:09:45
17:08:00<SmileyG>[sender] io timeout after 300 seconds -- exiting
17:09:00<joepie91>sec
17:09:00<joepie91>wtf, mine is dead
17:09:00<SmileyG>Retrying RsyncUpload for Item jpr.tree after 30 seconds...
17:13:00<SmileyG>.... brokeyd :D
17:13:00<SmileyG>alard: did you break something :(
17:21:00<joepie91>my rsyncs are dying..
17:21:00<joepie91>rsync: failed to connect to fos.textfiles.com: Connection timed out (110)
17:21:00<joepie91>Process RsyncUpload returned exit code 10 for Item andrewjjstanley
17:21:00<joepie91>Retrying RsyncUpload for Item andrewjjstanley after 30 seconds...
17:21:00<joepie91>rsync error: error in socket IO (code 10) at clientserver.c(122) [sender=3.0.7]
17:22:00<SmileyG>yah
17:22:00<SmileyG>:<
17:23:00<SmileyG>they retry, but still its killed all progress :<
17:23:00<joepie91>oh
17:23:00<joepie91>they run now
17:24:00<alard>http://isup.me/fos.textfiles.com
17:26:00<alard>I think this is a SketchCow problem.
17:27:00<SmileyG>:<
17:27:00<alard>(The warriors will retry 50 times with 30 second pauses before they fail.)
17:28:00<SmileyG>:< herp.
17:34:00<joepie91>alard: it responds to ping
17:46:00<SmileyG>alard: se16 0MB << hey look :D
18:21:00<joepie91>SmileyG: mmm
18:21:00<joepie91>it's probably because he replaced the index page
18:22:00<SmileyG>joepie91: yeah I figured it might be that.
18:22:00<SmileyG>well it makes sense, the script forwards you off site.
18:41:00<underscor>fos is currently down-ish
18:41:00<underscor>fyi
18:41:00<chronomex>ish
18:41:00<chronomex>how can a box be down-ish
18:42:00<SketchCow>He's mincing words.
18:42:00<underscor>it still pings
18:42:00<SketchCow>It's down.
18:42:00<SketchCow>It's superdown.
18:42:00<underscor>VMs at archive have 3 states. Up, nossh/services, and noping
18:43:00<underscor>anyway, yeah, it's turbofucked
18:46:00<Nemo_bis>how does tpb fetch Google Books' stuff? does it accept suggestions? http://lists.wikimedia.org/pipermail/wikisource-l/2012-October/001204.html
18:49:00<underscor>wait
18:49:00<underscor>how is rsync still working if fos is down :O
19:13:00<SketchCow>OKAY HI
19:13:00<SketchCow>NEED HELP
19:14:00<SketchCow>https://docs.google.com/a/textfiles.com/spreadsheet/ccc?key=0ApQeH7pQrcBWdDZIUEVjR3d1UmRoU0lPSWZYX0Q1Ync#gid=0
19:14:00<SketchCow>OK, that's a listing of all archiveteam projects on archive.org.
19:14:00<SketchCow>1. Please see if I missed any.
19:15:00<SketchCow>(i.e. just browse through the archiveteam set to see)
19:28:00<underscor>haha, I love the item counts
19:28:00<underscor>26, 70, 29, 3956
19:35:00<chronomex>is IA down? not working for me.
19:39:00<godane>its not working for me too
19:39:00<chronomex>k
19:42:00<SmileyG>SketchCow: you missed the most famous of all - geocities.
19:45:00<joepie91>heh
19:45:00<joepie91>okay, maybe a recursive grep through my entire repository folder was a bad idea
19:46:00<alard>Geocities isn't warc.
19:46:00<underscor>IA is fucked right now
19:46:00<underscor>please leave a message after the beep
19:46:00<underscor>:D
19:46:00chronomex waits for the beep
19:46:00<underscor>boop
19:46:00SmileyG hears helicopters
19:47:00<underscor>But yeah, it's down. Once of the core boxes decided to take a dump all over everything, people are working on fixing now
19:47:00<chronomex>ok, I'm not in a hurry
19:47:00<joepie91>underscor: wat
19:47:00<joepie91>IA went down?
19:48:00<underscor>it's down right now
19:48:00<SmileyG>we broke it ¬_¬
19:48:00<underscor>lol
19:49:00<joepie91>oh wow
19:50:00<alard>Can't edit the list, but Cinch is missing. City of Heroes (two items, I think: boards and www).
19:52:00<alard>Qaudio.
20:04:00<joepie91>god I hate efnet
20:05:00<joepie91>anyway
20:05:00<joepie91>is anyone up for testing a useful script?
20:05:00<joepie91>wrote a script that takes a glob pattern, then tries to figure out (from extension) what kind of archive each file is, and prints the decompressed contents to stdout using the appropriate application, without actually unpacking it
20:05:00<joepie91>consider it a 'cat' for archives :)
20:14:00<SmileyG>so like zcat?
20:15:00<chronomex>igelritte: you know you can be in multiple channels at once, right?
20:15:00<underscor>igelritte: yeah, most of us are in both
20:16:00<chronomex>well, actually, I don't know how to do it with pidgin
20:16:00<chronomex>but I think you can
20:16:00<underscor>just /j #channel1 and /j #channel2
20:16:00<underscor>they open up as tabs
20:16:00<underscor>at least in my pidgin
20:17:00<igelritte>yeah, I didn't think about it
20:17:00<igelritte>whateve's. I'm here now
20:18:00<chronomex>k
20:19:00<igelritte>so, tell me more about your structure and how one can plug in.
20:21:00<igelritte>Is it some starry-eyed-open-source-free-for-all? Or is there a process wherein you tell a gatekeeper what you can do, what you're experienced with, and then they tell you where you can start helping?
20:22:00<chronomex>freeforall.
20:23:00<igelritte>I've seen Mr. Scott's presentation at Defcon on how AT is going to save your shit...which sounds good to me...but that doesn't tell me a lot about how the group is organized.
20:23:00<SmileyG>some people write code
20:23:00<SmileyG>I appear and make comments
20:23:00<SmileyG>most people run some sort of downloaders
20:23:00<SmileyG>godane is ..... well I don't know :D
20:24:00<mistym>There are often projects you can help in by running code written by others, basically volunteering your bandwidth to help out.
20:24:00<chronomex>godane is affiliated but mostly works on solo projects
20:24:00<mistym>Those are usually advertised on the wiki and IRC, plus I think there's a mailing list for it now too.
20:24:00<igelritte>Unfortunately, I'm not really in a good position at the moment to run downloaders or anything else that requires a 24 hour network connection.
20:24:00<SmileyG>If you haven't got bandwidth, then you can help with the wiki and possibly coding...
20:25:00<SmileyG>doesn't need 24hr, it'll work when you can
20:25:00<SmileyG>upto a point
20:25:00<DFJustin>joepie91: that already exists as lsar in The Unarchiver, although it's all built-in and not invoking other apps
20:25:00<igelritte>I'm following this silly dream about living in Germany which means that my current address is--shall we say--fluid.
20:25:00<DFJustin>oh wait I'm wrong nm
20:26:00<DFJustin>keep forgetting unix cat is not the same as apple II cat :)
20:26:00<igelritte>Are most people in North America?
20:26:00<chronomex>a good number but by no means all
20:26:00<SmileyG>i'm UK
20:27:00<igelritte>I got that from the presentation. Something about a kid of 15 in Australia being threatened with legal action for downloading poetry.
20:27:00<DFJustin>igelritte: jason is in the gatekeeper role more or less, or cat herder if you prefer
20:27:00<chronomex>in order probably US, UK, AU, .eu
20:27:00<igelritte>Jason seems to do a lot.
20:29:00<DFJustin>but there's a lot of empowerment if you see something to just do it yourself
20:29:00<igelritte>Well, I can definitely help with the wiki
20:30:00<igelritte>when you say, 'coding', what do you mean?
20:30:00<soultcer>Programming stuff that downloads stuff
20:30:00<igelritte>I have a fair amount of experience with BASH scripting
20:31:00<igelritte>what are you guys using to download stuff?
20:31:00<DFJustin>perfect
20:31:00<DFJustin>primarily wget
20:31:00<igelritte>oh, hold on their solder, my BASH scripting is far from perfect
20:31:00<joepie91>DFJustin: The Unarchiver sounds like a comic hero :P
20:31:00<DFJustin>it's like a real life superhero
20:31:00<igelritte>but I have written some stuff using wget to batch download stuff for myself
20:32:00<DFJustin>the main difference is we use a parameter to wget to have it produce .warc files which are a full record of HTTP headers etc. suitable for going into the wayback machine
20:32:00<igelritte>lectures from the opencourse ware project at MIT
20:32:00<igelritte>hmmm
20:33:00<alard>Yes, so if you download anything for archiving, use the --warc-file option (available in Wget 1.14).
20:34:00<igelritte>hmmm. It appears that the wget that comes with Ubuntu these days is 1.13
20:34:00<igelritte>at least, so says dpkg
20:35:00<mistym>You'll need to build it yourself then (or grab a newer package). .warc support wasn't added until 1.14.
20:35:00<DFJustin>for our big multi-user projects we supply a ready-made VM with everything all set up and just a go button to push
20:35:00<igelritte>okay
20:35:00<igelritte>um, what are warc files and why use them?
20:36:00<DFJustin>warc is a standardized format for web archives, it includes all the HTTP response data from the server (not just the file contents) so that you can "play it back" with a proxy and duplicate the original site exactly
20:36:00<igelritte>You'all are interested in full HTTP headers, or the way back machine?
20:36:00<igelritte>interesting
20:37:00<igelritte>very interesting
20:37:00<DFJustin>the main impetus is that it's a requirement for wayback to integrate the data (proper timestamps are a necessity, for example)
20:37:00<igelritte>Okay, I can see what you're saying
20:38:00<DFJustin>everyone grabbed geocities kind of higgledy-piggledy and it's hard to pin down the dates for anything because of filesystems, time zones, modification time vs download time etc
20:39:00<DFJustin>so the later projects have been standardized on warc
20:39:00<igelritte>The Geocities project was quite an accomplishment
20:41:00<DFJustin>warc is big with the pointy-headed academic world because of formal documentation etc. so that gives us an in with that crowd too
20:41:00<DFJustin>unfortunately the end user tools for it are not great yet
20:43:00<igelritte>I loved Jason's picture of the datacenter where the nine terabytes where housed. It reminded me of this scene from 'Connections'--that interesting spin on discovery and invention that came out in the 70's by James Burke--where he holds up an old tape cartridge and expounds: "this device holds one million characters," in that tone of voice like the audience is supposed to piss themselves in amazement. You then do the math and realize that
20:43:00<joepie91>DFJustin: is there a format specification for warc?
20:43:00<joepie91>one that is publicly accessible
20:44:00<DFJustin>ISO 28500
20:45:00<joepie91>CHF 122,00
20:45:00<joepie91>eh.
20:46:00<joepie91>DFJustin; anything or any place that *doesn't* want to see the inside of my wallet?
20:46:00<joepie91>:|
20:46:00<DFJustin>obviously, you can google it just as well as I can though
20:46:00<joepie91>yes, and I only get drafts
20:47:00<joepie91>do I seriously have to pirate a document to figure out what warc looks like
20:47:00<joepie91>:|
20:47:00<igelritte>I have to say that you folks seem down right Edwardian in your manners. Most of my experiences in chatrooms with techsavy folks have not been so pleasant.
20:48:00<SmileyG>:D
20:48:00<SmileyG>Most people suck.
20:48:00<SmileyG>I think the fact everyone is here because they care about it helps, rather than being here because of "work" or other reasons.
20:49:00<DFJustin>my suspicion is that the 0.18 draft is the same as the final because international standards move slow but I'll defer that to somone whose head is pointier :)
20:49:00<igelritte>I was working on Linux from Scratch a few years back; their IRC...well, let's just say that you need a thick skin.
20:49:00<alard>I believe the bib-something site has a PDF of a draft of the warc spec.
20:49:00<alard>The warc people at archive.org assured me that that's what they use.
20:49:00<igelritte>And none of those people were there for work...
20:49:00<DFJustin>http://bibnum.bnf.fr/WARC/warc_ISO_DIS_28500.pdf
20:50:00<SmileyG>ah yeah hmm
20:50:00<alard>That's it. Just change the version header WARC/0.18 with WARC/1.0, or something.
20:50:00<SmileyG>igelritte: I've been "both" sides of the arguement
20:50:00<alard>There's also a warc implementation guidelines somewhere.
20:51:00<joepie91>alard: the draft is representative?
20:51:00joepie91 really hates 'standards' that you can't just view
20:52:00<alard>Yes, I believe so. The Heritrix implementation is based on the same draft, so that's something.
20:52:00<igelritte>Tell me about it joepie91. I worked in Teleco for years. Any idea what they want for a membership to the ITU?
20:52:00<alard>http://netpreserve.org/publications/WARC_Guidelines_v1.pdf
20:52:00<joepie91>igelritte: not sure I even want to know the amount of digits
20:53:00<igelritte>It's pretty gross
20:53:00<joepie91>alard: that 404s
20:53:00<joepie91>anyhow, I'll use the bibnum one then
20:54:00<alard>Does it? I just copied the link I put on the wiki months ago. :)
20:54:00<SmileyG>http://archiveteam.org/index.php?title=BT_Internet C-, needs work
20:54:00<SmileyG>:D
20:54:00<alard>http://www.netpreserve.org/resources/warc-implementation-guidelines-v1
20:54:00<alard>http://www.netpreserve.org/sites/default/files/resources/WARC_Guidelines_v1.pdf
20:55:00<joepie91>thankies
20:55:00<alard>(It's pretty silly that an "internet preservation consortium" doesn't have stable urls.)
20:55:00<DFJustin>one of the nice things about WARC though is it's basically human readable, you open it up and bam headers
20:55:00<DFJustin>so it's reasonably future-proof
20:57:00<joepie91>lol alard
21:00:00<SmileyG>Can't upload images to wiki?
21:00:00<igelritte>When you watch Jason's presentation at Defcon, you know that other people are involved and that recruits are needed, but the specifics are still a little vague. I guess that I've spent so much time interacting with organizations by being told what to do that the free-for-all comes off as very chaotic. Still not very sure where I can plug in.
21:00:00<SmileyG>why didn't I see "upload file" ? XD
21:00:00<joepie91>hmm, interesting... http://www.webarchivingbucket.com/
21:00:00<joepie91>igelritte: link to presentation?
21:01:00<igelritte>sure
21:02:00<DFJustin>well our formal projects now are all "run the warrior VM" where we tell your computer exactly what to do
21:02:00<joepie91>www.btinternet.com/~catechnology
21:02:00<joepie91>www.btinternet.com/~ted.power
21:02:00<joepie91>www.dgsgardening.btinternet.co.uk
21:02:00<joepie91>www.mstracey.btinternet.co.uk
21:02:00<joepie91>cc alard
21:02:00<DFJustin>it's just that on top of that people have their own archiving side projects that are related to the mission in varying degrees
21:02:00<alard>joepie91: http://tracker.archiveteam.org/webshots/rescue-me
21:03:00<joepie91>alard: webshots?
21:03:00<joepie91>shouldn't that be btinternet?
21:03:00<alard>Oops, sorry, http://tracker.archiveteam.org/btinternet/rescue-me
21:03:00<joepie91>:P
21:03:00<DFJustin>is that expecting urls or user names
21:03:00<alard>usernames
21:04:00<joepie91>0 items added to the queue
21:04:00<joepie91>Thanks for your help!
21:04:00<joepie91>lol
21:04:00<alard>Heh.
21:05:00<alard>The tracker really appreciates your contribution, it just wasn't useful. :)
21:05:00<joepie91>haha
21:06:00<joepie91>looks like catarc works well :)
21:06:00<joepie91>http://sebsauvage.net/paste/?9e695a09848493ea#Yy3GjmiyMI4bfhUcKv9vahutcX48KTJBHLivJh8l2BU=
21:06:00<DFJustin>nice regex
21:07:00<underscor><igelritte> I got that from the presentation. Something about a kid of 15 in Australia being threatened with legal action for downloading poetry.
21:07:00<underscor>I can't remember
21:07:00<underscor>hahahahahaha
21:07:00<underscor>was that bluemax?
21:08:00<SmileyG>what happened with htat o_O
21:08:00<underscor>joepie91: we conform to the draft fyi
21:09:00<SmileyG>http://archiveteam.org/index.php?title=BT_Internet <<< wtf is iwth the no description below the imae
21:09:00<joepie91>ok, thanks :P
21:09:00<SmileyG>image
21:09:00<underscor>we being archive.org
21:09:00<underscor>SmileyG: lulu poetry's IT department sent a scary letter to him
21:09:00<underscor>"scary" "letter"
21:10:00<SmileyG>o
21:11:00<joepie91>igelritte: does a video of the defcon presentation exist?
21:11:00<joepie91>I can't find it
21:14:00<alard>SmileyG: The "No description" comes from the image, I think.
21:14:00<SmileyG>except it has a description :/
21:16:00<Dark-Star>problems with the archive? I'm getting "rsync: failed to connect to fos.textfiles.com: Connection timed out (110)" all the time
21:16:00<alard>SmileyG: Oh. Then maybe it's in the template? http://archiveteam.org/index.php?title=Template:Infobox_project&action=edit
21:18:00<underscor>Dark-Star: it's down atrm
21:18:00<underscor>atm*
21:19:00<Dark-Star>ah okay. I'll just leave the Warrior running overnight then. I guess it'll automatically resume the upload later
21:23:00<SmileyG>alard: ah yeah hmmm :S
21:24:00<SmileyG>weird because the mobile me one doesn't do it
21:26:00<igelritte>right on...I'm not as stupid as I originally suspected
21:26:00<igelritte>GNU Wget 1.14 built on linux-gnu.
21:27:00<igelritte>I now have the ability to support warc
21:27:00<igelritte>though, my dpkg still thinks that I'm working with 1.13
21:28:00<igelritte>It's probably been six months or more since I've compiled and installed anything from scratch. It's funny how quickly you forget that shit.
21:28:00<alard>igelritte: I don't want to temper your enthusiasm and sense of achievement, but you might want to check if your new Wget includes gzip and SSL support. It's in wget -V, I think.
21:30:00<igelritte>well, I'm pretty sure that it does because I kept getting an SSL error and had to dig into why and then install libcurl and libgnutls dev packages in order to get wget to compile correctly
21:30:00<igelritte>but I will check
21:30:00<alard>Ah good, then it'll probably work.
21:30:00<alard>soultcer: Starting TinyBack for Item
21:31:00<alard>(Hint: the git clone it's very slow if there's no .git in the repository url: https://github.com/soult/tinyback.git )
21:32:00<soultcer>It is? Damn, I always felt so clever because I had to type 4 characters less
21:32:00<igelritte>well, right under the version number, you get the following list: +digest +https +ipv6 +iri +large-file +nls -ntlm +opie +ssl/gnutls
21:32:00<soultcer>http://tracker.tinyarchive.org/v1/ <-- "ranking"
21:33:00<alard>soultcer: It's strange, because it does seem to work, but it just takes a long time. I was wondering what my warrior was doing.
21:33:00<igelritte>I'm not sure about the 'wget-V, I' syntax...is that supposed to be 'wget -V -I'?
21:33:00<igelritte>or really a comma
21:33:00<alard>Heh. The comma and I are part of the sentence. :)
21:33:00igelritte laughs at self
21:34:00<primus>igelritte: if you're interested in downloading you can download ArchiveTeam Warrior virtual machine - it has everything already set up. http://archive.org/details/archiveteam-warrior
21:35:00<alard>To check if you have gzip support, use: wget --help | grep warc-compression and see if it returns something. If it does, it works.
21:35:00<igelritte>I'm a little limited on what I can do with downloading at the moment. This network connection is not really my own.
21:36:00<DFJustin><joepie91> igelritte: does a video of the defcon presentation exist? <-- https://www.youtube.com/watch?v=-2ZTmuX3cog
21:38:00<igelritte>alard: I get the "no-warc-compression"; I'm guessing that warc uses gzip for compression
21:38:00<igelritte>?
21:40:00<alard>Then your Wget is in top condition. The thing with gzip is: you can make .warc and .warc.gz files. It is much better to do the gzip compression in Wget than to do it afterwards. Wget makes a new gzip record for each downloaded file, so it's possible to extract only part of the .warc.gz. If you use the gzip utility to compress your warc afterwards, you can only decompress everything at once.
21:43:00<igelritte>Just performed a quick little test where I ran the following: wget --warc-file test http://en.wikipedia.org/wiki/Jason_Scott_Sadofsky. This seems to have created the 'test' file that I asked for.
21:43:00<igelritte>-rw-rw-r-- 1 23386 Oct 10 23:41 test.warc.gz
21:44:00<joepie91>quick question to alard: how does one write a setup.py where the resulting install package will copy a python file to the bin directory?
21:44:00<joepie91>/usr/bin etc
21:44:00<alard>gunzip -c test.warc.gz to look inside
21:45:00<alard>Why do you think I would know? I'm a copy-paste setup.py writer. :)
21:45:00<alard>scripts, I think: https://github.com/ArchiveTeam/seesaw-kit/blob/master/setup.py#L41-44
21:46:00<joepie91>well, seesaw does it :P
21:46:00<joepie91>and alright, thanks
21:46:00<alard>I thought you were the python distribution / pip / pypi expert. :)
21:47:00<igelritte>very interesting. That seems to have worked. I DO have an HTTP document. It doesn't look anything like a wiki, but I'm guessing why I know that is.
21:48:00<joepie91>alard: oh, not at all
21:49:00<joepie91>I just know how to package up a module with an existing setup.py
21:49:00<joepie91>:P
21:49:00<joepie91>and that's it
21:57:00<igelritte>so, when I unpack this archive file (warc) I should expect to find nothing put pure HTTP?
21:57:00<alard>You'll find warc records, some of which have a HTTP body.
21:58:00<igelritte>hmmm
21:58:00<alard>You get some warc headers identifying the record (type, target-uri, timestamp etc.), then the http request or response.
21:58:00<alard>There are special types of warc records with metadata, such as the wget command line and log.
21:59:00<alard>So it's not the most user-friendly format, you need to work to get the data out.
21:59:00<alard>The good thing is that everything is in the file, so you *can* get it out.
22:00:00<igelritte>This is all just for my education; so, feel free to tell me to fuck off when you lose patience. But, where can I find these headers? When I open the file with a text editor, it spears to be just HTML.
22:01:00<alard>You'll have to look better then, they're in there.
22:01:00<alard>It starts with WARC/1.0 or something, then there's WARC-Target-URI, etc.
22:04:00<SketchCow>Hey, so my commentary before.
22:06:00<alard>It has scrolled away. :)
22:06:00<DFJustin>SketchCow: http://archive.org/details/archiveteam-city-of-heroes-www is not on the list
22:07:00<igelritte>crazyness...I just used vi on the test.warc.gz file and the headers you mentioned showed up. Vi also showed me all the compressed content. I didn't know that vi could do that...
22:07:00<SmileyG>SketchCow: geocities - theres a dump on the ia but I can't find it anymore (and it was searchable.... we really need to make those links more accessable...)
22:08:00<alard>http://archive.org/details/archiveteam-qaudio-rescue
22:08:00<alard>http://archive.org/details/archiveteam-cinch
22:08:00<joepie91>wait wait wait wait, what? Jeroenz0r is/was part of urlteam?
22:09:00<SketchCow>Only WARC items. So Geocities proceeds that.
22:10:00<SmileyG>ah k
22:12:00<igelritte>Perhaps I'm really thick here...and that wouldn't be a surprise...but I'm still not seeing how I can contribute. Is there a list of "shit that needs to get done and we'd be thrilled if you'd take it on" some where?
22:12:00<SketchCow>Both added, alard
22:12:00<SketchCow>What's your skillset, igelritte?
22:12:00<DFJustin>various godane grabs(tm) at https://archive.org/search.php?query=warc%20uploader%3A%22slaxemulator%40gmail.com%22
22:13:00<alard>There are some groklaw.net warcs: http://archive.org/details/groklaw.net-pdfs-2004-20120827
22:13:00<igelritte>Well, I've done some BASH scripts. I'm trilingual. I've done lots of networking.
22:13:00<igelritte>And there's a bunch of voip in there too
22:13:00<alard>http://archive.org/search.php?query=groklaw%20warc
22:14:00<joepie91>igelritte: is there any chance you can turn the install script for the webshots script, into something more sane?
22:14:00<joepie91>because I suck at bash :P
22:14:00<igelritte>I'm not that awesome at it either, but I can look at it.
22:14:00<joepie91>current script is at http://cryto.net/projects/webshots/webshots_debian.sh
22:14:00<alard>http://archive.org/search.php?query=warc%20journalstar (but it's getting more obscure now)
22:14:00<joepie91>thanks :)
22:16:00<igelritte>Hmmm...
22:16:00<nintendud>joepie91: you can set a trap on error to avoid all the conditionals
22:16:00<igelritte>this could use some commenting and perhaps a header
22:16:00<nintendud>and then have it print "Error on line x". Not as nice of a message though.
22:17:00<igelritte>who wrote this? And why are they doing an apt-get at the beginning?
22:17:00<joepie91>igelritte: I did
22:17:00<joepie91>and the apt-get is to install dependencies
22:18:00<alard>http://archive.org/search.php?query=uploader%3A%28slaxemulator%40gmail.com%29%20AND%20warc
22:18:00<DFJustin>is there an echo in here
22:18:00<igelritte>I think I see what you're doing here, and I understand why you would do an apt-get update before doing an install
22:18:00<alard>DFJustin: Oh, sorry. :)
22:18:00<igelritte>but, I don't think I understand enough of the purpose here to understand why you would do that in a script
22:19:00<joepie91>igelritte: it's apt-get update, not upgrade
22:19:00<joepie91>just updates the package list
22:19:00<igelritte>I'm guessing that my ignorance is to blame
22:19:00<igelritte>right
22:19:00<nintendud>joepie91 / igelritte: here's a nice article on BASH traps, btw. http://phaq.phunsites.net/2010/11/22/trap-errors-exit-codes-and-line-numbers-within-a-bash-script/
22:19:00<igelritte>typo on my part
22:19:00<joepie91>had it break for some people because the package lists weren't up to date, so that's why update is there :)
22:21:00<nintendud>joepie91: also, why are you using useradd? On Debian, you're supposed to use the adduser command afaik
22:21:00<joepie91>adduser is interactive
22:21:00<nintendud>Doesn't have to be
22:21:00<nintendud>At least, I think you can make it a one-liner
22:21:00<joepie91>iirc I haven't found a way to make it not interactive
22:21:00<joepie91>:P
22:22:00<joepie91>anyway, any particular reason not to use useradd?
22:22:00<nintendud>Does useradd make the home directory?
22:22:00<joepie91>yes
22:22:00<nintendud>o
22:22:00<nintendud>Welp, adduser just follows a nice configuration file that specifies things like the permissions to set on the home directory among other things
22:23:00<nintendud>But I guess useradd works OK. I was just curious. :-)
22:36:00<DFJustin>SketchCow: there are more qaudio items, http://archive.org/details/archiveteam-qaudio-archive-1 through http://archive.org/details/archiveteam-qaudio-archive-7
22:39:00<DFJustin>also fan fiction http://archive.org/search.php?query=%22fan%20fiction%22%20archiveteam
22:41:00<joepie91>right
22:41:00<joepie91>pip install catarc
22:41:00<joepie91>:)
22:41:00<joepie91>cat for archives
22:48:00<SketchCow>OK, so I got out of a meeting about incorporating archive team stuff into wayback
22:48:00<SketchCow>NATURALLY it's slightly more complicated in some cases.
22:49:00<SketchCow>Let me make some changes to the thing.
22:52:00<chronomex>of course it is
22:52:00<chronomex>what kind of changes do they want?
22:55:00<SketchCow>Look at the document again. All green ones are cleared for takeoff.
22:55:00<chronomex>wow, awesome
22:57:00<chronomex>so looks like they can just suck in warc-in-nothing, yes?
22:59:00<SketchCow>Yes
22:59:00<SketchCow>They cannot suck in warc-in-archives
22:59:00<SketchCow>So, next step is to look at the archives ones and see if there's not too many WARCs in it, say less than 100
22:59:00<chronomex>I mean "just suck in" as in "point the ingestor at"
22:59:00<DFJustin>good thing we didn't upload 250tb of that XD
23:00:00<chronomex>lol yes
23:01:00<chronomex>mobileme: 280T of .tar containing .warc.gz
23:01:00<chronomex>soooo
23:02:00<SketchCow>We're aware of it and there'll be a project to deal with that.
23:02:00<SketchCow>But I don't want to rush it.
23:02:00<SketchCow>So Brewster's letting me make doubled files for weird ones.
23:06:00<DFJustin>even if there's a shitload of warcs inside they can all be cat-ed together into one megawarc right
23:07:00<arkhive>is there a webshots tracker I can check the progress? (I'm unable to help, I'm just curious how it's going)
23:07:00<DFJustin>http://tracker.archiveteam.org/webshots/
23:07:00<arkhive>thank you :)
23:08:00<DFJustin>underscor making his isp cry again
23:08:00<SketchCow>YEah, but the machine is still down
23:08:00<SketchCow>so I don' know what's going on
23:08:00<SketchCow>DFJustin: Yes, exactly.
23:13:00<joepie91>alard: what about an 'assorted' warrior project
23:13:00<joepie91>with things that are small or heavily rate-limited (like some urlteam targets)
23:13:00<joepie91>that the warrior automatically switches to whenever it has nothing else to do
23:14:00<chronomex>that sounds cool.
23:14:00<joepie91>for example, if the current selected project is done
23:14:00<joepie91>a "let's not waste any time or bandwith that we have" mode, so to say :P
23:14:00<chronomex>urlteam is a basically-no-bandwidth project, it might actually make more sense to run it in the background always.
23:15:00<joepie91>maybe have an 'always running' *and* 'assorted' project
23:15:00<chronomex>yeah
23:15:00<joepie91>separate projects... one always runs, like urlteam
23:15:00<joepie91>and assorted is filled with whatever small project is happening that doesn't warrant its own separate project, really
23:15:00<chronomex>'assorted' would be filler for "let archiveteam choose"
23:15:00<joepie91>as a fallback when it has nothing better to do
23:15:00<joepie91>well yes, but the thing is
23:16:00<joepie91>say that I've got it configured for btinternet
23:16:00<joepie91>the moment btinternet is done, which will be soon
23:16:00<joepie91>my warrior will be bored out of its skull, no?
23:17:00<chronomex>yes
23:18:00<joepie91>would be good if it switched to 'assorted' then :P
23:18:00<joepie91>'let archiveteam choose' has a pretty different function
23:18:00<joepie91>that option should always refer to the most urgent project
23:18:00<joepie91>such as, in this case, webshots
23:18:00<joepie91>assorted would have the stuff that isn't really urgent or significant, but has to be done anyway
23:18:00<joepie91>at some point in time
23:21:00<chronomex>ah
23:36:00<flaushy>hi, is fos.textfiles.com down?
23:36:00<joepie91>it is
23:37:00<flaushy>rsync will happily retry until it reappears, right?
23:37:00<joepie91>if I recall correctly, it will retry 50 times
23:37:00<joepie91>before giving up
23:37:00<joepie91>alard can probably confirm on that
23:37:00<flaushy>:( 50k link user in queue
23:37:00<SketchCow>Fortress of Solitude is Back
23:37:00<joepie91>ouch
23:37:00<joepie91>oh, it is?
23:38:00<joepie91>SketchCow: my warrior disagrees
23:38:00<joepie91>rsync: failed to connect to fos.textfiles.com: Connection timed out (110)
23:38:00<flaushy>same here, but i guess it will work soon then :)
23:39:00<flaushy>probably we are just hammering it currently
23:39:00<flaushy>and thx for the info!
23:39:00<joepie91>aaaaand there it went
23:39:00<joepie91>:D
23:41:00<SketchCow>Hooray, 517 rsync connections.
23:41:00<joepie91>lol
23:41:00<flaushy>working for me now too :)
23:42:00<joepie91>:|
23:42:00<joepie91>uploads just died
23:42:00<joepie91>like, literally flatlined
23:42:00<joepie91>ah, it resumed
23:42:00<joepie91>and flatlined again
23:42:00<joepie91>wat
23:43:00<DFJustin>alard: you wanna run through the usernames in these https://en.wikipedia.org/wiki/Wikipedia:Bot_requests#btinternet
23:43:00<igelritte>so, from the following, I can assume that fos = fortress of solitude and that this is some place where folks are trying to rsync there current downloads to. Feel free to direct me to a link that will shut me up.
23:43:00<igelritte>*thier
23:43:00<igelritte>or maybe their
23:44:00<joepie91>igelritte: yes, fos is where the uploads go
23:44:00<igelritte>At some point grammar will come bck to me
23:44:00<chronomex>until then
23:44:00<igelritte>indeed
23:46:00<flaushy>phew... seems like some 1 gb stuff is in queue on nooon
23:50:00<joepie91>DFJustin: http://pastie.org/5032511
23:51:00<joepie91>is the clean version
23:51:00<joepie91>of all usernames for both .com and .co.uk
23:51:00<joepie91>sorted, unique
23:51:00<joepie91>also cc alard, idk if that list is already in the tracker
23:51:00<joepie91>k, time to sleep
23:51:00<joepie91>goodnight all :)
23:51:00<DFJustin>nice thanks
23:59:00<SketchCow>Well, FOS is getting CRUSHED, we'll see how long this lasts.
23:59:00<SketchCow>848 Rsync collection
23:59:00<nintendud>lol