#archiveteam<efnet> log for 2012-10-11

Home Search Previous day Next day

00:00:00	<nintendud>	ah, I'm seeing timeouts in my warrior
00:00:00	<nintendud>	it must really be getting crushed
00:00:00	<primus>	what does FOS stand for?
00:01:00	<nintendud>	Free and Open Source? Maybe?
00:02:00	<sankin>	just curious, what are the hardware specs for fos?
00:02:00	<SketchCow>	This is really bad.
00:02:00	<nintendud>	it's a raspberry pi hooked up to a RAID array
00:02:00	<SketchCow>	It shouldn't be this hammered.
00:02:00	<nintendud>	Oh?
00:04:00	<nintendud>	speaking of 'pi's, apparently you can colocate a pi in Austria for free: https://www.edis.at/en/server/colocation/austria/raspberrypi/
00:04:00	<SketchCow>	FOS stands for Fortress of Solitude
00:04:00	<SketchCow>	It replaced a machine named Batcave
00:04:00	<nintendud>	Hah, nice
00:04:00	<SketchCow>	FOS became a way to refer to it easily.
00:04:00	<primus>	:-) awesome, thanks
00:05:00	<S[h]O[r]T>	i always thought it was a fun take on FiOS because verizon sponsored it :P
00:05:00	<nintendud>	iFOS. By Apple.
00:05:00	<S[h]O[r]T>	even though i know that is no where near true
00:17:00	<nintendud>	I wonder if these fixed 30 second retries has us all hammering FOS at the same time.
00:17:00	<chronomex>	thundering herd effect?
00:18:00	<nintendud>	TIL that term. Essentially, although more than one can rsync at a time.
00:18:00	<nintendud>	it's why random backoff in ethernet is a thing
00:19:00	<nintendud>	I keep getting about 5% uploaded before it dies
00:24:00	<SketchCow>	Machine is seriously getting hammered.
00:24:00	<SketchCow>	Not sure what to do yet.
00:24:00	<SketchCow>	Might set rsync.
00:25:00	<nintendud>	Is it coming in 30 second waves?
00:25:00	<nintendud>	Or is it just a constant surge of traffic?
00:25:00	<SketchCow>	ha ha you act like pressing keys makes anything happen.
00:25:00	<nintendud>	Oh right. The tubes. They are clogged.
00:26:00	<S[h]O[r]T>	if you have access to the switch or firewall in front of it you can block certain IP ranges or ports to slow down the flow of traffic in
00:26:00	<SketchCow>	I like where you said that too.
00:26:00	<SketchCow>	All these suggestions are well meaning and useless.
00:26:00	<SketchCow>	I'm going to implement a max connections as soon as I can get vi to respond.
00:27:00	<S[h]O[r]T>	well if you had access to the switch you could just deny all rsync or anything else and allow ssh :p
00:27:00	<S[h]O[r]T>	that wouldnt be useless
00:28:00	<SketchCow>	Yes.
00:28:00	<SketchCow>	So.....
00:28:00	<SketchCow>	If only we could turn lead into gold, we could solve a number of problems.
00:28:00	<SketchCow>	But the impossibility of that makes it useless.
00:30:00	<SketchCow>	Realize my temper is going to be short while I wrestle with a machine with over 1,100 rsync connections active.
00:30:00	<nintendud>	Yup. Good luck, soldier.
00:31:00	<SketchCow>	And advice along the line of "to fix the problem, you should fix the problem" is brain fart
00:32:00	<SketchCow>	It has been trying to open a vi session for 4 minutes.
00:32:00	<SketchCow>	That's how bad it is.
00:32:00	<SketchCow>	I have two other windows, trying to set up a killing of rsync
00:34:00	<S[h]O[r]T>	im guessing you didnt want any advice then and are just venting
00:34:00	<DoubleJ>	Mine finally timed out so I was able to pause the VM. So my minuscule part of the load is off.
00:39:00	<SketchCow>	I set it to 20
00:43:00	<SketchCow>	Now doing a megakill
00:44:00	<SketchCow>	Bitches
00:44:00	<SketchCow>	ps -ef \| grep rsync \| awk '{PRINT $2}' \| xargs kill
00:47:00	<nintendud>	no killall?
00:47:00	<chronomex>	or skill
00:48:00	<SketchCow>	shhh, I'm oldschool
00:48:00		chronomex nods knowingly
00:49:00	<chronomex>	you have legitimate claim to the phrase "I have underwear that's older than your home directory"
00:56:00	<igelritte>	nice
00:57:00	<SketchCow>	root@teamarchive-1:/etc# killall rsync
00:57:00	<igelritte>	thought I think if he had used 'ps -aux \| grep'... that would have been better
00:59:00	<igelritte>	looks like it's time for bed. Gettin' a little punchy.
00:59:00	<igelritte>	later
01:00:00	<SketchCow>	Machine is pretty hosed.
01:06:00	<SketchCow>	FOS crashed.
01:07:00	<BlueMax>	Wow, what happened
01:22:00	<SketchCow>	DJ Smiley remix of the main page of archiveteam.org now in place.
01:32:00	<SketchCow>	fos is back
01:32:00	<SketchCow>	now running with some severed rsync limits while we get shit in shape.
02:18:00	<godane>	i'm uploading issue 150 dvd of linux format
03:47:00	<bsmith096>	@ERROR: max connections (5) reached -- try again later
03:47:00	<bsmith096>	Starting RsyncUpload for Item woodp
03:47:00	<bsmith096>	getting a whole mess of these
03:47:00	<bsmith096>	rsync error: error starting client-server protocol (code 5) at main.c(1534) [sender=3.0.9]
03:49:00	<S[h]O[r]T>	the server (fos) stuff rsyncs too is limited to 5 rsync connections atm, it was having issues earlier. SketchCow should updated one its all good at some point
03:51:00	<bsmith096>	so is the script gonna continue at some point cause it just keeps trying to dend those 2 users over and over
03:51:00	<bsmith096>	send
03:51:00	<S[h]O[r]T>	yeah it will keep trying until it gets through
03:51:00	<S[h]O[r]T>	can just leave it running
03:52:00	<underscor>	I thought it only tries 50 times
03:52:00	<underscor>	and then gives up?
03:54:00	<S[h]O[r]T>	if it does thats 25min and there must be a bug?
03:55:00	<S[h]O[r]T>	thats good tho :P
03:58:00	<S[h]O[r]T>	i looked awhile back and i just a bit ago, was pretty sure the rsync in pipeline doesnt have a lot of overhead but i could be wrong. i know there are some options to turn off compression and us a lower encryption that generate less cpu usage.
03:58:00	<S[h]O[r]T>	for client and server
04:24:00	<underscor>	S[h]O[r]T: Well, I'm just saying
04:24:00	<underscor>	with the rate limit on fos
04:24:00	<underscor>	it's very likely you could not get in in 25m
04:24:00	<underscor>	and then the thing will just give up
04:24:00	<underscor>	and you're wasted
04:24:00	<underscor>	3
04:24:00	<underscor>	D:
04:43:00	<S[h]O[r]T>	im saying its been more than 25min and i havent got in and its still trying
04:47:00	<underscor>	oic
04:48:00	<underscor>	maybe I'm wrong
04:48:00	<underscor>	I just overheard someone say that
04:48:00	<underscor>	looks like SketchCow upped it to 10
04:48:00	<underscor>	none of my threads are doing any work still, though
04:48:00	<underscor>	hopefully we can reopen the floodgates soon
04:48:00	<underscor>	otherwise we're definitely not going to do well with webshots XD
04:51:00	<underscor>	yay!
04:51:00	<underscor>	finally got one in
04:51:00	<underscor>	w00t
05:13:00	<S[h]O[r]T>	i dont see it got upped to 10:P
06:27:00	<ivan`>	is anyone in the rehosting-dead-pastebins business?
06:27:00	<ivan`>	100K pastes from paste.lisp.org would be better off googlable
06:33:00	<chronomex>	do you have them??
06:42:00	<ivan`>	yes
06:42:00	<ivan`>	http://ludios.org/tmp/paste.lisp.org.7z
06:42:00	<ivan`>	chronomex: ^
06:43:00	<deathy>	something up with warrior upload? getting "@ERROR: max connections (10) reached -- try again later"
06:45:00	<chronomex>	<3
06:45:00	<Cameron_D>	The server we rsync to is currently limited because it was having problems earlier
06:46:00	<chronomex>	thanks ivan`! are you involved with paste.lisp.org?
06:46:00	<ivan`>	no, I think stassats runs it, but his reply did not indicate interest in restoring them
06:46:00	<chronomex>	aye.
06:47:00	<chronomex>	ow, this is a lot of files
06:47:00	<ivan`>	heh
06:47:00	<deathy>	hoping limit gets increased/lifted... almost all warriors waiting for upload :\|
06:47:00	<chronomex>	*wow
06:47:00	<SketchCow>	WHY HELLO
06:48:00		chronomex shoves this into IA
06:48:00	<SketchCow>	You crying sallybags.
06:48:00	<chronomex>	wassap brah
06:48:00	<SketchCow>	You whip a virtual server within an inch of its life, and then woah, you all want it jogging around the track 5 minutes later.
06:49:00	<SketchCow>	Also, I like Underscor whining on 3 channels about me taking a reasonable attempt to prevent the machine dying.
06:49:00	<SketchCow>	948 simultaneous rsyncs.
06:49:00	<SketchCow>	Think about that.
06:49:00	<chronomex>	o_O
06:49:00	<SketchCow>	You know what you did.
06:49:00	<chronomex>	bitches gotta bitch
06:49:00		SketchCow gets the newspaper
06:49:00	<deathy>	good job team! :)
06:49:00	<Cameron_D>	haha
06:50:00	<soultcer>	We need support for distributing uploads to multiple servers. Next one to complain about fos being unreachable will be volunteered to code that into the seesaw kit.
06:50:00	<chronomex>	ivan`: can you share some info about this file? when was it captured, was the paste dead at the time, is it complete, etc
06:52:00	<SketchCow>	Tomorrow, FOS goes down when one of the admins inceases its swap from 2gb to 6gb.
06:59:00	<ivan`>	chronomex: pastes were captured 2011-11-14 and 2012-05-01 and 2012-10-06 (though perhaps I should strip those); not complete, I don't have pastes 129789-131892
07:00:00	<chronomex>	ok
07:00:00	<chronomex>	:D
07:04:00	<chronomex>	http://archive.org/details/paste_lisp_org
07:06:00	<SketchCow>	So, basically I have a couple days to prepare some more archiveteam items for ingestion into the wayback.
07:09:00	<SketchCow>	188,329,776 14.0M/s eta 3h 58m
07:09:00	<SketchCow>	Now that's a spicy meatball
07:10:00	<SketchCow>	1,067,816,696 17.7M/s eta 4h 19m
07:10:00	<SketchCow>	Downloaded a gig. Going to take 4 hours. It's like that.
07:14:00	<SketchCow>	With luck, I can make a lot more of these things green.
07:14:00	<SketchCow>	If this all works, all the green ones go into the wayback machine instantly.
07:15:00	<SketchCow>	Instant SOPA review end of the month!
07:15:00	<SketchCow>	That'd be nice.
07:41:00	<SketchCow>	Re-initiated uploads from fos to archive.org of webshots loads.
07:46:00	<alard>	Hi there. I've looked up the "rsync retries 50 times" claim that I made yesterday. I now think that's wrong and that it retries indefinitely, so your warriors will just wait until they can upload.
08:37:00	<SketchCow>	OK, napping.
13:07:00	<joepie91>	SketchCow: I'm not quite sure what to do with this, but I archived the videos that someone (I think bsmith096) linked me to a while ago as rare footage: http://aarnist.cryto.net:81/youtube/all/
13:07:00	<joepie91>	flv/mp4/webm format
13:17:00	<balrog_>	having a hard time with rsync with warrior
13:17:00	<balrog_>	getting "max connections reached" errors
13:22:00	<joepie91>	same
13:22:00	<balrog_>	alard: ya there?
13:24:00	<S[h]O[r]T>	the server the scripts rsync to is currently limited because it was having problems earlier
13:24:00	<balrog_>	yeah, but how do I keep the warrior going?
13:24:00	<balrog_>	I have this bandwidth which otherwise isn't going to get used
13:24:00	<S[h]O[r]T>	just have to wait, it will keep retrying uploads :\
13:25:00	<balrog_>	need to shorten the wait time from 30 seconds to more like 5 then
13:42:00	<balrog_>	is there any way I can tweak this? :/
13:45:00	<SmileyG>	more concurrent threads?
13:45:00	<SmileyG>	problem is we are all downloading it faster than FOS can accept it back in
13:45:00	<balrog_>	yeah
13:46:00	<SmileyG>	The fix is FOS Accepting it faster, or us having larger caches.
13:46:00	<SmileyG>	larger caches are possible if you do more concurrent downloads, however depending how fast you download in ratio to the max upload, your still going to get stuck eventually
13:47:00	<SmileyG>	joepie91: I'd upload them to IA and give SketchCow a link,
13:47:00	<SmileyG>	thats what I've done with the fish ezine I get each month.
13:48:00	<SmileyG>	[08:46:31] <@alard> Hi there. I've looked up the "rsync retries 50 times" claim that I made yesterday. I now think that's wrong and that it retries indefinitely, so your warriors will just wait until they can upload.
13:48:00	<SmileyG>	Phew! that was a worry for me.
14:02:00	<balrog_>	5 connections seems a bit low
14:09:00	<alard>	Hey, "need to shorten the wait time from 30 seconds to more like 5 then" is not a good idea. If we all do that, will just increase the load on the server, but won't increase the number of uploads.
14:10:00	<flaushy>	the problem is that users like me with slow uploads (max 1 MiB/s) will use the slots for a long time :/
14:10:00	<alard>	We'll just have to wait until the server can handle more connections. (Or we'll have to find some other server were we can upload to, to spread the load.)
14:10:00	<flaushy>	right
14:11:00	<flaushy>	can we, until then, increase the warrior concurrency to a higher max than 6?
14:13:00	<alard>	No, that would require a lot of updates. (And I also don't really see how that would help. It would just add more waiting uploads.)
14:13:00	<alard>	Just be patient. :)
14:13:00	<flaushy>	okie :)
14:14:00	<flaushy>	at least we don't loose the queues, which is great
14:15:00	<soultcer>	alard: What would the requirements of such a server be?
14:15:00	<SmileyG>	we need a mini IA just for our upoloads lol
14:16:00	<flaushy>	i mean underscor looks like haveing enough bandwidth to act as a caching server for smaller guys. But i might be wrong
14:22:00	<alard>	soultcer: It would need downstream and upstream bandwidth, and a not too small disk to receive files before it packages and uploads them to archive.org.
14:22:00	<alard>	Uploads are 50GB batches, so a multiple of that.
14:24:00	<flaushy>	would 100 mbit be enough?
14:24:00	<soultcer>	Maybe renting a cheap server from hetzner or ovh for a month would work
14:25:00	<alard>	Yes, 100mbit would be enough (we also don't have to send all uploads to one server).
14:26:00	<SmileyG>	the bt ones are the issue right?
14:26:00	<SmileyG>	because they are so short.... ?
14:26:00	<SmileyG>	SHame we can't package multiple users together on the warrior?
14:26:00	<alard>	I do not know what the issue is. It can't be bt, I would think, since we have only a few thousand small users there.
14:27:00	<SmileyG>	alard: but most of them finish in sub30 seconds
14:27:00	<SmileyG>	thats a lot of rsync processes spawning constantly for such small tranfers.
14:27:00	<alard>	Yes, so there aren't many active at the same time. But I don't know what the issue is, really. It could be the number of processes spawning, or just the number of concurrent uploads.
14:28:00	<alard>	Resuming uploads are probably also more expensive than new uploads.
14:28:00	<alard>	(There would have been a few of them when the server came back up, I suspect.)
14:29:00	<alard>	It doesn't have to be rsync, by the way. That's just what fos currently has.
14:30:00	<alard>	Anyway, I'll be back later.
14:30:00	<SmileyG>	o/
14:30:00	<soultcer>	Does the bundling script rely on the partial setting? You could use --inplace, then it won't have to rename/move files after finishing
14:34:00	<SmileyG>	partial works for the webshots but makes no sense with the BT ones
14:37:00	<SmileyG>	14254080 52% 166.45kB/s 0:01:18
14:41:00	<SketchCow>	Back once again.
14:44:00	<flaushy>	meh, i need a couple of minutes mostly
14:44:00	<flaushy>	alard: i ask at my university
14:44:00	<flaushy>	whether i can crawl with the pools at night, and whether a dump would be acceptable
14:48:00	<tef_>	alard, DFJustin: 0.18 and 1.0 warcsare the same bar the version number, yes. (I have this from one of the authors of the warc spec)
14:49:00	<tef_>	pps warc2warc in warctools can recompress warcs record by record. warc2warc *.warc.gz -O output.warc.gz
14:53:00	<soultcer>	If it can recompress warcs, can it also concatenate them? Simply create one big warc file instead of tarring multiple warc files. Would make it easier for IA to use?
15:15:00	<DFJustin>	so SketchCow / underscor, can you pull bt usernames out of the wayback database, I can do stuff like http://wayback.archive.org/web//http://www.btinternet.com/~ but I only get a few hundred at a time and it will take forever
15:19:00	<alard>	DFJustin: underscor sent a list from the wayback database yesterday.
15:20:00	<DFJustin>	well I was getting usernames just now that rescue-me didn't know, although I think most of them are long gone
15:21:00	<alard>	Ah, I don't know what they searched for.
15:23:00	<alard>	soultcer: I think --partial or --inplace doesn't really matter (moving a file on the same disk isn't that expensive, is it?)
15:25:00	<alard>	I was playing with this for the one-big-warc problem: https://github.com/alard/megawarc Any good suggestions?
15:25:00	<SketchCow>	http://24.media.tumblr.com/tumblr_m9dvjezOvX1qm3r26o1_500.jpg
15:25:00	<soultcer>	When you have a big file half-uploaded, and then continue without --inplace, it will first make a temp copy of the already existing stuff, then write to that temp copy
15:25:00	<soultcer>	Only when it has finished uploading, will it remove that copy
15:26:00	<soultcer>	I had trouble transfering a file because rsync took more than 1.5 times the size of the file when I didn't use inplace
15:27:00	<alard>	In any case, --inplace can't be used here, because then half-uploaded files could be moved by the postprocessing script.
15:28:00	<soultcer>	alard: What do we need the original tar file for?
15:28:00	<alard>	It's nice to be able to find the per-user files.
15:28:00	<SketchCow>	yes
15:28:00	<alard>	And for mobileme there are wget.logs and other files.
15:32:00	<alard>	So even though you'd probably never want the original tar file back, it's useful to keep the data somewhere. The 'restore' function demonstrates that there's no data lost.
15:48:00	<tef_>	alard: if you have extra logs to put in, warc record can handle that metadata records
15:50:00	<alard>	tef_: I know. The new projects have one single warc file per user. The older projects, mobileme, have the logs and a few other files besides the warcs.
15:51:00	<alard>	(And even with mobileme the wget log is also in the warc files, I think.)
15:51:00	<tef_>	nice
15:51:00	<tef_>	but yeah converting from .tar to warcgz could happily convert non warc records into warcrecords in the final output
15:52:00	<alard>	Yes, so you could make one file that has everything.
15:52:00	<SketchCow>	Here's a hilarious one - the fortunecity collection. It's warcs AND straight html.
15:52:00	<tef_>	SketchCow: warc records can be of 'resource' instead of 'response' :-)
15:52:00	<alard>	We could put the tar file in the big warc.
15:59:00	<tef_>	heh
16:19:00	<underscor>	SketchCow: I wasn't whining!
16:22:00	<underscor>	alard: Does the seesaw kit support round-robining rsync servers?
16:22:00	<underscor>	Because I have 12 boxes at archive.org we could rr between
16:28:00	<alard>	underscor: Not yet, but it could. I think it would be even better to do it with HTTP PUT uploads, though. That would make round-robining easier. (And it might be less stressful for the server.)
16:28:00	<underscor>	Hmm, as safe as rsync though?
16:28:00	<underscor>	(checksum-wise)
16:28:00	<SmileyG>	hmmmm
16:28:00	<SketchCow>	alard: First test of megawarc coming up
16:28:00	<alard>	Does rsync make many checksums?
16:29:00	<underscor>	I thought it did a checksum
16:29:00	<underscor>	But actually, no
16:29:00	<SmileyG>	it does
16:29:00	<underscor>	In write only mode, it doesn't
16:29:00	<alard>	Only if you allow it, I thought. (Other than the filesize thing.)
16:29:00	<SmileyG>	files to check #0/1
16:29:00	<SmileyG>	currently it appears to check the writes...
16:30:00	<SmileyG>	can you just use dns RR too?....
16:30:00	<underscor>	Yeah, but that requires waiting for propagation, etc
16:30:00	<underscor>	Also a lot of places (RIT included) munge the results
16:31:00	<underscor>	and only return one of them until the cache expires
16:31:00	<SmileyG>	o
16:31:00	<SmileyG>	ttl 5
16:31:00	<SmileyG>	:D
16:31:00	<underscor>	haha
16:31:00	<underscor>	They ignore ttl :()
16:31:00	<underscor>	:( *
16:31:00	<SmileyG>	just make sure your dns server can take it
16:31:00	<SmileyG>	wut Â¬_Â¬
16:31:00	<underscor>	Yeah
16:31:00	<underscor>	sux
16:32:00	<SmileyG>	ok, have the tracker hand out upload details?
16:32:00	<SmileyG>	along with username?
16:33:00	<underscor>	alard: Setting up a PUT server for testing
16:34:00	<alard>	We could write a tiny node.js PUT server with checksums. :)
16:36:00	<soultcer>	Why complicate it further by introducing another programming language?
16:36:00	<alard>	Good question.
16:41:00	<soultcer>	Is there no simple point to point file transfer protocol witch checksumming?
16:43:00	<alard>	Do we need checksums? (If we're uncomplicating. :)
16:44:00	<underscor>	Nah.
16:44:00	<underscor>	I was just putting up a put accepter in nginx
16:44:00	<underscor>	since I already have it on these boxen
16:44:00	<alard>	After all, once it's on that server we uploaded to we'll be using the non-checksummed s3 api to bring it to archive.org.
16:46:00	<alard>	underscor: Do you happen to know if there's a way to distinguish uploaded from still-uploading files?
16:46:00	<underscor>	No idea. Let me see.
16:48:00	<alard>	That's useful to know for the postprocessing. (The current packaging script moves any file it can find.)
16:49:00	<deathy>	"A file uploaded with the PUT method is first written to a temporary file, then a file is renamed. Starting from version 0.8.9 temporary files and the persistent store can be put on different file systems but be aware that in this case a file is copied across two file systems instead of the cheap rename operation."
16:49:00	<deathy>	apparently from "ngx_http_dav_module" docs
16:50:00	<alard>	Ah, that's promising.
16:54:00	<S[h]O[r]T>	FTP
16:55:00	<soultcer>	FTP? What are we, farmers?
16:57:00	<S[h]O[r]T>	lol
16:57:00	<underscor>	http://p.defau.lt/?SBDTYn8UhfxVvm4rSmlydw
16:57:00	<underscor>	cc alard
16:57:00	<underscor>	:D
16:57:00	<underscor>	and it didn't appear until after the upload fully finished
16:57:00	<alard>	Nice.
16:58:00	<alard>	Does it make directories? (As in /webshots/underscor/something.warc.gz ?)
16:58:00	<underscor>	I can enable it
16:59:00	<underscor>	So if you put to http://bt-download00.us.archive.org:8302/webshots/some/path/here/libtorrent-packages.tar.gz
16:59:00	<underscor>	it will create /some/path/here on the fly
16:59:00	<alard>	It's not necessary, but I with the rsync uploads I generally let every download upload to a separate directory.
17:00:00	<alard>	Doesn't really serve a purpose.
17:00:00	<alard>	I'll be back later.
17:01:00	<underscor>	alard: option enabled.
17:02:00	<underscor>	Holler at me when you get back if you think this would be a better idea going forward, and I can push out to the rest of the boxes
17:19:00	<godane>	i got up to episode 43 of t3 podcast
17:36:00	<joepie91>	S[h]O[r]T: no, absolutely not FTP
17:36:00	<joepie91>	lol
18:51:00	<joepie91>	very relevant: I don't have time for silliness. Just let me know if you're removing our footage, or if I'm forwarding this to our attorneys. I'm not interested in your creative commons bs (which those of us who actually work in media refers to as amateur licensing) and I have told you that we do not want our work in any of your videos. Let me repeat: we want NONE of our work in ANY of your or any third party
18:51:00	<joepie91>	videos, and our exclusive licensing agreements exist specifically so that is enforcable.
18:51:00	<joepie91>	er
18:51:00	<joepie91>	faol
18:51:00	<joepie91>	fail *
18:51:00	<joepie91>	http://arstechnica.com/tech-policy/2012/10/court-rules-book-scanning-is-fair-use-suggesting-google-books-victory/
18:51:00	<joepie91>	ignore the above blob of text, it was an earlier copypaste from a pastebin :P
18:53:00	<chronomex>	now I'm curious
18:53:00	<chronomex>	however I have work to do
20:08:00	<SketchCow>	alard's not here, is he?
20:08:00	<SketchCow>	I think eh went awayyyy
20:10:00	<alard>	Hello!
20:12:00	<SketchCow>	Hey, my net went wonky.
20:12:00	<SketchCow>	ImportError: No module named ordereddict
20:13:00	<SketchCow>	How do I fix that?
20:13:00	<alard>	Python 2.6?
20:14:00	<alard>	wget https://bitbucket.org/wooparadog/zkit/raw/4ce69af1742f/ordereddict.py
20:14:00	<SketchCow>	File "megawarc.py", line 64, in <module>
20:14:00	<SketchCow>	ImportError: No module named ordereddict
20:14:00	<SketchCow>	Traceback (most recent call last):
20:14:00	<SketchCow>	from ordereddict import OrderedDict
20:14:00	<SketchCow>	root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# python2.7 megawarc.py
20:15:00	<soultcer>	OrderedDict is in collections for py 2.7
20:15:00	<SketchCow>	Bear in mind I am a perl guy at best.
20:15:00	<SketchCow>	We do it differently there.
20:18:00	<soultcer>	SketchCow: Replace "from orderecdict import OrderedDict" with this: http://pastebin.com/dQdZ0wX8
20:18:00	<soultcer>	Should work fine in py 2.7, and for py 2.6 you can download the ordereddict module alard told you about
20:20:00	<SketchCow>	OK
20:20:00	<SketchCow>	So I just wasted some time trying that.
20:20:00	<soultcer>	alard: You are only using the ordered dict for cosmetics anyway, right?
20:21:00	<alard>	Yes.
20:21:00	<SketchCow>	Alard, please put it in the megawarc github if it works
20:21:00	<SketchCow>	because damn, I don't edit python very well.
20:21:00	<chronomex>	spaces, no tabs
20:21:00	<chronomex>	though it pains me to say so
20:21:00	<SketchCow>	Yeah, no, like I don't do python
20:21:00	<PepsiMax>	omh
20:22:00	<SketchCow>	And the github should be improved, not my local copy of it anyway
20:22:00	<chronomex>	:)
20:24:00	<alard>	SketchCow: I've updated the github repository. Try again. (It worked for me before and it still works now.)
20:32:00	<SketchCow>	Usage: megawarc [--verbose] build FILE megawarc [--verbose] restore FILE
20:32:00	<SketchCow>	root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# python megawarc
20:32:00	<SketchCow>	root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# python megawarc build BOARDS-COH-05.tar
20:32:00	<SketchCow>	Looking much better.
20:33:00	<SketchCow>	Now, let's see if the 11gb file that results is good.
20:33:00	<SketchCow>	Do you account for things being in subdirectories in the .tar?
20:43:00	<alard>	Well, it doesn't care. What it does is this: it walks through the tar, one entry at a time. If it is a file and the filename ends with .warc.gz, it checks to see if it is an extractable gzip. If all that is OK, the warc file is added to the warc. In all other cases (directories, unreadable warcs, other files) the file is added to the leftover tar.
20:43:00	<alard>	For the tar reconstruction, it pastes together the content from the leftover tar, the tar headers and parts from the warc. So directories don't matter.
20:53:00	<SketchCow>	root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# python megawarc build BOARDS-COH-05.tar
20:53:00	<SketchCow>	root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# ls -l
20:53:00	<SketchCow>	total 21136664
20:53:00	<SketchCow>	-rw-r--r-- 1 root root 10822246400 Oct 11 19:26 BOARDS-COH-05.tar
20:53:00	<SketchCow>	-rw-r--r-- 1 root root 84149 Oct 11 20:41 BOARDS-COH-05.tar.megawarc.json.gz
20:53:00	<SketchCow>	-rw-r--r-- 1 root root 10240 Oct 11 20:41 BOARDS-COH-05.tar.megawarc.tar
20:53:00	<SketchCow>	-rw-r--r-- 1 root root 10821470898 Oct 11 20:41 BOARDS-COH-05.tar.megawarc.warc.gz
20:53:00	<SketchCow>	OK, so.
20:53:00	<SketchCow>	That worked... but there was no progress bar, and no updates.
20:53:00	<SketchCow>	So I'll use this for now, but I would definitely add something to indicate work is being done.
20:58:00	<alard>	SketchCow: Add --verbose
20:59:00	<alard>	It won't show a progress bar, but it will tell you what's taking so long.
20:59:00	<alard>	underscor: Do you have a /webshots/alard/webshots.com-user-siebertphotoshop-20121011-225722.warc.gz ?
21:01:00	<SketchCow>	Oh!
21:07:00	<joepie91>	<SmileyG>joepie91: I'd upload them to IA and give SketchCow a link,
21:08:00	<joepie91>	that's a bit hard
21:08:00	<joepie91>	they're on a server
21:08:00	<joepie91>	:P
21:08:00	<joepie91>	can't get to them now anyway, that server seems offline
21:13:00	<underscor>	alard: http://p.defau.lt/?rZSWA6hEE1SROnuBQKxtAQ
21:13:00	<underscor>	Lookin awesome :D
21:13:00	<alard>	Great. Ready for more?
21:14:00	<underscor>	joepie91: :o what was that mispaste about :D
21:15:00	<underscor>	alard: yep!
21:15:00	<underscor>	Shall I roll out to bt-download01-11 now too?
21:15:00	<underscor>	(for roundrobining)
21:15:00	<alard>	That would be nice. The tracker picks one of the urls from a list, so it's possible to remove/add urls later.
21:16:00	<underscor>	Ah, nice!
21:16:00	<underscor>	ok, I'll work on pushing the config
21:16:00	<flaushy>	is the limit of rsync only for webshot or for all projects?
21:16:00	<underscor>	I'll need the "cleanup" script too
21:16:00	<underscor>	flaushy: bt is set to 5 right now, webshots 10
21:17:00	<flaushy>	would it make sense to switch underscor?
21:17:00	<flaushy>	from webshots to bt?
21:18:00	<flaushy>	or are the rsyncs on bt crowded as well?
21:18:00	<alard>	Webshots is now uploading over HTTP (once your warrior gets the update).
21:18:00	<soultcer>	Sweet
21:18:00	<flaushy>	so warrior restart time :)
21:18:00	<flaushy>	awesome
21:18:00	<SketchCow>	What?
21:19:00	<SketchCow>	So wait, stuff is now banging directly into archive? Or something else.
21:19:00	<alard>	SketchCow: underscor wants it.
21:19:00	<underscor>	SketchCow: well, I have 12 machines we can load balance between
21:19:00	<SketchCow>	Underscor wants a lot of things, but I like to be included while I'm over here trying to make this machine function.
21:19:00	<underscor>	so I thought it might be a better idea
21:21:00	<SketchCow>	Please at least tell me it's going into http://archive.org/details/webshots-freeze-frame with the same format structure
21:21:00	<alard>	(We've been discussing this for a while, but we can change it again if you think it's not a good idea.)
21:21:00	<alard>	It's exactly the same.
21:21:00	<underscor>	It's exactly the same, just that it is roundrobined between 12 boxes instead of a single one
21:21:00	<SketchCow>	I trust you'll do the right thing, but if we're using an environment, I just want to know, with my name being mentioned, we're going to shift gears.
21:22:00	<SketchCow>	Because then I can focus on it as a "clear out the rest of what we have" instead of "work my ass off on this box trying to make it function for the time being"
21:23:00	<joepie91>	http://p.defau.lt/?rZSWA6hEE1SROnuBQKxtAQ
21:23:00	<alard>	Ah, yes. It won't change immediately: the current warriors are still trying to rsync and will keep trying, until they succeed or until they're restarted.
21:23:00	<joepie91>	er
21:23:00	<joepie91>	<underscor>joepie91: :o what was that mispaste about :D
21:23:00	<joepie91>	is what I wanted to paste
21:23:00	<joepie91>	anyway
21:23:00	<joepie91>	wtf is with my clipboard today
21:24:00	<joepie91>	tl;dr guy makes movie about occupy protests, then starts demanding that videos that reuse parts of his movie are taken down
21:24:00	<joepie91>	let me find the full paste
21:25:00	<SketchCow>	drop to -bs
21:26:00	<SketchCow>	Anyway, I am all for solutions that increase the bandwidth away from FOS, which is meant to be a buffer of 20tb for incoming data, but doesn't function as well as it could as something to blow 50tb of insanity in, do operations on, and blow out.
21:27:00	<joepie91>	SketchCow: what is the main bottleneck for fos?
21:27:00	<SketchCow>	I just need to know that's what's going on so I know I'm bailing water out of a bathtub for a little, and not trying to rescue a sinking ship.
21:27:00	<SketchCow>	FOS is a virtual box that does about 20 things.
21:27:00	<SketchCow>	So the bottleneck for FOS is FOS
21:27:00	<SketchCow>	Oversubscription.
21:27:00	<SketchCow>	In this particular case, we had the same disk being used for file writes, file compilation, and file reads
21:28:00	<SketchCow>	Which is normally not THAT big a deal but it was doing a LOT, and we had 900+ rsyncs
21:28:00	<SketchCow>	Eventually swap exploded
21:28:00	<underscor>	and everything goes to hell
21:28:00	<alard>	Webshots on FOS is now sizzling out, but bt internet is still using rsync. But that's so small it's probably something to keep there?
21:28:00	<SketchCow>	I expect so, yes.
21:28:00	<SketchCow>	Webshots is TOO DAMN HIGH
21:29:00	<joepie91>	so basically disk I/O is the bottleneck?
21:29:00	<joepie91>	or the main one, at least
21:29:00	<SketchCow>	It's one of them.
21:29:00	<joepie91>	hmm
21:29:00	<joepie91>	let me think about this for a moment
21:29:00	<SketchCow>	I guess if we're looking to find out, we can circle the sizzling wreck and waste a few days determining why.
21:29:00	<SketchCow>	No, don't think about it.
21:29:00	<SketchCow>	Think about things and projects that need saving.
21:29:00	<joepie91>	there's not much ability to save if the library is burning down
21:30:00	<flaushy>	is there a script for bt as well?
21:30:00	<SketchCow>	underscor has twice your brainpower, and 400x your resources (200x mine) and has an unhealthy compulsion to optimize.
21:30:00	<SketchCow>	He'll fix it.
21:30:00		underscor giggles giddily
21:31:00	<SketchCow>	He literally cuddles with the internat archive infrastructure.
21:31:00		underscor whistles innocently
21:31:00	<joepie91>	... not sure why you seem so strongly opposed to my decision to invest some of my _own_ time and thought into finding a possible solution
21:31:00	<underscor>	but, but, but, petaboxen are so cute~
21:31:00	<joepie91>	I personally don't really care who has more brainpower or infrastructure - more people thinking about it instead of watching random series because boredom, means more chance of a solution
21:31:00	<SketchCow>	This was a rare case where miscommunication, exacerbated by a red-eye flight, meant that I fell out of the loop of a solution set.
21:31:00	<SketchCow>	And got surprised, and whined.
21:32:00	<chronomex>	SketchCow can't stand WWIC
21:34:00	<SketchCow>	The teamarchive/FOS machine will now get 8gb of swap instead of 2.
21:34:00	<underscor>	SketchCow: What script do you use to inject these into IA?
21:34:00	<underscor>	(and can I get it plz)
21:35:00	<SketchCow>	I have a custom injector that uses a s3 call.
21:35:00		SmileyG sighs
21:35:00	<SmileyG>	still borked? :(
21:35:00	<SketchCow>	Before we do this with your round-robin thing.
21:35:00	<SketchCow>	What's still borked.
21:36:00	<SketchCow>	Anyway, before we do this with your round-robin thing, I think we need to decide if megawarc is ready for production.
21:36:00	<SmileyG>	my bt uploads by the look of things - looking at backlog now
21:36:00	<SketchCow>	Not borked.
21:36:00	<SketchCow>	It was being held at a limit, a limit which I will shortly lift as we move webshots over to a round-robin, and as FOS gets 4x the allocated RAM
21:38:00	<SmileyG>	Ah ok, I presumed it was the number of rsyncs due to the BT one being so fast that was the issue (i'd fill my queue in 30~ seconds).
21:39:00	<SketchCow>	Also
21:39:00	<SketchCow>	http://blog.archive.org/2012/10/10/the-ten-petabyte-party/
21:39:00	<SketchCow>	If you're in SF, go eat some foods
21:39:00	<SmileyG>	I wish.
21:40:00		joepie91 is practically on the other side of the world
21:40:00	<SketchCow>	Now, I want to discuss the format we put webshots in.
21:40:00	<mistym>	Probably every non-SF person here is wishing they'd be there now :b
21:40:00	<SketchCow>	My attention is gripped a little by seeing what the result of the megawarc program is.
21:40:00	<SketchCow>	http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-5
21:41:00	<SketchCow>	So first, let us see what the result of the derive is.
21:41:00	<SketchCow>	It's an 11gb megawarc, so it will take a few minutes.
21:42:00	<joepie91>	what is a megawarc?
21:42:00	<soultcer>	Could you teach the deriver to unpack the tar files?
21:42:00	<SketchCow>	soultcer: No
21:42:00	<SketchCow>	I sat in meetings across a week on it.
21:42:00	<chronomex>	teaching the deriver anything is a major undertaking
21:42:00	<SketchCow>	It's not the deriver, it's the wayback machine.
21:42:00	<SketchCow>	It's a mess.
21:42:00	<chronomex>	ah
21:43:00	<SketchCow>	So it's easier to generate a .warc.gz file that cats up all the other warcs in a specific way.
21:43:00	<chronomex>	the way I take it, WBM indexes tar files that remain on petaboxes?
21:43:00	<chronomex>	thus there's one copy of the WBM data or something
21:43:00	<SketchCow>	No, it's weirder.
21:43:00	<SketchCow>	It's all so weird.
21:43:00	<chronomex>	s/tar/warc.gz/
21:43:00	<SketchCow>	As much as we want me to go into the substance of this, here we go.
21:43:00	<SketchCow>	I see three audiences for our data.
21:43:00	<SketchCow>	1. Wayback Machine
21:44:00	<SketchCow>	2. The individuals who had their data on the thing, wanting their shit back
21:44:00	<SketchCow>	3. Historians from The Future, with The Future being 1 hour to forever from now.
21:44:00	<chronomex>	yeap
21:44:00	<SmileyG>	agreed.
21:45:00	<SketchCow>	So, the problem is, 1. is very, very, very old school and was designed from the ground up along a whole range of very carefully decided "things".
21:45:00	<SketchCow>	It is also, being from a non-profit, not overly packed with dev teams.
21:45:00	<SketchCow>	This translates to "it takes things a certain way"
21:45:00	<chronomex>	picky, brittle.
21:45:00	<SketchCow>	It's possible to go 'well, leave things as they are, and make a second version'
21:45:00	<SketchCow>	And we're doing that for the moment with some items, for the sake of stepping into it slowly.
21:46:00	<SketchCow>	Obviously that doesn't work with MobileMe.
21:46:00	<SketchCow>	Now, I asked MobileMe to miss this current load-in to Wayback, because whatever decision/process is made becomes a 274tb decision.
21:47:00	<flaushy>	do you have slides for a 5 minute presentation why you should join the archiveteam? i am going to a small congress from the ccc tomorrow
21:47:00	<SketchCow>	No, just links to my talks.
21:48:00	<SmileyG>	flaushy: hmmmmm not that I'm aware of - watch Jasons defcon speach and talk about Soy Sauce?
21:48:00	<flaushy>	could be good enough :)
21:48:00	<SketchCow>	http://www.us.archive.org/log_show.php?task_id=127610039
21:48:00	<chronomex>	soy sauce itself is >5 minutes :P
21:48:00	<SketchCow>	Can you guys see that?
21:48:00	<SmileyG>	yes SketchCow
21:48:00	<chronomex>	I can
21:48:00	<SketchCow>	Ok, so that's the deriver working with a megawarc.
21:48:00	<flaushy>	need login
21:49:00	<SketchCow>	Get a damn library card, buddy!
21:49:00	<underscor>	^
21:49:00	<SketchCow>	They're freeeeee
21:49:00	<underscor>	sweet, 1.8gb already on the first node!
21:49:00	<underscor>	cc alard, SketchCow
21:49:00	<SketchCow>	OK, so turning from that experiment, and still waiting to make sure it works.....
21:49:00	<SketchCow>	...I'd like to consider a process where we generate the megawarc by default.
21:50:00	<SketchCow>	And upload THOSE as webshots.
21:50:00	<SketchCow>	So my current process is "grab 50gb of these delightful picture warcs, .tar them, and then shove them on the internet archive."
21:51:00	<alard>	underscor: My uploads are going really fast.
21:51:00	<SketchCow>	But those .tars are good for the 2. (individuals) with a LOT of help from additional alard scripts, and 3. And not good for 1.
21:51:00	<underscor>	alard: that's a good thing, right?
21:51:00	<underscor>	hehe
21:51:00	<SketchCow>	Your uploads are going to boxes that aren't maxed out to misery
21:51:00	<alard>	underscor: Yes. :)
21:51:00	<underscor>	:D
21:51:00	<underscor>	SketchCow: hahahah
21:51:00	<alard>	We'll see how long it lasts.
21:51:00	<chronomex>	if we start with megawarcs, it's possible to make a tool that does range-requests and gets chunks in the middle
21:52:00	<underscor>	http://maelstrom.foxfag.com/munin/us.archive.org/bt-download00.us.archive.org/if_eth0.html
21:52:00	<SmileyG>	SketchCow: can we create some kind of "index" of the megawarc which we could feed into 1. (and use for 2.)
21:52:00	<SketchCow>	So I guess the question I pose to alard is, if we generate megawarcs, how hard would it be to write something that takes a link to the megawarc and returns your warc inside it?
21:52:00	<SketchCow>	SmileyG: The megawarc, by DEFINITION, works with 1. and 3.
21:52:00	<SketchCow>	And if it's in the Wayback, it helps 2.
21:53:00	<SmileyG>	SketchCow: ah duur failing to read.
21:53:00	<chronomex>	SmileyG: yes, there is an index. it's called a cdx.
21:53:00	<SketchCow>	So in THEORY, this would be fine.
21:53:00	<chronomex>	deriver makes it iirc
21:53:00	<alard>	The json file tells you where each file is, with byte ranges.
21:53:00	<SmileyG>	This is why I shoudln't irc while dying.
21:53:00	<chronomex>	how about not dying
21:53:00	<alard>	So it will tell you that user-x.warc.gz is in the big-warc from bytes a-b. This byte range you can feed to http://warctozip.archive.org/, for example. (This is how the tabblo/mobileme search things work.)
21:54:00	<SketchCow>	OK.
21:54:00	<alard>	Or you could do a curl with a byte range to get the warc.gz, if you don't like zip.
21:54:00	<SketchCow>	So it SHOULD be possible with current tools to assist 2.
21:54:00	<SketchCow>	Or some minor scripting to access current tools.
21:54:00	<alard>	Yes.
21:54:00	<chronomex>	current tools or minor changes, yes
21:54:00	<SketchCow>	Ok.
21:54:00	<SketchCow>	Then yes, we're going to:
21:54:00		SmileyG has other things on his plate hes thinking about. Time to disappear again.
21:55:00	<SketchCow>	1. Start pushing webshots up from underscor's Circle-Jerk to archive.org as native megawarcs
21:55:00	<SketchCow>	2. See about (carefully) converting both previous webshots and mobileme to native megawarcs.
21:55:00	<SmileyG>	99. Geocities?
21:56:00	<SketchCow>	Geocities as we did it will never go into the wayback.
21:56:00	<SmileyG>	Never? drat
21:56:00	<SketchCow>	As we did it.
21:56:00	<chronomex>	nope, we didn't manage to collect enough metadata to put it into warc
21:56:00	<SketchCow>	In THEORY, we could generate warcs with some sort of obviousness that it could pull in.
21:56:00	<SmileyG>	we can't "redo" it though so....
21:56:00	<SmileyG>	hmmm, as long as its "as" accessable as the others then shrug.
21:57:00	<SketchCow>	But man, I don't want to stress the IA infrastructe with THAT project this exact moment.
21:57:00	<SketchCow>	And by infrastructure I mean people.
21:57:00	<SmileyG>	wtf is hitting my keyboard o_O
21:57:00	<SketchCow>	sperm
21:57:00	<SmileyG>	worrying.
21:57:00	<SketchCow>	It dries
21:57:00	<SmileyG>	then its all crispy and the keys get stuck :<
21:58:00	<SketchCow>	check #archiveteam-spermclean
21:58:00	<SketchCow>	Read the FAQ
21:58:00	<joepie91>	stupid idea: set up haproxy on shitty unmetered gbit vps, proxy to various backends
21:58:00	<SmileyG>	lol sorry, dragging this off topic Â¬_Â¬; Really am going away, just gonna watch the convo unless someone on the internet turns out to be wrong.
21:58:00	<joepie91>	upload over HTTP
21:58:00	<SketchCow>	joepie91: We did that way back when
21:58:00	<SmileyG>	shitty unmetered gbit vps <<< Howm uch $$$?
21:58:00	<SketchCow>	It was hilarrrrrrrrrrrious
21:59:00	<joepie91>	SmileyG: not necessarily that much
21:59:00	<joepie91>	expect disk I/O etc to suck though, but that doesn't matter if it's just a proxy
21:59:00	<soultcer>	joepie91: Shitty unmetered VPS have one problem: In the end they are still a shitty VPS.
21:59:00	<SketchCow>	We did it on batcave, as I recall
21:59:00	<joepie91>	soultcer: shitty in the sense of everything but the bandwidth sucks
21:59:00	<joepie91>	:p
21:59:00	<joepie91>	SketchCow: what were the results?
21:59:00	<SketchCow>	Oh, it was very effective
21:59:00	<alard>	That was to fix network weirdness, where uploads directly to s3.us.archive.org were much slower than uploads proxied through batcave, also on archive.org.
22:00:00	<alard>	joepie91: But I think the HTTP upload from the warriors works fine now, without proxy stuff. The tracker redirects to one of the upload servers.
22:01:00	<soultcer>	SketchCow: Does the Internet Archive hire remote workers?
22:01:00	<joepie91>	alard: alright
22:01:00	<joepie91>	so... the upload problems should be solved, or?
22:01:00	<alard>	Yes, for the time being. :)
22:02:00	<alard>	Update your Webshots scripts, if you aren't using a warrior.
22:06:00	<SketchCow>	So, alard, is there a way to make a megawarc generator that just takes a directory instead of a tar?
22:08:00	<alard>	That depends on your definition of "megawarc". As it is now, the json contains tar headers and the position of the warcs in the original tar file. You could leave that out, though, and keep the properties that are useful for indexing the big-warc.
22:08:00	<alard>	What would be the best way to get the filenames to the megawarc script? Use find and pipe to stdin?
22:09:00	<alard>	(There may be too many files to go as command line arguments.)
22:09:00	<SketchCow>	I am comfortable with doing a tar to stdin... :)
22:10:00	<SketchCow>	or to stdout, I guess you might say
22:10:00	<soultcer>	Make the script recursively search a directory for warcs?
22:10:00	<alard>	Well, piping tar into the megawarc script won't easily work, since the script needs two passes over each warc file. (Once to check if it can be decompressed, once to copy it to the big-warc.)
22:11:00	<SketchCow>	Well, I assumed a different script taking a different approach.
22:11:00	<alard>	Yes, but I think you want the gzip test. If you don't have that test one invalid file can ruin the whole warc.
22:11:00	<SketchCow>	I mean, let's back it up. What I'd like is a way to take a directory, instead of a .tar, and make it a megawarc.
22:11:00	<SketchCow>	However that's done, I approve.
22:12:00	<soultcer>	So you want to stop even creating the tars for new projects and just uploads warcs to the idea plus a small tar for the logfiles?
22:12:00	<SketchCow>	It's expensive as shit, but making a .tar file, and then running megawarc against it, then removing the tar file and uploading the megawarc files....I could live with that.
22:12:00	<SketchCow>	that might be smartest.
22:13:00	<alard>	I think the tar isn't really necessary, especially not if you don't want to 'reconstruct' a tar that never was from the megawarc.
22:13:00	<alard>	Or you could use made-up tar headers.
22:13:00	<SketchCow>	Well, let's think about it.
22:14:00	<SketchCow>	DO we want the .tar file? By reconstructing later, you have a nice standard collection of the files.
22:14:00	<SketchCow>	And you can pull things from it.
22:14:00	<soultcer>	Well we need some way to store "these records in the warc file belong to a single user"
22:18:00	<SketchCow>	So, how do we feel about that? I think a .tar existing somewhere along the line works very well for what we want to do.
22:18:00	<SketchCow>	because then then .tars can go into The Next Thing After Internet Archive
22:19:00	<underscor>	TNTAIA, for short
22:19:00	<soultcer>	But then you have to store both the tar file and the megawarc
22:19:00	<SketchCow>	No, no.
22:19:00	<SketchCow>	you are using a .tar as the intermediary instead of the file directory to generate the megawarc
22:20:00	<SketchCow>	So in come the piles o' files
22:20:00	<SketchCow>	At some point, you have a 50gb collection (say)
22:20:00	<SketchCow>	You make it a .tar
22:20:00	<SketchCow>	You megawarc the .tar
22:20:00	<SketchCow>	you upload the megawarc.
22:20:00	<SketchCow>	Now the thing's been standardized out past the filesystem
22:20:00	<SketchCow>	And can be turned into 50gb chunks in the future on your holocube 2000x
22:21:00	<soultcer>	How about instead of creating a .tar and megawarcing it, you directly create the megawarc from the 50gb of source files?
22:21:00	<SketchCow>	This is what we just discussed.
22:22:00	<soultcer>	Oh, I thought you wanted to keep the "create a tar and megawarc it" step
22:22:00	<SketchCow>	I asked about that possibility, but it does lead to concerns.
22:22:00	<SketchCow>	by making something a .tar and then making it a megawarc, we have an intermediary thing it's converted back into that's able to be manipulated by other programs.
22:23:00	<SketchCow>	And I am saying, I think this is a good idea for future extensibility.
22:23:00	<SketchCow>	1. 2. and 3. are all handled.
22:23:00	<soultcer>	Even if you skip the tar step, you can later convert it back to a tar file.
22:24:00	<soultcer>	Though as long as going through the tar step doesn't create much of a bottleneck, it is probably nice to use tools that already exist and that do one thing well
22:24:00	<SketchCow>	http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-5
22:24:00	<SketchCow>	OK, so update.
22:24:00	<SketchCow>	It definitely made a >CDX
22:25:00	<underscor>	nice!
22:25:00	<underscor>	SketchCow: So is that what you want to do? fill up 50gb->tar->megawarc->ingest->rinse/repeat?
22:26:00	<SketchCow>	That is my proposal - I can give you scripts that I wrote and which alard wrote.
22:26:00	<underscor>	okay, awesome
22:26:00	<SketchCow>	But first, I want to have us discuss 50gb->megawarc->ingest
22:26:00	<SketchCow>	Because that was also on the table. pros and cons.
22:26:00	<underscor>	alard gave me the "watcher" script
22:26:00	<underscor>	but I don't have an tar/s3-er
22:26:00	<underscor>	that moves things into a temp dir
22:27:00	<SketchCow>	Right.
22:27:00	<SketchCow>	No, I'll give you those, but first I want this decided.
22:27:00	<SketchCow>	Also, I asked people to verify the CDX just generated.
22:27:00	<SketchCow>	Because if it just made borscht, more borscht is not a buddy.
22:27:00	<SketchCow>	I'm also about to restore that megawarc-5 to see what happens.
22:29:00	<soultcer>	SketchCow: As I said, it would be easy to modify alard's megawarc creator so it directly takes a directory of small warcs/wget logs and creates the same output (minus some tar metadata, that isn't necessary)
22:30:00	<SketchCow>	As alard said, one corrupted gz makes it not work
22:31:00	<alard>	soultcer: Just thought that it should be possible to add tar headers as well. Let Python create them, as if it is making a tar.
22:31:00	<alard>	I'll have a look.
22:31:00	<SketchCow>	Let's put it this way.
22:31:00	<alard>	Other question: it's possible to put the extra-tar and the json inside the big-warc. Is that useful?
22:31:00	<soultcer>	alard: Would work, but why would we need the tar headers anyway? It's mostly metadata about the filesystem on fos.
22:31:00	<alard>	You'd have one file, but the index would be less accessible.
22:32:00	<SketchCow>	alard: It makes it harder to decipher later. I'd keep it outside
22:32:00	<alard>	soultcer: It has timestamps.
22:32:00	<alard>	and it makes it easier to make a tar.
22:33:00	<soultcer>	alard: I don't really see why we would need any of the tar metadata, but it would of course be possible to create some of it, but I have no idea how to create the tar header string you have in the json file
22:34:00	<joepie91>	SketchCow: I'm still having rsync issues for btinternet - does that happen for everyone?
22:34:00	<SketchCow>	The issue is not everything we add is a warc.gz
22:34:00	<SketchCow>	Sometimes it's going to be additional 'stuff'
22:35:00	<joepie91>	for every single job: @ERROR: max connections (5) reached -- try again later
22:35:00	<soultcer>	SketchCow: The additional files are simply put into a single tar archive
22:35:00	<SketchCow>	btinternet just got more love
22:36:00	<joepie91>	looks fixed now, thanks
22:36:00	<joepie91>	:P
22:36:00	<soultcer>	Together with the metadata from the json file, the tar archive with the additional files and the megawarc file, you can recreate the original directory structure, or create a tar archive with all files
22:36:00	<joepie91>	whoa
22:36:00	<joepie91>	starts scrolling like mad
22:36:00	<joepie91>	lol
22:37:00	<joepie91>	looks like people had a lot in queue
22:37:00	<SketchCow>	soultcer: I'm going to again defer to alard's opinion on this.
22:37:00	<joepie91>	especially Sue
22:37:00	<joepie91>	cough
22:37:00	<SketchCow>	Incoming crap - ends up as megawarc
22:37:00	<SketchCow>	I just want the megawarc that results to be useful to the historians and the individuals as much as it is to wayback.
22:38:00	<soultcer>	And the wget logs I assume?
22:39:00	<SketchCow>	I don't want anything being lost
22:41:00	<joepie91>	lol, at this pace, btinternet will be done in 30 minutes
22:47:00	<soultcer>	alard: So what do you think? Bundle as tar, then megawarc; or directly create the megawarc?
22:50:00	<SketchCow>	Give him a moment, I see he's been coding some stuff related to uploads.
22:51:00	<soultcer>	Sure.
22:53:00	<soultcer>	The thing with the tar is: It includes a lot of metadata on who created the tar file (uid/gid), when it was created (mtime/ctime) and the filesystem permissions. I am not sure if we want to keep those, or not
23:06:00	<SketchCow>	I've reconstructed a .tar from the megawarc.
23:06:00	<SketchCow>	Now unpacking it to see if everything comes out ok.
23:07:00	<soultcer>	They should be bit-for-bit copies I think
23:07:00	<SketchCow>	Absolutely.
23:08:00	<SketchCow>	Regardless, I am doing what someone in 10 years would be doing.
23:17:00	<Sue>	i'm sorry for what i did to btinternet
23:17:00	<chronomex>	they got graped
23:18:00	<Sue>	i had like
23:18:00	<Sue>	300-400 rsync jobs queued up
23:18:00	<Sue>	apparently 17G woth
23:18:00	<Sue>	*worth
23:18:00	<chronomex>	jesus
23:18:00	<Sue>	it's about to be done
23:21:00	<alard>	New version: https://github.com/alard/megawarc
23:21:00	<alard>	I renamed megawarc build TAR to megawarc convert TAR (seemed more logical).
23:22:00	<alard>	There's now also a megawarc pack TAR FILE-1 FILE-2 ... option that packs files/paths directly.
23:22:00	<alard>	You still need to provide TAR to make the file names, but that tar doesn't exist.
23:23:00	<alard>	E.g. ./megawarc pack webshots-12345.tar 12345/ should work.
23:24:00	<alard>	Then ./megawarc restore webshots-12345.tar would give you a tar file.
23:24:00	<soultcer>	alard: Nice work. I was just thinking about simply using the TarInfo class to create the tar_header structure, but I see you not only thought of it faster, you implemented it while I was still thinking about the details ;-)
23:27:00	<alard>	I copied most of it from Python's tarfile.py.
23:27:00	<soultcer>	Good programmers code, better programmers reuse ;-)
23:28:00	<Sue>	btinternet is now in the negative
23:29:00	<joepie91>	ooo
23:29:00	<joepie91>	100MB btinternet user incoming
23:29:00	<joepie91>	... wat
23:29:00	<joepie91>	how's that even possible?
23:29:00	<SketchCow>	means they paid for premium
23:29:00	<joepie91>	SketchCow: but premium users are on a separate server
23:29:00	<SketchCow>	A la geocities and a few others, the old address is kept while the premium address goes up.
23:29:00	<chronomex>	recursion!
23:29:00	<SketchCow>	We found 1gb geocities users
23:29:00	<joepie91>	ah
23:30:00	<alard>	Time to find more usernames then, (There are also 1185 usernames still claimed, over 1000 by Sue.)
23:30:00	<joepie91>	wonder how they did that though
23:30:00	<joepie91>	because there's a separate server for all premium users
23:30:00	<joepie91>	two IPs away from the free server
23:30:00	<Sue>	over 1k by me? must be a glitch
23:30:00	<alard>	Are all your instances finished?
23:31:00	<Sue>	i'm still doing probably 20-30
23:31:00	<Sue>	the screen isn't full of no item recieved yet
23:31:00	<joepie91>	mine is
23:31:00	<joepie91>	or well
23:31:00	<joepie91>	alternating between no item received and tracker rate limiting
23:31:00	<joepie91>	lol
23:33:00	<Sue>	can you release items per user? that's strange that i have so many...
23:33:00	<alard>	I've put them back in the queue.
23:34:00	<Sue>	ok
23:34:00	<alard>	And with that I'm off to bed. Bye!
23:34:00	<SketchCow>	Thanks again, alard
23:34:00	<Sue>	i hunger for more
23:35:00	<joepie91>	suddenly, 5mbit!
23:35:00	<joepie91>	goodnight alard
23:35:00	<joepie91>	:)
23:39:00	<DFJustin>	wow so unless we find way more users, all of btinternet will fit on a microsd card
23:39:00	<DFJustin>	INCREASING COSTS
23:40:00	<SketchCow>	Huh, someone recorded alard's process for programming new code.
23:40:00	<SketchCow>	http://www.youtube.com/watch?feature=fvwp&NR=1&v=8VTW1iUn3Bg
23:40:00	<SketchCow>	Screencap's gotten good
23:40:00	<SketchCow>	(He's the one in the glasses)
23:43:00	<Sue>	i'm out of users, 14 left downloading
23:50:00	<SketchCow>	That's to be expected.
23:57:00	<SketchCow>	Internet Archive's teams have signed off on the megawarcs.
23:57:00	<SketchCow>	So guess what - FOS is making a ton of fucking megawarcs tonight.

Home Search Previous day Next day