#archiveteam-bs log for 2023-11-17

Home Search Previous day Next day

00:00:59	<thuban>	thank you!
00:20:50	<@JAA>	1.26M dirs done, 2.83M remaining. It exploded a bit again ~15 minutes ago.
00:34:27		BlueMaxima joins
01:38:39		HP_Archivist (HP_Archivist) joins
01:46:58		BlueMaxima quits [Read error: Connection reset by peer]
01:51:40		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
02:05:24		sdomi (sdomi) joins
02:09:00		rafaeletc joins
02:13:18	<rafaeletc>	hi folks, I need a help to understand the archives of the "fotolog" wevsite, I am doing a research on a young brazilian movement that used the site a lot, here is the right place for me to ask?
02:15:43	<@JAA>	Over 3M todo now, so I'll stop updating here until it's closer to completion.
02:15:51	<@JAA>	Hi rafaeletc, yeah, here is fine to ask.
02:17:49	<rafaeletc>	Oh, hello JAA, nice to meet u
02:18:52	<nicolas17>	JAA: any idea how much data is there?
02:19:15	<nicolas17>	I guess you're not extracting file sizes yet
02:20:21	<rafaeletc>	oh, a lot of gigabytes
02:20:37	<nicolas17>	oh I meant the mozilla archive he's currently working on :P
02:21:41	<rafaeletc>	nicolas17: sorry, thought it was for me the question
02:23:58	<@JAA>	nicolas17: No idea yet, no, just fetching the dir listings so far.
02:24:00		hdgdrqer joins
02:24:15		hdgdrqer quits [Remote host closed the connection]
02:24:19	<@arkiver>	rafaeletc: it would be good to know the question :P
02:24:30	<@arkiver>	but the archived are browsable in the Wayback Machine
02:26:17	<rafaeletc>	I'm trying to recover some contents from "fotolog.com" via Wayback Machine but I'm finding a lot of broken content, than I found this: https://archive.org/details/archiveteam_fotolog - my question is how to extract them in a way that I can identify where are the files of the profiles I want to analyze, if you give a search in this link you will see
02:26:17	<rafaeletc>	that they are very large files to download fully and look for needle in a hay tree.
02:27:18	<@arkiver>	i do see the site is marked "partially saved"
02:27:25	<@arkiver>	rafaeletc: do you have any URLs?
02:27:33	<@arkiver>	you could look those up in the Wayback Machine
02:27:39	<rafaeletc>	I managed to understand the metadata you have in the json of each upload, do you have a way to automate the download only of these json?
02:28:13	<@arkiver>	rafaeletc: the CDX files may be useful for you, they contain lists of URLs saved in the archives - though i don't know what you are looking for
02:28:41	<@arkiver>	as for downloading files, you can for example use the `internetarchive` (`ia`) library for this
02:29:06	<@arkiver>	https://archive.org/developers/internetarchive/
02:29:26	<@arkiver>	i'll be off to bed now though, but you can leave message and me or JAA or someone else might know what to do
02:30:21	<rafaeletc>	arkiver: I would like to search the comments and photos of what was archived, but not everything, only of some profiles
02:31:17	<@arkiver>	rafaeletc: i'm afraid 'searching' is not exactly possible, as the content of the records is not really indexed, only the URLs and some other metadata
02:31:32	<@arkiver>	so if you know some URLs for what you are looking for, you could look those up in the Wayback Machine
02:31:57	<@arkiver>	but searching all comments/pages for some terms is not possible at the moment, unless someone makes this searchable (which could be expensive)
02:33:07	<rafaeletc>	arkiver: so major it's static content
02:33:17	<rafaeletc>	but this application that you linked is well documented, I believe will help me a lot, via browser is very difficult
02:35:25	<rafaeletc>	I am researching a cultural movement that between 2003 and 2006 used the website for dissemination and communication
02:37:28	<rafaeletc>	therefore my interest in the comments, but as they were dynamic contents, I believe I will not be able to recover them, but getting the publications I believe that can help
02:41:00	<rafaeletc>	pardon my bad english, I am Brazilian and natively speak portuguese. but thank you very much, the indicated application illuminated the path I need to follow to research more
02:50:29		DogsRNice_ quits [Read error: Connection reset by peer]
02:52:31	<@JAA>	rafaeletc: `ia download --search 'collection:archiveteam_fotolog' --glob '*.json.gz'` should work for downloading all of the megawarc JSON files.
02:52:56	<rafaeletc>	JAA: it's possible to use this files: https://archive.org/download/archiveteam_fotolog_20160211111954 with https://replayweb.page/
02:53:56	<@JAA>	rafaeletc: Yes, that should work.
02:55:23	<rafaeletc>	JAA: I mean, the warc of profiles are compressed right? each page archived is a warc inside the compressed file?
02:57:02	<nicolas17>	each page archived is a compressed record inside the warc file
02:57:22	<@JAA>	rafaeletc: It's a compressed file, and each record is compressed individually. replayweb.page should support reading the compressed .warc.gz file directly.
02:57:51	<rafaeletc>	JAA: but a warc.gz with 36gb?
02:57:52	<@JAA>	In other words, you do not need to decompress it.
02:58:07	<@JAA>	Yeah, they are large and not exactly nice to work with.
02:59:00	<@JAA>	Using the metadata in the JSON files, you can download just the part of the file you're interested in.
02:59:15	<rafaeletc>	but I have to download it or just paste the link at replayweb.page?
02:59:56	<@JAA>	I'm not sure. I haven't used ReplayWeb.Page before. I just know that it exists.
03:03:27		nicolas17 quits [Client Quit]
03:03:27		CraftByte quits [Client Quit]
03:03:30		nicolas17_ joins
03:03:37		CraftByte (DragonSec\|CraftByte) joins
03:04:31		qwertyasdfuiopghjkl quits [Client Quit]
03:04:31		rafaeletc quits [Client Quit]
03:04:40		rafaeletc joins
03:06:00		nicolas17_ is now known as nicolas17
03:09:28		CraftByte quits [Client Quit]
03:09:28	<rafaeletc>	oh, i figured out, downloaded a small file and opened it on replayweb and saw the content that was archived, luckily the comments are saved, now is to find where are the warcs of the profiles I want to search
03:09:28	<rafaeletc>	thank you all, very very much
03:09:28		rafaeletc quits [Client Quit]
03:09:41		aismallard quits [Client Quit]
03:10:32	<@JAA>	:-)
03:10:38		aismallard joins
03:12:02	<@JAA>	I'm trying to get a partial size estimate for what I've crawld so far of archive.mozilla.org, but it makes my CPU sad.
03:12:21	<@JAA>	I'm already at 5 GiB of compressed WARCs for the listings so far...
03:12:26	<nicolas17>	x_x
03:12:31	<nicolas17>	}
03:12:36	<nicolas17>	I wonder where it's hosted
03:13:10	<@JAA>	Somewhere that doesn't care a bit about me hammering it for hours at least, so that's good.
03:13:21		rafaeletc joins
03:13:27	<nicolas17>	some massive storage server exposing all this data as a single filesystem?
03:13:29	<@JAA>	28.35.117.34.bc.googleusercontent.com
03:13:35	<nulldata>	Why in the <i>cloud</i>, of course!
03:15:20	<nicolas17>	x-goog-storage-class: NEARLINE
03:15:29	<nicolas17>	they must have their own thing for file listings then huh
03:15:33	<rafaeletc>	JAA: I accidentally closed the browser tab and did not copy the parameters of ia that you had suggested, can you paste to me again?
03:15:54	<@JAA>	rafaeletc: `ia download --search 'collection:archiveteam_fotolog' --glob '*.json.gz'`
03:16:35	<rafaeletc>	JAA: thank you, again
03:16:58	<@JAA>	Happy to help. :-)
03:17:08	<nicolas17>	yep, files have the x-goog headers suggesting they're on Google Cloud Storage, but file listings don't, so they probably have a reverse proxy / load balancer / thing redirecting file listing requests to somewhere else
03:17:11	<@JAA>	I just crossed 3M dirs fetched. There's another 3.15M in the queue.
03:17:34	<@JAA>	And I thought my 4.8M symlinks were bad...
03:22:20	<@JAA>	Always fun to optimise grep/awk/sed/... pipelines to get best throughput.
03:23:18		rafaeletc quits [Remote host closed the connection]
03:23:20		lennier2 joins
03:24:42	<@JAA>	In the first 3M-ish dirs listed (breadth-first recursion), I got 71 million files.
03:25:15	<@JAA>	20.7M of those are over 1 MB.
03:26:29	<@JAA>	19M are over 10 MB.
03:26:33		lennier2_ quits [Ping timeout: 265 seconds]
03:26:38	<@JAA>	2.2M are over 100 MB.
03:27:26	<@JAA>	Those 2.2M alone add up to 391.29 TiB.
03:28:10	<@JAA>	arkiver: ^ First numbers, and the listing isn't even halfway through the discovered dirs yet.
03:29:46	<@JAA>	Minor correction, those are over 1/10/100 MiB, not MB.
03:50:32	<@JAA>	Todo is finally less than done. 3.37M remaining though.
03:55:12	<nicolas17>	is todo going down? :D
03:59:56		Shjosan quits [Quit: Am sleepy (-, – )…zzzZZZ]
04:00:33		Shjosan (Shjosan) joins
04:01:46	<@JAA>	No, up, but slower than done at least. :-P
04:08:16	<nicolas17>	great
04:08:20	<nicolas17>	similar to telegrab right now
04:08:49	<nicolas17>	completing 12400 items/min, todo going down at 900 items/min but at least it's going down
04:11:18	<@JAA>	Yeah, here I'm grabbing 10k/min but todo grows by 8k...
04:13:13	<nicolas17>	ow
04:24:40	<fireonlive>	https://x.com/discordpreviews/status/1725240412023959844?s=12
04:24:40	<eggdrop>	nitter: https://nitter.net/discordpreviews/status/1725240412023959844
04:24:46	<fireonlive>	waste of money is being shut down
04:25:01	<fireonlive>	no action required other than a “lulz”
04:32:53	<@JAA>	In other news, the Canucks forums are still online and active. It was supposed to shut down at the end of September. The announcement has since been edited to read:
04:32:56	<@JAA>	> It will be closed on __________.
04:33:08	<@JAA>	https://forum.canucks.com/announcement/25-forum-closure/
04:34:44	<nicolas17>	literally underscores?
04:34:47	<@JAA>	Yep
04:35:06	<nicolas17>	they can't find the power button
04:35:40	<@JAA>	The only forum admin has no idea what's going on either. :-)
04:36:09		dumbgoy quits [Ping timeout: 272 seconds]
04:56:45		Wohlstand quits [Client Quit]
04:59:13		neggles quits [Quit: bye friends - ZNC - https://znc.in]
05:01:17		neggles (neggles) joins
05:02:12		neggles quits [Client Quit]
05:31:52		etnguyen03 quits [Client Quit]
06:29:43		systwi_ quits [Quit: systwi_]
06:29:43		nothere quits [Quit: Leaving]
06:50:44		nothere joins
06:52:35		nicolas17 quits [Client Quit]
07:13:13		Island quits [Read error: Connection reset by peer]
08:03:09		neggles (neggles) joins
10:00:02		Bleo18 quits [Client Quit]
10:01:23		Bleo18 joins
10:34:57		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
12:35:02		nicolas17 joins
13:20:33		HP_Archivist quits [Ping timeout: 272 seconds]
13:50:57		Arcorann quits [Ping timeout: 272 seconds]
13:59:50		Megame (Megame) joins
14:26:17		etnguyen03 (etnguyen03) joins
14:35:26		katocala quits [Remote host closed the connection]
15:30:06		lunik173 quits [Client Quit]
15:32:55		Kitty quits [Ping timeout: 272 seconds]
15:40:10		Kitty (Kitty) joins
15:40:22		lunik173 joins
16:08:55		Island joins
16:38:28		Shjosan quits [Client Quit]
16:38:28		Island quits [Remote host closed the connection]
16:38:36		Island joins
16:38:38		Shjosan_ (Shjosan) joins
16:51:09		dumbgoy joins
17:00:19		etnguyen03 quits [Ping timeout: 272 seconds]
17:07:19		nulldata quits [Quit: The Lounge - https://thelounge.chat]
17:09:33		nulldata (nulldata) joins
17:14:55	<h2ibot>	Megame edited Deathwatch (+265, /* 2023 */ GCN+ - Dec 19): https://wiki.archiveteam.org/?diff=51160&oldid=51153
17:27:46	<vokunal\|m>	There's a post on reddit saying pannchoa.com is likely going to be taken down in 10 hours. I can't find any evidence for an actual shutdown online. There's lots of people saying it should though
17:27:47	<vokunal\|m>	https://www.reddit.com/r/Archiveteam/comments/17x5qdr/pannchoacom_website_likely_being_taken_down_in_24/
17:28:40	<vokunal\|m>	Just based on the number of pages they have, it looks like they probably have ~10779 posts
17:30:56	<vokunal\|m>	Not sure whether this is a Deathwatch or more of a Firedrill
17:32:06		icedice (icedice) joins
17:37:02	<AK>	Already in AB by the looks of it, "!status wjfqc3qnj93820c7o97ea1vw"
17:39:22	<@JAA>	My archive.mozilla.org listing finished after 9494060 dirs. A handful of errors I need to look at. At least one of those dirs probably just can't be listed.
17:39:50	<@JAA>	14.3 GiB of listings in compressed WARC...
17:47:11		rohvani quits [Ping timeout: 272 seconds]
17:55:56	<fireonlive>	holy crap
18:00:07		rohvani joins
18:13:41		etnguyen03 (etnguyen03) joins
18:37:13		eggdrop quits [Ping timeout: 272 seconds]
18:50:21		eggdrop (eggdrop) joins
18:51:47		superkuh__ quits [Ping timeout: 272 seconds]
18:57:29		Bleo18 quits [Client Quit]
18:57:30		qwertyasdfuiopghjkl quits [Remote host closed the connection]
18:57:44		Bleo18 joins
19:00:26		Wohlstand (Wohlstand) joins
19:02:03		c3manu (c3manu) joins
19:52:50		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
20:12:04		razul quits [Quit: Bye -]
20:13:15		razul joins
20:40:31		icedice quits [Client Quit]
20:48:26		BlueMaxima joins
20:55:37	<h2ibot>	Manu edited Political parties/Germany/Hamburg (+12464, SPD (not even finished yet)): https://wiki.archiveteam.org/?diff=51161&oldid=51116
21:22:14		cdreimanu (c3manu) joins
21:22:30		Bleo18 quits [Client Quit]
21:22:30		c3manu quits [Remote host closed the connection]
21:22:30		razul quits [Client Quit]
21:22:30		rohvani quits [Client Quit]
21:22:33		rohvani joins
21:22:41		razul joins
21:22:42		Bleo18 joins
21:31:21		HP_Archivist (HP_Archivist) joins
22:05:52	<h2ibot>	Manu edited Political parties/Germany/Hamburg (+4072, /* Sozialdemokratische Partei Deutschlands…): https://wiki.archiveteam.org/?diff=51162&oldid=51161
22:07:10		DogsRNice joins
22:25:51		Megame quits [Ping timeout: 272 seconds]
22:28:29		ScenarioPlanet (ScenarioPlanet) joins
22:32:49		yasomi quits [Ping timeout: 272 seconds]
22:36:57	<h2ibot>	Manu edited Political parties/Germany/Hamburg (+50, /* SPD Hamburg-Nord */): https://wiki.archiveteam.org/?diff=51163&oldid=51162
23:07:39		etnguyen03 quits [Ping timeout: 272 seconds]
23:16:17		driib quits [Quit: The Lounge - https://thelounge.chat]
23:17:25		driib (driib) joins
23:17:52		etnguyen03 (etnguyen03) joins
23:22:30		Arcorann (Arcorann) joins
23:43:07		Wohlstand quits [Ping timeout: 272 seconds]
23:46:39		rohvani4 joins
23:46:40		razul2 joins
23:46:44		razul quits [Client Quit]
23:46:44		rohvani quits [Client Quit]
23:46:44		Bleo18 quits [Client Quit]
23:46:44		ScenarioPlanet quits [Remote host closed the connection]
23:46:44		Arcorann quits [Remote host closed the connection]
23:46:44		DogsRNice quits [Remote host closed the connection]
23:46:44		razul2 is now known as razul
23:46:44		rohvani4 is now known as rohvani
23:46:54		DogsRNice joins
23:46:55		Bleo18 joins
23:52:28		Arcorann (Arcorann) joins

Home Search Previous day Next day