00:00:59<thuban>thank you!
00:20:50<@JAA>1.26M dirs done, 2.83M remaining. It exploded a bit again ~15 minutes ago.
00:34:27BlueMaxima joins
01:38:39HP_Archivist (HP_Archivist) joins
01:46:58BlueMaxima quits [Read error: Connection reset by peer]
01:51:40qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
02:05:24sdomi (sdomi) joins
02:09:00rafaeletc joins
02:13:18<rafaeletc>hi folks, I need a help to understand the archives of the "fotolog" wevsite, I am doing a research on a young brazilian movement that used the site a lot, here is the right place for me to ask?
02:15:43<@JAA>Over 3M todo now, so I'll stop updating here until it's closer to completion.
02:15:51<@JAA>Hi rafaeletc, yeah, here is fine to ask.
02:17:49<rafaeletc>Oh, hello JAA, nice to meet u
02:18:52<nicolas17>JAA: any idea how much data is there?
02:19:15<nicolas17>I guess you're not extracting file sizes yet
02:20:21<rafaeletc>oh, a lot of gigabytes
02:20:37<nicolas17>oh I meant the mozilla archive he's currently working on :P
02:21:41<rafaeletc>nicolas17: sorry, thought it was for me the question
02:23:58<@JAA>nicolas17: No idea yet, no, just fetching the dir listings so far.
02:24:00hdgdrqer joins
02:24:15hdgdrqer quits [Remote host closed the connection]
02:24:19<@arkiver>rafaeletc: it would be good to know the question :P
02:24:30<@arkiver>but the archived are browsable in the Wayback Machine
02:26:17<rafaeletc>I'm trying to recover some contents from "fotolog.com" via Wayback Machine but I'm finding a lot of broken content, than I found this: https://archive.org/details/archiveteam_fotolog - my question is how to extract them in a way that I can identify where are the files of the profiles I want to analyze, if you give a search in this link you will see
02:26:17<rafaeletc>that they are very large files to download fully and look for needle in a hay tree.
02:27:18<@arkiver>i do see the site is marked "partially saved"
02:27:25<@arkiver>rafaeletc: do you have any URLs?
02:27:33<@arkiver>you could look those up in the Wayback Machine
02:27:39<rafaeletc>I managed to understand the metadata you have in the json of each upload, do you have a way to automate the download only of these json?
02:28:13<@arkiver>rafaeletc: the CDX files may be useful for you, they contain lists of URLs saved in the archives - though i don't know what you are looking for
02:28:41<@arkiver>as for downloading files, you can for example use the `internetarchive` (`ia`) library for this
02:29:06<@arkiver>https://archive.org/developers/internetarchive/
02:29:26<@arkiver>i'll be off to bed now though, but you can leave message and me or JAA or someone else might know what to do
02:30:21<rafaeletc>arkiver: I would like to search the comments and photos of what was archived, but not everything, only of some profiles
02:31:17<@arkiver>rafaeletc: i'm afraid 'searching' is not exactly possible, as the content of the records is not really indexed, only the URLs and some other metadata
02:31:32<@arkiver>so if you know some URLs for what you are looking for, you could look those up in the Wayback Machine
02:31:57<@arkiver>but searching all comments/pages for some terms is not possible at the moment, unless someone makes this searchable (which could be expensive)
02:33:07<rafaeletc>arkiver: so major it's static content
02:33:17<rafaeletc>but this application that you linked is well documented, I believe will help me a lot, via browser is very difficult
02:35:25<rafaeletc>I am researching a cultural movement that between 2003 and 2006 used the website for dissemination and communication
02:37:28<rafaeletc>therefore my interest in the comments, but as they were dynamic contents, I believe I will not be able to recover them, but getting the publications I believe that can help
02:41:00<rafaeletc>pardon my bad english, I am Brazilian and natively speak portuguese. but thank you very much, the indicated application illuminated the path I need to follow to research more
02:50:29DogsRNice_ quits [Read error: Connection reset by peer]
02:52:31<@JAA>rafaeletc: `ia download --search 'collection:archiveteam_fotolog' --glob '*.json.gz'` should work for downloading all of the megawarc JSON files.
02:52:56<rafaeletc>JAA: it's possible to use this files: https://archive.org/download/archiveteam_fotolog_20160211111954 with https://replayweb.page/
02:53:56<@JAA>rafaeletc: Yes, that should work.
02:55:23<rafaeletc>JAA: I mean, the warc of profiles are compressed right? each page archived is a warc inside the compressed file?
02:57:02<nicolas17>each page archived is a compressed record inside the warc file
02:57:22<@JAA>rafaeletc: It's a compressed file, and each record is compressed individually. replayweb.page should support reading the compressed .warc.gz file directly.
02:57:51<rafaeletc>JAA: but a warc.gz with 36gb?
02:57:52<@JAA>In other words, you do not need to decompress it.
02:58:07<@JAA>Yeah, they are large and not exactly nice to work with.
02:59:00<@JAA>Using the metadata in the JSON files, you can download just the part of the file you're interested in.
02:59:15<rafaeletc>but I have to download it or just paste the link at replayweb.page?
02:59:56<@JAA>I'm not sure. I haven't used ReplayWeb.Page before. I just know that it exists.
03:03:27nicolas17 quits [Client Quit]
03:03:27CraftByte quits [Client Quit]
03:03:30nicolas17_ joins
03:03:37CraftByte (DragonSec|CraftByte) joins
03:04:31qwertyasdfuiopghjkl quits [Client Quit]
03:04:31rafaeletc quits [Client Quit]
03:04:40rafaeletc joins
03:06:00nicolas17_ is now known as nicolas17
03:09:28CraftByte quits [Client Quit]
03:09:28<rafaeletc>oh, i figured out, downloaded a small file and opened it on replayweb and saw the content that was archived, luckily the comments are saved, now is to find where are the warcs of the profiles I want to search
03:09:28<rafaeletc>thank you all, very very much
03:09:28rafaeletc quits [Client Quit]
03:09:41aismallard quits [Client Quit]
03:10:32<@JAA>:-)
03:10:38aismallard joins
03:12:02<@JAA>I'm trying to get a partial size estimate for what I've crawld so far of archive.mozilla.org, but it makes my CPU sad.
03:12:21<@JAA>I'm already at 5 GiB of compressed WARCs for the listings so far...
03:12:26<nicolas17>x_x
03:12:31<nicolas17>}
03:12:36<nicolas17>I wonder where it's hosted
03:13:10<@JAA>Somewhere that doesn't care a bit about me hammering it for hours at least, so that's good.
03:13:21rafaeletc joins
03:13:27<nicolas17>some massive storage server exposing all this data *as a single filesystem*?
03:13:29<@JAA>28.35.117.34.bc.googleusercontent.com
03:13:35<nulldata>Why in the <i>cloud</i>, of course!
03:15:20<nicolas17>x-goog-storage-class: NEARLINE
03:15:29<nicolas17>they must have their own thing for file listings then huh
03:15:33<rafaeletc>JAA: I accidentally closed the browser tab and did not copy the parameters of ia that you had suggested, can you paste to me again?
03:15:54<@JAA>rafaeletc: `ia download --search 'collection:archiveteam_fotolog' --glob '*.json.gz'`
03:16:35<rafaeletc>JAA: thank you, again
03:16:58<@JAA>Happy to help. :-)
03:17:08<nicolas17>yep, files have the x-goog headers suggesting they're on Google Cloud Storage, but file listings don't, so they probably have a reverse proxy / load balancer / thing redirecting file listing requests to somewhere else
03:17:11<@JAA>I just crossed 3M dirs fetched. There's another 3.15M in the queue.
03:17:34<@JAA>And I thought my 4.8M symlinks were bad...
03:22:20<@JAA>Always fun to optimise grep/awk/sed/... pipelines to get best throughput.
03:23:18rafaeletc quits [Remote host closed the connection]
03:23:20lennier2 joins
03:24:42<@JAA>In the first 3M-ish dirs listed (breadth-first recursion), I got 71 million files.
03:25:15<@JAA>20.7M of those are over 1 MB.
03:26:29<@JAA>19M are over 10 MB.
03:26:33lennier2_ quits [Ping timeout: 265 seconds]
03:26:38<@JAA>2.2M are over 100 MB.
03:27:26<@JAA>Those 2.2M alone add up to 391.29 TiB.
03:28:10<@JAA>arkiver: ^ First numbers, and the listing isn't even halfway through the discovered dirs yet.
03:29:46<@JAA>Minor correction, those are over 1/10/100 MiB, not MB.
03:50:32<@JAA>Todo is finally less than done. 3.37M remaining though.
03:55:12<nicolas17>is todo going down? :D
03:59:56Shjosan quits [Quit: Am sleepy (-, – )…zzzZZZ]
04:00:33Shjosan (Shjosan) joins
04:01:46<@JAA>No, up, but slower than done at least. :-P
04:08:16<nicolas17>great
04:08:20<nicolas17>similar to telegrab right now
04:08:49<nicolas17>completing 12400 items/min, todo going down at 900 items/min *but at least it's going down*
04:11:18<@JAA>Yeah, here I'm grabbing 10k/min but todo grows by 8k...
04:13:13<nicolas17>ow
04:24:40<fireonlive>https://x.com/discordpreviews/status/1725240412023959844?s=12
04:24:40<eggdrop>nitter: https://nitter.net/discordpreviews/status/1725240412023959844
04:24:46<fireonlive>waste of money is being shut down
04:25:01<fireonlive>no action required other than a “lulz”
04:32:53<@JAA>In other news, the Canucks forums are still online and active. It was supposed to shut down at the end of September. The announcement has since been edited to read:
04:32:56<@JAA>> It will be closed on __________.
04:33:08<@JAA>https://forum.canucks.com/announcement/25-forum-closure/
04:34:44<nicolas17>literally underscores?
04:34:47<@JAA>Yep
04:35:06<nicolas17>they can't find the power button
04:35:40<@JAA>The only forum admin has no idea what's going on either. :-)
04:36:09dumbgoy quits [Ping timeout: 272 seconds]
04:56:45Wohlstand quits [Client Quit]
04:59:13neggles quits [Quit: bye friends - ZNC - https://znc.in]
05:01:17neggles (neggles) joins
05:02:12neggles quits [Client Quit]
05:31:52etnguyen03 quits [Client Quit]
06:29:43systwi_ quits [Quit: systwi_]
06:29:43nothere quits [Quit: Leaving]
06:50:44nothere joins
06:52:35nicolas17 quits [Client Quit]
07:13:13Island quits [Read error: Connection reset by peer]
08:03:09neggles (neggles) joins
10:00:02Bleo18 quits [Client Quit]
10:01:23Bleo18 joins
10:34:57qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
12:35:02nicolas17 joins
13:20:33HP_Archivist quits [Ping timeout: 272 seconds]
13:50:57Arcorann quits [Ping timeout: 272 seconds]
13:59:50Megame (Megame) joins
14:26:17etnguyen03 (etnguyen03) joins
14:35:26katocala quits [Remote host closed the connection]
15:30:06lunik173 quits [Client Quit]
15:32:55Kitty quits [Ping timeout: 272 seconds]
15:40:10Kitty (Kitty) joins
15:40:22lunik173 joins
16:08:55Island joins
16:38:28Shjosan quits [Client Quit]
16:38:28Island quits [Remote host closed the connection]
16:38:36Island joins
16:38:38Shjosan_ (Shjosan) joins
16:51:09dumbgoy joins
17:00:19etnguyen03 quits [Ping timeout: 272 seconds]
17:07:19nulldata quits [Quit: The Lounge - https://thelounge.chat]
17:09:33nulldata (nulldata) joins
17:14:55<h2ibot>Megame edited Deathwatch (+265, /* 2023 */ GCN+ - Dec 19): https://wiki.archiveteam.org/?diff=51160&oldid=51153
17:27:46<vokunal|m>There's a post on reddit saying pannchoa.com is likely going to be taken down in 10 hours. I can't find any evidence for an actual shutdown online. There's lots of people saying it should though
17:27:47<vokunal|m>https://www.reddit.com/r/Archiveteam/comments/17x5qdr/pannchoacom_website_likely_being_taken_down_in_24/
17:28:40<vokunal|m>Just based on the number of pages they have, it looks like they probably have ~10779 posts
17:30:56<vokunal|m>Not sure whether this is a Deathwatch or more of a Firedrill
17:32:06icedice (icedice) joins
17:37:02<AK>Already in AB by the looks of it, "!status wjfqc3qnj93820c7o97ea1vw"
17:39:22<@JAA>My archive.mozilla.org listing finished after 9494060 dirs. A handful of errors I need to look at. At least one of those dirs probably just can't be listed.
17:39:50<@JAA>14.3 GiB of listings in compressed WARC...
17:47:11rohvani quits [Ping timeout: 272 seconds]
17:55:56<fireonlive>holy crap
18:00:07rohvani joins
18:13:41etnguyen03 (etnguyen03) joins
18:37:13eggdrop quits [Ping timeout: 272 seconds]
18:50:21eggdrop (eggdrop) joins
18:51:47superkuh__ quits [Ping timeout: 272 seconds]
18:57:29Bleo18 quits [Client Quit]
18:57:30qwertyasdfuiopghjkl quits [Remote host closed the connection]
18:57:44Bleo18 joins
19:00:26Wohlstand (Wohlstand) joins
19:02:03c3manu (c3manu) joins
19:52:50qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
20:12:04razul quits [Quit: Bye -]
20:13:15razul joins
20:40:31icedice quits [Client Quit]
20:48:26BlueMaxima joins
20:55:37<h2ibot>Manu edited Political parties/Germany/Hamburg (+12464, SPD (not even finished yet)): https://wiki.archiveteam.org/?diff=51161&oldid=51116
21:22:14cdreimanu (c3manu) joins
21:22:30Bleo18 quits [Client Quit]
21:22:30c3manu quits [Remote host closed the connection]
21:22:30razul quits [Client Quit]
21:22:30rohvani quits [Client Quit]
21:22:33rohvani joins
21:22:41razul joins
21:22:42Bleo18 joins
21:31:21HP_Archivist (HP_Archivist) joins
22:05:52<h2ibot>Manu edited Political parties/Germany/Hamburg (+4072, /* Sozialdemokratische Partei Deutschlands…): https://wiki.archiveteam.org/?diff=51162&oldid=51161
22:07:10DogsRNice joins
22:25:51Megame quits [Ping timeout: 272 seconds]
22:28:29ScenarioPlanet (ScenarioPlanet) joins
22:32:49yasomi quits [Ping timeout: 272 seconds]
22:36:57<h2ibot>Manu edited Political parties/Germany/Hamburg (+50, /* SPD Hamburg-Nord */): https://wiki.archiveteam.org/?diff=51163&oldid=51162
23:07:39etnguyen03 quits [Ping timeout: 272 seconds]
23:16:17driib quits [Quit: The Lounge - https://thelounge.chat]
23:17:25driib (driib) joins
23:17:52etnguyen03 (etnguyen03) joins
23:22:30Arcorann (Arcorann) joins
23:43:07Wohlstand quits [Ping timeout: 272 seconds]
23:46:39rohvani4 joins
23:46:40razul2 joins
23:46:44razul quits [Client Quit]
23:46:44rohvani quits [Client Quit]
23:46:44Bleo18 quits [Client Quit]
23:46:44ScenarioPlanet quits [Remote host closed the connection]
23:46:44Arcorann quits [Remote host closed the connection]
23:46:44DogsRNice quits [Remote host closed the connection]
23:46:44razul2 is now known as razul
23:46:44rohvani4 is now known as rohvani
23:46:54DogsRNice joins
23:46:55Bleo18 joins
23:52:28Arcorann (Arcorann) joins