00:00:59 | <thuban> | thank you! |
00:20:50 | <@JAA> | 1.26M dirs done, 2.83M remaining. It exploded a bit again ~15 minutes ago. |
00:34:27 | | BlueMaxima joins |
01:38:39 | | HP_Archivist (HP_Archivist) joins |
01:46:58 | | BlueMaxima quits [Read error: Connection reset by peer] |
01:51:40 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
02:05:24 | | sdomi (sdomi) joins |
02:09:00 | | rafaeletc joins |
02:13:18 | <rafaeletc> | hi folks, I need a help to understand the archives of the "fotolog" wevsite, I am doing a research on a young brazilian movement that used the site a lot, here is the right place for me to ask? |
02:15:43 | <@JAA> | Over 3M todo now, so I'll stop updating here until it's closer to completion. |
02:15:51 | <@JAA> | Hi rafaeletc, yeah, here is fine to ask. |
02:17:49 | <rafaeletc> | Oh, hello JAA, nice to meet u |
02:18:52 | <nicolas17> | JAA: any idea how much data is there? |
02:19:15 | <nicolas17> | I guess you're not extracting file sizes yet |
02:20:21 | <rafaeletc> | oh, a lot of gigabytes |
02:20:37 | <nicolas17> | oh I meant the mozilla archive he's currently working on :P |
02:21:41 | <rafaeletc> | nicolas17: sorry, thought it was for me the question |
02:23:58 | <@JAA> | nicolas17: No idea yet, no, just fetching the dir listings so far. |
02:24:00 | | hdgdrqer joins |
02:24:15 | | hdgdrqer quits [Remote host closed the connection] |
02:24:19 | <@arkiver> | rafaeletc: it would be good to know the question :P |
02:24:30 | <@arkiver> | but the archived are browsable in the Wayback Machine |
02:26:17 | <rafaeletc> | I'm trying to recover some contents from "fotolog.com" via Wayback Machine but I'm finding a lot of broken content, than I found this: https://archive.org/details/archiveteam_fotolog - my question is how to extract them in a way that I can identify where are the files of the profiles I want to analyze, if you give a search in this link you will see |
02:26:17 | <rafaeletc> | that they are very large files to download fully and look for needle in a hay tree. |
02:27:18 | <@arkiver> | i do see the site is marked "partially saved" |
02:27:25 | <@arkiver> | rafaeletc: do you have any URLs? |
02:27:33 | <@arkiver> | you could look those up in the Wayback Machine |
02:27:39 | <rafaeletc> | I managed to understand the metadata you have in the json of each upload, do you have a way to automate the download only of these json? |
02:28:13 | <@arkiver> | rafaeletc: the CDX files may be useful for you, they contain lists of URLs saved in the archives - though i don't know what you are looking for |
02:28:41 | <@arkiver> | as for downloading files, you can for example use the `internetarchive` (`ia`) library for this |
02:29:06 | <@arkiver> | https://archive.org/developers/internetarchive/ |
02:29:26 | <@arkiver> | i'll be off to bed now though, but you can leave message and me or JAA or someone else might know what to do |
02:30:21 | <rafaeletc> | arkiver: I would like to search the comments and photos of what was archived, but not everything, only of some profiles |
02:31:17 | <@arkiver> | rafaeletc: i'm afraid 'searching' is not exactly possible, as the content of the records is not really indexed, only the URLs and some other metadata |
02:31:32 | <@arkiver> | so if you know some URLs for what you are looking for, you could look those up in the Wayback Machine |
02:31:57 | <@arkiver> | but searching all comments/pages for some terms is not possible at the moment, unless someone makes this searchable (which could be expensive) |
02:33:07 | <rafaeletc> | arkiver: so major it's static content |
02:33:17 | <rafaeletc> | but this application that you linked is well documented, I believe will help me a lot, via browser is very difficult |
02:35:25 | <rafaeletc> | I am researching a cultural movement that between 2003 and 2006 used the website for dissemination and communication |
02:37:28 | <rafaeletc> | therefore my interest in the comments, but as they were dynamic contents, I believe I will not be able to recover them, but getting the publications I believe that can help |
02:41:00 | <rafaeletc> | pardon my bad english, I am Brazilian and natively speak portuguese. but thank you very much, the indicated application illuminated the path I need to follow to research more |
02:50:29 | | DogsRNice_ quits [Read error: Connection reset by peer] |
02:52:31 | <@JAA> | rafaeletc: `ia download --search 'collection:archiveteam_fotolog' --glob '*.json.gz'` should work for downloading all of the megawarc JSON files. |
02:52:56 | <rafaeletc> | JAA: it's possible to use this files: https://archive.org/download/archiveteam_fotolog_20160211111954 with https://replayweb.page/ |
02:53:56 | <@JAA> | rafaeletc: Yes, that should work. |
02:55:23 | <rafaeletc> | JAA: I mean, the warc of profiles are compressed right? each page archived is a warc inside the compressed file? |
02:57:02 | <nicolas17> | each page archived is a compressed record inside the warc file |
02:57:22 | <@JAA> | rafaeletc: It's a compressed file, and each record is compressed individually. replayweb.page should support reading the compressed .warc.gz file directly. |
02:57:51 | <rafaeletc> | JAA: but a warc.gz with 36gb? |
02:57:52 | <@JAA> | In other words, you do not need to decompress it. |
02:58:07 | <@JAA> | Yeah, they are large and not exactly nice to work with. |
02:59:00 | <@JAA> | Using the metadata in the JSON files, you can download just the part of the file you're interested in. |
02:59:15 | <rafaeletc> | but I have to download it or just paste the link at replayweb.page? |
02:59:56 | <@JAA> | I'm not sure. I haven't used ReplayWeb.Page before. I just know that it exists. |
03:03:27 | | nicolas17 quits [Client Quit] |
03:03:27 | | CraftByte quits [Client Quit] |
03:03:30 | | nicolas17_ joins |
03:03:37 | | CraftByte (DragonSec|CraftByte) joins |
03:04:31 | | qwertyasdfuiopghjkl quits [Client Quit] |
03:04:31 | | rafaeletc quits [Client Quit] |
03:04:40 | | rafaeletc joins |
03:06:00 | | nicolas17_ is now known as nicolas17 |
03:09:28 | | CraftByte quits [Client Quit] |
03:09:28 | <rafaeletc> | oh, i figured out, downloaded a small file and opened it on replayweb and saw the content that was archived, luckily the comments are saved, now is to find where are the warcs of the profiles I want to search |
03:09:28 | <rafaeletc> | thank you all, very very much |
03:09:28 | | rafaeletc quits [Client Quit] |
03:09:41 | | aismallard quits [Client Quit] |
03:10:32 | <@JAA> | :-) |
03:10:38 | | aismallard joins |
03:12:02 | <@JAA> | I'm trying to get a partial size estimate for what I've crawld so far of archive.mozilla.org, but it makes my CPU sad. |
03:12:21 | <@JAA> | I'm already at 5 GiB of compressed WARCs for the listings so far... |
03:12:26 | <nicolas17> | x_x |
03:12:31 | <nicolas17> | } |
03:12:36 | <nicolas17> | I wonder where it's hosted |
03:13:10 | <@JAA> | Somewhere that doesn't care a bit about me hammering it for hours at least, so that's good. |
03:13:21 | | rafaeletc joins |
03:13:27 | <nicolas17> | some massive storage server exposing all this data *as a single filesystem*? |
03:13:29 | <@JAA> | 28.35.117.34.bc.googleusercontent.com |
03:13:35 | <nulldata> | Why in the <i>cloud</i>, of course! |
03:15:20 | <nicolas17> | x-goog-storage-class: NEARLINE |
03:15:29 | <nicolas17> | they must have their own thing for file listings then huh |
03:15:33 | <rafaeletc> | JAA: I accidentally closed the browser tab and did not copy the parameters of ia that you had suggested, can you paste to me again? |
03:15:54 | <@JAA> | rafaeletc: `ia download --search 'collection:archiveteam_fotolog' --glob '*.json.gz'` |
03:16:35 | <rafaeletc> | JAA: thank you, again |
03:16:58 | <@JAA> | Happy to help. :-) |
03:17:08 | <nicolas17> | yep, files have the x-goog headers suggesting they're on Google Cloud Storage, but file listings don't, so they probably have a reverse proxy / load balancer / thing redirecting file listing requests to somewhere else |
03:17:11 | <@JAA> | I just crossed 3M dirs fetched. There's another 3.15M in the queue. |
03:17:34 | <@JAA> | And I thought my 4.8M symlinks were bad... |
03:22:20 | <@JAA> | Always fun to optimise grep/awk/sed/... pipelines to get best throughput. |
03:23:18 | | rafaeletc quits [Remote host closed the connection] |
03:23:20 | | lennier2 joins |
03:24:42 | <@JAA> | In the first 3M-ish dirs listed (breadth-first recursion), I got 71 million files. |
03:25:15 | <@JAA> | 20.7M of those are over 1 MB. |
03:26:29 | <@JAA> | 19M are over 10 MB. |
03:26:33 | | lennier2_ quits [Ping timeout: 265 seconds] |
03:26:38 | <@JAA> | 2.2M are over 100 MB. |
03:27:26 | <@JAA> | Those 2.2M alone add up to 391.29 TiB. |
03:28:10 | <@JAA> | arkiver: ^ First numbers, and the listing isn't even halfway through the discovered dirs yet. |
03:29:46 | <@JAA> | Minor correction, those are over 1/10/100 MiB, not MB. |
03:50:32 | <@JAA> | Todo is finally less than done. 3.37M remaining though. |
03:55:12 | <nicolas17> | is todo going down? :D |
03:59:56 | | Shjosan quits [Quit: Am sleepy (-, – )…zzzZZZ] |
04:00:33 | | Shjosan (Shjosan) joins |
04:01:46 | <@JAA> | No, up, but slower than done at least. :-P |
04:08:16 | <nicolas17> | great |
04:08:20 | <nicolas17> | similar to telegrab right now |
04:08:49 | <nicolas17> | completing 12400 items/min, todo going down at 900 items/min *but at least it's going down* |
04:11:18 | <@JAA> | Yeah, here I'm grabbing 10k/min but todo grows by 8k... |
04:13:13 | <nicolas17> | ow |
04:24:40 | <fireonlive> | https://x.com/discordpreviews/status/1725240412023959844?s=12 |
04:24:40 | <eggdrop> | nitter: https://nitter.net/discordpreviews/status/1725240412023959844 |
04:24:46 | <fireonlive> | waste of money is being shut down |
04:25:01 | <fireonlive> | no action required other than a “lulz” |
04:32:53 | <@JAA> | In other news, the Canucks forums are still online and active. It was supposed to shut down at the end of September. The announcement has since been edited to read: |
04:32:56 | <@JAA> | > It will be closed on __________. |
04:33:08 | <@JAA> | https://forum.canucks.com/announcement/25-forum-closure/ |
04:34:44 | <nicolas17> | literally underscores? |
04:34:47 | <@JAA> | Yep |
04:35:06 | <nicolas17> | they can't find the power button |
04:35:40 | <@JAA> | The only forum admin has no idea what's going on either. :-) |
04:36:09 | | dumbgoy quits [Ping timeout: 272 seconds] |
04:56:45 | | Wohlstand quits [Client Quit] |
04:59:13 | | neggles quits [Quit: bye friends - ZNC - https://znc.in] |
05:01:17 | | neggles (neggles) joins |
05:02:12 | | neggles quits [Client Quit] |
05:31:52 | | etnguyen03 quits [Client Quit] |
06:29:43 | | systwi_ quits [Quit: systwi_] |
06:29:43 | | nothere quits [Quit: Leaving] |
06:50:44 | | nothere joins |
06:52:35 | | nicolas17 quits [Client Quit] |
07:13:13 | | Island quits [Read error: Connection reset by peer] |
08:03:09 | | neggles (neggles) joins |
10:00:02 | | Bleo18 quits [Client Quit] |
10:01:23 | | Bleo18 joins |
10:34:57 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
12:35:02 | | nicolas17 joins |
13:20:33 | | HP_Archivist quits [Ping timeout: 272 seconds] |
13:50:57 | | Arcorann quits [Ping timeout: 272 seconds] |
13:59:50 | | Megame (Megame) joins |
14:26:17 | | etnguyen03 (etnguyen03) joins |
14:35:26 | | katocala quits [Remote host closed the connection] |
15:30:06 | | lunik173 quits [Client Quit] |
15:32:55 | | Kitty quits [Ping timeout: 272 seconds] |
15:40:10 | | Kitty (Kitty) joins |
15:40:22 | | lunik173 joins |
16:08:55 | | Island joins |
16:38:28 | | Shjosan quits [Client Quit] |
16:38:28 | | Island quits [Remote host closed the connection] |
16:38:36 | | Island joins |
16:38:38 | | Shjosan_ (Shjosan) joins |
16:51:09 | | dumbgoy joins |
17:00:19 | | etnguyen03 quits [Ping timeout: 272 seconds] |
17:07:19 | | nulldata quits [Quit: The Lounge - https://thelounge.chat] |
17:09:33 | | nulldata (nulldata) joins |
17:14:55 | <h2ibot> | Megame edited Deathwatch (+265, /* 2023 */ GCN+ - Dec 19): https://wiki.archiveteam.org/?diff=51160&oldid=51153 |
17:27:46 | <vokunal|m> | There's a post on reddit saying pannchoa.com is likely going to be taken down in 10 hours. I can't find any evidence for an actual shutdown online. There's lots of people saying it should though |
17:27:47 | <vokunal|m> | https://www.reddit.com/r/Archiveteam/comments/17x5qdr/pannchoacom_website_likely_being_taken_down_in_24/ |
17:28:40 | <vokunal|m> | Just based on the number of pages they have, it looks like they probably have ~10779 posts |
17:30:56 | <vokunal|m> | Not sure whether this is a Deathwatch or more of a Firedrill |
17:32:06 | | icedice (icedice) joins |
17:37:02 | <AK> | Already in AB by the looks of it, "!status wjfqc3qnj93820c7o97ea1vw" |
17:39:22 | <@JAA> | My archive.mozilla.org listing finished after 9494060 dirs. A handful of errors I need to look at. At least one of those dirs probably just can't be listed. |
17:39:50 | <@JAA> | 14.3 GiB of listings in compressed WARC... |
17:47:11 | | rohvani quits [Ping timeout: 272 seconds] |
17:55:56 | <fireonlive> | holy crap |
18:00:07 | | rohvani joins |
18:13:41 | | etnguyen03 (etnguyen03) joins |
18:37:13 | | eggdrop quits [Ping timeout: 272 seconds] |
18:50:21 | | eggdrop (eggdrop) joins |
18:51:47 | | superkuh__ quits [Ping timeout: 272 seconds] |
18:57:29 | | Bleo18 quits [Client Quit] |
18:57:30 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
18:57:44 | | Bleo18 joins |
19:00:26 | | Wohlstand (Wohlstand) joins |
19:02:03 | | c3manu (c3manu) joins |
19:52:50 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
20:12:04 | | razul quits [Quit: Bye -] |
20:13:15 | | razul joins |
20:40:31 | | icedice quits [Client Quit] |
20:48:26 | | BlueMaxima joins |
20:55:37 | <h2ibot> | Manu edited Political parties/Germany/Hamburg (+12464, SPD (not even finished yet)): https://wiki.archiveteam.org/?diff=51161&oldid=51116 |
21:22:14 | | cdreimanu (c3manu) joins |
21:22:30 | | Bleo18 quits [Client Quit] |
21:22:30 | | c3manu quits [Remote host closed the connection] |
21:22:30 | | razul quits [Client Quit] |
21:22:30 | | rohvani quits [Client Quit] |
21:22:33 | | rohvani joins |
21:22:41 | | razul joins |
21:22:42 | | Bleo18 joins |
21:31:21 | | HP_Archivist (HP_Archivist) joins |
22:05:52 | <h2ibot> | Manu edited Political parties/Germany/Hamburg (+4072, /* Sozialdemokratische Partei Deutschlands…): https://wiki.archiveteam.org/?diff=51162&oldid=51161 |
22:07:10 | | DogsRNice joins |
22:25:51 | | Megame quits [Ping timeout: 272 seconds] |
22:28:29 | | ScenarioPlanet (ScenarioPlanet) joins |
22:32:49 | | yasomi quits [Ping timeout: 272 seconds] |
22:36:57 | <h2ibot> | Manu edited Political parties/Germany/Hamburg (+50, /* SPD Hamburg-Nord */): https://wiki.archiveteam.org/?diff=51163&oldid=51162 |
23:07:39 | | etnguyen03 quits [Ping timeout: 272 seconds] |
23:16:17 | | driib quits [Quit: The Lounge - https://thelounge.chat] |
23:17:25 | | driib (driib) joins |
23:17:52 | | etnguyen03 (etnguyen03) joins |
23:22:30 | | Arcorann (Arcorann) joins |
23:43:07 | | Wohlstand quits [Ping timeout: 272 seconds] |
23:46:39 | | rohvani4 joins |
23:46:40 | | razul2 joins |
23:46:44 | | razul quits [Client Quit] |
23:46:44 | | rohvani quits [Client Quit] |
23:46:44 | | Bleo18 quits [Client Quit] |
23:46:44 | | ScenarioPlanet quits [Remote host closed the connection] |
23:46:44 | | Arcorann quits [Remote host closed the connection] |
23:46:44 | | DogsRNice quits [Remote host closed the connection] |
23:46:44 | | razul2 is now known as razul |
23:46:44 | | rohvani4 is now known as rohvani |
23:46:54 | | DogsRNice joins |
23:46:55 | | Bleo18 joins |
23:52:28 | | Arcorann (Arcorann) joins |