00:00:45dm4v quits [Read error: Connection reset by peer]
00:01:11dm4v joins
00:01:13dm4v quits [Changing host]
00:01:13dm4v (dm4v) joins
00:26:06Core1427 (Cobalt17) joins
00:28:13Earendil quits [Ping timeout: 258 seconds]
00:58:59Stiletto joins
01:00:37Arcorann_ joins
01:02:43dm4v quits [Ping timeout: 258 seconds]
01:03:06dm4v joins
01:03:09dm4v quits [Changing host]
01:03:09dm4v (dm4v) joins
01:09:55Earendil (Cobalt17) joins
01:11:55Core1427 quits [Ping timeout: 258 seconds]
02:03:40HP_Archivist quits [Ping timeout: 258 seconds]
02:09:09<@JAA>So the anniversary of the death of Comicogs & Co. is approaching, and they *still* haven't managed to upload the dumps... *facepalm* https://comics.discogs.com/
02:21:42<Ryz>You know the vid dot me stuff, non-zero chance the companies that have embedded links to that website will just delete the articles instead
02:22:09<@JAA>Not much we can do about that though.
02:26:13<pabs>could use a search engine to find links to vid.me and archive the articles?
02:28:48<@JAA>They're embeds, not links. Is there any search engine that indexes those?
02:30:57<pabs>hmm
02:31:30<@JAA>Also, scraping search engines is hard to impossible. They don't like that at all. Only Bing kind of tolerates a slow speed.
02:31:45<@JAA>So... $£€$£€
02:55:45Megame (Megame) joins
03:05:45HP_Archivist (HP_Archivist) joins
03:15:46qw3rty__ joins
03:19:34qw3rty_ quits [Ping timeout: 258 seconds]
03:43:02Iki joins
04:16:35Ruthalas quits [Read error: Connection reset by peer]
04:16:51Ruthalas (Ruthalas) joins
04:18:12AntiLiberal joins
04:28:34AntiLiberal quits [Ping timeout: 258 seconds]
04:30:43AntiLiberal joins
04:34:33<h2ibot>Bauerbach edited SCP Foundation (+100): https://wiki.archiveteam.org/?diff=47007&oldid=46701
04:40:26DogsRNice quits [Read error: Connection reset by peer]
04:46:40BlueMaxima quits [Client Quit]
04:47:21AntiLiberal quits [Ping timeout: 258 seconds]
05:17:15Iki quits [Ping timeout: 258 seconds]
05:21:00Core8615 (Cobalt17) joins
05:21:00Earendil quits [Read error: Connection reset by peer]
05:33:36Stiletto quits [Ping timeout: 250 seconds]
05:44:00driib quits [Ping timeout: 250 seconds]
05:54:00driib (driib) joins
06:01:23Matthww80 joins
06:02:12Matthww8 quits [Ping timeout: 250 seconds]
06:02:12Matthww80 is now known as Matthww8
06:06:59Earendil (Cobalt17) joins
06:10:52Core8615 quits [Ping timeout: 250 seconds]
06:39:27vela quits [Quit: vela]
06:43:32vela (vela) joins
07:27:54Core2100 (Cobalt17) joins
07:29:18Earendil quits [Ping timeout: 250 seconds]
07:32:14spirit joins
07:33:43tzt quits [Ping timeout: 258 seconds]
07:57:06swebb quits [Ping timeout: 258 seconds]
07:59:09swebb joins
08:02:08<h2ibot>Sanqui created Sweb.cz (+215, Created page with "Czech freehost provided by…): https://wiki.archiveteam.org/?title=Sweb.cz
08:39:39HP_Archivist quits [Ping timeout: 258 seconds]
09:27:38Matthww89 joins
09:28:54Matthww8 quits [Ping timeout: 250 seconds]
09:28:54Matthww89 is now known as Matthww8
09:35:47Matthww83 joins
09:36:42Matthww8 quits [Ping timeout: 250 seconds]
09:36:42Matthww83 is now known as Matthww8
10:11:54ragu__ joins
10:13:34ragu_ quits [Ping timeout: 258 seconds]
10:14:01Stiletto joins
10:18:17Stiletto quits [Read error: Connection reset by peer]
10:18:34Stiletto joins
12:06:38TheRealZago (TheRealZago) joins
12:19:18@OrIdow6 quits [Ping timeout: 258 seconds]
12:28:21TheRealZago quits [Read error: Connection reset by peer]
12:37:50fuzzy8021 quits [Ping timeout: 250 seconds]
12:51:23Core2100 quits [Remote host closed the connection]
12:51:39Earendil (Cobalt17) joins
13:03:18fuzzy8021 (fuzzy8021) joins
13:50:09AntiLiberal joins
13:50:10Matthww88 joins
13:51:04Matthww8 quits [Ping timeout: 250 seconds]
13:51:04Matthww88 is now known as Matthww8
13:53:29AntiLiberal2 joins
13:56:17AntiLiberal quits [Ping timeout: 258 seconds]
14:03:38abcde quits [Ping timeout: 244 seconds]
14:09:50Matthww86 joins
14:11:52Matthww8 quits [Ping timeout: 250 seconds]
14:11:52Matthww86 is now known as Matthww8
14:46:43marked10 joins
14:48:16marked1 quits [Ping timeout: 250 seconds]
14:48:16marked10 is now known as marked1
14:54:32Gereon quits [Quit: The Lounge - https://thelounge.chat]
14:59:09Arcorann_ quits [Ping timeout: 258 seconds]
15:03:52Barto quits [Ping timeout: 250 seconds]
15:04:08Barto (Barto) joins
15:07:26Gereon (Gereon) joins
15:40:41Ryz quits [Remote host closed the connection]
15:41:11Ryz (Ryz) joins
17:06:10ragu_ joins
17:06:17tzt joins
17:09:52ragu__ quits [Ping timeout: 258 seconds]
18:23:56HP_Archivist (HP_Archivist) joins
19:02:11Earendil quits [Read error: Connection reset by peer]
19:02:33Earendil (Cobalt17) joins
19:03:33DogsRNice (Webuser299) joins
19:05:22<@JAA>FloydHub's a JS hellhole, so archiving it is tricky. The search accepts an empty query: https://www.floydhub.com/search/projects?page=0&query=
19:05:23Megame quits [Client Quit]
19:07:10<Jake>11806 pages on the empty search for projects
19:07:25<@JAA>Fun
19:08:07<Jake>5738 pages of datasets
19:09:26<Jake>8644 pages of users
19:09:31<@JAA>lol
19:09:35<@JAA>It's all in their main JS file.
19:09:42<@JAA>Because SPA.
19:10:10Stiletto quits [Remote host closed the connection]
19:10:11<@JAA>Oh wait no, those are other hits. Odd
19:10:38<Jake>SPA :(
19:11:11GNU_world joins
19:11:12S0V3R3IGNTY joins
19:11:19jamesp joins
19:12:10graf__ joins
19:12:13<jamesp>wait, who is @JAA?
19:12:20<@JAA>Jake: How did you find those numbers so quickly?
19:12:33<Jake>just casually went through it, starting with bigger numbers
19:12:56<rewby>jamesp: He's just another archivist.
19:13:04<@JAA>:-)
19:13:35<jamesp>I'm just wondering about textfiles. Does he come on?
19:13:53<rewby>He's around. Usually.
19:14:12<rewby>Don't ping him though, that's like poking a sleeping bear
19:14:31<jamesp>he isn't on
19:14:47<rewby>His IRC username isn't textfiles
19:15:28<jamesp>then what is it? put a space before the last character, like james p
19:16:57<@JAA>Projects API endpoint is this: https://www.floydhub.com/api/v1/projects/search?query=&limit=15&offset=15 (for the second page) Limit maxes out at a bit over 1000.
19:17:24<@JAA>1023 is the maximum allowed limit, to be precise.
19:18:13<Jake>some of floydhub seems partially broken already, https://www.floydhub.com/fastai/projects/lesson1_dogs_cats the project is 'empty', but has a few jobs, some of which have files
19:18:29<@JAA>Yeah, I haven't found any non-empty project yet, actually.
19:19:02<Jake>I think they are all displaying as empty for some odd reason, this job seems to show the code for the project. https://www.floydhub.com/fastai/projects/lesson1_dogs_cats/13/code
19:21:20<Jake>max offsets projects: https://www.floydhub.com/api/v1/projects/search?query=&limit=15&offset=177090 datasets: https://www.floydhub.com/api/v1/datasets/search?query=&limit=15&offset=86070 users: https://www.floydhub.com/api/v1/profile/search?query=&limit=15&offset=129660
19:23:58<Jake>I'll run through all the datasets and get a size estimate real quick.
19:24:05Core7846 (Cobalt17) joins
19:25:38<@JAA>Datasets are separate from job outputs, it seems?
19:26:28Earendil quits [Ping timeout: 250 seconds]
19:26:28<Jake>I believe so
19:26:30<@JAA>But I imagine the datasets will be much larger.
19:32:12<h2ibot>JustAnotherArchivist edited Deathwatch (+141, /* 2021 */ Add FloydHub): https://wiki.archiveteam.org/?diff=47009&oldid=47002
19:33:22Iki joins
19:33:57bsmith093 joins
19:37:02<Jake>script started. might be a bit.
19:41:37m0nika quits [Remote host closed the connection]
19:41:51m0nika (m0nika) joins
19:43:45m0nika quits [Client Quit]
19:43:59Earendil (Cobalt17) joins
19:44:01m0nika (m0nika) joins
19:45:58Core7846 quits [Ping timeout: 250 seconds]
19:46:03m0nika quits [Client Quit]
19:46:50m0nika (m0nika) joins
19:58:06AntiLiberal2 quits [Ping timeout: 250 seconds]
20:00:02<Ryz>Tumblr to take a page from Patreon to have Tumblr account posts that can only be accessed through money: https://cdn.discordapp.com/attachments/455120412460974104/868582804609458197/E68AvqbVcAI2FRt.png - https://cdn.discordapp.com/attachments/455120412460974104/868582834409984010/E68AzrMVgAEOLlE.png
20:02:38<Jake>alright, FloydHub datasets, I got 35363216376832 bytes for total size, or around 35 terabytes.
20:02:52qw3rty__ quits [Ping timeout: 250 seconds]
20:03:53<Jake>Will extract a full list of URLs to use later in a minute.
20:04:35qw3rty joins
20:04:36<Ryz>https://techcrunch.com/2021/07/21/tumblr-debuts-post-a-subscription-service-for-gen-z-creators/ - https://techcrunch.com/2021/07/22/tumblr-community-lash-out-post-plus-subscription/
20:04:37Myself (myself) joins
20:13:05rsn joins
20:15:13<@JAA>35 TB doesn't sound too bad. I wonder how much duplication there is.
20:15:13Ajay_m quits [Read error: Connection reset by peer]
20:15:23<jamesp>Wow...the videos are still up!
20:15:28<@JAA>Dude...
20:16:11Ajay_m joins
20:16:15<jamesp>sorry wrong channel
20:16:32spirit quits [Client Quit]
20:17:23<Jake>I also used totalSizeBytes rather than latestSizeBytes so that might count however many versions exist, most seem to have one version, though. I'll do another run with latest as well.
20:31:55abcde joins
20:52:35Iki quits [Ping timeout: 258 seconds]
20:57:02Doranwen quits [Ping timeout: 250 seconds]
20:57:34Ajay_m quits [Ping timeout: 258 seconds]
21:01:04Ajay_m joins
21:01:13<VerifiedJ>Looks likes there is a fair bit of duplication. there are ~1K forks(?) of dog-breed-images dataset. A version of that dataset is 700MB.
21:01:41<VerifiedJ>lists of users, datasets and projects https://verifiedjoseph.com/archiveteam/website-discovery/floydhub.com/
21:07:26Core7292 (Cobalt17) joins
21:07:47<Jake>beat me to it! :-)
21:08:11<Jake>total size with just the latest version is 29575463216128 or 29TB.
21:10:04<Jake>my version: https://transfer.archivete.am/c9djz/dataset_ids as well as all of the JSON from the datasets: https://transfer.archivete.am/TID3N/dataset_full_json
21:11:22Earendil quits [Ping timeout: 258 seconds]
21:19:49HP_Archivist quits [Read error: Connection reset by peer]
21:20:08HP_Archivist (HP_Archivist) joins
21:24:28<@JAA>There's obviously no good way to detect duplicates just from this, but summing up unique latestSizeBytes over 1 GiB gives 8.2 TiB.
21:25:32jacobk joins
21:25:51<@JAA>Well, all datasets over 1 GiB are 8.5 TiB though, so I guess there's not too much duplication, maybe.
21:29:26<jamesp>I checked the tracker and I don't see it moving. What's happening
21:30:47<@JAA>jamesp: Still the wrong channel.
21:31:06<jamesp>oops. I keep forgetting
21:36:34ArchivalEfforts joins
21:42:24Stiletto joins
22:00:21Matthww81 joins
22:00:49Core7292 quits [Read error: Connection reset by peer]
22:01:03Earendil (Cobalt17) joins
22:01:12Matthww8 quits [Ping timeout: 258 seconds]
22:01:12Matthww81 is now known as Matthww8
22:01:33Core7787 (Cobalt17) joins
22:05:25Earendil quits [Ping timeout: 258 seconds]
22:08:29Core7787 quits [Ping timeout: 258 seconds]
22:18:04jacobk quits [Ping timeout: 250 seconds]
22:31:43driib8 (driib) joins
22:35:24driib quits [Ping timeout: 250 seconds]
22:35:24driib8 is now known as driib
22:39:25jamesp quits [Remote host closed the connection]
22:47:53jamesp joins
22:53:50Iki joins
22:57:04AntiLiberal joins
23:05:36driib quits [Ping timeout: 258 seconds]
23:31:22Earendil (Cobalt17) joins
23:34:25Core2270 (Cobalt17) joins
23:35:53Iki quits [Ping timeout: 258 seconds]
23:37:48Ajay_m quits [Ping timeout: 258 seconds]
23:38:11Earendil quits [Ping timeout: 258 seconds]
23:45:39WarHawk80 joins
23:45:55<WarHawk80>hello can't rsync my warrior anymore
23:46:05<WarHawk80>im getting this from my client
23:46:24<@JAA>It's being worked on.
23:47:12<WarHawk80>ah ok...cool...wasn't sure if I borked something up
23:47:25<WarHawk80>thanks...I'll leave it running then
23:47:26<WarHawk80>thanks
23:49:05<WarHawk80>cool looks like it's working now...port changed on the rsync...nice! Thanks alot!
23:52:21WarHawk80 leaves
23:59:28HP_Archivist quits [Ping timeout: 250 seconds]