00:00:40Arcorann__ joins
00:00:40dm4v quits [Read error: Connection reset by peer]
00:01:10dm4v joins
00:01:13dm4v quits [Changing host]
00:01:13dm4v (dm4v) joins
00:28:01lorwp quits [Ping timeout: 258 seconds]
00:35:43lorwp (lorwp) joins
00:58:18britmob quits [Ping timeout: 250 seconds]
01:01:22TheTechRobo (TheTechRobo) joins
01:02:19<TheTechRobo>Hey, how would I go about downloading a Facebook profile? I'm getting an error with snscrape. IS there another specialised tool?
01:17:35<thuban>TheTechRobo: based on https://github.com/JustAnotherArchivist/snscrape/issues it looks like you've now got it working; have you?
01:18:11<@JAA>thuban: Issue 208
01:18:19<thuban>if you're still having trouble with facebook's rate limiting / ip banning, then no, we don't currently have a good solution for that
01:18:19<@JAA>Facebook's rate limiting is an arse.
01:18:27<thuban>yeah. is this a use case for #Y?
01:18:46<@JAA>Unlikely
01:19:16<@JAA>And I'd be surprised if any #// workers were still not banned from Facebook and Instagram.
01:19:45<@JAA>I'm not aware of any way that actually works. Their limits are beyond ridiculous and easily triggered even manually with a browser.
01:19:54<thuban>cause of the special handling it would need? yeah, makes sense.
01:20:07<@JAA>They just want to force people to create an account and log in.
01:20:30<thuban>there's definitely valuable content on facebook that we just can't get anywhere else, though. i wish we had something for it.
01:21:13<@JAA>Yeah, absolutely.
01:22:39Iki joins
01:24:45Iki1 quits [Ping timeout: 258 seconds]
01:26:39somerando3 joins
01:29:43<thuban>do we actually know whether distribution would help? like, does facebook clam up after n pagination requests for the same page from the same ip, or just n requests for the same page? (obvs the latter wouldn't do for very popular pages, but i wouldn't put it past them to do some form of load monitoring)
01:40:57<TheTechRobo>thuban: What is #Y?
01:41:48<h3ndr1k>The channel for the new project for a distributed archivebot.
01:42:19<TheTechRobo>JAA: You say that they're forcing people to log in; in thst case, would using cookies help? I do have an account, I don't use it at all but it exists.
01:42:33<TheTechRobo>h3ndr1k: Oh, that makes sense.
01:42:34<h3ndr1k>Basically. If I followed past conversations correctly.
01:42:37BlueMaxima joins
01:45:11<TheTechRobo>Instagram seems to suck too... I keep getting redirected to the login page with instaloader!
01:52:24britmob joins
01:52:26<@JAA>TheTechRobo: Possibly. snscrape doesn't support that though.
01:53:30<@JAA>thuban: It would certainly help for archiving the actual posts, videos, etc. At least last time I looked into it, those were limited per IP.
01:54:15<@JAA>But yes, they also block on the pagination. I haven't experimented with slower (faked) scrolling yet.
02:06:32Barto quits [Ping timeout: 258 seconds]
02:20:05lennier2 joins
02:23:01lennier1 quits [Ping timeout: 258 seconds]
02:23:07lennier2 is now known as lennier1
02:23:20somerando3 quits [Remote host closed the connection]
02:40:13somerando3 joins
02:45:16<somerando3>Actually, I just realized the wayback machine has fairly regular captures of RTHK's podcast RSS
02:45:23<somerando3>e.g. https://web.archive.org/web/20140912061846/http://podcast.rthk.hk/podcast/radio1_openline_openview.xml
02:46:25<somerando3>seems like it should be enough to reconstruct authoritative metadata going back quite some time, with few gaps.
02:47:31<somerando3>However the older scrape have audio links to mp3s on http://podcast.rthk.org.hk/, which don't work anymore
02:47:51<mgrandi>https://twitter.com/alexstamos/status/1410674022405181440
02:48:31<mgrandi>If we want to start a preemptive project for this
02:54:56Barto (Barto) joins
03:06:00abcd quits [Remote host closed the connection]
03:08:07<@OrIdow6>^ Twitter post about "Gettr", new social media network
03:19:11Megame quits [Client Quit]
03:27:27<tech234a>What are the crossed-out items I'm seeing on some of the trackers now? Are these failed items?
03:28:23HP_Archivist (HP_Archivist) joins
03:28:25<@OrIdow6>Dupliciates\
03:29:33<tech234a>So items were already queued but were distributed again?
03:29:39<tech234a>*items that were
03:30:10<@OrIdow6>AFAIK
03:30:30<tech234a>Hmm... ok
03:31:22lorwp quits [Client Quit]
03:39:31<SketchTheCow>Can someone do a full grab I asked for on #archivebot
03:39:53qw3rty__ joins
03:43:31qw3rty_ quits [Ping timeout: 258 seconds]
03:55:10lorwp (lorwp) joins
04:02:03qwertyasdfuiopghjkl quits [Remote host closed the connection]
04:18:36<somerando3>I just tried to save a URL from "https://hongkongfp.com" to the wayback machine and got an error: https://web.archive.org/web/20210702041507/https://hongkongfp.com/2021/06/29/hong-kongs-rthk-fires-veteran-radio-phone-in-host-as-more-shows-are-axed/
04:18:57<somerando3>"Sorry. This URL has been excluded from the Wayback Machine."
04:19:09KRG quits [Remote host closed the connection]
04:19:54<somerando3>Wasn't that site scraped as part of https://wiki.archiveteam.org/index.php/Hong_Kong_media? What will happen to the data if it's not accessible on the wayback machine?
04:22:21<@JAA>Yeah, it's still running actually and will take at least a few more days.
04:23:08<@JAA>The data will remain on the Internet Archive, but accessing it would require downloading it and ingesting it into pywb or similar, i.e. definitely too much for the casual user.
04:39:04<somerando3>Where could I download it eventually? Can it be downloaded it a big chuck from IA by domain? Or is it going to be mixed with a bunch of other stuff?
04:39:30nuroten quits [Ping timeout: 244 seconds]
04:41:49<thuban>somerando3: you'll be able to find it on http://archive.fart.website/archivebot/viewer/ when it's done.
04:41:57<@JAA>The ArchiveBot crawl is here: http://archive.fart.website/archivebot/viewer/job/6jjdq
04:42:19<@JAA>More files will still appear there, of course.
04:44:32<thuban>the internet archive items will include a mix of stuff (other warcs) from other domains, but those .warc.gz files will be all hongkongfp.com.
04:47:24<somerando3>ok, thanks! I'll take a look
06:39:43superkuh joins
06:39:43superkuh_ quits [Read error: Connection reset by peer]
06:53:35BlueMaxima quits [Read error: Connection reset by peer]
07:05:32Matthww8 quits [Ping timeout: 258 seconds]
07:08:14Matthww8 joins
07:13:58fuzzy8021 quits [Ping timeout: 258 seconds]
07:17:23Matthww80 joins
07:19:12Matthww8 quits [Ping timeout: 250 seconds]
07:19:12Matthww80 is now known as Matthww8
07:28:38fuzzy8021 (fuzzy8021) joins
07:41:18HP_Archivist quits [Ping timeout: 250 seconds]
07:42:16HP_Archivist (HP_Archivist) joins
07:46:56HP_Archivist quits [Ping timeout: 250 seconds]
08:46:44lorwp quits [Ping timeout: 258 seconds]
08:55:27lorwp (lorwp) joins
09:00:10lorwp quits [Ping timeout: 250 seconds]
09:10:26lorwp (lorwp) joins
09:14:01lorwpp (lorwp) joins
09:14:54lorwp quits [Ping timeout: 250 seconds]
09:18:56lorwpp quits [Ping timeout: 258 seconds]
09:22:16lorwp (lorwp) joins
09:38:06lorwp quits [Ping timeout: 258 seconds]
10:07:13lorwp (lorwp) joins
10:11:20<rewby>Apparently that new gettr social media thing already got hacked. https://twitter.com/IanColdwater/status/1410788066252443649
10:13:45lorwp quits [Ping timeout: 258 seconds]
10:38:19lorwp (lorwp) joins
10:47:07Megame (Megame) joins
11:20:27wizards quits [Ping timeout: 258 seconds]
11:22:12wizards joins
11:42:41ThreeHM quits [Ping timeout: 258 seconds]
11:44:43ThreeHM (ThreeHeadedMonkey) joins
12:02:45<h3ndr1k>rewby: Do you have more advanced sources? All I can see on twitter is deobfuscated source code for the frontend it seems, but that is not a hack.
12:37:56EdSavoie quits [Ping timeout: 244 seconds]
12:53:12<rewby>h3ndr1k: Apparently someon was posting nsfw on people's accounts.
12:53:18<rewby>I can't confirm it really
13:04:56EdSavoie joins
13:09:46<h3ndr1k>oh right, that has to be a hack
13:21:10TheTechRobo quits [Remote host closed the connection]
13:29:46TheTechRobo joins
14:17:31<rewby>I'm halfway considering archiving the gettr trash fire before it disappears from the face of the web
14:19:15<rewby>On the plus side: The api endpoints are nicely listed in the leaked source code. On the down side: There doesn't appear to be a good way to enumerate anything. Although doing live-discovery doesn't look too terrible to pull off
14:19:40<rewby>But not sure if it's "worth" saving
14:20:04<rewby>Will anyone want to look back on this? There's almost no content on it. It's less than a day old
14:20:22<rewby>On the other hand, it's the subject of the current political news cycle
14:43:48Arcorann__ quits [Ping timeout: 250 seconds]
15:01:17spirit joins
16:59:54Eighty quits [Remote host closed the connection]
17:04:58Eighty (Eighty) joins
17:59:14luckcolors quits [Ping timeout: 250 seconds]
18:05:38HP_Archivist (HP_Archivist) joins
18:15:35DogsRNice (Webuser299) joins
18:20:54jacobk quits [Ping timeout: 250 seconds]
18:22:21Lee joins
18:22:50Lee is now known as Lee303
18:23:58<Lee303>Hello, I have a question about the warrior.. It's been doing some Google Sites job for a week now and I have to restart the machine it's on. Any way to not lose the progress? Thanks!
18:26:11<Jake>If you have to restart it, no. Someone else will get the job after some time.
18:27:52<Lee303>Alright, thank you. Not really sure why someone made a roofing google site with >60000 things to scrape in the first place.
18:30:00HP_Archivist quits [Ping timeout: 250 seconds]
18:38:33HP_Archivist (HP_Archivist) joins
18:42:58jacobk joins
18:51:14jacobk quits [Ping timeout: 250 seconds]
18:58:42Iki1 joins
19:02:45Iki quits [Ping timeout: 258 seconds]
19:17:55Jonboy345 quits [Read error: Connection reset by peer]
19:40:28Lee303 quits [Remote host closed the connection]
19:50:36jacobk joins
20:16:21jacobk quits [Ping timeout: 258 seconds]
20:29:18jacobk joins
20:34:22jacobk quits [Ping timeout: 250 seconds]
20:36:58jacobk joins
20:43:58Stilett0 joins
20:45:38Stiletto quits [Ping timeout: 250 seconds]
20:51:29Eighty quits [Remote host closed the connection]
20:51:38fionera quits [*.net *.split]
20:56:01Eighty (Eighty) joins
20:57:23fionera joins
20:57:35fionera is now known as RJHacker15913
21:01:30<@OrIdow6>Esp. since they've presumably wised up about cloud providers, I think it will stay up
21:01:58<@OrIdow6>I haven;t seen news about this except when linked from ArchiveTeam and related
21:08:06@OrIdow6 quits [Ping timeout: 258 seconds]
21:08:16RJHacker15913 quits [*.net *.split]
21:10:23OrIdow6^2 (OrIdow6) joins
21:10:23@ChanServ sets mode: +o OrIdow6^2
21:42:14mgrandi quits [Read error: Connection reset by peer]
21:43:18<@arkiver>there posts on GETTR going back to 2014
21:43:26<@arkiver>well according to what GETTR shows
21:43:32<@arkiver>maybe they're showing bad info
21:43:44<russss>I think I read that they imported twitter feeds
21:43:54<@arkiver>i see
21:44:04<@arkiver>russss: do you have a link to that?
21:44:46<russss>nope sorry, it scrolled past on twitter a couple days ago
21:47:58jacobk quits [Ping timeout: 258 seconds]
21:48:15mgrandi (mgrandi) joins
21:53:04<@JAA>Apparently, there's an option to import your Twitter feed when you sign up. Can't find official documentation on that, but it's mentioned in a bunch of news articles, e.g. https://www.theregister.com/2021/07/02/gettr/
21:55:54Jonboy345 joins
22:08:24<h3ndr1k>arkiver: what did the domains-grab project do exactly? Can't find anything about it and it's hard to understand what goes on in the code.
22:08:24<h3ndr1k>Is it a predecessor to urls?
22:12:00spirit quits [Client Quit]
22:12:43<@OrIdow6^2>#Y
22:15:20Ajay_m joins
22:16:00<@OrIdow6^2>And in short, I can't remember what the channel was, but it just recursed over an entire domain
22:16:21Ajay_m quits [Client Quit]
22:16:32Ajay_m joins
22:16:41<@arkiver>i believe it was for flash domains
22:16:52Ajay_m quits [Client Quit]
22:17:03Ajay_m joins
22:17:20<@OrIdow6^2>And .eu
22:19:26<@arkiver>yeah and that
22:19:28<h3ndr1k>ah ok. In the readme it mentions domains on the wiki, but that page does not exist :)
22:46:14<@EggplantN>arkiver it was .eu
22:46:26<@EggplantN>flash was same code different repo
22:46:39<@arkiver>right yep, i see it
22:53:07fionera (Fionera) joins
22:54:35AlsoHP_Archivist joins
22:58:14HP_Archivist quits [Ping timeout: 250 seconds]
23:28:06EdSavoie quits [Remote host closed the connection]
23:35:57teej (teej) joins
23:52:26lunik1 quits [Client Quit]