| 00:00:40 | | Arcorann__ joins |
| 00:00:40 | | dm4v quits [Read error: Connection reset by peer] |
| 00:01:10 | | dm4v joins |
| 00:01:13 | | dm4v is now authenticated as dm4v |
| 00:01:13 | | dm4v quits [Changing host] |
| 00:01:13 | | dm4v (dm4v) joins |
| 00:28:01 | | lorwp quits [Ping timeout: 258 seconds] |
| 00:35:43 | | lorwp (lorwp) joins |
| 00:58:18 | | britmob quits [Ping timeout: 250 seconds] |
| 01:01:22 | | TheTechRobo (TheTechRobo) joins |
| 01:02:19 | <TheTechRobo> | Hey, how would I go about downloading a Facebook profile? I'm getting an error with snscrape. IS there another specialised tool? |
| 01:17:35 | <thuban> | TheTechRobo: based on https://github.com/JustAnotherArchivist/snscrape/issues it looks like you've now got it working; have you? |
| 01:18:11 | <@JAA> | thuban: Issue 208 |
| 01:18:19 | <thuban> | if you're still having trouble with facebook's rate limiting / ip banning, then no, we don't currently have a good solution for that |
| 01:18:19 | <@JAA> | Facebook's rate limiting is an arse. |
| 01:18:27 | <thuban> | yeah. is this a use case for #Y? |
| 01:18:46 | <@JAA> | Unlikely |
| 01:19:16 | <@JAA> | And I'd be surprised if any #// workers were still not banned from Facebook and Instagram. |
| 01:19:45 | <@JAA> | I'm not aware of any way that actually works. Their limits are beyond ridiculous and easily triggered even manually with a browser. |
| 01:19:54 | <thuban> | cause of the special handling it would need? yeah, makes sense. |
| 01:20:07 | <@JAA> | They just want to force people to create an account and log in. |
| 01:20:30 | <thuban> | there's definitely valuable content on facebook that we just can't get anywhere else, though. i wish we had something for it. |
| 01:21:13 | <@JAA> | Yeah, absolutely. |
| 01:22:39 | | Iki joins |
| 01:24:45 | | Iki1 quits [Ping timeout: 258 seconds] |
| 01:26:39 | | somerando3 joins |
| 01:29:43 | <thuban> | do we actually know whether distribution would help? like, does facebook clam up after n pagination requests for the same page from the same ip, or just n requests for the same page? (obvs the latter wouldn't do for very popular pages, but i wouldn't put it past them to do some form of load monitoring) |
| 01:40:57 | <TheTechRobo> | thuban: What is #Y? |
| 01:41:48 | <h3ndr1k> | The channel for the new project for a distributed archivebot. |
| 01:42:19 | <TheTechRobo> | JAA: You say that they're forcing people to log in; in thst case, would using cookies help? I do have an account, I don't use it at all but it exists. |
| 01:42:33 | <TheTechRobo> | h3ndr1k: Oh, that makes sense. |
| 01:42:34 | <h3ndr1k> | Basically. If I followed past conversations correctly. |
| 01:42:37 | | BlueMaxima joins |
| 01:45:11 | <TheTechRobo> | Instagram seems to suck too... I keep getting redirected to the login page with instaloader! |
| 01:52:24 | | britmob joins |
| 01:52:26 | <@JAA> | TheTechRobo: Possibly. snscrape doesn't support that though. |
| 01:53:30 | <@JAA> | thuban: It would certainly help for archiving the actual posts, videos, etc. At least last time I looked into it, those were limited per IP. |
| 01:54:15 | <@JAA> | But yes, they also block on the pagination. I haven't experimented with slower (faked) scrolling yet. |
| 02:06:32 | | Barto quits [Ping timeout: 258 seconds] |
| 02:20:05 | | lennier2 joins |
| 02:23:01 | | lennier1 quits [Ping timeout: 258 seconds] |
| 02:23:07 | | lennier2 is now known as lennier1 |
| 02:23:20 | | somerando3 quits [Remote host closed the connection] |
| 02:40:13 | | somerando3 joins |
| 02:45:16 | <somerando3> | Actually, I just realized the wayback machine has fairly regular captures of RTHK's podcast RSS |
| 02:45:23 | <somerando3> | e.g. https://web.archive.org/web/20140912061846/http://podcast.rthk.hk/podcast/radio1_openline_openview.xml |
| 02:46:25 | <somerando3> | seems like it should be enough to reconstruct authoritative metadata going back quite some time, with few gaps. |
| 02:47:31 | <somerando3> | However the older scrape have audio links to mp3s on http://podcast.rthk.org.hk/, which don't work anymore |
| 02:47:51 | <mgrandi> | https://twitter.com/alexstamos/status/1410674022405181440 |
| 02:48:31 | <mgrandi> | If we want to start a preemptive project for this |
| 02:54:56 | | Barto (Barto) joins |
| 03:06:00 | | abcd quits [Remote host closed the connection] |
| 03:08:07 | <@OrIdow6> | ^ Twitter post about "Gettr", new social media network |
| 03:19:11 | | Megame quits [Client Quit] |
| 03:27:27 | <tech234a> | What are the crossed-out items I'm seeing on some of the trackers now? Are these failed items? |
| 03:28:23 | | HP_Archivist (HP_Archivist) joins |
| 03:28:25 | <@OrIdow6> | Dupliciates\ |
| 03:29:33 | <tech234a> | So items were already queued but were distributed again? |
| 03:29:39 | <tech234a> | *items that were |
| 03:30:10 | <@OrIdow6> | AFAIK |
| 03:30:30 | <tech234a> | Hmm... ok |
| 03:31:22 | | lorwp quits [Client Quit] |
| 03:39:31 | <SketchTheCow> | Can someone do a full grab I asked for on #archivebot |
| 03:39:53 | | qw3rty__ joins |
| 03:43:31 | | qw3rty_ quits [Ping timeout: 258 seconds] |
| 03:55:10 | | lorwp (lorwp) joins |
| 04:02:03 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
| 04:18:36 | <somerando3> | I just tried to save a URL from "https://hongkongfp.com" to the wayback machine and got an error: https://web.archive.org/web/20210702041507/https://hongkongfp.com/2021/06/29/hong-kongs-rthk-fires-veteran-radio-phone-in-host-as-more-shows-are-axed/ |
| 04:18:57 | <somerando3> | "Sorry. This URL has been excluded from the Wayback Machine." |
| 04:19:09 | | KRG quits [Remote host closed the connection] |
| 04:19:54 | <somerando3> | Wasn't that site scraped as part of https://wiki.archiveteam.org/index.php/Hong_Kong_media? What will happen to the data if it's not accessible on the wayback machine? |
| 04:22:21 | <@JAA> | Yeah, it's still running actually and will take at least a few more days. |
| 04:23:08 | <@JAA> | The data will remain on the Internet Archive, but accessing it would require downloading it and ingesting it into pywb or similar, i.e. definitely too much for the casual user. |
| 04:39:04 | <somerando3> | Where could I download it eventually? Can it be downloaded it a big chuck from IA by domain? Or is it going to be mixed with a bunch of other stuff? |
| 04:39:30 | | nuroten quits [Ping timeout: 244 seconds] |
| 04:41:49 | <thuban> | somerando3: you'll be able to find it on http://archive.fart.website/archivebot/viewer/ when it's done. |
| 04:41:57 | <@JAA> | The ArchiveBot crawl is here: http://archive.fart.website/archivebot/viewer/job/6jjdq |
| 04:42:19 | <@JAA> | More files will still appear there, of course. |
| 04:44:32 | <thuban> | the internet archive items will include a mix of stuff (other warcs) from other domains, but those .warc.gz files will be all hongkongfp.com. |
| 04:47:24 | <somerando3> | ok, thanks! I'll take a look |
| 06:39:43 | | superkuh joins |
| 06:39:43 | | superkuh_ quits [Read error: Connection reset by peer] |
| 06:53:35 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 07:05:32 | | Matthww8 quits [Ping timeout: 258 seconds] |
| 07:08:14 | | Matthww8 joins |
| 07:13:58 | | fuzzy8021 quits [Ping timeout: 258 seconds] |
| 07:17:23 | | Matthww80 joins |
| 07:19:12 | | Matthww8 quits [Ping timeout: 250 seconds] |
| 07:19:12 | | Matthww80 is now known as Matthww8 |
| 07:28:38 | | fuzzy8021 (fuzzy8021) joins |
| 07:41:18 | | HP_Archivist quits [Ping timeout: 250 seconds] |
| 07:42:16 | | HP_Archivist (HP_Archivist) joins |
| 07:46:56 | | HP_Archivist quits [Ping timeout: 250 seconds] |
| 08:46:44 | | lorwp quits [Ping timeout: 258 seconds] |
| 08:55:27 | | lorwp (lorwp) joins |
| 09:00:10 | | lorwp quits [Ping timeout: 250 seconds] |
| 09:10:26 | | lorwp (lorwp) joins |
| 09:14:01 | | lorwpp (lorwp) joins |
| 09:14:54 | | lorwp quits [Ping timeout: 250 seconds] |
| 09:18:56 | | lorwpp quits [Ping timeout: 258 seconds] |
| 09:22:16 | | lorwp (lorwp) joins |
| 09:38:06 | | lorwp quits [Ping timeout: 258 seconds] |
| 10:07:13 | | lorwp (lorwp) joins |
| 10:11:20 | <rewby> | Apparently that new gettr social media thing already got hacked. https://twitter.com/IanColdwater/status/1410788066252443649 |
| 10:13:45 | | lorwp quits [Ping timeout: 258 seconds] |
| 10:38:19 | | lorwp (lorwp) joins |
| 10:47:07 | | Megame (Megame) joins |
| 11:20:27 | | wizards quits [Ping timeout: 258 seconds] |
| 11:22:12 | | wizards joins |
| 11:42:41 | | ThreeHM quits [Ping timeout: 258 seconds] |
| 11:44:43 | | ThreeHM (ThreeHeadedMonkey) joins |
| 12:02:45 | <h3ndr1k> | rewby: Do you have more advanced sources? All I can see on twitter is deobfuscated source code for the frontend it seems, but that is not a hack. |
| 12:37:56 | | EdSavoie quits [Ping timeout: 244 seconds] |
| 12:53:12 | <rewby> | h3ndr1k: Apparently someon was posting nsfw on people's accounts. |
| 12:53:18 | <rewby> | I can't confirm it really |
| 13:04:56 | | EdSavoie joins |
| 13:09:46 | <h3ndr1k> | oh right, that has to be a hack |
| 13:21:10 | | TheTechRobo quits [Remote host closed the connection] |
| 13:29:46 | | TheTechRobo joins |
| 13:30:06 | | TheTechRobo is now authenticated as TheTechRobo |
| 14:17:31 | <rewby> | I'm halfway considering archiving the gettr trash fire before it disappears from the face of the web |
| 14:19:15 | <rewby> | On the plus side: The api endpoints are nicely listed in the leaked source code. On the down side: There doesn't appear to be a good way to enumerate anything. Although doing live-discovery doesn't look too terrible to pull off |
| 14:19:40 | <rewby> | But not sure if it's "worth" saving |
| 14:20:04 | <rewby> | Will anyone want to look back on this? There's almost no content on it. It's less than a day old |
| 14:20:22 | <rewby> | On the other hand, it's the subject of the current political news cycle |
| 14:43:48 | | Arcorann__ quits [Ping timeout: 250 seconds] |
| 15:01:17 | | spirit joins |
| 16:59:54 | | Eighty quits [Remote host closed the connection] |
| 17:04:58 | | Eighty (Eighty) joins |
| 17:59:14 | | luckcolors quits [Ping timeout: 250 seconds] |
| 18:05:38 | | HP_Archivist (HP_Archivist) joins |
| 18:15:35 | | DogsRNice (Webuser299) joins |
| 18:20:54 | | jacobk quits [Ping timeout: 250 seconds] |
| 18:22:21 | | Lee joins |
| 18:22:50 | | Lee is now known as Lee303 |
| 18:23:58 | <Lee303> | Hello, I have a question about the warrior.. It's been doing some Google Sites job for a week now and I have to restart the machine it's on. Any way to not lose the progress? Thanks! |
| 18:26:11 | <Jake> | If you have to restart it, no. Someone else will get the job after some time. |
| 18:27:52 | <Lee303> | Alright, thank you. Not really sure why someone made a roofing google site with >60000 things to scrape in the first place. |
| 18:30:00 | | HP_Archivist quits [Ping timeout: 250 seconds] |
| 18:38:33 | | HP_Archivist (HP_Archivist) joins |
| 18:42:58 | | jacobk joins |
| 18:51:14 | | jacobk quits [Ping timeout: 250 seconds] |
| 18:58:42 | | Iki1 joins |
| 19:02:45 | | Iki quits [Ping timeout: 258 seconds] |
| 19:17:55 | | Jonboy345 quits [Read error: Connection reset by peer] |
| 19:40:28 | | Lee303 quits [Remote host closed the connection] |
| 19:50:36 | | jacobk joins |
| 20:16:21 | | jacobk quits [Ping timeout: 258 seconds] |
| 20:29:18 | | jacobk joins |
| 20:34:22 | | jacobk quits [Ping timeout: 250 seconds] |
| 20:36:58 | | jacobk joins |
| 20:43:58 | | Stilett0 joins |
| 20:45:38 | | Stiletto quits [Ping timeout: 250 seconds] |
| 20:51:29 | | Eighty quits [Remote host closed the connection] |
| 20:51:38 | | fionera quits [*.net *.split] |
| 20:56:01 | | Eighty (Eighty) joins |
| 20:57:23 | | fionera joins |
| 20:57:35 | | fionera is now known as RJHacker15913 |
| 21:01:30 | <@OrIdow6> | Esp. since they've presumably wised up about cloud providers, I think it will stay up |
| 21:01:58 | <@OrIdow6> | I haven;t seen news about this except when linked from ArchiveTeam and related |
| 21:08:06 | | @OrIdow6 quits [Ping timeout: 258 seconds] |
| 21:08:16 | | RJHacker15913 quits [*.net *.split] |
| 21:10:23 | | OrIdow6^2 (OrIdow6) joins |
| 21:10:23 | | @ChanServ sets mode: +o OrIdow6^2 |
| 21:42:14 | | mgrandi quits [Read error: Connection reset by peer] |
| 21:43:18 | <@arkiver> | there posts on GETTR going back to 2014 |
| 21:43:26 | <@arkiver> | well according to what GETTR shows |
| 21:43:32 | <@arkiver> | maybe they're showing bad info |
| 21:43:44 | <russss> | I think I read that they imported twitter feeds |
| 21:43:54 | <@arkiver> | i see |
| 21:44:04 | <@arkiver> | russss: do you have a link to that? |
| 21:44:46 | <russss> | nope sorry, it scrolled past on twitter a couple days ago |
| 21:47:58 | | jacobk quits [Ping timeout: 258 seconds] |
| 21:48:15 | | mgrandi (mgrandi) joins |
| 21:53:04 | <@JAA> | Apparently, there's an option to import your Twitter feed when you sign up. Can't find official documentation on that, but it's mentioned in a bunch of news articles, e.g. https://www.theregister.com/2021/07/02/gettr/ |
| 21:55:54 | | Jonboy345 joins |
| 22:08:24 | <h3ndr1k> | arkiver: what did the domains-grab project do exactly? Can't find anything about it and it's hard to understand what goes on in the code. |
| 22:08:24 | <h3ndr1k> | Is it a predecessor to urls? |
| 22:12:00 | | spirit quits [Client Quit] |
| 22:12:43 | <@OrIdow6^2> | #Y |
| 22:15:20 | | Ajay_m joins |
| 22:16:00 | <@OrIdow6^2> | And in short, I can't remember what the channel was, but it just recursed over an entire domain |
| 22:16:21 | | Ajay_m quits [Client Quit] |
| 22:16:32 | | Ajay_m joins |
| 22:16:41 | <@arkiver> | i believe it was for flash domains |
| 22:16:52 | | Ajay_m quits [Client Quit] |
| 22:17:03 | | Ajay_m joins |
| 22:17:20 | <@OrIdow6^2> | And .eu |
| 22:19:26 | <@arkiver> | yeah and that |
| 22:19:28 | <h3ndr1k> | ah ok. In the readme it mentions domains on the wiki, but that page does not exist :) |
| 22:46:14 | <@EggplantN> | arkiver it was .eu |
| 22:46:26 | <@EggplantN> | flash was same code different repo |
| 22:46:39 | <@arkiver> | right yep, i see it |
| 22:53:07 | | fionera (Fionera) joins |
| 22:54:35 | | AlsoHP_Archivist joins |
| 22:58:14 | | HP_Archivist quits [Ping timeout: 250 seconds] |
| 23:28:06 | | EdSavoie quits [Remote host closed the connection] |
| 23:35:57 | | teej (teej) joins |
| 23:52:26 | | lunik1 quits [Client Quit] |