#archiveteam-bs log for 2021-07-02

Home Search Previous day Next day

00:00:40		Arcorann__ joins
00:00:40		dm4v quits [Read error: Connection reset by peer]
00:01:10		dm4v joins
00:01:13		dm4v is now authenticated as dm4v
00:01:13		dm4v quits [Changing host]
00:01:13		dm4v (dm4v) joins
00:28:01		lorwp quits [Ping timeout: 258 seconds]
00:35:43		lorwp (lorwp) joins
00:58:18		britmob quits [Ping timeout: 250 seconds]
01:01:22		TheTechRobo (TheTechRobo) joins
01:02:19	<TheTechRobo>	Hey, how would I go about downloading a Facebook profile? I'm getting an error with snscrape. IS there another specialised tool?
01:17:35	<thuban>	TheTechRobo: based on https://github.com/JustAnotherArchivist/snscrape/issues it looks like you've now got it working; have you?
01:18:11	<@JAA>	thuban: Issue 208
01:18:19	<thuban>	if you're still having trouble with facebook's rate limiting / ip banning, then no, we don't currently have a good solution for that
01:18:19	<@JAA>	Facebook's rate limiting is an arse.
01:18:27	<thuban>	yeah. is this a use case for #Y?
01:18:46	<@JAA>	Unlikely
01:19:16	<@JAA>	And I'd be surprised if any #// workers were still not banned from Facebook and Instagram.
01:19:45	<@JAA>	I'm not aware of any way that actually works. Their limits are beyond ridiculous and easily triggered even manually with a browser.
01:19:54	<thuban>	cause of the special handling it would need? yeah, makes sense.
01:20:07	<@JAA>	They just want to force people to create an account and log in.
01:20:30	<thuban>	there's definitely valuable content on facebook that we just can't get anywhere else, though. i wish we had something for it.
01:21:13	<@JAA>	Yeah, absolutely.
01:22:39		Iki joins
01:24:45		Iki1 quits [Ping timeout: 258 seconds]
01:26:39		somerando3 joins
01:29:43	<thuban>	do we actually know whether distribution would help? like, does facebook clam up after n pagination requests for the same page from the same ip, or just n requests for the same page? (obvs the latter wouldn't do for very popular pages, but i wouldn't put it past them to do some form of load monitoring)
01:40:57	<TheTechRobo>	thuban: What is #Y?
01:41:48	<h3ndr1k>	The channel for the new project for a distributed archivebot.
01:42:19	<TheTechRobo>	JAA: You say that they're forcing people to log in; in thst case, would using cookies help? I do have an account, I don't use it at all but it exists.
01:42:33	<TheTechRobo>	h3ndr1k: Oh, that makes sense.
01:42:34	<h3ndr1k>	Basically. If I followed past conversations correctly.
01:42:37		BlueMaxima joins
01:45:11	<TheTechRobo>	Instagram seems to suck too... I keep getting redirected to the login page with instaloader!
01:52:24		britmob joins
01:52:26	<@JAA>	TheTechRobo: Possibly. snscrape doesn't support that though.
01:53:30	<@JAA>	thuban: It would certainly help for archiving the actual posts, videos, etc. At least last time I looked into it, those were limited per IP.
01:54:15	<@JAA>	But yes, they also block on the pagination. I haven't experimented with slower (faked) scrolling yet.
02:06:32		Barto quits [Ping timeout: 258 seconds]
02:20:05		lennier2 joins
02:23:01		lennier1 quits [Ping timeout: 258 seconds]
02:23:07		lennier2 is now known as lennier1
02:23:20		somerando3 quits [Remote host closed the connection]
02:40:13		somerando3 joins
02:45:16	<somerando3>	Actually, I just realized the wayback machine has fairly regular captures of RTHK's podcast RSS
02:45:23	<somerando3>	e.g. https://web.archive.org/web/20140912061846/http://podcast.rthk.hk/podcast/radio1_openline_openview.xml
02:46:25	<somerando3>	seems like it should be enough to reconstruct authoritative metadata going back quite some time, with few gaps.
02:47:31	<somerando3>	However the older scrape have audio links to mp3s on http://podcast.rthk.org.hk/, which don't work anymore
02:47:51	<mgrandi>	https://twitter.com/alexstamos/status/1410674022405181440
02:48:31	<mgrandi>	If we want to start a preemptive project for this
02:54:56		Barto (Barto) joins
03:06:00		abcd quits [Remote host closed the connection]
03:08:07	<@OrIdow6>	^ Twitter post about "Gettr", new social media network
03:19:11		Megame quits [Client Quit]
03:27:27	<tech234a>	What are the crossed-out items I'm seeing on some of the trackers now? Are these failed items?
03:28:23		HP_Archivist (HP_Archivist) joins
03:28:25	<@OrIdow6>	Dupliciates\
03:29:33	<tech234a>	So items were already queued but were distributed again?
03:29:39	<tech234a>	*items that were
03:30:10	<@OrIdow6>	AFAIK
03:30:30	<tech234a>	Hmm... ok
03:31:22		lorwp quits [Client Quit]
03:39:31	<SketchTheCow>	Can someone do a full grab I asked for on #archivebot
03:39:53		qw3rty__ joins
03:43:31		qw3rty_ quits [Ping timeout: 258 seconds]
03:55:10		lorwp (lorwp) joins
04:02:03		qwertyasdfuiopghjkl quits [Remote host closed the connection]
04:18:36	<somerando3>	I just tried to save a URL from "https://hongkongfp.com" to the wayback machine and got an error: https://web.archive.org/web/20210702041507/https://hongkongfp.com/2021/06/29/hong-kongs-rthk-fires-veteran-radio-phone-in-host-as-more-shows-are-axed/
04:18:57	<somerando3>	"Sorry. This URL has been excluded from the Wayback Machine."
04:19:09		KRG quits [Remote host closed the connection]
04:19:54	<somerando3>	Wasn't that site scraped as part of https://wiki.archiveteam.org/index.php/Hong_Kong_media? What will happen to the data if it's not accessible on the wayback machine?
04:22:21	<@JAA>	Yeah, it's still running actually and will take at least a few more days.
04:23:08	<@JAA>	The data will remain on the Internet Archive, but accessing it would require downloading it and ingesting it into pywb or similar, i.e. definitely too much for the casual user.
04:39:04	<somerando3>	Where could I download it eventually? Can it be downloaded it a big chuck from IA by domain? Or is it going to be mixed with a bunch of other stuff?
04:39:30		nuroten quits [Ping timeout: 244 seconds]
04:41:49	<thuban>	somerando3: you'll be able to find it on http://archive.fart.website/archivebot/viewer/ when it's done.
04:41:57	<@JAA>	The ArchiveBot crawl is here: http://archive.fart.website/archivebot/viewer/job/6jjdq
04:42:19	<@JAA>	More files will still appear there, of course.
04:44:32	<thuban>	the internet archive items will include a mix of stuff (other warcs) from other domains, but those .warc.gz files will be all hongkongfp.com.
04:47:24	<somerando3>	ok, thanks! I'll take a look
06:39:43		superkuh joins
06:39:43		superkuh_ quits [Read error: Connection reset by peer]
06:53:35		BlueMaxima quits [Read error: Connection reset by peer]
07:05:32		Matthww8 quits [Ping timeout: 258 seconds]
07:08:14		Matthww8 joins
07:13:58		fuzzy8021 quits [Ping timeout: 258 seconds]
07:17:23		Matthww80 joins
07:19:12		Matthww8 quits [Ping timeout: 250 seconds]
07:19:12		Matthww80 is now known as Matthww8
07:28:38		fuzzy8021 (fuzzy8021) joins
07:41:18		HP_Archivist quits [Ping timeout: 250 seconds]
07:42:16		HP_Archivist (HP_Archivist) joins
07:46:56		HP_Archivist quits [Ping timeout: 250 seconds]
08:46:44		lorwp quits [Ping timeout: 258 seconds]
08:55:27		lorwp (lorwp) joins
09:00:10		lorwp quits [Ping timeout: 250 seconds]
09:10:26		lorwp (lorwp) joins
09:14:01		lorwpp (lorwp) joins
09:14:54		lorwp quits [Ping timeout: 250 seconds]
09:18:56		lorwpp quits [Ping timeout: 258 seconds]
09:22:16		lorwp (lorwp) joins
09:38:06		lorwp quits [Ping timeout: 258 seconds]
10:07:13		lorwp (lorwp) joins
10:11:20	<rewby>	Apparently that new gettr social media thing already got hacked. https://twitter.com/IanColdwater/status/1410788066252443649
10:13:45		lorwp quits [Ping timeout: 258 seconds]
10:38:19		lorwp (lorwp) joins
10:47:07		Megame (Megame) joins
11:20:27		wizards quits [Ping timeout: 258 seconds]
11:22:12		wizards joins
11:42:41		ThreeHM quits [Ping timeout: 258 seconds]
11:44:43		ThreeHM (ThreeHeadedMonkey) joins
12:02:45	<h3ndr1k>	rewby: Do you have more advanced sources? All I can see on twitter is deobfuscated source code for the frontend it seems, but that is not a hack.
12:37:56		EdSavoie quits [Ping timeout: 244 seconds]
12:53:12	<rewby>	h3ndr1k: Apparently someon was posting nsfw on people's accounts.
12:53:18	<rewby>	I can't confirm it really
13:04:56		EdSavoie joins
13:09:46	<h3ndr1k>	oh right, that has to be a hack
13:21:10		TheTechRobo quits [Remote host closed the connection]
13:29:46		TheTechRobo joins
13:30:06		TheTechRobo is now authenticated as TheTechRobo
14:17:31	<rewby>	I'm halfway considering archiving the gettr trash fire before it disappears from the face of the web
14:19:15	<rewby>	On the plus side: The api endpoints are nicely listed in the leaked source code. On the down side: There doesn't appear to be a good way to enumerate anything. Although doing live-discovery doesn't look too terrible to pull off
14:19:40	<rewby>	But not sure if it's "worth" saving
14:20:04	<rewby>	Will anyone want to look back on this? There's almost no content on it. It's less than a day old
14:20:22	<rewby>	On the other hand, it's the subject of the current political news cycle
14:43:48		Arcorann__ quits [Ping timeout: 250 seconds]
15:01:17		spirit joins
16:59:54		Eighty quits [Remote host closed the connection]
17:04:58		Eighty (Eighty) joins
17:59:14		luckcolors quits [Ping timeout: 250 seconds]
18:05:38		HP_Archivist (HP_Archivist) joins
18:15:35		DogsRNice (Webuser299) joins
18:20:54		jacobk quits [Ping timeout: 250 seconds]
18:22:21		Lee joins
18:22:50		Lee is now known as Lee303
18:23:58	<Lee303>	Hello, I have a question about the warrior.. It's been doing some Google Sites job for a week now and I have to restart the machine it's on. Any way to not lose the progress? Thanks!
18:26:11	<Jake>	If you have to restart it, no. Someone else will get the job after some time.
18:27:52	<Lee303>	Alright, thank you. Not really sure why someone made a roofing google site with >60000 things to scrape in the first place.
18:30:00		HP_Archivist quits [Ping timeout: 250 seconds]
18:38:33		HP_Archivist (HP_Archivist) joins
18:42:58		jacobk joins
18:51:14		jacobk quits [Ping timeout: 250 seconds]
18:58:42		Iki1 joins
19:02:45		Iki quits [Ping timeout: 258 seconds]
19:17:55		Jonboy345 quits [Read error: Connection reset by peer]
19:40:28		Lee303 quits [Remote host closed the connection]
19:50:36		jacobk joins
20:16:21		jacobk quits [Ping timeout: 258 seconds]
20:29:18		jacobk joins
20:34:22		jacobk quits [Ping timeout: 250 seconds]
20:36:58		jacobk joins
20:43:58		Stilett0 joins
20:45:38		Stiletto quits [Ping timeout: 250 seconds]
20:51:29		Eighty quits [Remote host closed the connection]
20:51:38		fionera quits [.net .split]
20:56:01		Eighty (Eighty) joins
20:57:23		fionera joins
20:57:35		fionera is now known as RJHacker15913
21:01:30	<@OrIdow6>	Esp. since they've presumably wised up about cloud providers, I think it will stay up
21:01:58	<@OrIdow6>	I haven;t seen news about this except when linked from ArchiveTeam and related
21:08:06		@OrIdow6 quits [Ping timeout: 258 seconds]
21:08:16		RJHacker15913 quits [.net .split]
21:10:23		OrIdow6^2 (OrIdow6) joins
21:10:23		@ChanServ sets mode: +o OrIdow6^2
21:42:14		mgrandi quits [Read error: Connection reset by peer]
21:43:18	<@arkiver>	there posts on GETTR going back to 2014
21:43:26	<@arkiver>	well according to what GETTR shows
21:43:32	<@arkiver>	maybe they're showing bad info
21:43:44	<russss>	I think I read that they imported twitter feeds
21:43:54	<@arkiver>	i see
21:44:04	<@arkiver>	russss: do you have a link to that?
21:44:46	<russss>	nope sorry, it scrolled past on twitter a couple days ago
21:47:58		jacobk quits [Ping timeout: 258 seconds]
21:48:15		mgrandi (mgrandi) joins
21:53:04	<@JAA>	Apparently, there's an option to import your Twitter feed when you sign up. Can't find official documentation on that, but it's mentioned in a bunch of news articles, e.g. https://www.theregister.com/2021/07/02/gettr/
21:55:54		Jonboy345 joins
22:08:24	<h3ndr1k>	arkiver: what did the domains-grab project do exactly? Can't find anything about it and it's hard to understand what goes on in the code.
22:08:24	<h3ndr1k>	Is it a predecessor to urls?
22:12:00		spirit quits [Client Quit]
22:12:43	<@OrIdow6^2>	#Y
22:15:20		Ajay_m joins
22:16:00	<@OrIdow6^2>	And in short, I can't remember what the channel was, but it just recursed over an entire domain
22:16:21		Ajay_m quits [Client Quit]
22:16:32		Ajay_m joins
22:16:41	<@arkiver>	i believe it was for flash domains
22:16:52		Ajay_m quits [Client Quit]
22:17:03		Ajay_m joins
22:17:20	<@OrIdow6^2>	And .eu
22:19:26	<@arkiver>	yeah and that
22:19:28	<h3ndr1k>	ah ok. In the readme it mentions domains on the wiki, but that page does not exist :)
22:46:14	<@EggplantN>	arkiver it was .eu
22:46:26	<@EggplantN>	flash was same code different repo
22:46:39	<@arkiver>	right yep, i see it
22:53:07		fionera (Fionera) joins
22:54:35		AlsoHP_Archivist joins
22:58:14		HP_Archivist quits [Ping timeout: 250 seconds]
23:28:06		EdSavoie quits [Remote host closed the connection]
23:35:57		teej (teej) joins
23:52:26		lunik1 quits [Client Quit]

Home Search Previous day Next day