#archiveteam-bs log for 2022-11-13

Home Search Previous day Next day

00:01:45		Nulo joins
00:32:08		sonick (sonick) joins
00:47:07		fishingf0rpie quits [Quit: Leaving]
00:47:22		fishingforpie joins
01:02:16		march_happy quits [Ping timeout: 265 seconds]
01:02:36		march_happy (march_happy) joins
01:28:40		AnotherIki joins
01:29:49		tzt quits [Ping timeout: 265 seconds]
01:32:53		Iki1 quits [Ping timeout: 268 seconds]
01:41:09		fishingforpie quits [Remote host closed the connection]
01:41:09		onetruth quits [Remote host closed the connection]
01:41:28		onetruth joins
01:41:28		fishingforpie joins
02:27:56		sepro quits [Quit: Bye!]
02:28:39		sepro (sepro) joins
03:08:49		Craigle quits [Quit: The Lounge - https://thelounge.chat]
03:09:24		Craigle (Craigle) joins
03:18:25		tzt (tzt) joins
03:27:39		michaelblob_ (michaelblob) joins
03:31:16		michaelblob quits [Ping timeout: 240 seconds]
04:14:38		us3rrr joins
04:15:02		onetruth quits [Remote host closed the connection]
04:15:02		fishingforpie quits [Remote host closed the connection]
04:15:13		fishingforpie joins
04:24:54		Craigle quits [Client Quit]
04:25:28		Craigle (Craigle) joins
04:30:45		Iki1 joins
04:34:48		AnotherIki quits [Ping timeout: 268 seconds]
04:54:04		datechnoman quits [Ping timeout: 240 seconds]
04:57:36		Craigle quits [Read error: Connection reset by peer]
04:58:09		Craigle (Craigle) joins
04:58:43		fishingforpie quits [Read error: Connection reset by peer]
04:59:21		datechnoman (datechnoman) joins
04:59:23		fishingforpie joins
05:09:39		Craigle quits [Client Quit]
05:10:15		Craigle (Craigle) joins
05:19:06		BlueMaxima quits [Read error: Connection reset by peer]
05:21:46		HackMii_ quits [Remote host closed the connection]
05:22:20		HackMii_ (hacktheplanet) joins
05:27:35		sec^nd quits [Remote host closed the connection]
05:28:12		sec^nd (second) joins
05:38:24		fishingforpie quits [Read error: Connection reset by peer]
05:39:16		fishingforpie joins
05:40:37		HackMii_ quits [Remote host closed the connection]
05:40:50		HackMii_ (hacktheplanet) joins
06:20:39		michaelblob (michaelblob) joins
06:24:04		michaelblob_ quits [Ping timeout: 240 seconds]
06:24:46		Iki1 quits [Remote host closed the connection]
06:24:46		fishingforpie quits [Remote host closed the connection]
06:24:46		michaelblob quits [Remote host closed the connection]
06:24:58		fishingforpie joins
06:24:59		Iki1 joins
06:25:01		michaelblob (michaelblob) joins
06:48:53		michaelblob_ (michaelblob) joins
06:49:04		datechnoman quits [Client Quit]
06:49:04		michaelblob quits [Remote host closed the connection]
06:49:22		datechnoman (datechnoman) joins
07:19:01	<thuban>	aw fuck, bash.im is gone :(
07:27:34	<thuban>	normal as of 24 february, blacked out 25 february, then: https://web.archive.org/web/20220227133517/https://bash.im/ "NO WAR"
07:32:57	<thuban>	down at the end of march. back up for one day in may: https://web.archive.org/web/20220516193943/https://bash.im/ "Do backups while you can."
07:36:33	<thuban>	there are a bunch of spn and even archiveteam captures from the period the message was up (the latter from outlinks, i guess), but the last archivebot crawl was in 2019; we didn't get it in may.
07:36:53	<thuban>	i hope somebody did & that the ops are OK
08:59:19		march_happy quits [Ping timeout: 265 seconds]
08:59:38		march_happy (march_happy) joins
09:51:08		tech_exorcist (tech_exorcist) joins
09:57:54		tech_exorcist quits [Ping timeout: 255 seconds]
10:21:14	<nimaje>	about warcs: how about trying to get warc support into curl?
11:28:43	<spirit>	https://curl.se/mail/archive-2022-06/0016.html
12:25:13		datechnoman9 (datechnoman) joins
12:25:56		datechnoman quits [Client Quit]
12:25:56		datechnoman9 is now known as datechnoman
12:52:15		HackMii_ quits [Remote host closed the connection]
12:52:48		HackMii_ (hacktheplanet) joins
13:02:04		Arcorann quits [Ping timeout: 240 seconds]
13:54:41		spirit quits [Quit: Leaving]
14:56:24		Czechball4 joins
14:56:28		Czechball quits [Ping timeout: 240 seconds]
14:56:30		Czechball4 is now known as Czechball
15:58:40		march_happy quits [Remote host closed the connection]
15:59:26		march_happy (march_happy) joins
16:01:07		march_happy quits [Remote host closed the connection]
16:03:43		march_happy (march_happy) joins
16:32:51		tech_exorcist (tech_exorcist) joins
17:12:39		tech_exorcist quits [Remote host closed the connection]
17:13:40		tech_exorcist (tech_exorcist) joins
17:40:05	<@JAA>	Yeah, so we'd just have to maintain the WARC implementation there because nobody else will do it.
17:40:06		yawkat quits [Client Quit]
17:40:26		yawkat (yawkat) joins
17:42:20	<@JAA>	I'm also not sure curl's the right place to put it. We'd just end up with yet another individual software that supports WARC. The proxy route seems more feasible to me, because as long as it supports the usual proxy methods, almost anything can be made to work with it.
17:42:36	<@JAA>	The other approach is a tcpdump2warc converter, but ... yeah, I won't touch that.
17:45:56		Larsenv quits [Quit: ZNC 1.8.2+deb2build5 - https://znc.in]
17:46:19		HackMii_ quits [Remote host closed the connection]
17:47:01		Larsenv (Larsenv) joins
17:47:07		HackMii_ (hacktheplanet) joins
17:47:18		Larsenv quits [Remote host closed the connection]
17:52:15		Larsenv (Larsenv) joins
18:14:11		katocala quits [Remote host closed the connection]
18:14:15		HackMii_ quits [Ping timeout: 255 seconds]
18:16:16	<Terbium>	I ported ludios wpull to 3.10+ a while back. Had to do more changes that I had expected to get it to work but wasn't too bad.
18:17:05	<Terbium>	Python has quite a few changes to async behavior from 3.6 -> 3.10 which affected a bunch of code for libraries such as tornado and for the tests
18:20:29		HackMii_ (hacktheplanet) joins
20:04:15		sonick quits [Client Quit]
20:11:44		AK quits [Remote host closed the connection]
20:12:39		AK (AK) joins
20:15:08		AK quits [Remote host closed the connection]
20:57:22		TastyWiener95 (TastyWiener95) joins
21:13:52		andrew joins
21:18:27	<andrew>	is there any actual effort going on right now to archive Twitter?
21:25:11		tech_exorcist quits [Client Quit]
21:44:30	<h2ibot>	TheTechRobo edited MuseScore (+42): https://wiki.archiveteam.org/?diff=49153&oldid=43517
21:46:38		BlueMaxima joins
21:48:57		fuzzy8021 quits [Ping timeout: 268 seconds]
21:52:53		fuzzy8021 (fuzzy8021) joins
21:53:20	<lennier1>	andrew: Individual users get run through archivebot sometimes, but there is no largescale Twitter archiving at the moment.
21:56:51	<andrew>	lennier1: hmm, would such an archival task involve archiving entire webpages? it feels like the only way this would be remotely feasible is if we only archived the API responses or something
21:57:16	<andrew>	regardless, given what we know I think it would be prudent to start archiving it soon
22:00:24	<lennier1>	With archivebot it's individual web pages--each post has its own URL that goes into the Wayback Machine. I do think they compress reasonably well. Has anyone made an estimate of how much space/bandwidth that would actually require?
22:03:05	<andrew>	I see various stats floating around on the internet, let's just call it around 500 million tweets per day
22:04:41	<andrew>	Wayback Machine seems to archive the mobile edition's HTML, which is around 38 KB compressed for just the HTML
22:05:23	<andrew>	that's 19 terabytes per day of tweets
22:08:17	<lennier1>	Is that original tweets, or does it include retweets?
22:09:03	<lennier1>	Of course archiving images and videos would push the data up.
22:09:14	<andrew>	no idea, let me see check Twitter's SEC filings
22:10:03	<lennier1>	And tweets from private accounts couldn't be archived.
22:10:50	<andrew>	I'd be willing to bet that the vast majority of tweets were made from public accounts
22:11:00	<andrew>	and that 500 million per day figure appears to be from 2014
22:12:38		sonick (sonick) joins
22:12:52	<andrew>	yeah it's difficult to find actual stats about this
22:13:44	<andrew>	my best guess is that archiving every tweet's HTML would cost 69 PB (nice?)
22:14:33	<andrew>	but retweets don't actually get their own dedicated page and it's probably significant to record those as well
22:16:48	<andrew>	I think what IA has done is special-cased Twitter in such a way that it only records a static snapshot of the HTML after it has loaded
22:17:14	<andrew>	Twitter doesn't give you the actual tweet in the HTML anymore, you have to run the page in a web browser
22:20:45	<lennier1>	Any estimate would really depend on the growth rate over time and the loss rate as accounts and tweets are deleted. I think maybe you can't even get all retweets. Like if you go to a user's page and keep scrolling back, it won't load more than a few thousand tweets. You can use the search API to find older original tweets, but not retweets.
22:22:23	<andrew>	which is why I think the best way to get an exhaustive Twitter archive is to save the API responses, using the Snowflake generation to discover IDs
22:24:52	<@JAA>	I wouldn't call that 'best' but rather 'least bad'.
22:26:13		BlueMaxima_ joins
22:26:32	<@JAA>	It's an insane number of IDs to bruteforce your way through.
22:27:10	<@JAA>	Someone did the calculations on it a way back, and you'd need like a couple hundred API tokens to just keep up with newly posted tweets.
22:27:15	<@JAA>	a while back*
22:28:30		andrew quits [Remote host closed the connection]
22:28:30		BlueMaxima quits [Remote host closed the connection]
22:28:45		andrew joins
22:29:04	<lennier1>	Is there a way to list all users, or would they just have to be discovered in progress?
22:29:31	<@JAA>	If you want to go through the past tweets, that's 12 years of IDs (snowflakes were introduced in 2010). Let's say you want to bruteforce that in 3 months. That then requires something like 10k API tokens.
22:30:18	<andrew>	lennier1: I don't know of any reasonable method of enumerating users unless they have some sequential ID that you can poll
22:30:47	<@JAA>	lennier1: There used to be a profile directory, but I think it was removed a long time ago.
22:31:21	<andrew>	it appears users do have a small numeric identifier
22:31:33	<andrew>	you may be able to query for it somehow
22:31:35	<@JAA>	Old ones have a sequential ID, but newer ones don't.
22:31:52	<@JAA>	New IDs are also snowflake-like, although I don't know the details.
22:32:03	<andrew>	I recall seeing some thing about bypassing the Twitter API rate limit by using some mobile app's key?
22:33:07	<andrew>	also, it appears Twitter has a per IP/Onion circuit rate limit of 500 requests per 5 minutes. 10k IPv4 addresses is actually pretty doable
22:33:22	<andrew>	(that is, using the website's internal API to grab information)
22:35:32	<IDK>	JAA: also, due to indexing limitations, older tweets dont come up in the search
22:36:09	<IDK>	You could just "discover" accounts from replies, following/followed, retweets, etc.
22:36:12	<@JAA>	IDK: That's not true.
22:36:27		Hackerpcs_1 (Hackerpcs) joins
22:36:37	<@JAA>	andrew: Where can I get 10k IPv4 addresses cheaply? :-)
22:36:39		nico_32 quits [Remote host closed the connection]
22:36:39		fenugrec_ quits [Remote host closed the connection]
22:36:39		Hackerpcs quits [Remote host closed the connection]
22:36:39		hackbug quits [Remote host closed the connection]
22:36:39		fenugrec_ joins
22:36:46		hackbug (hackbug) joins
22:36:47		nico_32 (nico) joins
22:36:51	<andrew>	JAA: Archiveteam Warrior instances :)
22:37:07	<lennier1>	This blog post says Twitter has nearly an exabyte of data: https://blog.twitter.com/engineering/en_us/topics/infrastructure/2022/scaling-data-access-by-moving-an-exabyte-of-data-to-google-cloud
22:37:12	<@JAA>	... are nowhere near that.
22:39:59	<andrew>	JAA: there are also proxy services like Luminati and Stormproxies
22:40:19	<IDK>	JAA: to my knowledge, someone probably tweeted the word internet pre 2010 https://usercontent.irccloud-cdn.com/file/sVYYzAIG/internet.png
22:40:45	<IDK>	nvm I did it wrong
22:41:14	<@JAA>	andrew: Keyword being 'cheaply'. None of those service (that I looked at) were cheap.
22:42:19	<IDK>	We are really low on IP addresses rn
22:42:24	<@JAA>	IDK: Well, you did, but also their web interface sucks. It frequently shows 'no results' even though there are results on the underlying API because it stops the pagination too early.
22:42:31	<@JAA>	(snscrape handles that.)
22:42:41	<andrew>	https://twitter.com/search?q=(internet)%20until%3A2010-01-01&src=typed_query
22:43:32	<IDK>	tbh, I don't even think many providers offers a huge IP block
22:44:15	<IDK>	the biggest I saw was /24 block assigned for 256 dollar extra
22:44:53	<andrew>	JAA: (note: this is not an endorsement) Stormproxies claims to give you a fresh IP address for every connection you make out of a claimed pool of 200k IPs
22:45:55	<audrooku\|m>	Wow
22:46:47	<andrew>	I'd be willing to bet however that much of those IPs have horrible reputations or something given the nature of the service
22:46:58	<IDK>	How abt IPv6, how do they exactly block that
22:47:03	<andrew>	Twitter does not support IPv6
22:47:20	<andrew>	if they did I'd literally just ask my friend for access to his 20 /48 blocks
22:47:25	<@JAA>	Yeah, most likely, and we'd need a large number of concurrent connections as well. Most they offer is 200.
22:48:03	<andrew>	you might be able to get around that with HTTP/2 multiplexing
22:48:54	<@JAA>	Partially maybe, but I don't think it'd scale to what we need directly.
22:49:22	<IDK>	it will probably be like the telegram project
22:50:31	<IDK>	where there is always a never ending todo list
23:01:42		dasineura2 quits [Read error: Connection reset by peer]
23:03:06		AK (AK) joins
23:09:11	<andrew>	ugh, is it even worth attempting to archive all of Twitter?
23:22:37		HackMii_ quits [Remote host closed the connection]
23:26:09		HackMii_ (hacktheplanet) joins
23:27:05		HackMii_ quits [Remote host closed the connection]
23:28:35		HackMii_ (hacktheplanet) joins
23:28:52		fuzzy8021 quits [Ping timeout: 240 seconds]
23:42:08		fuzzy8021 (fuzzy8021) joins
23:43:56		HackMii_ quits [Remote host closed the connection]
23:44:56		HackMii_ (hacktheplanet) joins
23:55:03		HackMii_ quits [Remote host closed the connection]
23:55:35		HackMii_ (hacktheplanet) joins
23:56:15		sec^nd quits [Ping timeout: 255 seconds]
23:56:56		sec^nd (second) joins

Home Search Previous day Next day