00:01:45Nulo joins
00:32:08sonick (sonick) joins
00:47:07fishingf0rpie quits [Quit: Leaving]
00:47:22fishingforpie joins
01:02:16march_happy quits [Ping timeout: 265 seconds]
01:02:36march_happy (march_happy) joins
01:28:40AnotherIki joins
01:29:49tzt quits [Ping timeout: 265 seconds]
01:32:53Iki1 quits [Ping timeout: 268 seconds]
01:41:09fishingforpie quits [Remote host closed the connection]
01:41:09onetruth quits [Remote host closed the connection]
01:41:28onetruth joins
01:41:28fishingforpie joins
02:27:56sepro quits [Quit: Bye!]
02:28:39sepro (sepro) joins
03:08:49Craigle quits [Quit: The Lounge - https://thelounge.chat]
03:09:24Craigle (Craigle) joins
03:18:25tzt (tzt) joins
03:27:39michaelblob_ (michaelblob) joins
03:31:16michaelblob quits [Ping timeout: 240 seconds]
04:14:38us3rrr joins
04:15:02onetruth quits [Remote host closed the connection]
04:15:02fishingforpie quits [Remote host closed the connection]
04:15:13fishingforpie joins
04:24:54Craigle quits [Client Quit]
04:25:28Craigle (Craigle) joins
04:30:45Iki1 joins
04:34:48AnotherIki quits [Ping timeout: 268 seconds]
04:54:04datechnoman quits [Ping timeout: 240 seconds]
04:57:36Craigle quits [Read error: Connection reset by peer]
04:58:09Craigle (Craigle) joins
04:58:43fishingforpie quits [Read error: Connection reset by peer]
04:59:21datechnoman (datechnoman) joins
04:59:23fishingforpie joins
05:09:39Craigle quits [Client Quit]
05:10:15Craigle (Craigle) joins
05:19:06BlueMaxima quits [Read error: Connection reset by peer]
05:21:46HackMii_ quits [Remote host closed the connection]
05:22:20HackMii_ (hacktheplanet) joins
05:27:35sec^nd quits [Remote host closed the connection]
05:28:12sec^nd (second) joins
05:38:24fishingforpie quits [Read error: Connection reset by peer]
05:39:16fishingforpie joins
05:40:37HackMii_ quits [Remote host closed the connection]
05:40:50HackMii_ (hacktheplanet) joins
06:20:39michaelblob (michaelblob) joins
06:24:04michaelblob_ quits [Ping timeout: 240 seconds]
06:24:46Iki1 quits [Remote host closed the connection]
06:24:46fishingforpie quits [Remote host closed the connection]
06:24:46michaelblob quits [Remote host closed the connection]
06:24:58fishingforpie joins
06:24:59Iki1 joins
06:25:01michaelblob (michaelblob) joins
06:48:53michaelblob_ (michaelblob) joins
06:49:04datechnoman quits [Client Quit]
06:49:04michaelblob quits [Remote host closed the connection]
06:49:22datechnoman (datechnoman) joins
07:19:01<thuban>aw fuck, bash.im is gone :(
07:27:34<thuban>normal as of 24 february, blacked out 25 february, then: https://web.archive.org/web/20220227133517/https://bash.im/ "NO WAR"
07:32:57<thuban>down at the end of march. back up for one day in may: https://web.archive.org/web/20220516193943/https://bash.im/ "Do backups while you can."
07:36:33<thuban>there are a bunch of spn and even archiveteam captures from the period the message was up (the latter from outlinks, i guess), but the last archivebot crawl was in 2019; we didn't get it in may.
07:36:53<thuban>i hope somebody did & that the ops are OK
08:59:19march_happy quits [Ping timeout: 265 seconds]
08:59:38march_happy (march_happy) joins
09:51:08tech_exorcist (tech_exorcist) joins
09:57:54tech_exorcist quits [Ping timeout: 255 seconds]
10:21:14<nimaje>about warcs: how about trying to get warc support into curl?
11:28:43<spirit>https://curl.se/mail/archive-2022-06/0016.html
12:25:13datechnoman9 (datechnoman) joins
12:25:56datechnoman quits [Client Quit]
12:25:56datechnoman9 is now known as datechnoman
12:52:15HackMii_ quits [Remote host closed the connection]
12:52:48HackMii_ (hacktheplanet) joins
13:02:04Arcorann quits [Ping timeout: 240 seconds]
13:54:41spirit quits [Quit: Leaving]
14:56:24Czechball4 joins
14:56:28Czechball quits [Ping timeout: 240 seconds]
14:56:30Czechball4 is now known as Czechball
15:58:40march_happy quits [Remote host closed the connection]
15:59:26march_happy (march_happy) joins
16:01:07march_happy quits [Remote host closed the connection]
16:03:43march_happy (march_happy) joins
16:32:51tech_exorcist (tech_exorcist) joins
17:12:39tech_exorcist quits [Remote host closed the connection]
17:13:40tech_exorcist (tech_exorcist) joins
17:40:05<@JAA>Yeah, so we'd just have to maintain the WARC implementation there because nobody else will do it.
17:40:06yawkat quits [Client Quit]
17:40:26yawkat (yawkat) joins
17:42:20<@JAA>I'm also not sure curl's the right place to put it. We'd just end up with yet another individual software that supports WARC. The proxy route seems more feasible to me, because as long as it supports the usual proxy methods, almost anything can be made to work with it.
17:42:36<@JAA>The other approach is a tcpdump2warc converter, but ... yeah, I won't touch that.
17:45:56Larsenv quits [Quit: ZNC 1.8.2+deb2build5 - https://znc.in]
17:46:19HackMii_ quits [Remote host closed the connection]
17:47:01Larsenv (Larsenv) joins
17:47:07HackMii_ (hacktheplanet) joins
17:47:18Larsenv quits [Remote host closed the connection]
17:52:15Larsenv (Larsenv) joins
18:14:11katocala quits [Remote host closed the connection]
18:14:15HackMii_ quits [Ping timeout: 255 seconds]
18:16:16<Terbium>I ported ludios wpull to 3.10+ a while back. Had to do more changes that I had expected to get it to work but wasn't too bad.
18:17:05<Terbium>Python has quite a few changes to async behavior from 3.6 -> 3.10 which affected a bunch of code for libraries such as tornado and for the tests
18:20:29HackMii_ (hacktheplanet) joins
20:04:15sonick quits [Client Quit]
20:11:44AK quits [Remote host closed the connection]
20:12:39AK (AK) joins
20:15:08AK quits [Remote host closed the connection]
20:57:22TastyWiener95 (TastyWiener95) joins
21:13:52andrew joins
21:18:27<andrew>is there any actual effort going on right now to archive Twitter?
21:25:11tech_exorcist quits [Client Quit]
21:44:30<h2ibot>TheTechRobo edited MuseScore (+42): https://wiki.archiveteam.org/?diff=49153&oldid=43517
21:46:38BlueMaxima joins
21:48:57fuzzy8021 quits [Ping timeout: 268 seconds]
21:52:53fuzzy8021 (fuzzy8021) joins
21:53:20<lennier1>andrew: Individual users get run through archivebot sometimes, but there is no largescale Twitter archiving at the moment.
21:56:51<andrew>lennier1: hmm, would such an archival task involve archiving entire webpages? it feels like the only way this would be remotely feasible is if we only archived the API responses or something
21:57:16<andrew>regardless, given what we know I think it would be prudent to start archiving it soon
22:00:24<lennier1>With archivebot it's individual web pages--each post has its own URL that goes into the Wayback Machine. I do think they compress reasonably well. Has anyone made an estimate of how much space/bandwidth that would actually require?
22:03:05<andrew>I see various stats floating around on the internet, let's just call it around 500 million tweets per day
22:04:41<andrew>Wayback Machine seems to archive the mobile edition's HTML, which is around 38 KB compressed for just the HTML
22:05:23<andrew>that's 19 terabytes per day of tweets
22:08:17<lennier1>Is that original tweets, or does it include retweets?
22:09:03<lennier1>Of course archiving images and videos would push the data up.
22:09:14<andrew>no idea, let me see check Twitter's SEC filings
22:10:03<lennier1>And tweets from private accounts couldn't be archived.
22:10:50<andrew>I'd be willing to bet that the vast majority of tweets were made from public accounts
22:11:00<andrew>and that 500 million per day figure appears to be from 2014
22:12:38sonick (sonick) joins
22:12:52<andrew>yeah it's difficult to find actual stats about this
22:13:44<andrew>my best guess is that archiving every tweet's HTML would cost 69 PB (nice?)
22:14:33<andrew>but retweets don't actually get their own dedicated page and it's probably significant to record those as well
22:16:48<andrew>I think what IA has done is special-cased Twitter in such a way that it only records a static snapshot of the HTML after it has loaded
22:17:14<andrew>Twitter doesn't give you the actual tweet in the HTML anymore, you have to run the page in a web browser
22:20:45<lennier1>Any estimate would really depend on the growth rate over time and the loss rate as accounts and tweets are deleted. I think maybe you can't even get all retweets. Like if you go to a user's page and keep scrolling back, it won't load more than a few thousand tweets. You can use the search API to find older original tweets, but not retweets.
22:22:23<andrew>which is why I think the best way to get an exhaustive Twitter archive is to save the API responses, using the Snowflake generation to discover IDs
22:24:52<@JAA>I wouldn't call that 'best' but rather 'least bad'.
22:26:13BlueMaxima_ joins
22:26:32<@JAA>It's an insane number of IDs to bruteforce your way through.
22:27:10<@JAA>Someone did the calculations on it a way back, and you'd need like a couple hundred API tokens to just keep up with newly posted tweets.
22:27:15<@JAA>a while back*
22:28:30andrew quits [Remote host closed the connection]
22:28:30BlueMaxima quits [Remote host closed the connection]
22:28:45andrew joins
22:29:04<lennier1>Is there a way to list all users, or would they just have to be discovered in progress?
22:29:31<@JAA>If you want to go through the past tweets, that's 12 years of IDs (snowflakes were introduced in 2010). Let's say you want to bruteforce that in 3 months. That then requires something like 10k API tokens.
22:30:18<andrew>lennier1: I don't know of any reasonable method of enumerating users unless they have some sequential ID that you can poll
22:30:47<@JAA>lennier1: There used to be a profile directory, but I think it was removed a long time ago.
22:31:21<andrew>it appears users do have a small numeric identifier
22:31:33<andrew>you may be able to query for it somehow
22:31:35<@JAA>Old ones have a sequential ID, but newer ones don't.
22:31:52<@JAA>New IDs are also snowflake-like, although I don't know the details.
22:32:03<andrew>I recall seeing some thing about bypassing the Twitter API rate limit by using some mobile app's key?
22:33:07<andrew>also, it appears Twitter has a per IP/Onion circuit rate limit of 500 requests per 5 minutes. 10k IPv4 addresses is actually pretty doable
22:33:22<andrew>(that is, using the website's internal API to grab information)
22:35:32<IDK>JAA: also, due to indexing limitations, older tweets dont come up in the search
22:36:09<IDK>You could just "discover" accounts from replies, following/followed, retweets, etc.
22:36:12<@JAA>IDK: That's not true.
22:36:27Hackerpcs_1 (Hackerpcs) joins
22:36:37<@JAA>andrew: Where can I get 10k IPv4 addresses cheaply? :-)
22:36:39nico_32 quits [Remote host closed the connection]
22:36:39fenugrec_ quits [Remote host closed the connection]
22:36:39Hackerpcs quits [Remote host closed the connection]
22:36:39hackbug quits [Remote host closed the connection]
22:36:39fenugrec_ joins
22:36:46hackbug (hackbug) joins
22:36:47nico_32 (nico) joins
22:36:51<andrew>JAA: Archiveteam Warrior instances :)
22:37:07<lennier1>This blog post says Twitter has nearly an exabyte of data: https://blog.twitter.com/engineering/en_us/topics/infrastructure/2022/scaling-data-access-by-moving-an-exabyte-of-data-to-google-cloud
22:37:12<@JAA>... are nowhere near that.
22:39:59<andrew>JAA: there are also proxy services like Luminati and Stormproxies
22:40:19<IDK>JAA: to my knowledge, someone probably tweeted the word internet pre 2010 https://usercontent.irccloud-cdn.com/file/sVYYzAIG/internet.png
22:40:45<IDK>nvm I did it wrong
22:41:14<@JAA>andrew: Keyword being 'cheaply'. None of those service (that I looked at) were cheap.
22:42:19<IDK>We are really low on IP addresses rn
22:42:24<@JAA>IDK: Well, you did, but also their web interface sucks. It frequently shows 'no results' even though there are results on the underlying API because it stops the pagination too early.
22:42:31<@JAA>(snscrape handles that.)
22:42:41<andrew>https://twitter.com/search?q=(internet)%20until%3A2010-01-01&src=typed_query
22:43:32<IDK>tbh, I don't even think many providers offers a huge IP block
22:44:15<IDK>the biggest I saw was /24 block assigned for 256 dollar extra
22:44:53<andrew>JAA: (note: this is not an endorsement) Stormproxies claims to give you a fresh IP address for every connection you make out of a claimed pool of 200k IPs
22:45:55<audrooku|m>Wow
22:46:47<andrew>I'd be willing to bet however that much of those IPs have horrible reputations or something given the nature of the service
22:46:58<IDK>How abt IPv6, how do they exactly block that
22:47:03<andrew>Twitter does not support IPv6
22:47:20<andrew>if they did I'd literally just ask my friend for access to his 20 /48 blocks
22:47:25<@JAA>Yeah, most likely, and we'd need a large number of concurrent connections as well. Most they offer is 200.
22:48:03<andrew>you might be able to get around that with HTTP/2 multiplexing
22:48:54<@JAA>Partially maybe, but I don't think it'd scale to what we need directly.
22:49:22<IDK>it will probably be like the telegram project
22:50:31<IDK>where there is always a never ending todo list
23:01:42dasineura2 quits [Read error: Connection reset by peer]
23:03:06AK (AK) joins
23:09:11<andrew>ugh, is it even worth attempting to archive all of Twitter?
23:22:37HackMii_ quits [Remote host closed the connection]
23:26:09HackMii_ (hacktheplanet) joins
23:27:05HackMii_ quits [Remote host closed the connection]
23:28:35HackMii_ (hacktheplanet) joins
23:28:52fuzzy8021 quits [Ping timeout: 240 seconds]
23:42:08fuzzy8021 (fuzzy8021) joins
23:43:56HackMii_ quits [Remote host closed the connection]
23:44:56HackMii_ (hacktheplanet) joins
23:55:03HackMii_ quits [Remote host closed the connection]
23:55:35HackMii_ (hacktheplanet) joins
23:56:15sec^nd quits [Ping timeout: 255 seconds]
23:56:56sec^nd (second) joins