| 00:01:45 | | Nulo joins |
| 00:32:08 | | sonick (sonick) joins |
| 00:47:07 | | fishingf0rpie quits [Quit: Leaving] |
| 00:47:22 | | fishingforpie joins |
| 01:02:16 | | march_happy quits [Ping timeout: 265 seconds] |
| 01:02:36 | | march_happy (march_happy) joins |
| 01:28:40 | | AnotherIki joins |
| 01:29:49 | | tzt quits [Ping timeout: 265 seconds] |
| 01:32:53 | | Iki1 quits [Ping timeout: 268 seconds] |
| 01:41:09 | | fishingforpie quits [Remote host closed the connection] |
| 01:41:09 | | onetruth quits [Remote host closed the connection] |
| 01:41:28 | | onetruth joins |
| 01:41:28 | | fishingforpie joins |
| 02:27:56 | | sepro quits [Quit: Bye!] |
| 02:28:39 | | sepro (sepro) joins |
| 03:08:49 | | Craigle quits [Quit: The Lounge - https://thelounge.chat] |
| 03:09:24 | | Craigle (Craigle) joins |
| 03:18:25 | | tzt (tzt) joins |
| 03:27:39 | | michaelblob_ (michaelblob) joins |
| 03:31:16 | | michaelblob quits [Ping timeout: 240 seconds] |
| 04:14:38 | | us3rrr joins |
| 04:15:02 | | onetruth quits [Remote host closed the connection] |
| 04:15:02 | | fishingforpie quits [Remote host closed the connection] |
| 04:15:13 | | fishingforpie joins |
| 04:24:54 | | Craigle quits [Client Quit] |
| 04:25:28 | | Craigle (Craigle) joins |
| 04:30:45 | | Iki1 joins |
| 04:34:48 | | AnotherIki quits [Ping timeout: 268 seconds] |
| 04:54:04 | | datechnoman quits [Ping timeout: 240 seconds] |
| 04:57:36 | | Craigle quits [Read error: Connection reset by peer] |
| 04:58:09 | | Craigle (Craigle) joins |
| 04:58:43 | | fishingforpie quits [Read error: Connection reset by peer] |
| 04:59:21 | | datechnoman (datechnoman) joins |
| 04:59:23 | | fishingforpie joins |
| 05:09:39 | | Craigle quits [Client Quit] |
| 05:10:15 | | Craigle (Craigle) joins |
| 05:19:06 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 05:21:46 | | HackMii_ quits [Remote host closed the connection] |
| 05:22:20 | | HackMii_ (hacktheplanet) joins |
| 05:27:35 | | sec^nd quits [Remote host closed the connection] |
| 05:28:12 | | sec^nd (second) joins |
| 05:38:24 | | fishingforpie quits [Read error: Connection reset by peer] |
| 05:39:16 | | fishingforpie joins |
| 05:40:37 | | HackMii_ quits [Remote host closed the connection] |
| 05:40:50 | | HackMii_ (hacktheplanet) joins |
| 06:20:39 | | michaelblob (michaelblob) joins |
| 06:24:04 | | michaelblob_ quits [Ping timeout: 240 seconds] |
| 06:24:46 | | Iki1 quits [Remote host closed the connection] |
| 06:24:46 | | fishingforpie quits [Remote host closed the connection] |
| 06:24:46 | | michaelblob quits [Remote host closed the connection] |
| 06:24:58 | | fishingforpie joins |
| 06:24:59 | | Iki1 joins |
| 06:25:01 | | michaelblob (michaelblob) joins |
| 06:48:53 | | michaelblob_ (michaelblob) joins |
| 06:49:04 | | datechnoman quits [Client Quit] |
| 06:49:04 | | michaelblob quits [Remote host closed the connection] |
| 06:49:22 | | datechnoman (datechnoman) joins |
| 07:19:01 | <thuban> | aw fuck, bash.im is gone :( |
| 07:27:34 | <thuban> | normal as of 24 february, blacked out 25 february, then: https://web.archive.org/web/20220227133517/https://bash.im/ "NO WAR" |
| 07:32:57 | <thuban> | down at the end of march. back up for one day in may: https://web.archive.org/web/20220516193943/https://bash.im/ "Do backups while you can." |
| 07:36:33 | <thuban> | there are a bunch of spn and even archiveteam captures from the period the message was up (the latter from outlinks, i guess), but the last archivebot crawl was in 2019; we didn't get it in may. |
| 07:36:53 | <thuban> | i hope somebody did & that the ops are OK |
| 08:59:19 | | march_happy quits [Ping timeout: 265 seconds] |
| 08:59:38 | | march_happy (march_happy) joins |
| 09:51:08 | | tech_exorcist (tech_exorcist) joins |
| 09:57:54 | | tech_exorcist quits [Ping timeout: 255 seconds] |
| 10:21:14 | <nimaje> | about warcs: how about trying to get warc support into curl? |
| 11:28:43 | <spirit> | https://curl.se/mail/archive-2022-06/0016.html |
| 12:25:13 | | datechnoman9 (datechnoman) joins |
| 12:25:56 | | datechnoman quits [Client Quit] |
| 12:25:56 | | datechnoman9 is now known as datechnoman |
| 12:52:15 | | HackMii_ quits [Remote host closed the connection] |
| 12:52:48 | | HackMii_ (hacktheplanet) joins |
| 13:02:04 | | Arcorann quits [Ping timeout: 240 seconds] |
| 13:54:41 | | spirit quits [Quit: Leaving] |
| 14:56:24 | | Czechball4 joins |
| 14:56:28 | | Czechball quits [Ping timeout: 240 seconds] |
| 14:56:30 | | Czechball4 is now known as Czechball |
| 15:58:40 | | march_happy quits [Remote host closed the connection] |
| 15:59:26 | | march_happy (march_happy) joins |
| 16:01:07 | | march_happy quits [Remote host closed the connection] |
| 16:03:43 | | march_happy (march_happy) joins |
| 16:32:51 | | tech_exorcist (tech_exorcist) joins |
| 17:12:39 | | tech_exorcist quits [Remote host closed the connection] |
| 17:13:40 | | tech_exorcist (tech_exorcist) joins |
| 17:40:05 | <@JAA> | Yeah, so we'd just have to maintain the WARC implementation there because nobody else will do it. |
| 17:40:06 | | yawkat quits [Client Quit] |
| 17:40:26 | | yawkat (yawkat) joins |
| 17:42:20 | <@JAA> | I'm also not sure curl's the right place to put it. We'd just end up with yet another individual software that supports WARC. The proxy route seems more feasible to me, because as long as it supports the usual proxy methods, almost anything can be made to work with it. |
| 17:42:36 | <@JAA> | The other approach is a tcpdump2warc converter, but ... yeah, I won't touch that. |
| 17:45:56 | | Larsenv quits [Quit: ZNC 1.8.2+deb2build5 - https://znc.in] |
| 17:46:19 | | HackMii_ quits [Remote host closed the connection] |
| 17:47:01 | | Larsenv (Larsenv) joins |
| 17:47:07 | | HackMii_ (hacktheplanet) joins |
| 17:47:18 | | Larsenv quits [Remote host closed the connection] |
| 17:52:15 | | Larsenv (Larsenv) joins |
| 18:14:11 | | katocala quits [Remote host closed the connection] |
| 18:14:15 | | HackMii_ quits [Ping timeout: 255 seconds] |
| 18:16:16 | <Terbium> | I ported ludios wpull to 3.10+ a while back. Had to do more changes that I had expected to get it to work but wasn't too bad. |
| 18:17:05 | <Terbium> | Python has quite a few changes to async behavior from 3.6 -> 3.10 which affected a bunch of code for libraries such as tornado and for the tests |
| 18:20:29 | | HackMii_ (hacktheplanet) joins |
| 20:04:15 | | sonick quits [Client Quit] |
| 20:11:44 | | AK quits [Remote host closed the connection] |
| 20:12:39 | | AK (AK) joins |
| 20:15:08 | | AK quits [Remote host closed the connection] |
| 20:57:22 | | TastyWiener95 (TastyWiener95) joins |
| 21:13:52 | | andrew joins |
| 21:18:27 | <andrew> | is there any actual effort going on right now to archive Twitter? |
| 21:25:11 | | tech_exorcist quits [Client Quit] |
| 21:44:30 | <h2ibot> | TheTechRobo edited MuseScore (+42): https://wiki.archiveteam.org/?diff=49153&oldid=43517 |
| 21:46:38 | | BlueMaxima joins |
| 21:48:57 | | fuzzy8021 quits [Ping timeout: 268 seconds] |
| 21:52:53 | | fuzzy8021 (fuzzy8021) joins |
| 21:53:20 | <lennier1> | andrew: Individual users get run through archivebot sometimes, but there is no largescale Twitter archiving at the moment. |
| 21:56:51 | <andrew> | lennier1: hmm, would such an archival task involve archiving entire webpages? it feels like the only way this would be remotely feasible is if we only archived the API responses or something |
| 21:57:16 | <andrew> | regardless, given what we know I think it would be prudent to start archiving it soon |
| 22:00:24 | <lennier1> | With archivebot it's individual web pages--each post has its own URL that goes into the Wayback Machine. I do think they compress reasonably well. Has anyone made an estimate of how much space/bandwidth that would actually require? |
| 22:03:05 | <andrew> | I see various stats floating around on the internet, let's just call it around 500 million tweets per day |
| 22:04:41 | <andrew> | Wayback Machine seems to archive the mobile edition's HTML, which is around 38 KB compressed for just the HTML |
| 22:05:23 | <andrew> | that's 19 terabytes per day of tweets |
| 22:08:17 | <lennier1> | Is that original tweets, or does it include retweets? |
| 22:09:03 | <lennier1> | Of course archiving images and videos would push the data up. |
| 22:09:14 | <andrew> | no idea, let me see check Twitter's SEC filings |
| 22:10:03 | <lennier1> | And tweets from private accounts couldn't be archived. |
| 22:10:50 | <andrew> | I'd be willing to bet that the vast majority of tweets were made from public accounts |
| 22:11:00 | <andrew> | and that 500 million per day figure appears to be from 2014 |
| 22:12:38 | | sonick (sonick) joins |
| 22:12:52 | <andrew> | yeah it's difficult to find actual stats about this |
| 22:13:44 | <andrew> | my best guess is that archiving every tweet's HTML would cost 69 PB (nice?) |
| 22:14:33 | <andrew> | but retweets don't actually get their own dedicated page and it's probably significant to record those as well |
| 22:16:48 | <andrew> | I think what IA has done is special-cased Twitter in such a way that it only records a static snapshot of the HTML after it has loaded |
| 22:17:14 | <andrew> | Twitter doesn't give you the actual tweet in the HTML anymore, you have to run the page in a web browser |
| 22:20:45 | <lennier1> | Any estimate would really depend on the growth rate over time and the loss rate as accounts and tweets are deleted. I think maybe you can't even get all retweets. Like if you go to a user's page and keep scrolling back, it won't load more than a few thousand tweets. You can use the search API to find older original tweets, but not retweets. |
| 22:22:23 | <andrew> | which is why I think the best way to get an exhaustive Twitter archive is to save the API responses, using the Snowflake generation to discover IDs |
| 22:24:52 | <@JAA> | I wouldn't call that 'best' but rather 'least bad'. |
| 22:26:13 | | BlueMaxima_ joins |
| 22:26:32 | <@JAA> | It's an insane number of IDs to bruteforce your way through. |
| 22:27:10 | <@JAA> | Someone did the calculations on it a way back, and you'd need like a couple hundred API tokens to just keep up with newly posted tweets. |
| 22:27:15 | <@JAA> | a while back* |
| 22:28:30 | | andrew quits [Remote host closed the connection] |
| 22:28:30 | | BlueMaxima quits [Remote host closed the connection] |
| 22:28:45 | | andrew joins |
| 22:29:04 | <lennier1> | Is there a way to list all users, or would they just have to be discovered in progress? |
| 22:29:31 | <@JAA> | If you want to go through the past tweets, that's 12 years of IDs (snowflakes were introduced in 2010). Let's say you want to bruteforce that in 3 months. That then requires something like 10k API tokens. |
| 22:30:18 | <andrew> | lennier1: I don't know of any reasonable method of enumerating users unless they have some sequential ID that you can poll |
| 22:30:47 | <@JAA> | lennier1: There used to be a profile directory, but I think it was removed a long time ago. |
| 22:31:21 | <andrew> | it appears users do have a small numeric identifier |
| 22:31:33 | <andrew> | you may be able to query for it somehow |
| 22:31:35 | <@JAA> | Old ones have a sequential ID, but newer ones don't. |
| 22:31:52 | <@JAA> | New IDs are also snowflake-like, although I don't know the details. |
| 22:32:03 | <andrew> | I recall seeing some thing about bypassing the Twitter API rate limit by using some mobile app's key? |
| 22:33:07 | <andrew> | also, it appears Twitter has a per IP/Onion circuit rate limit of 500 requests per 5 minutes. 10k IPv4 addresses is actually pretty doable |
| 22:33:22 | <andrew> | (that is, using the website's internal API to grab information) |
| 22:35:32 | <IDK> | JAA: also, due to indexing limitations, older tweets dont come up in the search |
| 22:36:09 | <IDK> | You could just "discover" accounts from replies, following/followed, retweets, etc. |
| 22:36:12 | <@JAA> | IDK: That's not true. |
| 22:36:27 | | Hackerpcs_1 (Hackerpcs) joins |
| 22:36:37 | <@JAA> | andrew: Where can I get 10k IPv4 addresses cheaply? :-) |
| 22:36:39 | | nico_32 quits [Remote host closed the connection] |
| 22:36:39 | | fenugrec_ quits [Remote host closed the connection] |
| 22:36:39 | | Hackerpcs quits [Remote host closed the connection] |
| 22:36:39 | | hackbug quits [Remote host closed the connection] |
| 22:36:39 | | fenugrec_ joins |
| 22:36:46 | | hackbug (hackbug) joins |
| 22:36:47 | | nico_32 (nico) joins |
| 22:36:51 | <andrew> | JAA: Archiveteam Warrior instances :) |
| 22:37:07 | <lennier1> | This blog post says Twitter has nearly an exabyte of data: https://blog.twitter.com/engineering/en_us/topics/infrastructure/2022/scaling-data-access-by-moving-an-exabyte-of-data-to-google-cloud |
| 22:37:12 | <@JAA> | ... are nowhere near that. |
| 22:39:59 | <andrew> | JAA: there are also proxy services like Luminati and Stormproxies |
| 22:40:19 | <IDK> | JAA: to my knowledge, someone probably tweeted the word internet pre 2010 https://usercontent.irccloud-cdn.com/file/sVYYzAIG/internet.png |
| 22:40:45 | <IDK> | nvm I did it wrong |
| 22:41:14 | <@JAA> | andrew: Keyword being 'cheaply'. None of those service (that I looked at) were cheap. |
| 22:42:19 | <IDK> | We are really low on IP addresses rn |
| 22:42:24 | <@JAA> | IDK: Well, you did, but also their web interface sucks. It frequently shows 'no results' even though there are results on the underlying API because it stops the pagination too early. |
| 22:42:31 | <@JAA> | (snscrape handles that.) |
| 22:42:41 | <andrew> | https://twitter.com/search?q=(internet)%20until%3A2010-01-01&src=typed_query |
| 22:43:32 | <IDK> | tbh, I don't even think many providers offers a huge IP block |
| 22:44:15 | <IDK> | the biggest I saw was /24 block assigned for 256 dollar extra |
| 22:44:53 | <andrew> | JAA: (note: this is not an endorsement) Stormproxies claims to give you a fresh IP address for every connection you make out of a claimed pool of 200k IPs |
| 22:45:55 | <audrooku|m> | Wow |
| 22:46:47 | <andrew> | I'd be willing to bet however that much of those IPs have horrible reputations or something given the nature of the service |
| 22:46:58 | <IDK> | How abt IPv6, how do they exactly block that |
| 22:47:03 | <andrew> | Twitter does not support IPv6 |
| 22:47:20 | <andrew> | if they did I'd literally just ask my friend for access to his 20 /48 blocks |
| 22:47:25 | <@JAA> | Yeah, most likely, and we'd need a large number of concurrent connections as well. Most they offer is 200. |
| 22:48:03 | <andrew> | you might be able to get around that with HTTP/2 multiplexing |
| 22:48:54 | <@JAA> | Partially maybe, but I don't think it'd scale to what we need directly. |
| 22:49:22 | <IDK> | it will probably be like the telegram project |
| 22:50:31 | <IDK> | where there is always a never ending todo list |
| 23:01:42 | | dasineura2 quits [Read error: Connection reset by peer] |
| 23:03:06 | | AK (AK) joins |
| 23:09:11 | <andrew> | ugh, is it even worth attempting to archive all of Twitter? |
| 23:22:37 | | HackMii_ quits [Remote host closed the connection] |
| 23:26:09 | | HackMii_ (hacktheplanet) joins |
| 23:27:05 | | HackMii_ quits [Remote host closed the connection] |
| 23:28:35 | | HackMii_ (hacktheplanet) joins |
| 23:28:52 | | fuzzy8021 quits [Ping timeout: 240 seconds] |
| 23:42:08 | | fuzzy8021 (fuzzy8021) joins |
| 23:43:56 | | HackMii_ quits [Remote host closed the connection] |
| 23:44:56 | | HackMii_ (hacktheplanet) joins |
| 23:55:03 | | HackMii_ quits [Remote host closed the connection] |
| 23:55:35 | | HackMii_ (hacktheplanet) joins |
| 23:56:15 | | sec^nd quits [Ping timeout: 255 seconds] |
| 23:56:56 | | sec^nd (second) joins |