00:06:59<TheTechRobo>Do we have a good system in place for archiving websocket stuff?
00:07:05<TheTechRobo>https://place.gd is the site I'm referring to
00:10:45<TheTechRobo>...It's firebase. (Or at least it looks like it.)
00:16:09<@JAA>We don't, and WARC doesn't even support it at all. Best bet might be a tcpdump-ish thing with a browser (and dumping TLS keys).
00:16:50<TheTechRobo>(Also, has heretrix3 been audited?)
00:17:56<@JAA>mitmdump (from mitmproxy) might be another option, but I don't know if it actually dumps all the information required (cf. transfer encoding).
00:18:24<ivan>SingleFile the rendered page
00:18:47<ivan>or pull JavaScript representation out of memory and reconstruct it with custom software lol
00:18:56<TheTechRobo>ivan: But it's currently still live...
00:19:09<ivan>SingleFile it every hour
00:20:16<ivan>I guess you could just pipe the websocket messages to a file and deal with it later somehow
00:40:13jacobk joins
01:02:48BlueMaxima_ joins
01:03:38BlueMaxima quits [Remote host closed the connection]
01:03:38XanaAdmin quits [Client Quit]
01:10:48Arcorann (Arcorann) joins
01:12:28sonick quits [Client Quit]
01:17:41Celluloid joins
01:18:35Celluloid quits [Remote host closed the connection]
01:49:21<TheTechRobo>Is there a good warc library for java?
01:49:32<TheTechRobo>Does Heretrix export its warc stuff?
02:01:19cascode joins
02:02:29tomorrowRemoval joins
02:02:38<tomorrowRemoval>god, the ship is sinking fast https://twitter.com/jasonbaumgartne/status/1593573576346517504
02:06:48<joepie91|m>tomorrowRemoval: there is a mastodon instance exclusively for former Twitter employees, macaw.social, and it currently might actually have more Twitter employees than Twitter does
02:06:55<joepie91|m>for some further illustration on how things are going
02:07:33<tomorrowRemoval>are we... too late?
02:07:35<tomorrowRemoval>good god.
02:07:52<joepie91|m>well, your provisional deadline is Sunday
02:07:55<joepie91|m>that is when the world cup starts
02:08:11<joepie91|m>and therefore likely when the whole thing will blow up
02:08:17<andrew>someone please suggest a good Fediverse instance
02:10:09<joepie91|m>andrew: highly personal thing IMO, fedi instances tend to be strongly based around community. my recommendation would be to just pick one from https://joinmastodon.org/servers (ideally not a massive one), test the waters, and hop around until you find one that suits you
02:10:26<joepie91|m>account migration is pretty easy
02:10:49<joepie91|m>(more about this should probably go in -ot)
02:11:12<andrew>infosec.exchange looks fun :)
02:11:43<joepie91|m>seems to be a pretty okay instance from what I've seen
02:12:17<joepie91|m>do make sure to read the rules (https://infosec.exchange/about) because fedi is very different from twitter
02:12:30<joepie91|m>also https://gist.github.com/joepie91/f924e846c24ec7ed82d6d554a7e7c9a8 may be helpful
02:30:57<madpro|m>Never thought I'd see the day #archiveteam-bs would be talking about mastodon instances
02:31:02<madpro|m>Yet here we are
02:31:42<madpro|m>I'm conflicted as to whether or not I should rain on the parade by mentioning the "auto-delete" feature which can be instance-wide or turned on by individual users
02:32:02<madpro|m>https://scholar.social/@Em0nM4stodon@infosec.exchange/109367449990292160
02:32:29<joepie91|m>there's good reason for that to exist, and folks here should be aware that scraping is generally not welcomed on fedi in most places, and that the privacy expectations/dynamics are very different from twitter
02:32:49<madpro|m>> scraping is not welcomed on fedi
02:32:51<madpro|m>That too
02:33:36<joepie91|m>that doesn't mean there's never a reason to archive anything of course (think the usual "politician says something" kind of rationale) but don't be the guy trying to archive "the fediverse" basically :)
02:34:00<madpro|m>I'm just saying, we have a forum-watch. And as ineffective as that is
02:34:14<madpro|m>Mastodon as it continues to grow will demand a far more nuanced approach
02:34:32<madpro|m>There are already people furious at AT for the pettiest of reasons
02:34:49S55 joins
02:34:53<joepie91|m>I'm mentioning this mainly because a lot of news outlets are erroneously reporting on Mastodon as a "Twitter alternative" and there have already been entirely too many people assuming that the social norms are the same
02:35:14<madpro|m>And when you have this moral burden of "Twitter taught us that some things are more accountable from others"
02:35:16<joepie91|m>or that everything on fedi not behind a login is meant to be publicized to the world, for example
02:35:34<madpro|m>* meant to be more accountable
02:36:52<madpro|m>Volunteer Archiving will be further marginalized.
02:37:19<madpro|m>already, I find people asking "If these are volunteers, who do they volunteer for?"
02:39:13<madpro|m>I think that's all I have to say, if I go on I will digress into personal grievances.
02:44:52<h2ibot>Themadprogramer edited Mastodon (+250, Added summary on automatic post-deletion features): https://wiki.archiveteam.org/?diff=49164&oldid=45749
02:45:52<h2ibot>Themadprogramer edited Mastodon (-6, Added summary on automatic post-deletion features): https://wiki.archiveteam.org/?diff=49165&oldid=49164
02:45:59S55 quits [Remote host closed the connection]
03:16:45pabs quits [Ping timeout: 276 seconds]
03:17:25cascode quits [Remote host closed the connection]
03:17:25tomorrowRemoval quits [Remote host closed the connection]
03:26:47pabs (pabs) joins
03:27:20Lord_Nightmare2 (Lord_Nightmare) joins
03:27:42jacobk quits [Client Quit]
03:27:42Lord_Nightmare quits [Remote host closed the connection]
03:27:43Lord_Nightmare2 is now known as Lord_Nightmare
03:28:23jacobk joins
04:11:04AnotherIki joins
04:15:11Iki1 quits [Ping timeout: 268 seconds]
04:23:03wyatt8750 joins
04:24:21wyatt8740 quits [Ping timeout: 276 seconds]
04:33:37BlueMaxima_ quits [Client Quit]
04:51:38cascode joins
04:52:10Pichu0102 quits [Remote host closed the connection]
04:52:28Pichu0102 joins
04:53:08Pichu0102 quits [Remote host closed the connection]
04:53:40Pichu0102 joins
04:53:47atphoenix_ is now known as atphoenix
05:02:05<atphoenix>thought of the evening: seems to me that the Twitter shortener (t.co) could be at particular risk
05:03:26<andrew>t.co links are usually used from tweets though and the canonical URL is included in the API JSON
05:03:32<andrew>this could be an issue with archive sites though
05:07:46Iki1 joins
05:11:48AnotherIki quits [Ping timeout: 276 seconds]
05:14:17<atphoenix>the issue is with the unshortening. Yes #urlteam is a thing. But I doubt all the existing t.co URLs have been resolved into their real URLs. And the remaining ones depend on t.co remaining working. In short, the Terror of Tiny Town is real.
05:18:18pabs quits [Ping timeout: 276 seconds]
05:20:31pabs (pabs) joins
05:26:32AnotherIki joins
05:26:41sdffds joins
05:27:01sdffds quits [Remote host closed the connection]
05:29:37Iki1 quits [Ping timeout: 265 seconds]
05:41:29jacob joins
05:42:11jacob quits [Remote host closed the connection]
05:46:38AnotherIki quits [Remote host closed the connection]
05:46:48AnotherIki joins
05:57:01Iki1 joins
05:59:11AnotherIki quits [Remote host closed the connection]
05:59:11jacobk quits [Client Quit]
05:59:11cascode quits [Client Quit]
06:00:05jacobk joins
06:32:45qwertyasdfuiopghjkl joins
06:51:48Island quits [Read error: Connection reset by peer]
06:52:54wyatt8750 quits [Remote host closed the connection]
07:11:35<tech234a>Mastodon instance that is in the process of closing, not sure if it should be archived but I figure it's worth mentioning https://cybre.space/
07:15:41wyatt8740 joins
07:18:02wyatt8740 quits [Read error: Connection reset by peer]
07:19:38wyatt8740 joins
07:22:10<@JAA>Since it'll keep coming up now that there's far more attention on Mastodon, I'll repeat it again: we don't archive instances without explicit permission by the instance owner(s).
07:29:01<JTL>Reasonable take, but if someone else (not here) decided to go rogue and start scraping shit enmasse color me not surprised.
07:32:52wyatt8740 quits [Ping timeout: 265 seconds]
07:36:37wyatt8740 joins
07:41:38wyatt8750 joins
07:43:15wyatt8740 quits [Ping timeout: 276 seconds]
08:01:02<@OrIdow6^2>sweb on track Sanqui?
08:04:23<@OrIdow6^2>TheTechRobo: Is anything happening to place.gd?
08:27:17<@OrIdow6^2>Looking into webry
08:28:22wyatt8740 joins
08:29:56wyatt8750 quits [Client Quit]
08:29:56qwertyasdfuiopghjkl quits [Client Quit]
08:38:30tzt quits [Ping timeout: 268 seconds]
08:45:06qwertyasdfuiopghjkl joins
08:57:18<@Sanqui>OrIdow6^2: The ~120k domains I know about are scraped and I'm working on extracting some more. Contributions welcome
08:57:27<@Sanqui>by scraped I mean, done by ArchiveBot
08:57:58<@Sanqui>I want to process a few warcs to extract links but I haven't had the chance yet
10:40:49qwertyasdfuiopghjkl quits [Remote host closed the connection]
10:52:34tech_exorcist (tech_exorcist) joins
10:56:21tech_exorcist quits [Remote host closed the connection]
10:57:12tech_exorcist (tech_exorcist) joins
11:01:38tech_exorcist quits [Read error: Connection reset by peer]
11:01:59tech_exorcist (tech_exorcist) joins
11:33:18tech_exorcist quits [Ping timeout: 255 seconds]
11:36:02tech_exorcist (tech_exorcist) joins
11:41:12tech_exorcist quits [Remote host closed the connection]
11:46:13qwertyasdfuiopghjkl joins
12:22:43tech_exorcist (tech_exorcist) joins
12:29:20tech_exorcist quits [Remote host closed the connection]
12:29:38tech_exorcist (tech_exorcist) joins
12:56:43qwertyasdfuiopghjkl quits [Remote host closed the connection]
13:15:28tech_exorcist quits [Remote host closed the connection]
13:16:22tech_exorcist (tech_exorcist) joins
13:21:31tech_exorcist quits [Remote host closed the connection]
13:22:09tech_exorcist (tech_exorcist) joins
13:23:12Arcorann quits [Ping timeout: 276 seconds]
13:30:32qwertyasdfuiopghjkl joins
13:31:29Larsenv quits [Quit: ZNC 1.8.2+deb2build5 - https://znc.in]
13:41:24pabs quits [Ping timeout: 276 seconds]
13:43:13mistersheeple joins
13:44:08pabs (pabs) joins
13:46:10<mistersheeple>is there anyone working on a solution for archiving twitter during these trying times?
13:51:40tech_exorcist quits [Remote host closed the connection]
13:52:14tech_exorcist (tech_exorcist) joins
14:31:08<TheTechRobo>OrIdow6^2: No but it's a new experiment like r/place
14:31:17<TheTechRobo>When it's done the finished level will be uploaded to the servers
14:31:25<TheTechRobo>Absolutely no idea if the entire history will be preserved
14:31:57<TheTechRobo>mistersheeple: You can request individual accounts or hashtags in #archivebot
14:36:07sonick (sonick) joins
14:48:43march_happy (march_happy) joins
14:53:26yawkat` quits [Ping timeout: 268 seconds]
15:15:15<IDK>TheTechRobo: tbh, that won't really be that scalable
15:15:31<IDK>for both current and future
15:15:37<TheTechRobo>No, but it's all we've currently got.
15:18:01<IDK>Hopefully there will be a twitter project in the near future, since this is the most requested project rn
15:19:07tech_exorcist quits [Client Quit]
15:19:36<@arkiver>tech234a: yeah there's another mastodon closing soon as well
15:19:51<@arkiver>mastodon.technology
15:29:57jacobk quits [Ping timeout: 276 seconds]
15:29:59yawkat (yawkat) joins
15:39:58tech_exorcist (tech_exorcist) joins
15:54:06<IDK>TheTechRobo: "Please try again in ~600 min. Crawling this host is paused because they notified us that they are overloaded right now.", my first time seeing this while doing twitter on SPN, is this a new limitation
15:54:31<IDK>if yes will this affect socialbot
15:59:53tech_exorcist quits [Client Quit]
16:01:01<h2ibot>JAABot edited CurrentWarriorProject (+4): https://wiki.archiveteam.org/?diff=49166&oldid=49163
16:09:10tech_exorcist (tech_exorcist) joins
16:12:16tech_exorcist quits [Remote host closed the connection]
16:13:14tech_exorcist (tech_exorcist) joins
16:25:33tech_exorcist quits [Client Quit]
16:26:21tech_exorcist (tech_exorcist) joins
16:45:07<@arkiver>for very important stuff, use archivebot and don't rely on SPN
16:48:10HP_Archivist (HP_Archivist) joins
17:00:18march_happy quits [Ping timeout: 265 seconds]
17:26:53holbrooke quits [Ping timeout: 265 seconds]
17:27:50user23436436 joins
17:28:19user23436436 quits [Remote host closed the connection]
17:38:08tech_exorcist quits [Client Quit]
17:38:15<TheTechRobo><IDK> TheTechRobo: "Please try again in ~600 min. Crawling this host is paused because they notified us that they are overloaded right now.", my first time seeing this while doing twitter on SPN, is this a new limitation
17:38:15<TheTechRobo>No
17:38:21<TheTechRobo>It's automatic based on 429s etc
17:38:40<TheTechRobo>And no, it should not affect socialbot unless socialbot uses SPN
17:38:46<TheTechRobo>This only affects SPN
17:39:39<IDK>I mean, I havent seen 429s even with the 35 minutes queue
17:40:04tech_exorcist (tech_exorcist) joins
17:40:40HP_Archivist quits [Client Quit]
17:44:29<@JAA>Also, WBM/SPN stuff can go to #internetarchive.
18:29:27tech_exorcist quits [Client Quit]
18:30:40tzt (tzt) joins
18:41:11mistersheeple quits [Client Quit]
18:46:57<betamax>I've been asked by someone who has a list of tweets (some with media) and they want to have a local copy
18:47:35<betamax>I tried wget (with "--convert-links --page-requisites") but it failed to download anything but the HTML, and that didn't display correctly
18:47:43<betamax>Is there an easy way to do this?
18:56:01Larsenv_ (Larsenv) joins
19:00:29<ivan>betamax: I will PM you because of a thing that is not entirely public
20:47:50<h2ibot>Tech234a edited Mastodon (+202, /* Dead and dying instances */…): https://wiki.archiveteam.org/?diff=49167&oldid=49165
20:57:07njha joins
20:59:14<njha>Hi! Berkeley is shutting down student use of *.berkeley.edu domains. Currently active groups will be moved to a different domain, but I think the plan is to just take down any inactive groups. There are definitely things of historical significance (see https://xcf.berkeley.edu/ for one example, although this page was built somewhat recently).
21:00:11<njha>I have the list of subdomains potentially being removed, if anyone is interested (1sec, let me upload this somewhere).
21:00:28<@arkiver>awesome thanks njha
21:00:32<@arkiver>we'll get them
21:03:29<@JAA>njha: You can upload that to https://transfer.archivete.am/
21:06:25<@arkiver>oh yes
21:07:21BlueMaxima joins
21:08:27<njha>oh that's convenient
21:08:35<njha>i saw that too late so I ended up just putting it here: http://quarantine.ocf.berkeley.edu/ocfhosted_berkeley.edu
21:09:09<@arkiver>that may be small enough for ArchiveBot
21:09:28<njha>yeah it's not a ton, only 700 sites or so
21:13:10<njha>I operate the webserver vhosting all these sites so I also have a snapshot of the source code (for php/django sites) and databases but I can't make those public for obvious reasons.
21:13:58<@arkiver>yeah, we'll just get the public data
21:40:43<@JAA>Sounds good, will queue that to ArchiveBot later.
21:43:08wyatt8750 joins
21:43:42wyatt8740 quits [Ping timeout: 276 seconds]
21:46:35wyatt8750 quits [Client Quit]
21:47:59wyatt8740 joins
21:50:11march_happy (march_happy) joins
22:24:28Arcorann (Arcorann) joins
22:26:42wyatt8740 quits [Client Quit]
22:26:57wyatt8740 joins
22:35:28<@arkiver>JAA: according to https://wiki.archiveteam.org/index.php/Sweb.cz we archived the 'subdomainfinder' sweb.cz URLs. i got 13352 subdomains for sweb.cz here
22:35:53<@arkiver>many may be dead
22:36:42<@arkiver>transfer.archivete.am/GdQK5/sweb.cz.txt
22:37:04<@arkiver>that is significantly more
22:37:10<@arkiver>is that too much for ArchiveBot to handle?
22:37:51katocala quits [Remote host closed the connection]
23:00:57Orange635252 joins
23:25:53Island joins