00:01:29 | | Ketchup901 quits [Remote host closed the connection] |
00:01:36 | | Ketchup901 (Ketchup901) joins |
00:32:30 | <@OrIdow6> | Do we know why Mastadon fails to play back in SPN etc? Looking at a live instance there doesn't seem to be POST, nondeterminism, etc |
00:59:24 | | datechnoman quits [Client Quit] |
01:05:06 | | datechnoman (datechnoman) joins |
01:15:52 | | Island quits [Read error: Connection reset by peer] |
01:23:02 | | Island joins |
02:10:21 | | jasons quits [Ping timeout: 272 seconds] |
02:23:01 | | f_ quits [Read error: Connection reset by peer] |
02:23:53 | | wyatt8750 joins |
02:24:45 | | f_ (funderscore) joins |
02:27:27 | | wyatt8740 quits [Ping timeout: 272 seconds] |
02:41:01 | <pabs> | what tools are people using for dumping URLs from AB meta-WARCs? in the past I just did ghetto grep/sed, but uhhh :) |
02:41:48 | <@JAA> | Ghetto grep :-) |
02:42:15 | | Naruyoko joins |
02:42:26 | <@JAA> | archivebot-log-extract-ignores and wpull2-log-extract-errors in little-things, for example |
02:50:55 | | Island quits [Read error: Connection reset by peer] |
03:01:41 | | Island joins |
03:13:15 | | jasons (jasons) joins |
03:19:08 | | eggdrop quits [] |
03:20:08 | | eggdrop (eggdrop) joins |
03:42:02 | | tech234a (tech234a) joins |
04:10:20 | | jasons quits [Ping timeout: 240 seconds] |
04:20:22 | | Naruyoko quits [Client Quit] |
04:24:53 | | Naruyoko joins |
04:40:25 | | Darken quits [Read error: Connection reset by peer] |
04:40:45 | | Darken (Darken) joins |
04:47:25 | | monoxane quits [Ping timeout: 272 seconds] |
05:05:41 | | Wohlstand quits [Client Quit] |
05:13:44 | | jasons (jasons) joins |
05:38:37 | | Island quits [Read error: Connection reset by peer] |
05:40:43 | | Wohlstand (Wohlstand) joins |
05:54:47 | <h2ibot> | FireonLive edited List of websites excluded from the Wayback Machine (+31, add showyourdick.org): https://wiki.archiveteam.org/?diff=51720&oldid=51699 |
06:00:48 | <h2ibot> | JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=51721&oldid=51720 |
06:13:20 | | jasons quits [Ping timeout: 240 seconds] |
06:35:13 | | BlueMaxima quits [Client Quit] |
06:43:13 | | sonick (sonick) joins |
07:16:49 | | jasons (jasons) joins |
07:42:02 | | Arcorann (Arcorann) joins |
08:17:50 | | jasons quits [Ping timeout: 240 seconds] |
08:22:27 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
08:24:36 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
08:50:50 | | Wohlstand quits [Ping timeout: 240 seconds] |
09:00:02 | | ThetaDev quits [Client Quit] |
09:00:47 | | ThetaDev joins |
09:21:23 | | jasons (jasons) joins |
09:24:14 | | emtee quits [Remote host closed the connection] |
10:00:02 | | Bleo18260 quits [Client Quit] |
10:01:21 | | Bleo18260 joins |
10:33:14 | | sonick quits [Client Quit] |
11:04:46 | | pedantic-darwin quits [Quit: The Lounge - https://thelounge.chat] |
11:26:30 | <thuban> | OrIdow6: presumably the webpack is too obfuscated for wbm's rewriter (wombat.js, i believe) to recognize and handle |
11:32:41 | <thuban> | (i don't think it _either_ follows the chunk loading _or_ makes any api calls--the api response is there on the backend, but the rewritten js in the wbm's frontend can't find it) |
11:33:25 | <thuban> | i think we can safely bet that gotm.io won't work either |
11:39:08 | <eightthree> | otherwise public content obtained through a logged in account is offlimits? i.e. code search results on github? |
11:43:36 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
11:51:53 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
11:54:08 | <@OrIdow6> | thuban: Ah, not really a JS person, that's a shame |
11:54:23 | <@OrIdow6> | At some point I'll have to see if it works in my hackish WBM-as-HTTP-proxy thing |
11:56:04 | <@OrIdow6> | eightthree: In general yes - Github code search is more in the category of "why would we archive this in the first place?" but for instance a forum that require users to be logged in is definitely no-go |
11:57:03 | <@OrIdow6> | No hard and fast rule but for instance NSFW is usually an exception, basically since that's not a desire for privacy by the users/owner of the site, just them protecting themselves from legal stuff |
11:59:28 | <@OrIdow6> | And these's a big gray area cause sites use accounts for bot protection, to hide lists of posts that are themselves public, etc. |
11:59:58 | <thuban> | OrIdow6: yeah |
12:00:01 | <thuban> | ot1h, i see why the wbm works the way it does given its constraints (especially historically) |
12:00:11 | <thuban> | otoh, if you can do it conveniently, proxying is sure to be much more effective! |
12:00:16 | <eightthree> | OrIdow6: am i understanding correctly that websites that are or have nsfw content and require logging in to view them, in this case the archiving is ok for AT? this is the case for all major projects, reddit, twitter, or even some of the full-on porn sites that were archived? |
12:01:04 | <@OrIdow6> | thuban: Yeah, it plays back a lot better than the web version |
12:01:40 | <thuban> | eightthree: we've occasionally made that exception in the past for nsfw-focused sites (https://wiki.archiveteam.org/index.php/FurAffinity) |
12:02:03 | <thuban> | but increasingly rarely and reluctantly |
12:02:43 | <thuban> | it doesn't apply to any major or current projects |
12:03:10 | <@OrIdow6> | I've looked at trying to make something out of Webkit or similar that passes all HTTP traffic thru the WBM (something less hackish than my solution, which is a Python script + several minutes of configuring a Firefox profile) but IIRC most of them were dead-set on doing their own networking |
12:03:33 | <@OrIdow6> | *HTTP/S |
12:04:30 | <eightthree> | thuban: so nsfw or login-to-view content on youtube, x.com, reddit,telegram etc is not archived, right? |
12:05:54 | <thuban> | eightthree: it's a bit complicated, since 'nsfw' and' login to view' are generally not the same or even similar |
12:06:19 | | benjinsm quits [Ping timeout: 272 seconds] |
12:09:21 | <thuban> | if it's nfsw but can be viewed freely or with a freely available mechanism (like reddit's 'i'm over 18' cookie), we'll get it |
12:09:55 | <thuban> | if it requires login to view, we won't, whether or not it's nsfw |
12:10:24 | <eightthree> | Im guessing Arkiver2 isn't a single human user but is a generic account where various people use it to contribute? |
12:11:05 | <eightthree> | they seem to have most of the commits to the projectname-grab and projectname-items repos |
12:11:24 | <thuban> | arkiver is very much a real person :) |
12:11:33 | <eightthree> | !!! |
12:11:45 | <Maakuth|m> | on this very channel too |
12:11:58 | <eightthree> | busfactor = 1.0x? |
12:12:37 | <eightthree> | will the project survive if they can no longer contribute? |
12:12:50 | | jasons quits [Ping timeout: 240 seconds] |
12:13:38 | <thuban> | not _quite_ that bad; there are other people who understand the relevant infra even if they presently contribute much less |
12:14:37 | <thuban> | but yeah it's a known issue |
12:16:47 | <thuban> | OrIdow6: implement a browser in your browser so you can make http requests while you make http requests! that's what webassembly was intended for, right? |
12:21:37 | <eightthree> | ah |
12:25:09 | | benjins joins |
12:37:59 | | Arcorann quits [Ping timeout: 272 seconds] |
12:38:41 | <@OrIdow6> | thuban: I actually considered that at one point, it seemed like too much work though |
12:48:23 | <h2ibot> | OrIdow6 edited Parler (+1215, I think enough time has passed to write a bit…): https://wiki.archiveteam.org/?diff=51722&oldid=49713 |
13:09:27 | <h2ibot> | OrIdow6 uploaded File:Archiveteam and archiveteam-bs unique users, 2012-2024.png (A Pyplot of the number of unique users talking…): https://wiki.archiveteam.org/?title=File%3AArchiveteam%20and%20archiveteam-bs%20unique%20users%2C%202012-2024.png |
13:13:28 | <h2ibot> | OrIdow6 edited Parler (+158, /* Grab */ Add image showing the people coming…): https://wiki.archiveteam.org/?diff=51724&oldid=51722 |
13:16:19 | | jasons (jasons) joins |
13:23:35 | <TheTechRobo> | Service workers would also work, I think |
14:00:08 | | etnguyen03 (etnguyen03) joins |
14:03:57 | | SootBector quits [Remote host closed the connection] |
14:04:17 | | SootBector (SootBector) joins |
14:08:22 | | SootBector quits [Remote host closed the connection] |
14:08:41 | | SootBector (SootBector) joins |
14:12:20 | | jasons quits [Ping timeout: 240 seconds] |
14:18:17 | | lucas joins |
14:26:04 | <lucas> | Hi! I was wondering if anybody thought about archiving Twitter threads that appear on threadreaderapp.com it seems it's already a way to crowdsource if some collections of tweet have been deemed of interest, and maybe it's even easier than scraping from twitter.com. Thanks! |
14:29:26 | | Darken quits [Read error: Connection reset by peer] |
14:34:23 | <@arkiver> | aninternettroll: what you want to archive would be mastodon? in general? |
14:34:51 | <@arkiver> | eightthree: i'm real yes :) |
14:53:05 | | lucas quits [Remote host closed the connection] |
14:59:27 | <aninternettroll> | arkiver: first thought went to speedrun.com actually, which has a big community of api users with which you could get a lot of URLs |
14:59:55 | <aninternettroll> | (also an excuse to do something in rust) |
15:00:25 | <aninternettroll> | anyway, those plans are very young and might go nowhere |
15:03:12 | | lucas joins |
15:03:51 | | lucas is now known as ketsapiwiq |
15:12:25 | <eightthree> | arkiver: awesome! how/where do I grep what to know which projects include/avoid nsfw or "login-needed-to-view" content |
15:14:10 | <eightthree> | I tried nsfw already but not much found, the second I have no idea how (or if) you write that in issues/commits personally. I dont have code search, so if thats the way, I will have to...download all the projects and grep them locally? |
15:15:37 | | jasons (jasons) joins |
15:47:11 | <aninternettroll> | tbh mastodon wouldn't be a bad idea either, though it would require some lua knowledge and some analysing first. Sounds fun though |
15:48:54 | <that_lurker> | You could maybe do mastodon through the rss feed by adding .rss after the username |
15:51:21 | <aninternettroll> | the more interesting goal would be to archive everything the web ui needs for a given link |
16:04:33 | <DigitalDragons> | if not doing a full page grab it would probably be better to save the activitypub objects instead of the rss feed |
16:11:17 | <thuban> | eightthree: the premise of your question seems kind of confused (if we're not saving some portion of a site, we... don't write code to save it). what exactly are you trying to do here? what is your interest in nsfw specifically? |
16:14:53 | | ketsapiwiq quits [Ping timeout: 265 seconds] |
16:16:20 | | jasons quits [Ping timeout: 240 seconds] |
16:16:43 | | pedantic-darwin joins |
16:20:28 | <eightthree> | thuban: nsfw isnt just lewd stuff...there's also relatively controversial content, proof of war crimes etc... |
16:21:06 | <thuban> | right, but why are you asking about it? |
16:22:15 | <eightthree> | ive also seen relatively tame text only content (incorrectly?) tagged with nsfw, like people would tag it as such just in case |
16:24:58 | <thuban> | right, but again, what of it? |
16:25:58 | <thuban> | i don't mean to be interrogatory, it just seems like there's a lingering misunderstanding here |
16:32:21 | <imer> | as far as I remember there wasn't anything "special" about nsfw content per se that would stop us archiving it, I believe that wasn't always the stance? |
16:32:21 | <imer> | nsfw usually means porn though.. and porn videos are *big* and there's many of them, which makes archiving those not feasable |
16:32:21 | <imer> | there's also the issue of things locked behind a login.. which is adjacient, but a different thing all together |
16:32:49 | <eightthree> | im trying to get a feel for how things are included or excluded from the archive created vs the original i.e. mapping out the differences (for my personal understanding) |
16:33:19 | <thuban> | eightthree: if we're saving a site, and something's publicly posted on it, we try and save it |
16:33:27 | <thuban> | there's no additional level of moderation |
16:33:39 | <eightthree> | censorship is also very subjective |
16:34:26 | <imer> | given constraints of data size of course, and we do sometimes skip stuff or grab only the lower quality versions if time is running out until shutdown |
16:35:00 | <thuban> | yeah, hence "try" |
16:41:49 | | etnguyen03 quits [Ping timeout: 272 seconds] |
17:02:47 | | lucas joins |
17:04:29 | | lucas quits [Remote host closed the connection] |
17:20:15 | | jasons (jasons) joins |
18:06:24 | | qwertyasdfuiopghjkl quits [Ping timeout: 256 seconds] |
18:15:09 | | Wohlstand (Wohlstand) joins |
18:19:50 | | jasons quits [Ping timeout: 240 seconds] |
18:39:51 | | etnguyen03 (etnguyen03) joins |
19:04:57 | | Hackerpcs quits [Quit: Hackerpcs] |
19:07:02 | | Hackerpcs (Hackerpcs) joins |
19:09:22 | | Island joins |
19:25:20 | | etnguyen03 quits [Ping timeout: 240 seconds] |
19:28:33 | | jasons (jasons) joins |
19:51:02 | | icedice quits [Client Quit] |
20:20:50 | | jasons quits [Ping timeout: 240 seconds] |
21:23:47 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
21:24:27 | | jasons (jasons) joins |
21:33:08 | | threedeeitguy398 (threedeeitguy) joins |
21:34:25 | | threedeeitguy39 quits [Ping timeout: 272 seconds] |
21:34:25 | | threedeeitguy398 is now known as threedeeitguy39 |
21:51:31 | | BlueMaxima joins |
21:57:40 | | threedeeitguy397 (threedeeitguy) joins |
21:58:20 | | threedeeitguy39 quits [Ping timeout: 240 seconds] |
21:58:20 | | threedeeitguy397 is now known as threedeeitguy39 |
22:22:20 | | jasons quits [Ping timeout: 240 seconds] |
22:34:33 | | neggles quits [Quit: bye friends - ZNC - https://znc.in] |
22:34:46 | | neggles (neggles) joins |
23:15:20 | | jasons (jasons) joins |