00:00:06ducky_ (ducky) joins
00:01:45ducky quits [Ping timeout: 268 seconds]
00:01:45ducky_ is now known as ducky
00:03:35Shard6 (Shard) joins
00:04:55Shard quits [Ping timeout: 268 seconds]
00:04:56Shard6 is now known as Shard
00:13:29retrograde quits [Remote host closed the connection]
00:13:50retrograde (retrograde) joins
00:27:45Arcorann (Arcorann) joins
00:28:54etnguyen03 (etnguyen03) joins
00:29:28Webuser022567 quits [Client Quit]
00:48:17<pabs>nothing mentioned here https://wiki.archiveteam.org/index.php/Bluesky /cc Shard
00:56:20hyperreal (hyperreal) joins
01:35:29<h2ibot>Cooljeanius edited Archiwum Allegro (+16, /* Other archives */ use URL template): https://wiki.archiveteam.org/?diff=61091&oldid=60975
01:38:14polypept1 (polypeptide) joins
01:39:03etnguyen03 quits [Client Quit]
01:41:14polypeptide quits [Ping timeout: 260 seconds]
01:48:58cyanbox quits [Read error: Connection reset by peer]
02:10:11cyanbox joins
02:18:14<kline><klea> I wonder, is there any interest in archiving rsync:// urls?
02:18:48<kline>this would be good for source.org where i have a big filesystem copy, but it would need to support deltas as well as it's a constantly changing tree
02:19:37<@JAA>But are there large files where only small portions change, or do they just add/remove files?
02:19:45<@JAA>You'd only need delta for the former.
02:20:02<@JAA>Well, 'need'
02:21:10<kline>im not sure if there are changes, but there seem to be non-standard files (by which i mean symlinks etc) when I saw the logs scrolling by. I'm not sure if these symlinks change
02:21:44etnguyen03 (etnguyen03) joins
02:22:13<@JAA>I don't think that'd be an issue (although figuring out how to represent them in WARC would be interesting).
02:22:22<kline>i think it would be reasonable to expect that this is something that an rsync archive _could_ cover in any case, or else you'd have limitations on what rsync is specifically good at
02:23:17<kline>i wonder if it could be modeled like http patch/range requests
02:23:35<@JAA>Well, it's also about the storage though. You'd need to store the binary delta and then also provide tooling that can combine it with the previous record(s) to produce the correct result again. Which would be very messy.
02:24:27<@JAA>If the vast majority of things being archived this way is of the 'public files that rarely change after first being added' category, that delta mechanism is mostly just unnecessary complexity.
02:24:37<kline>fair enough
02:35:44SootBector quits [Remote host closed the connection]
02:36:11rohvani quits [Quit: Ping timeout (120 seconds)]
02:36:26rohvani joins
02:36:36<h2ibot>PaulWise edited Obstacles (+54, add trap_bots poisoner): https://wiki.archiveteam.org/?diff=61092&oldid=61084
02:36:47<pabs>arkiver: another poisoner https://maurycyz.com/projects/trap_bots/ - known instances are under https://maurycyz.com/babble/ and https://v6.maurycyz.com/babble/
02:36:55SootBector (SootBector) joins
02:42:39<triplecamera|m>"<@JAA> We have a new wiki user who chose the username 'User'..." <- It might be a good idea to ask him pick a new username on his talk page
02:43:25<triplecamera|m>Usernames like "User" are typically against the username policy on large wikis
02:45:47grill quits [Ping timeout: 268 seconds]
02:46:10grill (grill) joins
03:02:56SootBector quits [Remote host closed the connection]
03:07:27sepro quits [Ping timeout: 268 seconds]
03:09:21SootBector (SootBector) joins
03:15:55<hyperreal>Hi all. I'm kind of fuzzy about how all this works. I'd love to help out. What is an rsync target exactly, and how would I help the effort by setting one up on my infra? https://wiki.archiveteam.org/index.php/Dev/Staging
03:17:16<hyperreal>I know what rsync is, but this seems like it's part of some kind of pipeline. I'm not sure what the pipeline consists of.
03:24:01Mateon1 quits [Ping timeout: 268 seconds]
03:24:39Mateon1 joins
03:34:11<pabs>hyperreal: the pipeline is website -> warrior -> rsync target -> archive.org. usually targets are only run by core AT folks. anyone can contribute a warrior though
03:34:12<pabs>https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior
03:34:14<TheTechRobo>rsync target is where downloaded data is uploaded for packing into big WARCs to the Internet Archive
03:34:22<pabs>also see https://wiki.archiveteam.org/index.php/Archiveteam:Acronyms
03:34:34<TheTechRobo>haha, pabs beat me to it :P
03:34:52<hyperreal>Ah okay, thanks! Yeah I'm already running a warrior
03:36:54<hyperreal>On the Main_Page of the wiki, under Warrior-based projects, it shows a Telegram "for Warriors that don't have the multiple default projects update". What is the multiple default projects update?
03:39:52<TheTechRobo>A few weeks ago, there was an update to the Warrior code that allows a weighted randomness for the default project. Before the update, only one "Archiveteam's Choice" could exist; now multiple can exist with different percentages of Warriors running it
03:40:18chrismeller3 quits [Quit: chrismeller3]
03:40:55chrismeller3 (chrismeller) joins
03:43:46Nekroschizofrenetyk joins
03:43:56gosc joins
03:44:41<gosc>got a list of 5200 Brightcove Player urls, which channel do I send it to
03:45:37<hyperreal>TheTechRobo: I see. Would that be this? https://github.com/ArchiveTeam/warrior4-vm
03:58:07<TheTechRobo>hyperreal: -ish, the VM is really a wrapper around https://github.com/ArchiveTeam/warrior-dockerfile/ which is really a wrapper around https://github.com/ArchiveTeam/seesaw-kit/
03:58:29<TheTechRobo>(relevant commit: https://github.com/ArchiveTeam/seesaw-kit/commit/700fadd7d331ec6cbc9266d40ddca8c3fe9701bf)
04:00:29BearFortress quits [Ping timeout: 268 seconds]
04:00:44<hyperreal>TheTechRobo: I see
04:04:48n9nes quits [Ping timeout: 268 seconds]
04:04:51n9nes joins
04:21:50<h2ibot>PaulWise edited Ukraine (+207, add archive.org.ua): https://wiki.archiveteam.org/?diff=61093&oldid=60747
04:22:50<h2ibot>PaulWise edited Ukraine (+0, formatting typo): https://wiki.archiveteam.org/?diff=61094&oldid=61093
04:23:19etnguyen03 quits [Remote host closed the connection]
04:25:20Island joins
04:40:26sepro (sepro) joins
05:00:45nexussfan quits [Quit: Konversation terminated!]
05:01:40<pokechu22>gosc: is there anything about those that wouldn't work in archivebot?
05:04:22michaelblob764 quits [Quit: yoop]
05:07:13michaelblob7641 joins
05:15:45<pabs>arkiver: https://lists.opensuse.org/ uses an unknown JS-based PoW challenge thing
05:22:12<gosc>pokechu22, idk, I have no idea if it could; it works off of m3u8 files
05:22:35<pokechu22>What videos are they?
05:25:21<gosc>all of them come from Aniplex, extracted from here https://www.aniplex.co.jp/js/player/aniplex.videoplayer.js
05:25:50<gosc>most of them are promotional trailers, there's some radio stuff (no visuals), and some misc. videos
05:29:58<h2ibot>Exorcism edited Mailman/2 (-16): https://wiki.archiveteam.org/?diff=61095&oldid=61090
05:36:47<pokechu22>Do the m3u8 URLs/the video segments in them seem to be signed? I've got tooling to parse them from WARCs
05:44:59<gosc>well the urls start with "?fastly_token="
05:45:07<gosc>sample https://players.brightcove.net/4929511769001/SkbLowH7g_default/index.html?videoId=5086049326001
05:45:55Nekroschizofrenetyk quits [Client Quit]
05:55:29<pokechu22>Yeah, those will probably expire, though it's not immediately obvious how long they last
05:56:04<h2ibot>Exorcism edited Mailman/2 (+11, /* Status */): https://wiki.archiveteam.org/?diff=61096&oldid=61095
05:57:03<pokechu22>I could still try running them in AB I guess, but they'll probably expire before I get to them
05:59:21<gosc>ah rip
06:00:03<gosc>I'll probably also try to download the videos using yt-dlp and upload to ia just in case too
06:00:17<gosc>oh and also pokechu22, not all of the links work
06:00:33<gosc>some of the videos aren't publicly viewable I think? they just give an error
06:00:48<gosc>so slightly less than 5,000 videos
06:02:26<pokechu22>oh, no, AB won't work at all since that requires https://edge.api.brightcove.com/playback/v1/accounts/4929511769001/videos/5086049326001 and that requires a policy-key-raw header
06:03:05<h2ibot>Exorcism edited Mailman/2 (-65, /* Not yet archived */): https://wiki.archiveteam.org/?diff=61097&oldid=61096
06:40:16gosc quits [Client Quit]
06:42:17Island quits [Read error: Connection reset by peer]
07:13:35<@arkiver>pabs: thank you thank you
07:14:06<nicolas17>kline: is the source.org rsync data also available via HTTP(S)?
07:15:13<nicolas17>if so, we could use rsync to efficiently know when the tree changes, and then archive the changes alone via HTTP
07:15:18Nekroschizofrenetyk joins
07:46:09Webuser579225 joins
07:46:48Webuser579225 quits [Client Quit]
07:50:39<Nekroschizofrenetyk>how does AB deal with Yandex (disk.yandex.ru specifically) captcha challenge?
07:51:19<Nekroschizofrenetyk>I tried saving one url via SPN and it saved a 302 to captcha
07:51:36<nicolas17>AB doesn't deal with any captcha challenges
07:52:17<nicolas17>it can simulate certain user agents which may bypass challenges or blocks altogether vs identifying as "archivebot", but if a regular browser is getting a captcha, AB probably can't deal with it
07:52:17<Nekroschizofrenetyk>is it likely that at least some pipelines have not been marked as bot IPs and could be used?
07:52:43<nicolas17>ah yes, if there's IP blocks then trying another pipeline can work
07:53:05<Nekroschizofrenetyk>I'm not getting it but a page I tried to save was redirected to a captcha chalenge on IA
07:53:09<Nekroschizofrenetyk>and the challenge was saved
07:59:00<Nekroschizofrenetyk>why I'm asking: while browsing throught a CDX for *.narod.ru urls, I came across urls like this one: http://narod.ru:80/disk/5776712000/shabchr.djvu.html which, when opened now redirects to https://disk.yandex.ru/d/d4vSMS7GTyKVi
07:59:32<Nekroschizofrenetyk>the slug is not required, too - http://narod.ru/disk/5776712000/ does just fine
08:01:34<Nekroschizofrenetyk>*I have an idea - I'll check in fart viewer
08:03:09<Nekroschizofrenetyk>https://archive.fart.website/archivebot/viewer/domain/disk.yandex.ru - I don't think that's all that is available
08:04:03<Nekroschizofrenetyk>(which tbh isn't surprising, given the urls I checked at IA have not been saved, come to think of it, ... I'm slow)
08:05:03<Nekroschizofrenetyk>urls to the narod.ru/disk files appear sequential, don't they?
08:08:16<Nekroschizofrenetyk>maybe, then, it would be possible to: 1) archive the urls from the CDX that are still not behind a login wall 2) if some more rules of old narod.ru/disk url building are established, to brute force find any still available link, which has not been archived already ?
08:09:56<Nekroschizofrenetyk>ad 2) - normally it would be 10 billion possible urls but indeed, from the structure of the urls, it seems like we could narrow it down by a lot
08:11:38<Nekroschizofrenetyk>however, given that SPN returns a captcha challenge (but my normal browsing does not), how likely is it that AB will utterly fail? also - to save the files themselves (not just the sites), do we just need to feed AB the urls and it will save the redirects along with all the content on a page?
08:12:03<Nekroschizofrenetyk>or do we have to first find the redirects and then run them?
08:28:49danwellby quits [Read error: Connection reset by peer]
08:29:00danwellby-1 joins
08:29:22danwellby-1 is now known as danwellby
09:21:46<cruller>The previous AB jobs failed to save the files themselves.
09:22:30<cruller>While a file itself can be retrieved with a simple GET request, obtaining its URL requires a bit complex POST request. Therefore, saving files themselves would likely require at least a custom script?
09:24:16<Nekroschizofrenetyk>oh
09:25:38dada joins
09:26:06BearFortress joins
09:26:20<cruller>Also, if the restriction is only IP blocking, for saving only pages, wouldn't #// be more appropriate?
09:27:34<cruller>Personally, I don't think that's very meaningful, though.
09:35:05<Nekroschizofrenetyk>seems right re: #//
09:36:09<Nekroschizofrenetyk>if it cannot grab the files themselves, then agree, not that terribly meaningful (though still good to have, if the urls are pointed at somewhere online)
09:50:04<klea>cruller: How does the post work, is it different per file or is it a post to the same location? (ie, if saved properly, would it be playable properly on wbm or not)
09:51:04<klea>(re doing it on #//) Depends on how many disk.yandex.ru links we have to archive at once, since #// is able to DDoS sites.
09:51:24<Nekroschizofrenetyk>oh, I wouldn't be afraid of DDoSing yandex
09:51:35<Nekroschizofrenetyk>I guess
09:54:35<Nekroschizofrenetyk>there are 11 digits but they seem to have a structure
09:55:32<Nekroschizofrenetyk>at least for a vast majority of the urls, the ninth and tenth ones are 0s
09:55:45<Nekroschizofrenetyk>11th is either 0 or 1
10:24:00<cruller>klea: The POST target is always https://disk.yandex.ru/public/api/download-url AFAIK. I have no idea about replayablility.
10:24:01<cruller>Downloading a directory also sends a POST to https://disk.yandex.ru/public/api/get-dir-size , but this may not be necessary.
10:27:18<klea>Annoying, that'd mean it won't replay well.
10:28:09<cruller>I also found https://disk.yandex.ru/public/api/album-download-url . An example of an album is https://disk.yandex.ru/a/14XUtEbH3T7vh3/5aa00f2b0c1ee4b63fbaca78
10:28:17<cruller>Further investigation will likely be required.
10:44:09<cruller>https://web.archive.org/cdx/search/cdx?url=https://disk.yandex.ru/d/* There is clearly some kind of regularity here as well.
10:51:31lennier2_ joins
10:51:35Dango3607 (Dango360) joins
10:51:44cyanbox_ joins
10:51:47Starchives__ (Starchives) joins
10:51:50benjins3_ joins
10:52:03khaoohs_ joins
10:52:15ummmSokar joins
10:52:16scotrod2 quits [Quit: Ping timeout (120 seconds)]
10:52:16notSokar quits [Read error: Connection reset by peer]
10:52:16fuzzy80211 quits [Read error: Connection reset by peer]
10:52:29scotrod2 joins
10:52:34flotwig quits [Excess Flood]
10:52:36flotwig joins
10:54:53lennier2 quits [Ping timeout: 268 seconds]
10:54:54Dango360 quits [Ping timeout: 268 seconds]
10:54:54Starchives_ quits [Ping timeout: 268 seconds]
10:54:54Dango3607 is now known as Dango360
10:55:30cyanbox quits [Ping timeout: 268 seconds]
10:55:31bladem quits [Ping timeout: 268 seconds]
10:55:31benjins3 quits [Ping timeout: 268 seconds]
10:55:31khaoohs quits [Ping timeout: 268 seconds]
10:56:43bladem (bladem) joins
11:00:01Bleo1826007227196234552220110 quits [Quit: The Lounge - https://thelounge.chat]
11:02:45Bleo1826007227196234552220110 joins
11:25:56Nekroschizofrenetyk quits [Quit: Ooops, wrong browser tab.]
12:00:30etnguyen03 (etnguyen03) joins
12:07:57bigfren joins
12:08:20<bigfren>Good day!