00:00:06ducky_ (ducky) joins
00:01:45ducky quits [Ping timeout: 268 seconds]
00:01:45ducky_ is now known as ducky
00:03:35Shard6 (Shard) joins
00:04:55Shard quits [Ping timeout: 268 seconds]
00:04:56Shard6 is now known as Shard
00:13:29retrograde quits [Remote host closed the connection]
00:13:50retrograde (retrograde) joins
00:27:45Arcorann (Arcorann) joins
00:28:54etnguyen03 (etnguyen03) joins
00:29:28Webuser022567 quits [Client Quit]
00:48:17<pabs>nothing mentioned here https://wiki.archiveteam.org/index.php/Bluesky /cc Shard
00:56:20hyperreal (hyperreal) joins
01:35:29<h2ibot>Cooljeanius edited Archiwum Allegro (+16, /* Other archives */ use URL template): https://wiki.archiveteam.org/?diff=61091&oldid=60975
01:38:14polypept1 (polypeptide) joins
01:39:03etnguyen03 quits [Client Quit]
01:41:14polypeptide quits [Ping timeout: 260 seconds]
01:48:58cyanbox quits [Read error: Connection reset by peer]
02:10:11cyanbox joins
02:18:14<kline><klea> I wonder, is there any interest in archiving rsync:// urls?
02:18:48<kline>this would be good for source.org where i have a big filesystem copy, but it would need to support deltas as well as it's a constantly changing tree
02:19:37<@JAA>But are there large files where only small portions change, or do they just add/remove files?
02:19:45<@JAA>You'd only need delta for the former.
02:20:02<@JAA>Well, 'need'
02:21:10<kline>im not sure if there are changes, but there seem to be non-standard files (by which i mean symlinks etc) when I saw the logs scrolling by. I'm not sure if these symlinks change
02:21:44etnguyen03 (etnguyen03) joins
02:22:13<@JAA>I don't think that'd be an issue (although figuring out how to represent them in WARC would be interesting).
02:22:22<kline>i think it would be reasonable to expect that this is something that an rsync archive _could_ cover in any case, or else you'd have limitations on what rsync is specifically good at
02:23:17<kline>i wonder if it could be modeled like http patch/range requests
02:23:35<@JAA>Well, it's also about the storage though. You'd need to store the binary delta and then also provide tooling that can combine it with the previous record(s) to produce the correct result again. Which would be very messy.
02:24:27<@JAA>If the vast majority of things being archived this way is of the 'public files that rarely change after first being added' category, that delta mechanism is mostly just unnecessary complexity.
02:24:37<kline>fair enough
02:35:44SootBector quits [Remote host closed the connection]
02:36:11rohvani quits [Quit: Ping timeout (120 seconds)]
02:36:26rohvani joins
02:36:36<h2ibot>PaulWise edited Obstacles (+54, add trap_bots poisoner): https://wiki.archiveteam.org/?diff=61092&oldid=61084
02:36:47<pabs>arkiver: another poisoner https://maurycyz.com/projects/trap_bots/ - known instances are under https://maurycyz.com/babble/ and https://v6.maurycyz.com/babble/
02:36:55SootBector (SootBector) joins
02:42:39<triplecamera|m>"<@JAA> We have a new wiki user who chose the username 'User'..." <- It might be a good idea to ask him pick a new username on his talk page
02:43:25<triplecamera|m>Usernames like "User" are typically against the username policy on large wikis
02:45:47grill quits [Ping timeout: 268 seconds]
02:46:10grill (grill) joins
03:02:56SootBector quits [Remote host closed the connection]
03:07:27sepro quits [Ping timeout: 268 seconds]
03:09:21SootBector (SootBector) joins
03:15:55<hyperreal>Hi all. I'm kind of fuzzy about how all this works. I'd love to help out. What is an rsync target exactly, and how would I help the effort by setting one up on my infra? https://wiki.archiveteam.org/index.php/Dev/Staging
03:17:16<hyperreal>I know what rsync is, but this seems like it's part of some kind of pipeline. I'm not sure what the pipeline consists of.
03:24:01Mateon1 quits [Ping timeout: 268 seconds]
03:24:39Mateon1 joins
03:34:11<pabs>hyperreal: the pipeline is website -> warrior -> rsync target -> archive.org. usually targets are only run by core AT folks. anyone can contribute a warrior though
03:34:12<pabs>https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior
03:34:14<TheTechRobo>rsync target is where downloaded data is uploaded for packing into big WARCs to the Internet Archive
03:34:22<pabs>also see https://wiki.archiveteam.org/index.php/Archiveteam:Acronyms
03:34:34<TheTechRobo>haha, pabs beat me to it :P
03:34:52<hyperreal>Ah okay, thanks! Yeah I'm already running a warrior
03:36:54<hyperreal>On the Main_Page of the wiki, under Warrior-based projects, it shows a Telegram "for Warriors that don't have the multiple default projects update". What is the multiple default projects update?
03:39:52<TheTechRobo>A few weeks ago, there was an update to the Warrior code that allows a weighted randomness for the default project. Before the update, only one "Archiveteam's Choice" could exist; now multiple can exist with different percentages of Warriors running it
03:40:18chrismeller3 quits [Quit: chrismeller3]
03:40:55chrismeller3 (chrismeller) joins
03:43:46Nekroschizofrenetyk joins
03:43:56gosc joins
03:44:41<gosc>got a list of 5200 Brightcove Player urls, which channel do I send it to
03:45:37<hyperreal>TheTechRobo: I see. Would that be this? https://github.com/ArchiveTeam/warrior4-vm
03:58:07<TheTechRobo>hyperreal: -ish, the VM is really a wrapper around https://github.com/ArchiveTeam/warrior-dockerfile/ which is really a wrapper around https://github.com/ArchiveTeam/seesaw-kit/
03:58:29<TheTechRobo>(relevant commit: https://github.com/ArchiveTeam/seesaw-kit/commit/700fadd7d331ec6cbc9266d40ddca8c3fe9701bf)
04:00:29BearFortress quits [Ping timeout: 268 seconds]
04:00:44<hyperreal>TheTechRobo: I see
04:04:48n9nes quits [Ping timeout: 268 seconds]
04:04:51n9nes joins
04:21:50<h2ibot>PaulWise edited Ukraine (+207, add archive.org.ua): https://wiki.archiveteam.org/?diff=61093&oldid=60747
04:22:50<h2ibot>PaulWise edited Ukraine (+0, formatting typo): https://wiki.archiveteam.org/?diff=61094&oldid=61093
04:23:19etnguyen03 quits [Remote host closed the connection]
04:25:20Island joins
04:40:26sepro (sepro) joins
05:00:45nexussfan quits [Quit: Konversation terminated!]
05:01:40<pokechu22>gosc: is there anything about those that wouldn't work in archivebot?
05:04:22michaelblob764 quits [Quit: yoop]
05:07:13michaelblob7641 joins
05:15:45<pabs>arkiver: https://lists.opensuse.org/ uses an unknown JS-based PoW challenge thing
05:22:12<gosc>pokechu22, idk, I have no idea if it could; it works off of m3u8 files
05:22:35<pokechu22>What videos are they?
05:25:21<gosc>all of them come from Aniplex, extracted from here https://www.aniplex.co.jp/js/player/aniplex.videoplayer.js
05:25:50<gosc>most of them are promotional trailers, there's some radio stuff (no visuals), and some misc. videos
05:29:58<h2ibot>Exorcism edited Mailman/2 (-16): https://wiki.archiveteam.org/?diff=61095&oldid=61090
05:36:47<pokechu22>Do the m3u8 URLs/the video segments in them seem to be signed? I've got tooling to parse them from WARCs
05:44:59<gosc>well the urls start with "?fastly_token="
05:45:07<gosc>sample https://players.brightcove.net/4929511769001/SkbLowH7g_default/index.html?videoId=5086049326001
05:45:55Nekroschizofrenetyk quits [Client Quit]
05:55:29<pokechu22>Yeah, those will probably expire, though it's not immediately obvious how long they last
05:56:04<h2ibot>Exorcism edited Mailman/2 (+11, /* Status */): https://wiki.archiveteam.org/?diff=61096&oldid=61095
05:57:03<pokechu22>I could still try running them in AB I guess, but they'll probably expire before I get to them
05:59:21<gosc>ah rip
06:00:03<gosc>I'll probably also try to download the videos using yt-dlp and upload to ia just in case too
06:00:17<gosc>oh and also pokechu22, not all of the links work
06:00:33<gosc>some of the videos aren't publicly viewable I think? they just give an error
06:00:48<gosc>so slightly less than 5,000 videos
06:02:26<pokechu22>oh, no, AB won't work at all since that requires https://edge.api.brightcove.com/playback/v1/accounts/4929511769001/videos/5086049326001 and that requires a policy-key-raw header
06:03:05<h2ibot>Exorcism edited Mailman/2 (-65, /* Not yet archived */): https://wiki.archiveteam.org/?diff=61097&oldid=61096
06:40:16gosc quits [Client Quit]
06:42:17Island quits [Read error: Connection reset by peer]
07:13:35<@arkiver>pabs: thank you thank you
07:14:06<nicolas17>kline: is the source.org rsync data also available via HTTP(S)?
07:15:13<nicolas17>if so, we could use rsync to efficiently know when the tree changes, and then archive the changes alone via HTTP
07:15:18Nekroschizofrenetyk joins
07:46:09Webuser579225 joins
07:46:48Webuser579225 quits [Client Quit]
07:50:39<Nekroschizofrenetyk>how does AB deal with Yandex (disk.yandex.ru specifically) captcha challenge?
07:51:19<Nekroschizofrenetyk>I tried saving one url via SPN and it saved a 302 to captcha
07:51:36<nicolas17>AB doesn't deal with any captcha challenges
07:52:17<nicolas17>it can simulate certain user agents which may bypass challenges or blocks altogether vs identifying as "archivebot", but if a regular browser is getting a captcha, AB probably can't deal with it
07:52:17<Nekroschizofrenetyk>is it likely that at least some pipelines have not been marked as bot IPs and could be used?
07:52:43<nicolas17>ah yes, if there's IP blocks then trying another pipeline can work
07:53:05<Nekroschizofrenetyk>I'm not getting it but a page I tried to save was redirected to a captcha chalenge on IA
07:53:09<Nekroschizofrenetyk>and the challenge was saved
07:59:00<Nekroschizofrenetyk>why I'm asking: while browsing throught a CDX for *.narod.ru urls, I came across urls like this one: http://narod.ru:80/disk/5776712000/shabchr.djvu.html which, when opened now redirects to https://disk.yandex.ru/d/d4vSMS7GTyKVi
07:59:32<Nekroschizofrenetyk>the slug is not required, too - http://narod.ru/disk/5776712000/ does just fine
08:01:34<Nekroschizofrenetyk>*I have an idea - I'll check in fart viewer
08:03:09<Nekroschizofrenetyk>https://archive.fart.website/archivebot/viewer/domain/disk.yandex.ru - I don't think that's all that is available
08:04:03<Nekroschizofrenetyk>(which tbh isn't surprising, given the urls I checked at IA have not been saved, come to think of it, ... I'm slow)
08:05:03<Nekroschizofrenetyk>urls to the narod.ru/disk files appear sequential, don't they?
08:08:16<Nekroschizofrenetyk>maybe, then, it would be possible to: 1) archive the urls from the CDX that are still not behind a login wall 2) if some more rules of old narod.ru/disk url building are established, to brute force find any still available link, which has not been archived already ?
08:09:56<Nekroschizofrenetyk>ad 2) - normally it would be 10 billion possible urls but indeed, from the structure of the urls, it seems like we could narrow it down by a lot
08:11:38<Nekroschizofrenetyk>however, given that SPN returns a captcha challenge (but my normal browsing does not), how likely is it that AB will utterly fail? also - to save the files themselves (not just the sites), do we just need to feed AB the urls and it will save the redirects along with all the content on a page?
08:12:03<Nekroschizofrenetyk>or do we have to first find the redirects and then run them?
08:28:49danwellby quits [Read error: Connection reset by peer]
08:29:00danwellby-1 joins
08:29:22danwellby-1 is now known as danwellby
09:21:46<cruller>The previous AB jobs failed to save the files themselves.
09:22:30<cruller>While a file itself can be retrieved with a simple GET request, obtaining its URL requires a bit complex POST request. Therefore, saving files themselves would likely require at least a custom script?
09:24:16<Nekroschizofrenetyk>oh
09:25:38dada joins
09:26:06BearFortress joins
09:26:20<cruller>Also, if the restriction is only IP blocking, for saving only pages, wouldn't #// be more appropriate?
09:27:34<cruller>Personally, I don't think that's very meaningful, though.
09:35:05<Nekroschizofrenetyk>seems right re: #//
09:36:09<Nekroschizofrenetyk>if it cannot grab the files themselves, then agree, not that terribly meaningful (though still good to have, if the urls are pointed at somewhere online)
09:50:04<klea>cruller: How does the post work, is it different per file or is it a post to the same location? (ie, if saved properly, would it be playable properly on wbm or not)
09:51:04<klea>(re doing it on #//) Depends on how many disk.yandex.ru links we have to archive at once, since #// is able to DDoS sites.
09:51:24<Nekroschizofrenetyk>oh, I wouldn't be afraid of DDoSing yandex
09:51:35<Nekroschizofrenetyk>I guess
09:54:35<Nekroschizofrenetyk>there are 11 digits but they seem to have a structure
09:55:32<Nekroschizofrenetyk>at least for a vast majority of the urls, the ninth and tenth ones are 0s
09:55:45<Nekroschizofrenetyk>11th is either 0 or 1
10:24:00<cruller>klea: The POST target is always https://disk.yandex.ru/public/api/download-url AFAIK. I have no idea about replayablility.
10:24:01<cruller>Downloading a directory also sends a POST to https://disk.yandex.ru/public/api/get-dir-size , but this may not be necessary.
10:27:18<klea>Annoying, that'd mean it won't replay well.
10:28:09<cruller>I also found https://disk.yandex.ru/public/api/album-download-url . An example of an album is https://disk.yandex.ru/a/14XUtEbH3T7vh3/5aa00f2b0c1ee4b63fbaca78
10:28:17<cruller>Further investigation will likely be required.
10:44:09<cruller>https://web.archive.org/cdx/search/cdx?url=https://disk.yandex.ru/d/* There is clearly some kind of regularity here as well.
10:51:31lennier2_ joins
10:51:35Dango3607 (Dango360) joins
10:51:44cyanbox_ joins
10:51:47Starchives__ (Starchives) joins
10:51:50benjins3_ joins
10:52:03khaoohs_ joins
10:52:15ummmSokar joins
10:52:16scotrod2 quits [Quit: Ping timeout (120 seconds)]
10:52:16notSokar quits [Read error: Connection reset by peer]
10:52:16fuzzy80211 quits [Read error: Connection reset by peer]
10:52:29scotrod2 joins
10:52:34flotwig quits [Excess Flood]
10:52:36flotwig joins
10:54:53lennier2 quits [Ping timeout: 268 seconds]
10:54:54Dango360 quits [Ping timeout: 268 seconds]
10:54:54Starchives_ quits [Ping timeout: 268 seconds]
10:54:54Dango3607 is now known as Dango360
10:55:30cyanbox quits [Ping timeout: 268 seconds]
10:55:31bladem quits [Ping timeout: 268 seconds]
10:55:31benjins3 quits [Ping timeout: 268 seconds]
10:55:31khaoohs quits [Ping timeout: 268 seconds]
10:56:43bladem (bladem) joins
11:00:01Bleo1826007227196234552220110 quits [Quit: The Lounge - https://thelounge.chat]
11:02:45Bleo1826007227196234552220110 joins
11:25:56Nekroschizofrenetyk quits [Quit: Ooops, wrong browser tab.]
12:00:30etnguyen03 (etnguyen03) joins
12:07:57bigfren joins
12:08:20<bigfren>Good day!
12:14:19etnguyen03 quits [Client Quit]
12:29:27Wohlstand (Wohlstand) joins
12:58:50Nekroschizofrenetyk joins
13:04:03driib975 quits [Quit: The Lounge - https://thelounge.chat]
13:10:44driib975 (driib) joins
13:11:46Nekroschizofrenetyk quits [Client Quit]
13:11:57Nekroschizofrenetyk joins
13:44:38<katia>Good day!
13:45:41<Nekroschizofrenetyk>Hi!
13:52:27<katia>Hi!
13:52:35<klea>Hi!
14:02:07<bigfren>klea : please take a look at my new pull request, I really hope you would like it
14:02:18<bigfren>https://github.com/ArchiveTeam/grab-site/pull/251 - Resume feature, better regex, fix for a buggy no-offsite-links, grabber.sh with resume & openvpn support
14:02:41nexussfan (nexussfan) joins
14:03:34<bigfren>most importantly, it has 3) Sophisticated-yet-easily-modifiable grabber.sh example script that supports: this wonderful resume feature for easily continuing the existing not-completed crawls , auto-generating and using the above-mentioned regex if --no-offsite-links is being used , up to three OpenVPN config+auth pairs with auto-reconnect if any
14:03:34<bigfren>connection problems
14:04:27<klea>I think AT relies on re2, also, vpn????
14:07:04<klea>Also, bigfren if it's development related → #archiveteam-dev
14:09:23<bigfren>klea: I looked through the sources of grab-site and this seemed the only place that required a re2
14:10:10<bigfren>vpn, used in a grabber script, allows to reliably get a website even if either it blocks your IP or a website is blocked in your country
14:12:21<klea>Oh ignore that, I thought re2 was used in AB and you were contributing to wpull2.
14:16:43<bigfren>using this, I have restarted my huge crawl of today and so far it seems to be going strong, 12GB dumped already through VPN
14:28:10Arcorann quits [Ping timeout: 268 seconds]
14:43:35TheEnbyperor_ quits [Ping timeout: 268 seconds]
14:43:40TheEnbyperor quits [Ping timeout: 268 seconds]
15:06:30TheEnbyperor joins
15:09:09atphoenix_ (atphoenix) joins
15:10:43atphoenix__ quits [Ping timeout: 268 seconds]
15:17:42TheEnbyperor_ (TheEnbyperor) joins
15:26:04bigfren quits [Quit: Ooops, wrong browser tab.]
15:37:13fuzzy80211 (fuzzy80211) joins
15:46:35redoste quits [Remote host closed the connection]
15:47:51redoste (redoste) joins
15:56:12<justauser>klea, pabs, Nekroschizofrenetyk: archive.org.ua was already fully AB'd: https://archive.fart.website/archivebot/viewer/domain/archive.org.ua
15:56:20<klea>Ack.
15:58:52etnguyen03 (etnguyen03) joins
16:01:55<Nekroschizofrenetyk>nice :)
16:10:17Nekroschizofrenetyk quits [Client Quit]
16:11:49Nekroschizofrenetyk joins
16:19:47pabs quits [Ping timeout: 268 seconds]
16:24:55pabs (pabs) joins
16:29:31dabs joins
16:30:27dabs quits [Remote host closed the connection]
16:30:39dabs joins
17:04:19etnguyen03 quits [Client Quit]
17:15:22Webuser291943 joins
17:25:14etnguyen03 (etnguyen03) joins
17:27:52Nekroschizofrenetyk quits [Client Quit]
17:47:00Cornelius705 quits [Quit: Cornelius705]
17:47:50Cornelius705 (Cornelius) joins
18:03:26Cornelius7052 (Cornelius) joins
18:04:00Cornelius705 quits [Ping timeout: 268 seconds]
18:04:00Cornelius7052 is now known as Cornelius705
18:08:56Dango360 quits [Ping timeout: 268 seconds]
18:38:09Dango360 (Dango360) joins
18:42:02<justauser>https://conf.tube/ 's CDN is back.
18:42:49<klea>Time to AB it?
18:44:05Dango360 quits [Ping timeout: 268 seconds]
18:44:28Dango360 (Dango360) joins
18:44:39<justauser>https://transfer.archivete.am/fZ6Xs/conf.tube_sb-conf-tube.b-cdn.net_videos.txt is the list I got when scraping metadata API, but it probably shouldn't be !ao'd directly.
18:44:40<eggdrop>inline (for browser viewing): https://transfer.archivete.am/inline/fZ6Xs/conf.tube_sb-conf-tube.b-cdn.net_videos.txt
18:45:00<justauser>Because it has more than one file per video, and that would be wasteful.
18:45:03<klea>I've read b-cdn.net a few times when setting up ublock.
18:45:27<justauser>Bunny CDN - pretty good, but costs accordingly and thus unpopular.
18:46:01<justauser>Cloudflare does not much worse for a fraction of price or even for free.
18:46:11<klea>Ah.
18:46:44<klea>I wonder which resolution we should try to archive.
18:56:56Yakov1 (Yakov) joins
18:58:53Yakov quits [Ping timeout: 268 seconds]
18:58:53Yakov1 is now known as Yakov
19:06:17owl quits [Ping timeout: 268 seconds]
19:06:47owl joins
19:07:22wotd joins
19:07:23wotd quits [Remote host closed the connection]
19:10:02etnguyen03 quits [Client Quit]
19:12:31<@JAA>Size estimate?
19:13:51<klea>399 1080p videos, or 568 720p.
19:16:15<pokechu22>How long? Are these hour-long talks or short summaries?
19:17:54<klea>Between ~5 minutes, 50 minutes, 4 hours, and 8 hours.
20:18:42Goofybally quits [Killed (NickServ (GHOST command used by Goofybally9!~Goofyball@167.100.239.73))]
20:18:47Goofybally joins
20:19:08Island joins
20:43:08BitByBit41 (BitByBit) joins
20:44:57BitByBit4 quits [Ping timeout: 268 seconds]
20:44:57BitByBit41 is now known as BitByBit4
20:53:29<@JAA>(Size = total data size)
20:55:11<klea>Wait a sec.
20:55:25<@JAA>Doesn't need to be super accurate.
20:55:31<@JAA>But a few hundred videos doesn't sound huge anyway.
20:56:27<klea>curl -s list | xargs -I: curl -Is : | grep content-length
21:05:19<klea>Gives https://transfer.archivete.am/RiIkx/t:fZ6Xs_cl_filesizes.txt.zst
21:05:55<klea>So ~398.236329921 GB
21:06:27Nekroschizofrenetyk joins
21:11:39Nekroschizofrenetyk quits [Client Quit]
21:25:54Wohlstand quits [Quit: Wohlstand]
21:40:08Cupping1285 quits [Quit: bye]
21:52:01ducky_ (ducky) joins
21:53:24ducky quits [Ping timeout: 268 seconds]
21:53:24ducky_ is now known as ducky
22:11:38ericgallager quits [Quit: This computer has gone to sleep]
22:13:14iseaup (iseaup) joins
22:47:53Webuser560770 joins
22:48:00Webuser560770 quits [Client Quit]
23:05:17etnguyen03 (etnguyen03) joins
23:17:21igloo22225 quits [Ping timeout: 268 seconds]
23:27:25igloo22225 (igloo22225) joins
23:31:54Cupping1285 joins