| 00:00:06 | | ducky_ (ducky) joins |
| 00:01:45 | | ducky quits [Ping timeout: 268 seconds] |
| 00:01:45 | | ducky_ is now known as ducky |
| 00:03:35 | | Shard6 (Shard) joins |
| 00:04:55 | | Shard quits [Ping timeout: 268 seconds] |
| 00:04:56 | | Shard6 is now known as Shard |
| 00:13:29 | | retrograde quits [Remote host closed the connection] |
| 00:13:50 | | retrograde (retrograde) joins |
| 00:27:45 | | Arcorann (Arcorann) joins |
| 00:28:54 | | etnguyen03 (etnguyen03) joins |
| 00:29:28 | | Webuser022567 quits [Client Quit] |
| 00:48:17 | <pabs> | nothing mentioned here https://wiki.archiveteam.org/index.php/Bluesky /cc Shard |
| 00:56:20 | | hyperreal (hyperreal) joins |
| 01:35:29 | <h2ibot> | Cooljeanius edited Archiwum Allegro (+16, /* Other archives */ use URL template): https://wiki.archiveteam.org/?diff=61091&oldid=60975 |
| 01:38:14 | | polypept1 (polypeptide) joins |
| 01:39:03 | | etnguyen03 quits [Client Quit] |
| 01:41:14 | | polypeptide quits [Ping timeout: 260 seconds] |
| 01:48:58 | | cyanbox quits [Read error: Connection reset by peer] |
| 02:10:11 | | cyanbox joins |
| 02:18:14 | <kline> | <klea> I wonder, is there any interest in archiving rsync:// urls? |
| 02:18:48 | <kline> | this would be good for source.org where i have a big filesystem copy, but it would need to support deltas as well as it's a constantly changing tree |
| 02:19:37 | <@JAA> | But are there large files where only small portions change, or do they just add/remove files? |
| 02:19:45 | <@JAA> | You'd only need delta for the former. |
| 02:20:02 | <@JAA> | Well, 'need' |
| 02:21:10 | <kline> | im not sure if there are changes, but there seem to be non-standard files (by which i mean symlinks etc) when I saw the logs scrolling by. I'm not sure if these symlinks change |
| 02:21:44 | | etnguyen03 (etnguyen03) joins |
| 02:22:13 | <@JAA> | I don't think that'd be an issue (although figuring out how to represent them in WARC would be interesting). |
| 02:22:22 | <kline> | i think it would be reasonable to expect that this is something that an rsync archive _could_ cover in any case, or else you'd have limitations on what rsync is specifically good at |
| 02:23:17 | <kline> | i wonder if it could be modeled like http patch/range requests |
| 02:23:35 | <@JAA> | Well, it's also about the storage though. You'd need to store the binary delta and then also provide tooling that can combine it with the previous record(s) to produce the correct result again. Which would be very messy. |
| 02:24:27 | <@JAA> | If the vast majority of things being archived this way is of the 'public files that rarely change after first being added' category, that delta mechanism is mostly just unnecessary complexity. |
| 02:24:37 | <kline> | fair enough |
| 02:35:44 | | SootBector quits [Remote host closed the connection] |
| 02:36:11 | | rohvani quits [Quit: Ping timeout (120 seconds)] |
| 02:36:26 | | rohvani joins |
| 02:36:36 | <h2ibot> | PaulWise edited Obstacles (+54, add trap_bots poisoner): https://wiki.archiveteam.org/?diff=61092&oldid=61084 |
| 02:36:47 | <pabs> | arkiver: another poisoner https://maurycyz.com/projects/trap_bots/ - known instances are under https://maurycyz.com/babble/ and https://v6.maurycyz.com/babble/ |
| 02:36:55 | | SootBector (SootBector) joins |
| 02:42:39 | <triplecamera|m> | "<@JAA> We have a new wiki user who chose the username 'User'..." <- It might be a good idea to ask him pick a new username on his talk page |
| 02:43:25 | <triplecamera|m> | Usernames like "User" are typically against the username policy on large wikis |
| 02:45:47 | | grill quits [Ping timeout: 268 seconds] |
| 02:46:10 | | grill (grill) joins |
| 03:02:56 | | SootBector quits [Remote host closed the connection] |
| 03:07:27 | | sepro quits [Ping timeout: 268 seconds] |
| 03:09:21 | | SootBector (SootBector) joins |
| 03:15:55 | <hyperreal> | Hi all. I'm kind of fuzzy about how all this works. I'd love to help out. What is an rsync target exactly, and how would I help the effort by setting one up on my infra? https://wiki.archiveteam.org/index.php/Dev/Staging |
| 03:17:16 | <hyperreal> | I know what rsync is, but this seems like it's part of some kind of pipeline. I'm not sure what the pipeline consists of. |
| 03:24:01 | | Mateon1 quits [Ping timeout: 268 seconds] |
| 03:24:39 | | Mateon1 joins |
| 03:34:11 | <pabs> | hyperreal: the pipeline is website -> warrior -> rsync target -> archive.org. usually targets are only run by core AT folks. anyone can contribute a warrior though |
| 03:34:12 | <pabs> | https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior |
| 03:34:14 | <TheTechRobo> | rsync target is where downloaded data is uploaded for packing into big WARCs to the Internet Archive |
| 03:34:22 | <pabs> | also see https://wiki.archiveteam.org/index.php/Archiveteam:Acronyms |
| 03:34:34 | <TheTechRobo> | haha, pabs beat me to it :P |
| 03:34:52 | <hyperreal> | Ah okay, thanks! Yeah I'm already running a warrior |
| 03:36:54 | <hyperreal> | On the Main_Page of the wiki, under Warrior-based projects, it shows a Telegram "for Warriors that don't have the multiple default projects update". What is the multiple default projects update? |
| 03:39:52 | <TheTechRobo> | A few weeks ago, there was an update to the Warrior code that allows a weighted randomness for the default project. Before the update, only one "Archiveteam's Choice" could exist; now multiple can exist with different percentages of Warriors running it |
| 03:40:18 | | chrismeller3 quits [Quit: chrismeller3] |
| 03:40:55 | | chrismeller3 (chrismeller) joins |
| 03:43:46 | | Nekroschizofrenetyk joins |
| 03:43:56 | | gosc joins |
| 03:44:41 | <gosc> | got a list of 5200 Brightcove Player urls, which channel do I send it to |
| 03:45:37 | <hyperreal> | TheTechRobo: I see. Would that be this? https://github.com/ArchiveTeam/warrior4-vm |
| 03:58:07 | <TheTechRobo> | hyperreal: -ish, the VM is really a wrapper around https://github.com/ArchiveTeam/warrior-dockerfile/ which is really a wrapper around https://github.com/ArchiveTeam/seesaw-kit/ |
| 03:58:29 | <TheTechRobo> | (relevant commit: https://github.com/ArchiveTeam/seesaw-kit/commit/700fadd7d331ec6cbc9266d40ddca8c3fe9701bf) |
| 04:00:29 | | BearFortress quits [Ping timeout: 268 seconds] |
| 04:00:44 | <hyperreal> | TheTechRobo: I see |
| 04:04:48 | | n9nes quits [Ping timeout: 268 seconds] |
| 04:04:51 | | n9nes joins |
| 04:21:50 | <h2ibot> | PaulWise edited Ukraine (+207, add archive.org.ua): https://wiki.archiveteam.org/?diff=61093&oldid=60747 |
| 04:22:50 | <h2ibot> | PaulWise edited Ukraine (+0, formatting typo): https://wiki.archiveteam.org/?diff=61094&oldid=61093 |
| 04:23:19 | | etnguyen03 quits [Remote host closed the connection] |
| 04:25:20 | | Island joins |
| 04:40:26 | | sepro (sepro) joins |
| 05:00:45 | | nexussfan quits [Quit: Konversation terminated!] |
| 05:01:40 | <pokechu22> | gosc: is there anything about those that wouldn't work in archivebot? |
| 05:04:22 | | michaelblob764 quits [Quit: yoop] |
| 05:07:13 | | michaelblob7641 joins |
| 05:15:45 | <pabs> | arkiver: https://lists.opensuse.org/ uses an unknown JS-based PoW challenge thing |
| 05:22:12 | <gosc> | pokechu22, idk, I have no idea if it could; it works off of m3u8 files |
| 05:22:35 | <pokechu22> | What videos are they? |
| 05:25:21 | <gosc> | all of them come from Aniplex, extracted from here https://www.aniplex.co.jp/js/player/aniplex.videoplayer.js |
| 05:25:50 | <gosc> | most of them are promotional trailers, there's some radio stuff (no visuals), and some misc. videos |
| 05:29:58 | <h2ibot> | Exorcism edited Mailman/2 (-16): https://wiki.archiveteam.org/?diff=61095&oldid=61090 |
| 05:36:47 | <pokechu22> | Do the m3u8 URLs/the video segments in them seem to be signed? I've got tooling to parse them from WARCs |
| 05:44:59 | <gosc> | well the urls start with "?fastly_token=" |
| 05:45:07 | <gosc> | sample https://players.brightcove.net/4929511769001/SkbLowH7g_default/index.html?videoId=5086049326001 |
| 05:45:55 | | Nekroschizofrenetyk quits [Client Quit] |
| 05:55:29 | <pokechu22> | Yeah, those will probably expire, though it's not immediately obvious how long they last |
| 05:56:04 | <h2ibot> | Exorcism edited Mailman/2 (+11, /* Status */): https://wiki.archiveteam.org/?diff=61096&oldid=61095 |
| 05:57:03 | <pokechu22> | I could still try running them in AB I guess, but they'll probably expire before I get to them |
| 05:59:21 | <gosc> | ah rip |
| 06:00:03 | <gosc> | I'll probably also try to download the videos using yt-dlp and upload to ia just in case too |
| 06:00:17 | <gosc> | oh and also pokechu22, not all of the links work |
| 06:00:33 | <gosc> | some of the videos aren't publicly viewable I think? they just give an error |
| 06:00:48 | <gosc> | so slightly less than 5,000 videos |
| 06:02:26 | <pokechu22> | oh, no, AB won't work at all since that requires https://edge.api.brightcove.com/playback/v1/accounts/4929511769001/videos/5086049326001 and that requires a policy-key-raw header |
| 06:03:05 | <h2ibot> | Exorcism edited Mailman/2 (-65, /* Not yet archived */): https://wiki.archiveteam.org/?diff=61097&oldid=61096 |
| 06:40:16 | | gosc quits [Client Quit] |
| 06:42:17 | | Island quits [Read error: Connection reset by peer] |
| 07:13:35 | <@arkiver> | pabs: thank you thank you |
| 07:14:06 | <nicolas17> | kline: is the source.org rsync data also available via HTTP(S)? |
| 07:15:13 | <nicolas17> | if so, we could use rsync to efficiently know when the tree changes, and then archive the changes alone via HTTP |
| 07:15:18 | | Nekroschizofrenetyk joins |
| 07:46:09 | | Webuser579225 joins |
| 07:46:48 | | Webuser579225 quits [Client Quit] |
| 07:50:39 | <Nekroschizofrenetyk> | how does AB deal with Yandex (disk.yandex.ru specifically) captcha challenge? |
| 07:51:19 | <Nekroschizofrenetyk> | I tried saving one url via SPN and it saved a 302 to captcha |
| 07:51:36 | <nicolas17> | AB doesn't deal with any captcha challenges |
| 07:52:17 | <nicolas17> | it can simulate certain user agents which may bypass challenges or blocks altogether vs identifying as "archivebot", but if a regular browser is getting a captcha, AB probably can't deal with it |
| 07:52:17 | <Nekroschizofrenetyk> | is it likely that at least some pipelines have not been marked as bot IPs and could be used? |
| 07:52:43 | <nicolas17> | ah yes, if there's IP blocks then trying another pipeline can work |
| 07:53:05 | <Nekroschizofrenetyk> | I'm not getting it but a page I tried to save was redirected to a captcha chalenge on IA |
| 07:53:09 | <Nekroschizofrenetyk> | and the challenge was saved |
| 07:59:00 | <Nekroschizofrenetyk> | why I'm asking: while browsing throught a CDX for *.narod.ru urls, I came across urls like this one: http://narod.ru:80/disk/5776712000/shabchr.djvu.html which, when opened now redirects to https://disk.yandex.ru/d/d4vSMS7GTyKVi |
| 07:59:32 | <Nekroschizofrenetyk> | the slug is not required, too - http://narod.ru/disk/5776712000/ does just fine |
| 08:01:34 | <Nekroschizofrenetyk> | *I have an idea - I'll check in fart viewer |
| 08:03:09 | <Nekroschizofrenetyk> | https://archive.fart.website/archivebot/viewer/domain/disk.yandex.ru - I don't think that's all that is available |
| 08:04:03 | <Nekroschizofrenetyk> | (which tbh isn't surprising, given the urls I checked at IA have not been saved, come to think of it, ... I'm slow) |
| 08:05:03 | <Nekroschizofrenetyk> | urls to the narod.ru/disk files appear sequential, don't they? |
| 08:08:16 | <Nekroschizofrenetyk> | maybe, then, it would be possible to: 1) archive the urls from the CDX that are still not behind a login wall 2) if some more rules of old narod.ru/disk url building are established, to brute force find any still available link, which has not been archived already ? |
| 08:09:56 | <Nekroschizofrenetyk> | ad 2) - normally it would be 10 billion possible urls but indeed, from the structure of the urls, it seems like we could narrow it down by a lot |
| 08:11:38 | <Nekroschizofrenetyk> | however, given that SPN returns a captcha challenge (but my normal browsing does not), how likely is it that AB will utterly fail? also - to save the files themselves (not just the sites), do we just need to feed AB the urls and it will save the redirects along with all the content on a page? |
| 08:12:03 | <Nekroschizofrenetyk> | or do we have to first find the redirects and then run them? |
| 08:28:49 | | danwellby quits [Read error: Connection reset by peer] |
| 08:29:00 | | danwellby-1 joins |
| 08:29:22 | | danwellby-1 is now known as danwellby |
| 09:21:46 | <cruller> | The previous AB jobs failed to save the files themselves. |
| 09:22:30 | <cruller> | While a file itself can be retrieved with a simple GET request, obtaining its URL requires a bit complex POST request. Therefore, saving files themselves would likely require at least a custom script? |
| 09:24:16 | <Nekroschizofrenetyk> | oh |
| 09:25:38 | | dada joins |
| 09:26:06 | | BearFortress joins |
| 09:26:20 | <cruller> | Also, if the restriction is only IP blocking, for saving only pages, wouldn't #// be more appropriate? |
| 09:27:34 | <cruller> | Personally, I don't think that's very meaningful, though. |
| 09:35:05 | <Nekroschizofrenetyk> | seems right re: #// |
| 09:36:09 | <Nekroschizofrenetyk> | if it cannot grab the files themselves, then agree, not that terribly meaningful (though still good to have, if the urls are pointed at somewhere online) |
| 09:50:04 | <klea> | cruller: How does the post work, is it different per file or is it a post to the same location? (ie, if saved properly, would it be playable properly on wbm or not) |
| 09:51:04 | <klea> | (re doing it on #//) Depends on how many disk.yandex.ru links we have to archive at once, since #// is able to DDoS sites. |
| 09:51:24 | <Nekroschizofrenetyk> | oh, I wouldn't be afraid of DDoSing yandex |
| 09:51:35 | <Nekroschizofrenetyk> | I guess |
| 09:54:35 | <Nekroschizofrenetyk> | there are 11 digits but they seem to have a structure |
| 09:55:32 | <Nekroschizofrenetyk> | at least for a vast majority of the urls, the ninth and tenth ones are 0s |
| 09:55:45 | <Nekroschizofrenetyk> | 11th is either 0 or 1 |
| 10:24:00 | <cruller> | klea: The POST target is always https://disk.yandex.ru/public/api/download-url AFAIK. I have no idea about replayablility. |
| 10:24:01 | <cruller> | Downloading a directory also sends a POST to https://disk.yandex.ru/public/api/get-dir-size , but this may not be necessary. |
| 10:27:18 | <klea> | Annoying, that'd mean it won't replay well. |
| 10:28:09 | <cruller> | I also found https://disk.yandex.ru/public/api/album-download-url . An example of an album is https://disk.yandex.ru/a/14XUtEbH3T7vh3/5aa00f2b0c1ee4b63fbaca78 |
| 10:28:17 | <cruller> | Further investigation will likely be required. |
| 10:44:09 | <cruller> | https://web.archive.org/cdx/search/cdx?url=https://disk.yandex.ru/d/* There is clearly some kind of regularity here as well. |
| 10:51:31 | | lennier2_ joins |
| 10:51:35 | | Dango3607 (Dango360) joins |
| 10:51:44 | | cyanbox_ joins |
| 10:51:47 | | Starchives__ (Starchives) joins |
| 10:51:50 | | benjins3_ joins |
| 10:52:03 | | khaoohs_ joins |
| 10:52:15 | | ummmSokar joins |
| 10:52:16 | | scotrod2 quits [Quit: Ping timeout (120 seconds)] |
| 10:52:16 | | notSokar quits [Read error: Connection reset by peer] |
| 10:52:16 | | fuzzy80211 quits [Read error: Connection reset by peer] |
| 10:52:29 | | scotrod2 joins |
| 10:52:34 | | flotwig quits [Excess Flood] |
| 10:52:36 | | flotwig joins |
| 10:54:53 | | lennier2 quits [Ping timeout: 268 seconds] |
| 10:54:54 | | Dango360 quits [Ping timeout: 268 seconds] |
| 10:54:54 | | Starchives_ quits [Ping timeout: 268 seconds] |
| 10:54:54 | | Dango3607 is now known as Dango360 |
| 10:55:30 | | cyanbox quits [Ping timeout: 268 seconds] |
| 10:55:31 | | bladem quits [Ping timeout: 268 seconds] |
| 10:55:31 | | benjins3 quits [Ping timeout: 268 seconds] |
| 10:55:31 | | khaoohs quits [Ping timeout: 268 seconds] |
| 10:56:43 | | bladem (bladem) joins |
| 11:00:01 | | Bleo1826007227196234552220110 quits [Quit: The Lounge - https://thelounge.chat] |
| 11:02:45 | | Bleo1826007227196234552220110 joins |
| 11:25:56 | | Nekroschizofrenetyk quits [Quit: Ooops, wrong browser tab.] |
| 12:00:30 | | etnguyen03 (etnguyen03) joins |
| 12:07:57 | | bigfren joins |
| 12:08:20 | <bigfren> | Good day! |