| 00:42:51 | <pabs> | download speeds from IA seem quite slow around the world, are their connections overloaded or something? |
| 00:53:48 | <flashfire42> | Yeah |
| 00:56:47 | <nicolas17> | sorry my fault, still downloading the yahoo videos stuff and my 4 connections are clearly overwhelming poor IA (?) |
| 01:04:57 | <pabs> | :) |
| 01:08:00 | <pabs> | hmm, rtorrent finds 0 peers for IA torrents like https://archive.org/download/fossy2023_Breaking_the_Chains_of_Trustin/fossy2023_Breaking_the_Chains_of_Trustin_archive.torrent |
| 01:08:49 | <pabs> | transmission-cli too |
| 01:09:09 | <pabs> | rtorrent also says DHT search unsuccessful |
| 01:10:43 | <nicolas17> | pabs: they use webseeding, which rtorrent doesn't support |
| 01:15:28 | <pabs> | ah :( |
| 01:16:02 | <pabs> | transmission-cli does download from webseeds much faster than a regular web download, interesting |
| 01:17:24 | <pabs> | what are these .____padding_file directories in the torrent? |
| 01:19:24 | | Arcorann (Arcorann) joins |
| 01:20:24 | <nicolas17> | the torrent points at both IA servers having the file as webseeds, so you download from both simultaneously, which mitigates one being slow |
| 01:22:11 | <nicolas17> | multi-file torrents work as if it was a single blob with all files concatenated, so a chunk hash can span multiple files, which causes some issues, those zero-filled padding files align the beginning of each file with chunk hash boundaries |
| 01:25:47 | <pabs> | so the IA download problem isn't the uplink, but individual servers? |
| 01:38:32 | <pabs> | hmm, browsers need better download mechanisms |
| 02:33:12 | | threedeeitguy39 (threedeeitguy) joins |
| 02:33:25 | | threedeeitguy3 quits [Ping timeout: 265 seconds] |
| 02:33:25 | | threedeeitguy39 is now known as threedeeitguy3 |
| 02:47:35 | | threedeeitguy3 quits [Client Quit] |
| 02:54:13 | | threedeeitguy39 (threedeeitguy) joins |
| 04:00:26 | | BigBrain_ quits [Ping timeout: 245 seconds] |
| 04:02:12 | <pabs> | arkiver: in principle, do you think IA might be interested in sponsoring servers/storage/network/hosting/etc for live replicas of archive.softwareheritage.org (source code, all of GitHub/Debian/etc, will or does use Ceph) and or snapshot.debian.org (~152TB all of Debian's software history, both source/binaries, bespoke content-addressed filesystem layout)? |
| 04:02:26 | | BigBrain_ (bigbrain) joins |
| 06:31:08 | | nicolas17 quits [Ping timeout: 252 seconds] |
| 06:31:25 | <Exorcism> | update: https://irc.digitaldragon.dev/uploads/6b75bb3f2cb4020c/image.png |
| 06:32:03 | <Exorcism> | and because of this it's impossible for me to publish on ia at the moment oof |
| 07:26:18 | | themadpro (themadpro) joins |
| 08:07:23 | | nulldata quits [Ping timeout: 252 seconds] |
| 08:11:02 | | nulldata (nulldata) joins |
| 09:00:44 | | nulldata quits [Ping timeout: 252 seconds] |
| 09:04:23 | | nulldata (nulldata) joins |
| 09:17:49 | | Exorcism is now known as Exorcism_ |
| 09:35:31 | | themadpro quits [Client Quit] |
| 09:46:48 | | Exorcism (exorcism) joins |
| 11:57:07 | | Exorcism uploaded an image: (99KiB) < https://matrix.hackint.org/_matrix/media/v3/download/matrix.fedibird.com/UQEukxNbokaThmsFmxsEzEVw/1000017637.jpg > |
| 11:57:14 | | Exorcism uploaded an image: (225KiB) < https://matrix.hackint.org/_matrix/media/v3/download/matrix.fedibird.com/skOoTTrwCoeaXEFZWbsZvoKM/1000017636.jpg > |
| 11:57:30 | <Exorcism> | hum |
| 11:57:33 | <Exorcism> | wrong channel 💀 |
| 13:26:56 | | Arcorann quits [Ping timeout: 252 seconds] |
| 14:07:34 | | driib quits [Quit: The Lounge - https://thelounge.chat] |
| 14:12:28 | | driib (driib) joins |
| 15:43:05 | <@arkiver> | pabs: interesting question, are you connected with software heritage? |
| 15:43:10 | <@arkiver> | i have no idea how big software heritage is |
| 15:43:18 | <@arkiver> | definitely not saying no to that |
| 15:44:28 | <pabs> | kind of. I know the Debian folks who started it, I've been submitting code there for about a year and am about to start a contract or two with them about expanding it |
| 15:44:49 | <pabs> | for Debian and snapshot.debian.org, I'm one of the Debian sysadmin team |
| 15:45:54 | <pabs> | I forget how big SWH is either, but the video/slides for this talk will contain that once published https://debconf23.debconf.org/talks/44-software-heritage-building-a-community-to-safeguard-the-software-commons/ |
| 15:46:53 | <pabs> | SWH are also working on mirroring it, they have two orgs in Europe doing that now |
| 15:47:31 | <@arkiver> | very nice! |
| 15:47:40 | <@arkiver> | pabs: do you have an estimate on the side of software heritage? |
| 15:47:51 | <@arkiver> | ~250 TB? |
| 15:48:31 | <pabs> | I think more, but I can't remember sorry. definitely in the slides, I can ask olasd to publish them |
| 15:48:38 | <@arkiver> | and what is the yearly growth of both debian and software heritage like? |
| 15:49:08 | <pabs> | snapshot.d.o at least 5TB, probably more these days |
| 15:49:18 | <@arkiver> | i can have the numbers a bit later - will likely not have an answer for you yet next week, and we have the current problems at IA (which are being fixed - various factors :/ ) |
| 15:49:35 | <pabs> | it was 5TB/year in 2014-06-01 |
| 15:49:45 | <@arkiver> | ah, so probbaly more like 20 TB/year nowadays |
| 15:50:15 | <pabs> | maybe, not sure. might be some data in our munin instance |
| 15:50:31 | <@arkiver> | would an assumption of yearly growth at 50 TB for both debian and software heritage be realistic? |
| 15:50:40 | <@arkiver> | (again, we can figure out these numbers in the coming weeks) |
| 15:50:52 | | @arkiver if afk for ~40 minutes |
| 15:50:54 | <@arkiver> | is* |
| 15:51:10 | <pabs> | anyways, this is just exploration, only briefly mentioned to SWH during an interview and to Debian sysadmins on IRC |
| 16:04:04 | | zhongfu quits [Ping timeout: 258 seconds] |
| 16:05:49 | | zhongfu (zhongfu) joins |
| 16:26:14 | | qw3rty quits [Ping timeout: 252 seconds] |
| 16:30:00 | <pabs> | arkiver: the slides https://annex.softwareheritage.org/public/talks/2023/2023-09-10-DebConf23.pdf |
| 16:31:09 | <pabs> | More than 1PB of source code files |
| 16:31:09 | <pabs> | (replicated 3 times by Software |
| 16:31:09 | <pabs> | Heritage) |
| 16:31:09 | <pabs> | More than 100 TB used for (resilient) |
| 16:31:09 | <pabs> | storage of the graph |
| 16:31:10 | <pabs> | Infrastructure support for mirroring: |
| 16:31:14 | <pabs> | 100 TB kafka deployment (~30TB of |
| 16:31:16 | <pabs> | data used) |
| 17:27:00 | <@arkiver> | pabs: thank you. a PB is quite a bit of data |
| 17:27:16 | <@arkiver> | it's not an impossible amount, but might have to check in with some people |
| 18:09:56 | | AlsoHP_Archivist joins |
| 18:10:26 | <HP_Archivist> | JAA: You helped with this script for downloading items from the CLI |
| 18:10:31 | <HP_Archivist> | ia search --itemlist 'uploader:harrypotterarchival@gmail.com -collection:game_replays -collection:videogame_videos -collection:speed_runs' | xargs -P 8 -n 1 ia download |
| 18:10:50 | <HP_Archivist> | How would I edit this to *not* download IA derived files? and only the files I upload? |
| 18:40:12 | | AlsoHP_Archivist quits [Client Quit] |
| 18:50:36 | <@JAA> | HP_Archivist: Support for that has only been added a few months ago: https://github.com/jjjake/internetarchive/issues/365 |
| 18:51:14 | <@JAA> | You'll need 3.4.0 or higher, then add --exclude-source=derivative at the end of the command, probably. |
| 19:04:31 | | driib quits [Client Quit] |
| 19:16:34 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 19:21:05 | | driib (driib) joins |
| 19:34:05 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 19:35:01 | | BigBrain_ quits [Ping timeout: 245 seconds] |
| 19:37:18 | | BigBrain_ (bigbrain) joins |
| 21:03:18 | <HP_Archivist> | JAA: Thanks. ia search --itemlist 'uploader:harrypotterarchival@gmail.com -collection:game_replays -collection:videogame_videos -collection:speed_runs' | xargs -P 8 -n 1 ia download --exclude-source=derivative ? |
| 21:03:29 | <HP_Archivist> | That doesn't look right, though |
| 21:12:54 | <imer> | HP_Archivist: Before | xargs most likely |
| 21:25:40 | <HP_Archivist> | imer: Tried that, it's actually: ia search --itemlist 'uploader:harrypotterarchival@gmail.com -collection:game_replays -collection:videogame_videos -collection:speed_runs' | xargs -P 8 -n 1 ia download --exclude-source=derivative |
| 21:25:49 | <HP_Archivist> | That's what seemed to work just now |
| 21:28:54 | <@JAA> | <futurama_squint.png> |
| 21:29:01 | <@JAA> | Those two commands look the same...? |
| 21:29:50 | <HP_Archivist> | JAA: They are. I didn't try it the first time (when I said it didn't look right) Tried it first with imer's suggestion, that didn't work. But my original assumption was correct |
| 21:31:13 | <@JAA> | Yeah, adding --exclude-source=derivative at the end is what I said. :-) |
| 21:31:32 | <HP_Archivist> | :) |
| 21:31:46 | <HP_Archivist> | I should know not to second guess you, heh |
| 21:31:49 | <@JAA> | Technically, `ia download -h` says that the options must come after the identifier, but that's actually not true. |
| 21:32:03 | <@JAA> | And it's annoying to do with xargs, so whatever. :-P |
| 21:32:39 | <HP_Archivist> | Hm. Is there any way to add to this command a way to speed up the downloads, a la aria2? |
| 21:32:43 | <@JAA> | And it's also contrary to the vast majority of CLIs out there. Optional --x/-x arguments normally always come before positional ones. |
| 21:33:19 | <@JAA> | I'm not aware of one. You can make `ia download` print file URLs instead somehow, I think. |
| 21:33:28 | <@JAA> | Otherwise, just download more items in parallel by tweaking -P. |
| 21:33:40 | <@JAA> | But also, IA is struggling, so if it isn't urgent, maybe don't hammer them too hard. |
| 21:34:24 | <HP_Archivist> | JAA: Yeah, I thought about that. I'll just deal with the speed I'm getting rn. |
| 21:37:25 | | fireonlive quits [Excess Flood] |
| 21:39:01 | | fireonlive (fireonlive) joins |
| 21:56:03 | | nicolas17 joins |
| 22:52:52 | | BearFortress quits [Ping timeout: 265 seconds] |
| 23:47:58 | | systwi quits [Ping timeout: 265 seconds] |
| 23:48:25 | | systwi__ (systwi) joins |