Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Store Sources and Results in Cloud Object Storage

Jammi treats local disk, S3, GCS, and Azure Blob as interchangeable backends. Any place the engine accepts a local file path it also accepts a storage URL — file://, s3://, gs://, or azure:// — including registered file-shaped sources and the result-table Parquet that embedding and inference jobs write.

Build with the cloud features you need

The default build ships only file:// and the in-memory test driver. Cloud schemes are opt-in per provider so a deployment that only uses S3 does not pull in the GCS and Azure SDK chains:

FeatureSchemes it enables
storage-s3s3:// (AWS S3 and S3-compatible: MinIO, R2, LocalStack)
storage-gcsgs://
storage-azureazure://, abfss://
storage-cloudAll three (umbrella)
[dependencies]
jammi-db = { version = "0.5", features = ["storage-s3", "storage-gcs"] }

Live integration tests live behind matching live-s3-tests, live-gcs-tests, live-azure-tests features so the hermetic cargo test lane never reaches the network.

Register an S3-backed source

Rust

#![allow(unused)]
fn main() {
extern crate jammi_db;
extern crate jammi_ai;
extern crate tokio;
use jammi_ai::session::InferenceSession;
async fn ex(session: &InferenceSession) -> Result<(), Box<dyn std::error::Error>> {
use jammi_db::source::{FileFormat, SourceConnection, SourceType};
use jammi_db::storage::{CloudConfig, S3Config, StorageUrl};

let url = StorageUrl::parse("s3://benchmarks/snapshots/2026/papers.parquet")?;

let conn = SourceConnection {
    url: Some(url.to_string()),
    format: Some(FileFormat::Parquet),
    cloud: Some(CloudConfig::S3(S3Config {
        region: Some("us-east-1".into()),
        ..Default::default()
    })),
    ..Default::default()
};

session.add_source("papers", SourceType::File, conn).await?;

let rows = session
    .sql("SELECT id, title FROM papers.public.papers LIMIT 10")
    .await?;
Ok(()) }
}

If the cloud field is None and the URL is a cloud scheme, the driver falls back to the SDK’s ambient credential chain — env vars, instance profile, IRSA, ADC, Managed Identity.

Python

from jammi_ai import Database

db = Database()
db.add_source("papers", url="s3://benchmarks/snapshots/2026/papers.parquet", format="parquet")
db.sql("SELECT id, title FROM papers.public.papers LIMIT 10")

The Python binding accepts the same URL forms as the Rust API; per-source cloud credentials are read from process environment.

CLI

jammi sources add papers \
    --url s3://benchmarks/snapshots/2026/papers.parquet \
    --format parquet

GCS and Azure

The pattern is identical — only the URL prefix and the CloudConfig variant change:

#![allow(unused)]
fn main() {
extern crate jammi_db;
use jammi_db::source::{FileFormat, SourceConnection};
fn make() -> SourceConnection {
use jammi_db::storage::{CloudConfig, GcsConfig};

let conn = SourceConnection {
    url: Some("gs://archives/2026/jan.parquet".into()),
    format: Some(FileFormat::Parquet),
    cloud: Some(CloudConfig::Gcs(GcsConfig {
        service_account_path: Some("/etc/jammi/sa.json".into()),
        ..Default::default()
    })),
    ..Default::default()
};
conn }
}
#![allow(unused)]
fn main() {
extern crate jammi_db;
use jammi_db::source::{FileFormat, SourceConnection};
fn make() -> Result<SourceConnection, Box<dyn std::error::Error>> {
use jammi_db::storage::{AzureConfig, CloudConfig};

let conn = SourceConnection {
    url: Some("azure://snapshots/model_outputs.parquet".into()),
    format: Some(FileFormat::Parquet),
    cloud: Some(CloudConfig::Azure(AzureConfig {
        account_name: Some("mystorage".into()),
        sas_token: Some(std::env::var("AZURE_SAS_TOKEN")?),
        ..Default::default()
    })),
    ..Default::default()
};
Ok(conn) }
}

Persist result tables to the cloud

ResultStore accepts a [StorageUrl] root, so embedding and inference outputs land in the same bucket as the source data:

#![allow(unused)]
fn main() {
extern crate jammi_db;
use std::sync::Arc;
use jammi_db::catalog::Catalog;
fn ex(catalog: Arc<Catalog>) -> jammi_db::error::Result<()> {
use jammi_db::storage::{StorageRegistry, StorageUrl};
use jammi_db::store::ResultStore;
use std::sync::Arc;

let root = StorageUrl::parse("s3://benchmarks/jammi_db")?;
let registry = StorageRegistry::new();
let result_store = Arc::new(ResultStore::with_root(root, registry, catalog)?);
Ok(()) }
}

Every result table the session creates writes its Parquet and sidecar ANN index to that prefix; delete_table_files and the crash-recovery pass operate against the same backend.

How the layout maps onto buckets

For a result table named papers__text_embedding__bge-m3__20260520T120000Z_abc12345, the engine writes three siblings:

s3://benchmarks/jammi_db/papers__text_embedding__bge-m3__….parquet
s3://benchmarks/jammi_db/papers__text_embedding__bge-m3__….idx.usearch
s3://benchmarks/jammi_db/papers__text_embedding__bge-m3__….idx.rowmap
s3://benchmarks/jammi_db/papers__text_embedding__bge-m3__….idx.manifest.json

The sidecar layout is the same on every backend; the only difference is the driver under the hood. USearch’s path-based FFI is bridged through a tempfile for cloud schemes so its save / load calls work unchanged.