Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Generate Embeddings

Generate vector embeddings by running a model over text columns from a registered source. Results are persisted to Parquet with sidecar ANN indexes for fast similarity search.

Basic usage

Rust

#![allow(unused)]
fn main() {
extern crate jammi_db;
extern crate jammi_ai;
extern crate tokio;
use jammi_ai::session::InferenceSession;
async fn ex(session: &InferenceSession) -> jammi_db::error::Result<()> {
let record = session.generate_text_embeddings(
    "patents",
    "sentence-transformers/all-MiniLM-L6-v2",
    &["abstract".to_string()],
    "id",
).await?;

println!("Embedded {} rows, {} dimensions", record.row_count, record.dimensions.unwrap());
Ok(()) }
}

Python

db.generate_text_embeddings(
    source="patents",
    model="sentence-transformers/all-MiniLM-L6-v2",
    columns=["abstract"],
    key="id",
)

What gets created

Each call creates a timestamped Parquet file plus a sidecar ANN index bundle:

{artifact_dir}/jammi_db/
├── patents__embedding__all-MiniLM-L6-v2__20260325T120000.parquet
├── patents__embedding__all-MiniLM-L6-v2__20260325T120000.usearch
├── patents__embedding__all-MiniLM-L6-v2__20260325T120000.rowmap
└── patents__embedding__all-MiniLM-L6-v2__20260325T120000.manifest.json
  • Parquet file — source of truth. Contains _row_id, _source_id, _model_id, vector. Readable by external tools (DuckDB, Polars, pandas).
  • .usearch — USearch HNSW graph for ANN search.
  • .rowmap — maps internal USearch keys to _row_id strings.
  • .manifest.json — metadata (dimensions, count, metric, backend).

The sidecar files are disposable — deleting them falls back to brute-force exact search. The Parquet file is the only thing that matters.

Embedding table schema

ColumnTypeDescription
_row_idUtf8Key column value cast to string
_source_idUtf8Source identifier
_model_idUtf8Model identifier
vectorFixedSizeList(Float32, N)L2-normalized embedding vector

Failed rows (null or empty text) are excluded — only successfully embedded rows appear in the output.

Multiple text columns

Pass multiple column names to concatenate them (space-separated) before embedding:

Rust

#![allow(unused)]
fn main() {
extern crate jammi_db;
extern crate jammi_ai;
extern crate tokio;
use jammi_ai::session::InferenceSession;
async fn ex(session: &InferenceSession) -> jammi_db::error::Result<()> {
session.generate_text_embeddings(
    "papers",
    "sentence-transformers/all-MiniLM-L6-v2",
    &["title".to_string(), "abstract".to_string()],
    "doi",
).await?;
Ok(()) }
}

Python

db.generate_text_embeddings(
    source="papers",
    model="sentence-transformers/all-MiniLM-L6-v2",
    columns=["title", "abstract"],
    key="doi",
)

Multiple embedding tables

Each call creates a new table. Multiple tables can coexist for the same source (different models, different columns):

#![allow(unused)]
fn main() {
extern crate jammi_db;
extern crate jammi_ai;
extern crate tokio;
use jammi_ai::session::InferenceSession;
async fn ex(session: &InferenceSession) -> jammi_db::error::Result<()> {
session.generate_text_embeddings("patents", "all-MiniLM-L6-v2", &["abstract".into()], "id").await?;
session.generate_text_embeddings("patents", "bge-small-en-v1.5", &["title".into()], "id").await?;
Ok(()) }
}

When searching, the latest ready embedding table is used by default.

Supported models

Any encoder model on HuggingFace Hub with safetensors weights. Supported architectures:

BERT family — BERT, RoBERTa, DistilBERT, CamemBERT, XLM-RoBERTa:

  • sentence-transformers/all-MiniLM-L6-v2 (384-dim, fast)
  • sentence-transformers/all-mpnet-base-v2 (768-dim, higher quality)
  • BAAI/bge-small-en-v1.5, BAAI/bge-base-en-v1.5

ModernBERT — modernized encoder with rotary embeddings, 8192-token context, GeGLU:

  • answerdotai/ModernBERT-base (768-dim)
  • answerdotai/ModernBERT-large (1024-dim)

Or any local directory with config.json + model.safetensors + tokenizer.json. The architecture is detected automatically from model_type in config.json.

Use a local model:

#![allow(unused)]
fn main() {
extern crate jammi_ai;
use jammi_ai::model::ModelSource;
let model = ModelSource::local("/path/to/my-model");
}

Raw inference (no persistence)

To get embeddings as RecordBatch without writing to disk:

Rust

#![allow(unused)]
fn main() {
extern crate jammi_db;
extern crate jammi_ai;
extern crate tokio;
use jammi_ai::session::InferenceSession;
async fn ex(session: &InferenceSession) -> jammi_db::error::Result<()> {
use jammi_ai::model::{ModelSource, ModelTask};

let model = ModelSource::hf("sentence-transformers/all-MiniLM-L6-v2");
let results = session.infer("patents", &model, ModelTask::TextEmbedding, &["abstract".into()], "id").await?;
Ok(()) }
}

Python

results = db.infer(
    source="patents",
    model="sentence-transformers/all-MiniLM-L6-v2",
    columns=["abstract"],
    task="text_embedding",
    key="id",
)

Each RecordBatch has prefix columns (_row_id, _source, _model, _status, _error, _latency_ms) plus task-specific columns (e.g., vector for embeddings).

Error handling

Inference never panics on bad input. Errors are tracked per-row:

Condition_status_errorvector
Valid text"ok"null384-dim float vector
Null text"error""Empty or null text input"null
Empty text"error""Empty or null text input"null

The batch continues processing even when individual rows fail.

Dynamic batch sizing

The runner starts with the configured inference.batch_size (default: 32). If an out-of-memory error occurs:

  1. Halve the batch size
  2. Retry (up to 3 times)
  3. If OOM persists at batch size 1, mark the row as error and continue

The reduced batch size is sticky for the remainder of the stream.

Crash recovery

If the process dies mid-generation, the table is left in “building” status. On the next session start, recovery runs automatically:

  • Parquet missing — mark as failed
  • Parquet corrupt — delete file, mark as failed
  • Parquet valid but stuck in “building” — promote to “ready”, rebuild ANN index

No data is lost if the Parquet file was fully written.

DataFusion integration

Result tables are automatically registered in DataFusion and queryable via SQL:

#![allow(unused)]
fn main() {
extern crate jammi_db;
extern crate jammi_ai;
extern crate tokio;
use jammi_ai::session::InferenceSession;
use jammi_db::catalog::result_repo::ResultTableRecord;
async fn ex(session: &InferenceSession, record: &ResultTableRecord) -> jammi_db::error::Result<()> {
let results = session.sql(&format!(
    "SELECT _row_id, _source_id FROM \"jammi.{}\" LIMIT 10",
    record.table_name
)).await?;
Ok(()) }
}