Extract Entities (NER)

Run a Named Entity Recognition model over text columns to extract person names, organizations, locations, and other entities. Results are returned as JSON arrays of entity spans with character positions and confidence scores.

Basic usage

Rust

#![allow(unused)]
fn main() {
extern crate jammi_db;
extern crate jammi_ai;
extern crate tokio;
use jammi_ai::session::InferenceSession;
use jammi_db::store::CachePolicy;
async fn ex(session: &InferenceSession) -> jammi_db::error::Result<()> {
use jammi_ai::model::{ModelSource, ModelTask};

let model = ModelSource::hf("dslim/bert-base-NER");
let (_results, _outcome) = session.infer(
    "patents",
    &model,
    ModelTask::Ner,
    &["abstract".to_string()],
    "id",
    CachePolicy::Bypass,
).await?;
Ok(()) }
}

Python

results = db.infer(
    source="patents",
    model="dslim/bert-base-NER",
    columns=["abstract"],
    task="ner",
    key="id",
)

Output schema

Column	Type	Description
`_row_id`	Utf8	Key column value
`_source`	Utf8	Source identifier
`_model`	Utf8	Model identifier
`_status`	Utf8	`"ok"` or `"error"`
`_error`	Utf8 (nullable)	Error message if failed
`_latency_ms`	Float32	Inference latency
`entities`	Utf8 (nullable)	JSON array of entity spans

Entity span format

Each entity in the JSON array has:

{
  "text": "Google",
  "label": "ORG",
  "start": 15,
  "end": 21,
  "confidence": 0.97
}

Field	Type	Description
`text`	string	The entity text extracted from the input
`label`	string	Entity type (PER, ORG, LOC, etc.) without B-/I- prefix
`start`	integer	Character start position (inclusive)
`end`	integer	Character end position (exclusive)
`confidence`	float	Average softmax confidence across entity tokens

Supported models

NER models must have id2label with BIO-tagged labels (e.g., B-PER, I-PER, O) in their config.json.

BERT family — loads classifier.weight + classifier.bias on top of the encoder:

dslim/bert-base-NER (English, 4 entity types)
dbmdz/bert-large-cased-finetuned-conll03-english

ModernBERT — same pattern, modern encoder architecture.

How it works

text → tokenize (with character offsets)
     → encoder forward → hidden states [batch, seq_len, hidden]
     → Linear(hidden, num_labels) per token → logits
     → softmax → argmax → BIO tag per token
     → merge consecutive B-/I- tags into entity spans
     → map character offsets back to original text

The BIO decoding handles:

B-TYPE: starts a new entity of that type
I-TYPE: continues the current entity (must match type)
O: outside any entity
Special tokens ([CLS], [SEP], padding) are automatically skipped

Keyboard shortcuts

Jammi AI Guide