Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Extract Entities (NER)

Run a Named Entity Recognition model over text columns to extract person names, organizations, locations, and other entities. Results are returned as JSON arrays of entity spans with character positions and confidence scores.

Basic usage

Rust

#![allow(unused)]
fn main() {
extern crate jammi_db;
extern crate jammi_ai;
extern crate tokio;
use jammi_ai::session::InferenceSession;
async fn ex(session: &InferenceSession) -> jammi_db::error::Result<()> {
use jammi_ai::model::{ModelSource, ModelTask};

let model = ModelSource::hf("dslim/bert-base-NER");
let results = session.infer(
    "patents",
    &model,
    ModelTask::Ner,
    &["abstract".to_string()],
    "id",
).await?;
Ok(()) }
}

Python

results = db.infer(
    source="patents",
    model="dslim/bert-base-NER",
    columns=["abstract"],
    task="ner",
    key="id",
)

Output schema

ColumnTypeDescription
_row_idUtf8Key column value
_sourceUtf8Source identifier
_modelUtf8Model identifier
_statusUtf8"ok" or "error"
_errorUtf8 (nullable)Error message if failed
_latency_msFloat32Inference latency
entitiesUtf8 (nullable)JSON array of entity spans

Entity span format

Each entity in the JSON array has:

{
  "text": "Google",
  "label": "ORG",
  "start": 15,
  "end": 21,
  "confidence": 0.97
}
FieldTypeDescription
textstringThe entity text extracted from the input
labelstringEntity type (PER, ORG, LOC, etc.) without B-/I- prefix
startintegerCharacter start position (inclusive)
endintegerCharacter end position (exclusive)
confidencefloatAverage softmax confidence across entity tokens

Supported models

NER models must have id2label with BIO-tagged labels (e.g., B-PER, I-PER, O) in their config.json.

BERT family — loads classifier.weight + classifier.bias on top of the encoder:

  • dslim/bert-base-NER (English, 4 entity types)
  • dbmdz/bert-large-cased-finetuned-conll03-english

ModernBERT — same pattern, modern encoder architecture.

How it works

text → tokenize (with character offsets)
     → encoder forward → hidden states [batch, seq_len, hidden]
     → Linear(hidden, num_labels) per token → logits
     → softmax → argmax → BIO tag per token
     → merge consecutive B-/I- tags into entity spans
     → map character offsets back to original text

The BIO decoding handles:

  • B-TYPE: starts a new entity of that type
  • I-TYPE: continues the current entity (must match type)
  • O: outside any entity
  • Special tokens ([CLS], [SEP], padding) are automatically skipped