Related: Schema, CLI, Observability, Upsert Stream documents into a physical collection via JSONL bulk import with retries, stable memory, and notifications.Documentation Index
Fetch the complete documentation index at: https://nikita-shkoda.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
The Indexer focuses on importing documents. It does not create schemas, swap aliases, or enforce retention. Use
Schema.apply! (or the schema:apply task) for full blue/green lifecycle: create new physical → import → alias swap → retention cleanup.API
- into: physical collection name
- enum: enumerable yielding batches (Arrays of Hash documents)
- batch_size: soft guard for JSONL production; batches are not re-sliced unless handling 413
- action: defaults to
:upsert - max_parallel: maximum number of parallel threads for batch processing (default: 1, sequential). Set via mapper DSL with
max_parallelor override at runtime. See Parallel batch processing.
Data flow
JSONL format
- One JSON object per line
- Newline between documents; trailing newline optional
- Strings are escaped by the JSON library
Retries & backoff
- Transient errors (timeouts, connection, 429, 5xx) are retried with exponential backoff and jitter
- Non-transient errors (401/403/404/400/422) are not retried
- 413 Payload Too Large splits the batch recursively until it fits
Parallel batch processing
For large collections, you can speed up indexing by processing multiple batches simultaneously using parallel threads. This is especially useful when network latency is a bottleneck.Configuration
Enable parallel processing in your mapper DSL:max_parallel setting controls how many batches are processed simultaneously. Each thread gets its own Typesense client instance and buffer to avoid conflicts.
Usage examples
- Large collections with thousands of batches
- Network latency is the main bottleneck (not CPU or memory)
- Your Typesense server can handle concurrent requests
- You have sufficient memory for multiple buffers
- Small collections (< 100 batches)
- CPU-bound workloads (parallel processing adds overhead)
- Very memory-constrained environments
- When debugging (sequential processing has clearer error messages)
How it works
Parallel batch processing uses a producer-consumer pattern with a thread pool:- Producer thread: Fetches batches from your source (ActiveRecord, SQL, etc.) and adds them to a queue
- Worker threads: Multiple threads pull batches from the queue and process them concurrently
- Thread safety: Each worker gets its own Typesense client and buffer to avoid conflicts
- Statistics: All counters are synchronized using a mutex to ensure accurate reporting
max_parallel * 2 to keep workers busy while the producer fetches more batches.
Parallel imports run inside an InterruptiblePool wrapper that ensures clean shutdown on Ctrl+C. On normal completion, the pool drains gracefully (up to a configurable timeout). On Interrupt, worker threads are killed promptly (within 10 seconds) instead of waiting for the full graceful-shutdown window. This prevents stuck processes during interactive development.
When the graceful-shutdown timeout is exceeded (default: 3600 seconds), the pool kills remaining workers and raises SearchEngine::Errors::PartitionTimeout. The error reports which partitions were lost so operators can identify what needs re-running.
When a batch fails, the error is caught and recorded with full statistics (document count, failure details) just like in sequential processing. Failed batches are properly counted in the summary’s docs_total, failed_total, and batches array to ensure accurate reporting.
When indexation is run inside Schema.apply! (e.g., via .index_collection), a non-ok result raises SearchEngine::Errors::IndexationAborted, which prevents the alias swap. The partially-indexed new physical is automatically cleaned up to prevent orphan accumulation. See Schema → Failed indexation safety.
Live progress rendering
On TTY terminals, indexation displays real-time progress using theLiveRenderer. Each partition gets a dedicated line with a
braille spinner, a doc-based progress bar (when estimates are available),
and elapsed time. The renderer is viewport-aware — when the number of
partitions exceeds terminal height, it switches to a compact mode
showing active slots and a summary header.
Lifecycle steps (Presence, Schema Status,
Indexing, Retention) use
StepLine for in-place overwriting with animated spinners.
On non-TTY environments (CI, pipes, redirected output), the renderer
falls back to static one-line-per-partition completion output, preserving
pipe compatibility.
Memory notes
- Operates strictly batch-by-batch, reusing a single buffer per thread
- No accumulation of all records in memory; per-batch array may be materialized to support 413 splitting
- Parallel processing: each worker thread maintains its own buffer and client instance (memory usage scales with
max_parallel)
Instrumentation
- Emits
search_engine.indexer.batch_importper attempted batch - Payload includes:
collection,batch_index,docs_count,success_count,failure_count,attempts,duration_ms,http_status,bytes_sent,transient_retry,error_sample
Dry-run
SearchEngine::Indexer.dry_run!(…)builds JSONL for the first batch only and returns{ collection, action, bytes_estimate, docs_count, sample_line }
Data Sources
Adapters provide batched records for the Indexer in a memory-stable way. Each adapter implementseach_batch(partition:, cursor:) and yields arrays.
Examples:
partitionandcursorare opaque; adapters interpret them per-domain (e.g., id ranges, keyset predicates, external API tokens).- Instrumentation: emits
search_engine.source.batch_fetchedandsearch_engine.source.error.
Mapper
Backlinks: Models, Schema| Model field | Document field | Transform |
|---|---|---|
id | id | identity |
publisher_id | publisher_id | identity |
author_id | author_id | identity |
author.name | author_name | rename + safe navigation |
price_cents | price_cents | identity |
- Missing required fields: the mapper validates declared attributes;
idis injected and not required to be declared. - Unknown fields: warns by default; set
SearchEngine.config.mapper.strict_unknown_keys = trueto error. - Type checks: invalid types reported (e.g.,
Invalid type for field :price_cents (expected Integer, got String: “12.3”).). - Coercions: enable with
SearchEngine.config.mapper.coercions[:enabled] = true(safe integer/float/bool only).
Array empty filtering (hidden flags)
If an array attribute is declared withempty_filtering: true, the mapper auto-populates a hidden boolean <name>_empty per document:
-
mapper = SearchEngine::Mapper.for(SearchEngine::Book) -
docs, report = mapper.map_batch!(rows, batch_index: 1) - Emits
search_engine.mapper.batch_mappedper batch with:collection,batch_index,docs_count,duration_ms,missing_required_count,extra_keys_count,invalid_type_count,coerced_count.
Partitioning
Backlinks: Schema, CLI, ObservabilityModel-level shortcut
You can call the same operation directly on the collection model. This delegates to the Indexer and returns the sameSearchEngine::Indexer::Summary (or an Array when multiple partitions are provided).
Summary return value
SearchEngine::Indexer::Summary includes:
collection: logical collection namestatus::ok,:partial, or:failedbatches_total: total number of batches processeddocs_total: total documents processedsuccess_total: successfully indexed documentsfailed_total: failed documentsfailed_batches_total: count of batches with failuresduration_ms_total: total wall-clock duration in millisecondsbatches: array of per-batch stats
partitionsmust return an Enumerable of keys;partition_fetchmust return an Enumerable of batches (Arrays of records).- Hooks are optional; if provided, they must accept exactly one argument (the partition key).
- When
partition_fetchis missing, the source adapter is used with the partition passed through; for ActiveRecord sources, provide aHash/Rangepartition or definepartition_fetch.
Model indexation (.index_collection)
Backlinks: Schema, CLI, Observability, Models
High-level convenience API that orchestrates schema lifecycle and (partitioned) indexing from the model class.
- Full flow when
partitionis nil:- Presence check (
Schema.diff). - If missing → create new physical and apply schema, then import; else skip.
- If present → check drift; report
in_syncvsdrift. - If drift → apply schema (create new physical, import, alias swap, retention).
- If nothing was applied in steps 2–4 → index into the current alias (single or per partition).
- Retention cleanup: skipped when
Schema.apply!ran (already handled); otherwise best‑effort cleanup of old physicals beyondkeep_last.
- Presence check (
- Partial flow when
partition:is set:- Presence check; if missing → quit early with a message.
- Schema status; if drift → quit early with a message to run full indexation.
- Index only the provided partition(s) into the current alias.
Preflight dependency indexation (optional)
You can ask the engine to walk direct and transitivebelongs_to dependencies and ensure they are ready before indexing the current collection:
- Preflight walks only
belongs_toedges recursively, skipping unregistered collections. - Cycles are guarded with a visited set; already-visited collections are skipped.
- Default remains unchanged when
pre:is omitted.
- Emits concise console lines for each step and per‑partition result, for example:
Targeted bulk helpers (Bulk.index_collections / Bulk.reindex_collections!)
- Run blue/green or destructive indexation for a specific set of collections (not just “all”).
- Two-stage plan:
- Stage 1: inputs that are not referrers of other inputs (referenced-first order).
- Stage 2: unique referencers of any input, topologically sorted and processed once.
- Cascades are suppressed inside individual runs; the final cascade stage handles referencers exactly once.
Return value
Bulk.index_collections and Bulk.reindex_collections! return a hash with:
mode::indexor:reindexinputs: array of input collection namesstage_1: array of stage 1 collection namescascade: array of cascade stage collection namesinputs_count: number of input collectionsstage_1_count: number of stage 1 collectionscascade_count: number of cascade collectionsfailed_collections_total: count of unresolved collection targets
Bulk indexing (all collections)
Use Bulk helpers to index or reindex every declared SearchEngine collection discovered under your configuredsearch_engine_models directory. The engine ensures models are eagerly loaded, discovers collections, and orchestrates a two‑stage, reference‑aware run (inputs first, then a deduped cascade of referencers).
- index_all uses each model’s
.index_collection(presence/drift checks, apply+retention when needed; otherwise index into current alias). - reindex_all! uses each model’s
.reindex_collection!(drop active physical, then.index_collection). - Discovery leverages the engine’s dedicated loader and
CollectionResolver.models_map; ensure your collections live underapp/search_engine(default) or your configuredSearchEngine.config.search_engine_modelspath. - Emits
search_engine.bulk.runwith{ inputs, stage_1, cascade, counts, failed_collections_total }. See Observability for payload details. - Return value: same hash structure as
Bulk.index_collections/Bulk.reindex_collections!, includingfailed_collections_total.
Stale Deletes
Backlinks: CLI, Observability, Troubleshooting, ConfigurationDefine
stale rules once. The same compiled filter powers .stale,
.cleanup, Indexer.delete_stale!, and index:delete_stale.- Compiles all declared
stalerules into an OR‑mergedfilter_bystring and issuesDELETE /collections/:collection/documentswithfilter_by. - If no rules are defined or all resolve to blank for the given partition, deletion is skipped.
- Strict-mode guardrails block suspicious catch-alls; enable via
SearchEngine.config.stale_deletes.strict_mode = true. - Dry-run preview:
SearchEngine::Indexer.delete_stale!(…, dry_run: true)returns a summary without deleting.
| Pattern | Example |
|---|---|
| Archived flag | archived:=true |
| Partition + archived | store_id:=123 && archived:=true |
| Date threshold | updated_at:<“2025-01-01T00:00:00Z” or updated_at:<1704067200 |
search_engine.stale_deletes.started—{ collection, into, partition, filter_hash }search_engine.stale_deletes.skipped—{ reason, collection, into, partition }search_engine.stale_deletes.finished—{ collection, into, partition, duration_ms, deleted_count }search_engine.stale_deletes.error—{ collection, into, partition, error_class, message_truncated }
Inspect stale docs via a Relation
You can preview or chain on the stale set using a scope-like class method that returns a Relation:-
.stalecompiles onlystalerules and returns an empty relation when no rules produce a filter for the given partition (implemented as a safe, always‑false filter so Typesense never 400s). -
For destructive deletes, prefer
Model.cleanupor theindex:delete_staletask. -
SearchEngine.config.stale_deletes.enabled = true -
SearchEngine.config.stale_deletes.strict_mode = false -
SearchEngine.config.stale_deletes.timeout_ms = nil -
SearchEngine.config.stale_deletes.estimation_enabled = false
Stale cleanup DSL & model helper
Define stale cleanup rules alongside your index DSL. Eachstale call registers
one rule; cleanup runs them with OR semantics and uses the same compiled filter as
Indexer.delete_stale!.
SearchEngine::Base.cleanup delegates to SearchEngine::Deletion.delete_by, emitting the same search_engine.indexer.delete_stale instrumentation. When no stale configuration exists, it logs a skip and returns 0.
Pass clear_cache: true to clear the Typesense search cache after cleanup
finishes. This invokes SearchEngine::Cache.clear and emits
search_engine.cache.clear. Failures are logged and do not affect the cleanup
result.
Available stale inputs:
| Form | Result | ||
|---|---|---|---|
stale scope: :archived | Invokes a model scope; must return a Relation | ||
stale :flagged / stale attribute: :flagged, value: false | Attribute equality Hash | ||
stale({ flag: true, archived: true }) | Sanitized filter_by hash | ||
stale filter: ‘status:=archived’ | Raw Typesense filter string | ||
stale relation | Uses the relation’s compiled filter_by | ||
| `stale do | partition: | … end` | Custom block returning String/Hash/Relation |
- Multiple rules are OR‑ed together.
- Block arguments receive
partition:; return a String, Hash, or Relation.nil/blank results are ignored. - Scope helpers receive the partition as the first positional argument when their arity is ≥ 1, or as
partition:when they declare a keyword. - Attribute Hash inputs default values to
truewhen omitted (e.g.,stale :flagged).
Dispatcher
Backlinks: CLI, Observability, Configuration, Jobs- What it does: routes per-partition rebuilds either synchronously (inline) or via
ActiveJob. - Note: This setting controls
SearchEngine::Indexerlogic (used byrebuild_partition!,index_collectionflow). It does not affectActiveRecordSyncablecallbacks, which are always inline. - API:
SearchEngine::Dispatcher.dispatch!(SearchEngine::Book, partition: key, into: nil, mode: nil, queue: nil, metadata: {})- Returns an object describing the action: for ActiveJob
{ mode: :active_job, job_id, queue, collection, partition, into }; for inline{ mode: :inline, collection, partition, into, indexer_summary, duration_ms }.
- Returns an object describing the action: for ActiveJob
- Mode resolution:
mode || SearchEngine.config.indexer.dispatchwith a fallback to:inlineif ActiveJob is unavailable. - Queue name: taken from
queue || SearchEngine.config.indexer.queue_name.
SearchEngine::IndexPartitionJob.
- Args:
collection_class_name(String),partition(JSON-serializable), optionalinto(String), optionalmetadata(Hash).
When to choose model-level upserts
For light-touch updates or small repair batches, the model helpers described in Upsert are often faster than orchestrating a full Indexer run:- Single or sparse fixes — use
SearchEngine::Model.upsert(record: …)to map and stream one document without building custom JSONL buffers. - Small batches —
SearchEngine::Model.upsert_bulk(records: …)streams an enumerable of records through the mapper and client, reusing the same validation path as the Indexer. - Blue/green touch-ups — the helpers accept
into:/partition:to target a specific physical collection during staged rollout.
Troubleshooting
- Bulk import shape errors: Ensure each document is a flat Hash with required keys and valid types.
- Retry exhaustion: Inspect
search_engine.indexer.batch_importevents; increase backoff or fix upstream issues. - Stale deletes strict block: Verify your filter is not a catch-all; use partition guards.
- PartitionTimeout: The parallel pool exceeded its graceful-shutdown timeout. Check for slow partitions or increase
timeout_msin config. - IndexationAborted (alias not swapped): Indexation returned a non-ok status during
Schema.apply!. Inspect the new physical for import errors and retry. See Schema → Failed indexation safety.