Bulk import¶

Two paths for loading many vectors:

Client-side importers — Python reads the file, streams over gRPC. Good for most workloads up to a few million rows.
Server-side bulk import — the server reads directly from S3/MinIO. 3–5× faster for billion-scale loads.

Client-side importers¶

Install extras¶

pip install gvdb[import]       # Parquet, NumPy, Pandas, tqdm
pip install gvdb[import-all]   # plus AnnData + polars

Extra	Deps
`gvdb[parquet]`	pyarrow
`gvdb[numpy]`	numpy
`gvdb[pandas]`	pandas, pyarrow
`gvdb[h5ad]`	anndata, numpy
`gvdb[progress]`	tqdm

Shared parameters¶

Every import_* helper accepts these kwargs:

Arg	Default	Description
`batch_size`	`10_000`	Rows per insert RPC
`mode`	`"upsert"`	`"upsert"` (idempotent, safe to re-run) or `"stream_insert"` (faster)
`metric`	`"cosine"`	Distance metric when auto-creating the collection
`index_type`	`"auto"`	Index type when auto-creating the collection
`max_retries`	`3`	Retries per batch on transient errors
`show_progress`	`True`	tqdm progress bar (requires `gvdb[progress]`)

NumPy¶

import numpy as np
from gvdb import GVDBClient

client = GVDBClient("localhost:50051")

vectors = np.random.rand(100_000, 768).astype(np.float32)
result = client.import_numpy(
    vectors,
    "embeddings",
    ids=None,                       # None → sequential IDs starting from 0
    metadata=None,
    batch_size=10_000,
)
print(result)
# ImportResult(total=100000, batches=10, failed=0, elapsed=12.3s, ...)

Parquet¶

Expected schema: id (int), vector (list), plus any metadata columns.

result = client.import_parquet(
    "vectors.parquet",
    "embeddings",
    vector_column="vector",
    id_column="id",
)

Pandas / Polars DataFrame¶

import pandas as pd

df = pd.DataFrame({
    "id": range(1000),
    "vector": [[...] for _ in range(1000)],
    "category": [...],
    "price": [...],
})

result = client.import_dataframe(
    df,
    "products",
    vector_column="vector",
    id_column="id",
)

Non-vector, non-id columns become per-vector metadata (scalars only: int, float, str, bool).

CSV¶

Two vector encodings are auto-detected:

# 1. JSON array in a single column: "[0.1, 0.2, 0.3]"
result = client.import_csv("data.csv", "embeddings", vector_column="vector", id_column="id")

# 2. Dimension-prefixed columns: vector_0, vector_1, ..., vector_N
result = client.import_csv("wide.csv", "embeddings", vector_column="vector")

AnnData (`.h5ad`)¶

For single-cell workflows — import cell embeddings, obs columns as metadata:

result = client.import_h5ad(
    "adata.h5ad",
    "cells",
    embedding_key="X_pca",          # or "X_umap", "X_scvi", ...
    id_column=None,                 # defaults to row index
    metadata_columns=None,          # None = include all obs columns
)

`ImportResult`¶

ImportResult(
    total_count=100_000,
    batch_count=10,
    failed_count=0,
    elapsed_seconds=12.3,
    collection="embeddings",
    dimension=768,
    created_collection=True,
)

Server-side bulk import¶

For S3/MinIO-backed loads, skip gRPC entirely — the server downloads the file and writes segments directly:

import_id = client.bulk_import(
    "my_collection",
    source_uri="s3://my-bucket/embeddings.parquet",
    format="parquet",           # "parquet" or "numpy"
    vector_column="vector",
    id_column="id",
)

status = client.wait_for_import(import_id, poll_interval=2.0, timeout=3600.0)
# status == {"state": 2, "total_vectors": 1_000_000, "imported_vectors": 1_000_000,
#            "progress_percent": 100.0, "segments_created": 12, ...}

Polling and cancellation:

status = client.get_import_status(import_id)
client.cancel_import(import_id)      # -> True if accepted

state is an integer:

Value	Meaning
0	PENDING
1	RUNNING
2	COMPLETED
3	FAILED
4	CANCELLED

The collection must already exist when you call bulk_import — unlike client-side importers, the server-side path does not auto-create collections.

Alternatives for very large workloads¶

Spark for parallel loads from data lakes — see the Spark connector.
Flink for streaming ingestion — see the Flink connector.