Skip to content

Streaming ingestion

Continuously index embeddings as they arrive. GVDB ships first-party connectors for Apache Spark and Apache Flink.

When to choose which

Workload Choose
Batch ETL from data lakes (Parquet, Delta, Iceberg) Spark
Real-time streaming (Kafka, Kinesis) with exactly-once Flink
One-shot bulk load from a file Python SDK bulk import
gRPC client-streaming from your own producer Python/Java SDK streaming inserts

Spark (batch + streaming)

PySpark example writing a DataFrame to GVDB:

df.write.format("io.gvdb.spark") \
    .option("gvdb.target", "proxy.gvdb.svc:50051") \
    .option("gvdb.collection", "embeddings") \
    .option("gvdb.dimension", "768") \
    .option("gvdb.metric", "cosine") \
    .option("gvdb.index_type", "auto") \
    .option("gvdb.batch_size", "5000") \
    .mode("append") \
    .save()

Full walkthrough: Spark connector.

Java sink consuming an embedding stream:

var sink = GvdbSink.<Embedding>builder()
    .setTarget("proxy.gvdb.svc:50051")
    .setCollection("events")
    .setBatchSize(1000)
    .setMaxRetries(3)
    .setRecordMapper(e -> new GvdbVector(
        e.id(), e.vector(), Map.of("category", e.category())))
    .build();

stream.sinkTo(sink);

Full walkthrough: Flink connector.

Exactly-once semantics

  • Flink: the GVDB sink participates in Flink checkpoints. Combined with GVDB's upsert idempotency, you get exactly-once end-to-end.
  • Spark: batch jobs are idempotent via upsert mode; retried tasks produce the same result.

Back-pressure and retries

Both connectors:

  • Batch inserts (default 1,000–5,000 records)
  • Retry on transient errors with exponential backoff (default 3 attempts)
  • Fail the task on permanent errors, letting Spark/Flink re-schedule

Distributed clusters

Point the target at your Helm-deployed proxy. The proxy fan-outs inserts to the appropriate shard's primary data node; replication is handled server-side.

See also