Introducing OrangeDB: A Distributed Key-Doc Database

16 Aug 2025

A couple of months ago, I started building a small LSM tree-based storage engine, which I call Parrot.

If you’re unfamiliar with LSM trees—unlike B-trees, which require frequent reorganizations—LSM (Log-Structured Merge) trees write data as sequential logs. One of the main problems with B/B+ trees is write amplification. Inserting a single row might trigger multiple rearrangements across various index trees, leading to costly random disk writes, especially in persistent systems.

LSM trees avoid this by leveraging sequential writes, making write operations extremely fast. But there’s a tradeoff—read performance. Since records are spread across multiple levels, a read operation might need to scan through several files until it finds the desired record.

Read more about LSM here.

What Makes Parrot Different?

Parrot isn’t just another LSM tree engine. It includes several optimizations aimed at speeding up I/O and improving concurrency.

Chain of Memtables
What happens to writes while flushing the memtable? Many systems use a double-buffered approach—write to one while flushing the other. But if the second also fills up before flushing completes, things get messy.
Parrot uses a chain of memtables, where new memtables are continuously added to a queue. The flusher works independently, flushing from the other end of the queue, enabling uninterrupted writes.
Optimistic Concurrency Control
Parrot minimizes locking, using optimistic strategies for concurrent writes and reads.
Zero-Copy Reads
Parrot uses shared memory mapping (mmap) to read files. It avoids copying file data to user-space buffers; threads read directly from the kernel page cache, reducing latency and memory usage.

Introducing OrangeDB

Orange is a distributed key-document database built on top of Parrot. It provides:

SQL-like interface via OQL
Native JSON-like document encoding
Rich set of data types
Sharding and replication out of the box

Sharding, Consistency, and Replication

OrangeDB combines the architectures of MongoDB and Cassandra, striking a balance between high availability and consistency.

An Orange cluster consists of three main components:

Router
Uses consistent hashing to route queries to the appropriate shard. It also supports a concept we call “consistent ring within a consistent ring” to map keys precisely to replicas.
Shard
A group of replicas (or “siblings”) running OrangeDB instances. These replicas use a leaderless model—any replica that receives a write replicates it to others, similar to Cassandra’s gossip-style replication.
Operator
A custom Kubernetes Operator for deploying and managing OrangeDB clusters.

Read Consistency Levels

OrangeDB supports configurable read strategies:

all
Queries all siblings of a shard and requires unanimous agreement on the value.
If any replica is out of sync, the read fails.
This mode provides strong consistency, but may have higher latency.
quorum
Queries all siblings but only requires a majority to agree on the value.
Offers a balance between consistency and availability.
Can optionally trigger read repair to sync outdated replicas.
single
Queries only the primary replica responsible for the key.
This is the fastest mode but assumes the primary is alive and up-to-date.
Suitable when you need strong consistency with low latency, but primary node is a single point of failure.

Deployment Options

OrangeDB is available in multiple flavors.

Standalone Mode

Run a single OrangeDB instance with gRPC endpoints. Use the Go client to send OQL or gRPC-based queries.

app % orange server --port 8000
  ___                             
 / _ \ _ __ __ _ _ __   __ _  ___ 
| | | | '__/ _' | '_ \ / _' |/ _ \
| |_| | | | (_| | | | | (_| |  __/
 \___/|_|  \__,_|_| |_|\__, |\___|
                       |___/      
Running project: `orange`
Starting server at 127.0.0.1:8000

You can interact with it using the REPL:

app % orange repl --port 8000 --address localhost
# Welcome to orange
> SELECT * FROM users WHERE _ID = 10

Sharded Cluster on Kubernetes

Step 1: Deploy OrangeDB Operator

kubectl apply -f https://raw.githubusercontent.com/nagarajRPoojari/orangectl/main/dist/install.yaml

Step 2: Define Orange Cluster

# orange.yaml
apiVersion: ctl.orangectl.orange.db/v1alpha1
kind: OrangeCtl
metadata:
  name: orangectl-sample
  labels:
    app.kubernetes.io/name: orangectl
    app.kubernetes.io/managed-by: kustomize
spec:
  namespace: default

  router:
    name: router
    labels:
      app: orange-router
      tier: control
    image: np137270/gateway:latest
    port: 8000
    config:
      ROUTING_MODE: "hash"
      LOG_LEVEL: "info"

  shard:
    name: shard
    labels:
      app: orange-shard
      tier: data
    image: np137270/orange:latest
    count: 2                # Number of shards
    replicas: 2             # Replicas per shard
    port: 52001
    config:
      STORAGE_PATH: "/app/data"
      CACHE_ENABLED: "true"

kubectl apply -f orange.yaml
kubectl port-forward svc/router 8000:8000

Step 3: Try Some Queries

echo "Creating schema..."
curl -X POST http://localhost:8000/  \
     -H "Content-Type: application/json" \
     -d '{"query": "create document test {\"name\":\"STRING\"}"}'

echo "Inserting documents from ID 1 to 20..."
for i in $(seq 1 20); do
  curl -s -X POST http://localhost:8000/ \
       -H "Content-Type: application/json" \
       -d "{\"query\": \"insert value into test  {\\\"_ID\\\": $i, \\\"name\\\": \\\"hello-$i\\\"}\"}"
done

echo "Searching for document with _ID = 13..."
curl -X POST http://localhost:8000/  \
     -H "Content-Type: application/json" \
     -d '{"query": "select * from test where _ID = 13"}'

What’s Next?

OrangeDB is still evolving. We’re working on:

Transaction support
Better indexing strategies
Integration with stream processing engines

Feedback, contributions, and critiques are welcome.