SlothDB Documentation

SlothDB is a fast in-process SQL database for your local data files. It runs inside your application or as a standalone shell, reads seven file formats natively, and is 1.1×–8.6× faster than DuckDB on every benchmark query.

This page is the short reference. For long-form guides, see the full documentation on GitHub. For the SQL cheat sheet, see the SQL reference.

Installation

SlothDB ships as a single binary with no runtime dependencies. Pick whichever entry point fits your workflow:

Python

pip install slothdb
python -c "import slothdb; slothdb.demo()"

The demo() call generates a 100 000-row synthetic CSV, runs three queries with timing, and (if duckdb is installed) prints a side-by-side comparison.

Linux / macOS (install script)

curl -fsSL https://raw.githubusercontent.com/SouravRoy-ETL/slothdb/main/install.sh | bash

Windows

Download slothdb.exe from the latest release. Unzip and run.

Build from source

git clone https://github.com/SouravRoy-ETL/slothdb
cd slothdb
cmake -B build -DSLOTHDB_BUILD_SHELL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release

Your first query

$ slothdb
slothdb> SELECT 'Hello, world!' AS greeting;
slothdb> SELECT * FROM 'sales.csv' LIMIT 5;

No CREATE TABLE. No COPY FROM. SlothDB auto-detects the format and streams results.

Querying Parquet

-- Auto-detected by extension
SELECT * FROM 'events.parquet' WHERE year >= 2023;

-- Explicit function call for globs and options
SELECT * FROM read_parquet('logs/*.parquet');

Supports zstd, snappy, and gzip compression. Filter pushdown drops whole row groups when possible.

Querying CSV

SELECT region, SUM(revenue)
FROM read_csv('sales/*.csv')
GROUP BY region;

RFC 4180 compliant. Handles quoted fields, embedded newlines, and custom delimiters. Headers auto-detected.

Querying JSON

-- JSON array, object, or NDJSON, all work
SELECT status, COUNT(*)
FROM 'api_logs.json'
GROUP BY status;

Querying Avro

SELECT * FROM read_avro('events.avro');

Avro is built in. No extension load needed. This is one of SlothDB's biggest wins over DuckDB (which requires INSTALL avro; LOAD avro;).

Querying Arrow IPC

SELECT * FROM read_arrow('file.arrow');

Querying SQLite

SELECT * FROM sqlite_scan('app.db', 'users');

Reads an existing SQLite file in place. No conversion. Columnar streaming on top of row-oriented storage gives a 10-100× win over raw SQLite on analytical aggregates.

Python API

import slothdb

# Connect (empty string = in-memory)
db = slothdb.connect()
db = slothdb.connect("analytics.slothdb")   # persistent

# Query returns QueryResult
result = db.sql("SELECT * FROM 'data.csv'")
result.show()                              # pretty print
rows = result.fetchall()                    # list of tuples
df = result.fetchdf()                       # pandas DataFrame

# One-shot shortcut
slothdb.sql("SELECT 42").show()
Context manager: with slothdb.connect() as db: ... closes the database when the block exits.

CLI shell

$ slothdb                      # interactive, in-memory
$ slothdb analytics.slothdb    # interactive, persistent
$ slothdb -c "SELECT * FROM 'sales.csv' LIMIT 5"
$ slothdb -f queries.sql       # run a SQL script

# Inside the shell:
slothdb> .tables         # list tables
slothdb> .schema sales   # describe a table
slothdb> .help           # full command list
slothdb> .quit

C / C++ API

#include <slothdb/api/slothdb.h>

slothdb_database *db;
slothdb_connection *conn;
slothdb_result *result;

slothdb_open("analytics.slothdb", &db);
slothdb_connect(db, &conn);
slothdb_query(conn,
  "SELECT region, SUM(revenue) FROM 'sales.csv' GROUP BY region",
  &result);

for (uint64_t r = 0; r < slothdb_row_count(result); r++)
  printf("%s: %s\n",
    slothdb_value_varchar(result, r, 0),
    slothdb_value_varchar(result, r, 1));

slothdb_free_result(result);
slothdb_disconnect(conn);
slothdb_close(db);

Headers are in include/slothdb/. Link against the built static or shared library. CMake find_package is supported.

SQL

SlothDB implements 130+ SQL features covering the standard OLAP workload:

Joins and CTEs

-- All join types supported
WITH top_regions AS (
  SELECT region, SUM(revenue) AS total
  FROM 'sales.parquet'
  GROUP BY region
  ORDER BY total DESC LIMIT 5
)
SELECT s.*, t.total
FROM 'sales.parquet' s
INNER JOIN top_regions t USING (region);

Window functions

-- Top earner per department
SELECT name, department, salary
FROM 'employees.parquet'
QUALIFY ROW_NUMBER() OVER
  (PARTITION BY department ORDER BY salary DESC) = 1;

-- Running total per region
SELECT region, order_date,
       SUM(revenue) OVER
         (PARTITION BY region ORDER BY order_date) AS running
FROM 'orders.csv';

Benchmarks

Same machine, same queries, 1M rows, 5-run median, warm cache. SlothDB is faster than DuckDB on every single query.

FormatQuerySlothDBDuckDBSpeedup
CSVCOUNT(*)33 ms170 ms5.08×
ParquetCOUNT(*)12 ms34 ms2.83×
ParquetSUM(revenue)46 ms48 ms1.04×
JSONSUM(revenue)242 ms314 ms1.30×
AvroSUM(revenue)140 ms760 ms5.43×
AvroGROUP BY region170 ms800 ms4.71×
ExcelGROUP BY region2.5 s3.56 s1.41×

Full 15-query breakdown in the README benchmarks section.

Architecture

SlothDB is a vectorized, morsel-parallel columnar execution engine:

Extensions

SlothDB has a stable C ABI for third-party extensions. Unlike DuckDB's internal C++ API, extensions built against v0.1 will keep working on v1.0, v2.0, and beyond.

// Register a custom function from an extension
slothdb_register_function(db, "my_func", my_callback);

Full extension API headers are in include/slothdb/extension/extension_api.h. Load at runtime with LOAD 'my_ext'.

Limitations

SlothDB is v0.1.4. What it does not do (yet):

Roadmap

FAQ

How is this different from DuckDB?

Same embedded model, different defaults. SlothDB ships Avro, Excel, and SQLite in core (DuckDB needs extensions), has a stable C ABI for extensions, uses numeric error codes, and is 1.1×–8.6× faster on the shipped benchmark suite.

Can I use it with pandas?

Yes. db.sql("...").fetchdf() returns a pandas DataFrame.

Does it work on Apple Silicon?

Yes. SlothDB builds on x86_64 and arm64 on Linux, macOS, and Windows.

Is it production-ready?

It's v0.1.4 and about six months old. 359 tests pass on every CI run, but you'll find edge cases. Treat it as beta. Don't put it in front of paying customers yet without a load test.

How can I help?

File issues, open PRs, share benchmark results, or just tell people about it. See the community page.