SlothDB Documentation

SlothDB is an experimental in-process SQL database for local data files. It runs inside your application or as a standalone shell and reads seven file formats natively. See the Status section on the homepage for current architecture caveats before reading any performance numbers in this doc.

This page is the short reference. For long-form guides, see the full documentation on GitHub. For the SQL cheat sheet, see the SQL reference.

Installation

SlothDB ships as a single binary with no runtime dependencies. Pick whichever entry point fits your workflow:

Python

pip install slothdb
python -c "import slothdb; slothdb.demo()"

The demo() call generates a 100 000-row synthetic CSV, runs three queries with timing, and (if duckdb is installed) prints a side-by-side comparison.

Node.js / Browser (WebAssembly)

npm install @slothdb/wasm

import { SlothDB } from '@slothdb/wasm';

const db = await SlothDB.create();
const { columns, rows, ms } = db.query(
    "SELECT region, SUM(revenue) AS total " +
    "FROM '/data/sales.parquet' GROUP BY region"
);

1.3 MB wasm, works in Node ≥18 and every modern browser. Full engine, all seven file formats. Try it live without installing anything.

Linux / macOS (install script)

curl -fsSL https://raw.githubusercontent.com/SouravRoy-ETL/slothdb/main/install.sh | bash

Windows

Download slothdb.exe from the latest release. Unzip and run.

Build from source

git clone https://github.com/SouravRoy-ETL/slothdb
cd slothdb
cmake -B build -DSLOTHDB_BUILD_SHELL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release

Your first query

$ slothdb
slothdb> SELECT 'Hello, world!' AS greeting;
slothdb> SELECT * FROM 'sales.csv' LIMIT 5;

No CREATE TABLE. No COPY FROM. SlothDB auto-detects the format and streams results.

Querying Parquet

-- Auto-detected by extension
SELECT * FROM 'events.parquet' WHERE year >= 2023;

-- Explicit function call for globs and options
SELECT * FROM read_parquet('logs/*.parquet');

Supports zstd, snappy, and gzip compression. Filter pushdown drops whole row groups when possible.

Querying CSV

SELECT region, SUM(revenue)
FROM read_csv('sales/*.csv')
GROUP BY region;

RFC 4180 compliant. Handles quoted fields, embedded newlines, and custom delimiters. Headers auto-detected.

Querying JSON

-- JSON array, object, or NDJSON, all work
SELECT status, COUNT(*)
FROM 'api_logs.json'
GROUP BY status;

Querying Avro

SELECT * FROM read_avro('events.avro');

Avro is built in. No extension load needed. DuckDB reads Avro through a separately-installed extension (INSTALL avro; LOAD avro;); SlothDB reads it through a native typed decoder in the core binary, which on our bench is about 5.4× faster on SUM over a 1 M-row Avro file.

Querying Arrow IPC

SELECT * FROM read_arrow('file.arrow');

Querying SQLite

SELECT * FROM sqlite_scan('app.db', 'users');

Reads an existing SQLite file in place. No conversion. Column-selective scans over SQLite's row-oriented pages can be several× faster than raw SQLite for analytic aggregates - the exact speedup depends on row count and column width, so reproduce with your own data.

Python API

import slothdb

# Connect (empty string = in-memory)
db = slothdb.connect()
db = slothdb.connect("analytics.slothdb")   # persistent

# Query returns QueryResult
result = db.sql("SELECT * FROM 'data.csv'")
result.show()                              # pretty print
rows = result.fetchall()                    # list of tuples
df = result.fetchdf()                       # pandas DataFrame

# One-shot shortcut
slothdb.sql("SELECT 42").show()

Context manager: with slothdb.connect() as db: ... closes the database when the block exits.

CLI shell

$ slothdb                      # interactive, in-memory
$ slothdb analytics.slothdb    # interactive, persistent
$ slothdb -c "SELECT * FROM 'sales.csv' LIMIT 5"
$ slothdb -f queries.sql       # run a SQL script

# Inside the shell:
slothdb> .tables         # list tables
slothdb> .schema sales   # describe a table
slothdb> .help           # full command list
slothdb> .quit

Ask mode (.ask)

Type a question in plain English (or 28 other languages) inside the shell and SlothDB prints the SQL then runs it. No cloud round trip, no API keys, no telemetry.

$ slothdb
slothdb> .ask
ask> how many customers per country
-- SELECT "Country", COUNT(*) FROM customers GROUP BY "Country"
Norway 51 / Andorra 54 / Nepal 50 / ...  (243 rows)
ask> top 3 customers per country by subscription date
-- SELECT * FROM (SELECT "Country", "Customer Id",
--   ROW_NUMBER() OVER (PARTITION BY "Country" ORDER BY "Subscription Date" DESC) AS rn
-- FROM customers) WHERE rn <= 3
729 rows

How it works

Three tiers, in this order:

Rules-first: catalog-intent shapes (show tables, columns in X, count rows) answer in under 10 ms with no model touched.
Qwen 2.5 0.5B: simple aggregates and filters. First use downloads ~300 MB in the background.
Qwen 2.5 1.5B: window functions, "per"/"top N" shapes, anything the router flags as analytic. Downloads ~1 GB in parallel with the 0.5B.

Both models are lazy-downloaded on first .ask invocation and cached. The router picks based on question signals (analytic keywords, length, "per" patterns). A schema validator checks every generated column against the live catalog and refuses cleanly with a valid-column list if the model hallucinated one.

Confirmation prompts

SQL auto-runs by default after it is printed. If you want a [Y/n] prompt per statement:

$ SLOTHDB_ASK_CONFIRM=1 slothdb

Build with embedded models

The rules-first path ships in the default build. Local-model fallback is gated behind a CMake flag:

cmake -B build -DSLOTHDB_ASK_MODEL=ON -DCMAKE_BUILD_TYPE=Release

Full pipeline spec, router signals, and refusal policy: docs/ASK.md.

C / C++ API

#include <slothdb/api/slothdb.h>

slothdb_database *db;
slothdb_connection *conn;
slothdb_result *result;

slothdb_open("analytics.slothdb", &db);
slothdb_connect(db, &conn);
slothdb_query(conn,
  "SELECT region, SUM(revenue) FROM 'sales.csv' GROUP BY region",
  &result);

for (uint64_t r = 0; r < slothdb_row_count(result); r++)
  printf("%s: %s\n",
    slothdb_value_varchar(result, r, 0),
    slothdb_value_varchar(result, r, 1));

slothdb_free_result(result);
slothdb_disconnect(conn);
slothdb_close(db);

Headers are in include/slothdb/. Link against the built static or shared library. CMake find_package is supported.

SQL

SlothDB implements 130+ SQL features covering the standard OLAP workload:

Data types: INTEGER, BIGINT, DOUBLE, VARCHAR, DATE, TIMESTAMP, BOOLEAN, BLOB, DECIMAL, NULL
Predicates: comparisons, BETWEEN, IN, LIKE, IS NULL, EXISTS
Aggregates: COUNT, SUM, AVG, MIN, MAX, STRING_AGG, ARRAY_AGG, FIRST, LAST, MEDIAN, QUANTILE
Window functions: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM() OVER, custom partitions
CTE: non-recursive and recursive WITH
MERGE: MERGE INTO t USING s ON ... WHEN MATCHED / NOT MATCHED
Set ops: UNION, UNION ALL, INTERSECT, EXCEPT
QUALIFY: filter on window results without a subquery (Snowflake-style)

Joins and CTEs

-- All join types supported
WITH top_regions AS (
  SELECT region, SUM(revenue) AS total
  FROM 'sales.parquet'
  GROUP BY region
  ORDER BY total DESC LIMIT 5
)
SELECT s.*, t.total
FROM 'sales.parquet' s
INNER JOIN top_regions t USING (region);

Window functions

-- Top earner per department
SELECT name, department, salary
FROM 'employees.parquet'
QUALIFY ROW_NUMBER() OVER
  (PARTITION BY department ORDER BY salary DESC) = 1;

-- Running total per region
SELECT region, order_date,
       SUM(revenue) OVER
         (PARTITION BY region ORDER BY order_date) AS running
FROM 'orders.csv';

Benchmarks

The previous version of this page included a 16-query head-to-head table against DuckDB with specific multipliers. Removed because SlothDB is early-stage and those numbers shouldn't be the headline for an experimental project. See the homepage Status section. bench/ in the repo still has the runner if you want to measure on your own hardware.

Architecture

SlothDB is a vectorized, morsel-parallel columnar execution engine:

Columnar: data flows as DataChunk batches of 2048 values per column.
Vectorized: loops operate on entire columns at once rather than row-at-a-time.
Morsel-driven parallelism: work is split into small morsels distributed across all CPU cores.
PhysicalXXXScan operators: each format (Parquet, CSV, Avro, etc.) has a dedicated scan that decodes directly into typed DataChunk vectors. No bulk-load-to-DataTable roundtrip.
Fused scan + aggregate: common shapes like SUM(col) FROM file GROUP BY col2 run in a single pass with no intermediate materialization.
Zero-copy VARCHAR: short strings (≤12 bytes) live inline in the vector; longer strings are pointer-shared into a buffer.

Extensions

SlothDB has a stable C ABI for third-party extensions. Unlike DuckDB's internal C++ API, extensions built against v0.1 will keep working on v1.0, v2.0, and beyond.

// Register a custom function from an extension
slothdb_register_function(db, "my_func", my_callback);

Full extension API headers are in include/slothdb/extension/extension_api.h. Load at runtime with LOAD 'my_ext'.

Live file views

CREATE LIVE VIEW caches a view's result and auto-refreshes when the source file changes. DuckDB's execution model is snapshot-based and has no equivalent - every SELECT re-reads the whole file unconditionally.

CREATE LIVE VIEW app AS SELECT * FROM 'app.log';

-- First SELECT parses the file, caches the result.
SELECT level, COUNT(*) FROM app GROUP BY level;

-- Later, once more rows have been appended to app.log externally:
SELECT level, COUNT(*) FROM app GROUP BY level;

For the common log-tail shape - CSV that only grows, never rewrites - SlothDB parses only the newly-appended bytes and appends them to the cache. Append cost scales with how much was appended, not the total file size.

Eligibility for the incremental path: the view is a pass-through (SELECT * FROM 'file.csv' with no WHERE / GROUP BY / ORDER BY / JOIN / DISTINCT / LIMIT) over a .csv or .tsv source. Other shapes still refresh correctly via full rescan on change. Rewriting the file in place (different header bytes) forces a full rescan even if the size increased.

Edge / WASM minimal build

The default WASM bundle is ~1.3 MB. For Cloudflare Workers (hard 1 MB script cap), Deno Deploy, and Vercel Edge, build with -DSLOTHDB_EDGE=ON - this strips Excel / Avro / Arrow IPC / SQLite readers to get the bundle under the cap.

cmake -B build -DSLOTHDB_EDGE=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release

# Or through Emscripten:
emcmake cmake -B build-wasm-edge -DSLOTHDB_EDGE=ON
cmake --build build-wasm-edge

CSV, JSON, and Parquet are still built in. Calling an excluded reader (read_xlsx, read_avro, read_arrow, sqlite_scan) throws a clear error pointing at the full build. See EDGE_BUILD.md for audience, runtime distinctions, and expected sizes.

Limitations

SlothDB is v0.2.0. What it does not do (yet):

No distributed execution. Single-node embedded engine. Use ClickHouse if you outgrow one machine.
No MVCC or multi-writer transactions. Single-writer, crash-safe checkpoint.
Incomplete SQL dialect. Some corners will surprise you. Open an issue when they do.
Excel writes not supported yet. Reads work. COPY TO 'file.xlsx' is on the roadmap.

Roadmap

Generic execution path: rewrite the planner's dispatch surface and the generic ParallelScanFilter to vectorised columns
Extended window function coverage
HTTP and S3 file readers
R and Node.js bindings
Excel write path

FAQ

How is this different from DuckDB?

Same embedded model, different defaults. SlothDB ships Avro, Excel, and SQLite readers in the core binary (DuckDB ships them as extensions), exposes a stable C ABI for extensions with numeric error codes, and bakes CREATE LIVE VIEW (cached views with incremental CSV append on log-tail workloads) into the engine. DuckDB users can assemble equivalent behavior with application code; SlothDB makes it a first-class feature. Performance vs DuckDB is competitive on some shapes and worse on others; see the homepage Status section for the architecture context.

Can I use it with pandas?

Yes. db.sql("...").fetchdf() returns a pandas DataFrame.

Does it work on Apple Silicon?

Yes. SlothDB builds on x86_64 and arm64 on Linux, macOS, and Windows.

Is it production-ready?

It's v0.2.0 and about seven months old. 403 tests pass on every CI run, but you'll find edge cases. Treat it as beta. Don't put it in front of paying customers yet without a load test.

How can I help?

File issues, open PRs, share benchmark results, or just tell people about it. See the community page.