SlothDB Documentation
SlothDB is an experimental in-process SQL database for local data files. It runs inside your application or as a standalone shell and reads seven file formats natively. See the Status section on the homepage for current architecture caveats before reading any performance numbers in this doc.
This page is the short reference. For long-form guides, see the full documentation on GitHub. For the SQL cheat sheet, see the SQL reference.
Installation
SlothDB ships as a single binary with no runtime dependencies. Pick whichever entry point fits your workflow:
Python
pip install slothdb
python -c "import slothdb; slothdb.demo()"
The demo() call generates a 100 000-row synthetic CSV, runs three queries with timing, and (if duckdb is installed) prints a side-by-side comparison.
Node.js / Browser (WebAssembly)
npm install @slothdb/wasm
import { SlothDB } from '@slothdb/wasm'; const db = await SlothDB.create(); const { columns, rows, ms } = db.query( "SELECT region, SUM(revenue) AS total " + "FROM '/data/sales.parquet' GROUP BY region" );
1.3 MB wasm, works in Node ≥18 and every modern browser. Full engine, all seven file formats. Try it live without installing anything.
Linux / macOS (install script)
curl -fsSL https://raw.githubusercontent.com/SouravRoy-ETL/slothdb/main/install.sh | bash
Windows
Download slothdb.exe from the latest release. Unzip and run.
Build from source
git clone https://github.com/SouravRoy-ETL/slothdb cd slothdb cmake -B build -DSLOTHDB_BUILD_SHELL=ON -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release
Your first query
$ slothdb slothdb> SELECT 'Hello, world!' AS greeting; slothdb> SELECT * FROM 'sales.csv' LIMIT 5;
No CREATE TABLE. No COPY FROM. SlothDB auto-detects the format and streams results.
Querying Parquet
-- Auto-detected by extension SELECT * FROM 'events.parquet' WHERE year >= 2023; -- Explicit function call for globs and options SELECT * FROM read_parquet('logs/*.parquet');
Supports zstd, snappy, and gzip compression. Filter pushdown drops whole row groups when possible.
Querying CSV
SELECT region, SUM(revenue) FROM read_csv('sales/*.csv') GROUP BY region;
RFC 4180 compliant. Handles quoted fields, embedded newlines, and custom delimiters. Headers auto-detected.
Querying JSON
-- JSON array, object, or NDJSON, all work SELECT status, COUNT(*) FROM 'api_logs.json' GROUP BY status;
Querying Avro
SELECT * FROM read_avro('events.avro');
Avro is built in. No extension load needed. DuckDB reads Avro through a separately-installed extension (INSTALL avro; LOAD avro;); SlothDB reads it through a native typed decoder in the core binary, which on our bench is about 5.4× faster on SUM over a 1 M-row Avro file.
Querying Arrow IPC
SELECT * FROM read_arrow('file.arrow');
Querying SQLite
SELECT * FROM sqlite_scan('app.db', 'users');
Reads an existing SQLite file in place. No conversion. Column-selective scans over SQLite's row-oriented pages can be several× faster than raw SQLite for analytic aggregates - the exact speedup depends on row count and column width, so reproduce with your own data.
Python API
import slothdb # Connect (empty string = in-memory) db = slothdb.connect() db = slothdb.connect("analytics.slothdb") # persistent # Query returns QueryResult result = db.sql("SELECT * FROM 'data.csv'") result.show() # pretty print rows = result.fetchall() # list of tuples df = result.fetchdf() # pandas DataFrame # One-shot shortcut slothdb.sql("SELECT 42").show()
with slothdb.connect() as db: ... closes the database when the block exits.
CLI shell
$ slothdb # interactive, in-memory $ slothdb analytics.slothdb # interactive, persistent $ slothdb -c "SELECT * FROM 'sales.csv' LIMIT 5" $ slothdb -f queries.sql # run a SQL script # Inside the shell: slothdb> .tables # list tables slothdb> .schema sales # describe a table slothdb> .help # full command list slothdb> .quit
Ask mode (.ask)
Type a question in plain English (or 28 other languages) inside the shell and SlothDB prints the SQL then runs it. No cloud round trip, no API keys, no telemetry.
$ slothdb slothdb> .ask ask> how many customers per country -- SELECT "Country", COUNT(*) FROM customers GROUP BY "Country" Norway 51 / Andorra 54 / Nepal 50 / ... (243 rows) ask> top 3 customers per country by subscription date -- SELECT * FROM (SELECT "Country", "Customer Id", -- ROW_NUMBER() OVER (PARTITION BY "Country" ORDER BY "Subscription Date" DESC) AS rn -- FROM customers) WHERE rn <= 3 729 rows
How it works
Three tiers, in this order:
- Rules-first: catalog-intent shapes (
show tables,columns in X,count rows) answer in under 10 ms with no model touched. - Qwen 2.5 0.5B: simple aggregates and filters. First use downloads ~300 MB in the background.
- Qwen 2.5 1.5B: window functions, "per"/"top N" shapes, anything the router flags as analytic. Downloads ~1 GB in parallel with the 0.5B.
Both models are lazy-downloaded on first .ask invocation and cached. The router picks based on question signals (analytic keywords, length, "per" patterns). A schema validator checks every generated column against the live catalog and refuses cleanly with a valid-column list if the model hallucinated one.
Confirmation prompts
SQL auto-runs by default after it is printed. If you want a [Y/n] prompt per statement:
$ SLOTHDB_ASK_CONFIRM=1 slothdb
Build with embedded models
The rules-first path ships in the default build. Local-model fallback is gated behind a CMake flag:
cmake -B build -DSLOTHDB_ASK_MODEL=ON -DCMAKE_BUILD_TYPE=Release
Full pipeline spec, router signals, and refusal policy: docs/ASK.md.
C / C++ API
#include <slothdb/api/slothdb.h> slothdb_database *db; slothdb_connection *conn; slothdb_result *result; slothdb_open("analytics.slothdb", &db); slothdb_connect(db, &conn); slothdb_query(conn, "SELECT region, SUM(revenue) FROM 'sales.csv' GROUP BY region", &result); for (uint64_t r = 0; r < slothdb_row_count(result); r++) printf("%s: %s\n", slothdb_value_varchar(result, r, 0), slothdb_value_varchar(result, r, 1)); slothdb_free_result(result); slothdb_disconnect(conn); slothdb_close(db);
Headers are in include/slothdb/. Link against the built static or shared library. CMake find_package is supported.
SQL
SlothDB implements 130+ SQL features covering the standard OLAP workload:
- Data types:
INTEGER,BIGINT,DOUBLE,VARCHAR,DATE,TIMESTAMP,BOOLEAN,BLOB,DECIMAL,NULL - Predicates: comparisons,
BETWEEN,IN,LIKE,IS NULL,EXISTS - Aggregates:
COUNT,SUM,AVG,MIN,MAX,STRING_AGG,ARRAY_AGG,FIRST,LAST,MEDIAN,QUANTILE - Window functions:
ROW_NUMBER,RANK,DENSE_RANK,LAG,LEAD,SUM() OVER, custom partitions - CTE: non-recursive and recursive
WITH - MERGE:
MERGE INTO t USING s ON ... WHEN MATCHED / NOT MATCHED - Set ops:
UNION,UNION ALL,INTERSECT,EXCEPT - QUALIFY: filter on window results without a subquery (Snowflake-style)
Joins and CTEs
-- All join types supported WITH top_regions AS ( SELECT region, SUM(revenue) AS total FROM 'sales.parquet' GROUP BY region ORDER BY total DESC LIMIT 5 ) SELECT s.*, t.total FROM 'sales.parquet' s INNER JOIN top_regions t USING (region);
Window functions
-- Top earner per department SELECT name, department, salary FROM 'employees.parquet' QUALIFY ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) = 1; -- Running total per region SELECT region, order_date, SUM(revenue) OVER (PARTITION BY region ORDER BY order_date) AS running FROM 'orders.csv';
Benchmarks
The previous version of this page included a 16-query head-to-head table against DuckDB with specific multipliers. Removed because SlothDB is early-stage and those numbers shouldn't be the headline for an experimental project. See the homepage Status section. bench/ in the repo still has the runner if you want to measure on your own hardware.
Architecture
SlothDB is a vectorized, morsel-parallel columnar execution engine:
- Columnar: data flows as
DataChunkbatches of 2048 values per column. - Vectorized: loops operate on entire columns at once rather than row-at-a-time.
- Morsel-driven parallelism: work is split into small morsels distributed across all CPU cores.
- PhysicalXXXScan operators: each format (Parquet, CSV, Avro, etc.) has a dedicated scan that decodes directly into typed DataChunk vectors. No bulk-load-to-DataTable roundtrip.
- Fused scan + aggregate: common shapes like
SUM(col) FROM file GROUP BY col2run in a single pass with no intermediate materialization. - Zero-copy VARCHAR: short strings (≤12 bytes) live inline in the vector; longer strings are pointer-shared into a buffer.
Extensions
SlothDB has a stable C ABI for third-party extensions. Unlike DuckDB's internal C++ API, extensions built against v0.1 will keep working on v1.0, v2.0, and beyond.
// Register a custom function from an extension slothdb_register_function(db, "my_func", my_callback);
Full extension API headers are in include/slothdb/extension/extension_api.h. Load at runtime with LOAD 'my_ext'.
Live file views
CREATE LIVE VIEW caches a view's result and auto-refreshes when the source file changes. DuckDB's execution model is snapshot-based and has no equivalent - every SELECT re-reads the whole file unconditionally.
CREATE LIVE VIEW app AS SELECT * FROM 'app.log'; -- First SELECT parses the file, caches the result. SELECT level, COUNT(*) FROM app GROUP BY level; -- Later, once more rows have been appended to app.log externally: SELECT level, COUNT(*) FROM app GROUP BY level;
For the common log-tail shape - CSV that only grows, never rewrites - SlothDB parses only the newly-appended bytes and appends them to the cache. Append cost scales with how much was appended, not the total file size.
Eligibility for the incremental path: the view is a pass-through (SELECT * FROM 'file.csv' with no WHERE / GROUP BY / ORDER BY / JOIN / DISTINCT / LIMIT) over a .csv or .tsv source. Other shapes still refresh correctly via full rescan on change. Rewriting the file in place (different header bytes) forces a full rescan even if the size increased.
Edge / WASM minimal build
The default WASM bundle is ~1.3 MB. For Cloudflare Workers (hard 1 MB script cap), Deno Deploy, and Vercel Edge, build with -DSLOTHDB_EDGE=ON - this strips Excel / Avro / Arrow IPC / SQLite readers to get the bundle under the cap.
cmake -B build -DSLOTHDB_EDGE=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release
# Or through Emscripten:
emcmake cmake -B build-wasm-edge -DSLOTHDB_EDGE=ON
cmake --build build-wasm-edge
CSV, JSON, and Parquet are still built in. Calling an excluded reader (read_xlsx, read_avro, read_arrow, sqlite_scan) throws a clear error pointing at the full build. See EDGE_BUILD.md for audience, runtime distinctions, and expected sizes.
Limitations
SlothDB is v0.2.0. What it does not do (yet):
- No distributed execution. Single-node embedded engine. Use ClickHouse if you outgrow one machine.
- No MVCC or multi-writer transactions. Single-writer, crash-safe checkpoint.
- Incomplete SQL dialect. Some corners will surprise you. Open an issue when they do.
- Excel writes not supported yet. Reads work.
COPY TO 'file.xlsx'is on the roadmap.
Roadmap
- Generic execution path: rewrite the planner's dispatch surface and the generic
ParallelScanFilterto vectorised columns - Extended window function coverage
- HTTP and S3 file readers
- R and Node.js bindings
- Excel write path
FAQ
How is this different from DuckDB?
Same embedded model, different defaults. SlothDB ships Avro, Excel, and SQLite readers in the core binary (DuckDB ships them as extensions), exposes a stable C ABI for extensions with numeric error codes, and bakes CREATE LIVE VIEW (cached views with incremental CSV append on log-tail workloads) into the engine. DuckDB users can assemble equivalent behavior with application code; SlothDB makes it a first-class feature. Performance vs DuckDB is competitive on some shapes and worse on others; see the homepage Status section for the architecture context.
Can I use it with pandas?
Yes. db.sql("...").fetchdf() returns a pandas DataFrame.
Does it work on Apple Silicon?
Yes. SlothDB builds on x86_64 and arm64 on Linux, macOS, and Windows.
Is it production-ready?
It's v0.2.0 and about seven months old. 403 tests pass on every CI run, but you'll find edge cases. Treat it as beta. Don't put it in front of paying customers yet without a load test.
How can I help?
File issues, open PRs, share benchmark results, or just tell people about it. See the community page.