SlothDB Documentation
SlothDB is a fast in-process SQL database for your local data files. It runs inside your application or as a standalone shell, reads seven file formats natively, and is 1.1×–8.6× faster than DuckDB on every benchmark query.
This page is the short reference. For long-form guides, see the full documentation on GitHub. For the SQL cheat sheet, see the SQL reference.
Installation
SlothDB ships as a single binary with no runtime dependencies. Pick whichever entry point fits your workflow:
Python
pip install slothdb
python -c "import slothdb; slothdb.demo()"
The demo() call generates a 100 000-row synthetic CSV, runs three queries with timing, and (if duckdb is installed) prints a side-by-side comparison.
Linux / macOS (install script)
curl -fsSL https://raw.githubusercontent.com/SouravRoy-ETL/slothdb/main/install.sh | bash
Windows
Download slothdb.exe from the latest release. Unzip and run.
Build from source
git clone https://github.com/SouravRoy-ETL/slothdb cd slothdb cmake -B build -DSLOTHDB_BUILD_SHELL=ON -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release
Your first query
$ slothdb slothdb> SELECT 'Hello, world!' AS greeting; slothdb> SELECT * FROM 'sales.csv' LIMIT 5;
No CREATE TABLE. No COPY FROM. SlothDB auto-detects the format and streams results.
Querying Parquet
-- Auto-detected by extension SELECT * FROM 'events.parquet' WHERE year >= 2023; -- Explicit function call for globs and options SELECT * FROM read_parquet('logs/*.parquet');
Supports zstd, snappy, and gzip compression. Filter pushdown drops whole row groups when possible.
Querying CSV
SELECT region, SUM(revenue) FROM read_csv('sales/*.csv') GROUP BY region;
RFC 4180 compliant. Handles quoted fields, embedded newlines, and custom delimiters. Headers auto-detected.
Querying JSON
-- JSON array, object, or NDJSON, all work SELECT status, COUNT(*) FROM 'api_logs.json' GROUP BY status;
Querying Avro
SELECT * FROM read_avro('events.avro');
Avro is built in. No extension load needed. This is one of SlothDB's biggest wins over DuckDB (which requires INSTALL avro; LOAD avro;).
Querying Arrow IPC
SELECT * FROM read_arrow('file.arrow');
Querying SQLite
SELECT * FROM sqlite_scan('app.db', 'users');
Reads an existing SQLite file in place. No conversion. Columnar streaming on top of row-oriented storage gives a 10-100× win over raw SQLite on analytical aggregates.
Python API
import slothdb # Connect (empty string = in-memory) db = slothdb.connect() db = slothdb.connect("analytics.slothdb") # persistent # Query returns QueryResult result = db.sql("SELECT * FROM 'data.csv'") result.show() # pretty print rows = result.fetchall() # list of tuples df = result.fetchdf() # pandas DataFrame # One-shot shortcut slothdb.sql("SELECT 42").show()
with slothdb.connect() as db: ... closes the database when the block exits.
CLI shell
$ slothdb # interactive, in-memory $ slothdb analytics.slothdb # interactive, persistent $ slothdb -c "SELECT * FROM 'sales.csv' LIMIT 5" $ slothdb -f queries.sql # run a SQL script # Inside the shell: slothdb> .tables # list tables slothdb> .schema sales # describe a table slothdb> .help # full command list slothdb> .quit
C / C++ API
#include <slothdb/api/slothdb.h> slothdb_database *db; slothdb_connection *conn; slothdb_result *result; slothdb_open("analytics.slothdb", &db); slothdb_connect(db, &conn); slothdb_query(conn, "SELECT region, SUM(revenue) FROM 'sales.csv' GROUP BY region", &result); for (uint64_t r = 0; r < slothdb_row_count(result); r++) printf("%s: %s\n", slothdb_value_varchar(result, r, 0), slothdb_value_varchar(result, r, 1)); slothdb_free_result(result); slothdb_disconnect(conn); slothdb_close(db);
Headers are in include/slothdb/. Link against the built static or shared library. CMake find_package is supported.
SQL
SlothDB implements 130+ SQL features covering the standard OLAP workload:
- Data types:
INTEGER,BIGINT,DOUBLE,VARCHAR,DATE,TIMESTAMP,BOOLEAN,BLOB,DECIMAL,NULL - Predicates: comparisons,
BETWEEN,IN,LIKE,IS NULL,EXISTS - Aggregates:
COUNT,SUM,AVG,MIN,MAX,STRING_AGG,ARRAY_AGG,FIRST,LAST,MEDIAN,QUANTILE - Window functions:
ROW_NUMBER,RANK,DENSE_RANK,LAG,LEAD,SUM() OVER, custom partitions - CTE: non-recursive and recursive
WITH - MERGE:
MERGE INTO t USING s ON ... WHEN MATCHED / NOT MATCHED - Set ops:
UNION,UNION ALL,INTERSECT,EXCEPT - QUALIFY: filter on window results without a subquery (Snowflake-style)
Joins and CTEs
-- All join types supported WITH top_regions AS ( SELECT region, SUM(revenue) AS total FROM 'sales.parquet' GROUP BY region ORDER BY total DESC LIMIT 5 ) SELECT s.*, t.total FROM 'sales.parquet' s INNER JOIN top_regions t USING (region);
Window functions
-- Top earner per department SELECT name, department, salary FROM 'employees.parquet' QUALIFY ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) = 1; -- Running total per region SELECT region, order_date, SUM(revenue) OVER (PARTITION BY region ORDER BY order_date) AS running FROM 'orders.csv';
Benchmarks
Same machine, same queries, 1M rows, 5-run median, warm cache. SlothDB is faster than DuckDB on every single query.
| Format | Query | SlothDB | DuckDB | Speedup |
|---|---|---|---|---|
| CSV | COUNT(*) | 33 ms | 170 ms | 5.08× |
| Parquet | COUNT(*) | 12 ms | 34 ms | 2.83× |
| Parquet | SUM(revenue) | 46 ms | 48 ms | 1.04× |
| JSON | SUM(revenue) | 242 ms | 314 ms | 1.30× |
| Avro | SUM(revenue) | 140 ms | 760 ms | 5.43× |
| Avro | GROUP BY region | 170 ms | 800 ms | 4.71× |
| Excel | GROUP BY region | 2.5 s | 3.56 s | 1.41× |
Full 15-query breakdown in the README benchmarks section.
Architecture
SlothDB is a vectorized, morsel-parallel columnar execution engine:
- Columnar: data flows as
DataChunkbatches of 2048 values per column. - Vectorized: loops operate on entire columns at once rather than row-at-a-time.
- Morsel-driven parallelism: work is split into small morsels distributed across all CPU cores.
- PhysicalXXXScan operators: each format (Parquet, CSV, Avro, etc.) has a dedicated scan that decodes directly into typed DataChunk vectors. No bulk-load-to-DataTable roundtrip.
- Fused scan + aggregate: common shapes like
SUM(col) FROM file GROUP BY col2run in a single pass with no intermediate materialization. - Zero-copy VARCHAR: short strings (≤12 bytes) live inline in the vector; longer strings are pointer-shared into a buffer.
Extensions
SlothDB has a stable C ABI for third-party extensions. Unlike DuckDB's internal C++ API, extensions built against v0.1 will keep working on v1.0, v2.0, and beyond.
// Register a custom function from an extension slothdb_register_function(db, "my_func", my_callback);
Full extension API headers are in include/slothdb/extension/extension_api.h. Load at runtime with LOAD 'my_ext'.
Limitations
SlothDB is v0.1.4. What it does not do (yet):
- No distributed execution. Single-node embedded engine. Use ClickHouse if you outgrow one machine.
- No MVCC or multi-writer transactions. Single-writer, crash-safe checkpoint.
- Incomplete SQL dialect. Some corners will surprise you. Open an issue when they do.
- Excel writes not supported yet. Reads work.
COPY TO 'file.xlsx'is on the roadmap.
Roadmap
- ClickBench submission (external benchmark parity)
- Extended window function coverage
- HTTP and S3 file readers
- R and Node.js bindings
- Excel write path
FAQ
How is this different from DuckDB?
Same embedded model, different defaults. SlothDB ships Avro, Excel, and SQLite in core (DuckDB needs extensions), has a stable C ABI for extensions, uses numeric error codes, and is 1.1×–8.6× faster on the shipped benchmark suite.
Can I use it with pandas?
Yes. db.sql("...").fetchdf() returns a pandas DataFrame.
Does it work on Apple Silicon?
Yes. SlothDB builds on x86_64 and arm64 on Linux, macOS, and Windows.
Is it production-ready?
It's v0.1.4 and about six months old. 359 tests pass on every CI run, but you'll find edge cases. Treat it as beta. Don't put it in front of paying customers yet without a load test.
How can I help?
File issues, open PRs, share benchmark results, or just tell people about it. See the community page.