Why DuckDB is my first choice for data processing
We're moving towards a simpler world where most tabular data can be processed on a single large machine1 and the era of clusters is coming to an end for all but the largest datasets.
become very debatable. Depending on how you want to pivot/ scale/augment your data, even datasets that seemingly "fit" on large boxes will quickly OOM you.
The author also has another article where they claim that:
"SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc)
This does not map to my experience at all, outside of the realm of nicely parsed datasets that don't require too much complicated analysis or augmentation.
But the problem is the ecosystem hasn't standardised on any of them, and it's annoying to have to rewrite pipelines from one dataframe API.
I also agree you're gonna hit OOM if your data is massive, but my guess is the vast majority of tabular data people process is <10GB, and that'll generally process fine on a single large machine. Certainly in my experience it's common to see Spark being used on datasets that are no where big enough to need it. DuckDB is gaining traction, but a lot of people still seem unaware how quickly you can process multiple GB of data on a laptop nowadays.
I guess my overall position is it's a good idea to think about using DuckDB first, because often it'll do the job quickly and easily. There are a whole host of scenarios where it's inappropriate, but it's a good place to start.
Most of the datasets I work with are indeed <10GB but the ones that are much larger follow the same ETL and analysis flows. It helps that I've built a lot of tooling to help with types and memory-efficient inserts. Having to rewrite pipelines because of "that one dataframe API" is exactly what solidified my thoughts around SQL over everything else. So much of my life time has been lost trying to get dataframe and non-dataframe libraries to work together.
Thing about SQL is that it can be taken just about anywhere, so the time spent improving your SQL skills is almost always well worth it. R and pandas much less so.
The fact that I can use sqlite / local sql db for all kinds of development and reliably use the same code (with minor updates) in the cloud hosted solution is such a huge benefit that it undermines anything else that any other solution has to offer. I'm excited about the sql stuff I learned over 10 years ago being of of great use to me in the coming months.
The last product I worked heavily on used a nosql database and it worked fine till you start tweak it just a little bit - split entities, convert data types or update ids. Most of the data access layer logic dealt with conversion between data coming in from the database and the guardrails to keep the data integrity in check while interacting with the application models. To me this is something so obviously solved years ago with a few lines of constraints. Moving over to sql was totally impossible. Learned my lesson, advocated hard for sql. Hoping for better outcomes.
I see your point, even though my experience has been somewhat the opposite. E.g. a pipeline that used to work fast enough/at all up until some point in time because the scale of the data or requirements allowed it. Then some subset of these conditions changes, the pipeline cannot meet them, and one has to reverse engineer obscure SQL views/stored procedures/plugins, and migrate the whole thing to python or some compiled language.
I work with high density signal data now, and my SQL knowledge occupies the "temporary solution" part of my brain for the most part.
"SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc)
SQL has nothing to do with fast. Not sure what makes it any more testable than polars? Future-proof in what way? I guess they mean your SQL dialect won't have breaking changes?
My current habit is to suck down big datasets to parquet shards and then just query them with a wildcard in duckdb. I move to bigquery when doing true “big data” but a few GB of extract from BQ to a notebook VM disk and duckdb is super ergonomic and performant most of the time.
It’s the sql that I like. Being a veteran of when the world went mad for nosql it is just so nice to experience the revenge of sql.
It is faster to write and faster to read
At least on clickbench, polars and duckdb are roughly comparable (with polars edging out duckdb).
I disagree that SQL has nothing to do with fast. One of the most amazing things to me about SQL is that, since it's declarative, the same code has got faster and faster to execute as we've gone through better and better SQL engines. I've seen this through the past five years of writing and maintaining a record linkage library. It generates SQL that can be executed against multiple backends. My library gets faster and faster year after year without me having to do anything, due to improvements in the SQL backends that handle things like vectorisation and parallelization for me. I imagine if I were to try and program the routines by hand, it would be significantly slower since so much work has gone into optimising SQL engines.
In terms of future proof - yes in the sense that the code will still be easy to run in 20 years time.
I disagree that SQL has nothing to do with fast. One of the most amazing things to me about SQL is that, since it's declarative, the same code has got faster and faster to execute as we've gone through better and better SQL engines.
Yeah, but SQL isn't really portable between query all query engines. You always have to be speaking the same dialect. Also, SQL isn't the only "declarative" dsl, polars's lazyframe api is similarly declarative. Technically Ibis's dataframe dsl also works as a multi-frontend declarative query language. Or even substrait.
Anways my point is that SQL is not inherently a faster paradigm than "dataframes", but that you're conflating declarative query planning with SQL.
For a tiny fraction of the cost you can get numerous nodes with 600gbps ethernet ports that can fill their memory in seconds.
And if that's still not enough, if you just need to crunch data a couple times a week, it's not unreasonable to get a massive massive cloud box with ridiculous amounts of ram or ram+SSD. I7i or i8g boxes. Alas, we have cheap older gen epycs & some amazing cheap motherboards but RAM prices to DIY are off the charts unbelievable, but so be it.
From my experience, the data modelling side is still overwhelmingly in SQL. The ingestion side is definitely mostly Python/Scala though.
If you are shuffling data around in pipelines, sure, go for SQL.
Readability is in the eye of the beholder. I much prefer dataframes for that, though a good chunk of the internet claims to throw up in their mouths upon seeing it...
-- Support for .parquet, .json, .csv (note: Spotify listening history comes in a multiple .json files, something fun to play with).
-- Support for glob reading, like: select * from 'tsa20*.csv' - so you can read hundreds of files (any type of file!) as if they were one file.
-- if the files don't have the same schema, union_by_name is amazing.
-- The .csv parser is amazing. Auto assigns types well.
-- It's small! The Web Assembly version is 2mb! The CLI is 16mb.
-- Because it is small you can add duckdb directly to your product, like Malloy has done: https://www.malloydata.dev/ - I think of Malloy as a technical persons alternative to PowerBI and Tableau, but it uses a semantic model that helps AI write amazing queries on your data. Edit: Malloy makes SQL 10x easier to write because of its semantic nature. Malloy transpiles to SQL, like Typescript transpiles to Javascript.
The .csv parser is amazing
Their csv support coupled with lots of functions and fast & easy iterative data discovery has totally changed how I approach investigation problems. I used to focus a significant amount of time on understanding the underlying schema of the problem space first, and often there really wasn't one - but you didn't find out easily. Now I start with pulling in data, writing exploratory queries to validate my assumptions, then cleaning & transforming data and creating new tables from that state; rinse and repeat. Aside from getting much deeper much quicker, you also hit dead ends sooner, saving a lot of otherwise wasted time.
There's an interesting paper out there on how the CSV parser works, and some ideas for future enhancements. I couldn't seem to find it but maybe someone else can?
One of my favorite features is `SELECT ... FROM s3Cluster('<ch cluster>', 'https://...<s3 url>.../data//.json', ..., 'JSON')`[0] which lets you wildcard ingest from an S3 bucket and distributes the processing across nodes in your configured cluster. Also, I think it works with `schema_inference_mode` (mentioned below) though I haven't tried it. Very cool time for databases / DB tooling.
(I actually wasn't familiar with `union_by_name` but it looks to be like Clickhouse has implemented that as well [1,2] Neat feature in either case!)
[0] https://clickhouse.com/docs/sql-reference/table-functions/s3... [1] https://clickhouse.com/docs/interfaces/schema-inference [2] https://github.com/ClickHouse/ClickHouse/pull/55892
it is also difficult to customize as compared to sqlite so for example if you want to use your own parser for csv than it becomes hard.
But yes it provides lot of convenience out of the box as you have already listed.
I work with scientists who research BC's coastal environment, from airborne observation of glaciers to autonomous drones in the deep sea. We've got heaps of data.
A while back I took a leap of faith with DuckDB as the data-processing engine for a new tool we're using to transform and validate biodiversity data. The goal is to take heaps of existing datasets and convert them to valid Darwin Core data. Keyword being valid.
DuckDB is such an incredible tool in this context. Essentially I dynamically build duckdb tables from schemas describing the data, then import it into the tables. If it fails, it explains why on a row-by-row basis (as far as it's able to, at least). Once the raw data is in, transformations can occur. This is accomplished entirely in DuckDB as well. Finally, validations are performed using application-layer logic if the transformation alone isn't assurance enough.
I've managed to build an application that's way faster, way more capable, and much easier to build than I expected. And it's portable! I think I can get the entire core running in a browser. Field researchers could run this on an iPad in a browser, offline!
This is incredible to me. I've had so much fun learning to use DuckDB better. It's probably my favourite discovery in a couple of years.
And yeah, this totally could have been done any number of different ways. I had prototypes which took much different routes. But the cool part here is I can trust DuckDB to do a ton of heavy lifting. It comes with the cost of some things happening in SQL that I'd prefer it didn't sometimes, but I'm content with that tradeoff. In cases where I'm missing application-layer type safety, I use parsing and tests to ensure my DB abstractions are doing what I expect. It works really well!
edit: For anyone curious, the point of this project is to allow scientists to analyze biodiversity and genomic data more easily using common rather than bespoke tools, as well as publish it to public repositories. Publishing is a major pain point because people in the field typically work very far from the Darwin Core spec :) I'm very excited to polish it a bit and get it in the hands of other organizations.
To expand on this, Polars' `LazyFrame` implementation allows for simple addition of new backends like GPU, streaming, and now distributed computing (though it's currently locked to a vendor). The DuckDB codebase just doesn't have this flexibility, though there are ways to get it to run on GPU using external software.
Polars would let me have a lot of luxuries that are lost at the boundaries between my application and DuckDB, but those are weighed in the tradeoffs I was talking about. I do a lot of parsing at the boundaries to ensure data structures are sound, and otherwise DuckDB is enforcing strict schemas at runtime which provides as much safety as a dataset's schema requires. I do a lot of testing to ensure that I can trust how schemas are built and enforced as well.
Things like foreign keys, expressions that span multiple tables effortlessly, normalization, check constraints, unique constraints, and primary keys work perfectly right off the shelf. It's kind of perfect because the spec I'm supporting is fundamentally about normalized relational data.
Another consideration was that while Polars is a bit faster, we don't encounter datasets that require more speed. The largest dataset I've processed, including extensive transformations and complex validations (about as complex as they get in this spec), takes ~3 seconds for around 580k rows. That's on an M1 Max with 16GB of RAM, for what it's worth.
Our teams have written countless R scripts to do the same work with less assurance that the outputs are correct, having to relearn the spec each time, and with much worse performance (these people are not developers). So, we're very happy with DuckDB's performance despite that Polars would probably let us do it faster.
Having said that, if someone built the same project and chose Polars I wouldn't think they were wrong to do so. It's a great choice too, which is why your question is a good one.
I still love that idea but SQL turns out to be so battle-proven, reliable, flexible, capable, and well-documented that it's really hard to beat. After giving it a shot for a couple of weeks it became clear that it would yield a way more flexible and capable application. I'm confident enough that I can overcome the rough edges with the right abstractions and some polish over time.
The first is simply that it's fast - for example, DuckDB has one of the best csv readers around, and it's parallelised.
Next, engines like DuckDB are optimised for aggregate analysis, where your single query processes a lot of rows (often a significant % of all rows). That means that a full scan is not necessarily as big a problem as it first appears. It's not like a transactional database where often you need to quickly locate and update a single row out of millions.
In addition, engines like DuckDB have predicate pushdown so if your data is stored in parquet format, then you do not need to scan every row because the parquet files themselves hold metadata about the values contained within the file.
Finally, when data is stored in formats like parquet, it's a columnar format, so it only needs to scan the data in that column, rather than needing to process the whole row even though you may be only interested in one or two columns
Depends on your definition of medium sized, but for tables of hundreds of thousands of rows and ~30 columns, these tools are fast enough to run queries instantly or near instantly even on laptop CPUs.
However, you wouldn’t want to use either for transaction processing, the lack of indexes would really hurt.
Related question, curious as to your experience with DuckLake if you've used it. I'm currently setting up s3 + Iceberg + duckDB for my company (startup) and was wondering what to pick between Iceberg and DuckLake.
Think this opens up a lot of interesting possibilities like more powerful analytics notebooks like marimo (https://marimo.io/) … and that’s just one example of many.
We recently created a survey website for the community survey results for Node-RED making it completely dynamic and segment-able. Creates lots of value and allows everyone to look at the data through their own lens. It uses DuckDB with WASM under the hood. Awesome technologies.
Doing that in postgres takes some time, and even simple count(*) takes a lot of time (with all columns indexed)
From what you describe I'd expect a list of column-value pairs under a WHERE to resolve pretty fast if it uses indices and don't fish out large amounts of data at once.
I was thinking of using Citus for this, but possibly using duckdb is a better way to do. Citus comes with a lot more out of the box but duckdb could be a good stepping stone.
Please sell DuckDB to me. I don't know it very well but my (possibly wrong) intuition is that even giving equal performance, it's going to drop me to the awkwardness of SQL for data processing.
But coincidentally today I was exploring memory usage and I believe I'm finding memory leaks. Anybody have similar experiences?
Still debugging more deeply but looking reasonably conclusive atm.
https://github.com/duckdb/duckdb/issues/20569
If someone can explain this or has a fix for it I'd love to hear it!
The web/WASM integration is also fabulous. Looking forward to more "small engines" getting into that space to provide some competition and keep pushing it forward.
- you don’t need to use Python (but Pythonic wrappers like Ibis exist; disclaimer I worked on Ibis, you can find my blogs on performance comparisons easily there); CLI, WASM, etc. w/o Python
- governance: DuckDB as OSS is setup in a more sustainable way (DuckDB Labs + DuckDB Foundation). while there is a VC-backed company (MotherDuck), it doesn’t employ the primary developers/control the project in the same way the Polars company does
- overall just simplicity and focus. tends to break less, solely focused on single-node, easy to extend, etc. — not trying to do a cloud product, distributing computing, supporting GPU execution
To get fast access to the query results we use the Apache Arrow interface and generate the code directly from DuckDB SQL queries using the SQG tool ( https://sqg.dev/generators/java-duckdb-arrow/)
As such, it's not readily usable as a library, or set of libraries. I really prefer Apache's approach to analytics where it's possible to pick and choose the parts you need, and integrate them with standard package maangers.
Need GB/S arrays over HTTP? Use Arrow Flight. Want to share self-describing structured arrays with files? Use Arrow IPC. Need to read Parquet? Add that package trait.
Another potential issue with DuckDB is the typing at the SQL interface.
Arrow allows direct access to primitive arrays, but DuckDB uses a slightly different type system at the SQL interface. Even small differences in type systems can lead to combinatoric type explosion. This is more a critiscm of SQL interfaces than DuckDB however.
Additionally Arrow has native libraries in most mainstream languages.
If you want it's power as a query engine but like to write python instead of SQL, I highly recommend using it as a backend for the Ibis dataframe library
It let's you interchange pythonic dataframe syntax (like Pandas and Polars) with SQL that 'compile' down to SQL in DuckDB dialect
And you can use those queries interchangably in postgres, sqlite, polars, spark, etc
Writing SQL code
Language integration is paramount for med/lg projects. There's an experimental Java lang project, manifold-sql[1], that does the impossible: inline native DuckDB SQL + type-safety.
"""
[.sql/] SELECT station_name, count(*) AS num_services
FROM 'http://blobs.duckdb.org/train_services.parquet'
WHERE monthname(date) = 'May'
GROUP BY ALL
ORDER BY num_services DESC
LIMIT 3
"""
.fetch()
.forEach(row -> out.println(row.stationName + ": " + row.numServices));
1. https://github.com/manifold-systems/manifold/blob/master/doc...