Processing 300M+ dataset on a laptop in 30 seconds
This post is a summary of my quest to the edge of the performance of Polars
on my laptop. I generate a large dataset and run some predefined complex queries to find the results. In this post, we will see how to write larger than RAM datasets, scan the files, do analytics and collect the results.
The Polars
package makes it easier than ever to work with larger than RAM datasets.
Out of core
Out of core refers to the ability of a computer program or system to process data that does not fit in memory (RAM) of a computer. Memory is a type of fast, temporary storage that is used by a computer’s central processing unit (CPU) to access data and instructions quickly. However, main memory is usually limited in size and cannot hold all the data that a program or system needs to process.
Many systems will automatically swap
memory, meaning they will write data that doesn't fit in memory to the hard disk. This will decrease the performance significantly. Also, it turns out that it impacts the lifetime of SSDs as many cycles are required to read and write the data link.
The out of core capability of Polars
is very useful for handling big data problems, where the data sets are, initially, too large to be processed on a single computer. It enables you to run your 20, 50, 100GB+ datasets on your laptop. No need to spin up a cluster on a cloud platform or learn a new syntax.
Generating the dataset
It is not that easy to find very large datasets on the internet. Especially as we want to find the edge of what Polars
is capable of processing on our laptop. Luckily, other people do a lot of benchmarking. That is how I found the TPCH benchmark. Polars
uses it to benchmark against other packages. You can find the repository here.
You can clone the repo and run the code. In this Makefile you can change the scale of the generated dataset or run the command in your terminal. Replace XX with the scale. In our case we will scale with factor 50.
This factor will create several files. The largest is lineitem.tbl
with a size of 39.53GB. With the prepare_files.py
file we will create parquet
files, which are smaller. However, we will need to adjust this file as the file is (probably) too large to read to memory. Therefore we have to replace the read_csv()
with scan_csv()
.
We also see one of the newest features in action; sink_parquet()
. This function allows us to write larger than memory files to disk.
Scanning and analysing the dataset
I am processing and analysing the data with a 2020 Macbook Pro, 16GB RAM, M1 chip. Performance can differ with your machine based on OS and specs.
Find the notebook with all the examples here: https://github.com/r-brink/polars-outofcore-tutorial/blob/master/polars-outofcore-tutorial-2023.ipynb
Required imports
Scanning the datasets
Due to the size of the datasets, we cannot simply read the parquet files to memory. We will have to use the Lazy API of Polars
. Fortunately for us, the API is not so different from the Eager API.
The TPCH
dataset creates several datasets, but we will only scan
a selection to keep this notebook concise.
Investigating the dataset
Our lineitem
file spans 6 years of transactions and contains just a little over 300M+ rows. That is a pretty large dataset.
shape: (4, 10)
┌────────────┬────────────┬──────────┬────────────┬─────┬─────────┬─────────┬──────────┬───────────┐
│ l_returnfl ┆ l_linestat ┆ sum_qty ┆ sum_base_p ┆ ... ┆ avg_qty ┆ avg_pri ┆ avg_disc ┆ count_ord │
│ ag ┆ us ┆ --- ┆ rice ┆ ┆ --- ┆ ce ┆ --- ┆ er │
│ --- ┆ --- ┆ i64 ┆ --- ┆ ┆ f64 ┆ --- ┆ f64 ┆ --- │
│ str ┆ str ┆ ┆ f64 ┆ ┆ ┆ f64 ┆ ┆ u32 │
╞════════════╪════════════╪══════════╪════════════╪═════╪═════════╪═════════╪══════════╪═══════════╡
│ A ┆ F ┆ 18876559 ┆ 2.8306e12 ┆ ... ┆ 25.4993 ┆ 38236.5 ┆ 0.050001 ┆ 74027688 │
│ ┆ ┆ 13 ┆ ┆ ┆ 23 ┆ 56664 ┆ ┆ │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ N ┆ F ┆ 49261643 ┆ 7.3892e10 ┆ ... ┆ 25.5105 ┆ 38265.5 ┆ 0.049979 ┆ 1931029 │
│ ┆ ┆ ┆ ┆ ┆ 66 ┆ 10418 ┆ ┆ │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ N ┆ O ┆ 37176630 ┆ 5.5745e12 ┆ ... ┆ 25.4985 ┆ 38234.2 ┆ 0.049998 ┆ 145798813 │
│ ┆ ┆ 06 ┆ ┆ ┆ 82 ┆ 66455 ┆ ┆ │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ R ┆ F ┆ 18878478 ┆ 2.8307e12 ┆ ... ┆ 25.5023 ┆ 38239.1 ┆ 0.049999 ┆ 74026400 │
│ ┆ ┆ 53 ┆ ┆ ┆ 59 ┆ 30191 ┆ ┆ │
└────────────┴────────────┴──────────┴────────────┴─────┴─────────┴─────────┴──────────┴───────────┘
How many line items do we filter in the query above? A little over 295M.
A more complex example
In the TCPH
benchmarks of Polars
at pola.rs/benchmarks, we can see that query 7 is the most demanding for most Dataframe packages. These benchmarks were executed with scale factor 10. Let’s see how it performs on our dataset with scale factor 50.
In this example I have removed the original filter. I wanted to see the result for all the years, not just 1995 and 1996.
Running the above query without streaming=True
, will lead to large parts of the data being swapped to the SSD on my laptop.
Conclusion
The out of core functionality of Polars
is great. It enables the opportunity to run a whole new set of use cases on our laptops. Personally, I prefer this, as I don't have to create a cluster, spin it up and rewrite my queries to pyspark
or another syntax. Also, I can do most of the work on my laptop in the tools I prefer. In case it is not fast enough it is straightforward to get a larger machine with more cores to improve performance.
If you are curious. There are other queries in the TPCH
benchmark to experiment with. It is easy to use the repository to generate even larger datasets.
Learn more about Polars here: https://github.com/pola-rs/polars