Processing 300M+ dataset on a laptop in 30 seconds

Robin van den Brink
5 min readDec 28, 2022

--

This post is a summary of my quest to the edge of the performance of Polars on my laptop. I generate a large dataset and run some predefined complex queries to find the results. In this post, we will see how to write larger than RAM datasets, scan the files, do analytics and collect the results.

The Polars package makes it easier than ever to work with larger than RAM datasets.

Out of core

Out of core refers to the ability of a computer program or system to process data that does not fit in memory (RAM) of a computer. Memory is a type of fast, temporary storage that is used by a computer’s central processing unit (CPU) to access data and instructions quickly. However, main memory is usually limited in size and cannot hold all the data that a program or system needs to process.

Many systems will automatically swap memory, meaning they will write data that doesn't fit in memory to the hard disk. This will decrease the performance significantly. Also, it turns out that it impacts the lifetime of SSDs as many cycles are required to read and write the data link.

that is a lot of data being swapped

The out of core capability of Polars is very useful for handling big data problems, where the data sets are, initially, too large to be processed on a single computer. It enables you to run your 20, 50, 100GB+ datasets on your laptop. No need to spin up a cluster on a cloud platform or learn a new syntax.

Generating the dataset

It is not that easy to find very large datasets on the internet. Especially as we want to find the edge of what Polars is capable of processing on our laptop. Luckily, other people do a lot of benchmarking. That is how I found the TPCH benchmark. Polars uses it to benchmark against other packages. You can find the repository here.

You can clone the repo and run the code. In this Makefile you can change the scale of the generated dataset or run the command in your terminal. Replace XX with the scale. In our case we will scale with factor 50.

This factor will create several files. The largest is lineitem.tbl with a size of 39.53GB. With the prepare_files.py file we will create parquet files, which are smaller. However, we will need to adjust this file as the file is (probably) too large to read to memory. Therefore we have to replace the read_csv() with scan_csv().

We also see one of the newest features in action; sink_parquet(). This function allows us to write larger than memory files to disk.

Scanning and analysing the dataset

I am processing and analysing the data with a 2020 Macbook Pro, 16GB RAM, M1 chip. Performance can differ with your machine based on OS and specs.

Find the notebook with all the examples here: https://github.com/r-brink/polars-outofcore-tutorial/blob/master/polars-outofcore-tutorial-2023.ipynb

Required imports

Scanning the datasets

Due to the size of the datasets, we cannot simply read the parquet files to memory. We will have to use the Lazy API of Polars. Fortunately for us, the API is not so different from the Eager API.

The TPCH dataset creates several datasets, but we will only scan a selection to keep this notebook concise.

Investigating the dataset

Our lineitem file spans 6 years of transactions and contains just a little over 300M+ rows. That is a pretty large dataset.

shape: (4, 10)
┌────────────┬────────────┬──────────┬────────────┬─────┬─────────┬─────────┬──────────┬───────────┐
│ l_returnfl ┆ l_linestat ┆ sum_qty ┆ sum_base_p ┆ ... ┆ avg_qty ┆ avg_pri ┆ avg_disc ┆ count_ord │
│ ag ┆ us ┆ --- ┆ rice ┆ ┆ --- ┆ ce ┆ --- ┆ er │
│ --- ┆ --- ┆ i64 ┆ --- ┆ ┆ f64 ┆ --- ┆ f64 ┆ --- │
│ str ┆ str ┆ ┆ f64 ┆ ┆ ┆ f64 ┆ ┆ u32 │
╞════════════╪════════════╪══════════╪════════════╪═════╪═════════╪═════════╪══════════╪═══════════╡
│ A ┆ F ┆ 18876559 ┆ 2.8306e12 ┆ ... ┆ 25.4993 ┆ 38236.5 ┆ 0.050001 ┆ 74027688 │
│ ┆ ┆ 13 ┆ ┆ ┆ 23 ┆ 56664 ┆ ┆ │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ N ┆ F ┆ 49261643 ┆ 7.3892e10 ┆ ... ┆ 25.5105 ┆ 38265.5 ┆ 0.049979 ┆ 1931029 │
│ ┆ ┆ ┆ ┆ ┆ 66 ┆ 10418 ┆ ┆ │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ N ┆ O ┆ 37176630 ┆ 5.5745e12 ┆ ... ┆ 25.4985 ┆ 38234.2 ┆ 0.049998 ┆ 145798813 │
│ ┆ ┆ 06 ┆ ┆ ┆ 82 ┆ 66455 ┆ ┆ │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ R ┆ F ┆ 18878478 ┆ 2.8307e12 ┆ ... ┆ 25.5023 ┆ 38239.1 ┆ 0.049999 ┆ 74026400 │
│ ┆ ┆ 53 ┆ ┆ ┆ 59 ┆ 30191 ┆ ┆ │
└────────────┴────────────┴──────────┴────────────┴─────┴─────────┴─────────┴──────────┴───────────┘

How many line items do we filter in the query above? A little over 295M.

A more complex example

In the TCPH benchmarks of Polars at pola.rs/benchmarks, we can see that query 7 is the most demanding for most Dataframe packages. These benchmarks were executed with scale factor 10. Let’s see how it performs on our dataset with scale factor 50.

In this example I have removed the original filter. I wanted to see the result for all the years, not just 1995 and 1996.

Running the above query without streaming=True, will lead to large parts of the data being swapped to the SSD on my laptop.

Conclusion

The out of core functionality of Polars is great. It enables the opportunity to run a whole new set of use cases on our laptops. Personally, I prefer this, as I don't have to create a cluster, spin it up and rewrite my queries to pyspark or another syntax. Also, I can do most of the work on my laptop in the tools I prefer. In case it is not fast enough it is straightforward to get a larger machine with more cores to improve performance.

If you are curious. There are other queries in the TPCH benchmark to experiment with. It is easy to use the repository to generate even larger datasets.

Learn more about Polars here: https://github.com/pola-rs/polars

--

--

Robin van den Brink

Building digital products as product manager and hobby developer. Focused on data and AI products. Polars enthusiast and contributor.