Processing 10GB dataset on someone else’s computer in 170 seconds with Polars

Robin van den Brink
5 min readFeb 5, 2023

--

No need for a computer with the cloud at your fingertips

Part three in my search for the limitations of Polars. Previously I wrote about processing a 300M dataset on a laptop and on a low-spec machine using Polars, showing that we can do more with our laptops in terms of data processing. The most logical next step to investigate and experiment with was to use Polars with no computer or someone else’s computer (i.e., in the cloud).

The cloud promises the opportunity to send over the workload and let them optimise the resources for the execution. This would make our lives even easier.

Disclaimer: In this article I will leave the account configuration out of scope. I found it confusing and demotivating to set up all the account and service rights. We must assume it just works.

Fully serverless with Lambda functions

The first attempt was Digital Ocean’s Functions. However, installing any package seemed to be too much workload for the Functions to handle. We therefore switched to AWS, famous for its serverless Lambda functions.

The setup is simple: A parquet file (60M rows, 1.9GB) stored in an S3 bucket. We download the input from the latter, run our query, and print the output to the console.

Early in the experimentation I bumped the RAM of the Lambda function to the maximum limit of 10240 MB (10GB). The Lambda functions seem to take a lot of memory. The 10GB memory should be plenty considering my previous post on processing a larger dataset on 512MB.

As we can see the Lambda function only takes 34 seconds to print the output to our console.

In my S3 bucket I also have my larger parquet file stored. With 9.8GB it is close to the memory limits. Read more in my first post on how I created it.

We see that the Lambda functions returns the expected output after 170 seconds and well within the memory limits.

It is interesting and unexplainable at the moment that the query on the large dataset works on the Lambda function. The smaller 1.9 GB dataset takes 7GB of memory, while the 9.8GB (only) takes 10GB. We just happily accept it for now.

Scaling up with an improvised home build solution

Fully serverless worked, but we probably can do faster. With fully serverless we do not have any control of the machine that runs our query. Although that is the selling point of the cloud, in our case we benefit from having a lot of cores for Polars to utilise.

The new approach was to keep the S3 bucket but spin up and terminate an assigned EC2 instance. Major benefit is the opportunity to vertically scale the resources, as we can select EC2 instances with many cores. As we only run it for several minutes, we could go for 100+ cores. To illustrate: the u-6tb1.112xlarge instance has 448 cores (instance with most cores in eu-central-1 region) and is € 65 per hour (just over €1 per minute).

Overview of our set up

There are two scripts to get our proof of concept working:

Both code snippets, including the .env template can be found in this Github repository — https://github.com/r-brink/serverless-polars.

  1. First ensure that execute.py is available for the EC2 instance to download or clone. The script downloads the right datafile from S3, runs the query and uploads the outcome to the S3 bucket again. For this experiment I wget the file from an external location when the instance starts.
  2. After that, run runinstance.py from your local machine. It starts the specified instance on AWS. The instance will run the user data part of the script when it starts. See disclaimer below on the termination of the instance.

Disclaimer: the part that terminates the instance needs improvement. Currently it often terminates the instance before the query ran successfully, but you won’t be surprised by the bill of a running instance. DO NOT FORGET to turn off the instance yourself if you remove lines of code that terminate your instance.

Time to discuss the results

Quick recap of our approach. We create an EC2 instance on AWS, download the data from S3, execute a query and write the output back to S3. In my testing I found the results promising.

For the first experiment, I used a c3.4xlarge instance (16 vCPUs, ~€1.7/hour) and a small data set (60M rows — 2 GB). The instance was up and running in about 16.5 seconds. The output was returned and uploaded to S3 in ~50 seconds.

The second experiment was with 300M rows — 10 GB parquet file. The output was uploaded and returned to S3 in 2 minutes and 10 seconds. Because I haven’t set up proper logging (I know), I am not sure how long each part of the execute.py script runs. Most of the query time here will be caused by the size of the data file that we download.

Conclusion

These two small experiments leave a lot of room for improvement: such as any form of logging, pulling scripts from a git repository, terminate the instance if the workload is done, dynamically assign instances, research if we can get the data faster on the EC2 instance, and the list goes on.

Main takeaway is that with Polars we do not have to spin up a whole cluster to distribute the workload for our data processing.

You can run smaller ETL workloads with Lambda functions or spin up a compute-optimised EC2 instance for larger workloads. A next step would be to use an event trigger to start a Lambda function that creates the required EC2 instance and use another Lambda to terminate the EC2 instance when the query is done.

The potential workflow looks amazing. Work locally on a sample and evaluate your query with the Lazy API of Polars. If all looks good, commit your script, and let the cloud pick it up to handle the rest.

Learn more about Polars here: www.github.com/pola-rs/polars

--

--

Robin van den Brink

Building digital products as product manager and hobby developer. Focused on data and AI products. Polars enthusiast and contributor.