Introduction to Polars

One of the fastest Dataframe library at the moment

Robin van den Brink
6 min readMar 23, 2021

THIS ARTICLE IS OUTDATED. YOU CAN FIND AN UPDATED VERSION HERE: [2023 update] Introduction to Polars

Introduction

In this article, we going to take a closer look at Polars. Polars is a new Dataframe library implemented in Rust with convenient Python bindings. The benchmark of H2Oai shows that it is one of the fastest Dataframe library of the moment. From the Polars book: ‘The goal of Polars is being a fast DataFrame library that utilizes the available cores on your machine. Its ideal use case is data too big for pandas and too small for spark. Similar to spark Polars consists of a query planner that may (and probably does) optimize your query in order to do less work or reduce memory usage.’

Polars offers both an eager and a lazy API. The lazy API is said to be ‘somewhat similar to spark’. The lazy API allows the user to optimise the query before it runs. Promising ‘blazingly fast’ performance.

In this article, we will do a first introduction in Python and work with some of the available functionalities of this new dataframe package to get an idea that it has to offer. In the first part of the article we will use the eager API from Polars and at the end we will use the lazy API to check the syntax and see the differences.

To explore the functionalities of Polars we are going to use the Wine Review dataset with 150k wine reviews with variety, location, winery, price, and descriptions.

You can download the dataset that we will use on Kaggle.

It is also possible to run the cells in this article by yourself and play around with the code along the way. You can find this article in a Jupyter notebook format on my Github page.

Installing Polars

We can easily install Polars via Pypi with the following command

pip install polars==0.7.0

In this article, we will specifically use the 0.7.0 release of Polars, because it is the latest most stable version. It is still in an early stage of development, so a lot may change till the first truly stable version; 1.0.

Note: as a best practice, don’t forget to create and activate your virtual environment before installing Polars

Import relevant packages

To work with Polars and start analysing the Wine Review dataset we are going to import two packages: Polars and Matplotlib.

Polars already offers many functionalities that we are already familiar if you have worked with Pandas before. We can find an overview, including examples (for most), in the reference guide.

Let’s start with loading the dataset and start with our analyses.

Now that the data is read into the dataframe. Let’s have a closer look at the dataframe.

Dataset inspection

The dataset has a lot to offer. With 11 variables and over 150k rows, there is a lot of data to analyse. We see a couple of variables that are interesting to look into, like price, country, points.

Removing nulls

Before we continue we want to have a closer look at if there are any nulls in the dataset.

It seems that around a little less than 10% of the price variable didn’t have a value. We could have dropped the rows with missing values or fill them, but we chose to fill with the mean as a filling strategy.

Some analyses

The next step is to dive in a little deeper and have a closer look at the dataset with some more complex functions.

The goal that we want to achieve in the following part is to have a closer look at the countries and how they compare in terms of price and points.

The minimum number of points shows that there is no such thing as bad wine.

There are two strange values in our dataset: an undefined country (“”) and a country called ‘US-France’.

There were only 6 of them, so it was safe to drop them.

Time to look into the countries that produces the best wine according to the points and has the hightest price for a bottle.

England is leading the list for the best wines. Wonder how they think about that on the other side of the Canal in France.

Plotting while using Polars

To get a better insight into the differences it always helps to have some nice plots. Where Pandas has a plotting functionality build in, we have to rely on our Matplotlib skills for Polars. We focus on the top 15 countries.

Distribution of points of the top 15 countries

Time to go lazy

The lazy API offers a way to optimise your queries, similar to Spark. The major benefit over spark is that we don’t have to set up our environment and can therefore continue working from our notebook.

More information can be found in the Polars-book.

Printing the type returns ‘polars.lazy.LazyFrame’ indicating the data is available to us.

Similar to the filters that we did with the eager API we are going to filter the unknown and ‘US-France’ values in the country variable.

As we can see nothing happens right away. From the documentation: ‘This is due to the lazyness, nothing will happen until specifically requested. This allows Polars to see the whole context of a query and optimize just in time for execution.

As we can see the syntax of the lazy API is different from what we did in the beginning. Although it takes some getting used to the syntax gives a nice overview of the different steps we want to take.

To actually see the results we can do two things: collect() and fetch(). The difference is that fetch takes the first 500 rows and then runs the query, whereas collect runs the query over all the results. Below we can see the differences for our case.

Output

We have got the output that we are looking for. Polars offers several ways to output our analyses, even to other formats useful for further analyses (e.g. pandas dataframe (to_pandas()) or numpy arrays (to_numpy()).

Final word

Polars is a new package that is gaining a lot of attention. At the time of writing this article, it has gathered more than 1300 stars on Github, which is impressive looking at the fact that is around for less than a year. It offers almost all the functions that we need to manipulate our dataframe. Next to that, it offers a lazy API that helps us optimising our queries before we execute them. Although we didn’t touch it is in this article, the benchmark of H20 shows that it is super efficient and fast. Especially with larger datasets it becomes worthwhile to look into the benefits that the lazy API has to offer.

I hope this article showed some of the potential Polars has to offer. There is a lot more to explore. The developer behind Polars is very responsive to issues. For (beginning) open source developers there are plenty of opportunities to contribute, both on the Python and Rust side. If you want to know more about the design decisions in Polars, I highly recommend this blog post from the developer behind the package.

Link to the Polars Github page

The Polars logo

--

--

Robin van den Brink

Building digital products as product manager and hobby developer. Focused on data and AI products. Polars enthusiast and contributor.