Mastering Polars with contexts and more complex expressions

Robin van den Brink
4 min readSep 12, 2023

--

Querying data with Polars is a great experience. It is fast and the query is easy to read. There are many tutorials to get you started (I particularly like the one I wrote). However, most only cover the basics. You will probably need a bit more guidance to start using Polars in your daily work, replacing your Pandas or Pyspark work.

To grasp the topic at hand, let us examine the various components of the query shown in the figure below. The query comprises the contextual information as well as the actual expression. Thinking about these levels has facilitated structuring the code and has helped me create better queries.

the figure below

In the rest of the article we will look at the different contexts and see how we can write more complex expressions. Our aim is to acquire a deeper comprehension in order to fully utilise Polars.

Expressions are the core strength of Polars. The expressions offer a versatile structure that both solves easy queries and is easily extended to complex ones ~ Polars User guide (link)

NOTE: If you are familiar with the basics of Polars expressions, you can skip part 1.

Part 0: Dataset, development environment and inspection

As always, you can find the accompanying repository here (https://github.com/r-brink/polars-queries-contexts-expressions), which includes a notebook and the requirements.txt.

The dataset used in this article can be downloaded from https://osf.io/p6tyr. Credits to Gábor Békés and Gabor Kezdi for creating the dataset.

Some basic Dataframe inspection

Some quick expressions to inspect our Dataframe before we dive in.

Part 1: Defining the context

Let us start at the beginning. As we can see in the recently updated and improved Polars user guide, there are three main contexts:f

  • select() & with_columns()
  • filter()
  • group_by().agg()

Select

Select allows us to select a specific column from your Dataframe with an expression! It looks like this. Don’t forget to include collect r fetch if you use the Lazy API (as we do in this tutorial) to actually return the output.

You can extend the expression by selecting more columns, so that we can see the hotel id, year, price and offer categories.

As we can see, it is very easy to select the columns we need to add to our Dataframe.

Filter

In the examples above, we have already noticed that we get a lot of information that we may not be interested in. What if we want more specific information, such as data for a particular year or price range?

Although we don’t see the year column, we can still see that the number of rows is greatly reduced. What if we have a more complex expression in our filter context?

Expressions in the filter context can be simple and specific using operators such as <, > or == and can be extended with using basic operators such as & or | .

Below you can see a more complex expression where we are looking for specific hotels that have an offer with a price point between 2500 and 2600 and have either a ‘no offer’ or a ‘50–75% offer’ in any year that is not 2018.

With_columns

So far we have selected columns and created specific filters. There are many cases where you want to create a new column based on the data, or add a new column to your Dataframe. This is where with_columns context comes in.

In the examples a breakfast column has been added. pl.lit() has been used to add a ‘literal’ as a value. If we want to add a more ‘dynamic’ column, we can do the following:

Group_by

The last, but perhaps most useful context, is the group_by. It allows us to aggregate data around a specific value. Below is an example to show the average price per hotel(_id).

There is also something new in this query, .alias(). Alias allows us to give a column a specific name. This is useful if you’re creating a new column, as Polars does not allow duplicate names, or if you want to create a new Dataframe on a particular sub selection.

Part 2: Writing more complex queries

The User Guide promises that the expressions are versatile and provide a structure for easily extending your queries. We’ve seen the basic components, so let’s combine them to create more advanced and useful queries.

The syntax makes it easy to combine the context and write the required expressions we have already covered. So if we want to:

  1. add three new features to our Dataframe (including calculating the total price),
  2. filter for specific features,
  3. select hotels and add the features to our Dataframe,
  4. sort everything to show the most expensive option at the top.

The query will look like this:

Conclusion

I hope the examples above have given you some examples and inspiration on how to write and structure your queries. Understanding the basics helped me to write better queries with Polars. It also made me not want to go back to Pandas. Expressions and contexts in Polars are easily extensible, easier to read and explain, and blazingly fast.

If you have any questions, let me know! Write in the comments and I’ll try to answer as soon as possible.

--

--

Robin van den Brink

Building digital products as product manager and hobby developer. Focused on data and AI products. Polars enthusiast and contributor.