Mastering Polars with contexts and more complex expressions
Querying data with Polars is a great experience. It is fast and the query is easy to read. There are many tutorials to get you started (I particularly like the one I wrote). However, most only cover the basics. You will probably need a bit more guidance to start using Polars in your daily work, replacing your Pandas or Pyspark work.
To grasp the topic at hand, let us examine the various components of the query shown in the figure below. The query comprises the contextual information as well as the actual expression. Thinking about these levels has facilitated structuring the code and has helped me create better queries.
In the rest of the article we will look at the different contexts and see how we can write more complex expressions. Our aim is to acquire a deeper comprehension in order to fully utilise Polars.
Expressions
are the core strength ofPolars
. Theexpressions
offer a versatile structure that both solves easy queries and is easily extended to complex ones ~ Polars User guide (link)
NOTE: If you are familiar with the basics of Polars expressions, you can skip part 1.
Part 0: Dataset, development environment and inspection
As always, you can find the accompanying repository here (https://github.com/r-brink/polars-queries-contexts-expressions), which includes a notebook and the requirements.txt.
The dataset used in this article can be downloaded from https://osf.io/p6tyr. Credits to Gábor Békés and Gabor Kezdi for creating the dataset.
Some basic Dataframe inspection
Some quick expressions to inspect our Dataframe before we dive in.
Part 1: Defining the context
Let us start at the beginning. As we can see in the recently updated and improved Polars user guide, there are three main contexts:f
select()
&with_columns()
filter()
group_by().agg()
Select
Select allows us to select a specific column from your Dataframe with an expression! It looks like this. Don’t forget to include collect
r fetch
if you use the Lazy API (as we do in this tutorial) to actually return the output.
You can extend the expression by selecting more columns, so that we can see the hotel id, year, price and offer categories.
As we can see, it is very easy to select the columns we need to add to our Dataframe.
Filter
In the examples above, we have already noticed that we get a lot of information that we may not be interested in. What if we want more specific information, such as data for a particular year or price range?
Although we don’t see the year
column, we can still see that the number of rows is greatly reduced. What if we have a more complex expression in our filter context?
Expressions in the filter context can be simple and specific using operators such as <
, >
or ==
and can be extended with using basic operators such as &
or |
.
Below you can see a more complex expression where we are looking for specific hotels that have an offer with a price point between 2500 and 2600 and have either a ‘no offer’ or a ‘50–75% offer’ in any year that is not 2018.
With_columns
So far we have selected columns and created specific filters. There are many cases where you want to create a new column based on the data, or add a new column to your Dataframe. This is where with_columns
context comes in.
In the examples a breakfast column has been added. pl.lit() has been used to add a ‘literal’ as a value. If we want to add a more ‘dynamic’ column, we can do the following:
Group_by
The last, but perhaps most useful context, is the group_by. It allows us to aggregate data around a specific value. Below is an example to show the average price per hotel(_id).
There is also something new in this query, .alias(). Alias allows us to give a column a specific name. This is useful if you’re creating a new column, as Polars does not allow duplicate names, or if you want to create a new Dataframe on a particular sub selection.
Part 2: Writing more complex queries
The User Guide promises that the expressions are versatile and provide a structure for easily extending your queries. We’ve seen the basic components, so let’s combine them to create more advanced and useful queries.
The syntax makes it easy to combine the context and write the required expressions we have already covered. So if we want to:
- add three new features to our Dataframe (including calculating the total price),
- filter for specific features,
- select hotels and add the features to our Dataframe,
- sort everything to show the most expensive option at the top.
The query will look like this:
Conclusion
I hope the examples above have given you some examples and inspiration on how to write and structure your queries. Understanding the basics helped me to write better queries with Polars. It also made me not want to go back to Pandas. Expressions and contexts in Polars are easily extensible, easier to read and explain, and blazingly fast.
If you have any questions, let me know! Write in the comments and I’ll try to answer as soon as possible.