Data Transformation

Introduction

It’s very rare that data arrive in exactly the right form you need. Often, you’ll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with.

You’ll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the polars package and a new dataset on flights that departed New York City in 2013.

The goal of this chapter is to give you an overview of all the key tools for transforming a data frame, a special kind of object that holds tabular data.

We’ll come back these functions in more detail in later chapters, as we start to dig into specific types of data (e.g. numbers, strings, dates).

Prerequisites

In this chapter we’ll focus on the polars package, one of the most widely used tools for data science. You’ll need to ensure you have polars installed. To do this, you can run

import polars as pl

If this command fails, you don’t have polars installed. Open up the terminal in Visual Studio Code (Terminal -> New Terminal), cd to the folder you are working in, and type in uv add polars.

Furthermore, if you wish to check which version of polars you’re using, it’s

pl.__version__

You’ll also need the data. Most of the time, data will need to be loaded from a file or the internet. These data are no different, but one of the amazing things about polars is how many different types of data it can load, including from files on the internet.

The data is around 50MB in size so you will need a good internet connection or a little patience for it to download.

Let’s download the data:

import io

import requests

url = "https://raw.githubusercontent.com/byuidatascience/data4python4ds/master/data-raw/flights/flights.csv"
resp = requests.get(url, timeout=60)
resp.raise_for_status()

flights = pl.read_csv(
    io.BytesIO(resp.content),
    null_values=["NA"],
    truncate_ragged_lines=True,
    ignore_errors=True,
)

If the above code worked, then you’ve downloaded the data in CSV format and put it in a data frame. Let’s look at the first few rows using the .head() function that works on all polars data frames.

flights.head()

To get more general information on the columns, the data types (dtypes) of the columns, and the size of the dataset, use .glimpse().

flights.glimpse(max_items_per_column=5)

You might have noticed the short abbreviations that appear in the dtype column. These tell you the type of the values in their respective columns: i64 is short for integer (eg whole numbers) and f64 is short for double-precision floating point number (these are real numbers). polars has an object data type which allows storing arbitrary Python objects, but this makes you lose performance benefits, as polars is strictly typed. Although not found here, other data types include str for text and datetime for combinations of a date and time.

The table below gives some of the most common data types you are likely to encounter.

Name of data type	Type of data
Float64	real numbers
Categorical	categories
Datetime	date and time
Date	date
Int64	integers
Boolean	True or False
String	text

The different column data types are important because the operations you can perform on a column depend so much on its “type”; for example, you can remove all punctuation from strings while you can multiply ints and floats.

We would like to work with the "time_hour" variable in the form of a datetime; fortunately, polars makes it easy to perform that conversion on that specific column

flights.get_column("time_hour")

flights.with_columns(pl.col("time_hour").str.to_datetime())

polars basics

polars is a really comprehensive package, and this book will barely scratch the surface of what it can do. But it’s built around a few simple ideas that, once they’ve clicked, make life a lot easier.

Let’s start with the absolute basics. The most basic polars object is DataFrame. A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data, even lists) in columns. It is made up of rows and columns (with each row-column cell containing a value), plus contextual information: column names (which carry information about each column).

Note

Note: If you’re coming from pandas, be aware that polars does not use an index column and each row is indexed by its integer position in the table.

df = pl.DataFrame(
    data={
        "col0": [0, 0, 0, 0],
        "col1": [0, 0, 0, 0],
        "col2": [0, 0, 0, 0],
        "col3": ["a", "b", "b", "a"],
        "col4": ["alpha", "gamma", "gamma", "gamma"],
    },
)
df.head()

You can see there are 5 columns (named "col0" to "col4").

A second key point you should know is that the operations on a polars data frame can be chained together. We need not perform one assignment per line of code; we can actually do multiple assignments in a single command.

Let’s see an example of this. We’re going to string together four operations:

we will use filter() to find only the rows where the destination "dest" column has the value "IAH". We use the pl.col() expression to select the column for filtering condition. In effect, this step removes rows we’re not interested in.
we will use group_by() to group rows by the year, month, and day (we pass a list of columns to the group_by() function).
we will choose which columns to perform aggregation on after the group_by() operation by using the pl.col() expression inside agg(), to select the column. Here we just want one column, "arr_delay". In effect, this step removes columns we’re not interested in.
finally, we must specify what agg() operation we wish to apply; when aggregating the information in multiple rows down to one row, we need to say how that information should be aggregated. In this case, we’ll use the mean(). In effect, this step applies a statistic to the variable(s) we selected earlier, across the groups we created earlier.

flights.filter(pl.col("dest") == "IAH").group_by(["year", "month", "day"]).agg(
    pl.col("arr_delay").mean()
)

You can see here that we’ve created a new data frame. To do it, we used three key operations:

manipulating rows
manipulating columns
applying statistics

Most operations you could want to do to a single data frame are covered by these, but there are different options for each of them depending on what you need.

Let’s now dig a bit more into these operations.

Manipulating Rows in Data Frames

Let’s create some fake data to show how this works.

import numpy as np

df = pl.DataFrame(
    data=np.reshape(range(36), (6, 6)),
    schema=["col" + str(i) for i in range(6)],
)
df.insert_column(
    6, pl.Series("col6", ["apple", "orange", "pineapple", "mango", "kiwi", "lemon"])
)
df

Accessing Rows

To access a particular row directly, you can get that by index (location in the data) or predicate/expression using .row(), which returns as a tuple. Remember that Python indices begin from zero, so to retrieve the first row by index you would use .row(0):

For example,

# Gets the first row of the DataFrame
df.row(0)

# Gets the fifth row of the DataFrame
df.row(4)

We can also access particular rows based on a predicate using .row() with the by_predicate parameter.

df.row(by_predicate=pl.col("col6") == "mango")

To get the row as a dictionary instead of a tuple with a mapping of column names to row values, specify named=True

# Get the first row of the DataFrame as a dictionary
df.row(0, named=True)

# Get the row where col6 is "mango" as a dictionary
df.row(by_predicate=pl.col("col6") == "mango", named=True)

We can also access rows using the .slice() function. As the function name implies, we get a slice of the DataFrame. we can use this to get a single row or a number of rows. To use this, we give it an offset - a start index, negative indexing is supported to index from the bottom of the DataFrame - and a length of the slice. This returns a DataFrame

df.slice(-2, 2)

Filtering rows

As with the flights example, we can also filter rows based on a condition using filter():

df.filter((pl.col("col6") == "kiwi") | (pl.col("col6") == "pineapple"))

For numbers, you can also use the greater than and less than signs:

df.filter(pl.col("col0") > 6)

In fact, there are lots of options that work with filter(): as well as > (greater than), you can use >= (greater than or equal to), < (less than), <= (less than or equal to), == (equal to), and != (not equal to). You can also use operators & as well as | to combine multiple conditions. Here’s an example of & from the flights data frame:

# Flights that departed on January 1
flights.filter((pl.col("month") == 1) & (pl.col("day") <= 5))

Note that equality is tested by == and not by =, because the latter is used for assignment.

Re-arranging Rows

Again and again, you will want to re-order the rows of your data frame according to the values in a particular column. polars makes this very easy via the .sort() function. You can sort by single or multiple column names and also by expressions. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. For example, the following code sorts by the departure time, which is spread over four columns.

flights.sort("dep_time")

# Sort by multiple columns by passing a list of columns.
flights.sort(["year", "month", "day", "dep_time"])

# Or use positional arguments to sort by multiple columns in the same way.
flights.sort("year", "month", "day", "dep_time")

You can use the keyword argument descending=True to re-order by a column or columns in descending order. For example, this code shows the most delayed flights:

flights.sort("dep_delay", descending=True)

flights.sort(["dep_delay", "arr_delay"], descending=[True, False])

You can of course combine all of the above row manipulations to solve more complex problems. For example, we could look for the top three destinations of the flights that were most delayed on arrival that left on roughly on time:

(
    flights.filter((pl.col("dep_delay") <= 10) & (pl.col("dep_delay") >= -10))
    .sort("arr_delay", descending=True)
    .head(3)
)

Exercises

Find all flights that
1. Had an arrival delay of two or more hours
2. Flew to Houston ("IAH" or "HOU")
3. Were operated by United, American, or Delta
4. Departed in summer (July, August, and September)
5. Arrived more than two hours late, but didn’t leave late
6. Were delayed by at least an hour, but made up over 30 minutes in flight
Sort flights to find the flights with longest departure delays.
Sort flights to find the fastest flights
Which flights traveled the farthest?
Does it matter what order you used filter() and sort() in if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.

Manipulating Columns

This section will show you how to apply various operations you may need to columns in your data frame.

Creating New Columns

Let’s now move on to creating new columns, either using new information or from existing columns. Given a data frame, df, creating a new column with the same value repeated is done by using .with_columns(), with an expression assigned to a column name. Here we use pl.lit() which returns an expression representing a literal value, 5 in our case.

df = df.with_columns(new_column0=pl.lit(5))
df

If we do the same operation again, but this time assigning a Series to the same column, it will overwrite what was already in that column. A Series repesents a single column in a Polars DataFrame. Let’s see this with an example where we put different values in each position by assigning a list to the new column.

df = df.with_columns(new_column0=pl.Series([0, 1, 2, 3, 4, 5]))
df

Exercise

What happens if you try to use assignment where the right-hand side values are longer or shorter than the length of the data frame?

We can actually use .with_columns to create more than one new column:

df = df.with_columns(new_column1=pl.lit(5), new_column2=pl.lit(6))
df

Very often, you will want to create a new column or modify a column that is the result of an operation on existing columns. There are a couple of ways to do this. The ‘stand-alone’ method works in a similar way to what we’ve just seen except that we refer to the data frame on the right-hand side of the assignment statement too:

df.with_columns(new_column2=pl.col("col0") - pl.col("new_column0"))

We can use .alias() with an expression to assign column names, when creating new columns. We can chain multiple expressions together with .with_columns(), which would create multiple new columns with the names assigned to .alias().

flights.with_columns(
    (pl.col("dep_delay") - pl.col("arr_delay")).alias("gain"),
    (pl.col("distance") / pl.col("air_time") * 60).alias("speed"),
)

Accessing Columns

Just as with selecting rows, there are many options and ways to select the columns to operate on. The one with the simplest syntax is the name of the data frame followed by square brackets and the column name (as a string)

df["col0"]

If you need to select multiple columns, you cannot just pass a string into df[...]; instead you need to pass an object that is iterable (and so have multiple items). The most straight forward way to select multiple columns is to pass a list. Remember, lists comes in square brackets so we’re going to see something with repeated square brackets: one for accessing the data frame’s innards and one for the list.

df[["col0", "new_column0", "col2"]]

We can also use .select() on the data frame to select columns, passing a single string to select a single column or an iterable, like a list, positional arguments or keyword arguments, to select multiple columns. Using keyword arguments renames the columns in the output

# selecting a single column
df.select("col0")

# Using positional arguments to select multiple columns
df.select("col0", "new_column0", "col2")

# Using keyword arguments to rename columns in the output
df.select(col1="col0", col2="new_column0", col3="col2")

Expressions are also accepted

df.select(pl.col("col0"), pl.col("new_column0") + 2, pl.col("col2") * 2)

If we want to access particular rows at the same time, we can chain .filter() or .slice() to the .select() function:

df.select("col0", "new_column0", "col2").slice(0, 2)

Sometimes, we’ll want to select columns based on the type of data that they hold. For this, we can call polars data types with an expression inside .select(). Let’s use this to select all columns with integers in the flights data.

flights.select(pl.col(pl.Int64))

polars also provides a selectors module that we can use to select columns based on both data types and criteria such as patterns in the name of the column.

import polars.selectors as S

# Select all integer columns
flights.select(S.integer())

# Exclude string columns
flights.select(S.exclude(S.string()))

# Select columns that contain "delay" in their name
flights.select(S.contains("delay"))

# Select columns that start with "arr"
flights.select(S.starts_with("arr"))

Other selectors commands include ends_with(), by_index(), first(), last(), duration(), numeric(), boolean() and more

Renaming Columns

We use the dedicated rename() function with a mapping, such as a dictionary or a lambda function. Dictionaries in Python consist of curly brackets with comma separated pairs of values where the first values maps into the second value. An example of a dictionary would be {'old_col1': 'new_col1', 'old_col2': 'new_col2'}. Let’s see this in practice (but note that we are not ‘saving’ the resulting data frame, just showing it—to save it, you’d need to add df = to the left-hand side of the code below).

df.rename({"col3": "letters", "col4": "names", "col6": "fruit"})

Using a lambda function, maps each column name as its argument, which you can then perform an operation on.

df.rename(lambda column_name: column_name.upper())

Tip

A lambda function is a small, anonymous function in Python that performs a single operation. It’s a shorthand way to create a function without using def.

Re-ordering Columns

By default, new columns are added to the right-hand side of the data frame. But you may have reasons to want the columns to appear in a particular order, or perhaps you’d just find it more convenient to have new columns on the left-hand side when there are many columns in a data frame (which happens a lot).

The simplest way to re-order (all) columns is to create a new list of their names with them in the order that you’d like them: but be careful you don’t forget any columns that you’d like to keep!

Let’s see an example with a fresh version of the fake data from earlier. We’ll put all of the odd-numbered columns first, in descending order, then the even similarly using .select().

df = pl.DataFrame(
    data=np.reshape(range(36), (6, 6)), schema=["col" + str(i) for i in range(6)]
)
df

df = df.select(["col5", "col3", "col1", "col4", "col2", "col0"])
df

Of course, this is quite tedious if you have lots of columns! There are methods that can help make this easier depending on your context. Perhaps you’d just liked to sort the columns in order? This can be achieved by combining sorted() and the .select() function, passing the DataFrame .columns.

# Alphabetical order
df.select(sorted(df.columns))

# Reverse alphabetical order
df.select(sorted(df.columns, reverse=True))

Column and Row Exercises

Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?
Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?
Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
What happens if you include the name of a row or column multiple times when trying to select them?

Grouping and applying summary statistics

So far you’ve learned about working with rows and columns. polars gets even more powerful when you add in the ability to work with groups. And because groups tend to imply an aggregation or pooling of data, they go hand-in-glove with the application of a summary statistic.

The diagram below gives a sense of how these operations can proceed together. Note that the ‘split’ operation is achieved through grouping, while apply produces summary statistics. At the end, you get a data frame with a new index (one entry per group) in what is shown as the ‘combine’ step.

Grouping and Aggregating

Let’s take a look at creating a group using the .group_by() function, then followed by the .agg() function for aggregation, selecting a column and applying a summary statistic via an aggregation.

flights.group_by("month").agg(pl.col("dep_delay").mean())

This now represents the mean departure delay by month. The mechanics happenning here is that the DataFrame is grouped by each unique item in the “month” column and then a mean summary statistic is derived on the “dep_delay” column from each group.

Other summary statistics can be derived from aggregations. Some common options are in the table below:

Aggregation	Description
`count()`	Number of non-null items
`len()`	Number of all items
`first()`, `last()`	First and last item
`mean()`, `median()`	Mean and median
`min()`, `max()`	Minimum and maximum
`std()`, `var()`	Standard deviation and variance
`mad()`	Mean absolute deviation
`product()`	Product of all items
`sum()`	Sum of all items
`value_counts()`	Counts of unique values

For doing multiple aggregations on multiple columns with new names for the output variables, the syntax becomes

# Multiple aggregations using polars' syntactic sugar (shorthand) for mean and count
flights.group_by("month").agg(
    mean_delay=pl.mean("dep_delay"),
    count_flights=pl.count("dep_delay"),
)

Means and counts can get you a surprisingly long way in data science!

Grouping by multiple variables

This is as simple as passing .group_by() a list or multiple strings representing columns instead of a single string representing a single column.

month_year_delay = flights.group_by("month", "year").agg(
    mean_delay=pl.mean("dep_delay"),
    count_flights=pl.count("dep_delay"),
)
month_year_delay

Groupby Exercises

Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights.group_by(["carrier", "dest"]))
Find the most delayed flight to each destination.
How do delays vary over the course of the day?