Categorical Data

Introduction

In this chapter, we’ll introduce how to work with categorical variables—that is, variables that have a fixed and known set of possible values. This chapter is enormously indebted to the pandas documentation.

Prerequisites

This chapter will use the pandas data analysis package.

The Category Datatype

Everything in Python has a type, even the data in pandas data frame columns. While you may be more familiar with numbers and even strings, there is also a special data type for categorical data called Categorical. There are some benefits to using categorical variables (where appropriate):

  • they can keep track even when elements of the category isn’t present, which can sometimes be as interesting as when they are (imagine you find no-one from a particular school goes to university)
  • they can use vastly less of your computer’s memory than encoding the same information in other ways
  • they can be used efficiently with modelling packages, where they will be recognised as potential ‘dummy variables’, or with plotting packages, which will treat them as discrete values
  • you can order them (for example, “neutral”, “agree”, “strongly agree”)

All values of categorical data for a pandas column are either in the given categories or take the value np.nan.

Creating Categorical Data

Let’s create a categorical column of data:

import numpy as np
import pandas as pd

df = pd.DataFrame({"A": ["a", "b", "c", "a"]})

df["A"] = df["A"].astype("category")
df["A"]

Notice that we get some additional information at the bottom of the shown series: we get told that not only is this a categorical column type, but it has three values ‘a’, ‘b’, and ‘c’.

You can also use special functions, such as pd.cut(), to groups data into discrete bins. Here’s an example where specify the labels for the categories directly:

df = pd.DataFrame({"value": np.random.randint(0, 100, 20)})
labels = [f"{i} - {i+9}" for i in range(0, 100, 10)]
df["group"] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
df.head()

In the example above, the group column is of categorical type.

Another way to create a categorical variable is directly using the pd.Categorical() function:

raw_cat = pd.Categorical(
    ["a", "b", "c", "a", "d", "a", "c"], categories=["b", "c", "d"]
)
raw_cat

We can then enter this into a data frame:

df = pd.DataFrame(raw_cat, columns=["cat_type"])
df["cat_type"]

Note that NaNs appear for any value that isn’t in the categories we specified—you can find more on this in Missing Values.

You can also create ordered categories:

ordered_cat = pd.Categorical(
    ["a", "b", "c", "a", "d", "a", "c"],
    categories=["a", "b", "c", "d"],
    ordered=True,
)
ordered_cat

Working with Categories

Categorical data has a categories and a ordered property; these list the possible values and whether the ordering matters or not respectively. These properties are exposed as .cat.categories and .cat.ordered. If you don’t manually specify categories and ordering, they are inferred from the passed arguments.

Let’s see some examples:

df["cat_type"].cat.categories
df["cat_type"].cat.ordered

If categorical data is ordered (ie .cat.ordered == True), then the order of the categories has a meaning and certain operations are possible: you can sort values (with .sort_values), and apply .min and .max.

Renaming Categories

Renaming categories is done via the rename_categories() method (which works with a list or a dictionary).

df["cat_type"] = df["cat_type"].cat.rename_categories(["alpha", "beta", "gamma"])

Quite often, you’ll run into a situation where you want to add a category. You can do this with .add_categories():

df["cat_type"] = df["cat_type"].cat.add_categories(["delta"])
df["cat_type"]

Similarly, there is a .remove_categories() function and a .remove_unused_categories() function. .set_categories adds and removes categories in one fell swoop. One of the nice properties of set categories is that Remember that you need to do df["columnname"].cat before calling any cat(egory) functions though.

Operations on Categories

As noted, ordered categories will already undergo some operations. But there are some that work on any set of categories. Perhaps the most useful is value_counts()

df["cat_type"].value_counts()

Note that even though ‘delta’ doesn’t appear at all, it gets a count (of zero). This tracking of missing values can be quite handy.

mode() is another one:

df["cat_type"].mode()

And if your categorical column happens to consist of elements that can undergo operations, those same operations will still work. For example,

time_df = pd.DataFrame(
    pd.Series(pd.date_range("2015/05/01", periods=5, freq="M"), dtype="category"),
    columns=["datetime"],
)
time_df
time_df["datetime"].dt.month

Finally, if you ever need to translate your actual data types in your categorical column into a code, you can use .cat.codes to get unique codes for each value.

time_df["datetime"].cat.codes