How to do a dummy variable in Python?

Question

Let me try to introduce you to the unique yet important concept of data modeling – dummy variables through the below scenario.

Nội dung chính Show

Let us create a dummy variable in Python now!
1. Load the dataset
2. Create a copy of the original dataset to work on.
3. Store all the categorical variables in a list
4. Use get_dummies() method to create dummy of the variables
Dummy Coding for Regression Analysis
What is Categorical Data?
What is a Dummy Variable?
How do you Convert Categorical Variables to Dummy Variables in Python?
Installing Pandas
Example Data to Dummy Code
Import Data in Python using Pandas
Creating Dummy Variables in Python
How to Make Dummy Variables in Python with Two Levels
How to Create Dummy variables in Python Video Tutorial
How to Create Dummy Variables in Python with Three Levels
Creating Dummy Variables in Python for Many Columns/Categorical Variables
Conclusion: Dummy Coding in Python
How do you create a dummy variable in Python?
How do you create a dummy variable?
How to do dummy encoding in Python?
What is a dummy value in Python?

Consider a dataset which is a combination of continuous as well as categorical data. As soon as we read the work ‘categorical’, what first comes to our mind is categories in the data or presence of groups.

It usually happens that the variables represent vivid/ different types of categories. Handling the huge number of groups in the data and feeding it to the model becomes a tedious and complex task as the size of the dataset increases and soon the ambiguity starts to increase.

This is when the concept of dummy variables comes into picture.

A dummy variable is a numeric variable which represents the sub-categories or sub-groups of the categorical variables of the dataset.

In a nutshell, a dummy variable enables us to differentiate between different sub-groups of the data and which in terms enables us to use the data for regression analysis as well.

Have a look at the below example!

Consider a dataset that contains 10-15 data variables amongst which it contains a category of ‘Male‘ and ‘Female‘.

The task is to understand usually which gender opts and chooses ‘pink’ as the color of their mobile cases. Now, in this case, we can use dummy variables and assign 0 as Male and 1 as Female. This would inturn help the feeding model have a better understanding and clearance on the data fed.

Let us create a dummy variable in Python now!

Let us now begin with creating a dummy variable. We have used the Bike rental count prediction problem to analyse and create dummy variables.

So, let us begin!

1. Load the dataset

At first, we need to load the dataset into the working environment as shown below:

import pandas
BIKE = pandas.read_csv("Bike.csv")

The original dataset:

Dataset-Bike Prediction

2. Create a copy of the original dataset to work on.

In order to make sure that the original dataset remains unaltered, we create a copy of the original dataset to work on and perform the operation of creation of dummies.

We have used pandas.dataframe.copy() function for the same.

bike = BIKE.copy()

3. Store all the categorical variables in a list

Let us now save all the categorical variables from the dataset into a list to work on!

categorical_col_updated = ['season','yr','mnth','weathersit','holiday']

4. Use get_dummies() method to create dummy of the variables

Pandas module provides us with dataframe.get_dummies() function to create dummies of the categorical data.

bike = pandas.get_dummies(bike, columns = categorical_col_updated) print(bike.columns)

We have passed the dataset, and the categorical column values to the function to create dummies.

Output:

As seen below, a dummy or separate column is created for every sub-group under each category.

Like, the column ‘month’ has all the 12 months as categories.

Thus, every single month is considered as a sub-group and the get_dummies() function has created a separate column for every column.

Index(['temp', 'hum', 'windspeed', 'cnt', 'season_1', 'season_2', 'season_3',
       'season_4', 'yr_0', 'yr_1', 'mnth_1', 'mnth_2', 'mnth_3', 'mnth_4',
       'mnth_5', 'mnth_6', 'mnth_7', 'mnth_8', 'mnth_9', 'mnth_10', 'mnth_11',
       'mnth_12', 'weathersit_1', 'weathersit_2', 'weathersit_3', 'holiday_0',
       'holiday_1'],
      dtype='object')

You can find the resultant dataset by the get_dummies() function here.

Conclusion

By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.

In the code chunk above, df is the Pandas dataframe, and we use the columns argument to specify which columns we want to be dummy code (see the following examples, in this post, for more details).

Dummy Coding for Regression Analysis

One statistical analysis in which we may need to create dummy variables in regression analysis. In fact, regression analysis requires numerical variables and this means that when we, whether doing research or just analyzing data, wishes to include a categorical variable in a regression model, supplementary steps are required to make the results interpretable.

Save

Three dummy coded variables in Pandas dataframe

In these steps, categorical variables in the data set are recoded into a set of separate binary variables (dummy variables). Furthermore, this re-coding is called “dummy coding” and involves the creation of a table called contrast matrix. Dummy coding can be done automatically by statistical software, such as R, SPSS, or Python.

What is Categorical Data?

In this section, of the creating dummy variables in Python guide, we are going to answer the question about what categorical data is. Now, in statistics, a categorical variable (also known as factor or qualitative variable) is a variable that takes on one of a limited, and most commonly a fixed number of possible values. Furthermore, these variables are typically assigning each individual, or another unit of observation, to a particular group or nominal category. For example, gender is a categorical variable.

What is a Dummy Variable?

Now, the next question we are going to answer before working with Pandas get_dummies, is “what is a dummy variable?”. Typically, a dummy variable (or column) is one which has a value of one (1) when a categorical event occurs (e.g., an individual is male) and zero (0) when it doesn’t occur (e.g., an individual is female).

How do you Convert Categorical Variables to Dummy Variables in Python?

To convert your categorical variables to dummy variables in Python you c an use Pandas

import pandas as pd

data_url = 'http://vincentarelbundock.github.io/Rdatasets/csv/carData/Salaries.csv'
df = pd.read_csv(data_url, index_col=0)

df.head()
Code language: Python (python)

1 method. For example, if you have the categorical variable “Gender” in your dataframe called “df” you can use the following code to make dummy variables:

import pandas as pd

data_url = 'http://vincentarelbundock.github.io/Rdatasets/csv/carData/Salaries.csv'
df = pd.read_csv(data_url, index_col=0)

df.head()
Code language: Python (python)

2. If you have multiple categorical variables you simply add every variable name as a string to the list!

Installing Pandas

Obviously, we need to have Pandas installed to use the get_dummies() method. Pandas can be installed using pip or conda, for instance. If we want to install Pandas using condas we type

import pandas as pd

data_url = 'http://vincentarelbundock.github.io/Rdatasets/csv/carData/Salaries.csv'
df = pd.read_csv(data_url, index_col=0)

df.head()
Code language: Python (python)

3. On the other hand, if we want to use pip, we type

import pandas as pd

data_url = 'http://vincentarelbundock.github.io/Rdatasets/csv/carData/Salaries.csv'
df = pd.read_csv(data_url, index_col=0)

df.head()
Code language: Python (python)

4. Note, it is typically suggested that Python packages are installed in virtual environments. Pipx can be used to install Python packages directly in virtual environments and if we want to install, update, and use Python packages we can, as in this post, use conda or pip.

Finally, if there is a message that there is a newer version of pip, make sure check out the post about how to up update pip.

Example Data to Dummy Code

In this Pandas get_dummies tutorial, we will use the Salaries dataset, which contains the 2008-09 nine-month academic salary for Assistant Professors, Associate Professors, and Professors in a college in the U.S.

Import Data in Python using Pandas

Now, before we start using Pandas get_dummies() method, we need to load pandas and import the data.

import pandas as pd

data_url = 'http://vincentarelbundock.github.io/Rdatasets/csv/carData/Salaries.csv'
df = pd.read_csv(data_url, index_col=0)

df.head()
Code language: Python (python)

Save

Pandas Dataframe to dummy code

Of course, data can be stored in multiple different file types. For instance, we could have our data stored in .xlsx, SPSS, SAS, or STATA files. See the following tutorials to learn more about importing data from different file types:

Learn how to read Excel (.xlsx) files using Python and Pandas
Read SPSS files using Pandas in Python
Import (Read) SAS files using Pandas
Read STATA files in Python with Pandas

Now, if we only want to work with Excel files, reading xlsx files in Python, can be done with other libraries, as well.

Creating Dummy Variables in Python

In this section, we are going to use pandas get_dummies() to generate dummy variables in Python. First, we are going to work with the categorical variable “sex”. That is, we will start with dummy coding a categorical variable with two levels.

Second, we are going to generate dummy variables in Python with the variable “rank”. That is, in that dummy coding example we are going to work with a factor variable with three levels.

How to Make Dummy Variables in Python with Two Levels

In this section, we are going to create a dummy variable in Python using Pandas get_dummies method. Specifically, we will generate dummy variables for a categorical variable with two levels (i.e., male and female).

Save

In this create dummy variables in Python post, we are going to work with Pandas get_dummies(). As can be seen, in the image above we can change the prefix of our dummy variables, and specify which columns that contain our categorical variables.

First Dummy Coding in Python Example:

In the first Python dummy coding example below, we are using Pandas get_dummies to make dummy variables. Note, we are using a series as data and, thus, get two new columns named Female and Male.

# Pandas get_dummies on one column:
pd.get_dummies(df['sex']).head()
Code language: Python (python)

Save

Female and Male dummy coded columns

In the code, above, we also printed the first 5 rows (using Pandas head()). We will now continue and use the columns argument. Here we input a list with the column(s) we want to create dummy variables from. Furthermore, we will create the new Pandas dataframe containing our new two columns.

How to Create Dummy variables in Python Video Tutorial

For those that prefer, here’s a video describing most of what is covered in this tutorial.

More Python Dummy Coding Examples:

# Creating dummy variables from one column:
df_dummies = pd.get_dummies(df, columns=['sex'])
df_dummies.head()
Code language: Python (python)

Save

Resulting dataframe with dummy coded columns

In the output (using Pandas head()), we can see that Pandas get_dummies automatically added “sex” as prefix and underscore as prefix separator. If we, however, want to change the prefix as well as the prefix separator we can add these arguments to Pandas get_dummies():

# Changing the prefix for the dummy variables:
df_dummies = pd.get_dummies(df, prefix='Gender', prefix_sep='.', 
                            columns=['sex'])
df_dummies.head()
Code language: Python (python)

Save

Gender, instead of sex, as the prefix for the dummy columns.

Remove Prefix and Separator from Dummy Columns

In the next Pandas dummies example code, we are going to make dummy variables with Python but we will set the prefix and the prefix_sep arguments so that we the column name will be the factor levels (categories):

# Remove the prefix and separator when dummy coding:
df_dummies = pd.get_dummies(df, prefix=, prefix_sep='', 
                            columns=['sex'])
df_dummies.head()
Code language: Python (python)

Save

How to Create Dummy Variables in Python with Three Levels

In this section, of the dummy coding in Python tutorial, we are going to work with the variable “rank”. That is, we will create dummy variables in Python from a categorical variable with three levels (or 3 factor levels). In the first dummy variable example below, we are working with Pandas get_dummies() the same way as we did in the first example.

# Python dummy variables with 3 factor levels (categorical data):
pd.get_dummies(df['rank']).head()
Code language: Python (python)

Save

That is, we put in a Pandas Series (i.e., the column with the variable) as the only argument and then we only got a new dataframe with 3 columns (i.e., for the 3 levels).

Create a Dataframe with Dummy Coded Variables

Of course, we want to have the dummy variables in a dataframe with the data. Again, we do this by using the columns argument and a list with the column that we want to use:

df_dummies = pd.get_dummies(df, columns=['rank'])
df_dummies.head()
Code language: Python (python)

Save

In the image above, we can see that Pandas get_dummies() added “rank” as prefix and underscore as prefix separator. Next, we are going to change the prefix and the separator to “Rank” (uppercase) and “.” (dot).

df_dummies = pd.get_dummies(df, prefix='Rank', prefix_sep='.', 
                            columns=['rank'])
df_dummies.head()
Code language: Python (python)

Save

Now, we may not need to have a prefix or a separator and, as in the previous Pandas create dummy variables in Python example, want to remove these. To accomplish this, we just add empty strings to the prefix and prefix_sep arguments:

df_dummies = pd.get_dummies(df, prefix='', prefix_sep='', 
                            columns=['rank'])
Code language: Python (python)

Creating Dummy Variables in Python for Many Columns/Categorical Variables

In the final Pandas dummies example, we are going to dummy code two columns. Specifically, we are going to add a list with two categorical variables and get 5 new columns that are dummy coded. This is, in fact, very easy and we can follow the example code from above:

Creating Multiple Dummy Variables Example Code:

Here’s how to create dummy variables from multiple categorical variables in Python:

import pandas as pd

data_url = 'http://vincentarelbundock.github.io/Rdatasets/csv/carData/Salaries.csv'
df = pd.read_csv(data_url, index_col=0)

df.head()
Code language: Python (python)

0

Save

Finally, if we want to add more columns, to create dummy variables from, we can add that to the list we add as a parameter to the columns argument. See this notebook for all code examples in this tutorial about creating dummy variables in Python. For more Python Pandas tutorials, check out this page.

Conclusion: Dummy Coding in Python

In this post, we have learned how to do dummy coding in Python using Pandas get_dummies() method. More specifically, we have worked with categorical data with two levels, and categorical data with three levels. Furthermore, we have learned how to add and remove prefixes from the new columns created in the dataframe.

How do you create a dummy variable in Python?

We can create dummy variables in python using get_dummies() method..

Syntax: pandas.get_dummies(data, prefix=None, prefix_sep='_',).

Parameters:.

Return Type: Dummy variables..

How do you create a dummy variable?

There are two steps to successfully set up dummy variables in a multiple regression: (1) create dummy variables that represent the categories of your categorical independent variable; and (2) enter values into these dummy variables – known as dummy coding – to represent the categories of the categorical independent ...

How to do dummy encoding in Python?

To perform dummy encoding, set this parameter to 'first' that drops the first category of each variable. sparse — Set this to False to return the output as a NumPy array. The default is True which returns a sparse matrix.

What is a dummy value in Python?

In a nutshell: a dummy variable is a numeric variable that represents categorical data. For example, if you want to calculate a linear regression, you need numerical predictors.