Understanding One-Hot Encoded Values and Their Coefficients in a Linear Regression Model

When dealing with linear regression models we often have to deal with categorical variables. When I began learning about linear regression models, I found it a bit overwhelming, given that with the multiple assumptions the data needed to meet, understanding how to evaluate the models, trying to find an effective process by which to reduce and/or get the 'best' combination of features for the model, amongst several other things, it felt like a lot to take in. In these situations, I usually like to create a simple example, with very few 'moving parts', to make sure I understand the fundamentals. I will be doing just that here, by:

1. Creating a simple data set with categorical variables.

2. Separating the features from the target, and applying one-hot encoding to the features with categorical values (pre-processing).

3. Creating a linear regression model with statsmodels.

4. Making a prediction based on the model.

5. Going over, exactly how the results were derived, and how the coefficients (parameters) interact with their respective feature's values.

Creating the Data Set

I will create a pandas dataframe, made up with the characteristics of fictional houses. The dataframe will have eight samples (rows), and four columns.

This results in the following dataframe:

Pre-processing the Data

This results in the following dataframe:

A major take-away from this dataframe is how each of the original feature columns ('Bedrooms', 'Bathrooms', 'Floors'), were replaced by columns representing each of their respective unique values, with the exception of their lowest value. For example, the original dataframe includes houses with two, three, and four bedrooms, yet in the one-hot encoded dataframe, there are only columns representing three, and four bedrooms. This is done on purpose to avoid multicollinearity. The thought process here is that if all but one of the possible values were not chosen (0), it would be a foregone conclusion that remaining value would be the one that is chosen. In order to avoid this redundancy, one value is dropped from the dataframe.

Creating the Linear Regression Model

This returns the following summary below. Don't worry about the nan(s). The major take-aways here are that:

1. The R-squared score is 1. The model is able to perfectly predict the prices of all the houses on the dataset.

2. Notice the values of each of the coefficients, as well as the constant.

Making a Prediction

Now that the model has been created, I will use it to make a prediction:

This returns the following dataframe:

When we use this dataframe along with the model to predict the price of a house:

We get the following result:

Analyzing the Result

Before we continue, I would like to share with you a handy statsmodels attribute, that returns all of the model's coefficients, aka parameters. These are the same as shown in the summary but, I feel, presented in a clearer fashion.

This gives us the following result:

Now the question becomes: how was the price of approximately $200,000 derived ?

Answer: Given that this is a linear equation, we multiply each coefficient to their corresponding feature's value, and add them all together as such:

(200000*1) + (25000*0) + (75000*0) + (75000*0) + (125000*0) + (150000*0) + (50000*0) + (100000*0) = 200000

Here we have effectively predicted the price of a house with two bedrooms, one bathroom, and

one floor, which if we compare prices with the original dataframe we will see that it has the same price of $200,000. As a learning tool, I found this to be very helpful. I suggest you run the code, and play around with different configurations. I hope you have found this helpful!

One-Hot Encoding and Their Coefficients