Regression Analysis for Kings County Home Sales — Log Transformation
I analyzed a data set for Kings County, in Washington state, to learn about the home sales in the area. To accomplish this, I performed a regression analysis.
Regression analysis is a reliable method of identifying which variables have impact on a topic of interest.
This dataset in comprised of 21 variables
I will discuss how I started this project
After this, I also added data on cities, which I added to reduce the noise in terms of geography. I grouped over 70 zip codes into roughly 15 cities or regions. I wanted complete geographic representation in my model, but I did not want to use more than 20 variables for it, so this seemed like a healthy compromise given the size and variety of King County.
One of my first steps was to make sure that price was normally distributed and performed a log transformation to do so. The graphs below the before and after of the log transformation.
Before the log transformation, the plot was very skewed:
#Log transformation to normalize price distribution
#Using mean price later
mean_house_price = df.price.mean()
df["price"] = np.log(df["price"])
m=df.price.mean()
std=df.price.std()
df.price = (df.price - m) / std
After the log transformation, the plot was normally distributed:
Here we can see why it is important to performa a log transformation.
From this I went on to our model using the two data frames and additional made with dummy variables. The initial model was an OLS regression. The final model also has a high R Squared indicating that in theory the model explains over 85% of the variability of properties prices.
Conclusion
The three factors that affect price the most are location as expected, but also the grade and size which both show high positive correlation with price. The year a house is built or renovated does not seem to impact price significantly for properties in the King County area