data310

View the Project on GitHub

Project 1

Question 1

How did your model fare?

I chose the city of Denver for project 1 and created a dataframe with 200 homes data to create my predictive models. The MSE of my model was 14.6, and after calculating this and examining the plots shown below, I believe my model to be fairly accurate for the amount of data used to create it.

Plot of model loss

Screen Shot 2021-07-11 at 5 54 09 PM

Actual Asking Price vs Price Prediction scatter plot

Screen Shot 2021-07-11 at 5 54 16 PM

Question 2

In your estimation is there a particular variable that may improve model performance?

I think that there a number of variables that could have been included to improve model performance. When I examined the original dataframe with 61 columns of variables two columns stood out, ‘address/community’ and ‘address/neighborhood’. The values in these columns’ rows were all nan, however, if there were a way to include a variable that represented the location of homes (perhaps using zipcodes) then I think predictions could have been more accurate. If a house is highly priced it will likely be near other highly priced homes. A model that included a zipcode variable would likely be able to make predictions about nice neighborhoods leading to higher price predictions, based on homes with higher prices being in close proximity to each other.

Question 3

Which of the predictions were the most accurate? In which percentile do these most accurate predictions reside? Did your model trend towards over or under predicting home values?

The most accurate prediction my model produced was predicting that a homes price would be $776,892 when it was listed at $779,000. This is not surprising because when you examine the scatterplot you can see that the best predictions fell in the range fo around $300,000 to just under $1,000,000. The model seems to trend towards less accurate predictions as the prices of homes increase. The model also seemed to predict home prices under and over the actual values fairly evenly.

Question 4

Which feature appears to be the most significant predictor?

To find the most significant predictor I removed the variables bedrooms, bathrooms, and living area from the stacked array individually and then re-ran the code to see the new MSE. The none of the exclusions changed the data too drastically. The exclusion of bedrooms changed the MSE from 14.6 to 15.91 which was the largest change. For this reason I am assuming that bedrooms was the most significant predictor for my model.