I chose the city of Denver for project 1 and created a dataframe with 200 homes data to create my predictive models. The MSE of my model was 14.6, and after calculating this and examining the plots shown below, I believe my model to be fairly accurate for the amount of data used to create it.
I think that there a number of variables that could have been included to improve model performance. When I examined the original dataframe with 61 columns of variables two columns stood out, ‘address/community’ and ‘address/neighborhood’. The values in these columns’ rows were all nan, however, if there were a way to include a variable that represented the location of homes (perhaps using zipcodes) then I think predictions could have been more accurate. If a house is highly priced it will likely be near other highly priced homes. A model that included a zipcode variable would likely be able to make predictions about nice neighborhoods leading to higher price predictions, based on homes with higher prices being in close proximity to each other.
The most accurate prediction my model produced was predicting that a homes price would be $776,892 when it was listed at $779,000. This is not surprising because when you examine the scatterplot you can see that the best predictions fell in the range fo around $300,000 to just under $1,000,000. The model seems to trend towards less accurate predictions as the prices of homes increase. The model also seemed to predict home prices under and over the actual values fairly evenly.
To find the most significant predictor I removed the variables bedrooms, bathrooms, and living area from the stacked array individually and then re-ran the code to see the new MSE. The none of the exclusions changed the data too drastically. The exclusion of bedrooms changed the MSE from 14.6 to 15.91 which was the largest change. For this reason I am assuming that bedrooms was the most significant predictor for my model.