current position:Home>Show your hand and use Python to analyze house prices
Show your hand and use Python to analyze house prices
2022-02-02 02:47:04 【PI dada】
official account : Youer cottage
author :Peter
edit :Peter
Hello everyone , I am a Peter~
This is Kaggle The second column , The title of the competition is : House Prices - Advanced Regression Techniques. In this paper , You will learn :
- single 、 Multivariate analysis
- correlation analysis
- Missing and outlier handling
- Dummy variable conversion
primary notebook Address :www.kaggle.com/pmarcelino/…
Ranking list
Let's look at the rankings , The first place is really crushing the other players ~ therefore , Today, let's see how great this first place plan is ?
Data is introduced
This Boston house price data set has 4 Copy of the data , Training set train+ Test set test+ Description of the dataset description+ Submit template sample
The training set includes 81 Features ,1460 Data ; Test set 81 Features ,1459 Data . look down Some properties Introduce :
data EDA
Import modules and data , And data exploration :
Import library
import pandas as pd
import numpy as np
# Drawing related
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("fivethirtyeight")
# Data modeling
from scipy.stats import norm
from scipy import stats
from sklearn.preprocessing import StandardScaler
# Warning
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
Copy code
Import data
Data and information
The training set as a whole is 1460*81; And many existing fields have missing values
Describe statistics :
Selling price SalePrice analysis
Maehara notebook In the document , The author has analyzed many of his views on this topic and field , Don't elaborate . The following is the key part :
Statistics
Just look at the statistics of this field :
The distribution histogram is as follows , We clearly feel :
- The distribution of prices deviates from the normal distribution
- There is obvious positive skewness
- There are obvious peaks
Skewness and kurtosis (skewness and kurtosis)
Knowledge gas station : Skewness and kurtosis
For a detailed explanation, see the article :zhuanlan.zhihu.com/p/53184516
- skewness : Measure the asymmetry of the probability distribution of random face change , It's a measure of the degree of asymmetry relative to the average , By measuring the skewness coefficient , We can determine the degree of asymmetry and direction of data distribution .
- kurtosis : It is a statistic that studies the steep or smooth distribution of data , By measuring the kurtosis coefficient , We can determine whether the data is steeper or smoother than the normal distribution . The kurtosis is close to 0, The data are normally distributed ; kurtosis >0, High tip distribution ; kurtosis <0, Short and fat distribution
Two distributions of skewness :
- If it's left , Then the skewness is less than 0
- If it's right , Then the skewness is greater than 0
Two distributions of kurtosis :
- If it's tall and thin , Then the kurtosis is greater than 0
- If it's short and fat , Then the kurtosis is less than 0
# Print the skewness and kurtosis of the sales price
print("Skewness( skewness ): %f" % train['SalePrice'].skew())
print("Kurtosis( kurtosis ): %f" % train['SalePrice'].kurt())
Skewness( skewness ): 1.882876
Kurtosis( kurtosis ): 6.536282
Copy code
Skewness and kurtosis are positive , It clearly shows that the data is right biased and high sharp distribution
SalePrice And numeric fields
First, we examine the relationship between and living area :
plt.figure(1,figsize=(12,6))
sns.scatterplot(x="GrLivArea",y="SalePrice",data=data)
plt.show()
Copy code
# plotly edition
px.scatter(data,x="GrLivArea",y="SalePrice",trendline="ols")
Copy code
TotalBsmtSF VS SalePrice
# 2、TotalBsmtSF
data = train[["SalePrice","TotalBsmtSF"]]
plt.figure(1,figsize=(12,6))
sns.scatterplot(x="TotalBsmtSF",y="SalePrice",data=data)
plt.show()
Copy code
Summary : We can observe that there is a certain linear relationship between these two characteristics and sales price .
Relationship between price and category field
1、OverallQual VS SalePrice
# 1、OverallQual: Overall house quality
# in total 10 Categories
train["OverallQual"].value_counts()
Copy code
5 397
6 374
7 319
8 168
4 116
9 43
3 20
10 18
2 3
1 2
Name: OverallQual, dtype: int64
Copy code
data = train[["SalePrice","OverallQual"]]
# The relationship between the overall quality of the house and the house price
# Draw a subgraph :1 No. A
f,ax = plt.subplots(1,figsize=(12,6))
fig = sns.boxplot(x="OverallQual",y="SalePrice",data=data)
# y The scale range of the axis
fig.axis(ymin=0,ymax=800000)
plt.show()
Copy code
2、YearBuilt VS SalePrice
The relationship between the year of construction and the selling price
data = train[["SalePrice","YearBuilt"]]
# Relationship between construction year and house price
f,ax = plt.subplots(1,figsize=(16,8))
fig = sns.boxplot(x="YearBuilt",y="SalePrice",data=data)
# y The scale range of the axis
fig.axis(ymin=0,ymax=800000)
plt.show()
Copy code
Summary : The sales price has a strong relationship with the overall quality of the house ; But it has little to do with the year of construction . But in the actual process of buying a house , We still care about the year
Summary
A summary of the above analysis :
- Ground living area (GrLivArea)、 Basement area (GrLivArea) And selling price SalePrice All show positive linear correlation
- The overall quality of the house (OverallQual) And year of construction (YearBuilt) It also seems to be linearly related to the sales price . Common sense , The better the overall quality , The more expensive it is
correlation analysis
In order to explore the relationship between many attributes , Carry out the following analysis :
- Correlation between two attributes ( Heat map )
- Selling price saleprice Relationship with other attributes ( Heat map )
- The relationship between the most relevant attributes ( Scatter plot )
Overall relevance
Analyze the correlation of each two attributes , And draw a thermal diagram
There are two noteworthy points in the figure above :
- TotalBsmtSF and 1stFlrSF
- GarageCar and GarageArea
Both sets of variables are strongly correlated , Our subsequent analysis only takes one of them
Scale the correlation matrix ( Selling price saleprice)
from The heat map above Select and SalePrice The most relevant former 10 A feature to draw a thermal map
sns.set(font_scale=1.25)
hm = sns.heatmap(
cm, # The drawing data
cbar=True, # Whether to use the color bar as a legend , Default True
annot=True, # Display value
square=True, # Whether to make each unit of the thermodynamic diagram square , The default is False
fmt='.2f', # Keep two decimal places
annot_kws={'size':10},
xticklabels=cols.values, # xy Axis settings
yticklabels=cols.values)
plt.show()
Copy code
Summary 1
By scaling the heat map above , We can draw the following conclusion :
- 'OverallQual', 'GrLivArea' and 'TotalBsmtSF' It's true and 'SalePrice' Show strong correlation
- 'GarageCars' and 'GarageArea' It is also two characteristics with strong correlation ; And they all appear at the same time , Follow up selection GarageCars Analyze
- Construction life 'YearBuilt' relatively speaking , The correlation is relatively low
Variable discrete graph
Set the selling price SalePrice Together with several highly correlated features , Draw variable discrete diagram
sns.set()
# Variables to be analyzed
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(train[cols],size=2.5)
plt.show()
Copy code
Summary 2
In the diagonal direction is the histogram of the variable , Explanatory variables and explained variables SalePrice, Others are scatter charts .
If the graph shows scattered points of straight lines or horizontal lines , It means that the variable is discrete , For example 1 That's ok 4 Column variables ,y The axis is SalePrice,x The axis is YearBuilt, Straight line description YearBuilt Is discrete
Missing value processing
For missing values , It mainly discusses two points :
- What is the distribution of missing values ?
- Missing values are random ? There is also a certain law
Proportion of missing values
1、 Check the missing value of each field
# Number of missing values per field : Descending
total = train.isnull().sum().sort_values(ascending=False)
total.head()
Copy code
PoolQC 1453
MiscFeature 1406
Alley 1369
Fence 1179
FireplaceQu 690
dtype: int64
Copy code
2、 Convert to percentage
# Missing value for each field / total
percent = (train.isnull().sum() / train.isnull().count()).sort_values(ascending=False)
percent.head()
Copy code
PoolQC 0.995205
MiscFeature 0.963014
Alley 0.937671
Fence 0.807534
FireplaceQu 0.472603
dtype: float64
Copy code
3、 Data merging , Overall missing values :
Delete missing value
The original text analyzes a lot , The final conclusion :
In summary, to handle missing data,
1、we'll delete all the variables with missing data, except the variable 'Electrical'.
2、In 'Electrical' we'll just delete the observation with missing data.
# step 1: Fields to be deleted
missing_data[missing_data["Total"] > 1].index
Copy code
Index(['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'LotFrontage',
'GarageYrBlt', 'GarageCond', 'GarageType', 'GarageFinish', 'GarageQual',
'BsmtFinType2', 'BsmtExposure', 'BsmtQual', 'BsmtCond', 'BsmtFinType1',
'MasVnrArea', 'MasVnrType'],
dtype='object')
Copy code
# First step
train = train.drop(missing_data[missing_data["Total"] > 1].index,1)
# The second step
train = train.drop(train.loc[train["Electrical"].isnull()].index)
Copy code
outliers out liars
Find outliers
## Data standardization standardizing data
# np.newaxis Add data dimension , One dimension becomes two dimensions
saleprice_scaled = StandardScaler().fit_transform(train["SalePrice"][:,np.newaxis])
saleprice_scaled[:5]
Copy code
array([[ 0.34704187],
[ 0.0071701 ],
[ 0.53585953],
[-0.5152254 ],
[ 0.86943738]])
Copy code
# See the former 10 And finally 10 A data
# argsort: What is returned is the index value ; The default is ascending , The smallest is at the front , The biggest is at the end
low_range = saleprice_scaled[saleprice_scaled[:,0].argsort()][:10]
high_range = saleprice_scaled[saleprice_scaled[:,0].argsort()][-10:]
print(low_range)
print('----------')
print(high_range)
Copy code
Summary 3
- low_range near , And leave 0 It's closer
- high_range leave 0 Far away ; And 7+ The data should be outliers
Univariate analysis 1
data = train[["SalePrice","GrLivArea"]]
data.plot.scatter(x="GrLivArea",y="SalePrice",ylim=(0,800000))
plt.show()
Copy code
The obvious , Two variables ( attribute ) There is a linear relationship
Delete outliers
Specify the method to delete a field as a specific value :
Univariate analysis 2
data = train[["SalePrice","TotalBsmtSF"]] # Two variables to be analyzed
data.plot.scatter(x="TotalBsmtSF",y="SalePrice",ylim=(0,800000))
plt.show()
Copy code
In depth understanding of SalePrice
This paper mainly studies the sales price from the following aspects :
- Normality: normalization
- Homoscedasticity: homoscedasticity
- Linearity: Linear trait
- Absence of correlated errors: Correlation error
Normality normalization (SalePrice)
sns.distplot(train["SalePrice"],fit=norm)
fig = plt.figure()
res = stats.probplot(train["SalePrice"], plot=plt)
Copy code
We found that : The selling price is not normally distributed , There is a right deviation ; At the same time, it does not follow the law of logarithmic change .
To solve this problem : Implement logarithmic transformation
## Logarithmic transformation
train["SalePrice"] = np.log(train["SalePrice"])
sns.distplot(train["SalePrice"],fit=norm)
fig = plt.figure()
res = stats.probplot(train["SalePrice"], plot=plt)
Copy code
After the implementation of logarithmic transformation, the effect is much better
Normality- normalization (GrLivArea)
sns.distplot(train["GrLivArea"],fit=norm)
fig = plt.figure()
res = stats.probplot(train["GrLivArea"], plot=plt)
Copy code
Effect before logarithmic transformation :
Perform logarithmic transformation and effect :
# Perform the same logarithmic operation
train["GrLivArea"] = np.log(train["GrLivArea"])
sns.distplot(train["GrLivArea"],fit=norm)
fig = plt.figure()
res = stats.probplot(train["GrLivArea"], plot=plt)
Copy code
Normality- normalization (TotalBsmtSF)
sns.distplot(train["TotalBsmtSF"],fit=norm)
fig = plt.figure()
res = stats.probplot(train["TotalBsmtSF"], plot=plt)
Copy code
Effect before treatment :
How to deal with the special parts above ?
# Add a column of data
train['HasBsmt'] = 0
# When TotalBsmtSF>0 Then assign 1
train.loc[train['TotalBsmtSF']>0,'HasBsmt'] = 1
# Logarithmic transformation : be equal to 1 Part of
train.loc[train['HasBsmt']==1,'TotalBsmtSF'] = np.log(train['TotalBsmtSF'])
# mapping
data = train[train['TotalBsmtSF']>0]['TotalBsmtSF']
sns.distplot(data,fit=norm)
fig = plt.figure()
res = stats.probplot(data, plot=plt)
Copy code
homoscedasticity
The best way to test the same variance between two variables is to plot .
The best approach to test homoscedasticity for two metric variables is graphically
1、 Discuss :'SalePrice' and 'GrLivArea' The relationship between
2、 Discuss SalePrice' and 'TotalBsmtSF'
We can say that, in general, 'SalePrice' exhibit equal levels of variance across the range of 'TotalBsmtSF'. Cool!
From the two pictures above , We see : There is a positive relationship between sales price and the other two variables
Generate dummy variables
Dummy variable ( Dummy Variables) Also known as dummy variable 、 Nominal variable or dummy variable , An artificial variable used to reflect qualitative attributes , Is a quantified independent variable , Usually the value is 0 or 1.
Pandas Medium get_dummies Functions can implement :
train = pd.get_dummies(train) # Generate dummy variables
train
Copy code
summary
thus , We have completed the following :
- Correlation analysis between overall variables
- The variables are mainly analyzed “SalePrice”
- Handle missing and outliers ( outliers )
- Did some statistical analysis , Change the classification variable into a dummy variable
I need to add some points for further study :
- Multivariate statistical analysis
- Skewness and kurtosis
- The depth of dummy variables
- Standardization and normalization
About collection of data sets , Official account back office reply : housing price . Look at the analysis process of the whole article , How do you feel ?
copyright notice
author[PI dada],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202020247017719.html
The sidebar is recommended
- How IOS developers learn Python Programming 22 - Supplement 1
- Python can meet any API you need
- Python 3 process control statement
- The 20th of 120 Python crawlers, 1637. All the way business opportunity network joined in data collection
- Datetime of pandas time series preamble
- How to send payslips in Python
- [Python] closure and scope
- Application of Python Matplotlib color
- leetcode 1627. Graph Connectivity With Threshold (python)
- Python thread 08 uses queues to transform the transfer scenario
guess what you like
-
Python: simple single player strange game (text)
-
Daily python, chapter 27, Django template
-
TCP / UDP communication based on Python socket
-
Use of pandas timestamp index
-
leetcode 148. Sort List(python)
-
Confucius old book network data collection, take one anti three learning crawler, python crawler 120 cases, the 21st case
-
[HTB] cap (datagram analysis, setuid capability: Python)
-
How IOS developers learn Python Programming 23 - Supplement 2
-
How to automatically identify n + 1 queries in Django applications (2)?
-
Data analysis starts from scratch. Pandas reads HTML pages + data processing and analysis
Random recommended
- 1313. Unzip the coding list (Java / C / C + + / Python / go / trust)
- Python Office - Python edit word
- Collect it quickly so that you can use the 30 Python tips for taking off
- Strange Python strip
- Python crawler actual combat, pyecharts module, python realizes China Metro data visualization
- DOM breakpoint of Python crawler reverse
- Django admin custom field stores links in the database after uploading files to the cloud
- Who has powder? Just climb who! If he has too much powder, climb him! Python multi-threaded collection of 260000 + fan data
- Python Matplotlib drawing streamline diagram
- The game comprehensively "invades" life: Python releases the "cool run +" plan!
- Python crawler notes: use proxy to prevent local IP from being blocked
- Python batch PPT to picture, PDF to picture, word to picture script
- Advanced face detection: use Dlib, opencv and python to detect face markers
- "Python 3 web crawler development practice (Second Edition)" is finally here!!!!
- Python and Bloom filters
- Python - singleton pattern of software design pattern
- Lazy listening network, audio novel category data collection, multi-threaded fast mining cases, 23 of 120 Python crawlers
- Troubleshooting ideas and summary of Django connecting redis cluster
- Python interface automation test framework (tools) -- interface test tool requests
- Implementation of Morse cipher translator using Python program
- [Python] numpy notes
- 24 useful Python tips
- Pandas table beauty skills
- Python tiktok character video, CV2 module, Python implementation
- I used Python to climb my wechat friends. They are like this
- 20000 words take you into the python crawler requests library, the most complete in history!!
- Answer 2: why can you delete the table but not update the data with the same Python code
- [pandas learning notes 02] - advanced usage of data processing
- How to implement association rule algorithm? Python code and powerbi visualization are explained to you in detail (Part 2 - actual combat)
- Python adds list element append() method, extend() method and insert() method [details]
- python wsgi
- Introduction to Python gunicorn
- Python dictionary query key value pair methods and examples
- Opencv Python reads video, processes and saves it frame by frame
- Python learning process and bug
- Imitate the up master and realize a live broadcast room controlled by barrage with Python!
- Essence! Configuration skills of 12 pandas
- [Python automated operation and maintenance road] path inventory
- Daily automatic health punch in (Python + Tencent cloud server)
- [Python] variables, comments, basic data types