# Data mining: Python actual combat multi factor analysis

official account ： Youer cottage
author ：Peter
edit ：Peter

Hello everyone , I am a Peter~

I've read a lot about Factor analysis Information , Sort out this theory + Practical articles are shared with you . There will be an article later PCA Principal component analysis The article , The two dimensionality reduction methods of principal component analysis and factor analysis are compared .

## Factor analysis

As one of the dimensionality reduction methods in multivariate statistical analysis , Factor analysis can be applied to multiple scenarios , Such as research 、 Data modeling and other scenarios .

### origin

The origin of factor analysis is like this ：1904 In, a psychologist in Britain found that students' English 、 French and classical language scores are very relevant , He believes that there is a common driving factor behind the three courses , Finally, this factor is defined as “ Language ability ”.

Based on this idea , It is found that behind many highly correlated factors are Common factor driven , So it defines ** Factor analysis , This is the origin of factor analysis .

### The basic idea

Let's use a more practical example to understand the basic idea of factor analysis ：

Now suppose a classmate's math 、 Physics 、 chemical 、 Biology got full marks , Then we can think that this student has strong rational thinking , ad locum Rational thinking is Is what we call a factor . Under the action of this factor , The score of science is so high .

What exactly is factor analysis ？ Is to assume that all the existing independent variables x Because of the role of a potential variable , This potential variable is what we call a factor . Under the action of this factor ,x Can be observed .

Factor analysis is to Variables with certain correlations are refined into fewer factors , Use these factors to represent the original variables , Variables can also be classified according to factors .

Factor molecules are essentially a dimensionality reduction process , And principal component analysis （PCA） The algorithm is similar .

### 2 Factor analysis

Factor analysis is divided into two types ：

• Exploratory factor analysis ： It is uncertain how many factors are at work behind the existing independent variables , We need this method to try to find these factors
• Confirmatory factor analysis ： It has been assumed that there are several factors behind the independent variable , Try to test whether this hypothesis is correct through this method .

### Model derivation

Suppose there is p A primitive variable $x_i（i=1,2,…p）$, They may be independent or related . take $x_i$ After standardization, new variables are obtained $z_i$, We can establish the following factor analysis model ：

$z_{i}=a_{i 1} F_{1}+a_{i 2} F_{2}+\cdots+a_{i m} F_{m}+c_{i} U_{i} (i=1,2, \cdots, p)$

among , We can define the following terms ： The common factor 、 Special factor 、 Load factor 、 Load matrix

1、 The first point ： $F_j(j=1,2,…m)$ Appears in the formula of each variable and m<p, be called The common factor

2、 Second point ： $U_i(i=1,2,…p)$ Only with variables $z_i$ relevant , be called Special factor

3、 The third point ： coefficient $a_{ij}、c_i(i=1,2…p,j=1,2,…m)$ be called Load factor

4、 Fourth, ： $A={(a_{ij})}$ be called Load matrix

The above formula can be expressed as ： $z=AF+CU$

At the same time, it will meet ：

A=\left(a_{i j}\right)_{p \times m}, \quad C=\operatorname{diag}\left(c_{1}, c_{2}, \cdots, c_{p}\right)$$We usually need to make assumptions about the of the above model ： 1. Each special factor and the special factor and the common factor are independent of each other , The meet ：$$\left\{\begin{array}{l}\operatorname{Cov}(U)=\operatorname{diag}\left(\sigma_{1}^{2}, \sigma_{2}^{2}, \cdots, \sigma_{p}^{2}\right) \\ \operatorname{Cov}(F, U)=0\end{array}\right.$$2、 All common factors have a mean value of 0、 The variance of 1 Independent normal random variable , Its covariance matrix is the identity matrix I_m, namely F-N(0,I_m) 3、m A common factor pairs the i The contribution of the variance of the first variable is called the i** Contribution **, Write it down as h_i^2$$h_{i}^{2}=a_{i 1}^{2}+a_{i 2}^{2}+\cdots+a_{i m}^{2}$$4、 The variance of a particular factor is called ** Special variance ** perhaps ** Special values **（\sigma_{i}^{2},i=1,2,3…p） 5、 The first i The variance of variables is decomposed into ：\operatorname{Var} z_{i}=h_{i}^{2}+\sigma_{i}^{2}, i=1,2, \cdots, p Specific model derivation process ：https://blog.csdn.net/qq_29831163/article/details/88901245 ### Important properties of factor load matrix Some important properties of factor load matrix ： 1、 Factor load a_{ij} It's No i The first variable is the same as j Correlation coefficient of two common factors , It reflects the second i The first variable and the second j The importance of two common factors .** The greater the absolute value , The closer the degree of correlation is **. 2、 Statistical significance of contribution$$h_{i}^{2}=a_{i 1}^{2}+a_{i 2}^{2}+\cdots+a_{i m}^{2}=\sum_{i=1}^ma_{ij}^2$$explain ： Variable X_i The contribution degree of is the second of the factor load matrix i The sum of squares of the elements of a row Find the variance on both sides of the above formula at the same time ：$$\operatorname{Var}\left(X_{i}\right)=a_{i 1}^{2} \operatorname{Var}\left(F_{1}\right)+\cdots+a_{i m}^{2} \operatorname{Var}\left(F_{m}\right)+\operatorname{Var}\left(\varepsilon_{i}\right) That is to say ：$1=\sum_{j=1}^{m} a_{i j}^{2}+\sigma_{i}^{2}$ You can see it ： Common factors and special factors affect variables $X_i$ The sum of the contributions is 1. If $\sum_{i=1}^ma_{ij}^2$ Very close to 1, be $\sigma^2$ A very small , Then the effect of factor analysis is very good . 3、 The common factor $F_{j}$ Statistical significance of variance contribution Sum of squares of each column element in factor load matrix $S_j=\sum_{i=1}^p a_{ij}^2$ Become $F_(j)$ For all the $X_j$ Variance contribution and , To measure $F_j$ The relative importance of . ## Factor analysis steps The main steps of applying factor analysis are as follows ： - Test the data samples given ** Standardized treatment ** - Calculate the number of samples ** Correlation matrix R** - Find the correlation matrix R Of ** The eigenvalue 、 Eigenvector ** - According to the system requirements ** Cumulative contribution ** Determine the number of principal factors - Calculation factor ** Load matrix A** - Finally determine the factor model ## factor_analyzer library utilize Python The core library for factor analysis is ：factor_analyzer python pip install factor_analyzer  This library mainly has two main modules to learn ： - factor_analyzer.analyze（ a key ） - factor_analyzer.factor_analyzer Official website learning address ：https://factor-analyzer.readthedocs.io/en/latest/factor_analyzer.html ## Case actual combat Here is a case to explain how to conduct factor analysis . ### Import data The data used in this article is a public data set , The following is the introduction and download address of the data ： - Data set introduction : [https://vincentarelbundock.github.io/Rdatasets/doc/psych/bfi.](https://vincentarelbundock.github.io/Rdatasets/doc/psych/bfi.html)[html](https://vincentarelbundock.github.io/Rdatasets/doc/psych/bfi.html) - Dataset Download : https://vincentarelbundock.github.io/Rdatasets/datasets.html This data set collects 2800 Personal about personality 25 A question . At the same time, these data and hidden 5 Two features are related , > Big Five Model is widely used nowadays, the five factors include: **neuroticism,extraversion,openness to experience,agreeableness and conscientiousness.** The corresponding relationship between features is ： - Identity recognition ：agree=c(“-A1”,“A2”,“A3”,“A4”,“A5”) - assiduous 、 Responsible ：conscientious=c(“C1”,“C2”,“C3”,“-C4”,“-C5”) - exocentric ：extraversion=c(“-E1”,“-E2”,“E3”,“E4”,“E5”) - neurotic 、 Instability ：neuroticism=c(“N1”,“N2”,“N3”,“N4”,“N5”) - Open ：openness = c(“O1”,“-O2”,“O3”,“O4”,“-O5”) We don't know the corresponding relationship of these invisible variables in advance , So I want to find... Through factor analysis 25 Hidden variables behind variables . Let's start the practical process of factor analysis ： ## Import library Import the libraries needed for data processing and analysis ： python # Data processing import pandas as pd import numpy as np # mapping import seaborn as sns import matplotlib.pyplot as plt # Factor analysis from factor_analyzer import FactorAnalyzer  ## Data exploration ### Data and information First, let's advance the data ： The total is 2800 Data ,28 A feature attribute python df = pd.read_csv("bfi.csv", index_col=0).reset_index(drop=True) df  ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/764cef4318c647ecab0cc394d66606e9~tplv-k3u1fbpfcp-zoom-1.image) Check the missing value of the data ：** Most fields have missing values ** ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/6ac5564d90404759b20d506f3b264e46~tplv-k3u1fbpfcp-zoom-1.image) ### Data preprocessing Data preprocessing includes removing 3 Invalid fields （ It has no effect on analysis ：age、gender、education）, At the same time, remove the data with null value ： python # Remove invalid fields df.drop(["gender","education","age"],axis=1,inplace=True) # Remove the null value df.dropna(inplace=True)  ## Adequacy testing Before factor analysis , The adequacy test needs to be carried out first , It mainly tests the correlation between various variables in the correlation characteristic matrix , Whether it is an identity matrix , That is to test whether each variable is independent . There are usually two ways ： - Bartlett's Spherical test （ Bartley ball test ） - KMO test ### Bartlett's Spherical test Check whether the correlation matrix of the population variable is a unit matrix （ All elements of the diagonal of the correlation coefficient matrix are 1, All non diagonal elements are zero ）; That is to test whether each variable is independent . If it's not an identity matrix , It indicates that there is a correlation between the original variables , You can make factor molecules ; conversely , There is no correlation between the original variables , The data are not suitable for principal component analysis python from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity chi_square_value, p_value = calculate_bartlett_sphericity(df) chi_square_value, p_value # result (18170.96635086924, 0.0)  We find that statistics p-value The value of is 0, Indicates that the correlation matrix of the variable is not an identity matrix , That is, there is a certain correlation between various variables , We can do factor analysis . ### KMO test Check the correlation and partial correlation between variables , The value is 0-1 Between ;KOM The closer the statistics are 1, The stronger the correlation between variables , The weaker the partial correlation , The better the effect of factor analysis . > **Kaiser-Meyer-Olkin (KMO) Test** measures the suitability of data for factor analysis. It determines the adequacy for each observed variable and for the complete model. > > KMO estimates the proportion of variance among all the observed variable. Lower proportion id more suitable for factor analysis. KMO values range between 0 and 1. Value of KMO less than 0.6 is considered inadequate. Usually the value is from 0.6 Start factor analysis python from factor_analyzer.factor_analyzer import calculate_kmo kmo_all,kmo_model=calculate_kmo(df) kmo_all  ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/e4ddcf54fcf94c5fa71f8ba44bcb2381~tplv-k3u1fbpfcp-zoom-1.image) KMO Greater than 0.6, It also shows that there is correlation between variables , It can be analyzed . ## Number of selection factors In the data description , We already know that these variables are and 5 A hidden factor correlation . But a lot of times , We don't know the number , You need to explore yourself . Method ：** Calculate the eigenvalues of the correlation matrix , Arrange in descending order ** ### Eigenvalues and eigenvectors python faa = FactorAnalyzer(25,rotation=None) faa.fit(df) # Get the eigenvalue ev、 Eigenvector v ev,v=faa.get_eigenvalues()  ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/cc4e4d6d8616426193a95baae204e233~tplv-k3u1fbpfcp-zoom-1.image) ### Visual display We draw the changes of eigenvalues and the number of factors into a graph ： python # Plot scatter and line charts with the same data plt.scatter(range(1, df.shape[1] + 1), ev) plt.plot(range(1, df.shape[1] + 1), ev) # Displays the title and of the diagram xy The name of the shaft # It is best to use English , Chinese may be garbled plt.title("Scree Plot") plt.xlabel("Factors") plt.ylabel("Eigenvalue") plt.grid() # Show grid plt.show() # The graphics  ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/cdb882f69cf3450b99afc683e536f9e0~tplv-k3u1fbpfcp-zoom-1.image) From the figure above , We clearly see ： choice 5 A factor is the most appropriate ## modeling ### Factor analysis -fit We choose 5 A factor is used to model the factor molecule , At the same time, specify the rotation mode of the matrix as ：** Maximize variance ** ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/2754e3135f58495c9130b1bfa696e874~tplv-k3u1fbpfcp-zoom-1.image) ratation Other values of parameters ： - varimax (orthogonal rotation) - promax (oblique rotation) - oblimin (oblique rotation) - oblimax (orthogonal rotation) - quartimin (oblique rotation) - quartimax (orthogonal rotation) - equamax (orthogonal rotation) ### Check the factor variance -get_communalities() Execute above fit After modeling , Let's look at the variance of each factor , Usage method ：get_communalities() ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/4c511e807206458da6e95f39ef81bb91~tplv-k3u1fbpfcp-zoom-1.image) ### Look at the eigenvalues -get_eigenvalues View the characteristic values of variables ： ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/bad000efb31c447f9e44368395ee57b5~tplv-k3u1fbpfcp-zoom-1.image) ### Look at the composition matrix -loadings Now there is 25 A variable ,5 A hidden variable （ factor ）, Look at the composition matrix they form ： ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/91f79c21b3ce4a0fb8111af70ecb6999~tplv-k3u1fbpfcp-zoom-1.image) If it turns into DataFrame Format ,index It's ours 25 A variable ,columns It's specified 5 A factor factor. Turn into DataFrame Data after format ： ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/935f9e2d71ae4a96b8241a867ca15f6f~tplv-k3u1fbpfcp-zoom-1.image) ### Check the factor contribution rate -get_factor_variance() Through the explanation of the theoretical part , We found that each factor has a certain contribution to the variable , There is a value for a certain contribution , Check out 3 Indicators related to contribution ： - Total variance contribution ：variance (numpy array) – The factor variances - Variance contribution rate ：proportional_variance (numpy array) – The proportional factor variances - Cumulative variance contribution rate ：cumulative_variances (numpy array) – The cumulative factor variances ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/d1dd98a87ca1400182d60193949403cc~tplv-k3u1fbpfcp-zoom-1.image) ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/3d0330bab2124894bf3759a649aa3852~tplv-k3u1fbpfcp-zoom-1.image) ## Hidden variable visualization In order to more intuitively observe the relationship between each hidden variable and which features are relatively large , Make a visual presentation , For the convenience of taking the absolute value of the above correlation coefficient ： ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/7b1650a674c0408287d2e43d87e7ebf5~tplv-k3u1fbpfcp-zoom-1.image) Then we draw the coefficient matrix through the thermodynamic diagram ： python # mapping plt.figure(figsize = (14,14)) ax = sns.heatmap(df1, annot=True, cmap="BuPu") # Set up y Axis font size ax.yaxis.set_tick_params(labelsize=15) plt.title("Factor Analysis", fontsize="xx-large") # Set up y Axis labels plt.ylabel("Sepal Width", fontsize="xx-large") # display picture plt.show() # Save the picture # plt.savefig("factorAnalysis", dpi=500)  ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/045b011bcade49209053a4296445f357~tplv-k3u1fbpfcp-zoom-1.image) ## Convert to new variable -transformn We already know above 5 Two factors are more appropriate , Raw data can be converted into 5 A new feature , The specific conversion method is ： ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/28d9d619d1f04ae69af8198ecf55b3a1~tplv-k3u1fbpfcp-zoom-1.image) Turn into DataFrame The data display effect is better after format ： Still 2436 Data ,5 Features （ New features ） ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/55a1a0af84c34a79940cf4fc1d114098~tplv-k3u1fbpfcp-zoom-1.image) thus , We have completed the following work ： 1. Correlation detection of original data 2. Exploration of the number of factors 3. Modeling process of factor analysis 4. Visualization of hidden variables 5. Based on 5 Data of a new variable ## Reference material 1、Factor Analysis：https://www.datasklr.com/principal-component-analysis-and-factor-analysis/factor-analysis 2、 Multivariate analysis ：https://mathpretty.com/10994.html 3、factor_analyzer package Official website user manual ：https://factor-analyzer.readthedocs.io/en/latest/factor_analyzer.html 4、 On principal component analysis and factor analysis ：https://zhuanlan.zhihu.com/p/37755749