current position:Home>Data mining: Python actual combat multi factor analysis

Data mining: Python actual combat multi factor analysis

2022-01-31 04:52:32 PI dada

official account : Youer cottage
author :Peter
edit :Peter

Hello everyone , I am a Peter~

I've read a lot about Factor analysis Information , Sort out this theory + Practical articles are shared with you . There will be an article later PCA Principal component analysis The article , The two dimensionality reduction methods of principal component analysis and factor analysis are compared .

Factor analysis

As one of the dimensionality reduction methods in multivariate statistical analysis , Factor analysis can be applied to multiple scenarios , Such as research 、 Data modeling and other scenarios .


The origin of factor analysis is like this :1904 In, a psychologist in Britain found that students' English 、 French and classical language scores are very relevant , He believes that there is a common driving factor behind the three courses , Finally, this factor is defined as “ Language ability ”.

Based on this idea , It is found that behind many highly correlated factors are Common factor driven , So it defines ** Factor analysis , This is the origin of factor analysis .

The basic idea

Let's use a more practical example to understand the basic idea of factor analysis :

Now suppose a classmate's math 、 Physics 、 chemical 、 Biology got full marks , Then we can think that this student has strong rational thinking , ad locum Rational thinking is Is what we call a factor . Under the action of this factor , The score of science is so high .

What exactly is factor analysis ? Is to assume that all the existing independent variables x Because of the role of a potential variable , This potential variable is what we call a factor . Under the action of this factor ,x Can be observed .

Factor analysis is to Variables with certain correlations are refined into fewer factors , Use these factors to represent the original variables , Variables can also be classified according to factors .

Factor molecules are essentially a dimensionality reduction process , And principal component analysis (PCA) The algorithm is similar .

2 Factor analysis

Factor analysis is divided into two types :

  • Exploratory factor analysis : It is uncertain how many factors are at work behind the existing independent variables , We need this method to try to find these factors
  • Confirmatory factor analysis : It has been assumed that there are several factors behind the independent variable , Try to test whether this hypothesis is correct through this method .

Model derivation

Suppose there is p A primitive variable x i i = 1 , 2 , p x_i(i=1,2,…p) , They may be independent or related . take x i x_i After standardization, new variables are obtained z i z_i , We can establish the following factor analysis model :

z i = a i 1 F 1 + a i 2 F 2 + + a i m F m + c i U i ( i = 1 , 2 , , p ) z_{i}=a_{i 1} F_{1}+a_{i 2} F_{2}+\cdots+a_{i m} F_{m}+c_{i} U_{i} (i=1,2, \cdots, p)

among , We can define the following terms : The common factor 、 Special factor 、 Load factor 、 Load matrix

1、 The first point : F j ( j = 1 , 2 , m ) F_j(j=1,2,…m) Appears in the formula of each variable and m<p, be called The common factor

2、 Second point : U i ( i = 1 , 2 , p ) U_i(i=1,2,…p) Only with variables z i z_i relevant , be called Special factor

3、 The third point : coefficient a i j c i ( i = 1 , 2 p , j = 1 , 2 , m ) a_{ij}、c_i(i=1,2…p,j=1,2,…m) be called Load factor

4、 Fourth, : A = ( a i j ) A={(a_{ij})} be called Load matrix

The above formula can be expressed as : z = A F + C U z=AF+CU

At the same time, it will meet :

A=\left(a_{i j}\right)_{p \times m}, \quad C=\operatorname{diag}\left(c_{1}, c_{2}, \cdots, c_{p}\right)$$ We usually need to make assumptions about the of the above model : 1. Each special factor and the special factor and the common factor are independent of each other , The meet : $$\left\{\begin{array}{l}\operatorname{Cov}(U)=\operatorname{diag}\left(\sigma_{1}^{2}, \sigma_{2}^{2}, \cdots, \sigma_{p}^{2}\right) \\ \operatorname{Cov}(F, U)=0\end{array}\right.$$ 2、 All common factors have a mean value of 0、 The variance of 1 Independent normal random variable , Its covariance matrix is the identity matrix $I_m$, namely $F-N(0,I_m)$ 3、m A common factor pairs the i The contribution of the variance of the first variable is called the i** Contribution **, Write it down as $h_i^2$ $$h_{i}^{2}=a_{i 1}^{2}+a_{i 2}^{2}+\cdots+a_{i m}^{2}$$ 4、 The variance of a particular factor is called ** Special variance ** perhaps ** Special values **($\sigma_{i}^{2},i=1,2,3…p$) 5、 The first i The variance of variables is decomposed into :$\operatorname{Var} z_{i}=h_{i}^{2}+\sigma_{i}^{2}, i=1,2, \cdots, p$ Specific model derivation process : ### Important properties of factor load matrix Some important properties of factor load matrix : 1、 Factor load $a_{ij}$ It's No i The first variable is the same as j Correlation coefficient of two common factors , It reflects the second i The first variable and the second j The importance of two common factors .** The greater the absolute value , The closer the degree of correlation is **. 2、 Statistical significance of contribution $$h_{i}^{2}=a_{i 1}^{2}+a_{i 2}^{2}+\cdots+a_{i m}^{2}=\sum_{i=1}^ma_{ij}^2$$ explain : Variable $X_i$ The contribution degree of is the second of the factor load matrix i The sum of squares of the elements of a row Find the variance on both sides of the above formula at the same time : $$\operatorname{Var}\left(X_{i}\right)=a_{i 1}^{2} \operatorname{Var}\left(F_{1}\right)+\cdots+a_{i m}^{2} \operatorname{Var}\left(F_{m}\right)+\operatorname{Var}\left(\varepsilon_{i}\right)$$ That is to say :$1=\sum_{j=1}^{m} a_{i j}^{2}+\sigma_{i}^{2}$ You can see it : Common factors and special factors affect variables $X_i$ The sum of the contributions is 1. If $\sum_{i=1}^ma_{ij}^2$ Very close to 1, be $\sigma^2$ A very small , Then the effect of factor analysis is very good . 3、 The common factor $F_{j}$ Statistical significance of variance contribution Sum of squares of each column element in factor load matrix $S_j=\sum_{i=1}^p a_{ij}^2$ Become $F_(j)$ For all the $X_j$ Variance contribution and , To measure $F_j$ The relative importance of . ## Factor analysis steps The main steps of applying factor analysis are as follows : - Test the data samples given ** Standardized treatment ** - Calculate the number of samples ** Correlation matrix R** - Find the correlation matrix R Of ** The eigenvalue 、 Eigenvector ** - According to the system requirements ** Cumulative contribution ** Determine the number of principal factors - Calculation factor ** Load matrix A** - Finally determine the factor model ## factor_analyzer library utilize Python The core library for factor analysis is :factor_analyzer ```python pip install factor_analyzer ``` This library mainly has two main modules to learn : - factor_analyzer.analyze( a key ) - factor_analyzer.factor_analyzer Official website learning address : ## Case actual combat Here is a case to explain how to conduct factor analysis . ### Import data The data used in this article is a public data set , The following is the introduction and download address of the data : - Data set introduction : []([html]( - Dataset Download : This data set collects 2800 Personal about personality 25 A question . At the same time, these data and hidden 5 Two features are related , > Big Five Model is widely used nowadays, the five factors include: **neuroticism,extraversion,openness to experience,agreeableness and conscientiousness.** The corresponding relationship between features is : - Identity recognition :agree=c(“-A1”,“A2”,“A3”,“A4”,“A5”) - assiduous 、 Responsible :conscientious=c(“C1”,“C2”,“C3”,“-C4”,“-C5”) - exocentric :extraversion=c(“-E1”,“-E2”,“E3”,“E4”,“E5”) - neurotic 、 Instability :neuroticism=c(“N1”,“N2”,“N3”,“N4”,“N5”) - Open :openness = c(“O1”,“-O2”,“O3”,“O4”,“-O5”) We don't know the corresponding relationship of these invisible variables in advance , So I want to find... Through factor analysis 25 Hidden variables behind variables . Let's start the practical process of factor analysis : ## Import library Import the libraries needed for data processing and analysis : ```python # Data processing import pandas as pd import numpy as np # mapping import seaborn as sns import matplotlib.pyplot as plt # Factor analysis from factor_analyzer import FactorAnalyzer ``` ## Data exploration ### Data and information First, let's advance the data : The total is 2800 Data ,28 A feature attribute ```python df = pd.read_csv("bfi.csv", index_col=0).reset_index(drop=True) df ``` ![]( Check the missing value of the data :** Most fields have missing values ** ![]( ### Data preprocessing Data preprocessing includes removing 3 Invalid fields ( It has no effect on analysis :age、gender、education), At the same time, remove the data with null value : ```python # Remove invalid fields df.drop(["gender","education","age"],axis=1,inplace=True) # Remove the null value df.dropna(inplace=True) ``` ## Adequacy testing Before factor analysis , The adequacy test needs to be carried out first , It mainly tests the correlation between various variables in the correlation characteristic matrix , Whether it is an identity matrix , That is to test whether each variable is independent . There are usually two ways : - Bartlett's Spherical test ( Bartley ball test ) - KMO test ### Bartlett's Spherical test Check whether the correlation matrix of the population variable is a unit matrix ( All elements of the diagonal of the correlation coefficient matrix are 1, All non diagonal elements are zero ); That is to test whether each variable is independent . If it's not an identity matrix , It indicates that there is a correlation between the original variables , You can make factor molecules ; conversely , There is no correlation between the original variables , The data are not suitable for principal component analysis ```python from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity chi_square_value, p_value = calculate_bartlett_sphericity(df) chi_square_value, p_value # result (18170.96635086924, 0.0) ``` We find that statistics p-value The value of is 0, Indicates that the correlation matrix of the variable is not an identity matrix , That is, there is a certain correlation between various variables , We can do factor analysis . ### KMO test Check the correlation and partial correlation between variables , The value is 0-1 Between ;KOM The closer the statistics are 1, The stronger the correlation between variables , The weaker the partial correlation , The better the effect of factor analysis . > **Kaiser-Meyer-Olkin (KMO) Test** measures the suitability of data for factor analysis. It determines the adequacy for each observed variable and for the complete model. > > KMO estimates the proportion of variance among all the observed variable. Lower proportion id more suitable for factor analysis. KMO values range between 0 and 1. Value of KMO less than 0.6 is considered inadequate. Usually the value is from 0.6 Start factor analysis ```python from factor_analyzer.factor_analyzer import calculate_kmo kmo_all,kmo_model=calculate_kmo(df) kmo_all ``` ![]( KMO Greater than 0.6, It also shows that there is correlation between variables , It can be analyzed . ## Number of selection factors In the data description , We already know that these variables are and 5 A hidden factor correlation . But a lot of times , We don't know the number , You need to explore yourself . Method :** Calculate the eigenvalues of the correlation matrix , Arrange in descending order ** ### Eigenvalues and eigenvectors ```python faa = FactorAnalyzer(25,rotation=None) # Get the eigenvalue ev、 Eigenvector v ev,v=faa.get_eigenvalues() ``` ![]( ### Visual display We draw the changes of eigenvalues and the number of factors into a graph : ```python # Plot scatter and line charts with the same data plt.scatter(range(1, df.shape[1] + 1), ev) plt.plot(range(1, df.shape[1] + 1), ev) # Displays the title and of the diagram xy The name of the shaft # It is best to use English , Chinese may be garbled plt.title("Scree Plot") plt.xlabel("Factors") plt.ylabel("Eigenvalue") plt.grid() # Show grid # The graphics ``` ![]( From the figure above , We clearly see : choice 5 A factor is the most appropriate ## modeling ### Factor analysis -fit We choose 5 A factor is used to model the factor molecule , At the same time, specify the rotation mode of the matrix as :** Maximize variance ** ![]( ratation Other values of parameters : - varimax (orthogonal rotation) - promax (oblique rotation) - oblimin (oblique rotation) - oblimax (orthogonal rotation) - quartimin (oblique rotation) - quartimax (orthogonal rotation) - equamax (orthogonal rotation) ### Check the factor variance -get_communalities() Execute above fit After modeling , Let's look at the variance of each factor , Usage method :get_communalities() ![]( ### Look at the eigenvalues -get_eigenvalues View the characteristic values of variables : ![]( ### Look at the composition matrix -loadings Now there is 25 A variable ,5 A hidden variable ( factor ), Look at the composition matrix they form : ![]( If it turns into DataFrame Format ,index It's ours 25 A variable ,columns It's specified 5 A factor factor. Turn into DataFrame Data after format : ![]( ### Check the factor contribution rate -get_factor_variance() Through the explanation of the theoretical part , We found that each factor has a certain contribution to the variable , There is a value for a certain contribution , Check out 3 Indicators related to contribution : - Total variance contribution :variance (numpy array) – The factor variances - Variance contribution rate :proportional_variance (numpy array) – The proportional factor variances - Cumulative variance contribution rate :cumulative_variances (numpy array) – The cumulative factor variances ![]( ![]( ## Hidden variable visualization In order to more intuitively observe the relationship between each hidden variable and which features are relatively large , Make a visual presentation , For the convenience of taking the absolute value of the above correlation coefficient : ![]( Then we draw the coefficient matrix through the thermodynamic diagram : ```python # mapping plt.figure(figsize = (14,14)) ax = sns.heatmap(df1, annot=True, cmap="BuPu") # Set up y Axis font size ax.yaxis.set_tick_params(labelsize=15) plt.title("Factor Analysis", fontsize="xx-large") # Set up y Axis labels plt.ylabel("Sepal Width", fontsize="xx-large") # display picture # Save the picture # plt.savefig("factorAnalysis", dpi=500) ``` ![]( ## Convert to new variable -transformn We already know above 5 Two factors are more appropriate , Raw data can be converted into 5 A new feature , The specific conversion method is : ![]( Turn into DataFrame The data display effect is better after format : Still 2436 Data ,5 Features ( New features ) ![]( thus , We have completed the following work : 1. Correlation detection of original data 2. Exploration of the number of factors 3. Modeling process of factor analysis 4. Visualization of hidden variables 5. Based on 5 Data of a new variable ## Reference material 1、Factor Analysis: 2、 Multivariate analysis : 3、`factor_analyzer package` Official website user manual : 4、 On principal component analysis and factor analysis :

copyright notice
author[PI dada],Please bring the original link to reprint, thank you.

Random recommended