current position:Home>Exploratory data analysis (EDA) in Python using SQL and Seaborn (SNS).

Exploratory data analysis (EDA) in Python using SQL and Seaborn (SNS).

2022-01-30 11:54:07 Stone sword

Exploratory data analysis (EDA) Is a method of analyzing data sets to summarize their main characteristics , Statistical graphics and other data visualization methods are usually used . Various statistical models can be used or not , But mainly EDA Used to see what data can tell us beyond formal modeling or hypothesis testing tasks .

Guess what? ... Always ...

The picture is from

Why should I do anything first EDA?

I believe a more appropriate question would be .

Under what circumstances should I not use EDA?

EDA Is one of the key steps in Data Science , It allows us to analyze the data we process Some insight and statistical measurement . This is crucial for countless users , Including business manager 、 stakeholders 、 Data scientists, etc .

For data scientists ,EDA It helps to define and improve our important characteristic variable selection , This will be used for machine learning models that have not yet been trained .

In this story , For demonstration purposes , We will use some FitBit data .

Fitness tracker data is data scientist 、 Statistician 、 Medical experts 、 A hot research area for physiologists and psychologists , Here are just a few academic research fields . Detect relationships in complex time series data , Such as FitBit Fitness tracker data , It can be a way to establish a pattern of daily life , It is also a way to detect the deviation of these patterns .

A good EDA Can help find these ...


Yes Fitbit The data were thoroughly analyzed . Key findings are highlighted and discussed . The analysis provided in this paper is based on 33 Collected from different users 940 Data points .

While reading this story , I hope to convey to you the reasoning and logic that drives coding .

First , In order to understand the lifestyle of these users , We plot minutes and distances based on the user's activity level .…

As expected , Very active _ Of people walk the distance in a short time ( in other words , They have greater speed , Represented by a steeper regression line ). A somewhat unexpected result is ," Mild activity for minutes " Than " Moderate _ Activity minutes " Faster . If you know how this classification is carried out , In order to really understand " light " Activities and " Moderate " The difference between activities , That would be interesting .

Let's be right _ Total steps _ And _ calories _ Do some simple linear regression ...…

Once again, , As expected , The number of calories burned in a day increases with the number of steps users take . An interesting fact is , The intercept of the regression line represents the number of calories burned in a day without walking . This is the amount of calories users burn when they are very sedentary . according to Healthline Website , This number corresponds to the basal metabolic rate .

If we know the gender of the user 、 weight 、 Height and age , This value can be calculated . for example , They reported that , A weight 175 pounds 、 height 5 feet 11 " 35 The basal metabolic rate of a - year-old man is 1,816 calories , A weight 135 pounds 、 height 5 feet 5 " 35 The basal metabolic rate of women aged 10 years is 1,383 calories . To compare these estimates with our data , We can get the intercept value by linear regression . Predicted BMR yes ~1665.74( Be situated between 35 Between the predicted values of women and men ).

If we only filter the data points that have taken zero steps , And get the statistics of calorie distribution , We can further get the user's BMR Information .…

Let's see _ Very active _ minute 、_ Quite active minutes _ and _ Slightly active _ Minute data distribution ...…

Here's a question : It is not clear whether all users used the fitness tracker throughout the analysis period . If a user records all day , that _VeryActiveMinutes_ +FairlyActiveMinutes +LightlyActiveMinutes +SedentaryMinutes The sum of should be equal to 1440 minute ( Total minutes of the day ).…

From the code snippet above , We deduce that

There are 474 (out of 936) rows where users logged the whole day.
 Copy code 
There are 462 rows where users logged parts of the day.
 Copy code 

Mild activity for minutes _ The distribution of is very symmetrical , There is no peak in very little activity time . Users who record all day may eventually register a large number of users _ Mild activity for minutes , Users who only record part of the day may only register for activities with high demand .…

Now? , Let's look at sleep habits ...…

Is there any difference on which day of the week ? Now let's take a look at our data and its distribution , Which day of the week makes a big difference to users' behavior ?……

What's the change in sedentary time on weekends ?…

How this distribution depends on the weekend ?…

We have now based on _ Sit long _ The distribution of time distinguishes two groups of users…

It seems that we have found a trend here , There is an obvious offset , It seems to be near the boundary between the two groups . So let's verify that ...…

ad locum , We found a clear trend , That is, users who sleep more tend to sit less . This indicates that the users who sleep the most , Often more active during the day .

Data insight

Just use us 33 The daily activities of a user , We came to some interesting conclusions

Here it is , I include from the above EDA Some high-level insights from .

  • There is no obvious difference in the activities of users on different days of the week ; The average number of steps per day is about 7670 Step .
  • according to CDC Some of the studies ."...... Higher daily steps are associated with a lower risk of death from all causes ". disease The CDC also told us ."...... And walk every day 4000 Step ( A figure considered low for adults ) comparison , Go every day 8000 Step and all-cause death ( Or die of various causes ) Risk reduction 51% of . Go every day 12,000 Step and walk 4,000 Step comparison , Reduce risk 65%".

If the goal is to burn some calories , It was found that there was a linear relationship between the number of steps taken and the calories burned . Accordingly , We can use user data to fit a model , Predict how many steps users should take to reach a certain calorie consumption .

  • About sleep habits , As sleep time increases , Sedentary time is significantly reduced .

What to do next ?

EDA It's usually done to gain data insight , It can help us complete the task of machine learning . In the next story , We use the same data set and derived insight to train several machine learning models to solve the regression problem .…

If you like my story , And want to jump to a notebook with code and a complete data set , I've been in my personal git On the one repo Released it in .

Here it is repo Make a star :)

If your data is scientific and / Or AI projects need any help , Please don't hesitate , stay Linkedin or Contact me .

Use SQL and Seaborn(SNS) stay Python Exploratory data analysis in (EDA) Originally published in Medium Upper Towards Data Science, People continue the dialogue by emphasizing and responding to the story .

copyright notice
author[Stone sword],Please bring the original link to reprint, thank you.

Random recommended