Exploratory data analysis (EDA) in Python using SQL and Seaborn (SNS).
2022-01-30 11:54:07 【Stone sword】
Exploratory data analysis （EDA） Is a method of analyzing data sets to summarize their main characteristics , Statistical graphics and other data visualization methods are usually used . Various statistical models can be used or not , But mainly EDA Used to see what data can tell us beyond formal modeling or hypothesis testing tasks .
Guess what? ... Always ...
The picture is from unsplash.com
Why should I do anything first EDA？
I believe a more appropriate question would be .
Under what circumstances should I not use EDA？
EDA Is one of the key steps in Data Science , It allows us to analyze the data we process Some insight and statistical measurement . This is crucial for countless users , Including business manager 、 stakeholders 、 Data scientists, etc .
For data scientists ,EDA It helps to define and improve our important characteristic variable selection , This will be used for machine learning models that have not yet been trained .
In this story , For demonstration purposes , We will use some FitBit data .
Fitness tracker data is data scientist 、 Statistician 、 Medical experts 、 A hot research area for physiologists and psychologists , Here are just a few academic research fields . Detect relationships in complex time series data , Such as FitBit Fitness tracker data , It can be a way to establish a pattern of daily life , It is also a way to detect the deviation of these patterns .
A good EDA Can help find these ...
Yes Fitbit The data were thoroughly analyzed . Key findings are highlighted and discussed . The analysis provided in this paper is based on 33 Collected from different users 940 Data points .
While reading this story , I hope to convey to you the reasoning and logic that drives coding .
First , In order to understand the lifestyle of these users , We plot minutes and distances based on the user's activity level .
As expected , Very active _ Of people walk the distance in a short time （ in other words , They have greater speed , Represented by a steeper regression line ）. A somewhat unexpected result is ," Mild activity for minutes " Than " Moderate _ Activity minutes " Faster . If you know how this classification is carried out , In order to really understand " light " Activities and " Moderate " The difference between activities , That would be interesting .
Let's be right _ Total steps _ And _ calories _ Do some simple linear regression ...
Once again, , As expected , The number of calories burned in a day increases with the number of steps users take . An interesting fact is , The intercept of the regression line represents the number of calories burned in a day without walking . This is the amount of calories users burn when they are very sedentary . according to Healthline Website , This number corresponds to the basal metabolic rate .
If we know the gender of the user 、 weight 、 Height and age , This value can be calculated . for example , They reported that , A weight 175 pounds 、 height 5 feet 11 " 35 The basal metabolic rate of a - year-old man is 1,816 calories , A weight 135 pounds 、 height 5 feet 5 " 35 The basal metabolic rate of women aged 10 years is 1,383 calories . To compare these estimates with our data , We can get the intercept value by linear regression . Predicted BMR yes ~1665.74（ Be situated between 35 Between the predicted values of women and men ）.
If we only filter the data points that have taken zero steps , And get the statistics of calorie distribution , We can further get the user's BMR Information .
Let's see _ Very active _ minute 、_ Quite active minutes _ and _ Slightly active _ Minute data distribution ...
Here's a question ： It is not clear whether all users used the fitness tracker throughout the analysis period . If a user records all day , that _VeryActiveMinutes_ +FairlyActiveMinutes +LightlyActiveMinutes +SedentaryMinutes The sum of should be equal to 1440 minute （ Total minutes of the day ）.
From the code snippet above , We deduce that
There are 474 (out of 936) rows where users logged the whole day. Copy code
There are 462 rows where users logged parts of the day. Copy code
Mild activity for minutes _ The distribution of is very symmetrical , There is no peak in very little activity time . Users who record all day may eventually register a large number of users _ Mild activity for minutes , Users who only record part of the day may only register for activities with high demand .
Now? , Let's look at sleep habits ...
Is there any difference on which day of the week ？ Now let's take a look at our data and its distribution , Which day of the week makes a big difference to users' behavior ？
What's the change in sedentary time on weekends ？
How this distribution depends on the weekend ？
We have now based on _ Sit long _ The distribution of time distinguishes two groups of users
It seems that we have found a trend here , There is an obvious offset , It seems to be near the boundary between the two groups . So let's verify that ...
ad locum , We found a clear trend , That is, users who sleep more tend to sit less . This indicates that the users who sleep the most , Often more active during the day .
Just use us 33 The daily activities of a user , We came to some interesting conclusions
Here it is , I include from the above EDA Some high-level insights from .
- There is no obvious difference in the activities of users on different days of the week ; The average number of steps per day is about 7670 Step .
- according to CDC Some of the studies ."...... Higher daily steps are associated with a lower risk of death from all causes ". disease The CDC also told us ."...... And walk every day 4000 Step （ A figure considered low for adults ） comparison , Go every day 8000 Step and all-cause death （ Or die of various causes ） Risk reduction 51% of . Go every day 12,000 Step and walk 4,000 Step comparison , Reduce risk 65%".
If the goal is to burn some calories , It was found that there was a linear relationship between the number of steps taken and the calories burned . Accordingly , We can use user data to fit a model , Predict how many steps users should take to reach a certain calorie consumption .
- About sleep habits , As sleep time increases , Sedentary time is significantly reduced .
What to do next ？
EDA It's usually done to gain data insight , It can help us complete the task of machine learning . In the next story , We use the same data set and derived insight to train several machine learning models to solve the regression problem .
Here it is repo Make a star :)
Use SQL and Seaborn（SNS） stay Python Exploratory data analysis in （EDA） Originally published in Medium Upper Towards Data Science, People continue the dialogue by emphasizing and responding to the story .
author[Stone sword],Please bring the original link to reprint, thank you.
The sidebar is recommended
- Python Network Programming -- create a simple UPD socket to realize mutual communication between two processes
- leetcode 110. Balanced Binary Tree（python）
- Django uses Django celery beat to dynamically add scheduled tasks
- The bear child said "you haven't seen Altman" and hurriedly studied it in Python. Unexpectedly
- Optimization iteration of nearest neighbor interpolation and bilinear interpolation algorithm for Python OpenCV image
- Bilinear interpolation algorithm for Python OpenCV image, the most detailed algorithm description in the whole network
- Use of Python partial()
- Python game development, pyGame module, python implementation of angry birds
- leetcode 1104. Path In Zigzag Labelled Binary Tree（python）
- Save time and effort. 10 lines of Python code automatically clean up duplicate files in the computer
guess what you like
Learn python, know more meat, and be a "meat expert" in the technical circle. One article is enough
[Python data structure series] "stack (sequential stack and chain stack)" -- Explanation of knowledge points + code implementation
Datetime module of Python time series
Python encrypts and decrypts des to solve the problem of inconsistency with Java results
Chapter 1: introduction to Python programming-4 Hello World
Summary of Python technical points
11.5K Star！ An open source Python static type checking Library
Chapter 2: Fundamentals of python-1 grammar
[Python daily homework] day4: write a function to count the number of occurrences of each number in the incoming list and return the corresponding dictionary.
Python uses turtle to express white
- Some people say Python does not support function overloading?
- "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system with Python
- Introduction to Python - CONDA common commands
- Python actual combat | just "4 steps" to get started with web crawler (with benefits)
- Don't know what to eat every day? Python to tell you! Generate recipes and don't worry about what to eat every day!
- Are people who like drinking tea successful? I use Python to make a tea guide! Do you like it?
- I took 100g pictures offline overnight with Python just to prevent the website from disappearing
- Binary operation of Python OpenCV image re learning and image smoothing (convolution processing)
- Analysis of Python event mechanism
- Iterator of Python basic language
- Base64 encryption and decryption in Python
- Chapter 2: Fundamentals of python-2 variable
- Python garbage collection summary
- Python game development, pyGame module, python takes you to realize a magic tower game from scratch (1)
- Python draws a spinning windmill with turtle
- Deep understanding of Python features
- A website full of temptations for Python crawler writers, "lovely picture network", look at the name of this website
- Python opencv Canny edge detection knowledge supplement
- Complex learning of Python opencv Sobel operator, ScHARR operator and Laplacian operator
- Python: faker extension package
- Python code reading (Part 44): find the location of qualified elements
- Elegant implementation of Django model field encryption
- 40 Python entry applet
- Pandas comprehensive application
- Chapter 2: Fundamentals of python-3 character string
- Python pyplot draws a parallel histogram, and the x-axis value is displayed in the center of the two histograms
- [Python crawler] detailed explanation of selenium from introduction to actual combat 
- Curl to Python self use version
- Python visualization - 3D drawing solutions pyecharts, Matplotlib, openpyxl
- Use python, opencv's meanshift and CAMSHIFT algorithms to find and track objects in video
- Using python, opencv obtains and changes pixels, modifies image channels, and trims ROI
- [Python data collection] university ranking data collection
- [Python data collection] stock information collection
- Python game development, pyGame module, python takes you to realize a magic tower game from scratch (2)
- Python solves the problem of suspending execution after clicking the mouse in CMD window (fast editing mode is prohibited)
- [Python from introduction to mastery] (II) how to run Python? What are the good development tools (pycharm)
- Python type hints from introduction to practice
- Python notes (IX): basic operation of dictionary
- Python notes (8): basic operations of collections
- Python notes (VII): definition and use of tuples