current position:Home>Four Fast and Easy Ways to Visualize Data in Python

Four Fast and Easy Ways to Visualize Data in Python

2022-09-09 06:50:51Programmer Lao Hua

Data visualization is a very important part of a data science or machine learning project.通常,你需要在项目初期进行探索性的数据分析(EDA),从而对数据有一定的了解,而且创建可视化确实可以使分析的任务更清晰、更容易理解,Especially for large-scale high-dimensional datasets.near the end of the project,以一种清晰、It is also very important to present the end result in a concise and eye-catching way,Let your audience(Usually non-technical customers)能够理解.


热力图(Heat Map)is a matrix representation of data,where the value of each matrix element is represented by a color.Different colors represent different values,The two or two features that need to be compared are linked together by the index of the matrix.Heatmaps are great for showing relationships between multiple feature variables,Because you can know the size of the matrix element at that position directly from the color.By looking at other points in the heatmap,You can also see how each relationship compares to other relationships in the dataset.Colors are so intuitive,So it gives us a very simple way of interpreting the data.

Now let's look at the implementation code.与「matplotlib」相比,「seaborn」Can be used to draw more advanced graphics,It usually requires more components,such as multiple colors、graph or variable.「matplotlib」Can be used to display graphics,「NumPy」can be used to generate data,「pandas」can be used to process data!Drawing just「seaborn」的一个简单的功能.

# Importing libs
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create a random dataset
data = pd.DataFrame(np.random.random((10,6)), columns=["Iron Man","Captain America","Black Widow","Thor","Hulk", "Hawkeye"])


# Plot the heatmap
heatmap_plot = sns.heatmap(data, center=0, cmap='gist_ncar')


二维密度图(2D Density Plot)is an intuitive extension of the 1D version of the density map,relative to the one-dimensional version,The advantage is being able to see the probability distribution about two variables.例如,In the 2D density plot below,The tick plot on the right shows the probability of each point in color.Where our data appears most likely(This is where the data points are most concentrated),似乎在 size=0.5,speed=1.4 左右.As you know by now,A 2D density map is very useful for quickly finding out where our data is most concentrated in the case of two variables,Instead of having only one variable like a 1D density plot.When you have two variables that are very important to the output,and want to understand how they work together on the distribution of the output,It is very efficient to view the data with a two-dimensional density map.

事实再次证明,使用「seaborn」Writing code is very convenient!这一次,We will create a skewed distribution,Make data visualization results more interesting.You can adjust most of the optional parameters,Make the visualization look clearer.

# Importing libs
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import skewnorm

# Create the data
speed = skewnorm.rvs(4, size=50) 
size = skewnorm.rvs(4, size=50)

# Create and shor the 2D Density plot
ax = sns.kdeplot(speed, size, cmap="Reds", shade=False, bw=.15, cbar=True)
ax.set(xlabel='speed', ylabel='size')


蜘蛛网图(Spider Plot)is one of the best ways to show a one-to-many relationship.换而言之,You can plot and view the values ​​of multiple variables related to a variable or category.In the spider web diagram,The significance of one variable relative to another is clear and obvious,because in a specific direction,The area covered and the length from the center become larger.If you want to see how different classes of objects are described using these variables,They can be drawn side by side.在下面的图表中,It's easy for us to compare the different attributes of the Avengers,and see where their respective strengths lie!(请注意,These data are randomly set,I have no prejudice against the members of the Avengers.)

在这里,我们可以直接使用「matplotlib」而非「seaborn」to create visualizations.We need to have each attribute equally spaced along the circumference.We will set labels on each corner,Then plot the value as a point,Its distance from the center depends on its value/大小.最后,为了显示更清晰,We will use a semi-transparent color to fill the area enclosed by the lines that connect the attribute points.

# Import libs
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# Get the data

   #             Name  Attack  Defense  Speed  Range  Health
0  1         Iron Man      83       80     75     70      70
1  2  Captain America      60       62     63     80      80
2  3             Thor      80       82     83    100     100
3  3             Hulk      80      100     67     44      92
4  4      Black Widow      52       43     60     50      65
5  5          Hawkeye      58       64     58     80      65


# Get the data for Iron Man

# Make some calculations for the plot
angles=np.linspace(0, 2*np.pi, len(labels), endpoint=False)

# Plot stuff
fig = plt.figure()
ax = fig.add_subplot(111, polar=True)
ax.plot(angles, stats, 'o-', linewidth=2)
ax.fill(angles, stats, alpha=0.25)
ax.set_thetagrids(angles * 180/np.pi, labels)


We have been using dendrograms since elementary school(Tree Diagram)了!Dendrograms are natural and intuitive,This makes them easy to interpret.Directly connected nodes are closely related,Whereas nodes with multiple connections are less the visualization below,我根据 Kaggle 的统计数据(生命值、攻击力、防御力、特殊攻击、特殊防御、速度)A dendrogram is drawn for a dataset of a small subset of Pokemon games.

因此,The Pokemon that are the most statistically matched will be closely linked together.例如,在图的顶部,Aberdeen It is directly connected to the jays,If we look at the data,The total score of Aberdeen 438,Sharp-billed birds do 442,二者非常接近!But if we look at Lada,We can see that its total score is 413,This is with Aberdeen、Sharp-billed birds are quite different,So they are separated in the dendrogram!As we move up the tree,The Pokemon in the green group are more similar to each other than they are to any Pokemon in the red group,Even if there is no direct green connection here.

for dendrograms,We actually need to use 「Scipy」来绘制!After reading the data in the dataset,We will remove the string column.This is done just to make the visualization more intuitive、便于理解,但在实践中,Converting these strings to categorical variables gives better results and comparisons.We also set the index of the dataframe,so that it can be properly used as a column referencing each node.The last thing I need to tell you is,在「Scipy」Calculating and drawing dendrograms requires only one simple line of code.

# Import libs
import pandas as pd
from matplotlib import pyplot as plt
from scipy.cluster import hierarchy
import numpy as np

# Read in the dataset
# Drop any fields that are strings
# Only get the first 40 because this dataset is big
df = pd.read_csv('Pokemon.csv')
df = df.set_index('Name')
df = df.drop(["Type 1", "Type 2", "Legendary"], axis=1)
df = df.head(n=40)

# Calculate the distance between each sample
Z = hierarchy.linkage(df, 'ward')

# Orientation our tree
hierarchy.dendrogram(Z, orientation="left", labels=df.index)


copyright notice
author[Programmer Lao Hua],Please bring the original link to reprint, thank you.

Random recommended