# Deeply understand pandas to read excel, TXT, CSV files and other commands

2022-01-31 20:38:47

## pandas Read the document provided by the file official

In the use of pandas Before reading the file , Essential content , It must be an official document , Refer to the address of the official document

pandas.pydata.org/pandas-docs…

Document operation belongs to pandas Inside Input/Output That is to say IO operation , Basic API It's all on the website , The core of this article takes you through some of the common commands

Read txt The file needs to be determined txt Whether the file conforms to the basic format , That is, whether it exists \t,,, And a special separator

commonly txt The file looks like this

txt File, for example,

The following files are Spaces apart

1 2019-03-22 00:06:24.4463094  Chinese test
2 2019-03-22 00:06:32.4565680  You need to edit encoding
3 2019-03-22 00:06:32.6835965 ashshsh
4 2017-03-22 00:06:32.8041945 eggg
Copy code 

import pandas as pd
print(df)

import pandas as pd
print(df)
Copy code 

however , Be careful , The content of the data read from this place is 3 That's ok 1 Column DataFrame type , We didn't get it as we wanted 3 That's ok 4 Column

import pandas as pd
print(type(df))
print(df.shape)

<class 'pandas.core.frame.DataFrame'>
(3, 1)
Copy code 

Default : From file 、URL、 The delimited data is loaded in the new object of the file , The default separator is a comma .

Above txt Documents are not separated by commas , So you need to increase it as you read sep Separator parameter

df =  pd.read_csv("./test.txt",sep=' ')
Copy code 

Parameter description , official Source : github.com/pandas-dev/… Description in Chinese and key function cases

Parameters Chinese meaning
filepath_or_buffer It can be URL, You can use URL Types include ：http, ftp, s3 And documents , The local file reads the instance ：file://localhost/path/to/table.csv
sep str type , Default ',' Specify the separator . If no parameters are specified , Instead, you will try to use the default value, comma . The separator is longer than one character and is not ‘\s+’, Will use python Parsers of . And ignore commas in the data . Regular expression example ：'\r\t'
delimiter delimiter , Alternative separator （ If this parameter is specified , be sep Parameters of the failure ） Generally do not use
delimiter_whitespace True or False Default False, Using a space as a separator is equivalent to spe=’\s+’ If the parameter is called , be delimite It won't work
header Specify the row as the column name ( Ignore comment lines ), If no column name is specified , Default header=0; If the column name is specified header=None
names Specifies the column name , If not in the file header The line of , Should be explicit header=None ,header It could be a list of integers , Such as 0,1,3. An unspecified intermediate line is deleted ( for example , Skip this example 2 That's ok )
index_col( Case study 1) The default is None Use the column name DataFrame The line of label , If I give you a sequence , Then use MultiIndex. If a file is read , The file has a delimiter at the end of each line , Consider using index_col=False send panadas Do not use the first column as the row name .
usecols Default None You can use column sequences or column names , Such as 0, 1, 2 or ‘foo’, ‘bar’, ‘baz’ , Use this parameter to speed up loading and reduce memory consumption .
squeeze The default is False, True The type returned is Series, If the data is parsed to contain only one line , Then return to Series
prefix Automatically generates a prefix for the column name number , Such as ： ‘X’ for X0, X1, ... When header =None Or no Settings header In force
mangle_dupe_cols The default is True, Duplicate columns will be specified as ’X.0’…’X.N’, instead of ’X’…’X’. If you pass in False, When a duplicate name exists in a column , Will cause the data to be overwritten .
dtype Example ： {‘a’: np.float64, ‘b’: np.int32} Specify the data type for each column ,a,b Said column names
engine The analysis engine used . You can choose C Or is it python,C The engine is fast but Python The engine has more features
converters( Case study 2) Sets the handler for the specified column , It can be used " Serial number " You can also use “ Name ” Specifies the column
true_values / false_values No actual application scenario was found , Note the , The late perfect
skipinitialspace Ignore the space after the separator , Default false
skiprows The default value is None Number of rows to ignore （ Start at the beginning of the file ）, Or a list of line Numbers to skip （ from 0 Start ）
skipfooter Start at the end of the file and ignore it . (c Engine not supported )
na_values A null value definition , By default , ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’. Are characterized by NAN
keep_default_na If specified na_values Parameters , also keep_default_na=False, So the default NaN Will be overwritten , Otherwise, add
na_filter Whether to check for missing values （ An empty string or a null value ）. Not in the dataset for large files N/A Null value , Use na_filter=False Can improve the read speed .
verbose Whether to print the output of various parsers , for example ：“ The number of missing values in a nonnumeric column ” etc. .
skip_blank_lines If True, Skip the blank line ; Otherwise the record for NaN.
parse_dates There are the following operations 1. boolean. True -> Analytical index 2. list of ints or names. e.g. If 1, 2, 3 -> analysis 1,2,3 The value of the column as a separate date column ;3. list of lists. e.g. If [1, 3] -> Merge 1,3 The column is used as a date column 4. dict, e.g. {‘foo’ : 1, 3} -> take 1,3 columns , And name the merged column "foo"
infer_datetime_format If set to True also parse_dates You can use , that pandas Converts an attempt to a date type , If you can switch , Transform the method and parse it . In some cases it is 5~10 times
keep_date_col Resolve dates if multiple columns are concatenated , Keeps the columns that participate in the join . The default is False
date_parser The function that parses the date , By default dateutil.parser.parser To do the conversion .Pandas Try parsing in three different ways , Use the next method if you encounter problems .1. Use one or more arrays（ from parse_dates Appoint ） As a parameter ;2. Concatenation specifies a multi-column string as a single column as an argument ;3. Each line is called once date_parser Function to parse one or more strings （ from parse_dates Appoint ） As a parameter .
dayfirst DD/MM Format of the date type
chunksize The size of the file block
compression Use the compressed file directly on disk . If you use infer Parameters , Then use gzip, bz2, zip Or extract the file name ‘.gz’, ‘.bz2’, ‘.zip’, or ‘xz’ These files are suffixed , Otherwise, no decompression . If you use zip, that ZIP Package China must contain only one file . Set to None Do not unzip .
The new version 0.18.1 Versioning support zip and xz decompression
thousands Thousandths sign , Default ‘,’
decimal Decimal symbol , Default ‘.’
lineterminator Line separator , Only in C Used under the parser
quotechar quotes , A character used to identify the beginning and interpretation , The delimiter inside the quote is ignored
quoting control csv Quote constant in . Optional QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3)
doublequote Double quotes , When a single quote has been defined , also quoting The parameter is not QUOTE_NONE When , Using double quotation marks indicates that the element within the quotation marks is used as an element .
escapechar When quoting by QUOTE_NONE when , Specifies an undelimited value for a character to cause .
comment Indicates that extra lines are not parsed . If the character appears at the beginning of the line , This line will be completely ignored . This parameter can only be one character , Blank line （ It's like skip_blank_lines=True） Comment lines are header and skiprows Ignore the same . For example, if you specify comment='#' analysis ‘#empty\na,b,c\n1,2,3’ With header=0 Then the return result will be ’a,b,c' As header
encoding Encoding mode , Specified character set type , Usually specified as 'utf-8'
dialect If no specific language is specified , If sep Larger than one character is ignored . Specific to see csv.Dialect file
error_bad_lines If a row contains too many columns , Then the default will not return DataFrame , If I set it to false, So you're going to get rid of the change （ Only in C Used under the parser ）
warn_bad_lines If error_bad_lines =False, also warn_bad_lines =True So all “bad lines” Will be output （ Only in C Used under the parser ）
low_memory Block load into memory , Resolve at low memory consumption . However, type confusion can occur . To ensure that the type is not obfuscated need to be set to False. Or use dtype Parameter specified type . Pay attention to chunksize perhaps iterator Parameter block reading reads the entire file into one Dataframe, And ignore the type （ Only in C Valid in the parser ）
delim_whitespace New in version 0.18.1: Python Valid in the parser
memory_map If filepath_or_buffer A file path is provided , The file object is mapped directly to memory , And access the data directly from there . Use this option to improve performance , Because there's no more I / O expenses , Using this method prevents the file from doing it again IO operation
float_precision Appoint C The engine applies to converters of floating point values

Part of the table is for reference Blog  www.cnblogs.com/datablog/p/…  Thanks to the blogger for the translation ,O(∩_∩)O ha-ha ~

Case study 1

index_col Use

First prepare a txt file , The biggest problem with this file is that it has one more at the end of each line ',' , Follow the instructions to interpret as , If there is a separator at the end of each line , There will be problems , But in the actual test, I found that I needed to cooperate names Parameters , To have an effect

goof,1,2,3,ddd,
u,1,3,4,asd,
as,df,12,33,
Copy code 

Part of the table is for reference Blog  www.cnblogs.com/datablog/p/…  Thanks to the blogger for the translation ,O(∩_∩)O ha-ha ~

Case study 1

index_col Use

First prepare a txt file , The biggest problem with this file is that it has one more at the end of each line ',' , Follow the instructions to interpret as , If there is a separator at the end of each line , There will be problems , But in the actual test, I found that I needed to cooperate names Parameters , To have an effect

goof,1,2,3,ddd,
u,1,3,4,asd,
as,df,12,33,
Copy code 

Write the following code

df =  pd.read_csv("./demo.txt",header=None,names=['a','b','c','d','e'])
print(df)

print(df)
Copy code 

In fact, the significance of the discovery is not very much , Maybe the document doesn't make it clear what it does . Let's move on index_col Common USES

While reading the file , If not set index_col Column index , The default is to use from 0 Integer index to begin with . When a row or column in a table is manipulated , When saving to file you will find that there is always an extra column from 0 The start of the column , If you set index_col Parameter to set the column index , There would be no such problem .

Case study 2

converters Sets the handler for the specified column , It can be used " Serial number " You can also use “ Name ” Specifies the column

import pandas as pd

def fun(x):
return str(x)+"-haha"

print(type(df))
print(df.shape)
print(df)
Copy code 

### read_csv A common problem with functions

1. yes , we have IDE In the use of Pandas Of read_csv When the function imports a data file , If the file path or file name contains Chinese , Will report a mistake .

terms of settlement

import pandas as pd
#df=pd.read_csv('F:/ Test folder / Test data .txt')
f=open('F:/ Test folder / Test data .txt')
Copy code 
1. Exclude certain rows Use Parameters skiprows. Its function is to exclude a row . It should be noted that ： Rule out before 3 Line is skiprows=3 Out of the first 3 Line is skiprows=3
2. For irregular separators , Use Regular expressions Read the file The separator in the file is a space , So we just set it up sep=" " Just read the file . When the separator is not a single space , Maybe one space or more Spaces , If we still use it at this point sep=" " To read the file , Maybe you get a really weird number , Because it will use the space as data as well . data = pd.read_csv("data.txt",sep="\s+")
3. Read the file if there is a Chinese coding error Need to set encoding Parameters
4. Add indexes for rows and columns With the parameters names Add column index , use index_col Add row index

read_csv The command has a number of arguments . Most of them are unnecessary , Because most of the files you download are in standard format .

The basic usage is consistent , The difference lies in separator Separator .

csv Is a comma separated value , Can only read correctly “,” Segmented data ,read_table The default is '\t'( That is to say tab) Cutting data sets

Reads a file with a fixed - width column , Such as files

id8141    360.242940   149.910199   11950.7
id1594    444.953632   166.985655   11788.4
id1849    364.136849   183.628767   11806.2
id1230    413.836124   184.375703   11916.8
id1948    502.953953   173.237159   12468.3
Copy code 

colspecs ：

You need to give a list of tuples , The list of tuples is half-open ,[from,to) , By default it will go before 100 Row data is inferred .

Example ：

import pandas as pd
colspecs = [(0, 6), (8, 20), (21, 33), (34, 43)]
Copy code 

widths：

Just use a width list , Can replace colspecs Parameters

widths = [6, 14, 13, 10]
Copy code 

read_fwf It's not used very often , You can refer to  pandas.pydata.org/pandas-docs…  Study

pandas A new serializable data format supported , This is a lightweight, portable binary format , Similar to binary JSON, This data space utilization is high , In the writing （ serialize ） And read （ Deserialization ） Aspects provide good performance .

Reads the data in the clipboard , Can be seen as read_table Clipboard version . Useful in converting web pages into tables

This place appears as follows BUG

module 'pandas' has no attribute 'compat'

I updated it pandas It can be used normally

There is one more pitfall , That's when you read the clipboard , If you copy Chinese , It's easy not to read the data

terms of settlement

1. open site-packages\pandas\io\clipboard.py This file needs to be retrieved by itself
2. stay text = clipboard_get() Behind a line To join this ： text = text.decode('UTF-8')
3. preservation , Then you can use it

The official document is still the first ：pandas.pydata.org/pandas-docs…

Parameters Chinese meaning
io File class object ,pandas Excel File or xlrd workbook . The string might be one URL.URL Include http,ftp,s3 And documents . for example , The local file can be written file://localhost/path/to/workbook.xlsx
sheet_name The default is sheetname by 0, Returns multiple table usage sheetname=0,1, if sheetname=None Is to return the full table . Be careful ：int/string The return is dataframe, and none and list The return is dict of dataframe, Table names are represented as strings , Index table positions are expressed as integers ;
header Specifies the row as the column name , Default 0, So let's take the first row , The data is below the column name row ; If the data does not contain column names , Is set header = None;
names Specifies the name of the column , Pass in a list data
index_col Specifies the column to be listed as the index column , You can also use u”strings” , If I pass a list , These columns will be combined into one MultiIndex.
squeeze If the parsed data contains only one column , Returns a Series
dtype The data type of the data or column , Reference resources read_csv that will do
engine If io Not a buffer or a path , You must set it as an identity io. The acceptable value is None or xlrd
converters reference read_csv that will do
The rest of the parameters Basic and read_csv Agreement

pandas Read excel File error , The general treatment is

Error for ：ImportError: No module named 'xlrd'

pandas Read excel file , Need separate xlrd Module support pip install xlrd that will do

Parameters Chinese meaning
path_or_buf A valid JSON file , The default value is None, The string can be URL, for example file://localhost/path/to/table.json
orient （ Case study 1） The expected json String format ,orient Has the following values ：1. 'split' : dict like {index -> index, columns -> columns, data -> values}2. 'records' : list like {column -> value}, ... , {column -> value}3. 'index' : dict like {index -> {column -> value}}4. 'columns' : dict like {column -> {index -> value}}5. 'values' : just the values array
typ Return format (series or frame), The default is ‘frame’
dtype The data type of the data or column , Reference resources read_csv that will do
convert_axes boolean, Try to convert the axis to the correct one dtypes, The default value is True
convert_dates Resolve the column list of dates ; If True, Attempts to resolve a date-like column , The default value is True Reference column label it ends with '_at',it ends with '_time',it begins with 'timestamp',it is 'modified',it is 'date'
keep_default_dates boolean,default True. If you parse the date , Resolves the default date sample column
numpy Direct decoding as numpy Array . The default is False; Support only digital data , But labels can be nonnumeric . Also pay attention to , If numpy=True,JSON Sort MUST
precise_float boolean, Default False. Set to enable higher precision when decoding a string to a double value （strtod） Use of functions . The default value is （False） Is to use quick but imprecise built-in features
date_unit string, A timestamp unit used to detect the conversion date . The default value is no . By default , The timestamp accuracy will be detected , If you don't need to , Through 's','ms','us' or 'ns' One of them forces the timestamp precision to be seconds respectively , millisecond , Microseconds or nanoseconds .
encoding json code
lines Each line reads one file json object .

If JSON Cannot be resolved , The parser will be generated ValueError/TypeError/AssertionError One of .

Case study 1

1. orient='split'
    import pandas as pd
s = '{"index":[1,2,3],"columns":["a","b"],"data":[[1,3],[2,5],[6,9]]}'
Copy code 
1. orient='records' Member for dictionary list
import pandas as pd
s = '[{"a":1,"b":2},{"a":3,"b":4}]'
Copy code 
1. orient='index' Based on the index key, Takes the dictionary of column fields as the key value . Such as ： s = '{"0":{"a":1,"b":2},"1":{"a":2,"b":4}}'
2. orient='columns' perhaps values You can extrapolate

Partial Chinese translation , You can refer to github> github.com/apachecn/pa…

Read json The file appears  ValueError: Trailing data ,JSON Format problem

The original format for

{"a":1,"b":1},{"a":2,"b":2}
Copy code 

[{"a":1,"b":1},{"a":2,"b":2}]
Copy code 

Or use lines Parameters , also JSON Adjust to one data per row

{"a":1,"b":1}
{"a":2,"b":2}
Copy code 

if JSON There is Chinese in the document , Advice and encoding Parameters , assignment 'utf-8', Otherwise, an error will be reported

Parameters Chinese meaning
io Receiving site 、 file 、 character string . Url not accepted https, Try to get rid of s Climb to the top of the
match Regular expressions , Returns a table that matches the regular expression
flavor The parser defaults to ‘lxml’
header Specifies the row where the column header resides ,list For multiple indexes
index_col Specifies the column corresponding to the row header ,list For multiple indexes
skiprows Skip the first n That's ok （ The sequence labeling ） Or skip n That's ok （ Integer labeled ）
attrs attribute , such as attrs = {'id': 'table'}
parse_dates Parsing the date

Usage method , Right click on the page if the table is found That is to say table You can use

for example ： data.stcn.com/2019/0304/1…

<table class="..." id="...">
<tr>
<th>...</th>
</tr>
<tbody>
<tr>
<td>...</td>
</tr>
<tr>...</tr>
</tbody>
</table>

<table> :  Define the form
<tbody> :  Define the body of the table
<tr>    :  Defines the rows of the table
<th>    :  Defines the header of the table
<td>    :  Define table cell
Copy code 

common BUG

install html5lib that will do , Or using parameters

import pandas as pd
Copy code 

More reference source code , You can refer to > pandas.pydata.org/pandas-docs…

pandas Read and write function table of

import pandas as pd
2
4 print(csvframe, "\n-----*-----")
6 print(csvframe1, "\n-----*-----")
8 print(csvframe2, "\n-----*-----")
10 print(csvframe20, "\n-----*-----")
11
13 print(csvframe30, "\n-----*-----")
15 print(csvframe31, "\n-----*-----")
16
17 txtframe4 = pd.read_table('pandas_data_test\ch05_04.txt',sep='\s+') #  According to regular analysis
18 print(txtframe4, "\n-----*-----")
20 print(txtframe5, "\n-----*-----")
21 #  Use skiprows Options , You can exclude redundant lines . Put the row number of the row to be excluded into the array , Assign to this option .
23 print(txtframe6)
24 Out[1]: 25    white  red  blue  green animal
26 0      1    5     2      3    cat
27 1      2    7     8      5    dog
28 2      3    3     6      7  horse
29 3      2    2     8      3   duck
30 4      4    4     2      1  mouse
31 5      4    4     2      1    mou
32 -----*-----
33    white  red  blue  green animal
34 0      1    5     2      3    cat
35 1      2    7     8      5    dog
36 2      3    3     6      7  horse
37 3      2    2     8      3   duck
38 4      4    4     2      1  mouse
39 5      4    4     2      1    mou
40 -----*-----
41    0  1  2  3      4
42 0  1  5  2  3    cat
43 1  2  7  8  5    dog
44 2  3  3  6  7  horse
45 3  2  2  8  3   duck
46 4  4  4  2  1  mouse
47 -----*-----
48    white  red  blue  green animal
49 0      1    5     2      3    cat
50 1      2    7     8      5    dog
51 2      3    3     6      7  horse
52 3      2    2     8      3   duck
53 4      4    4     2      1  mouse
54 -----*-----
55    color status  iteml  item2  item3
56 0  black     up      3      4      6
57 1  black   down      2      6      7
58 2  white     up      5      5      5
59 3  white   down      3      3      2
60 4  white   left      1      2      1
61 5    red     up      2      2      2
62 6    red   down      1      1      4
63 -----*-----
64               iteml  item2  item3
65 color status
66 black up          3      4      6
67       down        2      6      7
68 white up          5      5      5
69       down        3      3      2
70       left        1      2      1
71 red   up          2      2      2
72       down        1      1      4
73 -----*-----
74    white  red  blue  green
75 0      1    5     2      3
76 1      2    7     8      5
77 2      3    3     6      7
78 -----*-----
79    0    1    2
80 0  0  123  122
81 1  1  124  321
82 2  2  125  333
83 -----*-----
84    white  red  blue  green animal
85 0      1    5     2      3   cat
86 1      2    7     8      5    dog
87 2      3    3     6      7  horse
88 3      2    2     8      3  duck
89 4      4    4     2      1  mouse
Copy code 

from TXT The file reads part of the data

 1 print(csvframe2, "\n-----*-----")
2 # nrows=2 Specify the number of rows to get ,skiprows=[2] Delete the corresponding line
4 print(csvfram20)
5 Out[2]: 6    0  1  2  3      4
7 0  1  5  2  3    cat
8 1  2  7  8  5    dog
9 2  3  3  6  7  horse
10 3  2  2  8  3   duck
11 4  4  4  2  1  mouse
12 -----*-----
13    0  1  2  3    4
14 0  1  5  2  3  cat
15 1  2  7  8  5  dog
Copy code 

Another interesting and common operation is to segment the text you want to parse , And then go through the parts , One by one A particular operation .

for example , For a column of numbers , Add one every two lines , Finally, put and plug into Series In the object „ This little example It's easy to understand ,

It has no practical value , But once you understand how it works , You can use it in more complex situations .

 1 csvframe1 = pd.read_table('pandas_data_test\myCSV_01.csv',sep=',')
2 print(csvframe1, "\n-----*-----")
3 out = pd.Series()
4 pieces = pd.read_csv('pandas_data_test\myCSV_01.csv',chunksize=4)  # chunksize The parameter determines the number of rows to split per part
5 i = 0
6 for piece in pieces:
7     print(piece['white'])
8     out.at[i] = piece['white'].sum()
9     i += 1
10 print(out, "\n-----*-----")
11 Out[3]: 12    white  red  blue  green animal
13 0      1    5     2      3    cat
14 1      2    7     8      5    dog
15 2      3    3     6      7  horse
16 3      2    2     8      3   duck
17 4      4    4     2      1  mouse
18 5      4    4     2      1    mou
19 -----*-----
20 0    1
21 1    2
22 2    3
23 3    2
24 Name: white, dtype: int64
25 4    4
26 5    4
27 Name: white, dtype: int64
28 0    8
29 1    8
30 dtype: int64
Copy code 

Go to CSV File write data

 1 print(csvframe1)
2 print(csvframe1.to_csv('pandas_data_test\ch05_07.csv'))
3 #  Use index and  header Options , Set their values to False, The default write can be cancelled index and header
5 print(csvframe30.to_csv('pandas_data_test\ch05_08.csv'))
6 #  It can be used to_csv() Functional na_rep Option to replace the empty field with the value you want . Common values are NULL、0 and NaN
7 print(csvframe30.to_csv('pandas_data_test\ch05_09.csv',na_rep=" empty "))
8 Out[4]: 9    white  red  blue  green animal
10 0      1    5     2      3    cat
11 1      2    7     8      5    dog
12 2      3    3     6      7  horse
13 3      2    2     8      3   duck
14 4      4    4     2      1  mouse
15 5      4    4     2      1    mou
16 None
17 None
18 None
19 None
Copy code 

Enter the folder and we can see the corresponding file ：

 1 frame = pd.DataFrame(np.arange(4).reshape(2,2))
2 print(frame.to_html())
3 frame2 = pd.DataFrame( np.random.random((4,4)),index = ['white','black','red','blue1'],columns = ['up','down','right','left'])
4 s = ['<HTML>']
6 s.append(' <B0DY>')
7 s.append(frame.to_html())
8 s.append('</BODY></HTML>')
9 html = ''.join(s)
10 html_file = open('pandas_data_test\myFrame.html','w')
11 html_file.write(html)
12 html_file.close()
14 print(web_frames[0])
16 print(ranking[0][1:10]) #  Before outputting the content of a web page 10 That's ok
17 Out[5]: 18 <table border="1" class="dataframe">
20     <tr style="text-align: right;">
21       <th></th>
22       <th>0</th>
23       <th>1</th>
24     </tr>
26   <tbody>
27     <tr>
28       <th>0</th>
29       <td>0</td>
30       <td>1</td>
31     </tr>
32     <tr>
33       <th>1</th>
34       <td>2</td>
35       <td>3</td>
36     </tr>
37   </tbody>
38 </table>
39    Unnamed: 0  0  1
40 0           0  0  1
41 1           1  2  3
42     #             Nome   Exp  Livelli
43 1   2            admin  9029      NaN
44 2   3      BrunoOrsini  2124      NaN
45 3   4        Berserker   700      NaN
46 4   5         Dnocioni   543      NaN
47 5   6  albertosallusti   409      NaN
48 6   7              Jon   233      NaN
49 7   8             Mr.Y   180      NaN
50 8   9  michele sisinni   157      NaN
51 9  10           Selina   136      NaN
Copy code 

pandas All of the I/O API Function , It's not specifically used to deal with XML( Extensible markup language ） Format . Although there is no , But this format is actually

Very important , Because a lot of structured data is based on XML Format stored .pandas There is no special processing letter It doesn't matter , because Python

There's a lot of reading and writing XML A library of formatted data （ except pandas). One of the libraries is called lxml, It has excellent performance in large file processing , So from

It stands out from many of its kind . this This section describes how to use it to deal with XML file , And how to combine it with pandas integrated , With the most

Eventually XML In the document Get the data you need and convert it to DataFrame object .

XML The source file is shown in the figure below

 1 from lxml import objectify
2
3 xml = objectify.parse('pandas_data_test\books.xml')
4 root = xml.getroot()  #  Get root node
5 print(root.Book.Author)
6 mes1 = root.Book.getchildren()
7 print("root.Book.getchildren() Get the sub tag content ：\n", mes1)
8 mes2 = root.Book[1].getchildren()  #  Take the second Book label
9 print([child.tag for child in mes2])  #  Get child tags
10 print([child.text for child in mes2])  #  Get the sub tag content
11 Out[6]: 12 272103_l_EnRoss, Mark
13 root.Book.getchildren() Get the sub tag content ：
14  [' 272103_l_EnRoss, Mark', 'XML Cookbook', 'Computer', 23.56, '2014-22-0l']
15 ['Author', 'Title', 'Genre', 'Price', 'PublishDate']
16 [' 272l03_l_EnBracket, Barbara', 'XML for Dummies', 'Computer', '35.95', '20l4-l2-l6']
Copy code 

Reading and writing Microsoft Excel file

HDF5 Format
thus , I have learned how to read and write the text format . To analyze a lot of data , Binary format is preferred .Python How much It's a kind of binary data processing

Tools .HDF5 Ku has had some success in this area .HDF Represents the hierarchy data format （hierarchical data format ).HDF5

Ku is concerned about HDF5 Reading and writing of documents , The data structure of this file consists of nodes , The ability to store large data sets . The library uses all c Language

Development , Provides python/matlab and Java Language interface . Its rapid expansion benefits from developers The widespread use of , And thanks to its effectiveness

rate , In particular, using this format to store large amounts of data , It's very efficient . Compared to other formats that are simpler to process binary data ,HDF5

Support real time compression , Therefore, it is possible to compress files by using the repeating pattern in the data structure . at present ,Python There are two kinds of maneuvers HDF5 Format data

Methods ：PyTables and h5py. There are several differences between the two methods , Which one to choose depends largely on the specific needs .

h5py by HDF5 Advanced API Provide the interface .PyTables It's packaged a lot of HDF5 details , Provide more flexible data containers 、 Index table 、 Search for

Functions and other computing related media .pandas Another one is called HDFStore、 Be similar to diet Class , It USES PyTables Storage pandas

object . Use HDF5 Before the format , You have to guide people HDFStore class .

 1 from pandas.io.pytables import HDFStore
2 #  Pay attention to the need tables This package , No, please install it yourself
3 frame = pd.DataFrame(np.arange(16).reshape(4,4),index=['white','black1','red','blue'],columns=['up','down','right','left'])
4 store = HDFStore('pandas_data_test\mydata.h5')
5 store['obj1'] = frame
6 frame1 = pd.DataFrame(np.random.rand(16).reshape(4,4),index=['white','black1','red','blue'],columns=['up','down','right','left'])
7 store['obj2'] = frame1
8 print(store['obj1'])
9 print(store['obj2'])
10 Out[7]: 11         up  down  right  left
12 white    0     1      2     3
13 black1   4     5      6     7
14 red      8     9     10    11
15 blue    12    13     14    15
16               up      down     right      left
17 white   0.251269  0.422823  0.619294  0.273534
18 black1  0.593960  0.353969  0.966026  0.104581
19 red     0.964577  0.625644  0.342923  0.638627
20 blue    0.246541  0.997952  0.414599  0.908750
21 Closing remaining open files:pandas_data_test\mydata.h5...done
Copy code 

Implement object serialization

****pickle The module implements a powerful algorithm , Be able to use Python Implement the data structure serialization （pickling) And deserialization operations .

Serialization is the process of transforming the hierarchy of an object into a byte stream . Serialization facilitates the transfer of objects 、 Storage and reconstruction , It's just a receiver that can rewind

Building objects , It also retains all of its original features .

use pandas Library implements object serialization （ Deserialization ） Very convenient , All the tools are out of the box , No need to Python Import... In a conversation cPickle model

block , All operations are done implicitly . pandas The serialization format of is not fully used ASCII code .

 1 import pickle
2 data = { 'color': ['white','red'], 'value': [5, 7]}
3 pickled_data = pickle.dumps(data)
4 print(pickled_data)
6 print(nframe)
7
8 #  use pandas serialize
9 frame = pd.DataFrame(np.arange(16).reshape(4,4), index = ['up','down','left','right'])
10 frame.to_pickle('pandas_data_test\frame.pkl')  #  Same as json The data is similar to
12 Out[8]: 13 b'\x80\x03}q\x00(X\x05\x00\x00\x00colorq\x01]q\x02(X\x05\x00\x00\x00whiteq\x03X\x03\x00\x00\x00redq\x04eX\x05\x00\x00\x00valueq\x05]q\x06(K\x05K\x07eu.'
14 {'color': ['white', 'red'], 'value': [5, 7]}
15         0   1   2   3
16 up      0   1   2   3
17 down    4   5   6   7
18 left    8   9  10  11
19 right  12  13  14  15
Copy code 

Docking database
**** In many applications , The data used comes from very few text files , Because text files are not the most efficient way to store data .

Data is often stored in SQL Class relational database , Supplement ,NoSQL Databases have become popular recently .

from SQL Database loading data , Convert it to DataFrame The object is very simple pandas Several functions are provided to simplify the process .

pandas.io.sql Module provides database independent 、 called sqlalchemy The unified interface . This interface simplifies the connection mode , No matter for

What kind of database , There is only one set of operation commands . Connected database usage create_engine() function , You can use it to configure the drive

The required user name 、 password 、 Port and database instance . database URL The typical form of this is ：

The identifying name of the name , for example sqlite,mysql,postgresql,oracle, or mssql.drivername It's used to connect with all lowercase letters

To the database DBAPI The name of . If not specified , Then “ Default ”DBAPI（ If available ） - This default value is usually available for that backend

The most extensive driver .

 1 from sqlalchemy import create_engine
2
3 # PostgreSQL database
4 # default
5 engine = create_engine('postgresql://scott:[email protected]/mydatabase')
6 # pg8000  Driver
7 engine = create_engine('postgresql+pg8000://scott:[email protected]/mydatabase')
8 # psycopg2  Driver
9 engine = create_engine('postgresql+psycopg2://scott:[email protected]/mydatabase')
10 # MySql
11 # default
12 engine = create_engine('mysql://scott:[email protected]/foo')
13
14 # mysql-python  Note the correspondence behind the drive
15 engine = create_engine('mysql+mysqldb://scott:[email protected]/foo')
16
17 # MySQL-connector-python
18 engine = create_engine('mysql+mysqlconnector://scott:[email protected]/foo')
19 # OurSQL
20 engine = create_engine('mysql+oursql://scott:[email protected]/foo')
21 # Oracle
22 engine = create_engine('oracle://scott:[email protected]:1521/sidname')
23
24 engine = create_engine('oracle+cx_oracle://scott:[email protected]')
25 # Microsoft SQL
26 # pyodbc
27 engine = create_engine('mssql+pyodbc://scott:[email protected]')
28
29 # pymssql
30 engine = create_engine('mssql+pymssql://scott:[email protected]:port/dbname')
Copy code 

SQLite：

because SQLite Connect to a local file , therefore URL Slightly different format .URL Of “ file ” Part of it is the file name of the database .

For relative file paths , It takes three slashes ：engine = create_engine('sqlite:///foo.db')

**** For absolute file paths , After the three slashes is the absolute path ：

Unix/Mac - 4 initial slashes in total

engine = create_engine('sqlite:absolute/path/to/foo.db')

**Windows

engine = create_engine('sqlite:///C:\path\to\foo.db')

Windows alternative using raw string

engine = create_engine(r'sqlite:///C:\path\to\foo.db')

**** Learn to use Python Built in SQLite database sqlite3.SQLite3 The tool implements simplicity 、 Lightweight DBMS SQL,

So it can be built into Python Language implementation of any application . It's very practical , You can create an embedded database in a single file .

If you want to use all the functions of a database without installing a real database , This tool is the best choice . If you want to use real

Practice database operation before using the database , Or use a database to store data in a single program without considering the interface , SQLite3 It's all No

The wrong choice .

 1 from sqlalchemy import create_engine
2 frame = pd.DataFrame( np.arange(20).reshape(4,5),columns=['white','red','blue','black','green'])
3 #  Connect SQLite3 database
4 engine = create_engine('sqlite:///pandas_data_test/foo.db')
5 #  hold DataFrame Convert to database table .
6 # to_sql(self, name, con, schema=None, if_exists='fail', index=True,
7 #                index_label=None, chunksize=None, dtype=None)
8 frame.to_sql('colors',engine)
9
Copy code 

Running results ：