current position:Home>Deeply understand pandas to read excel, TXT, CSV files and other commands
Deeply understand pandas to read excel, TXT, CSV files and other commands
2022-01-31 20:38:47 【bosaidongmomo】
pandas Read the document provided by the file official
In the use of pandas Before reading the file , Essential content , It must be an official document , Refer to the address of the official document
Document operation belongs to pandas Inside Input/Output
That is to say IO operation , Basic API It's all on the website , The core of this article takes you through some of the common commands
pandas Read txt file
Read txt The file needs to be determined txt Whether the file conforms to the basic format , That is, whether it exists \t
,,,
And a special separator
commonly txt The file looks like this
txt File, for example,
The following files are Spaces apart
1 2019-03-22 00:06:24.4463094 Chinese test
2 2019-03-22 00:06:32.4565680 You need to edit encoding
3 2019-03-22 00:06:32.6835965 ashshsh
4 2017-03-22 00:06:32.8041945 eggg
Copy code
Read command using read_csv perhaps read_table Fine
import pandas as pd
df = pd.read_table("./test.txt")
print(df)
import pandas as pd
df = pd.read_csv("./test.txt")
print(df)
Copy code
however , Be careful , The content of the data read from this place is 3 That's ok 1 Column DataFrame type , We didn't get it as we wanted 3 That's ok 4 Column
import pandas as pd
df = pd.read_csv("./test.txt")
print(type(df))
print(df.shape)
<class 'pandas.core.frame.DataFrame'>
(3, 1)
Copy code
read_csv function
Default : From file 、URL、 The delimited data is loaded in the new object of the file , The default separator is a comma .
Above txt Documents are not separated by commas , So you need to increase it as you read sep Separator parameter
df = pd.read_csv("./test.txt",sep=' ')
Copy code
Parameter description , official Source : github.com/pandas-dev/… Description in Chinese and key function cases
Parameters | Chinese meaning |
---|---|
filepath_or_buffer | It can be URL, You can use URL Types include :http, ftp, s3 And documents , The local file reads the instance :file://localhost/path/to/table.csv |
sep | str type , Default ',' Specify the separator . If no parameters are specified , Instead, you will try to use the default value, comma . The separator is longer than one character and is not ‘\s+’, Will use python Parsers of . And ignore commas in the data . Regular expression example :'\r\t' |
delimiter | delimiter , Alternative separator ( If this parameter is specified , be sep Parameters of the failure ) Generally do not use |
delimiter_whitespace | True or False Default False, Using a space as a separator is equivalent to spe=’\s+’ If the parameter is called , be delimite It won't work |
header | Specify the row as the column name ( Ignore comment lines ), If no column name is specified , Default header=0; If the column name is specified header=None |
names | Specifies the column name , If not in the file header The line of , Should be explicit header=None ,header It could be a list of integers , Such as 0,1,3. An unspecified intermediate line is deleted ( for example , Skip this example 2 That's ok ) |
index_col( Case study 1) | The default is None Use the column name DataFrame The line of label , If I give you a sequence , Then use MultiIndex. If a file is read , The file has a delimiter at the end of each line , Consider using index_col=False send panadas Do not use the first column as the row name . |
usecols | Default None You can use column sequences or column names , Such as 0, 1, 2 or ‘foo’, ‘bar’, ‘baz’ , Use this parameter to speed up loading and reduce memory consumption . |
squeeze | The default is False, True The type returned is Series, If the data is parsed to contain only one line , Then return to Series |
prefix | Automatically generates a prefix for the column name number , Such as : ‘X’ for X0, X1, ... When header =None Or no Settings header In force |
mangle_dupe_cols | The default is True, Duplicate columns will be specified as ’X.0’…’X.N’, instead of ’X’…’X’. If you pass in False, When a duplicate name exists in a column , Will cause the data to be overwritten . |
dtype | Example : {‘a’: np.float64, ‘b’: np.int32} Specify the data type for each column ,a,b Said column names |
engine | The analysis engine used . You can choose C Or is it python,C The engine is fast but Python The engine has more features |
converters( Case study 2) | Sets the handler for the specified column , It can be used " Serial number " You can also use “ Name ” Specifies the column |
true_values / false_values | No actual application scenario was found , Note the , The late perfect |
skipinitialspace | Ignore the space after the separator , Default false |
skiprows | The default value is None Number of rows to ignore ( Start at the beginning of the file ), Or a list of line Numbers to skip ( from 0 Start ) |
skipfooter | Start at the end of the file and ignore it . (c Engine not supported ) |
nrows | How many rows are read from the file , Number of rows to read ( Start with the file header ) |
na_values | A null value definition , By default , ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’. Are characterized by NAN |
keep_default_na | If specified na_values Parameters , also keep_default_na=False, So the default NaN Will be overwritten , Otherwise, add |
na_filter | Whether to check for missing values ( An empty string or a null value ). Not in the dataset for large files N/A Null value , Use na_filter=False Can improve the read speed . |
verbose | Whether to print the output of various parsers , for example :“ The number of missing values in a nonnumeric column ” etc. . |
skip_blank_lines | If True, Skip the blank line ; Otherwise the record for NaN. |
parse_dates | There are the following operations 1. boolean. True -> Analytical index 2. list of ints or names. e.g. If 1, 2, 3 -> analysis 1,2,3 The value of the column as a separate date column ;3. list of lists. e.g. If [1, 3] -> Merge 1,3 The column is used as a date column 4. dict, e.g. {‘foo’ : 1, 3} -> take 1,3 columns , And name the merged column "foo" |
infer_datetime_format | If set to True also parse_dates You can use , that pandas Converts an attempt to a date type , If you can switch , Transform the method and parse it . In some cases it is 5~10 times |
keep_date_col | Resolve dates if multiple columns are concatenated , Keeps the columns that participate in the join . The default is False |
date_parser | The function that parses the date , By default dateutil.parser.parser To do the conversion .Pandas Try parsing in three different ways , Use the next method if you encounter problems .1. Use one or more arrays( from parse_dates Appoint ) As a parameter ;2. Concatenation specifies a multi-column string as a single column as an argument ;3. Each line is called once date_parser Function to parse one or more strings ( from parse_dates Appoint ) As a parameter . |
dayfirst | DD/MM Format of the date type |
iterator | Return to one TextFileReader object , To process files block by block . |
chunksize | The size of the file block |
compression | Use the compressed file directly on disk . If you use infer Parameters , Then use gzip, bz2, zip Or extract the file name ‘.gz’, ‘.bz2’, ‘.zip’, or ‘xz’ These files are suffixed , Otherwise, no decompression . If you use zip, that ZIP Package China must contain only one file . Set to None Do not unzip . |
The new version 0.18.1 Versioning support zip and xz decompression | |
thousands | Thousandths sign , Default ‘,’ |
decimal | Decimal symbol , Default ‘.’ |
lineterminator | Line separator , Only in C Used under the parser |
quotechar | quotes , A character used to identify the beginning and interpretation , The delimiter inside the quote is ignored |
quoting | control csv Quote constant in . Optional QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3) |
doublequote | Double quotes , When a single quote has been defined , also quoting The parameter is not QUOTE_NONE When , Using double quotation marks indicates that the element within the quotation marks is used as an element . |
escapechar | When quoting by QUOTE_NONE when , Specifies an undelimited value for a character to cause . |
comment | Indicates that extra lines are not parsed . If the character appears at the beginning of the line , This line will be completely ignored . This parameter can only be one character , Blank line ( It's like skip_blank_lines=True) Comment lines are header and skiprows Ignore the same . For example, if you specify comment='#' analysis ‘#empty\na,b,c\n1,2,3’ With header=0 Then the return result will be ’a,b,c' As header |
encoding | Encoding mode , Specified character set type , Usually specified as 'utf-8' |
dialect | If no specific language is specified , If sep Larger than one character is ignored . Specific to see csv.Dialect file |
error_bad_lines | If a row contains too many columns , Then the default will not return DataFrame , If I set it to false, So you're going to get rid of the change ( Only in C Used under the parser ) |
warn_bad_lines | If error_bad_lines =False, also warn_bad_lines =True So all “bad lines” Will be output ( Only in C Used under the parser ) |
low_memory | Block load into memory , Resolve at low memory consumption . However, type confusion can occur . To ensure that the type is not obfuscated need to be set to False. Or use dtype Parameter specified type . Pay attention to chunksize perhaps iterator Parameter block reading reads the entire file into one Dataframe, And ignore the type ( Only in C Valid in the parser ) |
delim_whitespace | New in version 0.18.1: Python Valid in the parser |
memory_map | If filepath_or_buffer A file path is provided , The file object is mapped directly to memory , And access the data directly from there . Use this option to improve performance , Because there's no more I / O expenses , Using this method prevents the file from doing it again IO operation |
float_precision | Appoint C The engine applies to converters of floating point values |
Part of the table is for reference Blog www.cnblogs.com/datablog/p/… Thanks to the blogger for the translation ,O(∩_∩)O ha-ha ~
Case study 1
index_col Use
First prepare a txt file , The biggest problem with this file is that it has one more at the end of each line ',' , Follow the instructions to interpret as , If there is a separator at the end of each line , There will be problems , But in the actual test, I found that I needed to cooperate names Parameters , To have an effect
goof,1,2,3,ddd,
u,1,3,4,asd,
as,df,12,33,
Copy code
Part of the table is for reference Blog www.cnblogs.com/datablog/p/… Thanks to the blogger for the translation ,O(∩_∩)O ha-ha ~
Case study 1
index_col Use
First prepare a txt file , The biggest problem with this file is that it has one more at the end of each line ',' , Follow the instructions to interpret as , If there is a separator at the end of each line , There will be problems , But in the actual test, I found that I needed to cooperate names Parameters , To have an effect
goof,1,2,3,ddd,
u,1,3,4,asd,
as,df,12,33,
Copy code
Write the following code
df = pd.read_csv("./demo.txt",header=None,names=['a','b','c','d','e'])
print(df)
df = pd.read_csv("./demo.txt",header=None,index_col=False,names=['a','b','c','d','e'])
print(df)
Copy code
In fact, the significance of the discovery is not very much , Maybe the document doesn't make it clear what it does . Let's move on index_col Common USES
While reading the file , If not set index_col Column index , The default is to use from 0 Integer index to begin with . When a row or column in a table is manipulated , When saving to file you will find that there is always an extra column from 0 The start of the column , If you set index_col Parameter to set the column index , There would be no such problem .
Case study 2
converters Sets the handler for the specified column , It can be used " Serial number " You can also use “ Name ” Specifies the column
import pandas as pd
def fun(x):
return str(x)+"-haha"
df = pd.read_csv("./test.txt",sep=' ',header=None,index_col=0,converters={3:fun})
print(type(df))
print(df.shape)
print(df)
Copy code
read_csv A common problem with functions
- yes , we have IDE In the use of Pandas Of read_csv When the function imports a data file , If the file path or file name contains Chinese , Will report a mistake .
terms of settlement
import pandas as pd
#df=pd.read_csv('F:/ Test folder / Test data .txt')
f=open('F:/ Test folder / Test data .txt')
df=pd.read_csv(f)
Copy code
- Exclude certain rows Use Parameters skiprows. Its function is to exclude a row . It should be noted that : Rule out before 3 Line is skiprows=3 Out of the first 3 Line is skiprows=3
- For irregular separators , Use Regular expressions Read the file The separator in the file is a space , So we just set it up sep=" " Just read the file . When the separator is not a single space , Maybe one space or more Spaces , If we still use it at this point sep=" " To read the file , Maybe you get a really weird number , Because it will use the space as data as well . data = pd.read_csv("data.txt",sep="\s+")
- Read the file if there is a Chinese coding error Need to set encoding Parameters
- Add indexes for rows and columns With the parameters names Add column index , use index_col Add row index
read_csv The command has a number of arguments . Most of them are unnecessary , Because most of the files you download are in standard format .
read_table function
The basic usage is consistent , The difference lies in separator Separator .
csv Is a comma separated value , Can only read correctly “,” Segmented data ,read_table The default is '\t'( That is to say tab) Cutting data sets
read_fwf function
Reads a file with a fixed - width column , Such as files
id8141 360.242940 149.910199 11950.7
id1594 444.953632 166.985655 11788.4
id1849 364.136849 183.628767 11806.2
id1230 413.836124 184.375703 11916.8
id1948 502.953953 173.237159 12468.3
Copy code
read_fwf Command has 2 Three additional parameters can be set
colspecs :
You need to give a list of tuples , The list of tuples is half-open ,[from,to) , By default it will go before 100 Row data is inferred .
Example :
import pandas as pd
colspecs = [(0, 6), (8, 20), (21, 33), (34, 43)]
df = pd.read_fwf('demo.txt', colspecs=colspecs, header=None, index_col=0)
Copy code
widths:
Just use a width list , Can replace colspecs
Parameters
widths = [6, 14, 13, 10]
df = pd.read_fwf('demo.txt', widths=widths, header=None)
Copy code
read_fwf It's not used very often , You can refer to pandas.pydata.org/pandas-docs… Study
read_msgpack function
pandas A new serializable data format supported , This is a lightweight, portable binary format , Similar to binary JSON, This data space utilization is high , In the writing ( serialize ) And read ( Deserialization ) Aspects provide good performance .
read_clipboard function
Reads the data in the clipboard , Can be seen as read_table Clipboard version . Useful in converting web pages into tables
This place appears as follows BUG
module 'pandas' has no attribute 'compat'
I updated it pandas It can be used normally
There is one more pitfall , That's when you read the clipboard , If you copy Chinese , It's easy not to read the data
terms of settlement
- open site-packages\pandas\io\clipboard.py This file needs to be retrieved by itself
- stay text = clipboard_get() Behind a line To join this : text = text.decode('UTF-8')
- preservation , Then you can use it
read_excel function
The official document is still the first :pandas.pydata.org/pandas-docs…
Parameters | Chinese meaning |
---|---|
io | File class object ,pandas Excel File or xlrd workbook . The string might be one URL.URL Include http,ftp,s3 And documents . for example , The local file can be written file://localhost/path/to/workbook.xlsx |
sheet_name | The default is sheetname by 0, Returns multiple table usage sheetname=0,1, if sheetname=None Is to return the full table . Be careful :int/string The return is dataframe, and none and list The return is dict of dataframe, Table names are represented as strings , Index table positions are expressed as integers ; |
header | Specifies the row as the column name , Default 0, So let's take the first row , The data is below the column name row ; If the data does not contain column names , Is set header = None; |
names | Specifies the name of the column , Pass in a list data |
index_col | Specifies the column to be listed as the index column , You can also use u”strings” , If I pass a list , These columns will be combined into one MultiIndex. |
squeeze | If the parsed data contains only one column , Returns a Series |
dtype | The data type of the data or column , Reference resources read_csv that will do |
engine | If io Not a buffer or a path , You must set it as an identity io. The acceptable value is None or xlrd |
converters | reference read_csv that will do |
The rest of the parameters | Basic and read_csv Agreement |
pandas Read excel File error , The general treatment is
Error for :ImportError: No module named 'xlrd'
pandas Read excel file , Need separate xlrd Module support pip install xlrd that will do
read_json function
Parameters | Chinese meaning |
---|---|
path_or_buf | A valid JSON file , The default value is None, The string can be URL, for example file://localhost/path/to/table.json |
orient ( Case study 1) | The expected json String format ,orient Has the following values :1. 'split' : dict like {index -> index, columns -> columns, data -> values}2. 'records' : list like {column -> value}, ... , {column -> value}3. 'index' : dict like {index -> {column -> value}}4. 'columns' : dict like {column -> {index -> value}}5. 'values' : just the values array |
typ | Return format (series or frame), The default is ‘frame’ |
dtype | The data type of the data or column , Reference resources read_csv that will do |
convert_axes | boolean, Try to convert the axis to the correct one dtypes, The default value is True |
convert_dates | Resolve the column list of dates ; If True, Attempts to resolve a date-like column , The default value is True Reference column label it ends with '_at',it ends with '_time',it begins with 'timestamp',it is 'modified',it is 'date' |
keep_default_dates | boolean,default True. If you parse the date , Resolves the default date sample column |
numpy | Direct decoding as numpy Array . The default is False; Support only digital data , But labels can be nonnumeric . Also pay attention to , If numpy=True,JSON Sort MUST |
precise_float | boolean, Default False. Set to enable higher precision when decoding a string to a double value (strtod) Use of functions . The default value is (False) Is to use quick but imprecise built-in features |
date_unit | string, A timestamp unit used to detect the conversion date . The default value is no . By default , The timestamp accuracy will be detected , If you don't need to , Through 's','ms','us' or 'ns' One of them forces the timestamp precision to be seconds respectively , millisecond , Microseconds or nanoseconds . |
encoding | json code |
lines | Each line reads one file json object . |
If JSON Cannot be resolved , The parser will be generated ValueError/TypeError/AssertionError One of .
Case study 1
- orient='split'
import pandas as pd
s = '{"index":[1,2,3],"columns":["a","b"],"data":[[1,3],[2,5],[6,9]]}'
df = pd.read_json(s,orient='split')
Copy code
- orient='records' Member for dictionary list
import pandas as pd
s = '[{"a":1,"b":2},{"a":3,"b":4}]'
df = pd.read_json(s,orient='records')
Copy code
- orient='index' Based on the index key, Takes the dictionary of column fields as the key value . Such as : s = '{"0":{"a":1,"b":2},"1":{"a":2,"b":4}}'
- orient='columns' perhaps values You can extrapolate
Partial Chinese translation , You can refer to github> github.com/apachecn/pa…
read_json() common BUG
Read json The file appears ValueError: Trailing data
,JSON Format problem
The original format for
{"a":1,"b":1},{"a":2,"b":2}
Copy code
Adjusted for
[{"a":1,"b":1},{"a":2,"b":2}]
Copy code
Or use lines Parameters , also JSON Adjust to one data per row
{"a":1,"b":1}
{"a":2,"b":2}
Copy code
if JSON There is Chinese in the document , Advice and encoding Parameters , assignment 'utf-8', Otherwise, an error will be reported
read_html function
Parameters | Chinese meaning |
---|---|
io | Receiving site 、 file 、 character string . Url not accepted https, Try to get rid of s Climb to the top of the |
match | Regular expressions , Returns a table that matches the regular expression |
flavor | The parser defaults to ‘lxml’ |
header | Specifies the row where the column header resides ,list For multiple indexes |
index_col | Specifies the column corresponding to the row header ,list For multiple indexes |
skiprows | Skip the first n That's ok ( The sequence labeling ) Or skip n That's ok ( Integer labeled ) |
attrs | attribute , such as attrs = {'id': 'table'} |
parse_dates | Parsing the date |
Usage method , Right click on the page if the table is found That is to say table You can use
for example : data.stcn.com/2019/0304/1…
<table class="..." id="...">
<thead>
<tr>
<th>...</th>
</tr>
</thead>
<tbody>
<tr>
<td>...</td>
</tr>
<tr>...</tr>
</tbody>
</table>
<table> : Define the form
<thead> : Defines the header of the table
<tbody> : Define the body of the table
<tr> : Defines the rows of the table
<th> : Defines the header of the table
<td> : Define table cell
Copy code
common BUG
The following error occurred ImportError: html5lib not found, please install it
install html5lib that will do , Or using parameters
import pandas as pd
df = pd.read_html("http://data.stcn.com/2019/0304/14899644.shtml",flavor ='lxml')
Copy code
More reference source code , You can refer to > pandas.pydata.org/pandas-docs…
pandas Read and write function table of
Read function | Write function | explain |
---|---|---|
read_clipboard | to_clipboard | Read text from the clipboard and pass it to read_table |
read_csv | to_csv | take CSV( Comma separated ) File read in DataFrame |
read_excel | to_excel | Excel form |
read_sql | to_sql | |
read_pickle | to_pickle | |
read_json | to_json | |
read_msgpack | to_msgpack | |
read_stata | to_stata | |
read_gbq | to_gbq | from Google BigQuery Load data |
read_hdf | to_hdf | |
read_html | to_html | |
read_parquet | to_parquet | |
read_feather | to_feather |
import pandas as pd
2
3 csvframe = pd.read_csv('pandas_data_test\myCSV_01.csv')
4 print(csvframe, "\n-----*-----")
5 csvframe1 = pd.read_table('pandas_data_test\myCSV_01.csv',sep=',')
6 print(csvframe1, "\n-----*-----")
7 csvframe2 = pd.read_csv('pandas_data_test\myCSV_02.csv',header=None) # Set up header For nothing , You don't use data as the header , The default header is added
8 print(csvframe2, "\n-----*-----")
9 csvframe20 = pd.read_csv('pandas_data_test\myCSV_02.csv',names=['white','red','blue','green','animal']) # Specify header
10 print(csvframe20, "\n-----*-----")
11
12 csvframe30 = pd.read_csv('pandas_data_test\myCSV_03.csv')
13 print(csvframe30, "\n-----*-----")
14 csvframe31 = pd.read_csv('pandas_data_test\myCSV_03.csv',index_col=['color','status']) # Grade index
15 print(csvframe31, "\n-----*-----")
16
17 txtframe4 = pd.read_table('pandas_data_test\ch05_04.txt',sep='\s+') # According to regular analysis
18 print(txtframe4, "\n-----*-----")
19 txtframe5 = pd.read_table('pandas_data_test\ch05_05.txt',sep=r'\D+',header=None,engine='python')
20 print(txtframe5, "\n-----*-----")
21 # Use skiprows Options , You can exclude redundant lines . Put the row number of the row to be excluded into the array , Assign to this option .
22 txtframe6 = pd.read_table('pandas_data_test\ch05_06.txt',sep=',',skiprows=[0,1,3,6])
23 print(txtframe6)
24 Out[1]: 25 white red blue green animal
26 0 1 5 2 3 cat
27 1 2 7 8 5 dog
28 2 3 3 6 7 horse
29 3 2 2 8 3 duck
30 4 4 4 2 1 mouse
31 5 4 4 2 1 mou
32 -----*-----
33 white red blue green animal
34 0 1 5 2 3 cat
35 1 2 7 8 5 dog
36 2 3 3 6 7 horse
37 3 2 2 8 3 duck
38 4 4 4 2 1 mouse
39 5 4 4 2 1 mou
40 -----*-----
41 0 1 2 3 4
42 0 1 5 2 3 cat
43 1 2 7 8 5 dog
44 2 3 3 6 7 horse
45 3 2 2 8 3 duck
46 4 4 4 2 1 mouse
47 -----*-----
48 white red blue green animal
49 0 1 5 2 3 cat
50 1 2 7 8 5 dog
51 2 3 3 6 7 horse
52 3 2 2 8 3 duck
53 4 4 4 2 1 mouse
54 -----*-----
55 color status iteml item2 item3
56 0 black up 3 4 6
57 1 black down 2 6 7
58 2 white up 5 5 5
59 3 white down 3 3 2
60 4 white left 1 2 1
61 5 red up 2 2 2
62 6 red down 1 1 4
63 -----*-----
64 iteml item2 item3
65 color status
66 black up 3 4 6
67 down 2 6 7
68 white up 5 5 5
69 down 3 3 2
70 left 1 2 1
71 red up 2 2 2
72 down 1 1 4
73 -----*-----
74 white red blue green
75 0 1 5 2 3
76 1 2 7 8 5
77 2 3 3 6 7
78 -----*-----
79 0 1 2
80 0 0 123 122
81 1 1 124 321
82 2 2 125 333
83 -----*-----
84 white red blue green animal
85 0 1 5 2 3 cat
86 1 2 7 8 5 dog
87 2 3 3 6 7 horse
88 3 2 2 8 3 duck
89 4 4 4 2 1 mouse
Copy code
from TXT The file reads part of the data
1 print(csvframe2, "\n-----*-----")
2 # nrows=2 Specify the number of rows to get ,skiprows=[2] Delete the corresponding line
3 csvfram20 = pd.read_csv('pandas_data_test\myCSV_02.csv',skiprows=[2],nrows=2,header=None)
4 print(csvfram20)
5 Out[2]: 6 0 1 2 3 4
7 0 1 5 2 3 cat
8 1 2 7 8 5 dog
9 2 3 3 6 7 horse
10 3 2 2 8 3 duck
11 4 4 4 2 1 mouse
12 -----*-----
13 0 1 2 3 4
14 0 1 5 2 3 cat
15 1 2 7 8 5 dog
Copy code
Another interesting and common operation is to segment the text you want to parse , And then go through the parts , One by one A particular operation .
for example , For a column of numbers , Add one every two lines , Finally, put and plug into Series In the object „ This little example It's easy to understand ,
It has no practical value , But once you understand how it works , You can use it in more complex situations .
1 csvframe1 = pd.read_table('pandas_data_test\myCSV_01.csv',sep=',')
2 print(csvframe1, "\n-----*-----")
3 out = pd.Series()
4 pieces = pd.read_csv('pandas_data_test\myCSV_01.csv',chunksize=4) # chunksize The parameter determines the number of rows to split per part
5 i = 0
6 for piece in pieces:
7 print(piece['white'])
8 out.at[i] = piece['white'].sum()
9 i += 1
10 print(out, "\n-----*-----")
11 Out[3]: 12 white red blue green animal
13 0 1 5 2 3 cat
14 1 2 7 8 5 dog
15 2 3 3 6 7 horse
16 3 2 2 8 3 duck
17 4 4 4 2 1 mouse
18 5 4 4 2 1 mou
19 -----*-----
20 0 1
21 1 2
22 2 3
23 3 2
24 Name: white, dtype: int64
25 4 4
26 5 4
27 Name: white, dtype: int64
28 0 8
29 1 8
30 dtype: int64
Copy code
Go to CSV File write data
1 print(csvframe1)
2 print(csvframe1.to_csv('pandas_data_test\ch05_07.csv'))
3 # Use index and header Options , Set their values to False, The default write can be cancelled index and header
4 print(csvframe1.to_csv('pandas_data_test\ch05_07b.csv',index =False,header=False))
5 print(csvframe30.to_csv('pandas_data_test\ch05_08.csv'))
6 # It can be used to_csv() Functional na_rep Option to replace the empty field with the value you want . Common values are NULL、0 and NaN
7 print(csvframe30.to_csv('pandas_data_test\ch05_09.csv',na_rep=" empty "))
8 Out[4]: 9 white red blue green animal
10 0 1 5 2 3 cat
11 1 2 7 8 5 dog
12 2 3 3 6 7 horse
13 3 2 2 8 3 duck
14 4 4 4 2 1 mouse
15 5 4 4 2 1 mou
16 None
17 None
18 None
19 None
Copy code
Enter the folder and we can see the corresponding file :
Reading and writing HTML file
1 frame = pd.DataFrame(np.arange(4).reshape(2,2))
2 print(frame.to_html())
3 frame2 = pd.DataFrame( np.random.random((4,4)),index = ['white','black','red','blue1'],columns = ['up','down','right','left'])
4 s = ['<HTML>']
5 s.append('<HEAD><TITLE>My DataFrame</TITLE></HEAD>')
6 s.append(' <B0DY>')
7 s.append(frame.to_html())
8 s.append('</BODY></HTML>')
9 html = ''.join(s)
10 html_file = open('pandas_data_test\myFrame.html','w')
11 html_file.write(html)
12 html_file.close()
13 web_frames = pd.read_html('pandas_data_test\myFrame.html')
14 print(web_frames[0])
15 ranking = pd.read_html('http://www.meccanismocomplesso.org/en/ eccanismo-complesso-sito-2/classifica-punteggio/')
16 print(ranking[0][1:10]) # Before outputting the content of a web page 10 That's ok
17 Out[5]: 18 <table border="1" class="dataframe">
19 <thead>
20 <tr style="text-align: right;">
21 <th></th>
22 <th>0</th>
23 <th>1</th>
24 </tr>
25 </thead>
26 <tbody>
27 <tr>
28 <th>0</th>
29 <td>0</td>
30 <td>1</td>
31 </tr>
32 <tr>
33 <th>1</th>
34 <td>2</td>
35 <td>3</td>
36 </tr>
37 </tbody>
38 </table>
39 Unnamed: 0 0 1
40 0 0 0 1
41 1 1 2 3
42 # Nome Exp Livelli
43 1 2 admin 9029 NaN
44 2 3 BrunoOrsini 2124 NaN
45 3 4 Berserker 700 NaN
46 4 5 Dnocioni 543 NaN
47 5 6 albertosallusti 409 NaN
48 6 7 Jon 233 NaN
49 7 8 Mr.Y 180 NaN
50 8 9 michele sisinni 157 NaN
51 9 10 Selina 136 NaN
Copy code
from XML Reading data
pandas All of the I/O API Function , It's not specifically used to deal with XML( Extensible markup language ) Format . Although there is no , But this format is actually
Very important , Because a lot of structured data is based on XML Format stored .pandas There is no special processing letter It doesn't matter , because Python
There's a lot of reading and writing XML A library of formatted data ( except pandas). One of the libraries is called lxml, It has excellent performance in large file processing , So from
It stands out from many of its kind . this This section describes how to use it to deal with XML file , And how to combine it with pandas integrated , With the most
Eventually XML In the document Get the data you need and convert it to DataFrame object .
XML The source file is shown in the figure below
1 from lxml import objectify
2
3 xml = objectify.parse('pandas_data_test\books.xml')
4 root = xml.getroot() # Get root node
5 print(root.Book.Author)
6 mes1 = root.Book.getchildren()
7 print("root.Book.getchildren() Get the sub tag content :\n", mes1)
8 mes2 = root.Book[1].getchildren() # Take the second Book label
9 print([child.tag for child in mes2]) # Get child tags
10 print([child.text for child in mes2]) # Get the sub tag content
11 Out[6]: 12 272103_l_EnRoss, Mark
13 root.Book.getchildren() Get the sub tag content :
14 [' 272103_l_EnRoss, Mark', 'XML Cookbook', 'Computer', 23.56, '2014-22-0l']
15 ['Author', 'Title', 'Genre', 'Price', 'PublishDate']
16 [' 272l03_l_EnBracket, Barbara', 'XML for Dummies', 'Computer', '35.95', '20l4-l2-l6']
Copy code
Reading and writing Microsoft Excel file
read_excel()、to_excel(), Can read .xls and .xlsx Two types of files .
Reading and writing JSON data
read_json()、to_json()
HDF5 Format
thus , I have learned how to read and write the text format . To analyze a lot of data , Binary format is preferred .Python How much It's a kind of binary data processing
Tools .HDF5 Ku has had some success in this area .HDF Represents the hierarchy data format (hierarchical data format ).HDF5
Ku is concerned about HDF5 Reading and writing of documents , The data structure of this file consists of nodes , The ability to store large data sets . The library uses all c Language
Development , Provides python/matlab and Java Language interface . Its rapid expansion benefits from developers The widespread use of , And thanks to its effectiveness
rate , In particular, using this format to store large amounts of data , It's very efficient . Compared to other formats that are simpler to process binary data ,HDF5
Support real time compression , Therefore, it is possible to compress files by using the repeating pattern in the data structure . at present ,Python There are two kinds of maneuvers HDF5 Format data
Methods :PyTables and h5py. There are several differences between the two methods , Which one to choose depends largely on the specific needs .
h5py by HDF5 Advanced API Provide the interface .PyTables It's packaged a lot of HDF5 details , Provide more flexible data containers 、 Index table 、 Search for
Functions and other computing related media .pandas Another one is called HDFStore、 Be similar to diet Class , It USES PyTables Storage pandas
object . Use HDF5 Before the format , You have to guide people HDFStore class .
1 from pandas.io.pytables import HDFStore
2 # Pay attention to the need tables This package , No, please install it yourself
3 frame = pd.DataFrame(np.arange(16).reshape(4,4),index=['white','black1','red','blue'],columns=['up','down','right','left'])
4 store = HDFStore('pandas_data_test\mydata.h5')
5 store['obj1'] = frame
6 frame1 = pd.DataFrame(np.random.rand(16).reshape(4,4),index=['white','black1','red','blue'],columns=['up','down','right','left'])
7 store['obj2'] = frame1
8 print(store['obj1'])
9 print(store['obj2'])
10 Out[7]: 11 up down right left
12 white 0 1 2 3
13 black1 4 5 6 7
14 red 8 9 10 11
15 blue 12 13 14 15
16 up down right left
17 white 0.251269 0.422823 0.619294 0.273534
18 black1 0.593960 0.353969 0.966026 0.104581
19 red 0.964577 0.625644 0.342923 0.638627
20 blue 0.246541 0.997952 0.414599 0.908750
21 Closing remaining open files:pandas_data_test\mydata.h5...done
Copy code
Implement object serialization
****pickle The module implements a powerful algorithm , Be able to use Python Implement the data structure serialization (pickling) And deserialization operations .
Serialization is the process of transforming the hierarchy of an object into a byte stream . Serialization facilitates the transfer of objects 、 Storage and reconstruction , It's just a receiver that can rewind
Building objects , It also retains all of its original features .
use pandas Library implements object serialization ( Deserialization ) Very convenient , All the tools are out of the box , No need to Python Import... In a conversation cPickle model
block , All operations are done implicitly . pandas The serialization format of is not fully used ASCII code .
1 import pickle
2 data = { 'color': ['white','red'], 'value': [5, 7]}
3 pickled_data = pickle.dumps(data)
4 print(pickled_data)
5 nframe = pickle.loads(pickled_data)
6 print(nframe)
7
8 # use pandas serialize
9 frame = pd.DataFrame(np.arange(16).reshape(4,4), index = ['up','down','left','right'])
10 frame.to_pickle('pandas_data_test\frame.pkl') # Same as json The data is similar to
11 print(pd.read_pickle('pandas_data_test\frame.pkl')) # Deserialization , Reading data
12 Out[8]: 13 b'\x80\x03}q\x00(X\x05\x00\x00\x00colorq\x01]q\x02(X\x05\x00\x00\x00whiteq\x03X\x03\x00\x00\x00redq\x04eX\x05\x00\x00\x00valueq\x05]q\x06(K\x05K\x07eu.'
14 {'color': ['white', 'red'], 'value': [5, 7]}
15 0 1 2 3
16 up 0 1 2 3
17 down 4 5 6 7
18 left 8 9 10 11
19 right 12 13 14 15
Copy code
Docking database
**** In many applications , The data used comes from very few text files , Because text files are not the most efficient way to store data .
Data is often stored in SQL Class relational database , Supplement ,NoSQL Databases have become popular recently .
from SQL Database loading data , Convert it to DataFrame The object is very simple pandas Several functions are provided to simplify the process .
pandas.io.sql Module provides database independent 、 called sqlalchemy The unified interface . This interface simplifies the connection mode , No matter for
What kind of database , There is only one set of operation commands . Connected database usage create_engine() function , You can use it to configure the drive
The required user name 、 password 、 Port and database instance . database URL The typical form of this is :
dialect+driver://username:[email protected]:port/database
The identifying name of the name , for example sqlite,mysql,postgresql,oracle, or mssql.drivername It's used to connect with all lowercase letters
To the database DBAPI The name of . If not specified , Then “ Default ”DBAPI( If available ) - This default value is usually available for that backend
The most extensive driver .
1 from sqlalchemy import create_engine
2
3 # PostgreSQL database
4 # default
5 engine = create_engine('postgresql://scott:[email protected]/mydatabase')
6 # pg8000 Driver
7 engine = create_engine('postgresql+pg8000://scott:[email protected]/mydatabase')
8 # psycopg2 Driver
9 engine = create_engine('postgresql+psycopg2://scott:[email protected]/mydatabase')
10 # MySql
11 # default
12 engine = create_engine('mysql://scott:[email protected]/foo')
13
14 # mysql-python Note the correspondence behind the drive
15 engine = create_engine('mysql+mysqldb://scott:[email protected]/foo')
16
17 # MySQL-connector-python
18 engine = create_engine('mysql+mysqlconnector://scott:[email protected]/foo')
19 # OurSQL
20 engine = create_engine('mysql+oursql://scott:[email protected]/foo')
21 # Oracle
22 engine = create_engine('oracle://scott:[email protected]:1521/sidname')
23
24 engine = create_engine('oracle+cx_oracle://scott:[email protected]')
25 # Microsoft SQL
26 # pyodbc
27 engine = create_engine('mssql+pyodbc://scott:[email protected]')
28
29 # pymssql
30 engine = create_engine('mssql+pymssql://scott:[email protected]:port/dbname')
Copy code
SQLite:
because SQLite Connect to a local file , therefore URL Slightly different format .URL Of “ file ” Part of it is the file name of the database .
For relative file paths , It takes three slashes :engine = create_engine('sqlite:///foo.db')
**** For absolute file paths , After the three slashes is the absolute path :
Unix/Mac - 4 initial slashes in total
engine = create_engine('sqlite:absolute/path/to/foo.db')
**Windows
engine = create_engine('sqlite:///C:\path\to\foo.db')
Windows alternative using raw string
engine = create_engine(r'sqlite:///C:\path\to\foo.db')
SQLite3 Data reading and writing
**** Learn to use Python Built in SQLite database sqlite3.SQLite3 The tool implements simplicity 、 Lightweight DBMS SQL,
So it can be built into Python Language implementation of any application . It's very practical , You can create an embedded database in a single file .
If you want to use all the functions of a database without installing a real database , This tool is the best choice . If you want to use real
Practice database operation before using the database , Or use a database to store data in a single program without considering the interface , SQLite3 It's all No
The wrong choice .
1 from sqlalchemy import create_engine
2 frame = pd.DataFrame( np.arange(20).reshape(4,5),columns=['white','red','blue','black','green'])
3 # Connect SQLite3 database
4 engine = create_engine('sqlite:///pandas_data_test/foo.db')
5 # hold DataFrame Convert to database table .
6 # to_sql(self, name, con, schema=None, if_exists='fail', index=True,
7 # index_label=None, chunksize=None, dtype=None)
8 frame.to_sql('colors',engine)
9
10 # Read
11 print(pd.read_sql('colors',engine))
Copy code
Running results :
copyright notice
author[bosaidongmomo],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201312038445924.html
The sidebar is recommended
- Python crawls the map of Gaode and the weather conditions of each city
- leetcode 1275. Find Winner on a Tic Tac Toe Game(python)
- leetcode 2016. Maximum Difference Between Increasing Elements(python)
- Run through Python date and time processing (Part 2)
- Application of urllib package in Python
- Django API Version (II)
- Python utility module playsound
- Database addition, deletion, modification and query of Python Sqlalchemy basic operation
- Tiobe November programming language ranking: Python surpasses C language to become the first! PHP is about to fall out of the top ten?
- Learn how to use opencv and python to realize face recognition!
guess what you like
-
Using OpenCV and python to identify credit card numbers
-
Principle of Python Apriori algorithm (11)
-
Python AI steals your voice in 5 seconds
-
A glance at Python's file processing (Part 1)
-
Python cloud cat
-
Python crawler actual combat, pyecharts module, python data analysis tells you which goods are popular on free fish~
-
Using pandas to implement SQL group_ concat
-
How IOS developers learn Python Programming 8 - set type 3
-
windows10+apache2. 4 + Django deployment
-
Django parser
Random recommended
- leetcode 1560. Most Visited Sector in a Circular Track(python)
- leetcode 1995. Count Special Quadruplets(python)
- How to program based on interfaces using Python
- leetcode 1286. Iterator for Combination(python)
- leetcode 1418. Display Table of Food Orders in a Restaurant (python)
- Python Matplotlib drawing histogram
- Python development foundation summary (VII) database + FTP + character coding + source code security
- Python modular package management and import mechanism
- Django serialization (II)
- Python dataloader error "dataloader worker (PID XXX) is killed by signal" solution
- apache2. 4 + Django + windows 10 Automated Deployment
- leetcode 1222. Queens That Can Attack the King(python)
- leetcode 1387. Sort Integers by The Power Value (python)
- Tiger sniffing 24-hour praise device, a case with a crawler skill, python crawler lesson 7-9
- Python object oriented programming 01: introduction classes and objects
- Baidu Post: high definition Python
- Python Matplotlib drawing contour map
- Python crawler actual combat, requests module, python realizes IMDB movie top data visualization
- Python classic: explain programming and development from simple to deep and step by step
- Python implements URL availability monitoring and instant push
- Python avatar animation, come and generate your own animation avatar
- leetcode 1884. Egg Drop With 2 Eggs and N Floors(python)
- leetcode 1910. Remove All Occurrences of a Substring(python)
- Python and binary
- First acquaintance with Python class
- [Python data collection] scrapy book acquisition and coding analysis
- Python crawler from introduction to mastery (IV) extracting information from web pages
- Python crawler from entry to mastery (III) implementation of simple crawler
- The apscheduler module in Python implements scheduled tasks
- 1379. Find the same node in the cloned binary tree (Java / C + + / Python)
- Python connects redis, singleton and thread pool, and resolves problems encountered
- Python from 0 to 1 (day 11) - Python data application 1
- Python bisect module
- Python + OpenGL realizes real-time interactive writing on blocks with B-spline curves
- Use the properties of Python VTK implicit functions to select and cut data
- Learn these 10000 passages and become a humorous person in the IT workplace. Python crawler lessons 8-9
- leetcode 986. Interval List Intersections(python)
- leetcode 1860. Incremental Memory Leak(python)
- How to teach yourself Python? How long will it take?
- Python Matplotlib drawing pie chart