current position:Home>Python crawler - get fund change information

Python crawler - get fund change information

2022-01-31 21:35:38 first quarter of the moon

This is my participation 11 The fourth of the yuegengwen challenge 3 God , Check out the activity details :2021 One last more challenge

Lose humanity , Lose a lot ; Loss of animal nature , Lose everything .

1 Preface

Previously, we have introduced how to obtain the fund list and how to obtain the basic information of the Fund , Today we continue with the previous content , Get the change information of the Fund . How to get and parse the information of the page api Interface call .

2 Capture change information

We look at the basic information page of the Fund , We can find that the page of fund change information can include the following 4 Parts of :

1- The information in the header of the fund page

Next, let's talk about our idea of capturing data , In the first figure, we have got the basic information of the Fund , Change information and stage increase , But the increase is already in the third stage 2 It is shown in the figure , So in this picture , We just need to get the real-time rise and fall and the net value of the fund the previous day .

2.1 Acquisition of fund change information
#  Fund change information , Let's start with a simple connection , The way of obtaining other funds is similar to this ,
#  The access address can be changed into other fund codes .
 Copy code 

There are two parts to get the change , One part is to obtain the new information of fund changes in real time , You will find that the net worth estimation will change over time , By monitoring the access request record of the browser , Caught such a api visit , The flowers bloom in an instant .

	'fundcode': '005585',
	'name': ' Galaxy Entertainment mix ',
	'jzrq': '2021-11-16',
	'dwjz': '1.6718',
	'gsz': '1.6732',
	'gszzl': '0.08',
	'gztime': '2021-11-17 15:00'
 Copy code 

The fund code and fund name can be according to json The returned content can be known , however jzrq,dwjz,gsz,gszzl,gztime What do you mean , I studied it carefully for a long time , Combined with the content displayed on the page , Plus dfcf The habit of coding the first letter of Chinese Pinyin , I guess these fields roughly mean Net value date 、 Unit net worth 、 Estimate 、 Estimate the growth rate 、 Estimate time . I'm a little complacent , Even cracked the mystery .

The second part is to obtain the unit net value of the Fund , Through analysis, it is found that the data is contained in a <dl class="dataItem02"> Of html In the element , The way we get it is through bs4 The method of parsing returns the page information to grab the element dom Get the tree .

To sum up, we passed api The interface call is used to obtain the real-time change information of the fund , Returned by parsing html, analysis dom Tree to get the unit net value information of the Fund . The following is the code for grabbing information in the first part .

#  Capture real-time fund change information 
resp = requests.get("{}.js".format(code))
#  Remove js The woolen fabric is convenient for data json conversion 
data = resp.text.replace("jsonpgz(", "").replace(");", "")
body = json.loads(data)
#  Output the obtained result data 
print("{} {}  Estimate  {}  Estimate the rise and fall  {}  Estimate time  {}".format(body["fundcode"], body["name"], body["gsz"], body["gszzl"], body["gztime"]))

#  Request information on the fund page 
response = requests.get("{}.html".format(code))
#  Print the encoding method of the original request return message 
# print(response.apparent_encoding)
#  Set the encoding method of the returned content of the request , Avoid random code on the console 
response.encoding = "UTF-8"
resp_body = response.text
#  Data conversion and analysis 
soup = BeautifulSoup(resp_body, 'lxml')
#  Because it is determined that there is only one element , So you can use  find  Release to get data , This is to find  dl label ,class=dataItem02  The elements of 
dl_con = soup.find("dl", class_="dataItem02")

#  Get the update time of the net value of the Fund 
value_date = dl_con.find("p").get_text()
#  Only the time of extracting fund data 
value_date = value_date.replace(" Unit net worth ", "").replace("(", "").replace(")", "")
#  Net worth data and percentage rise and fall data are in dd Two under the label p In the label 
value_con = dl_con.find("dd", class_="dataNums")
data_list = value_con.find_all("span")
val_data = data_list[0].get_text()
per_data = data_list[1].get_text()
print(" Fund net worth date  {}  Net worth data  {}  Up and down percentage  {}".format(value_date, val_data, per_data))

 Copy code 

Final , Through the above operations , You can get the change information of the Fund .

2.2 Capture of fund stage information

The stage information capture of the fund also adopts bs4 Operate by parsing page data , There are three figures here , The first figure shows the rise and fall information of the stage , The second and third are the quarterly and annual rise and fall information , Because finally, we need to format the storage , For the first graph, we can store structured row patterns , It can show the changes every day , But for two and three, we need to use column mode storage , Query as a kind of statistical data . Because the parsing methods of the two methods are different , The header field in the figure exists as a field in the database , So we don't need to care , Two and three need to get the header of the table for storage , The statistical events are also the data we store . Another is that we should not only get the basic information of the Fund , And get to Shanghai and Shenzhen 300 Information about , In the future, it is convenient to use it as an intensity index for benchmark judgment during screening , So Shanghai and Shenzhen 300 The data also needs to be captured , The operation of this part is not difficult , It mainly focuses on the analysis of the data obtained and the subsequent storage ideas .

2- The stage rise of the Fund

3- The quarterly increase of the Fund

4- The annual increase of the Fund

I'm here to get all the pages directly table Elements , Then cycle and output the results , Then get the data that needs to be captured in that subscript . Here I will directly post the code to explain :

#  Print forms 
def print_table(head, body):
    tb = PrettyTable()  #  Generate table objects 
    tb.field_names = head  #  Define header 
#  Query quarter   Annual data 
def query_year_quarter(data_list, num):
    stage_list = data_list.find_all("tr")[0].find_all("th")
    head_list = []
    for nd in stage_list:
        val = nd.get_text().strip()
        val = val.replace(" quarter ", "").replace(" year ", "").replace(" year ", "-")
        if val:
            # print(nd.get_text())

    body_list = []
    stage_list = data_list.find_all("tr")[num].find_all("td")
    for nd in stage_list:
        val = nd.get_text()
        if " Stage increase " in val or " Shanghai and Shenzhen 300" in val:
        body_list.append(val.replace("%", ""))

    #  Print forms 
    print_table(head_list, body_list)    

#  To get the basic information of the fund, only some codes are pasted here , You need to combine the information of the net value part to run 
def query_fund_basic(code="005585", hsFlag=False):
    #  Stage increase header 
    stage_head_list = ["stage_week", "stage_month", "stage_month3", "stage_month6", "stage_year", "stage_year1","stage_year2", "stage_year3", ]
    stage_list = body_list[11].find_all("tr")
    #  For the first 2 One is the fund situation   For the first 4 Yes hs300 situation 
    num = 1
    if hsFlag:
        num = 3
    tmp_list = []
    for nd in stage_list[num].find_all("td"):
        val = nd.get_text()
        if " Stage increase " in val or " Shanghai and Shenzhen 300" in val:
        tmp_list.append(val.replace("%", ""))

    #  Print phase amplitude table 
    print("\t------ Stage rise and fall ------")
    print_table(stage_head_list, tmp_list)

    print("\t------ Quarterly rise and fall ------")
    query_year_quarter(body_list[12], num)
    print("\t------ Annual rise and fall ------")
    query_year_quarter(body_list[13], num)
 Copy code 

3 The final result shows

Due to limited space , This code will not be shown in the text , In the future, I will maintain the content in github Provide... On .

 Fund net worth information results

 The final output of the Fund

HS300 Final output

copyright notice
author[first quarter of the moon],Please bring the original link to reprint, thank you.

Random recommended