current position:Home>[Python data collection] stock information collection

[Python data collection] stock information collection

2022-01-30 09:49:55 liedmirror

Little knowledge , Great challenge ! This article is participating in “ A programmer must have a little knowledge ” Creative activities

Preface

This article will introduce the way of capturing packets , Get use js Website data crawling for dynamic rendering , Taking stock information data as an example , Carry out actual operation .

Stock information collection

requirement : use requests And optional information extraction method to crawl stock related information , And stored in the database .

Ideas :

1. Grab the bag

Because this web page belongs to dynamic rendering web page , Data is passed through ajax Make dynamic rendering , Therefore, it is necessary to capture packets .

open F12 And refresh , Get the message sent by the browser js request .

Analyze as follows :

  1. Copy any stock name ;

  2. Start search , Paste the copied stock name ;

  3. A... Appears in the search bar ( Or more -> Additional data refresh requests will be made over time ) Request interface containing all stock code information on this page ;

  4. On the right side Response All data can be previewed in .

 2. Argument parsing

The interface obtained in the previous step url as follows :

http://67.push2.eastmoney.com/api/qt/clist/get?cb=jQuery112407972169804676412_1634974778877&pn=1&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&fid=f3&fs=m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1634974778878
 Copy code 

You can see , stay url after , Follow a lot get Parameters , The parameters can be modified to improve the crawler .

So I made some attempts ( The request has been verified , Match the actual page ):

First , The first thing to see is pn and pz Two parameters .

Write the interface in the back-end , Generally used page_num and page_size Corresponding page number and page size ,pn and pz Is the abbreviation of these two variables .

And then there was cb=jQuery....... Parameters .

jQuery Is based on js A framework for , Although I don't know much about the front end , But in general , The front and back end separation project needs to have a certain request specification , The interface you crawled to is Restful API Add a layer based on jQuery Function encapsulation .

You can speculate boldly , This interface is dedicated to jQuery Adaptation performed , The most likely parameter to control this adaptation is cb Parameters ( Because there is jQuery).

After deleting , Interface data usage json Format to return ( The back-end interface is not standardized ):

It's intuitive to see ,data Medium diff Correspondence is a problem list Type data , Start the included f* Is the data we need .  Then there is the data range limitation

After manual comparison , The corresponding table of each field and parameter name is as follows :

Stock code f12
name f14
Latest offer f2
applies f3
Up and down f4
volume f5
turnover f6
The amplitude f7
The highest f15
The minimum f16
Open today f17
Yesterday f18

Finally, there are some other unknown parameters , It may be the use of buried points at the back end , Filter one by one , Remove unnecessary parameters , Last , Add the following parameters :

    #  Required parameters , Other parameters do not need to be requested 
    needParams = ['f12', 'f14', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f15', 'f16', 'f17', 'f18']
    fields = ",".join(needParams)   #  Merge into fields Parameters 

    #  Enter paging related parameters to limit the number and range of crawls ,fid For the stock class (f3 For the purpose ),fields Qualified return parameters 
    #  Compared with the interface crawled , Removed cb=jQuery.... Parameters , You can go straight back to json Format data 
    url = f'http://60.push2.eastmoney.com/api/qt/clist/get' \
          f'?pn={pageNum}' \
          f'&pz={pageSize}' \
          f'&po=1&np=1&fltt=2&invt=2' \
          f'&fid={fid}&fs=m:1+s:2' \
          f'&fields={fields}'
 Copy code 

(po=1&np=1&fltt=2&invt=2 The effect is unknown , But deleting will cause the request not to be empty )

3. Database building

db = DB()
db.driver.execute('use spider_test')
db.driver.execute('drop table if exists money')
sql_create_table = """ CREATE TABLE `money` ( `id` int(11) NOT NULL AUTO_INCREMENT, `code` varchar(64) DEFAULT NULL, `name` varchar(64) DEFAULT NULL, `zxbj` varchar(64) DEFAULT NULL, `zdf` varchar(64) DEFAULT NULL, `zde` varchar(64) DEFAULT NULL, `cjl` varchar(64) DEFAULT NULL, `cje` varchar(64) DEFAULT NULL, `zf` varchar(64) DEFAULT NULL, `high` varchar(64) DEFAULT NULL, `low` varchar(64) DEFAULT NULL, `jk` varchar(64) DEFAULT NULL, `zs` varchar(64) DEFAULT NULL, PRIMARY KEY (`id`) )ENGINE=InnoDB DEFAULT CHARSET=utf8; """
db.driver.execute(sql_create_table)
 Copy code 
 In principle, , The database field name should not appear Chinese characters , therefore , Database naming looks a little random .
 The insertion operation is similar to that in job 1 , It's not going to show .
 Copy code 

4. Result display

At present, only one data table is created for experiment , If there is an actual demand, the table shall be divided according to time , Data of all time periods cannot be mixed in one table ( Otherwise, it doesn't make sense ).

in addition , Save only source data information , Facilitate data analysis or visual operation , Data escape ( Add % Or gawan 、 Billion and so on ) It should not be done in the database .

copyright notice
author[liedmirror],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201300949521244.html

Random recommended