current position:Home>[Python data collection] university ranking data collection

[Python data collection] university ranking data collection

2022-01-30 09:49:47 liedmirror

Little knowledge , Great challenge ! This article is participating in “ A programmer must have a little knowledge ” Creative activities

Preface

In this article , By parsing a special dynamic rendering method , How to introduce “ Don't play according to the routine ” Website , A solution to realize data acquisition .

University ranking data collection

Climb to a Chinese University 2021 Main list (www.shanghairanking.cn/rankings/bc…) the
There is college information , And stored in the database , At the same time, the browser F12 Record the process of debugging analysis Gif Join the blog .

Output information :

ranking School Total score
1 Tsinghua University 969.2

Ideas :

1. Grab the bag

The page turning implementation of this web page is quite special , Is to return all data at once , And then through js Function for dynamic implementation .

It's the same as the second question , Use search to get information sources .

Because Fuzhou University is not on the first page , We can search by using ~~" Southern small Tsinghua University "~~" Fuzhou University " To search , Soon found that the data came from a payload.js Script :

2. Parse web pages

nameList = re.findall(r'univNameCn:"(.*?)"', html, re.S)
scoreList = re.findall(r'score:(.*?),', html, re.S)
 Copy code 

Using regular expressions , Parameters can be extracted quickly , But I found that under the same score , The ranking will be replaced by parameters , Such as :

Observe the requested js Script , It is found that its structure is a function , The parameter name is stored in the front function header , At the end, there are parameter values .

therefore , Just get the corresponding table of parameter name and parameter value , Replace the abnormal data , To solve the above problems .

3. Parameter extraction

Still use regular expressions to extract parameter names (keys) And parameter values (values), And use split(",") Segmentation , Combine and convert into dict type :

keys = re.findall(r'function((.*?))', html)[0].split(',')
values = re.findall(r'((.*?))', html)[-1]
params = dict(zip(keys, values))
 Copy code 

However , There is a problem of element correspondence error , Print keys and values length , It is found that the length difference between the two is 1:

Check the data and find it in a string , There is "," The segmentation is interfered :

 

  This situation , There are many solutions , I chose a very “Python” Methods , Use eval Function to parse :

  1. First process the string , Replace some types with Python How to express ( Such as :true->True, null->None) etc. ;

  2. Put the string in square brackets " wrap up ";

  3. Use it directly eval To list form

    values = eval("[" +
                  values.replace("true", "True").replace("false", "False").replace("null", "None")
                  + "]")
 Copy code 

Output see , The parameters have been mapped one by one :

(eval Although it works well , But it is controversial , When used improperly, it is easy to become “ Bad code ”)

 4. Result display

Write a simple replacement code , Implement parameter replacement :

 

Database construction , It's not much different from the above , But be careful , ranking rank Repetition exists , Therefore, it cannot be used as a primary key , Need to use another id Fields to differentiate .

 

  in addition , Perform a same ranking processing for the same sub ranking , You can see , Ranking and scores are corresponding :

  Grab the bag GIF:

copyright notice
author[liedmirror],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201300949438621.html

Random recommended