current position:Home>Python web crawler - crawling cloud music review (4)

Python web crawler - crawling cloud music review (4)

2022-01-31 01:21:43 FizzH

「 This is my participation 11 The fourth of the yuegengwen challenge 4 God , Check out the activity details :2021 One last more challenge

Let's review the content of yesterday's article first ,1. Position the target ;2. Download Web page ;3. Set the loading speed , Find the target file . Just open a song of Netease cloud

https://music.163.com/#/song?id=25723157
 Copy code 

Of every song id It's the last string of numbers . So in principle, we just need to collect the corresponding songs and enter them into the play page , Can get id Number . It's easy to make a cycle , Crawling through all the reviews of multiple songs .

def get_comments(url):
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36',
        'referer': 'http://music.163.com/'
        }

    params = "EuIF/+GM1OWmp2iaIwbVdYDaqODiubPSBToe5EdNp6LHTLf+aID/dWGU6bHWXS0jD9pPa/oY67TOwiicLygJ+BhMkOX/J1tZMhq45dcUIr6fLuoHOECYrOU6ySwH4CjxxdbW3lpVmksGEdlxbZevVPkTPkwvjNLDZHK238OuNCy0Csma04SXfoVM3iLhaFBT"
    encSecKey = "db26c32e0cd08a11930639deadefda2783c81034be6445ca8f4fbedd346e1f9567375083aeb1a85e6ad6d9ae4532a49752c2169db8bcc04d38a79f9bed7facea42ee23f1b33538c34f82741318d9b4b846663b53b0b808dd0499dccfbc6c61fbf180c6fb24b1c2dd3c2c450ce09917d74be9424dab836fd2e671988ffbc6ae1b"
    data = {
        "params": params,
        "encSecKey": encSecKey
        }

    name_id = url.split('=')[1]
    target_url = "http://music.163.com/weapi/v1/resource/comments/R_SO_4_{}?csrf_token=".format(name_id)

    res = requests.post(target_url, headers=headers, data=data)

    return res
 Copy code 

params and encSecKey Is the data the server wants , These two are encrypted content . To get the core.js file , Because of the js The files are huge , We can according to the key words encSecKey retrieval , Navigate to key code segments Again ,POST The above two parameters , It can also be used in other songs .

image.png

Comments found , after , We need to extract key data . We can see that , The data climbed to is JSON Format .JSON Is a lightweight data exchange format , It is often used in network transmission .

Use JSON.LOADS() Method can restore the string to Python Data structure of :

comments_json = json.loads(res.text)
 Copy code 

So in the dictionary “hotComments” The value corresponding to the key is all the wonderful comments !

def get_hot_comments(res):
    comments_json = json.loads(res.text)
    hot_comments = comments_json['hotComments']
    with open('hot_comments.txt', 'w', encoding='utf-8') as file:
        for each in hot_comments:
            file.write(each['user']['nickname'] + ':\n\n')
            file.write(each['content'] + '\n')
            file.write("---------------------------------------\n")
 Copy code 

thus , Combine the previous parts , We can get the effect we want !

image.png

In addition, Netease cloud music API The interface form is as follows :music.163.com/api/v1/reso…

It should be noted that , If the reptile frequency is too fast , Too many , The server will block IP. So a more complete reptile project , Yes, you need to set up agents and IP Pooled .

in addition , Don't want to stick to cracking post Friends of form parameters , Try to use python+selenium+PhantomJs Simulate user operations in a way , Click turn page , Then directly analyze the page elements , This can do “ You can climb when you can see it ”, But the efficiency will be slightly lower .

copyright notice
author[FizzH],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201310121410107.html

Random recommended