current position:Home>Python web crawler - crawling cloud music review (3)

Python web crawler - crawling cloud music review (3)

2022-01-30 22:51:44 FizzH

「 This is my participation 11 The fourth of the yuegengwen challenge 3 God , Check out the activity details :2021 One last more challenge

After learning the basics of reptiles in the first two days , Let's try the ox knife today . Climb the comments of Netease cloud .

1. Position the target

First find my favorite song 《 The golden age 》, But there is no original song , Netease cloud really has no original song , A lot of cover songs !!

image.png

It can be seen that all comments are wrapped in id by id="auto-id-0flvTEG8zLVkFZST" Of

In the label . Whatever it is , Now download the web page and have a look .

2. Download Web page

Download the web page first , And then use BeautifulSoup Just extract the comments .

import requests

def get_url(url):
    headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1'}

    res = requests.get(url,headers = headers)

    return res

def main():
    url = input(" Please enter the link address :")
    res = get_url(url)

    with open("res.txt","w",encoding = 'utf-8') as file:
        file.write(res.text)
        
if __name__ == "__main__":
    main()
 Copy code 

The output is as follows , You need to input the web page of relevant songs by yourself :

image.png

image.png

Search for relevant comments , Can't find it ! Note that the comments are not in this document ! That explains. , Comments in other documents !

3. set speed , Find the target file

The Internet speed is too fast , Brush a sound , The whole web page is loaded .

Let's drive network, refresh , You can find many source files , They are part of the whole web page :

image.png

We need to find the documents with comments from this pile of documents ; Obviously , We can look through the documents one by one , But this method is a little troublesome . Here we can let the browser load the web page slowly , When a target is found , Stop time .

image.png

In addition, the order of labels is as follows :

data = soup.select('#main > div > div.mtop.firstMod.clearfix > div.centerBox > ul.newsList > li > a')
 Copy code 

Rollover , Browsers cannot ! Go back and update in the evening chrome!

Comments are documents , We can look directly at XHR and DOC Files of type . meanwhile , When downloading the target file , We will find that the file is a POST file , Remember what we said earlier POST Do you need anything for the file ? We need to submit some specified data to the server , To get what we want . I'll update this tomorrow !

copyright notice
author[FizzH],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201302251415111.html

Random recommended