current position:Home>Introduction to python (IV) dynamic web page analysis and capture

Introduction to python (IV) dynamic web page analysis and capture

2022-01-30 13:20:35 baiyuliang

What is a dynamic web page ? Dynamic web pages , That is, the web page contains information through asynchronous ajax Loaded content ! When we open a web page , Click on the right “ View page source code ”, You will find that there are some contents displayed on the web page , There is no... In the source code , And this part is through ajax Asynchronously loaded , This is the dynamic web page !

take csdn Blog example :Python introduction ( One ) Environment building

Click to open this article , There is a comment below :

 Insert picture description here

Press F12 Check elements :

 Insert picture description here

Then select the comment content :

 Insert picture description here

here , You can determine the location of the comment area :<div class="comment-list-box" >...</div>

Actually , This is called web page analysis , By checking the element , Determine the area location of the content you want to extract , Then you can use the label id,name,class Or other attribute extraction content !

Keep looking down :

 Insert picture description here

This contains a list , And that comment is in it , At this point, we can , Right click to view the source code , then Ctrl+F, Input “comment-list-box” Find this part :

 Insert picture description here

We will find that , There's nothing in the source code ! Come here , Do you understand ?

And if we want to extract this dynamic content , Only through Last one The method is impossible , Unless we can analyze the of loading dynamic web pages url, So how can we simply and efficiently capture dynamic web content ? Here we need to use dynamic web page grabbing artifact :Selenium

Selenium It's actually a problem web Automated test tool , It can simulate user sliding , Click on , open , Verification and a series of web page operation behaviors , It's like a real user operating ! In this way, you can use the browser rendering method to crawl dynamic web pages , Become crawling static web pages !

install Selenium:pip install selenium

After successful installation , A simple test :

from selenium import webdriver

#  use selenium Open the web page 
driver = webdriver.Chrome()
driver.get("https://www.baidu.com")
 Copy code 

Report errors :

WebDriverException( selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home

This is actually the lack of Google browser driver :chromedriver, After downloading, put it under a drive letter and record the location , Modify the code and re execute :

driver = webdriver.Chrome(executable_path=r"C:\chromedriver.exe")
driver.get("https://www.baidu.com")
 Copy code 

Here's what I'm using FireFox browser , The effect is the same , Of course , You want to download the Firefox browser driver :geckodriver

driver = webdriver.Firefox(executable_path=r"C:\geckodriver.exe")
driver.get("https://www.baidu.com")
 Copy code 

 Insert picture description here

After successfully opening , It will show that the browser has been controlled !

We can do it in PyCharm in , see webdriver The method provided :

 Insert picture description here

When the extracted content is nested in frame In the middle of the day , We can driver.switch_to.frame location , ordinary , We can use it directly driver.find_element_by_css_selector、find_element_by_tag_name Wait, extract the content , Method with complex number s The extracted is a list , No s What is extracted is a single data , Well understood. , Detailed usage , You can view the official documentation !

Still with csdn For example, blog :Python introduction ( One ) Environment building , Crawl through the comments on this article , We have analyzed the comment area above :<div class="comment-list-box" >...</div>

 Insert picture description here

Then we can go straight through find_element_by_css_selector Get the div Below :

from selenium import webdriver

driver = webdriver.Firefox(executable_path=r"C:\geckodriver.exe")
driver.get("https://baiyuliang.blog.csdn.net/article/details/120473414")

comment_list_box = driver.find_element_by_css_selector('div.comment-list-box')
comment_list = comment_list_box.find_element_by_class_name('comment-list')
comment_line_box = comment_list.find_elements_by_class_name('comment-line-box')
for comment in comment_line_box:
    span_text = comment.find_element_by_class_name('new-comment').text
    print(span_text)
 Copy code 

result :

 Insert picture description here

Be careful find_element_by_css_selector and find_element_by_class_name Differences in usage !

copyright notice
author[baiyuliang],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201301320320680.html

Random recommended