current position:Home>Python crawler actual combat: crawling southern weekend news articles

Python crawler actual combat: crawling southern weekend news articles

2022-01-31 21:39:45 Clever crane

This is my participation 11 The fourth of the yuegengwen challenge 5 God , Check out the activity details :2021 One last more challenge

A few days ago, I was entrusted by a fan , Crawling 《 Southern weekend 》 News articles on the website .

The requirements are not complicated , Follow People's daily crawler and Liberation Daily crawler similar .

Don't talk much , Let's go straight to .

1. Analysis website

Southern weekend , The website address is :www.infzm.com/contents?te…

image-20211117093204841

Watch the website home page , We can see , The left side of the website is Channel list , In the middle News list .

Click the mouse to switch the channel on the left , Observe in the browser address bar term_id The values of change synchronously , explain term_id The parameter represents the of the channel id .

Slide the page scroll bar down , It is observed that new news articles will continue to load in , But the URL in the browser address bar has not changed throughout the whole process , Explain that the news list uses The waterfall flow Loading form of , Data is passed through Ajax Dynamic loading .

After a simple analysis , We turn on Developer tools , Switch to Network The tab starts packet capture analysis .

1.1 News list analysis

image-20211117094809449

In the process of page sliding , New requests keep coming .

Requested URL Form like : www.infzm.com/contents?te…

The content of the request is shown in the figure :

image-20211117095101231

So here we know , This is what we are looking for News list Data interface .

Look at the interface URL:www.infzm.com/contents?te…

Yes 3 Parameters :term_id ,page and format .

term_id Previously, we analyzed the of channel id, The other two are literally ,page Number of pages ,format Represents the data format .

The returned data format is standard json , The article list data is located in data -> contents , Include the title of the article , article id, Author's name , Release time and other information .

1.2 News detail page analysis

Just open the details page of a news article , Such as :www.infzm.com/contents/21… .

We observed that the structure of the detail page link is http://www.infzm.com/contents/ + article id .

View through developer tools , It is learned that the content of news text is rendered in HTML Source code .

image-20211117101137062

As shown in the figure , The news content is <div class="nfzm-content__content"> In the label . among introduction Partially in <blockquote class="nfzm-bq"> Under the label ; The text is located in <div class="nfzm-content__fulltext"> Label under p In the label .

The structure of the web page is as follows :

<div class="nfzm-content__content">
    <blockquote class="nfzm-bq"> introduction </blockquote>
    <div class="nfzm-content__fulltext">
        <p> The first paragraph </p>
        <p> The second paragraph </p>
        <p> The third paragraph </p>
    </div>
</div>
 Copy code 

1.3 Analysis of anti creep mechanism

We use it Python Simply write a piece of code , Test the anti crawl mechanism of the website .

1.3.1 News list

A simple forgery headers , Initiate network request .

import requests

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
url = "http://www.infzm.com/contents?term_id=1&page=2&format=json"
r = requests.get(url, headers=headers)
r.encoding = r.apparent_encoding
print(r.text)
 Copy code 

It is found that data can be obtained normally .

image-20211117135615035

1.3.2 News text

import requests

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
url = "http://www.infzm.com/contents/217973"
r = requests.get(url, headers=headers)
r.encoding = r.apparent_encoding
print(r.text)
 Copy code 

The content of news text can also be successfully obtained .

image-20211117190343203

However, not all news texts can be accessed without obstacles , Some news texts show only part of the content , The full text can only be viewed after logging in to your account .

image-20211117194704481

When I register my account, I refresh the interface , I found that I have to subscribe to members to view the full text .

image-20211117195110540

I won't open a member here .

If there are students in need , You can open your own membership , The cookies Fill in the... In the code headers in , To climb .

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    'Cookie': " Your own cookie"
}
 Copy code 

Cookie You can view it in the developer tool .

2. Coding

Next , Start formal coding .

First, import the library that the crawler needs

import requests
import json
from bs4 import BeautifulSoup
import os
 Copy code 

Then the network request function fetchUrl

def fetchUrl(url):
    '''  function : visit  url  The web page of , Get the content of the web page and return to   Parameters : The target page  url  return : The target page  html  Content  '''
    headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    }
    try:
        r = requests.get(url, headers=headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except Exception as e:
        print(e)
 Copy code 

Parse news list function parseNewsList

def parseNewsList(html):
    '''  function : Parse the news list page , Extract the news list data and return to   Parameters : The list of data (json  Format )  return : journalistic id, title , Release time  '''
    try:
        jsObj = json.loads(html)
        contents = jsObj["data"]["contents"]
        for cnt in contents:
            pid = cnt["id"]
            subject = cnt["subject"]
            publish_time = cnt["publish_time"]
            yield pid, subject, publish_time

    except Exception as e:
        print("parseNewsList error!")
        print(e)
 Copy code 

Function to parse the content of news text parseNewsContent

def parseNewsContent(html):
    '''  function : Analyze the news details page , Extract the content of the news body and return   Parameters : Web source code (html  Format )  return : String of news body content  '''
    try:
        bsObj = BeautifulSoup(html, "html.parser")
        cntDiv = bsObj.find("div", attrs={"class": "nfzm-content__content"})
        blockQuote = cntDiv.find("blockquote", attrs={"class": "nfzm-bq"})
        fulltextDiv = cntDiv.find("div", attrs={"class": "nfzm-content__fulltext"})
        pList = fulltextDiv.find_all("p")
        
        ret = blockQuote.text + "\n" if blockQuote else ""
        ret += "\n".join([p.text for p in pList if len(p.text) > 1])
        return ret
        
    except Exception as e:
        print("parseNewsContent error!")
        print(e)
 Copy code 

Save file function saveFile

def saveFile(path, filename, content):
    '''  function : Put the content of the article  content  Save to local file   Parameters : What to save , route , file name  '''
    #  If there is no such folder , Then automatically generate 
    if not os.path.exists(path):
        os.makedirs(path)
    #  Save the file 
    with open(path + filename, 'w', encoding='utf-8') as f:
        f.write(content)
 Copy code 

Crawler scheduler download_nfzm

def download_nfzm(termId, page, savePath):
    '''  function : Crawling  termId  channel , The first  page  All the news on page , And save to  savePath  Under the path   Parameters :termId  channel  Id page  Page number  savePath  Save the path  '''
    url = f"http://www.infzm.com/contents?term_id={termId}&page={page}&format=json"
    html = fetchUrl(url)
    try:
        for pid, title, publish_time in parseNewsList(html):
            print(pid, publish_time, title)
            pLink = f"http://www.infzm.com/contents/{pid}"
            content = parseNewsContent(fetchUrl(pLink))
            content = title + "\n\n" + publish_time + "\n\n" + content
            date = publish_time.split(" ")[0]
            filename = f"{date}-{pid}.txt"
            
            saveFile(savePath, filename, content)
            
    except Exception as e:
        print("download_nfzm Error")
        print(e)
 Copy code 

Finally, the main function , Used to start the crawler .

if __name__ == '__main__':
    '''  The main function : Program entrance  '''
    beginPage = 1
    endPage = 10
    term_id = 1

    for page in range(beginPage, endPage + 1):
        download_nfzm(term_id, page, 'infzm_News/')
    
    print(" Crawling is complete ")
 Copy code 

3. Keyword screening

Some students may have such needs , Is to screen news articles to crawl according to keywords , Not all of them .

So I tried the search function of the website .

image-20211117201207638

Try capturing packets on the search results page , The method is the same as before .

image-20211117201411418

And found that , The keyword search function of Southern Weekend website , In fact, it is based on the previous data interface , Added a new parameter k

http://www.infzm.com/search?term_id=&page=2&k=%E7%BB%8F%E6%B5%8E&format=json

among %E7%BB%8F%E6%B5%8E Namely url Encoded keywords economic .

therefore , We can build on the previous code , Slightly adjust The main function and download_nfzm function , You can transform an ordinary news article crawler into a belt Keyword screening News article crawler .

def download_nfzm(termId, page, kw, savePath):
    '''  function : Crawling  termId  channel , The first  page  All the news on page , And save to  savePath  Under the path   Parameters :termId  channel  Id page  Page number  savePath  Save the path  '''
    url = f"http://www.infzm.com/contents?term_id={termId}&page={page}&k={kw}&format=json"
    html = fetchUrl(url)
    try:
        for pid, title, publish_time in parseNewsList(html):
            print(pid, publish_time, title)
            pLink = f"http://www.infzm.com/contents/{pid}"
            content = parseNewsContent(fetchUrl(pLink))
            content = title + "\n\n" + publish_time + "\n\n" + content
            date = publish_time.split(" ")[0]
            filename = f"{date}-{pid}.txt"
            
            saveFile(savePath, filename, content)
            
    except Exception as e:
        print("download_nfzm Error")
        print(e)
 Copy code 
if __name__ == '__main__':
    '''  The main function : Program entrance  '''
    beginPage = 1
    endPage = 10
    term_id = 1
    kw = " economic "

    for page in range(beginPage, endPage + 1):
        download_nfzm(term_id, page, kw, 'infzm_News/')
    
    print(" Crawling is complete ")
 Copy code 

4. Running effect

Run code , Before crawling 10 Page for testing

 Running results

 Saved news article files

 Crawl for good news content


If there is something in the article that is not clear , Or the wrong explanation , Welcome to comment on , Or scan the QR code below , Add me WeChat , Let's learn and communicate , Common progress .

copyright notice
author[Clever crane],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201312139443902.html

Random recommended