current position:Home>Confucius old book network data collection, take one anti three learning crawler, python crawler 120 cases, the 21st case

Confucius old book network data collection, take one anti three learning crawler, python crawler 120 cases, the 21st case

2022-02-01 14:42:45 Dream eraser

「 This is my participation 11 The fourth of the yuegengwen challenge 28 God , Check out the activity details :2021 One last more challenge

E-commerce website crawler , Always a must climb item in the reptile circle . Today we'll take 《 Confucius old book net 》 Practice hands .

Crawling target source data analysis

The target website of this secondary crawl is https://book.kongfz.com/Cxiaoshuo/v6/, Open the page to find paging data , Page number switching can be performed at the position shown in the figure below .

 Confucius old book network data collection , Take one anti three learning reptile ,Python Reptiles 120 Example No 21 example While switching page numbers , Capture paging Links , And find paging rules .

https://book.kongfz.com/Cxiaoshuo/v6w1/
https://book.kongfz.com/Cxiaoshuo/v6w2/
https://book.kongfz.com/Cxiaoshuo/v6w3/
 Copy code 

The address template of the refined list page is https://book.kongfz.com/C{ Category }/v6w{ Page number }/.

The above contents are sorted out , You can collect and crawl the list page , This climbing is divided into three steps .

  1. Extract all book categories ;
  2. Collect list pages under each category ( Test data , Only under a single category 5 Page data );
  3. Extract target data , For example, book name , author , Press. , Publication date , Store name and other information .

Next, follow the steps .

Extract all book categories

Through developer tools , Capture book classification area HTML Code , As shown below :

 Confucius old book network data collection , Take one anti three learning reptile ,Python Reptiles 120 Example No 21 example The above data , You can access any classification page to get , The core code is as follows , among self.get_headers() function , Please refer to the previous blog , Or download the code to see .

import requests
from lxml.html import etree
import random
import time


class SSS:
    def __init__(self):

        self.url_format = 'https://book.kongfz.com/C{}/v6w{}/'
        #  Categories to be captured , You can extend the 
        self.types = ["wenxue", "xiaoshuo"]
        self.session = requests.Session()
        self.headers = self.get_headers()
        self.categorys =[]

    def get_categorys(self):

        with self.session.get(url='https://book.kongfz.com/Cfalv/',headers=self.headers) as res:
            if res:
                html = etree.HTML(res.text)
                items = html.cssselect('.tushu div.link-item a')
                #  Match out URL Medium type
                for item in items:
                    # print(item)
                    # print(item.get("href"))
                    href = item.get("href")
                    type = href[href.find('C')+1:-1]
                    self.categorys.append(type)
 Copy code 

At this point, after simple operation , You will get the following list , That is, the classification of all books on Confucius old book network .

xiaoshuo
wenxue
yuyan
lishi
dili
yishu
……
 Copy code 

At this point, traverse the list , You can get all book list page data , Learning phase , Take one of them for analysis , For example, the classification of literature and novels I choose ,self.types = ["wenxue", "xiaoshuo"].

Collect static page data of classification page

For static page data , Use the previous method to save it locally , stay SSS Class get_detail And run function , Page number due to the amount of data , The maximum is 200, You can set it to 5, Easy to climb , The following code at run time , Pay attention to establishing in advance Confucius Folder .

The code continues to use session.get Method , Make a data request .

    def get_detail(self, type, page):
        with self.session.get(url=self.url_format.format(type, page), headers=self.headers, timeout=5) as res:
            if res.text:
                with open(f"./ Confucius /{type}_{page}.html", "w+", encoding="utf-8") as f:
                    f.write(res.text)
            else:
                #  If there is no data , Re request 
                print(f" Page number {page} Request exception , Re request ")
                self.get_detail(page)

    def run(self):
        pagesize = 5
        for type in self.types:
            for page in range(1, pagesize):
                self.get_detail(type, page)
                time.sleep(2)
                print(f" classification :{type}, Page number :{page} Page saved !")
 Copy code 

Run code , The following data are obtained , During the actual measurement , No anti climbing measures were found , For ease of testing , Targeted control of request speed .  Confucius old book network data collection , Take one anti three learning reptile ,Python Reptiles 120 Example No 21 example

Extract the data

Finally, for local HTML To operate , Get the final target data .

When extracting , Is still CSS Selectors Proficiency in the use of plays a decisive role , Of course, the processing of abnormal data , You need to pay attention .

#  Data extraction class 
class Analysis:
    def __init__(self):
        #  Categories to be captured , You can extend the 
        self.types = ["wenxue", "xiaoshuo"]

    #  Remove special characters 
    def remove_character(self, origin_str):
        if origin_str is None:
            return
        origin_str = origin_str.replace('\n', '')
        origin_str = origin_str.replace(',', ',')
        return origin_str

    def format(self, text):
        html = etree.HTML(text)
        #  Get all project areas  div
        div_books = html.cssselect('div#listBox>div.item')
        for book in div_books:
            #  Get the title property value 
            title = book.cssselect('div.item-info>div.title')[0].get('title')
            #  The author defaults to null 
            author = None
            author_div = book.cssselect('div.item-info>div.zl-isbn-info>span:nth-child(1)')
            if len(author_div)>0:
                author = author_div[0].text
            #  Press the same operation 
            publisher = None
            publisher_div = book.cssselect('div.item-info>div.zl-isbn-info>span:nth-child(2)')
            if len(publisher_div)>0:
                #  Data extraction and interception 
                publisher = publisher_div[0].text.split(' ')[1]
            print(publisher)

    def run(self):
        pagesize = 5
        for type in self.types:
            for page in range(1, pagesize):
                with open(f"./ Confucius /{type}_{page}.html", "r", encoding="utf-8") as f:
                    text = f.read()
                    # print(text)
                    self.format(text)

 Copy code 

Some abnormal data occurred during extraction , Special processing can be carried out for abnormal data , For example, the following screenshot data .  Confucius old book network data collection , Take one anti three learning reptile ,Python Reptiles 120 Example No 21 example Learning phase , No more data will be extracted , Extract only the title of the book , Author and publisher .

 A novel : The baboons of dharskong ( Hardcover ) [ Law ] Alphonse · Dude    Writing ; Li Jieren    translate   Sichuan literature and Art Press 
 The sword Dynasty .4  On the sword   innocent   Changjiang press 
 Only the moon can hear   Kang Lingling   Sichuan literature and Art Press 
 Yuanzun 1· The hidden dragon is in the abyss   Scarab potatoes    Writing   Changjiang press 
 Bestseller queen : Zhang Ailing's 33 A writing class   Duanmu Xiangyu   Tianjin People's publishing house 
 Blockchain changes the world   Yan Xingfang   China Textile Press 
 Will we see each other again   Miao Yonggang 、 Jia Yuping    translate   China Publishing Group , Modern press 
 Midsummer night love I  Little girl    Writing   Writers press 
 Long baduyana ( Hardcover ) [ Law ] Heller · Malang    Writing ; Li Jieren    translate   Sichuan literature and Art Press 
 Copy code 

Collection time

Code download address :codechina.csdn.net/hihell/pyth…, Could you give me Star.

== To have come , No comment , Point a praise , Put it away ?==

Today is the first day of continuous writing 201 / 365 God . You can pay attention to me , Praise me 、 Comment on me 、 Collect me .

copyright notice
author[Dream eraser],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202011442444559.html

Random recommended