2022-02-01 13:06:54 Dream eraser

The following cases , It will focus on basic data collection for sales , The industry will choose the beauty industry , Please kindly be informed .

This case will use lxml And cssselect Combination of methods to collect , A focus on cssselect Selectors .

Target site analysis

The goal of this capture is, The website has multiple categories , When collecting, the classification is stored in a list in advance , Facilitate subsequent expansion . Later, it was found that the primary industry can choose There is no limit , At this time, you can get all the classifications , Based on this , Let's grab all the data locally first , Then after screening out beauty / Relevant franchise data of the beauty industry can .

The amount of data and pages captured this time are shown in the figure below .

Python Reptiles 120 Example No 20 example ,1637、 All the way business opportunity network joins in data collection Grab data using the old method , The first HTML Save page to local , And then after the second treatment .

Technical points used

Request data usage requests, Data extraction uses lxml + cssselect Realization , Use cssselect Before , adopt pip install cssselect Install the corresponding library .

Installation completed , There are two ways to use it in code , First, it adopts CSSSelector class, As follows :

from lxml.cssselect import CSSSelector
#  It is a little similar to the way regular expressions are used , Construct a CSS Selector object 
sel = CSSSelector('#div_total>em', translator="html")
#  And then  Element  Object to 
element = sel(etree.HTML(res.text))
The above usage is suitable for building selectors in advance , Easier to expand , If you don't use this method , You can use it directly cssselect method To implement , That is the following code :

#  adopt  cssselect  Selectors , choice  em  label 
div_total = element.cssselect('#div_total>em')
No matter which of the above two methods is used , What's in brackets #div_total>em Is the focus of our study , The wording is CSS Selectors A way of writing , If you know more about front-end knowledge , It's easy to master , If you don't understand, there's no problem , First remember the following .

CSS Selectors Suppose there is the following paragraph HTML Code :

<div class="totalnum" id="div_total"> common <em>57041</em> A project </div>
among class,id All for HTML Property value of label , commonly class There can be more than one... In a web page , and id There can only be one .

If you want to get div label , Use css Selectors , Use #div_total perhaps .totalnum Can be realized , Focus on if the basis id obtain , The symbol in front of that is #, If you rely on class obtain , The symbol in front of that is . Sometimes there are other properties , stay css Selectors in , Can be written like this , modify HTML The code is as follows .

<div class="totalnum" id="div_total" custom="abc">  common <em>57041</em> A project  </div>
Write the following test code , Be careful CSSSelector Part of the css Selectors How to write it , namely div[custom="abc"] em.

sel = CSSSelector('div[custom="abc"] em', translator="html")
element = sel(etree.HTML('<div class="totalnum" id="div_total" custom="abc">  common <em>57041</em> A project  </div>'))
Above css Selectors It is also applied to a knowledge point , It's called a descendant selector , for example #div_total>em, among #div_total And em Between , There is one. > Symbol , This symbol indicates the selection id=div_total The direct child element of em, If you remove the middle >, It is amended as follows #div_total>em, Express choice id=div_total All descendant elements ( Children and grandchildren elements ) Medium em Elements .

After a brief grasp of the above contents , You can simply write your own cssselect Code. .

Code time

The capture method used in this case is , Grab first HTML Page to local , Parsing for local files , Therefore, the acquisition code is relatively simple , Just dynamically get the total number of pages . The following code highlights: get_pagesize Function internal logic .

import requests
from lxml.html import etree
import random
import time

class SSS:
    def __init__(self):
        self.start_url = ''
        self.url_format = '{}.html'
        self.session = requests.Session()
        self.headers = self.get_headers()

    def get_headers(self):
    	#  This function can be obtained from previous blogs 
        uas = [
            "Mozilla/5.0 (compatible; Baiduspider/2.0; +"
        ua = random.choice(uas)
        headers = {
            "user-agent": ua,
            "referer": ""
        return headers

    def get_pagesize(self):

        with self.session.get(url=self.start_url, headers=self.headers, timeout=5) as res:
            if res.text:
                element = etree.HTML(res.text)
                #  adopt  cssselect  Selectors , choice  em  label 
                div_total = element.cssselect('#div_total>em')
                #  obtain  em  Label internal text  div_total[0].text, And convert it to an integer 
                total = int(div_total[0].text)
                #  Get page number 
                pagesize = int(total / 10) + 1
                # print(pagesize)
                #  The total is just 10 Integers , No need to add another page of data 
                if total % 10 == 0:
                    pagesize = int(total / 10)

                return pagesize
                return None

    def get_detail(self, page):
        with self.session.get(url=self.url_format.format(page), headers=self.headers, timeout=5) as res:
            if res.text:
                with open(f"./ To join in 1/{page}.html", "w+", encoding="utf-8") as f:
                #  If there is no data , Re request 
                print(f" Page number {page} Request exception , Re request ")

    def run(self):
        pagesize = self.get_pagesize()
        #  Test data , Can be modified temporarily  pagesize = 20
        for page in range(1, pagesize):
            print(f" Page number {page} After grabbing !")

if __name__ == '__main__':
    s = SSS()
Secondary extraction of data

When static HTML After crawling all over the place , Extract page data , It's easy , After all, there is no need to solve the anti climbing problem .

The core technology used at this time is to read the file , Through cssselect Extract fixed data values .

Through developer tools , The tag node where the query data is located is as follows , in the light of class='xminfo' Just extract the content of .

Python Reptiles 120 Example No 20 example ,1637、 All the way business opportunity network joins in data collection

The following code core shows the data extraction method , among format Focus on functions , Because the data is stored as csv file , So we need to remove_character Function to handle \n And English , Number .

#  Data extraction class 
class Analysis:
    def __init__(self):

    #  Remove special characters 
    def remove_character(self, origin_str):
        if origin_str is None:
        origin_str = origin_str.replace('\n', '')
        origin_str = origin_str.replace(',', ',')
        return origin_str

    def format(self, text):
        html = etree.HTML(text)
        #  Get all project areas  div
        div_xminfos = html.cssselect('div.xminfo')
        for xm in div_xminfos:
            adtexts = self.remove_character(xm.cssselect('a.adtxt')[0].text)  #  Get a list of advertising words 
            url = xm.cssselect('a.adtxt')[0].attrib.get('href')  #  Get the details page address 

            brands = xm.cssselect(':nth-child(2)>:nth-child(2)')[1].text  #  Get a list of brands 
            categorys = xm.cssselect(':nth-child(2)>:nth-child(3)>a')[0].text  #  Get categories , for example  [" Restaurant "," snack "]
            types = ''
                #  There may be no secondary classification here 
                types = xm.cssselect(':nth-child(2)>:nth-child(3)>a')[1].text  #  Get categories , for example  [" Restaurant "," snack "]
            except Exception as e:
            creation = xm.cssselect(':nth-child(2)>:nth-child(6)')[0].text  #  Brand building time list 
            franchise = xm.cssselect(':nth-child(2)>:nth-child(9)')[0].text  #  List of franchise stores 
            company = xm.cssselect(':nth-child(3)>span>a')[0].text  #  List of company names 

            introduce = self.remove_character(xm.cssselect(':nth-child(4)>span')[0].text)  #  Brand Introduction 
            pros = self.remove_character(xm.cssselect(':nth-child(5)>:nth-child(2)')[0].text)  #  Business product introduction 
            investment = xm.cssselect(':nth-child(5)>:nth-child(4)>em')[0].text  #  The amount of investment 
            #  String concatenation 
            long_str = f"{adtexts},{categorys},{types},{brands},{creation},{franchise},{company},{introduce},{pros},{investment},{url}"
            with open("./ Join data .csv", "a+", encoding="utf-8") as f:
                f.write(long_str + "\n")

    def run(self):
        for i in range(1, 5704):
            with open(f"./ To join in /{i}.html", "r", encoding="utf-8") as f:
                text =

if __name__ == '__main__':
    #  Collect data , Which part to run , Just remove the comment 
    # s = SSS()
    #  Extract the data 
    a = Analysis()
The above code is extracting HTML When labeling , Repeatedly used :nth-child(2), The selector is : Match the first... Of its parent element N Sub elements , Regardless of the type of element , So you just need to find the exact location of the element .

Collection time

