current position:Home>The 20th of 120 Python crawlers, 1637. All the way business opportunity network joined in data collection

The 20th of 120 Python crawlers, 1637. All the way business opportunity network joined in data collection

2022-02-01 13:06:54 Dream eraser

「 This is my participation 11 The fourth of the yuegengwen challenge 27 God , Check out the activity details :2021 One last more challenge

The following cases , It will focus on basic data collection for sales , The industry will choose the beauty industry , Please kindly be informed .

This case will use lxml And cssselect Combination of methods to collect , A focus on cssselect Selectors .

Target site analysis

The goal of this capture is http://www.1637.com/, The website has multiple categories , When collecting, the classification is stored in a list in advance , Facilitate subsequent expansion . Later, it was found that the primary industry can choose There is no limit , At this time, you can get all the classifications , Based on this , Let's grab all the data locally first , Then after screening out beauty / Relevant franchise data of the beauty industry can .

The amount of data and pages captured this time are shown in the figure below .

Python Reptiles 120 Example No 20 example ,1637、 All the way business opportunity network joins in data collection Grab data using the old method , The first HTML Save page to local , And then after the second treatment .

Technical points used

Request data usage requests, Data extraction uses lxml + cssselect Realization , Use cssselect Before , adopt pip install cssselect Install the corresponding library .

Installation completed , There are two ways to use it in code , First, it adopts CSSSelector class, As follows :

from lxml.cssselect import CSSSelector
#  It is a little similar to the way regular expressions are used , Construct a CSS Selector object 
sel = CSSSelector('#div_total>em', translator="html")
#  And then  Element  Object to 
element = sel(etree.HTML(res.text))
print(element[0].text)
 Copy code 

The above usage is suitable for building selectors in advance , Easier to expand , If you don't use this method , You can use it directly cssselect method To implement , That is the following code :

#  adopt  cssselect  Selectors , choice  em  label 
div_total = element.cssselect('#div_total>em')
 Copy code 

No matter which of the above two methods is used , What's in brackets #div_total>em Is the focus of our study , The wording is CSS Selectors A way of writing , If you know more about front-end knowledge , It's easy to master , If you don't understand, there's no problem , First remember the following .

CSS Selectors Suppose there is the following paragraph HTML Code :

<div class="totalnum" id="div_total"> common <em>57041</em> A project </div>
 Copy code 

among class,id All for HTML Property value of label , commonly class There can be more than one... In a web page , and id There can only be one .

If you want to get div label , Use css Selectors , Use #div_total perhaps .totalnum Can be realized , Focus on if the basis id obtain , The symbol in front of that is #, If you rely on class obtain , The symbol in front of that is . Sometimes there are other properties , stay css Selectors in , Can be written like this , modify HTML The code is as follows .

<div class="totalnum" id="div_total" custom="abc">  common <em>57041</em> A project  </div>
 Copy code 

Write the following test code , Be careful CSSSelector Part of the css Selectors How to write it , namely div[custom="abc"] em.

sel = CSSSelector('div[custom="abc"] em', translator="html")
element = sel(etree.HTML('<div class="totalnum" id="div_total" custom="abc">  common <em>57041</em> A project  </div>'))
print(element[0].text)
 Copy code 

Above css Selectors It is also applied to a knowledge point , It's called a descendant selector , for example #div_total>em, among #div_total And em Between , There is one. > Symbol , This symbol indicates the selection id=div_total The direct child element of em, If you remove the middle >, It is amended as follows #div_total>em, Express choice id=div_total All descendant elements ( Children and grandchildren elements ) Medium em Elements .

After a brief grasp of the above contents , You can simply write your own cssselect Code. .

Code time

The capture method used in this case is , Grab first HTML Page to local , Parsing for local files , Therefore, the acquisition code is relatively simple , Just dynamically get the total number of pages . The following code highlights: get_pagesize Function internal logic .

import requests
from lxml.html import etree
import random
import time


class SSS:
    def __init__(self):
        self.start_url = 'http://xiangmu.1637.com/p1.html'
        self.url_format = 'http://xiangmu.1637.com/p{}.html'
        self.session = requests.Session()
        self.headers = self.get_headers()

    def get_headers(self):
    	#  This function can be obtained from previous blogs 
        uas = [
            "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
        ]
        ua = random.choice(uas)
        headers = {
            "user-agent": ua,
            "referer": "https://www.baidu.com"
        }
        return headers

    def get_pagesize(self):

        with self.session.get(url=self.start_url, headers=self.headers, timeout=5) as res:
            if res.text:
                element = etree.HTML(res.text)
                #  adopt  cssselect  Selectors , choice  em  label 
                div_total = element.cssselect('#div_total>em')
                #  obtain  em  Label internal text  div_total[0].text, And convert it to an integer 
                total = int(div_total[0].text)
                #  Get page number 
                pagesize = int(total / 10) + 1
                # print(pagesize)
                #  The total is just 10 Integers , No need to add another page of data 
                if total % 10 == 0:
                    pagesize = int(total / 10)

                return pagesize
            else:
                return None

    def get_detail(self, page):
        with self.session.get(url=self.url_format.format(page), headers=self.headers, timeout=5) as res:
            if res.text:
                with open(f"./ To join in 1/{page}.html", "w+", encoding="utf-8") as f:
                    f.write(res.text)
            else:
                #  If there is no data , Re request 
                print(f" Page number {page} Request exception , Re request ")
                self.get_detail(page)

    def run(self):
        pagesize = self.get_pagesize()
        #  Test data , Can be modified temporarily  pagesize = 20
        for page in range(1, pagesize):
            self.get_detail(page)
            time.sleep(2)
            print(f" Page number {page} After grabbing !")


if __name__ == '__main__':
    s = SSS()
    s.run()
 Copy code 

After testing , If you don't increase the time limit , It's easy to be limited IP, That is, data cannot be obtained , By adding agents, you can solve , If you're only interested in data , Can be directly in Download address download HTML Packet data , The decompression password is cajie.

Python Reptiles 120 Example No 20 example ,1637、 All the way business opportunity network joins in data collection

Secondary extraction of data

When static HTML After crawling all over the place , Extract page data , It's easy , After all, there is no need to solve the anti climbing problem .

The core technology used at this time is to read the file , Through cssselect Extract fixed data values .

Through developer tools , The tag node where the query data is located is as follows , in the light of class='xminfo' Just extract the content of .

Python Reptiles 120 Example No 20 example ,1637、 All the way business opportunity network joins in data collection

The following code core shows the data extraction method , among format Focus on functions , Because the data is stored as csv file , So we need to remove_character Function to handle \n And English , Number .

#  Data extraction class 
class Analysis:
    def __init__(self):
        pass

    #  Remove special characters 
    def remove_character(self, origin_str):
        if origin_str is None:
            return
        origin_str = origin_str.replace('\n', '')
        origin_str = origin_str.replace(',', ',')
        return origin_str

    def format(self, text):
        html = etree.HTML(text)
        #  Get all project areas  div
        div_xminfos = html.cssselect('div.xminfo')
        for xm in div_xminfos:
            adtexts = self.remove_character(xm.cssselect('a.adtxt')[0].text)  #  Get a list of advertising words 
            url = xm.cssselect('a.adtxt')[0].attrib.get('href')  #  Get the details page address 

            brands = xm.cssselect(':nth-child(2)>:nth-child(2)')[1].text  #  Get a list of brands 
            categorys = xm.cssselect(':nth-child(2)>:nth-child(3)>a')[0].text  #  Get categories , for example  [" Restaurant "," snack "]
            types = ''
            try:
                #  There may be no secondary classification here 
                types = xm.cssselect(':nth-child(2)>:nth-child(3)>a')[1].text  #  Get categories , for example  [" Restaurant "," snack "]
            except Exception as e:
                pass
            creation = xm.cssselect(':nth-child(2)>:nth-child(6)')[0].text  #  Brand building time list 
            franchise = xm.cssselect(':nth-child(2)>:nth-child(9)')[0].text  #  List of franchise stores 
            company = xm.cssselect(':nth-child(3)>span>a')[0].text  #  List of company names 

            introduce = self.remove_character(xm.cssselect(':nth-child(4)>span')[0].text)  #  Brand Introduction 
            pros = self.remove_character(xm.cssselect(':nth-child(5)>:nth-child(2)')[0].text)  #  Business product introduction 
            investment = xm.cssselect(':nth-child(5)>:nth-child(4)>em')[0].text  #  The amount of investment 
            #  String concatenation 
            long_str = f"{adtexts},{categorys},{types},{brands},{creation},{franchise},{company},{introduce},{pros},{investment},{url}"
            with open("./ Join data .csv", "a+", encoding="utf-8") as f:
                f.write(long_str + "\n")

    def run(self):
        for i in range(1, 5704):
            with open(f"./ To join in /{i}.html", "r", encoding="utf-8") as f:
                text = f.read()
                self.format(text)


if __name__ == '__main__':
    #  Collect data , Which part to run , Just remove the comment 
    # s = SSS()
    # s.run()
    #  Extract the data 
    a = Analysis()
    a.run()
 Copy code 

The above code is extracting HTML When labeling , Repeatedly used :nth-child(2), The selector is : Match the first... Of its parent element N Sub elements , Regardless of the type of element , So you just need to find the exact location of the element .

Collection time

Code download address :codechina.csdn.net/hihell/pyth…, Could you give me Star.

To have come , No comment , Point a praise , Put it away ?

Today is the first day of continuous writing 200 / 200 God . You can pay attention to me , Praise me 、 Comment on me 、 Collect me .

copyright notice
author[Dream eraser],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202011306530122.html

Random recommended