current position:Home>Python crawler from introduction to pit full series of tutorials (detailed tutorial + various practical combat)

Python crawler from introduction to pit full series of tutorials (detailed tutorial + various practical combat)

2022-01-31 19:38:53 ruochen

「 This is my participation 11 The fourth of the yuegengwen challenge 17 God , Check out the activity details :2021 One last more challenge

Reptile preparation

  • Reference material
    • python Network data collection ’ Turing Industrial Press
    • Master python The crawler frame Scrapy ’ People's post and Telecommunications Press
    • python3 Web crawler
    • Scrapy The official tutorial
  • Prerequisite knowledge
    • url
    • http agreement
    • web front end ’ html, css, js
    • ajax
    • re, xpath
    • xml

About reptiles

  • Definition of reptile : Web crawler ( Also known as web spider 、 Network robot , stay FOAF Community Center , More often referred to as a web chaser ),

It's a rule of thumb , A program or script that automatically grabs information from the world wide web Other unusual names are ants 、 Auto index 、 Emulator or worm

  • Two characteristics
    • Can download data or content as required by the author
    • Can automatically move on the network
  • Three steps
    • Download Web page
    • Extract the right information
    • According to certain rules, automatically jump to another web page to perform the previous two steps
  • Reptile classification
    • Universal crawler
    • Dedicated crawler ( Focus on reptiles )
  • Python Introduction to network package
    • python2.x:urllib,urllib2,urllib3,httplib,httplib2,requests
    • python3.x:urllib,urllib3,httplib2,requests
    • python2:urllib and urllib2 In combination with , perhaps requests
    • python3:urllib,requests

urllib

  • Contains modules
    • urllib.request: Open and read urls
    • urllib.error: contain urllib.request Common mistakes that arise , Use try capture
    • urllib.parse: Include parsing url Methods
    • urllib.robotparse: analysis robots.txt file
    • Case study v01
    '''  Case study v01  Use urllib.request Request a page content , And print out the content  '''
    from urllib import request
    
    if __name__ == '__main__':
        url = "https://www.zhaopin.com/taiyuan/"
        #  Open the corresponding url And return the corresponding page as 
        rsp = request.urlopen(url)
    
        #  Read the returned result 
        #  The read content type is bytes
        html = rsp.read()
        print(type(html))
    
        #  If you want to put the bytes Convert content to string , Need to decode 
        print(html.decode())
     Copy code 
  • Web page coding problem solving
    • chardet It can automatically detect the encoding format of page file , however , May be wrong
    • Need to install ,conda install chardet
    • Case study v02
    '''  Case study v02  utilize request The download page   Automatically detect page coding  '''
    import urllib
    import chardet
    
    if __name__ == '__main__':
        url = "http://stock.eastmoney.com/news/1407,20170807763593890.html"
    
        rsp = urllib.request.urlopen(url)
    
        html = rsp.read()
    
        #  utilize chardet Automatic detection 
        cs = chardet.detect(html)
        print(type(cs))
        print(cs)
    
        #  Use get The value is guaranteed to be correct 
        html = html.decode(cs.get("encoding", "utf-8"))
        print(html)
     Copy code 
  • urlopen The return object of
    • Case study v03
    #  Case study v03
    import urllib
    import chardet
    
    if __name__ == '__main__':
        url = "http://stock.eastmoney.com/news/1407,20170807763593890.html"
    
        rsp = urllib.request.urlopen(url)
    
        print(type(rsp))
        print(rsp)
    
        print("URL: {0}".format(rsp.geturl()))
        print("Info: {0}".format(rsp.info()))
        print("Code: {0}".format(rsp.getcode()))
    
        html = rsp.read()
    
        #  Use get The value is guaranteed to be correct 
        html = html.decode()
     Copy code 
    • geturl: Returns the of the request object url
    • info: Request feedback from the target meta Information
    • getcode: return http code
  • request.data Use
    • Two ways to access the network
      • get:
        • Use parameters to pass information to the server
        • Parameter is dict, And then use parse code
        • Case study v04
         #  Case study v04
         from urllib import request, parse
        
         '''  Master the right url Method of parameter coding   Need to use parse modular  '''
         
         if __name__ == '__main__':
             url = "http://www.baidu.com/s?"
             wd = input("Input your keyword: ")
         
             #  Want to use data, Need to use dictionary structure 
             qs = {
                 "wd": wd
             }
         
             #  transformation url code 
             qs = parse.urlencode(qs)
             print(qs)
         
             fullurl =url + qs
             print(fullurl)
         
             #  If you directly use readable with parameters url, It's not accessible 
             # fullurl = "http://www.baidu.com/s?wd= panda "
         
             rsp = request.urlopen(fullurl)
         
             html = rsp.read()
         
             #  Use get The value is guaranteed to be correct 
             html = html.decode()
         
             print(html)
         Copy code 
      • post
        • Generally, passing parameters to the server uses
        • post It's the automatic encryption of information
        • If you want to use post Information , Need to use data Parameters
        • Use post, signify http Your request may need to be changed :
          • Content-Type:application/x-www.form-urlencode
          • Content-Length: Data length
          • In short , Once the request method is changed , Please note that other request header information is appropriate
        • urllib.parse.urlencode The string can be automatically converted to the above format
        • Case study v05
         '''  Case study v05  utilize parse Module simulation post request   Analyze baidu dictionary   Analysis steps : 1.  open F12 2.  Try entering a word girl, I found that every time I typed a letter, there was a request  3.  Request address is  https://fanyi.baidu.com/sug 4.  utilize  Network-All-Headers  see , Find out FormData The value of is  kw:girl 5.  Check the format of the returned content , It turns out to be json Format content ==> Need to use json package  '''
         
         from urllib import request, parse
         #  Responsible for handling json Format module 
         import json
         
         '''  The general process is : 1.  utilize data Construction content , then urlopen open  2.  Return to one json Results in format  3.  The result should be girl The meaning of the  '''
         
         baseurl = 'https://fanyi.baidu.com/sug'
         
         #  Stored to simulate form The data must be dict Format 
         data = {
             # girl Is to translate the input English content , It should be entered by the user , Use hard coding at this time 
             'kw': 'girl'
         }
         
         #  Need to use parse The module of data Encoding 
         data = parse.urlencode(data).encode('utf-8')
         print(type(data))
         
         #  We need to construct a request header , The request header should contain at least the length of the incoming data 
         # request The incoming request header is required to be a dict Format 
         
         headers = {
             #  Because use post request , At least it should contain content-length  Field 
             'Content-Length':len(data)
         }
         
         #  With headers,data,url, You can try to make a request 
         
         rsp = request.urlopen(baseurl, data=data)
         
         json_data = rsp.read().decode('utf-8')
         print(type(json_data))
         print(json_data)
         
         #  hold json Convert string to dictionary 
         json_data = json.loads(json_data)
         print(type(json_data))
         print(json_data)
         
         for item in json_data['data']:
             print(item['v'], "--", item['v'])
         Copy code 
         <class 'bytes'>
          <class 'str'>
          {"errno":0,"data":[{"k":"girl","v":"n. \u5973\u5b69; \u59d1\u5a18; \u5973\u513f; \u5e74\u8f7b\u5973\u5b50; \u5973\u90ce;"},{"k":"girls","v":"n. \u5973\u5b69; \u59d1\u5a18; \u5973\u513f; \u5e74\u8f7b\u5973\u5b50; \u5973\u90ce;  girl\u7684\u590d\u6570;"},{"k":"girlfriend","v":"n. \u5973\u670b\u53cb; \u5973\u60c5\u4eba; (\u5973\u5b50\u7684)\u5973\u4f34\uff0c\u5973\u53cb;"},{"k":"girl friend","v":" \u672a\u5a5a\u59bb; \u5973\u6027\u670b\u53cb;"},{"k":"Girls' Generation","v":" \u5c11\u5973\u65f6\u4ee3\uff08\u97e9\u56fdSM\u5a31\u4e50\u6709\u9650\u516c\u53f8\u4e8e2007\u5e74\u63a8\u51fa\u7684\u4e5d\u540d\u5973\u5b50\u5c11\u5973\u7ec4\u5408\uff09;"}]}
          <class 'dict'>
          {'errno': 0, 'data': [{'k': 'girl', 'v': 'n.  The girl ;  The girl ;  daughter ;  Young woman ;  girl ;'}, {'k': 'girls', 'v': 'n.  The girl ;  The girl ;  daughter ;  Young woman ;  girl ;  girl Complex number ;'}, {'k': 'girlfriend', 'v': 'n.  Girl friend ;  Mistress ; ( Women's ) Female companion , Girlfriend ;'}, {'k': 'girl friend', 'v': '  fiancée ;  Female friend ;'}, {'k': "Girls' Generation", 'v': '  girlhood ( South Korea SM Entertainment Co., Ltd 2007 A group of nine girls launched in );'}]}
          n.  The girl ;  The girl ;  daughter ;  Young woman ;  girl ; -- n.  The girl ;  The girl ;  daughter ;  Young woman ;  girl ;
          n.  The girl ;  The girl ;  daughter ;  Young woman ;  girl ;  girl Complex number ; -- n.  The girl ;  The girl ;  daughter ;  Young woman ;  girl ;  girl Complex number ;
          n.  Girl friend ;  Mistress ; ( Women's ) Female companion , Girlfriend ; -- n.  Girl friend ;  Mistress ; ( Women's ) Female companion , Girlfriend ;
            fiancée ;  Female friend ; --   fiancée ;  Female friend ;
            girlhood ( South Korea SM Entertainment Co., Ltd 2007 A group of nine girls launched in ); --   girlhood ( South Korea SM Entertainment Co., Ltd 2007 A group of nine girls launched in );
         Copy code 
        • For more setup request information , Simply through urlopen Function is not easy to use
        • Need to use request.Request class
        • Case study v06
         '''  Case study v06  The task requirements and contents are the same as v05 equally   This case only uses Request To achieve v05 The content of   utilize parse Module simulation post request   Analyze baidu dictionary   Analysis steps : 1.  open F12 2.  Try entering a word girl, I found that every time I typed a letter, there was a request  3.  Request address is  https://fanyi.baidu.com/sug 4.  utilize  Network-All-Headers  see , Find out FormData The value of is  kw:girl 5.  Check the format of the returned content , It turns out to be json Format content ==> Need to use json package  '''
         
         from urllib import request, parse
         #  Responsible for handling json Format module 
         import json
         
         '''  The general process is : 1.  utilize data Construction content , then urlopen open  2.  Return to one json Results in format  3.  The result should be girl The meaning of the  '''
         
         baseurl = 'https://fanyi.baidu.com/sug'
         
         #  Stored to simulate form The data must be dict Format 
         data = {
             # girl Is to translate the input English content , It should be entered by the user , Use hard coding at this time 
             'kw': 'girl'
         }
         
         #  Need to use parse The module of data Encoding 
         data = parse.urlencode(data).encode('utf-8')
         
         #  We need to construct a request header , The request header should contain at least the length of the incoming data 
         # request The incoming request header is required to be a dict Format 
         
         headers = {
             #  Because use post request , At least it should contain content-length  Field 
             'Content-Length':len(data)
         }
         
         #  Construct a Request Example 
         
         req = request.Request(url=baseurl, data=data, headers=headers)
         
         #  Because we have constructed a Request Request instance , Then all request information can be encapsulated in Request In the example 
         rsp = request.urlopen(req)
         
         json_data = rsp.read().decode('utf-8')
         print(type(json_data))
         print(json_data)
         
         #  hold json Convert string to dictionary 
         json_data = json.loads(json_data)
         print(type(json_data))
         print(json_data)
         
         for item in json_data['data']:
             print(item['v'], "--", item['v'])
         Copy code 
         <class 'str'>
          {"errno":0,"data":[{"k":"girl","v":"n. \u5973\u5b69; \u59d1\u5a18; \u5973\u513f; \u5e74\u8f7b\u5973\u5b50; \u5973\u90ce;"},{"k":"girls","v":"n. \u5973\u5b69; \u59d1\u5a18; \u5973\u513f; \u5e74\u8f7b\u5973\u5b50; \u5973\u90ce;  girl\u7684\u590d\u6570;"},{"k":"girlfriend","v":"n. \u5973\u670b\u53cb; \u5973\u60c5\u4eba; (\u5973\u5b50\u7684)\u5973\u4f34\uff0c\u5973\u53cb;"},{"k":"girl friend","v":" \u672a\u5a5a\u59bb; \u5973\u6027\u670b\u53cb;"},{"k":"Girls' Generation","v":" \u5c11\u5973\u65f6\u4ee3\uff08\u97e9\u56fdSM\u5a31\u4e50\u6709\u9650\u516c\u53f8\u4e8e2007\u5e74\u63a8\u51fa\u7684\u4e5d\u540d\u5973\u5b50\u5c11\u5973\u7ec4\u5408\uff09;"}]}
          <class 'dict'>
          {'errno': 0, 'data': [{'k': 'girl', 'v': 'n.  The girl ;  The girl ;  daughter ;  Young woman ;  girl ;'}, {'k': 'girls', 'v': 'n.  The girl ;  The girl ;  daughter ;  Young woman ;  girl ;  girl Complex number ;'}, {'k': 'girlfriend', 'v': 'n.  Girl friend ;  Mistress ; ( Women's ) Female companion , Girlfriend ;'}, {'k': 'girl friend', 'v': '  fiancée ;  Female friend ;'}, {'k': "Girls' Generation", 'v': '  girlhood ( South Korea SM Entertainment Co., Ltd 2007 A group of nine girls launched in );'}]}
          n.  The girl ;  The girl ;  daughter ;  Young woman ;  girl ; -- n.  The girl ;  The girl ;  daughter ;  Young woman ;  girl ;
          n.  The girl ;  The girl ;  daughter ;  Young woman ;  girl ;  girl Complex number ; -- n.  The girl ;  The girl ;  daughter ;  Young woman ;  girl ;  girl Complex number ;
          n.  Girl friend ;  Mistress ; ( Women's ) Female companion , Girlfriend ; -- n.  Girl friend ;  Mistress ; ( Women's ) Female companion , Girlfriend ;
            fiancée ;  Female friend ; --   fiancée ;  Female friend ;
            girlhood ( South Korea SM Entertainment Co., Ltd 2007 A group of nine girls launched in ); --   girlhood ( South Korea SM Entertainment Co., Ltd 2007 A group of nine girls launched in );
         Copy code 

Last , Welcome to my personal WeChat official account  『 Little ape like dust 』, For more IT technology 、 Dry goods knowledge 、 Hot news

copyright notice
author[ruochen],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201311938502866.html

Random recommended