current position:Home>"I just want to collect some plain photos in Python for machine learning," he said. "I believe you a ghost!"

"I just want to collect some plain photos in Python for machine learning," he said. "I believe you a ghost!"

2022-01-31 21:00:23 Dream eraser

「 This is my participation 11 The fourth of the yuegengwen challenge 18 God , Check out the activity details :2021 One last more challenge

today , He crawled to get Thousands of them Blind date, plain face . Say with me ? Brush the dating platform , Collect plain photos , Training machine models . It can also be believed that ? They're wearing makeup .

Read this article and you'll get

  1. Nearly ten thousand faces ;
  2. lxml First knowledge of analytic library ;
  3. XPath First knowledge of grammar ;
  4. Cooike Anti creeping ;
  5. Girl friend ( Maybe it's a windfall )

Python collection 19 The portrait of a girl on a blind date

Start with this blog , You're going to enter the crawler 120 The second small stage of the example ,requests + lxml Implement reptiles .

requests Believe through the front 10 A case , You're already familiar with , And then we'll build on that , Add a new crawler parsing library lxml. The library is mainly used for XML,HTML Parsing , And the parsing efficiency is very high , After using it , You can get rid of the trouble of writing regular expressions .

Target data source analysis

Crawl to the target website

The target of this capture is 19 The girls' blind date channel in Beijing , The classified channel is closed 7 month 1 It's still being updated .

www.19lou.com/r/1/19lnsxq…

Here's a screenshot from the website , If there is any infringement , Contact the eraser in time ~

 He said :“ Just want to use Python Collect some plain photos , Do machine learning use ”,“ I believe you're a ghost !”

The target of this crawling is the head picture above , The file name is saved as the title content .

The use of Python modular

  • requests modular
  • lxml modular
  • fake_useragent modular

Key learning content

lxml The module first .

List page analysis

This grab around the list page to complete the task , The order of the list page is as follows :

https://www.19lou.com/r/1/19lnsxq.html
https://www.19lou.com/r/1/19lnsxq_2.html
https://www.19lou.com/r/1/19lnsxq-233.html
 Copy code 

The label of the picture is as follows , The extraction work is handed over to lxml Module to complete , Of course, in order to connect proficiency , You can still use re The first version of the module is completed .

 He said :“ Just want to use Python Collect some plain photos , Do machine learning use ”,“ I believe you're a ghost !”

lxml Basic knowledge of

To pass ahead of time pip install lxml Complete the installation of the library .

Import the library and the basic use of the library .

from lxml import etree
html = " One o'clock HTML Code "
# Generate a  XPath  object 
html=etree.HTML(text)
#  Extract the data 
html.xpath('//li/a')
 Copy code 

As mentioned in the above code comments XPath object , About XPath, Is a door in XML/HTML The language in which information is found in a document , By means of specific grammar in HTML The language for extracting data from the database , The study of basic knowledge , You can refer to www.w3school.com.cn/xpath/xpath…, The best learning skill is to use it while checking .

The sorting requirements are as follows

  1. Batch generation of list pages to be crawled ;
  2. requests Request target data ;
  3. lxml Extract target data ;
  4. Save the picture .

Code time

When coding , In order to prevent direct anti climbing identification , In the process of crawling , Add a waiting time , Limit crawling speed ( Of course, in the follow-up found no right IP The limitation of , Just remove it ).

In the process of coding , If the following error occurs , to update fake_useragent that will do .

raise FakeUserAgentError('Maximum amount of retries reached')
 Copy code 

The update script is as follows :

pip install -U fake-useragent
 Copy code 

If it still fails , I suggest you write random generation UserAgent Function of .

A little bit of anti climbing

When crawling the target data , Directly through requests Request target address , The following code will be returned , This code is not the page where the target data is located , That is, there is anti climbing technology in the website .

Request the target address directly , The response code is shown in the figure below , Pay attention to the position of the red box .

 He said :“ Just want to use Python Collect some plain photos , Do machine learning use ”,“ I believe you're a ghost !”

Yes requests Analyze the requested data , Found in the returned code that Cookie, After repeated testing of this value , Found to be a fixed value , Go straight through requests Parameters headers Can be set .

After getting the source code of the target page , You can go through lxml Page extraction operation , In the previous article has been a simple description . The focus of study is divided into two parts :

  1. First, through lxml Module etree object , take HTML Source code serialization , That is to translate it into Element object ;
  2. Then on Element Object parsing , The parsing syntax used here is XPath, This article uses path analysis , There are comments in the full code section .

Complete code

import requests
from lxml import etree
from fake_useragent import UserAgent
import time


def save(src, title):
    try:
        res = requests.get(src)
        with open(f"imgs/{title}.jpg", "wb+") as f:
            f.write(res.content)
    except Exception as e:
        print(e)


def run(url):
    # ua = UserAgent(cache=False)
    ua = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36"
    headers = {
        "User-Agent": ua,
        "Host": "www.19lou.com",
        "Referer": "https://www.19lou.com/r/1/19lnsxq-233.html",
        "Cookie": "_Z3nY0d4C_=37XgPK9h"  #  The value obtained from the anti crawl code 
    }
    try:
        res = requests.get(url=url, headers=headers)
        text = res.text
        #  take  html  convert to  Element  object 
        html = etree.HTML(text)
        # xpath  Path extraction  @class  To select  class  attribute 
        divs = html.xpath("//div[@class='pics']")
        # print(len(divs))
        #  Traverse  Elements  node 
        for div in divs:
            #  Extract address , Note that the extracted attribute is  data-src  instead of  src
            src = div.xpath("./img/@data-src")[0]
            #  Extract the title 
            title = div.xpath("./img/@alt")[0]
            save(src, title)
    except Exception as e:
        print(e)


if __name__ == '__main__':
    urls = ["https://www.19lou.com/r/1/19lnsxq.html"]
    for i in range(114, 243):
        urls.append(f"https://www.19lou.com/r/1/19lnsxq-{i}.html")
    for url in urls:
        print(f" Is fetching {url}")
        run(url)
        # time.sleep(5)

    print(" It's all crawled ")
 Copy code 

In order to improve efficiency , You can cancel 5 Seconds to wait , You can also use multithreading , But just try for a few seconds , Don't over grab , After all, we just want to learn .

There is also an important knowledge point in the above code , In the obtained source code, the image src The attribute is dot.gif( Loading pictures ),data-src Property has a value .

The specific comparison is shown in the figure below , The picture above shows the source code of the page , The following figure shows the source code returned directly by the server .

The crawling tips in this section are , Any parsing and extraction of data , According to the source code directly returned by the server .

 He said :“ Just want to use Python Collect some plain photos , Do machine learning use ”,“ I believe you're a ghost !”

Grab the result display time

Reptiles 120 example , The first 11 Example completed , I hope this blog can bring you different surprises and knowledge . Relevant information can be obtained directly below .

 He said :“ Just want to use Python Collect some plain photos , Do machine learning use ”,“ I believe you're a ghost !”

Full code download address :codechina.csdn.net/hihell/pyth…,NO11.

Here are the various learning data generated in the crawling process , If you only need data , You can go to the download channel to download ~.

Crawling resources are for learning purposes only , Infringement and deletion .

copyright notice
author[Dream eraser],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201312100211138.html

Random recommended