current position:Home>[Python data collection] scripy realizes picture download

[Python data collection] scripy realizes picture download

2022-01-31 06:45:44 liedmirror

「 This is my participation 11 The fourth of the yuegengwen challenge 8 God , Check out the activity details :2021 One last more challenge

Preface

Scrapy Framework is a framework specially used to implement crawler , Its data processing part , Both in Pipelines Layer , therefore , By modifying the Pipelines layer , You can save pictures or other data processing operations .

Basic settings

Before you start crawling , You need to make some settings first , for example :

  1. Set the default request header ;
  2. Set download path ( Need to add by yourself IMAGES_STORE Field ,setting.py There was nothing in the original );
  3. start-up Pilelines( Just cancel the original comment )

The code is as follows :

# setting.py
#  Turn off crawler Protocol Validation 
ROBOTSTXT_OBEY = False
#  Set the default request header 
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}
#  Set download path 
#  start-up pipeline
import os
IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'images')
ITEM_PIPELINES = {
   'session_2.pipelines.Session2Pipeline': 300,
}
#  Start the downloader  
DOWNLOADER_MIDDLEWARES = { 'session_2.middlewares.Session2DownloaderMiddleware': 543, }
 Copy code 

To write Pipeline

Need to download pictures , We need to make our own pipeline Inherit scrapy Framework of the ImagesPipeline.scrapy Frame here pipeline Picture download operation is encapsulated in ( Concurrent downloads , Speed can guarantee ).

because scrapy The package is perfect , We just need to overwrite get_media_requests function , Picture the url adopt yield Request(item['url']) Pass to download operation :

# pipelines.py
from scrapy import Request
from scrapy.pipelines.images import ImagesPipeline

class Session2Pipeline(ImagesPipeline):
    #  Inherit ImagesPipeline
    def get_media_requests(self, item, info):
        print(item['url'])
        yield Request(item['url'])
 Copy code 

stay scrapy After the crawler framework starts , The document will be based on setting.py Set in the IMAGES_STORE, Save the picture in IMAGES_STORE/full Under the table of contents .

image.png

Pass on url

item At least one of the url Parameters ( Other parameters are not relevant to this tutorial , Just omit ):

# item.py
import scrapy

class Session2Item(scrapy.Item):
    # define the fields for your item here like:
    url = scrapy.Field()
 Copy code 

In the main crawler function , Use yield Session2Item(url=img) take url Pass to Pipeline, The analysis part , have access to CSS、XPath or re, You can choose according to your proficiency and preferences , I won't repeat it here :

# main.py
yield Session2Item(url=img)
 Copy code 

image.png

copyright notice
author[liedmirror],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201310645423211.html

Random recommended