current position:Home>[Python data collection] scripy realizes picture download

[Python data collection] scripy realizes picture download

2022-01-31 06:45:44 liedmirror

「 This is my participation 11 The fourth of the yuegengwen challenge 8 God , Check out the activity details :2021 One last more challenge


Scrapy Framework is a framework specially used to implement crawler , Its data processing part , Both in Pipelines Layer , therefore , By modifying the Pipelines layer , You can save pictures or other data processing operations .

Basic settings

Before you start crawling , You need to make some settings first , for example :

  1. Set the default request header ;
  2. Set download path ( Need to add by yourself IMAGES_STORE Field , There was nothing in the original );
  3. start-up Pilelines( Just cancel the original comment )

The code is as follows :

#  Turn off crawler Protocol Validation 
#  Set the default request header 
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
#  Set download path 
#  start-up pipeline
import os
IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'images')
   'session_2.pipelines.Session2Pipeline': 300,
#  Start the downloader  
DOWNLOADER_MIDDLEWARES = { 'session_2.middlewares.Session2DownloaderMiddleware': 543, }
 Copy code 

To write Pipeline

Need to download pictures , We need to make our own pipeline Inherit scrapy Framework of the ImagesPipeline.scrapy Frame here pipeline Picture download operation is encapsulated in ( Concurrent downloads , Speed can guarantee ).

because scrapy The package is perfect , We just need to overwrite get_media_requests function , Picture the url adopt yield Request(item['url']) Pass to download operation :

from scrapy import Request
from scrapy.pipelines.images import ImagesPipeline

class Session2Pipeline(ImagesPipeline):
    #  Inherit ImagesPipeline
    def get_media_requests(self, item, info):
        yield Request(item['url'])
 Copy code 

stay scrapy After the crawler framework starts , The document will be based on Set in the IMAGES_STORE, Save the picture in IMAGES_STORE/full Under the table of contents .


Pass on url

item At least one of the url Parameters ( Other parameters are not relevant to this tutorial , Just omit ):

import scrapy

class Session2Item(scrapy.Item):
    # define the fields for your item here like:
    url = scrapy.Field()
 Copy code 

In the main crawler function , Use yield Session2Item(url=img) take url Pass to Pipeline, The analysis part , have access to CSS、XPath or re, You can choose according to your proficiency and preferences , I won't repeat it here :

yield Session2Item(url=img)
 Copy code 


copyright notice
author[liedmirror],Please bring the original link to reprint, thank you.

Random recommended