current position:Home>[Python data collection] scrapy book acquisition and coding analysis

[Python data collection] scrapy book acquisition and coding analysis

2022-01-31 17:27:50 liedmirror

「 This is my participation 11 The fourth of the yuegengwen challenge 14 God , Check out the activity details :2021 One last more challenge


Use Scrapy+Xpath Crawl book data , And into the database .

url analysis

analysis url, You can see that you need two parameters :

Parameter name meaning
key Search keywords
page_index Number of pages


Page parsing , From the request obtained through packet capture , We can find that the website uses static rendering , however , What we get is a pile of random code :

In this case , We can view the request header , From the response header, you can find that the website uses gbk Code return , therefore , On request , Need to convert the code , Otherwise, you will get the same random code as in the preview above .

Use response.body.decode('gbk') To implement .


In terms of analysis , It can be found that book information is used ul -> li In the form of .


  In parsing , You can locate it first ul node , Then by traversing the child nodes , Where to get all the book information li node , Finally, continue the sub query , Get all the information :

class BookSpider(scrapy.Spider):
    name = 'book'
    allowed_domains = ['']
    url = '{}&page_index={}'
    titlie = 'Python'
    index = 1

    def start_requests(self):
        yield scrapy.Request(url=self.url.format(self.titlie, self.index), callback=self.parse)

    def parse(self, response):
            html = response.body.decode('gbk')
            selector = scrapy.Selector(text=html)
            books = selector.xpath("//*[@class='bigimg']/li")
            for book in books:
                item = Session1Item()
                item['bTitle'] = book.xpath(".//p[@class='name']/a/@title").extract_first()
                item['bAuthor'] = book.xpath(".//p[@class='search_book_author']//span[1]//a/text()").extract_first()
                item['bPublisher'] = book.xpath(".//p[@class='search_book_author']//span[3]//a/text()").extract_first()
                item['bDate'] = book.xpath(".//p[@class='search_book_author']//span[2]/text()").extract_first().replace('/', '')
                item['bPrice'] = book.xpath(".//p[@class='price']/span[@class='search_now_price']/text()").extract_first()
                item['bDetail'] = book.xpath(".//p[@class='detail']/text()").extract_first()
                yield item
        except Exception as e:
        self.index += 1
        if self.index > 3:
            #  Limited quantity 
        yield scrapy.Request(url=self.url.format(self.titlie, self.index), callback=self.parse)
 Copy code 

  Database is through item Pass the parameter to pipelines To implement , Previous experiments have been implemented many times , So call the previous code directly , It's not shown here , The final result is as follows :


copyright notice
author[liedmirror],Please bring the original link to reprint, thank you.

Random recommended