current position:Home>[Python data collection] scrapy book acquisition and coding analysis
[Python data collection] scrapy book acquisition and coding analysis
2022-01-31 17:27:50 【liedmirror】
「 This is my participation 11 The fourth of the yuegengwen challenge 14 God , Check out the activity details :2021 One last more challenge 」
demand
Use Scrapy+Xpath Crawl book data , And into the database .
url analysis
analysis url, You can see that you need two parameters :
Parameter name | meaning |
---|---|
key | Search keywords |
page_index | Number of pages |
code
Page parsing , From the request obtained through packet capture , We can find that the website uses static rendering , however , What we get is a pile of random code :
In this case , We can view the request header , From the response header, you can find that the website uses gbk Code return , therefore , On request , Need to convert the code , Otherwise, you will get the same random code as in the preview above .
Use response.body.decode('gbk')
To implement .
analysis
In terms of analysis , It can be found that book information is used ul -> li In the form of .
In parsing , You can locate it first ul node , Then by traversing the child nodes , Where to get all the book information li node , Finally, continue the sub query , Get all the information :
class BookSpider(scrapy.Spider):
name = 'book'
allowed_domains = ['http://search.dangdang.com/']
url = 'http://search.dangdang.com/?key={}&page_index={}'
titlie = 'Python'
index = 1
def start_requests(self):
yield scrapy.Request(url=self.url.format(self.titlie, self.index), callback=self.parse)
def parse(self, response):
try:
html = response.body.decode('gbk')
selector = scrapy.Selector(text=html)
books = selector.xpath("//*[@class='bigimg']/li")
for book in books:
item = Session1Item()
item['bTitle'] = book.xpath(".//p[@class='name']/a/@title").extract_first()
item['bAuthor'] = book.xpath(".//p[@class='search_book_author']//span[1]//a/text()").extract_first()
item['bPublisher'] = book.xpath(".//p[@class='search_book_author']//span[3]//a/text()").extract_first()
item['bDate'] = book.xpath(".//p[@class='search_book_author']//span[2]/text()").extract_first().replace('/', '')
item['bPrice'] = book.xpath(".//p[@class='price']/span[@class='search_now_price']/text()").extract_first()
item['bDetail'] = book.xpath(".//p[@class='detail']/text()").extract_first()
yield item
except Exception as e:
print(e)
self.index += 1
if self.index > 3:
# Limited quantity
return
yield scrapy.Request(url=self.url.format(self.titlie, self.index), callback=self.parse)
Copy code
Database is through item Pass the parameter to pipelines To implement , Previous experiments have been implemented many times , So call the previous code directly , It's not shown here , The final result is as follows :
copyright notice
author[liedmirror],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201311727476199.html
The sidebar is recommended
- Python - convert Matplotlib image to numpy Array or PIL Image
- Python and Java crawl personal blog information and export it to excel
- Using class decorators in Python
- Untested Python code is not far from crashing
- Python efficient derivation (8)
- Python requests Library
- leetcode 2047. Number of Valid Words in a Sentence(python)
- leetcode 2027. Minimum Moves to Convert String(python)
- How IOS developers learn Python Programming 5 - data types 2
- leetcode 1971. Find if Path Exists in Graph(python)
guess what you like
-
leetcode 1984. Minimum Difference Between Highest and Lowest of K Scores(python)
-
Python interface automation test framework (basic) -- basic syntax
-
Detailed explanation of Python derivation
-
Python reptile lesson 2-9 Chinese monster database. It is found that there is a classification of color (he) desire (Xie) monsters during operation
-
A brief note on the method of creating Python virtual environment in Intranet Environment
-
[worth collecting] for Python beginners, sort out the common errors of beginners + Python Mini applet! (code attached)
-
[Python souvenir book] two people in one room have three meals and four seasons: 'how many years is it only XX years away from a hundred years of good marriage' ~?? Just come in and have a look.
-
The unknown side of Python functions
-
Python based interface automation test project, complete actual project, with source code sharing
-
A python artifact handles automatic chart color matching
Random recommended
- Python crawls the map of Gaode and the weather conditions of each city
- leetcode 1275. Find Winner on a Tic Tac Toe Game(python)
- leetcode 2016. Maximum Difference Between Increasing Elements(python)
- Run through Python date and time processing (Part 2)
- Application of urllib package in Python
- Django API Version (II)
- Python utility module playsound
- Database addition, deletion, modification and query of Python Sqlalchemy basic operation
- Tiobe November programming language ranking: Python surpasses C language to become the first! PHP is about to fall out of the top ten?
- Learn how to use opencv and python to realize face recognition!
- Using OpenCV and python to identify credit card numbers
- Principle of Python Apriori algorithm (11)
- Python AI steals your voice in 5 seconds
- A glance at Python's file processing (Part 1)
- Python cloud cat
- Python crawler actual combat, pyecharts module, python data analysis tells you which goods are popular on free fish~
- Using pandas to implement SQL group_ concat
- How IOS developers learn Python Programming 8 - set type 3
- windows10+apache2. 4 + Django deployment
- Django parser
- leetcode 1560. Most Visited Sector in a Circular Track(python)
- leetcode 1995. Count Special Quadruplets(python)
- How to program based on interfaces using Python
- leetcode 1286. Iterator for Combination(python)
- leetcode 1418. Display Table of Food Orders in a Restaurant (python)
- Python Matplotlib drawing histogram
- Python development foundation summary (VII) database + FTP + character coding + source code security
- Python modular package management and import mechanism
- Django serialization (II)
- Python dataloader error "dataloader worker (PID XXX) is killed by signal" solution
- apache2. 4 + Django + windows 10 Automated Deployment
- leetcode 1222. Queens That Can Attack the King(python)
- leetcode 1387. Sort Integers by The Power Value (python)
- Tiger sniffing 24-hour praise device, a case with a crawler skill, python crawler lesson 7-9
- Python object oriented programming 01: introduction classes and objects
- Baidu Post: high definition Python
- Python Matplotlib drawing contour map
- Python crawler actual combat, requests module, python realizes IMDB movie top data visualization
- Python classic: explain programming and development from simple to deep and step by step
- Python implements URL availability monitoring and instant push