current position:Home>The 20th of 120 Python crawlers, 1637. All the way business opportunity network joined in data collection
The 20th of 120 Python crawlers, 1637. All the way business opportunity network joined in data collection
2022-02-01 13:06:54 【Dream eraser】
「 This is my participation 11 The fourth of the yuegengwen challenge 27 God , Check out the activity details :2021 One last more challenge 」
The following cases , It will focus on basic data collection for sales , The industry will choose the beauty industry , Please kindly be informed .
This case will use lxml
And cssselect
Combination of methods to collect , A focus on cssselect
Selectors .
Target site analysis
The goal of this capture is http://www.1637.com/
, The website has multiple categories , When collecting, the classification is stored in a list in advance , Facilitate subsequent expansion . Later, it was found that the primary industry can choose There is no limit
, At this time, you can get all the classifications , Based on this , Let's grab all the data locally first , Then after screening out beauty / Relevant franchise data of the beauty industry can .
The amount of data and pages captured this time are shown in the figure below .
Grab data using the old method , The first HTML Save page to local , And then after the second treatment .
Technical points used
Request data usage requests
, Data extraction uses lxml
+ cssselect
Realization , Use cssselect
Before , adopt pip install cssselect
Install the corresponding library .
Installation completed , There are two ways to use it in code , First, it adopts CSSSelector class
, As follows :
from lxml.cssselect import CSSSelector
# It is a little similar to the way regular expressions are used , Construct a CSS Selector object
sel = CSSSelector('#div_total>em', translator="html")
# And then Element Object to
element = sel(etree.HTML(res.text))
print(element[0].text)
Copy code
The above usage is suitable for building selectors in advance , Easier to expand , If you don't use this method , You can use it directly cssselect method
To implement , That is the following code :
# adopt cssselect Selectors , choice em label
div_total = element.cssselect('#div_total>em')
Copy code
No matter which of the above two methods is used , What's in brackets #div_total>em
Is the focus of our study , The wording is CSS Selectors
A way of writing , If you know more about front-end knowledge , It's easy to master , If you don't understand, there's no problem , First remember the following .
CSS Selectors Suppose there is the following paragraph HTML Code :
<div class="totalnum" id="div_total"> common <em>57041</em> A project </div>
Copy code
among class
,id
All for HTML Property value of label , commonly class
There can be more than one... In a web page , and id
There can only be one .
If you want to get div
label , Use css Selectors
, Use #div_total
perhaps .totalnum
Can be realized , Focus on if the basis id
obtain , The symbol in front of that is #
, If you rely on class
obtain , The symbol in front of that is .
Sometimes there are other properties , stay css Selectors
in , Can be written like this , modify HTML The code is as follows .
<div class="totalnum" id="div_total" custom="abc"> common <em>57041</em> A project </div>
Copy code
Write the following test code , Be careful CSSSelector
Part of the css Selectors
How to write it , namely div[custom="abc"] em
.
sel = CSSSelector('div[custom="abc"] em', translator="html")
element = sel(etree.HTML('<div class="totalnum" id="div_total" custom="abc"> common <em>57041</em> A project </div>'))
print(element[0].text)
Copy code
Above css Selectors
It is also applied to a knowledge point , It's called a descendant selector , for example #div_total>em
, among #div_total
And em
Between , There is one. >
Symbol , This symbol indicates the selection id=div_total
The direct child element of em
, If you remove the middle >
, It is amended as follows #div_total>em
, Express choice id=div_total
All descendant elements ( Children and grandchildren elements ) Medium em
Elements .
After a brief grasp of the above contents , You can simply write your own cssselect
Code. .
Code time
The capture method used in this case is , Grab first HTML Page to local , Parsing for local files , Therefore, the acquisition code is relatively simple , Just dynamically get the total number of pages . The following code highlights: get_pagesize
Function internal logic .
import requests
from lxml.html import etree
import random
import time
class SSS:
def __init__(self):
self.start_url = 'http://xiangmu.1637.com/p1.html'
self.url_format = 'http://xiangmu.1637.com/p{}.html'
self.session = requests.Session()
self.headers = self.get_headers()
def get_headers(self):
# This function can be obtained from previous blogs
uas = [
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
]
ua = random.choice(uas)
headers = {
"user-agent": ua,
"referer": "https://www.baidu.com"
}
return headers
def get_pagesize(self):
with self.session.get(url=self.start_url, headers=self.headers, timeout=5) as res:
if res.text:
element = etree.HTML(res.text)
# adopt cssselect Selectors , choice em label
div_total = element.cssselect('#div_total>em')
# obtain em Label internal text div_total[0].text, And convert it to an integer
total = int(div_total[0].text)
# Get page number
pagesize = int(total / 10) + 1
# print(pagesize)
# The total is just 10 Integers , No need to add another page of data
if total % 10 == 0:
pagesize = int(total / 10)
return pagesize
else:
return None
def get_detail(self, page):
with self.session.get(url=self.url_format.format(page), headers=self.headers, timeout=5) as res:
if res.text:
with open(f"./ To join in 1/{page}.html", "w+", encoding="utf-8") as f:
f.write(res.text)
else:
# If there is no data , Re request
print(f" Page number {page} Request exception , Re request ")
self.get_detail(page)
def run(self):
pagesize = self.get_pagesize()
# Test data , Can be modified temporarily pagesize = 20
for page in range(1, pagesize):
self.get_detail(page)
time.sleep(2)
print(f" Page number {page} After grabbing !")
if __name__ == '__main__':
s = SSS()
s.run()
Copy code
After testing , If you don't increase the time limit , It's easy to be limited IP, That is, data cannot be obtained , By adding agents, you can solve , If you're only interested in data , Can be directly in Download address download HTML Packet data , The decompression password is cajie
.
Secondary extraction of data
When static HTML After crawling all over the place , Extract page data , It's easy , After all, there is no need to solve the anti climbing problem .
The core technology used at this time is to read the file , Through cssselect
Extract fixed data values .
Through developer tools , The tag node where the query data is located is as follows , in the light of class='xminfo'
Just extract the content of .
The following code core shows the data extraction method , among format
Focus on functions , Because the data is stored as csv file
, So we need to remove_character
Function to handle \n
And English ,
Number .
# Data extraction class
class Analysis:
def __init__(self):
pass
# Remove special characters
def remove_character(self, origin_str):
if origin_str is None:
return
origin_str = origin_str.replace('\n', '')
origin_str = origin_str.replace(',', ',')
return origin_str
def format(self, text):
html = etree.HTML(text)
# Get all project areas div
div_xminfos = html.cssselect('div.xminfo')
for xm in div_xminfos:
adtexts = self.remove_character(xm.cssselect('a.adtxt')[0].text) # Get a list of advertising words
url = xm.cssselect('a.adtxt')[0].attrib.get('href') # Get the details page address
brands = xm.cssselect(':nth-child(2)>:nth-child(2)')[1].text # Get a list of brands
categorys = xm.cssselect(':nth-child(2)>:nth-child(3)>a')[0].text # Get categories , for example [" Restaurant "," snack "]
types = ''
try:
# There may be no secondary classification here
types = xm.cssselect(':nth-child(2)>:nth-child(3)>a')[1].text # Get categories , for example [" Restaurant "," snack "]
except Exception as e:
pass
creation = xm.cssselect(':nth-child(2)>:nth-child(6)')[0].text # Brand building time list
franchise = xm.cssselect(':nth-child(2)>:nth-child(9)')[0].text # List of franchise stores
company = xm.cssselect(':nth-child(3)>span>a')[0].text # List of company names
introduce = self.remove_character(xm.cssselect(':nth-child(4)>span')[0].text) # Brand Introduction
pros = self.remove_character(xm.cssselect(':nth-child(5)>:nth-child(2)')[0].text) # Business product introduction
investment = xm.cssselect(':nth-child(5)>:nth-child(4)>em')[0].text # The amount of investment
# String concatenation
long_str = f"{adtexts},{categorys},{types},{brands},{creation},{franchise},{company},{introduce},{pros},{investment},{url}"
with open("./ Join data .csv", "a+", encoding="utf-8") as f:
f.write(long_str + "\n")
def run(self):
for i in range(1, 5704):
with open(f"./ To join in /{i}.html", "r", encoding="utf-8") as f:
text = f.read()
self.format(text)
if __name__ == '__main__':
# Collect data , Which part to run , Just remove the comment
# s = SSS()
# s.run()
# Extract the data
a = Analysis()
a.run()
Copy code
The above code is extracting HTML When labeling , Repeatedly used :nth-child(2)
, The selector is : Match the first... Of its parent element N Sub elements , Regardless of the type of element , So you just need to find the exact location of the element .
Collection time
Code download address :codechina.csdn.net/hihell/pyth…, Could you give me Star.
To have come , No comment , Point a praise , Put it away ?
Today is the first day of continuous writing 200 / 200 God . You can pay attention to me , Praise me 、 Comment on me 、 Collect me .
copyright notice
author[Dream eraser],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202011306530122.html
The sidebar is recommended
- Python logging log error and exception exception callback method
- Learn Python quickly and take a shortcut~
- Python from 0 to 1 (day 15) - Python conditional judgment 2
- Python crawler actual combat, requests module, python to capture headlines and take beautiful pictures
- The whole activity collected 8 proxy IP sites to pave the way for the python proxy pool, and the 15th of 120 crawlers
- Why can't list be used as dictionary key value in Python
- Python from 0 to 1 (day 16) - Python conditional judgment 3
- What is the python programming language?
- Python crawler reverse webpack, a real estate management platform login password parameter encryption logic
- Python crawler reverse, a college entrance examination volunteer filling platform encrypts the parameter signsafe and decrypts the returned results
guess what you like
-
Python simulated Login, selenium module, python identification graphic verification code to realize automatic login
-
Python -- datetime (timedelta class)
-
Python's five strange skills will bring you a sense of enrichment in mastering efficient programming skills
-
[Python] comparison of dictionary dict, defaultdict and orderdict
-
Test driven development using Django
-
Face recognition practice: face recognition using Python opencv and deep learning
-
leetcode 1610. Maximum Number of Visible Points(python)
-
Python thread 03 thread synchronization
-
Introduction and internal principles of Python's widely used concurrent processing Library Futures
-
Python - progress bar artifact tqdm usage
Random recommended
- Python learning notes - the fifth bullet * class & object oriented
- Python learning notes - the fourth bullet IO operation
- Python crawler actual combat: crawl all the pictures in the answer
- Quick reference manual of common regular expressions, necessary for Python text processing
- [Python] the characteristics of dictionaries and collections and the hash table behind them
- Python crawler - fund information storage
- Python crawler actual combat, pyteseract module, python realizes the visualization of boos direct employment & hook post data
- Pit filling summary: Python memory leak troubleshooting tips
- Python code reading (Chapter 61): delaying function calls
- Through the for loop, compare the differences between Python and Ruby Programming ideas
- leetcode 1606. Find Servers That Handled Most Number of Requests(python)
- leetcode 1611. Minimum One Bit Operations to Make Integers Zero(python)
- 06python learning notes - reading external text data
- [Python] functions, higher-order functions, anonymous functions and function attributes
- Python Networkx practice social network visualization
- Data analysis starts from scratch, and pandas reads and writes CSV data
- Python review (format string)
- [pandas learning notes 01] powerful tool set for analyzing structured data
- leetcode 147. Insertion Sort List(python)
- apache2. 4 + windows deployment Django (multi site)
- Python data analysis - linear regression selection fund
- How to make a python SDK and upload and download private servers
- Python from 0 to 1 (day 20) - basic concepts of Python dictionary
- Django -- closure decorator regular expression
- Implementation of home page and back end of Vue + Django tourism network project
- Easy to use scaffold in Python
- [Python actual combat sharing] I wrote a GIF generation tool, which is really TM simple (Douluo continent, did you see it?)
- [Python] function decorators and common decorators
- Explain the python streamlit framework in detail, which is used to build a beautiful data visualization web app, and practice making a garbage classification app
- Construction of the first Django project
- Python crawler actual combat, pyecharts module, python realizes the visualization of river review data
- Python series -- web crawler
- Plotly + pandas + sklearn: shoot the first shot of kaggle
- How to learn Python systematically?
- Analysis on several implementations of Python crawler data De duplication
- leetcode 1616. Split Two Strings to Make Palindrome (python)
- Python Matplotlib drawing violin diagram
- Python crawls a large number of beautiful pictures with 10 lines of code
- [tool] integrated use of firebase push function in Python project
- How to use Python to statistically analyze access logs?