current position:Home>Python reptile lesson 2-9 Chinese monster database. It is found that there is a classification of color (he) desire (Xie) monsters during operation
Python reptile lesson 2-9 Chinese monster database. It is found that there is a classification of color (he) desire (Xie) monsters during operation
2022-01-31 09:44:00 【Dream eraser】
「 This is my participation 11 The fourth of the yuegengwen challenge 11 God , Check out the activity details :2021 One last more challenge 」
I got a good book recently 《 Chinese monster story ( The complete )》, Suddenly thought to do a collection of Chinese monster website should be very interesting , So this article .
Analysis before writing crawlers
For writing crawlers , A lot of times, find a target site , Then analyze the site , Always find a way to get the data you want ; There is also a situation like this today , We came across an idea , I think it's a good idea , And then try to grab some basic data , In combination PHP,JAVA Make a website of these languages , Maybe you can get good traffic .
The data to be captured today is Chinese monsters , In addition to sorting it out by yourself , It's important to find a data source website , So I directly open Baidu search , Sure enough , With an eraser (dream.blog.csdn.net/) Your intelligence is still hard to think of …
Although not much about monster , But there was one Know the demon . This website has really done such an interesting job of sorting out monsters , Here for something that came to mind earlier than I did bosses , Point a praise .
Now that we've found the target site , The next work is relatively simple , Analysis starts .
Let's first see if the amount of data is complete , One sentence I often write in my blog is “ As long as the human eye can see the data , The reptiles can catch ”. This website is personal maintenance , So the data is more comprehensive , Of course, the amount is not very large , total 130 Page about , Every time I see a description like the last page , I knew that there must be drama in this website .
Get paging address rules
Just click on 1~2 page , You can get the basic rules of pagination .
https://www.cbaigui.com/page/4
https://www.cbaigui.com/page/3
https://www.cbaigui.com/page/130
Copy code
You can see in the above address , The page number is a simple number .
Write regular expressions
The goal of this time is to get the monster data , A certain amount of redundant data is allowed during crawler crawling , So analyze the page elements directly , Take a look at what data is valuable .
The area shown in the red box above , Compare core data for list pages , Here's actually grabbing 2 It's worth , The first is the title , The second is the link after the title click , Grab the link , In order to get the inner page data , This is shown in the red box below tag Label area . The reason why we got this tag , It's different from person to person , I mainly want to get the tags and then classify them accordingly .
If you want to complete the data integrity , You can grab some other information from the head . It contains the origins of some dynasties and monsters .
Analysis complete , From the eraser's point of view , The hardest work is done , The rest is to write code and grab .
Crawler writing work
Here we need to pay attention to , The site should belong to individual developers , So we should pay attention to limiting the climbing speed when climbing , If you climb too fast, it's bad for the website .
And then the coding begins .
First of all, you can use some regular expression tools , First write the regular match , In fact, this part has been written , A lot of code is written .
This page uses 2 On regularity , The first is used to match the title with the link , The rules are as follows :
The first regular expression :
<h2 class="post-title">[.\s]*<a href="(.*?)" rel="bookmark">(.*?)</a>
Copy code
The second regular expression :
<a href=".*?" rel="tag">(.*?)</a>' Copy code
For the rigor of writing regular expressions , Not required in this series of columns , Enough , Easy to use .
Let's show you part of the code , The core code has been completed , The rest is up to you !~
import requests
import re
import time
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36'}
def get_tags(url):
res = requests.get(url, headers=headers)
pattern = re.compile(
r'<a href=".*?" rel="tag">(.*?)</a>')
all = pattern.findall(res.text)
print(all)
def get_list(page):
url_format = "https://www.cbaigui.com/page/{page}"
url = url_format.format(page=page)
res = requests.get(url, headers=headers)
pattern = re.compile(
r'<h2 class="post-title">[.\s]*<a href="(.*?)" rel="bookmark">(.*?)</a>')
all = pattern.findall(res.text)
for item in all:
get_tags(item[0])
time.sleep(1)
if __name__ == "__main__":
total = int(input(" Please enter the maximum page number :"))
for i in range(1, total):
get_list(1)
# get_tags("https://www.cbaigui.com/post-18153.html")
Copy code
After running , One of the results is Lust ? What the hell? ?
Curiosity didn't hold back , Find the link and click , Have a good look at the relevant information , Very fruitful .
Reptiles talk after class
Complete the code, you can complete it by yourself , The rest is all about data storage , You can write csv
In the file . After the crawler crawls the data , You'll find a lot of fun , For example, this case has virtually added a lot of knowledge to me .
copyright notice
author[Dream eraser],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201310943583144.html
The sidebar is recommended
- Django ORM details - fields, attributes, operations
- Python web crawler - crawling cloud music review (3)
- Stroke list in python (bottom)
- What cat is the most popular? Python crawls the whole network of cat pictures. Which one is your favorite
- [algorithm learning] LCP 06 Take coins (Java / C / C + + / Python / go / trust)
- Python shows the progress of downloading files from requests
- Solve the problem that Django celery beat prompts that the database is incorrectly configured and does not support multiple databases
- Bamboolib: this will be one of the most useful Python libraries you've ever seen
- Python quantitative data warehouse construction 3: data drop library code encapsulation
- The source code and implementation of Django CSRF Middleware
guess what you like
-
Python hashlib module
-
The cover of Python 3 web crawler development (Second Edition) has been determined!
-
The introduction method of executing Python source code or Python source code file (novice, please enter the old bird and turn left)
-
[Python basics] explain Python basic functions in detail, including teaching and learning
-
Python web crawler - crawling cloud music review (4)
-
The first step of scientific research: create Python virtual environment on Linux server
-
Writing nmap scanning tool in Python -- multithreaded version
-
leetcode 2057. Smallest Index With Equal Value(python)
-
Bamboolib: this will be one of the most useful Python libraries you've ever seen
-
Python crawler actual combat, requests module, python realizes capturing a video barrage
Random recommended
- [algorithm learning] 1108 IP address invalidation (Java / C / C + + / Python / go / trust)
- Test platform series (71) Python timed task scheme
- Java AES / ECB / pkcs5padding encryption conversion Python 3
- Loguru: the ultimate Python log solution
- Blurring and anonymizing faces using OpenCV and python
- How fast Python sync and async execute
- Python interface automation test framework (basic) -- common data types list & set ()
- Python crawler actual combat, requests module, python realizes capturing video barrage comments of station B
- Python: several implementation methods of multi process
- Sword finger offer II 054 Sum of all values greater than or equal to nodes | 538 | 1038 (Java / C / C + + / Python / go / trust)
- How IOS developers learn python programming 3-operator 2
- How IOS developers learn python programming 2-operator 1
- [Python applet] 8 lines of code to realize file de duplication
- Python uses the pynvml tool to obtain the working status of GPU
- Data mining: Python actual combat multi factor analysis
- Manually compile opencv on MacOS and Linux and add it to Python / C + + / Java as a dependency
- Use Python VTK to batch read 2D slices and display 3D models
- Complete image cutting using Python version VTK
- Python interface automation test framework (basic) -- common data types Dict
- Django (make an epidemic data report)
- Python specific text extraction in actual combat challenges the first step of efficient office
- Daily python, Part 8 - if statement
- Django model class 1
- The same Python code draws many different cherry trees. Which one do you like?
- Python code reading (Chapter 54): Fibonacci sequence
- Django model class 2
- Python crawler Basics
- Mapping 3D model surface distances using Python VTK
- How to implement encrypted message signature and verification in Python -- HMAC
- leetcode 1945. Sum of Digits of String After Convert(python)
- leetcode 2062. Count Vowel Substrings of a String(python)
- Analysis of Matplotlib module of Python visualization
- Django permission management
- Python integrated programming -- visual hot search list and new epidemic situation map
- [Python data collection] scripy realizes picture download
- Python interface automation test framework (basic part) -- loop statement of process control for & while
- Daily python, Chapter 9, while loop
- Van * Python | save the crawled data with docx and PDF
- Five life saving Python tips
- Django frequency control