current position:Home>I took 100g pictures offline overnight with Python just to prevent the website from disappearing
I took 100g pictures offline overnight with Python just to prevent the website from disappearing
2022-01-30 06:48:36 【Dream eraser】
This article has participated in 「 Digging force Star Program 」 , Win a creative gift bag , Challenge creation incentive fund .
20 Line code , Turn into a small and fleshy person in the technology circle
use Python Crawling 100G Cosers picture
The goal of this blog is
Climb to the target
- Target data source :www.cosplay8.com/pic/chinaco…, Another Cos Website , It's easy to disappear On the Internet , In order to store the data , We set it up .
The use of Python modular
- requests,re,os
Key learning content
- Today's focus is , It can be put on the detail page , This technique was not covered in previous blogs , Take care of it in the process of writing code .
List page and detail page analysis
Through developer tools , It is convenient to analyze the label of the target data .
Click on any image , Enter details page , Get the target image for single page display , One picture per page .
<a href="javascript:dPlayNext();" id="infoss">
<img src="/uploads/allimg/210601/112879-210601143204.jpg" id="bigimg" width="800" alt="" border="0" /></a>
Copy code
Get the list page and details page at the same time URL The generation rules are as follows :
List of pp.
Details page
Note that there is no serial number on the first page of the details page 1
, Gu crawled to get the total page number at the same time , Need to store home page picture .
Code time
The target site classifies the images , namely At home cos, Abroad cos, Hanfu circle ,Lolita, Therefore, it can be dynamically input during crawling , That is to crawl the target source custom .
def run(category, start, end):
# Generate list pages to crawl
wait_url = [
f"http://www.cosplay8.com/pic/chinacos/list_{category}_{i}.html" for i in range(int(start), int(end)+1)]
print(wait_url)
url_list = []
for item in wait_url:
# get_list Function is provided later
ret = get_list(item)
print(f" We've captured :{len(ret)} Data ")
url_list.extend(ret)
if __name__ == "__main__":
# http://www.cosplay8.com/pic/chinacos/list_22_2.html
category = input(" Please enter the classification number :")
start = input(" Please enter the start page :")
end = input(" Please enter the end page :")
run(category, start, end)
Copy code
The above code is first based on user input , Generate the target URL , Then pass the target URL to get_list
Function , The function code is as follows :
def get_list(url):
""" Get a link to the full details page """
all_list = []
res = requests.get(url, headers=headers)
html = res.text
pattern = re.compile('<li><a href="(.*?)">')
all_list = pattern.findall(html)
return all_list
Copy code
Through regular expressions <li><a href="(.*?)">
Match all details page addresses in the list page , And return it as a whole .
stay run
Continue to add code to the function , Get details page picture material , And save the captured pictures .
def run(category, start, end):
# List page to crawl
wait_url = [
f"http://www.cosplay8.com/pic/chinacos/list_{category}_{i}.html" for i in range(int(start), int(end)+1)]
print(wait_url)
url_list = []
for item in wait_url:
ret = get_list(item)
print(f" We've captured :{len(ret)} Data ")
url_list.extend(ret)
print(url_list)
# print(len(url_list))
for url in url_list:
get_detail(f"http://www.cosplay8.com{url}")
Copy code
Because the address of the matched detail page is relative , Gu formats the address , Generate full address . get_detail
The function code is as follows :
def get_detail(url):
# Request details page data
res = requests.get(url=url, headers=headers)
# Set encoding
res.encoding = "utf-8"
# Get the source code of the web page
html = res.text
# Page number , Save the first picture
size_pattern = re.compile('<span> common (\d+) page : </span>')
# Get the title , It was found that there were differences in publication , Regular expressions have been modified
# title_pattern = re.compile('<title>(.*?)-Cosplay China </title>')
title_pattern = re.compile('<title>(.*?)-Cosplay( China |8)</title>')
# Set image regular expressions
first_img_pattern = re.compile("<img src='(.*?)' id='bigimg'")
try:
# Try matching page numbers
page_size = size_pattern.search(html).group(1)
# Try to match the title
title = title_pattern.search(html).group(1)
# Try to match the address
first_img = first_img_pattern.search(html).group(1)
print(f"URL The corresponding data is {page_size} page ", title, first_img)
# Path is generated
path = f'images/{title}'
# Path judgment
if not os.path.exists(path):
os.makedirs(path)
# Ask for the first picture
save_img(path, title, first_img, 1)
# Ask for more pictures
urls = [f"{url[0:url.rindex('.')]}_{i}.html" for i in range(2, int(page_size)+1)]
for index, child_url in enumerate(urls):
try:
res = requests.get(url=child_url, headers=headers)
html = res.text
first_img_pattern = re.compile("<img src='(.*?)' id='bigimg'")
first_img = first_img_pattern.search(html).group(1)
save_img(path, title, first_img, index)
except Exception as e:
print(" Grab subpages ", e)
except Exception as e:
print(url, e)
Copy code
The core logic of the above code has been written into the comments , A focus on title
Regular matching part , The initial writing of regular expressions is as follows :
<title>(.*?)-Cosplay China </title>
Copy code
The subsequent discovery can't all match successfully , Change to the following :
<title>(.*?)-Cosplay( China |8)</title>
Copy code
, lack of save_img
The function code is as follows :
def save_img(path, title, first_img, index):
try:
# Ask for pictures
img_res = requests.get(f"http://www.cosplay8.com{first_img}", headers=headers)
img_data = img_res.content
with open(f"{path}/{title}_{index}.png", "wb+") as f:
f.write(img_data)
except Exception as e:
print(e)
Copy code
copyright notice
author[Dream eraser],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201300648352733.html
The sidebar is recommended
- [recalling the 1970s] using Python to repair the wonderful memories of parents' generation, black-and-white photos become color photos
- You used to know Python advanced
- Pyinstaller package Python project
- 2021 IEEE programming language rankings: Python tops the list!
- Implementation of Python automatic test control
- Python advanced: [Baidu translation reverse] graphic and video teaching!!!
- Do you know the fuzzy semantics in Python syntax?
- [Python from introduction to mastery] (XXVII) learn more about pilot!
- Playing excel office automation with Python
- Some applications of heapq module of Python module
guess what you like
-
Python and go languages are so popular, which is more suitable for you?
-
Python practical skills task segmentation
-
Python simulated Login, numpy module, python simulated epidemic spread
-
Python opencv contour discovery function based on image edge extraction
-
Application of Hoff circle detection in Python opencv
-
Python reptile test ox knife (I)
-
Day 1: learn the Django framework of Python development
-
django -- minio_ S3 file storage service
-
[algorithm learning] 02.03 Delete intermediate nodes (Java / C / C + + / Python / go)
-
Similarities and differences of five pandas combinatorial functions
Random recommended
- Learning in Python + opencv -- extracting corners
- Python beginner's eighth day ()
- Necessary knowledge of Python: take you to learn regular expressions from zero
- Get your girlfriend's chat records with Python and solve the paranoia with one move
- My new book "Python 3 web crawler development practice (Second Edition)" has been recommended by the father of Python!
- From zero to familiarity, it will take you to master the use of Python len() function
- Python type hint type annotation guide
- leetcode 108. Convert Sorted Array to Binary Search Tree(python)
- For the geometric transformation of Python OpenCV image, let's first talk about the extraordinary resize function
- leetcode 701. Insert into a Binary Search Tree (python)
- For another 3 days, I sorted out 80 Python datetime examples, which must be collected!
- Python crawler actual combat | using multithreading to crawl lol HD Wallpaper
- Complete a python game in 28 minutes, "customer service play over the president card"
- The universal Python praise machine (commonly known as the brushing machine) in the whole network. Do you want to know the principle? After reading this article, you can write one yourself
- How does Python compare file differences between two paths
- Common OS operations for Python
- [Python data structure series] linear table - explanation of knowledge points + code implementation
- How Python parses web pages using BS4
- How do Python Network requests pass parameters
- Python core programming - decorator
- Python Network Programming -- create a simple UPD socket to realize mutual communication between two processes
- leetcode 110. Balanced Binary Tree(python)
- Django uses Django celery beat to dynamically add scheduled tasks
- The bear child said "you haven't seen Altman" and hurriedly studied it in Python. Unexpectedly
- Optimization iteration of nearest neighbor interpolation and bilinear interpolation algorithm for Python OpenCV image
- Bilinear interpolation algorithm for Python OpenCV image, the most detailed algorithm description in the whole network
- Use of Python partial()
- Python game development, pyGame module, python implementation of angry birds
- leetcode 1104. Path In Zigzag Labelled Binary Tree(python)
- Save time and effort. 10 lines of Python code automatically clean up duplicate files in the computer
- Learn python, know more meat, and be a "meat expert" in the technical circle. One article is enough
- [Python data structure series] "stack (sequential stack and chain stack)" -- Explanation of knowledge points + code implementation
- Datetime module of Python time series
- Python encrypts and decrypts des to solve the problem of inconsistency with Java results
- Chapter 1: introduction to Python programming-4 Hello World
- Summary of Python technical points
- 11.5K Star! An open source Python static type checking Library
- Chapter 2: Fundamentals of python-1 grammar
- [Python daily homework] day4: write a function to count the number of occurrences of each number in the incoming list and return the corresponding dictionary.
- Python uses turtle to express white