current position:Home>Python crawls cat pictures in batches to realize thousand image imaging
Python crawls cat pictures in batches to realize thousand image imaging
2022-01-30 22:39:28 【Lucifer thinks twice before acting】
Suck cats with code ! This article is participating in 【 Meow star essay solicitation activity 】
Preface
Use Python Crawling for pictures of cats , And make thousands of images for cats !
Crawling for pictures of cats
This article uses Python The version is 3.10.0 edition , It can be downloaded directly from the official website :www.python.org .
Pythonn The installation and configuration process of is not described in detail here , Random searches on the Internet are all tutorials !
1、 Climb the painting material website
Crawl to the website : Pictures of cats
First install the necessary Libraries :
pip install BeautifulSoup4
pip install requests
pip install urllib3
pip install lxml
Copy code
Crawl image code :
from bs4 import BeautifulSoup
import requests
import urllib.request
import os
# First page cat picture website
url = 'https://www.huiyi8.com/tupian/tag-%E7%8C%AB%E5%92%AA/1.html'
# Image saving path , here r It means no escape
path = r"/Users/lpc/Downloads/cats/"
# Determine whether the directory exists , If it exists, skip , Create if it does not exist
if os.path.exists(path):
pass
else:
os.mkdir(path)
# Get the web address of all cat pictures
def allpage():
all_url = []
# Number of page turning cycles 20 Time
for i in range(1, 20):
# Replace the number of pages turned , there [-6] It refers to the penultimate page address 6 position
each_url = url.replace(url[-6], str(i))
# All acquired url Join in all_url Array
all_url.append(each_url)
# Return all obtained addresses
return all_url
# Main function entry
if __name__ == '__main__':
# call allpage Function to get all web page addresses
img_url = allpage()
for url in img_url:
# Get the page source code
requ = requests.get(url)
req = requ.text.encode(requ.encoding).decode()
html = BeautifulSoup(req, 'lxml')
# Add one url Array
img_urls = []
# obtain html All in img Content of the label
for img in html.find_all('img'):
# Filter matches src Label contents are marked with http start , With jpg end
if img["src"].startswith('http') and img["src"].endswith("jpg"):
# Will meet the conditions img Label to join img_urls Array
img_urls.append(img)
# Loop through all in the array src
for k in img_urls:
# Get photo url
img = k.get('src')
# Get the picture name , Casts are important
name = str(k.get('alt'))
type(name)
# Name the picture
file_name = path + name + '.jpg'
# Through pictures url And picture name download cat picture
with open(file_name, "wb") as f, requests.get(img) as res:
f.write(res.content)
# Print crawled pictures
print(img, file_name)
Copy code
Be careful : The above code cannot be copied directly , Need to modify the download image path :/Users/lpc/Downloads/cats
, Please modify the local save path for the reader !
Climb to success :
Climb and take 346 Picture of cat !
2、 Crawling ZOL Website
Crawling ZOL website : Adorable cat
Crawling code :
import requests
import time
import os
from lxml import etree
# The path of the request
url = 'https://desk.zol.com.cn/dongwu/mengmao/1.html'
# Image saving path , here r It means no escape
path = r"/Users/lpc/Downloads/ZOL/"
# Here is the path location you want to save Ahead r This paragraph does not escape
if os.path.exists(path): # Determine whether the directory exists , If it exists, skip , Create if it does not exist
pass
else:
os.mkdir(path)
# Request header
headers = {"Referer": "Referer: http://desk.zol.com.cn/dongman/1920x1080/",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36", }
headers2 = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36 SE 2.X MetaSr 1.0", }
def allpage(): # Get all pages
all_url = []
for i in range(1, 4): # Number of page turning cycles
each_url = url.replace(url[-6], str(i)) # Replace
all_url.append(each_url)
return all_url # Return to address list
# TODO Get Html Page to parse
if __name__ == '__main__':
img_url = allpage() # Call function
for url in img_url:
# Send a request
resq = requests.get(url, headers=headers)
# Displays whether the request was successful
print(resq)
# The page obtained after parsing the request
html = etree.HTML(resq.text)
# obtain a Under the tag, enter the page of high-definition image url
hrefs = html.xpath('.//a[@class="pic"]/@href')
# TODO Go deeper and get pictures HD picture
for i in range(1, len(hrefs)):
# request
resqt = requests.get("https://desk.zol.com.cn" + hrefs[i], headers=headers)
# analysis
htmlt = etree.HTML(resqt.text)
srct = htmlt.xpath('.//img[@id="bigImg"]/@src')
# Cut picture name
imgname = srct[0].split('/')[-1]
# according to url Get photo
img = requests.get(srct[0], headers=headers2)
# Execute write picture to file
with open(path + imgname, "ab") as file:
file.write(img.content)
# Print crawled pictures
print(img, imgname)
Copy code
Climb to success :
Climb and take 81 Picture of cat !
3、 Climb Baidu picture website
Climb Baidu website : Baidu cat pictures
1、 Crawl image code :
import requests
import os
from lxml import etree
path = r"/Users/lpc/Downloads/baidu1/"
# Determine whether the directory exists , If it exists, skip , Create if it does not exist
if os.path.exists(path):
pass
else:
os.mkdir(path)
page = input(' Please enter how many pages to crawl :')
page = int(page) + 1
header = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
n = 0
pn = 1
# pn From the first few pictures Baidu pictures are displayed at one time by default when they fall 30 Zhang
for m in range(1, page):
url = 'https://image.baidu.com/search/acjson?'
param = {
'tn': 'resultjson_com',
'logid': '7680290037940858296',
'ipn': 'rj',
'ct': '201326592',
'is': '',
'fp': 'result',
'queryWord': ' The cat ',
'cl': '2',
'lm': '-1',
'ie': 'utf-8',
'oe': 'utf-8',
'adpicid': '',
'st': '-1',
'z': '',
'ic': '0',
'hd': '1',
'latest': '',
'copyright': '',
'word': ' The cat ',
's': '',
'se': '',
'tab': '',
'width': '',
'height': '',
'face': '0',
'istype': '2',
'qc': '',
'nc': '1',
'fr': '',
'expermode': '',
'nojc': '',
'acjsonfr': 'click',
'pn': pn, # Start with the first few pictures
'rn': '30',
'gsm': '3c',
'1635752428843=': '',
}
page_text = requests.get(url=url, headers=header, params=param)
page_text.encoding = 'utf-8'
page_text = page_text.json()
print(page_text)
# First take out the dictionary where all links are located , And store it in a list
info_list = page_text['data']
# Because the last dictionary extracted in this way is empty , So delete the last element in the list
del info_list[-1]
# Define a list of image addresses
img_path_list = []
for i in info_list:
img_path_list.append(i['thumbURL'])
# Then take out all the picture addresses , Download
# n Will be the name of the picture
for img_path in img_path_list:
img_data = requests.get(url=img_path, headers=header).content
img_path = path + str(n) + '.jpg'
with open(img_path, 'wb') as fp:
fp.write(img_data)
n = n + 1
pn += 29
Copy code
2、 Crawling code
# -*- coding:utf-8 -*-
import requests
import re, time, datetime
import os
import random
import urllib.parse
from PIL import Image # Import a module
imgDir = r"/Volumes/DBA/python/img/"
# Set up headers To prevent pickpocketing , Set up multiple headers
# chrome,firefox,Edge
headers = [
{
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Connection': 'keep-alive'
},
{
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0',
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Connection': 'keep-alive'
},
{
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19041',
'Accept-Language': 'zh-CN',
'Connection': 'keep-alive'
}
]
picList = [] # Empty space for storing pictures List
keyword = input(" Please enter the search keywords :")
kw = urllib.parse.quote(keyword) # transcoding
# obtain 1000 A thumbnail searched by Baidu list
def getPicList(kw, n):
global picList
weburl = r"https://image.baidu.com/search/acjson?tn=resultjson_com&logid=11601692320226504094&ipn=rj&ct=201326592&is=&fp=result&queryWord={kw}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=&ic=&hd=&latest=©right=&word={kw}&s=&se=&tab=&width=&height=&face=&istype=&qc=&nc=1&fr=&expermode=&force=&cg=girl&pn={n}&rn=30&gsm=1e&1611751343367=".format(
kw=kw, n=n * 30)
req = requests.get(url=weburl, headers=random.choice(headers))
req.encoding = req.apparent_encoding # Prevent Chinese miscoding
webJSON = req.text
imgurlReg = '"thumbURL":"(.*?)"' # Regular
picList = picList + re.findall(imgurlReg, webJSON, re.DOTALL | re.I)
for i in range(150): # The number of cycles is relatively large , If there are not so many pictures , that picList The data will not increase .
getPicList(kw, i)
for item in picList:
# Suffix name And the name
itemList = item.split(".")
hz = ".jpg"
picName = str(int(time.time() * 1000)) # Millisecond timestamps
# Ask for pictures
imgReq = requests.get(url=item, headers=random.choice(headers))
# Save the picture
with open(imgDir + picName + hz, "wb") as f:
f.write(imgReq.content)
# use Image Module open picture
im = Image.open(imgDir + picName + hz)
bili = im.width / im.height # Get the aspect ratio , Resize the picture according to the width height ratio
newIm = None
# Resize picture , The smallest side is set to 50
if bili >= 1:
newIm = im.resize((round(bili * 50), 50))
else:
newIm = im.resize((50, round(50 * im.height / im.width)))
# Capture the image 50*50 Part of
clip = newIm.crop((0, 0, 50, 50)) # Capture pictures ,crop Cutting
clip.convert("RGB").save(imgDir + picName + hz) # Save the captured picture
print(picName + hz + " Finished processing ")
Copy code
Climb to success :
summary : Three websites crawl 1600 Picture of cat !
Kilogram imaging
After crawling a thousand pictures , Next, you need to splice the pictures into a cat picture , That is, Qiantu imaging .
1、Foto-Mosaik-Edda Software implementation
Download the software first :Foto-Mosaik-Edda Installer, If you can't download , Direct Baidu search foto-mosaik-edda
!
Windows install Foto-Mosaik-Edda The process is relatively simple !
Be careful : But it needs to be installed in advance .NET Framework 2
, Otherwise, an error is reported as follows, and the installation cannot be successful !
Enable .NET Framework 2
The way :
Confirm that it has been successfully enabled :
Then you can continue to install !
After installation , Open as follows :
First step , Create a gallery :
The second step , Kilogram imaging :
Check the first step here to create a good gallery :
A moment of wonder :
Make another lovely cat :
Be accomplished !
2、 Use Python Realization
First , Choose a picture :
Run the following code :
# -*- coding:utf-8 -*-
from PIL import Image
import os
import numpy as np
imgDir = r"/Volumes/DBA/python/img/"
bgImg = r"/Users/lpc/Downloads/494.jpg"
# Get the average color value of the image
def compute_mean(imgPath):
''' Get the average color value of the image :param imgPath: Thumbnail path :return: (r,g,b) Of the entire thumbnail rgb Average '''
im = Image.open(imgPath)
im = im.convert("RGB") # To rgb Pattern
# Convert image data into data sequence . Behavior unit , Each row stores the color of each pixel
''' Such as : [[ 60 33 24] [ 58 34 24] ... [188 152 136] [ 99 96 113]] [[ 60 33 24] [ 58 34 24] ... [188 152 136] [ 99 96 113]] '''
imArray = np.array(im)
# mean() The functionality : Find the mean value of the specified data
R = np.mean(imArray[:, :, 0]) # Get all R The average of the values
G = np.mean(imArray[:, :, 1])
B = np.mean(imArray[:, :, 2])
return (R, G, B)
def getImgList():
""" Get the path and average color of thumbnails :return: list, Stored image path 、 Average color value . """
imgList = []
for pic in os.listdir(imgDir):
imgPath = imgDir + pic
imgRGB = compute_mean(imgPath)
imgList.append({
"imgPath": imgPath,
"imgRGB": imgRGB
})
return imgList
def computeDis(color1, color2):
''' Calculate the color difference between the two pictures , The of computer is color space distance . dis = (R**2 + G**2 + B**2)**0.5 Parameters :color1,color2 It's color data (r,g,b) '''
dis = 0
for i in range(len(color1)):
dis += (color1[i] - color2[i]) ** 2
dis = dis ** 0.5
return dis
def create_image(bgImg, imgDir, N=2, M=50):
''' According to the background picture , Fill in the new picture with the avatar bgImg: Background picture address imgDir: Avatar catalog N: The magnification of the background image M: The size of the avatar (MxM) '''
# Get a list of pictures
imgList = getImgList()
# Read the picture
bg = Image.open(bgImg)
# bg = bg.resize((bg.size[0] // N, bg.size[1] // N)) # The zoom . It is recommended to zoom the original image , The picture is too big and the calculation time is very long .
bgArray = np.array(bg)
width = bg.size[0] * M # The width of the newly generated picture . Magnification per pixel M times
height = bg.size[1] * M # The height of the newly generated picture
# Create a new blank diagram
newImg = Image.new('RGB', (width, height))
# Cyclic filling diagram
for x in range(bgArray.shape[0]): # x, Row data , The original width can be used to replace
for y in range(bgArray.shape[1]): # y, Column data ,, You can use the height of the original drawing to replace
# Find the picture with the smallest distance
minDis = 10000
index = 0
for img in imgList:
dis = computeDis(img['imgRGB'], bgArray[x][y])
if dis < minDis:
index = img['imgPath']
minDis = dis
# Cycle is completed ,index Is to store the image path with the closest color
# minDis The color difference is stored
# fill
tempImg = Image.open(index) # Open the picture with the smallest color difference distance
# Resize the picture , No adjustment is allowed here , Because I adjusted it when I downloaded the picture
tempImg = tempImg.resize((M, M))
# Paste the small picture onto the new picture . Be careful x,y , Don't confuse the ranks . apart M Paste a sheet .
newImg.paste(tempImg, (y * M, x * M))
print('(%d, %d)' % (x, y)) # Print progress . Format output x,y
# Save the picture
newImg.save('final.jpg') # Finally save the picture
create_image(bgImg, imgDir)
Copy code
Running results :
You can see from the picture above , The clarity of the picture is comparable to that of the original , After zooming in, the small picture is still clearly visible !
Be careful : Use Python It will run slower !
At the end
splendid , You can suck cats happily again ~
In this paper, the reference :
- python Batch crawling cat pictures
- Python Realize multi-threaded concurrent download of large files
- python Crawling ZOL Desktop wallpaper HD pictures
- Python Crawling Baidu pictures
- Python--- How to realize thousand image imaging : Elementary chapter
- Python Learning notes 17: Play with Qiantu imaging
copyright notice
author[Lucifer thinks twice before acting],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201302239267000.html
The sidebar is recommended
- Introduction to python (IV) dynamic web page analysis and capture
- leetcode 119. Pascal's Triangle II(python)
- leetcode 31. Next Permutation(python)
- [algorithm learning] 807 Maintain the city skyline (Java / C / C + + / Python / go / trust)
- The rich woman's best friend asked me to write her a Taobao double 11 rush purchase script in Python, which can only be arranged
- Glom module of Python data analysis module (1)
- Python crawler actual combat, requests module, python realizes the full set of skin to capture the glory of the king
- Summarize some common mistakes of novices in Python development
- Python libraries you may not know
- [Python crawler] detailed explanation of selenium from introduction to actual combat [2]
guess what you like
-
This is what you should do to quickly create a list in Python
-
On the 55th day of the journey, python opencv perspective transformation front knowledge contour coordinate points
-
Python OpenCV image area contour mark, which can be used to frame various small notes
-
How to set up an asgi Django application with Postgres, nginx and uvicorn on Ubuntu 20.04
-
Initial Python tuple
-
Introduction to Python urllib module
-
Advanced Python Basics: from functions to advanced magic methods
-
Python Foundation: data structure summary
-
Python Basics: from variables to exception handling
-
Python notes (22): time module and calendar module
Random recommended
- Python notes (20): built in high-order functions
- Python notes (17): closure
- Python notes (18): decorator
- Python notes (16): generators and iterators
- Python notes (XV): List derivation
- Python tells you what timing attacks are
- Python -- file and exception
- [Python from introduction to mastery] (IV) what are the built-in data types of Python? Figure out
- Python code to scan code to pay attention to official account login
- [algorithm learning] 1221 Split balanced string (Java / C / C + + / Python / go / trust)
- Python notes (22): errors and exceptions
- Python has been hidden for ten years, and once image recognition is heard all over the world
- Python notes (21): random number module
- Python notes (19): anonymous functions
- Use Python and OpenCV to calculate and draw two-dimensional histogram
- Python, Hough circle transformation in opencv
- A library for reading and writing markdown in Python: mdutils
- Datetime of Python time operation (Part I)
- The most useful decorator in the python standard library
- Python iterators and generators
- [Python from introduction to mastery] (V) Python's built-in data types - sequences and strings. They have no girlfriend, not a nanny, and can only be used as dry goods
- Does Python have a, = operator?
- Go through the string common sense in Python
- Fanwai 4 Handling of mouse events and solutions to common problems in Python opencv
- Summary of common functions for processing strings in Python
- When writing Python scripts, be sure to add this
- Python web crawler - Fundamentals (1)
- Pandas handles duplicate values
- Python notes (23): regular module
- Python crawlers are slow? Concurrent programming to understand it
- Parameter passing of Python function
- Stroke tuple in Python
- Talk about ordinary functions and higher-order functions in Python
- [Python data acquisition] page image crawling and saving
- [Python data collection] selenium automated test framework
- Talk about function passing and other supplements in Python
- Python programming simulation poker game
- leetcode 160. Intersection of Two Linked Lists (python)
- Python crawler actual combat, requests module, python to grab the beautiful wallpaper of a station
- Fanwai 5 Detailed description of slider in Python opencv and solutions to common problems