current position:Home>Python crawls cat pictures in batches to realize thousand image imaging

Python crawls cat pictures in batches to realize thousand image imaging

2022-01-30 22:39:28 Lucifer thinks twice before acting

Suck cats with code ! This article is participating in 【 Meow star essay solicitation activity 】

Preface

Use Python Crawling for pictures of cats , And make thousands of images for cats !

Crawling for pictures of cats

This article uses Python The version is 3.10.0 edition , It can be downloaded directly from the official website :www.python.org .

Pythonn The installation and configuration process of is not described in detail here , Random searches on the Internet are all tutorials !

1、 Climb the painting material website

Crawl to the website : Pictures of cats

First install the necessary Libraries :

pip install BeautifulSoup4
pip install requests
pip install urllib3
pip install lxml
 Copy code 

Crawl image code :

from bs4 import BeautifulSoup
import requests
import urllib.request
import os

#  First page cat picture website 
url = 'https://www.huiyi8.com/tupian/tag-%E7%8C%AB%E5%92%AA/1.html'
#  Image saving path , here  r  It means no escape 
path = r"/Users/lpc/Downloads/cats/"
#  Determine whether the directory exists , If it exists, skip , Create if it does not exist 
if os.path.exists(path):
    pass
else:
    os.mkdir(path)


#  Get the web address of all cat pictures 
def allpage():
    all_url = []
    #  Number of page turning cycles  20  Time 
    for i in range(1, 20):
        #  Replace the number of pages turned , there  [-6]  It refers to the penultimate page address  6  position 
        each_url = url.replace(url[-6], str(i))
        #  All acquired  url  Join in  all_url  Array 
        all_url.append(each_url)
    #  Return all obtained addresses 
    return all_url


#  Main function entry 
if __name__ == '__main__':
    #  call  allpage  Function to get all web page addresses 
    img_url = allpage()
    for url in img_url:
        #  Get the page source code 
        requ = requests.get(url)
        req = requ.text.encode(requ.encoding).decode()
        html = BeautifulSoup(req, 'lxml')
        #  Add one  url  Array 
        img_urls = []
        #  obtain  html  All in  img  Content of the label 
        for img in html.find_all('img'):
            #  Filter matches  src  Label contents are marked with  http  start , With  jpg  end 
            if img["src"].startswith('http') and img["src"].endswith("jpg"):
                #  Will meet the conditions  img  Label to join  img_urls  Array 
                img_urls.append(img)
        #  Loop through all in the array  src
        for k in img_urls:
            #  Get photo  url
            img = k.get('src')
            #  Get the picture name , Casts are important 
            name = str(k.get('alt'))
            type(name)
            #  Name the picture 
            file_name = path + name + '.jpg'
            #  Through pictures  url  And picture name download cat picture 
            with open(file_name, "wb") as f, requests.get(img) as res:
                f.write(res.content)
            #  Print crawled pictures 
            print(img, file_name)
 Copy code 

Be careful : The above code cannot be copied directly , Need to modify the download image path :/Users/lpc/Downloads/cats, Please modify the local save path for the reader !

Climb to success :

Climb and take 346 Picture of cat !

2、 Crawling ZOL Website

Crawling ZOL website : Adorable cat

Crawling code :

import requests
import time
import os
from lxml import etree

#  The path of the request 
url = 'https://desk.zol.com.cn/dongwu/mengmao/1.html'
#  Image saving path , here  r  It means no escape 
path = r"/Users/lpc/Downloads/ZOL/"
#  Here is the path location you want to save   Ahead r  This paragraph does not escape 
if os.path.exists(path):  #  Determine whether the directory exists , If it exists, skip , Create if it does not exist 
    pass
else:
    os.mkdir(path)
#  Request header 
headers = {"Referer": "Referer: http://desk.zol.com.cn/dongman/1920x1080/",
           "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36", }

headers2 = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36 SE 2.X MetaSr 1.0", }


def allpage():  #  Get all pages 
    all_url = []
    for i in range(1, 4):  #  Number of page turning cycles 
        each_url = url.replace(url[-6], str(i))  #  Replace 
        all_url.append(each_url)
    return all_url  #  Return to address list 


# TODO  Get Html Page to parse 
if __name__ == '__main__':
    img_url = allpage()  #  Call function 
    for url in img_url:
        #  Send a request 
        resq = requests.get(url, headers=headers)
        #  Displays whether the request was successful 
        print(resq)
        #  The page obtained after parsing the request 
        html = etree.HTML(resq.text)
        #  obtain a Under the tag, enter the page of high-definition image url
        hrefs = html.xpath('.//a[@class="pic"]/@href')
        # TODO  Go deeper and get pictures   HD picture 
        for i in range(1, len(hrefs)):
            #  request 
            resqt = requests.get("https://desk.zol.com.cn" + hrefs[i], headers=headers)
            #  analysis 
            htmlt = etree.HTML(resqt.text)
            srct = htmlt.xpath('.//img[@id="bigImg"]/@src')
            #  Cut picture name 
            imgname = srct[0].split('/')[-1]
            #  according to url Get photo 
            img = requests.get(srct[0], headers=headers2)
            #  Execute write picture to file 
            with open(path + imgname, "ab") as file:
                file.write(img.content)
            #  Print crawled pictures 
            print(img, imgname)
 Copy code 

Climb to success :

Climb and take 81 Picture of cat !

3、 Climb Baidu picture website

Climb Baidu website : Baidu cat pictures

1、 Crawl image code :

import requests
import os
from lxml import etree
path = r"/Users/lpc/Downloads/baidu1/"
#  Determine whether the directory exists , If it exists, skip , Create if it does not exist 
if os.path.exists(path):
    pass
else:
    os.mkdir(path)

page = input(' Please enter how many pages to crawl :')
page = int(page) + 1
header = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
n = 0
pn = 1
# pn From the first few pictures   Baidu pictures are displayed at one time by default when they fall 30 Zhang 
for m in range(1, page):
    url = 'https://image.baidu.com/search/acjson?'

    param = {
        'tn': 'resultjson_com',
        'logid': '7680290037940858296',
        'ipn': 'rj',
        'ct': '201326592',
        'is': '',
        'fp': 'result',
        'queryWord': ' The cat ',
        'cl': '2',
        'lm': '-1',
        'ie': 'utf-8',
        'oe': 'utf-8',
        'adpicid': '',
        'st': '-1',
        'z': '',
        'ic': '0',
        'hd': '1',
        'latest': '',
        'copyright': '',
        'word': ' The cat ',
        's': '',
        'se': '',
        'tab': '',
        'width': '',
        'height': '',
        'face': '0',
        'istype': '2',
        'qc': '',
        'nc': '1',
        'fr': '',
        'expermode': '',
        'nojc': '',
        'acjsonfr': 'click',
        'pn': pn,  #  Start with the first few pictures 
        'rn': '30',
        'gsm': '3c',
        '1635752428843=': '',
    }
    page_text = requests.get(url=url, headers=header, params=param)
    page_text.encoding = 'utf-8'
    page_text = page_text.json()
    print(page_text)
    #  First take out the dictionary where all links are located , And store it in a list 
    info_list = page_text['data']
    #  Because the last dictionary extracted in this way is empty , So delete the last element in the list 
    del info_list[-1]
    #  Define a list of image addresses 
    img_path_list = []
    for i in info_list:
        img_path_list.append(i['thumbURL'])
    #  Then take out all the picture addresses , Download 
    # n Will be the name of the picture 
    for img_path in img_path_list:
        img_data = requests.get(url=img_path, headers=header).content
        img_path = path + str(n) + '.jpg'
        with open(img_path, 'wb') as fp:
            fp.write(img_data)
        n = n + 1

    pn += 29
 Copy code 

2、 Crawling code

# -*- coding:utf-8 -*-
import requests
import re, time, datetime
import os
import random
import urllib.parse
from PIL import Image  #  Import a module 

imgDir = r"/Volumes/DBA/python/img/"
#  Set up headers  To prevent pickpocketing , Set up multiple headers
# chrome,firefox,Edge
headers = [
    {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
        'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
        'Connection': 'keep-alive'
    },
    {
        "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0',
        'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
        'Connection': 'keep-alive'
    },
    {
        "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19041',
        'Accept-Language': 'zh-CN',
        'Connection': 'keep-alive'
    }
]

picList = []  #  Empty space for storing pictures  List

keyword = input(" Please enter the search keywords :")
kw = urllib.parse.quote(keyword)  #  transcoding 


#  obtain  1000  A thumbnail searched by Baidu  list
def getPicList(kw, n):
    global picList
    weburl = r"https://image.baidu.com/search/acjson?tn=resultjson_com&logid=11601692320226504094&ipn=rj&ct=201326592&is=&fp=result&queryWord={kw}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=&ic=&hd=&latest=&copyright=&word={kw}&s=&se=&tab=&width=&height=&face=&istype=&qc=&nc=1&fr=&expermode=&force=&cg=girl&pn={n}&rn=30&gsm=1e&1611751343367=".format(
        kw=kw, n=n * 30)
    req = requests.get(url=weburl, headers=random.choice(headers))
    req.encoding = req.apparent_encoding  #  Prevent Chinese miscoding 
    webJSON = req.text
    imgurlReg = '"thumbURL":"(.*?)"'  #  Regular 
    picList = picList + re.findall(imgurlReg, webJSON, re.DOTALL | re.I)


for i in range(150):  #  The number of cycles is relatively large , If there are not so many pictures , that  picList  The data will not increase .
    getPicList(kw, i)

for item in picList:
    #  Suffix name   And the name 
    itemList = item.split(".")
    hz = ".jpg"
    picName = str(int(time.time() * 1000))  #  Millisecond timestamps 
    #  Ask for pictures 
    imgReq = requests.get(url=item, headers=random.choice(headers))
    #  Save the picture 
    with open(imgDir + picName + hz, "wb") as f:
        f.write(imgReq.content)
    #  use  Image  Module open picture 
    im = Image.open(imgDir + picName + hz)
    bili = im.width / im.height  #  Get the aspect ratio , Resize the picture according to the width height ratio 
    newIm = None
    #  Resize picture , The smallest side is set to  50
    if bili >= 1:
        newIm = im.resize((round(bili * 50), 50))
    else:
        newIm = im.resize((50, round(50 * im.height / im.width)))
    #  Capture the image  50*50  Part of 
    clip = newIm.crop((0, 0, 50, 50))  #  Capture pictures ,crop  Cutting 
    clip.convert("RGB").save(imgDir + picName + hz)  #  Save the captured picture 
    print(picName + hz + "  Finished processing ")
 Copy code 

Climb to success :

summary : Three websites crawl 1600 Picture of cat !

Kilogram imaging

After crawling a thousand pictures , Next, you need to splice the pictures into a cat picture , That is, Qiantu imaging .

1、Foto-Mosaik-Edda Software implementation

Download the software first :Foto-Mosaik-Edda Installer, If you can't download , Direct Baidu search foto-mosaik-edda

Windows install Foto-Mosaik-Edda The process is relatively simple !

Be careful : But it needs to be installed in advance .NET Framework 2, Otherwise, an error is reported as follows, and the installation cannot be successful !

Enable .NET Framework 2 The way :

Confirm that it has been successfully enabled :

Then you can continue to install !

After installation , Open as follows :

First step , Create a gallery :

The second step , Kilogram imaging :

Check the first step here to create a good gallery :

A moment of wonder :

Make another lovely cat :

Be accomplished !

2、 Use Python Realization

First , Choose a picture :

Run the following code :

# -*- coding:utf-8 -*-
from PIL import Image
import os
import numpy as np

imgDir = r"/Volumes/DBA/python/img/"
bgImg = r"/Users/lpc/Downloads/494.jpg"


#  Get the average color value of the image 
def compute_mean(imgPath):
    '''  Get the average color value of the image  :param imgPath:  Thumbnail path  :return: (r,g,b) Of the entire thumbnail rgb Average  '''
    im = Image.open(imgPath)
    im = im.convert("RGB")  #  To  rgb Pattern 
    #  Convert image data into data sequence . Behavior unit , Each row stores the color of each pixel 
    ''' Such as : [[ 60 33 24] [ 58 34 24] ... [188 152 136] [ 99 96 113]] [[ 60 33 24] [ 58 34 24] ... [188 152 136] [ 99 96 113]] '''
    imArray = np.array(im)
    # mean() The functionality : Find the mean value of the specified data 
    R = np.mean(imArray[:, :, 0])  #  Get all  R  The average of the values 
    G = np.mean(imArray[:, :, 1])
    B = np.mean(imArray[:, :, 2])
    return (R, G, B)


def getImgList():
    """  Get the path and average color of thumbnails  :return: list, Stored image path 、 Average color value . """
    imgList = []
    for pic in os.listdir(imgDir):
        imgPath = imgDir + pic
        imgRGB = compute_mean(imgPath)
        imgList.append({
            "imgPath": imgPath,
            "imgRGB": imgRGB
        })
    return imgList


def computeDis(color1, color2):
    '''  Calculate the color difference between the two pictures , The of computer is color space distance . dis = (R**2 + G**2 + B**2)**0.5  Parameters :color1,color2  It's color data  (r,g,b) '''
    dis = 0
    for i in range(len(color1)):
        dis += (color1[i] - color2[i]) ** 2
    dis = dis ** 0.5
    return dis


def create_image(bgImg, imgDir, N=2, M=50):
    '''  According to the background picture , Fill in the new picture with the avatar  bgImg: Background picture address  imgDir: Avatar catalog  N: The magnification of the background image  M: The size of the avatar (MxM) '''
    #  Get a list of pictures 
    imgList = getImgList()

    #  Read the picture 
    bg = Image.open(bgImg)
    # bg = bg.resize((bg.size[0] // N, bg.size[1] // N)) #  The zoom . It is recommended to zoom the original image , The picture is too big and the calculation time is very long .
    bgArray = np.array(bg)
    width = bg.size[0] * M  #  The width of the newly generated picture . Magnification per pixel  M  times 
    height = bg.size[1] * M  #  The height of the newly generated picture 

    #  Create a new blank diagram 
    newImg = Image.new('RGB', (width, height))

    #  Cyclic filling diagram 
    for x in range(bgArray.shape[0]):  # x, Row data , The original width can be used to replace 
        for y in range(bgArray.shape[1]):  # y, Column data ,, You can use the height of the original drawing to replace 
            #  Find the picture with the smallest distance 
            minDis = 10000
            index = 0
            for img in imgList:
                dis = computeDis(img['imgRGB'], bgArray[x][y])
                if dis < minDis:
                    index = img['imgPath']
                    minDis = dis
            #  Cycle is completed ,index  Is to store the image path with the closest color 
            # minDis  The color difference is stored 
            #  fill 
            tempImg = Image.open(index)  #  Open the picture with the smallest color difference distance 
            #  Resize the picture , No adjustment is allowed here , Because I adjusted it when I downloaded the picture 
            tempImg = tempImg.resize((M, M))
            #  Paste the small picture onto the new picture . Be careful  x,y , Don't confuse the ranks . apart  M  Paste a sheet .
            newImg.paste(tempImg, (y * M, x * M))
            print('(%d, %d)' % (x, y))  #  Print progress . Format output  x,y

    #  Save the picture 
    newImg.save('final.jpg')  #  Finally save the picture 


create_image(bgImg, imgDir)
 Copy code 

Running results :

You can see from the picture above , The clarity of the picture is comparable to that of the original , After zooming in, the small picture is still clearly visible !

Be careful : Use Python It will run slower !

At the end

splendid , You can suck cats happily again ~

In this paper, the reference :

copyright notice
author[Lucifer thinks twice before acting],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201302239267000.html

Random recommended