current position:Home>Python crawler actual combat | using multithreading to crawl lol HD Wallpaper

Python crawler actual combat | using multithreading to crawl lol HD Wallpaper

2022-01-30 03:29:58 Jacko's it journey

source : official account 【 Jay's IT The journey 】

author : Alaska

ID:Jake_Internet

One 、 Background introduction

With the popularity of mobile terminals, there are many mobile terminals APP, Application software has also become popular . Recently, I saw the mobile game of hero League Online , Feeling ok ,PC The end hero alliance is a popular game , I don't know the future of the hero alliance on the mobile terminal , Today, we use multithreading to crawl LOL Official website hero HD Wallpaper .

Two 、 Page analysis

The target site :lol.qq.com/data/info-h…

image.png

The official website interface is shown in the figure , Obvious , A small picture shows a hero , Our goal is to crawl all the skin pictures of each hero , Download them all and save them locally .

Secondary page

The above page is called the main page , The secondary page is the page corresponding to each hero , Take the daughter of darkness as an example , Its secondary pages are as follows :

image.png

We can see a lot of small pictures , Each small picture corresponds to a skin , adopt network View skin data interface , As shown in the figure below :

image.png

We know that skin information is a json Format string for transmission , Then we just need to find the corresponding of each hero id, Find the corresponding json file , Extract the required data to get HD skin wallpaper .

Then here's the daughter of darkness json Your file address is :

hero_one = 'https://game.gtimg.cn/images/lol/act/img/js/hero/1.js'
 Copy code 

In fact, the law here is also very simple , The address of each hero's skin data is like this :

url = 'https://game.gtimg.cn/images/lol/act/img/js/hero/{}.js'.format(id)
 Copy code 

So here comes the question id What is the law of ? Here's a hero id Need to view on the home page , As shown below :

image.png

We can see two lists [0,99], [100,156] , namely 156 A hero , however heroId But until 240…., thus it can be seen , It has a certain law of change , Not one in turn , So you have to crawl all the hero skin pictures , You need to get all the heroId.

3、 ... and 、 Grab ideas

Why multithreading , Here's an explanation , We're crawling for pictures , When video is such data , Because it needs to be saved locally , Therefore, a large number of file read and write operations will be used , That is to say IO operation , Imagine if we do a synchronous request operation ;

Then the first request is completed until the file is saved locally , Will make a second request , So it's very inefficient , If you use multithreading for asynchronous operations , The efficiency will be greatly improved .

Therefore, it is necessary to use multithreading or multiprocessing , Then throw so many data queues to the thread pool or process pool for processing ;

stay Python in ,multiprocessing Pool The process of pool ,multiprocessing.dummy Very easy to use .

  • multiprocessing.dummy modular :dummy Modules are multithreaded ;
  • multiprocessing modular :multiprocessing It's multi process ;

multiprocessing.dummy Module and multiprocessing Of both modules api It's all universal , Code switching is more flexible ;

We start with a test of demo.py File capture hero id, I've written the code here , Get a stored hero id A list of , You can use it directly in the main file ;

demo.py

url = 'https://game.gtimg.cn/images/lol/act/img/js/heroList/hero_list.js'
res = requests.get(url,headers=headers)
res = res.content.decode('utf-8')
res_dict = json.loads(res)
heros = res_dict["hero"] # 156 individual hero Information 
idList = []
for hero in heros:
    hero_id = hero["heroId"]
    idList.append(hero_id)
print(idList)
 Copy code 

obtain idList As shown below :

idlist = [1,2,3,….,875,876,877] # The hero in the middle id There's no show here

Built url:page = 'www.bizhi88.com/s/470/{}.ht…

there i Express id, Conduct url Dynamic construction of ;

Then we customize two functions, one for crawling and parsing the page (spider), One for downloading data  (download), Open thread pool , Use for Loop build storage json Data url, Stored in the list , As url queue , Use  pool.map()   Method execution spider ( Reptiles ) function ;

def map(self, fn, *iterables, timeout=None, chunksize=1):
    """Returns an iterator equivalent to map(fn, iter)”“”
#  Here we use :pool.map(spider,page) # spider: Crawler functions ;page:url queue 
 Copy code 

effect : Extract each element in the list as an argument to the function , Create processes , Put it into the process pool ;

Parameters 1: Function to execute ;

Parameters 2: iterator , Pass the numbers in the iterator as parameters into the function in turn ;

json Data analysis

image.png

Here we take the skin of the dark daughter json The file is displayed for analysis , What we need to get is  1.name,2.skin_name,3.mainImg, Because we found that heroName It's the same , So take the hero's name as the hero's skin folder name , This makes it easy to view and save ;

item = {}
item['name'] = hero["heroName"]
item['skin_name'] = hero["name"]
if hero["mainImg"] == '':
   continue
item['imgLink'] = hero["mainImg"]
 Copy code 

There's one caveat :

yes , we have mainImg The label is empty , So we need to skip , Otherwise, if it is an empty link , An error will be reported when requesting ;

Four 、 Data collection

Import related third-party libraries

import requests #  request 
from multiprocessing.dummy import Pool as ThreadPool #  Concurrent 
import time #  efficiency 
import os #  File operations 
import json #  analysis 
 Copy code 

Page data analysis

def spider(url):
    res = requests.get(url, headers=headers)
    result = res.content.decode('utf-8')
    res_dict = json.loads(result)

    skins = res_dict["skins"]  # 15 individual hero Information 
    print(len(skins))

    for index,hero in enumerate(skins): #  Here we use enumerate Get subscript , So that the file and picture can be named ;
        item = {} #  A dictionary object 
        item['name'] = hero["heroName"]
        item['skin_name'] = hero["name"]

        if hero["mainImg"] == '':
            continue
        item['imgLink'] = hero["mainImg"]
        print(item)

        download(index+1,item)
 Copy code 

download Download the pictures

def download(index,contdict):
    name = contdict['name']
    path = " The skin /" + name
    if not os.path.exists(path):
        os.makedirs(path)
    content = requests.get(contdict['imgLink'], headers=headers).content
    with open('./ The skin /' + name + '/' + contdict['skin_name'] + str(index) + '.jpg', 'wb') as f:
        f.write(content)
 Copy code 

Here we use OS Module create folder , We talked about it earlier , Of every hero heroName The value of is the same , This creates a folder and names it , Convenient for skin preservation ( classified ), Then here is the path of the image file. You need to be careful , Missing a slash will report an error .

main() The main function

def main(): 
    pool = ThreadPool(6)
    page = []
    for i in range(1,21):
        newpage = 'https://game.gtimg.cn/images/lol/act/img/js/hero/{}.js'.format(i)
        print(newpage)
        page.append(newpage)
    result = pool.map(spider, page)
    pool.close()
    pool.join()
    end = time.time()
 Copy code 

explain :

  • In the main function, we preferred to create six thread pools ;

  • adopt for Loop dynamic build 20 strip url, Let's try the ox knife ,20 A hero skin , If you climb all of them, you can do the previous idList Traverse , Re dynamic construction url;

  • Use map() Function on the... In the thread pool url Perform data analysis and storage operations ;

  • When the thread pool close The thread pool is not closed when , It just changes the state to the state where no more elements can be inserted ;

5、 ... and 、 The program runs

if __name__ == '__main__':
    main()
 Copy code 

give the result as follows :

image.png

Of course, only part of the image is intercepted here , In total, I crawled 200+ A picture , Generally speaking, it's ok .

6、 ... and 、 summary

This time we used multithreading to crawl the high-definition Wallpaper of hero skin on the official website of hero League , Because the picture involves IO operation , We use concurrency to , It greatly improves the execution efficiency of the program .

Of course, reptiles have a taste , This little trial ox knife , Crawling away 20 A hero's skin picture , Interested friends can climb down all their skin , Just change the traversal element to the previous idlist that will do .

The end of this paper .


Originality is not easy. , If you think this article is useful to you , Please like this article 、 Comment or forward , Because it will be my motivation to output more quality articles , thank !

by the way , Dig friends remember to give me a free attention ! In case you get lost, you won't find me next time .

See you next time !

copyright notice
author[Jacko's it journey],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201300329572468.html

Random recommended