current position:Home>Python crawler actual combat | using multithreading to crawl lol HD Wallpaper
Python crawler actual combat | using multithreading to crawl lol HD Wallpaper
2022-01-30 03:29:58 【Jacko's it journey】
source : official account 【 Jay's IT The journey 】
author : Alaska
ID:Jake_Internet
One 、 Background introduction
With the popularity of mobile terminals, there are many mobile terminals APP, Application software has also become popular . Recently, I saw the mobile game of hero League Online , Feeling ok ,PC The end hero alliance is a popular game , I don't know the future of the hero alliance on the mobile terminal , Today, we use multithreading to crawl LOL Official website hero HD Wallpaper .
Two 、 Page analysis
The target site :lol.qq.com/data/info-h…
The official website interface is shown in the figure , Obvious , A small picture shows a hero , Our goal is to crawl all the skin pictures of each hero , Download them all and save them locally .
Secondary page
The above page is called the main page , The secondary page is the page corresponding to each hero , Take the daughter of darkness as an example , Its secondary pages are as follows :
We can see a lot of small pictures , Each small picture corresponds to a skin , adopt network View skin data interface , As shown in the figure below :
We know that skin information is a json Format string for transmission , Then we just need to find the corresponding of each hero id, Find the corresponding json file , Extract the required data to get HD skin wallpaper .
Then here's the daughter of darkness json Your file address is :
hero_one = 'https://game.gtimg.cn/images/lol/act/img/js/hero/1.js'
Copy code
In fact, the law here is also very simple , The address of each hero's skin data is like this :
url = 'https://game.gtimg.cn/images/lol/act/img/js/hero/{}.js'.format(id)
Copy code
So here comes the question id What is the law of ? Here's a hero id Need to view on the home page , As shown below :
We can see two lists [0,99], [100,156] , namely 156 A hero , however heroId But until 240…., thus it can be seen , It has a certain law of change , Not one in turn , So you have to crawl all the hero skin pictures , You need to get all the heroId.
3、 ... and 、 Grab ideas
Why multithreading , Here's an explanation , We're crawling for pictures , When video is such data , Because it needs to be saved locally , Therefore, a large number of file read and write operations will be used , That is to say IO operation , Imagine if we do a synchronous request operation ;
Then the first request is completed until the file is saved locally , Will make a second request , So it's very inefficient , If you use multithreading for asynchronous operations , The efficiency will be greatly improved .
Therefore, it is necessary to use multithreading or multiprocessing , Then throw so many data queues to the thread pool or process pool for processing ;
stay Python in ,multiprocessing Pool The process of pool ,multiprocessing.dummy Very easy to use .
multiprocessing.dummy
modular :dummy
Modules are multithreaded ;multiprocessing
modular :multiprocessing
It's multi process ;
multiprocessing.dummy
Module and multiprocessing
Of both modules api It's all universal , Code switching is more flexible ;
We start with a test of demo.py File capture hero id, I've written the code here , Get a stored hero id A list of , You can use it directly in the main file ;
demo.py
url = 'https://game.gtimg.cn/images/lol/act/img/js/heroList/hero_list.js'
res = requests.get(url,headers=headers)
res = res.content.decode('utf-8')
res_dict = json.loads(res)
heros = res_dict["hero"] # 156 individual hero Information
idList = []
for hero in heros:
hero_id = hero["heroId"]
idList.append(hero_id)
print(idList)
Copy code
obtain idList As shown below :
idlist = [1,2,3,….,875,876,877] # The hero in the middle id There's no show here
Built url:page = 'www.bizhi88.com/s/470/{}.ht…
there i Express id, Conduct url Dynamic construction of ;
Then we customize two functions, one for crawling and parsing the page (spider), One for downloading data (download), Open thread pool , Use for Loop build storage json Data url, Stored in the list , As url queue , Use pool.map() Method execution spider ( Reptiles ) function ;
def map(self, fn, *iterables, timeout=None, chunksize=1):
"""Returns an iterator equivalent to map(fn, iter)”“”
# Here we use :pool.map(spider,page) # spider: Crawler functions ;page:url queue
Copy code
effect : Extract each element in the list as an argument to the function , Create processes , Put it into the process pool ;
Parameters 1: Function to execute ;
Parameters 2: iterator , Pass the numbers in the iterator as parameters into the function in turn ;
json Data analysis
Here we take the skin of the dark daughter json The file is displayed for analysis , What we need to get is 1.name,2.skin_name,3.mainImg, Because we found that heroName It's the same , So take the hero's name as the hero's skin folder name , This makes it easy to view and save ;
item = {}
item['name'] = hero["heroName"]
item['skin_name'] = hero["name"]
if hero["mainImg"] == '':
continue
item['imgLink'] = hero["mainImg"]
Copy code
There's one caveat :
yes , we have mainImg The label is empty , So we need to skip , Otherwise, if it is an empty link , An error will be reported when requesting ;
Four 、 Data collection
Import related third-party libraries
import requests # request
from multiprocessing.dummy import Pool as ThreadPool # Concurrent
import time # efficiency
import os # File operations
import json # analysis
Copy code
Page data analysis
def spider(url):
res = requests.get(url, headers=headers)
result = res.content.decode('utf-8')
res_dict = json.loads(result)
skins = res_dict["skins"] # 15 individual hero Information
print(len(skins))
for index,hero in enumerate(skins): # Here we use enumerate Get subscript , So that the file and picture can be named ;
item = {} # A dictionary object
item['name'] = hero["heroName"]
item['skin_name'] = hero["name"]
if hero["mainImg"] == '':
continue
item['imgLink'] = hero["mainImg"]
print(item)
download(index+1,item)
Copy code
download Download the pictures
def download(index,contdict):
name = contdict['name']
path = " The skin /" + name
if not os.path.exists(path):
os.makedirs(path)
content = requests.get(contdict['imgLink'], headers=headers).content
with open('./ The skin /' + name + '/' + contdict['skin_name'] + str(index) + '.jpg', 'wb') as f:
f.write(content)
Copy code
Here we use OS Module create folder , We talked about it earlier , Of every hero heroName The value of is the same , This creates a folder and names it , Convenient for skin preservation ( classified ), Then here is the path of the image file. You need to be careful , Missing a slash will report an error .
main() The main function
def main():
pool = ThreadPool(6)
page = []
for i in range(1,21):
newpage = 'https://game.gtimg.cn/images/lol/act/img/js/hero/{}.js'.format(i)
print(newpage)
page.append(newpage)
result = pool.map(spider, page)
pool.close()
pool.join()
end = time.time()
Copy code
explain :
-
In the main function, we preferred to create six thread pools ;
-
adopt for Loop dynamic build 20 strip url, Let's try the ox knife ,20 A hero skin , If you climb all of them, you can do the previous idList Traverse , Re dynamic construction url;
-
Use map() Function on the... In the thread pool url Perform data analysis and storage operations ;
-
When the thread pool close The thread pool is not closed when , It just changes the state to the state where no more elements can be inserted ;
5、 ... and 、 The program runs
if __name__ == '__main__':
main()
Copy code
give the result as follows :
Of course, only part of the image is intercepted here , In total, I crawled 200+ A picture , Generally speaking, it's ok .
6、 ... and 、 summary
This time we used multithreading to crawl the high-definition Wallpaper of hero skin on the official website of hero League , Because the picture involves IO operation , We use concurrency to , It greatly improves the execution efficiency of the program .
Of course, reptiles have a taste , This little trial ox knife , Crawling away 20 A hero's skin picture , Interested friends can climb down all their skin , Just change the traversal element to the previous idlist that will do .
The end of this paper .
Originality is not easy. , If you think this article is useful to you , Please like this article 、 Comment or forward , Because it will be my motivation to output more quality articles , thank !
by the way , Dig friends remember to give me a free attention ! In case you get lost, you won't find me next time .
See you next time !
copyright notice
author[Jacko's it journey],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201300329572468.html
The sidebar is recommended
- Getting started with Python - object oriented - special methods
- Using linear systems in python with scipy.linalg
- Fast power modulus Python implementation of large numbers
- Python architects recommend the book "Python programmer's Guide" which must be read by self-study Python architects. You are welcome to take it away
- Decoding the verification code of Taobao slider with Python + selenium, the road of information security
- Python game development, pyGame module, python implementation of skiing games
- Python collects and monitors system data -- psutil
- Python + selenium automated test: page object mode
- You can easily get started with Excel. Python data analysis package pandas (IV): any grouping score bar
- Python ThreadPoolExecutor restrictions_ work_ Queue size
guess what you like
-
Python generates and deploys verification codes with one click (Django)
-
[Python kaggle] pandas basic exercises in machine learning series (6)
-
Using linear systems in python with scipy.linalg
-
Using Python to realize national second-hand housing data capture + map display
-
How to make Python run faster? Six tips!
-
Python chat room (Tkinter writing interface, streaming, socket to realize private chat, group chat, check chat records, Mysql to store data)
-
This pandas exercise must be successfully won
-
[algorithm learning] sword finger offer 64 Find 1 + 2 +... + n (Java / C / C + + / Python / go / trust)
-
Understand Python's built-in function and add a print function yourself
-
Python implements JS encryption algorithm in thousands of music websites
Random recommended
- leetcode 35. Search Insert Position(python)
- [introduction to Python visualization]: 12 small examples of complete data visualization, taking you to play with visualization ~
- Learning this Python library can reduce at least 100 lines of code
- leetcode 67. Add Binary(python)
- Regular re parameter replacement of Python 3 interface automation test framework
- V. pandas based on Python
- Only 15 lines of code is needed for face detection! (using Python and openCV)
- [Python crawler Sao operation] you can crawl Sirius cinema movies without paying
- leetcode 69. Sqrt(x)(python)
- Teach you to read the source code of Cpython (I)
- Snowball learning started in the fourth quarter of Python. One needs three meals. I have a new understanding of Python functional programming, process-oriented, object-oriented and functional
- leetcode 88. Merge Sorted Array(python)
- Don't you know more about a python library before the end of 2021?
- Python crawler web page parsing artifact XPath quick start teaching!!!
- Use Python and OpenCV to watermark the image
- String and related methods of Python data type introduction
- Heapq module of Python module
- Introduction to beautiful soup of Python crawler weapon, detailed explanation, actual combat summary!!!
- Event loop of Python collaboration series
- Django docking pin login system
- [recalling the 1970s] using Python to repair the wonderful memories of parents' generation, black-and-white photos become color photos
- You used to know Python advanced
- Pyinstaller package Python project
- 2021 IEEE programming language rankings: Python tops the list!
- Implementation of Python automatic test control
- Python advanced: [Baidu translation reverse] graphic and video teaching!!!
- Do you know the fuzzy semantics in Python syntax?
- [Python from introduction to mastery] (XXVII) learn more about pilot!
- Playing excel office automation with Python
- Some applications of heapq module of Python module
- Python and go languages are so popular, which is more suitable for you?
- Python practical skills task segmentation
- Python simulated Login, numpy module, python simulated epidemic spread
- Python opencv contour discovery function based on image edge extraction
- Application of Hoff circle detection in Python opencv
- Python reptile test ox knife (I)
- Day 1: learn the Django framework of Python development
- django -- minio_ S3 file storage service
- [algorithm learning] 02.03 Delete intermediate nodes (Java / C / C + + / Python / go)
- Learning in Python + opencv -- extracting corners