current position:Home>Python practical skills task segmentation

Python practical skills task segmentation

2022-01-30 00:47:15 cxapython

Little knowledge , Great challenge ! This article is participating in “ A programmer must have a little knowledge ” Creative activities .

image.png Let's talk today ,Python Task segmentation in . Take reptiles for example , From a deposit url Of txt In file , Read its contents , We'll get one url list . We put this one url A list is called a big task .

List segmentation

Regardless of memory usage , Let's make a segmentation of the big task above . For example, when we cut a large task into small tasks, we can only access... At most per second 5 individual URL.

import os
import time

CURRENT_DIR = os.path.dirname(os.path.abspath(__file__))

def read_file():
    file_path = os.path.join(CURRENT_DIR, "url_list.txt")
    with open(file_path, "r", encoding="utf-8") as fs:
        result = [i.strip() for i in fs.readlines()]
    return result

def fetch(url):
    print(url)

def run():
    max_count = 5
    url_list = read_file()
    for index in range(0, len(url_list), max_count):
        start = time.time()
        fetch(url_list[index:index + max_count])
        end = time.time() - start
        if end < 1:
            time.sleep(1 - end)


if __name__ == '__main__':
    run()
 Copy code 

The key code is for In circulation , First, let's make a statement range The third parameter of , This parameter specifies that the iteration step is 5, So every time index All increase with 5 Cardinal number , namely 0,5,10... And then we went to url_list Make a slice , Take five elements at a time , These five elements will follow index The increase of is constantly changing , If there are less than five in the end , According to the characteristics of the slice, you can take as many as you have at this time , It will not cause the problem of index superscript .

With url The addition of the list , We will find that the memory consumption is also increasing . At this time, we need to modify the code , We know that the generator saves more memory space , After modification, the code becomes , The following is like this .

Generator segmentation

# -*- coding: utf-8 -*-
# @ Time  : 2019-11-23 23:47
# @ author  :  Chen Xiangan 
# @ file name  : g.py
# @ official account : Python Learning to develop 
import os
import time
from itertools import islice

CURRENT_DIR = os.path.dirname(os.path.abspath(__file__))


def read_file():
    file_path = os.path.join(CURRENT_DIR, "url_list.txt")
    with open(file_path, "r", encoding="utf-8") as fs:
        for i in fs:
            yield i.strip()


def fetch(url):
    print(url)


def run():
    max_count = 5
    url_gen = read_file()
    while True:
        url_list = list(islice(url_gen, 0, max_count))
        if not url_list:
            break
        start = time.time()
        fetch(url_list)
        end = time.time() - start
        if end < 1:
            time.sleep(1 - end)


if __name__ == '__main__':
    run()
 Copy code 

First , We changed the way the file is read , Put the original reading list form , Change to the form of generator . In this way, we save a lot of memory when calling the file reading method .

Then it's up there for Cycle transformation , Because of the characteristics of the generator , It's not suitable to use for To iterate , Because every iteration consumes the elements of the generator , By using itertools Of islice Yes url_gen Segmentation ,islice Is the slice of the generator , Here, each time we cut out the containing 5 A generator of elements , Because the generator does not __len__ So , Let's turn it into a list , Then judge whether the list is empty , You can know whether the iteration should end .

Modified code , Both performance and memory saving are greatly improved . Reading tens of millions of files is not a problem . besides , When using asynchronous crawlers , You may use asynchronous generator slices . Let's discuss with you , The problem of asynchronous generator segmentation

Asynchronous generator segmentation

First, let's look at a simple asynchronous generator . We know that calling the following code will get a generator

def foo():
    for i in range(20):
        yield i
 Copy code 

If in def Add one in front async, Then it is an asynchronous generator when it is called . The complete example code is as follows

import asyncio
async def foo():
    for i in range(20):
        yield i


async def run():
    async_gen = foo()
    async for i in async_gen:
        print(i)


if __name__ == '__main__':
    asyncio.run(run())
 Copy code 

About async for The segmentation of is a little complicated , It is recommended to use aiostream modular , After use, the code is changed to the following

import asyncio
from aiostream import stream

async def foo():
    for i in range(22):
        yield i


async def run():
    index = 0
    limit = 5

    while True:
        xs = stream.iterate(foo())
        ys = xs[index:index + limit]
        t = await stream.list(ys)
        if not t:
            break
        print(t)
        index += limit


if __name__ == '__main__':
    asyncio.run(run())
 Copy code 

The original content comes from my Zhihu column :zhuanlan.zhihu.com/p/93413442

copyright notice
author[cxapython],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201300047120989.html

Random recommended