current position:Home>Who has powder? Just climb who! If he has too much powder, climb him! Python multi-threaded collection of 260000 + fan data

Who has powder? Just climb who! If he has too much powder, climb him! Python multi-threaded collection of 260000 + fan data

2022-02-01 16:24:08 Dream eraser

「 This is my participation 11 The fourth of the yuegengwen challenge 29 God , Check out the activity details :2021 One last more challenge

Whose powder do you want to climb today ? Who has more powder , Just climb who . Then who has powder ? Silent King II has powder .

Let's continue our study today Python Reptiles , Start with this blog for a short (15 piece ) Multithreaded crawler learning .

The first one is to collect @ Silent king two A fan of , have 27W+ readers , The truth is enviable .

Target data source analysis

The data source captured this time is https://blog.csdn.net/qing_gee?type=sub&subType=fans, Among them ID You can switch to what you want to collect ID, Of course, including your own ID.

The page will automatically request a API Interface , namely https://blog.csdn.net/community/home-api/v1/get-fans-list?page=3&size=20&noMore=false&blogUsername=qing_gee, The parameters are as follows :

  • page: Page number , According to the total number of target fans / 20 Calculate and obtain ;
  • size: Each page of the data , The default value is 20;
  • noMore: It's useless ;
  • blogUsername: Blog user name

At the same time, in the process of testing the interface , The interface will return exception data , A delay control is added to the actual measurement , It can greatly improve the stability of interface data return .

{'code': 400, 'message': 'fail', 'data': None}
 Copy code 

The normal interface data return is shown in the figure below :  Who has powder , Just climb who ,@ Silent king two , I'm going to climb your  27W+  Fans

Instructions for using technical points

This time, we use Python Multithreading to achieve data collection , Code using threading Module for multithreading control , This series of columns starts with the simplest multithreading , For example, in this case , One time launch 5( Customizable ) A request .

The full code is shown below , For code description, please refer to the notes section and the tail description

import threading
from threading import Lock, Thread
import time
import os
import requests
import random


class MyThread(threading.Thread):
    def __init__(self, name):
        super(MyThread, self).__init__()
        self.name = name

    def run(self):
        global urls
        lock.acquire()
        one_url = urls.pop()
        print(" Crawling up :", one_url)
        lock.release()
        print(" Any thread waits for random time ")
        time.sleep(random.randint(1,3))
        res = requests.get(one_url, headers=self.get_headers(), timeout=5)

        if  res.json()["code"] != 400:
            data = res.json()["data"]["list"]
            for user in data:
                name = user['username']
                nickname = self.remove_character(user['nickname'])
                userAvatar = user['userAvatar']
                blogUrl = user['blogUrl']
                blogExpert = user['blogExpert']
                briefIntroduction = self.remove_character(
                    user['briefIntroduction'])

	            with open('./qing_gee_data.csv', 'a+', encoding='utf-8') as f:
	                print(f'{name},{nickname},{userAvatar},{blogUrl},{blogExpert},{briefIntroduction}')
	                f.write(f"{name},{nickname},{userAvatar},{blogUrl},{blogExpert},{briefIntroduction}\n")
        else:
            print(res.json())
            print(" Abnormal data ", one_url)
            with open('./error.txt', 'a+', encoding='utf-8') as f:
                f.write(one_url+"\n")
    #  Remove special characters 

    def remove_character(self, origin_str):
        if origin_str is None:
            return
        origin_str = origin_str.replace('\n', '')
        origin_str = origin_str.replace(',', ',')
        return origin_str
	#  Get random UA Request header 
    def get_headers(self):
        uas = [
            "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
        ]
        ua = random.choice(uas)
        #  Pay special attention to the following  cookie  part , You need to manually copy from the developer tool , Otherwise, the captured data , The lack of nikename  And personal profile 
        headers = {
            "user-agent": ua,
            'cookie': 'UserName= Yours ID; UserInfo= Yours UserInfo; UserToken= Yours UserToken;',
            "referer": "https://blog.csdn.net/qing_gee?type=sub&subType=fans"
        }
        return headers


if __name__ == '__main__':
    lock = Lock()
    url_format = 'https://blog.csdn.net/community/home-api/v1/get-fans-list?page={}&size=20&noMore=false&blogUsername=qing_gee'
    urls = [url_format.format(i) for i in range(1, 13300)]
    l = []
    while len(urls) > 0:
        print(len(urls))
        for i in range(5):
            p = MyThread("t"+str(i))
            l.append(p)
            p.start()
        for p in l:
            p.join()
 Copy code 

The code running result is shown in the figure below :  Who has powder , Just climb who ,@ Silent king two , I'm going to climb your  27W+  Fans

The above code uses multithreading , Thread locks are also used , Simple multithreaded code can be abstracted as follows .

Simple multithreaded code :

import threading
import time

def run(n):
    print('task', n)
    time.sleep(3)

if __name__ == '__main__':
    t1 = threading.Thread(target=run, args=('t1',))
    t2 = threading.Thread(target=run, args=('t2',))
    t1.start()
    t2.start()
 Copy code 

The core code is threading.Thread, Parameters target The following value is the function name ,args It's the parameters that are passed , Note that it must be of tuple type .

The crawler code still uses shared global variables , The simplified code is as follows , Among them, the focus is on learning lock=Lock() Part of the code , And before and after using global variables lock.acquire() and lock.release(). It also uses the thread join Method , This method is mainly to make the main thread wait for the sub thread to execute .

import threading
from threading import Lock,Thread
import time,os

def work():
    global urls
    lock.acquire()
    #  Get one  url
    one_url = urls.pop()
    lock.release()

    print(" Got  URL  by ",one_url)


if __name__ == '__main__':
    lock = Lock()
    url_format = 'https://blog.csdn.net/community/home-api/v1/get-fans-list?page={}&size=20&noMore=false&blogUsername=qing_gee'
    #  Splicing URL, Global shared variables 
    urls = [url_format.format(i) for i in range(1, 13300)]
    l = []
    #  Number of open threads 
    for i in range(3):
        p = Thread(target=work)
        l.append(p)
        p.start()
    for p in l:
        p.join()
 Copy code 

Get the data , It can be targeted to describe the user portrait of an author , This part will give you a separate detailed introduction in the follow-up blog .

The code is in the data cleaning section , There is room for optimization , Due to the 13300 Page data , So finally grab 26W+ data , I inquired about , There is Dream eraser .

 Who has powder ? Just climb who ! He has a lot of powder , Just climb him !Python  Multithreaded collection  27W+  Fan data

At least... Of the followers 83 A blogger , You can see that the personal profiles of blog experts are relatively clear , Simultaneous discovery jiangtao(CSDN founder )

 Who has powder ? Just climb who ! He has a lot of powder , Just climb him !Python  Multithreaded collection  27W+  Fan data

Collection time

Code download address :codechina.csdn.net/hihell/pyth…, Could you give me Star.

To have come , No comment , Point a praise , Put it away ?

copyright notice
author[Dream eraser],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202011624070905.html

Random recommended