current position:Home>Python crawler actual combat, pyteseract module, python realizes the visualization of boos direct employment & hook post data

Python crawler actual combat, pyteseract module, python realizes the visualization of boos direct employment & hook post data

2022-02-01 07:37:32 Dai mubai

「 This is my participation 11 The fourth of the yuegengwen challenge 21 God , Check out the activity details :2021 One last more challenge 」.

Preface

utilize Python Realization BOOS Direct employment & Visualization of post data in pull hook network . I don't say much nonsense .

Let's start happily ~

development tool

Python edition : 3.6.4

Related modules :

requests modular

pyspider modular ;

pymysql modular ;

pytesseract modular ;

random modular ;

re modular

As well as some Python Built in modules .

Environment building

install Python And add to environment variable ,pip Install the relevant modules required .

This time through to BOSS Direct employment , Dragnet data analysis post data analysis , Understand the industry situation of data analysis post

Web analytics

 Web analytics

obtain BOSS Direct employment index page information , Mainly the post name 、 Salary 、 place 、 Years of service 、 Degree required , Corporate name 、 type 、 state 、 scale .

At first, I wanted to analyze the details page , You can also get the job content and job skill requirements in the details page .

Then because there are too many requests , Gave up . Index page has 10 page ,1 Page has 30 Post , A detail page requires a request , In all, there are 300 A request .

In the first 2 page (60 A request ), There is a warning of too frequent access .

And only get the index page information , Only 10 A request , Basically no problem , Besides, I don't want to mess with the agent IP, So let's have a simple .

When we do data mining jobs , See if slowing down can succeed .

 obtain url Request address

Get the pull hook index page information , Mainly the post name 、 place 、 Salary 、 Years of service 、 Degree required , Corporate name 、 type 、 state 、 scale , Job skills , Work benefits .

The web page is Ajax request , use PyCharm Write code , Get familiar .

Data acquisition

pyspider obtain BOSS Direct employment data

pyspider It's easy to install , Directly at the command line pip3 install pyspider that will do .

Because it was not installed before pyspider Butted PhantomJS( Handle JavaScript Rendered page ).

So you need to download it from the website exe file , Put it in the Python Of exe Under the folder where the file is located .

Finally, enter... On the command line pyspider all, You can run pyspider.

Open the web address in the browser http://localhost:5000/, Create project , Add project name , Enter the requested URL , The following figure is obtained .

pyspider dashboard

Last in pyspider Write code in your script editor , Combined with the feedback on the left , Correct the code .

 Write code in the editor

The specific code of the script editor is as follows

from pyspider.libs.base_handler import *
import pymysql
import random
import time
import re

count = 0

class Handler(BaseHandler):
    #  Add request header , Otherwise appear 403 Report errors 
    crawl_config = {'headers': {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}}

    def __init__(self):
        #  Connect to database 
        self.db = pymysql.connect(host='127.0.0.1', user='root', password='774110919', port=3306, db='boss_job', charset='utf8mb4')

    def add_Mysql(self, id, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people):
        #  Write data to the database 
        try:
            cursor = self.db.cursor()
            sql = 'insert into job(id, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people) values ("%d", "%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s")' % (id, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people);
            print(sql)
            cursor.execute(sql)
            print(cursor.lastrowid)
            self.db.commit()
        except Exception as e:
            print(e)
            self.db.rollback()

 @every(minutes=24 * 60)
    def on_start(self):
        #  because pyspider The default is HTTP request , about HTTPS( encryption ) request , Need to add validate_cert=False, otherwise 599/SSL Report errors 
        self.crawl('https://www.zhipin.com/job_detail/?query=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90&scity=100010000&industry=&position=', callback=self.index_page, validate_cert=False)

 @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        time.sleep(random.randint(2, 5))
        for i in response.doc('li > div').items():
            #  Set global variables 
            global count
            count += 1
            #  Job title 
            job_title = i('.job-title').text()
            print(job_title)
            #  Job salary 
            job_salary = i('.red').text()
            print(job_salary)
            #  Post location 
            city_result = re.search('(.*?)<em class=', i('.info-primary > p').html())
            job_city = city_result.group(1).split(' ')[0]
            print(job_city)
            #  Job experience 
            experience_result = re.search('<em class="vline"/>(.*?)<em class="vline"/>', i('.info-primary > p').html())
            job_experience = experience_result.group(1)
            print(job_experience)
            #  Job qualifications 
            job_education = i('.info-primary > p').text().replace(' ', '').replace(city_result.group(1).replace(' ', ''), '').replace(experience_result.group(1).replace(' ', ''),'')
            print(job_education)
            #  Corporate name 
            company_name = i('.info-company a').text()
            print(company_name)
            #  Company type 
            company_type_result = re.search('(.*?)<em class=', i('.info-company p').html())
            company_type = company_type_result.group(1)
            print(company_type)
            #  Company status 
            company_status_result = re.search('<em class="vline"/>(.*?)<em class="vline"/>', i('.info-company p').html())
            if company_status_result:
                company_status = company_status_result.group(1)
            else:
                company_status = ' No information '
            print(company_status)
            #  The company size 
            company_people = i('.info-company p').text().replace(company_type, '').replace(company_status,'')
            print(company_people + '\n')
            #  Write to the database 
            self.add_Mysql(count, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people)
        #  Get the information on the next page 
        next = response.doc('.next').attr.href
        if next != 'javascript:;':
            self.crawl(next, callback=self.index_page, validate_cert=False)
        else:
            print("The Work is Done")
        #  Details page information acquisition , Due to the limited number of visits , Don't use 
        #for each in response.doc('.name > a').items():
            #url = each.attr.href
            #self.crawl(each.attr.href, callback=self.detail_page, validate_cert=False)

 @config(priority=2)
    def detail_page(self, response):
        #  Details page information acquisition , Due to the limited number of visits , Don't use 
        message_job = response.doc('div > .info-primary > p').text()
        city_result = re.findall(' City :(.*?) Experience ', message_job)
        experience_result = re.findall(' Experience :(.*?) Education ', message_job)
        education_result = re.findall(' Education :(.*)', message_job)

        message_company = response.doc('.info-company > p').text().replace(response.doc('.info-company > p > a').text(),'')
        status_result = re.findall('(.*?)\d', message_company.split(' ')[0])
        people_result = message_company.split(' ')[0].replace(status_result[0], '')

        return {
            "job_title": response.doc('h1').text(),
            "job_salary": response.doc('.info-primary .badge').text(),
            "job_city": city_result[0],
            "job_experience": experience_result[0],
            "job_education": education_result[0],
            "job_skills": response.doc('.info-primary > .job-tags > span').text(),
            "job_detail": response.doc('div').filter('.text').eq(0).text().replace('\n', ''),
            "company_name": response.doc('.info-company > .name > a').text(),
            "company_status": status_result[0],
            "company_people": people_result,
            "company_type": response.doc('.info-company > p > a').text(),
        }
 Copy code 

obtain BOSS Direct employment data analysis post

BOSS Direct employment data analysis post

PyCharm Get dragnet data

import requests
import pymysql
import random
import time
import json

count = 0
#  Set the request URL and request header parameters 
url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
    'Cookie': ' Yours Cookie value ',
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'Connection': 'keep-alive',
    'Host': 'www.lagou.com',
    'Origin': 'https://www.lagou.com',
    'Referer': 'ttps://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?labelWords=sug&fromSearch=true&suginput=shuju'
}

#  Connect to database 
db = pymysql.connect(host='127.0.0.1', user='root', password='774110919', port=3306, db='lagou_job', charset='utf8mb4')


def add_Mysql(id, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people, job_tips, job_welfare):
    #  Write data to the database 
    try:
        cursor = db.cursor()
        sql = 'insert into job(id, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people, job_tips, job_welfare) values ("%d", "%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s")' % (id, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people, job_tips, job_welfare);
        print(sql)
        cursor.execute(sql)
        print(cursor.lastrowid)
        db.commit()
    except Exception as e:
        print(e)
        db.rollback()


def get_message():
    for i in range(1, 31):
        print(' The first ' + str(i) + ' page ')
        time.sleep(random.randint(10, 20))
        data = {
            'first': 'false',
            'pn': i,
            'kd': ' Data analysis '
        }
        response = requests.post(url=url, data=data, headers=headers)
        result = json.loads(response.text)
        job_messages = result['content']['positionResult']['result']
        for job in job_messages:
            global count
            count += 1
            #  Job title 
            job_title = job['positionName']
            print(job_title)
            #  Job salary 
            job_salary = job['salary']
            print(job_salary)
            #  Post location 
            job_city = job['city']
            print(job_city)
            #  Job experience 
            job_experience = job['workYear']
            print(job_experience)
            #  Job qualifications 
            job_education = job['education']
            print(job_education)
            #  Corporate name 
            company_name = job['companyShortName']
            print(company_name)
            #  Company type 
            company_type = job['industryField']
            print(company_type)
            #  Company status 
            company_status = job['financeStage']
            print(company_status)
            #  The company size 
            company_people = job['companySize']
            print(company_people)
            #  Job skills 
            if len(job['positionLables']) > 0:
                job_tips = ','.join(job['positionLables'])
            else:
                job_tips = 'None'
            print(job_tips)
            #  Work benefits 
            job_welfare = job['positionAdvantage']
            print(job_welfare + '\n\n')
            #  Write to database 
            add_Mysql(count, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people, job_tips, job_welfare)


if __name__ == '__main__':
    get_message()
 Copy code 

Obtain the data of hook network data analysis post

 Dragnet data analysis post

Data visualization

Urban distribution map

BOOS Distribution map of direct employment cities

 Urban distribution map of dragnet

Heat map of urban distribution

BOOS Heat map of direct employment city distribution

 Heat map of urban distribution of dragnet

Work experience salary chart

BOOS Salary chart of direct employment experience

 Pull the hook net work experience salary chart

Here, by looking at the quartile and median of the box graph , It can be roughly seen that with the increase of working years , Wages are also rising all the way .

BOSS Zhibingli ,1 Salary with less than years working experience , There is a highest 4 More than ten thousand , It must be unreasonable .

So I went to the database , In fact, the job requirement is 3 In the above , But the actual label is 1 Within years .

So the data source provides The accuracy of the data Very important .

Education salary chart

BOOS Direct employment education salary chart

 Pull the hook net education salary chart

in general 「 master 」>「 Undergraduate 」>「 junior college 」, Of course 、 There are also high paying undergraduates .

After all, ability is more important later , Education is a Important bonus items .

Company status salary chart

BOOS Status salary chart of direct employment company

 Dragnet company status salary chart

Company size salary chart

BOOS Direct employment company scale salary chart

 Scale and salary chart of dragnet company

Normally , The bigger the company is , The higher the salary should be .

After all, the wages of big factories are there , It's hard to know .

Company type TOP10

BOOS Type of direct employment company TOP10

 Dragnet company type TOP10

The data analysis post is mainly concentrated in the Internet industry ,「 Finance 」「 real estate 」「 education 」「 Medical care 」「 game 」 It also involves .

Job skills chart

 Draw the hook net work skill chart

The cloud picture of work benefits

22.jpg

It can be seen here that most of the focus is around 「 Five social insurance and one housing fund 」「 Forrest 」「 Good team atmosphere 」「 There's a lot of room for promotion 」「 Industry leader 」 On .

copyright notice
author[Dai mubai],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202010737314025.html

Random recommended