current position:Home>Van * Python | simple crawling of a site course

Van * Python | simple crawling of a site course

2022-01-30 13:15:56 coder_ pig

This article has participated in 「 Digging force Star Program 」, Win a creative gift bag , Challenge creation incentive fund .

0x0、 introduction

Gold nine silver ten blink of an eye , Next week will be November , An article in October didn't write , It's hard to be ashamed , There's just material , Hurry up and write an article on the water ~

solemn statement

This article is only used to record the research and learning of crawler technology , Crawling scripts will not be provided , The crawled data has been deleted and not propagated , Do not use for illegal purposes , If there are other illegal uses causing losses , Not relevant to this article .

Friends interested in reptile learning can move to what I wrote before :《Python Reptiles go from entry to prison 》 Study notes


0x1、 cause

On the way to work , Brush a job every day APP, Slide your hand to the course Tab, Yo yo , There are many courses , The quality looks good , That's good .

Suddenly remembered the previous year N Group generalization , White whoring to the whole year VIP, Many friends should have gathered a wave like me , And added the collection do not see eat ash series ~

One day , I'll learn , Right , however , This VIP Annual pass TM Your is about to expire !!!

Don't panic. , No big problem , If it expires, it will expire , Big deal renewal , More than ten yuan for milk tea , We can still afford it ~

Glancing at the renewal fee :

Oh my god , The price ...TM It's OK , On the mouth ( Hard air ), Hand but ( Discredit ) Long screenshots began to appear ...

Because of poverty , Will even knowledge leave me ? No , You can't go !!!

After cutting some long pictures , I began to feel something wrong :

It took a few minutes to cut a picture , So many courses and chapters , I've cut off the year of the monkey ? My power button is broken , Maybe it's not finished ?

And I've been holding a screenshot of my cell phone , You can't do anything else , Slip your hand in case you press it wrong , Turn on the flash , In the crowded subway , It's embarrassing ...

As a developer who likes to be lazy , We must find a way to free our hands , Achieve your dreams , Do as you say !


0x2、 A little bit doesn't seem to work

Hand over the little operations to the program to complete , Then you have to draw the flow of the screenshot first :

Three steps in the cycle , Until you cut and play all the courses , The process looks simple , One of four schemes for automatic point-to-point by mobile phone :

  • Accessibility services
  • py Script + adb command
  • Automated test tool :Appium、airtest
  • autojs

open Android mobile phone , Developer tools → Display layout and boundaries , Can see ①② Are native controls , Good positioning to the control to do simulation click / Get text .

Click process 、 Logical processing is OK , The biggest difficulty is 「 Long screenshot 」, As far as I know, the above tools should not support long screenshots , So I have to realize the long screenshot by myself , The general plan is :

Multiple slide screenshots → Multiple screenshots of students' growth

The processing here is very cumbersome , Sliding distance calculation 、 The accuracy of image stitching ( Content overlap 、 defect ) etc. , Hard and thankless things , Why not ,so, Change the plan , Grab the bag and take a wave ~


0x3、 It doesn't seem very good to grab bags

Catch first PC End ,23333, Request header encryption to dissuade ~

Then catch Android End ,23333, Same request header , Persuade them to retreat again ~

Is dubious , Blood pressure soared ! Then decrypt a wave ? Look, it's 360 strengthening , So the next thing is Shelling reverse link ?

23333, To make fun of , The title is simple , There must be an easier way , That's it : Dot + Grab the bag


0x4、 Dot + Just grab the bag

From the mobile end , It's changed PC A little bit of a web page , Conventional technical solutions :

  • Selenium and Pyppeteer, The latter rely on Chromium kernel , No need for tedious environment configuration , Compared with the former, the efficiency is also higher . The former is used here , dedicated , Just because I'm familiar with it .

It's easy to play :

Use to find elements API Positioning elements → Click on the simulation → Analog input → Get the text in a specific label → Save to local file

It seems a little too easy and brainless ? Then add a little technical content :

Cooperate with the bag grabbing tool → Intercept requests initiated by the page → Filter out the required data → Save to local file

Use here browsermob-proxy To intercept , Next, let's talk about the down climbing process ~


1、 Tool preparation

  • Seleniumpip install selenium Command to install directly , Can not install their own Baidu ;
  • chromedriver.exe → Chrome Look at the browser version , Download the version directly on the official website :chromedriver, Put it in the project directory ;
  • browsermob-proxy-2.1.4Github Warehouse Direct down , Also unzip it into the project directory ;

Other libraries used pip Direct loading ~


2、 Initialize proxy server and browser

import os
from browsermobproxy import Server
from selenium import webdriver
import time


#  Initialize agent 
def init_proxy():
    server = Server(os.path.join(os.getcwd(), r'browsermob-proxy-2.1.4\bin\browsermob-proxy'))
    server.start()
    return server.create_proxy()


#  Initialize browser , Passing proxy objects 
def init_browser(proxy):
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--headless')  #  Headless browser 
    chrome_options.add_argument('ignore-certificate-errors')  #  Ignore certificate verification 
    chrome_options.add_argument('--start-maximized')  #  At the beginning, direct to the maximum screen 
    #  Set user data directory , Avoid logging in again every time 
    chrome_options.add_argument(r'--user-data-dir=D:\ChromeUserData')
    chrome_options.add_argument('--proxy-server={0}'.format(proxy.proxy))  #  Set the requested proxy 
    return webdriver.Chrome(options=chrome_options)


if __name__ == '__main__':
    server_proxy = init_proxy()
    browser = init_browser(server_proxy)
    server_proxy.new_har("test", options={
        'captureContent': True,
        'captureHeaders': True
    })
    browser.get("https://www.baidu.com")
    time.sleep(2)
    catch_result = server_proxy.har
    for entry in catch_result['log']['entries']:
        print(entry['response']['content'])
        
    #  Remember to turn off when you're finished 
    server_proxy.close()
    browser.close()
 Copy code 

Then run, wait a moment , You can see that the console prints the crawled log information :

Tips: It can be done to catch_resul Lower breakpoint debugging , Remember the data you want key, You don't need Baidu ~

Proxy and browser have become embroiled , Then we're going to start a little bit ~


3、 Simulated landing

The browser opens the home page , Navigate to the login tab :

Find out if there is this tag , If yes, it means that you are not logged in , Execute login logic , Click this button , The following pop-up window appears :

Switch to account password login , Locate the node where the mobile phone number and password are entered , Enter the mobile phone password , Then click log in .

Sometimes due to risk control or other factors , Verification code verification... Will pop up , Such as :

A simple way to deal with : Set aside a certain time after clicking login , You can do the verification manually .

Because it has Chrome User data directory of the browser , After logging in once, it will be in login status , You don't have to log in when you open the browser later , Of course, the top number 、Cookie When it expires, you may need to manually call the next login method .

Write a simple code example ~

def login():
    browser.get(base_url)
    time.sleep(2)
    not_login_element = browser.find_element_by_class_name("not-login-item")
    if not_login_element is None:
        print(" Logged in , immediate withdrawal ")
    else:
        #  Click login 
        not_login_element.click()
        time.sleep(1)
        
        #  Switch TAB Log in with your account and password 
        browser.find_elements_by_class_name("account-text")[1].click()
        
        #  Enter the account and password 
        input_items = browser.find_elements_by_class_name("input-item")
        input_items[2].find_element(By.TAG_NAME, 'input').send_keys("xxx")
        input_items[3].find_element(By.TAG_NAME, 'input').send_keys("xxx")
        
        #  Click login 
        browser.find_element_by_class_name("login-btn").click()
        
        #  Sometimes a verification code pop-up window may appear , Set aside enough time for you to 
        time.sleep(20)
        
        #  Turn it off when you're done 
        proxy.close()
        browser.close()
 Copy code 

4、 Get all courses ID

The column at the bottom of the home page , Find all the courses :

###

F12 Open developer tools , Switch to Network tab , Empty , And then refresh the page , Just find a course name , Search below , Good positioning :

No paging , The data are all in one line Json In the back , So just grab it once , Copy all Json, Save to local , Make an analysis , Extract all courses id, A simple code example is as follows :

#  Course list and id
def load_all_course_list():
    with open(lg_course_item_json, 'r+', encoding='utf-8') as f:
        content_dict = json.load(f)
        c_list = []
        for course in content_dict['content']['contentCardList'][22]['courseList']:
            c_list.append(str(course['id']))
        cp_utils.write_list_data(c_list, course_id_list_file)
 Copy code 

Some processing results are as follows :

Okay, just 101 A course , Not too much , Then go to get the chapters in each course .

5、 Get chapters ID

Open a chapter casually , Empty , Refresh the page , Search for a title at random , It's also well positioned :

Get the course id What's the use ? Go through the above courses id list , Always replace url Medium courseID, Is the of each course url:

course_template_url = 'https://xxx/courseInfo.htm?courseId={}#/content'.format(' Course id')
 Copy code 

browser.get() Load the above link directly , The returned... Is saved directly here json, Because considering that some fields may be useful later , A simple code example is as follows :

#  Load course list 
def load_course_list():
    course_id_list = cp_utils.load_list_from_file(course_id_list_file)
    for course_id in course_id_list:
        proxy.new_har("course_list", options={
            'captureContent': True,
            'captureHeaders': True
        })
        browser.get(course_template_url.format(course_id))
        time.sleep(random.randint(3, 30))
        result = proxy.har
        for entry in result['log']['entries']:
            if entry['request']['url'].find('getCourseLessons?courseId=') != -1:
                content = entry['response']['content']
                #  Filter the right requests 
                if len(str(content)) > 200:
                    text = json.loads(content['text'])
                    course_name = text['content']['courseName']
                    json_save_path = os.path.join(course_json_save_dir, course_name + '.json')
                    with open(json_save_path, "w+", encoding='utf-8') as f:
                        f.write(json.dumps(text, ensure_ascii=False, indent=2))
                        print(json_save_path, "  File write complete ...")
    proxy.close()
    browser.close()
 Copy code 

You can see the saved json file ~


6、 Get the content of the article

be modeled on , Keyword search :

chapter URL:

article_template_url = 'https://xxx/courseInfo.htm?courseId={}#/detail/pc?id={}'.format(course_id, theme_id))
 Copy code 

Similarly, loop through , Parse the returned data textContent Field contents , Save as html that will do , Relatively simple , I won't post the code .

Not a lot of data , You can basically climb in half a day , Open the saved HTML Find out , It's all garbled :

A small problem , Specify the encoding method , Just plug the content of the web page into the comment area :

<html>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
<head><title></title></head>
<body>
<!-- Copy here -->
</body>
</html>
 Copy code 

well , Classes are climbing down , You might say : Is this ? It's too simple , I feel the same way , Then add a billion more details !!!


0x5、 Add a little detail

1、HTML turn Markdown

No style HTML, It's really ugly after opening it , And it doesn't get better , Then let's turn it into Markdown Well ~

Life is too short , I use Python, Don't panic when you meet demand , First find out if there are wheels , No more self-made , see , Just search and find :

pip The command line directly installs , Write a paragraph demo, Try the conversion effect :

import cp_utils
import html2text as ht

if __name__ == '__main__':
    text_marker = ht.HTML2Text()
    content = cp_utils.read_content_from_file('test.html')
    cp_utils.write_str_data(text_marker.handle(content), "after.md")
 Copy code 

The conversion results look good , There is no problem at present , The next step is to traverse the file , Batch conversion wave ~

import cp_utils
import os
import html2text as ht

lg_save_dir = os.path.join(os.getcwd(), "lg_save")
course_json_save_dir = os.path.join(lg_save_dir, "course_json")
article_save_dir = os.path.join(lg_save_dir, "article")
md_save_dir = os.path.join(lg_save_dir, "md")

if __name__ == '__main__':
    text_marker = ht.HTML2Text()
    cp_utils.is_dir_existed(lg_save_dir)
    cp_utils.is_dir_existed(course_json_save_dir)
    cp_utils.is_dir_existed(article_save_dir)
    cp_utils.is_dir_existed(md_save_dir)
    course_dir_list = cp_utils.fetch_all_file(article_save_dir)
    for course_dir in course_dir_list:
        course_name = course_dir.split("\\")[-1]
        for article_path in cp_utils.fetch_all_file(course_dir):
            article_name = article_path.split("\\")[-1]
            for lesson_path in cp_utils.filter_file_type(article_path, ".html"):
                lesson_name = lesson_path.split("\\")[-1].split(".")[0]
                after_save_dir = os.path.join(md_save_dir,
                                              course_name + os.path.sep + article_name + os.path.sep)
                cp_utils.is_dir_existed(after_save_dir)
                md_file_path = os.path.join(after_save_dir, lesson_name + '.md')
                cp_utils.write_str_data(text_marker.handle(cp_utils.read_content_from_file(lesson_path)),
                                        md_file_path)
 Copy code 

Wait for a moment , All files have been converted , With a wave I wrote before hzwz-markdown-wx MD Transfer company number HTML Style scripts :

The typesetting suddenly rose up ,2333, Of course , Just kidding , It's not going to die like this , Respect the work of the author ~


2、 The image processing

md The pictures in are all used in the drawing bed of the site , Some students may have the following needs :

Need one : There is a need to view documents offline

Small case, analysis md file , Download pictures to local , Replace the original link , By the way, add a wave md title ( file name ).

Be careful :Markdown Local links in Relative paths , And not Absolute path

A simple code example is as follows :

#  Match pictures URL Regular 
pic_pattern = re.compile(r'!\[.*?\]\((.*?)\)', re.S)    

def process_md(md_path):
    content = cp_utils.read_content_from_file(md_path)
    
    #  Add a first level title 
    title = md_path.split('.')[0]
    new_content = "# {}\n{}".format(title, content)
    
    #  Find all picture links 
    pic_result = pic_pattern.findall(new_content)
    for pic in pic_result:
        #  Local picture absolute path 
        pic_file_path = os.path.join(pic_save_dir, pic.split('/')[-1])
        
        #  Picture relative path 
        pic_relative_path = "..{}pic{}{}".format(os.path.sep, os.path.sep, pic.split('/')[-1])
        
        #  Download the pictures 
        cp_utils.download_pic(pic_file_path, pic)
       
        #  hold MD Resources in the file URL Replace with local relative path 
        new_content = new_content.replace(pic, pic_relative_path)
    
    #  preservation md file 
    cp_utils.write_str_data(new_content, os.path.join(md_new_save_dir, title + '.md'))
 Copy code 

After operation , I just got a picture and reported a mistake :

I went to , Why does the picture name have a carriage return ??? look down md Errors reported in the document :

I take ,html2text The conversion bug, Go back to the author issues, Now we have to find a way to solve this problem .

  • Cure the symptoms : Download without error , Replace... When downloading pictures url Medium \n It's blank , But modify md You still have to deal with files ;
  • Permanent cure : Find the location of this abnormal picture link , Get rid of \n enter .

This must be a permanent cure , Okay , Then how to locate and replace the redundant carriage returns ? Please find out the string processing artifact → Regular expressions , There are several solutions :

  • re.findAll() + str.replace()
#  Regular matching of abnormal pictures ,re.M  Represents support for multiple line matching 
error_pic_pattern = re.compile(r'http.*?\n.*?\..*?\)', re.M)

#  Get all the matching exception pictures , Remove with \n String after , To replace the original string 
error_pics = error_pic_pattern.findall(new_content)
for error_pic in error_pics:
    new_content = new_content.replace(error_pic, error_pic.replace("\n", ""))
print(new_content)
 Copy code 
  • re.sub() + Replace the matched data

sub() Where the function supports modification, use the function method , So you can put the method ① Simplify to this :

#  Remove the function of carriage return 
def trim_enter(temp):
    return temp.group().replace("\n", "")
    
new_content = error_pic_pattern.sub(trim_enter, new_content)
 Copy code 
  • re.sub() + backreferences

backreferences : In the process of specifying replacement results , You can refer to the matching content in the original string .

re.sub There are also results that match re.match Same grouping , Therefore, you only need to reference the grouping result in the replacement expression , There are two ways to quote :

  • \number → Such as \1, Represents the first group in the matching result
  • \g<number> you → The same as above , The advantage is that ambiguity can be avoided , Enter into \10 Indicates that the first group is followed by 0 It's still the first group ;

So you can replace it with the following code :

error_pic_pattern_n = re.compile(r'(http.*?)(\n)(.*?\.\w+\))', re.M)

#  It is divided into three groups , And then put 1、3 Group the splicing results as the replacement results 
new_content = error_pic_pattern_n.sub(r"\g<1>\g<3>", new_content)
 Copy code 

After modification , Open locally md file , Verify that the picture can be viewed normally ~

Demand two : I want to put some XX In the notes , I'm afraid there will be anti-theft chains in the future

Directly to a third party CDN, Replace the local picture URL Just fine , Help people to the end , Post a simple code example of qiniu cloud uploading pictures ~

#  Seven cattle CDN Configuration information 
qn_access_key = 'xxx'
qn_secret_key = 'xxx'

#  Upload the picture to qiniuyun 
def upload_qn_cdn(file_path):
    with open(file_path, 'rb') as f:
        data = f.read()
        #  Build authentication object 
        q = Auth(qn_access_key, qn_secret_key)
        #  Upload space name 
        bucket_name = ' Storage space name '
        key = 'lg/' + str(int(round(time.time() * 1000))) + '_' + f.name
        token = q.upload_token(bucket_name, key, 3600 * 24)
        ret, info = put_data(token, key, data)
        print(ret)
        print(info)
        if info.status_code == 200:
            full_url = 'http://qiniupic.coderpig.cn/' + ret["key"]
            return full_url
 Copy code 

Demand 3 : Want to be PDF, It's convenient for you to check

More and more outrageous ... Ample food and clothing by working with our own hands. , Under the search :Python Markdown turn PDF Just find a library ~


0x6、 Summary

After this section A site course - The text part The crawling process of , It's quite simple , You may ask , Why is there no audio and video crawling ?

I'm sorry , Maybe I'm really too delicious , I didn't bother for two hours , And I don't have a strong desire to crawl , Forget it .

I took a look and there was a Java The library of , If you are interested, you can study the encryption rules by yourself :lagou-course-downloader

Say something else , Don't think Selenium Simulation is safe , The browser launched in this way has dozens of features that can be passed by the website JavaScript Sniffing .

ok , That's all , thank ~

copyright notice
author[coder_ pig],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201301315484385.html

Random recommended