current position:Home>After the old friend (R & D post) was laid off, I wanted to join the snack bar. I collected some data in Python. It's more or less a intention

After the old friend (R & D post) was laid off, I wanted to join the snack bar. I collected some data in Python. It's more or less a intention

2022-02-01 03:28:51 Dream eraser

「 This is my participation 11 The fourth of the yuegengwen challenge 21 God , Check out the activity details :2021 One last more challenge

My friend's company is in the doldrums , Was laid off , To comfort him , I asked brother chage to pick up some snacks to join the project .

Read this blog and you'll get

  • Python Technological advances ;
  • requests Kuhe lxml The familiarity of the library increases ;
  • Data , It might be useful .

3158 Network data capture

Target data source analysis

The target data source is 3158 Join in , Before writing the article , Eraser did not expect that there is really a special publicity website to join !

The destination data source address is as follows :

https://www.3158.cn/xiangmu/canyin/
 Copy code 

 When the programmer is laid off , Want to join a snack bar , I use Python Collected a little bit of data , Give him a little comfort

The knowledge that this blog will involve

  1. requests Grab web data ;
  2. xpath Familiarity exercises ,lxml Format extract
  3. csv File store ,3 Line code version

Data source analysis

The paging format of the target data is as follows , Simple rules , And the total number of pages can be seen directly on the web page , The logic to judge whether the data is crawled is omitted .

https://www.3158.cn/xiangmu/canyin/?pt=all&page=1
https://www.3158.cn/xiangmu/canyin/?pt=all&page=2
https://www.3158.cn/xiangmu/canyin/?pt=all&page=3

https://www.3158.cn/xiangmu/canyin/?pt=all&page=n
 Copy code 

Static web pages are used directly requests Scraping can , The focus of this case remains lxml extract , In order to prevent interruption of crawling process , You can first store the web page data locally in batches .

The grab code is as follows

import requests
from lxml import etree
import time
import re
import random

USER_AGENTS = [
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    " Other  UserAgent  Choose... For yourself "
]


def run(url, index):
    try:
        headers = {"User-Agent": random.choice(USER_AGENTS)}
        res = requests.get(url=url, headers=headers)
        res.encoding = "utf-8"
        html = res.text
        with open(f"./html/{index}.html", "w+", encoding="utf-8") as f:
            f.write(html)
    except Exception as e:
        print(e)


if __name__ == '__main__':

    for i in range(1, 130):
        print(f" Climbing to the top {i} Page data ")
        run(f"https://www.3158.cn/xiangmu/canyin/?pt=all&page={i}", i)

    print(" It's all crawled ")
 Copy code 

Before running the above code , You need to create... In the code directory html Folder , Used to store static web pages , Finally, the following figure is formed under this folder , Indicates that the page has been crawled .

 When the programmer is laid off , Want to join a snack bar , I use Python Collected a little bit of data , How much help

Data extraction time

This extraction still uses lxml library , More content can be learned continuously on the official website , The address is https://lxml.de/tutorial.html .

In the last step , The page has been saved locally , I'll focus on this 129 individual HTML File processing is enough .

Open any HTML page , The effect is as follows , After comparing the data, we found that , There are differences in page layout as shown in the figure below .

Data extraction complete , Only then discovered , In fact, this difference can be ignored .

 Old friend ( The programmer ) After being cut , Want to join a snack bar , I use Python Collected a little bit of data , More or less an idea Set data rules based on the figure above , The target data contains :

  • Franchise name
  • The amount of investment
  • home
  • industry
  • label ( If data exists )
  • Details page address

For subsequent crawling , Extract the address of the details page as well , The above is the final format of the target data .

In the case of differences in data sources , Direct inspection HTML Source code , Find out why the page is different .

li There are differences in labels , It can be treated specially .

 Old friend ( The programmer ) After being cut , Want to join a snack bar , I use Python Collected a little bit of data , More or less an idea The extraction code is as follows , Instructions have been added to the comments

import requests
from lxml import etree
import time
import re
import random

#  List to string 
def list_str(my_list):
    return ",".join(my_list)


def get_data():
    for i in range(1, 130):
        with open(f"./html/{i}.html", "r", encoding="utf-8") as f:
            html = f.read()
            element = etree.HTML(html)
            # contains  function   Get contains  xxx  The elements of , Similar to that  starts-with,ends-with,last
            origin_li = element.xpath("//ul[contains(@class,'xm-list')]/li")
            #  Loop grabbing  li  internal data 
            for item in origin_li:

                #  Extract the franchise name 
                # title = item.xpath(".//div[@class='r']/h4/text()")[0]
                title = item.xpath("./div[@class='top']/a/@title")[0]
                #  Extract hyperlinks 
                detail_link = "http://" + item.xpath("./div[@class='top']/a/@href")[0]

                #  Extract special tags 
                special_tag = list_str(item.xpath("./@class"))
                #  When it contains special tags  xm-list2  when , Use different extraction rules 

                if special_tag != "xm-list2":
                    #  Extract tags 
                    tags = list_str(item.xpath(".//div[@class='bot']/span[@class='label']/text()"))
                    #  Extract the investment price 
                    price = list_str(item.xpath(".//div[@class='bot']/span[@class='money']/b/text()"))
                    #  Address and Industry 
                    city_industry = list_str(item.xpath("./div[@class='bot']/p/span/text()"))

                    long_str = f"{title},{detail_link}, {tags}, {price}, {city_industry}"
                    save(long_str)
                else:
                    #  Address and Industry 
                    city_industry = list_str(item.xpath(
                        "./div[@class='top']/a/div/p[2]/span/text()"))
                    long_str = f"{title},{detail_link}, {0}, {0}, {city_industry}"
                    save(long_str)


def save(long_str):
    try:
        with open(f"./jiameng.csv", "a+", encoding="utf-8") as f:
            f.write("\n"+long_str)
    except Exception as e:
        print(e)


if __name__ == '__main__':

    # for i in range(1, 130):
    # print(f" Climbing to the top {i} Page data ")
    # run(f"https://www.3158.cn/xiangmu/canyin/?pt=all&page={i}", i)

    get_data()

    print(" All extracted ")
 Copy code 

The above code first passes through element.xpath("//ul[contains(@class,'xm-list')]/li") extract HTML Medium li label , Then traverse the extracted li data , Internal extraction .

During extraction , Find out title And detail_link, That is, the title is consistent with the extraction code of the details page , Other data by judging li Labeled class No xm-list2 Judge , The overall code is as shown above .

In the use of lxml In the process of , The most common is the path expression , namely // Extract... From the root directory , .// Extract... From the root of the current node ,./ Extract... From the current node .

The code also uses contains function , In this case, the function is used to determine whether the attribute value contains a string , For example, extracted from the code above ul label , It's the application of this function , In the extraction of all li In the process of , Need to match in advance ul, The label is in HTML The properties in are as follows , Use contains Partial matching is possible .

 Old friend ( R & D post ) After being cut , Want to join a snack bar , I use Python Collected a little bit of data , More or less an idea In the second half of the code , There is one. XPath Matching rules city_industry = list_str(item.xpath("./div[@class='top']/a/div/p[2]/span/text()")) , Notice what happens in it p[2] , This code means to select the second p label .

Comment on time

== To have come , Don't you have any comments in the comments section ?==

Today is the first day of continuous writing 184 / 200 God . You can pay attention to me , Praise me 、 Comment on me 、 Collect me .

copyright notice
author[Dream eraser],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202010328502955.html

Random recommended