current position:Home>Using Python to realize national second-hand housing data capture + map display

Using Python to realize national second-hand housing data capture + map display

2021-08-23 19:36:58 The second brother is not like a programmer

The recent introduction of various policies , The price of second-hand houses fluctuates greatly , In this article, the second brother will lead you through the chain of second-hand houses as an example , A brief analysis of the second-hand housing prices in many parts of the country .

【 I suggest you like it first 、 Re collection 】

One 、 Ideas ️

Want to get the information of second-hand houses in Lianjia country , First, let's go to the relevant second-hand housing page and have a look ( Take Beijing for example ):
 Insert picture description here
Here you can see , We can see the second-hand housing information in Beijing , But there are no options for other provinces and cities , So go back to the home page and find options for major cities , By clicking the city button in the upper left corner of the home page , You can enter the relevant provinces - Page city :
 Insert picture description here
With provinces - After the city page , We can get the information of each city through this page url Information , Then visit each url Just grab the second-hand housing data .
The overall process is as follows
 Insert picture description here

Two 、 Get city information ️

When getting city information , We get directly to the city page HTML Just parse , Here because HTML The information structure of some provinces in China is different , Therefore, it is analyzed that the information of most provinces is used .
The code for obtaining city information is as follows :

import random
import time
import csv
import requests
from lxml import etree
import pandas as pd

#  Get each province , City Information 
def city(i, j):
    try:
        p1 = "//li[@class='city_list_li city_list_li_selected'][{}]/div[@class='city_list']/div[@class='city_province']/div[@class='city_list_tit c_b']/text()".format(
            i)
        province = et.xpath(p1)[0]
        cn1 = "//li[@class='city_list_li city_list_li_selected'][{}]/div[@class='city_list']/div[@class='city_province']/ul/li[{}]/a/text()".format(
            i, j)
        city_name = et.xpath(cn1)[0]
        cu1 = "//li[@class='city_list_li city_list_li_selected'][{}]/div[@class='city_list']/div[@class='city_province']/ul/li[{}]/a/@href".format(
            i, j)
        city_url = et.xpath(cu1)[0]
    except:
        return 0, 0, 0
    return province, city_name, city_url


#  Generating Province - City -URL Dictionaries 
dic1 = {
    }
count = 1
for i in range(1, 15):
    for j in range(1, 6):
        province, city_name, city_url = city(i, j)
        if province != 0:
            dic1[count] = [province, city_name, city_url]
            count += 1
        else:
            pass
# dic1

The obtained results are as follows :
 Insert picture description here

3、 ... and 、 Get second-hand housing data ️

With the home page information of each city , We can try to obtain multi city data by constructing the web site of second-hand houses , When constructing the second-hand house website, we only need to URL Suffix with ershoufang/pg{}/ that will do . With the website, we can obtain data in a normal way :

f = open(' National second-hand housing data .csv', 'a', encoding='gb18030')
write = csv.writer(f)


def parser_html(pr_ci, page, User_Agent):

    headers = {
    
        'User-Agent': User_Agent[random.randint(0,
                                                len(User_Agent) - 1)]
    }

    for i in range(1, len(pr_ci) + 1):
        province = pr_ci.get(i)[0]
        city = pr_ci.get(i)[1]
        url = pr_ci.get(i)[2] + 'ershoufang/pg{}/'.format(page)
        print(url)
        html = requests.get(url=url, headers=headers).text
        eobj = etree.HTML(html)
        li_list = eobj.xpath("//li[@class='clear LOGVIEWDATA LOGCLICKDATA']")
        for li in li_list:
            title_list = li.xpath(".//div[@class='title']/a/text()")
            title = title_list[0] if title_list else None
            name_list = li.xpath(".//div[@class='positionInfo']/a[1]/text()")
            name = name_list[0] if name_list else None
            area_list = li.xpath(".//div[@class='positionInfo']/a[2]/text()")
            area = area_list[0] if area_list else None
            info_list = li.xpath(".//div[@class='houseInfo']/text()")
            info = info_list[0] if info_list else None

            if info:
                model = size = face = decorate = floor = year = type1 = None
                info_list1 = info.split("|")
                for i in info_list1:

                    if ' room ' in i:
                        model = i
                    elif ' Square meters ' in i:
                        size = i
                    elif ' In the east ' in i or ' In the west ' in i or ' south ' in i or ' north ' in i:
                        face = i
                    elif ' loading ' in i or ' hair ' in i:
                        decorate = i
                    elif ' layer ' in i:
                        floor = i
                    elif ' year ' in i:
                        year = i
                    elif ' plate ' in i or ' tower ' in i:
                        type1 = i
                    else:
                        pass

            else:
                model = size = face = decorate = floor = year = type1 = None
            follow_list = li.xpath(".//div[@class='followInfo']/text()")
            follow = follow_list[0].split(
                '/')[0].strip() if follow_list else None
            time1 = follow_list[0].split(
                '/')[1].strip() if follow_list else None
            price_list = li.xpath(".//div[@class='totalPrice']/span/text()")
            price = price_list[0] + ' ten thousand ' if price_list else None
            unit_list = li.xpath(".//div[@class='unitPrice']/span/text()")
            unit = unit_list[0][2:-4] if unit_list else None

            #  Specific cities + Building information 
            list1 = [
                province, city, url, title, name, area, model, size, face,
                decorate, floor, year, type1, follow, time1, price, unit
            ]

            write.writerow(list1)
        time.sleep(random.randint(2, 5))


def serve_forever():
    write.writerow([
        'province', 'city', 'url', 'title', 'name', 'area', 'model', 'size',
        'face', 'decorate', 'floor', 'year', 'type', 'follow', 'time', 'price',
        'unit'
    ])
    try:
        for i in range(1, 3):
            parser_html(dic1, i, User_Agent)
            time.sleep(random.randint(1, 3))
    except:
        pass

The data after crawling are as follows :
 Insert picture description here

Four 、 mapping ️

Since we are capturing national data , So the best way to present the data is to display the data through the map , Here we take the number of houses as an example to show the map , You can replace other dimensions with data .
The implementation is as follows :

from pyecharts import options as opts
from pyecharts.charts import Geo
from pyecharts.faker import Faker
from pyecharts.globals import ChartType
import pandas as pd

ljdata = pd.read_csv(" National second-hand housing data .csv",encoding = 'gb18030')
pro_num = ljdata['province'].value_counts()
c = (
    Geo()
    .add_schema(maptype="china")
    .add(
        " Number of houses available ",
        [list(z) for z in zip(pro_num.index, pro_num.values)],
        type_=ChartType.HEATMAP,
    )
    .set_series_opts(label_opts=opts.LabelOpts(is_show=False))
    .set_global_opts(
        visualmap_opts=opts.VisualMapOpts(),
        title_opts=opts.TitleOpts(title="Geo-HeatMap"),
    )
)
c.render_notebook()
c.render()

The results after running are as follows :
 Insert picture description here
So far, our data acquisition + The visualization is complete .

️ I like it !️
Collection !
Attention !

copyright notice
author[The second brother is not like a programmer],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2021/08/20210823193648116i.html

Random recommended