current position:Home>Python practice - capture 58 rental information and store it in MySQL database

Python practice - capture 58 rental information and store it in MySQL database

2022-01-30 13:20:38 baiyuliang

Python Operating the database , Compared with other languages , It's a lot simpler !

Mysql Not to mention the installation of and the operation of building database and table , I created a local database here py, And the watch tb_py_test:

create table tb_py_test
(
    id      int auto_increment
        primary key,
    url     text         null,
    content varchar(255) null,
    price   double       null
);
 Copy code 

Next , install py Of mysql Connection tool ,pymysql:pip install pymysql;

After successful installation , Writing linker :

pymysql.connect(host="localhost", user="root", password="root", database="py")
 Copy code 

It returns the database object , Then get the cursor through the database object cursor, Re pass cursor perform sql sentence , And get the results :

import pymysql

db = pymysql.connect(host="localhost", user="root", password="root", database="py")
cursor = db.cursor()
try:
    sql = "select * from tb_py_test"
    cursor.execute(sql)
    results = cursor.fetchall()
    for result in results:
        print(result[0], result[1], result[2])
except Exception as e:
    print('fail:' + str(e))
    db.rollback()
db.close()
 Copy code 

Let's manually insert a piece of data into the database :

 Insert picture description here

And then execute py:

 Insert picture description here

Prove that there is no problem with database connection and operation !

So next , The function of this blog , It's crawling 58 City rental information , Price <2000 Before 300 Bar information !

open 58 home page ,zz.58.com/ analysis :

 Insert picture description here

We need to get “ Rent button ” And automatically click ( Of course , You can also skip this step , Get the rental connection directly ):

driver = webdriver.Firefox(executable_path=r'C:\geckodriver.exe')
driver.get("https://zz.58.com")
zf = driver.find_element_by_xpath("//a[@tongji_tag='pc_home_dh_zf']")
zf.click()
 Copy code 

Be careful find_element_by_xpath usage ,"//a[@tongji_tag='pc_home_dh_zf']" It means to find out a In the label , Properties, tongji_tag, The property value is 'pc_home_dh_zf' Of element, Automatically click to enter the rental information page :  Insert picture description here Continue analysis , On demand , We need to get three fields , Rental information title , link , Price

 Insert picture description here

here , What we need to pay attention to , Rental information is a list (ul>li), What we get is a collection , therefore i We need to get... First ul, Then get li surface , Last loop traversal li, And from each item To extract information :

driver = webdriver.Firefox(executable_path=r'C:\geckodriver.exe')
driver.get("https://zz.58.com")
zf = driver.find_element_by_xpath("//a[@tongji_tag='pc_home_dh_zf']")
zf.click()

time.sleep(2)

driver.switch_to.window(driver.window_handles[len(driver.window_handles) - 1])

ul = driver.find_element_by_css_selector('ul.house-list')
lis = ul.find_elements_by_tag_name('li')
for i in range(len(lis) - 1):
    price = lis[i].find_element_by_class_name("money").find_element_by_tag_name('b').text  #  Price 
    if int(price) < 2000:
        des = lis[i].find_element_by_class_name("des")
        a = des.find_element_by_tag_name('a')
        title = a.text  #  title 
        url = a.get_attribute('href')  #  link 
        print(title, "  The rent :" + price, "  link :" + url)
 Copy code 

But actually , Let's take a closer look at the following li Elements :

 Insert picture description here

The last one is the page number , It's not the data we want , So you need to filter out this one , We can directly... During the cycle -1 that will do !

driver.switch_to.window(driver.window_handles[len(driver.window_handles) - 1])

The meaning of this sentence is , Get the handle to the new window , And switch to a new window , otherwise ,driver Looking for the elements of the old window !

Print the results :

 Insert picture description here

Is it over ?, Of course not , Our need is to get 300 strip , The above code just gets the data of the first page , So we need to get the data on the first page , Automatically get the data of the next page , Until you get satisfaction 300 strip !

The first method : Analyze each page url link :

 Insert picture description here

You'll find that , When switching to the next page , This parameter will change to... Following the number of page numbers pn2,pn3..., So after the data of the current page is extracted , You can modify it directly url And go to the next page ;

The second method : Click on the simulation “ The next page ” Button :

 Insert picture description here

This is also the method used in this blog :

nextBtn = driver.find_element_by_css_selector('div.pager').find_element_by_css_selector("a.next")
nextBtn.click()
 Copy code 

When looping through each piece of data , Then insert into the database , that will do :

sql = "insert into tb_py_test (url, content, price) VALUE ('','','')"
cursor.execute(sql)
db.commit()
 Copy code 

complete py Code :

import time

from selenium import webdriver

import pymysql


class House:

    def __init__(self):
        self.title = ""
        self.url = ""
        self.price = 0.0


fp = webdriver.FirefoxProfile()
#  Limit css load 
fp.set_preference("permissions.default.stylesheet", 2)
#  Limit img load 
fp.set_preference("permissions.default.image", 2)
#  Limit js load 
fp.set_preference("javascript.enabled", False)

driver = webdriver.Firefox(firefox_profile=fp, executable_path=r'C:\geckodriver.exe')

driver.get("https://zz.58.com")
zf = driver.find_element_by_xpath("//a[@tongji_tag='pc_home_dh_zf']")
zf.click()

time.sleep(2)

driver.switch_to.window(driver.window_handles[len(driver.window_handles) - 1])

houses = []


def start():
    while True:
        if len(houses) >= 300:
            return
        getHouses()


def getHouses():
    try:
        ul = driver.find_element_by_css_selector('ul.house-list')
        lis = ul.find_elements_by_tag_name('li')
        for i in range(len(lis) - 1):
            house = House()
            try:
                price = lis[i].find_element_by_class_name("money").find_element_by_tag_name('b').text  #  Price 
                if int(price) < 2000:
                    des = lis[i].find_element_by_class_name("des")
                    a = des.find_element_by_tag_name('a')
                    title = a.text  #  title 
                    url = a.get_attribute('href')  #  link 
                    # print(title, "  The rent :" + price, "  link :" + url)
                    house.title = title
                    house.url = url
                    house.price = price
                    addHouse(house)
                    if len(houses) >= 300:
                        return
            except Exception as e:
                print(str(e))
        nextBtn = driver.find_element_by_css_selector('div.pager').find_element_by_css_selector("a.next")
        nextBtn.click()
        time.sleep(3)
    except Exception as e:
        print(str(e))


def addHouse(house):
    houses.append(house)
    print(len(houses), house.price)
    try:
        #  First query whether the item has been inserted into the database , Yes, filter 
        sql = "select url from tb_py_test where url='" + house.url + "'"
        cursor.execute(sql)
        url = cursor.fetchone()
        if url:
            return
        #  insert data 
        sql = "insert into tb_py_test (url, content, price) VALUE ('" + house.url + "','" + house.title + "',+'" + house.price + "')"
        cursor.execute(sql)
        db.commit()
        print('insert success')
    except Exception as e:
        print('insert fail:' + str(e))
        db.rollback()


db = pymysql.connect(host="localhost", user="root", password="root", database="py")
cursor = db.cursor()
start()
db.close()

 Copy code 

 Insert picture description here

copyright notice
author[baiyuliang],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201301320348351.html

Random recommended