current position:Home>Python reptile test ox knife (I)

Python reptile test ox knife (I)

2022-01-30 00:48:49 It man who is not making two mistakes

stay github A lightweight crawler framework was found on requests-html

One . website

Chinese document

English official website


  • Full support for parsing JavaScript!
  • CSS Selectors (jQuery style , thank PyQuery).
  • XPath Selectors , for the faint at heart.
  • Customize user-agent ( It's like a real web browser ).
  • Automatically track redirects .
  • Connection pool and cookie Persistence .
  • A delightful request experience , Magic parsing page .

Um. It feels powerful , Have a try , 315 The party reported 360 Medical related fake advertising , Think about climbing some medical related data… Lock the target first , Let's climb the data of the top 100 hospitals in the country to excel In practice

Two . F12 Analyze web page element interface

 screenshots 20210317  Afternoon 82936png

The entire page table The parcel , The name of the hospital is wrapped in a In the label

The simplest and crudest idea is to crawl all a The data in the tag , And then the loop extracts href Text in , Go straight to the code

# coding=UTF-8
from requests_html import HTMLSession
import xlwt

# Web link site 
session = HTMLSession()
r = session.get('')

# Initialize a Excel
xl = xlwt.Workbook(encoding='utf-8')
sheet = xl.add_sheet(' National Hospital ranking ')
sheet.write(0, 0, ' ranking ')
sheet.write(0, 1, ' Hospital name ')

# Initialize ranking 
i = 0

# Crawl data 
def findHospitalName():
    trs = r.html.find("a")
    for item in trs:
      # Get a Labeled href Text in attribute 
      text = item.find('a', first=True).attrs['href']

# Data cleaning 
def filterData(text):
    # Filtered text link parameters 
   if "#" in text:
      array = text.split("#", 1)
      # Filter out empty 
      if len(array[1]):
        global i
        i += 1
        writeData(i, array[1])

# Write data 
def writeData(sort, data):
  sheet.write(sort, 0, sort)
  sheet.write(sort, 1, data)'/Users/lsr/Documents/GJProject/py/' + " National Hospital ranking " + ".xls")

# Start 
 Copy code 

Don't look at the code , In fact, there are only two core codes

trs = r.html.find("a") # Get all a Tag data , return  Element object   Array 
text = item.find('a', first=True).attrs['href'] # obtain a Labeled herf attribute 
 Copy code 

 Insert picture description here

copyright notice
author[It man who is not making two mistakes],Please bring the original link to reprint, thank you.

Random recommended