current position:Home>Python crawler series: crawling global airport information
Python crawler series: crawling global airport information
2022-01-30 11:25:56 【Internet of things_ Salted fish】
1. Preface
Recently, the company needs global airport information , Used to do some data analysis . I just found this information on a website , But there is no longitude and latitude information of the airport , But with airport information , We can fill in the longitude and latitude information by ourselves
2. Website element analysis
We found a website with this information , In the next step, we can analyze where we want the information through the website elements .
First we open the website , Press “F12”, You can view all the element information of the website through the browser development tool .
When we mouse over these div When you move up , The page will display the div The corresponding display block is shaded , So we can quickly get the information blocks we need
Find the element block we need , Then we find the address that we jump to when we click on each link , Find the secondary page
Then enter the secondary page through crawler simulation , Through the secondary page , Find each airport in the airport list “ details ” Corresponding three-level page information
Finally, we enter the third level page , Find the information we need to get from each airport details interface
Then we can write our crawler program
3. Crawler source code
import requests
from bs4 import BeautifulSoup
import re
import logging
# Create a logger
logger = logging.getLogger("mylog")
#Log Grade master switch
logger.setLevel(level=logging.DEBUG)
# Get the file log handle and set the log level , Second layer filtration
handler = logging.FileHandler("log.txt")
handler.setLevel(logging.INFO)
# Generate and set file log format , among name Set for the above mylog
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
# Get the stream handle and set the log level , Second layer filtration
console = logging.StreamHandler()
console.setLevel(logging.WARNING)
# by logger Object to add a handle
logger.addHandler(handler)
logger.addHandler(console)
# Define the website address we need to crawl
url = 'http://www.yicang.com'
# Define the main connection of the website we need to crawl
urlmain=url+'/airports.html'
# Get the main connection content
req = requests.get(urlmain,{'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'})
req.encoding = 'utf-8'
soup = BeautifulSoup(req.text,'lxml')
# Query all a In the label url
xmlmain = soup.find_all('a')
# Define an array , Store all countries for url
gj=[]
# Traverse 'a' The tag gets the information about the country url
for portlist in xmlmain:
# According to the rules url Does it include '/airports/', Screening what we need url
if '/airports/' in portlist['href']:
gj.append(portlist['href'])
# Define a url Store airport information for each country url
porturllist=[]
# Encapsulate the method of obtaining airport information url Methods
def getporturllist(porturl):
reqgj = requests.get(porturl,{'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'})
reqgj.encoding = 'utf-8'
soupgj = BeautifulSoup(reqgj.text,'lxml')
# We just need to find the elements of the details
xmlli = soupgj.find_all('li',class_='listnr6')
# Traverse the element content of details
for portdetailurlxml in xmlli:
# Get details corresponding to url
portdetailurl=portdetailurlxml.find('a')['href']
# Judge this url Whether it conforms to our rules
if '/AirPortsinfo/' in portdetailurl:
# take url Store information to porturllist Inside
porturllist.append(portdetailurl)
# Traverse every country url Get access to all airports in the country url
for portlist in gj:
# Here you need to initialize the airport information of each country again url list
porturllist=[]
# Splice the list of airports in this country url
urlgj = url+portlist
getporturllist(urlgj)
# Get the information in the list of national airports , Determine whether there is paging
reqgj = requests.get(urlgj,{'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'})
reqgj.encoding = 'utf-8'
soupgj = BeautifulSoup(reqgj.text,'lxml')
# Get the corresponding page div Elements
xmlpage = soupgj.find_all('div',class_='ports_page')
for pageinfo in xmlpage:
# Get the element of the last page
xmlpagelast = pageinfo.find_all('li',class_='last')
for pagelast in xmlpagelast:
# Get the last page url
pagelasturl=pagelast.find('a')['href']
# Judge url Compliance with rules
if 'BaseAirports_page=' in pagelasturl:
# Get the page number of the last page
pagecountstr=pagelasturl.split('=')[-1]
pagecount=int(pagecountstr)+1
# Cycle this page number , Then splice out the list of airports in all pages
for i in range(2,pagecount):
# Get the details of each airport in the paging list url
pageurl=urlgj+'?BaseAirports_page='+str(i)
getporturllist(pageurl)
# Go through all the airports in the country url
for portdetail in porturllist:
try:
porturl=url+portdetail
reqport = requests.get(porturl,{'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'})
reqport.encoding = 'utf-8'
soupport = BeautifulSoup(reqport.text,'lxml')
# Screening class by shippings_showT Of dl Elements
xmldiv = soupport.find_all('dl',class_='shippings_showT')
#dl All under the element dd The label contents are stored in this array
dlArry=[]
for divport in xmldiv:
ddArry=[]
for dd in divport.find_all('dd'):
ddArry.append(dd.text)
dlArry.append(ddArry)
text=''
# Splice the stored airport details into the console , And store it in the log file
for ddArry in dlArry:
for item in ddArry:
text=text+item+';'
print(text)
logger.info(text)
except Exception as ex:
print(str(ex))
Copy code
4. paging
There are many national airports , There will be pagination , So we need to get every page url
I'm here to get “ Last ” Corresponding page number , Then start from the second page to splice the pages in a circular way url
I stored the airport information in the log file , If you need to convert to excel, Just divide it according to the rules , Here is the effect of my segmentation
copyright notice
author[Internet of things_ Salted fish],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201301125513239.html
The sidebar is recommended
- Python Network Programming -- create a simple UPD socket to realize mutual communication between two processes
- leetcode 110. Balanced Binary Tree(python)
- Django uses Django celery beat to dynamically add scheduled tasks
- The bear child said "you haven't seen Altman" and hurriedly studied it in Python. Unexpectedly
- Optimization iteration of nearest neighbor interpolation and bilinear interpolation algorithm for Python OpenCV image
- Bilinear interpolation algorithm for Python OpenCV image, the most detailed algorithm description in the whole network
- Use of Python partial()
- Python game development, pyGame module, python implementation of angry birds
- leetcode 1104. Path In Zigzag Labelled Binary Tree(python)
- Save time and effort. 10 lines of Python code automatically clean up duplicate files in the computer
guess what you like
-
Learn python, know more meat, and be a "meat expert" in the technical circle. One article is enough
-
[Python data structure series] "stack (sequential stack and chain stack)" -- Explanation of knowledge points + code implementation
-
Datetime module of Python time series
-
Python encrypts and decrypts des to solve the problem of inconsistency with Java results
-
Chapter 1: introduction to Python programming-4 Hello World
-
Summary of Python technical points
-
11.5K Star! An open source Python static type checking Library
-
Chapter 2: Fundamentals of python-1 grammar
-
[Python daily homework] day4: write a function to count the number of occurrences of each number in the incoming list and return the corresponding dictionary.
-
Python uses turtle to express white
Random recommended
- Some people say Python does not support function overloading?
- "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system with Python
- Introduction to Python - CONDA common commands
- Python actual combat | just "4 steps" to get started with web crawler (with benefits)
- Don't know what to eat every day? Python to tell you! Generate recipes and don't worry about what to eat every day!
- Are people who like drinking tea successful? I use Python to make a tea guide! Do you like it?
- I took 100g pictures offline overnight with Python just to prevent the website from disappearing
- Binary operation of Python OpenCV image re learning and image smoothing (convolution processing)
- Analysis of Python event mechanism
- Iterator of Python basic language
- Base64 encryption and decryption in Python
- Chapter 2: Fundamentals of python-2 variable
- Python garbage collection summary
- Python game development, pyGame module, python takes you to realize a magic tower game from scratch (1)
- Python draws a spinning windmill with turtle
- Deep understanding of Python features
- A website full of temptations for Python crawler writers, "lovely picture network", look at the name of this website
- Python opencv Canny edge detection knowledge supplement
- Complex learning of Python opencv Sobel operator, ScHARR operator and Laplacian operator
- Python: faker extension package
- Python code reading (Part 44): find the location of qualified elements
- Elegant implementation of Django model field encryption
- 40 Python entry applet
- Pandas comprehensive application
- Chapter 2: Fundamentals of python-3 character string
- Python pyplot draws a parallel histogram, and the x-axis value is displayed in the center of the two histograms
- [Python crawler] detailed explanation of selenium from introduction to actual combat [1]
- Curl to Python self use version
- Python visualization - 3D drawing solutions pyecharts, Matplotlib, openpyxl
- Use python, opencv's meanshift and CAMSHIFT algorithms to find and track objects in video
- Using python, opencv obtains and changes pixels, modifies image channels, and trims ROI
- [Python data collection] university ranking data collection
- [Python data collection] stock information collection
- Python game development, pyGame module, python takes you to realize a magic tower game from scratch (2)
- Python solves the problem of suspending execution after clicking the mouse in CMD window (fast editing mode is prohibited)
- [Python from introduction to mastery] (II) how to run Python? What are the good development tools (pycharm)
- Python type hints from introduction to practice
- Python notes (IX): basic operation of dictionary
- Python notes (8): basic operations of collections
- Python notes (VII): definition and use of tuples