current position:Home>Python crawler series: crawling global airport information

Python crawler series: crawling global airport information

2022-01-30 11:25:56 Internet of things_ Salted fish

1. Preface

         Recently, the company needs global airport information , Used to do some data analysis . I just found this information on a website , But there is no longitude and latitude information of the airport , But with airport information , We can fill in the longitude and latitude information by ourselves

2. Website element analysis

        We found a website with this information , In the next step, we can analyze where we want the information through the website elements .

        First we open the website , Press “F12”, You can view all the element information of the website through the browser development tool .

image.png

When we mouse over these div When you move up , The page will display the div The corresponding display block is shaded , So we can quickly get the information blocks we need

image.png

Find the element block we need , Then we find the address that we jump to when we click on each link , Find the secondary page

image.png

Then enter the secondary page through crawler simulation , Through the secondary page , Find each airport in the airport list “ details ” Corresponding three-level page information

image.png

Finally, we enter the third level page , Find the information we need to get from each airport details interface

image.png

image.png

Then we can write our crawler program

3. Crawler source code

import requests
from bs4 import BeautifulSoup
import re
import logging
 
# Create a logger
logger = logging.getLogger("mylog")
#Log Grade master switch 
logger.setLevel(level=logging.DEBUG)
 
# Get the file log handle and set the log level , Second layer filtration 
handler = logging.FileHandler("log.txt")
handler.setLevel(logging.INFO)  
 
# Generate and set file log format , among name Set for the above mylog
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
 
#  Get the stream handle and set the log level , Second layer filtration 
console = logging.StreamHandler()
console.setLevel(logging.WARNING)
 
#  by logger Object to add a handle 
logger.addHandler(handler)
logger.addHandler(console)
 
# Define the website address we need to crawl 
url = 'http://www.yicang.com'
 
# Define the main connection of the website we need to crawl 
urlmain=url+'/airports.html'
# Get the main connection content 
req = requests.get(urlmain,{'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'})
req.encoding = 'utf-8'
soup = BeautifulSoup(req.text,'lxml')
 
# Query all a In the label url
xmlmain = soup.find_all('a')
 
# Define an array , Store all countries for url
gj=[]
 
# Traverse 'a' The tag gets the information about the country url
for portlist in xmlmain:
    # According to the rules url Does it include '/airports/', Screening what we need url
    if '/airports/' in portlist['href']:
        gj.append(portlist['href'])
 
# Define a url Store airport information for each country url
porturllist=[]
 
# Encapsulate the method of obtaining airport information url Methods 
def getporturllist(porturl):
    reqgj = requests.get(porturl,{'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'})
    reqgj.encoding = 'utf-8'
    soupgj = BeautifulSoup(reqgj.text,'lxml')
    # We just need to find the elements of the details 
    xmlli = soupgj.find_all('li',class_='listnr6')
    # Traverse the element content of details 
    for portdetailurlxml in xmlli:
        # Get details corresponding to url
        portdetailurl=portdetailurlxml.find('a')['href']
        # Judge this url Whether it conforms to our rules 
        if '/AirPortsinfo/' in portdetailurl:
            # take url Store information to porturllist Inside 
            porturllist.append(portdetailurl)
 
# Traverse every country url Get access to all airports in the country url
for portlist in gj:
    # Here you need to initialize the airport information of each country again url list 
    porturllist=[]
    # Splice the list of airports in this country url
    urlgj = url+portlist
    getporturllist(urlgj)
 
    # Get the information in the list of national airports , Determine whether there is paging 
    reqgj = requests.get(urlgj,{'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'})
    reqgj.encoding = 'utf-8'
    soupgj = BeautifulSoup(reqgj.text,'lxml')
    # Get the corresponding page div Elements 
    xmlpage = soupgj.find_all('div',class_='ports_page')
    for pageinfo in xmlpage:
        # Get the element of the last page 
        xmlpagelast = pageinfo.find_all('li',class_='last')
        for pagelast in xmlpagelast:
            # Get the last page url
            pagelasturl=pagelast.find('a')['href']
            # Judge url Compliance with rules 
            if 'BaseAirports_page=' in pagelasturl:
                # Get the page number of the last page 
                pagecountstr=pagelasturl.split('=')[-1]
                pagecount=int(pagecountstr)+1
                # Cycle this page number , Then splice out the list of airports in all pages 
                for i in range(2,pagecount):
                    # Get the details of each airport in the paging list url
                    pageurl=urlgj+'?BaseAirports_page='+str(i)
                    getporturllist(pageurl)
                   
    # Go through all the airports in the country url
    for portdetail in porturllist:
        try:
            porturl=url+portdetail
            reqport = requests.get(porturl,{'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'})
            reqport.encoding = 'utf-8'
            soupport = BeautifulSoup(reqport.text,'lxml')
            # Screening class by shippings_showT Of dl Elements 
            xmldiv = soupport.find_all('dl',class_='shippings_showT')
            #dl All under the element dd The label contents are stored in this array 
            dlArry=[]
            for divport in xmldiv:
                ddArry=[]
                for dd in divport.find_all('dd'):
                    ddArry.append(dd.text)
                dlArry.append(ddArry)
            text=''
            # Splice the stored airport details into the console , And store it in the log file 
            for ddArry in dlArry:
                for item in ddArry:
                    text=text+item+';'
            print(text)
            logger.info(text)
        except Exception as ex:
            print(str(ex))
 Copy code 

4. paging

There are many national airports , There will be pagination , So we need to get every page url

image.png

I'm here to get “ Last ” Corresponding page number , Then start from the second page to splice the pages in a circular way url

image.png

I stored the airport information in the log file , If you need to convert to excel, Just divide it according to the rules , Here is the effect of my segmentation

image.png

copyright notice
author[Internet of things_ Salted fish],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201301125513239.html

Random recommended