current position:Home>Python crawler series: crawling global port information

Python crawler series: crawling global port information

2022-01-30 11:25:57 Internet of things_ Salted fish

1. Premise

The company had a need to monitor containers before. If you can give information when entering or leaving a port , And docking with shipping data , Reach the container and transport it by land — Ocean Shipping — land transportation , The purpose of dynamic monitoring of the whole process .

One of the small requirements is to give a message reminder when the container passes through a port , Close the box first by notice ( Shipper ). In this way, we need to know all the port information in the world , Regional positioning information, etc . In this way, we use our container positioning equipment +GIS Spatial geographic information analysis can realize the analysis of container entering and leaving the port .

2. Website

gangkou.00cha.net

image.png What I choose here is to query by country , This is the first layer of the website , After selecting a country , Let's go to the second floor

image.png In this way, you can get a list of all ports in a country , We also need to go to the third floor to get the details of each port .

image.png

3. Realization

First step :

# Define a variable url, For the need to crawl data, my web site 

url = 'http://gangkou.00cha.net/'
 
# Get the source code of this web page , Store in req in ,{} For different browsers User-Agent attribute , For different browsers, you can baidu 
req = requests.get(url,{'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'})
req.encoding = 'gb2312'
# Generate a Beautifulsoup object , Use the search work at the back 
soup = BeautifulSoup(req.text,'lxml')
 
# Find all a The contents of the label and stored in xml In such an object similar to an array queue 
xml = soup.find_all('a')
gj=[]
# Find the national port URL
for k in xml:
    if 'gj_' in k['href']:
        gj.append(k['href'])
 Copy code 

We will the country information and the corresponding... Behind each country's port URL For storage .

The second step :

        urlgj='http://gangkou.00cha.net/'+l
        # Get the source code of this web page , Store in req in ,{} For different browsers User-Agent attribute , For different browsers, you can baidu 
        reqgj = requests.get(urlgj,{'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'})
        reqgj.encoding = 'gb2312'
        # Generate a Beautifulsoup object , Use the search work at the back 
        soupgj = BeautifulSoup(reqgj.text,'lxml')
        # Find all a The contents of the label and stored in xml In such an object similar to an array queue 
        xmlgj = soupgj.find_all('a')
 Copy code 

The third step :

Traverse the list of ports in this country , Get detailed data for each port

        # Find the national port URL
        for kgj in xmlgj:
            if 'gk_' in kgj['href']:
                urlgk='http://gangkou.00cha.net/'+kgj['href']
                reqgk = requests.get(urlgk,{'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'})
                reqgk.encoding = 'gb2312'
                soupgk = BeautifulSoup(reqgk.text,'lxml')
                #keylatlon1=soupgk.find(key1)
                trarry=[]
                for tr in soupgk.find_all('tr'):
                    tdarry=[]
                    for td in tr.find_all('td'):
                        text = td.text.replace('\u3000','').replace(' ',' ')
                        tdarry.append(text)
                    trarry.append(tdarry)
                keylonlat1='LatLng'# Set the latitude and longitude keyword 1
                keylonlat2=");"# Set the latitude and longitude keyword 2
                plonlata=reqgk.text.find(keylonlat1)# Find the key words 1 The location of 
                plonlatt=reqgk.text.find(keylonlat2,plonlata)# Find the key words 2 The location of ( Congzi 1 Start looking later )
                lonlat=reqgk.text[plonlata:plonlatt+1]# Get keywords 1 With keywords 2 Content between ( That is, the data you want )
                lonlat= re.findall(r'[(](.*?)[)]', lonlat)
                introarry=[]
                for introduce in soupgk.find_all('div', class_='bei lh'):
                    if ' Port introduction ' in introduce.text:
                        introarry.append([introduce.text.replace( '\ufffd','').replace( '\xe6','').replace(' ',' ')])
 Copy code 

Last

Import pymssql library , Save data to SQL Server Inside

image.png

copyright notice
author[Internet of things_ Salted fish],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201301125556961.html

Random recommended