current position:Home>Learn these 10000 passages and become a humorous person in the IT workplace. Python crawler lessons 8-9
Learn these 10000 passages and become a humorous person in the IT workplace. Python crawler lessons 8-9
2022-01-31 18:56:33 【Dream eraser】
「 This is my participation 11 The fourth of the yuegengwen challenge 17 God , Check out the activity details :2021 One last more challenge 」
Modern professionals , It should be done Sentient beings 、 Interesting 、 Useful 、 Some products . Okay , Get rid of “ Yes ” You are the word . How to become a professional humor master , We need some material, that is, the story , Only by reading more can we say more , And I can say high-level jokes .
Analysis before crawling
The target website for this crawl is :www.wllxy.net/gxqmlist.as…
The overall difficulty of climbing is not big , The analysis work can be basically omitted , After all, for you who have learned here ,requests
Have mastered 7~8 It's hot .
This article focuses on introducing you to requests
Agent related content in .
Reptile basics time
What is an agent
Agent is to obtain network information on behalf of network users . Vernacular is to put the user's own IP And other network related information find ways to hide , Let the target site not get .
Types of agents
High anonymity agent The high anonymity proxy will forward the data packet unchanged , From the server of the target website , It's like a real ordinary user visiting , And it uses IP It's also a proxy server IP Address , It can perfectly hide the user's original IP, So high anonymity proxy is the first choice for crawler proxy .
Ordinary anonymous agent Ordinary anonymous agents will make some changes on the packet , Join in HTTP Head fixing parameters . Due to the existence of fixed parameters, the target server can track the real users IP, Websites with high anti climbing degree can easily judge whether users are crawlers .
Transparent proxy There is no need to elaborate on this , Instead of Bai Dai , The target server is easy to detect .
On the type of agency , Sometimes according to HTTP and HTTPS distinguish , Now, most websites have been upgraded to HTTPS It's agreed , but HTTP Not abandoned , Generally, you can also crawl . Here's the thing to notice HTTPS Need to shake hands many times , Relatively slow , After using the agent, it will become slower , So I can climb later HTTP Try to crawl the website of the agreement HTTP agreement , Including the use of agents .
requests Using agents
requests Support multiple proxy methods , The setting method is also very simple , By providing... For any request method proxies Parameter to configure a single request , For example, the following code ( About the agency, this part will give you an introduction , Because it is found in the actual operation of this case that the target data acquisition can be easily completed without an agent )
import requests
proxies = {
"http": "http://10.10.1.10:3128",
"https": "http://10.10.1.10:1080",
}
requests.get("http://example.org", proxies=proxies)
Copy code
Note that the proxy is a dictionary parameter , Can contain HTTP perhaps HTTPS Any one of the .
Also pay attention to requests It's supporting SOCKS Acting , The difficulty of knowledge points , Don't explain .
Code time
Agent related knowledge has been introduced , Let's move on to the actual coding process .
import requests
import re
import threading
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"}
flag_page = 0
# Regular expression parsing , Finally, you need to merge the three tuples , Use zip function
def anay(html):
# Regular expressions are matched three times . We can find ways to improve efficiency , Leave it to you .
pattern = re.compile(
'<td class="diggtdright">[.\s]*<a href=".*?" target="_blank">\s*(.*?)</a>')
titles = pattern.findall(html)
times = re.findall(' Release time :(\d+[-]\d+[-]\d+)', html)
diggt = re.findall(' Get the ticket :(\d+) Person time ', html)
return zip(titles, times, diggt)
def save(data):
with open("newdata.csv", "a+", encoding="utf-8-sig") as f:
f.write(f"{data[0]},{data[1]},{data[2]}\n")
def get_page():
global flag_page
while flag_page < 979:
flag_page += 1
url = f"http://www.wllxy.net/gxqmlist.aspx?p={flag_page}"
print(f" Crawling up {url}")
r = requests.get(url=url, headers=headers)
ok_data = anay(r.text)
for data in ok_data:
print(data)
# Save to local method to complete by itself
# save(data)
if __name__ == "__main__":
for i in range(1, 6):
t = threading.Thread(target=get_page)
t.start()
Copy code
Be careful ,zip
Function to take iteratable objects as parameters , Package the corresponding elements in the object into tuples , Then return a list of these tuples .zip It returns an object . To show the list , It needs to be manually list() transformation .
If the number of elements in each iterator is inconsistent , Returns a list of the same length as the shortest object .
The rest of the content involves the data saving part , In the above code save
function , You can write your own .
At the end of the sentence
This series of reptile lessons mainly introduces requests
library , After learning , You can be right requests
Library has a relatively perfect cognition .
copyright notice
author[Dream eraser],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201311856294080.html
The sidebar is recommended
- Python - convert Matplotlib image to numpy Array or PIL Image
- Python and Java crawl personal blog information and export it to excel
- Using class decorators in Python
- Untested Python code is not far from crashing
- Python efficient derivation (8)
- Python requests Library
- leetcode 2047. Number of Valid Words in a Sentence(python)
- leetcode 2027. Minimum Moves to Convert String(python)
- How IOS developers learn Python Programming 5 - data types 2
- leetcode 1971. Find if Path Exists in Graph(python)
guess what you like
-
leetcode 1984. Minimum Difference Between Highest and Lowest of K Scores(python)
-
Python interface automation test framework (basic) -- basic syntax
-
Detailed explanation of Python derivation
-
Python reptile lesson 2-9 Chinese monster database. It is found that there is a classification of color (he) desire (Xie) monsters during operation
-
A brief note on the method of creating Python virtual environment in Intranet Environment
-
[worth collecting] for Python beginners, sort out the common errors of beginners + Python Mini applet! (code attached)
-
[Python souvenir book] two people in one room have three meals and four seasons: 'how many years is it only XX years away from a hundred years of good marriage' ~?? Just come in and have a look.
-
The unknown side of Python functions
-
Python based interface automation test project, complete actual project, with source code sharing
-
A python artifact handles automatic chart color matching
Random recommended
- Python crawls the map of Gaode and the weather conditions of each city
- leetcode 1275. Find Winner on a Tic Tac Toe Game(python)
- leetcode 2016. Maximum Difference Between Increasing Elements(python)
- Run through Python date and time processing (Part 2)
- Application of urllib package in Python
- Django API Version (II)
- Python utility module playsound
- Database addition, deletion, modification and query of Python Sqlalchemy basic operation
- Tiobe November programming language ranking: Python surpasses C language to become the first! PHP is about to fall out of the top ten?
- Learn how to use opencv and python to realize face recognition!
- Using OpenCV and python to identify credit card numbers
- Principle of Python Apriori algorithm (11)
- Python AI steals your voice in 5 seconds
- A glance at Python's file processing (Part 1)
- Python cloud cat
- Python crawler actual combat, pyecharts module, python data analysis tells you which goods are popular on free fish~
- Using pandas to implement SQL group_ concat
- How IOS developers learn Python Programming 8 - set type 3
- windows10+apache2. 4 + Django deployment
- Django parser
- leetcode 1560. Most Visited Sector in a Circular Track(python)
- leetcode 1995. Count Special Quadruplets(python)
- How to program based on interfaces using Python
- leetcode 1286. Iterator for Combination(python)
- leetcode 1418. Display Table of Food Orders in a Restaurant (python)
- Python Matplotlib drawing histogram
- Python development foundation summary (VII) database + FTP + character coding + source code security
- Python modular package management and import mechanism
- Django serialization (II)
- Python dataloader error "dataloader worker (PID XXX) is killed by signal" solution
- apache2. 4 + Django + windows 10 Automated Deployment
- leetcode 1222. Queens That Can Attack the King(python)
- leetcode 1387. Sort Integers by The Power Value (python)
- Tiger sniffing 24-hour praise device, a case with a crawler skill, python crawler lesson 7-9
- Python object oriented programming 01: introduction classes and objects
- Baidu Post: high definition Python
- Python Matplotlib drawing contour map
- Python crawler actual combat, requests module, python realizes IMDB movie top data visualization
- Python classic: explain programming and development from simple to deep and step by step
- Python implements URL availability monitoring and instant push