current position:Home>Python crawler Basics
Python crawler Basics
2022-01-31 05:51:32 【NicesLife】
This is my participation 11 The fourth of the yuegengwen challenge 7 God , Check out the activity details :2021 One last more challenge .
One 、 What is a reptile
A crawler is by writing a program , Analog browser internet , Then let it go to the Internet to grab the process of data .
Two 、 The value of reptiles :
1、 The practical application
2、 employment
3、 ... and 、 Research on the legitimacy of reptiles
Are reptiles legal or illegal ?
1、 Reptiles are not prohibited by law .
2、 There is a risk of breaking the law .
3、 Reptiles can be divided into goodwill reptiles and malicious reptiles .
The risk of reptiles
Risks can be reflected in the following 2 aspect :
1、 The crawler interferes with the normal operation of the visited website .
2、 The crawler grabs a specific type of data or information protected by law .
How to avoid entering the local bad luck in the process of using code crawler ?
1、 Optimize your program from time to time , Avoid interfering with the normal operation of the visited website .
2、 In the use of , When propagating crawled data , Review the content captured , If it is found that user privacy is involved Sensitive contents such as trade secrets need to stop crawling or spreading in time .
Four 、 Classification of reptiles in use scenarios
( One ) Universal crawler :
General crawler capture system is an important part of , Grab a whole sheet The page data .
( Two ) Focus on reptiles :
Focus crawler is based on general crawler , What we grab is the specific partial content of the page .
( 3、 ... and ) Incremental reptiles :
Incremental crawler detects data updates in the website , It will only grab the latest updated data in the website .
5、 ... and 、 Anti climbing mechanism and anti climbing strategy
( One ) Anti climbing mechanism
Anti crawling mechanism is that Guanghu website can formulate corresponding strategies or technical means , Prevent crawlers from crawling the website data .
( Two ) Anti-crawl strategy
Anti crawling strategy is that the crawler program can formulate relevant strategies or technical means , Crack the anti crawling mechanism in portal website , So you can get the relevant information in the portal website .
6、 ... and 、robots.txt agreement
robots.txt The agreement is a gentleman's agreement , It specifies which data in the website can be crawled and which data can not be crawled .
Check out the website robots.txt agreement
Take Taobao for example :www.taobao.com/robots.txt
copyright notice
author[NicesLife],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201310551300610.html
The sidebar is recommended
- My friend's stock suffered a terrible loss. When I was angry, I crawled the latest data of securities with Python
- Python interface automation testing framework -- if you want to do well, you must first sharpen its tools
- Python multi thread crawling weather website pictures and saving
- How to convert pandas data to excel file
- Python series tutorials 122
- Python Complete Guide - printing data using pyspark
- Python Complete Guide -- tuple conversion array
- Stroke the list in python (top)
- Analysis of Python requests module
- Comments and variables in Python
guess what you like
-
New statement match, the new version of Python is finally going to introduce switch case?
-
Fanwai 6 Different operations for image details in Python opencv
-
Python crawler native code learning (I)
-
Python quantitative data warehouse building series 2: Python operation database
-
Python code reading (Part 50): taking elements from list intervals
-
Pyechart + pandas made word cloud pictures of 19 report documents
-
[Python crawler] multithreaded daemon & join() blocking
-
Python crawls cat pictures in batches to realize thousand image imaging
-
Van * Python | simple crawling of a planet
-
Input and output of Python practice
Random recommended
- Django ORM details - fields, attributes, operations
- Python web crawler - crawling cloud music review (3)
- Stroke list in python (bottom)
- What cat is the most popular? Python crawls the whole network of cat pictures. Which one is your favorite
- [algorithm learning] LCP 06 Take coins (Java / C / C + + / Python / go / trust)
- Python shows the progress of downloading files from requests
- Solve the problem that Django celery beat prompts that the database is incorrectly configured and does not support multiple databases
- Bamboolib: this will be one of the most useful Python libraries you've ever seen
- Python quantitative data warehouse construction 3: data drop library code encapsulation
- The source code and implementation of Django CSRF Middleware
- Python hashlib module
- The cover of Python 3 web crawler development (Second Edition) has been determined!
- The introduction method of executing Python source code or Python source code file (novice, please enter the old bird and turn left)
- [Python basics] explain Python basic functions in detail, including teaching and learning
- Python web crawler - crawling cloud music review (4)
- The first step of scientific research: create Python virtual environment on Linux server
- Writing nmap scanning tool in Python -- multithreaded version
- leetcode 2057. Smallest Index With Equal Value(python)
- Bamboolib: this will be one of the most useful Python libraries you've ever seen
- Python crawler actual combat, requests module, python realizes capturing a video barrage
- [algorithm learning] 1108 IP address invalidation (Java / C / C + + / Python / go / trust)
- Test platform series (71) Python timed task scheme
- Java AES / ECB / pkcs5padding encryption conversion Python 3
- Loguru: the ultimate Python log solution
- Blurring and anonymizing faces using OpenCV and python
- How fast Python sync and async execute
- Python interface automation test framework (basic) -- common data types list & set ()
- Python crawler actual combat, requests module, python realizes capturing video barrage comments of station B
- Python: several implementation methods of multi process
- Sword finger offer II 054 Sum of all values greater than or equal to nodes | 538 | 1038 (Java / C / C + + / Python / go / trust)
- How IOS developers learn python programming 3-operator 2
- How IOS developers learn python programming 2-operator 1
- [Python applet] 8 lines of code to realize file de duplication
- Python uses the pynvml tool to obtain the working status of GPU
- Data mining: Python actual combat multi factor analysis
- Manually compile opencv on MacOS and Linux and add it to Python / C + + / Java as a dependency
- Use Python VTK to batch read 2D slices and display 3D models
- Complete image cutting using Python version VTK
- Python interface automation test framework (basic) -- common data types Dict
- Python specific text extraction in actual combat challenges the first step of efficient office