current position:Home>Python crawler Basics

Python crawler Basics

2022-01-31 05:51:32 NicesLife

This is my participation 11 The fourth of the yuegengwen challenge 7 God , Check out the activity details :2021 One last more challenge .

One 、 What is a reptile

A crawler is by writing a program , Analog browser internet , Then let it go to the Internet to grab the process of data .

Two 、 The value of reptiles :

1、 The practical application

2、 employment

3、 ... and 、 Research on the legitimacy of reptiles

Are reptiles legal or illegal ?

1、 Reptiles are not prohibited by law .

2、 There is a risk of breaking the law .

3、 Reptiles can be divided into goodwill reptiles and malicious reptiles .

The risk of reptiles

Risks can be reflected in the following 2 aspect :

1、 The crawler interferes with the normal operation of the visited website .

2、 The crawler grabs a specific type of data or information protected by law .

How to avoid entering the local bad luck in the process of using code crawler ?

1、 Optimize your program from time to time , Avoid interfering with the normal operation of the visited website .

2、 In the use of , When propagating crawled data , Review the content captured , If it is found that user privacy is involved Sensitive contents such as trade secrets need to stop crawling or spreading in time .

Four 、 Classification of reptiles in use scenarios

( One ) Universal crawler :

General crawler capture system is an important part of , Grab a whole sheet The page data .

( Two ) Focus on reptiles :

Focus crawler is based on general crawler , What we grab is the specific partial content of the page .

( 3、 ... and ) Incremental reptiles :

Incremental crawler detects data updates in the website , It will only grab the latest updated data in the website .

5、 ... and 、 Anti climbing mechanism and anti climbing strategy

( One ) Anti climbing mechanism

Anti crawling mechanism is that Guanghu website can formulate corresponding strategies or technical means , Prevent crawlers from crawling the website data .

( Two ) Anti-crawl strategy

Anti crawling strategy is that the crawler program can formulate relevant strategies or technical means , Crack the anti crawling mechanism in portal website , So you can get the relevant information in the portal website .

6、 ... and 、robots.txt agreement

robots.txt The agreement is a gentleman's agreement , It specifies which data in the website can be crawled and which data can not be crawled .

Check out the website robots.txt agreement

Take Taobao for example :www.taobao.com/robots.txt

image.png

copyright notice
author[NicesLife],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201310551300610.html

Random recommended