Robots协议是互联网爬虫的一项公认的道德规范,它的全称是“网络爬虫排除标准”(Robots exclusion protocol),这个协议用来告诉爬虫,哪些页面是可以抓取的,哪些不可以。
如何查看网站的robots协议呢,很简单,在网站的域名后加上/robots.txt就可以了。
如百度https://www.baidu.com/robots.txt
User-agent: Baiduspider # 百度爬虫 Disallow: /baidu #disallow禁止访问,allow允许访问 Disallow: /s? Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ Disallow: /bh User-agent: Googlebot Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ Disallow: /bh User-agent: MSNBot Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ Disallow: /bh User-agent: Baiduspider-image Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ Disallow: /bh User-agent: YoudaoBot Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ Disallow: /bh User-agent: Sogou web spider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ Disallow: /bh User-agent: Sogou inst spider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ Disallow: /bh User-agent: Sogou spider2 Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ Disallow: /bh User-agent: Sogou blog Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ Disallow: /bh User-agent: Sogou News Spider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ Disallow: /bh User-agent: Sogou Orion spider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ Disallow: /bh User-agent: ChinasoSpider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ Disallow: /bh User-agent: Sosospider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ Disallow: /bh User-agent: yisouspider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ Disallow: /bh User-agent: EasouSpider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ Disallow: /bh User-agent: * Disallow: /
虽然Robots协议只是一个道德规范,和爬虫相关的法律也还在建立和完善之中,但我们在爬取网络中的信息时,也应该有意识地去遵守这个协议。
网站的服务器被爬虫爬得多了,也会受到较大的压力,因此,各大网站也会做一些反爬虫的措施。不过呢,有反爬虫,也就有相应的反反爬虫。
恶意消耗别人的服务器资源,是一件不道德的事,恶意爬取一些不被允许的数据,还可能会引起严重的法律后果。
工具在你手中,如何利用它是你的选择。当你在爬取网站数据的时候,别忘了先看看网站的Robots协议是否允许你去爬取。
同时,限制好爬虫的速度,对提供数据的服务器心存感谢,避免给它造成太大压力,维持良好的互联网秩序,也是我们该做的事。