常用User Agent
http://blog.csdn.net/kang_tju/article/details/52563374
http://blkstone.github.io/2016/03/02/crawler-anti-anti-cheat/
https://www.waitig.com/python-%E7%88%AC%E8%99%AB%E4%B8%80%E4%BA%9B%E5%B8%B8%E7%94%A8%E7%9A%84uauser-agent.html
https://www.zhihu.com/question/19553117
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent/Firefox
http://www.webapps-online.com/online-tools/user-agent-strings/dv/browser51854/firefox
https://developers.whatismybrowser.com/useragents/explore/software_name/firefox/9
https://developer.mozilla.org/zh-CN/docs/Web/HTTP/Browser_detection_using_the_user_agent
http://tools.jb51.net/table/useragent
https://zhuanlan.zhihu.com/p/21252983
http://www.gooseeker.com/doc/thread-1829-1-1.html
http://www.skyfox.org/useragent.html
http://blog.csdn.net/tao_627/article/details/42297443
http://blog.csdn.net/u012175089/article/details/61199238
https://www.bbsmax.com/A/GBJr7eZ950/
https://www.bbsmax.com/A/Vx5MjWWgdN/
http://www.361way.com/ieua/692.html
http://www.jianshu.com/p/da6a44d0791e
http://blog.csdn.net/qianqianstd/article/details/52185488
http://www.360doc.com/content/12/1012/21/7662927_241124973.shtml
http://blog.sina.com.cn/s/blog_9513526f0101pjvt.html
http://www.966266.com/jishu/32.html
http://www.cftea.com/c/2010/08/M3TS2JCPZ0IQOW60.asp
http://techseo.cn/python/31.html1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75USER_AGENTS = [
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
]
# 随机生成user-agent
class RandomUAMiddleware(object):
def process_request(self, request, spider):
request.headers["User-Agent"]=random.choice(USER_AGENTS)
USER_AGENTS = [
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
]
USER_AGENTS = [
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
]
if __name__=="__main__":
def process_request(self, request, spider):
request.headers["User-Agent"]=random.choice(USER_AGENTS)
python 有一个现成的随机生成UA(user-agent)的第三方库:fake-useragent 直接:pip install
fake-useragent 即可。 使用:
from fake_useragent import UserAgent
ua = UserAgnet()
ua.random
Urllib库详解
http://www.bkjia.com/Pythonjc/1043449.html
http://blog.csdn.net/drdairen/article/details/51149498
http://www.bijishequ.com/detail/441751?p=
简单爬虫
1 | import urllib.request |
爬虫入门
http://yshblog.com/blog/148
http://blog.csdn.net/column/details/why-bug.html
http://blog.csdn.net/Eastmount/article/category/5758691
http://zihaolucky.github.io/using-python-to-build-zhihu-cralwer/
https://www.zhihu.com/question/20899988
https://www.zhihu.com/question/21358581
https://segmentfault.com/a/1190000009111635
http://docs.python-requests.org/zh_CN/latest/user/quickstart.html
http://www.cnblogs.com/yizhenfeng168/p/7078480.html
Python爬虫技巧
http://blkstone.github.io/2016/03/02/crawler-anti-anti-cheat/
http://gohom.win/2016/01/21/proxy-py/
http://ningning.today/2015/10/04/python/Python%E7%88%AC%E8%99%AB%E7%9A%84%E4%B8%80%E4%BA%9B%E6%80%BB%E7%BB%93/
http://hamstersi.com/2017/08/Python%E7%88%AC%E8%99%AB%EF%BC%9A%E4%B8%80%E4%BA%9B%E5%B8%B8%E7%94%A8%E7%9A%84%E7%88%AC%E8%99%AB%E6%8A%80%E5%B7%A7%E6%80%BB%E7%BB%93/
http://java.ctolib.com/hellysmile-fake-useragent.html
https://github.com/jhao104/proxy_pool
http://zqdevres.qiniucdn.com/data/20130909104216/index.html
https://www.zhihu.com/topic/19577498
https://www.zhihu.com/question/35461941
https://github.com/lining0806/PythonSpiderNotes
爬虫与反爬虫
https://zhuanlan.zhihu.com/p/20520370
https://www.zhihu.com/question/28168585
https://www.zhihu.com/question/26221432
BeautifulSoup教程
http://beautifulsoup.readthedocs.io/zh_CN/latest/
https://www.gitbook.com/book/wizardforcel/bs4-doc/details
https://www.crifan.com/files/doc/docbook/python_topic_beautifulsoup/release/pdf/python_topic_beautifulsoup.pdf
http://www.jianshu.com/p/154211515a41
Scrapy教程
http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html
爬虫实战项目
http://www.jianshu.com/p/78684e5a7916
https://zhuanlan.zhihu.com/p/25172216
http://python.jobbole.com/88325/
http://yz0515.com/2017/03/29/%E7%88%AC%E8%99%AB%E5%AE%9E%E6%88%98%E2%80%94%E2%80%94%E5%B0%86%E5%BB%96%E9%9B%AA%E5%B3%B0%E7%9A%84%E6%95%99%E7%A8%8B%E8%BD%AC%E6%8D%A2%E6%88%90PDF%E7%94%B5%E5%AD%90%E4%B9%A6/
搜索引擎蜘蛛爬虫User Agent一览
http://www.wilf.cn/post/search-engine-bots.html1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172今天分析研究了两个网站的 Apache 日志,分析日志虽然很无聊,但却是很有意义的事情,比如跟踪 SPAM 的 User Agent。顺便整理出一些搜索引擎爬虫的 User Agent,在这里分享一下,也欢迎补充。
微软
“msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)”
msnbot,大多数已经被bingbot替代了,现在偶尔还可以看到。
“Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)”
bing,必应
搜搜
“Sosospider+(+http://help.soso.com/webspider.htm)”
腾讯搜搜
“Sosoimagespider+(+http://help.soso.com/soso-image-spider.htm)”
搜搜图片
雅虎
“Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)”
雅虎英文
“Yahoo! Slurp China”
“Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)”
雅虎中国
搜狗
“http://pic.sogou.com” “Sogou Pic Spider/3.0(+http://www.sogou.com/docs/help/webmasters.htm#07)”
搜狗图片
“Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)”
搜狗,搜狗的蜘蛛程序做的很不好,总是进入死循环,已经分别在 robots.txt 和 设置中屏蔽掉
Google
“Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
Google
“Googlebot-Image/1.0”
Google图片搜索
“Mediapartners-Google”
未知
“FeedBurner/1.0 (http://www.FeedBurner.com)”
feedburner
“AdsBot-Google-Mobile (+http://www.google.com/mobile/adsbot.html) Mozilla (iPhone; U; CPU iPhone OS 3 0 like Mac OS X) AppleWebKit (KHTML, like Gecko) Mobile Safari”
Adwords移动网络
百度
“Baiduspider-image+(+http://www.baidu.com/search/spider.htm)”
百度图片
“Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)”
亲爱的百度蜘蛛
“Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.2.8;baidu Transcoder) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729)”
baidu+Transcoder 是用户用手机浏览网站留下的记录,Transcoder 是代码转换器,把网站转码成手机用户上网看到的网页留下的记录
360
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0); 360Spider
360搜索
其他搜索引擎
“Mozilla/5.0 (compatible; YoudaoBot/1.0; http://www.youdao.com/help/webmaster/spider/; )”
网易有道
“Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) Speedy Spider (http://www.entireweb.com/about/search_tech/speedy_spider/)”
来自瑞典的搜索引擎,网站看起来很不错,http://www.entireweb.com
“jikespider \”Mozilla/5.0”
即刻搜索,原人民搜索,搜索引擎国家队,已倒闭
“Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)”
俄罗斯yandex
Mozilla/5.0 (compatible; EasouSpider; +http://www.easou.com/search/spider.html)
宜搜,不认识,一直不停抓取,已屏蔽
其他已知bot
“HuaweiSymantecSpider/1.0+DSE-support@huaweisymantec.com+(compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR ; http://www.huaweisymantec.com/cn/IRL/spider)”
华为赛门铁克蜘蛛,是华为赛门铁克科技有限公司网页信誉分析系统的一个页面爬取程序,其作用是用于爬取互联网网页并进行信誉分析,从而检查该网站上的是否含有恶意代码。
http://baike.baidu.com/view/5994606.htm
qiniu-imgstg-spider-1.0
七牛镜像蜘蛛
“xFruits/1.0 (http://www.xfruits.com)”
xFruits,聚合rss用的
Feedly/1.0 (+http://www.feedly.com/fetcher.html; like FeedFetcher-Google)
Feedly,Google Reader 关闭后一直用这个
Mozilla/5.0 (compatible;YoudaoFeedFetcher/1.0;http://www.youdao.com/help/reader/faq/topic006/;1 subscribers;)
有道阅读
FeedDemon/4.5 (http://www.feeddemon.com/; Microsoft Windows)
一款离线RSS阅读器
“Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; JianKongBao Monitor 1.1)”
监控宝
DNSPod-Monitor/2.0
DNSPod监控
“Mozilla 5.0 (compatible; Feedsky crawler /1.0; http://www.feedsky.com)”
Feedsky
“Xianguo.com 1 Subscribers”
鲜果
360spider(http://webscan.360.cn)
360网站安全检测
“yrspider Mozilla/5.0 (compatible; YRSpider; +http://www.yunrang.com/yrspider.html)”
云壤公司,http://www.yunrang.com/yrspider.html
其他未知bot
“Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; EmbeddedWB 14.52 from: http://www.bsalsa.com/ EmbeddedWB 14.52; .NET CLR 2.0.50727)”
怀疑为发布SPAM用的,因为总是在获取注册页面和验证码
Mozilla/5.0 (compatible; LinkpadBot/1.06; +http://www.linkpad.ru)
LinkpadBot,看域名知道是来自俄罗斯的
Mozilla/5.0 (compatible; SISTRIX Crawler; http://crawler.sistrix.net/)
又一个国外的
“Mozilla/5.0 (compatible; MJ12bot/v1.4.0; http://www.majestic12.co.uk/bot.php?+)”
来自英国的未知bot
“Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com)”
未知
“IS Alpha/Nutch-1.1”
未知
Nutch Spider/Nutch-2.2.1
貌似是上面那个进化来的
“BlogPulseLive (support@blogpulse.com)”
“findlinks/2.0.2 (+http://wortschatz.uni-leipzig.de/findlinks/)”
来自德国的未知bot
“Mozilla/4.0 (compatible; MSIE 6.0; AugustBot/augstbot@163.com)”
未知,貌似与网易有关
“InternetSeer.com”
未知
“Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
未知,已更新为下面的
Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)
DotBot,不认识
“http://www.internet-zarabotok.net/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; Win64; AMD64)”
来自俄罗斯的未知bot
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.19; aggregator:Spinn3r (Spinn3r 3.1); http://spinn3r.com/robot) Gecko/2010040121 Firefox/3.0.19
Spinn3r,不认识
Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)
Exabot,还是不认识
Mozilla/5.0 (compatible; Exabot/3.0 (BiggerBetter); +http://www.exabot.com/go/robot)
Exabot,不认识
psbot/0.1 (+http://www.picsearch.com/bot.html)
psbot,不认识
TurnitinBot/3.0 (http://www.turnitin.com/robot/crawlerinfo.html)
TurnitinBot,不认识
爬虫与正则表达式
1 | import re |