这几天为了面试的事情,看个很多关于Scrapy以及周边的相关技术的文章和代码,相关的整理如下:
- Scrapy爬取很多网站的方法:
编程方式下运行 Scrapy spider
使用Scrapy定制可动态配置的爬虫
使用Redis和SQLAlchemy对Scrapy Item去重并存储
这系列的文章对之后的爬虫工作极为有用,值得很大程度的借鉴
- Scrapy的改造
- 回调与异步
ps: 这里面这段代码极好!!
from collections import defaultdict
def find_anagram(input_data):
""" find all the anagram from a dictionary
"""
# initize a cotoutine
def start(func):
def _(*args, **kwargs):
g = func(*args, **kwargs)
g.send(None)
return g
return _
@start
def sign(target):
while True:
words = yield
for w in words:
target.send([''.join(sorted(w)), w])
target.send(None) # flag incicates that all data have been seen
@start
def sort(target):
sign_words = []
while True:
word = yield
if word:
sign_words.append(word)
else: # all word have sort
target.send(sorted(sign_words))
@start
def squash():
nonlocal dictionary # python3 only: use the variable define in outside
while True:
word_list = yield
for x in word_list:
dictionary[x[0]].add(x[1])
dictionary = defaultdict(set)
sign(sort(squash())).send(input_data)
# filter the word has no anagram
return filter(lambda x: len(x[1]) > 1, dictionary.items())
if __name__ == "__main__":
test_data = ['abc', 'acb', 'bca', 'iterl', 'liter', 'hello',
'subessential', 'suitableness', 'hello']
result = find_anagram(test_data)
for each in result:
print(each[0], ':', each[1])
- ajax爬取的问题
- 大神的笔记
以上是我这两天所看到的精华,还需要我消化两天。