这就是所谓的广度优先网页抓取算法,,,
时间:2011-01-05
来源:互联网
本帖最后由 SeriousCool 于 2011-01-05 02:08 编辑
从一个页面开始,初始化列表,抓到页面里面所有网址就往列表后面加,再从列表最前面取出一个地址,再抓,添加,再取,再抓,再添加,,,还以为是神马高深的东西,,
复制代码
从一个页面开始,初始化列表,抓到页面里面所有网址就往列表后面加,再从列表最前面取出一个地址,再抓,添加,再取,再抓,再添加,,,还以为是神马高深的东西,,
- #!/usr/bin/python
- import urllib2
- import re
- import time
-
- urlsOld = []
- timeStart = time.time()
-
-
- def getURL(url):
- try:
- fp = urllib2.urlopen(url)
- except:
- print 'get url exception'
- return []
-
- pattern = re.compile("http://money.163.com/[^\>]+.html")
- while 1:
- s = fp.read()
- if not s:
- break
- urls = pattern.findall(s)
- fp.close()
-
- if not urls:
- return []
- else:
- return urls
-
- def spider(startURL,times):
- global urlsOld
- urls = []
- urls.append(startURL)
- urlFind = 0
- urlFetched = 0
- while True:
-
- url = urls.pop(0)
-
- if urlsOld.count(url) == 0:
- urlsOld.append(url)
- print 'fetch url: ', url
- urlFetched += 1
- else:
- print 'already fetched!'
- continue
-
- urlList = getURL(url)
- for url in urlList:
- if ( urls.count(url) == 0 and urlsOld.count(url) == 0 ):
- urls.append(url)
- print 'find url: ', url
- urlFind += 1
- else:
- print 'url exists: ', url
- seconds = time.time() - timeStart
- print 'urls: ', len(urls), '; urlFind: ', urlFind, '; urlFetched: ', urlFetched, '; time spent: ', int(seconds), ' seconds'
-
- spider('http://www.163.com',1000)
作者: SeriousCool 发布时间: 2011-01-05
明天再给加上多线程,跑上一天,看看能抓多少,
作者: SeriousCool 发布时间: 2011-01-05
相关阅读 更多
热门阅读
-
office 2019专业增强版最新2021版激活秘钥/序列号/激活码推荐 附激活工具
阅读:74
-
如何安装mysql8.0
阅读:31
-
Word快速设置标题样式步骤详解
阅读:28
-
20+道必知必会的Vue面试题(附答案解析)
阅读:37
-
HTML如何制作表单
阅读:22
-
百词斩可以改天数吗?当然可以,4个步骤轻松修改天数!
阅读:31
-
ET文件格式和XLS格式文件之间如何转化?
阅读:24
-
react和vue的区别及优缺点是什么
阅读:121
-
支付宝人脸识别如何关闭?
阅读:21
-
腾讯微云怎么修改照片或视频备份路径?
阅读:28