用Python3 urlllb库及正则表达式写一个简单的网络爬虫

本文是一个使用python3官方库urllib及正则表达式写一个简单网络爬虫,抓取email地址的简单实例!虽然使用requests实现更简单一些,但是为了可以在不支持安装第三方库的设备上可以使用,所以采用urllib作为示例!

本文固定地址:http://blog.jiangjiaolong.com/python3-urllib-reptile.html

定义一个获取网页源代码的函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
//导入模块
import re
from urllib import request
import gzip
import time

//定义获取网页源代码的函数
def get_page_soucre(url):
//模拟chrome的heard信息
heardstr = """
Accept:*/*
Accept-Encoding:gzip, deflate, br
Accept-Language:zh-CN,zh;q=0.9,en;q=0.8,ja;q=0.7,zh-TW;q=0.6
Connection:keep-alive
User-Agent:Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36
X-Requested-With:XMLHttpReques
"""
heardstrs = heardstr.split('\n')
heard = {}
for i in heardstrs:
if i != '':
temp = i.split(':')
heard[temp[0]] = temp[1]
request1 = request.Request(url,headers=heard)
with request.urlopen(request1) as page:
page_source = page.read()
if page.getheader('Content-Encoding')=='gzip':
page_source = gzip.decompress(page_source)
try:
page_source = page_source.decode()
except:
page_source = page_source.decode('gbk')
return page_source

定义一个查找网页中email的函数,如果找打则打印出找到的email数组,如果网页存在后页则递归调用在后页继续查找

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def find_email(page_source):
email_list = re.findall(r'([0-9a-zA-Z_.]+\@[a-zA-Z0-9]+\.[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)?)',page_source)
if email_list:
elist = []
for i in email_list:
if i not in elist:
elist.append(i[0])
print(elist)
next_page = re.search(r'\<a\s.+后页',page_source)
if next_page:
print('next page...')
next_page_url = next_page.group().split('"')[1]
with request.urlopen(next_page_url) as p:
find_email(p.read().decode())

为了找到足够多的email地址,选择从百度搜索页作为入口!定义一个在百度搜索结果中返回含有关键字页面的链接地址,并自动递归查找下一页

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def find_bd(bat_url):
bat = re.compile('\"\shref\=\"https?\:\/\/www\.baidu\.com\/link\?url\=[^\"]+')
page_source = get_page_soucre(bat_url)
url = re.findall(bat,page_source)
if url:
url1 = []
for i in url:
temp = i.split('"')[2]
if temp not in url1:
url1.append(temp)
return url1

else:
return None

get_url = re.compile('https?://[^\"\']+')


# print(find_url(start_url,bat_url))
def next_page_url(get_url):
source = get_page_soucre(get_url)
next_page = re.search(r'[^\"]+\"\sclass\=\"n\"\>下一页',source)
if next_page:
print('找到下一页!')
next_page = next_page.group().split('"')[0]
return "https://www.baidu.com" + next_page
else:
return None

好了,成为爬虫的最后一步,定义爬虫函数,调用刚才定义的函数

1
2
3
4
5
6
7
8
9
10
11
def reptile(url):
url_list = find_bd(start_url)
for i in url_list:
print("进入新帖",i)
time.sleep(2)
page_source = get_page_soucre(i)
find_email(page_source)
next_p = next_page_url(url)
while next_p:
print('百度翻页!')
reptile(next_p)

在百度搜索关键字,复制链接作为入口链接

百度搜索关键字

直接使用reptile函数调用

1
reptile('https://www.baidu.com/s?wd=%E7%95%99%E4%B8%8B%E9%82%AE%E7%AE%B1%20site%3Adouban.com&rsv_spt=1&rsv_iqid=0xe8b753c40002aa13&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&rqlang=cn&tn=baiduhome_pg&rsv_enter=1&oq=%25E9%2582%25AE%25E4%25BB%25B6%25E5%259C%25B0%25E5%259D%2580%2520site%253Adouban.com&rsv_t=7d704deARXuzCJek8pgSo%2B%2FaEqT6cm%2Fb7QohX1%2Bq1mVvdD4mr51D3s2g9FsDEdlBMadt&inputT=3569&rsv_pq=c5fa75fe00036c2e&rsv_sug3=46&rsv_sug2=0&rsv_sug4=6442')

示例完整代码

点此查看完整代码示例

本文历史

  • 2018年3月17日 18:00 首次发表