练手。

本次将要取的是在线网络小说

书籍链接:点我

爬虫程序如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import requests
import re
url = "http://book.zongheng.com/showchapter/911329.html"
response = requests.get(url)
response.encoding = 'utf-8'
html = response.text
#名称
title = re.findall(r'<a href="http://book.zongheng.com/book/.*?">(.*?)</a>',html,re.S)[0]
title = title.replace('>','')
#txt
document = open('%s.txt'%title,'w',encoding = 'utf-8')
#获取列表
ul = re.findall(r'<ul class="chapter-list clearfix">.*?</ul>',html,re.S)
ul = str(ul)
#获取超链接及章节名称
chapter_list = re.findall(r'<a href="(.*?)".*?>(.*?)</a>',ul,re.S)
#循环获取
for chapter_url,chapter_name in chapter_list:
#开始获取
chapter_response = requests.get(chapter_url)
chapter_response.encoding = "utf-8"
chapter_html = chapter_response.text
#提取文本
chapter_text = re.findall(r'<div class="content" itemprop="acticleBody">(.*?)</div>',chapter_html,re.S)[0]
#清洗文本
chapter_text = chapter_text.replace(' ','')
chapter_text = chapter_text.replace('\n','')
chapter_text = chapter_text.replace('</p>','')
chapter_text = chapter_text.replace('<p>','\n ')
#写入文本
document.write(chapter_name)
document.write(" ")
document.write(chapter_text)
document.write('\n')
#提示
print(chapter_name+"Write Successful!")
print("Program complete")

理论上可用于此网站所有免费书籍,将url替换为其他书的目录页即可

但是要注意,过多异常访问可能会被阻止,导致获取不到数据,从而引起 list out of range