小老弟网站（李白一网打尽）

今天偶尔看到一篇初学 python 时候的代码笔记文档，挺佩服当时自己的认真学习态度^_^代码是采集多页李白诗词，每篇诗词写入1个TXT文本；当时使用python 2.7，今天把它稍微改动一下，在 python3 下正常运行；网上很多Python学习教程谬误还是不少的，所以今天就把这个简单的采集代码发上来，抛砖引玉！代码对于初学者有几个重点：1，href 超链接的组合；2，下一页的超链接采集合成；3，根据 href 超链接，通过页面标签提取该页诗词文本的方法；#coding:utf-8
#’http://www.shicimingju.com’
#采集多页诗词网站，并储存为TXT文件；
#– 读取写入txt段再思考；
import sys
import re, os, random, requests
from bs4 import BeautifulSoup as BP

base=’http://www.shicimingju.com’
url=’http://www.shicimingju.com/chaxun/zuozhe/1.html’
visithead= {‘User-Agent': ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:58.0)’+ ‘Gecko/20100101 Firefox/58.0′}

def geturls(url):
print (‘——————————————————————————–‘)
r= requests.get(url, headers=visithead)
html=r.text.encode(r.encoding)
soup=BP(html, ‘lxml’)

div=soup.find(‘div’, attrs={‘class': ‘www-shadow-card www-main-container’})
hrefs=[l.attrs[‘href’] for l in div.findAll(‘a’) if l.has_attr(‘href’) ]
hrefs=[base + i for i in hrefs]
print (hrefs)

n=soup.find(‘div’ , attrs= {‘class':’pagination www-shadow-card’})
n2=n.find(re.compile(‘a’), text = re.compile(u’\u4e0b\u4e00\u9875′))
nexturl=[base+i for i in re.findall(r'[/].*.html’,str(n2))]
print (u’\u4e0b\u4e00\u9875′, ‘——————————–‘)
print (nexturl)

ans={}
ans[‘hrefs’]= hrefs
ans[‘nexturl’]=nexturl[0]
return(ans)

def txt(url) :
r= requests.get(url)
html= r.text.encode(r.encoding)
soup= BP(html, ‘lxml’)

x={‘class': ‘shici-container www-shadow-card’}
# 一，对 div 段的传统处理法；
# c0=soup.find(‘div’, attrs=x).text
# c0=re.sub(r'[ ]’, ”, c0)
# c0=re.sub(r'[\xa0]’, ”, c0)

# 二，对text div段的标题、作者、诗词内容的单独处理；
c1=soup.find(‘div’, attrs=x).h1.text #标题；
c2=soup.find(‘div’, attrs=x).find(‘div’, attrs={‘class': ‘shici-info’}).text #作者
c3=soup.find(‘div’, attrs=x).find(‘div’, attrs={‘class': ‘shici-content’}).text #内容
c3=re.sub(r'[\xa0]’, ”, c3) #删除特殊字符；
c3=re.sub(r'[ ]{4}’, ”, c3) #删除N个空格；

t=re.sub(r'[/]’, ‘ ‘,c1) #标题去除斜线；

filedir= os.getcwd() + ‘/ok’
if not os.path.exists(filedir):
os.mkdir(filedir)

with open(filedir + ‘/%d-%s.txt’ % (i+1,t), mode=’w’) as f:
c0 = c1 +u’\n’+ c2 + c3 #加换行；
f.write(c0)
print (c0)

ans= geturls(url)
allhrefs= ans[‘hrefs’]

while ans[‘nexturl’]:
try:
ans=geturls(ans[‘nexturl’])
allhrefs= allhrefs+ans[‘hrefs’]
except:
print (‘This is last page…!\n’)
print (u’总计找到 ‘, len(allhrefs), u’条数据！’)
input (‘Press any key to write to txt files!’)
break

for i in range(len(allhrefs)//100):
txt(allhrefs[i])
print (i+1,’……done!’)
print (‘——————————————————————————‘)
采集过程：采集结果：采集文本格式：同理，将主链接更换为其他作者，即可获取其他作者全部或者定义数量的诗词；完整代码截屏：备注1，本文涉及诗词站点，仅为学习测试，无意冒犯版权；备注2，转载请附上本文头条链接；

相关文章