【Python】新手——采集自己网站

虽然我学的是C#语言，但是因为报名比赛老师就额外探讨了Python语言……

昨晚才开始学，学习了网络数据的采集于是就对自己网站进行了采集。

由于种种问题保存的文本文件会有乱码，设置因为我的标题有”/”而导致保存的文件名加标题名称失败，虽然可以通过筛选替换掉那些不能作为文件名的符号的。

#encoding:utf-8
'''
Created on 2017年3月6日

@author: xx
'''

import urllib
import urllib2
import re
import time

#url = "https://www.cyzwb.com/"

for i in range(1,137):
    url = "https://www.cyzwb.com/%s"%i

    #打开网页
    req = urllib2.Request(url)
    req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) \
    AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 \
    Safari/537.36 \
    https://www.cyzwb.com/ Bot By Python-urllib/2.7 ')
    html = urllib2.urlopen(req).read().decode("utf-8")  
    #正则用来匹配标题
    title = re.compile('

<h1 class="entry-title">.*?</h1>


',re.S)
    
    h1 = re.search(title, html).group(0)
    
    r2 = re.compile('<.*?>',re.S)
    
    h1 = re.sub(r2,"", h1)
    
    
    print h1
   
    content = re.compile('

<div class="entry-content">.*?</div>


', re.S)
    
    notices = re.findall(content,html)
    file = open(time.strftime('%Y-%m-%d',time.localtime(time.time()))+"-www.cyzwb.com-"+str(i)+"-"+'-blog.txt','w+');
    file.write(h1.encode("utf-8") + "\n")
    
    for n in notices:
        n = re.sub(r2,"",n)
        file.write(n.encode("utf-8") + "\n")
        print n
    file.write("-----"+url+"-----\n")
    file.write('-------------------www.cyzwb.com-------------------\n')
    print '-----------------------------------------'

刚开始没有设置“add_header”记录的是Python-urllib/2.7。

渣渣献丑了。。。。。。

超越自我吧

2017年3月7日

【Python】新手——采集自己网站

发布者

ChiuYut