《和美女同事的电梯一夜》脚本

newbuding

UID: 26757
帖子: 1
积分: 2
在线时间: 10 分钟

1^# newbuding 发表于 2006-10-24 18:18

《和美女同事的电梯一夜》脚本

从本站的一个blog中抓这个网上比较热门的小说

[Copy to clipboard] [ - ]

CODE:

#!/usr/bin/env python
# coding:latin-1

import urllib,re

START_URL=r' http://blog.chinaunix.net/u/12612/showart_183257.html'
BASE_URL=r'http://blog.chinaunix.net/u/12612/showart.php?id='
re_str=r'showart.php\?id=(\d*)'
list=[]

base_html=urllib.urlopen(START_URL).read()
p=re.compile(re_str,re.M)
counter=0
for m in p.finditer(base_html):
list.append(m.groups()[0])
#print m.groups()
counter=counter+1
print '共%d篇'%counter

counter=0
try:
for i in list:
 one_url=BASE_URL+i
 one_html=urllib.urlopen(one_url).read()
 counter=counter+1
 #f=open('one.html','w')
 #f.write(one_html)
 #f.close()
 re1_str=r'(.*)'
 p1=re.compile(re1_str,re.M)
 print 'Start>>>','='*20,counter,'='*20
 for m in p1.finditer(one_html):
 aa=m.groups()[0]
 #bb=unicode(aa,'gb18030') #aa.decode('gbk').encode('utf8')
 cc=re.sub(r'<(.*)>|</.*>|<(.*)\/>','',aa)
 cc=re.sub(r'&\w*;','',cc)
 print cc
 print '='*20,counter,'='*20,'<<<END'
except Error,msg:
print msg
x=raw_input("Press Enter to exit")

刚刚完成,目前还只是在屏幕上显示,如果用*nix的话重定向一下就可以了
现在还有点小问题，就是不知道为什么83-89篇的页面抓不下来，
也没有什么出错显示，那位感兴趣的话，拿回去帮忙看看吧

xushanjun

UID: 11361
帖子: 190
积分: 436
在线时间: 2 天 1 小时

2^# xushanjun 发表于 2006-10-25 18:16

因为你的第 84 章的网址原文件,用re1_str=r'(.*)'不行啊,你看看就知道了.

xushanjun

UID: 11361
帖子: 190
积分: 436
在线时间: 2 天 1 小时

3^# xushanjun 发表于 2006-10-25 19:55

把re1_str=r'<>(.*)'换成re1_str=r'<(DIV|P)(\s|\w|=|"|\-|*>(.*)</(DIV|P)>'应该就可以了吧

xushanjun

UID: 11361
帖子: 190
积分: 436
在线时间: 2 天 1 小时

4^# xushanjun 发表于 2006-10-25 20:57

[Copy to clipboard] [ - ]

CODE:

# -*- coding:gb2312 -*-
#!/usr/bin/env python
# coding:latin-1

import urllib,re

START_URL=r' http://blog.chinaunix.net/u/12612/showart_183257.html'
BASE_URL=r'http://blog.chinaunix.net/u/12612/showart.php?id='
re_str=r'showart.php\?id=(\d*)'
list=[]

base_html=urllib.urlopen(START_URL).read()
p=re.compile(re_str,re.M)
counter=0
for m in p.finditer(base_html):
list.append(m.groups()[0])
#print m.groups()
counter=counter+1
print '共%d篇'%counter

while len(list):
one_url=BASE_URL+list[0]
try:
 one_html=urllib.urlopen(one_url).read()
except:
 pass
else:
 list.remove(list[0])

 title = r'(.*)'
 title = re.compile(title).search(one_html).groups()[0]

 re1_str=r'<(DIV|P)(\s|\w|=|"|\-|:|;)*>(.*)</(DIV|P)>'
 p1=re.compile(re1_str,re.M)
 print 'Start>>>','='*20,title,'='*20
 for m in p1.finditer(one_html):
 aa=m.groups()[2]
 cc=re.sub(r'<(.*)>|</.*>|<(.*)\/>','',aa)
 cc=re.sub(r'&\w*;','',cc)
 print cc
 print '='*20,title,'='*20,'<<<END'