大家有没遇到用urllib2.urloprn()函数取网页时超时\用socket获取页面不完整的问题?

huiqiang

UID: 16202
帖子: 2
积分: 4
在线时间: 10 分钟

1^# huiqiang 发表于 2006-09-05 20:18

大家有没遇到用urllib2.urloprn()函数取网页时超时\用socket获取页面不完整的问题?

定义这样的函数来下载网页,有的时候主程序就死在这个被调函数中,我查过源码,似乎urllib2没有设置socket.settimeout(),有什么办法?如果用socket直接来取函数,虽然可以避免程序死在超时上,但仍然会遇到超时返回后,获取页面不完整的问题,这种情况下大家有什么好办法？

两个获取页面函数如下：
def getHTML(self, link):
   if not (link.startswith("http://")):
      link = "http://" + link
   try:
      connection = urllib2.Request(link)
      connection.add_header("Accept-Language","zh-cn")
      connection.add_header("Content-Type","text/html; charset=gb2312")
      connection.add_header("User-Agent","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)")
      conn = urllib2.urlopen(connection)
      page = conn.read()
   except Exception, err:
      print "error when opening " + link
      page = ""
      errhandler = errhandle.errhandle("log/log.log", err)
      errhandler.dealer("log/log.log", err)
      print err
   return page

如果用socket直接来取函数,虽然可以避免程序死在超时上,但仍然会遇到超时返回后,获取页面不完整的问题,大家有什么好办法

def getHTML(self, link):

   if not (link.find('*-') == -1):
      link = link.split('*-http://')[1]
   link_bak = link
   link_kw = link.split('/')[1]
   link_host = link_bak.split('/')[0]
   server = link_host
   link = 'GET /' + link_kw + ' HTTP/1.1\r\n' + 'Host: ' + link_host + '\r\nUser-Agent: aPython2-urllib/3.0a1\r\n\r\n'

   page = ""
   try:
      sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
      sock.settimeout(5)
      sock.connect( (server, 80) )
   except Exception, err:
      print "error when connect to " + link_host
      page = ""
   t = sock.send(link)
   i = 0
   while  (t==-1) and (i<2):
      try:
         sock.send(link)
      except Exception,err:
         sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
         sock.settimeout(1)
         sock.connect( (server, 80) )
      i = i + 1

   sock.shutdown(1)
   while 1:
            newdata = sock.recv(1024)
            if not newdata: break
            page = page + newdata
   sock.close()

   return page

huiqiang

UID: 16202
帖子: 2
积分: 4
在线时间: 10 分钟

2^# huiqiang 发表于 2006-09-05 20:20

可能跟网络状态有直接关系,该设置什么样的机制保证接收包的完整?

limodou

UID: 17491
帖子: 110
积分: 252
在线时间: 16 小时

3^# limodou 发表于 2006-09-05 22:28

把read()放在一个while循环中，直到读不出东西来：

[Copy to clipboard] [ - ]

CODE:

buf = []
while True:
text = read()
if text:
buf.append(text)
else:
break
result = ''.join(buf)

不过我没有试过。