抛砖引玉-LWP抓取网页title问题-请教

King_Leo

UID: 24392
帖子: 142
积分: 326
在线时间: 1 天 3 小时

1^# King_Leo 发表于 2008-05-22 11:49

抛砖引玉-LWP抓取网页title问题-请教

现需求如下：
批量处理一些url，估计有几百至几千个。对url不做限制，各种各样（比如我自己的访问记录）
现在想通过perl抓取title
我现在完成的代码如下：基本可用，还有些问题，请有经验的朋友指教

#!/usr/local/bin/perl -w

use strict;
use Encode;
#----------use LWP::userAgent---------
#
use LWP::UserAgent;

my $url = 'http://news.sina.com.cn/c/p/2008-05-22/020615590454.shtml';

my $ua = new LWP::UserAgent;
$ua->timeout(5);       # 设置timeout时限
my $hr = new HTTP::Request(GET=>$url);

my $res = $ua->request($hr);

if ($res->is_success eq '1') {  # 如果访问成功
  my $content_type = $res->header('Content-Type');    # 包含页面类型text/html 及字符编码charset
  if($content_type =~ /text\/html|text\/plain/i) { # 只关注网页
      my $charset = ''; #--根据 charset 调整字符编码
      if ($content_type =~ /.*charset=([^\s]+)/gi) {
               $charset = $1;
      }
      else {
               $charset = 'utf-8';
      }
      print "Title : ".encode('gbk',decode("$charset",$res->header('Title')))."\n";
  }
  else {
      print"此URL不是普通网页\n";
  }
}
else {
      print "访问失败，失败原因：".$res->status_line." \n";
}

想请教各位的问题：
1.我只想抓取title，如购物网站页面，论坛页面的title。我做的这个text/html|text/plain的限制是否全面
2.对于网页的charset有没有现成的方法，尝试了半天没有找到，只有找到了Content-type的方法
3.这个代码的效率如何，怎么能够让程序运行的更快，毕竟有那么多url呢
4.对于这个需求，我还有什么没有考虑到的呢，请行家指点
谢谢各位

King_Leo

UID: 24392
帖子: 142
积分: 326
在线时间: 1 天 3 小时

2^# King_Leo 发表于 2008-05-22 15:14

顶一下~ 没有有经验的兄弟吗

King_Leo

UID: 24392
帖子: 142
积分: 326
在线时间: 1 天 3 小时

3^# King_Leo 发表于 2008-05-23 10:12

up 下

hfahe

UID: 36097
帖子: 160
积分: 368
在线时间: 1 天 11 小时

4^# hfahe 发表于 2008-05-23 10:39

关于charset 有一个模块叫做Lingua::Han::Utils

或者使用eval

hfahe

UID: 36097
帖子: 160
积分: 368
在线时间: 1 天 11 小时

5^# hfahe 发表于 2008-05-23 10:44

if ($res->is_success eq '1') => if($res->is_success)

另外如果要加快执行可以使用fork或者Thread

King_Leo

UID: 24392
帖子: 142
积分: 326
在线时间: 1 天 3 小时

6^# King_Leo 发表于 2008-05-23 10:51

多谢
能说的稍微详细些吗，那个eval能说的详细些吗
用fork 怎么就能加快速度了呢

King_Leo

UID: 24392
帖子: 142
积分: 326
在线时间: 1 天 3 小时

7^# King_Leo 发表于 2008-05-23 10:58

遇到一个问题：
http://tw.f6.page.bid.yahoo.com/ ... 5961?u=sky679472000
这个url
编码是big5
在我上边的程序运行时报错，报的是Encode的错误，请指点下
1
200 OK
charset:big5
Wide character in subroutine entry at /usr/local/lib/perl5/5.8.8/x86_64-linux/Encode.pm line 166.

多谢

hfahe

UID: 36097
帖子: 160
积分: 368
在线时间: 1 天 11 小时

8^# hfahe 发表于 2008-05-23 14:46

eval {my $str2 = $str; Encode::decode("gbk", $str2, 1)};
print "not gbk: $@\n" if $@;

eval {my $str2 = $str; Encode::decode("utf8", $str2, 1)};
print "not utf8: $@\n" if $@;

eval {my $str2 = $str; Encode::decode("big5", $str2, 1)};
print "not big5: $@\n" if $@;

####################
use Thread 'async';
use Encode;
use LWP::UserAgent;
use Lingua::Han::Utils;

my @url = ('http://news.sina.com.cn/c/p/2008-05-22/020615590454.shtml',
'http://www.sohu.com','http://www.sina.com.cn','http://www.baidu.com');

sub get_title {
my $url = shift;
my $ua = new LWP::UserAgent;
$ua->timeout(5); # 设置timeout时限
my $hr = new HTTP::Request(GET=>$url);

my $res = $ua->request($hr);

if ($res->is_success) {
my $content = Lingua::Han::Utils::cdecode($res->header('Title'));
Encode::_utf8_off($content);
print $content, "\n";
}
}

my @t;
for my $i (0..3) {
push @t,Thread->new(\&get_title, $url[$i]);
}
for (@t) { $_->join;}

King_Leo

UID: 24392
帖子: 142
积分: 326
在线时间: 1 天 3 小时

9^# King_Leo 发表于 2008-05-23 15:18

多谢多谢
我好好学学
有些地方我还看不懂~