chomp只处理\n,但这应该不是问题所在,Perl在处理不同编码方式的文本时,用了一个简单而有效的方法,即内部使用一种统一的通用而又高效的编码方式(选择了UTF-8),在IO层插入PerlIO Layer实现编码方式转换或其他转换,详细的介绍可以查看相应的文档。这样的好处很明显,不管源或目标文本采用什么编码方式,或处在什么样的平台下,内部的文本格式始终是统一的,这样一方面可以减少频繁的转换提高效率,另一方面很多API,正则表达式都可以做到与文本编码方式无关。众所周知行结束符由于各种原因,各个平台上是不一样的,PerlIO layer ':crlf'就是用来处理这个问题的,如在windows上输入时"\r\n"转为"\n",而输出时正好相反。但可惜的是,Perl windows版本中用起来有点问题,主要是各个PerlIO layers的排列顺序引起的,默认的顺序是':unix:crlf'.":encoding($enc)",其中$enc是文本使用的编码方式,而正确的顺序应该是':unix'.":encoding($enc)".':crlf',而我更喜欢用':perlio:raw'.":encoding($enc)".':crlf:utf8'。以下是代码供参考:
use Encode;
use Encode::Guess;
my %opt = (
encode => 'UTF-16',
buf_sz => 1024, # the buffer size when is used to guess file encoding
);
sub guess_file_encoding($);
sub open_file($$;$);
# main routine
Encode::Guess->add_suspects( qw/cp936/ ); # suspect encodings
# guess file encoding first
my $src = shift or die "input source file";
my $decoding = guess_file_encoding( $src );
my $FH = open_file( $src, '<', $decoding );
# do what ever you want
close $FH;
# sub definitions
sub guess_file_encoding($) {
my $file = shift or die "input file first\n$!";
open my $FH, '<', $file or die "$!";
my @default_layers = PerlIO::get_layers($FH);
binmode($FH, ':pop') if $default_layers[-1] eq 'crlf';
binmode($FH, ':raw');
my ( $buf, $buf_sz );
$buf_sz = $opt{buf_sz} < -s $file ? $opt{buf_sz} : -s _;
read( $FH, $buf, $buf_sz );
my $decoder = Encode::Guess->guess( $buf );
die $decoder unless ref($decoder);
close $FH;
return $decoder->name;
}
sub open_file ($$;$) {
my ( $file, $mode, $encoding ) = @_;
$encoding ||= $opt{encode}; # default
unless ( Encode::perlio_ok($encoding) ) {
die "the target encoding: $encoding is not support in PerlIO\n";
}
open my $FH, $mode, $file or die "$!";
my @default_layers = PerlIO::get_layers($FH);
# caution: we need to pop the :crlf layer, because if we push :raw
# layer the :crlf is only disabled, and if we need to push :crlf
# on the top of the stack, it will not be what we want, the previous
# :crlf is enabled again, and the later one is not pushed.
binmode($FH, ':pop') if $default_layers[-1] eq 'crlf';
my $add_bom_flag = 0;
if ( $^O eq 'MSWin32'
&& ( $mode eq '>' || $mode eq 'w' ) # we only play the trick for write only
&& $encoding eq 'UTF-16' ) {
# UTF-16 in win32 perl will make BOM to be BE
# which is not convenient, so we play a trick to solve it
$encoding = 'UTF-16LE';
$add_bom_flag = 1;
}
my $new_layers = ':perlio:raw' . ":encoding($encoding)" . ':crlf:utf8';
binmode($FH, $new_layers);
# print "$_\t" foreach PerlIO::get_layers($FH);
print $FH "\x{feff}" if $add_bom_flag; # BOM
return $FH;
}