用 perl 实现的一个 拼写检查器,与 python 代码作对比。

用 perl 实现的一个 拼写检查器,与 python 代码作对比。

几日前在 python 版看到一个链接,作者用 python 实现了一个简单的 拼写检查器,忍不住用 perl 重写了一遍。
望高手指点,看能不能再简化一下
注释里有 python 的代码

原文链接

http://norvig.com/spell-correct.html



[Copy to clipboard] [ - ]
CODE:
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

用 map + grep

说我字数超了,只好重新回一个。这次用 grep 解决问题啦

python 的用了 21 行,perl 的做到了 20 行,我很满意

import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):   
    model = collections.defaultdict(lambda: 1)   
    for f in features:        
        model[f] += 1   
    return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):   
    n = len(word)   
    return set([word[0:i]+word[i+1:] for i in range(n)] +                     # deletion               
           [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition               
           [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration               
           [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])  # insertion
def known_edits2(word):   
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):   
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]   
    return max(candidates, key=lambda w: NWORDS[w])


use IO::File;
my $fh = IO::File->new('big.txt') or die;
my $words = join '', <$fh>;
sub words { return (lc shift) =~ /([a-z]+)/g;}
sub train {
    my %model;
    $model{$_} = ($model{$_} || 1)+1 for @_;
    return %model; }
my %nwords = train(words($words));
sub edits1 {
    my $word = shift;
    return ((map {(substr $word, 0, $_) . (substr $word, $_+1)} 0 .. (length($word)-1)),
            (map {(substr $word, 0, $_) . (substr $word, $_+1, 1) . (substr $word, $_, 1) . (substr $word, $_+2)} 0 .. (length($word)-2)),
            (map {my $c = $_; map {(substr $word, 0, $_) . $c . (substr $word, $_+1)} 0 .. (length($word)-1)} 'a'..'z'),
            (map {my $c = $_; map {(substr $word, 0, $_) . $c . (substr $word, $_)} 0 .. length($word)} 'a'..'z')); }
sub known_edits2 {return map {grep {exists $nwords{$_}} edits1($_)} edits1(shift)}
sub known {return grep {exists $nwords{$_}} @_}
sub correct {
    my @candidates = known(@_) ? known(@_) : known(edits1(@_)) ? known(edits1(@_)) : known_edits2(@_) ? known_edits2(@_) : @_;
    return (sort {$nwords{$b} <=> $nwords{$a}} @candidates)[0]; }


又照原文写了个测试

sub spelltest {
    my %test = @_;

    my $start    = time;
    my $n        = 0;
    my $bad        = 0;
    my $unknown = 0;

    for my $word (keys %test) {
        for my $wrong ((split ' ', $test{$word})) {
            $n += 1;
            my $w = correct($wrong);
            if ($w ne $word) {
                $bad += 1;
                $unknown += !(exists $nwords{$word});
            }
        }
    }
    my $secs = time - $start;
    my $pct = int(100 - 100 * $bad/$n);
    return "bad= $bad, unknown= $unknown, secs= $secs, pct= $pct, n= $n\n";
}


def spelltest(tests, bias=None, verbose=False):
    import time
    n, bad, unknown, start = 0, 0, 0, time.clock()
    if bias:
        for target in tests: NWORDS[target] += bias
    for target,wrongs in tests.items():
        for wrong in wrongs.split():
            n += 1
            w = correct(wrong)
            if w!=target:
                bad += 1
                unknown += (target not in NWORDS)
                if verbose:
                    print 'correct(%r) => %r (%d); expected %r (%d)' % (
                        wrong, w, NWORDS[w], target, NWORDS[target])
    return dict(bad=bad, n=n, bias=bias, pct=int(100. - 100.*bad/n),
                unknown=unknown, secs=int(time.clock()-start) )


运行结果

[Copy to clipboard] [ - ]
CODE:
>perl -w spell.pl
bad= 68, unknown= 15, secs= 85, pct= 74, n= 270
bad= 130, unknown= 43, secs= 132, pct= 67, n= 400
>Exit code: 0

>pythonw -u "spell.py"
{'bad': 68, 'bias': None, 'unknown': 15, 'secs': 17, 'pct': 74, 'n': 270}
{'bad': 130, 'bias': None, 'unknown': 43, 'secs': 29, 'pct': 67, 'n': 400}
>Exit code: 0

用perl写这个慢了好多,不知道原因在哪里

python 中的 word[0:2] 这种语法看来很高效。
perl 有没有什么类似的方法?

另外,map 好像也比较耗时间。
收藏了
其实…… perl 可以是一行


QUOTE:
原帖由 redspider 于 2008-6-2 17:55 发表
其实…… perl 可以是一行

我们没必要在这点上欺负 python

关键是 python 那个版本真的很快,你有没有解决办法
还没看代码,下班了来学习学习