请问这个正则表达式怎么写？

ustctapper

UID: 23932
帖子: 2
积分: 4
在线时间: 10 分钟

1^# ustctapper 发表于 2007-11-28 09:27

请问这个正则表达式怎么写？

学英语的一些例句，每句都有若干词根相同的词，例如
She swears to wear the pearls that appear to be pears.
但是每句的词根都未必相同

我希望把这些包含词根的词都标记出来，请问如何写？

redicaps

UID: 37436
帖子: 187
积分: 430
在线时间: 1 天 23 小时

2^# redicaps 发表于 2007-11-28 10:48

是要找都含有 ear的么

while($str =~ /(\b[^ ]*?ear[^ ]*?\b)/g){
print $1,"\n";
}

ustctapper

UID: 23932
帖子: 2
积分: 4
在线时间: 10 分钟

3^# ustctapper 发表于 2007-11-28 11:17

要是，我就不问了，每行的词根都不一样。

myhan

UID: 14774
帖子: 175
积分: 402
在线时间: 1 天 18 小时

4^# myhan 发表于 2007-11-28 11:23

好像正则搞不定吧，词根长度都不一定的啦。先拆成一个一个单词，再变长字符组合，看看在各个次里面出现次数最多的一个咯。

ustctapper

UID: 23932
帖子: 2
积分: 4
在线时间: 10 分钟

5^# ustctapper 发表于 2007-11-28 11:45

想想就挺复杂的，总觉得强大的perl可以做。。

Lonki

UID: 21498
帖子: 1
积分: 2
在线时间: 10 分钟

6^# Lonki 发表于 2007-11-28 12:49

可否给出明确定义: 什么是一句话的词根?

我困惑的是如下例子:
例1:
a12 b12 c2

2个单词以12结尾
3个单词以2结尾

词根是?

例2:
a11 b11 c12 d12

2个单词以11结尾
2个单词以12结尾

词根是?

ustctapper

UID: 23932
帖子: 2
积分: 4
在线时间: 10 分钟

7^# ustctapper 发表于 2007-11-28 14:26

这里说的词根不是原本词根的定义，只是一组字母序列，比如
9. The dust in the industrial zone frustrated the industrious man.
词根是dust或ust
10. The just budget judge just justifies the adjustment of justice.
词根是dust
11. I used to abuse the unusual usage, but now I'm not used to doing so.
词根是use，有变形
12. The lace placed in the palace is replaced first, and displaced later.
词根是lace
13. I paced in the peaceful spacecraft.
词根是pace
14. Sir, your bird stirred my girlfriend's birthday party.
词根是ir

Lonki

UID: 21498
帖子: 1
积分: 2
在线时间: 10 分钟

8^# Lonki 发表于 2007-11-28 16:07

QUOTE:

原帖由 ustctapper 于 2007-11-28 14:26 发表
这里说的词根不是原本词根的定义，只是一组字母序列，比如
9. The dust in the industrial zone frustrated the industrious man.
词根是dust或ust
10. The just budget judge just justifies the adjustmen ...

# 你指的词根应该是英语语言学范畴的词根.
# 从程序的角度来说, 词根是很模糊的, 因为它不是简单的公共字符串.
# 词根有形变: use是used的词根, 但能说use是所有use[a-z]的词根吗?
# 我想应该给出一个list, 或者某些符合词根的规则, 才能将它程序化.

zhasm

UID: 11992
帖子: 173
积分: 397
在线时间: 1 天 17 小时

9^# zhasm 发表于 2007-11-29 09:02

我写了一段代码，分析每个词的任意一段（按照左右的相邻顺序）在整个一行中的匹配权重。使用程序找出者，有助于发现词根。但是如果有相同权重的情况出现，只能依靠肉眼发现。程序只是辅助而已。

She swears to wear the pearls that appear to be pears.
在这一行中，以swares词为例，分别检验swares, sware, wares, swar, ware, ares, swa, war, are, res.
为什么最短检验长度为3呢？这其实只是个人设定而已。如果你坚持，设为2也不妨。

下面这段程序在我机器上运行结果是：
---------------------------
the line is :She swears to wear the pearls that appear to be pears.

ears:2; wea:2; wear:2; pea:6; ear:20; ars:2; pear:6;

程序使用方法是：(程序文件名是x.pl，要处理的词根文件是root.txt)
./x.pl root.txt

[Copy to clipboard] [ - ]

CODE:

#!/usr/bin/perl -w
$/ = ".\n";

while (<>) {
my @array="";
my %myhash=();
print "---------------------------\n";
print "the line is :$_\n";
while(/^\w+/)
{
      s/^(\w+)\W+(.*)$/$2/;
      push(@array,lc($1)); #save all the words(in lower case format) into array.
}
@b=@array;                   #copy this array to b, for checking

my $len;

my $matchlen;

foreach $item(@array)

{
      $len=length($item);
      for( $matchlen=$len;$matchlen>=3;$matchlen--)
      {
         for( $i=0;$i<=$len-$matchlen;$i++)
         {
            $matchstr=substr($item,$i,$matchlen);  #define the matchstring.
            foreach $pig(@b)
            {
                  next if ( $item eq $pig);                   #the word can not match against itself.
                  if ( $pig =~ /$matchstr/)
                  {
                     $myhash{$matchstr}++;          #if matches, record them.
                  }
            }
         }
      }
}
foreach (keys %myhash)
{
      print "$_:$myhash{$_};\t";                #print all the successful match records.
}
print "\n";
}

zhasm

UID: 11992
帖子: 173
积分: 397
在线时间: 1 天 17 小时

10^# zhasm 发表于 2007-11-29 09:25

如果有root.txt有多行内容，例如：
She swears to wear the pearls that appear to be pears.
i google and doogle sdffaskj bloookgle.
.............

则程序结果为：
---------------------------
the line is :She swears to wear the pearls that appear to be pears.

ears:2; wea:2; wear:2; pea:6; ear:20; ars:2; pear:6;
---------------------------
the line is :i google and doogle fucking bloookgle.

oogl:2; ogl:2; oog:2; ogle:2; oogle:2; gle:6;