如何从网页代码中抽提展现在网页中的文字

如何从网页代码中抽提展现在网页中的文字

我想能在ie浏览器中看到的东西,全部抽提出来,但是不能有"<></>...."之类的代码

如:          <table border="0" cellspacing="3" cellpadding="1">
                <tr><td>Mapped EST Accession:</td><td><b>BE399426</b> &nbsp&nbsp&nbsp&nbsp[<a href=http://www.graingenes.org/cgi-bin/WebAce/webace?db=graingenes&class=Probe&object=BE399426>GrainGenes</a> &nbsp&nbsp|&nbsp&nbsp<a href=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=search&db=nucest&term=BE399426>NCBI</a> &nbsp&nbsp|&nbsp&nbsp<a href=http://wheat.pw.usda.gov/cgi-bin/westsql/est_blast.cgi?q=BE399426&t=a>wEST-SQL</a>]&nbsp&nbsp&nbsp&nbsp <font color="red">Sequence Tagged Site in Relevant Diploid</font></td></tr>
                <tr><td>Orthologous Loci by Contig:</td><td><a href="/snpworld/Search?contigName=NSFT03P2_Contig14337&chromosome=2&genome=D">NSFT03P2_Contig14337</a></td></tr>           
                <tr><td>Bin:</td><td><a href="/snpworld/Search?bin=2DL3-0.49-0.76">2DL3-0.49-0.76</a></td></tr>
                <tr><td>Forward Primer Name:</td><td><a href="/snpworld/Search?primer=BE399426_cpF1&chromosome=2&genome=D">BE399426_cpF1</a></td></tr>
                <tr><td>Reverse Primer Name:</td><td><a href="/snpworld/Search?primer=BE399426_cpR1&chromosome=2&genome=D">BE399426_cpR1</a></td></tr>
                <tr><td>Chromosome/Genome:</td><td>2D</td></tr>
                <tr><td>Ref Plant:</td><td>Ae. tauschii, Armenia (At01, D) </td></tr>
                <tr><td valign="middle">Ref Sequence:</td><td><pre>        10        20        30        40        50
TTTGGAAATATCCTGTTACTGCTGCTGATGCATTCTTATTTTTTTTTCAT
GTATGATCTCCAGGCTGTTCGAGTTGGGGACTTAGAAGTGTTTAGAGCTG
TTGCAGAGAAATTTGGGAGCACTTTCAGTGCCGACAGGACATCCAATTTG
ATCGTGAGGCTGCGCCACAACGTCATCCGGACCGGACTACGCAACATTAG
CATTTCCTACTCACGTATCTCCCTTGCTGACATTGCCAAGAAACTGAGGC
TAGATACTAAGACCGCTGTTGCTGATGCTGAGAGCATTGTAGCCAAGGCC
ATCAGAGATGGGGCAATTGATGCCACCATTGATCATGCCAATGGCTGGGT
GGTGTCGAAAGAGACTGGCGACGTTTACTCAACAAACGAGCCACAGGCTG
CGTTTAACTCCAGGATTGCGTTCTGCCTGAACATGCACAACGAGGCAGTCAAGGCTCTGAGGTTCCCCCCGAATTCTCACAAGGAAAA [488 bases] </pre></td></tr>
                <tr><td>Exon Ranges:</td><td>64-488</td></tr>
                <tr><td>Intron Ranges:</td><td>1-63</td></tr>
                <tr><td>Lab:</td><td>UCD</td></tr>
               
            </table>

即把红色部分提出来
1)自己用正则式提取
2)用 HTML::Parser 之类的模块
i think this script can do your work basically,you can try it

[Copy to clipboard] [ - ]
CODE:
#!/usr/bin/perl -w
open(IN,"d:\\c.txt") || die "ERROR";
open(OUT,">d:\\d.txt") || die "FLAG ERRORS!";
$text = "";
while($ri = <IN>) {
        if ($ri =~ s/(\w+?)(\n)/$1/g){}
        $text .= $ri;
}
if($text =~ s/\<.+?\>//g) {}
if($text =~ s/\&nbsp//g) {}
if($text =~ s/(\t+?)/\t/g) {}
print OUT $text;