[求助]编程求助──有一定挑战性的问题

[求助]编程求助──有一定挑战性的问题

[b]了解生物的请看A部分,不了解请看B部分,同样的问题.
A.
我有大量的Fasta格式的多序列,假如要做以下处理:
问题1  去掉特定长度的短序列
若要去除其中小于五十碱基的序列,请问如何操作;
问题2 取motif上下游一定长度的序列(含motif)
    假如我要取每个序列中motif为“AAAA”及其上下游10个碱基的序列片段,并输出位置信息,请问如何编程处理

B.
问题1  去掉特定长度的短字符串
若要去除其中小于五十个字符的字符串,请问如何操作;
问题2 取某个字符模式上下游一定长度的字符串片段(含该字符模式)
    假如我要取每个字符串中字符模式为“AAAA”及其上下游10个字符的字符段,并输出位置信息,请问如何编程处理
(每“>“开头,接着是字符串名及相关属性,接着另起的一行以下为要处理的字符)

部分序列可操作文件如下:
>Y64G10A.7|Y64G10A.7|14302238|14319530|IV
ATATTTCAATTTCAAAATTAAATTTATTTTTATTTTCAACAATTCAATCTCTTGCCCAAA
TCTTTCTTCCCCTCATTTCTCATGTAAGAATCTCATTTTTTGACAGTTAGCCTAGTTATG
TTACAGTTGATTAATTTAAATACTTTTACTGTTTCAGTACTAAAATTTATAAAAATCCGA
GTGTCTTTTTCTTAATGAAATCAGTTTCAATTATTCCAAGCCATAATAACTTTCCATGTT
TTCACACAGGTCTATGTGTTTCATTAAATAAATTT
>Y53F4B.31|gst-28|15166113|15167310|II
ATCGAAAATTTTCTTGAAATTTTGTTCAATGAAGTAATATAATTACAAATTCGAATAAAT
>ZK792.8|atg-4.2|11668005|11671460|IV
TTATTGACAATTTTTGAAATTCTGTACACTGAACATTTATTGGTGCTTGATATGTATTGA
TGACGCAATGTGAACATTATTAAATAAAA
>Y69H2.8|nhr-241|18699416|18700957|V
AGTCATGTTGCCAGGCACAGGCAAATTGTTT
>Y71F9AL.13|rpl-1|2875396|2876313|I
GGTAAAATGTCGAAGGTTTCCCGCGAGTCTCTTAATGAGGCTATCGCTGAGGTCCTTAAG
GGATCCTCGGAAAAGCCACGCAAGTTCCGTGAGACTATCGAGCTCCAGATCGGTCTCAAG
AACTACGACCCACAGAAGGACAAGCGTTTCAGCGGATCCATCAGGTAGCTACGACCTGTG
AAAACAGGGAGTTGGGACTTTCAAAGCCTAACGGACCCCCACCCAGGGTCTTAAATTTTT
GGGGAGGCAATTATTGCCGCTTTGAAATAATGATATGAGCTGGTATAATACTGAAGCACA
TTCCACGCCCAAACATGAAGGTGTGCGTCTTCGGAGACCAGCACCATCTTGACGAGGCTG
CCGCCGGAGACATTCCATCGATGAGCGCCGATGACTTGAAGAAGCTTAACAAGCAGAAGA
AGCTCATCAAGAAGCTCGCCAAGAGCTACGATGCTTTCATCGCTTCCGAATCCTTGATCA
AGCAGATCCCACGTATCCTCGGTCCAGGACTCAACAAGGCCGGAAAATTCCCTTCCGTCG
TTACCCACGGAGAATCTCTCCAAAGCAAGAGTGACGAGATCCGCGCCACCGTCAAGTTCC
AGATGAAGAAG
>Y47G6A.4|Y47G6A.4|3547093|3552007|I
ACGGAAACCCCTCACACATTCATTTCATGGTATTAAATATAAGCATTCTAAAATCCTAAT
TGTACTAATAGTATTATACATGATATACTCTGTCTGTCTGCTTGGTACAAATTTACTTCA
TTCTCAATTTTCAAAACAATAAATCATATATTCTTT
>Y17G9B.4|Y17G9B.4|4752110|4753154|IV
AATAAATGTGTTTTCTTATTTAAAATATTATTCTGTTTCGAG
>Y65B4BR.5|Y65B4BR.5|535792|536676|I
CTTGTTTCCTGATGACCTTGCAGATACTCTTGT
>Y47D3B.11|Y47D3B.11|11456758|11462763|III
AAAATTAGCCAGAATACTCTATCAAAAAATATCATATCACACCTTTTCTTCTCCCAGTTT
TTCTCTTTTTGTCGTGCCTTGCTCGTCATCTCTCCAACCCCGATCTCTGTATTTTTCTCC
ACGTATGACTGCGCCTTGTAGGCACGCAGGTAGGCATTTTTGTGCCTACGTGGATTAATT
GCCTAAATTGTCTTAAAATGCTTAGTGTTTTTCAGGTGTCAAACTTTCCTCCCCATGTGT
ACCCAACGGGTCAATCACATTCTCTTCTTGCATATGAATACATTTTCATTTTTCTGTTGT
TTTTTAGTTTTTTTTTAGACTACCATTTTTTTTTTAACTTTCTAGAAGTTTCTAGTAATA
A

请各位多帮忙!谢谢!

1. perl -ne 'print if /^>/ || /.{50}/' urfile
2.

[Copy to clipboard] [ - ]
CODE:
#! /usr/bin/perl

use warnings;
use strict;

while(<DATA>)
{
        if(/^>/)
        {
                print;
                next;
        }
        chomp;
        while(/A(?=AAA)/g)
        {
                if($+[0] < 10)
                {
                        print substr($_, 0, $+[0] + 13) . "\n";
                }
                else
                {
                        print substr($_, $+[0] - 11,  24) . "\n";
                }
        }
}


__END__
>Y64G10A.7|Y64G10A.7|14302238|14319530|IV
ATATTTCAATTTCAAAATTAAATTTATTTTTATTTTCAACAATTCAATCTCTTGCCCAAATCTTTCTTCCCCTCATTTCTCATGTAAGAATCTCATTTTTTGACAGTTAGCCTAGTTATGTTACAGTTGATTAATTTAAATACTTTTACTGTTTCAGTACTAAAATTTATAAAAATCCGAGTGTCTTTTTCTTAATGAAATCAGTTTCAATTATTCCAAGCCATAATAACTTTCCATGTTTTCACACAGGTCTATGTGTTTCATTAAATAAATTT
>Y53F4B.31|gst-28|15166113|15167310|II
ATCGAAAATTTTCTTGAAATTTTGTTCAATGAAGTAATATAATTACAAATTCGAATAAAT
>ZK792.8|atg-4.2|11668005|11671460|IV
TTATTGACAATTTTTGAAATTCTGTACACTGAACATTTATTGGTGCTTGATATGTATTGATGACGCAATGTGAACATTATTAAATAAAA
>Y69H2.8|nhr-241|18699416|18700957|V
AGTCATGTTGCCAGGCACAGGCAAATTGTTT
>Y71F9AL.13|rpl-1|2875396|2876313|I
GGTAAAATGTCGAAGGTTTCCCGCGAGTCTCTTAATGAGGCTATCGCTGAGGTCCTTAAGGGATCCTCGGAAAAGCCACGCAAGTTCCGTGAGACTATCGAGCTCCAGATCGGTCTCAAGAACTACGACCCACAGAAGGACAAGCGTTTCAGCGGATCCATCAGGTAGCTACGACCTGTGAAAACAGGGAGTTGGGACTTTCAAAGCCTAACGGACCCCCACCCAGGGTCTTAAATTTTTGGGGAGGCAATTATTGCCGCTTTGAAATAATGATATGAGCTGGTATAATACTGAAGCACATTCCACGCCCAAACATGAAGGTGTGCGTCTTCGGAGACCAGCACCATCTTGACGAGGCTGCCGCCGGAGACATTCCATCGATGAGCGCCGATGACTTGAAGAAGCTTAACAAGCAGAAGAAGCTCATCAAGAAGCTCGCCAAGAGCTACGATGCTTTCATCGCTTCCGAATCCTTGATCAAGCAGATCCCACGTATCCTCGGTCCAGGACTCAACAAGGCCGGAAAATTCCCTTCCGTCGTTACCCACGGAGAATCTCTCCAAAGCAAGAGTGACGAGATCCGCGCCACCGTCAAGTTCCAGATGAAGAAG
>Y47G6A.4|Y47G6A.4|3547093|3552007|I
ACGGAAACCCCTCACACATTCATTTCATGGTATTAAATATAAGCATTCTAAAATCCTAATTGTACTAATAGTATTATACATGATATACTCTGTCTGTCTGCTTGGTACAAATTTACTTCATTCTCAATTTTCAAAACAATAAATCATATATTCTTT
>Y17G9B.4|Y17G9B.4|4752110|4753154|IV
AATAAATGTGTTTTCTTATTTAAAATATTATTCTGTTTCGAG
>Y65B4BR.5|Y65B4BR.5|535792|536676|I
CTTGTTTCCTGATGACCTTGCAGATACTCTTGT
>Y47D3B.11|Y47D3B.11|11456758|11462763|III
AAAATTAGCCAGAATACTCTATCAAAAAATATCATATCACACCTTTTCTTCTCCCAGTTTTTCTCTTTTTGTCGTGCCTTGCTCGTCATCTCTCCAACCCCGATCTCTGTATTTTTCTCCACGTATGACTGCGCCTTGTAGGCACGCAGGTAGGCATTTTTGTGCCTACGTGGATTAATTGCCTAAATTGTCTTAAAATGCTTAGTGTTTTTCAGGTGTCAAACTTTCCTCCCCATGTGTACCCAACGGGTCAATCACATTCTCTTCTTGCATATGAATACATTTTCATTTTTCTGTTGTTTTTTAGTTTTTTTTTAGACTACCATTTTTTTTTTAACTTTCTAGAAGTTTCTAGTAATAA



QUOTE:
原帖由 ly5066113 于 2008-11-12 22:52 发表
1. perl -ne 'print if /^>/ || /.{50}/' urfile

强大,谢谢ly5066113!运行结果如下,但是结果中红颜色部分要如何去掉的?另外能否解读一下程序谢谢!
>Y64G10A.7|Y64G10A.7|14302238|14319530|IV
ATATTTCAATTTCAAAATTAAATTTATTTTTATTTTCAACAATTCAATCTCTTGCCCAAA
TCTTTCTTCCCCTCATTTCTCATGTAAGAATCTCATTTTTTGACAGTTAGCCTAGTTATG
TTACAGTTGATTAATTTAAATACTTTTACTGTTTCAGTACTAAAATTTATAAAAATCCGA
GTGTCTTTTTCTTAATGAAATCAGTTTCAATTATTCCAAGCCATAATAACTTTCCATGTT
>Y53F4B.31|gst-28|15166113|15167310|II
ATCGAAAATTTTCTTGAAATTTTGTTCAATGAAGTAATATAATTACAAATTCGAATAAAT
>ZK792.8|atg-4.2|11668005|11671460|IV
TTATTGACAATTTTTGAAATTCTGTACACTGAACATTTATTGGTGCTTGATATGTATTGA
>Y69H2.8|nhr-241|18699416|18700957|V
>Y71F9AL.13|rpl-1|2875396|2876313|I
GGTAAAATGTCGAAGGTTTCCCGCGAGTCTCTTAATGAGGCTATCGCTGAGGTCCTTAAG
GGATCCTCGGAAAAGCCACGCAAGTTCCGTGAGACTATCGAGCTCCAGATCGGTCTCAAG
AACTACGACCCACAGAAGGACAAGCGTTTCAGCGGATCCATCAGGTAGCTACGACCTGTG
AAAACAGGGAGTTGGGACTTTCAAAGCCTAACGGACCCCCACCCAGGGTCTTAAATTTTT
GGGGAGGCAATTATTGCCGCTTTGAAATAATGATATGAGCTGGTATAATACTGAAGCACA
TTCCACGCCCAAACATGAAGGTGTGCGTCTTCGGAGACCAGCACCATCTTGACGAGGCTG
CCGCCGGAGACATTCCATCGATGAGCGCCGATGACTTGAAGAAGCTTAACAAGCAGAAGA
AGCTCATCAAGAAGCTCGCCAAGAGCTACGATGCTTTCATCGCTTCCGAATCCTTGATCA
AGCAGATCCCACGTATCCTCGGTCCAGGACTCAACAAGGCCGGAAAATTCCCTTCCGTCG
TTACCCACGGAGAATCTCTCCAAAGCAAGAGTGACGAGATCCGCGCCACCGTCAAGTTCC
>Y47G6A.4|Y47G6A.4|3547093|3552007|I
ACGGAAACCCCTCACACATTCATTTCATGGTATTAAATATAAGCATTCTAAAATCCTAAT
TGTACTAATAGTATTATACATGATATACTCTGTCTGTCTGCTTGGTACAAATTTACTTCA
>Y17G9B.4|Y17G9B.4|4752110|4753154|IV
>Y65B4BR.5|Y65B4BR.5|535792|536676|I
>Y47D3B.11|Y47D3B.11|11456758|11462763|III
AAAATTAGCCAGAATACTCTATCAAAAAATATCATATCACACCTTTTCTTCTCCCAGTTT
TTCTCTTTTTGTCGTGCCTTGCTCGTCATCTCTCCAACCCCGATCTCTGTATTTTTCTCC
ACGTATGACTGCGCCTTGTAGGCACGCAGGTAGGCATTTTTGTGCCTACGTGGATTAATT
GCCTAAATTGTCTTAAAATGCTTAGTGTTTTTCAGGTGTCAAACTTTCCTCCCCATGTGT
ACCCAACGGGTCAATCACATTCTCTTCTTGCATATGAATACATTTTCATTTTTCTGTTGT
TTTTTAGTTTTTTTTTAGACTACCATTTTTTTTTTAACTTTCTAGAAGTTTCTAGTAATA


QUOTE:
原帖由 ly5066113 于 2008-11-12 23:10 发表
2.
#! /usr/bin/perl

use warnings;
use strict;

while()
{
        if(/^>/)
        {
                print;
                next;
        }
        chomp;
        while(/A(?=AAA)/g)
        {
                if($+[0] < 10)
                {
                        print substr($_, 0, $+[0] + 13) ...

位于每行末端的字符串模式不能按要求正确输出。如第一个结果:
>Y64G10A.7|Y64G10A.7|14302238|14319530|IV
TTTCAATTTCAAAATTAAATTTAT
TTTCAGTACTAAAATTTATAAAAA
TAAAATTTATAAAAATCCGA
AAAATTTATAAAAATCCGA

实际结果应为
   >Y64G10A.7|Y64G10A.7|14302238|14319530|IV
   4TTTCAATTTCAAAATTAAATTTAT27
   153TTTCAGTACTAAAATTTATAAAAA176
   162TAAAATTTATAAAAATCCGAGTGT185
   163AAAATTTATAAAAATCCGAGTGTC186
另外,程序还需匹配跨行的字符串模式,如
ATATTTCAATTTCAAAATTAAATTTATTTTTATTTTCAACAATTCAATCTCTTGCCCAAA
ACTTTCTTCCCCTCATTTCTCATGTAAGAATCTCATTTTTTGACAGTTAGCCTAGTTATG