如何找序列中的包含关系!
choose2005
|
1#
choose2005 发表于 2008-05-23 15:28
如何找序列中的包含关系!
我有这样一个文件:
里面有一系列的序列: 如: AAAAAAAAAGATTGAG SEQ.1893836.16.1 AAAAAAAAAGATTGAGACGAATT SEQ.1549482.23.1 AAAAAAAAAGATTGAGCCGAA SEQ.2553655.21.1 AAAAAAAAAGATTGAGCCGAAA SEQ.2706125.22.3 AAAAAAAAAGATTGAGCCGAAT SEQ.1372232.22.9 AAAAAAAAAGATTGAGCCGAATA SEQ.1365232.23.4 AAAAAAAAAGATTGAGCCGAGTTA SEQ.889399.24.1 AAAAAAAAAGATTGAGCCGCAAAA SEQ.1263859.24.1 AAAAAAAAAGATTGAGTCGAATA SEQ.970919.23.1 AAAAAAAAAGATTGTCACAACT SEQ.142122.22.1 AAAAAAAAAGATTTGCGAAAACT SEQ.785497.23.1 AAAAAAAAAGCAAAGAACAAAAAG SEQ.926879.24.1 AAAAAAAAAGCTCTTCAGGAGGG SEQ.50491.23.1 AAAAAAAAAGTGATACTTCTTGTT SEQ.1347612.24.1 AAAAAAAAATAAAAAACAAGA SEQ.1258876.21.1 AAAAAAAAATAAAGCAAGAA SEQ.2556803.20.1 AAAAAAAAATACGAACCGTTCTTT SEQ.1157920.24.1 AAAAAAAAATAGCGTGGGAATGAG SEQ.1457720.24.1 AAAAAAAAATAGGAAAAAA SEQ.1060112.19.1 AAAAAAAAATAGGCTTATCAACCA SEQ.1528036.24.1 AAAAAAAAATCATCATACAGGGT SEQ.556325.23.1 AAAAAAAAATCGACGGCTAGATA SEQ.1452090.23.1 AAAAAAAAATGCAGTCTTGTCAA SEQ.2525543.23.1 AAAAAAAAATGCATTGATCAT SEQ.1264660.21.1 AAAAAAAAATGTTCTCATGGTT SEQ.2274085.22.1 AAAAAAAAATTAAACGGGCCGTGC SEQ.880386.24.1 AAAAAAAAATTCGATATAGA SEQ.2571452.20.1 AAAAAAAACACCGAGCTATCGGCT SEQ.1882674.24.1 AAAAAAAACACTCAGGATTCAGGA SEQ.2492497.24.1 AAAAAAAACACTCTGCAACAGATG SEQ.1064.24.1 AAAAAAAACACTTTCGGCGCCAAC SEQ.1954907.24.1 AAAAAAAACATGTAGTCTT SEQ.2131152.19.1 AAAAAAAACCCGACTCGGGCACACA SEQ.1023067.25.1 AAAAAAAACCGGACAAATCA SEQ.2385057.20.1 AAAAAAAACCGGTCCGTTTGGATC SEQ.1419987.24.1 AAAAAAAACCGTTTAGATACATGT SEQ.1275097.24.1 AAAAAAAACCTACTTGAGCTACGC SEQ.2530317.24.1 AAAAAAAACGCCTTCAGCAGGGCC SEQ.1595165.24.1 AAAAAAAACGTCTAATACATGTAA SEQ.1756499.24.1 AAAAAAAACGTCTAGATACATAGA SEQ.155114.24.1 AAAAAAAACGTCTAGATACATGTA SEQ.316197.24.1 AAAAAAAACGTGGGATGCTA SEQ.1811897.20.1 AAAAAAAACGTTTAGATACATGTA SEQ.901852.24.1 AAAAAAAACGTTTGGCAATCGAAA SEQ.2696829.24.1 AAAAAAAACTAAAAATAAAAGA SEQ.2667333.22.1 AAAAAAAACTCGGTCGCAGGCCTG SEQ.1867874.24.1 AAAAAAAACTGCTCCAACGGCATA SEQ.1044149.24.2 AAAAAAAAGACACACGAGGTTGGT SEQ.1133483.24.1 AAAAAAAAGATACATTATCAT SEQ.1951462.21.1 AAAAAAAAGATCAGAGAT SEQ.2455401.18.1 AAAAAAAAGATTGAGCCGAA SEQ.626739.20.4 AAAAAAAAGATTGAGCCGAAACT SEQ.854926.23.1 AAAAAAAAGATTGAGCCGAAAG SEQ.1664333.22.1 AAAAAAAAGATTGAGCCGAAT SEQ.2647549.21.4 AAAAAAAAGATTGAGCCGAATA SEQ.822665.22.3 AAAAAAAAGCACAACTCGGCTGCT SEQ.73260.24.1 AAAAAAAAGCTGTGCCCTAGCTGG SEQ.375077.24.1 AAAAAAAAGGAACAGTGACATATT SEQ.226404.24.1 AAAAAAAAGGACAAATGCAGTGCA SEQ.1071214.24.1 AAAAAAAAGGTATGTAGTGTGCCT SEQ.2467414.24.1 AAAAAAAAGTAAAAAAAAAAAA SEQ.1895978.22.1 AAAAAAAAGTAAATTACAGATAAC SEQ.2427386.24.1 AAAAAAAAGTCAATTCGTTCGA SEQ.2234367.22.1 AAAAAAAAGTCACATCCGA SEQ.2424071.19.1 AAAAAAAAGTGTCTAGATACACGT SEQ.2033632.24.1 AAAAAAAAGTGTCTCTTGGGGACT SEQ.2382025.24.1 AAAAAAAAGTTTAGATACATGTAAT SEQ.1263057.25.1 AAAAAAAATATGGAACATAGGGAGT SEQ.1391063.25.1 AAAAAAAATATGTGGAAA SEQ.2506062.18.1 AAAAAAAATCCAGATGTTGATGA SEQ.360584.23.1 AAAAAAAATCCTTTAGGAACG SEQ.2306220.21.1 AAAAAAAATCGTCCATTGTACC SEQ.184068.22.1 AAAAAAAATCTAGATGTTGGTCAG SEQ.2464218.24.1 AAAAAAAATCTGGATTCGTGATTT SEQ.1703616.24.1 AAAAAAAATGACATGTGAGTCCAA SEQ.492275.24.1 AAAAAAAATGATCGGTATCGGTCG SEQ.1784908.24.1 AAAAAAAATGCATTGAAAAA SEQ.1196651.20.1 AAAAAAAATGCATTGATCA SEQ.226298.19.1 AAAAAAAATGCATTGATCATA SEQ.1992480.21.1 AAAAAAAATGGACAAGATTTAAAT SEQ.2627544.24.1 AAAAAAAATTAACCAAATCGCATG SEQ.1340145.24.1 AAAAAAAATTCTATCCATTCGGAT SEQ.564597.24.1 AAAAAAAATTCTGCGATGAGAAGT SEQ.515474.24.1 CAAGGTCTGAAATCCTTAGTA SEQ.1147435.21.1 CAAGGTCTTCATGCCGGCTTA SEQ.263499.21.1 CAAGGTGAACAGCCTCT SEQ.110162.17.6 CAAGGTGAACAGCCTCTG SEQ.842849.18.5 CAAGGTGAACAGCCTCTGCTGT SEQ.841302.22.1 CAAGGTGAACAGCCTCTGG SEQ.1427456.19.2 CAAGGTGAACAGCCTCTGGC SEQ.295890.20.30 CAAGGTGAACAGCCTCTGGCC SEQ.1155377.21.9 CAAGGTGAACAGCCTCTGGCCA SEQ.247839.22.14 CAAGGTGAACAGCCTCTGGCCAA SEQ.2085008.23.5 CAAGGTGAACAGCCTCTGGCCAAT SEQ.1241836.24.2 CAAGGTGAACAGCCTCTGGCCAATG SEQ.150863.25.2 CAAGGTGAACAGCCTCTGGCCACTC SEQ.2289679.25.1 CAAGGTGAACAGCCTCTGGCCCA SEQ.197175.23.1 CAAGGTGAACAGCCTCTGGCCTGT SEQ.1487567.24.1 CAAGGTGAACAGCCTGGCCA SEQ.642268.20.1 CAAGGTGAACAGCTCTGG SEQ.1372063.18.1 CAAGGTGAACAGTCTCTGGC SEQ.305520.20.1 CAAGGTGAACGGCCTTGGC SEQ.370621.19.1 CAAGGTGAAGAAATTCAAC SEQ.1298817.19.1 CAAGGTGACAACAGTAAC SEQ.12876.18.1 CAAGGTGACAACAGTAACCTG SEQ.895208.21.19 CAAGGTGACAACAGTAACCTGCC SEQ.1890248.23.1 CAAGGTGACACAAGTAACCTGCCG SEQ.79758.24.1 CAAGGTGACAGCAGTAACCTG SEQ.910509.21.1 CAAGGTGACCCTGCTTTTTCAGGGA SEQ.1499918.25.1 CAAGGTGAGCAGCCTCTGGCCAA SEQ.2571296.23.1 CAAGGTGAGGGCGCCCCTGCAGGC SEQ.228694.24.1 CAAGGTGAGGGCGGCCCGGCCGGT SEQ.1871447.24.1 CAAGGTGAGTTTACCCACCTCGGC SEQ.1872162.24.1 CAAGGTGATCGTCCCAATACCA SEQ.220011.22.1 CAAGGTGATGAAGGAAATTT SEQ.2523909.20.1 CAAGGTGCACTACTCACTGCACGA SEQ.1680725.24.4 CAAGGTGCGGGGCCTTCCTGTGGA SEQ.172187.24.1 CAAGGTGCTAAGACAGGTGCTTT SEQ.245869.23.1 CAAGGTGGACACAGGACACAG SEQ.2626465.21.1 CAAGGTGGACAGCCTCTGGC SEQ.1929138.20.1 CAAGGTGGACGCATTTTTGCTAA SEQ.635404.23.1 CAAGGTGGCCAGCGAGTGCT SEQ.170744.20.1 CAAGGTGGCTGCGTCCAAGCGGG SEQ.542762.23.1 CAAGGTGGGACTATTCCCAA SEQ.763696.20.1 CAAGGTGGTAACTACACCAGGGT SEQ.1202265.23.1 CAAGGTGGTACGCTGAACGCACCG SEQ.661635.24.1 。。。。。 我想找出像这样的所有group; CGGGGCATGACGTCACGATGATG CGGGGCATGACGTCACGAT GGGCATGACGTCACG ATGACGTCACGATGAT ATGACGTCACGATGATG 即包含关系,该怎么来想这个问题呢? 我sort了一下,这能找出 CGGGGCATGACGTCACGATGATG CGGGGCATGACGTCACGAT or CGGGGCATGACGTCACGATGATG ATGACGTCACGATGATG 但是这种的 CGGGGCATGACGTCACGATGATG GCATGACGTCACG 就找不出来了 谢谢! |