Analysis of Repetition Structure
Genome nucleic acid sequences have a huge number of repeats, and those repeats are considered to be deeply related to the structure and evolution of the sequences. Repetition structures on genome sequences are mainly categorized into consecutive repeats (tandem repeats) and interspersed repeats such as SINE (short interspersed nuclear element) and LINE (long interspersed nuclear element). We developed a compact representation of a combination of consecutive repeats called RRS (repetition representation of a string) and proposed a method of repetition structure analysis using minimum RRS. We are also developing a frequent approximate pattern mining algorithm to find interspersed repeats, and its gap-constrained version is fast and memory efficient enough for application to genome sequences.