String Algorithm

Analysis of Repetition Structure

Genome nucleic acid sequences have a huge number of repeats, and those repeats are considered to be deeply related to the structure and evolution of the sequences. Repetition structures on genome sequences are mainly categorized into consecutive repeats (tandem repeats) and interspersed repeats such as SINE (short interspersed nuclear element) and LINE (long interspersed nuclear element). We developed a compact representation of a combination of consecutive repeats called RRS (repetition representation of a string) and proposed a method of repetition structure analysis using minimum RRS. We are also developing a frequent approximate pattern mining algorithm to find interspersed repeats, and its gap-constrained version is fast and memory efficient enough for application to genome sequences.

The compression rate of chromosomes by minimum RRS is specific to species and it depends not only on the density of primitive tandem repeats but also on qualitative differences of their component repeats.

MENU

Analysis of Repetition Structure

Research and Education Fields

Research and Education Fields