Sequence Data - TIS


Huiqing Liu & Limsoon Wong. "Data Mining Tools for Biological Sequences" Journal of Bioinformatics and Computational Biology, 1(1): 139-167, April 2003.

Raw Data:



This data set is converted from sequence data. The original data consists of a selected set of vertebrates genomic sequences extracted from GenBank. It is used to find the Translation Initiation Site (TIS), at which the translation from mRNA to proteins initiates. Since only those sequences with an annotated TIS are included in the data set, a classification model can be built to distinguish true (positive) TIS and false (negative) TIS. As the data set is processed DNA, the TIS site is ATG. In total, there are 3312 sequences (i.e. 3312 true ATGs). There are various ways to extract sequences and build feature space. Here, we provide one approach: a window centered at each ATG, with both upstream and downstream are 100 bases long, is generated from each ATG. So there are 203 bases indicated by A, T, C and G in each window. If the portion of sequence is shorter than the window end, those bases are denoted by "?". With this strategy, we got 3312 true ATGs, 10063 false ATGs. When building feature space for classification, we matched 3 nucleotides to 1 amino acid and count the frequency of each amino acid. We distinguish these amino acid as upstream or downstream regarding to that it appears before or after the centered ATG. Besides the single amino acid, we also considered the frequency of a pair of amino acid. Thus, the number of features based on amino acid is (21+21*21)*2=924. Furthermore, according to our knowledge, a true ATG often has G at position 1 of its downstream side, A or G at position 3 of its upstream side and has no another upstream ATG (for mRNA). Then we added these 3 features in our feature space as well. Finally, we got a feature space containing 927 features.


Our transformed .data and .names format files are available here.

Back to Data Repository