Desenvolvimento de uma metodologia para previsão de sítios de início de tradução

AUTOR(ES)
DATA DE PUBLICAÇÃO

2007

RESUMO

The correct prediction of the translation start site in mRNA sequences is an im-portant task in genomic annotation. However, attaining a correct prediction is not trivial. Frequently the translation starts on the first AUG, but that is not a rule. Thus, this problem can be modeled as a classification problem between positive (co-ding sequences) and negative patterns (non coding sequences). To approach this problem the authors of this work propose the following methodology: (1) an alterna-tive extration of negative patterns; (2) using of shorter sequence window; (3) modi-fication of the codification for the nucleotides; (4) utilization of Smote - method for class balance, since the problem is highly unbalanced (1:29 fold in average) for the bases used in this work; (5) use of a transductive approach besides the traditional inductive inference; and finally, (6) use of the Support Vector Machine (SVM) classi-fier - with simple kernel functions. To test this methodology sequences collected by Petersen and Nielsen and RefSeq (Reference Sequences) sequences from NCBI (Na-tional Center for Biotechnology Information) from five organisms were used: Danio rerio, Drosophila melanogaster, Homo sapiens, Mus musculus and Rattus norvegicus, under six distinct inspection levels (reviewed, provisional, predicted, validated, mo-del and inferred). As a result, accuracy, adjusted accuracy, precision, sensitivity and specificity over 95% were attained, in average, by using negative patterns out of frame during training step, 24 nucleotide windows, codification by triples, pattern balancing with Smote, SVM classifier and by considering a scanning model, in which validation is tested up to TIS.

ASSUNTO(S)

bioinformática teses.

Documentos Relacionados