학술논문

A Method for Evaluating Quality of Clustering DNA Fragments Encoded in Different Nucleotide Frequencies
Document Type
Conference
Source
2007 Frontiers in the Convergence of Bioscience and Information Technologies Frontiers in the Convergence of Bioscience and Information Technologies, 2007. FBIT 2007. :60-63 Oct, 2007
Subject
Computing and Processing
Bioengineering
DNA
Frequency
Sequences
Genomics
Bioinformatics
Testing
Assembly
Information technology
Control systems
Biodiversity
Language
Abstract
The whole-genome shotgun sequencing technique has been successfully applied to environmental genomes. However, a considerable amount of DNA sequences and small contigs remain generally unassembled after the shotgun sequencing. Binning is a step of grouping these sequences based on some biological and molecular features. The combination of oligonucleotide frequency and Self-Organising Maps (SOM) clustering algorithm shows high potential as a compositional binning tool. As the previous work did not provide methods for assessing results, we proposed a systematic quantitative method to evaluate the clustering results specifically for this type of application. We used this method to investigate the suitability of each of di, tri, tetra and pentanucleotide frequencies as training feature for this binning technique. The results show that dinucleotide frequency is unable to bin 10kb DNA sequence fragments into well-clustered species groups. Furthermore, we noticed that increasing order of oligonucleotide frequency may deteriorate the assignment of DNA sequences to classes in our test, which indicates the possible existence of optimal species-specific oligonucleotide frequency. Results suggest that using trinucleotide frequency for the combination of oligonucleotide frequency and SOM as a binning process gives sufficiently good clustering quality in this case.