Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Rivero L.Encyclopedia of database technologies and applications.2006

.pdf
Скачиваний:
11
Добавлен:
23.08.2013
Размер:
23.5 Mб
Скачать

Bioinformatics Data Management and Data Mining

Figure 1. Illustration of the primary, secondary, tertiary, and quaternary structure of proteins

Heme group (hemoglobin)

Primary structure

Secondary structure

Tertiary structure

Quaternary structure

As to innovative approaches to database design, the following biological problems are worth mentioning:

1.Evolution and phylogenetic analysis (Maier et al., 2003). The demands of biodiversity and ecosystem research can advance one’s understanding and use of information technologies;

2.Protein structure prediction;

3.Molecular sequence management and alignment;

4.Recognition of genes and regulatory elements;

5.Interpretation of large-scale gene expression data;

6.Whole genome comparative analysis and synthesis;

7.Modeling of biochemical pathways (complex network of interactions between proteins, Pathogenomics); and

8.Drug design and combinatorial libraries (Ghose & Viswanadha, 2001).

We also outline a number of challenges in representing biological data:

The inherent complexity of biological data;

Domain knowledge barrier;

The evolution of domain knowledge; and

The lack of expert data modeling skills.

BACKGROUND

Data management for molecular and cell biology involves the traditional areas of data generation and acquisition, data modeling, data integration, and data analysis. In industry, the main focus of the past several years has been the development of methods and technologies supporting high-throughput data generation, especially for DNA sequence and gene expression data. New technology platforms for generating biological data present data management challenges arising from the need to:

(1) capture, (2) organize, (3) interpret, and (4) archive vast amounts of experimental data. Platforms keep evolving with new versions benefiting from technological improvements, such as higher density arrays and better probe selection for microarrays.

This technology evolution raises the additional problem of collecting potentially incompatible data generated using different versions of the same platform, encountered both when these data need to be integrated and analyzed. Further challenges include qualifying the data generated using inherently imprecise tools and techniques and the high complexity of integrating data residing in diverse and poorly correlated repositories.

The data management challenges mentioned above, as well as other data management challenges, have been

30

TEAM LinG

Bioinformatics Data Management and Data Mining

examined in the context of both traditional and scientific database applications. When considering the associated problems, it is important to determine whether they require new or additional research, or can be addressed by adapting and/or applying existing data management tools and methods to the biological domain. The experts believe that existing data management tools and methods, such as commercial database management systems, data warehousing tools, and statistical methods, can be adapted effectively to the biological domain.

For example, the development of Gene Logic’s gene expression data management system (GeneLogic) has involved modeling and analyzing microarray data in the context of gene annotations (including sequence data from a variety of sources), pathways, and sample annotations (e.g., morphology, demography, clinical), and has been carried out using or adapting existing tools. Dealing with data uncertainty or inconsistency in experimental data has required statistical, rather than data management, methods. (Adapting statistical methods to gene expression data analysis at various levels of granularity has been the subject of intense research and development in recent years.)

The most difficult problems have been encountered in the area of data semantics—properly qualifying data values (e.g., an expression estimated value) and their relationships, especially in the context of continuously changing platforms and evolving biological knowledge. While such problems are met across all data management areas, from data generation through data collection and integration to data analysis, the solutions require domain-specific knowledge and extensive work with data definition and structuring, with data management providing only the framework (e.g., controlled vocabularies, ontologies) to address these problems (The Gene Ontology Consortium, 2000; Gruber, 1993; Karp et al., 2000).

In an industry setting, solutions to data management challenges need to be considered in terms of complexity, cost, robustness, performance, and other userand product-specific requirements. Devising effective solutions for biological data management problems requires thorough understanding of the biological application, the data management field, and the overall context in which the problems are considered (GeneLogic). Inadequate understanding of the biological application and of data management technology and practices seem to present more problems than the limitations of existing data management technology in supporting biological data-specific structures or queries.

DATA MINING

B

As to the data mining in bioinformatics, it is an important source of important discoveries, based on the combination of advanced algorithm of classification, clustering, and pattern recognition and prediction with interactive visualization tools (Bertone & Gerstein, 2001, Galitsky, Gelfand & Kister, 1998). Data mining (knowledge discovery in databases) is the methodology to extract interesting (nontrivial, implicit, previously unknown, and potentially useful) information or patterns from data in large databases. The following methods are the basis for successful data mining applications:

Statistical algorithms: Statistical analysis systems such as SAS and SPSS have been used by analysts to detect unusual patterns and explain patterns using statistical models such as linear models. Such systems have their place and will continue to be used.

Neural networks: Artificial neural networks mimic the pattern-finding capacity of the human brain; hence, some researchers have suggested applying Neural Network algorithms to pattern-mapping. Neural networks have been applied successfully in a few applications such as secondary and tertiary structure prediction and microarray data analysis.

Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution.

Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (sometimes called the k-nearest neighbor technique).

Rule induction: The extraction of useful if-then rules from data based on statistical significance.

Data visualization: The visual interpretation of complex relationships in multidimensional data. Has been applied in (Galitsky, 2003) for data mining of the sequences of immunoglobulin-like fold.

The comprehensive suite of bioinformatics data mining tools available, for example, at NCBI NIH (Baxevanis & Ouellette 1998; NIH Tools) includes the following:

31

TEAM LinG

The Basic Local Alignment Search Tool (BLAST), for comparing gene and protein sequences against others in public databases.

Clusters of Orthologous Groups (COGs) currently covers 21 complete genomes from 17 major phylogenetic lineages (the descendants of one individual). A COG is a cluster of very similar proteins found in at least three species. The presence or absence of a protein in different genomes can tell us about the evolution of the organisms, as well as point to new drug targets.

Map Viewer shows integrated views of chromosome maps for 17 organisms. Used to view the NCBI assembly of complete genomes, including human, it is a valuable tool for the identification and localization of genes, particularly those that contribute to diseases.

LocusLink combines descriptive and sequence information on genetic loci through a single query interface.

A UniGene Cluster is a non-redundant (non-repeti- tive) set of sequences that represents a unique gene. Each cluster record also contains information such as the tissue types in which the gene has been expressed and map location.

Electronic PCR (polymerase chain reaction: a method use to make multiple copies of DNA) allows you to search your DNA sequence for sequence tagged sites, which have been used as landmarks in various types of genomic maps.

VAST Search is a structure–structure similarity search service. It compares tertiary structure (3D coordinates) of a newly determined protein structure to those in the PDB (Protein Data Bank) database. VAST Search computes a list of similar structures that can be browsed interactively, using molecular graphics to view superimpositions and alignments.

The Human–Mouse Homology Maps compare genes in homologous segments of DNA from human and mouse sources, sorted by position in each genome.

Spidey aligns one or more mRNA sequences to a single genomic sequence. Messenger RNA arises in the process of transcription from the DNA and includes information on the synthesis of a protein. Spidey will try to determine the exon/intron structure, returning one or more models of the genomic structure, including the genomic/mRNA alignments for each exon. Exon is a part of a gene that can encode amino acids in a protein. Usually adjacent to a non-coding DNA segment called an intron.

Bioinformatics Data Management and Data Mining

FUTURE TRENDS

Genome database mining (Biodatabases.com) is referred to as computational genome annotation. Computational genome annotation is the identification of the protein-encoding regions of a genome and the assignment of functions to these genes on the basis of sequence similarity homologies against other genes of known function. Gene expression database mining is the identification of intrinsic patterns and relationships in transcriptional expression data generated by large-scale gene expression experiments. Proteome database mining is the identification of intrinsic patterns and relationships in translational expression data generated by large-scale proteomics experiments. As the determination of the DNA sequences comprising the human genome nears completion, the Human Genome Initiative is undergoing a paradigm shift from static structural genomics to dynamic functional genomics. Thus, gene expression and proteomics are emerging as the major intellectual challenges of database mining research in the postsequencing phase of the Human Genome Initiative.

Genome, gene expression, and proteome database mining are complementary emerging technologies with much scope being available for improvements in data analysis. Improvements in genome, gene expression, and proteome database mining algorithms will enable the prediction of protein function in the context of higher order processes such as the regulation of gene expression, metabolic pathways, and signaling cascades. The final objective of such higher-level functional analysis will be the elucidation of integrated mapping between genotype and phenotype (Bork et al., 1998).

CONCLUSION

Bioinformatics may be alternatively defined as the interface between life sciences and computational sciences. It is a new science that has been stimulated by recent work on gene sequences; it applies the latest database techniques and smart mathematical algorithms to gene and protein sequence information in the search for new medical drug leads. Bioinformatics combines the storage and retrieval of complex biological data, with analysis and annotation of biological information. IT tools automate many of the processes, some of which take large amounts of computing power. The newest area is knowledge-based modeling of specific cellular and molecular processes. Bioinformatics is thus the study of the information content and information flow in biological systems and processes.

32

TEAM LinG

Bioinformatics Data Management and Data Mining

REFERENCES

Adams, M.D., Fields, C., & Venter J.C. (Eds.). (1994).

Automated DNA sequencing and analysis. London: Academic Press.

Baxevanis, A., Ouellette, F.B.F. (Eds.). (1998).

Bioinformatics: A practical guide to the analysis of genes and proteins. New York: John Wiley & Sons.

Bertone, P., & Gerstein, M. (2001). Integrative data mining: The new direction in bioinformatics. IEEE Engineering Medical Biology Magazine, 20(4), 33-40.

Biodatabases.com. Database mining tools in the Human Genome Initiative. Retrieved February 2, 2005, from http://www.biodatabases.com/whitepaper03.html

Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M., & Yuan, Y. (1998). Predicting function: From genes to genomes and back. Journal of Molecular Biology, 283(4), 707-725.

Galitsky, B. (2003). Revealing the set of mutually correlated positions for the protein families of immunoglobulin fold. In Silico Biology 3, 0022, Bioinformation Systems e.V. 241-264.

Galitsky, B., Gelfand, I., & Kister, A. (1998). Predicting amino acid sequences of the antibody human VH chains from its first several residues. Proceedings of the National Academy of Science, 95 (pp. 5193-5198).

The Gene Ontology Consortium. (2000). Gene ontology: Tool for the unification of biology. Nature Genetics, 25, 25-29.

GeneLogic. Retrieved February 2, 2005, from http:// www.genelogic.com/

Ghose, A.K., & Viswanadha, V.N. (2001). Combinatorial library design and evaluation: Principles, software tools, and applications in drug discovery. New York: Marcel Dekker.

Gruber, T.R. (1993). Towards principles for the design of ontologies used for knowledge sharing. Proceedings of the International Workshop on Formal Ontology.

Karp, P.D., Riley, M., Saier, M., Paulsen, I.T., Paley, S.M., & Pellegrini-Toole, A. (2000). The EcoCyc and MetaCyc databases. Nucleic Acids Research, 28, 56-59.

Maier, D., Landis, E., Cushing, J., Frondorf, A., Schnase, J.L., & Silberschatz, A. (2003). Information technology challenges of biodiversity and ecosystems informatics.

Information Systems, 28, 241-242.

NIH Tools. Retrieved February 2, 2005, from http://

 

B

www.ncbi.nlm.nih.gov/Tools/

Pathogenomics. Retrieved February 2, 2005, from http://

 

 

www.cmdr.ubc.ca/pathogenomics/terminology.html

 

Pevsner, P. (2000). Computational molecular biology:

 

An algorithmic approach. Cambridge, MA: MIT.

 

Protein Structure. Retrieved February 2, 2005, from http:/

 

/sosnick.uchicago.edu/precpstru.html (1d-3d structure).

 

Waterman, M.S. (1995). Introduction to computational

 

biology: Maps, sequences, and genomes. London:

 

Chapman and Hall.

 

Wilkins, M.R., Williams, K.L., Appel, R.D., &

 

Hochstrasser, D.H. (Eds.). (1997). Proteome research:

 

New frontiers in functional genomics. Berlin: Springer-

 

Verlag.

 

KEY TERMS

DNA: Deoxyribonucleic acid. DNA molecules carry the genetic information necessary for the organization and functioning of most living cells and control the inheritance of characteristics

Gene: The unit of heredity. A gene contains hereditary information encoded in the form of DNA and is located at a specific position on a chromosome in a cell’s nucleus. Genes determine many aspects of anatomy and physiology by controlling the production of proteins. Each individual has a unique sequence of genes, or genetic code.

Genome: It includes all the genetic material in the chromosomes of a particular organism; its size is generally given as its total number of base pairs.

Messenger RNA: Pieces of ribonucleic acid that carry genetic information from DNA to ribosomes (molecular machines that manufacture proteins), leading to their synthesis.

Microarray: Tool for studying how large numbers of genes interact with each other and how a cell’s regulatory networks control vast batteries of genes simultaneously. Uses a robot to precisely apply tiny droplets containing functional DNA to glass slides. Researchers then attach fluorescent labels to DNA from the cell they are studying. The labeled probes are allowed to bind to cDNA strands on the slides. The slides are put into a scanning microscope to measure how much of a specific DNA fragment is present.

33

TEAM LinG

Phylogenetic Tree: A variety of dendrogram (diagram) in which organisms are shown arranged on branches that link them according to their relatedness and evolutionary descent.

Phylogenetics:The taxonomical classification of organisms based on their degree of evolutionary relatedness.

Phylogeny: Evolutionary relationships within and between taxonomic levels, particularly the patterns of lines of descent.

Bioinformatics Data Management and Data Mining

Protein Structure Prediction: The problem of determining the secondary (division of sequence into fragments), tertiary (3D space), and quaternary structure of proteins, given their amino acid sequence.

RNA: A single-stranded nucleic acid made up of nucleotides. RNA is involved in the transcription of genetic information; the information encoded in DNA is transcribed into messenger RNA (mRNA), which controls the synthesis of new proteins.

34

TEAM LinG

 

35

 

Biological Data Mining

 

 

 

B

 

 

 

 

 

George Tzanis

Aristotle University of Thessaloniki, Greece

Christos Berberidis

Aristotle University of Thessaloniki, Greece

Ioannis Vlahavas

Aristotle University of Thessaloniki, Greece

INTRODUCTION

At the end of the 1980s, a new discipline named data mining emerged. The introduction of new technologies such as computers, satellites, new mass storage media, and many others have lead to an exponential growth of collected data. Traditional data analysis techniques often fail to process large amounts of, often noisy, data efficiently in an exploratory fashion. The scope of data mining is the knowledge extraction from large data amounts with the help of computers. It is an interdisciplinary area of research that has its roots in databases, machine learning, and statistics and has contributions from many other areas such as information retrieval, pattern recognition, visualization, parallel and distributed computing. There are many applications of data mining in the real world. Customer relationship management, fraud detection, market and industry characterization, stock management, medicine, pharmacology, and biology are some examples (Two Crows Corporation, 1999).

Recently, the collection of biological data has been increasing at explosive rates due to improvements of existing technologies and the introduction of new ones such as the microarrays. These technological advances have assisted the conduct of large-scale experiments and research programs. An important example is the Human Genome Project that was founded in 1990 by the U.S. Department of Energy and the U.S. National Institutes of Health (NIH) and was completed in 2003 (U.S. Department of Energy Office of Science, 2004). A representative example of the rapid biological data accumulation is the exponential growth of GenBank (Figure 1), the U.S. NIH genetic sequence database (National Center for Biotechnology Information, 2004). The explosive growth in the amount of biological data demands the use of computers for the organization, maintenance, and analysis of these data.

This led to the evolution of bioinformatics, an interdisciplinary field at the intersection of biology, computer science, and information technology. As Luscombe,

Figure 1. Growth of GenBank (years 1982-2003)

 

Millions

 

 

 

 

 

 

 

 

 

 

!#

 

 

 

 

 

 

 

 

 

 

!

 

 

 

 

 

 

 

 

 

 

#

 

 

 

 

 

 

 

 

 

DNA Sequences

 

 

 

 

 

 

 

 

 

 

#

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

#

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

'&

'&" '&$

'&&

''

''

''"

''$

''&

 

 

Greenbaum, and Gerstein (2001) mention, the aims of bioinformatics are:

The organization of data in such a way that allows researchers to access existing information and to submit new entries as they are produced.

The development of tools that help in the analysis of data.

The use of these tools to analyze the individual systems in detail in order to gain new biological insights.

The field of bioinformatics has many applications in the modern day world, including molecular medicine, industry, agriculture, stock farming, and comparative studies (2can Bioinformatics, 2004).

BACKGROUND

One of the basic characteristics of life is its diversity. Everyone can notice this by just observing the great

Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG

differences among living creatures. Despite this diversity, the molecular details underlying living organisms are almost universal. Every living organism depends on the activities of a complex family of molecules called proteins. Proteins are the main structural and functional units of an organism’s cells. A typical example of proteins are the enzymes that catalyze (accelerate) chemical reactions. There are four levels of protein structural arrangement (conformation) as listed in Table 1 (Brazma et al., 2001). The statement about unity among organisms is strengthened by the observation that similar protein sets, having similar functions, are found in very different organisms (Hunter, 2004). Another common characteristic of all organisms is the presence of a second family of molecules, the nucleic acids. Their role is to carry the information that “codes” life. The force that created both the unity and the diversity of living things is evolution (Hunter, 2004).

Proteins and nucleic acids are both called biological macromolecules, due to their large size compared to other molecules. Important efforts towards understanding life are made by studying the structure and function of biological macromolecules. The branch of biology concerned in this study is called molecular biology.

Both proteins and nucleic acids are linear polymers of smaller molecules called monomers. The term sequence is used to refer to the order of monomers that constitute a macromolecule. A sequence can be represented as a string of different symbols, one for each monomer. There are 20 protein monomers called amino acids. There exist two nucleic acids, deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), composed by four different monomers called nucleotides. DNA is the genetic material of almost every living organism. RNA has many functions

Table 1. The four levels of protein conformation

Biological Data Mining

inside a cell and plays an important role in protein synthesis (Table 2). Moreover, RNA is the genetic material for some viruses such as HIV, which causes AIDS.

The genetic material of an organism is organized in long double-stranded DNA molecules called chromosomes. An organism may contain one or more chromosomes. Gene is a DNA sequence located in a particular chromosome and encodes the information for the synthesis of a protein or RNA molecule. All the genetic material of a particular organism constitutes its genome.

The central dogma of molecular biology, as coined by Francis Crick (1958), describes the flow of genetic information (Figure 2). DNA is transcribed into RNA, and then RNA is translated into proteins. The circular arrow around DNA denotes its replication ability. However, today it is known that in retroviruses RNA is reverse transcribed into DNA. Moreover, in some viruses, RNA is able to replicate itself. The extended statement of central dogma of molecular biology is depicted in Figure 3.

Houle et al. (2000) refer to a classification of three successive levels for the analysis of biological data that is identified on the basis of the central dogma of molecular biology:

1.Genomics is the study of an organism’s genome and deals with the systematic use of genome information to provide new biological knowledge.

2.Gene expression analysis is the use of quantitative mRNA-level measurements of gene expression (the process by which a gene’s coded information is converted into the structural and functional units of a cell) in order to characterize biological processes and elucidate the mechanisms of gene transcription (Houle et al., 2000).

Primary Structure. The sequence of amino acids, forming a chain called polypeptide.

Secondary Structure. The structure that forms a polypeptide after folding.

Tertiary Structure. The stable 3D structure that forms a polypeptide.

Quaternary Structure. The final 3D structure of the protein formed by the conjugation of two or more polypeptides.

Table 2. Some of the basic types of RNA

Messenger RNA (mRNA). Carries information from DNA to protein synthesis site.

Ribosomal RNA (rRNA). The main constituent of ribosomes, the cellular components where the protein synthesis takes place.

Transfer RNA (tRNA). Transfers the amino acids to ribosomes.

36

TEAM LinG

Biological Data Mining

Figure 2. The central dogma of molecular biology (initial statement)

Replication

 

Transcription

Translation

DNA

RNA

Figure 3. The central dogma of molecular biology (extended statement)

B

Protein

Replication

Transcription

Replication

DNA

Translation

 

Reverse

RNA

 

 

 

Transcription

 

Protein

3.Proteomics is the large-scale study of proteins, particularly their structures and functions (Wikipedia, 2004).

These application domains are examined in the following paragraphs.

As many genome projects (the endeavors to sequence and map genomes) like the Human Genome Project have been completed, there is a paradigm shift from static structural genomics to dynamic functional genomics (Houle et al., 2000). The term structural genomics refers to the DNA sequence determination and mapping activities, while functional genomics refers to the assignment of functional information to known sequences. There are particular DNA sequences, that have a specific biological role. The identification of such sequences is a problem that concerns bioinformatics scientists. One such sequence is transcription start site, which is the region of DNA where transcription (the process of mRNA production from DNA) starts. Another biologically meaningful sequence is the translation initiation site, which is the site where translation (protein production from mRNA) initiates.

Although every cell in an organism—with only few exceptions—has the same set of chromosomes, two cells may have very different properties and functions. This is due to the differences in abundance of proteins. The abundance of a protein is partly determined by the levels of mRNA which in turn are determined by the expression or non-expression of the corresponding gene. A tool for analyzing gene expression is microarray. A microarray experiment measures the relative mRNA levels of typically thousands of genes, providing the ability to compare the expression levels of different biological samples. These samples may correlate with different time points taken during a biological process or with different tissue types

such as normal cells and cancer cells (Aas, 2001). An example raw microarray image is illustrated in Figure 4.

Serial Analysis of Gene Expression (SAGE) is a method that allows the quantitative profiling of a large number of transcripts (Velculescu et al., 1995). A transcript is a sequence of mRNA produced by transcription. However, this method is very expensive in contrast to microarrays; thus there is a limited amount of publicly available SAGE data.

One of the concerns of proteomics is the prediction of protein properties such as active sites, modification sites, localization, stability, globularity, shape, protein domains, secondary structure, and interactions (Whishart, 2002). Secondary structure prediction is one of the most important problems in proteomics. The interaction of proteins with other biomolecules is another important issue.

Figure 4. An illuminated microarray (enlarged). A typical dimension of such an array is about 1 inch or less, the spot diameter is of the order of 0.1 mm, for some microarray types can be even smaller (Brazma et al., 2001).

37

TEAM LinG

MINING BIOLOGICAL DATA

Data mining is the discovery of useful knowledge from databases. It is the main step in the process known as Knowledge Discovery in Databases (KDD) (Fayyad et al., 1996), although the two terms are often used interchangeably. Other steps of the KDD process are the collection, selection, and transformation of the data and the visualization and evaluation of the extracted knowledge. Data mining employs algorithms and techniques from statistics, machine learning, artificial intelligence, databases and data warehousing, and so forth. Some of the most popular tasks are classification, clustering, association and sequence analysis, and regression. Depending on the nature of the data as well as the desired knowledge, there is a large number of algorithms for each task. All of these algorithms try to fit a model to the data (Dunham, 2002). Such a model can be either predictive or descriptive. A predictive model makes a prediction about data using known examples, while a descriptive model identifies patterns or relationships in data. Table 3 presents the most common data mining tasks (Dunham, 2002).

Many general data mining systems such as SAS Enterprise Miner, SPSS, S-Plus, IBM Intelligent Miner, Microsoft SQL Server 2000, SGI MineSet, and Inxight VizServer can be used for biological data mining. However, some biological data mining tools such as GeneSpring, Spot Fire, VectorNTI, COMPASS, Statistics for Microarray Analysis, and Affymetrix Data Mining Tool have been developed (Han, 2002). Also, a large number of biological data mining tools is provided by the National Center for Biotechnology Information and by the European Bioinformatics Institute.

Data Mining in Genomics

Many data mining techniques have been proposed to deal with the identification of specific DNA sequences. The most common include neural networks, Bayesian classifiers, decision trees, and Support Vector Machines (SVMs) (Hirsh & Noordewier, 1994; Ma & Wang, 1999; Zien et al., 2000). Sequence recognition algorithms exhibit perfor-

Table 3. Common data mining tasks

Biological Data Mining

mance tradeoffs between increasing sensitivity (ability to detect true positives) and decreasing selectivity (ability to exclude false positives) (Houle et al., 2000). However, as Li, Ng, and Wong (2003) state, traditional data mining techniques cannot be directly applied to these types of recognition problems. Thus, there is the need to adapt the existing techniques to these kinds of problems. Attempts to overcome this problem have been made using feature generation and feature selection (Li et al., 2003; Zeng, Yap & Wong, 2002). Another data mining application in genomic level is the use of clustering algorithms to group structurally related DNA sequences.

Gene Expression Data Mining

The main types of microarray data analysis include (Piatetsky-Shapiro & Tamayo, 2003) gene selection, clustering, and classification.

Piatetsky-Shapiro and Tamayo (2003) present one great challenge that data mining practitioners have to deal with. Microarray datasets—in contrast with other application domains—contain a small number of records (less than a hundred), while the number of fields (genes) is typically in the thousands. The same case is in SAGE data. This increases the likelihood of finding “false positives”.

An important issue in data analysis is feature selection. In gene expression analysis, the features are the genes. Gene selection is a process of finding the genes most strongly related to a particular class. One benefit provided by this process is the reduction of the foresaid dimensionality of dataset. Moreover, a large number of genes are irrelevant when classification is applied. The danger of overshadowing the contribution of relevant genes is reduced when gene selection is applied.

Clustering is the far most used method in gene expression analysis. Tibshirani et al. (1999) and Aas (2001) provide a classification of clustering methods in two categories: one-way clustering and two-way clustering. Methods of the first category are used to group either genes with similar behavior or samples with similar gene expressions. Two-way clustering methods are used to simultaneously cluster genes and samples. Hierarchical

Predictive

Descriptive

 

 

Classification. Maps data into predefined

Association Analysis. The production of

classes.

rules that describe relationships among data.

Regression. Maps data into a real valued

Sequence Analysis. Same as association, but

prediction variable.

sequence of events is also considered.

 

Clustering. Groups similar input patterns

 

together.

 

 

38

TEAM LinG

Biological Data Mining

clustering is currently the most frequently applied method in gene expression analysis. An important issue concerning the application of clustering methods in microarray data is the assessment of cluster quality. Many techniques such as bootstrap, repeated measurements, mixture model-based approaches, sub-sampling and others have been proposed to deal with the cluster reliability assessment (Kerr & Churchill, 2001; Ghosh & Chinnaiyan, 2002; Smolkin & Ghosh, 2003; Yeung, Medvedovic & Bumgarner, 2003).

In microarray analysis, classification is applied to discriminate diseases or to predict outcomes based on gene expression patterns and perhaps even to identify the best treatment for given genetic signature (PiatetskyShapiro & Tamayo, 2003).

Table 4 lists the most commonly used methods in microarray data analysis. Detailed descriptions of these methods can be found in literature (Aas, 2001; Dudoit, Fridly & Speed, 2002; Golub et al., 1999; Hastie et al., 2000; Lazzeroni & Owen, 2002; Tibshirani et al., 1999).

Most of the methods used to deal with microarray data analysis can be used for SAGE data analysis.

Finally, machine learning and data mining can be applied in order to design microarray experiments except to analyce them (Molla et al., 2004).

Data Mining in Proteomics

Many modification sites can be detected by simply scanning a database that contains known modification sites. However, in some cases, a simple database scan is not effective. The use of neural networks provides better results in these cases. Similar approaches are used for the prediction of active sites. Neural network approaches and nearest neighbor classifiers have been used to deal with protein localization prediction (Whishart, 2002). Neural networks have also been used to predict protein properties such as stability, globularity, and shape. Whishart refers to the use of hierarchical clustering algorithms for predicting protein domains.

Data mining has been applied for the protein secondary structure prediction. This problem has been studied for over than 30 years, and many techniques have been

developed (Whishart, 2002). Initially, statistical approaches were adopted to deal with this problem. Later, B more accurate techniques based on information theory, Bayes theory, nearest neighbors, and neural networks

were developed. Combined methods such as integrated multiple sequence alignments with neural network or nearest neighbor approaches improve prediction accuracy.

A density-based clustering algorithm (GDBSCAN) is presented by Sander et al. (1998) that can be used to deal with protein interactions. This algorithm is able to cluster point and spatial objects according to both their spatial and non-spatial attributes.

FUTURE TRENDS

Because of the special characteristics of biological data, the variety of new problems and the extremely high importance of bioinformatics research, a large number of critical issues is still open and demands active and collaborative research by the academia as well as the industry. Moreover, new technologies such as the microarrays led to a constantly increasing number of new questions on new data. Examples of hot problems in bioinformatics are the accurate prediction of protein structure and gene behavior analysis in microarrays. Bioinformatics demands and provides the opportunities for novel and improved data mining methods development. As Houle et al. (2000) mention, these improvements will enable the prediction of protein function in the context of higher order processes such as the regulation of gene expression, metabolic pathways (series of chemical reactions within a cell, catalyzed by enzymes), and signaling cascades (series of reactions which occur as a result of a single stimulus). The final objective of such analysis will be the illumination of the way conveying from genotype to phenotype.

CONCLUSION

The recent technological advances have led to an exponential growth of biological data. New questions on these

Table 4. Popular microarray data mining methods

One-way Clustering

Two-way Clustering

Classification

 

 

 

Hierarchical Clustering

Block Clustering

SVMs

Self-organizing Maps (SOMs)

Gene Shaving

K-nearest Neighbors

K-means

Plaid Models

Classification/Decision Trees

Singular Value Decomposition (SVD)

 

Voted Classification

 

 

Weighted Gene Voting

 

 

Bayesian Classification

 

 

 

39

TEAM LinG

Соседние файлы в предмете Электротехника