Bioinformatics Special Interest Group (BI SIG)

Projects

Development of hybrid feature selection techniques for analysis of high-throughput biological data

Research Team: Assoc. Prof. Dr. Veselka Boeva, Elena Kostadinova

Consultant: Dr. Elena Tsiporkova

(The research is supported by U904-17 Bulgarian National Science Fund grant, 2007-present)

 

The main objective of the research project is the development and implementation of novel hybrid feature selection techniques for analysis of high-throughput biological data. High-throughput gene expression data is characterized by a large number (in thousands) of variables in contrast to a relatively small number (in tens, or maximum in hundreds) of experiments. The latter poses a serious risk, in particular for classification tasks, of "model overfitting". It makes it also difficult to assess the relative importance of the individual genes and to identify synergetic effects between them. Therefore, reducing the dimensionality of the feature space is of key importance when performing classification, mining and modeling of biological data. On the other hand, there is no such thing as the best performing feature selection method, since this depends on the concrete data set and classification or identification task under consideration. Therefore, we have recognized as an important research in the above context the development of hybrid feature selection approaches which could overcome the shortcomings of individual methods and lead to improved performance in the analysis of gene expression data. Our research plan involves: 1) the development and evaluation of hybrid feature ranking procedures, involving a combination of clustering, classification and scoring techniques; 2) the design of multi-criteria based feature weighting schemes; 3) the development and implementation of adequate feature selection techniques for mining and analysis of time series expression data.

 

 

Development of cluster analysis methods for multi-experimental gene expression data

 

Research Team: Assoc. Prof. Dr. Veselka Boeva, Elena Kostadinova

(The research is supported by 111pd058-19 Technical University of Sofia grant, 2011)

In the framework of this research project, we have studied two microarray data integration techniques and have demonstrated how they can be applied and validated on a set of independent, but biologically related, microarray data sets in order to derive consistent and relevant clustering results. First, we develop a cluster integration approach, which combines the information containing in multiple data sets at the level of expression or similarity matrices, and then applies a clustering algorithm on the combined matrix for subsequent analysis. Second, we propose a technique for the integration of multiple partitioning results. In addition, we introduce a combined similarity measure which allows an extension of the traditional gene clustering analysis by grouping genes into a cluster if they have high similarity with respect to both their expression values and their relationships with other studied genes. The performance of the proposed cluster integration algorithms and combined similarity measure is evaluated on time series expression data using three clustering algorithms and three cluster validation measures.

 

A system for integration analysis of multi-experimental time series expression data

 

Research Team: Assoc. Prof. Dr. Veselka Boeva, Elena Kostadinova, Milka Dimitrova, Petia Ivanova

Consultant: Dr. Elena Tsiporkova

(The research is supported by 091ni065-17 Technical University of Sofia grant, 2009)

In the framework of this research project a software system, which implements a novel integration approach targeting the combination of multi-experiment time series expression data, has been developed. Initially, a recursive hybrid aggregation algorithm is employed to extract a set of genes, which are eventually of interest for the biological phenomenon under study. Next, a hierarchical merge procedure is applied for fusing together the multiple-experiment expression profiles of the selected genes. This employs Dynamic Time Warping alignment techniques in order to account adequately for the potential phase shift between the different experiments. Further the developed system has been extended by a hybrid integration method, which is specially suited for analysis of time series expression data across different experiments when the direct integration is ineffective or even impossible. The developed system has been evaluated and validated on gene expression time series data coming from two independent studies examining the global cell-cycle control of gene expression in fission yeast Schizosaccharomyces pombe. It has been demonstrated that the fused expression profiles can be used as a sort of gene signature for a particular activity, e.g. cyclic behavior, stress response, noise, etc. In addition, the developed system can be used as a normalization and smoothing procedure aiming at noise reduction and signal amplification.