|
Clust is a clustering algorithm that was designed in order to extract biologically meaningful clusters from gene expression data. Key features of this algorithm are based on the philosophical argument that the authors of clust presented regarding the biological expectations of the application of cluster analysis to gene expression data. The main argument is that, given some gene expression dataset that covers a specific observation window (e.g. some time-points, biological conditions, or developmental stages), not all of the genes in the dataset are expected to behave in a coordinate manner. Therefore, not all genes in the dataset should be included in one of the generated clusters. Therefore, the authors argue that gene expression clustering is a cluster extraction problem and not a data partitioning problem. In other words, the algorithm should extract good clusters from the given dataset in contrast to partitioning the entire given dataset into a set of clusters. Accordingly, clust was designed and validated against seven mainstream clustering methods over 100 real gene expression datasets<ref name=":0" />. Key features Amongst the key features provided by clust are: * Automatic normalization of data: when a user does not specifically dictate which normalization techniques should be applied to their dataset(s), clust automatically detects the most suitable normalization and applies it. * Automatic identification of number of clusters. * Automatic filtering of data. * Cluster tightness can be tuned by users if needed. * Ability to analyze multiple datasets simultaneously. * Ability to analyze cross-species datasets simultaneously. * Ability to analyze datasets from different technologies (e.g. microarrays and RNA-seq) simultaneously. Availability Clust is available as an open-source command line package . A beta web-interface is available for users to upload data, run clust, and download results with no need of any command-line experience . Simultaneous clustering of multiple datasets Clust offers the capability of applying cluster analysis to more than one dataset simultaneously. In this case, clust finds the groups (clusters) of genes which are consistently co-expressed in each one of the given datasets. Both the command-line based clust package and the web-interface allow users to submit multiple datasets to clust. Cross-species clustering If the multiple datasets to be clustered simultaneously are from different species (e.g. human and mouse), the user must provide clust with a gene-ID map file that defines which genes in one of the species are orthologous to which genes from the other species. This orthology information can be readily downloaded from relevant repositories such as the NCBI HomoloGene database and the Phytozome portal [https://phytozome.jgi.doe.gov/pz/portal.html#], or can be generated for any given set of species by tools such as the OrthoFinder algorithm as long as their proteomes are available. This capability can be utilized for cross-species comparative analysis. Workflow
|
|
|