CLASSIFI (Cluster Assignment for Biological Inference) is a data-mining tool that can be used to identify significant co-clustering of genes with similar functional properties (e.g. cellular response to DNA damage). Briefly, CLASSIFI uses the Gene OntologyTM (GO) gene annotation scheme to define the functional properties of all genes/probes in a microarray data set, and then applies a cumulative hypergeometric distribution analysis to determine if any statistically significant gene ontology co-clustering has occurred. For a more detailed description, see Lee et al. BMC Bioinformatics 2006, 7:237.

This web site provides an interface to run CLASSIFI algorithm against microarray data sets derived from either cDNA or oligonucleotide arrays by simply uploading a tab-delimited text file in an appropriate format. A detailed description of the acceptable file formats is given below.

Currently this site supports probe IDs from the RIKEN FANTOM collection, most Affymetrix chips, Invitrogen UltimateTM human ORF clone set and the UTSW custom mouse oligo array set. If you think CLASSIFI should support additional probes, please send us an email.

 

File Option:

Input File:



More About CLASSIFI Algorithm:

  1. Where does CLASSIFI fit into microarray data analysis?
  2. Outline of the CLASSIFI algorithm
  3. How to use CLASSIFI
  4. File formats
  5. Output files
  6. Interpretations of results

Where does CLASSIFI fit into microarray data analysis?

It is important to realize that CLASSIFI is applied to the results from a microarray experiment after the raw data has been filtered, normalized and clustered to group together genes/probes with similar expression patterns under the experimental conditions tested. For example, one may first filter the data to remove genes/probes that give fluorescent signals too close to background using PRIM. Then filter the data to focus on genes/probes that show significantly different expression levels under the experimental conditions tested using SAM. Finally, the genes/probes that pass these filtering procedures can be grouped together using the standard expression clustering method, (e.g. k-means, hierarchical, SOM) into a finite number of Gene Clusters with similar expression patterns. CLASSIFI can then be used to help determine if these gene clusters contain genes with common functional characteristics.

Outline of the CLASSIFI algorithm

How to use CLASSIFI

  • Step 1 – Filter, normalize and cluster the microarray data using your favorite approach.
  • Step 2 – Assign Gene Cluster membership for every probe in the data set.
  • Step 3 – Format CLASSIFI input file in one of two formats, as described below.
  • Step 4 – Upload input file and submit query
  • File format

    Files should be provided as tab-delimited text files, which can be generated from Excel spreadsheets.

  • Format 1 - Consists of three columns:

  • Column 1 contains the Probe ID (for cDNA or oligo arrays) or the Probe Set ID (for Affymetrix GeneChip® arrays)
    Column 2 contains gene name or other useful descriptors
    Column 3 contains Gene Cluster ID (determinded by user)
  • Format 2 – the same as Format 1 except that it contains a fourth column with the primary GO annotation for each probe.
  • Column 1 contain the Probe ID (for cDNA or oligo arrays) or the Probe Set ID (for Affymetrix GeneChip® arrays)
    Column 2 contains gene name or other useful descriptors
    Column 3 contains primary GO annotation for each probe
    (if GO annotation is unknown, please use file format 1)
    Column 4 contains Gene Cluster ID (determined by user)
    Each row corresponds to a different probe from the filtered and clustered dataset.

    Examples of file formats can be found at the following links:
        Format 1 cDNA
        Format 1 Affymetrix
        Format 2 cDNA
        Format 2 Affymetrix

    Output files

    CLASSIFI generates three output files:

    Classifi_gofile – The GO file contains the results from the automated probe annotation with all Gene Ontology ID’s listed in the fourth column. These include the primary GO annotations along with the GO parentage associated with these primary annotations. This file is useful to identify which probes/genes are responsible for gene cluster classifications.

    Classifi_outputfile – The output file contains all of the enumerated variables that were used in the hypergeometric distribution calculation and the probability results from the calculation. The data is divided by gene cluster and the GO ID’s are ranked from lowest p value to highest p value for each gene cluster. The column descriptions are as follows:

    GO id – is the identification number given for the specific GO annotation.
    g – total number of probes (for Affy these are actually probe sets) in the entire data set.
    f – the number of occurrences of this particular ontology in the entire data set.
    c – the number of probes in the gene cluster.
    n – the number of occurrences of the particular ontology in the gene cluster.
    expt – the expected number of occurrences of the particular ontology based on “g”, “f” and “c” given a Poisson distribution.
    prob – the probability that the co-clustering of this ontology in this gene cluster would have occurred by chance (this is the important number).
    GO type – the gene ontology type, either Molecular Function, Biological Process or Cellular Component (cellular location).
    GO name – the ontology description linked with the GO ID in the first column.
    clusterid – gene cluster ID assigned by user.

    Classifi_topfile – The Top file is simply a list of the GO ID’s that gives the lowest p value in each of the gene clusters.

    Interpretations of Results

    Recommended data set size - CLASSIFI does not work very well for small data sets. We have obtained good results using data sets that contain at least 1000 probes, with gene clusters containing an average of ~100 probes. But the larger the data set, the longer it will take to do the calculations. Data sets of 1000 – 3000 probes containing 10 – 30 gene clusters are recommended.

    Establishing p value cutoff - Low probabilities are highly informative. For this kind of data, selecting a p value cutoff of 0.05 is not appropriate, since this will include a large number of false positive results. To determine an appropriate p value cutoff, one needs to either do a permutation analysis in which a series of randomized data sets are generated with a structure similar to the original data set and then analyzed using the same algorithm (time consuming), or apply a Bonnferoni correction to determine the appropriate cutoff (easy and fast). While the first approach is more accurate, using the Bonnferoni correction gives a reasonable estimate for most datasets and is simple to calculate. To estimate the significance cutoff, select an a level (e.g. 0.05 or 0.01) and divide it by the number of data points used in the calculations (equivalent to the number of rows in the data set).

    This number provides an estimate for the p value cutoff that should be used for this data set.

    Limitations - High probabilities do not necessarily rule out a significant co-clustering. Since the probabilities are determined by all four parameters, high probabilities can occur for a variety of reasons. For example, high p values are always obtained if the number of occurrences of a particular ontology in the entire data set (f) is small.

    The bottom line is low p values are significant; high p values may or may not be significant.

    Related GO ID’s in the Output file - Several GO id’s come from the same gene, and often are listed near each other in the table. For example, you may find that the first 5 GO ID’s listed for a given gene cluster are functionally related and have the exact same n values. This may be because all of these GO annotations are attached to the same set of genes. Remember that since both the primary GO annotations and the entire GO parentage is included, a group of genes will often have multiple GO ID’s in common.

    Contact us

    If you have any problems or questions, or you've obtained interesting result, please contact us at jcai@utsouthwestern.edu.

     


    Free Counters
    Start to Count Since 10/30/2003

    Last Update: May, 2006
    WebMaster: jcai@utsouthwestern.edu
    © copyright 2003 UT Southwestern Medical Center