More About CLASSIFI Algorithm:
- Where does CLASSIFI fit into microarray data analysis?
- Outline of the CLASSIFI algorithm
- How to use CLASSIFI
- File formats
- Output files
- Interpretations of results
Where does CLASSIFI fit into microarray data analysis?
It is important to realize that CLASSIFI is applied to
the results from a microarray experiment after the raw data has been
filtered, normalized and clustered to group together genes/probes
with similar expression
patterns under the experimental conditions tested. For example, one
may first filter the data to remove genes/probes that give fluorescent
signals
too close to background using
PRIM. Then filter the data
to focus on genes/probes that show significantly different expression
levels under the experimental conditions tested using SAM.
Finally, the genes/probes
that pass these filtering procedures can be grouped together using
the standard expression clustering method, (e.g. k-means, hierarchical,
SOM) into a finite number of Gene Clusters with similar expression
patterns. CLASSIFI can then
be used to help
determine if these gene
clusters
contain genes with common functional characteristics.
Outline of the CLASSIFI algorithm
How to use CLASSIFI
Step 1 – Filter, normalize and cluster the microarray data using your favorite approach.
Step 2 – Assign Gene Cluster membership for every probe in the data set.
Step 3 – Format CLASSIFI input file in one of two formats, as described below.
Step 4 – Upload input file and submit query
File format
Files should be provided as tab-delimited text files, which can be generated from Excel spreadsheets.
Format 1 - Consists of three columns:
Column 1 contains the Probe ID (for cDNA or oligo arrays) or the Probe Set ID (for Affymetrix GeneChip® arrays)
Column 2 contains gene name or other useful descriptors
Column 3 contains Gene Cluster ID (determinded by user)
Format 2 – the same as Format 1 except that it contains a fourth column with the primary GO annotation for each probe.
Column
1 contain the Probe ID (for cDNA or oligo arrays) or the Probe Set
ID (for Affymetrix GeneChip® arrays)
Column 2 contains gene name or other useful descriptors
Column 3 contains primary GO annotation for each probe
(if GO annotation is unknown, please use file format 1)
Column 4 contains Gene Cluster ID (determined by user)
Each row corresponds to a different probe from the filtered and clustered dataset.
Examples of file formats can be found at the following links:
    Format 1 cDNA
    Format 1 Affymetrix
    Format 2 cDNA
    Format 2 Affymetrix
Output files
CLASSIFI generates three output files:
Classifi_gofile – The GO file contains the results from the automated probe annotation with all Gene Ontology ID’s listed in the fourth column. These include the primary GO annotations along with the GO parentage associated with these primary annotations. This file is useful to identify which probes/genes are responsible for gene cluster classifications.
Classifi_outputfile – The output file contains all of the enumerated variables that were used in the hypergeometric distribution calculation and the probability results from the calculation. The data is divided by gene cluster and the GO ID’s are ranked from lowest p value to highest p value for each gene cluster. The column descriptions are as follows:
GO id – is the identification number given for the specific GO annotation.
g – total number of probes (for Affy these are actually probe sets)
in the entire data set.
f – the number of occurrences of this particular ontology in the entire
data set.
c – the number of probes in the gene cluster.
n – the number of occurrences of the particular ontology in the gene
cluster.
expt – the expected number of occurrences of the particular ontology based on “g”, “f” and “c” given a Poisson distribution.
prob – the probability that the co-clustering of this ontology in
this gene cluster would have occurred by chance (this
is the important number).
GO type – the gene ontology type, either Molecular Function, Biological
Process or Cellular Component (cellular location).
GO name – the ontology description linked with the GO ID in the first
column.
clusterid – gene cluster ID assigned by user.
Classifi_topfile – The Top
file is simply a list of the GO ID’s that gives the lowest p value in each
of the gene clusters.
Interpretations of Results
Recommended data set size - CLASSIFI does not
work very well for small data sets. We have obtained good results using
data sets that contain at least 1000 probes, with gene clusters containing
an
average of ~100 probes. But the larger the data set, the longer it
will take to do the calculations. Data sets of 1000 – 3000 probes containing
10 – 30
gene clusters are recommended.
Establishing p value cutoff - Low probabilities
are highly informative. For this kind of data, selecting a p value
cutoff of 0.05 is not appropriate, since this will include a large
number of false
positive results. To determine an appropriate p value cutoff, one needs
to either do a permutation analysis in which a series of randomized
data sets
are generated with a structure similar to the original data set and
then analyzed using the same algorithm (time consuming), or apply a
Bonnferoni correction to determine the appropriate cutoff (easy and
fast). While
the
first approach is more accurate, using the Bonnferoni correction gives
a reasonable estimate for most datasets and is simple to calculate.
To estimate the significance cutoff, select an a level
(e.g. 0.05 or 0.01) and divide it by the number of data points used
in the calculations (equivalent to the number of rows in the data set).
This number provides an estimate for the p value cutoff that should be used
for this data set.
Limitations - High probabilities do not necessarily
rule out a significant co-clustering. Since the probabilities are determined
by all four parameters, high probabilities can occur for a variety of reasons.
For example, high p values are always obtained if the number of occurrences
of a particular ontology in the entire data set (f) is small.
The bottom line is low p values are significant; high p values may or
may not be significant.
Related GO ID’s in the Output file - Several GO id’s come from the same gene, and often are listed near each other in the table. For example, you may find that the first 5 GO ID’s listed for a given gene cluster are functionally related and have the exact same n values. This may be because all of these GO annotations are attached to the same set of genes. Remember that since both the primary GO annotations and the entire GO parentage is included, a group of genes will often have multiple GO ID’s in common.
Contact us
If you have any problems or questions, or you've obtained interesting result,
please contact us at jcai@utsouthwestern.edu.
|