The exercise as described below is nice and self-containd in that it can be done on any Windows computer with free/academic software. However, it did not work for us in the Library computer room last year due to complex Admin privilages required to install Excell Plug-ins.
This year, we are going to do the whole thing in ArrayAssist - a commercial software for microarray analysis that I run for the NYU Med School.
Download the full set of software and data for this exercise:
MArrayEx.zip
There are 3 basic steps to the analysis of a microarray:
1) Normalization
2) Find differentially regulated genes
3) Discover biological functions of regulated genes
For Affymetrix arrays, the normalization step is mathematically very challenging, but an acceptable
method known as RMA (Robust Multichip Average) had been developed and implemented in rather simple
software (RMAExpress, Bolstad)
Bolstad, B.M., Irizarry R. A., Astrand, M., and Speed, T.P. (2003). A Comparison of Normalization
Methods for High Density Oligonucleotide Array Data Based on Bias and Variance.
Bioinformatics 19(2):185-193
To set up RMA express on a Windows computer, go to the RMAExpress webpage:
Download the Windows executable:
Run the installer.
Read the fine manual (!)
To analyze a set of Affy chips, you will need the .CEL file from each chip. You will also need a .CDF file for that chip type. The .CDF file contains information about where each probe is located on the chip and which probes go together to form a probe set for a specific gene. Affymetrix is somewhat shy about relasing the .CDF files, but they can be obtained from the GCOS software.
This exercise involves 8 chips of the HG-U133A_2 chip type. Your task is to find the top 20 genes that are significantly differentially regulated between RPTEC (control) and UOK-145 (treated) samples. Then find biologial functions that are shared by the regulated genes.
Now open the natural scale file in Excel.
You should see a table with one column of headers (Affy probe IDs) and
8 columns of signal intensities for each gene. At this point, you could do
some filtering of the data in Excel (remove values less than 100, or remove genes where all values are below 100). This is not necessary for today's exercise. This file can be used as input to GeneSpring, TIGR MeV, Matlab, and many other Microarray and statistics software packages. We are going to use
SAM (Significance Analysis of Microarrays).
Now set up SAM for statistical analysis of the data.
Download the latest version of SAM (this may already be installed on some Library computers)
Read the SAM manual (RTFM !)
SAM is an Excel plugin, and it is covered by some license restrictions, so there is a fair bit of hassle involved with downloading and installing. I have requested an academic license, downloaed the software, and put a copy on our class webserver TEMPORARILY.
SAM should now be ready to use - there should be two new buttons in the Excel toolbar named "SAM" and "SAM Plot Control"
You must put the input data in the correct format for SAM.
Open the natural scale file produced by RMAExpress in Excel. Insert a new row under the existing header row. Code the rows "1" for RPTEC and "2" for UOK-145. Select this new row and all data rows below it (leave out the header row with file names and do not select any empty rows or columns).
Selected data should look something like this:
1 1 2 2 1 1 2 2 GENE1 101 7.64 -0.50 -1.95 10.12 -10.77 -4.47 -7.65 7.58 GENE2 102 38.10 4.86 7.87 -13.59 -9.79 -13.46 -8.91 -5.07 GENE3 103 21.15 5.96 3.20 -4.74 -3.70 -12.35 -10.17 0.63 GENE4 104 187.21 -23.81 16.76 14.10 -99.76 -89.11 -10.92 5.52
Click on the SAM button in the Excel toolbar.
It should read in your data and then bring up a control screen. This is "Two class, unpaired data". Genes are designated by name. Click OK.
Play with the sliders on the SAM output controller until you get a set of genes that have an acceptably low FDR (False Discovery Rate).
Take the top 20 genes by p-value.
One source of information is the NetAffx database on the
Affymetrix website (free registration requried)
http://www.affymetrix.com/analysis/index.affx
Another possibility is one of the many Genome Ontology tools. This one called DAVID/EASE is very easy to use online:
http://david.niaid.nih.gov/david/ease.htm