|
|
|
Basics of Microarray Data AnalysisPrepared by: Scott A. Ness, Ph.D. Tel. 505-272-9883 Table of Contents
IntroductionMicroarrays provide biological scientists with a powerful new tool for simultaneously analyzing the expression of many thousands of genes. However, the data sets generated by these experiments are large and complex. This document provides the basic information and background necessary for biological scientists to successfully perform and understand the results of microarray experiments. The goal is to provide biological scientists with the information necessary to design their microarray experiments in a way that will provide the best possible data, and to describe the basics of analyzing microarray data in a way that fits the design of biological experiments. Other sources of informationFor Affymetrix users, the Affymetrix manual for Data Mining Tool contains a tutorial that can be used either with sample data or with your own data. The DMT tutorial provides an excellent overview of clustering methods and algorithms and statistics that can be applied to microarray data, and I highly recommend it. The Affymetrix web site also offers detailed information. Silicon Genetics provides excellent support including details manuals that can be downloaded from their web site. You must register with them to download the PDF manual. Detailed on-line help is also available from within GeneSpring. Back to topTypes of microarrays and microarray dataThe two basic types of microarrays are Affymetrix GeneChips, which have short (25-mer) oligonucleotides synthesized directly on the glass, or spotted arrays made by dipping pens into a concentrated DNA solution and physically depositing the DNA on a microscope slide. There are some other variants on these approaches as well, such as specialized ink-jet printers that can be used to synthesize oligos directly on a solid support, but they are not yet wide spread. Affymetrix GeneChipsThe Affymetrix microarrays have a number of advantages. First, the features (DNA spots) are extremely uniform and very close together. As a consequence, a single GeneChip can contain more than 400,000 different DNA spots, which allows these commercial chips to contain a large number of controls and multiple (16-20) spots for each gene. Although there have been some complaints of lot-to-lot variability, the GeneChips are, for the most part, highly reproducible. The Affymetrix microarrays use short (25-mer) oligonucleotides. Up to 20 perfect match oligos are present for each gene, as well as an equal number of corresponding mismatch oligos that have a single nucleotide substitution. Thus, the Affymetrix system is very good at detecting single nucleotide differences. The disadvantage is that allelic differences could lead to changes in the data. However, in general, the Affymetrix system works extremely well, is very user friendly and is highly recommended as a starting point for investigators beginning their first microarray experiments. Spotted arraysCustom, or spotted arrays have two major advantages: cost and flexibility. Once the DNA samples have been prepared, spotted arrays can be produced for a very low cost essentially just the price of the glass slides and the labor involved. In addition, spotted arrays can be designed to contain only the genes of interest. The disadvantage is that the DNA must be purchased (in the case of oligonucleotides) or prepared by PCR and purification (for cDNA clones). These can be costly and problematic procedures. Using longer probes (cDNAs or longer oligonucleotides) provides more specificity in the hybridization, and limits the effects of single nucleotide polymorphisms or allelic differences. However, the DNA spots made by spotters are much larger than the features present on an Affymetrix array. As a consequence, there are many fewer spots per slide, and many fewer controls. For example, with spotted arrays it is rare to have more than 15,000 spots per microscope slide, or to have more than duplicate spots for each gene. This makes spotted arrays more prone to reproducibility problems, especially chip to chip variation. In contrast, the Affymetrix arrays have 16 or more spots per gene, and use statistical comparisons of the results to determine the expression level. Data normalizationIn order to compare gene expression results, it is necessary to normalize the microarray data, and there are two basic ways this is done. Per-chip normalization is essentially a type of scaling to adjust the total or average intensity of each array to be approximately the same. The second type is per-gene normalization, which compares the results for a single gene across all the samples. Scaling or per-chip normalization Per-chip normalization is extremely useful to help eliminate minor differences in probe preparation, hybridization conditions, etc. Essentially, this is like turning the sensitivity of the scanner up or down, or adjusting the brightness of the monitor, so that each sample looks the same, on average. Usually, the adjustments are made to set the average fluorescence intensity to some standard value, so that all the intensities on the chip go up or down to a similar degree. This approach makes sense if the samples are all similar, e.g. all from the same types of cells or tissues. However, this type of normalization will obscure some aspects of the data, such as whether the RNA samples or the probe preparation steps were equivalent for each sample. Thus, without scaling, bad samples will be obviously dimmer across the board. With scaling, bad samples will be more difficult to detect. In addition, for samples of poor quality, where relatively few probes yield a detectable signal, those values that are detected will be amplified disproportionately so that they represent too great a fraction of the average total fluorescence. With the Affymetrix system, our standard approach is to turn scaling on, and to set the average fluorescence for each GeneChip to 500. After analysis of the data it is important to check the scaling factors that the software used. The best samples will have scaling factors less than 20. Poor quality samples will have high scaling factors, often greater than 100. This is often the most reliable way of judging the quality of Affymetrix data when scaling has been used. Per gene normalization The goal of microarray experiments is to identify the genes whose expression change in different conditions. Per-gene normalization is necessary to compare the gene expression profiles of genes that may be expressed at very different levels. An example of this is shown in the figures below where the samples appear along the X-axis, expression levels are on the Y-axis, and each line represents a different gene. Here, microarrays were used to follow changes in the expression of more than 18,000 human genes. The data have been filtered to show only 9 genes that were induced the most in samples 5 and 6. In the figure at left, the normalized data were plotted as fold change. All 9 genes show a similar pattern, and the lines representing the different genes cluster together nicely. In the figure at right, the raw data were plotted for the same 9 genes. Although the pattern is still discernable, the genes do not appear to be clustered or expressed in the same pattern.
Thus, per-gene normalization is necessary to find genes that have same expression pattern. Analysis of the raw data is useful for finding genes that are expressed at the same level (e.g. all the highly abundant genes). Clustering and comparisonsClustering algorithms work by finding genes that have similar patterns of gene expression. In the example shown above, the 9 genes would form a nice cluster, because all are induced more than 4-fold, relative to the other samples, in samples 5 and 6. As a consequence, all clustering methods require normalized data. Therefore, the results are highly dependent on how the data were normalized, and what data are included in the clustering analysis. The forest vs. trees problem The biggest problem with clustering is that it appears to offer a one-click solution to finding interesting genes. However, if too many genes are analyzed at once, the clusters become meaningless. This is a type of forest vs. trees problem. The presence of too many irrelevant genes obscures the changes in a few and makes them difficult to detect. For example, consider the results of K-means clustering shown below. In each case, the default clustering algorithms in GeneSpring were used, set on 5 clusters and 100 iterations. At left is the clustering performed on all 18,000 genes. At right, the same clustering on 157 genes after filtering for those that were more than 2.5 fold induced and expressed at a value of at least 1200 in two or more samples. The clusters on the left are very large (>2500 genes each) and do not distinguish different functional groups of genes. In this case, the forest of unchanging genes has obscured the fact that some genes are changing dramatically.
At right, the genes are clearly grouped into small clusters (<100 genes) with similar expression patterns. In particular, the cluster at top left looks very similar to the 9 genes shown in the previous figure. The use of filtering has removed the irrelevant genes, and permitted the formation of useful clusters. This is an example of why filtering is so important. Defining thresholds Filtering the data is important for focusing on the most important changes. However, it also introduces some bias. In the example shown above, the filters eliminated all but 157 out of more than 18,000 genes. The thresholds that were chosen were completely arbitrary (>2.5 fold induced in two or more samples, and expressed above 1200 in two or more samples), so this approach could eliminate some genes that really are regulated in an interesting fashion, but that are expressed at very low levels or that are induced to a lesser extent. The Importance of ReplicatesThe experiments shown above were performed with four sample types or treatment groups, performed in two independent, replicate experiments. As a result, the analysis can be set up to look specifically for genes whose expression change in a similar manner in both replicates of a particular type. For example, genes that are induced in both samples 5 and 6 (top left cluster in right panel, above). Researchers planning microarray experiments should plan to analyze at least two independent replicates (i.e. the complete experiment is done twice, generating two independent RNA samples, and independent probes are hybridized to different microarrays) for each treatment group. SummaryMicroarray data can be quite complicated, and the experiments can be very expensive. Researchers should think carefully about how the resulting data will be analyzed before beginning an experiment, in order to make the best use of their resources. |
|
Search |
HSC Home |
HSC Intranet |
UNM |