University of New Mexico Health Sciences Center Education at the HSC (Programs in Medicine :: Pharmacy :: Nursing) Patient Care at the HSC (Hospitals :: Clinics) Research at the HSC HSC Partnerships About the HSC (News Releases :: Calendars :: Administration) Library Health Sciences Center Home Page HSC Site Search ( Search :: Alphabetical Listings) HSC Home Page HSC Intranet  (Resources and News for Employees) University of New Mexico Home Page

 

 

Site Navigation

KUGR Home Page

Services


Affymetrix 3'-Assays

Affymetrix All Exon

Affymetrix 500K SNP

Affymetrix TG

Real Time PCR

TaqMan SNP

Data Analysis

Genomics Software


Acknowledge KUGR !

Frequently Asked Questions

Helpful Information

Contact Information

Equipment/Resources


KUGR History

Microarray Primer

Basics of Microarray Data Analysis

Data Analysis Tutorial

Microarray Comparisons

Faculty Director

Scott A. Ness, Ph.D.


KUGR Personnel

Gavin Pickett, Ph.D.

Marilee Morgan, M.S.

Tel: (505) 272-5564
Room: CRF 118

HSC Resources

Department of Molecular Genetics and Microbiology

HSC Research Commons

UNM Cancer Research and Treatment Center

UNM NIEHS Center

BSGP Graduate Program

Other Links

Albuquerque
New Mexico
UNM HSC
UNM

 
Keck-UNM Genomics Resource


Microarray Data Analysis Tutorial

This step-by-step guide is meant as a general description for researchers using the Affymetrix system to generate data, and then using GeneSpring for the analysis. This tutorial describes a basic analysis that should be useful for new users. The thresholds and limits provided here have been used successfully, but are meant only as a starting point. Each user should adjust these for their own purposes.

A PDF version of this tutorial is available for download.

Prepared by Scott A. Ness

            Associate Professor, Molecular Genetics & Microbiology

            Faculty Director, Keck-UNM Genomics Resource

            University of New Mexico HSC, Albuquerque, NM   87131

 


Jump to Specific Sections:

Experiment and Data

Downloading Data from the NCBI GEO Repository

Import Data into GeneSpring

Normalizing the Data

Define Parameters

Define the Experiment Interpretation

Analyze the GeneSpring Data

Use Present/Absent Flags to Identify the Expressed Genes

Advanced Filtering: Filter on Fold Change

Statistical Test: 1-Way Anova

Venn Diagram Tool

Inspecting Gene Lists

Generating Gene Trees (Heat Maps)

Using the GeneSpring Data

Copy an Annotated Gene List to Excel

Getting Gene Annotation Information from Affymetrix

Use VLOOKUP to Link the Data and the Annotations in Excel


Experiment and Data

For this tutorial, we will use data from the Ness laboratory that is publicly available from the NCBI GEO data repository.

 

The files come from an Affymetrix microarray analysis of gene expression changes induced by using a recombinant adenovirus to express the c-Myb transcription factor in human MCF-7 cells.

 

To download the data, go to the GEO web site: http://www.ncbi.nlm.nih.gov/projects/geo/

In the “Query” section in the middle of the page, type “GSE1318” (without the quotes) into the box labeled “GEO accession”, then push “GO”.

 

GSE1318 is a data series, a collection of 25 microarray data sets (GSM21610 – GSM21638). To simplify things and make them go faster, we will only use the first six for this tutorial (GSM21610 – GSM21615).

 back to top


Downloading Data from the NCBI GEO Repository

The GSE1318 web page has links for the authors involved in generating the data, for a Pubmed citation (15105423) describing this data, the e-mail address of the submitter (ness@unm.edu) and other information.

At the bottom of the page is part of a table showing the data structure. This table lists normalized and raw values for each type of sample. However, the data in the table already has the replicate samples averaged together. We will start with the original data for three sets of replicates instead.

We need to download the data sets GSM21610 – GSM21615, and there are links for each set just above the partial table. Follow the following steps to download each data set:

1.     Click on the link for the appropriate data set (e.g. GSM21610)

2.     From the new page, use the pull-down menu to set the “Scope” to “Samples”

3.     Set the “Format” to “SOFT” and the “Amount” to “Full”

4.     Push “GO” to download the data to your computer

5.     In the “GEO accession” box, change the accession number to the next one (e.g. GSM21611), push “GO” to download, then repeat for the rest of the samples

Put all the newly downloaded files in one folder.

The samples are the following:

GSM21610 and GSM21611:  Control cells, uninfected MCF-7

GSM21612 and GSM21613:  Control cells, infected with a control (GFM-only) adenovirus

GSM21614 and GSM21615:  Experimental cells, infected with an adenovirus expressing c-Myb and GFP

 

 back to top


Import Data into GeneSpring

Analyzing microarray data in GeneSpring is a multi-step process. First the data is imported, which may require creating a “genome” or gene collection. In most cases the “genomes” for commonly used microarrays like Affymetrix arrays are available from Agilent, but here we will create out own.

After the data is imported, it must be normalized then finally analyzed to find interesting genes.

Generate Genome

Open GeneSpring. Choose File > Import Data

Navigate to the folder with your downloaded files and choose one of them.

The “Import Data: Define File Format and Genome” window will appear. The file format at the top should be Custom

Your copy of GeneSpring may or may not have a list of previously used genomes in the middle of the window, which you can ignore. If you have the correct genome installed you could choose it in the window and click on Next at the bottom.

At the bottom, choose “Create a New Genome” then enter a suitable name (e.g. GEOfiles) and click on Next


The next window should be the “Import Data: Column Editor”. GeneSpring will probably figure out where the data begins and should label the top of the data columns in red. You will still need to choose the pull down menus at the top of each column. Label the first column “Gene Identifier”, the next one “Signal” and the last one “Flags”.

If GeneSpring does not figure out where the data should start, use the up or down arrows near the middle bottom to change the number of skipped lines until you get the right result.

The Flag values, at the bottom center, should be P for Present, A for Absent and M for Marginal. GenSpring will probably figure that out on its own.

When you are ready click on Next.

 

In the next window, make sure all your imported files are added to the right most window by selecting each one and clicking on the “Add” button. When the correct files are ready, click on Next.

 

GeneSpring will do some calculating, then a “Sample Attributes” window will appear. We will just skip this part. Simply click on Next and go on.

GeneSping will do some more calculating, then it will inform you that 6 new samples have been created and ask whether it should create a new experiment. Click on Yes.

After some more calculating a window like the one shown below will appear. Give your experiment a name (e.g. MyExperiment) at the top.

Create a new folder by typing its name in the Folder box (e.g. BioMed516), then click on Save at the bottom.

 

Now the data has been imported, and you are ready for normalizing.


 back to top


Normalizing the Data

The Experiment Checklist window will appear, showing the steps that remain. Click on Normalizations

 

The Experiment Normalizations window will appear with three types of default normalizations already entered. This window is like a protocol, with a list of options at the left and the steps that will be followed in the middle.

 

The first default step is Data Transformation and will convert any negative numbers to 0.01 That is not really necessary for our data, but will not hurt anything, so just leave it alone.

The next default step is “Per Chip: Normalize to 50th percentile”. Our data was already scaled by the Affymetrix software, so this step is unnecessary. We do not want to scale or perform ‘per chip” normalization twice, so select that line and click on Delete.

The last default step is “Per Gene: Normalize to median”. This is the fall-back normalization method that one does if nothing better is available. However, our experiment has real controls, so we do not want that normalization step. Select that line and choose Delete.

GeneSpring will give you some warnings at the bottom, saying that no per chip or per gene normalizations have been applied, but you can ignore those.

Now, we want to normalize our data to the median of the control values, which are in the first four data sets, GSM21610 – GSM21613. From the menu at left, double-click on “Per Gene: Normalize to specific samples”

A new window should appear that looks like the one shown below:

On the left side, click the “Check All” button to select all the samples. This indicates that we want to normalize all the samples.

On the right side, click the little boxes next to the first four samples, GSM21610.txt – GSM21613.txt

When you are ready, your window should look like the one shown above. Click on OK. GeneSpring will calculate the median expression for each gene in the first four samples, then use that value to normalize all the samples. So our data will be represented as fold change, relative to the median of all four control samples.

Back in the Normalizations window, you should only have the original Data Transformation and the new Per Gene listings. When you are ready, choose OK.

GeneSpring will do some calculating, then the Checklist window will appear again. Now there will be a checkmark next to Normalizations.


 back to top


Define Parameters

Now click on Parameters, and you should see a window like the one shown below:

Parameters are ways of labeling the data and of defining which ones are replicates for statistical analyses. First, give the samples some better names. Click on “New Parameter”. When the menu window appears, choose “Custom Parameter” and click OK.

In the top row, label this new column “type”. Label the first two samples Uninf1 and Uninf2, the second two AdCont1 and AdCont2 and the last two c-Myb1 and c-Myb2

The first two of these samples are control cells, not infected with anything, so they are labeled Uninf1 and Uninf2

The second two samples were infected by a control adenovirus expressing only GFP. They should be labeled AdCont1 and AdCont2

The last two samples were cells infected by an adenovirus expressing c-Myb. They should be labeled c-Myb1 and c-Myb2

Note that each sample has a different name, so GeneSpring does not know that any of them are replicates, yet.

Now create another new parameter, label this one groups.

Give the samples the same names as before, but leave off the numbers. Now each pair of samples have identical names. We will use these groups later to identify replicates and do some statistics.

Finally, we will order the samples to make the figures look better. Click in the box above the “type” label to select the whole column, then choose “Set Value Order” at bottom right.

In the new window, select the Uninf1 and Uninf2 labels and use the buttons at right to move them to the top. The c-Myb1 and c-Myb2 samples should be at the bottom. Click OK

 

Click Save in the Parameters window to get back to the Checklist.

 

 back to top


Define the Experiment Interpretation

From the Checklist, select the Experiment Interpretation button

The Experiment Interpretation window does a lot. For now, just make three changes:

First, under “Analysis” near top center is a pull-down menu labeled “Mode” which by default says “Log of Ratio”. For our purposes, the best option is a linear scale, so change the pull-down menu to “Ratio (signal/control)”. This label does not make sense for our data, but the result will be what we want.

Second, near the top, change the Upper Bound to 30

Finally, in the “How to Display Parameter” section near the middle, there are several rows of buttons, one row for each parameter defined previously. Set the File Name parameter to “Do Not Display” and the “type” parameter to “Continuous”

Your window should look like the one shown below. When it does, choose Save at the bottom.

 

 

Back at the Checklist, just ignore the Error Model and choose Close at the bottom.

The main GeneSpring window should appear and it should look like the one shown below:

The colors in your copy of GeneSpring may be different than the ones shown here. You can either leave them as they are or change them in the Edit > Preferences menu.

The main window display is like a browser window. There are folders and files listed along the left side, data in the middle and a color bar on the right. At the moment, GeneSpring is in “Blocks” mode, showing one little box or line for each gene. The lines are colored depending on how they are expressed, relative to the first sample. At the bottom of the central window is a slider that you can move to change the coloring to be relative to some other sample.

For a line plot, go the View menu and choose Graph

The window should now look like this:

 

Now GeneSpring and the data are ready for some analysis. We will not go through all the features of GeneSpring, which are many, but we will touch on some of the most useful and basic ones that should suffice to get most users started.

 


 back to top


Analyze the GeneSpring Data

Use Present/Absent Flags to Identify the Expressed Genes

Before doing any real analyses, we will identify the genes that are expressed at a statistically significant level above background in at least two samples. The best way to do this is by using the Affymetrix Present/Absent flags. The Affymetrix system actually measures each gene with 12 independent perfectly matched probes and 12 corresponding mis-matched probes. It subtracts the values of the matched from the mis-matched probes, then does statistics to see whether, based on 12 independent measurements, the difference between the signals of the matched and mis-matched probes are statistically significant, using a t-test like measure. It uses these measures to label the genes P for present, M for marginal or A for absent. GeneSpring can find the genes that are labeled P in any two samples.

From the Filtering menu choose Filter on Flags

In the filtering window, make sure GeneSpring is starting with all genes, and scanning all six samples (which should be the defaults).

Beneath the graph, set the pull-down menu to Present

Just beneath that, change it so the “Value must appear in at least” 2 out of 6 samples, then choose Save

Another window will appear so you can name the list you are creating. It should already be named “Flags are Present”. Choose Save then close the Filter on Flags window.

 

As shown in the figure above, GeneSpring found 4,448 genes that were marked P out of 12,625 genes in the whole genome. So, this simple filter eliminated almost 65% of the genes that were not expressed in the MCF-7 cells and that do not need to be considered in the rest of our analyses.


 back to top


Advanced Filtering: Filter on Fold Change

Next, we will identify genes that are expressed up or down at least 2-fold when c-Myb is expressed in the MCF-7 cells.

From the Filtering menu, choose Advanced Filtering

The Advanced Filtering window lets us add sequential filters. The resulting genes must pass all the conditions we apply.

First, limit the search to the Present genes. From the left side of the window, double-click on “Filter on Gene List”. Find the list “Flags are Present” on the left side of the new window and click on “Choose Gene List”. The name “Flags are Present” should appear at top center of the window and the graph should show 4,448 genes that pass the filter. Then choose OK at the bottom.

Back in the Advanced Filtering window, our Flags are Present filter is at the top of the list. Only the genes in that list will be used for the subsequent filters.

 

Now, on the left side, double-click on “Filter on Fold Change”.

In the next window, if necessary, click on the little arrows next to the experiment name so that all the samples are visible (see image below). Click on the “type c-Myb1” sample at left, then the “Choose Condition 1” button at top center. We want to find all the genes up or down regulated in this sample, relative to the four controls.

At the left, click on the name “Default Interpretation” to select all the samples, then on the “Choose Condition 2” button in the center of the window. The filter is now set to compare the first sample against the other 5, so we must remove the c-Myb2 sample from this comparison.

Click on the Add/Remove button at right.

In the lower part of the window, click on the “type c-Myb2” line, then choose the “Remove” button in the middle, leaving only the four control samples listed in the bottom panel, and click OK.

 

 

The Filter on Fold Change window now shows that the type c-Myb1 sample will be compared to 4 selected conditions.

At the bottom of the window, make sure the Fold Difference is set to 2 (it should be).

Change the value in the “Difference must appear in at least” window to 4

The filter will now find genes that are up or down expressed at least 2-fold in the c-Myb1 sample compared to all four control samples.

Choose OK at the bottom to return to the Advanced Filtering window.

 

 

We created a filter that compares the c-Myb1 sample to all four controls, now we need one that does the same for the c-Myb2 sample.

In the Advanced Filtering window, select the new Filter on Fold Change filter line by clicking on it once, then choose the “Duplicate” button at the right.

Double-click on the new line to edit it.

On the left side, highlight the c-Myb2 sample, then click the “Choose Condition 1” button.

At the bottom, make sure the Fold Difference is still set to 2 and Difference must appear in at least 4 of the 4 conditions, then choose OK.

Now, back in the Advanced Filtering window, choose Start at the bottom.

A New Gene List window will open, showing 25 genes. At the top, change the name to “c-Myb Filtering” than click on Save at the bottom.

Close the Advanced Filtering window.

When the main window appears, only the data for those 25 genes will be visible. If you ever want to, you can switch back to see all the genes by choosing the “all genes” list at top left. GeneSpring only displays the data for the list of genes that is currently selected in the left panel.

 


 back to top


Statistical Test: 1-Way Anova

The 1-Way Anova test uses the replicates to find genes that are significantly different between groups. Earlier, we defined a parameter (groups) that defined the uninfected, control virus-infected and Myb virus-infected samples. Now we can use those groups with the 1-Way Anova test.

We will only analyze the genes that were expressed at statistically signficant levels (Present).

In the main window, click on the “Flags are Present” list in the left panel, to restrict our search. Then choose “Statistical Analysis (ANOVA)” from the Tools menu.

The “Flags are Present” gene list should appear at the top of the Statistical Analysis (ANOVA) window. If not, select that list in the left panel and click on the “Choose Gene List” button at the top center.

In the middle of the window, select the pull-down menu labeled “Parameter to Test” and select the “groups” parameter that we created earlier. This is the parameter that links the replicate samples.

Next to “Multiple Testing Correction”, select the pull-down menu and change it to “None”.

Check to make sure that the P-value Cutoff is set to the default of 0.05

 

When everything is ready, click on Start at the bottom.

A new Gene List window should appear with about 69 genes that were selected. Click on Save at the bottom and then close the Statistical Analysis (ANOVA) window.

For this type of analysis, the value of the ANOVA test is limited, since there are very few replicates (only two). Next we will compare the gene lists identified by the filtering and ANOVA methods.

 


 back to top


Venn Diagram Tool

GeneSpring has a Venn Diagram tool that offers a quick way to compare different gene lists.

From the main window, right-click on the gene list labeled “1-Way Anova” and choose “Venn Diagram > Left (Red)” from the pop-up menu.

Right-click on the “c-Myb Filtering” gene list and choose “Venn Diagram > Right (Green)” from the pop-up menu.

Finally, click on the “all genes” list at left, to populate the Venn Diagram with all the data.

The circles in the Venn Diagram show the overlap in the different gene lists. As shown in the figure, there are 8 genes shared by both the 1-Way Anova and c-Myb Filtering lists (in the yellow area in the middle).

Right-click on the yellow area containing the 8 shared genes and choose “Make list of genes in both lists” from the pop-up menu. Save the gene list in the next window.

Go back to the Venn Diagram. You may need to select “all genes” from the left panel again to re-populate the diagram, then right-click on the yellow area again but this time choose “Make list of genes in either list”. Save the new gene list as before.

In the main window, choose “Color By Expression” from the Colorbar menu to hide the Venn Diagram.

Now we have gene lists containing all the identified genes and also a list of genes found by both methods.

 


 back to top


Inspecting Gene Lists

Detailed information about any genes are available by inspecting the gene lists. From the main window, double-click on one of the gene list names (e.g. c-Myb Filtering) to open the Gene List Inspector Window.

The upper left panel of this window shows general information about this gene list. A graphical representation of the data is at upper right, and a list of the genes in the list is at the bottom.

Double-click on the first gene in the list to open the Gene Inspector window.

If our genome was correctly annotated, the upper left panel of this window would have information like GenBank ID, gene function, etc. (but we aren’t so it doesn’t). The upper right panel contains the actual data for this gene in a table and shows values for Normalized (fold change), Control (median value for the control samples used in the normalization), Raw (the actual Affymetrix expression score), t-test p-value (which is an internal GeneSpring calculation and doesn’t mean much here) and Flags (P for Present, M for Marginal and A for Absent).

Especially important is to look at the Raw numbers. A gene might be listed as highly induced, but nevertheless be expressed at levels that are near background.

Check out the Normalized and Raw values for few genes to get an idea of what genes look like that are really induced or repressed.

When you are done, close all the Inspector windows and go back to the main window.


 back to top


Generating Gene Trees (Heat Maps)

A gene tree or heat map is a graphical means of comparing many genes and samples at one time.

In the main window, select the gene list that was created using the Venn Diagram that is the combination of both the 1-Way ANOVA and the c-Myb Filtering lists (e.g. 1-Way ANOVA or c-Myb Filtering). This gene list should have about 86 genes.

From the Tools menu, choose “Clustering > Gene Tree”

In the Clustering window, make sure the correct gene list is selected. If not, find it in the left panel and click on “Choose Gene List”. Everything else should be set properly by default. Choose Start at the bottom. If the gene list at the top of the window is large (e.g. all genes), this may take a very long time. Otherwise, it should finish in few seconds.

Click on Save in the next window to create the gene tree.

In the gene tree, each gene is represented by a row and each sample is a column. The colors indicate the fold change (Normalized expression). In this case, the darkest red is 6-fold up-regulated, the darkest blue is 10-fold down-regulated and white indicates unchanged (the color settings in your copy of GeneSpring may be different, see the color bar at right for the scale).

The gene tree shown here shows both replicates for each sample. The most significant genes are the ones that are regulated the same way in both replicates (e.g. red or blue in both of the c-Myb samples).

Double-click on any row to see the detailed information for that gene. Compare some genes that look like they are more or less significantly regulated.

 back to top


Using the GeneSpring Data

Copy an Annotated Gene List to Excel

An annotated gene list contains the names of the genes and associated data.

If it is not already selected, choose the combined list (e.g. 1-Way ANOVA or c-Myb Filtering) from the left panel of the main window.

From the Edit menu, choose “Copy > Copy Annotated Gene List”

Expand the next window so you can see all the options.

Make sure the “Average” box in the Raw Data section is checked, then choose “Save to File”

Give your file a name (e.g. GeneList) and save it to your computer.

Now you can quit GeneSpring.

Open Microsoft Excel and then use the File > Open command to open your saved gene list file. In the Text Import Wizard, click Next twice and then Finish. Your data should appear in a spreadsheet.

Each sample should have two columns of data marked Normalized and Raw. These are the fold change and actual expression level data from GeneSpring. Save this file for use later.

 

 back to top


Getting Gene Annotation Information from Affymetrix

In the Excel spreadsheet, select just the gene names from the left most column (not the column headers, just the names), then copy the gene names by choosing Copy from the Edit menu.

Open a new spreadsheet in Excel and paste in the copied values. Save this spreadsheet at “Text Tab-delimited” with a new name (e.g. genes.txt). Just click through the two warning boxes that Excel gives you.

Go the Affymetrix web site (http://www.affymetrix.com)

Click “login” at the top. You can register to get your own free Affymetrix account or just enter sness in both login fields, then click on the login button.

Click on the Analysis or NetAffx link at left center of the page.

Under “Tools and Annotations” select the NetAffx Analysis Center link

Select “Batch Query” from the Expression menu near the top along the left side.

In section 1, set the GeneChip Array type to “Human Genome U95 Set” from the first menu.

Section 2 should say “Probe Set ID” by default.

In Section 3, click on the “Choose File” button, then navigate to your newly-saved “genes.txt” file, saved in tab-delimited text.

Section 4 should say Annotation List by default.

Click on “search” at the bottom.

When the results come back, click on the “Export” link near top center of the window.

On the next screen, make sure the File Format setting near the bottom is set to “TSV” then click on the Export button at the bottom.

If your browser asks you to save a file at this point go ahead. Otherwise, it may open a window full of text. In that case, choose “Select All” from the Edit menu to highlight the whole window, then choose Copy from the Edit window to copy the text.

Go to the Excel spreadsheet containing the data (from the annotated gene list). Go to a new worksheet (a new tab at the bottom) and paste in the annotations from Affymetrix. You should now have a single Excel workbook with two worksheets, one that has the GeneSpring data from the combined gene lists and the other that has the corresponding Affymetrix annotations.


 back to top


Use VLOOKUP to Link the Data and the Annotations in Excel

In Excel, go to the worksheet with the annotations. Select all the cells of data (columns A – H, rows 2 through the bottom of the data)

From the Data menu, choose Sort, then click OK to sort all the data by the gene identifiers.

With the data still selected, choose Name > Define from the Insert menu.

In the little window that appears, change the name to AnnotationTable, then click Add.

Go the worksheet with the GeneSpring data, and scroll across to the last column of data.

Delete any unused column headers, then label two new columns “Name” and “Description”

Now you need to enter a formula in the top row of each of these columns. In this example, the top row of data is row 4. If yours is different, change the formula accordingly.

In the first row of the Name column, enter the following formula (without quotes):

“=vlookup(A4,AnnotationTable,3)”

When you hit return, the name of the gene should appear.

The vlookup function in Excel has three values separated by commas in the parentheses. The first value (in this case A4) is the location of the data that should be looked up (in this case the gene identifier). The second value is the name (defined above) of the table containing the data that needs to be looked up. The gene identifier must appear in the first column of the named data table, and the first column of the data table must be sorted to work properly. The third value indicates the column number from which the data should be looked up (in this case column 3, which has the gene names).

In the first row of the Description column, enter this formula (without quotes):

“=vlookup(A4,AnnotationTable,2)”

This is the same formula as before, but the column number is changed to 2.

Now select the first two cells you just filled in, grab the lower right corner of the selection area and drag down to fill all the other rows in the table. All the annotation values should fill in.

Save your file. Now your data table should have the corresponding gene names and descriptions.

 

Summary

This completes the basic tutorial and should prepare you to analyze your own data, and to design and interpret microarray experiments.

Good luck.

back to top

No. of visitors to this page since 8/20/02: