SigTerms: a bioinformatics tool for linking gene expression profiling results with gene class associations

 

Instructions for use

Chad Creighton, Ph.D.

Baylor College of Medicine

creighto@bcm.edu



 

Running SigTerms Excel macros

The SigTerms software tool consists of a set of Microsoft Excel macros for use in Excel:

  • FindSignificantTerms - Finds the significantly enriched term classes within a set of genes of interest.
  • CountTermToGene – Tallies up the term-to-gene associations from the Annotation worksheet, in order to generate the Counts worksheet (the Counts worksheet need only be generated once for a new Annotation worksheet).
  • DoSimulationTesting - Runs a number of random simulations in order to measure the true significance of a low enrichment p-value for a term-to-gene set association (helps correct for multiple testing).

 

In order to run the SigTerms macros, first open the SigTerms.xls workbook in Excel, then select Tools->Macro->Macros from the main menu (for 32-bit, pre-Vista Excel, or select Alt+F8 keys).  The list of SigTerms macros will be displayed.  Select the name of the macro you wish to run, and then click the "Run" button.

 

Important note on running Excel macros:  If it does not already, your copy of Excel needs to allow the opening and running of macros in order to use the SigTerms software.  To allow macros in 32-bit Excel (pre-Vista), open the Tools->Options dialog from the main menu, click the "Security" button, and set the security level to "Medium."  When opening the SigTerms.xls spreadsheet, select "Enable macros" (not "Disable macros"), if so prompted.

 

Other tips on running Excel macros:

  • The Alt+F11 keys opens the Visual Basic editor, which includes all of the VBA source code for the macros
  • In order to stop a macro while it is running, hit the Ctrl+Break keys.

 

Selecting the Annotation workbook

The SigTerms "FindSignificantTerms" macro takes two types of input:

  • A list of selected genes (e.g. genes significantly up-regulated in a particular expression profiling experiment)
  • A dataset of gene-to-class associations

 

The gene-to-class associations are contained within an Excel spreadsheet referred to as the "Annotation" worksheet.  The Annotation worksheet has the following format:

  • The worksheet has the name "Annotation."
  • Genes are listed starting from the second row, one gene per row.
  • The first and second columns may include any relevant information pertaining to the genes.
  • The third column lists the Entrez identifier for each gene.
  • The fourth column may include the gene symbol or title description (this information is included for each gene in the "Terms with Genes" output of the "FindSignificantTerms" macro).
  • Gene class associations for each gene are listed beginning from the fifth column. 
  • Starting from the fifth column, the top row lists the gene class type (e.g. "GO" or "TargetScan_pred").  Each column with at least one class listed for a particular gene should have a class type heading.

 

The Annotation workbook includes the Annotation worksheet and a worksheet named "Counts," which lists each gene class term, along with the total number of times the term occurred in the Annotation worksheet.

 

The main page provides links to download pre-compiled Annotation workbooks for several types of gene class associations of potential interest (e.g. Gene Ontology annotations, microRNA targeting predictions, oncogenic signatures, etc.).  Users also have the freedom to create their own Annotation worksheets, using the format specified above.  After creating a new Annotation sheet, users need to run the "CountTermToGene" macro in order to generate the Counts worksheet.

 

Linking gene class terms to gene sets

In order to find all gene class terms (with significance of enrichment) for a given gene set:

  1. Make sure the "SigTerms.xls" spreadsheet is open in Excel.
  2. In Excel, open the appropriate Annotation workbook for your array.
  3. In the Annotation workbook, create a new blank spreadsheet and paste your set of genes of interest in the first column of the new sheet, one gene per row (starting from the first row, no column heading).  For each gene, the Entrez gene identifier must be used.  It is okay to have duplicate gene identifiers in the list (e.g. a gene may be represented by multiple Affymetrix probe sets); where there are duplicates, the gene will be counted only once.
  4. With the worksheet with the selected gene list active (i.e. at the front of any other spreadsheets open in Excel), run the "FindSignificantTerms" macro.  On the displayed form, specify the total population of genes for the purposes of computing enrichment p-values (see "Computing enrichment p-values" for more detail).  You can also limit the number of genes that are to be written out for each term by specifying a maximum p-value an enriched term may have in order to write out all of the genes found to fall under the term (using the default of "0.99," all gene-to-term associations for all terms will be written out).   
  5. After filling out the "FindSignificantTerms" form, hit the OK button.  Allow time for the macro to complete (in most cases, should be less than a minute).

 

The "FindSignificantTerms" macro generates two new sheets in the Annotation workbook:

  • The "Enriched Terms" sheets includes for each term the number of occurrences for the term in the gene set of interest and the probability (by one-sided Fisher’s exact test) of finding the same number of occurrences or more of the term by chance. 
  • The "Terms with Genes" sheet includes the same information as the "Enriched Terms" sheet, but in addition, lists each gene that fell under a given term with that term, there being one row for each gene-term pair.

 

Computing enrichment p-values

For each gene class that is matched with a number of genes in the selected gene set, SigTerms tests whether the class appears a disproportionate number of times within the gene set, i.e. whether the class occurred more times in the user-specified gene set that would be expected in a randomly selected set of genes.  The classical one-sided Fisher’s exact test is used to assess significance of enrichment for each gene class term.

 

For computing the one-sided Fisher’s exact test, it is important to specify the total gene population from which the gene set was selected.  The total number of genes in the population is used as the denominator for the enrichment calculations.  There are a number of ways that one can choose the gene population, including the following:

  • In the case where the selected genes were from a set of profiling experiments, use the total number of genes represented on the array platform.  In this case, all genes listed in the Annotation worksheet should be represented on the array (any genes not represented should be removed).  The use of the number of unique, named genes featured on the array, as the input gene population, is recommended (i.e. no duplicates, no genes or ESTs without an Entrez identifier, and no RNA probes with ambiguous gene mappings should be used).
  • Use the total number of genes listed in the Annotation worksheet.  In this case, all genes in the selected list should be represented in the Annotation worksheet (any genes not represented should be removed).

 

The number of genes in the total population is specified in the input form of the "FindSignificantTerms" macro.  If the gene population is not correctly specified, the macro will still run, though the enrichment p-values will not be precise.

 

The main page provides links to download several pre-compiled Annotation workbooks for commonly used profiling arrays (e.g. Affymetrix or Illumina).  Users of a profiling platform represented on the main page may simply download the corresponding Annotation workbook (which includes a list of the unique named genes represented on the array).  With these array platform-specific Annotation workbooks, the worksheet labeled "Gene Pop" has the recommended number of genes to input as the population.  Users of a profiling platform not represented on the main page may download the "all" version of the Annotation worksheet, select out the genes not represented on their array (using the Excel "MATCH" function), and run the "CountTermToGene" macro to regenerate the Counts worksheet

 

Correcting enrichment p-values for multiple testing

For each term, the one-sided Fisher’s exact p-value gives the probability for that term having occurred a given number of times or more within the selected set of genes by chance.  However if many terms are simultaneously considered for enrichment, the issue of multiple term testing needs to be considered when trying to assess the "global" significance of any particular term over the hundreds or thousands of terms that may be represented for the entire set of genes under study.  One multiple comparison procedure to address this is to do numerous Monte Carlo simulation tests for randomness, each test in which a set of genes equal to the number of genes used to search for terms using the "FindSignificantTerms" macro is first randomly selected from a population of genes and a set of term enrichment p-values for these genes is then calculated.  One can then examine the distribution of p-values generated from each test, in order to be able to estimate the number of terms that may have received a low p-value by chance alone (e.g. how many terms in each test on average received a p-value less than 0.01).

 

To do simulation testing to determine global significance of term enrichment p-values:

  1. In the Annotation workbook, create a new blank spreadsheet and paste a set of genes (by Entrez identifier) that represents the population of genes under study, one gene per row.  (See "Computing enrichment p-values" for more on specifying the gene population.) 
  2. With the newly created sheet active (i.e. at the front of any other spreadsheets open in Excel), run the "DoSimulationTesting" macro.  In the form provided, enter the number of simulations you want to run (e.g. 100 or 1000) and the number of genes you want to sample from your population in each test, which should be the same number of genes that you had input into the "FindSignificantTerms" macro. 
  3. Hit the OK button on the form.  Allow a several minutes to hours for the macro to complete, depending on how many tests you wanted performed.

 

One or more new sheets will be generated in the current workbook.  Each of the columns in these new sheets will contain a set of p-values generated from a single simulation test (p-values greater than 0.05 for a simulation will not be listed).  You can use the simulation results in order to estimate the true significance of a p-value obtained from your set of genes of interest (e.g. count the number of terms in each test that had a p-value<0.01).

 

Notes for Macintosh users

The SigTerms macros have been tested mainly with Microsoft Excel (both 32-bit and 64-bit).  In principle, the software should work with Excel for Macintosh.  We have recently noticed one issue with running the "FindSignificantTerms" macro on Mac OSX version 10.  At the end of the program run, the following error message is generated: "Run-time error '1004': Method 'FreezePanes' of object window failed."  This appears to have to with the 'FreezePanes' feature not working properly when spreadsheets are in the "Page Layout" view, which has been noted elsewhere on Mac user forums.  

 

When running SigTerms on Macintosh, we recommend that the user has all open worksheets in the "Normal" view instead of the "Page Layout" view.  If the error message occurs, click "End" on the error message prompt.  All gene-to-term associations will have been successfully generated at this point (the Freeze Panes action comes at the very end), and so the error can be ignored.