Main Content

affygcrma

Perform GC Robust Multi-array Average (GCRMA) procedure on Affymetrix microarray probe-level data

Syntax

Expression = affygcrma(CELFiles, CDFFile, SeqFile)
Expression = affygcrma(ProbeStructure, Seq)
Expression = affygcrma(CELFiles, CDFFile, SeqFile, ...'CELPath', CELPathValue, ...)
Expression = affygcrma(CELFiles, CDFFile, SeqFile, ...'CDFPath', CDFPathValue, ...)
Expression = affygcrma(CELFiles, CDFFile, SeqFile, ...'SeqPath', SeqPathValue, ...)
Expression = affygcrma(..., 'ChipIndex', ChipIndexValue, ...)
Expression = affygcrma(..., 'OpticalCorr', OpticalCorrValue, ...)
Expression = affygcrma(..., 'CorrConst', CorrConstValue, ...)
Expression = affygcrma(..., 'Method', MethodValue, ...)
Expression = affygcrma(..., 'TuningParam', TuningParamValue, ...)
Expression = affygcrma(..., 'GSBCorr', GSBCorrValue, ...)
Expression = affygcrma(..., 'Median', MedianValue, ...)
Expression = affygcrma(..., 'Output', OutputValue, ...)
Expression = affygcrma(..., 'Showplot', ShowplotValue, ...)
Expression = affygcrma(..., 'Verbose', VerboseValue, ...)

Input Arguments

CELFiles

Any of the following:

  • Character vector or string specifying a single CEL file name.

  • '*', which reads all CEL files in the current folder.

  • ' ', which opens the Select CEL Files dialog box from which you select the CEL files. From this dialog box, you can press and hold Ctrl or Shift while clicking to select multiple CEL files.

  • Cell array of character vectors or string vector containing CEL file names.

CDFFile

Either of the following:

  • Character vector or string specifying a CDF file name.

  • ' ', which opens the Select CDF File dialog box from which you select the CDF file.

SeqFile

Either of the following:

  • Character vector or string specifying a file name of a sequence file (tab-separated or FASTA) that contains the following information for a specific type of Affymetrix® GeneChip® array:

    • Probe set IDs

    • Probe x-coordinates

    • Probe y-coordinates

    • Probe sequences in each probe set

    • Affymetrix GeneChip array type (FASTA file only)

    The sequence file (tab-separated or FASTA) must be on the MATLAB® search path or in the Current Folder (unless you use the SeqPath property). In a tab-separated file, each row represents a probe; in a FASTA file, each header represents a probe.

  • An N-by-25 matrix of sequence information, such as returned by affyprobeseqread.

Seq

An N-by-25 matrix of sequence information, such as returned by affyprobeseqread.

ProbeStructure

MATLAB structure containing information from the CEL files, including probe intensities, probe indices, and probe set IDs, returned by the celintensityread function.

CELPathValue

Character vector or string specifying the path and folder where the files specified in CELFiles are stored.

CDFPathValue

Character vector or string specifying the path and folder where the file specified in CDFFile is stored.

SeqPathValue

Character vector or string specifying a folder or path and folder where SeqFile is stored.

ChipIndexValue

Positive integer specifying a chip. This chip's sequence information and mismatch probe intensity data is used to compute probe affinities. Default is 1.

OpticalCorrValue

Controls the use of optical background correction on the input probe intensity values. Choices are true (default) or false.

CorrConstValue

Value that specifies the correlation constant, rho, for log background intensity for each PM/MM probe pair. Choices are any value ≥ 0 and ≤ 1. Default is 0.7.

MethodValue

Character vector or string that specifies the method to estimate the signal. Choices are 'MLE', a faster, ad hoc Maximum Likelihood Estimate method, or 'EB', a slower, more formal, empirical Bayes method. Default is 'MLE'.

TuningParamValue

Value that specifies the tuning parameter used by the estimate method. This tuning parameter sets the lower bound of signal values with positive probability. Choices are a positive value. Default is 5 (MLE) or 0.5 (EB).

Tip

For information on determining a setting for this parameter, see Wu et al., 2004.

GSBCorrValue

Specifies whether to perform gene-specific binding (GSB) correction using probe affinity data. Choices are true (default) or false. If there is no probe affinity information, this property is ignored.

MedianValue

Specifies the use of the median of the ranked values instead of the mean for normalization. Choices are true or false (default).

OutputValue

Specifies the scale of the returned gene expression values. Choices are:

  • 'log'

  • 'log2'

  • 'log10'

  • 'linear'

  • @functionname

In the last instance, the data is transformed as defined by the function functionname. Default is 'log2'.

ShowplotValue

Controls the display of a plot showing the log2 of mismatch (MM) probe intensity values from a specified chip (CEL file), versus that chip's MM probe affinities. The plot also shows the LOWESS fit for computing NSB data of the specified chip. Choices are true, false, or I, an integer specifying a chip. If set to true, the first chip is plotted. Default is:

  • false — When return values are specified.

  • true — When return values are not specified.

VerboseValue

Controls the display of the status of the reading of files and GCRMA processing. Choices are true (default) or false.

Output Arguments

Expression

DataMatrix object containing the log2 gene expression values that have been background adjusted, normalized, and summarized using the GC Robust Multi-array Average (GCRMA) procedure.

Each row in Expression corresponds to a gene (probe set), and each column corresponds to an Affymetrix CEL file.

Description

Expression = affygcrma(CELFiles, CDFFile, SeqFile) reads the specified Affymetrix CEL files, the associated CDF library file (created from Affymetrix GeneChip arrays for expression or genotyping assays), and the associated sequence file or matrix. It then processes the probe intensity values using GCRMA background adjustment, quantile normalization, and median-polish summarization procedures, then returns Expression, a DataMatrix object containing the log2 based gene expression values in a matrix, the probe set IDs as row names, and the CEL file names as column names. Note that each row in Expression corresponds to a gene (probe set), and each column corresponds to an Affymetrix CEL file. (Each CEL file is generated from a separate chip. All chips should be of the same type.)

CELFiles is a character vector, string, string vector, or cell array of character vectors containing CEL file names. CDFFile is a character vector or string specifying a CDF file name. If you set CELFiles to '*', then it reads all CEL files in the current folder. If you set CELFiles or CDFFile to ' ', then it opens the Select Files dialog box from which you select the CEL files or CDF file. From this dialog box, you can press and hold Ctrl or Shift while clicking to select multiple CEL files. SeqFile is a file or matrix containing sequence information for probes on a specific type of Affymetrix GeneChip array.

Note

For details on the reading of files and GCRMA processing, see celintensityread, affyprobeseqread, affyprobeaffinities, gcrma, gcrmabackadj, quantilenorm, and rmasummary.

Expression = affygcrma(ProbeStructure, Seq) uses GCRMA background adjustment, quantile normalization, and median-polish summarization procedures to process the probe intensity values in ProbeStructure. ProbeStructure is a MATLAB structure containing information from the CEL files, including probe intensities, probe indices, and probe set IDs, returned by the celintensityread function. Seq is a matrix containing sequence information for probes on a specific type of Affymetrix GeneChip array.

Expression = affygcrma(..., 'PropertyName', PropertyValue, ...) calls affygcrma with optional properties that use property name/property value pairs. You can specify one or more properties in any order. Each PropertyName must be enclosed in single quotation marks and is case insensitive. These property name/property value pairs are as follows:

Expression = affygcrma(CELFiles, CDFFile, SeqFile, ...'CELPath', CELPathValue, ...) specifies a path and folder where the files specified by CELFiles are stored.

Expression = affygcrma(CELFiles, CDFFile, SeqFile, ...'CDFPath', CDFPathValue, ...) specifies a path and folder where the file specified by CDFFile is stored.

Expression = affygcrma(CELFiles, CDFFile, SeqFile, ...'SeqPath', SeqPathValue, ...) specifies a path and folder where the file specified by SeqFile is stored.

Expression = affygcrma(..., 'ChipIndex', ChipIndexValue, ...) computes probe affinities from MM probe intensity data using sequence information and mismatch probe intensity values from the chip specified by ChipIndexValue. Default ChipIndexValue is 1.

Expression = affygcrma(..., 'OpticalCorr', OpticalCorrValue, ...) controls the use of optical background correction on the input probe intensity values. Choices are true (default) or false.

Expression = affygcrma(..., 'CorrConst', CorrConstValue, ...) specifies the correlation constant, rho, for background intensity for each PM/MM probe pair. Choices are any value ≥ 0 and ≤ 1. Default is 0.7.

Expression = affygcrma(..., 'Method', MethodValue, ...) specifies the method to estimate the signal. Choices are 'MLE', a faster, ad hoc Maximum Likelihood Estimate method, or 'EB', a slower, more formal, empirical Bayes method. Default is 'MLE'.

Expression = affygcrma(..., 'TuningParam', TuningParamValue, ...) specifies the tuning parameter used by the estimate method. This tuning parameter sets the lower bound of signal values with positive probability. Choices are a positive value. Default is 5 (MLE) or 0.5 (EB).

Tip

For information on determining a setting for this parameter, see Wu et al., 2004.

Expression = affygcrma(..., 'GSBCorr', GSBCorrValue, ...) specifies whether to perform gene-specific binding (GSB) correction using probe affinity data. Choices are true (default) or false. If there is no probe affinity information, this property is ignored.

Expression = affygcrma(..., 'Median', MedianValue, ...) specifies the use of the median of the ranked values instead of the mean for normalization. Choices are true or false (default).

Expression = affygcrma(..., 'Output', OutputValue, ...) specifies the scale of the returned gene expression values. OutputValue can be:

  • 'log'

  • 'log2'

  • 'log10'

  • 'linear'

  • @functionname

In the last instance, the data is transformed as defined by the function functionname. Default is 'log2'.

Expression = affygcrma(..., 'Showplot', ShowplotValue, ...) controls the display of a plot showing the log2 of mismatch (MM) probe intensity values from a specified chip (CEL file), versus that chip's MM probe affinities. The plot also shows the LOWESS fit for computing NSB data of the specified chip. Choices are true, false, or I, an integer specifying a chip. If set to true, the first chip is plotted. Default is:

  • false — When return values are specified.

  • true — When return values are not specified.

Expression = affygcrma(..., 'Verbose', VerboseValue, ...) controls the display of the status of the reading of files and GCRMA processing. Choices are true (default) or false.

Examples

The following example assumes that you have the HG_U95Av2.CDF library file stored at D:\Affymetrix\LibFiles\HGGenome, and that your current folder points to a location containing CEL files and a sequence file associated with this CDF library file. In this example, the affygcrma function reads all the CEL files and the sequence file in the current folder and a CDF file in a specified folder. It also performs GCRMA background adjustment, quantile normalization, and summarization procedures on the PM probe intensity values, and returns a DataMatrix object, containing the metadata and processed data.

Expression = affygcrma('*', 'HG_U95Av2.CDF','HG-U95Av2_probe_tab',...
	                    'CDFPath', 'D:\Affymetrix\LibFiles\HGGenome');

References

[1] Naef, F., and Magnasco, M.O. (2003). Solving the Riddle of the Bright Mismatches: Labeling and Effective Binding in Oligonucleotide Arrays. Physical Review E 68, 011906.

[2] Wu, Z., Irizarry, R.A., Gentleman, R., Murillo, F.M., and Spencer, F. (2004). A Model Based Background Adjustment for Oligonucleotide Expression Arrays. Journal of the American Statistical Association 99(468), 909–917.

[3] Wu, Z., and Irizarry, R.A. (2005). Stochastic Models Inspired by Hybridization Theory for Short Oligonucleotide Arrays. Proceedings of RECOMB 2004. J Comput Biol. 12(6), 882–93.

[4] Wu, Z., and Irizarry, R.A. (2005). A Statistical Framework for the Analysis of Microarray Probe-Level Data. Johns Hopkins University, Biostatistics Working Papers 73.

[5] Wu, Z., and Irizarry, R.A. (2003). A Model Based Background Adjustment for Oligonucleotide Expression Arrays. RSS Workshop on Gene Expression, Wye, England, https://biosun01.biostat.jhsph.edu/%7Eririzarr/Talks/gctalk.pdf.

[6] Speed, T. (2006). Background models and GCRMA. Lecture 10, Statistics 246, University of California Berkeley.

[7] Abd Rabbo, N.A., and Barakat, H.M. (1979). Estimation Problems in Bivariate Lognormal Distribution. Indian J. Pure Appl. Math 10(7), 815–825.

[8] Best, C.J.M., Gillespie, J.W., Yi, Y., Chandramouli, G.V.R., Perlmutter, M.A., Gathright, Y., Erickson, H.S., Georgevich, L., Tangrea, M.A., Duray, P.H., Gonzalez, S., Velasco, A., Linehan, W.M., Matusik, R.J., Price, D.K., Figg, W.D., Emmert-Buck, M.R., and Chuaqui, R.F. (2005). Molecular alterations in primary prostate cancer after androgen ablation therapy. Clinical Cancer Research 11, 6823–6834.

[9] Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf, U., Speed, T.P. (2003). Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. Biostatistics. 4, 249–264.

[10] Mosteller, F., and Tukey, J. (1977). Data Analysis and Regression (Reading, Massachusetts: Addison-Wesley Publishing Company), pp. 165–202.

Version History

Introduced in R2008b