Analysis of Gene Expression Data
Differential expression analysis for digital gene technologies
MD Robinson, GK Smyth Pub ref: 121
Digital gene expression (DGE) technologies are high-throughput sequencing platforms that are predicted to change the landscape of molecular expression data collection. The essential idea of DGE is to sequence simultaneously a large number of short regions or tags of DNA or RNA. These short sequence reads are mapped back to the genome in order to get an understanding of the genetic mechanism underlying them. In expression profiling, the number of occurrences of each tag is proportional to the abundance of the corresponding gene. DGE technologies are used for the same purposes as microarrays, but generate digital data consisting of counts instead of analogue data consisting of intensities.
DGE technologies are generating unprecedented amounts of data. As with microarrays in the early 21st century, research institutions and biotechnology companies are scrambling to adapt to the possibilities and demands of the new technologies. One of the fundamental questions to be addressed with such data is the identification of tags that are differentially expressed between experimental conditions or between affected and unaffected patients. Our interest is particularly in functional genomic experiments wherein disease models are studied or particular genes are perturbed in order to infer their function.
Any discussion of differential expression must take into account the degree of biological variability between individuals, if statistical significance is to be assessed in a realistic manner. Yet DGE expression profiles are expensive to obtain, and so the number of biological replicates is inevitably very small for any experiment.
The secret to making progress is to borrow information between genes. We have developed the first statistical method that is able to share information between genes from DGE expression profiles, and to estimate biological variability for each gene, even when the number of biological replicates is minimal. Our method allows the magnitude of biological variability to be different for each gene, yet the estimated coefficient of variation is mathematically moderated between genes. Not only is our strategy applicable even with the smallest number of samples, but it also proves to be more powerful than previous strategies when more samples are available.
Our work shows that the principle of information borrowing, which has proved so important and productive for microarray data analysis, can be applied also to digital data.

SAGE data from Zhang et al (Nature, 1997), comparing two colon tumour samples with normal colon tissue. The plot shows log-fold change vs log-expression for all tags. Significantly changing tags are highlighted.
Genome-wide inter-primate comparison of gene expression profiles in multiple tissues
A Oshlack, GK Smyth in collaboration with R Blekhman, AE Chabot, Y Gilad (Department of Human Genetics, University of Chicago, IL WA)
We have designed and analysed a whole genome multi-species microarray, which allows the comparison of gene expression between human, chimpanzee and rhesus macaque. Comparisons of six individuals from each species from liver, kidney and heart revealed several pathways that have been remodeled in the human lineage. Specifically, metabolic pathways are over-represented in genes that are likely to be undergoing natural selection in humans. This makes intuitive sense given that large changes in diet, including the consumption of cooked food, have occurred in the human lineage. Also, genes identified to be under stabilising selection were enriched for genes previously associated with disease.
Histone deacetylase inhibitors induce clinical responses with associated alterations in gene expression profiles
W Shi, TP Speed, GK Smyth in collaboration with J Bolden, L Ellis, R Johnstone (Peter MacCallum Cancer Centre)
HDAC inhibitors (HDACi) induce apoptosis of tumour cells at concentrations that leave normal cells relatively unharmed. A time course microarray experiment, in which normal cells and tumourigenic cells are treated with HDACi, shows that genes in the apoptosis pathway, and the Bcl-2 family in particular, are significantly associated with the HDACi response in the tumourigenic cells relative to normal cells. Careful analysis of microarray data from Phase I trials, including array quality weighting and linear modelling, shows that the HDACi LBH589 induces alterations in gene expression profiles associated with clinical responses in cutaneous T cell lymphoma.
Gene set testing and pathway identification in designed gene expression experiments
D Wu, K Satterley, TP Speed, GK Smyth in collaboration with C de Graaf, D Hilton (Molecular Medicine Division), J Visvader, G Lindeman (Molecular Genetics of Cancer Division)
A seminal advance in the last few years has been the idea of assessing sets of genes for differential expression instead of individual genes. The sets may correspond to known pathways or diseases, and set analysis can give a clearer picture of global gene activity. Gene set testing methods are being developed that have good power and resolution for the sort of complex designed experiments that are common in functional genomics. A database of mouse-orientated gene sets is being curated to facilitate systematic investigations into haematological malignancies and breast cancer




