Abstract
This is an implementation of the R statistical software qvalue package [Dabney et al., 2014], designed for use with large datasets where memory or computation time is limiting. In addition to estimating p values adjusted for multiple testing, the software outputs a script which can be pasted into R to produce diagnostic plots and parameter estimates. This program runs almost 30 times faster and requests substantially less memory than the qvalue package when analysing 10 million p values on a high performance cluster. The software has been used to control for the multiple testing of 390 million tests when analysing a full cis scan of RNA-seq exon level gene expression from the Eurobats project [Brown et al., 2014]. The source code and links to executable files for linux and Mac OSX can be found here: https://github.com/abrown25/qvalue. Help for the package can be found by running ./largeQvalue --help.
1. Introduction
Recent developments have allowed us to investigate the action of DNA in a new, hypothesis free framework. The ability to collect information on DNA at millions of loci has led to a reappraisal of statistical methods to correct for multiple testing: it has been argued that in many contexts traditional measures of significance such as Bonferroni are too conservative, leading to an unacceptably high rate of falsely accepting null hypotheses. The false discovery rate is an alternative way of accounting for multiple testing, which tolerates a higher false positive rate in order not to discard potentially interesting findings [Benjamini and Hochberg, 1995]. It can simplistically be viewed as the proportion of falsely rejected null hypotheses, and the concept was further developed in Storey and Tibshirani [2003] and Storey et al. [2004], and implemented in the R package qvalue [Dabney et al., 2014].
Advances in sequencing have accelerated our ability to perform experiments of even greater scope. Whereas genotyping arrays ascertain DNA at around a million SNPs, the 1000 Genomes project [1000 Genomes Project Consortium, 2012] has uncovered around 100 million variants in the human genome. Application of sequencing to produce molecular phenotypes has also revolutionised how we can interrogate cellular function with few preconceived notions; for example sequencing of RNA allows us to construct hundreds of thousands of phenotypes, each capturing the expression of a given exon [Lappalainen et al., 2013]. This permits a full, hypothesis free investigation of the genetics of gene regulation. Together, these developments imply that a full genome-wide scan for all possible variants affecting exon expression would involve ∼ 1013 hypothesis tests.
The qvalue package [Dabney et al., 2014], implemented in R [R Core Team, 2013], allows estimation of false discovery rates, as well as the quantity π0, which is an estimate of the proportion of the tests for which the null hypothesis is true. However, R programs usually run slower and consume more memory than programs written in lower level languages, rendering the qvalue package impractical when the number of tests is of the order 1013. Here I present an implementation of qvalue in a compiled language, which runs faster, uses less memory, and is capable of scaling up to the experiments produced by these new sequencing technologies.
2. Methods
2.1. Brief description of qvalue algorithms
The qvalue package implements the algorithms described in Storey [2002], Storey and Tibshirani [2003] and Storey et al. [2004]. It takes as input a set of p values and maps these onto q values, which can be interpreted as follows: a decision rule which rejects all null hypotheses with corresponding q values less than α has an estimated false discovery rate of α. Taking an increasing set of p values p1..pn this mapping is defined as:
The quantity π0 is the proportion of null hypotheses amongst all tests, and is a central difference between the methods used to control the FDR in Benjamini and Hochberg [1995] and the qvalue approach. Its estimation is described in Sections 2.4, 2.5 and 2.6.
2.2. Implementation
largeQvalue is written mainly in the D programming language [Alexandrescu, 2010]. D is a compiled language which promises runtime speeds of the same order as C or C++ while guaranteeing memory safety, and also allows easy integration with code written in C. Leveraging this, parts of the program are written in C; in particular we use the GNU Software Library (http://www.gnu.org/software/gsl/) functions for fitting cubic splines, and sampling from the binomial distribution. The program is released as standalone binaries for use on the linux and Mac OSX systems, as well as source code which can be downloaded from a github repository (https://github.com/abrown25/qvalue) and be compiled on the Windows platform. This source code is released under the GNU General Public Licence v3 (http://www.gnu.org/copyleft/gpl.html). The program is called from the command line with options set using parameter flags. A description of all the available options can be found by running ./largeQvalue --help.
2.3. Input and output file formats and commands
The input format for this program is a whitespace delimited (i.e. either tab or space separated fields) text file where the p values to be adjusted for multiple testing lie within one column. Consecutive runs of whitespace are collapsed to one whitespace character. The file can either be specified using the --input flag (i.e. --input PValueFile) or if not present, then following UNIX convention the filename is taken as the last argument on the command line. The column containing the p values can be specified with the --col flag; by default it is assumed to be the first column. If the --header flag is not present, then it is assumed the file does not have a header line. Any data that cannot be converted to numerical form will be treated as missing (the corresponding q value estimate will be written as NA). If any numerical values lie outside the interval [0, 1], then the program will terminate with the message Some p values are outside [0, 1] interval.
The standard output file is the input file with an additional column on the end which specifies the q value [Storey, 2002] corresponding to the p value in that row. If a filename for the output is not specified using the --out flag, then output is written to stdout.
An optional parameter file is written, if specified using the --param flag. This file contains the estimate of π0 used to calculate q values as well as other intermediate parameters. This file can also be run in R, either by pasting it at the R prompt or by running the command source(“ParameterFileName”) from the R terminal. This script will produce diagnostic plots, as seen in Figure 1 and described in Sections 2.5 and 2.6. For this functionality, the graphing package ggplot2 [Wickham, 2009] is required.
2.4. Estimation of π0
One method to estimate the proportion of null hypotheses is to choose a value λ ∈ [0, 1) and assume that every p > λ is from the null hypothesis (obviously this assumption is more valid as λ tends to 1). Then, assuming uniform p values generated by the null hypothesis, this proportion can be reflected back across the interval. This corresponds to estimates of π0 as a function of λ:
Using the qvalue package in R, the user can specify a value of λ to be used to estimate π0 according to Equation 2. The equivalent command in the largeQvalue package uses the flag --lambda to specify this value. More sophisticated methods, also implemented in the R package, use a sequence of λ values to produce a range of estimates of π0. Information from all of these values is combined either by fitting a cubic spline to these estimates [Storey and Tibshirani, 2003] or by taking bootstrap samples of the p values [Storey et al., 2004]. The implementation of both of these options in largeQvalue is described in the following two sections.
2.5. Estimation of π0 with cubic splines
The methodology to estimate π0 using cubic splines was described in Storey and Tibshirani [2003], and is used when the specified values of λ form a sequence rather than a single number (this is true by default) and the --boot flag is not given.
This option takes a sequence of λ values, specified by --lambda START,STOP,STEP (default value is 0 to 0.9 in steps of 0.05). Estimates of are calculated at every value of λ and a cubic spline is fitted to these points to estimate the asymptote as λ → 1. The estimate of π0 used is the value of this cubic spline at the maximum value of λ.
The amount of smoothing supplied by the spline can be adjusted using the --coef flag (default 5, must be greater than 3). If the --log flag is specified then the spline is fitted to the values instead of .
Figure 1a shows the output of the parameter file processed by R when this methodology is used. The dots show the estimates of π0 at various values of λ, the black line is a cubic spline fitted to these points which is used to approximate the function π0(λ). The estimate used to estimate q values is shown in red line.
largeQvalue uses the spline fitting algorithm implemented in the gsl, which differs slightly to the implementation used by the qvalue package. Exact replication of qvalue results can be produced by running the program twice. First, run the code with the --param flag specified. Paste the parameter output into R. A line in the output Estimate of pi0 from qvalue package is: specifies the qvalue estimate of π0. The analysis can then be run again, specifying this value with the --pi0 flag.
2.6 Estimation of π0 from bootstrap samples
Storey et al. [2004] proposed a method of estimating π0 based on bootstrap samples of the p values. This is used when largeQvalue is run with the flag --boot. The seed for the generation of the samples can be set with the flag --seed (default 0).
Again, a sequence of λ values is given --lambda START,STOP,STEP which is used to calculate estimates . Then, we take 100 bootstrap samples Bj of the p values with replacement and calculate estimates of on these new datasets. The implementation of the bootstrap differs from that used in the qvalue package. To save memory and computation time, instead of resampling individual p values we sample counts of p values lying within intervals defined by the sequence of λ from a multinomial distribution. This is effectively the same procedure.
The estimate is used to estimate q values, where λ* is chosen to minimise Equation 3.
Figure 1b shows the diagnostic plot when this method is applied. Estimates are shown by the blue dots, with shown by the blue line. The bootstrap estimates are shown in the box plots and the mean square error by the dashed line. The chosen estimate of π0, which minimises this mean squared error, is shown by the red line.
2.7. Other parameter flags
The flags --help and --version will bring up the help page and print the version number of the software respectively. The flag --issorted informs the program that the p values are already in ascending order with no missing values: especially with large datasets this can speed up the program considerably.
The flag --param allows the user to specify an output file to write estimates of parameters. When only a single value of λ is specified, this file only contains the point estimate of π0. When λ is specified as a sequence, these parameters include the estimate of π0 used to calculate q values and the estimates for every value in this sequence. If the cubic spline method is employed, then smoothed cubic splines values are reported at every value of λ. For the bootstrap method the file contains all bootstrap estimates and estimates of the mean square error. Examples of these files can be found in Supplementary Materials S2 and S3. These files can be pasted into R to produce diagnostic plots, examples of which can be seen in Figures 1a and 1b.
The --pi0 flag allows the estimate of π0 to be directly specified. This can be useful to exactly recreate q values produced using the cubic spline routine from the q value package.
The --robust flag produces q values which are more conservative for very small values of p. Instead of q values defined by Equation 1, we use the formula:
2.8. An example command
./largeQvalue --header --col 4 --out OutputFile --param ParameterFile --lambda 0,0.9,0.05 --robust QTLresults.txt
The p values can be found in the 4th column of QTLresults.txt, write the results to OutputFile and the estimated parameter values to ParameterFile. Use values of λ from 0 to 0.9 in steps of 0.05 to estimate proportion of null hypotheses (standard settings in qvalue) and produce estimates of q values robust for small p values.
2.9. Comparison of speed and memory usage of largeQvalue and qvalue package
We compared the CPU time and memory used by largeQvalue with that of the qvalue implementation in R. Each program was run five times against four datasets; these consisted of text files with a single column of either 5 × 105, 106, 5 × 106, or 107 uniformly distributed numbers between 0 and 1. The following R code was used to call the qvalue algorithm with the cubic spline method: library(qvalue) Pvalues <-scan(file = “InputFile”) write(qvalue(Pvalues)$qvalues, file = “Qvalues”, ncolumns = 1) For the bootstrap method the following code was used: library(qvalue) Pvalues <-scan(file= “InputFile”) write(qvalue(Pvalues, pi0.method == “bootstrap”)$qvalues, file = “Qvalues", ncolumns = 1) The corresponding commands for the largeQvalue package were: ./largeQvalue --out Qvalues InputFile and ./largeQvalue --boot --out Qvalues InputFile
2.10 List Option
List of options:
Usage: largeQvalue [options]
Options:
--help: Print help and quit
--version: Print version and quit
--header: Input has header line (default = FALSE)
--boot: Apply bootstrap method to find π0 (default = FALSE)
--seed: Set seed for generating bootstrap samples (default = 0, equivalent to gsl default)
--coef: Set the number of coefficients used to fit cubic spline (default = 5, must be greater than 3)
--log: Smoothing spline applied to log π0 values (default = FALSE)
--robust: More robust values for small p values (default = FALSE)
--pi0: Use value of π0 given (useful for recreating qvalue package results)
--lambda: Either a fixed number or a sequence given 0,0.9,0.05 (default = 0,0.9,0.05)
--param: Print out parameter list to given file
--out: File to write results to (default = stdout)
--input: File to take results from (must be specified, if not explicitly, the last parameter after all options have been parsed is used)
--issorted: File has already been sorted with no missing values (default = FALSE)
--col: Column with p values (default = 1)
2.11. Binary executables
The binary for 64bit linux can be found here: ftp://ftp.sanger.ac.uk/pub/resources/software/largeqvalue/largeQvalue.tar.gz. The binary for Mac OSX can be found here: ftp://ftp.sanger.ac.uk/pub/resources/software/largeqvalue/largeQvalue_mac.tar.gz.
2.12. Building from source
To compile an executable binary from source (for example if you are using Windows), it is necessary to have both a D compiler (we have used the ldc compiler as it produces faster programs) and the gsl library installed on your system. Binaries for the ldc compiler can be downloaded from here: https://github.com/ldc-developers/ldc/releases and a link to the gsl software library found here: http://www.gnu.org/software/gsl/. Once both are installed you can clone the github account by running git clone https://github.com/abrown25/qvalue.git. Then change to the qvalue directory and run make. If you see an error stating /usr/lib/libgsl.a: No such file or directory this means the gsl is installed in a different folder on your system. Edit the first line of the makefile to reflect where the library can be found, then rerun the make command. Once this is done, if installation has worked then running ./largeQvalue should bring up the help information. If there are any issues with this procedure, please contact andrew.brown{at}unige.ch.
It is possible that you wish to compile with the reference D compiler DMD (this may especially be true if you are working on a Windows system). This can be downloaded from http://dlang.org/download.html. Then, the only difference with the previous installation is you must run make dmd instead of make.
3. Results
3.1. CPU time and memory usage of largeQvalue and the qvalue package
To compare performance of the two implementations of the qvalue algorithm, we analysed four different datasets five times with both of the programs on the High Performance Cluster at the Wellcome Trust Sanger Institute (further details in Section 2.9). Table S1 lists the CPU time and maximum memory reported by the cluster. The CPU time is shown in Figure 2. The largeQvalue package requires less time to analyse the data, and the difference increases as the number of tests increases. When there are 107 p values, the largeQvalue program is 27 times faster than the qvalue package when comparing median running times.
We do not believe that the cluster reports reliable measures of memory usage, especially as the largeQvalue package runs for a relatively brief time, and allocates memory for much of it. However, when comparing maximum reported memory use across all runs with 107 p values, this program uses less than a quarter of the memory of the qvalue package.
3.2. Application to find tissue specific eQTL
The program was used to look for evidence of cell type specific eQTL in the skin expression data from the Eurobats project [Brown et al., 2014]. We are currently preparing a manuscript based on this work. In brief, a measure of cell type composition was constructed, quantifying the proportions of dermis and epidermis in each individual sample. Then, for every gene and every SNP within 1Mbp of that gene, we tested for interactions between SNP and cell-type measure which affected the expression of that gene (as quantified by RNA-seq), a total of 390,537,631 tests. We ran largeQvalue on these p values. Although the high multiple testing burden meant that we found no significant hits, we estimated a value of π0 of 0.961, suggesting that 3.9% of cis-SNPs had different effects on dermis expression in comparison to expression within the epidermis layer. The program took 1 hour 25 mins 55 seconds of CPU time to run, and used 3.99GB of memory.
4. Conclusion
The qvalue package has been widely adopted to control for multiple testing; the two papers proposing its methodology [Storey and Tibshirani, 2003, Storey et al., 2004] have been cited over 5,600 times. This means that the procedure and its implementation have become widely accepted. The integration within the R framework makes it easy for results to be processed and plotted. The implementation is capable of analysing the results of genome-wide association studies or studies of differential expression of genes (one context where the methodology was originally proposed). However, it is doubtful that it scales up to cope with the results coming from modern cellular sequencing experiments, which can test hundreds of thousands of phenotypes for association with tens of thousands of SNPs (if concentrating on the cis window). For this we propose our implementation, which has already been used to analyse such data.
5. Acknowledgements
I would like to thank Leo Parts and Ana Viñuela for their comments and suggestions which have greatly changed this article for the better. Andrew Brown has been supported by a grant from the South-Eastern Norway Health Authority (No. 2011060).