Search for Interactions is Probably Approximately Complete
SIXPAC is an efficient, scalable search algorithm that finds synergy between pairs of physically unlinked SNPs (genome-wide) in large case-control datasets. It expects genotype and phenotype information as input, encoded inand map file formats as used by Plink software.
Each genotype entry (0/1/2 minor alleles) in the raw file is reduced into two binary variables: representing recessive carrier status (=2 minor alleles) and dominance status (1 or 2 minor alleles) respectively. For each pair of SNPs, this allows us to test 4 genetic models of interaction (dom-dom, dom-rec, rec-dom, rec-rec). The output list contains binarized SNP-pairs for which the test statistic clears the user-provided significance cutoff (default Bonferroni).
SIXPAC is fast: genome wide scans of large datasets (thousands of samples, typed at ~105-6 SNPs) can be processed within a day on a single desktop machine, or in a couple of hours on a small cluster.
This algorithm was conceived and designed at the Itsik Pe’er Lab of Computational Genetics at Columbia University. It employs a novel technique called Probably Approximately Complete Searching, which is inspired by (but different from) PAC Learning. SIXPAC was written in Java 1.6, requires JRE 1.6 or higher to run and is distributed under the GPL license. Please contact either Itsik Pe'er or Snehit Prabhu if you use this software and would like to cite it.
SIXPAC version 1.0.2
Thank you for downloading SIXPAC. Please direct all technical questions and bug-reports to Snehit Prabhu (snehitp [at] cs [dot] columbia [dot] edu). Here is a version update history.
June 26th, 2012: We are currently in the process of applying recently suggested corrections to some of the tests implemented, to reduce type I error. At present, we recommend that you subject the results of SIXPAC through a battery of follow-up screens (see section on Statistical Tests below).
July 5th, 2012: SIXPAC paper published!
Read about the method in our paper in Genome Research.
A typical SIXPAC process requires at least 1G of RAM reserved as heap space. To test whether things are working fine, type the following on the command line where you've downloaded the jar file:
java -Xmx1G -jar Sixpac.jar
SIXPAC accepts raw files containing cases and controls where each genotyped site is represented by the count (0/1/2) of the number of minor alleles at that site (To convert your ped or bed file to a raw file use the plink --recodeA option). By default, SIXPAC will search for allelic combinations that are enriched in cases. Instructions on how to change these default search parameters are provided below.
WARNING: Reference allele consistency
The plink raw file format that is used as input must represent each site by the number of minor alleles (0/1/2). SIXPAC results are extremely sensitive to this count: a single allele flip can result in several falsely reported significant combinations. Please ensure that cases and controls are both consistent with regards to their reference alleles.
Command Line Options
I. File I/O (mandatory)
--raw [raw file]
Provides the plink format raw file containing both cases and controls. Cases are coded as phenotype 2 and controls as phenotype 1.
--map [map file]
Provides the plink format map file containing the list of SNPs in the raw file.
WARNING: Please ensure that your map file contains genetic distance information (column 3). If you do not have centimorgan distance information, then use the --useblocks-bp option to specify a distal SNP definition (see below).
--out [output file prefix]
Provides the location and filename prefix where SIXPAC will output its results.
II. Modes of operation (optional)
Mode of search.
Complete search is an exhaustive, brute-force enumeration and test of every pair of SNPs genome-wide.
Approx is a 2-stage test : first an exhaustive, brute-force enumeration of every pair of SNPs genome-wide, tested first in cases only. Subsequently a case-versus-control tests if the pair passes the stage-1 cutoff.
Probably Approximately Complete (PAC) is the default search mode. This applies the ultrafast randomization step for stage-1 that helps SIXPAC identify SNP-pairs that clear the significance cutoff without enumerating and testing every pair genome-wide. See our paper for details.
Tells SIXPAC to draw random pairs of SNPs from the dataset. Useful for checking the validity of the null hypothesis (QQ). Outputs a list of randomly drawn SNP pairs, and the results of applying the test on each pair. Used in conjunction with the --draws flag, which tells the software how many draws to make. Default is 5000.
e.g. --mode random --draws 500000
Tells SIXPAC to test a specific pair (or pairs) of binarized SNPs, rather than search for all significant pairs. Also requires an additional input of a file containing a list of (mode \t SNP \t mode \t SNP) entries. The location of this flat file is provided following the --testfile flag.
e.g.1 --mode test --testfile list.txt
Each entry of the file list.txt should contain 4 columns as follows:
r rs1234 d rs4321
d rs6789 d rs1413
d rs6789 d rs1413
The results of the test on each entry will be provided in the output file.
e.g.2 --mode test --testfile list.txt --span 5
If list.txt contains a single entry (single row), then the output file will contain a square matrix spanning 5 SNPs upstream and downstream of each locus.
For example, the entry (r rs1234 d rs4321) will be processed by considering 5 SNPs upstream and downstream of rs1234in recessive mode, and 5 SNPs upstream and downstream of rs4321 in dominant mode. A matrix of (5+1+5) x (5+1+5) = 121 -log(pvalues) will be output (where p-values depend on the statistical test applied, see below)
III. Search Parameters (optional)
--power [number between 0 and 1]
Provides the minimum computational power (sensitivity) with which significant interactions must be recovered. Default power is 0.1 (i.e. 10%)
NOTE: High power searches on large datasets (>100,000 SNPs) can be time consuming (but should still take less than 1 day). We recommend distributing a search over multiple machines, where each machine scans the same dataset with low power. This cumulatively increases the likelihood of at least one machine finding the interaction (see example 6 below for design of experiments).
--significance [p-value cutoff]
The required significance cutoff. SNP-pairs with a p-value below this cutoff will be reported.
Expects input number between 0 and 1 in scientific format (e.g. 1e-10). The default value is Bonferroni : for example, if you provide a 100,000 SNP dataset, SIXPAC applies a pairwise Bonferroni threshold of 0.05/((100000 choose 2) x 4) = 2.5e-12, since it tests 4 genetic models per SNP-pair.
If you want to use SIXPAC as a first pass to identify candidate SNP-pairs which can be followed-up using other genetic models, then you may want to consider lowering this cutoff by 1, 2 or 3 orders of magnitude - but keep in mind that lowered cutoffs may result in computational slowdown.
--seed [number between 0 and 1000]
Provides a random seed to SIXPAC’s random number generator. Can be used to replicate a search result. Default is 1.
--threads [number of threads]
Number of threads to perform a parallelized search with. Default is as many computing cores as your computer has. (e.g. quad-core machines will run 4 threads).
WARNING: Multiple threads can quickly consume heap memory and cause the search to falter midway. Only recommended for machines with 8G RAM or more. If you find numberous out of memory errors ensuing, or find your computer thrashing for memory, then please consider enforcing fewer threads (e.g. --threads 1). Also, this parameter may be used in conjunction with the --maxheap parameter (see below)
--maxheap [memory per thread in Mb]
Provides a maximum heap memory allocated to each search thread in Mb.
NOTE: If your machine has more than 1G heap memory to offer, then adjust the java –Xmx flag first, and only then inform SIXPAC of increased memory availability through this flag. The --maxheap flag should always follow the --threads flag when they are used in conjunction.
--useblocks-bp [bp distance]
Tells SIXPAC not to consider pairs of SNPs that are less than this far apart (distance in base pairs). Recommended distance is at least 1Mbp in humans to account for LD. Default is 5Mbp.
--useblocks-cm [cm distance]
Tells SIXPAC not to consider pairs of SNPs that are less than these many centimorgans apart. cM distances are taken from the 3rd column of the map file provided as input.
IV. Statistical Tests (optional)
--testname [D] or [R] or [lod]
"D" contrasts the Linkage Disequilibrium between binarized SNPs in cases and controls.
"R" contrasts Pearson's correlation coefficient between binarized SNPs in cases and controls.
"lod" contrasts the traditional log of odds between cases and controls. This is the closest equivalent of a conventional logistic regression test for interaction between a pair of variables.
If SIXPAC does report a pair of binarized SNPs that demonstrate statistically significant LD-contrast (or R-contrast) in your dataset, then the recommended protocol for follow-up would be to (i) check QQ plot of randomly drawn pairs from your dataset, to confirm the validity of SIXPAC's null hypothesis and absence overdispersion (see --mode random flag) (ii) return to raw genotype data or intensity files, and check whether the SNPs and individuals in question were poorly called, or suffered from any experimental artifacts, (iii) Test the SNP pair using a variety of statistical epistasis models ( is user friendly and highly recommended).
The best validation of your long-range SNP-SNP finding is always replication. If you have an additional case-control dataset, you may want to inspect all pairs of SNPs within the physical neighborhood of the 2 loci for similar fluctuations in LD (see --mode test flag).
V. Other Parameters (optional)
--pheno [phenotype file]
Plink format phenotype file if you wish to override the phenotype status in the .raw file. Currently only accepts 3 columns (Fam ID, Indiv ID, Phenotype 1/0).
At the end of a run, SIXPAC outputs afile containing a list of allelic combinations that are significantly enriched in cases versus controls. The first few columns of the output represent interacting SNPs (depending upon the –-order input parameter passed through the command line)
Where, each column entry has the format:
[number, mode, chr, SNPid, cmDistance, bpDistance]
e.g. [10136,r,22,rs1234567,19.134,19134027] represents the recessive allele of rs1234567. Further, it is the 10136th SNP in the map file provided, and is located 19.134 cM from the start of chromosome 22, at bp position 19134027.
Col3 : Expected Cases
Cases expected to be carriers of 00/01/10/11 states for this pair of SNPs, if they were segregating independently. These are calculated from the empirical case-frequencies of each allele.
e.g. [...,r,22,rs1234567,...] [...,d,22,rs7654321,...] [256/128/64/43] ...
represents an interaction by the recessive allele of rs1234567 and the dominant allele of rs7654321.
Expected 256 cases to carry <2 minor alleles of rs1234567 and <1 minor allele of rs7654321.
Expected 128 cases to carry <2 minor alleles of rs1234567 and >=1 minor allele of rs7654321.
Expected 64 cases to carry 2 minor alleles of rs1234567 and <1 minor allele of rs7654321.
Expected 43 cases to carry 2 minor alleles of rs1234567 and >=1 minor allele of rs7654321.
Col4 : Observed Cases
Case carriers of the 00/01/10/11 states for the given alleles. Here 11 represents the carriers of both SNPs in their respective allelic states.
Col5: Case-only analysis p-value
Gives the stage 1 p-value that compares the observed versus expected Gametic Phase Disequilibrium (GPD) between minor alleles in cases.
Col6 : Expected Controls
Controls expected to be in the 00/01/10/11 states if these SNPs were segregating independently, as calculated from the empirical case-frequencies of each allele.
Col7 : Observed Controls
Control carriers of the 00/01/10/11 states for the given alleles. Here 11 represents the carriers of both SNPs in their respective allelic states.
Col8: Case vs. Control analysis p-value
Gives the final stage 2 p-value that contrasts the GPD in cases to the GPD in controls. Please refer to the SIXPAC paper for details about this test.
Here are a few examples to get you started. Suppose you have a raw file containing Crohn's cases and matched controls (Crohns.raw) and the corresponding map file (WG.map).
EXAMPLE 1: If you wish to scan the dataset for SNP-pairs that are significant at Bonferroni leves, with computational power (probability of finding these pairs) of at least 40%, use
java -Xmx4G -jar Sixpac.jar --raw Crohns.raw --map WG.map --out Crohns --power 0.4
Note that if you have a multi-core processor. the the multithreaded search operation may require larger amounts of ram (e.g. 4G to 8G). SIXPAC will attempt to keep it's operations within system resources, but sometimes a few search threads do run out of memory.
To force SIXPAC to use just one thread, append an additional flag --threads 1 to the above command.
EXAMPLE 2: If you wish to scan your large dataset (~10^6 SNPs) for significant interactions with power of at least 80% on your 8-core processor which also offers generous amounts of ram (say 16G), you can force SIXPAC to reserve more heap memory per thread than it would typically employ
java -Xmx12G -jar Sixpac.jar --raw Crohns.raw --map WG.map --out Crohns --power 0.8 --threads 8 --maxheap 1000
Note that reserving 1000MB per thread in heap space still does not amount to 12G. This is because the heap is only part of the memory consumption of the java process. SIXPAC does not attempt to control the other memory intensive aspects.
EXAMPLE 3: Suppose you wish to do a genome-wide scan using permuted phenotype status rather than the case-control labels stated in the .raw file (this is a good way to test for a false-positve rate at a certain significance level) , use
java -Xmx8G -jar Sixpac.jar --raw Crohns.raw --map WG.map --out Crohns --power 0.8 --threads 8 --pheno permuted.txt
EXAMPLE 4: If you want to apply a genetic distance filter (instead of basepair distance) to screen out allelic combinations between SNPs that are less than 5cM apart instead of the 5Mbp default, then use
java -Xmx8G -jar Sixpac.jar --raw Crohns.raw --map WG.map --out Crohns --power 0.2 --useblock-cm 5
EXAMPLE 5: Design of Experiments.
You want to scan a really large dataset (say 1M SNPs, 10000 samples) for interactions. Further, you want to be at least 90% certain that you have identified all significant interactions in the dataset, but quickly realize that a single machine is overwhelmed by the task because :
Hint: Try running in single threaded mode (default). This may increase the runtime considerably, but is more memory efficient. If that still doesn't work, see below.
b. Hint: Let SIXPAC process a dataset for at least an hour before taking its time-to-completion estimates seriously. If the stable estimates after this burn-in period are still unsatisfactory, see below. Also note that these estimates are contingent upon the machine not running out of memory (particularly if running in multithreaded mode). Playing with the --threads and --maxheap settings can affect the speed considerably.
Distributing your search
... accross computers, into manageably sized chunks is easy with SIXPAC!
What you really want to ensure after distributing the work is that if a statistically significant SNP-pair exists, then it is reported by at least one run of SIXPAC with 90% probability.
We apply some basic probability theory: Instead of a single SIXPAC search, try performing 2 searches with 70% power each. This ensures that at least one of them will find the interaction with 1 - (1-0.7)2 = 91% certainty. By extension, performing 10 searches with 20% power each will also ensure that at least one of them will find the interaction with probability 1 - (1-0.2)10 = 90% power. Both scenarios (and many more like these) satisfy your original requirement.
Breaking the search effort into multiple pieces as shown above, makes each individual SIXPAC run faster and less memory intensive. Find one that best suits your computational infrastructure. Note that the 2 (or 10) runs mentioned above can be performed sequentially one after another on the same machine, or simultaneously on 2 (to 10) different machines - all to the same effect.