This page provides information on HATS, the software written in
JAVA that implements that HATS algorithm discussed in [Paper
Submitted]. The following sections will guide you through
downloading, building, and running HATS.
The program has been developed by Itsik Pe'er's Lab of Computational Genetics at Columbia University. It is built in Java 1.5 and is tested in both the Windows and Linux environments. The source code is distributed here in a jar package under the GPL license.
Download: SourceForge HATS Project Page
Dependencies
HATS possesses dependencies on
the following publicly available libraries. Please download the indicated
versions of the jar files in order to compile and run HATS.
1)
Colt Math Library (version 1.2.0): http://acs.lbl.gov/software/colt/
2)
Commons Math Library (version 2.1): http://commons.apache.org/math/
3)
JFreeChart (version 1.0.13): http://www.jfree.org/jfreechart/
4)
JSAP (version 2.1): http://martiansoftware.com/jsap/
Installation
For users’ convenience, a built version of HATS is available in
this jar file (hats.jar),
which should run on both Windows and Linux platforms. For the remainder
of this page, we assume that hats.jar is saved in a user-specified directory
$PROJECT_DIR. If users still wish to
rebuild HATS, please refer to the next section, or else skip to the Usage
section.
Building HATS
HATS requires the Java Development Kit (JDK) 1.5 or higher in
order to compile. These instructions assume that:
1. the
source code is located in the directory $PROJECT_DIR/src (so that this
directory contains subdirectories: dynamicArray, genomeEnums, nutils, hats,
etc.).
2. the
external libraries for the above dependencies (in the form of .jar files) are
located in $PROJECT_DIR/lib
a. The
specific jar files needed are listed in the downloadable shell scripts just
below.
3. the current directory of the user is $PROJECT_DIR, and the user has full write permissions to this current directory
To build on Linux: Run the following shell script file at the command
line: buildHATS.Linux.sh
To build on Cygwin: Run the
following shell script file at the command line: buildHATS.Cygwin.sh
To build on Windows: Run the
following batch file at the command line: buildHATS.bat
Build results:
The class files will be placed in the $PROJECT_DIR/bin directory, and the resulting hats.jar file will be placed in $PROJECT_DIR.
Using
HATS
Preparing the Training Data
The first step involves
preparing the training data files. The
training data consists of phased haplotype sequences for HapMap samples from the
1000 Genomes Project. The files (ending
in .hap, .sites, and .Samples extensions) can be downloaded for each of the
three HapMap populations (CEU, YRI, JPTCHB) at:
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/2009_04/
These files must first be
pre-processed prior to usage in HATS. Download
and modify the following script (processHaplotypes.sh)
in order to prepare the training data.
The java classpath within the script should be set appropriately to
$PROJECT_DIR (where hats.jar is saved) as well as $PROJECT_DIR/lib (where the
JSAP jar file is located). The script
also assumes gawk is installed,
though the corresponding line in the script can be commented out if working on
JPTCHB or YRI populations. The script
was written to run in cygwin but can be modified easily (i.e. the classpath
section) in order to run on Linux.
Windows users familiar with shell scripting can easily modify the file
to work on Windows as well (so long as gawk in installed).
On Cygwin, run the script as:
$ bash processHaplotypes.sh <sites filename> <data
filename> <sample indices to filter>
The final argument represents a
comma-separated list of sample indices (starting from index 0) that are to be
filtered out, surrounded by braces. The
indices can be seen in the .Sample file. For example, if we want to eliminate samples 10
and 54, we use for this argument: “{10,54}” (include the quotes, and do not put
spaces!).
Thus, if the CEU files are to
be processed, the command-line would be:
$ bash processHaplotypes.sh CEU.sites CEU.hap “{10,54}”
The output files will be
written to the same directory as the phased filenames, with one output file
containing phased information per chromosome.
The columns of the file will be:
1)
chromosome number
2)
position
3)
reference allele
4)
main variant allele
5)
main variant allele frequency
6)
phased haplotypes (two alleles per sample from left to right)
Preparing the Test Data
The user can download a sample
of input test data here.
The test data consists of
genotype and allele-specific read count information for each site within an
amplified region across n samples
(indexed by 1 ≤ j ≤ n).
The columns of the input file are as such:
Chromosome Position Reference_Allele [Columns
Sample 1] [Columns Sample 2] ...
[Columns Sample n]
| |
+------------------+
|
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| With [Columns
Sample j] being expanded to 14 columns:
|
Tumor_Genotype_Code Tumor_Genotype_Allele1 Tumor_Genotype_Allele2 Normal_Genotype_Code Normal_Genotype_Allele1 Normal_Genotype_Allele2 Tumor_IsSiteAmplified Normal_IsSite_Amplified Tumor_ReadCount_Total Tumor_Pileup_ReadString Tumor_Pileup_QualityString Normal_ReadCount_Total Normal_Pileup_ReadString Normal_Pileup_QualityString
Column values:
The {Tumor,Normal}_Genotype_Code should consist
of values:
·
0 (homozygous for
reference allele)
·
1 (heterozygous)
·
2 (homozygous
for variant allele)
·
5 (homozygous deletion)
·
-1 (missing
data)
·
Note that
hemizygous calls are not considered, as they would typically be called as
homozygous by genotype calling algorithms.
The {Tumor,Normal}_Genotype_Allele{1,2} columns
should only contain:
·
one of {A, C, G,
T}, if the {Tumor,Normal}_Genotype_Code is 0, 1, or 2
·
N if the {Tumor,Normal}_Genotype_Code
is -1 or 5
The {Tumor,Normal}_IsSiteAmplified columns should
only contain:
·
0, if the site
is not amplified in the tissue (tumor/normal) for sample j
·
1, if the site
is amplified in the tissue (tumor/normal) for sample j
The {Tumor,Normal}_ReadCount_Total consists of an
integer:
·
That is > 0
that reflects the total number of reads at that site within that tissue for
sample j. This is obtained from a pileup
that is generated for that sample (e.g. via samtools)
·
That is <= 0
that reflects that this site is missing or is a homozygous deletion within that
tissue in sample j
The {Tumor,Normal}_Pileup_ReadString contains the
string of reads from the pileup that is generated (e.g. via samtools) for that
site within that tissue for sample j. HATS
tallies allele-specific read counts for this site/tissue/sample via parsing
this string.
The {Tumor,Normal}_Pileup_QualityString contains
the mapping qualities for the string of reads from the pileup that is generated
(e.g. via samtools) for that site within that tissue for sample j.
This format allows for flexibility, such as
indicating non-perfectly-overlapping amplified stretches over the n samples and allows the user to indicate
missing matched normal data for sample j.
Running HATS
HATS will take as input the
training and test data files (prepared as indicated above) and will output the
amplified alleles within each amplified region for each sample j in the test data.
For demo purposes, use the test
data sample (same as the one linked in the previous section) and a
corresponding snippet of CEU training data here. Finally, download this script (tailored
for Cygwin) and run on the command line.
The output file in the script is specified after the –O flag.
The columns in the output file are:
Chromosome Position Reference_Allele [Columns
Sample 1] [Columns Sample 2] ...
[Columns Sample n]
| |
+------------------+
|
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| With [Columns
Sample j] being expanded to 12 columns:
|
Normal_Genotype_Allele1
Normal_Genotype_Allele2 Tumor_Genotype_Allele1 Tumor_Genotype_Allele2 Normal_ReadCount_ReferenceAllele Normal_ReadCount_VariantAllele Tumor_ReadCount_ReferenceAllele Tumor_ReadCount_VariantAllele Amplified_Allele_CalledBy_HATS_Tumor NonAmplfied_Allele_CalledBy_HATS_Tumor Amplified_Allele_CalledBy_Naive_Tumor NonAmplfied_Allele_CalledBy_Naive_Tumor
Running HATS: Command-Line Usage
The command-line usage for HATS
is:
Usage:
java hats.HATS
[--processTumorDataOneRegion]
<tumorRegionFilename> <trainingFilename> [(-O|--out)
<outFilename>] [--copyNumberTumors <copyNumberTumors>]
[--diploidCoverages <diploidCoverages>] [-b|--bias] [-G|--GEC]
[--ignoreTraining] [-v|--viterbi] [(-l|--log)[:<log_filename>]]
[(-h|--haplen) <length>] [(-w|--windowSize) value_1,value_2,...,value_N ]
[--processTumorDataOneRegion]
<tumorRegionFilename>
The filename representing the tumor
amplified region: containing
genotypes and pileup read cout
information for the tumor (and perhaps
matched normal
<trainingFilename>
The filename for the training data that
covers the tumor region of
interest
[(-O|--out) <outFilename>]
The specified output file. If none specified, output is written to
standard out.
[--copyNumberTumors <copyNumberTumors>]
The user-specified copy number for each
tumor region in each tumor
sample.
For example, if there are three tumors in the file (with the
first tumor possessing one amplified
region, the second two such
regions, and the third one such
region), this option would be:
({3};{2.7,2.9};{3.1}) (note no
spaces). If not specified, it is
automatically calculated from the region by comparing with the matched
normal.
[--diploidCoverages <diploidCoverages>]
The user-specified diploid coverage for
each tumor region in each tumor
sample.
For example, if there are three tumors in the file (with the
first tumor possessing one amplified
region, the second two such
regions, and the third one such
region), this option would be:
({35};{10.5,15};{22}) (note no
spaces). If not specified, it is
automatically calculated from the
region by comparing with the matched
normal.
[-b|--bias]
Activates the calculation and use of
biases for all test samples
[-G|--GEC]
Activates the genotype error correction
feature at a *steep* cost of
execution time.
[--ignoreTraining]
Ignores the training data. Note that option -G/--GEC is rendered
ineffective by this option.
[-v|--viterbi]
Executes the Viterbi algorithm instead of
the Forward-Backward (default)
algorithm for calling the amplified
alleles.
[(-l|--log)[:<log_filename>]]
Logs debug information into the given
filename at a *steep* cost of
execution time.
[(-h|--haplen) <length>]
The length of the haplotype windows
used to leverage LD information from
the training data. The minimum size is 31 (optimal for execution
time),
while greater sizes translate to an
increased execution time. (default:
31)
[(-w|--windowSize)
value_1,value_2,...,value_N ]
Advanced Option - Analyzing a lengthy
amplified region in the test
sample can either: 1) produce internal
program numeric instability, or
2) infeasible memory demands (when GEC
is turned on). To alleviate
either ailment, the test region is
divided into partially overlapping
sliding windows. The first option value (value_1) sets the
window size