October 2006
Contents:
1. Introduction
2. Compiling
3. Input
4. Output
5. Generation of Haplotype Counts
1. Introduction
Volume_2x2.c and Volume_IxJ.c calculate a number of volume measures of
linkage disequilibrium introduced in "Volume Measures for Linkage
Disequilibrium." Volume_2x2.c computes exact values of Dvol, Mvol, and
Hvol for 2x2 tables by enumerating all possible 2x2 tables with fixed
marginal sums. Volume_IxJ.c estimates the values of Mvol and Hvol by the
sequential importance sampling algorithm.
2. Compiling
The algorithms are written in C language. We can use the command
gcc -o executive_file_name c_code_file_name -lm
to compile the code and then run the executive file. For example,
Volume_2x2.c can be compiled by using the command
gcc -o Volume_2x2.exe Volume_2x2.c -lm
3. Input
When implementing the executive file, the user will be prompted to provide
the names of the input file and the output file. For 2x2 tables, the input
file is a kx4 matrix of non-negative integers. The four numbers n_{11},
n_{12}, n_{21} and n_{22} in each row represent a 2x2 table
n_{11} n_{12}
n_{21} n_{22}
and the total number of rows k is the total number of 2x2 tables the user
wants to compute the volume measures. An example of the input file
Input_2x2.txt is given on this web page. In the input file, the four
numbers in each row can also be written in the table format as
n_{11} n_{12}
n_{21} n_{22}
For IxJ tables, each row of the input file looks like
I J IxJ n_{11} n_{12} ... n_{1J} n_{21} n_{22} ... n_{2J} ... n_{I1} n_{I2} ... n_{IJ}
where I is the number of rows, J is the number of columns, IxJ is the
total number of cells in the tables, and n_{11} to n_{IJ} are the entries
in the table. The user can attach many 0's at the end of each row to make
the input file a rectangular matrix. An example of the input file
Input_IxJ.txt is given on this web page. It is also fine to write the
input file in the following format:
I J IxJ
n_{11} n_{12} ... n_{1J}
n_{21} n_{22} ... n_{2J}
...
n_{I1} n_{I2} ... n_{IJ}
For IxJ tables, the user is asked to specify the number of Monte Carlo
samples for the sequential importance sampling algorithm. Large Monte
Carlo samples will give more accurate estimates of the volume measures,
but will also take longer time to run.
When preparing the input file, the user needs to make sure that none of
the rows and columns have zero sums. If all entries in a row or a column
are zeros, several volume measures are not defined, and when this occurs,
the code will stop and print an error message.
4. Output
In the output file, The results are given in matrix format. For example,
the following is the output file corresponding to the input file
Input_2x2.txt
Dvol Mvol Hvol
0.611111 0.718750 0.681818
0.909091 0.947368 0.933333
0.000000 0.030303 -0.000000
0.210526 0.181818 -0.750000
The following is the output file corresponding to the input file
Input_IxJ.txt
Mvol Hvol
0.167630 0.167630
0.000000 0.000000
0.457938 0.457938
0.005679 0.001360
The numbers in the i-th row are the volume measures for the 2x2 or IxJ
table in the i-th row of the input file. For this example, 100,000 Monte
Carlo samples were used in the sequential importance sampling algorithm
for IxJ tables.
Since the algorithm for IxJ tables is based on Monte Carlo simulations,
each run will not give exactly the same result. For example, the following
is another output file based on 100,000 Monte Carlo samples for the input
file Input_IxJ.txt:
Mvol Hvol
0.166420 0.166420
0.000000 0.000000
0.457693 0.457693
0.005943 0.001739
The difference between each run will be small if a large number of Monte
Carlo samples are used in the simulation. The precision of the final
estimate depends on the number of Monte Carlo samples and how well the
proposal distribution in the sequential importance sampling algorithm
approximates the target distribution for a given table. It is advisable to
conduct multiple trial runs with a small number of Monte Carlo samples
(say 1000) to see the variation of the estimates, and then select an
appropriate number of Monte Carlo samples that assures an acceptable
precision.
5. Generation of Haplotype Counts
If the haplotype counts for 2x2 tables are not available, there are two
ways to obtain the haplotype counts from genotype data.
(i) Use the ldmax routine in the software GOLD, which is available at
http://www.sph.umich.edu/csg/abecasis/GOLD/docs/ldmax.html
(ii) We also have C code to generate haplotype data from genotype data.
There are two files: Haplotype_count.c is used to construct the observed
numbers of haplotype counts, and EM.c implements the
expectation-maximization (EM) algorithm to estimate haplotype counts based
on the output of Haplotype_count.c. The two files are available at
http://www.stat.uiuc.edu/~yuguo/software/volume/Haplotype_count.c
http://www.stat.uiuc.edu/~yuguo/software/volume/EM.c
In order to implement Haplotype_count.c, the input genotype data need to
be re-coded in the following way:
Original New code
0/0 0
1/1 1
1/2 or 2/1 2
2/2 3
These two C codes need to use gsl library which can be downloaded from
http://www.gnu.org/software/gsl/
We need to run Haplotype_count.c first to get the observed numbers
haplotype counts, and then use the output from Haplotype_count.c to run
EM.c to obtain the haplotype counts. The output of EM.c can be used
directly as the input file for Volume_2x2.c.