Yuguo Chen - University of Illinois at Urbana-Champaign

October 2006

Contents:
1. Introduction
2. Compiling
3. Input
4. Output
5. Generation of Haplotype Counts

1. Introduction

Volume_2x2.c and Volume_IxJ.c calculate a number of volume measures of 
linkage disequilibrium introduced in "Volume Measures for Linkage 
Disequilibrium." Volume_2x2.c computes exact values of Dvol, Mvol, and 
Hvol for 2x2 tables by enumerating all possible 2x2 tables with fixed 
marginal sums. Volume_IxJ.c estimates the values of Mvol and Hvol by the 
sequential importance sampling algorithm.

2. Compiling

The algorithms are written in C language. We can use the command

gcc -o executive_file_name c_code_file_name -lm

to compile the code and then run the executive file. For example, 
Volume_2x2.c can be compiled by using the command

gcc -o Volume_2x2.exe Volume_2x2.c -lm

3. Input

When implementing the executive file, the user will be prompted to provide 
the names of the input file and the output file. For 2x2 tables, the input 
file is a kx4 matrix of non-negative integers. The four numbers n_{11}, 
n_{12}, n_{21} and n_{22} in each row represent a 2x2 table

n_{11}  n_{12}
n_{21}  n_{22} 

and the total number of rows k is the total number of 2x2 tables the user 
wants to compute the volume measures. An example of the input file 
Input_2x2.txt is given on this web page. In the input file, the four 
numbers in each row can also be written in the table format as

n_{11}  n_{12}
n_{21}  n_{22} 

For IxJ tables, each row of the input file looks like 

I J IxJ n_{11} n_{12} ... n_{1J} n_{21} n_{22} ... n_{2J} ... n_{I1} n_{I2} ... n_{IJ}

where I is the number of rows, J is the number of columns, IxJ is the 
total number of cells in the tables, and n_{11} to n_{IJ} are the entries 
in the table. The user can attach many 0's at the end of each row to make 
the input file a rectangular matrix. An example of the input file 
Input_IxJ.txt is given on this web page. It is also fine to write the 
input file in the following format:

I J IxJ 
n_{11} n_{12} ... n_{1J} 
n_{21} n_{22} ... n_{2J} 
... 
n_{I1} n_{I2} ... n_{IJ}

For IxJ tables, the user is asked to specify the number of Monte Carlo 
samples for the sequential importance sampling algorithm. Large Monte 
Carlo samples will give more accurate estimates of the volume measures, 
but will also take longer time to run.

When preparing the input file, the user needs to make sure that none of 
the rows and columns have zero sums. If all entries in a row or a column 
are zeros, several volume measures are not defined, and when this occurs, 
the code will stop and print an error message.

4. Output

In the output file, The results are given in matrix format. For example, 
the following is the output file corresponding to the input file 
Input_2x2.txt

Dvol     Mvol     Hvol
0.611111 0.718750 0.681818
0.909091 0.947368 0.933333
0.000000 0.030303 -0.000000
0.210526 0.181818 -0.750000

The following is the output file corresponding to the input file 
Input_IxJ.txt

Mvol     Hvol
0.167630  0.167630
0.000000  0.000000
0.457938  0.457938
0.005679  0.001360

The numbers in the i-th row are the volume measures for the 2x2 or IxJ 
table in the i-th row of the input file. For this example, 100,000 Monte 
Carlo samples were used in the sequential importance sampling algorithm 
for IxJ tables.

Since the algorithm for IxJ tables is based on Monte Carlo simulations, 
each run will not give exactly the same result. For example, the following 
is another output file based on 100,000 Monte Carlo samples for the input 
file Input_IxJ.txt:

Mvol     Hvol
0.166420  0.166420
0.000000  0.000000
0.457693  0.457693
0.005943  0.001739

The difference between each run will be small if a large number of Monte 
Carlo samples are used in the simulation. The precision of the final 
estimate depends on the number of Monte Carlo samples and how well the 
proposal distribution in the sequential importance sampling algorithm 
approximates the target distribution for a given table. It is advisable to 
conduct multiple trial runs with a small number of Monte Carlo samples 
(say 1000) to see the variation of the estimates, and then select an 
appropriate number of Monte Carlo samples that assures an acceptable 
precision.

5. Generation of Haplotype Counts

If the haplotype counts for 2x2 tables are not available, there are two 
ways to obtain the haplotype counts from genotype data.

(i) Use the ldmax routine in the software GOLD, which is available at

http://www.sph.umich.edu/csg/abecasis/GOLD/docs/ldmax.html

(ii) We also have C code to generate haplotype data from genotype data. 
There are two files: Haplotype_count.c is used to construct the observed 
numbers of haplotype counts, and EM.c implements the 
expectation-maximization (EM) algorithm to estimate haplotype counts based 
on the output of Haplotype_count.c. The two files are available at

http://www.stat.uiuc.edu/~yuguo/software/volume/Haplotype_count.c
http://www.stat.uiuc.edu/~yuguo/software/volume/EM.c

In order to implement Haplotype_count.c, the input genotype data need to 
be re-coded in the following way:

Original        New code
0/0             0
1/1             1
1/2 or 2/1      2
2/2             3

These two C codes need to use gsl library which can be downloaded from

http://www.gnu.org/software/gsl/

We need to run Haplotype_count.c first to get the observed numbers 
haplotype counts, and then use the output from Haplotype_count.c to run 
EM.c to obtain the haplotype counts. The output of EM.c can be used 
directly as the input file for Volume_2x2.c.