October 2006 Contents: 1. Introduction 2. Compiling 3. Input 4. Output 5. Generation of Haplotype Counts 1. Introduction Volume_2x2.c and Volume_IxJ.c calculate a number of volume measures of linkage disequilibrium introduced in "Volume Measures for Linkage Disequilibrium." Volume_2x2.c computes exact values of Dvol, Mvol, and Hvol for 2x2 tables by enumerating all possible 2x2 tables with fixed marginal sums. Volume_IxJ.c estimates the values of Mvol and Hvol by the sequential importance sampling algorithm. 2. Compiling The algorithms are written in C language. We can use the command gcc -o executive_file_name c_code_file_name -lm to compile the code and then run the executive file. For example, Volume_2x2.c can be compiled by using the command gcc -o Volume_2x2.exe Volume_2x2.c -lm 3. Input When implementing the executive file, the user will be prompted to provide the names of the input file and the output file. For 2x2 tables, the input file is a kx4 matrix of non-negative integers. The four numbers n_{11}, n_{12}, n_{21} and n_{22} in each row represent a 2x2 table n_{11} n_{12} n_{21} n_{22} and the total number of rows k is the total number of 2x2 tables the user wants to compute the volume measures. An example of the input file Input_2x2.txt is given on this web page. In the input file, the four numbers in each row can also be written in the table format as n_{11} n_{12} n_{21} n_{22} For IxJ tables, each row of the input file looks like I J IxJ n_{11} n_{12} ... n_{1J} n_{21} n_{22} ... n_{2J} ... n_{I1} n_{I2} ... n_{IJ} where I is the number of rows, J is the number of columns, IxJ is the total number of cells in the tables, and n_{11} to n_{IJ} are the entries in the table. The user can attach many 0's at the end of each row to make the input file a rectangular matrix. An example of the input file Input_IxJ.txt is given on this web page. It is also fine to write the input file in the following format: I J IxJ n_{11} n_{12} ... n_{1J} n_{21} n_{22} ... n_{2J} ... n_{I1} n_{I2} ... n_{IJ} For IxJ tables, the user is asked to specify the number of Monte Carlo samples for the sequential importance sampling algorithm. Large Monte Carlo samples will give more accurate estimates of the volume measures, but will also take longer time to run. When preparing the input file, the user needs to make sure that none of the rows and columns have zero sums. If all entries in a row or a column are zeros, several volume measures are not defined, and when this occurs, the code will stop and print an error message. 4. Output In the output file, The results are given in matrix format. For example, the following is the output file corresponding to the input file Input_2x2.txt Dvol Mvol Hvol 0.611111 0.718750 0.681818 0.909091 0.947368 0.933333 0.000000 0.030303 -0.000000 0.210526 0.181818 -0.750000 The following is the output file corresponding to the input file Input_IxJ.txt Mvol Hvol 0.167630 0.167630 0.000000 0.000000 0.457938 0.457938 0.005679 0.001360 The numbers in the i-th row are the volume measures for the 2x2 or IxJ table in the i-th row of the input file. For this example, 100,000 Monte Carlo samples were used in the sequential importance sampling algorithm for IxJ tables. Since the algorithm for IxJ tables is based on Monte Carlo simulations, each run will not give exactly the same result. For example, the following is another output file based on 100,000 Monte Carlo samples for the input file Input_IxJ.txt: Mvol Hvol 0.166420 0.166420 0.000000 0.000000 0.457693 0.457693 0.005943 0.001739 The difference between each run will be small if a large number of Monte Carlo samples are used in the simulation. The precision of the final estimate depends on the number of Monte Carlo samples and how well the proposal distribution in the sequential importance sampling algorithm approximates the target distribution for a given table. It is advisable to conduct multiple trial runs with a small number of Monte Carlo samples (say 1000) to see the variation of the estimates, and then select an appropriate number of Monte Carlo samples that assures an acceptable precision. 5. Generation of Haplotype Counts If the haplotype counts for 2x2 tables are not available, there are two ways to obtain the haplotype counts from genotype data. (i) Use the ldmax routine in the software GOLD, which is available at http://www.sph.umich.edu/csg/abecasis/GOLD/docs/ldmax.html (ii) We also have C code to generate haplotype data from genotype data. There are two files: Haplotype_count.c is used to construct the observed numbers of haplotype counts, and EM.c implements the expectation-maximization (EM) algorithm to estimate haplotype counts based on the output of Haplotype_count.c. The two files are available at http://www.stat.uiuc.edu/~yuguo/software/volume/Haplotype_count.c http://www.stat.uiuc.edu/~yuguo/software/volume/EM.c In order to implement Haplotype_count.c, the input genotype data need to be re-coded in the following way: Original New code 0/0 0 1/1 1 1/2 or 2/1 2 2/2 3 These two C codes need to use gsl library which can be downloaded from http://www.gnu.org/software/gsl/ We need to run Haplotype_count.c first to get the observed numbers haplotype counts, and then use the output from Haplotype_count.c to run EM.c to obtain the haplotype counts. The output of EM.c can be used directly as the input file for Volume_2x2.c.