MetaCRAM

Lossless Compression Tool for Metagenomic Reads

What is MetaCRAM?

MetaCRAM is a pipeline for taxonomy identification and lossless compression of FASTA-format metagenomic reads.  It integrates algorithms for taxonomy identification, read alignment, assembly, and finally, a reference-based compression method in a parallel manner.

Why use MetaCRAM?

We live in a “Big Data era” in which the amount of information is overwhelming for current storage and processing capacity.  Metagenomic data is no exception: the volume of sequencing data is increasing rapidly, requiring a compression tool for long-term archival storage.  MetaCRAM precisely addresses this problem and its performance was evaluated on various metagenomic samples from the NCBI Sequence Read Archive, suggesting 2- to 4-fold compression ratio improvements compared to gzip.  On average, the compressed file sizes were 2-13 percent of the original raw metagenomic file sizes.  Compression ratios of this order will provide for tremendous storage savings.

Figure. Block diagram of MetaCRAM.

Figure. Block diagram of MetaCRAM.

 

How do I download MetaCRAM?

You may access MetaCRAM’s source code, installation guideline and README in the Github repository.

System Requirement

We tested MetaCRAM on a linux machine with Intel Core i5-3470 CPU at 3.2 GHz, with a 16 GB RAM.

Usage

After following the installation guideline from our Github repository, use the following commands to run MetaCRAM.

Compression

perl MetaCram.pl –compress –output <output directory> –paired <path to reads> –<exGolomb, huffman, golomb>

Example:

[shared3]$ perl MetaCram.pl –compress –output /shared3/MetaCRAM_SRR359032_Huffman –paired /shared3/SRR359032_1.fasta /shared3/ SRR359032_2.fasta –huffman

Decompression

perl MetaDeCram.pl –input <path to folder containing the Round1 and Round2 folders>

Example:

[shared3]$ perl MetaDeCram.pl –input /shared3/MetaCRAM_processedSRR359032_Huffman/MetaCRAM

(*–paired is optional)

(*<> indicates a choice)

For more information on options, use:

$ perl MetaCram.pl –help

Reference

“MetaCRAM: An Integrated Pipeline for Metagenomic Data Processing and Compression”.
M. Kim, X. Zhang, J.G. Ligo, F. Farnoud, V.V. Veeravalli, O. Milenkovic.
BMC Bioinformatics17(1)
(2016).  [link]

Contact

Minji Kim (mkim158@illinois.edu) and Olgica Milenkovic (milenkov@illinois.edu)

In Archive