View on GitHub

diversiTools

Command-line tutorial

Download this project as a .zip file Download this project as a tar.gz file

Here I will try to explain how you can run the various btctools from the command-line. The core of btctools is written in perl and makes use of the the Bio-SamTools module written by Lincoln D. Stein. So the various perl script can be run from the command-line to automate the processing on multiple bam files.

There are two basic processing scripts: btcutils which is used for processing a BAM file and btcmerge which is used for combining multiple btc outputs.

btcutils

This script is used to generate the files that contain the frequencies of various mutations from the bam file. In the most basic form, you may just be interested in the number of SNPs at each site of your alignment. From the directory where the perl script is located:

$ perl btcutils.pl -bam path/to/input.bam -ref path/to/reference.fasta -stub out

Bear in mind that the bam file needs its associated index file (.bai). All of the arguments above are necessary to run btcutils. The output from the latter command will be 4 files:

out_entropy.txt file

out_entropy.txt contains the following columns:

out_log.txt file

out_log.txt contains summary statistics from the alignment:

out_read.txt file

out_read.txt contains information about mismatches per read postion which can be used to determine whether harsher trimming is required. This table assumes that you have the same number of bases in each reads, if not the results may be irrelevant as there will be different coverage for each site of the reads:

btcutils with ORFs

It is also possible to provide information about the number of non-synonymous and synonymous mutations at each amino acid site, but to do this you need to provide a text file (e.g. CodingRegion.txt) with the information about the protein names and open reading frames, in the following tab delimited format:


Protein	Beg	End	Reference
PB2	28	2307	PB2_Influenza
PB1	25	2301	PB1_Influenza
PB1-F2	119	391	PB1_Influenza
PA	25	2176	PA_Influenza
PA-X_A	25	596	PA_Influenza
PA-X_B	598	784	PA_Influenza

The first column corresponds to the protein name, the names in this column must be unique, the second column corresponds to the start of the orf for that protein, the third is the end of the orf for the protein and the fourth corresponds to the identifier (name) of the reference in your reference file from which the protein is transcribed. Run the command:

$ perl btcutils.pl -bam path/to/input.bam -ref path/to/reference.fasta -orfs CodingRegions.txt -stub out

The latter command produces all of the previous 4 files with an additional out_AA.txt with the information about the non-synonymous, synonymous and stop-codons.

out_AA.txt file

out_AA.txt contains information about the codons and amino acids at each site of the coding regions for each protein. The file contains the following columns:

<<<<<<< HEAD

btcmerge

=======

btcmerge

>>>>>>> 89a80b2b410d3060c8701e119c4ccb96c8736c7f

Once you have produced multiple btc files, you may want to merge several, either to compare the results from independent replicates, or to compare the intra-host diversity from different individuals, or even to compare closely related viruses in the same or different individuals.