diversiTools

Here I will try to explain how you can run the various btctools from the command-line. The core of btctools is written in perl and makes use of the the Bio-SamTools module written by Lincoln D. Stein. So the various perl script can be run from the command-line to automate the processing on multiple bam files.

There are two basic processing scripts: btcutils which is used for processing a BAM file and btcmerge which is used for combining multiple btc outputs.

btcutils

This script is used to generate the files that contain the frequencies of various mutations from the bam file. In the most basic form, you may just be interested in the number of SNPs at each site of your alignment. From the directory where the perl script is located:

$ perl btcutils.pl -bam path/to/input.bam -ref path/to/reference.fasta -stub out

Bear in mind that the bam file needs its associated index file (.bai). All of the arguments above are necessary to run btcutils. The output from the latter command will be 4 files:

out_entropy.txt which contains the frequency of mutations at each individual site for each of the gene segments
out_log.txt which contains the distribution of mutations per read and the average entropy per gene segment
out_motif.fa a fasta file with 10 bases either side of each mutation
out_read.txt the per read position mutation count and average quality

out_entropy.txt file

out_entropy.txt contains the following columns:

Sample: the name of the bam file that has been run
Chr: the gene or chromosome identifier from the reference used for the mapping
Position: the site/position of alignment in the reference
RefBase: the base of the reference sequence at that position
Coverage: the coverage, i.e. number of reads mapping at that position
AvQual: the average base-calling error probability which is related to the Phred quality score
Acnt: the total number of A nucleotides at that position
Apval: average probability of sequencing error for that nucleotide based on quality scores
The values for Ccnt, Cpval, Tcnt, Tpval, Gcnt, Gpval are also listed
entropy(base e): a measure of uncertainty in the dataset used as a means to quantify sequence variability at that particular site
NonRefCnt: the number of reads that have bases different to the reference
CntTv: the total number of transversions at that site
CntTs: the total number of transitions at that sites
OrderOfNucs: the order of the nucleotides based on each nucleotide count

out_log.txt file

out_log.txt contains summary statistics from the alignment:

Number of reads with inserts: reads with insertions or deletions are excluded from all the btc output files
Number of reads with Ns: these are also excluded from analyses
Number of sequence used: this provides the total number of reads used for generating the btc files
Frequency of mismatches per read (mismatches: number of reads): this provides a distribution of the number of mismatches found per read
Gene Average entropy: the average entropy for each of the genes in the dataset is provided

out_read.txt file

out_read.txt contains information about mismatches per read postion which can be used to determine whether harsher trimming is required. This table assumes that you have the same number of bases in each reads, if not the results may be irrelevant as there will be different coverage for each site of the reads:

ReadPos: the position in the read (e.g., 1 to 150)
CntRef: number of bases matching the reference at this position
CntNonRef: number of bases not matching the reference at that position in the read
TotalCnt: total number of bases (ref + nonref)
Freq: frequency of mismatches at this read position
AvQual: the average base-calling error probability for that position in a read
AvQualRef: the average base-calling probability for nucleotides matching the reference
AvQualNonRef: the average base-calling probability for frequency for nucleotides different to the reference

btcutils with ORFs

It is also possible to provide information about the number of non-synonymous and synonymous mutations at each amino acid site, but to do this you need to provide a text file (e.g. CodingRegion.txt) with the information about the protein names and open reading frames, in the following tab delimited format:


Protein	Beg	End	Reference
PB2	28	2307	PB2_Influenza
PB1	25	2301	PB1_Influenza
PB1-F2	119	391	PB1_Influenza
PA	25	2176	PA_Influenza
PA-X_A	25	596	PA_Influenza
PA-X_B	598	784	PA_Influenza

The first column corresponds to the protein name, the names in this column must be unique, the second column corresponds to the start of the orf for that protein, the third is the end of the orf for the protein and the fourth corresponds to the identifier (name) of the reference in your reference file from which the protein is transcribed. Run the command:

$ perl btcutils.pl -bam path/to/input.bam -ref path/to/reference.fasta -orfs CodingRegions.txt -stub out

The latter command produces all of the previous 4 files with an additional out_AA.txt with the information about the non-synonymous, synonymous and stop-codons.

out_AA.txt file

out_AA.txt contains information about the codons and amino acids at each site of the coding regions for each protein. The file contains the following columns:

Sample: name of the sample bam file
Chr: name of the chromosome or gene based on the reference used for the alignment
Protein: name of the protein based on the first column of the coding region information file (e.g., CodingRegion.txt)
AAPosition: the amino acid position from start position of the protein
RefAA: the reference amino acid
RefSite: the reference position for the first codon position
RefCodon: the reference codon
FstCodonPos: the number of mutations in the first codon position
SndCodonPos: the number of mutations in the second codon position
TrdCodonPos: the number of mutations in the third codon position
CntNonSyn: the number of non synonymous changes at the amino acid site
CntSyn: the number of synonymous changes at the amino acid site
NbStop: the number of stop codons at this amino acid site
TopAA: the top (most frequently found) amino acid (usually the reference amino acid)
TopAAcnt: the number of times this amino acid was found
SndAA: the second most frequently found amino acid
SndAAcnt: the number of times this amino acid was found
TrdAA: the third most frequent amino acid (this could go on for ever ...)
TrdAAcnt: the number of times the third most frequent amino acid was found
AAcoverage: the amino acid coverage which may be a bit lower than the nucleotide coverage as it only takes into account full codons

<<<<<<< HEAD

btcmerge

=======

btcmerge

>>>>>>> 89a80b2b410d3060c8701e119c4ccb96c8736c7f

Once you have produced multiple btc files, you may want to merge several, either to compare the results from independent replicates, or to compare the intra-host diversity from different individuals, or even to compare closely related viruses in the same or different individuals.