DiversiTools

Here I will try to explain how you can run the various diversitools from the command-line. The core of diversitools is written in perl and makes use of the the Bio-SamTools module written by Lincoln D. Stein. So the various perl script can be run from the command-line to automate the processing on multiple bam files.

There are two basic processing scripts: diversiutils which is used for processing a BAM file and diversifilter which is used for filtering the output according to quality and strand bias criteria.

diversiutils

This script is used to generate the files that contain the frequencies of various mutations from the bam file. In the most basic form, you may just be interested in the number of SNPs at each site of your alignment. From the directory where the perl script is located:

$ perl bin/diversiutils.pl -bam path/to/input.bam -ref path/to/reference.fasta -stub out

This is an example using files in the test directory:

$ perl bin/diversiutils.pl -bam test/S18_refLPAI.bam -ref test/refLPAI.fa -stub S18

Bear in mind that the bam file needs its associated index file (.bai). All of the arguments above are necessary to run diversiutils. The output from the latter command will be 3 files:

out_entropy.txt which contains the frequency of mutations at each individual site for each of the gene segments
out_log.txt which contains the distribution of mutations per read and the average entropy per gene segment
out_read.txt the per read position mutation count and average quality

out_entropy.txt file

out_entropy.txt contains the following columns:

Sample: the name of the bam file that has been run
Chr: the gene or chromosome identifier from the reference used for the mapping
Position: the site/position of alignment in the reference
RefBase: the base of the reference sequence at that position
Coverage: the coverage, i.e. number of reads mapping at that position
AvQual: the average base-calling error probability which is related to the Phred quality score
Acnt: the total number of A nucleotides at that position
Apval: average probability of sequencing error for that nucleotide based on quality scores
The values for Ccnt, Cpval, Tcnt, Tpval, Gcnt, Gpval are also listed
entropy(base e): a measure of uncertainty in the dataset used as a means to quantify sequence variability at that particular site
NonRefCnt: the number of reads that have bases different to the reference
CntTv: the total number of transversions at that site
CntTs: the total number of transitions at that sites
OrderOfNucs: the order of the nucleotides based on each nucleotide count
Strandbias: the number of reads supporting each base in the forward and the reverse direction (e.g., A1:0;C83:2;T0:0;G1:0)
ins_cnt: the number of reads with inserts at that position
ins_mode: the sequence of the insert that occurs the most frequently
del_cnt: the number of reads with deletions at that site
del_mode: the sequence of the deletion that occurs the most frequently
Ncnt: the total number of N nucleotides

out_log.txt file

out_log.txt contains summary statistics from the alignment:

Frequency of mismatches per read (mismatches: number of reads): this provides a distribution of the number of mismatches found per read
Gene Average entropy: the average entropy for each of the genes in the dataset is provided

out_read.txt file

out_read.txt contains information about mismatches per read position which can be used to determine whether harsher trimming is required. This table assumes that you have the same number of bases in each read, if not the results may be irrelevant as there will be different coverage for each site of the reads:

ReadPos: the position in the read (e.g., 1 to 150)
CntRef: number of bases matching the reference at this position
CntNonRef: number of bases not matching the reference at that position in the read
TotalCnt: total number of bases (ref + nonref)
Freq: frequency of mismatches at this read position
AvQual: the average base-calling error probability for that position in a read
AvQualRef: the average base-calling probability for nucleotides matching the reference
AvQualNonRef: the average base-calling probability for frequency for nucleotides different to the reference

diversiutils with ORFs

It is also possible to provide information about the number of non-synonymous and synonymous mutations at each amino acid site, but to do this you need to provide a text file (e.g. CodingRegion.txt) with the information about the protein names and open reading frames, in the following tab delimited format:


Protein	Beg	End	Reference
PB2	28	2307	PB2_Influenza
PB1	25	2301	PB1_Influenza
PB1-F2	119	391	PB1_Influenza
PA	25	2176	PA_Influenza
PA-X_A	25	596	PA_Influenza
PA-X_B	598	784	PA_Influenza

The first column corresponds to the protein name, the names in this column must be unique, the second column corresponds to the start of the orf for that protein, the third is the end of the orf for the protein and the fourth corresponds to the identifier (name) of the reference in your reference file from which the protein is transcribed. Run the command:

$ perl bin/diversiutils.pl -bam path/to/input.bam -ref path/to/reference.fasta -orfs CodingRegions.txt -stub out

The latter command produces all of the previous 4 files with an additional out_AA.txt with the information about the non-synonymous, synonymous and stop-codons.

out_AA.txt file

out_AA.txt contains information about the codons and amino acids at each site of the coding regions for each protein. The file contains the following columns:

Sample: name of the sample bam file
Chr: name of the chromosome or gene based on the reference used for the alignment
Protein: name of the protein based on the first column of the coding region information file (e.g., CodingRegion.txt)
AAPosition: the amino acid position from start position of the protein
RefAA: the reference amino acid
RefSite: the reference position for the first codon position
RefCodon: the reference codon
FstCodonPos: the number of mutations in the first codon position
SndCodonPos: the number of mutations in the second codon position
TrdCodonPos: the number of mutations in the third codon position
CntNonSyn: the number of non synonymous changes at the amino acid site
CntSyn: the number of synonymous changes at the amino acid site
NbStop: the number of stop codons at this amino acid site
TopAA: the top (most frequently found) amino acid (usually the reference amino acid)
TopAAcnt: the number of times this amino acid was found
SndAA: the second most frequently found amino acid
SndAAcnt: the number of times this amino acid was found
TrdAA: the third most frequent amino acid (this could go on for ever ...)
TrdAAcnt: the number of times the third most frequent amino acid was found
AAcoverage: the amino acid coverage which may be a bit lower than the nucleotide coverage as it only takes into account full codons

Processing multiple samples

One of the advantages of being able to run scripts from the command line, is the option to be able to automate the process. For example, we can process all the bam files that have been aligned to the LPAI reference will the following script.

for file in test/*_refLPAI.bam
do 
perl bin/diversiutils.pl -bam $file -ref test/refLPAI.fa -orf test/Coding.regions.LPAI.txt -stub ${file%_refLPAI.bam}
done

diversifilter

Once you have produced processed multiple files, you may want to filter them based on the probabilities of the bases being above the significance level of the binomial test or to remove bases that have strand bias using the Fisher's exact test. For this you can use the diversifilter perl script.

This is an example of filtering the entropy file according to bases that are found below the quality threshold of 0.05 and below the strand bias significance of 0.05.

$ perl bin/diversifilter.pl -in S18 -pQ 0.05 -pS 0.05 -stub S18_pQ05pS05

The output is the same as the out_entropy.txt file except that base counts that have not passed the threshold will be replaced with 0 and the entropy will be recalculated using only the bases that have passed the quality threshold.