View on GitHub


Tool for analysing viral diversity from HTS

Download this project as a .zip file Download this project as a tar.gz file

Here I will try to explain how you can run the various diversitools from the command-line. The core of diversitools is written in perl and makes use of the the Bio-SamTools module written by Lincoln D. Stein. So the various perl script can be run from the command-line to automate the processing on multiple bam files.

There are two basic processing scripts: diversiutils which is used for processing a BAM file and diversifilter which is used for filtering the output according to quality and strand bias criteria.


This script is used to generate the files that contain the frequencies of various mutations from the bam file. In the most basic form, you may just be interested in the number of SNPs at each site of your alignment. From the directory where the perl script is located:

$ perl bin/ -bam path/to/input.bam -ref path/to/reference.fasta -stub out
This is an example using files in the test directory:
$ perl bin/ -bam test/S18_refLPAI.bam -ref test/refLPAI.fa -stub S18

Bear in mind that the bam file needs its associated index file (.bai). All of the arguments above are necessary to run diversiutils. The output from the latter command will be 3 files:

out_entropy.txt file

out_entropy.txt contains the following columns:

out_log.txt file

out_log.txt contains summary statistics from the alignment:

out_read.txt file

out_read.txt contains information about mismatches per read position which can be used to determine whether harsher trimming is required. This table assumes that you have the same number of bases in each read, if not the results may be irrelevant as there will be different coverage for each site of the reads:

diversiutils with ORFs

It is also possible to provide information about the number of non-synonymous and synonymous mutations at each amino acid site, but to do this you need to provide a text file (e.g. CodingRegion.txt) with the information about the protein names and open reading frames, in the following tab delimited format:

Protein	Beg	End	Reference
PB2	28	2307	PB2_Influenza
PB1	25	2301	PB1_Influenza
PB1-F2	119	391	PB1_Influenza
PA	25	2176	PA_Influenza
PA-X_A	25	596	PA_Influenza
PA-X_B	598	784	PA_Influenza

The first column corresponds to the protein name, the names in this column must be unique, the second column corresponds to the start of the orf for that protein, the third is the end of the orf for the protein and the fourth corresponds to the identifier (name) of the reference in your reference file from which the protein is transcribed. Run the command:

$ perl bin/ -bam path/to/input.bam -ref path/to/reference.fasta -orfs CodingRegions.txt -stub out

The latter command produces all of the previous 4 files with an additional out_AA.txt with the information about the non-synonymous, synonymous and stop-codons.

out_AA.txt file

out_AA.txt contains information about the codons and amino acids at each site of the coding regions for each protein. The file contains the following columns:

Processing multiple samples

One of the advantages of being able to run scripts from the command line, is the option to be able to automate the process. For example, we can process all the bam files that have been aligned to the LPAI reference will the following script.

for file in test/*_refLPAI.bam
perl bin/ -bam $file -ref test/refLPAI.fa -orf test/Coding.regions.LPAI.txt -stub ${file%_refLPAI.bam}


Once you have produced processed multiple files, you may want to filter them based on the probabilities of the bases being above the significance level of the binomial test or to remove bases that have strand bias using the Fisher's exact test. For this you can use the diversifilter perl script.

This is an example of filtering the entropy file according to bases that are found below the quality threshold of 0.05 and below the strand bias significance of 0.05.

$ perl bin/ -in S18 -pQ 0.05 -pS 0.05 -stub S18_pQ05pS05

The output is the same as the out_entropy.txt file except that base counts that have not passed the threshold will be replaced with 0 and the entropy will be recalculated using only the bases that have passed the quality threshold.