Here I will try to explain how you can run the various diversitools from the command-line. The core of diversitools is written in perl and makes use of the the Bio-SamTools module written by Lincoln D. Stein. So the various perl script can be run from the command-line to automate the processing on multiple bam files.
There are two basic processing scripts: diversiutils which is used for processing a BAM file and diversifilter which is used for filtering the output according to quality and strand bias criteria.
diversiutils
This script is used to generate the files that contain the frequencies of various mutations from the bam file. In the most basic form, you may just be interested in the number of SNPs at each site of your alignment. From the directory where the perl script is located:
$ perl bin/diversiutils.pl -bam path/to/input.bam -ref path/to/reference.fasta -stub out
This is an example using files in the test directory:
$ perl bin/diversiutils.pl -bam test/S18_refLPAI.bam -ref test/refLPAI.fa -stub S18
Bear in mind that the bam file needs its associated index file (.bai). All of the arguments above are necessary to run diversiutils. The output from the latter command will be 3 files:
- out_entropy.txt which contains the frequency of mutations at each individual site for each of the gene segments
- out_log.txt which contains the distribution of mutations per read and the average entropy per gene segment
- out_read.txt the per read position mutation count and average quality
out_entropy.txt file
out_entropy.txt contains the following columns:
- Sample: the name of the bam file that has been run
- Chr: the gene or chromosome identifier from the reference used for the mapping
- Position: the site/position of alignment in the reference
- RefBase: the base of the reference sequence at that position
- Coverage: the coverage, i.e. number of reads mapping at that position
- AvQual: the average base-calling error probability which is related to the Phred quality score
- Acnt: the total number of A nucleotides at that position
- Apval: average probability of sequencing error for that nucleotide based on quality scores
- The values for Ccnt, Cpval, Tcnt, Tpval, Gcnt, Gpval are also listed
- entropy(base e): a measure of uncertainty in the dataset used as a means to quantify sequence variability at that particular site
- NonRefCnt: the number of reads that have bases different to the reference
- CntTv: the total number of transversions at that site
- CntTs: the total number of transitions at that sites
- OrderOfNucs: the order of the nucleotides based on each nucleotide count
- Strandbias: the number of reads supporting each base in the forward and the reverse direction (e.g., A1:0;C83:2;T0:0;G1:0)
- ins_cnt: the number of reads with inserts at that position
- ins_mode: the sequence of the insert that occurs the most frequently
- del_cnt: the number of reads with deletions at that site
- del_mode: the sequence of the deletion that occurs the most frequently
- Ncnt: the total number of N nucleotides
out_log.txt file
out_log.txt contains summary statistics from the alignment:
- Frequency of mismatches per read (mismatches: number of reads): this provides a distribution of the number of mismatches found per read
- Gene Average entropy: the average entropy for each of the genes in the dataset is provided
out_read.txt file
out_read.txt contains information about mismatches per read position which can be used to determine whether harsher trimming is required. This table assumes that you have the same number of bases in each read, if not the results may be irrelevant as there will be different coverage for each site of the reads:
- ReadPos: the position in the read (e.g., 1 to 150)
- CntRef: number of bases matching the reference at this position
- CntNonRef: number of bases not matching the reference at that position in the read
- TotalCnt: total number of bases (ref + nonref)
- Freq: frequency of mismatches at this read position
- AvQual: the average base-calling error probability for that position in a read
- AvQualRef: the average base-calling probability for nucleotides matching the reference
- AvQualNonRef: the average base-calling probability for frequency for nucleotides different to the reference
diversiutils with ORFs
It is also possible to provide information about the number of non-synonymous and synonymous mutations at each amino acid site, but to do this you need to provide a text file (e.g. CodingRegion.txt) with the information about the protein names and open reading frames, in the following tab delimited format:
Protein Beg End Reference
PB2 28 2307 PB2_Influenza
PB1 25 2301 PB1_Influenza
PB1-F2 119 391 PB1_Influenza
PA 25 2176 PA_Influenza
PA-X_A 25 596 PA_Influenza
PA-X_B 598 784 PA_Influenza
The first column corresponds to the protein name, the names in this column must be unique, the second column corresponds to the start of the orf for that protein, the third is the end of the orf for the protein and the fourth corresponds to the identifier (name) of the reference in your reference file from which the protein is transcribed. Run the command:
$ perl bin/diversiutils.pl -bam path/to/input.bam -ref path/to/reference.fasta -orfs CodingRegions.txt -stub out
The latter command produces all of the previous 4 files with an additional out_AA.txt with the information about the non-synonymous, synonymous and stop-codons.
out_AA.txt file
out_AA.txt contains information about the codons and amino acids at each site of the coding regions for each protein. The file contains the following columns:
- Sample: name of the sample bam file
- Chr: name of the chromosome or gene based on the reference used for the alignment
- Protein: name of the protein based on the first column of the coding region information file (e.g., CodingRegion.txt)
- AAPosition: the amino acid position from start position of the protein
- RefAA: the reference amino acid
- RefSite: the reference position for the first codon position
- RefCodon: the reference codon
- FstCodonPos: the number of mutations in the first codon position
- SndCodonPos: the number of mutations in the second codon position
- TrdCodonPos: the number of mutations in the third codon position
- CntNonSyn: the number of non synonymous changes at the amino acid site
- CntSyn: the number of synonymous changes at the amino acid site
- NbStop: the number of stop codons at this amino acid site
- TopAA: the top (most frequently found) amino acid (usually the reference amino acid)
- TopAAcnt: the number of times this amino acid was found
- SndAA: the second most frequently found amino acid
- SndAAcnt: the number of times this amino acid was found
- TrdAA: the third most frequent amino acid (this could go on for ever ...)
- TrdAAcnt: the number of times the third most frequent amino acid was found
- AAcoverage: the amino acid coverage which may be a bit lower than the nucleotide coverage as it only takes into account full codons
Processing multiple samples
One of the advantages of being able to run scripts from the command line, is the option to be able to automate the process. For example, we can process all the bam files that have been aligned to the LPAI reference will the following script.
for file in test/*_refLPAI.bam
do
perl bin/diversiutils.pl -bam $file -ref test/refLPAI.fa -orf test/Coding.regions.LPAI.txt -stub ${file%_refLPAI.bam}
done
diversifilter
Once you have produced processed multiple files, you may want to filter them based on the probabilities of the bases being above the significance level of the binomial test or to remove bases that have strand bias using the Fisher's exact test. For this you can use the diversifilter perl script.
This is an example of filtering the entropy file according to bases that are found below the quality threshold of 0.05 and below the strand bias significance of 0.05.
$ perl bin/diversifilter.pl -in S18 -pQ 0.05 -pS 0.05 -stub S18_pQ05pS05
The output is the same as the out_entropy.txt file except that base counts that have not passed the threshold will be replaced with 0 and the entropy will be recalculated using only the bases that have passed the quality threshold.