BLAST¶
Using multiple threads¶
Benchmark the performance of ncbi-blast+ using multiple threads.
Data Sets¶
Query Sequences: 2000 cdna sequences of Aradopsis ( TAIR10_cdna_2000seq.fa) -db option: nt BLASTDB: export BLASTDB=/work/LAS4/BioDatabase/BLASTdb/NCBI/Archives/Current(downloaded on 20171115)
Command¶
time blastn -db nt -query TAIR10_cdna_2000seq.fa -out TAIR10_cdna_2000seq.fa.out -num_threads 4
Results¶
Case 1¶
Only one blast job is running while 16 cores are allocated. The input file has 2,000 sequences. core(s) -- Time(real/user/sys)
1 -- 16m53/15m48/0m55 2 -- 09m54/16mxx/0m54 4 -- 06m22/16m29/0m55 8 -- 04m37/17mxx/0m56 16 -- 03m43/18mxx/0m59
Case 2¶
Two blast jobs are running, 2,000 sequences/8 cores/each, all 16 cores are used for 4,000 sequences.
8 -- 05m26/17m09/0m54 8 -- 05m26/17m10/0m56
Case 3¶
Four blast jobs are running, 2,000 sequences/4 cores/each, all 16 cores are used for 8,000 sequences.
4 -- 10m35/17m41/0m54 4 -- 10m55/18m07/0m55 4 -- 11m03/18m27/0m55 4 -- 10m54/18m15/0m55
Case 4¶
Estimate the run time for blasting 80,000 sequences in various settings:
16 cores/job, 1 job/node --- 40*03m43=148.7m=2.5h 08 cores/job, 2 jobs/node --- 20*05m26=108.7m=1.8h 04 cores/job, 4 jobs/node ----10*10m52=108.7m=1.8h
Observation while blast running¶
-
Sequences are processed in a batch mode, and the output is done through one thread. No matter how many threads were used while blasting TAIR10_cdna_2000seq.fa in Case 1, the first batch of 162 sequences were processed, then the rest sequences ( 2000-162) are processed in the 2nd batch. As the query sequences are cDNA of Arabidopsis, there are huge amount of hits, without restricting the output, the time is significant for writing output the 2nd batch of blasting through one thread.
Suggestion: set evalue and max_target_seqs to restrict outputs. -
Use parallel to distribute blast jobs over a few nodes.
Based on the estimation on run time to blast 80,000 sequences, the difference in the run times is not significant among the three cases:
- Using 16 cores and not splitting the sequence file;
- Using 8 cores and split the sequence file into 2 files; or
- Using 4 cores/job and splitting the sequence file into 4 files.
Suggestion: for blasting ~ 1 million sequences on condo, allocate one node for 16 cores, and set -num_threads =16.
If you have a few million sequences, you may be better off if you split the large sequence file into a few small ones, files, allocate more than one node, using gnu parallel to distribute the jobs over the nodes, and each job will use 8 or 4 cores.
Using GNU Parallel¶
Demonstrate how to use gnu parallel to do blast for millions of query sequences.
The following section demo:
- Split the large sequence file into 8 small ones
- Allocate 2 nodes with 32 cores, distribute the 8 jobs over the two nodes using parallel, each job uses 4 cores.
Example: Split a huge sequence file into 8 small ones, and run parallel to distribute 8 blast jobs over 32 cores of 2 nodes.
Demo Scripts: 1node-parallel-blast_interactive.sh 1-or-2-node-parallel-blast.script
Interactively split the large file into 8 using pyfasta on a computational node Ref for pyfasta: https://pypi.python.org/pypi/pyfasta/
head-node>salloc -N -n 1 -t 1:0:0
compute-node> cd /to/your/sequence/dir
compute-node> module load python
pyfasta does not allow the duplicated sequences in the fasta file to remove the duplicated sequences
compute-node> /work/LAS4/rit/tutorial/BLAST+/remove_duplicate_fasta.py huge_DNA_fasta.fa huge_DNA_fasta_uniq.fa
compute-node> pyfasta split -n 8 huge_DNA_fasta_uniq.fa
the output files are huge_DNA_fasta_uniq.[0...7].fa.
The following is a slurm queue script to distribute the 8 blast jobs for 8 sequence files (huge_DNA_fasta_uniq.[0...7].fa) over 32 cores of 2 nodes.
#!/bin/bash
#SBATCH -N 2 # allocate 2 nodes
#SBATCH -n 16 # 16 cores/node
#SBATCH -t 2:0:0 # 2 hours
# cd to where the sbatch script was submitted.
cd $SLURM_SUBMIT_DIR
mkdir output
module load parallel
module load ncbi-blast
export BLASTDB=/work/LAS4/BioDatabase/BLASTdb/NCBI/Archives/Current
#the env needs to be exported to the remote node through the env PARALLEL from the head node
export PARALLEL="--env BLASTDB "
export PARALLEL="$PARALLEL --workdir . --env PATH --env LD_LIBRARY_PATH "
export PARALLEL="$PARALLEL --env LOADEDMODULES --env _LMFILES_ --env MODULE_VERSION --env MODULEPATH --env MODULEVERSION_STACK --env MODULESHOME "
export PARALLEL="$PARALLEL --env OMP_DYNAMICS --env OMP_MAX_ACTIVE_LEVELS --env OMP_NESTED --env OMP_NUM_THREADS --env OMP_SCHEDULE " export PARALLEL="$PARALLEL --env OMP_STACKSIZE --env OMP_THREAD_LIMIT --env OMP_WAIT_POLICY"
# run 8 blast jobs in one node, each job will use 2 threads.
# --noswap: if OS is busy doing memory swap in and out, not issue another job
ls huge_DNA_fasta_uniq.*.fa | parallel --sshloginfile $PBS_NODEFILE --workdir $SLURM_SUBMIT_DIR -j 8 --noswap 'blastn -db nt -query {} -out output-1node/{}.out -num_threads 2'
# run 8 blast jobs in two nodes, each job will use 4 threads.
ls huge_DNA_fasta_uniq.*.fa | parallel --sshloginfile $PBS_NODEFILE --workdir $SLURM_SUBMIT_DIR -j 8 --noswap 'blastn -db nt -query {} -out output-2node/{}.out -num_threads 4'