BLAST+

External Links:
BLAST website
NCBI BLAST QuickStart

Loading BLAST+

module load blast

Running BLAST+

Do not use BLAST+ on a login node for anything other than downloading data or basic file manipulation. To use BLAST+ interactively for test and debug, use an interactive session. To confirm that the session is not on a login node, type hostname.

BLAST+ should be run from the /share directory. The standard output, standard error, and by default all generated data from the job will be placed in the directory where the bsub command was issued. Running BLAST+ from the /home directory can quickly fill a users' quota.

Creating a database file

module load blast
makeblastdb -in [input database] -out [output database] -dbtype [database type]

Batch script for running BLAST+

Serial

For BLAST+ commands that are mostly limited to I/O, there may be little value from running in parallel; check the BLAST+ documentation for specific commands.

Here is a template batch script for running BLAST+ in serial.

#!/bin/bash
#BSUB -n 1
#BSUB -W 60
#BSUB -R span[hosts=1]
#BSUB -o out.%J
#BSUB -e err.%J
module load blast
blastn -query [query file] -db [database file] -out [output file]

Shared memory

To run BLAST with shared memory, i.e. multithreaded but limited to one node, use the command line argument -numthreads followed by the number of threads to be used. This number should match the number of cores requested by LSF. If -numthreads 4 is used, then #BSUB -n 4 must be used. To avoid errors when modifying a batch script, please use the LSF environment variable for the number of assigned cores $LSB_DJOB_NUMPROC as shown in the following LSF script.

#!/bin/bash
#BSUB -n 10
#BSUB -W 60
#BSUB -R span[hosts=1]
#BSUB -o out.%J
#BSUB -e err.%J
module load blast
blastn -query [query file] -db [database file] -num_threads $LSB_DJOB_NUMPROC -out [output file]

Distributed memory

Running with more processors does not guarantee a BLAST job will run faster, as this depends on the input and the command used. Please do a scaling test before requesting a large number of processors.

To run BLAST with distributed memory using MPI, i.e. processes may span multiple nodes, use the following template:

#!/bin/bash
#BSUB -n 16
#BSUB -W 60
#BSUB -o out.%J
#BSUB -e err.%J
module load openmpi-gcc
module load blast
mpirun blastn -query [query file] -db [database file] -out [output file]

BLAST tutorial

Here is a short tutorial explaining the basic usage of BLAST, thanks to Postdoc John Soghigian from the Entomology and Plant Pathology department.

The tutorial should take less than 10 minutes of computation.

Objective of tutorial

The goal of this exercise is to identify a contaminant fungal sequence in a bacteria genome assembly. We are using an E. coli K12 assembly from GenBank (GCF_000005845.2) with artificially added fungal sequences as an example assembly. We will use a yeast reference (sacCer3 Saccharomyces cerevisiae S288C, also from GenBank) as a reference. We will use BLAST+ to query our E. coli assembly against the yeast reference, sort the resulting BLAST hits, and identify the contigs in the bacterial assembly that have some similarity to yeast. BLAST has many options, and only a limited set are covered here - see the BLAST manual for much more information.

Get the files

Usually we will have downloaded the necessary files from GenBank and have transferred it to Henry2 using sftp, but for the tutorial, we will copy them from Henry2 to our shared folder.

Change to your scratch directory on /share, copy and untar the example directory:

cd /share/$GROUP/$USER
cp /usr/local/apps/examples/apps/blast/blast_tutorial.tar .
tar -xvf blast_tutorial.tar
cd blast_tutorial
ls

The directory should contain two files, Ecoli_K12_GCF_000005845_2_modified.fasta and Scerevisiae_reference_GCF_000146045_2.fasta.

Build the database

First we need to build our reference database. There are other options for makeblastdb. Here, we are using a yeast reference, in the file Scerevisiae_reference_GCF_000146045_2.fasta.

Creating a database is a simple file operation and does not use significant memory or computation, therefore it can be done on a login node; however, do not use a script to create multiple databases simultaneously.

We will go over this tutorial using an interactive compute node so that different commands, including help commands, can be tested. The proper way to run BLAST is with an LSF script, and one is given at the end of the tutorial.

Reserve an interactive compute node, load BLAST and create the database:

bsub -Is -n 4 -R "span[hosts=1]" -W 30 bash
module load blast
makeblastdb -in Scerevisiae_reference_GCF_000146045_2.fasta -parse_seqids -out Saccer_ref -dbtype nucl

The out parameter determines the database's name prefix. The dbtype determines the sequence type, e.g. nucleotide or protein. Use -help for more options.

makeblastdb -help

Running the BLAST command to create the database will output

Building a new DB, current time: 01/14/2020 08:23:48
New DB name:   /share/group/user/blast_tutorial/Saccer_ref
New DB title:  Scerevisiae_reference_GCF_000146045_2.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 17 sequences in 0.261676 seconds.

Run a query

With our reference database created, we can now BLAST a query against it and retrieve hit information. We will use the BLAST executable blastn, which is by default tuned for interspecific sequence similarity. In this example, we use the output format 6, which is a tabular format. Other formats for BLAST output are available (see list at https://www.ncbi.nlm.nih.gov/books/NBK279684 ). We specify multiple threads here (4). Note that if a reference database has a single sequence or is very small, the use of multiple threads does not increase the speed of BLAST.

blastn -query Ecoli_K12_GCF_000005845_2_modified.fasta -db Saccer_ref -outfmt 6 -num_threads 4 -out example_BLAST_results.txt

example_BLAST_results.txt

blastn

-outfmt 6

query_id   subject_id   pct_identity   aln_length   n_of_mismatches   gap_openings

q_start    q_end        s_start        s_end        e_value           bit_score

E-values provide a measure of confidence that the query matched the database hit by chance alone, based on the relative size of the database. Smaller e-values indicate a query hit to the database is less likely to be due to change alone. However, e-values are influenced by the database size; thus, they can be problematic for determining sequence homology alone. Bit score measures sequence similarity independent of query sequence length and database size and is normalized based on the pairwise alignment score. Higher scores are better for bitscores.

We can choose which columns to return by BLAST as well, modifying the BLAST command as such (see the BLAST manual for a list of command-line options):

blastn -query Ecoli_K12_GCF_000005845_2_modified.fasta -db Saccer_ref -outfmt '6 qseqid sseqid sacc pident evalue bitscore' -num_threads 4 -out example_BLAST_results.txt

Examine the results

Now that we have our hits, we may want to easily investigate the best matches to potential fungal contaminants in the bacterial assembly. The nice thing about the tabular format is it is easily sortable - for instance, if we wanted the largest scores at the top of the file, we could use:

sort -k 6,6gr example_BLAST_results.txt -o sorted_BLAST_results.txt

Batch script for tutorial

To run the tutorial using LSF, including creating the database, create the following LSF script, called submit.sh:

#!/bin/bash
#BSUB -n 4
#BSUB -W 60
#BSUB -R span[hosts=1] 
#BSUB -o out.%J
#BSUB -e err.%J
module load blast
makeblastdb -in Scerevisiae_reference_GCF_000146045_2.fasta -parse_seqids -out Saccer_ref -dbtype nucl
blastn -query Ecoli_K12_GCF_000005845_2_modified.fasta -db Saccer_ref -outfmt 6 -num_threads $LSB_DJOB_NUMPROC -out example_BLAST_results.txt
sort -k 6,6gr example_BLAST_results.txt -o sorted_BLAST_results.txt

bsub < submit.sh

NCBI BLAST databases on Henry2

Henry2 hosts a subset of BLAST databases, including:

nr
nt
refseq_protein
refseq_rna
swissprot

The databases are located here:

/gpfs_partners/databases/ncbi/blast

Bioinformatics Users Group

This group consists of NC State faculty, staff, and students who are using bioinformatics software. It will be used to organize and announce meetings and for users to exchange information and advice on using various bioinformatics applications. Please share this announcement with others who might be interested in joining this group.

Last modified: April 02 2024 13:40:31.