module load blast
Do not use BLAST+ on a login node for anything other than downloading data or basic file manipulation. To use BLAST+ interactively for test and debug, use an interactive session. To confirm that the session is not on a login node, type
BLAST+ should be run from the /share directory. The standard output, standard error, and by default all generated data from the job will be placed in the directory where the bsub command was issued. Running BLAST+ from the /home directory can quickly fill a users' quota.
module load blast makeblastdb -in [input database] -out [output database] -dbtype [database type]
For BLAST+ commands that are mostly limited to I/O, there may be little value from running in parallel; check the BLAST+ documentation for specific commands.
Here is a template batch script for running BLAST+ in serial.
#!/bin/tcsh #BSUB -n 1 #BSUB -W 60 #BSUB -R span[hosts=1] #BSUB -o out.%J #BSUB -e err.%J module load blast blastn -query [query file] -db [database file] -out [output file]
To run BLAST with shared memory, i.e. multithreaded but limited to one node, use the command line argument -numthreads followed by the number of threads to be used. This number should match the number of cores requested by LSF. If -numthreads 4 is used, then #BSUB -n 4 must be used. To avoid errors when modifying a batch script, please use the LSF environment variable for the number of assigned cores $LSB_DJOB_NUMPROC as shown in the following LSF script.
#!/bin/tcsh #BSUB -n 10 #BSUB -W 60 #BSUB -R span[hosts=1] #BSUB -o out.%J #BSUB -e err.%J module load blast blastn -query [query file] -db [database file] -num_threads $LSB_DJOB_NUMPROC -out [output file]
Running with more processors does not guarantee a BLAST job will run faster, as this depends on the input and the command used. Please do a scaling test before requesting a large number of processors.
To run BLAST with distributed memory using MPI, i.e. processes may span multiple nodes, use the following template:
#!/bin/tcsh #BSUB -n 16 #BSUB -W 60 #BSUB -o out.%J #BSUB -e err.%J module load openmpi-gcc module load blast mpirun blastn -query [query file] -db [database file] -out [output file]
Here is a short tutorial explaining the basic usage of BLAST, thanks to Postdoc John Soghigian from the Entomology and Plant Pathology department.
The tutorial should take less than 10 minutes of computation.
The goal of this exercise is to identify a contaminant fungal sequence in a bacteria genome assembly. We are using an E. coli K12 assembly from GenBank (GCF_000005845.2) with artificially added fungal sequences as an example assembly. We will use a yeast reference (sacCer3 Saccharomyces cerevisiae S288C, also from GenBank) as a reference. We will use BLAST+ to query our E. coli assembly against the yeast reference, sort the resulting BLAST hits, and identify the contigs in the bacterial assembly that have some similarity to yeast. BLAST has many options, and only a limited set are covered here - see the BLAST manual for much more information.
Usually we will have downloaded the necessary files from GenBank and have transferred it to Henry2 using sftp, but for the tutorial, we will copy them from Henry2 to our shared folder.
Change to your scratch directory on /share, copy and untar the example directory:
cd /share/$GROUP/$USER cp /usr/local/apps/examples/apps/blast/blast_tutorial.tar . tar -xvf blast_tutorial.tar cd blast_tutorial ls
The directory should contain two files, Ecoli_K12_GCF_000005845_2_modified.fasta and Scerevisiae_reference_GCF_000146045_2.fasta.
First we need to build our reference database. There are other options for makeblastdb. Here, we are using a yeast reference, in the file Scerevisiae_reference_GCF_000146045_2.fasta.
Creating a database is a simple file operation and does not use significant memory or computation, therefore it can be done on a login node; however, do not use a script to create multiple databases simultaneously.
We will go over this tutorial using an interactive compute node so that different commands, including help commands, can be tested. The proper way to run BLAST is with an LSF script, and one is given at the end of the tutorial.
Reserve an interactive compute node, load BLAST and create the database:
bsub -Is -n 4 -R "span[hosts=1]" -W 30 tcsh module load blast makeblastdb -in Scerevisiae_reference_GCF_000146045_2.fasta -parse_seqids -out Saccer_ref -dbtype nucl
The out parameter determines the database's name prefix. The dbtype determines the sequence type, e.g. nucleotide or protein. Use -help for more options.
Running the BLAST command to create the database will output
Building a new DB, current time: 01/14/2020 08:23:48 New DB name: /share/group/user/blast_tutorial/Saccer_ref New DB title: Scerevisiae_reference_GCF_000146045_2.fasta Sequence type: Nucleotide Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 17 sequences in 0.261676 seconds.and the following files will be created: Saccer_ref.nin, Saccer_ref.nsd, Saccer_ref.nsq, Saccer_ref.nhr, Saccer_ref.nog, Saccer_ref.nsi.
With our reference database created, we can now BLAST a query against it and retrieve hit information. We will use the BLAST executable blastn, which is by default tuned for interspecific sequence similarity. In this example, we use the output format 6, which is a tabular format. Other formats for BLAST output are available (see list at https://www.ncbi.nlm.nih.gov/books/NBK279684 ). We specify multiple threads here (4). Note that if a reference database has a single sequence or is very small, the use of multiple threads does not increase the speed of BLAST.
blastn -query Ecoli_K12_GCF_000005845_2_modified.fasta -db Saccer_ref -outfmt 6 -num_threads 4 -out example_BLAST_results.txtThis should generate a BLAST output file called example_BLAST_results.txt. By default, the tabular format has the following 12 columns, and in the above blastn command, we requested 6 values per line (-outfmt 6), so that the results have the format:
query_id subject_id pct_identity aln_length n_of_mismatches gap_openings
q_start q_end s_start s_end e_value bit_score
E-values provide a measure of confidence that the query matched the database hit by chance alone, based on the relative size of the database. Smaller e-values indicate a query hit to the database is less likely to be due to change alone. However, e-values are influenced by the database size; thus, they can be problematic for determining sequence homology alone. Bit score measures sequence similarity independent of query sequence length and database size and is normalized based on the pairwise alignment score. Higher scores are better for bitscores.
We can choose which columns to return by BLAST as well, modifying the BLAST command as such (see the BLAST manual for a list of command-line options):
blastn -query Ecoli_K12_GCF_000005845_2_modified.fasta -db Saccer_ref -outfmt '6 qseqid sseqid sacc pident evalue bitscore' -num_threads 4 -out example_BLAST_results.txtThis will report the query ID, the hit/subject ID, the accession of the hit, the percent of identical bases, the e-value and the bitscore of the hit.
Now that we have our hits, we may want to easily investigate the best matches to potential fungal contaminants in the bacterial assembly. The nice thing about the tabular format is it is easily sortable - for instance, if we wanted the largest scores at the top of the file, we could use:
sort -k 6,6gr example_BLAST_results.txt -o sorted_BLAST_results.txtNote this requires the score be in the 6th column, as it was in the previous example. Now the top hit is the largest score; it lists the name of the contig from our assembly that has the best match to a fungal DNA sequence (Introduced_Fungal_Sequence1).
To run the tutorial using LSF, including creating the database, create the following LSF script, called submit.csh:
#!/bin/tcsh #BSUB -n 4 #BSUB -W 60 #BSUB -R span[hosts=1] #BSUB -o out.%J #BSUB -e err.%J module load blast makeblastdb -in Scerevisiae_reference_GCF_000146045_2.fasta -parse_seqids -out Saccer_ref -dbtype nucl blastn -query Ecoli_K12_GCF_000005845_2_modified.fasta -db Saccer_ref -outfmt 6 -num_threads $LSB_DJOB_NUMPROC -out example_BLAST_results.txt sort -k 6,6gr example_BLAST_results.txt -o sorted_BLAST_results.txtand run it by doing
bsub < submit.csh
Henry2 hosts a subset of BLAST databases, including:
The databases are located here:
This group consists of NC State faculty, staff, and students who are using bioinformatics software. It will be used to organize and announce meetings and for users to exchange information and advice on using various bioinformatics applications. Please share this announcement with others who might be interested in joining this group.
Last modified: March 23 2022 13:21:00.