A Crash Course in BLAST Searching (2024)

Simple BLAST searching is pretty straightforward to many of us. Just plug in your sequence, select the species genome, and hit search! But have you ever wondered what it takes to run a BLAST query using these mammoth-sized (no pun intended!) sequence databases?

BLAST searching can produce dozens, hundreds, or even thousands of candidate alignments. The results of BLASTing your favorite gene or protein can differ substantially, depending on the exact query sequence used and the parameters of your search (such as database size and e-value cut-off, to name a few). On top of that is the long list of values in the results page. What are these? And what do they mean?

Most importantly, which of the dozens (or hundreds, or thousands) of results are the most accurate? If you’re already feeling lost by some of the jargon used above, don’t despair! Keep reading this article, and I will help you answer these questions, making you a BLAST-whiz in no time!

Why Do We BLAST?

For bioinformaticians, hom*ology is the main clue to predict gene and protein function. But how does one predict hom*ology? The answer is to examine similarity between two or more sequences. As a general rule of thumb, you should expect at least 25 % sequence similarity for DNA sequence hom*ologues and 70 % sequence similarity for protein hom*ologues if your query included more than 100 nucleotides or amino acids.

The easiest way to assess sequence similarity between two or more sequences is to perform a sequence alignment. BLAST and FASTA are very common tools for doing just this.

FASTA is a sequence search algorithm that flourished in the mid 1980s, but it was and still is time consuming. Since it’s debut in 1990, BLAST has fast become the most widely used sequence search program, functioning as a revolutionary tool to search against big sequence databases such as those at NCBI (for example, Nucleotide Collection (nr/nt), Expressed Sequence Tags (EST), Protein Data Bank (PDB)).

The Principle Behind BLAST Searching

BLAST makes a list of ‘words’ (i.e., a list of short sequences) where the nucleotides/amino acids constitute the letters that make up the query sequence. These words are then screened in the database that has scores over a threshold value (T), as they are easy to track down due to their short length.

The scores for each word out of the list are calculated using a scoring matrix (for example BLOSUM62). Next, those words above the threshold value act as seeds to widen the alignment between the query and the target sequence. This can be via a gapped or un-gapped alignment extension in either direction to get sequence pairs with high scores or HSPs (High Scoring Segment Pairs). This process is called seeding.

Those HSPs that are above a cutoff score (S) are reported in the BLAST output and the extension process is terminated when the HSPs are below cut-off. This is followed by a trace back procedure to work out the position of insertions, deletions and matches together with some additional computations.

What Is the ‘E-Value’?

The percentage similarity between two or more sequences alone is not enough to ensure trustworthy alignments. Bioinformatics buffs tend to regard the e-value as the most informative parameter when looking for true matches.

To keep it simple, the e-value takes into account the number of hits one can expect to obtain by chance when searching a given database, and represents the probability that a given sequence is a significant match. Generally, the lower the e-value obtained, the more significant the alignment. The e-value of an alignment alone may be a useful tool for you to rapidly search different databases against your query. The user must be aware that identical alignments when searching different databases may not receive the same e-value. This is because of the difference in the number of sequences across databases.

Bit score is an important measure that gives an indication about the statistical significance of an alignment. In simple terms, the higher the bit score, the more similar the two sequences are. Bit scores below 50 are generally assumed to be untrustworthy.

Using Word Size and Low Complexity Region Filters

The default BLAST settings for word size are 3 and 11 for protein and nucleotide sequence searches, respectively. The word size can be lessened to 2 for short stretches of amino acids. However, you can either increase to 15 or reduce to 7 to improve your BLAST output in the case of nucleotides. In general, reducing the word size leads to more accurate results but is time consuming.

Low complexity regions are stretches of amino acids or nucleotides that are commonly found with low information content which may have statistical, but not biological, significance. For example, ATATATATATATAT, PPPPPPPPPPPPPPPP or Alu repeats may sometimes appear redundant in query sequences, and you can filter these by choosing the ‘low complexity region filter’ option. This will shorten your query further.

How to BLAST

Once you enter the BLAST page, select the desired BLAST tool (blastn or blastp). Then, you will need to enter the query sequence, choose the desired algorithm, and set search parameters.

  1. Choose Search Set: Here, you have the choice of genomic plus transcripts and other databases. You can also create a custom database.
  2. Program Selection:Here, you have the opportunity to select the intended BLAST algorithm.
  3. Algorithm Parameters: Lastly, you’ll need to set some parameters for your chosen algorithm. Here, you may consider the e-value, word size and the low-complexity regions filter.

BLAST Results Page

Let me take you on a detour to the BLAST results page (Figure 1). The output in ordered sequence includes:

  • A graphical display: This provides a brief summary of the alignments ranked according to alignment scores.
  • List of hits: The hit list tabulates the target sequences that produce significant alignments, along with their corresponding bit score, query coverage (percentage of query sequence aligned), e-value, identity % and assigned accession number.
  • Individual alignments (along with calculated parameters): Individual alignments provide details on the parameters, e-value, bit score and identity % calculated for each alignment. You will also find the gaps and clusters of identical sequences within individual alignment(s) in this panel.
  • Figure 1. Example of BLAST results output. (a) graphical display: The color key signifies the alignments to the query (thickest individual horizontal line at top scaling 1-2250) against target database sequences (thin lines). The red-colored lines indicate the best match whereas pink and green lines indicate acceptable matches, (b) hit list, and (c) individual alignments.

Parting Words of Advice

Bear in mind that you may not get the exact same results when you run the same blast query on two separate occasions. Updates to the contents of your server can lead to change in the results, so it’s worthwhile keeping an eye on when updates occur. For example, there could have been 100 sequences in the target database previously and there may now be 250. This change can bring additional hits to your BLAST query results.

More often than not, it is better to stick to the ‘default’ specifications when running sequence alignments. If this isn’t feasible, I recommend paying close attention to the search results at the bottom of the results page, and use what you’ve learned in this article to decide whether or not a hit is likely to be a good match or not.

Do bear in mind that there is no one magic cut-off to identify true matches for all your query sequence(s). And to be a snappy user, you must be able to confidently tweak the parameters discussed here for optimal results.

I do hope that this helps to demystify the inner workings of BLAST! Have a BLAST!

Further Reading and Resources

  1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990). Basic local alignment search tool. J Mol Biol.Oct 5;215(3):403-10.
  2. NCBI BLAST help page.
  3. Lesk, A., 2013. Introduction to bioinformatics. Oxford University Press. ISBN-13: 978-0199208043.
A Crash Course in BLAST Searching (2024)

FAQs

What are the steps of BLAST search? ›

Main steps of BLAST

Step 1: Given query sequence Q, compile the list of possible words which form with words in Q high scoring word pairs. Step 2: Scan database for exact matching with the list of words complied in step 1. Step 3: Extending hits from step 2. Step 4: Evaluating significance of extended hits from step 3.

What happens during a BLAST search? ›

The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between protein or nucleotide sequences. The program compares nucleotide or protein sequences to sequence in a database and calculates the statistical significance of the matches.

What is the goal of the BLAST search to find? ›

The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches.

What is a good BLAST score? ›

Blast results are sorted by E-value by default (best hit in first line). The smaller the E-value, the better the match. Blast hits with an E-value smaller than 1e -50 includes database matches of very high quality. Blast hits with E-value smaller than 0.01 can still be considered as good hit for hom*ology matches.

How can I improve my BLAST search? ›

Improving BLAST hits
  1. Adjusting BLAST search parameters. Ideally, optimal BLAST parameters should of course be used in all BLAST searches. ...
  2. Reducing the database size. ...
  3. Increasing the quality of the alignment. ...
  4. Using addition positional information. ...
  5. Adding query sequences.
Feb 23, 2015

What is the first step of BLAST? ›

BLAST works by first making a look-up table of all the “words” (short subsequences, which for proteins the default is three letters) and “neighboring words”, i.e., similar words in the query sequence. The sequence database is then scanned for these “hot spots”.

What does E value mean in BLAST? ›

The Expect value (E) is a parameter that describes the number of hits one can “expect” to see by chance when searching a database of a particular size. It decreases exponentially as the Score (S) of the match increases. Essentially, the E value describes the random background noise.

What is a hit in a BLAST search? ›

A sequence search will (hopefully) identify sequences that are similar (or even identical) to the queries. The identified sequences are often called the hit sequences (or just hits). Typically, there is much more known about the hits than the query. For instance, we may know that a specific hit is an enzyme.

What is the query sequence in BLAST? ›

The query sequence(s) to be used for a BLAST search should be pasted in the 'Search' text area. BLAST accepts a number of different types of input and automatically determines the format or the input.

What is the BLAST search strategy? ›

In other words, BLAST looks at bits of DNA, RNA, or proteins and calculates how similar they are to one another. BLAST offers specialized searches that find and compare immunologically important genes and proteins, design PCR primers, screen for vector contamination, and identify conserved domains.

Why would you do a BLAST search? ›

The program compares nucleotide or protein sequences and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. There are several types of BLAST searches.

How does BLAST work? ›

How does BLAST work? BLAST identifies hom*ologous sequences using a heuristic method which initially finds short matches between two sequences; thus, the method does not take the entire sequence space into account. After initial match, BLAST attempts to start local alignments from these initial matches.

How to read blast search results? ›

Interpreting BLAST Results. BLAST results show all of the taxa that share sequence similarity with the query sequence based on the selected database. The results page includes a search summary, hit description table, graphic summary, and alignments that can help determine the quality or accuracy of a given hit.

What is a bad e-value? ›

10e-10 < E-value < 1 Could be a true hom*ologue but it is a gray area. E-value > 1 Proteins are most likely not related. E-value > 10 Hits are most likely junk unless the query sequence is very short.

What do positives mean in BLAST? ›

field called „Positives‟ which corresponds to the number of amino acids that are either identical. between the query and the subject sequence or have similar chemical properties (Figure 13). Figure 13. The key characteristics of a blastx alignment.

What are the different BLAST searches? ›

BLAST Types
BLAST TypeQuery SequenceAlignment
blastnnucleotidenucleotide
blastxnucleotide (translated to protein)protein
blastpproteinprotein
tblastxnucleotide (translated to protein)protein
1 more row
Dec 3, 2023

What is the BLAST method? ›

The mnemonic stands for: Believe (what the patient is saying), Listen (actively, to assess and restate the patient's unmet expectations), Apologize (for the patient's unmet expectations), Satisfy (the patient), and Thank (the patient for expressing his/her concerns and providing a second chance to satisfy the patient).

What are the different methods of BLAST? ›

There are five types (variants) of BLAST that are differentiated based on the type of sequence (DNA or protein) of the query and database sequences. BLASTN compares a nucleotide query sequence to a nucleotide sequence database. BLASTP compares a protein query sequence to a protein sequence database.

What is BLAST and how do you use it? ›

Basic Local Alignment Search Tool (BLAST)

The program compares nucleotide or protein sequences and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

Top Articles
Latest Posts
Article information

Author: Lidia Grady

Last Updated:

Views: 6553

Rating: 4.4 / 5 (45 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Lidia Grady

Birthday: 1992-01-22

Address: Suite 493 356 Dale Fall, New Wanda, RI 52485

Phone: +29914464387516

Job: Customer Engineer

Hobby: Cryptography, Writing, Dowsing, Stand-up comedy, Calligraphy, Web surfing, Ghost hunting

Introduction: My name is Lidia Grady, I am a thankful, fine, glamorous, lucky, lively, pleasant, shiny person who loves writing and wants to share my knowledge and understanding with you.