BIOT 630 Matrix using the Needleman Wunsch Problems

Fill the following Matrix using the Needleman/Wunsch “Global” Dynamic Programming Sequence Alignment Method. Use Match = 1, Mismatch = 0, Gap Penalty (w) = 0. See Lecture 4 Slides as well as review the Supplement provided if having difficulty.

Answer =

0 A G C T T C G A

0 0 0 0 0 0 0 0

G 0

C 0

C 0

G 0

A 0

2.

Perform Traceback for the Matrix filled in for Question 1 and provide the alignment as your answer. Remember, start at the lower right. See Lecture 4 Slides as well as review the Supplement provided if having difficulty.

Answer =

3.

What is the Score for the Alignment in Question 2? See Lecture 4 Slides as well as review the Supplement provided if having difficulty.

Answer =

4.

Generate the Local DP ” Smith-Waterman” pairwise sequence alignment between the AQP5 rat [Rattus norvegicus] RefSeq protein sequence (NP_036911.1) and the AQP5 cow [Bos taurus] RefSeq protein sequence (NP_001178089.1). Use the “EMBOSS” tool to generate the alignment. Then copy/paste the sequence alignment returned from tool as your answer. Be sure to review Lecture 4 Slides if you are having difficulty.

EMBOSS tool url: https://www.ebi.ac.uk/Tools/psa/emboss_water/

Answer =

5.

What is the Score for the Alignment in Question 4? Be sure to review Lecture 4 Slides if you are having difficulty.

Answer =

The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul B. Needleman and Christian D. Wunsch and published in 1970.[1] The algorithm essentially divides a large problem (e.g. the full sequence) into a series of smaller problems, and it uses the solutions to the smaller problems to find an optimal solution to the larger problem.[2] It is also sometimes referred to as the optimal matching algorithm and the global alignment technique. The Needleman–Wunsch algorithm is still widely used for optimal global alignment, particularly when the quality of the global alignment is of the utmost importance. The algorithm assigns a score to every possible alignment, and the purpose of the algorithm is to find all possible alignments having the highest score.

Choosing a scoring system

Next, decide how to score each individual pair of letters. Using the example above, one possible alignment candidate might be:

12345678

GCATG-CG

G-ATTACA

The letters may match, mismatch, or be matched to a gap (a deletion or insertion (indel)):

Match: The two letters at the current index are the same.

Mismatch: The two letters at the current index are different.

Indel (INsertion or DELetion): The best alignment involves one letter aligning to a gap in the other string.

Each of these scenarios is assigned a score and the sum of the scores of all the pairings is the score of the whole alignment candidate. Different systems exist for assigning scores; some have been outlined in the Scoring systems section below. For now, the system used by Needleman and Wunsch[1] will be used:

Match: +1

Mismatch or Indel: −1

For the Example above, the score of the alignment would be 0:

GCATG-CG

G-ATTACA

+−++−−+− −> 1*4 + (−1)*4 = 0

Filling in the table

Start with a zero in the second row, second column. Move through the cells row by row, calculating the score for each cell. The score is calculated by comparing the scores of the cells neighboring to the left, top or top-left (diagonal) of the cell and adding the appropriate score for match, mismatch or indel. Calculate the candidate scores for each of the three possibilities:

The path from the top or left cell represents an indel pairing, so take the scores of the left and the top cell, and add the score for indel to each of them.

The diagonal path represents a match/mismatch, so take the score of the top-left diagonal cell and add the score for match if the corresponding bases (letters) in the row and column are matching or the score for mismatch if they do not.

Scoring systems

Basic scoring schemes

The simplest scoring schemes simply give a value for each match, mismatch and indel. The step-by-step guide above uses match = 1, mismatch = −1, indel = −1. Thus the lower the alignment score the larger the edit distance, for this scoring system one wants a high score. Another scoring system might be:

Match = 0

Indel = 1

Mismatch = 1

For this system the alignment score will represent the edit distance between the two strings. Different scoring systems can be devised for different situations, for example if gaps are considered very bad for your alignment you may use a scoring system that penalises gaps heavily, such as:

Match = 0

Mismatch = 1

Indel = 10

Similarity matrix

More complicated scoring systems attribute values not only for the type of alteration, but also for the letters that are involved. For example, a match between A and A may be given 1, but a match between T and T may be given 4. Here (assuming the first scoring system) more importance is given to the Ts matching than the As, i.e. the Ts matching is assumed to be more significant to the alignment. This weighting based on letters also applies to mismatches.

In order to represent all the possible combinations of letters and their resulting scores a similarity matrix is used. The similarity matrix for the most basic system is represented as:

A G C T

A 1 -1 -1 -1

G -1 1 -1 -1

C -1 -1 1 -1

T -1 -1 -1 1

Each score represents a switch from one of the letters the cell matches to the other. Hence this represents all possible matches and mismatches (for an alphabet of ACGT). Note all the matches go along the diagonal, also not all the table needs to be filled, only this triangle because the scores are reciprocal.= (Score for A → C = Score for C → A). If implementing the T-T = 4 rule from above the following similarity matrix is produced:

A G C T

A 1 −1 −1 −1

G −1 1 −1 −1

C −1 −1 1 −1

T −1 −1 −1 4

Different scoring matrices have been statistically constructed which give weight to different actions appropriate to a particular scenario. Having weighted scoring matrices is particularly important in protein sequence alignment due to the varying frequency of the different amino acids. There are two broad families of scoring matrices, each with further alterations for specific scenarios:

PAM

BLOSUM