Preface.

Acknowledgements.

Part I: SEQUENCE ANALYSIS.

1. Pairwise Global Alignment of Sequences.

1.1 Alignment and Evolution.

1.2 What is an Alignment?

1.3 A Scoring Scheme for the Model.

1.4 Finding Highest-Scoring Alignments with Dynamic Programming.

1.4.1 Determine Hi,j.

1.4.2 Use of matrices.

1.4.3 Finding the alignments that give the highest score.

1.4.4 Gaps.

1.5 Scoring Matrices.

1.6 Scoring Gaps: Gap Penalties.

1.7 Dynamic Programming for General Gap Penalty.

1.8 Dynamic Programming for Affine Gap Penalty.

1.9 Alignment Score and Sequence Distance.

1.10 Exercises.

1.11 Bibliographic notes.

2 Pairwise Local Alignment and Database Search.

2.1 The Basic Operation: Comparing Two Sequences.

2.2 Dot Matrices.

2.2.1 Filtering.

2.2.2 Repeating segments.

2.3 Dynamic Programming.

2.3.1 Initialization.

2.3.2 Finding the best local alignments.

2.3.3 Algorithms.

2.3.4 Scoring matrices and gap penalties.

2.4 Database Search: BLAST.

2.4.1 The procedure.

2.4.2 Preprocess the query: make the word list.

2.4.3 Scanning the database sequences.

2.4.4 Extending to HSP.

2.4.5 Introducing gaps.

2.4.6 Algorithm.

2.5 Exercises.

2.6 Bibliographic notes.

3. Statistical Analysis.

3.1 Hypothesis Testing for Sequence Homology.

3.1.1 Random generation of sequences.

3.1.2 Use of Z values for estimating the statistical significance.

3.2 Statistical Distributions.

3.2.1 Poisson probability distribution.

3.2.2 Extreme value distributions.

3.3 Theoretical Analysis of Statistical Significance.

3.3.1 The P value has an extreme value distribution.

3.3.2 Theoretical analysis for database search.

3.4 Probability Distributions for Gapped Alignments.

3.5 Assessing and Comparing Programs for Database Search.

3.5.1 Sensitivity and specificity.

3.5.2 Discrimination power.

3.5.3 Using more sequences as queries.

3.6 Exercises.

3.7 Bibliographic notes.

4 .Multiple Global Alignment and Phylogenetic Trees.

4.1 Dynamic Programming.

4.1.1 SP score of multiple alignments.

4.1.2 A pruning algorithm for the DP solution.

4.2 Multiple Alignments and Phylogenetic Trees.

4.3 Phylogeny.

4.3.1 The number of different tree topologies.

4.3.2 Molecular clock theory.

4.3.3 Additive and ultrametric trees.

4.3.4 Different approaches for reconstructing phylogenetic trees.

4.3.5 Distance-based construction.

4.3.6 Rooting of trees.

4.3.7 Statistical test: bootstrapping.

4.4 Progressive Alignment.

4.4.1 Aligning two subset alignments.

4.4.2 Clustering.

4.4.3 Sequence weights.

4.4.4 CLUSTAL.

4.5 Other Approaches.

4.6 Exercises.

4.7 Bibliographic notes.

5. Scoring Matrices.

5.1 Scoring Matrices Based on Physio-Chemical Properties.

5.2 PAM Scoring Matrices.

5.2.1 The evolutionary model.

5.2.2 Calculate substitution matrix.

5.2.3 Matrices for general evolutionary time.

5.2.4 Measuring sequence similarity by use of Mτ.

5.2.5 Odds matrices.

5.2.6 Scoring matrices (log-odds matrices).

5.2.7 Estimating the evolutionary distance.

5.3 BLOSUM Scoring Matrices.

5.3.1 Log-odds matrix.

5.3.2 Developing scoring matrices for different evolutionary distances.

5.4 Comparing BLOSUM and PAM Matrices.

5.5 Optimal Scoring Matrices.

5.5.1 Analysis for one sequence.

5.6 Exercises.

5.7 Bibliographic notes.

6. Profiles.

6.1 Constructing a Profile.

6.1.1 Notation.

6.1.2 Removing rows and columns.

6.1.3 Position weights.

6.1.4 Sequence weights.

6.1.5 Treating gaps.

6.2 Searching Databases with Profiles.

6.3 Iterated BLAST: PSI-BLAST.

6.3.1 Making the multiple alignment.

6.3.2 Constructing the profile.

6.4 HMM Profile.

6.4.1 Definitions for an HMM.

6.4.2 Constructing a profile HMM for a protein family.

6.4.3 Comparing a sequence with an HMM.

6.4.4 Protein family databases.

6.5 Exercises.

6.6 Bibliographic notes.

7. Sequence Patterns.

7.1 The PROSITE Language.

7.2 Exact/Approximate Matching.

7.3 Defining Pattern Classes by Imposing Constraints.

7.4 Pattern Scoring: Information Theory.

7.4.1 Information theory.

7.4.2 Scoring patterns.

7.5 Generalization and Specialization.

7.6 Pattern Discovery: Introduction.

7.7 Comparison-Based Methods.

7.7.1 Pivot-based methods.

7.7.2 Tree progressive methods.

7.8 Pattern-Driven Methods: Pratt.

7.8.1 The main procedure.

7.8.2 Preprocessing.

7.8.3 The pattern space.

7.8.4 Searching.

7.8.5 Ambiguous components.

7.8.6 Specialization.

7.8.7 Pattern scoring.

7.9 Exercises.

7.10 Bibliographic notes.

Part II: STRUCTURE ANALYSIS

8. Structures and Structure Descriptions.

8.1 Units of Structure Descriptions.

8.2 Coordinates.

8.3 Distance Matrices.

8.4 Torsion Angles.

8.5 Coarse Level Description.

8.5.1 Line segments (sticks).

8.5.2 Ellipsoid.

8.5.3 Helices.

8.5.4 Strands and sheets.

8.5.5 Topology of Protein Structure (TOPS).

8.6 Identifying the SSEs.

8.6.1 Use of distance matrices.

8.6.2 Define Secondary Structure of Proteins (DSSP).

8.7 Structure Comparison.

8.7.1 Structure descriptions for comparison.

8.7.2 Structure representation.

8.8 Framework for Pairwise Structure Comparison.

8.9 Exercises.

8.10 Bibliographic notes.

9. Superposition and Dynamic Programming.

9.1 Superposition.

9.1.1 Coordinate RMSD.

9.1.2 Distance RMSD.

9.1.3 Using RMSD as scoring of structure similarities.

9.2 Alternating Superposition and Alignment.

9.3 Double Dynamic Programming.

9.3.1 Low-level scoring matrices.

9.3.2 High-level scoring matrix.

9.3.3 Iterated double dynamic programming.

9.4 Similarity of the Methods.

9.5 Exercises.

9.6 Bibliographic notes.

10. Geometric Techniques.

10.1 Geometric Hashing.

10.1.1 Two-dimensional geometric hashing.

10.1.2 Geometric hashing for structure comparison.

10.1.3 Geometric hashing for SSE representation.

10.1.4 Clustering.

10.2 Distance Matrices.

10.2.1 Measuring the similarity of distance (sub)matrices.

10.3 Exercises.

10.4 Bibliographic notes.

11. Clustering: Combining Local Similarities.

11.1 Compatibility and Consistency.

11.2 Searching for Seed Matches.

11.3 Consistency.

11.3.1 Test for consistency.

11.3.2 Overlapping clusters.

11.4 Clustering Algorithms.

11.4.1 Linear clustering.

11.4.2 Hierarchical clustering.

11.5 Clustering by Use of Transformations.

11.5.1 Comparing transformations.

11.5.2 Calculating the new transformation.

11.5.3 Algorithm.

11.6 Clustering by Use of Relations.

11.6.1 How many relations to compare?

11.6.2 Geometric relation.

11.6.3 Distance relation.

11.6.4 Use of graph theory.

11.7 Refinement.

11.8 Exercises.

11.9 Bibliographic notes.

12. Significance and Assessment of Structure Comparisons.

12.1 Constructing Random Structure Models.

12.1.1 Use of distance geometry.

12.2 Use of Structure Databases.

12.2.1 Constructing nonredundant subsets.

12.2.2 Demarcation line for similarity.

12.3 Reversing the Protein Chain.

12.4 Randomized Alignment Models.

12.5 Assessing Comparison and Scoring Methods.

12.6 Is RMSD Suitable for Scoring?

12.7 Scoring and Biological Significance.

12.8 Exercises.

12.9 Bibliographic notes.

13. Multiple Structure Comparison.

13.1 Multiple Superposition.

13.2 Progressive Structure Alignment.

13.2.1 Scoring.

13.2.2 Construction of consensus.

13.3 Finding a Common Core from a Multiple Alignment.

13.4 Discovering Common Cores.

13.4.1 Finding the multiple seed matches.

13.4.2 Pairwise clustering.

13.4.3 Determining common cores.

13.4.4 Scoring clusters.

13.5 Local Structure Patterns.

13.5.1 Local packing patterns.

13.5.2 Discovering packing patterns.

13.5.3 The approach.

13.5.4 Scoring the packing motifs.

13.6 Exercises.

13.7 Bibliographic notes.

14. Protein Structure Classification.

14.1 Protein Domains.

14.2 An Ising Model for Domain Identification.

14.3 Domain Classes.

14.3.1 Mainly-? domains.

14.3.2 Mainly-? domains.

14.3.3 ?? domains.

14.4 Folds.

14.5 Automatic Approaches to Classification.

14.6 Databases for Structure Classification.

14.7 FSSP-Dali Domain Dictionary.

14.8 CATH.

14.8.1 Domains.

14.8.2 Class.

14.8.3 Architecture.

14.8.4 Topology (fold family).

14.8.5 Homologous superfamily.

14.8.6 Sequence families.

14.8.7 The CATH classification procedure.

14.9 Classification Based on Sticks.

14.10 Exercises.

14.11 Bibliographic notes.

Part III: SEQUENCE-STRUCTURE ANALYSIS.

15. Structure Prediction: Threading.

15.1 Protein Secondary Structure Prediction.

15.1.1 Artificial neural networks.

15.1.2 The PHD program.

15.1.3 Accuracy in secondary structure prediction.

15.2 Threading.

15.3 Methods Based on Sequence Alignment.

15.3.1 The 3D–1D matching method.

15.3.2 The FUGUE method.

15.4 Methods Using 3D Interactions.

15.4.1 Potentials of mean force.

15.4.2 Towards modelling methods.

15.5 Alignment Methods.

15.5.1 Frozen approximation.

15.5.2 Double Dynamic Programming.

15.6 Multiple Sequence/Structure Threading.

15.6.1 Simple multiple sequence threading.

15.7 Combined Sequence/Threading Methods.

15.8 Assessment of Threading Methods.

15.8.1 Fold recognition.

15.8.2 Alignment accuracy.

15.8.3 CASP and CAFASP.

15.9 Bibliographic notes.

Appendix A: Basics in Mathematics, Probability and Algorithms.

A.1 Mathematical Formulae and Notation.

A.2 Boolean Algebra.

A.3 Set Theory.

A.4 Probability.

A.4.1 Permutation and combination.

A.4.2 Probability distributions.

A.4.3 Expected value.

A.5 Tables, Vectors and Matrices.

A.6 Algorithmic Language.

A.6.1 Alternatives.

A.6.2 Loops.

A.7 Complexity.

Appendix B: Introduction to Molecular Biology.

B.1 The Cell and the Molecules of Life: DNA–RNA Proteins.

B.2 Chromosomes and Genes.

B.3 The Central Dogma of Molecular Biology.

B.4 The Genetic Code.

B.5 Protein Function.

B.5.1 The gene ontology.

B.6 Protein Structure.

B.7 Evolution.

B.8 Insulin Example.

B.9 Bibliographic notes.

References.

Index.