Why is sequence important in hypothesis testing




















It is easy to see that the P -value is actually all we need; it can be interpreted as the smallest significance level at which we can reject the null hypothesis. To calculate the P -value for a sequence alignment, we need to rigorously define the null hypothesis by specifying a probabilistic model the null model in which the aligned sequences are independent, and estimate the probability that this model generates a score at least as extreme as the observed score.

Note that, since usually we are interested in high scores, the score-based tests is a one-sided statistical test, and we should concentrate on the behavior of the right tail of the score distribution. In score-free analyses, such as studies of word occurrences in single or multiple sequences or string matching, the test statistic may be the count of a given word in a sequence, or the distance between words, etc.

The P -value is again the probability of this statistic taking a value at least as large as actually observed under the assumption that the null hypothesis is true, i. As it was already mentioned, it is possible to implement randomness in the null model either by applying shuffling algorithms or by defining stochastic processes.

Yet another approach is to select biomolecular sequences at random from available databases [ 15 , 16 ]. The motivation for the latter approach is that real sequences may give a more biologically adequate null model, compared to mathematical abstractions or ad hoc algorithms.

Shuffling methods also use real sequences as raw material, and are oriented toward preserving certain properties of real sequences. For example, the permutation technique devised by Altschul and Erickson [ 4 ] can preserve dinucleotide and trinucleotide composition in nucleotide sequences.

Waterman and Vingron [ 16 ] found that, for protein sequences, the estimation results using sequences retrieved from databases are in good agreement with those using random sequence models with independent residues. In general, the use of database sequences for parameter estimation and benchmarking is a promising approach [ 16 , 18 ]. Although obtaining P -value estimates of the form 1 or building empirical distribution is always an option, such estimates may fail to give the correct answer for small P -values [ 10 , 11 , 25 ].

Furthermore, the applicability of the P -values obtained by simulation is frequently restricted to the concrete parameter values for which they were derived [ 18 ].

In addition, obtaining good estimates by simulations may be very time consuming [ 11 , 18 , 26 ]. Thus, analytic expressions for the distributions are always desirable. They are especially useful when they give explicit dependence of the P -value on some of the parameters, so that this dependence does not have to be estimated by simulations.

While simulations are needed to estimate those of the distribution parameters which depend on the sequence composition and the scoring system, the formula gives explicit dependence of the P -value on the lengths of aligned sequences, which makes its use quite efficient in practice [ 16 , 18 , 27 ]. Analytic approximations can only be derived for precisely specified probabilistic models.

The question is what probabilistic models are adequate for DNA and protein sequences. Markov chain models are built under the assumption that the genome is homogeneous. If the nucleotide composition of a genome varies substantially along its length, it is possible to parse DNA [ 34 ] and build Markov models for separate homogeneous segments. It should also be noted that successful modern gene finding methods use hidden Markov or hidden semi-Markov models to represent DNA sequences [ 35—37 ].

These models may be best for describing inhomogeneous DNA sequences. It is apparent that, in general, we should expect some dependence of the P -values on the selected model.

However, it has been found that the expression for the mean length of the longest possibly interrupted match for two random sequences with i. These findings have been regarded as a justification for the claim that independence models are likely to give accurate enough approximations for the distributions of test statistics in sequence comparisons [ 11 ].

For protein sequences, independence models sequences of i. Numerical experiments show that, for protein sequences, using Markov models instead of independence models gives a relatively small enhancement in accuracy [ 16 , 18 , 41 ]. The study of Goldstein and Waterman [ 42 ] also shows that independence models for protein sequences may produce distributions close to those for real sequences.

Yet Karlin and Altschul [ 43 ] note that independence models should be used only as a reference standard to prove that scores of certain alignments can occur by chance alone. Mott [ 14 ] points out that protein sequences consist of segments which differ in composition, and that accurate estimates of statistical significance should take this structure into account.

In the DNA sequence matching studies by Arratia et al. Yet it is known that simple independence models should not be used for describing low complexity and GC-biased regions in DNA sequences [ 13 , 45 ]. In the studies of transcription factor binding sites, background models in the form of Markov chains are generally considered as more advanced compared to independence models [ 46 ]. Thus, the degree of influence of the choice of a probabilistic sequence model on the quality of the statistical significance analysis appears to depend on the specific problem being solved.

We shall return to the subject of null model selection when describing specific sequence analysis problems. Many functional elements in the genome, e. DNA restriction sites [ 42 ] and transcription factor binding sites [ 46 ], may be represented as words, i. The knowledge of probabilistic properties of words in random texts is necessary for the estimation of statistical significance of patterns observed in DNA sequences.

In particular, distributional properties of words are important in DNA sequencing by hybridization [ 28 , 47 ], in the analysis of functional DNA regions [ 48—50 ] and in alignment-free comparisons of biological sequences [ 51 ]. The abundance of analytic results for Markov chain models makes such models a convenient tool in the studies of statistical properties of words in DNA sequences.

In these works, approximate as well as exact results on the statistical properties of counts of sell-overlapping and non-overlapping words have been discussed. These results allow to assess the probabilities of events related to different word occurrences, thus providing P -value estimates for such events. Word count distributions, distributions of the length of the gaps between words, and occurrences of multiple patterns were analyzed.

In asymptotic cases, normal, Poisson and compound Poisson approximations have been derived for word counts; the type of approximation depends on the word length and on the method of counting word occurrences. In particular, if n is the length of the sequence, then, for large n , the distribution of counts of a word can be approximated by the normal distribution; this approximation is good when the length of the word is relatively small compared to the sequence length, and works for both self-overlapping and non-overlapping words.

The normal approximation can also be extended to the case of the inhomogeneous three-matrix Markov model [ 28 ]. For periodic words words consisting of repeated patterns Poisson approximation is not satisfactory because of possible overlaps, and compound Poisson approximation should be used.

Not only limiting distributions, but also bounds on the speed of convergence have been provided for the approximations. Similar results are available for the distributions of joint counts and joint occurrences of multiple patterns. While the Gaussian and Poisson laws are used to approximate the probability that a given word occurs more than a certain number of times, the probability that a given word frequency deviates from its expected value can be approximated using large deviations techniques.

The distribution of the length of sequence intervals between two occurrences of a word can under some conditions also be approximated by the Poisson law. If a sequence of letters represents the primary structure of a biological macromolecule, certain biologically relevant properties of the monomers letters can be described numerically by scores, the numbers associated with each letter.

Other examples of scores include scores derived from target frequencies and scores based on structural alphabets [ 43 , 54 , 55 ].

The aggregate score for a contiguous sequence segment is defined as the sum of the scores of constituting letters. Of interest are high-scoring segments, which with an appropriate choice of scoring system correspond to structurally and functionally interesting regions of a biological molecule.

Subtracting this from 1, we obtain 3. This Poisson approximation can be used to assess the statistical significance of several occurrences of high-scoring segments i. However, while the utility of 3 and the Poisson approximation is obvious, no theoretical bounds on the accuracy of the approximation are available. Thus it is unknown of how large n should be to guarantee a good approximation.

The theory described above concerns with the i. In that work, the sequence was modeled by a Markov chain with a positive transition probability matrix the results are extendable to irreducible aperiodic case. For each consecutive pair of letters in the sequence, the scores were represented by random variables associated with the Markov chain transitions, so that the sequence of scores formed a hidden Markov model of a special type.

The analogues of the conditions i and ii were introduced, and formulas similar to 3 and 4 were derived for the distribution of the MSS score. Two sequences can be aligned in many different ways. Optimal alignments, those with the best scores, are of great practical interest. Global optimal alignments are optimized along the whole length of the two sequences. The dynamic programming algorithm for constructing the optimal global pairwise alignment was proposed by Needleman and Wunsch [ 60 ].

A more efficient version was later devised by Gotoh [ 61 ]. So far, the statistical significance of global alignment scores has been assessed via simulations; no theoretical results on the score distribution are known. Reich et al. The work was centered around the case of zero gap penalties.

Using independence models for DNA sequences, the authors found that the score distribution for alignments of such sequences is quite close to that for alignments of DNA sequences randomly retrieved from a database.

The authors used the Z -score as the statistical measure for the extremity of the scores. For sequences of equal length and uniform four-letter distribution, aligned with zero gap penalties, Reich et al.

Corrections to these formulas for the cases of non-zero gap penalties, non-uniform letter distributions and non-equal sequence lengths have also been discussed. An example of another possible approach to assessing P -values for global optimal alignments is the estimation of estimation of statistical significance for evolutionary distances between sequences derived from global alignments carried out by Altschul and Erickson [ 4 ].

In that work, the authors devised a special shuffling procedure mentioned in the section of the use of P -values in sequence comparisons , and proposed to use it for P -value estimation. Since computations showed that the distribution is slightly non-Gaussian, the authors preferred to make no assumptions concerning the tail behavior, and used an expression similar to 1 for P -value estimation.

In many practical situations we are interested in the highest-scoring alignment between the subsequences of two given sequences, the best local alignment. The local alignment is called optimal if a the aligned subsequences form a high-scoring segment pair HSP with score that cannot be increased by extending or shortening the aligned subsequences and b all other HSPs have the same or smaller alignment score.

The dynamic programming algorithm for finding optimal local alignment was proposed by Smith and Waterman [ 66 ]. Gotoh [ 61 ] developed a faster version for the case of affine gap scores. Approximations by parametric distributions of the form 9 work not for all scoring systems, gap penalties and sequence compositions. It can be shown that, in the case of affine gap costs and symmetric scores, there are two types of regions for scoring parameters: linear and logarithmic [ 16 , 18 , 74 ].

In the logarithmic region, the growth of the expected local alignment score is proportional to the logarithm of the product of the sequence lengths; in the linear region, the expected local alignment score grows proportionally to the sequence lengths.

This result holds both for independence and Markov sequence models [ 74 ]. The precise definitions of the logarithmic and linear regions are as follows.

Let s g n 2 be the optimal global alignment score of two random sequences of length n , and Es g n 2 be its expected value. It can be shown that the limit exists. The requirement that the local alignments be in the logarithmic region is the extension of condition ii for ungapped alignments to alignments with gaps. In the linear region, the penalty for poorly aligned letters and indels is too low, so the limiting expected global alignment score per aligned letter pair is positive.

In this case, high-scoring local alignments may have regions of poorly aligned letters. Consequently, using local alignments with scores falling in the linear region is not productive in biological sequence analysis. The importance of finite-length corrections for a given pair of sequences depends on how good an approximation is provided by 9 for sequences of such lengths.

Mott [ 14 ] has shown that the quality of this approximation depends on the scoring system and on the sequence composition, for both gapped and ungapped alignments. Since many real protein sequences have skewed composition, and the average protein sequence length is about residues, finite-length corrections are as a rule necessary. While this statistic explicitly shows the dependence of the score distribution on the lengths, the dependence on sequence composition is implicit.

It is therefore desirable to have a composition-free measure of statistical significance. Bacro and Comet [ 81 ] have shown that the Z -score of an optimal local alignment is such a measure. Under the assumptions used by Waterman and Vingron [ 16 ], Bacro and Comet demonstrated that the Z -score has approximately extreme value distribution with parameters independent of sequence lengths or compositions.

The Z -values can be obtained from simulations, using the shuffling procedure described by Comet et al. One of the frequently used types of database searches is the search for sequence similarity implemented as a series of local pairwise alignments of the query sequence to all the sequences in the database. Current databases are quite large, and the Smith—Waterman full dynamic programming algorithm is too slow to be practical in this context. This is why fast heuristic algorithms have been designed for database searches.

Though fast, the heuristic local alignment algorithms are not as sensitive as the Smith—Waterman algorithm. The exact dynamic programming algorithm for constructing a multiple alignment is known but is impractical for more than a few sequences, therefore, heuristic methods, such as progressive alignment, are usually used [ 68 , 85 ]. The question of how to estimate the statistical significance of such alignments is even more obscure than it is for pairwise alignments.

Of course, it is possible to use simulations and distribution curve fitting for P -value estimations, but methods based on pure simulation may fail in the case of small P -values [ 25 ]. Thus, analytic approaches should be developed. When estimating statistical significance, it is natural to take into account the specific features of a multiple alignment algorithm. For example, if a method uses a sequence of pairwise alignments, then it may be possible to utilize the estimates for the pairwise alignment score statistics to assess the P -value for the multiple alignment.

This approach is implemented in DIALIGN-T [ 90 ], which builds multiple alignments from ungapped pairwise local alignments, called fragments, involving pairs from the whole set of sequences. The score of a multiple alignment produced by DIALIGN-T is the sum of the scores of the constituting fragments, while a fragment score is defined as a negative logarithm of the P -value of global Needleman—Wunsch pairwise alignment for the fragment.

Thus, an optimal multiple alignment is a collection of fragments with minimal product of pairwise P -values, and the score of a multiple alignment is the negative logarithm of an estimate of its P -value. The probabilistic sequence models for both DNA and protein alignments are independence models with uniform letter distribution; the global pairwise score P -values are estimated via simulations combined with heuristic formulas.

Although such an approach seems reasonable, it was argued that DIALIGN-T over-estimates the probabilities of random occurrences of alignments with high scores [ 90 ]. Similar to the case of global multiple alignments, the full dynamic programming solution to the problem of local multiple alignments is currently unfeasible. Practically efficient approaches include heuristic block analysis [ 91 ], expectation—maximization [ 92 ], Gibbs sampling [ 93 ] and Eulerian path approach [ 94 ].

As is the case with pairwise alignments, some analytic results are available which can increase the efficiency of the P -value estimation for local multiple alignments. The choice of a method for P -value estimation depends on the nature of the alignment algorithm and on scoring system. Formulas for the P -value estimation are available only for ungapped local multiple alignments.

The basic idea behind the P -value estimation is the conjecture that expression 9 is extendable to the case of multiple sequences [ 43 , 69 ]. Let be the letter probabilities for the sequence S j. We specify a length and choose one segment of this length from each of the sequences; the set of such segments is called a block.

The score of this block is defined as the sum of scores for the block columns. If there is a block whose score cannot be increased by shifting any of its borders, then this block is called a high-scoring block. Suppose that: i some column score is positive with positive probability; ii the average column score is negative; iii column scores do not change upon permuting the block's rows. These assumptions are valid if the column scores are defined as the SP sum-of-pairs -scores [ 91 ].

The SP-score of a column is the sum of all the pairwise scores for the letters it comprises, with each pair being counted only once.

For example, if the letters denote amino acids, then the pairwise substitution scores can be the usual log-odds scores defined by a PAM matrix. SP-scores are frequently used scores for multiple sequence alignments [ 68 ]. In the two sections aforesaid we described two general methods applicable for statistical significance estimation of local ungapped multiple alignments.

The methods are quite different in nature. While the choice of the specific method depends on the nature of sequences being aligned, there may exist situations when both methods are applicable. It would be interesting to compare their relative accuracy and performance. A position-specific scoring matrix PSSM , also called a position-specific weight matrix or a profile in more general case , is frequently used to model different evolutionary conserved regions situated within protein and nucleotide sequences [ 99— ].

PSSMs are usually built from multiple alignments. The general purpose of PSSMs is to summarize the information contained in a multiple alignment, describing the propensities of different letters nucleotides or amino acids to occur in different positions columns of the alignment. When this matrix is aligned to a sequence without gaps, the score for this alignment is calculated as follows. If a letter of type a i in the sequence is aligned with the j th column of the PSSM, then the score for this position is w ij.

The total alignment score is the sum of the scores over all the aligned PSSM positions. There are different approaches to deriving amino acid or nucleotide PSSMs given a set of aligned protein or DNA sequences some of the approaches are discussed by Claverie and Audic [ 13 , 98 ]. It is also possible to consider gapped sequence-to-PSSM alignments [ ]; in this case, the PSSM would have to have an additional row specifying gap costs for each position.

The two major types of sequence-to-PSSM alignments are as follows. The second one uses a local pairwise alignment algorithm to find the optimal local alignment with gaps [ 38 , , ]. Huang et al. This effort was undertaken in the context of algorithm development for predicting transcription factor binding sites in DNA sequences. One of the features of this project was the use of Markov chains as local background models.

After scanning the sequence with a PSSM, the top 0. For each hit, the first order Markov model was built from a rather short e. This model was used with the Markov generalization of 19 to assess the significance of the hit. We now consider the second definition of the sequence-to-PSSM alignment, the one that allows gaps. The statistical significance of local sequence-to-PSSM alignments with gaps was empirically investigated by Altschul et al.

For the special way of choosing w ij that Altschul et al. A comparison with the simulated score distribution showed that such an approximation was indeed reasonably accurate. Because it would be too time-consuming to reestimate the gapped parameters for every new PSSM, Altschul et al. An effective approach to increase the quality of the P -value estimates is to take into account the actual amino acid composition of the target sequence, which may differ from the assumed average composition [ 40 ].

Such a compositional correction can be performed by rescaling the PSSM. To increase the search speed, the algorithm filtered out candidates for in-depth analysis and performed PSSM rescaling only if nearly-significant PSSM-sequence alignment was obtained in the first iteration.

Mott's approximation appears to work quite well, supporting the general claim that the score distribution of local sequence-to-PSSM alignments can be approximated by the Karlin—Altschul statistic. If a sequence family is characterized by several simultaneously occurring motifs, then the fact that a sequence belongs to the family can be established by scanning the sequence using the PSSM models of the motifs. Even if the individual PSSM hits have relatively large P -values, the hits occurring simultaneously may provide sufficient evidence that the sequence indeed belongs to the specified family.

To assess the importance of such combined evidence, it is necessary to estimate the statistical significance of simultaneous hits of several PSSMs. Computations show that using multiple-PSSM queries indeed gives better database search results [ , ]. Here we consider only ungapped sequence-to-PSSM alignments. There exist several approaches to statistical significance estimation for multiple PSSM hits.

The most straightforward is to estimate the distribution function of the sum of the best scores for each of the motifs [ ]. This is usually the hypothesis the researcher is interested in proving. The alternative hypothesis can be one-sided only provides one direction, e. We often use two-sided tests even when our true hypothesis is one-sided because it requires more evidence against the null hypothesis to accept the alternative hypothesis. The significance level denoted by the Greek letter alpha— a is generally set at 0.

The smaller the significance level, the greater the burden of proof needed to reject the null hypothesis, or in other words, to support the alternative hypothesis. In another section we present some basic test statistics to evaluate a hypothesis. Hypothesis testing generally uses a test statistic that compares groups or examines associations between variables. When describing a single sample without establishing relationships between variables, a confidence interval is commonly used.

The p-value describes the probability of obtaining a sample statistic as or more extreme by chance alone if your null hypothesis is true. This p-value is determined based on the result of your test statistic. Your conclusions about the hypothesis are based on your p-value and your significance level.

Cautions About P-Values Your sample size directly impacts your p-value. Large sample sizes produce small p-values even when differences between groups are not meaningful. You should always verify the practical relevance of your results. On the other hand, a sample size that is too small can result in a failure to identify a difference when one truly exists.

Plan your sample size ahead of time so that you have enough information from your sample to show a meaningful relationship or difference if one exists. See calculating a sample size for more information. If you do a large number of tests to evaluate a hypothesis called multiple testing , then you need to control for this in your designation of the significance level or calculation of the p-value. For example, if three outcomes measure the effectiveness of a drug or other intervention, you will have to adjust for these three analyses.

Hypothesis testing is not set up so that you can absolutely prove a null hypothesis. The runs test is a test of significance or hypothesis test. The procedure for this test is based upon a run, or a sequence, of data that have a particular trait. To understand how the runs test works, we must first examine the concept of a run. We will begin by looking at an example of runs. Consider the following sequence of random digits:.

One way to classify these digits is to split them into two categories, either even including the digits 0, 2, 4, 6 and 8 or odd including the digits 1, 3, 5, 7 and 9.

We will look at the sequence of random digits and denote the even numbers as E and odd numbers as O:. The runs are easier to see if we rewrite this so that all of the Os are together and all of the Es are together:. We count the number of blocks of even or odd numbers and see that there are a total of ten runs for the data. Four runs have length one, five have length two and one has length five.

With any test of significance , it is important to know what conditions are necessary to conduct the test. For the runs test, we will be able to classify each data value from the sample into one of two categories.

We will count the total number of runs relative to the number of the number of data values that fall into each category. The test will be a two-sided test. The reason for this is that too few runs mean that there is likely not enough variation and the number of runs that would occur from a random process. Too many runs will result when a process alternates between the categories too frequently to be described by chance.

Every test of significance has a null and an alternative hypothesis. For the runs test, the null hypothesis is that the sequence is a random sequence.

The alternative hypothesis is that the sequence of sample data is not random. Statistical software can calculate the p-value that corresponds to a particular test statistic.

There are also tables that give critical numbers at a certain level of significance for the total number of runs. We will work through the following example to see how the runs test works. Suppose that for an assignment a student is asked to flip a coin 16 times and note the order of heads and tails that showed up. If we end up with this data set:.



0コメント

  • 1000 / 1000