Single amino acid mutations may lead to changes in protein function. Because missense mutations are the most likely singlepoint genetic mutation to have an effect on protein function, the nonrandom mutation clustering (NMC) algorithm is applied to missense mutations in individual genes in this work.
The NMC algorithm is derived under the following assumptions: 1. each amino acid residue in a protein sequence has equal mutation probability; 2. mutations between amino acid positions are independent; 3. mutations between samples are independent; and 4. the number of potentially available samples is larger than the number of mutations.
Denote N as the protein sequence length and n as the total number of mutations in the protein. Denote X
_{
i
}, a random variable between 1 and N, to be the position of the ith nonsynonymous (missense) mutation. By assumption, the mutations follow a discrete uniform distribution, and the n mutations are equivalent to n independent sample draws with replacement from the discrete uniform distribution, where the probability Pr(X
_{
i
} = j) = 1/N, where j = 1,..., N and i = 1,..., n.
By assumption, mutations are random and can occur at the same position more than once. The data are transferred into order statistics by ordering the X
_{
i
} into X
_{(1)} ≤...≤ X
_{(i) }≤...≤ X
_{(n)}, where X
_{(i) }is the ith smallest number in the sample, i = 1,..., n. To characterize clustering, the distance between order statistics R
_{
ki
}= X
_{(k) } X
_{(i)}, for any pair i, k, i < k, i, k = 1, .., n is computed. We develop the distribution of R
_{
ki
}, and declare the clustering to be nonrandom when the probability that the distance between order statistics R
_{
ki
} is less than a predefined significant probability level α: Pr(R
_{
ki
} ≤ r) ≤ α. The probability Pr(R
_{
ki
} ≤ r) is the cumulative distribution of R
_{
ki
}, the chance that the distance between order statistics X
_{(i) }and X
_{(k) }is as close or closer than r. Therefore, the probability Pr(R
_{
ki
} ≤ r) is derived as a pvalue, where the probability α is an arbitrary level such as 0.01, 0.05, or 0.1. The distance R
_{
ki
} has the simple interpretation of the size of the mutation cluster.
1.1 Derivation of the distribution of statistical measure
While distributions of order statistics are usually derived for continuous distributions, they have also been derived for discrete distributions. Burr (1955) [40] derived the distribution of range statistics using order statistics on a discrete uniform distribution. Range statistics is a special case of our statistical measure R
_{
ki
}, where i = 1 and k = n. Evans et al. (2006) [41] developed the density function and cumulative distribution of the ith order statistics given an arbitrary discrete distribution, i = 1,..., n. We extend the approach of Evans et al. (2006) [41] to determine the distribution of the distance between order statistics, and generalize the approach of Burr, I.W. (1955) [40] to derive the distribution of statistics R
_{
ki
}.
The distribution of R
_{
ki
} is developed from the joint distribution of order statistics X
_{(i) }and X
_{(k) }for any pair i, k, i < k, i, k = 1, .., n. R
_{
ki
}, the distance between order statistics X
_{(i) }and X
_{(k)}, can range from 0, which means both mutations are located at the same position, to N1, which means the mutations are on the first and last positions of the protein sequence. Intermediate values between 0 and N1 are also possible, for example R
_{
ki
} = 1 implies that the mutations are adjacent to each other and so on. We develop the distribution of R
_{
ki
} for each possible scenario.
R_{
ki
} = 0, for any pair i, k, i < k, i, k = 1, ..,
n, implies that mutations
X_{(i) }and
X_{(k) }are located at the same position. Taking the
N possible positions into consideration, the probability that
R_{
ki
} = 0 is written as
The distribution is derived using the properties of order statistics. For example, when y = X
_{(i) }= X
_{(k) }= 1, the first k order statistics are on the first position and the remaining nk order statistics are on or above the first position. Among these nk order statistics, v order statistics are located strictly above the first position, with the remaining nkv order statistics at the first position, where v can range from 0, meaning all n order statistics are on the first position, to nk, indicating that all the remaining order statistics are strictly larger than the first position. A similar logic applies to y = X
_{(i) }= X
_{(k) }= N. For1 <y <N, the distribution is derived as follows: there must be i1 order statistics at position x, where x ≤ y ; among those i1 order statistics, there are u order statistics where x <y and i1u with x = y, where u can range from 0 to i1. There must be ki+1 order statistics at position x = y. Finally, there must be nk order statistics at x, where x ≥ y; among those nk order statistics, there are v order statistics where x >y and nkv where x = y, where v can range from 0 to nk. Putting all the terms together, there are u order statistics located before position y, with probability (y  1) N, where u = 0,..., i1; there are (ki+1)+(i1u)+(nkv) = nuv order statistics at y with probability1/N ; there are v order statistics after position y, with probability1  y/N, where v = 0,..., nk and x = 2,..., N1.
For
R
_{
ki
} = 1, for any pair i, k, i < k, i, k = 1, .., n, the order statistics
X
_{(i) }and
X
_{(k) }are adjacent to each other. The probability distribution can be written as:
For
R
_{
ki
} =
r, for any pair i, k, i < k, i, k = 1, ..,
n,
r = 2,...,
N1, the distribution can be written as:
The distributions for R
_{
ki
} = 1 and R
_{
ki
} = r derived above, for any pair i, k, i < k, i, k = 1, .., n, r = 2,..., N1, is based on similar logic as R
_{
ki
}= 0. The i1 order statistics must be located at or before position X
_{(i)}, and the nk order statistics must be located at or after position X
_{(k)}. For the remaining ki1 order statistics, q order statistics are located at position X
_{(i)}, t order statistics are strictly between X
_{(i) }and X
_{(k) }and the remaining ki1qt statistics are at position X
_{(k)}, where q = 0,..., ki1 and t = 0,..., ki1q. Grouping all the terms together yields the distribution equations for R
_{
ki
} = 1 and R
_{
ki
} = r, for any pairs of i, k, i < k, i, k = 1, .., n, r = 2,..., N1.
Finally, for the special case of i = 1 and k =
n, the distribution of
R
_{
ki
} may be simplified as
Note that Pr(R_{n1 }≤ r) = 1 for r = N1. The result is the same as the range statistics reported in Burr, I.W. (1955) [40].
1.2 Approximation of the distribution
The derivation in section 1.1 is the exact distribution of the statistical measure for nonrandom mutation clustering in the discrete uniform distribution. Proteins typically contain hundreds or thousands of amino acids and it is convenient to approximate the discrete uniform distribution with a continuous uniform distribution (0, 1) because calculating the distribution of R
_{
ki
} = r can be extremely slow when the length of the protein sequence N or the number of mutations n is large, resulting in dramatically increased iterations in those summations. For computational efficiency, we now develop the distribution for the test statistics in the continuous limit.
When the
n order statistics are random samples from a uniform distribution (0, 1), the probability distribution of order statistics
X
_{(i) }and
X
_{(k)}, for any pair i, k, i < k, i, k = 1, .., is:
where distance is normalized to be in the range (0,1), so the distance
R
_{
ki
} = (
X
_{(k) }
X
_{(i)})/
N differs by the constant
N from section 1.1, where
R
_{
ki
}=
X
_{(k) }
X
_{(i)}. The cumulative distribution can be written as Pr(
R
_{
ki
} ≤
r)
which by iterated integration by parts gives:
Using the continuous uniform distribution, R
_{
ki
} simply follows a Beta distribution with parameters ki and i + n  k + 1, ensuring that Pr(R
_{
ki
} ≤ 1) = 1. This result was reported in Johnson et al. (1995) [42] for a joint distribution of pairwise order statistics following a continuous uniform distribution (0, 1).
1.3 Correction for multiple testing
For each pairwise order statistic, the exact and continuous distributions can be calculated using formulas in sections 1.1 and 1.2. Clusters are evaluated for each pair of order statistics, which can elevate the false positive rate due to multiple testing. A Bonferroni correction can be chosen to correct the false positive rate because it doesn't require an independent hypotheses assumption and it is a conservative test. The false discovery rate (FDR) developed by Benjamini and Hochberg (1995) [43] is popular and has been applied to multiple testing problems in many areas. Although it requires an independent test statistics assumption, it is known to be powerful and robust under positively correlated test statistics (Benjamini and Yekutieli (2001) [44]). Because of its conservativeness, Bonferroni is applied as the default to adjust multiple testing for the NMC algorithm and as an alternative, FDR can be applied.
1.4 NMC algorithm
The exact and approximate distributions of distance between pairwise order statistics were derived in section 1.1 and 1.2. The calculation is rapid for the special case when R
_{
ki
} is 0 or 1 or for the range statistics, and we use the exact distribution derived in section 1.1 to ensure accuracy for these cases. For further efficiency when calculating the distribution for R
_{
ki
} = 1, the algorithm is stopped when the iterated summation in the distribution reaches the significance level because the full summation is larger than the partial summation and the difference cannot be significant. The continuous distribution is used for computational efficacy when the difference R
_{
ki
} is greater than 1. The nonrandom mutation clustering (NMC) algorithm is summarized in the following procedure:

Input: Number and location of missense mutations in a protein

Output: A table with columns of nonrandom mutation cluster size, starting location of the cluster, ending location of the cluster, number of mutations observed in the cluster and probability of the cluster that is significant after Bonferroni or FDR correction.

NMC algorithm:

Step 1: Reorder the mutation positions into order statistics and set the significance level α. By default, α = 0.05.

Step 2: For each pairwise order statistics, calculate the probability Pr(R
_{
ki
} ≤ r), for any pair i, k, i < k, i, k = 1, .., n. For R = 0 and 1 and/or i = 1 and k = n, use the distribution in section 1.1. For r>1, use the distribution in section 1.2.

Step 3: Calculate the Bonferroni or FDR corrected probabilities.

Step 4: Report the multipletesting corrected significant clusters in the output table after sorting from the lowest probability to the highest.
The R source code is available in Additional file 1 and an analysis of minimum number of mutations required for NMC algorithm is available in Additional file 2.