#### Definition of the minimum distance

Consider the difference in log R ratio (

*r*) between offspring (O) and father (F) at a single marker, calculated as

*r*
_{
O
}−

*r*
_{
F
}. We denote the paternal distance by

*δ*
_{
F
}. A comparable calculation for offspring and mother provides a measure of the maternal distance, denoted by

*δ*
_{
M
}. We define the minimum distance between parents and offspring as

The calculation is easily vectorized in R and its computation for ≈ 610,000 log R ratios obtained from Illumina’s 610 quad array for a single trio is nearly instantaneous. Denoting the minimum distance vector by d, consecutive negative or positive values in a genomic interval suggest DNA copy number loss or gain, respectively, relative to the most similar parental copy number. Although its calculation at a given marker is independent of the neighboring markers, the minimum distance can reduce technical variation from correlated probe-effects as well as the peaks and troughs of genomic waves that vary smoothly over large regions of the genome (e.g., Figure
1c). Alternatives to d include the difference of the offspring log R ratios and the CNV-transmitting parent. However, such an alternative requires inference of the CNV-transmitting parent and a trade-off in variance when technical factors such as wave and probe effects in the offspring are more correlated with the non-CNV transmitting parent.

#### Segmentation of the minimum distance

Single-sample segmentation algorithms applied to the univariate d can be used to identify breakpoints of potentially de novo CNVs. We currently favor circular binary segmentation (CBS)
[9, 12] for its maturity in the Bioconductor package DNAcopy and its use as a benchmark in comparison papers for CNV detection algorithms
[43]. Nonstandard options for CBS implemented in MinimumDistance include special handling of large gaps in the array’s coverage of the genome (see Methods) and a pruning step to remove breakpoints that is a function of the number of markers on a segment (coverage) and the standardized difference in segment means (see Additional file
1).

The minimum distance can reduce artifacts that are shared by one or both parents and the offspring. In the motiving example (Figure
1), we argue that genomic waves contribute to false de novo and transmitted deletions when the trough of a genomic wave spans regions lacking heterozygous genotypes. Application of CBS to d calculated in the motivating example smooths the trough of the genomic wave (not shown), thereby avoiding local maxima in the likelihood identified by the joint HMM. The subsequent classification of the trio copy number (discussed next) for the minimum distance segment spanning the trough overwhelmingly favors a diploid trio copy number state due to the large number of heterozygous genotypes in the broader region.

As the minimum distance is a relative measure, regions with non-zero minimum distance do not necessarily indicate de novo CNVs. For example, a 300 kb region with positive d on chromosome 14 suggests a de novo duplication (bottom panel, Additional file
2: Figure S1). However, visual inspection of the B allele frequencies and log R ratios reveals deletions in both parents while the offspring is diploid (panels 1-3, Additional file
2: Figure S1). To avoid false positive de novo CNV calls for regions such as chromosome 14, estimation of the absolute copy numbers is needed. We use maximum a posteriori estimation to infer the absolute copy numbers, as described in the following section.

#### Maximum a posteriori estimation

We classify the copy number states of the minimum distance segments using a fully probabilistic model based in part on the joint HMM. Our approach delineates de novo events by finding the mode of the distribution of

P(states | data, …) over the set of possible trio states. More formally, the maximum a posteriori estimate for the trio copy number for candidate segment

*l* is defined as

The vector s
_{
l
}contains the copy number state symbols for the trio denoted as *xyz*, where *x* is the state symbol for the father, *y* is the state symbol for the mother, and *z* is the state symbol for the offspring. The copy number state symbols are 1 = homozygous deletion, 2 = hemizygous deletion, 3 = diploid copy number, 5 = single copy gain, and 6=two copy gain. The triplet 332, for example, corresponds to a de novo hemizygous deletion in the offspring. These integer state symbols are used to be consistent with PennCNV, and are subject to change in the software implementation of MinimumDistance. The set of 121 biologically plausible trio copy number states is denoted by *S*, and excludes 4 of the 5^{3} possible combinations of trio states in which the parents are both homozygous null and the offspring has one or more copies. The parameter **Θ** denotes other parameters for our model, including the transition probabilities and initial state probabilities. The matrices of B allele frequencies (B
_{
l
}) and log R ratios (R
_{
l
}) are *n*
_{
l
}× 3 matrices where *n*
_{
l
} is the number of markers spanned by the segment *l* (hereafter, referred to as coverage) and columns are individuals in the trio. We remark that the ratio of
to the probability of a trio of diploid copy numbers can be used to rank de novo CNVs.

The conditional probability of the trio copy number in equation (2) can be re-expressed using Bayes’ rule as a product of the likelihood and the joint probability of the copy number states. (Hereafter, we refer to the conditional probability in equation (2) as a posterior probability.) Factoring the joint probability of the trio state as in Wang

*et al.*[

40], we write the posterior probability as

for the first segment and

for segments

*l* > 1. This is a first order Markov model incorporating terms

P(

*s*
_{1,O
}|…) and

P(

*s*
_{
l,O
}
*s*
_{
l−1,O
}|…) for Mendelian transmission of CNVs as implemented in the joint HMM
[

40], but assessed on previously determined DNA segments (see Methods). Assuming conditional independence of the log R ratios and B allele frequencies given the unobserved copy number states, the likelihood is

As copy number estimates from hybridization-based arrays are noisy, our goal is to estimate the likelihood robustly.

Our approach for robust-to-outlier estimation of a sample’s log R ratio likelihood is predicated on a mixture distribution for the emitted log R ratios. Specifically, for individual k of a trio and marker i, we assume a mixture distribution for the log R ratio given by

where the normal component captures within-sample variation for copy number state *s* and the uniform component captures outliers arising from technical artifacts that we assume to be independent of the latent copy number. The parameter *ε*
_{
r,k
} is the probability of observing an outlier log R ratio in sample *k*. Similar mixture models have been proposed for aCGH
[44], and adapted here for genotyping platforms. Estimation of the parameters for the means, variances, and outlier mixture probabilities is carried out via the Baum-Welch algorithm as described in the Methods section
[45].

With the exception of the homozygous null state, robust-to-outlier estimation of the B allele frequency likelihood for a sample is also implemented via a mixture model. In particular, for positive copy number states we assume a theoretical mixture distribution given by

where the truncated-normal (
) mixture captures within-sample heterogeneity of the B allele frequencies over the possible genotypes for state *s* (*G*
*T*
_{
s
}) and the uniform zero-one density captures technical variation that we assume to be independent of the genotype and copy number state. As B allele frequencies are thresholded to the [0,1] interval, the proportion of outlier log R ratios, *ε*
_{
r,k
}, does not necessarily correspond to the proportion of outlier B allele frequencies given by *ε*
_{
b,k
}, motivating their separate parameterization. The mixture probability *p*
_{
i,g
} is estimated from a binomial density parameterized by the frequency of the A allele for genotype *g* (i.e., 2 for genotype *AA*) and the population frequency of the A allele. Estimation of the parameters for the means, variances, and outlier mixture probabilities for the B allele frequencies are estimated via the Baum-Welch algorithm as described in the Methods section. For the homozygous null state, we assume the B allele frequencies are emitted from a uniform zero-one distribution.

The likelihood in equations (3) and (4) is multiplied by terms involving the conditional probability of the offspring copy number, the initial state probability of the parental copy numbers (if *l* = 1), and transition probabilities for the parental copy numbers (if *l* > 1). We calculate the conditional probability for the offspring copy number by integrating out (averaging over) Mendelian and non-Mendelian models for CNV transmission. The derivation of the conditional probability is similar to the derivation in the joint HMM, but indexed over segments instead of markers. We leave the mathematical details to Additional file
1 (see also
[40]) and specification of the initial state and transition probabilities to Section Methods. Multiplication of these terms with the likelihood provides an estimate of the posterior probability. Repeating the estimation procedure for each of the 121 possible trio states, we obtain a distribution of the posterior probability. The mode of this distribution is the maximum a posteriori estimate. Conditional on the maximum a posteriori estimate at segment *l*, we repeat the procedure for segment *l* + 1 until maximum a posteriori estimates are available for all segments.

Segmentation and maximum a posteriori estimation are performed independently for each chromosomal arm and each trio, enabling an embarrassingly parallel implementation. Computational speed is derived from the parallel architecture and the implementation of the computationally intensive maximum a posteriori estimation (121 calculations) on a set of segments that is typically several orders of magnitude smaller than the number of markers on the array.