Skip to main content
Fig. 4 | BMC Bioinformatics

Fig. 4

From: Detecting genomic deletions from high-throughput sequence data with unsupervised learning

Fig. 4

Detecting true deletions with unsupervised learning. 25 deletion candidates from \(Del_0\) to \(Del_{24}\) are identified, and each of which contains multiple features. PCA is applied to all candidates, and the top two principle components are used to present each candidate. All candidates are classified into four clusters through hierarchical clustering, including blue (6), yellow (6), red (6) and green (7). After checking \(\sum _{i=0}^{2} THAvg_{LN_i}\), three clusters are marked as good, including blue, yellow and red, while green is marked as bad. Then statistic filter is applied to find true deletions. For a good cluster, the deletions are discarded if \(LN_{i} < T_{low}\) and \(\sum _{j=0}^{2} LN_i(j \ne i) < T_{low}\). For a bad cluster, the deletions are kept if \(LN_{i}>T_{high}\) or \(\sum _{j=0}^{2} LN_i(j \ne i) > T_{high}\). Afterwards, 4, 5, 6, 2 deletions in blue, yellow, red and green groups are remained. These deletions are reported as true deletions

Back to article page