The ability to visualise data is imperative in modern experimental plant genetics, with volumes of data being routinely produced far exceeding the ability for humans to digest and identify underlying phenomena. Until now, pedigree visualization, with few exceptions [12, 13] has primarily been focussed on work carried out in the human genetics domain. Because plant breeding programmes involve phenomena not normally seen in human populations, such as routine inbreeding, there are additional visualization challenges that need to be overcome. There are often large numbers of plant lines involved in any pedigree, many more so than in an average human pedigree due to factors such as generation time/time to sexual maturity which is far lower in most plant species than that of their mammalian counterparts. This section will look at the various visualization techniques used to represent pedigree based data and highlight the problems and strengths that these techniques exhibit.
Table-based approaches
Table-based visualization tools such as Flapjack [14] address some of the problems associated with visualizing large datasets and are optimized for efficient sorting and querying of genotypic and phenotypic data, but currently lack the ability to display data on a pedigree-based scaffold.
While other tools such as PedStats [15] offer statistical validation of users’ pedigree data without visualization of the actual pedigree structure, it is difficult if not impossible to conceptualize pedigree structure for complex data sets without some visual representation.
Matrix-based visualizations to represent pedigrees use the intersection of the x and y edge to define relationships. Matrix-based visualizations have advantages over node-link or graph-centred layout approaches including the ability to create compact graph representations and the ability to remove edge overlapping. However, tests generating matrix visualizations using our pedigree data have shown that the data density is so low the resulting representations are not particularly insightful. The ability to easily track flow and identify paths is also removed.
Tools such as GeneaQuilts [16], offer a new visualization technique suitable for use with thousands of individuals but offer limited scope for addition of complex genotypic and phenotypic data and discussions with our users showed that they found it difficult to easily interpret such representations.
Finally, tools such as VIPER [17] offer novel pedigree visualization and genotypic error checking capabilities. VIPER is essentially a stack of nested table representations of generations where rows represent sires, dams or children and columns represent individuals which can span multiple columns where they are parents. VIPER’s primary use is in identification of genotyping problems in farmed animals and would be unsuitable for visualizing the complex crossing relationships that exist between crops where selfing is not uncommon. VIPER requires both separate male and female parents which is the norm in any applications handling animal or human data, but not always the case in plant breeding.
Graph-based
Unlike trees, graphs allow for the precise modelling of the complexity of a plant breeding programme. Techniques such as node link diagrams have long been used as a way of representing graph-based data and recent work has examined how effective the node-link model performs representing graph data when compared to matrix-based visualizations [18]. Work carried out by Purchase [19, 20] and Bennett [21] also indicated that while graph layout played an important part in a user’s understanding, it was not the major focus; this focus perhaps being the use of other aesthetics relating to node colour and shape.
Most of the current tools have been developed for human pedigrees where consanguineous mating events are negligible. This is not the case in plant and animal breeding which cannot be properly modelled using tools that use node-link or tree hierarchies such as Pedfiddler and Madeline [22].
Cranefoot [23] reports the use of mathematical graph structures to deal with between-relative mating but the approach is limited in its current form in the amount of information that can be attached to a node. Finally, HaploPainter [24] allows the drawing of genetic haplotypes, but suffers from being restricted in the number of individuals it is able to display.
A commonly used two-dimensional pedigree visualization tool is Peditree [13] which offers a tree-based view of data in a pedigree but this is not suited to our requirements as plant pedigrees are not trees (inbreeding and the use of older lines in more modern crosses prevents us from treating them as such). Other tools such as the Pedigree Visualizer by Wong [25] offer new layout algorithms. Wong suggests introducing duplicate “alias” lines in representations with multiple matings from the same individuals, phenomena that are commonplace in plant data. PyPedal [26] not only offers rudimentary graph drawing tools, restricted to changing node shape to represent male and females, but also error checking algorithms to try and identify potential pedigree errors where appropriate genotypic data exists.
Visualization techniques such as sunbursts [27] which are space filling versions of a node-link diagram have the advantage that a node’s position in a hierarchy is maintained. Additionally, Fan Charts [28] and H-trees [29] have also been described as a means for recounting human genealogy; these techniques however assume no inbreeding (they are trees and not graphs) and thus rule themselves out for use with plant pedigrees.
While the main problems with these additional techniques are that they are not appropriate for observing a pedigree in its entirety (indeed the complexity of the data may rule many of them out), they may be useful when trying to visualize a sub-section of data such as a sub-pedigree for specific lines.
Layout algorithms
Plant pedigrees often form what we describe as a pedigree net, whereby there is structure to the graph but it’s not as simple as traditional top-down pedigree representation that is seen in humans and to a lesser extent in farmed animals (Figure 2).This abstract representation does include a time component in the form of generations, but due to the viability of seed, and the existence of varieties and landraces that may be many hundreds of years old, there is the potential to use these older varieties in modern crosses. This situation leads to nodes at the top of the graph having edges connecting to nodes at the bottom - this is not common in animals and would be extremely unlikely in humans. The existence of a time component means that the use of a layout algorithm that preserves topology (top-down generations) is nonetheless important as most (but not all) crossing will be between newer varieties. Because of this, layout methodologies such as force-directed algorithms (Figure 3B) would not offer the ability for us to arrange our pedigree based on time. Force directed layouts are not well suited to our requirements. The lack of a visually identifiable pedigree structure is strikingly apparent.
The problem of very large pedigrees in humans has been identified and solutions proposed in tools such as PViN [30] which looks at windows on large datasets but only offers pedigree drawing with no scope for addition of other information onto the visualization. In addition, its traditional human family tree output is not the most efficient use of space for plant pedigrees which form a more dense net due to the nature of reproduction which is not seen in humans or animals (Figure 2A)
Although there are problems associated with 2D node-link layouts such as a lack of horizontal space and problems with crossing of edges [31] they are still well suited to displaying data of this type. 3D tools also have their problems, including visual occlusion and that they tend to visualise high-level features and not specifics, so while some trends are easy to spot, the actual detail is hidden from the user. From this point of view they are limited in use for our purposes and offer no advantages over their 2D counterparts. Notable examples of such tools are Walrus [32] and Celestial3D [31] but their success lies in alternate problem domains.
Discussion
It is clear that these techniques and tools contain many features that are useful, but none meet the exact requirements (including data abstraction) of our problem to be able to overlay genotypic and phenotypic data onto a complex pedigree structure.
There is a need for the development of tools that are tailored for the unique needs of plant breeding with the ability to explore pedigree structure, and paint additional genotypic and phenotypic data on top, to allow breeders to make informed decisions and visualize the way in which alleles for agriculturally important traits are transmitted through previous and subsequent generations. Such tools do not currently exist.
Through the examination of methodologies to display pedigree data we suggest that the best method to visualize plant pedigree data is a layered layout (Sugiyama-style) based approach (Figures 2A and 3A). Not only does this allow us to accurately map the exact specifics of how breeding programmes run (including inbreeding) but also provides a well-established framework onto which a visualization can be built. The use of graphs as our data structure means that features such as standard graph-traversal algorithms can be used to bring greater functionality to our pedigree structure in locating ancestors and descendants and as a logical framework which can be used to look for problems with underlying datasets. The layered layout representation also brings a coherent structure to sparse relationships and generations and topological layout are clearer compared to matrix style layouts. This is not the case with animal (Figure 2B) and human pedigrees whose top-down fan type shape is not well suited to a layered layout as they quickly become very large, consuming large volumes of horizontal space [17].
Tools that allow exploration of data to try and bring a greater understanding of complex relationships between individuals should bring greater insight into how plant breeding programmes operate at the genetic level and how to bring maximum potential benefit from them. The ability to detect patterns and associations (or even anomalies) within these datasets such as; the identification of problems with inheritance of alleles, the identification of lines from which additional information would allow inference of data on large parts of the pedigree, simple typos and errors, or looking for lines which are similar to unknown lines, will lead to increased depth of domain knowledge for plant breeders and geneticists.