How to make a dendrogram using the D3 library (part 1)

In this series of articles we will learn how to make a dendrogram on web using the D3 graphic library.

Dendrograms are a very useful tool of data visualization, especially for the cluster analysis or clustering.  The dendrogram is essentially a tree graph.

Representing a dendrogram, we begin by considering as a starting point the root. This is the origin point from which all links branch off. Each link ends in a point, called node, from which, in turn, other links branch off until to reach a set of terminal nodes, called leaves.

tree-structure

Fig.1: tree structure

(Note: for the sake of clarity I used arrows, but actually the tree is an undirected graph)

These tree structures are used to represent hierarchical structures with different levels, where, starting from considering the root as the level 0, each generation of nodes will define the next level (level 1, livel 2, …) up to reach the leaves (level n-1, where n is the depth of the graph).

tree-structure2

Fig.2: levels of a tree structure

The dendrogram, in particular, is the result of a calculation derived from an algorithm or by a method of hierarchical representation, in which, in addition to the hierarchy, the ultrametric distance is also shown.

Each link of the dendrogram corresponds to a cluster, whereas each node, with its position, identifies the distance to which the different clusters merge.

Fig.3 Il dendrogramma

Fig.3 Il dendrogramma

libro

Many methods are available to calculate the distance between clusters, including the centroids method, which I discussed in the book Beginning Javascript Charts with jqPlot, D3 and Highcharts, (2013) Apress. Other methods are:

  • single-link method
  • complete-link method
  • average-link method
  • Ward’s method
  • centroids method (already mentioned)

Most likely, in the future, I’ll discuss these methods one by one. However, so far, you can easily find some of these algorithms already implemented in different programming languages.

As a starting point, I referred to the dendrogram of the M. Bostock’s example #4063570 (http://bl.ocks.org/mbostock/4063570).

dendrogr_bostock1

Fig.4: Bostock’s example

Actually, if you notice carefully, this example is not a dendrogram, but only a tree structure. Indeed, all nodes on the same level of hierarchy are vertically aligned. Thus the ultrametric distances are not considered at all. In the example code that we will implement in this sequence of articles, we will develop real dendrograms in which also the distances will be taken into account.

dendrogr_bostock2

Fig.5: all nodes at the same level are aligned

Thus the Bostock’s example shows only the hierarchical information and gives an image of a qualitative approach to the cluster analysis, without taking into account the distance at which cluster differ to each other (quantitative approach).

But that’s not all, in the last example, I will introduce a further distance. This distance quantifies in an alternative way (and supplementary) the distance between the various clustered elements. In this case, the leaves of the dendrogram will not be shown evenly spaced, and at the same time, the branching of the links from each node will no longer be symmetrical.

dendrogrammaXY

Fig.6: dendrogram with two distances

Considering this additional feature, it is possible to obtain also the case in which the links of two different clusters overlap. Consequently leaves of two different clusters may appear closer than those belonging to the same cluster. This happens because the clustering was not made according to the distribution of the leaves alogn the distance x (see Fig.6 and 7), since the clustering should be based on the similarity of the analyzed samples.

dendrogrammaXY2

Fig.7: dendrogram with overlapping links

But now it is time to start with the first example. First, we will implement the tree structure in all respects similar to the structure of the Bostock’s example. But for our purposes, we will work on a much simpler structure (3 levels and 8 nodes), in order to better focus on the basic aspects of a dendrogram development.

dendrogram_es01

Fig.8: struttura ad albero

As we can see the structure is distributed over two level (in addition to the root level) and I labeled each node with the corresponding role within the tree structure.

This tree structure should be written with a specific format, so that it can be read by the JavaScript code that we are going to implement. The JSON format is the best candidate.

Thus, using any text editor, let’s copy the following listing, which defines the tree structure in Figure 8, and save it as dendrogram01.json.

In the next article we conclude this example, developing within an HTML page, the code necessary to correctly read the data in the JSON file. Finally we will obtain a tree structure corresponding to the entered data.

Leave a Reply