Tree-penalized Path Length (tpPL) criterion for data seriation
From this page, you can see comparisons of the results of using different seriation criteria on synthetic datasets.
Three criteria are used:
- Path length, so the orderings are solutions of the Traveling Salesman Problem (TSP). This gives the shortest path lengths, but sometimes the orderings obscure the structure of the data.
- Tree-penalized path length (tpPL), which adds a penalty to the distance matrix to discourage paths from jumping between branches of a hierarchical clustering tree. This produces somewhat longer paths, but they generally finish one cluster before moving on to the next.
- Optimal leaf order (OLO), which finds the shortest path consistent with a tree coming from hierarchical clustering. This produces the longest paths, some of which are quite inefficient.
The 2022 article that describes the tpPL ordering method is available online.
There are 44 sythetic datasets, 40 of them two dimensional, and one RNA dataset.
The synthetic datasets and the code that generated them is available on GitHub.
The 3D structures forming the RNA dataset come from the RNA motif group HL_33239.14.
See the link to view the 3D coordinates of the instances.
Four hierarchical clustering linkages are used for tpPL and OLO; these are average, complete, single, and Ward.
tpPL orderings are shown for a variety of tree penalty strengths; as the penalty strength increases from 0, the paths transition from TSP to tpPL to OLO paths, and the path lengths get longer.
For each dataset, each clustering linkage, and each tree penalty strength, the best known ordering is shown, as well as a re-ordered heat map of all-against-all distances.
- Individual view shows one objective at a time plus the backtracking indicator matrix and values of all numerical measures
- Multiple view shows the best known TSP ordering (lower left), the best known tpPL ordering with a given penalty strength (upper right), and the OLO ordering (lower right) all on one page
Frontier graphs for all linkages are posted.
The sections below have links to instructive examples of the different ordering methods making imperfect orderings.
These are not meant to be exhaustive, but rather to make it easy to find the prime examples.
Examples are given for OLO average and OLO complete, and comments are made concerning OLO Ward when appropriate.
OLO single consistently gives long, inefficient paths, so no examples are chosen to illustrate that.
Datasets with elongated clusters generally have those clusters split by tpPL and OLO with average, complete, and Ward linkages; only one example of that (line with points off the line) will be shown below.
Examples where TSP backtracks
- Three parallel bars; TSP backtracks in all three bars, forming dark X patterns in the heat map; tpPL average and OLO average have much smaller backtracks
- Three rectangular bars; TSP backtracks in the bottom right bar; tpPL average and OLO average have much smaller backtracks
Examples where TSP splits clusters
- Central cluster with four spokes; TSP splits the central cluster, which appears as four separated dark squares in the heat map; tpPL average and OLO average do not split the central cluster; this dataset is very difficult to order linearly because of the four arms
- Six normal clusters 5; TSP splits the central cluster, which appears as a dark checkerboard in the heat map; tpPL average and OLO average do not split the central cluster
- Hypertriangle in 6 dimensions; TSP splits the second cluster; tpPL average and OLO average do not
- Five by two normal clusters; TSP backtracks a lot, splitting clusters; tpPL and OLO have much smaller backtracks, and their heat maps clearly identify 10 clusters.
Examples where OLO average makes it appear there is a sharp cluster boundary
- Inverted U shape; OLO average heat map shows an apparent cluster, but only because of an inefficient jump across the open space in the dataset; OLO complete, single, and Ward do the same; TSP and tpPL do not make this jump
- Pinwheel; OLO average incorrectly suggests a separate cluster in the lower right of the heat map because of a long jump across open space in the dataset; OLO complete and Ward also make long jumps; tpPL average does well, but tpPL with other linkages also make long jumps
- Letter I; OLO average shows a sharp break in the bottom right of the heat map due to a long jump in the lower left of the dataset; OLO Ward does a similar thing in a different place; tpPL does well with all linkages
- New normal clusters 5; OLO average heat map has a sharp break in the lower right due to an inefficient jump to the last two clusters; tpPL average makes efficient jumps between clusters; TSP, OLO complete, tpPL complete, and tpPL Ward split the central cluster
Examples where OLO average has a particularly inefficient path
Examples where OLO complete splits clusters
- New normal clusters 2; OLO complete path produces a checkerboard pattern from the middle cluster; tpPL complete is better, but tpPL average and tpPL Ward do better yet; note that TSP has a dark X surrounded by a circle in the heat map because of a spiral path through a cluster; this is an artifact of the ordering, not a feature of the data
- Six normal clusters 3; OLO complete path produces a checkerboard pattern from the middle cluster; tpPL complete does not
- Six normal clusters 4; OLO complete path jumps back and forth between the second and third clusters, creating dark bars off the diagonal in the heat map; tpPL complete does not
Examples where OLO ward has a particularly inefficient path
- New normal clusters 2; OLO Ward path produces a checkerboard pattern from the middle cluster, and has several long jumps; tpPL Ward is better; note that TSP has a dark X surrounded by a circle in the heat map because of a spiral path through a cluster; this is an artifact of the ordering, not a feature of the data
- Six normal clusters 2 plus 60 percent noise; OLO Ward path has several inefficient jumps and makes it appear there is a sharp cluster boundary; tpPL Ward path is substantially shorter;
Examples where tpPL average does poorly
- Line and points off the line; tpPL average splits the line cluster and traverses it in two different directions which makes it difficult to recognize the single long cluster in the heat map; OLO average traverses the cluster in one direction and its heat map clearly shows that the data contains an elongated cluster. Single linkage would be better in this case, and tpPL single and OLO single do well.
Examples where tpPL complete does poorly
- Line and points off the line; tpPL complete splits the line into three parts and covers them in different directions; OLO complete does this as well but to worse effect
- Pinwheel; tpPL complete uses a long jump which makes it look like there is a sharp cluster boundary; OLO complete does so as well
- Central cluster with four spokes; tpPL complete splits the central cluster; OLO complete does not
- New normal clusters 5; tpPL complete splits the central cluster; TSP and OLO complete also do so
Examples where tpPL single does poorly
- Three parallel bars; tpPL single backtracks on the 3 bars, but not as badly as TSP; OLO single has a very long path length
- Three rectangular bars; tpPL single backtracks on the last bar; TSP backtracks on one bar; OLO single path is long and crosses itself often
- New normal clusters 3; tpPL single has a dark circle artifact in the center of the heat map, due to a spiral path through one cluster; other methods do not do this
- Hypertriangle in 5 dimensions; tpPL single splits the last two clusters; OLO single does even worse, with many points not appearing to belong to any cluster at all
- Hypertriangle in 6 dimensions;tpPL single splits a cluster; TSP also splits a cluster; OLO single leaves many points not appearing to belong to any cluster at all
Examples where tpPL Ward does poorly
- Line and points off the line; tpPL Ward splits the line and traverses the two parts in different directions which makes it difficult to recognize the single long cluster in the heat map; OLO Ward simply makes excursions from the line
- Pinwheel; tpPL Ward makes it look like there is a cluster due to a long inefficient jump; OLO Ward does this as well with approximately the same severity
- New normal clusters 5; tpPL Ward splits the central cluster; TSP also splits the central cluster; OLO Ward does not
Copyright 2022 Denis A. Aliyev and Craig L. Zirbel.