Tree-penalized Path Length (tpPL) criterion for data seriation

From this page, you can see comparisons of the results of using different seriation criteria on synthetic datasets. Three criteria are used:

Path length, so the orderings are solutions of the Traveling Salesman Problem (TSP). This gives the shortest path lengths, but sometimes the orderings obscure the structure of the data.
Tree-penalized path length (tpPL), which adds a penalty to the distance matrix to discourage paths from jumping between branches of a hierarchical clustering tree. This produces somewhat longer paths, but they generally finish one cluster before moving on to the next.
Optimal leaf order (OLO), which finds the shortest path consistent with a tree coming from hierarchical clustering. This produces the longest paths, some of which are quite inefficient.

The 2022 article that describes the tpPL ordering method is available online.

There are 44 sythetic datasets, 40 of them two dimensional, and one RNA dataset. The synthetic datasets and the code that generated them is available on GitHub. The 3D structures forming the RNA dataset come from the RNA motif group HL_33239.14. See the link to view the 3D coordinates of the instances.

Four hierarchical clustering linkages are used for tpPL and OLO; these are average, complete, single, and Ward. tpPL orderings are shown for a variety of tree penalty strengths; as the penalty strength increases from 0, the paths transition from TSP to tpPL to OLO paths, and the path lengths get longer. For each dataset, each clustering linkage, and each tree penalty strength, the best known ordering is shown, as well as a re-ordered heat map of all-against-all distances.

Individual view shows one objective at a time plus the backtracking indicator matrix and values of all numerical measures
Multiple view shows the best known TSP ordering (lower left), the best known tpPL ordering with a given penalty strength (upper right), and the OLO ordering (lower right) all on one page

Frontier graphs for all linkages are posted.

The sections below have links to instructive examples of the different ordering methods making imperfect orderings. These are not meant to be exhaustive, but rather to make it easy to find the prime examples. Examples are given for OLO average and OLO complete, and comments are made concerning OLO Ward when appropriate. OLO single consistently gives long, inefficient paths, so no examples are chosen to illustrate that. Datasets with elongated clusters generally have those clusters split by tpPL and OLO with average, complete, and Ward linkages; only one example of that (line with points off the line) will be shown below.

Examples where TSP backtracks

Three parallel bars; TSP backtracks in all three bars, forming dark X patterns in the heat map; tpPL average and OLO average have much smaller backtracks
Three rectangular bars; TSP backtracks in the bottom right bar; tpPL average and OLO average have much smaller backtracks

Examples where TSP splits clusters

Central cluster with four spokes; TSP splits the central cluster, which appears as four separated dark squares in the heat map; tpPL average and OLO average do not split the central cluster; this dataset is very difficult to order linearly because of the four arms
Six normal clusters 5; TSP splits the central cluster, which appears as a dark checkerboard in the heat map; tpPL average and OLO average do not split the central cluster
Hypertriangle in 6 dimensions; TSP splits the second cluster; tpPL average and OLO average do not
Five by two normal clusters; TSP backtracks a lot, splitting clusters; tpPL and OLO have much smaller backtracks, and their heat maps clearly identify 10 clusters.

Examples where OLO average makes it appear there is a sharp cluster boundary

Inverted U shape; OLO average heat map shows an apparent cluster, but only because of an inefficient jump across the open space in the dataset; OLO complete, single, and Ward do the same; TSP and tpPL do not make this jump
Pinwheel; OLO average incorrectly suggests a separate cluster in the lower right of the heat map because of a long jump across open space in the dataset; OLO complete and Ward also make long jumps; tpPL average does well, but tpPL with other linkages also make long jumps
Letter I; OLO average shows a sharp break in the bottom right of the heat map due to a long jump in the lower left of the dataset; OLO Ward does a similar thing in a different place; tpPL does well with all linkages
New normal clusters 5; OLO average heat map has a sharp break in the lower right due to an inefficient jump to the last two clusters; tpPL average makes efficient jumps between clusters; TSP, OLO complete, tpPL complete, and tpPL Ward split the central cluster

Examples where OLO average has a particularly inefficient path

New normal clusters 4; OLO average path crosses itself many times and is substantially longer than tpPL average
Six normal clusters 6; OLO average path crosses itself many times and is substantially longer than tpPL average
Six normal clusters 3 plus 100 percent noise; OLO average path crosses itself several times, makes several long jumps, and is substantially longer than tpPL average

Examples where OLO complete splits clusters

New normal clusters 2; OLO complete path produces a checkerboard pattern from the middle cluster; tpPL complete is better, but tpPL average and tpPL Ward do better yet; note that TSP has a dark X surrounded by a circle in the heat map because of a spiral path through a cluster; this is an artifact of the ordering, not a feature of the data
Six normal clusters 3; OLO complete path produces a checkerboard pattern from the middle cluster; tpPL complete does not
Six normal clusters 4; OLO complete path jumps back and forth between the second and third clusters, creating dark bars off the diagonal in the heat map; tpPL complete does not

Examples where OLO ward has a particularly inefficient path

New normal clusters 2; OLO Ward path produces a checkerboard pattern from the middle cluster, and has several long jumps; tpPL Ward is better; note that TSP has a dark X surrounded by a circle in the heat map because of a spiral path through a cluster; this is an artifact of the ordering, not a feature of the data
Six normal clusters 2 plus 60 percent noise; OLO Ward path has several inefficient jumps and makes it appear there is a sharp cluster boundary; tpPL Ward path is substantially shorter;

Examples where tpPL average does poorly

Line and points off the line; tpPL average splits the line cluster and traverses it in two different directions which makes it difficult to recognize the single long cluster in the heat map; OLO average traverses the cluster in one direction and its heat map clearly shows that the data contains an elongated cluster. Single linkage would be better in this case, and tpPL single and OLO single do well.

Examples where tpPL complete does poorly

Line and points off the line; tpPL complete splits the line into three parts and covers them in different directions; OLO complete does this as well but to worse effect
Pinwheel; tpPL complete uses a long jump which makes it look like there is a sharp cluster boundary; OLO complete does so as well
Central cluster with four spokes; tpPL complete splits the central cluster; OLO complete does not
New normal clusters 5; tpPL complete splits the central cluster; TSP and OLO complete also do so

Examples where tpPL single does poorly

Three parallel bars; tpPL single backtracks on the 3 bars, but not as badly as TSP; OLO single has a very long path length
Three rectangular bars; tpPL single backtracks on the last bar; TSP backtracks on one bar; OLO single path is long and crosses itself often
New normal clusters 3; tpPL single has a dark circle artifact in the center of the heat map, due to a spiral path through one cluster; other methods do not do this
Hypertriangle in 5 dimensions; tpPL single splits the last two clusters; OLO single does even worse, with many points not appearing to belong to any cluster at all
Hypertriangle in 6 dimensions;tpPL single splits a cluster; TSP also splits a cluster; OLO single leaves many points not appearing to belong to any cluster at all

Examples where tpPL Ward does poorly

Line and points off the line; tpPL Ward splits the line and traverses the two parts in different directions which makes it difficult to recognize the single long cluster in the heat map; OLO Ward simply makes excursions from the line
Pinwheel; tpPL Ward makes it look like there is a cluster due to a long inefficient jump; OLO Ward does this as well with approximately the same severity
New normal clusters 5; tpPL Ward splits the central cluster; TSP also splits the central cluster; OLO Ward does not