Cut Price Clusterings - a.k.

Last month we saw how we could efficiently generate hierarchical clusterings, which are sequences of sets of clusters, which are themselves subsets of a set of data that each contain elements that are similar to each other, such that if a pair of data are in the same clustering at one step then they must be in the same clustering in the next which will always be the case if we move from one step to the next by merging the closest pairs of clusters. Specifically, we used our ak.minHeap implementation of the min-heap structure to cache the distances between clusters, saving us the expense of recalculating them for clusters that don't change from one step in the hierarchy to the next.
Recall that we used three different schemes for calculating the distance between a pair of clusters, the average distance between their members, known as average linkage, the distance between their closest members, known as single linkage, and the distance between their farthest members, known as complete linkage, and that I concluded by noting that our algorithm was about as efficient as possible in general but that there is a much more efficient scheme for single linkage clusterings; efficient enough that sorting the clusters in each clustering by size would be the most costly operation and so in this post we shall implement objects to represent clusterings that don't do that.

Full text...  

The Octogram Of Seth LaPod - baron m.

Salutations Sir R-----! I trust that this fine summer weather has you thirsting for a flagon. And perhaps a wager?

Splendid! Come join me at my table!

I propose a game played as a religious observance by the parishioners of the United Reformed Eighth-day Adventist Church of Cthulhu, the eldritch octopus god that lies dead but dreaming in the drowned city of Hampton-on-Sea.
Several years ago, the Empress directed me to pose as a peasant and infiltrate their temple of Fhtagn in the sleepy village of Saint Reatham on the Hill when it was discovered that Bishop Derleth Miskatonic had been directing his congregation to purchase vast tracts of land in the Ukraine and gift them to the church in return for the promise of being spared when Cthulhu finally wakes and devours mankind.

Full text...  

Racing Up The Hierarchy - a.k.

In the previous post we saw how to identify subsets of a set of data that are in some sense similar to each other, known as clusters, by constructing sequences of clusterings starting with each datum in its own cluster and ending with all of the data in the same cluster, subject to the constraint that if a pair of data are in the same cluster in one clustering then they must also be in the same cluster in the next, which are known as hierarchical clusterings.
We did this by selecting the closest pairs of clusters in one clustering and merging them to create the next, using one of three different measures of the distance between a pair of clusters; the average distance between their members, the distance between their nearest members and the distance between their farthest members, known as average linkage, single linkage and complete linkage respectively.
Unfortunately our implementation came in at a rather costly O(n3) operations and so in this post we shall look at how we can improve its performance.

Full text...  

On The Hydra Of Argos - student

When the Baron last met with Sir R-----, he proposed a wager which commenced with the placing of twenty black tokens and fifteen white tokens in a bag. At each turn Sir R----- was to draw a token from the bag and then put it and another of the same colour back inside until there were thirty tokens of the same colour in the bag, with the Baron winning a coin from Sir R----- if there were thirty black and Sir R----- winning ten coins from the Baron if there were thirty white.
Upon hearing these rules I recognised that they described the classic probability problem known as Pólya's Urn. I explained to the Baron that it admits a relatively simple expression that governs the likelihood that the bag contains given numbers of black and white tokens at each turn which could be used to figure the probability that he should have triumphed, but I fear that he didn't entirely grasp my point.

Full text...  

A Place In The Hierarchy - a.k.

Last time we implemented the clusterings type to store a set of clustering objects in order to represent hierarchical clusterings, which are sequences of clusterings having the property that if a pair of data are in the same cluster in one clustering then they will be in the same cluster in the next, where clusters are subsets of a set of data that are in some sense similar to each other.
We then went on to define the ak.clade type to represent hierarchical clusterings as trees, so named because that's what they're called in biology when they are used to show the relationships between species and their common ancestors.
Now that we have those structures in place we're ready to see how to create hierarchical clusterings and so in this post we shall start with a simple, general purpose, but admittedly rather inefficient, way to do so.

Full text...  

Further On An Ethereal Orrery - student

Last time we met we spoke of my fellow students' and my interest in constructing a model of the motion of heavenly bodies using mathematical formulae in the place of brass. In particular we have sought to do so from first principals using Sir N-----'s law of universal gravitation, which states that the force attracting two bodies is proportional to the product of their masses divided by the square of the distance between them, and his laws of motion, which state that a body will remain at rest or in constant motion unless a force acts upon it, that it will be accelerated in the direction of that force at a rate proportional to its magnitude divided the body's mass and that a force acting upon it will be met with an equal force in the opposite direction.
Whilst Sir N----- showed that a pair of bodies traversed conic sections under gravity, being those curves that arise from the intersection of planes with cones, the general case of several bodies has proved utterly resistant to mathematical reckoning. We must therefore approximate the equations of motion and I shall now report on our first attempt at doing so.

Full text...  

Making The Clade - a.k.

We have so far seen a couple of schemes for identifying clusters in sets of data which are subsets whose members are in some sense similar to each other. Specifically, we have looked at the k means and the shared near neighbours algorithms which implicitly define clusters by the closeness of each datum to the average of each cluster and by their closeness to each other respectively.
Note that both of these algorithms use a heuristic, or rule of thumb, to assign data to clusters, but there's another way to construct clusterings; define a heuristic to measure how close to each other a pair of clusters are and then, starting with each datum in a cluster of its own, progressively merge the closest pairs until we end up with a single cluster containing all of the data. This means that we'll end up with a sequence of clusterings and so before we can look at such algorithms we'll need a structure to represent them.

Full text...  

The Hydra Of Argos - baron m.

Ho there Sir R-----! Will you join me for a cold tankard of ale to refresh yourself on this warm spring evening?

And, might I hope, for a little sport?

I should not have doubted it for a moment sir!

This fine weather reminds me of the time I spent as the Empress's trade envoy to the market city of Argos, famed almost as much for the remarkable, if somewhat fragile, mechanical contraptions made by its artificers and the most reasonably priced jewellery sold by its goldsmiths as for its fashion for tiny writing implements.

Full text...  

Nearest And Dearest - a.k.

Last time we saw how we could use the list of the nearest neighbours of a datum in a set of clustered data to calculate its strengths of association with their clusters. Specifically, we used the k nearest neighbours algorithm to define those strengths as the proportions of its k nearest neighbours that were members of each cluster or with a generalisation of it that assigned weights to the neighbours according to their positions in the list.
This time we shall take a look at a clustering algorithm that uses nearest neighbours to identify clusters, contrasting it with the k means clustering algorithm that we covered about four years ago.

Full text...  

On Pennies From Heaven - student

Recall that the Baron and Sir R-----'s most recent wager first had the Baron place three coins upon the table, choosing either heads or tails for each in turn, after which Sir R----- would follow suit. They then set to tossing coins until a run of three matched the Baron's or Sir R-----'s coins from left to right, with the Baron having three coins from Sir R----- if his made a match and Sir R----- having two from the Baron if his did.

When the Baron described the manner of play to me I immediately pointed out to him that it was Penney-Ante, which I recognised because one of my fellow students had recently employed it to enjoy a night at the tavern entirely at the expense of the rest of us! He was able to do so because it's an example of an intransitive wager in which the second player can always contrive to make a choice that will best the first player's.

Full text...