Beyond the references in the algorithms document, I don't have a good reference for understanding the log-likelihood distance. Without knowing how familiar you are with likelihood functions, maybe the following will help?
The motivation for using the log-likelihood distance is to allow the inclusion of categorical variables. This distance measure is, appropriately enough, based on the logarithm of the likelihood function under the models in which:
1. the cases are in separate clusters (the first two terms in the distance formula)
2. the cases are in a single cluster (the third term in the distance formula)
The likelihood function is constructed under the assumptions that:
1. continuous variables have normal distributions (an independent distribution for each variable for each cluster)
2. categorical variables have multinomial distributions (an independent distribution for each variable for cluster)
The log-likelihood distance formula should follow from this**. As a very simple example, consider a dataset with one continuous variable that takes values -4, -3, -2, -1, 1, 2, 3, 4. Say you create two clusters:
Cluster 1: -4, -1, 2, 3
Cluster 2: -3, -2, 1, 4
Each cluster has variance 7.5, as does the cluster formed by combining these two clusters. Thus, the log-likelihood distance will be 0***:
-4*(0.5*log(7.5)) - 4*(0.5*log(7.5)) + 8*(0.5*log(7.5)) = 0
On the other hand, create two clusters:
Cluster 1: -4, -3, -2, -1
Cluster 2: 1, 2, 3, 4
Each cluster has variable 1.25, and the cluster formed by combining these two clusters has variance 7.5. Thus, the log-likelihood distance will be roughly 7.17***:
-4*(0.5*log(1.25)) - 4*(0.5*log(1.25)) + 8*(0.5*log(7.5))) = 7.17
Alex
** with the \hat{\sigma}^2_k term in the TWOSTEP CLUSTER algorithms added later to solve a particular problem, but it can be ignored for basic understanding of the equation.
*** ignoring the \hat{\sigma}^2_k term for the moment for simplicity's sake
-----Original Message-----
From: kathrin [mailto:
[hidden email]]
Sent: Thursday, February 08, 2007 2:22 PM
To:
[hidden email]; Reutter, Alex
Cc: kathrin
Subject: Re: [BULK] log-likelihood distance
Importance: Low
Thanks Alex. I already had a look! Seems like some complicated formula! Do
you now some example or any case studie to bring live in them? I search
some way to better understand the way the formular goes.
Thanks a lot!
Kathrin