Re: [BULK] log-likelihood distance

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [BULK] log-likelihood distance

kathrin-2
Thanks Alex. I already had a look! Seems like some complicated formula! Do
you now some example or any case studie to bring live in them? I search
some way to better understand the way the formular goes.
Thanks a lot!

Kathrin
Reply | Threaded
Open this post in threaded view
|

Re: log-likelihood distance

Reutter, Alex
Beyond the references in the algorithms document, I don't have a good reference for understanding the log-likelihood distance.  Without knowing how familiar you are with likelihood functions, maybe the following will help?

The motivation for using the log-likelihood distance is to allow the inclusion of categorical variables.  This distance measure is, appropriately enough, based on the logarithm of the likelihood function under the models in which:
  1. the cases are in separate clusters  (the first two terms in the distance formula)
  2. the cases are in a single cluster  (the third term in the distance formula)

The likelihood function is constructed under the assumptions that:
  1. continuous variables have normal distributions (an independent distribution for each variable for each cluster)
  2. categorical variables have multinomial distributions (an independent distribution for each variable for cluster)

The log-likelihood distance formula should follow from this**.  As a very simple example, consider a dataset with one continuous variable that takes values -4, -3, -2, -1, 1, 2, 3, 4.  Say you create two clusters:

Cluster 1: -4, -1, 2, 3
Cluster 2: -3, -2, 1, 4

Each cluster has variance 7.5, as does the cluster formed by combining these two clusters.  Thus, the log-likelihood distance will be 0***:

-4*(0.5*log(7.5)) - 4*(0.5*log(7.5)) + 8*(0.5*log(7.5)) = 0

On the other hand, create two clusters:

Cluster 1: -4, -3, -2, -1
Cluster 2: 1, 2, 3, 4

Each cluster has variable 1.25, and the cluster formed by combining these two clusters has variance 7.5.  Thus, the log-likelihood distance will be roughly 7.17***:

-4*(0.5*log(1.25)) - 4*(0.5*log(1.25)) + 8*(0.5*log(7.5))) = 7.17

Alex

** with the \hat{\sigma}^2_k term in the TWOSTEP CLUSTER algorithms added later to solve a particular problem, but it can be ignored for basic understanding of the equation.

*** ignoring the \hat{\sigma}^2_k term for the moment for simplicity's sake


-----Original Message-----
From: kathrin [mailto:[hidden email]]
Sent: Thursday, February 08, 2007 2:22 PM
To: [hidden email]; Reutter, Alex
Cc: kathrin
Subject: Re: [BULK] log-likelihood distance
Importance: Low

Thanks Alex. I already had a look! Seems like some complicated formula! Do
you now some example or any case studie to bring live in them? I search
some way to better understand the way the formular goes.
Thanks a lot!

Kathrin