OPTIMAL BINNING not very consistent

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

OPTIMAL BINNING not very consistent

Kirill Orlov
OPTIMAL BINNING /method= EQUALFREQ (BINS=n) is a method to bin a continuous/scale variable into n approximately equal, by frequency inside, categories.
This is "unsupervised" optimal binning: available through syntax only.

According to CSR, if the number of requested bins n is greater than the observed number of distinct values k in the variable, the procedure leaves the k values as the k bins. I've observed that this is not always the case. In the following example, k=5 and the requested n=7. It is expected that the procedure will return 5 bins (i.e. will not bin anything actually). In fact, it produced 4 bins. Either this is a bug or particular contingency due to the algorithm. I wonder what SPSS team may say.

data list list /v1 (f8).
begin data
1
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
end data.

optimal binning /variables bin= v1 save= yes(into= v1# )
 /criteria method= equalfreq(bins= 7) /missing scope= pairwise.

V1# has 4 values, not 5 values.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: OPTIMAL BINNING not very consistent

Jon Peck
The CSR says
If the number of distinct values in a binning input variable is greater than the BINS value, then the
number of bins created is no more than the BINS value. Otherwise, BINS gives an upper bound on the
number of bins created. Thus, for example, if BINS = 10 is specified but a binning input variable has at
most 10 distinct values, then the number of bins created will equal the number of distinct values in the
input variable.

It doesn't say how many bins  you get when BINS is greater than the number of  values, just that there will be no more than that number of bins.

In your data, there are five distinct values. With your syntax, specify bins=7, it gives 4 bins.  However, if you specify bins=4, it gives 4 bins, but the boundary is slightly different for the first bin.  That seems a little weird, but it might have initially created five bins and then packed it down to four.

On Sat, Sep 19, 2020 at 9:45 AM Kirill Orlov <[hidden email]> wrote:
OPTIMAL BINNING /method= EQUALFREQ (BINS=n) is a method to bin a continuous/scale variable into n approximately equal, by frequency inside, categories.
This is "unsupervised" optimal binning: available through syntax only.

According to CSR, if the number of requested bins n is greater than the observed number of distinct values k in the variable, the procedure leaves the k values as the k bins. I've observed that this is not always the case. In the following example, k=5 and the requested n=7. It is expected that the procedure will return 5 bins (i.e. will not bin anything actually). In fact, it produced 4 bins. Either this is a bug or particular contingency due to the algorithm. I wonder what SPSS team may say.

data list list /v1 (f8).
begin data
1
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
end data.

optimal binning /variables bin= v1 save= yes(into= v1# )
 /criteria method= equalfreq(bins= 7) /missing scope= pairwise.

V1# has 4 values, not 5 values.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: OPTIMAL BINNING not very consistent

Kirill Orlov
If they consider such weird, unintuitive behaviour normal (which may be) I thought it ought to be explicitly stated and exemplified in CSR.


19.09.2020 19:32, Jon Peck пишет:
The CSR says
If the number of distinct values in a binning input variable is greater than the BINS value, then the
number of bins created is no more than the BINS value. Otherwise, BINS gives an upper bound on the
number of bins created. Thus, for example, if BINS = 10 is specified but a binning input variable has at
most 10 distinct values, then the number of bins created will equal the number of distinct values in the
input variable.

It doesn't say how many bins  you get when BINS is greater than the number of  values, just that there will be no more than that number of bins.

In your data, there are five distinct values. With your syntax, specify bins=7, it gives 4 bins.  However, if you specify bins=4, it gives 4 bins, but the boundary is slightly different for the first bin.  That seems a little weird, but it might have initially created five bins and then packed it down to four.


===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: OPTIMAL BINNING not very consistent

Kirill Orlov
In reply to this post by Jon Peck
Perhaps there should be an option in syntax not to undertake any binning at all if the number of observed distinct values is not greater than the requested number of bins. And in this case, the original variable should be perhaps copied as the "binned" one - to produce some result, along with due warning.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: OPTIMAL BINNING not very consistent

Jon Peck
In reply to this post by Kirill Orlov
I looked at the algorithms doc, which has a very detailed description of the rules, but I didn’t wade through it to see if that explains what happens with your data.

The equal frequency algorithm refers to a user-specified number of cutpoints (n), but there is no such parameter documented in the syntax.

However, slight wobbles in binning with small datasets and small numbers of values would not be surprising in an algorithm like this.

In the absence of a guide variable, as here, the Visual Binner would give good results.


On Sat, Sep 19, 2020 at 10:42 AM Kirill Orlov <[hidden email]> wrote:
If they consider such weird, unintuitive behaviour normal (which may be) I thought it ought to be explicitly stated and exemplified in CSR.


19.09.2020 19:32, Jon Peck пишет:
The CSR says
If the number of distinct values in a binning input variable is greater than the BINS value, then the
number of bins created is no more than the BINS value. Otherwise, BINS gives an upper bound on the
number of bins created. Thus, for example, if BINS = 10 is specified but a binning input variable has at
most 10 distinct values, then the number of bins created will equal the number of distinct values in the
input variable.

It doesn't say how many bins  you get when BINS is greater than the number of  values, just that there will be no more than that number of bins.

In your data, there are five distinct values. With your syntax, specify bins=7, it gives 4 bins.  However, if you specify bins=4, it gives 4 bins, but the boundary is slightly different for the first bin.  That seems a little weird, but it might have initially created five bins and then packed it down to four.


===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: OPTIMAL BINNING not very consistent

Jon Peck
In reply to this post by Kirill Orlov
But it isn't exactly a "requested number of bins".  It's an upper bound.  Thus, for example, if BINS = 10 is specified but a binning input variable has at
most 10 distinct values, then the number of bins created will equal the number of distinct values in the input variable.

BTW, there is a dialog box for this procedure under Transform.

On Sat, Sep 19, 2020 at 10:53 AM Kirill Orlov <[hidden email]> wrote:
Perhaps there should be an option in syntax not to undertake any binning at all if the number of observed distinct values is not greater than the requested number of bins. And in this case, the original variable should be perhaps copied as the "binned" one - to produce some result, along with due warning.



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: OPTIMAL BINNING not very consistent

Kirill Orlov
The dialog Optimal Binning is for supervised method only.
The unsupervised "equal frequency algorithm" must be coinciding with the Visual Binning "equal percentile" Make cut-points option. However, Visual Binning is not explicitly tied with one specific syntax command.

BTW, there is a dialog box for this procedure under Transform.

On Sat, Sep 19, 2020 at 10:53 AM Kirill Orlov <[hidden email]> wrote:
Perhaps there should be an option in syntax not to undertake any binning at all if the number of observed distinct values is not greater than the requested number of bins. And in this case, the original variable should be perhaps copied as the "binned" one - to produce some result, along with due warning.



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: OPTIMAL BINNING not very consistent

Kirill Orlov
In reply to this post by Jon Peck
BTW, RANK CASES with NTILES method - which is an alternative to Optimal
Binning method to obtainin "equal percentile groups", sometimes may too,
yield the number of output bins less than the number of unique values k
in the data in the situation when k>= the requested number of bins.

And it also can irritate. Despite it obviously is not a bug.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD