SPSSX Discussion

OPTIMAL BINNING not very consistent

Classic

List

Threaded

8 messages Options

Kirill Orlov

OPTIMAL BINNING not very consistent

OPTIMAL BINNING /method= EQUALFREQ (BINS=n) is a method to bin a continuous/scale variable into n approximately equal, by frequency inside, categories.
This is "unsupervised" optimal binning: available through syntax only.

According to CSR, if the number of requested bins n is greater than the observed number of distinct values k in the variable, the procedure leaves the k values as the k bins. I've observed that this is not always the case. In the following example, k=5 and the requested n=7. It is expected that the procedure will return 5 bins (i.e. will not bin anything actually). In fact, it produced 4 bins. Either this is a bug or particular contingency due to the algorithm. I wonder what SPSS team may say.

data list list /v1 (f8).
begin data
1
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
end data.

optimal binning /variables bin= v1 save= yes(into= v1# )
/criteria method= equalfreq(bins= 7) /missing scope= pairwise.

V1# has 4 values, not 5 values.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Jon Peck

Re: OPTIMAL BINNING not very consistent

The CSR says

If the number of distinct values in a binning input variable is greater than the BINS value, then the

number of bins created is no more than the BINS value. Otherwise, BINS gives an upper bound on the

number of bins created. Thus, for example, if BINS = 10 is specified but a binning input variable has at

most 10 distinct values, then the number of bins created will equal the number of distinct values in the

input variable.

It doesn't say how many bins you get when BINS is greater than the number of values, just that there will be no more than that number of bins.

In your data, there are five distinct values. With your syntax, specify bins=7, it gives 4 bins. However, if you specify bins=4, it gives 4 bins, but the boundary is slightly different for the first bin. That seems a little weird, but it might have initially created five bins and then packed it down to four.

On Sat, Sep 19, 2020 at 9:45 AM Kirill Orlov <[hidden email]> wrote:

OPTIMAL BINNING /method= EQUALFREQ (BINS=n) is a method to bin a continuous/scale variable into n approximately equal, by frequency inside, categories.
This is "unsupervised" optimal binning: available through syntax only.

According to CSR, if the number of requested bins n is greater than the observed number of distinct values k in the variable, the procedure leaves the k values as the k bins. I've observed that this is not always the case. In the following example, k=5 and the requested n=7. It is expected that the procedure will return 5 bins (i.e. will not bin anything actually). In fact, it produced 4 bins. Either this is a bug or particular contingency due to the algorithm. I wonder what SPSS team may say.

data list list /v1 (f8).
begin data
1
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
end data.

optimal binning /variables bin= v1 save= yes(into= v1# )
/criteria method= equalfreq(bins= 7) /missing scope= pairwise.

V1# has 4 values, not 5 values.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Kirill Orlov

Re: OPTIMAL BINNING not very consistent

If they consider such weird, unintuitive behaviour normal (which may be) I thought it ought to be explicitly stated and exemplified in CSR.

19.09.2020 19:32, Jon Peck пишет:

The CSR says

If the number of distinct values in a binning input variable is greater than the BINS value, then the

number of bins created is no more than the BINS value. Otherwise, BINS gives an upper bound on the

number of bins created. Thus, for example, if BINS = 10 is specified but a binning input variable has at

most 10 distinct values, then the number of bins created will equal the number of distinct values in the

input variable.

It doesn't say how many bins you get when BINS is greater than the number of values, just that there will be no more than that number of bins.

In your data, there are five distinct values. With your syntax, specify bins=7, it gives 4 bins. However, if you specify bins=4, it gives 4 bins, but the boundary is slightly different for the first bin. That seems a little weird, but it might have initially created five bins and then packed it down to four.

Kirill Orlov

Re: OPTIMAL BINNING not very consistent

In reply to this post by Jon Peck

Perhaps there should be an option in syntax not to undertake any binning at all if the number of observed distinct values is not greater than the requested number of bins. And in this case, the original variable should be perhaps copied as the "binned" one - to produce some result, along with due warning.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Jon Peck

Re: OPTIMAL BINNING not very consistent

In reply to this post by Kirill Orlov

I looked at the algorithms doc, which has a very detailed description of the rules, but I didn’t wade through it to see if that explains what happens with your data.

The equal frequency algorithm refers to a user-specified number of cutpoints (n), but there is no such parameter documented in the syntax.

However, slight wobbles in binning with small datasets and small numbers of values would not be surprising in an algorithm like this.

In the absence of a guide variable, as here, the Visual Binner would give good results.

On Sat, Sep 19, 2020 at 10:42 AM Kirill Orlov <[hidden email]> wrote:

If they consider such weird, unintuitive behaviour normal (which may be) I thought it ought to be explicitly stated and exemplified in CSR.

19.09.2020 19:32, Jon Peck пишет:

The CSR says

If the number of distinct values in a binning input variable is greater than the BINS value, then the

number of bins created is no more than the BINS value. Otherwise, BINS gives an upper bound on the

number of bins created. Thus, for example, if BINS = 10 is specified but a binning input variable has at

most 10 distinct values, then the number of bins created will equal the number of distinct values in the

input variable.

It doesn't say how many bins you get when BINS is greater than the number of values, just that there will be no more than that number of bins.

In your data, there are five distinct values. With your syntax, specify bins=7, it gives 4 bins. However, if you specify bins=4, it gives 4 bins, but the boundary is slightly different for the first bin. That seems a little weird, but it might have initially created five bins and then packed it down to four.

Jon Peck

Re: OPTIMAL BINNING not very consistent

In reply to this post by Kirill Orlov

But it isn't exactly a "requested number of bins". It's an upper bound. Thus, for example, if BINS = 10 is specified but a binning input variable has at

most 10 distinct values, then the number of bins created will equal the number of distinct values in the input variable.

BTW, there is a dialog box for this procedure under Transform.

On Sat, Sep 19, 2020 at 10:53 AM Kirill Orlov <[hidden email]> wrote:

Perhaps there should be an option in syntax not to undertake any binning at all if the number of observed distinct values is not greater than the requested number of bins. And in this case, the original variable should be perhaps copied as the "binned" one - to produce some result, along with due warning.

Jon K Peck
[hidden email]

Kirill Orlov

Re: OPTIMAL BINNING not very consistent

The dialog Optimal Binning is for supervised method only.
The unsupervised "equal frequency algorithm" must be coinciding with the Visual Binning "equal percentile" Make cut-points option. However, Visual Binning is not explicitly tied with one specific syntax command.

BTW, there is a dialog box for this procedure under Transform.

On Sat, Sep 19, 2020 at 10:53 AM Kirill Orlov <[hidden email]> wrote:

Perhaps there should be an option in syntax not to undertake any binning at all if the number of observed distinct values is not greater than the requested number of bins. And in this case, the original variable should be perhaps copied as the "binned" one - to produce some result, along with due warning.

--

Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Kirill Orlov

Re: OPTIMAL BINNING not very consistent

In reply to this post by Jon Peck

BTW, RANK CASES with NTILES method - which is an alternative to Optimal
Binning method to obtainin "equal percentile groups", sometimes may too,
yield the number of output bins less than the number of unique values k
in the data in the situation when k>= the requested number of bins.

And it also can irritate. Despite it obviously is not a bug.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD