OPTIMAL BINNING /method= EQUALFREQ (BINS=n) is a method to bin a
continuous/scale variable into n approximately equal, by frequency
inside, categories.
This is "unsupervised" optimal binning: available through syntax only. According to CSR, if the number of requested bins n is greater than the observed number of distinct values k in the variable, the procedure leaves the k values as the k bins. I've observed that this is not always the case. In the following example, k=5 and the requested n=7. It is expected that the procedure will return 5 bins (i.e. will not bin anything actually). In fact, it produced 4 bins. Either this is a bug or particular contingency due to the algorithm. I wonder what SPSS team may say. data list list /v1 (f8). begin data 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 end data. optimal binning /variables bin= v1 save= yes(into= v1# ) /criteria method= equalfreq(bins= 7) /missing scope= pairwise. V1# has 4 values, not 5 values. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
The CSR says If the number of distinct values in a binning input variable is greater than the BINS value, then the number of bins created is no more than the BINS value. Otherwise, BINS gives an upper bound on the number of bins created. Thus, for example, if BINS = 10 is specified but a binning input variable has at most 10 distinct values, then the number of bins created will equal the number of distinct values in the input variable. It doesn't say how many bins you get when BINS is greater than the number of values, just that there will be no more than that number of bins. In your data, there are five distinct values. With your syntax, specify bins=7, it gives 4 bins. However, if you specify bins=4, it gives 4 bins, but the boundary is slightly different for the first bin. That seems a little weird, but it might have initially created five bins and then packed it down to four. On Sat, Sep 19, 2020 at 9:45 AM Kirill Orlov <[hidden email]> wrote:
|
If they consider such weird, unintuitive behaviour normal (which may
be) I thought it ought to be explicitly stated and exemplified in
CSR.
19.09.2020 19:32, Jon Peck пишет:
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Jon Peck
Perhaps there should be an option in syntax not to undertake any
binning at all if the number of observed distinct values is
not greater than the requested number of bins. And in this case, the
original variable should be perhaps copied as the "binned" one - to
produce some result, along with due warning.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Kirill Orlov
I looked at the algorithms doc, which has a very detailed description of the rules, but I didn’t wade through it to see if that explains what happens with your data. The equal frequency algorithm refers to a user-specified number of cutpoints (n), but there is no such parameter documented in the syntax. However, slight wobbles in binning with small datasets and small numbers of values would not be surprising in an algorithm like this. In the absence of a guide variable, as here, the Visual Binner would give good results. On Sat, Sep 19, 2020 at 10:42 AM Kirill Orlov <[hidden email]> wrote:
|
In reply to this post by Kirill Orlov
But it isn't exactly a "requested number of bins". It's an upper bound. Thus, for example, if BINS = 10 is specified but a binning input variable has at most 10 distinct values, then the number of bins created will equal the number of distinct values in the input variable. BTW, there is a dialog box for this procedure under Transform. On Sat, Sep 19, 2020 at 10:53 AM Kirill Orlov <[hidden email]> wrote:
|
The dialog Optimal Binning is for supervised method only.
The unsupervised "equal frequency algorithm" must be coinciding with the Visual Binning "equal percentile" Make cut-points option. However, Visual Binning is not explicitly tied with one specific syntax command.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Jon Peck
BTW, RANK CASES with NTILES method - which is an alternative to Optimal
Binning method to obtainin "equal percentile groups", sometimes may too, yield the number of output bins less than the number of unique values k in the data in the situation when k>= the requested number of bins. And it also can irritate. Despite it obviously is not a bug. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |