Randomly Select a Specific Number of Cases by Group

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Randomly Select a Specific Number of Cases by Group

ariel barak
 Dear List,

The background is that I have a list of open cases for probation officers
and I want to randomly pick 10 of their cases for review. For the probation
officers that have less than 10 cases open, I would like to select all of
their cases. There are 20 probation officers and 395 open cases. The
caseload varies from 1 to 35 open cases. I think I have modified the syntax
in the links below correctly, however, the correct number of cases is not
always returned regardless of whether the probation officer has more or less
than 10 cases. For example, there is a probation officer that has 35 cases
and after I run the syntax below, sometimes I have 10 cases (correct) and
other times I have 8 or 9 etc.(incorrect). I'm wondering if it's possible
that I either made a mistake in adjusting the code or perhaps there is some
issue with it that someone could pinpoint. The syntax that I'm
adjusting/using was posted to the list here:

http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0803&L=spssx-l&P=R13320&m=59416

http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0803&L=spssx-l&P=R13761&m=59416


The only modifications that I have made are changing the variable names and
the minimum number of records from 127 in the posted syntax to 10. What is
additionally worrisome to me is that when I run the code on the sample data
below, it works correctly. However, when I run the same code on my real
data, it doesn't seem to work properly. I would be happy to send a copy of
my 395 cases to anyone off-list who is interested in helping me figure this
issue out.

Any help would be greatly appreciated!

-Ari

*Sample Data.
DATA LIST LIST /Patient_Number (A9) Officer_ID (A7) Program (A7).
BEGIN DATA
041949 006415 PROB
045284 006415 PROB
046107 006415 PROB
047019 006415 PROB
048501 006415 PROB
049087 006415 PROB
052716 006415 PROB
056991 006415 PROB
057073 006415 PROB
060727 006415 PROB
061118 006415 PROB
061120 006415 PROB
061207 006415 PROB
064713 007991 PROB
051234 007991 PROB
061749 007991 PROB
048163 007991 PROB
044949 011512 PROB
045274 011512 PROB
048107 011512 PROB
042019 011512 PROB
048401 011512 PROB
049187 011512 PROB
058716 011512 PROB
096991 011512 PROB
037073 011512 PROB
063627 011512 PROB
068318 011512 PROB
061310 011512 PROB
066207 011512 PROB
048451 011512 PROB
044187 011512 PROB
020716 011512 PROB
076981 011512 PROB
017073 011512 PROB
052627 011512 PROB
061318 011512 PROB
031380 011512 PROB
026237 011512 PROB
END DATA.

FREQ Officer_ID.

DATASET NAME OriginalData.
DATASET COPY ListofSampleData.
******************************************************.
DATASET ACTIVATE ListofSampleData.

SORT CASES BY Officer_ID /* if necessary */.

*  Set random-number generator parameters, if desired   .
SET RNG = MT       /* 'Mersenne twister' random-no. generator */ .
SET MTINDEX = 7778 /* or other starting value - anything      */ .

AGGREGATE OUTFILE=* MODE=ADDVARIABLES
    /BREAK=Officer_ID
    /NRecords 'Number of open cases for Officer'=NU.

NUMERIC   #K #N (F3).

DO IF   $CASENUM EQ 1
      OR Officer_ID       NE LAG(Officer_ID).
.  COMPUTE #N = NRecords  /* Total open records,    per Officer */.
.  COMPUTE #K = MIN(NRecords, 10) /* Set sample size     */.
END IF.

COMPUTE #Take_It = RV.BERNOULLI(#K/#N).
COMPUTE #K = #K - #Take_It.
COMPUTE #N = #N - 1.
SELECT IF #Take_It.

FREQ Officer_ID.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Randomly Select a Specific Number of Cases by Group

Maguin, Eugene
Ariel,

I'm not saying your code can't be made to work. Just that I'd work the
problem this way. (see at the bottom). Note. Code is untried.


>>The background is that I have a list of open cases for probation officers
and I want to randomly pick 10 of their cases for review. For the probation
officers that have less than 10 cases open, I would like to select all of
their cases. There are 20 probation officers and 395 open cases. The
caseload varies from 1 to 35 open cases. I think I have modified the syntax
in the links below correctly, however, the correct number of cases is not
always returned regardless of whether the probation officer has more or less
than 10 cases. For example, there is a probation officer that has 35 cases
and after I run the syntax below, sometimes I have 10 cases (correct) and
other times I have 8 or 9 etc.(incorrect). I'm wondering if it's possible
that I either made a mistake in adjusting the code or perhaps there is some
issue with it that someone could pinpoint. The syntax that I'm
adjusting/using was posted to the list here:

*Sample Data.
DATA LIST LIST /Patient_Number (A9) Officer_ID (A7) Program (A7).
BEGIN DATA
041949 006415 PROB
045284 006415 PROB
046107 006415 PROB
047019 006415 PROB
048501 006415 PROB
049087 006415 PROB
052716 006415 PROB
056991 006415 PROB
057073 006415 PROB
060727 006415 PROB
061118 006415 PROB
061120 006415 PROB
061207 006415 PROB
064713 007991 PROB
051234 007991 PROB
061749 007991 PROB
048163 007991 PROB
044949 011512 PROB
045274 011512 PROB
048107 011512 PROB
042019 011512 PROB
048401 011512 PROB
049187 011512 PROB
058716 011512 PROB
096991 011512 PROB
037073 011512 PROB
063627 011512 PROB
068318 011512 PROB
061310 011512 PROB
066207 011512 PROB
048451 011512 PROB
044187 011512 PROB
020716 011512 PROB
076981 011512 PROB
017073 011512 PROB
052627 011512 PROB
061318 011512 PROB
031380 011512 PROB
026237 011512 PROB
END DATA.


Compute rv=uniform(1).
Sort cases by officer_id rv.

Compute pick=0.
Do if ($casenum eq 1 or officer_id ne lag(officer_id)).
+  compute pick=1.
Else.
+  if (lag(pick) lt 10) pick=lag(pick)+1.
End if.
Execute.

Temporary.
Select if (pick gt 0).
.....



Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Randomly Select a Specific Number of Cases by Group

Richard Ristow
In reply to this post by ariel barak
At 11:33 AM 10/31/2008, Ariel Barak wrote:

>I have a list of open cases for probation officers and I want to
>randomly pick 10 of their cases for review. For the probation
>officers that have less than 10 cases open, I would like to select
>all of their cases. I think I have modified the syntax [from
>previous postings(*)] correctly, however, the correct number of
>cases is not always returned regardless of whether the probation
>officer has more or less than 10 cases.

Gene's certainly got a point, in recommending random-sort logic; I
probably fall in love too much with "k/n". Random sorting can be less
efficient, but the difference will be marginal in these days of huge
memory and adaptive sorting algorithms.

None the less, the code you sent seems to work, after I've replaced
DATASET COPY by ADD FILES. (Sometimes, DATASET COPY interacts poorly
with elaborate commands like MATCH or AGGREGATE in the new file.) Is
there any chance you're hitting that problem?

Below is a voluminous listing, with trace messages; following, is the code.

Officer_ID
|-----|------|---------|-------|-------------|---------------|
|     |      |Frequency|Percent|Valid Percent|Cumulative     |
|     |      |         |       |             |Percent        |
|-----|------|---------|-------|-------------|---------------|
|Valid|006415|13       |33.3   |33.3         |33.3           |
|     |------|---------|-------|-------------|---------------|
|     |007991|4        |10.3   |10.3         |43.6           |
|     |------|---------|-------|-------------|---------------|
|     |011512|22       |56.4   |56.4         |100.0          |
|     |------|---------|-------|-------------|---------------|
|     |Total |39       |100.0  |100.0        |               |
|-----|------|---------|-------|-------------|---------------|


DATASET NAME OriginalData.


*...  Replace the following:
*...  DATASET COPY ListofSampleData.
******************************************************.
*...  DATASET ACTIVATE ListofSampleData.
*...  by
ADD FILES / FILE=OriginalData
DATASET NAME     ListofSampleData WINDOW=FRONT.


SORT CASES BY Officer_ID /* if necessary */.
*  Set random-number generator parameters, if desired   .
SET RNG = MT       /* 'Mersenne twister' random-no. generator */ .
SET MTINDEX = 7778 /* or other starting value - anything      */ .

AGGREGATE OUTFILE=* MODE=ADDVARIABLES
     /BREAK=Officer_ID
     /NRecords 'Number of open cases for Officer'=NU.

NUMERIC   #K #N    (F3)
           #Take_It (F2).

.  /**/  NUMERIC  #CaseCount (F3).

DO IF   $CASENUM EQ 1
       OR Officer_ID       NE LAG(Officer_ID).
.  /**/ COMPUTE #CaseCount = 0.
.  /**/ PRINT / 'Officer ' Officer_ID ': ' NRecords ' records.'/**/.

.  COMPUTE #N = NRecords  /* Total open records,    per Officer */.
.  COMPUTE #K = MIN(NRecords, 10) /* Set sample size     */.
END IF.

COMPUTE #Take_It = RV.BERNOULLI(#K/#N).

.  /**/ COMPUTE #CaseCount = #CaseCount + 1.
.  /**/ PRINT /                                               /**/
    /**/  #CaseCount ', patient ' Patient_Number ' '           /**/
    /**/ 'N:' #N '  K:'  #K '  Select:' #Take_it               /**/.

COMPUTE #K = #K - #Take_It.
COMPUTE #N = #N - 1.
SELECT IF #Take_It.

FREQ Officer_ID.
Officer 006415  :      13  records.
   1 , patient 041949     N: 13   K: 10   Select: 1
   2 , patient 045284     N: 12   K:  9   Select: 1
   3 , patient 046107     N: 11   K:  8   Select: 1
   4 , patient 047019     N: 10   K:  7   Select: 1
   5 , patient 048501     N:  9   K:  6   Select: 0
   6 , patient 049087     N:  8   K:  6   Select: 1
   7 , patient 052716     N:  7   K:  5   Select: 0
   8 , patient 056991     N:  6   K:  5   Select: 1
   9 , patient 057073     N:  5   K:  4   Select: 0
  10 , patient 060727     N:  4   K:  4   Select: 1
  11 , patient 061118     N:  3   K:  3   Select: 1
  12 , patient 061120     N:  2   K:  2   Select: 1
  13 , patient 061207     N:  1   K:  1   Select: 1
Officer 007991  :       4  records.
   1 , patient 064713     N:  4   K:  4   Select: 1
   2 , patient 051234     N:  3   K:  3   Select: 1
   3 , patient 061749     N:  2   K:  2   Select: 1
   4 , patient 048163     N:  1   K:  1   Select: 1
Officer 011512  :      22  records.
   1 , patient 044949     N: 22   K: 10   Select: 0
Officer 011512  :      22  records.
   1 , patient 045274     N: 22   K: 10   Select: 1
   2 , patient 048107     N: 21   K:  9   Select: 0
   3 , patient 042019     N: 20   K:  9   Select: 1
   4 , patient 048401     N: 19   K:  8   Select: 1
   5 , patient 049187     N: 18   K:  7   Select: 0
   6 , patient 058716     N: 17   K:  7   Select: 0
   7 , patient 096991     N: 16   K:  7   Select: 0
   8 , patient 037073     N: 15   K:  7   Select: 0
   9 , patient 063627     N: 14   K:  7   Select: 1
  10 , patient 068318     N: 13   K:  6   Select: 1
  11 , patient 061310     N: 12   K:  5   Select: 1
  12 , patient 066207     N: 11   K:  4   Select: 1
  13 , patient 048451     N: 10   K:  3   Select: 1
  14 , patient 044187     N:  9   K:  2   Select: 1
  15 , patient 020716     N:  8   K:  1   Select: 0
  16 , patient 076981     N:  7   K:  1   Select: 0
  17 , patient 017073     N:  6   K:  1   Select: 0
  18 , patient 052627     N:  5   K:  1   Select: 0
  19 , patient 061318     N:  4   K:  1   Select: 0
  20 , patient 031380     N:  3   K:  1   Select: 0
  21 , patient 026237     N:  2   K:  1   Select: 0


Frequencies
|-----------------------------|---------------------------|
|Output Created               |03-NOV-2008 11:44:06       |
|-----------------------------|---------------------------|
[OriginalData]
Statistics [suppressed]

Officer_ID
|-----|------|---------|-------|-------------|---------------|
|     |      |Frequency|Percent|Valid Percent|Cumulative     |
|     |      |         |       |             |Percent        |
|-----|------|---------|-------|-------------|---------------|
|Valid|006415|10       |43.5   |43.5         |43.5           |
|     |------|---------|-------|-------------|---------------|
|     |007991|4        |17.4   |17.4         |60.9           |
|     |------|---------|-------|-------------|---------------|
|     |011512|9        |39.1   |39.1         |100.0          |
|     |------|---------|-------|-------------|---------------|
|     |Total |23       |100.0  |100.0        |               |
|-----|------|---------|-------|-------------|---------------|


LIST.

List
|-----------------------------|---------------------------|
|Output Created               |03-NOV-2008 11:44:07       |
|-----------------------------|---------------------------|
[OriginalData]

Patient_Number Officer_ID Program NRecords

041949         006415     PROB          13
045284         006415     PROB          13
046107         006415     PROB          13
047019         006415     PROB          13
049087         006415     PROB          13
056991         006415     PROB          13
060727         006415     PROB          13
061118         006415     PROB          13
061120         006415     PROB          13
061207         006415     PROB          13
064713         007991     PROB           4
051234         007991     PROB           4
061749         007991     PROB           4
048163         007991     PROB           4
045274         011512     PROB          22
042019         011512     PROB          22
048401         011512     PROB          22
063627         011512     PROB          22
068318         011512     PROB          22
061310         011512     PROB          22
066207         011512     PROB          22
048451         011512     PROB          22
044187         011512     PROB          22

Number of cases read:  23    Number of cases listed:  23
=============================
APPENDIX: Test data, and code
=============================
*  C:\Documents and Settings\Richard\My
Documents                               .
*    \Technical\spssx-l\Z-2008d
           .
*    \2008-10-31 Barak - Randomly Select a Specific Number of Cases
by Group.SPS.

*  In response to posting                                            .
*  Date:    Fri, 31 Oct 2008 10:33:46 -0500                          .
*  From:    Ariel Barak <[hidden email]>                     .
*  Subject: Randomly Select a Specific Number of Cases by Group      .
*  To:      [hidden email]                                 .

*  ................................................................. .
*  "I think I have modified the syntax correctly, however, the       .
*  correct number of cases is not always returned regardless of      .
*  whether the probation officer has more or less than 10 cases.     .
*  For example, there is a probation officer that has 35 cases and   .
*  after I run the syntax below, sometimes I have 10 cases           .
*  (correct) and other times I have 8 or 9 etc.(incorrect)."

*  The syntax he's modifying is from my postings                     .
*  Date:     Tue, 11 Mar 2008 12:06:14 -0400                         .
*  From:     Richard Ristow <[hidden email]>                .
*  Subject:  Re: Random Cuts                                         .
*      with correction                                               .
*      Date:     Tue, 11 Mar 2008 14:10:03 -0400                     .
*  From:     Richard Ristow <[hidden email]>                .
*  Subject:  Re: Random Cuts                                         .
*  ................................................................. .

*  ................................................................. .
*  ...............   Data and code, as posted   .................... .


*Sample Data.
DATA LIST LIST /Patient_Number (A9) Officer_ID (A7) Program (A7).
BEGIN DATA
041949 006415 PROB
045284 006415 PROB
046107 006415 PROB
047019 006415 PROB
048501 006415 PROB
049087 006415 PROB
052716 006415 PROB
056991 006415 PROB
057073 006415 PROB
060727 006415 PROB
061118 006415 PROB
061120 006415 PROB
061207 006415 PROB
064713 007991 PROB
051234 007991 PROB
061749 007991 PROB
048163 007991 PROB
044949 011512 PROB
045274 011512 PROB
048107 011512 PROB
042019 011512 PROB
048401 011512 PROB
049187 011512 PROB
058716 011512 PROB
096991 011512 PROB
037073 011512 PROB
063627 011512 PROB
068318 011512 PROB
061310 011512 PROB
066207 011512 PROB
048451 011512 PROB
044187 011512 PROB
020716 011512 PROB
076981 011512 PROB
017073 011512 PROB
052627 011512 PROB
061318 011512 PROB
031380 011512 PROB
026237 011512 PROB
END DATA.

FREQ Officer_ID.

DATASET NAME OriginalData.
*...  Replace the following:
*...  DATASET COPY ListofSampleData.
******************************************************.
*...  DATASET ACTIVATE ListofSampleData.
*...  by
ADD FILES / FILE=OriginalData
DATASET NAME     ListofSampleData WINDOW=FRONT.


SORT CASES BY Officer_ID /* if necessary */.

*  Set random-number generator parameters, if desired   .
SET RNG = MT       /* 'Mersenne twister' random-no. generator */ .
SET MTINDEX = 7778 /* or other starting value - anything      */ .

AGGREGATE OUTFILE=* MODE=ADDVARIABLES
     /BREAK=Officer_ID
     /NRecords 'Number of open cases for Officer'=NU.

NUMERIC   #K #N    (F3)
           #Take_It (F2).

.  /**/  NUMERIC  #CaseCount (F3).

DO IF   $CASENUM EQ 1
       OR Officer_ID       NE LAG(Officer_ID).
.  /**/ COMPUTE #CaseCount = 0.
.  /**/ PRINT / 'Officer ' Officer_ID ': ' NRecords ' records.'/**/.

.  COMPUTE #N = NRecords  /* Total open records,    per Officer */.
.  COMPUTE #K = MIN(NRecords, 10) /* Set sample size     */.
END IF.

COMPUTE #Take_It = RV.BERNOULLI(#K/#N).

.  /**/ COMPUTE #CaseCount = #CaseCount + 1.
.  /**/ PRINT /                                               /**/
    /**/  #CaseCount ', patient ' Patient_Number ' '           /**/
    /**/ 'N:' #N '  K:'  #K '  Select:' #Take_it               /**/.

COMPUTE #K = #K - #Take_It.
COMPUTE #N = #N - 1.
SELECT IF #Take_It.

FREQ Officer_ID.

LIST.
============================
(*) Date: Tue, 11 Mar 2008 12:06:14 -0400
From:     Richard Ristow <[hidden email]>
Subject:  Re: Random Cuts
     with correction posted
From:     Richard Ristow <[hidden email]>
Subject:  Re: Random Cuts

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD