Compute variable if only one case listed

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Compute variable if only one case listed

Gunnar
Hello everybody. I am trying to calculate a new variable with SPSS 25.

My data set contains 2 variables. The first is called "ID" containing a
nominal identifier of the subject. The second ist called "var1" and contains
ordinal data attributed to the ID. Of note, there might be one or more cases
for each ID, i.e. the variable ID might list a given identifier once or
more.

My aim is to calculate varnew in a way that data are the same like in var1,
except the "value in var1 = 1" AND the "respective ID is listed only once".
Then, varnew should list the value 101 instead.  

I thought it might be helpful to see an illustration. Please find attached.
Please note that varnew was not calculated, but edited just to illustrate
the aim. The original dataset is considerably bigger (~5000 cases) and IDs
are more complex.

Thanks to everybody who tries to find an answer.
<http://spssx-discussion.1045642.n5.nabble.com/file/t341778/Image_15.jpg>



--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Compute variable if only one case listed

Rich Ulrich
What you illustrate does not seem to be useful in any obvious way,
but it is not hard to do, with flexibility for variation.

Use AGGREGATE to add onto each record the count for ID as n_ID

Then, for your exact example,
IF  (n_ID eq 1) and (var1 eq 1)   newvar= 101.

If you want similar counts for other values, you can compute
newvar = 10*ID + var1      instead of setting newvar to 101, 102, 103.

--
Rich Ulrich


From: SPSSX(r) Discussion <[hidden email]> on behalf of Gunnar <[hidden email]>
Sent: Wednesday, July 15, 2020 6:32 AM
To: [hidden email] <[hidden email]>
Subject: Compute variable if only one case listed
 
Hello everybody. I am trying to calculate a new variable with SPSS 25.

My data set contains 2 variables. The first is called "ID" containing a
nominal identifier of the subject. The second ist called "var1" and contains
ordinal data attributed to the ID. Of note, there might be one or more cases
for each ID, i.e. the variable ID might list a given identifier once or
more.

My aim is to calculate varnew in a way that data are the same like in var1,
except the "value in var1 = 1" AND the "respective ID is listed only once".
Then, varnew should list the value 101 instead. 

I thought it might be helpful to see an illustration. Please find attached.
Please note that varnew was not calculated, but edited just to illustrate
the aim. The original dataset is considerably bigger (~5000 cases) and IDs
are more complex.

Thanks to everybody who tries to find an answer.
<http://spssx-discussion.1045642.n5.nabble.com/file/t341778/Image_15.jpg>



--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Compute variable if only one case listed

Gunnar
Thank you very much for this information, Rich. I was not aware of this
solution.

FYI: For me it is useful, because the dataset I have to analyze is based on
a bug, where the user was able to select a specification without selecting
the main category and vice versa. Recoding the variable allows me to
identify each case where solely the main category was selected, enabling
further analysis. Everything else is nothing but plus and minus. So once
again, thank you!

Gunnar



--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Compute variable if only one case listed

Bruce Weaver
Administrator
In reply to this post by Rich Ulrich
I was struggling to understand what the OP was trying to do, so I implemented
Rich's suggestion as follows:

NEW FILE.
DATASET CLOSE ALL.
DATA LIST LIST / ID (A1) var1(F1) varnew(F3.0).
BEGIN DATA
"A" 1 1
"A" 2 2
"B" 1 101
"C" 3 3
"C" 2 2
"D" 2 2
END DATA.

*  Use AGGREGATE to add onto each record the count for ID as n_ID.
AGGREGATE
  /OUTFILE=* MODE=ADDVARIABLES OVERWRITE=YES
  /BREAK=ID
  /n_ID=NU.

* Then, for your exact example:  .
IF  (n_ID eq 1) and (var1 eq 1)   RUnewvar= 101.
FORMATS RUnewvar(F3.0).
LIST.

Output from LIST:

ID var1 varnew    n_ID RUnewvar
 
A    1      1        2      .
A    2      2        2      .
B    1    101        1    101
C    3      3        2      .
C    2      2        2      .
D    2      2        1      .


Re-reading the original post, the OP wanted RUnewvar to be equal to var 1
where it is currently missing.  So if I change that IF to a DO-IF as
follows, I think I get the desired result:

* To get the desired result, replace that IF with this DO-IF.
DO IF (n_ID eq 1) and (var1 eq 1).
  COMPUTE RUnewvar = 101.
ELSE.
   COMPUTE RUnewvar = var1.
END IF.
LIST.

OUTPUT:

ID var1 varnew    n_ID RUnewvar
 
A    1      1        2      1
A    2      2        2      2
B    1    101        1    101
C    3      3        2      3
C    2      2        2      2
D    2      2        1      2



PS to the OP:  IMO, using DATA LIST to provide a small dataset (rather than
a screen capture image) makes it more likely that other members will have a
go at trying to help you solve your problem.  



Rich Ulrich wrote

> What you illustrate does not seem to be useful in any obvious way,
> but it is not hard to do, with flexibility for variation.
>
> Use AGGREGATE to add onto each record the count for ID as n_ID
>
> Then, for your exact example,
> IF  (n_ID eq 1) and (var1 eq 1)   newvar= 101.
>
> If you want similar counts for other values, you can compute
> newvar = 10*ID + var1      instead of setting newvar to 101, 102, 103.
>
> --
> Rich Ulrich
>
> ________________________________
> From: SPSSX(r) Discussion &lt;

> SPSSX-L@.UGA

> &gt; on behalf of Gunnar &lt;

> gunnartreff@

> &gt;
> Sent: Wednesday, July 15, 2020 6:32 AM
> To:

> SPSSX-L@.UGA

>  &lt;

> SPSSX-L@.UGA

> &gt;
> Subject: Compute variable if only one case listed
>
> Hello everybody. I am trying to calculate a new variable with SPSS 25.
>
> My data set contains 2 variables. The first is called "ID" containing a
> nominal identifier of the subject. The second ist called "var1" and
> contains
> ordinal data attributed to the ID. Of note, there might be one or more
> cases
> for each ID, i.e. the variable ID might list a given identifier once or
> more.
>
> My aim is to calculate varnew in a way that data are the same like in
> var1,
> except the "value in var1 = 1" AND the "respective ID is listed only
> once".
> Then, varnew should list the value 101 instead.
>
> I thought it might be helpful to see an illustration. Please find
> attached.
> Please note that varnew was not calculated, but edited just to illustrate
> the aim. The original dataset is considerably bigger (~5000 cases) and IDs
> are more complex.
>
> Thanks to everybody who tries to find an answer.
> &lt;http://spssx-discussion.1045642.n5.nabble.com/file/t341778/Image_15.jpg&gt;
>
>
>
> --
> Sent from: http://spssx-discussion.1045642.n5.nabble.com/
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD





-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Compute variable if only one case listed

Art Kendall
Cannibalizing Bruce' syntax.
NEW FILE.
DATASET CLOSE ALL.
DATA LIST LIST / ID (A1) var1(F1) Want (f1).
BEGIN DATA
"A" 1 1
"A" 2 2
"B" 1 0
"C" 3 1
"C" 2 2
"D" 2 0
END DATA.
* ===============  Pasted syntax from a few clicks .
* Identify Duplicate Cases.
SORT CASES BY ID(A).
MATCH FILES
  /FILE=*
  /BY ID
  /FIRST=PrimaryFirst
  /LAST=PrimaryLast.
DO IF (PrimaryFirst).
COMPUTE  MatchSequence=1-PrimaryLast.
ELSE.
COMPUTE  MatchSequence=MatchSequence+1.
END IF.
LEAVE  MatchSequence.
FORMATS  MatchSequence (f7).
MATCH FILES
  /FILE=*
  /DROP=PrimaryFirst PrimaryLast.
VARIABLE LABELS  MatchSequence 'Sequential count of matching cases'.
VARIABLE LEVEL  MatchSequence (SCALE).
FREQUENCIES VARIABLES=MatchSequence.
* ==================== end Pasted syntax.
VALUE LABELS MatchSequence
0    'Single occurrence of ID'.
LIST.




-----
Art Kendall
Social Research Consultants
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Compute variable if only one case listed

Rich Ulrich
In reply to this post by Bruce Weaver
Here is just a simple note on code practices. 
I don't use DO-IF when I can avoid it by an initialization to the ELSE value.

Consider Bruce's code -
* To get the desired result, replace that IF with this DO-IF.
DO IF (n_ID eq 1) and (var1 eq 1).
  COMPUTE RUnewvar = 101.
ELSE.
   COMPUTE RUnewvar = var1.
END IF.

I find it more readable in two lines -
COMPUTE RUnewvar = var1.
IF  (n_ID eq 1) and (var1 eq 1)  RUnewvar = 101.

Also: Dave Marso showed us how to initialize and get a result in one line,
which is easy to read once you get use to seeing it that way -

COMPUTE RUnewvar = var1 + ( (n_ID eq 1) and (var1 eq 1) )*100.

--
Rich Ulrich
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Compute variable if only one case listed

Kirill Orlov
But note, Rich, that on the other hand the first variant is 4 operations (3 boolean and 1 arithmetic) while the other two are 5 operations (3 boolean and 2 arithmetic).
I may expect that on millions of cases (or millions of loops) the first block is just slightly faster.


18.07.2020 1:44, Rich Ulrich пишет:
Here is just a simple note on code practices. 
I don't use DO-IF when I can avoid it by an initialization to the ELSE value.

Consider Bruce's code -
* To get the desired result, replace that IF with this DO-IF.
DO IF (n_ID eq 1) and (var1 eq 1).
  COMPUTE RUnewvar = 101.
ELSE.
   COMPUTE RUnewvar = var1.
END IF.

I find it more readable in two lines -
COMPUTE RUnewvar = var1.
IF  (n_ID eq 1) and (var1 eq 1)  RUnewvar = 101.

Also: Dave Marso showed us how to initialize and get a result in one line,
which is easy to read once you get use to seeing it that way -

COMPUTE RUnewvar = var1 + ( (n_ID eq 1) and (var1 eq 1) )*100.

--
Rich Ulrich
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Compute variable if only one case listed

Art Kendall
The returned value of zero From <Identify Duplicate Cases> is not the exactly
literal value 101 the OP asked for.

Identifying case IDs that are NOT duplicated, is related to finding case IDs
that ARE duplicated.

I do not believe that a GUI-only approach is productive in the long run. I
do believe that knowing what generalized operations are available in
drafting syntax is productive.

Ideally, the people implementing <Identify Duplicate Cases> try to maximize
both user-interface quality and machine efficiency.  

How many millions of cases would it take before execution time differences
between approaches would be measured in minutes?

P.S.  It seems that many questions on this list come from people who are
"newbies".







-----
Art Kendall
Social Research Consultants
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Compute variable if only one case listed

Jon Peck
In reply to this post by Kirill Orlov
When examining the performance of different computational formulas, there are a lot of factors other than the actual evaluation time for the arithmetic operations to consider even ignoring parsing time and other setup of the formulas.  Just the writing of the result to the active file probably swamps the computation time in most cases.  In addition, formulas might be rewritten into a computationally more efficient form by the interpreter sometimes, although there is limited scope for this in Statistics, since numbers are all double precision floating point values and rewriting could change the result a bit.

I did a little very casual benchmarking to see if there is a detectable difference. I generated a million case file of random numbers and compared execution times, excluding setup, of a formula with one multiplication to one with twenty.  I only did a dozen replications, but here is the result.
Mean for one:  1.3354  Mean for twenty:  1.5212.  Times are in seconds, so for a million cases the mean difference is about 0.2 seconds.  But standard deviations are .592 and .542, and the t test sig level is .43.

So, at least on this simple test, the variation from run to run swamps the difference between the formulas.  Obviously this could be explored much further, but the simple message is that clarity in the code will likely swamp any difference in run times.  I do prefer the clarify of the one-line version of the code to the other forms.






On Sat, Jul 18, 2020 at 3:31 AM Kirill Orlov <[hidden email]> wrote:
But note, Rich, that on the other hand the first variant is 4 operations (3 boolean and 1 arithmetic) while the other two are 5 operations (3 boolean and 2 arithmetic).
I may expect that on millions of cases (or millions of loops) the first block is just slightly faster.


18.07.2020 1:44, Rich Ulrich пишет:
Here is just a simple note on code practices. 
I don't use DO-IF when I can avoid it by an initialization to the ELSE value.

Consider Bruce's code -
* To get the desired result, replace that IF with this DO-IF.
DO IF (n_ID eq 1) and (var1 eq 1).
  COMPUTE RUnewvar = 101.
ELSE.
   COMPUTE RUnewvar = var1.
END IF.

I find it more readable in two lines -
COMPUTE RUnewvar = var1.
IF  (n_ID eq 1) and (var1 eq 1)  RUnewvar = 101.

Also: Dave Marso showed us how to initialize and get a result in one line,
which is easy to read once you get use to seeing it that way -

COMPUTE RUnewvar = var1 + ( (n_ID eq 1) and (var1 eq 1) )*100.

--
Rich Ulrich
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Compute variable if only one case listed

Art Kendall
Indenting, comments, variable labels, and value labels enhance clarity.
Variable names such as "YearlyIncome" rather than "YI", "NameBlank" rather
than "test1", "PointOfSale" rather than "PS'  should only add trivial time
to execution.

These things are also a major help in redrafting syntax so that it comes
closer to what you intend.

YMMV but I try to write as if I were explaining to a beginner. That some
times using more lines for a statement. And sometimes more statements.




-----
Art Kendall
Social Research Consultants
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Compute variable if only one case listed

Kirill Orlov
In reply to this post by Jon Peck
This is very instructive, Jon. Thank you.
I've tested the three equivalent code pieces on 1M cases * 500 loops on each case, and saw no regular speed difference.


18.07.2020 21:12, Jon Peck пишет:
When examining the performance of different computational formulas, there are a lot of factors other than the actual evaluation time for the arithmetic operations to consider even ignoring parsing time and other setup of the formulas.  Just the writing of the result to the active file probably swamps the computation time in most cases.  In addition, formulas might be rewritten into a computationally more efficient form by the interpreter sometimes, although there is limited scope for this in Statistics, since numbers are all double precision floating point values and rewriting could change the result a bit.

I did a little very casual benchmarking to see if there is a detectable difference. I generated a million case file of random numbers and compared execution times, excluding setup, of a formula with one multiplication to one with twenty.  I only did a dozen replications, but here is the result.
Mean for one:  1.3354  Mean for twenty:  1.5212.  Times are in seconds, so for a million cases the mean difference is about 0.2 seconds.  But standard deviations are .592 and .542, and the t test sig level is .43.

So, at least on this simple test, the variation from run to run swamps the difference between the formulas.  Obviously this could be explored much further, but the simple message is that clarity in the code will likely swamp any difference in run times.  I do prefer the clarify of the one-line version of the code to the other forms.






On Sat, Jul 18, 2020 at 3:31 AM Kirill Orlov <[hidden email]> wrote:
But note, Rich, that on the other hand the first variant is 4 operations (3 boolean and 1 arithmetic) while the other two are 5 operations (3 boolean and 2 arithmetic).
I may expect that on millions of cases (or millions of loops) the first block is just slightly faster.


18.07.2020 1:44, Rich Ulrich пишет:
Here is just a simple note on code practices. 
I don't use DO-IF when I can avoid it by an initialization to the ELSE value.

Consider Bruce's code -
* To get the desired result, replace that IF with this DO-IF.
DO IF (n_ID eq 1) and (var1 eq 1).
  COMPUTE RUnewvar = 101.
ELSE.
   COMPUTE RUnewvar = var1.
END IF.

I find it more readable in two lines -
COMPUTE RUnewvar = var1.
IF  (n_ID eq 1) and (var1 eq 1)  RUnewvar = 101.

Also: Dave Marso showed us how to initialize and get a result in one line,
which is easy to read once you get use to seeing it that way -

COMPUTE RUnewvar = var1 + ( (n_ID eq 1) and (var1 eq 1) )*100.

--
Rich Ulrich
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Compute variable if only one case listed

Nkem Ntonghanwah
In reply to this post by Art Kendall
I wonder if we shouldn't take this comment "P.S.  It seems that many questions on this list come from people who are
"newbies"" more seriously when providing answers to some of the questions. -Determining what the actual objective is - and tailoring response to that objective.
Eg. Why is this new variable being created - what will be done with it?

For those provided the solution, kindly explain why for ID = A  and var1 = 1 then newvar 101 
whereas when  ID = C  and var1 = 2 then newvar =2
 and when ID = D  and var1 = 2 then newvar = 2

Thanks
Forcheh


On Sat, Jul 18, 2020 at 10:37 AM Art Kendall <[hidden email]> wrote:
The returned value of zero From <Identify Duplicate Cases> is not the exactly
literal value 101 the OP asked for.

Identifying case IDs that are NOT duplicated, is related to finding case IDs
that ARE duplicated.

I do not believe that a GUI-only approach is productive in the long run. I
do believe that knowing what generalized operations are available in
drafting syntax is productive.

Ideally, the people implementing <Identify Duplicate Cases> try to maximize
both user-interface quality and machine efficiency. 

How many millions of cases would it take before execution time differences
between approaches would be measured in minutes?

P.S.  It seems that many questions on this list come from people who are
"newbies".







-----
Art Kendall
Social Research Consultants
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD