check adjacent rows with duplicate values

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

check adjacent rows with duplicate values

Maguin, Eugene
I'm checking for duplicated values between pairs of adjacent records and the following code doesn't seem to be working. The 'real' problem is that value(lag(x)) doesn't work when variables have user missing values because lag(x) returns sysmis when x is missing. It seems like lag ought to work differently but that's beside the point. One way around the problem is to take off user missing values. But, I'd like to keep them defined. I could well be missing or not understanding something but if I'm not, is there a syntax method for this problem that doesn't involve stripping off user missing values.


DO IF (EPISODEID NE LAG(EPISODEID)).
+     COMPUTE DUPROW=-1.
ELSE.
+     COMPUTE DUPROW=0.
+     DO REPEAT X=DischargeLevelofCare TO SpecEd8 DxA DxB1 TO DxB4 PsychMeds.
+        COMPUTE #V=LAG(X).
+        IF (VALUE(X) EQ VALUE(#V)) DUPROW=DUPROW+1.
+     END REPEAT.
END IF.
FREQUENCIES DUPROW.

Thanks, Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: check adjacent rows with duplicate values

David Marso
Administrator
Gene,
Please mock up and post a simple data set definition to illustrate the problem.
Are you attempting to identify adjacent records which are complete duplicates?
IF so:
MATCH FILES / FILE * / KEEP <VarlistForCaseComparisonDeletingAllOthers>.
SORT CASES BY ALL.
MATCH FILES /FILE * / BY ALL/ FIRST=@TOP@/LAST=@BOT@.
David

Maguin, Eugene wrote
I'm checking for duplicated values between pairs of adjacent records and the following code doesn't seem to be working. The 'real' problem is that value(lag(x)) doesn't work when variables have user missing values because lag(x) returns sysmis when x is missing. It seems like lag ought to work differently but that's beside the point. One way around the problem is to take off user missing values. But, I'd like to keep them defined. I could well be missing or not understanding something but if I'm not, is there a syntax method for this problem that doesn't involve stripping off user missing values.


DO IF (EPISODEID NE LAG(EPISODEID)).
+     COMPUTE DUPROW=-1.
ELSE.
+     COMPUTE DUPROW=0.
+     DO REPEAT X=DischargeLevelofCare TO SpecEd8 DxA DxB1 TO DxB4 PsychMeds.
+        COMPUTE #V=LAG(X).
+        IF (VALUE(X) EQ VALUE(#V)) DUPROW=DUPROW+1.
+     END REPEAT.
END IF.
FREQUENCIES DUPROW.

Thanks, Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: check adjacent rows with duplicate values

Maguin, Eugene
Ok.

Id x1 x2 x3 x4 x5 x6   Result
10  5   8   0   9   1    0              -1
11  1   3   9   4   8    2              -1
11  2   3   9   4   8    5               4
12  3   8   8   9   8    0              -1
13  2   5   8   9   9    9              -1
13  2   5   8   9   9    9               6

Missing values x1 to x6(8 9).

Re: your code. Depending on how user missing are handled in the match files that seems like it would work. Is, 'all' a valid value for the BY subcommand or do you mean all to reference the case comparison varlist? (I think no but it's always seemed like there are undocumented things in spss.)

Gene Maguin



-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Marso
Sent: Thursday, May 03, 2012 11:55 AM
To: [hidden email]
Subject: Re: check adjacent rows with duplicate values

Gene,
Please mock up and post a simple data set definition to illustrate the
problem.
Are you attempting to identify adjacent records which are complete
duplicates?
IF so:
MATCH FILES / FILE * / KEEP <VarlistForCaseComparisonDeletingAllOthers>.
SORT CASES BY ALL.
MATCH FILES /FILE * / BY ALL/ FIRST=@TOP@/LAST=@BOT@.
David


Maguin, Eugene wrote

>
> I'm checking for duplicated values between pairs of adjacent records and
> the following code doesn't seem to be working. The 'real' problem is that
> value(lag(x)) doesn't work when variables have user missing values because
> lag(x) returns sysmis when x is missing. It seems like lag ought to work
> differently but that's beside the point. One way around the problem is to
> take off user missing values. But, I'd like to keep them defined. I could
> well be missing or not understanding something but if I'm not, is there a
> syntax method for this problem that doesn't involve stripping off user
> missing values.
>
>
> DO IF (EPISODEID NE LAG(EPISODEID)).
> +     COMPUTE DUPROW=-1.
> ELSE.
> +     COMPUTE DUPROW=0.
> +     DO REPEAT X=DischargeLevelofCare TO SpecEd8 DxA DxB1 TO DxB4
> PsychMeds.
> +        COMPUTE #V=LAG(X).
> +        IF (VALUE(X) EQ VALUE(#V)) DUPROW=DUPROW+1.
> +     END REPEAT.
> END IF.
> FREQUENCIES DUPROW.
>
> Thanks, Gene Maguin
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> LISTSERV@.UGA (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>


--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/check-adjacent-rows-with-duplicate-values-tp5683555p5683588.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: check adjacent rows with duplicate values

David Marso
Administrator
Gene,
Simplest construction I could devise as follows:

--
DATA LIST FREE /Id x1 x2 x3 x4 x5 x6   Result.
BEGIN DATA
10  5   8   0   9   1    0              -1
11  1   3   9   4   8    2              -1
11  2   3   9   4   8    5               4
12  3   8   8   9   8    0              -1
13  2   5   8   9   9    9              -1
13  2   5   8   9   9    9               6
END DATA.
MISSING VALUES  x1 TO x6(8 9).

COMPUTE seq=$CASENUM.
SAVE OUTFILE 'tmp'.
MISSING VALUES X1 TO x6 ().
DO REPEAT v=x1 TO x6.
+  COMPUTE res=sum(res,(ID EQ LAG(ID))*(v EQ LAG(v))).
END REPEAT.
UPDATE / FILE * / FILE 'tmp' / BY SEQ.
RECODE res (0,SYSMIS=-1).
EXE.

I know, it is *FUGLY* but it gets the job done.
I had thought of TEMP but then LAG won't work (same with CREATE under TEMP).
I had my initial doubts about UPDATE wrt User  Missing but in fact it works.
FYI: ALL is valid on both SORT and MATCH (conditional on not exceeding the number of keys ).
OTOH, your desire was to enumerate the number of data matches so the code would only end up flagging on mismatches but not give you the required result.
*** it's always seemed like there are undocumented things in spss ***.
Yeah, users call them bugs.. When I used to work in tech support we called them 'features'.
How's this for undocumented
(one of my own special requests for implementation about 15 years ago):

PRESERVE.
SET ERRORS OFF RESULTS OFF.
*-----------*.
CORR VAR <huge_list_of_variables_sufficient_to_build_MONSTER_unwanted_pivot_table>
  / MATRIX OUT (*).
RESTORE.
MATRIX.
   party hardy on huge correlation matrix.....
END MATRIX.
---


Maguin, Eugene wrote
Ok.

Id x1 x2 x3 x4 x5 x6   Result
10  5   8   0   9   1    0              -1
11  1   3   9   4   8    2              -1
11  2   3   9   4   8    5               4
12  3   8   8   9   8    0              -1
13  2   5   8   9   9    9              -1
13  2   5   8   9   9    9               6

Missing values x1 to x6(8 9).

Re: your code. Depending on how user missing are handled in the match files that seems like it would work. Is, 'all' a valid value for the BY subcommand or do you mean all to reference the case comparison varlist? (I think no but it's always seemed like there are undocumented things in spss.)

Gene Maguin



-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Marso
Sent: Thursday, May 03, 2012 11:55 AM
To: [hidden email]
Subject: Re: check adjacent rows with duplicate values

Gene,
Please mock up and post a simple data set definition to illustrate the
problem.
Are you attempting to identify adjacent records which are complete
duplicates?
IF so:
MATCH FILES / FILE * / KEEP <VarlistForCaseComparisonDeletingAllOthers>.
SORT CASES BY ALL.
MATCH FILES /FILE * / BY ALL/ FIRST=@TOP@/LAST=@BOT@.
David


Maguin, Eugene wrote
>
> I'm checking for duplicated values between pairs of adjacent records and
> the following code doesn't seem to be working. The 'real' problem is that
> value(lag(x)) doesn't work when variables have user missing values because
> lag(x) returns sysmis when x is missing. It seems like lag ought to work
> differently but that's beside the point. One way around the problem is to
> take off user missing values. But, I'd like to keep them defined. I could
> well be missing or not understanding something but if I'm not, is there a
> syntax method for this problem that doesn't involve stripping off user
> missing values.
>
>
> DO IF (EPISODEID NE LAG(EPISODEID)).
> +     COMPUTE DUPROW=-1.
> ELSE.
> +     COMPUTE DUPROW=0.
> +     DO REPEAT X=DischargeLevelofCare TO SpecEd8 DxA DxB1 TO DxB4
> PsychMeds.
> +        COMPUTE #V=LAG(X).
> +        IF (VALUE(X) EQ VALUE(#V)) DUPROW=DUPROW+1.
> +     END REPEAT.
> END IF.
> FREQUENCIES DUPROW.
>
> Thanks, Gene Maguin
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> LISTSERV@.UGA (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>


--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/check-adjacent-rows-with-duplicate-values-tp5683555p5683588.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: check adjacent rows with duplicate values

Jon K Peck
Using SHIFT VALUES would get around the missing value problem.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621




From:        David Marso <[hidden email]>
To:        [hidden email]
Date:        05/03/2012 01:58 PM
Subject:        Re: [SPSSX-L] check adjacent rows with duplicate values
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Gene,
Simplest construction I could devise as follows:

--
DATA LIST FREE /Id x1 x2 x3 x4 x5 x6   Result.
BEGIN DATA
10  5   8   0   9   1    0              -1
11  1   3   9   4   8    2              -1
11  2   3   9   4   8    5               4
12  3   8   8   9   8    0              -1
13  2   5   8   9   9    9              -1
13  2   5   8   9   9    9               6
END DATA.
MISSING VALUES  x1 TO x6(8 9).

COMPUTE seq=$CASENUM.
SAVE OUTFILE 'tmp'.
MISSING VALUES X1 TO x6 ().
DO REPEAT v=x1 TO x6.
+  COMPUTE res=sum(res,(ID EQ LAG(ID))*(v EQ LAG(v))).
END REPEAT.
UPDATE / FILE * / FILE 'tmp' / BY SEQ.
RECODE res (0,SYSMIS=-1).
EXE.

I know, it is *FUGLY* but it gets the job done.
I had thought of TEMP but then LAG won't work (same with CREATE under TEMP).
I had my initial doubts about UPDATE wrt User  Missing but in fact it works.
FYI: ALL is valid on both SORT and MATCH (conditional on not exceeding the
number of keys ).
OTOH, your desire was to enumerate the number of data matches so the code
would only end up flagging on mismatches but not give you the required
result.
*** it's always seemed like there are undocumented things in spss ***.
Yeah, users call them bugs.. When I used to work in tech support we called
them 'features'.
How's this for undocumented
(one of my own special requests for implementation about 15 years ago):

PRESERVE.
SET ERRORS OFF RESULTS OFF.
*-----------*.
CORR VAR
<huge_list_of_variables_sufficient_to_build_MONSTER_unwanted_pivot_table>
 / MATRIX OUT (*).
RESTORE.
MATRIX.
  party hardy on huge correlation matrix.....
END MATRIX.
---



Maguin, Eugene wrote
>
> Ok.
>
> Id x1 x2 x3 x4 x5 x6   Result
> 10  5   8   0   9   1    0              -1
> 11  1   3   9   4   8    2              -1
> 11  2   3   9   4   8    5               4
> 12  3   8   8   9   8    0              -1
> 13  2   5   8   9   9    9              -1
> 13  2   5   8   9   9    9               6
>
> Missing values x1 to x6(8 9).
>
> Re: your code. Depending on how user missing are handled in the match
> files that seems like it would work. Is, 'all' a valid value for the BY
> subcommand or do you mean all to reference the case comparison varlist? (I
> think no but it's always seemed like there are undocumented things in
> spss.)
>
> Gene Maguin
>
>
>
> -----Original Message-----
> From: SPSSX(r) Discussion [
[hidden email]] On Behalf Of David Marso
> Sent: Thursday, May 03, 2012 11:55 AM
> To: SPSSX-L@.UGA
> Subject: Re: check adjacent rows with duplicate values
>
> Gene,
> Please mock up and post a simple data set definition to illustrate the
> problem.
> Are you attempting to identify adjacent records which are complete
> duplicates?
> IF so:
> MATCH FILES / FILE * / KEEP <VarlistForCaseComparisonDeletingAllOthers>.
> SORT CASES BY ALL.
> MATCH FILES /FILE * / BY ALL/ FIRST=@TOP@/LAST=@BOT@.
> David
>
>
> Maguin, Eugene wrote
>>
>> I'm checking for duplicated values between pairs of adjacent records and
>> the following code doesn't seem to be working. The 'real' problem is that
>> value(lag(x)) doesn't work when variables have user missing values
>> because
>> lag(x) returns sysmis when x is missing. It seems like lag ought to work
>> differently but that's beside the point. One way around the problem is to
>> take off user missing values. But, I'd like to keep them defined. I could
>> well be missing or not understanding something but if I'm not, is there a
>> syntax method for this problem that doesn't involve stripping off user
>> missing values.
>>
>>
>> DO IF (EPISODEID NE LAG(EPISODEID)).
>> +     COMPUTE DUPROW=-1.
>> ELSE.
>> +     COMPUTE DUPROW=0.
>> +     DO REPEAT X=DischargeLevelofCare TO SpecEd8 DxA DxB1 TO DxB4
>> PsychMeds.
>> +        COMPUTE #V=LAG(X).
>> +        IF (VALUE(X) EQ VALUE(#V)) DUPROW=DUPROW+1.
>> +     END REPEAT.
>> END IF.
>> FREQUENCIES DUPROW.
>>
>> Thanks, Gene Maguin
>>
>> =====================
>> To manage your subscription to SPSSX-L, send a message to
>> LISTSERV@.UGA (not to SPSSX-L), with no body text except the
>> command. To leave the list, send the command
>> SIGNOFF SPSSX-L
>> For a list of commands to manage subscriptions, send the command
>> INFO REFCARD
>>
>
>
> --
> View this message in context:
>
http://spssx-discussion.1045642.n5.nabble.com/check-adjacent-rows-with-duplicate-values-tp5683555p5683588.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> LISTSERV@.UGA (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> LISTSERV@.UGA (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>


--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/check-adjacent-rows-with-duplicate-values-tp5683555p5684168.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: check adjacent rows with duplicate values

Andy W
In reply to this post by Maguin, Eugene

Late reply I know, but if all you are concerned is losing the missing values meta-data, you may want to look at the APPLY DICTIONARY command. So assuming your code works without missing values defined, and you have a previously saved file with the applicable meta data, it may only take about two more lines of code added to modify your original syntax and end with missing values still defined.

MISSING VALUES ALL ().
*commands here.
APPLY DICTIONARY FROM = 'Original_File.sav'.

I've always thought the behavior with user defined missing values was somewhat counter-intuitive, and the current agency I'm employed at avoids defining missing values for use cases like yours above (although always includes them in value labels). But the apply dictionary command seems like an easy way to get rid of the missing values and then re-apply them later on if that is what you need.

Andy

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/
Reply | Threaded
Open this post in threaded view
|

Re: check adjacent rows with duplicate values

David Marso
Administrator
Yes!  Apply Dictionary ;-)!!!
----
Andy W wrote
<p>Late reply I know, but if all you are concerned is losing the missing values meta-data, you may want to look at the <code>APPLY DICTIONARY</code> command. So assuming your code works without missing values defined, and you have a previously saved file with the applicable meta data, it may only take about two more lines of code added to modify your original syntax and end with missing values still defined.</p>

<p><code>
MISSING VALUES ALL ().<br />
*commands here.<br />
APPLY DICTIONARY FROM = 'Original_File.sav'.
</code></p>

<p>I've always thought the behavior with user defined missing values was somewhat counter-intuitive, and the current agency I'm employed at avoids defining missing values for use cases like yours above (although always includes them in value labels). But the apply dictionary command seems like an easy way to get rid of the missing values and then re-apply them later on if that is what you need. </p>

<p>Andy</p>
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"