|
<[hidden email]>
I am having problems with finding duplicates in a dataset, and I think its something to do with the data being in Unicode. I am running the following code: SORT CASES BY email_address (A) first_name(A) last_name(A). MATCH FILES /FILE=* /BY email_address first_name last_name /FIRST=PrimaryFirst /LAST=PrimaryLast. DO IF (PrimaryFirst). COMPUTE MatchSequence=1-PrimaryLast. ELSE. COMPUTE MatchSequence=MatchSequence+1. END IF. LEAVE MatchSequence. FORMAT MatchSequence (f7). SELECT IF PriamryLast. MATCH FILES /FILE=* /DROP=PrimaryFirst PrimaryLast. But I keep getting an error message that the file is not sorted properly, and I should use the SORT CASES (which, obviously, I already am!)…. When I read the data in as ANSI rather than Unicode it runs fine, but I lose data as some of the characters are not read in properly. The point where it thinks the file is out of order is a record where the last_name contains an ' é ' so I am wondering if it is something to do with the special characters in Unicode - any ideas on what to do about this? I have already discussed this with tech support, and the suggestion was to use the LTRIM and RTRIM functions to remove blanks from the sort variables, but this has not made any difference. Thanks for any help! Meraud Bawden ====================To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
I'm not familiar with this application of MATCH FILES (and would not expect it to run).
MATCH FILES /FILE=* /BY email_address first_name last_name /FIRST=PrimaryFirst /LAST=PrimaryLast. A simple alternative for flagging duplicates is: COMPUTE Dup = 0 . IF ($Casenum>1 and email_address = LAG(email_address)) Dup = 1 . SELECT IF (Dup=0) . Dennis Deck Senior Researcher RMC Researcher -----Original Message----- From: Meraud Bawden [mailto:[hidden email]] Sent: Tuesday, September 16, 2008 9:32 AM Subject: sorting data in unicode <[hidden email]> I am having problems with finding duplicates in a dataset, and I think its something to do with the data being in Unicode. I am running the following code: SORT CASES BY email_address (A) first_name(A) last_name(A). MATCH FILES /FILE=* /BY email_address first_name last_name /FIRST=PrimaryFirst /LAST=PrimaryLast. DO IF (PrimaryFirst). COMPUTE MatchSequence=1-PrimaryLast. ELSE. COMPUTE MatchSequence=MatchSequence+1. END IF. LEAVE MatchSequence. FORMAT MatchSequence (f7). SELECT IF PriamryLast. MATCH FILES /FILE=* /DROP=PrimaryFirst PrimaryLast. But I keep getting an error message that the file is not sorted properly, and I should use the SORT CASES (which, obviously, I already am!).... When I read the data in as ANSI rather than Unicode it runs fine, but I lose data as some of the characters are not read in properly. The point where it thinks the file is out of order is a record where the last_name contains an ' é ' so I am wondering if it is something to do with the special characters in Unicode - any ideas on what to do about this? I have already discussed this with tech support, and the suggestion was to use the LTRIM and RTRIM functions to remove blanks from the sort variables, but this has not made any difference. Thanks for any help! Meraud Bawden ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
At 01:54 AM 9/17/2008, Dennis Deck wrote:
>I'm not familiar with this application of MATCH FILES (and would not >expect it to run). > >MATCH FILES > /FILE=* > /BY email_address first_name last_name > /FIRST=PrimaryFirst > /LAST=PrimaryLast. Actually, one-file MATCH FILES statements (or ADD FILES) work fine, and are handy for effects like this, as well as reordering and dropping variables, etc. Of course, the file must be pre-sorted by any BY variables. (I've no light on the Unicode sorting problem, though.) |-----------------------------|---------------------------| |Output Created |17-SEP-2008 16:34:57 | |-----------------------------|---------------------------| First Last Key Alice Becker A Alice Becker B Alice Becker C Charles Darwin D Edward Fergus E Edward Fergus F Number of cases read: 6 Number of cases listed: 6 SORT CASES BY First Last Key. MATCH FILES /FILE=* /BY first last /FIRST=PrimaryFirst /LAST=PrimaryLast. DO IF (PrimaryFirst). . COMPUTE MatchSequence=1-PrimaryLast. ELSE. . COMPUTE MatchSequence=MatchSequence+1. END IF. LEAVE MatchSequence. FORMAT PrimaryFirst PrimaryLast MatchSequence (f3). LIST. List |-----------------------------|---------------------------| |Output Created |17-SEP-2008 16:34:57 | |-----------------------------|---------------------------| First Last Key PrimaryFirst PrimaryLast MatchSequence Alice Becker A 1 0 1 Alice Becker B 0 0 2 Alice Becker C 0 1 3 Charles Darwin D 1 1 0 Edward Fergus E 1 0 1 Edward Fergus F 0 1 2 Number of cases read: 6 Number of cases listed: 6 ============================= APPENDIX: Test data, and code ============================= DATA LIST LIST /First Last Key (2A8,A1). BEGIN DATA Alice Becker A Alice Becker B Alice Becker C Charles Darwin D Edward Fergus E Edward Fergus F END DATA. LIST. SORT CASES BY First Last Key. MATCH FILES /FILE=* /BY first last /FIRST=PrimaryFirst /LAST=PrimaryLast. DO IF (PrimaryFirst). . COMPUTE MatchSequence=1-PrimaryLast. ELSE. . COMPUTE MatchSequence=MatchSequence+1. END IF. LEAVE MatchSequence. FORMAT PrimaryFirst PrimaryLast MatchSequence (f3). LIST. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
| Free forum by Nabble | Edit this page |
