SPSSX Discussion

sorting data in unicode

Classic

List

Threaded

3 messages Options

Meraud Bawden

sorting data in unicode

<[hidden email]>

I am having problems with finding duplicates in a dataset, and I think its
something to do with the data being in Unicode.

I am running the following code:
SORT CASES BY email_address (A) first_name(A) last_name(A).
MATCH FILES
/FILE=*
/BY email_address first_name last_name
/FIRST=PrimaryFirst
/LAST=PrimaryLast.
DO IF (PrimaryFirst).
COMPUTE MatchSequence=1-PrimaryLast.
ELSE.
COMPUTE MatchSequence=MatchSequence+1.
END IF.
LEAVE MatchSequence.
FORMAT MatchSequence (f7).
SELECT IF PriamryLast.
MATCH FILES
/FILE=*
/DROP=PrimaryFirst PrimaryLast.

But I keep getting an error message that the file is not sorted properly,
and I should use the SORT CASES (which, obviously, I already am!)…. When I
read the data in as ANSI rather than Unicode it runs fine, but I lose data
as some of the characters are not read in properly.
The point where it thinks the file is out of order is a record where the
last_name contains an ' é ' so I am wondering if it is something to do with
the special characters in Unicode - any ideas on what to do about this? I
have already discussed this with tech support, and the suggestion was to use
the LTRIM and RTRIM functions to remove blanks from the sort variables, but
this has not made any difference.

Thanks for any help!

Meraud Bawden

====================To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Dennis Deck

Re: sorting data in unicode

I'm not familiar with this application of MATCH FILES (and would not expect it to run).

MATCH FILES
/FILE=*
/BY email_address first_name last_name
/FIRST=PrimaryFirst
/LAST=PrimaryLast.

A simple alternative for flagging duplicates is:

COMPUTE Dup = 0 .
IF ($Casenum>1 and email_address = LAG(email_address)) Dup = 1 .

SELECT IF (Dup=0) .

Dennis Deck
Senior Researcher
RMC Researcher

-----Original Message-----
From: Meraud Bawden [mailto:[hidden email]]
Sent: Tuesday, September 16, 2008 9:32 AM
Subject: sorting data in unicode

<[hidden email]>

I am having problems with finding duplicates in a dataset, and I think its
something to do with the data being in Unicode.

I am running the following code:
SORT CASES BY email_address (A) first_name(A) last_name(A).
MATCH FILES
/FILE=*
/BY email_address first_name last_name
/FIRST=PrimaryFirst
/LAST=PrimaryLast.
DO IF (PrimaryFirst).
COMPUTE MatchSequence=1-PrimaryLast.
ELSE.
COMPUTE MatchSequence=MatchSequence+1.
END IF.
LEAVE MatchSequence.
FORMAT MatchSequence (f7).
SELECT IF PriamryLast.
MATCH FILES
/FILE=*
/DROP=PrimaryFirst PrimaryLast.

But I keep getting an error message that the file is not sorted properly,
and I should use the SORT CASES (which, obviously, I already am!).... When I
read the data in as ANSI rather than Unicode it runs fine, but I lose data
as some of the characters are not read in properly.
The point where it thinks the file is out of order is a record where the
last_name contains an ' é ' so I am wondering if it is something to do with
the special characters in Unicode - any ideas on what to do about this? I
have already discussed this with tech support, and the suggestion was to use
the LTRIM and RTRIM functions to remove blanks from the sort variables, but
this has not made any difference.

Thanks for any help!

Meraud Bawden

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Richard Ristow

One-file MATCH (was, re: sorting data in unicode)

At 01:54 AM 9/17/2008, Dennis Deck wrote:

>I'm not familiar with this application of MATCH FILES (and would not
>expect it to run).
>
>MATCH FILES
> /FILE=*
> /BY email_address first_name last_name
> /FIRST=PrimaryFirst
> /LAST=PrimaryLast.

Actually, one-file MATCH FILES statements (or ADD FILES) work fine,
and are handy for effects like this, as well as reordering and
dropping variables, etc. Of course, the file must be pre-sorted by
any BY variables. (I've no light on the Unicode sorting problem, though.)

|-----------------------------|---------------------------|
|Output Created |17-SEP-2008 16:34:57 |
|-----------------------------|---------------------------|
First Last Key

Alice Becker A
Alice Becker B
Alice Becker C
Charles Darwin D
Edward Fergus E
Edward Fergus F

Number of cases read: 6 Number of cases listed: 6

SORT CASES BY First Last Key.
MATCH FILES
/FILE=*
/BY first last
/FIRST=PrimaryFirst
/LAST=PrimaryLast.

DO IF (PrimaryFirst).
. COMPUTE MatchSequence=1-PrimaryLast.
ELSE.
. COMPUTE MatchSequence=MatchSequence+1.
END IF.
LEAVE MatchSequence.
FORMAT PrimaryFirst PrimaryLast MatchSequence (f3).

LIST.

List
|-----------------------------|---------------------------|
|Output Created |17-SEP-2008 16:34:57 |
|-----------------------------|---------------------------|
First Last Key PrimaryFirst PrimaryLast MatchSequence

Alice Becker A 1 0 1
Alice Becker B 0 0 2
Alice Becker C 0 1 3
Charles Darwin D 1 1 0
Edward Fergus E 1 0 1
Edward Fergus F 0 1 2

Number of cases read: 6 Number of cases listed: 6
=============================
APPENDIX: Test data, and code
=============================
DATA LIST LIST
/First Last Key (2A8,A1).
BEGIN DATA
Alice Becker A
Alice Becker B
Alice Becker C
Charles Darwin D
Edward Fergus E
Edward Fergus F
END DATA.

LIST.

SORT CASES BY First Last Key.
MATCH FILES
/FILE=*
/BY first last
/FIRST=PrimaryFirst
/LAST=PrimaryLast.

DO IF (PrimaryFirst).
. COMPUTE MatchSequence=1-PrimaryLast.
ELSE.
. COMPUTE MatchSequence=MatchSequence+1.
END IF.
LEAVE MatchSequence.
FORMAT PrimaryFirst PrimaryLast MatchSequence (f3).

LIST.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD