SPSSX Discussion

sorting in version 15 and 16

Classic

List

Threaded

4 messages Options

Antoon Smulders

sorting in version 15 and 16

Dear list,

I recently noticed that the sorting order of SPSS 15 differs from SPSS 16,
(at least *my* installations of these versions.

The following syntax results in different output for the non-alfabethic
characters (, *, ^).

DATA LIST FIXED /txt 1-6 (A).

BEGIN DATA.

* test

(test)

test

^ test

END DATA.

SORT CASES BY txt.

echo "version 16".

LIST.

In version 15 it results in:

(test)

* test

^ test

test

while in version 16 it gives:

^ test

(test)

* test

test

It seems to affect non-alfabethic characters.

The same is true for AUTORECODE, giving different numbers to different
values.

Is there a setting I overlooked?

Thanks ahead,

Antoon Smulders,

Advies- en Onderzoeksgroep Beke

026-4438619

www.beke.nl

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Peck, Jon

Re: sorting in version 15 and 16

SPSS 16 supports both Unicode and the traditional code page character sets. (SET UNICODE ON/OFF). As part of that, SPSS 16 follows the Unicode collation algorithm. We felt it was important to give the same sort order in both modes as long as the characters can be represented in the code page.

As you can imagine, Unicode, with around 100,000 characters defined, had to put some serious research into what collation means, especially since the order is affected by the locale/language in use.

You can see the default collation tables for Unicode at
http://www.unicode.org/Public/UCA/latest/allkeys.txt

For simple characters, only the first weight following the "[" is necessary. Since, however, sorting words gets quite a lot more complicated when you consider French and other sorts where you can't proceed simply left to right character by character, there are other weights that affect the collation algorithm. Accented characters are treated in different ways in different locales, and multi-script combinations (Japanese with Russian, e.g.) introduce other complexities.

The Unicode collation algorithm itself is explained at
http://www.unicode.org/unicode/reports/tr10/#AllKeys
which gives some insight into the complexities involved.

If you need to get an Autorecode result that is consistent between SPSS 15 and 16, use one or the other to produce a template (AUTORECODE ... /SAVE TEMPLATE=) and apply that template (AUTORECODE ... /APPLY TEMPLATE) in the other version.

Using a template with Autorecode when you need a stable recoding is always a good idea, because otherwise new values that occur (or values that disappear) will cause the recode for the other values to change.

HTH,
Jon Peck

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Antoon Smulders
Sent: Tuesday, March 04, 2008 3:04 AM
To: [hidden email]
Subject: [SPSSX-L] sorting in version 15 and 16

Dear list,

I recently noticed that the sorting order of SPSS 15 differs from SPSS 16,
(at least *my* installations of these versions.

The following syntax results in different output for the non-alfabethic
characters (, *, ^).

DATA LIST FIXED /txt 1-6 (A).

BEGIN DATA.

* test

(test)

test

^ test

END DATA.

SORT CASES BY txt.

echo "version 16".

LIST.

In version 15 it results in:

(test)

* test

^ test

test

while in version 16 it gives:

^ test

(test)

* test

test

It seems to affect non-alfabethic characters.

The same is true for AUTORECODE, giving different numbers to different
values.

Is there a setting I overlooked?

Thanks ahead,

Antoon Smulders,

Advies- en Onderzoeksgroep Beke

026-4438619

www.beke.nl

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Antoon Smulders

Re: sorting in version 15 and 16

In reply to this post by Antoon Smulders

Hello Jon,

It makes no difference if UNICODE is SET ON or OFF in version 16. That is:
for the simple example I gave. The resulting order is the same and thus
different from version 15.

Greetings
Antoon

-----Oorspronkelijk bericht-----
Van: SPSSX(r) Discussion [mailto:[hidden email]] Namens Peck, Jon
Verzonden: dinsdag 4 maart 2008 22:14
Aan: [hidden email]
Onderwerp: Re: sorting in version 15 and 16

SPSS 16 supports both Unicode and the traditional code page character sets.
(SET UNICODE ON/OFF). As part of that, SPSS 16 follows the Unicode
collation algorithm. We felt it was important to give the same sort order
in both modes as long as the characters can be represented in the code page.

As you can imagine, Unicode, with around 100,000 characters defined, had to
put some serious research into what collation means, especially since the
order is affected by the locale/language in use.

You can see the default collation tables for Unicode at
http://www.unicode.org/Public/UCA/latest/allkeys.txt

For simple characters, only the first weight following the "[" is necessary.
Since, however, sorting words gets quite a lot more complicated when you
consider French and other sorts where you can't proceed simply left to right
character by character, there are other weights that affect the collation
algorithm. Accented characters are treated in different ways in different
locales, and multi-script combinations (Japanese with Russian, e.g.)
introduce other complexities.

The Unicode collation algorithm itself is explained at
http://www.unicode.org/unicode/reports/tr10/#AllKeys
which gives some insight into the complexities involved.

If you need to get an Autorecode result that is consistent between SPSS 15
and 16, use one or the other to produce a template (AUTORECODE ... /SAVE
TEMPLATE=) and apply that template (AUTORECODE ... /APPLY TEMPLATE) in the
other version.

Using a template with Autorecode when you need a stable recoding is always a
good idea, because otherwise new values that occur (or values that
disappear) will cause the recode for the other values to change.

HTH,
Jon Peck

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Antoon Smulders
Sent: Tuesday, March 04, 2008 3:04 AM
To: [hidden email]
Subject: [SPSSX-L] sorting in version 15 and 16

Dear list,

I recently noticed that the sorting order of SPSS 15 differs from SPSS 16,
(at least *my* installations of these versions.

The following syntax results in different output for the non-alfabethic
characters (, *, ^).

DATA LIST FIXED /txt 1-6 (A).

BEGIN DATA.

* test

(test)

test

^ test

END DATA.

SORT CASES BY txt.

echo "version 16".

LIST.

In version 15 it results in:

(test)

* test

^ test

test

while in version 16 it gives:

^ test

(test)

* test

test

It seems to affect non-alfabethic characters.

The same is true for AUTORECODE, giving different numbers to different
values.

Is there a setting I overlooked?

Thanks ahead,

Antoon Smulders,

Advies- en Onderzoeksgroep Beke

026-4438619

www.beke.nl

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Peck, Jon

Re: sorting in version 15 and 16

That is what I said. We need to be consistent between Unicode and code page modes. It would be terrible if the Unicode mode affected the sorting. We always use the Unicode standard in SPSS 16 regardless of the mode.

HTH,
Jon

-----Original Message-----
From: Antoon Smulders [mailto:[hidden email]]
Sent: Wednesday, March 05, 2008 2:27 AM
To: Peck, Jon; [hidden email]
Subject: RE: sorting in version 15 and 16

Hello Jon,

It makes no difference if UNICODE is SET ON or OFF in version 16. That is:
for the simple example I gave. The resulting order is the same and thus
different from version 15.

Greetings
Antoon

-----Oorspronkelijk bericht-----
Van: SPSSX(r) Discussion [mailto:[hidden email]] Namens Peck, Jon
Verzonden: dinsdag 4 maart 2008 22:14
Aan: [hidden email]
Onderwerp: Re: sorting in version 15 and 16

SPSS 16 supports both Unicode and the traditional code page character sets.
(SET UNICODE ON/OFF). As part of that, SPSS 16 follows the Unicode
collation algorithm. We felt it was important to give the same sort order
in both modes as long as the characters can be represented in the code page.

As you can imagine, Unicode, with around 100,000 characters defined, had to
put some serious research into what collation means, especially since the
order is affected by the locale/language in use.

You can see the default collation tables for Unicode at
http://www.unicode.org/Public/UCA/latest/allkeys.txt

For simple characters, only the first weight following the "[" is necessary.
Since, however, sorting words gets quite a lot more complicated when you
consider French and other sorts where you can't proceed simply left to right
character by character, there are other weights that affect the collation
algorithm. Accented characters are treated in different ways in different
locales, and multi-script combinations (Japanese with Russian, e.g.)
introduce other complexities.

The Unicode collation algorithm itself is explained at
http://www.unicode.org/unicode/reports/tr10/#AllKeys
which gives some insight into the complexities involved.

If you need to get an Autorecode result that is consistent between SPSS 15
and 16, use one or the other to produce a template (AUTORECODE ... /SAVE
TEMPLATE=) and apply that template (AUTORECODE ... /APPLY TEMPLATE) in the
other version.

Using a template with Autorecode when you need a stable recoding is always a
good idea, because otherwise new values that occur (or values that
disappear) will cause the recode for the other values to change.

HTH,
Jon Peck

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Antoon Smulders
Sent: Tuesday, March 04, 2008 3:04 AM
To: [hidden email]
Subject: [SPSSX-L] sorting in version 15 and 16

Dear list,

I recently noticed that the sorting order of SPSS 15 differs from SPSS 16,
(at least *my* installations of these versions.

The following syntax results in different output for the non-alfabethic
characters (, *, ^).

DATA LIST FIXED /txt 1-6 (A).

BEGIN DATA.

* test

(test)

test

^ test

END DATA.

SORT CASES BY txt.

echo "version 16".

LIST.

In version 15 it results in:

(test)

* test

^ test

test

while in version 16 it gives:

^ test

(test)

* test

test

It seems to affect non-alfabethic characters.

The same is true for AUTORECODE, giving different numbers to different
values.

Is there a setting I overlooked?

Thanks ahead,

Antoon Smulders,

Advies- en Onderzoeksgroep Beke

026-4438619

www.beke.nl

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD