SPSSX Discussion

A SUPER Variable PART2

Classic

List

Threaded

3 messages Options

Eugenio Grant

A SUPER Variable PART2

Guys:

The syntax provided works fine, but I came across a problem, I have some of
the variables with a length of 1, and another of 2. For example, V1 = 1, V2
= 2, but V17 (for example) is V17 = 22. And the code stops working.

The code so far is working only for variables of fixed length.

Any ideas.

Hi:

I have a DataBase of 2500 variables, all are numbers.

V1 to V2500, and I need to create the a 200 character variable that
concatenates such values like this..

If V1 = 1 and V2 = 7 and V3 = 9 I need a new variable like this.

Vnew = 179 and so on.

Can I create 200 character variable like this, does SPSS has a limitation on
the length of a variable?.

Regards,

Richard Ristow

Re: A SUPER Variable PART2

(Continues thread begun under heading "How to create a SUPER
Variable".)

At 04:32 PM 1/19/2007, Eugenio Grant wrote:

>The syntax provided works fine, but I came across a problem, I have
>some of the variables with a length of 1, and another of 2. For
>example, V1=1, V2=2, but V17 (for example) is V17=22. And the code
>stops working.
>
>The code so far is working only for variables of fixed length.

ALL numeric variables (and don't forget this) are of the same fixed
length. Some have different ranges that they span, which means they'll
take different numbers of digits to display properly, but they're all
the same.

That's important here, because it means the transformation from numeric
to string is basically arbitrary: there's no way that's *a priori*
right.

>Any ideas.

Yes. To start with, rethink the project: how do you WANT the string
(the 'super' variable) to be defined? Do you want to allocate two
spaces for each variable, to accommodate two-digit values? Here's code
(untested) to do that; it uses leading zeroes for values less than 10.
Changes are, using format N2 instead of F1 in the 'string' function;
and somewhat awkward calculations for SUBSTR arguments, in the code
that uses SUBSTR. (Definitely an argument for CONCAT rather than SUBSTR
logic.)

do repeat #T = v1 to v3.
. compute SUPER = concat(rtrim(SUPER), rtrim(string(#T, N2))).
end repeat.

OR

do repeat VARIABLE = v1 to v3
/POSITION = 1 to 3.
. COMPUTE #CharPos = POSITION*2 - 1.
. COMPUTE SUBSTR(SUPER2,#CharPos,2)
= STRING(VARIABLE,F1).
end repeat.

Richard Ristow

Re: A SUPER Variable PART2

At 06:28 PM 1/19/2007, Eugenio Grant wrote, off-list:

>The idea is to take a big piece of information of every record of my
>database, and then be able to aggregate,

So this whole thing is to be a BREAK variable for AGGREGATE? You know
that you can give a variable list, a large number of variables (I don't
know how many) in BREAK. Try that, before something like this
combination.

>In order to make sure that there are no duplicates, records that hold
>big chunks of the same info might be suspicious to me. (if I aggregate
>by the Super Variable there should not be equal cases, hence n = 1 in
>all cases)

Yes, I see what you're getting at. I hope you set up your AGGREGATE so
it gives you *which* records, by respondent ID or whatever you have,
participate in a putative 'duplicate'.

>Because a questionnaire in which 2 respondents answer identically in
>some parts might be pretty strange...

I can see that, too.

>My variables are all numeric but have different width, for example v10
>width is 2, while v77 has a width of 4.

Remember, they're *NOT* of different width; they're of a different
range of values. To do what you say you want to, all I can think of is
to allow enough character spaces per variable, to hold the maximum
number of digits needed.

However, better not to fuss with the catenation at all. That is,
instead of

do repeat #T = v1 to v3.
. compute SUPER = concat(rtrim(SUPER), rtrim(string(#T, N2))).
end repeat.

AGGREGATE OUTFILE=*
/BREAK=SUPER
/N_INST 'Number of instances matching vbl grp' = N.

use

AGGREGATE OUTFILE=*
/BREAK=v1 to v3
/N_INST 'Number of instances matching vbl grp' = N.

>If I can take big pieces of my database for every record I might be
>able to find duplicate records.

If you really want to find exact duplicates, you could try the above,
naming all variables in your data. I don't know when AGGREGATE will hit
a limit of how many variables it can have in a BREAK, but try it.

If your file is big (hundreds of thousands of records, or more), you'll
have many, many break groups, only a little less than one per record.
That can slow AGGREGATE badly. If so, SORT CASES to put records in
order, and use PRESORTED on AGGREGATE.

As an alternative to

AGGREGATE OUTFILE=*
/BREAK=<varlist>
/N_INST 'Number of instances matching vbl grp' = N.

consider the following (untested), which retains the original records

SORT CASES BY <varlist>.
ADD FILES
/FILE=*
/BY <varlist>
/FIRST=ListFrst
/LAST =ListLast.

NUMERIC List_Dup (F2).
VAR LABELS List_Dup 'Record is non-unique on list <describe>'.
VAL LABELS List_Dup 0 'Unique' 1 'Duplicated'.
COMPUTE List_Dup = 0.
IF ListFrst EQ 0 List_Dup = 1.
IF ListLast EQ 0 List_Dup = 1.