Transforming a poorly created string variable

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Transforming a poorly created string variable

D Bartlett
Hello,

I have inherited a dataset with ID codes that were poorly constructed.  I
would like to run some syntax to reorder the information in the string,
and also create id codes that are the same length.

In the "ID" field I have a string variable (width 10) which contains three
pieces of information

1.  the year - as two digits (for example 2001-2002 is "0102")
2.  the program (training is "T", and advanced training is "AT")
3.  the participant # (01, 02, thru about 44)

Right now the order is participant#, program, year.  This is really bad
for sorting.

Here is an example of two codes.
02T0102 (second participant in the basic training which took place 2001-
2002).

02AT0102 (second participant in the advanced training which took place
2001-2002).


The first problem is that the ID codes are different lengths because for
one of the programs "training" they used only 1 character "T" and for the
advanced training they used an "AT".


The second problem is the order of the information.  I would really like
them to read:
A010202 (program, year, participant)
T010202  (for basic training)

And THEN I want to be able to split the program and the years out into
separate variables.

Many thanks for any advice.

Diane
Reply | Threaded
Open this post in threaded view
|

Re: Transforming a poorly created string variable

Richard Ristow
At 12:59 PM 4/4/2007, Diane Bartlett wrote:

>I have inherited a dataset with ID codes that were poorly constructed.
>In the "ID" field I have a string variable (width 10) which contains
>three pieces of information
>
>1.  the year - as two digits (for example 2001-2002 is "0102")

That's four digits, and it does matter. Do you want your year codes
like this?

0001 -> 2000-2001
0102 -> 2001-2002
0203 -> 2002-2003

For what it's worth, there's a useful convention for fiscal years,
which also span calendar years: name them by the highest calendar year.
That would assign codes 2001, 2002, and 2003, for the years above, and
I think that's a good deal clearer. (I don't advocate two-digit year
codes like '01', '02'.)

>2.  the program (training is "T", and advanced training is "AT"). [A]
>problem is that the ID codes are different lengths because "training"
>they used only 1 character "T" and for advanced training they used
>"AT".

I take it, you want to use "T" and "A". Makes sense.

>3.  the participant # (01, 02, thru about 44)

If you're making a string, allow three digits for participant number.
'44' is within a factor of two of overflowing a two-digit field, and
that's dangerously close.

>Here is an example of two codes.
>02T0102 (second participant in the basic training which took place
>2001-
>2002).
>
>02AT0102 (second participant in the advanced training which took place
>2001-2002).
>
>I would really like them to read:
>A010202 (program, year, participant)
>T010202  (for basic training)

OK, except I'd strongly consider, for reasons I gave above,

A2002001
T2002001

>And THEN I want to be able to split the program and the years out into
>separate variables.

Would it be OK just to do this, or do you want the ID string as well?

I'll leave this, at least for now. There are lots of other good
programmers here.

-Best of wishes,
  Richard
Reply | Threaded
Open this post in threaded view
|

Re: Transforming a poorly created string variable

meljr
In reply to this post by D Bartlett
Hi Diane, as I thought about a syntax solution to this using Substr(a1,a2,a3) as seen in the syntax manual, it occured to me that with so few cases you could code this by hand faster that with syntax.  If this were my task, I would make 3 new variables, year, program, id. Then put your current id field next to the new field and quickly add in the values. I would also recommend the coding changes that Richard mentioned. However, this is a fun syntax problem, so I am sure you might yet get that solution also.
Good luck.
meljr

Diane Bartlett wrote
Hello,

I have inherited a dataset with ID codes that were poorly constructed.  I
would like to run some syntax to reorder the information in the string,
and also create id codes that are the same length.

In the "ID" field I have a string variable (width 10) which contains three
pieces of information

1.  the year - as two digits (for example 2001-2002 is "0102")
2.  the program (training is "T", and advanced training is "AT")
3.  the participant # (01, 02, thru about 44)

Right now the order is participant#, program, year.  This is really bad
for sorting.

Here is an example of two codes.
02T0102 (second participant in the basic training which took place 2001-
2002).

02AT0102 (second participant in the advanced training which took place
2001-2002).


The first problem is that the ID codes are different lengths because for
one of the programs "training" they used only 1 character "T" and for the
advanced training they used an "AT".


The second problem is the order of the information.  I would really like
them to read:
A010202 (program, year, participant)
T010202  (for basic training)

And THEN I want to be able to split the program and the years out into
separate variables.

Many thanks for any advice.

Diane
Reply | Threaded
Open this post in threaded view
|

Re: Transforming a poorly created string variable

Melissa Ives
In reply to this post by D Bartlett
Since the first 2 digits are always the participant number, you could
use a simple substring (as I'm sure you know) for that portion. (TESTED
SYNTAX).  In the following, you create the individual items, then string
them together in the order you desire.

string pids (a2).
compute pids=Substr(ID,1,2).

*(If you want numeric ids instead, use this, but note that these will
not be useable in creating newid.
*       string #pid (A2).
*       compute #pid=Substr(ID,1,2).
*       recode #pid (convert) into pid.
*       freq pid.)

**If all of your program ids are present as either A or T and as
appears, always begin in col 3--you can use.
**If any program ids are missing or something else, un-star the 2
commands before running.
string program (A1).
do if (substr(ID,3,1)="A").
  compute program='A'.
else if (substr(ID,3,1)="T").
  compute program='T'.
*else.
*  compute program='X'.
end if.
freq program.

*** then based on the value of program, select the years from the right
place in the ID string.
string years (a4).
do if (program="A").
  compute years=substr(ID,5,4).
else if (program="T").
  compute years=substr(ID,4,4).
end if.
freq years.

** Finally put them back in the order you want.
String newid (A8).
compute newid=concat(program,years,pids).
freq newid.


> Here is an example of two codes.
> 02T0102
> 02AT0102
>
> The second problem is the order of the information.  I would really
> like them to read:
> A010202 (program, year, participant)
> T010202  (for basic training)


Melissa
The bubbling brook would lose its song if you removed the rocks.


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
meljr
Sent: Wednesday, April 04, 2007 1:43 PM
To: [hidden email]
Subject: Re: [SPSSX-L] Transforming a poorly created string variable

Hi Diane, as I thought about a syntax solution to this using
Substr(a1,a2,a3) as seen in the syntax manual, it occured to me that
with so few cases you could code this by hand faster that with syntax.
If this were my task, I would make 3 new variables, year, program, id.
Then put your current id field next to the new field and quickly add in
the values. I would also recommend the coding changes that Richard
mentioned. However, this is a fun syntax problem, so I am sure you might
yet get that solution also.
Good luck.
meljr


Diane Bartlett wrote:
>
> Hello,
>
> I have inherited a dataset with ID codes that were poorly constructed.

> I would like to run some syntax to reorder the information in the
> string, and also create id codes that are the same length.
>
> In the "ID" field I have a string variable (width 10) which contains
> three pieces of information
>
> 1.  the year - as two digits (for example 2001-2002 is "0102") 2.  the

> program (training is "T", and advanced training is "AT") 3.  the
> participant # (01, 02, thru about 44)
>
> Right now the order is participant#, program, year.  This is really
> bad for sorting.
>
> Here is an example of two codes.
> 02T0102 (second participant in the basic training which took place
> 2001- 2002).
>
> 02AT0102 (second participant in the advanced training which took place

> 2001-2002).
>
>
> The first problem is that the ID codes are different lengths because
> for one of the programs "training" they used only 1 character "T" and
> for the advanced training they used an "AT".
>
>
> The second problem is the order of the information.  I would really
> like them to read:
> A010202 (program, year, participant)
> T010202  (for basic training)
>
> And THEN I want to be able to split the program and the years out into

> separate variables.
>
> Many thanks for any advice.
>
> Diane
>
>

--
View this message in context:
http://www.nabble.com/Transforming-a-poorly-created-string-variable-tf35
27049.html#a9843101
Sent from the SPSSX Discussion mailing list archive at Nabble.com.


PRIVILEGED AND CONFIDENTIAL INFORMATION
This transmittal and any attachments may contain PRIVILEGED AND
CONFIDENTIAL information and is intended only for the use of the
addressee. If you are not the designated recipient, or an employee
or agent authorized to deliver such transmittals to the designated
recipient, you are hereby notified that any dissemination,
copying or publication of this transmittal is strictly prohibited. If
you have received this transmittal in error, please notify us
immediately by replying to the sender and delete this copy from your
system. You may also call us at (309) 827-6026 for assistance.