SPSSX Discussion

finding unique data in a column

Classic

List

Threaded

5 messages Options

Gary Stevens-2

finding unique data in a column

I have a 1.5 million line file made up of two columns of data. The first
column has a common identifier,always "$**ID**$", before each facility's
data begins, then always two lines down is the facility's unique ID number
(always 9 digits ending with an H) and the next line (third line after
$**ID**$) is the facility's name. Further down the rows is a specific
item number I am looking for (always the same number) and its value in the
column next it. This repeats 40 times (40 facilities). The number of
rows of data in each facility varies. How can I select out the facility's
ID number, name, desired item number and its value? example below

$**ID**$
value
5902002H
Facility Name

(many intervening rows of items with values)
item number value

needed item no value

(many intervening rows of items with values, then repeats 40 times)

$**ID**$
value
5902002H
Facility Name

(many intervening rows of items with values)

item number value
item number value
item number value
item number value
needed item no value

(many intervening rows of items with values)

Maguin, Eugene

Re: finding unique data in a column

Gary,

I don't have time to work out an idea but information that others may need
to work out a solution are the following.

1) Are you reading a text file, or is the data already in a spreadsheet or a
database?
2) Is the data structured? That is, is the number of lines of data for each
facility exactly the same or does it vary from from facility to anther?
3) Is the location of the data to be extracted for a facility in exactly the
same position relative to the first line for data for all facilities?
4) Is the information to be read in exactly the same columns for every
record containing that information?

If the answer to 1) is text, I'd suggest that you start reading the
documentation for Input program-end input program.

Gene Maguin

Raynald Levesque-2

Re: finding unique data in a column

Hi

I believe the following does the job.

* This assumes you want the values for item number 35.

DATA LIST FIXED /str 1-20 (A) val1 30-35(F).

Data List will read 1 records from the command file

Variable Rec Start End Format

str 1 1 20 A20
val1 1 30 35 F6.0

BEGIN DATA.
$**ID**$ 1
1 2
5902002H 3
ABC home 4
25 17
35 22
45 45
$**ID**$ 1
1 2
5902002H 3
ABC home 4
22 18
23 55
24 14
25 17
26 01
35 28
45 45
$**ID**$ 1
1 2
5902088H 3
XYZ home 4
35 47
45 45
56 56
75 11
END DATA.

LIST.

List

str val1

$**ID**$ 1
1 2
5902002H 3
ABC home 4
25 17
35 22
45 45
$**ID**$ 1
1 2
5902002H 3
ABC home 4
22 18
23 55
24 14
25 17
26 1
35 28
45 45
$**ID**$ 1
1 2
5902088H 3
XYZ home 4
35 47
45 45
56 56
75 11

Number of cases read: 26 Number of cases listed: 26

COMPUTE start1= (str="$**ID**$").
IF start1=1 OR LAG(start1,2)=1 OR LAG(start1,3) OR str="35" KeepThis=1.
EXECUTE.
SELECT IF KeepThis=1.

DO IF $CASENUM=1.
- COMPUTE grp=1.
ELSE.
- COMPUTE grp= LAG(grp) + start1.
END IF.

DO IF start1=1.
- COMPUTE idx=1.
ELSE.
- COMPUTE idx= LAG(idx) + 1.
END IF.
VECTOR var(4A20).
COMPUTE var(idx)=str.
IF idx=4 myVal = val1.
AGG OUT=*
/BREAK=grp
/var1 TO var4 myVal =MAX(var1 TO var4 myVal).
LIST.

List

grp var1 var2 var3
var4 myVal

1.00 $**ID**$ 5902002H ABC home
35 22.00
2.00 $**ID**$ 5902002H ABC home
35 28.00
3.00 $**ID**$ 5902088H XYZ home
35 47.00

Number of cases read: 3 Number of cases listed: 3

--
Raynald Levesque
www.spsstools.net

On 8/23/07, Gene Maguin <[hidden email]> wrote:

>
> Gary,
>
> I don't have time to work out an idea but information that others may need
> to work out a solution are the following.
>
> 1) Are you reading a text file, or is the data already in a spreadsheet or
> a
> database?
> 2) Is the data structured? That is, is the number of lines of data for
> each
> facility exactly the same or does it vary from from facility to anther?
> 3) Is the location of the data to be extracted for a facility in exactly
> the
> same position relative to the first line for data for all facilities?
> 4) Is the information to be read in exactly the same columns for every
> record containing that information?
>
> If the answer to 1) is text, I'd suggest that you start reading the
> documentation for Input program-end input program.
>
> Gene Maguin
>

Richard Ristow

Re: finding unique data in a column

In reply to this post by Gary Stevens-2

Since I was about finished when Raynald posted, here's a solution in a
slightly different style, but also using AGGREGATE to go from many
records to one.

At 12:36 PM 8/23/2007, Gary Stevens wrote:

>I have a 1.5 million line file made up of two columns of data. The
>first
>column has a common identifier,always "$**ID**$", before each
>facility's data begins, then always two lines down is the facility's
>unique ID number (always 9 digits ending with an H) and the next line
>(third line after $**ID**$) is the facility's name. Further down the
>rows is a specific item number I am looking for (always the same
>number) and its value in the column next it.

Like this, except I didn't read carefully enough, so the ID is a
number, rather than a string of digits ending in 'H':

|-----------------------------|---------------------------|
|Output Created |23-AUG-2007 23:10:07 |
|-----------------------------|---------------------------|
Item VALUE

$**ID**$ .
(ignore) .
Alpha .
20050 .
17 94
18 9
29 198
42 121
49 145
60 155
66 8
$**ID**$ .
(ignore) .
Beta .
30076 .
9 108
17 44
30 31
64 89
79 24
92 114
$**ID**$ .
(ignore) .
Gamma .
20058 .
6 128
17 48
36 116
54 85
63 148
64 135

Number of cases read: 31 Number of cases listed: 31

>How can I select out the facility's ID number, name, desired item
>number and its value?

Raynald selected item 35; I'm selecting item 17. This is SPSS 15 draft
output:

STRING Facility (A8).
NUMERIC Fclty# (N5).
NUMERIC Item# (F3).
NUMERIC Value$ (F6).

LEAVE Facility.

DO IF Item EQ '$**ID**$'.
. COMPUTE #LineNo = 0.
. COMPUTE Facility = ' '.
END IF.

COMPUTE #LineNo = #LineNo + 1.

IF #LineNo EQ 3 Facility = Item.
IF #LineNo EQ 4 Fclty# = NUMBER(Item,F8).

DO IF #LineNo GT 4.
DO IF NUMBER(Item,F8) EQ 17.
. COMPUTE Item# = NUMBER(Item,F8).
. COMPUTE Value$ = Value.
. END IF.
END IF.

. /*-- PRINT /*-*/
/*-- / #LineNo (F3) ': ' Item Value /*-*/
/*-- / ' -> ' Facility Fclty# Item# Value$ /*-*/.

. /*-- EXECUTE /*-*/.

SELECT IF Facility NE ' '.

AGGREGATE OUTFILE=*
/BREAK = Facility
/Fclty# 'Facility ID number' = FIRST(Fclty#)
/Item# 'Data item number' = FIRST(Item#)
/Value 'Value of data item ' = FIRST(Value$).

VARIABLE LABELS
Facility 'Facility name'
Fclty# 'Facility ID number'.

LIST.

List
|-----------------------------|---------------------------|
|Output Created |23-AUG-2007 23:10:08 |
|-----------------------------|---------------------------|
Facility Fclty# Item# Value

Alpha 20050 17 94
Beta 30076 17 44
Gamma 20058 17 48

Number of cases read: 3 Number of cases listed: 3

===================
APPENDIX: Test data
===================
* ................................................................. .
* ................. Test data ..................... .
SET RNG = MT /* 'Mersenne twister' random number generator */ .
SET MTINDEX = 3782 /* Providence, RI telephone book */ .

NEW FILE.
INPUT PROGRAM.
. STRING Item (A9).
. NUMERIC VALUE (F3).
. LOOP #Fclty = 1 TO 3.

* ... Key to begin a new facility ................ .
. COMPUTE Item = '$**ID**$'.
. END CASE.

* ... Ignored line, following beginning of a facility .
. COMPUTE Item = '(ignore)'.
. END CASE.

* ... Facility name ................ .
. RECODE #Fclty
(1 = 'Alpha')
(2 = 'Beta' )
(3 = 'Gamma') INTO Item.
. END CASE.

* ... Facility number ................ .
. COMPUTE #Fclt# = 1E4*TRUNC(RV.UNIFORM(0,4)).
. COMPUTE #Fclt# = #Fclt#
+ TRUNC(RV.UNIFORM(1,200)).
. COMPUTE Item = STRING(#Fclt#,N5).
. END CASE.

* ... Data values ................ .
. LOOP #Data# = 1 TO 100.
. DO IF RV.BERNOULLI(.07)
OR #Data# = 17.
. COMPUTE Item = STRING(#Data#,F3).
. COMPUTE Value = TRUNC(RV.UNIFORM(1,200)).
. END CASE.
. END IF.
. END LOOP /* for one data value per pass */.

. END LOOP. /* for one facility per pass */.
END FILE.
END INPUT PROGRAM.

* ................. Post after this point ..................... .

Melissa Ives

Re: finding unique data in a column

In reply to this post by Gary Stevens-2

I think it may be simpler than any of the suggestions so far-- this
seems like a nested file type (See blurb below from SPSS help topics
"nested"), that can be read in with appropriate syntax--(they provide an
example in help that you could emulate.) I believe this assumes that
you have a data file that is not an SPSS file

NESTED Nested file type. NESTED defines a file in which the record
types are related to each other hierarchically. The record types are
grouped together by a case identification number that identifies the
highest level-the first record type-of the hierarchy. Usually, the last
record type specified-the lowest level of the hierarchy-defines a case.
For example, in a file containing household records and records for each
person living in the household, each person record defines a case.
Information from higher record types may be spread to each case. For
example, the value for a variable on the household record, such as CITY,
can be spread to the records for each person in the household.

Look into this option and see if it fits what you need.

Melissa

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Gary Stevens
Sent: Thursday, August 23, 2007 11:37 AM
To: [hidden email]
Subject: [SPSSX-L] finding unique data in a column

I have a 1.5 million line file made up of two columns of data. The
first column has a common identifier,always "$**ID**$", before each
facility's data begins, then always two lines down is the facility's
unique ID number (always 9 digits ending with an H) and the next line
(third line after
$**ID**$) is the facility's name. Further down the rows is a specific
item number I am looking for (always the same number) and its value in
the column next it. This repeats 40 times (40 facilities). The number
of rows of data in each facility varies. How can I select out the
facility's ID number, name, desired item number and its value? example
below

$**ID**$
value
5902002H
Facility Name

(many intervening rows of items with values) item number value

needed item no value

(many intervening rows of items with values, then repeats 40 times)

$**ID**$
value
5902002H
Facility Name

(many intervening rows of items with values)

item number value
item number value
item number value
item number value
needed item no value

(many intervening rows of items with values)

PRIVILEGED AND CONFIDENTIAL INFORMATION
This transmittal and any attachments may contain PRIVILEGED AND
CONFIDENTIAL information and is intended only for the use of the
addressee. If you are not the designated recipient, or an employee
or agent authorized to deliver such transmittals to the designated
recipient, you are hereby notified that any dissemination,
copying or publication of this transmittal is strictly prohibited. If
you have received this transmittal in error, please notify us
immediately by replying to the sender and delete this copy from your
system. You may also call us at (309) 827-6026 for assistance.