finding unique data in a column

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

finding unique data in a column

Gary Stevens-2
I have a 1.5 million line file made up of two columns of data.  The first
column has a common identifier,always "$**ID**$", before each facility's
data begins, then always two lines down is the facility's unique ID number
(always 9 digits ending with an H) and the next line (third line after
$**ID**$) is the facility's name.  Further down the rows is a specific
item number I am looking for (always the same number) and its value in the
column next it.  This repeats 40 times (40 facilities).  The number of
rows of data in each facility varies.  How can I select out the facility's
ID number, name, desired item number and its value? example below

$**ID**$
value
5902002H
Facility Name

(many intervening rows of items with values)
item number value

needed item no value

(many intervening rows of items with values, then repeats 40 times)

$**ID**$
value
5902002H
Facility Name

(many intervening rows of items with values)

item number value
item number value
item number value
item number value
needed item no value

(many intervening rows of items with values)
Reply | Threaded
Open this post in threaded view
|

Re: finding unique data in a column

Maguin, Eugene
Gary,

I don't have time to work out an idea but information that others may need
to work out a solution are the following.

1) Are you reading a text file, or is the data already in a spreadsheet or a
database?
2) Is the data structured? That is, is the number of lines of data for each
facility exactly the same or does it vary from from facility to anther?
3) Is the location of the data to be extracted for a facility in exactly the
same position relative to the first line for data for all facilities?
4) Is the information to be read in exactly the same columns for every
record containing that information?

If the answer to 1) is text, I'd suggest that you start reading the
documentation for Input program-end input program.

Gene Maguin
Reply | Threaded
Open this post in threaded view
|

Re: finding unique data in a column

Raynald Levesque-2
Hi

I believe the following does the job.

* This assumes you want the values for item number 35.

DATA LIST FIXED /str 1-20 (A) val1 30-35(F).

Data List will read 1 records from the command file

Variable          Rec   Start     End  Format

str                 1       1      20  A20
val1                1      30      35  F6.0

BEGIN DATA.
$**ID**$                     1
1                            2
5902002H                     3
ABC home                     4
25                           17
35                           22
45                           45
$**ID**$                     1
1                            2
5902002H                     3
ABC home                     4
22                           18
23                           55
24                           14
25                           17
26                           01
35                           28
45                           45
$**ID**$                     1
1                            2
5902088H                     3
XYZ home                     4
35                           47
45                           45
56                           56
75                           11
END DATA.

LIST.


List

str                    val1

$**ID**$                  1
1                         2
5902002H                  3
ABC home                  4
25                       17
35                       22
45                       45
$**ID**$                  1
1                         2
5902002H                  3
ABC home                  4
22                       18
23                       55
24                       14
25                       17
26                        1
35                       28
45                       45
$**ID**$                  1
1                         2
5902088H                  3
XYZ home                  4
35                       47
45                       45
56                       56
75                       11


Number of cases read:  26    Number of cases listed:  26


COMPUTE start1= (str="$**ID**$").
IF start1=1 OR LAG(start1,2)=1 OR LAG(start1,3) OR str="35" KeepThis=1.
EXECUTE.
SELECT IF KeepThis=1.

DO IF $CASENUM=1.
- COMPUTE grp=1.
ELSE.
- COMPUTE grp= LAG(grp) + start1.
END IF.

DO IF start1=1.
- COMPUTE idx=1.
ELSE.
- COMPUTE idx= LAG(idx) + 1.
END IF.
VECTOR var(4A20).
COMPUTE var(idx)=str.
IF idx=4 myVal = val1.
AGG OUT=*
  /BREAK=grp
  /var1 TO var4 myVal =MAX(var1 TO var4 myVal).
LIST.


List

     grp var1                 var2                 var3
var4                    myVal

    1.00 $**ID**$             5902002H             ABC home
35                      22.00
    2.00 $**ID**$             5902002H             ABC home
35                      28.00
    3.00 $**ID**$             5902088H             XYZ home
35                      47.00


Number of cases read:  3    Number of cases listed:  3

--
Raynald Levesque
www.spsstools.net


On 8/23/07, Gene Maguin <[hidden email]> wrote:

>
> Gary,
>
> I don't have time to work out an idea but information that others may need
> to work out a solution are the following.
>
> 1) Are you reading a text file, or is the data already in a spreadsheet or
> a
> database?
> 2) Is the data structured? That is, is the number of lines of data for
> each
> facility exactly the same or does it vary from from facility to anther?
> 3) Is the location of the data to be extracted for a facility in exactly
> the
> same position relative to the first line for data for all facilities?
> 4) Is the information to be read in exactly the same columns for every
> record containing that information?
>
> If the answer to 1) is text, I'd suggest that you start reading the
> documentation for Input program-end input program.
>
> Gene Maguin
>
Reply | Threaded
Open this post in threaded view
|

Re: finding unique data in a column

Richard Ristow
In reply to this post by Gary Stevens-2
Since I was about finished when Raynald posted, here's a solution in a
slightly different style, but also using AGGREGATE to go from many
records to one.

At 12:36 PM 8/23/2007, Gary Stevens wrote:

>I have a 1.5 million line file made up of two columns of data.  The
>first
>column has a common identifier,always "$**ID**$", before each
>facility's data begins, then always two lines down is the facility's
>unique ID number (always 9 digits ending with an H) and the next line
>(third line after $**ID**$) is the facility's name.  Further down the
>rows is a specific item number I am looking for (always the same
>number) and its value in the column next it.

Like this, except I didn't read carefully enough, so the ID is a
number, rather than a string of digits ending in 'H':

|-----------------------------|---------------------------|
|Output Created               |23-AUG-2007 23:10:07       |
|-----------------------------|---------------------------|
Item      VALUE

$**ID**$     .
(ignore)     .
Alpha        .
20050        .
  17         94
  18          9
  29        198
  42        121
  49        145
  60        155
  66          8
$**ID**$     .
(ignore)     .
Beta         .
30076        .
   9        108
  17         44
  30         31
  64         89
  79         24
  92        114
$**ID**$     .
(ignore)     .
Gamma        .
20058        .
   6        128
  17         48
  36        116
  54         85
  63        148
  64        135


Number of cases read:  31    Number of cases listed:  31

>How can I select out the facility's ID number, name, desired item
>number and its value?

Raynald selected item 35; I'm selecting item 17. This is SPSS 15 draft
output:

STRING  Facility (A8).
NUMERIC Fclty#   (N5).
NUMERIC Item#    (F3).
NUMERIC Value$   (F6).

LEAVE   Facility.

DO IF   Item EQ  '$**ID**$'.
.  COMPUTE #LineNo  =  0.
.  COMPUTE Facility = ' '.
END IF.

COMPUTE          #LineNo  = #LineNo + 1.

IF #LineNo EQ 3  Facility = Item.
IF #LineNo EQ 4  Fclty#   = NUMBER(Item,F8).

DO IF    #LineNo  GT 4.
    DO IF NUMBER(Item,F8) EQ 17.
.     COMPUTE    Item#    = NUMBER(Item,F8).
.     COMPUTE    Value$   = Value.
.  END IF.
END IF.

.  /*--  PRINT                                   /*-*/
    /*--  / #LineNo (F3) ': '       Item  Value   /*-*/
    /*--  / '  -> ' Facility Fclty# Item# Value$  /*-*/.

.  /*--  EXECUTE                                 /*-*/.



SELECT IF   Facility NE ' '.

AGGREGATE OUTFILE=*
    /BREAK  =  Facility
    /Fclty#   'Facility ID number'  = FIRST(Fclty#)
    /Item#    'Data item number'    = FIRST(Item#)
    /Value    'Value of data item ' = FIRST(Value$).

VARIABLE LABELS
     Facility 'Facility name'
     Fclty#   'Facility ID number'.


LIST.

List
|-----------------------------|---------------------------|
|Output Created               |23-AUG-2007 23:10:08       |
|-----------------------------|---------------------------|
Facility Fclty# Item#  Value

Alpha     20050   17      94
Beta      30076   17      44
Gamma     20058   17      48

Number of cases read:  3    Number of cases listed:  3

===================
APPENDIX: Test data
===================
*  ................................................................. .
*  .................   Test data               ..................... .
SET RNG = MT       /* 'Mersenne twister' random number generator  */ .
SET MTINDEX = 3782 /*  Providence, RI telephone book              */ .

NEW FILE.
INPUT PROGRAM.
.  STRING  Item  (A9).
.  NUMERIC VALUE (F3).
.  LOOP #Fclty = 1 TO 3.

* ... Key to begin a new facility    ................ .
.     COMPUTE  Item = '$**ID**$'.
.     END CASE.

* ... Ignored line, following beginning of a facility .
.     COMPUTE  Item = '(ignore)'.
.     END CASE.


* ... Facility name                  ................ .
.      RECODE   #Fclty
             (1 = 'Alpha')
             (2 = 'Beta' )
             (3 = 'Gamma') INTO Item.
.     END CASE.

* ... Facility number                ................ .
.     COMPUTE #Fclt# = 1E4*TRUNC(RV.UNIFORM(0,4)).
.     COMPUTE #Fclt# = #Fclt#
                      +     TRUNC(RV.UNIFORM(1,200)).
.     COMPUTE Item   = STRING(#Fclt#,N5).
.     END CASE.

* ... Data values                    ................ .
.     LOOP    #Data# = 1 TO 100.
.        DO IF    RV.BERNOULLI(.07)
                OR #Data# = 17.
.           COMPUTE Item   = STRING(#Data#,F3).
.           COMPUTE Value  = TRUNC(RV.UNIFORM(1,200)).
.           END CASE.
.        END IF.
.     END LOOP  /* for one data value per pass */.

.  END LOOP.    /* for one facility per pass   */.
END FILE.
END INPUT PROGRAM.

*  .................   Post after this point   ..................... .
Reply | Threaded
Open this post in threaded view
|

Re: finding unique data in a column

Melissa Ives
In reply to this post by Gary Stevens-2
I think it may be simpler than any of the suggestions so far-- this
seems like a nested file type (See blurb below from SPSS help topics
"nested"), that can be read in with appropriate syntax--(they provide an
example in help that you could emulate.)  I believe this assumes that
you have a data file that is not an SPSS file

NESTED  Nested file type. NESTED defines a file in which the record
types are related to each other hierarchically. The record types are
grouped together by a case identification number that identifies the
highest level-the first record type-of the hierarchy. Usually, the last
record type specified-the lowest level of the hierarchy-defines a case.
For example, in a file containing household records and records for each
person living in the household, each person record defines a case.
Information from higher record types may be spread to each case. For
example, the value for a variable on the household record, such as CITY,
can be spread to the records for each person in the household.

Look into this option and see if it fits what you need.

Melissa

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Gary Stevens
Sent: Thursday, August 23, 2007 11:37 AM
To: [hidden email]
Subject: [SPSSX-L] finding unique data in a column

I have a 1.5 million line file made up of two columns of data.  The
first column has a common identifier,always "$**ID**$", before each
facility's data begins, then always two lines down is the facility's
unique ID number (always 9 digits ending with an H) and the next line
(third line after
$**ID**$) is the facility's name.  Further down the rows is a specific
item number I am looking for (always the same number) and its value in
the column next it.  This repeats 40 times (40 facilities).  The number
of rows of data in each facility varies.  How can I select out the
facility's ID number, name, desired item number and its value? example
below

$**ID**$
value
5902002H
Facility Name

(many intervening rows of items with values) item number value

needed item no value

(many intervening rows of items with values, then repeats 40 times)

$**ID**$
value
5902002H
Facility Name

(many intervening rows of items with values)

item number value
item number value
item number value
item number value
needed item no value

(many intervening rows of items with values)


PRIVILEGED AND CONFIDENTIAL INFORMATION
This transmittal and any attachments may contain PRIVILEGED AND
CONFIDENTIAL information and is intended only for the use of the
addressee. If you are not the designated recipient, or an employee
or agent authorized to deliver such transmittals to the designated
recipient, you are hereby notified that any dissemination,
copying or publication of this transmittal is strictly prohibited. If
you have received this transmittal in error, please notify us
immediately by replying to the sender and delete this copy from your
system. You may also call us at (309) 827-6026 for assistance.