Identifying duplicates without running sort

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Identifying duplicates without running sort

devoidx
Hi guys, you have been very helpful in the past and I have a new problem.

I have a humongous database with 1 billion cases which is about 500gb. The data is comprised of patient visits to doctors office and each patient has its own ID value and each case is the information about a particular patient visit. Naturally the database contains for example 100 visits of the same patient so 100 cases with the same ID value which varies in other variables (treatment, diagnosis etc).

 I am trying to figure out how many unique patients does my database have (ie. how many unique ID values I have).

I tried to use the spss's native 'find duplicate cases' but that means it has to sort my 1 billion-case database and to do so its gonna take up another 500gb space which I don't have..

I should also add that my database is already sorted based on ID value. What are my options of figuring out how many unique ID values I have in the database?

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Identifying duplicates without running sort

Richard Ristow
At 05:58 PM 9/20/2013, devoidx wrote:

>I have a humongous database with 1 billion cases. The data is
>comprised of patient visits to doctors office and each patient has
>its own ID value and each case is the information about a particular
>patient visit. I am trying to figure out how many unique patients
>does my database have (ie. how many unique ID values I have).
>
>My database is already sorted based on ID value.

This is ideal for AGGREGATE with PRESORTED.

GET FILE=Humongous.
DATASET   DECLARE Patients.
AGGREGATE OUTFILE=Patients
    /PRESORTED
    /BREAK=ID
    /NVisits 'Number of visits for this patient' = NU.

DATASET ACTIVATE  Patients WINDOW=FRONT.

*  Then, you just want the number of cases in the Patients file: .

SHOW N.

*  Or, if you want the value in an SPSS dataset (with one case   .
*  and one variable):                                            .

DATASET DECLARE HowManyPats.

COMPUTE NoBreak = 1/* This isn't necessary for most recent SPSS */.
AGGREGATE OUTFILE=HowManyPats
    /BREAK=NoBreak
    /NPats 'Total number of patients' = NU.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Identifying duplicates without running sort

devoidx
You are the man! thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Identifying duplicates without running sort

drfg2008
This post was updated on .
why not simply:


IF ($casenum EQ 1) patient =1.
IF (lag(ID) NE ID) patient = 1.
FREQUENCIES patient.


if it is sorted
Dr. Frank Gaeth