SPSSX Discussion

Identifying duplicates without running sort

Classic

List

Threaded

4 messages Options

devoidx

Identifying duplicates without running sort

Hi guys, you have been very helpful in the past and I have a new problem.

I have a humongous database with 1 billion cases which is about 500gb. The data is comprised of patient visits to doctors office and each patient has its own ID value and each case is the information about a particular patient visit. Naturally the database contains for example 100 visits of the same patient so 100 cases with the same ID value which varies in other variables (treatment, diagnosis etc).

I am trying to figure out how many unique patients does my database have (ie. how many unique ID values I have).

I tried to use the spss's native 'find duplicate cases' but that means it has to sort my 1 billion-case database and to do so its gonna take up another 500gb space which I don't have..

I should also add that my database is already sorted based on ID value. What are my options of figuring out how many unique ID values I have in the database?

Thanks!

Richard Ristow

Re: Identifying duplicates without running sort

At 05:58 PM 9/20/2013, devoidx wrote:

>I have a humongous database with 1 billion cases. The data is
>comprised of patient visits to doctors office and each patient has
>its own ID value and each case is the information about a particular
>patient visit. I am trying to figure out how many unique patients
>does my database have (ie. how many unique ID values I have).
>
>My database is already sorted based on ID value.

This is ideal for AGGREGATE with PRESORTED.

GET FILE=Humongous.
DATASET DECLARE Patients.
AGGREGATE OUTFILE=Patients
/PRESORTED
/BREAK=ID
/NVisits 'Number of visits for this patient' = NU.

DATASET ACTIVATE Patients WINDOW=FRONT.

* Then, you just want the number of cases in the Patients file: .

SHOW N.

* Or, if you want the value in an SPSS dataset (with one case .
* and one variable): .

DATASET DECLARE HowManyPats.

COMPUTE NoBreak = 1/* This isn't necessary for most recent SPSS */.
AGGREGATE OUTFILE=HowManyPats
/BREAK=NoBreak
/NPats 'Total number of patients' = NU.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

devoidx

Re: Identifying duplicates without running sort

You are the man! thanks!

drfg2008

Re: Identifying duplicates without running sort

This post was updated on .

why not simply:

IF ($casenum EQ 1) patient =1.
IF (lag(ID) NE ID) patient = 1.
FREQUENCIES patient.

if it is sorted

Dr. Frank Gaeth