SPSSX Discussion

identfiying duplicates with multiple variables

Classic

List

Threaded

6 messages Options

susieqtips

identfiying duplicates with multiple variables

Hi there

I am working with a data set that contains a list of people who have come to visit centres within different ridings

For example

Ajax riding includes
Visits to Ajax communiy Centre
Visits to Mcleans community Centre
Visits to Ajax Library
etc

each person must register in the riding with their address etc then they are given a user id.

I need to find out the total number of visits to the centres, however only one visit per riding per day per user counnts. So if they visit Ajax community Centre and Mcleans community Centre in one day only one of those visits counts for the day.

I would therefore need to identify duplicates based on the user id and visit date.

How can I use identify dupplicates to find duplicates based on both the variables of user id and visit day.

Do I just put both the variables of User Id and Visit date in the idenity user section or do i use the other box below that says

sort with match cases by

What does this function do ?

thanks

Susan

Maguin, Eugene

Re: identfiying duplicates with multiple variables

Susan,

So far as i know, this command sequence (sort with match cases by) is not valid.

It'd be very helpful to see a sample of your dataset structure. I'm going to
assume this structure.

ridingid personid date visitloc
Ajax 101 7/21/2011 Ajax community centre
Ajax 101 7/21/2011 Ajax Library
Hamilton 101 7/31/2011 Hamilton libraray

Basically, this is just an aggregate command problem. But, an important problem
is counting one visit per person per riding per day. I'd suggest randomly
selecting the visitloc to be counted. So:

compute rannum=uniform(1). /* i think this line is correct but it is from memory.
sort cases by ridingid personid date visitloc rannum.

aggregate outfile=*/break=ridingid personid date/visitloc=first(visitloc).

But, why not also count the total number of locations visited each day by riding
by person. I'd kind of bet that you (or somebody) could use that data to
reconstruct the distribution of numbers of visits. Also, and something else that
enters in here is multiple visits to a location on the same day by a specific
person. I don't know if these data are in your dataset or what. Even if i have
your dataset structure wrong, this will get the discussion started.

Gene Maguin

On Mon 08/01/11 6:59 AM , susieqtips [hidden email] sent:

> Hi there
>
> I am working with a data set that contains a list of people who have come
> tovisit centres within different ridings
>
> For example
>
> Ajax riding includes
> Visits to Ajax communiy Centre
> Visits to Mcleans community Centre
> Visits to Ajax Library
> etc
>
>
> each person must register in the riding with their address etc then they
> aregiven a user id.
>
>
> I need to find out the total number of visits to the centres, however
> onlyone visit per riding per day per user counnts. So if they visit Ajax
> community Centre and Mcleans community Centre in one day only one of
> thosevisits counts for the day.
>
> I would therefore need to identify duplicates based on the user id and
> visitdate.
>
>
> How can I use identify dupplicates to find duplicates based on both the
> variables of user id and visit day.
>
>
> Do I just put both the variables of User Id and Visit date in the
> idenityuser section or do i use the other box below that says
>
>
> sort with match cases by
>
> What does this function do ?
>
> thanks
>
> Susan
>
> --
> View this message in context:
> http://spssx-discussion.1045642.n5.nabble.com/identfiying-dupli
> cates-with-multiple-variables-tp4654697p4654697.htmlSent from the SPSSX

Discussion mailing list archive at Nabble.com.

>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> LIS
> [hidden email] (not to SPSSX-L), with no body text except
> thecommand. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
>
>
>
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Barry

Re: identfiying duplicates with multiple variables

Perhaps I'm missing something, but the SPSS command to identify duplicate cases
would work here
(Data = >Identify Duplicate Cases). In this case, you'd specify the personid,
date and riding as the key fields;
SPSS will flag all cases for which there is more than one row with the same
values of these three variables.
It gives you the option of just flagging rows, or going ahead and deleting
them. The limitation is that SPSS allows
either the last or first case to be considered the valid case; it doesn't have
any other options.

Barry

----- Original Message ----

> From: "[hidden email]" <[hidden email]>
> To: [hidden email]
> Sent: Mon, August 1, 2011 9:35:02 AM
> Subject: Re: identfiying duplicates with multiple variables
>
> Susan,
>
> So far as i know, this command sequence (sort with match cases by) is not
>valid.
>
> It'd be very helpful to see a sample of your dataset structure. I'm going to
> assume this structure.
>
> ridingid personid date visitloc
> Ajax 101 7/21/2011 Ajax community centre
> Ajax 101 7/21/2011 Ajax Library
> Hamilton 101 7/31/2011 Hamilton libraray
>
> Basically, this is just an aggregate command problem. But, an important
>problem
> is counting one visit per person per riding per day. I'd suggest randomly
> selecting the visitloc to be counted. So:
>
> compute rannum=uniform(1). /* i think this line is correct but it is from
>memory.
> sort cases by ridingid personid date visitloc rannum.
>
> aggregate outfile=*/break=ridingid personid date/visitloc=first(visitloc).
>
> But, why not also count the total number of locations visited each day by
>riding
> by person. I'd kind of bet that you (or somebody) could use that data to
> reconstruct the distribution of numbers of visits. Also, and something else
>that
> enters in here is multiple visits to a location on the same day by a specific
> person. I don't know if these data are in your dataset or what. Even if i

have

> your dataset structure wrong, this will get the discussion started.
>
> Gene Maguin
>
>
>
>
>
>
> On Mon 08/01/11 6:59 AM , susieqtips [hidden email] sent:
> > Hi there
> >
> > I am working with a data set that contains a list of people who have come
> > tovisit centres within different ridings
> >
> > For example
> >
> > Ajax riding includes
> > Visits to Ajax communiy Centre
> > Visits to Mcleans community Centre
> > Visits to Ajax Library
> > etc
> >
> >
> > each person must register in the riding with their address etc then they
> > aregiven a user id.
> >
> >
> > I need to find out the total number of visits to the centres, however
> > onlyone visit per riding per day per user counnts. So if they visit Ajax
> > community Centre and Mcleans community Centre in one day only one of
> > thosevisits counts for the day.
> >
> > I would therefore need to identify duplicates based on the user id and
> > visitdate.
> >
> >
> > How can I use identify dupplicates to find duplicates based on both the
> > variables of user id and visit day.
> >
> >
> > Do I just put both the variables of User Id and Visit date in the
> > idenityuser section or do i use the other box below that says
> >
> >
> > sort with match cases by
> >
> > What does this function do ?
> >
> > thanks
> >
> > Susan
> >
> > --
> > View this message in context:
> > http://spssx-discussion.1045642.n5.nabble.com/identfiying-dupli
> > cates-with-multiple-variables-tp4654697p4654697.htmlSent from the SPSSX
> Discussion mailing list archive at Nabble.com.
> >
> > =====================
> > To manage your subscription to SPSSX-L, send a message to
> > LIS
> > [hidden email] (not to SPSSX-L), with no body text except
> > thecommand. To leave the list, send the command
> > SIGNOFF SPSSX-L
> > For a list of commands to manage subscriptions, send the command
> > INFO REFCARD
> >
> >
> >
> >
> >
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>

Maguin, Eugene

Re: identfiying duplicates with multiple variables

In reply to this post by susieqtips

Hi Barry,

I'm not sure that you are missing anything. I see your point about the identify
duplicates command. Although I have never used it, I think it would be a valid
alternative to my use of aggregate.

Gene Maguin

On Mon 08/01/11 12:53 PM , Barry DeCicco [hidden email] sent:

> Perhaps I'm missing something, but the SPSS command to identify duplicate
> cases would work here
> (Data = >Identify Duplicate Cases). In this case, you'd specify the
> personid, date and riding as the key fields;
> SPSS will flag all cases for which there is more than one row with the same
> values of these three variables.
> It gives you the option of just flagging rows, or going ahead and deleting
> them. The limitation is that SPSS allows
> either the last or first case to be considered the valid case; it doesn't
> have any other options.
>
> Barry
>
>
>
> ----- Original Message ----
> > From: "emaguin@b
> uffalo.edu" <emaguin@b
> uffalo.edu>> To: SPSS
> [hidden email]> Sent: Mon, August 1, 2011 9:35:02 AM
> > Subject: Re: identfiying duplicates with
> multiple variables>
> > Susan,
> >
> > So far as i know, this command sequence (sort
> with match cases by) is not >valid.
> >
> > It'd be very helpful to see a sample of your
> dataset structure. I'm going to> assume this structure.
> >
> > ridingid personid date
> visitloc> Ajax 101 7/21/2011 Ajax community
> centre> Ajax 101 7/21/2011 Ajax
> Library> Hamilton 101 7/31/2011 Hamilton
> libraray>
> > Basically, this is just an aggregate command
> problem. But, an important >problem
> > is counting one visit per person per riding per
> day. I'd suggest randomly> selecting the visitloc to be counted.
> So:>
> > compute rannum=uniform(1). /* i think this line
> is correct but it is from >memory.
> > sort cases by ridingid personid date visitloc
> rannum.>
> > aggregate outfile=*/break=ridingid personid
> date/visitloc=first(visitloc).>
> > But, why not also count the total number of
> locations visited each day by >riding
> > by person. I'd kind of bet that you (or
> somebody) could use that data to> reconstruct the distribution of numbers of
> visits. Also, and something else >that
> > enters in here is multiple visits to a location
> on the same day by a specific> person. I don't know if these data are in your
> dataset or what. Even if i have
> > your dataset structure wrong, this will get the
> discussion started.>
> > Gene Maguin
> >
> >
> >
> >
> >
> >
> > On Mon 08/01/11 6:59 AM , susieqtips tigrr@rogers
> .com sent:> > Hi there
> > >
> > > I am working with a data set that contains
> a list of people who have come> > tovisit centres within different
> ridings> >
> > > For example
> > >
> > > Ajax riding includes
> > > Visits to Ajax communiy Centre
> > > Visits to Mcleans community
> Centre> > Visits to Ajax Library
> > > etc
> > >
> > >
> > > each person must register in the riding
> with their address etc then they> > aregiven a user id.
> > >
> > >
> > > I need to find out the total number of
> visits to the centres, however> > onlyone visit per riding per day per user
> counnts. So if they visit Ajax> > community Centre and Mcleans community
> Centre in one day only one of> > thosevisits counts for the day.
> > >
> > > I would therefore need to identify
> duplicates based on the user id and> > visitdate.
> > >
> > >
> > > How can I use identify dupplicates to find
> duplicates based on both the> > variables of user id and visit
> day.> >
> > >
> > > Do I just put both the variables of User
> Id and Visit date in the> > idenityuser section or do i use the other
> box below that says> >
> > >
> > > sort with match cases by
> > >
> > > What does this function do ?
> > >
> > > thanks
> > >
> > > Susan
> > >
> > > --
> > > View this message in context:
> > > http://spssx-discussion.1045642.n5.nabble.com/identfiying-dupli
> > >
> cates-with-multiple-variables-tp4654697p4654697.htmlSent from the
> SPSSX> Discussion mailing list archive at
> Nabble.com.> >
> > > =====================
> > > To manage your subscription to SPSSX-L,
> send a message to> > LIS
> > > TSERV@
> LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except> > thecommand. To

leave the list, send the

> command> > SIGNOFF SPSSX-L
> > > For a list of commands to manage
> subscriptions, send the command> > INFO REFCARD
> > >
> > >
> > >
> > >
> > >
> >
> > =====================
> > To manage your subscription to SPSSX-L, send a
> message to> LIS
> [hidden email] (not to SPSSX-L), with no body text except
> the> command. To leave the list, send the
> command> SIGNOFF SPSSX-L
> > For a list of commands to manage subscriptions,
> send the command> INFO REFCARD
> >
>
>
>
>
>

Rich Ulrich

Re: identfiying duplicates with multiple variables

If the first information you give them is complete enough, they won't
be coming back later to get more.

If it were my data, I would be interested (eventually) in knowing
how many people visited just one place in a day and how many
visited several. And which places were seen alone. And so on.

And that partly depends on not-worrying how it is going to be done,
but looking to define what you would ideally want to produce.
If a person visits the exact same place twice, does it count twice?

I think I would start by using Aggregate to put into each record
the number of visits on that day. - One count that *could* be
interesting is what counts result when you look at just those.

Is there really enough multiplicity that you do want to look in
detail at it? Would crosstabs of places be interesting? - That would
call for indicator variables for each place.

... just pondering the possibilities.

--
Rich Ulrich

> Date: Mon, 1 Aug 2011 19:14:48 -0400

> From: [hidden email]
> Subject: Re: identfiying duplicates with multiple variables
> To: [hidden email]
>
> Hi Barry,
>
> I'm not sure that you are missing anything. I see your point about the identify
> duplicates command. Although I have never used it, I think it would be a valid
> alternative to my use of aggregate.
>
> Gene Maguin

[snip, previous]

Richard Ristow

Re: identfiying duplicates with multiple variables

In reply to this post by susieqtips

At 06:59 AM 8/1/2011, susieqtips wrote:

>I am working with a data set that contains a list of people who have
>come to visit centres within different ridings. For example
>
>Ajax riding includes
>Visits to Ajax communiy Centre
>Visits to Mcleans community Centre
>Visits to Ajax Library
>etc
>
>each person must register in the riding with their address etc then
>they are given a user id.
>
>I need to find out the total number of visits to the centres,
>however only one visit per riding per day per user counnts. So if
>they visit Ajax community Centre and Mcleans community Centre in one
>day only one of those visits counts for the day.

So you need to count the number of unique occurrences of the triplet

user_id, visit_date, riding

(If I understand you correctly, it is not simply 'based on the user
id and visit
date.')

Your first problem may be to identify the riding where each visit
occurs. Possibly that's a RECODE, though there are other approaches:

STRING Riding (A12).
RECODE Centre
('Ajax community Centre' = 'Ajax')
('Mcleans community Centre' = 'Ajax')
('Ajax Library' = 'Ajax')
etc.
INTO Riding.

Now there are various approaches. Myself, I'm a syntax person, so I'd
write syntax using AGGREGATE to get one record per "visit" as you've
defined it. The following (untested) requires SPSS 14 or later:

DATASET NAME Original WINDOW=FRONT.
DATASET DECLARE Visits.
AGGREGATE OUTFILE=Visits
/BREAK=user_id visit_date riding
/NRECS 'Number of records for this visit' = NU.

DATASET ACTIVATE Visits WINDOW=FRONT.

Now, your active dataset has one record per visit as you define it --
at most one per user per riding per day, and you can count by riding,
by month, or however you please. (You probably can ignore variable 'NRECS'.)

Is this getting any closer?

-Best of luck,
Richard

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD