identfiying duplicates with multiple variables

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

identfiying duplicates with multiple variables

susieqtips
Hi there

I am working with a data set that contains a list of people who have come to visit centres within different ridings

For example

Ajax riding includes
Visits to Ajax communiy Centre
Visits to Mcleans community Centre
Visits to Ajax Library
etc


each person must register in the riding with their address etc then they are given a user id.


I need to find out the total number of visits to the centres, however only one visit per riding per day per user counnts. So if they visit Ajax community Centre and Mcleans community Centre in one day only one of those visits counts for the day.

I would therefore need to identify duplicates based on the user id and visit date.


How can I use identify dupplicates to find duplicates based on both the variables of user id and visit day.


Do I just put both the variables of User Id and Visit date in the idenity user section or do i use the other box below that says


sort with match cases by

What does this function do ?

thanks

Susan
Reply | Threaded
Open this post in threaded view
|

Re: identfiying duplicates with multiple variables

Maguin, Eugene
Susan,

So far as i know, this command sequence (sort with match cases by) is not valid.

It'd be very helpful to see a sample of your dataset structure. I'm going to
assume this structure.

ridingid personid date       visitloc
Ajax      101     7/21/2011  Ajax community centre
Ajax      101     7/21/2011  Ajax Library
Hamilton  101     7/31/2011  Hamilton libraray

Basically, this is just an aggregate command problem. But, an important problem
is counting one visit per person per riding per day. I'd suggest randomly
selecting the visitloc to be counted. So:

compute rannum=uniform(1).  /* i think this line is correct but it is from memory.
sort cases by ridingid personid date visitloc rannum.

aggregate outfile=*/break=ridingid personid date/visitloc=first(visitloc).

But, why not also count the total number of locations visited each day by riding
by person. I'd kind of bet that you (or somebody) could use that data to
reconstruct the distribution of numbers of visits. Also, and something else that
enters in here is multiple visits to a location on the same day by a specific
person. I don't know if these data are in your dataset or what. Even if i have
your dataset structure wrong, this will get the discussion started.

Gene Maguin






On Mon 08/01/11  6:59 AM , susieqtips [hidden email] sent:

> Hi there
>
> I am working with a data set that contains a list of people who have come
> tovisit centres within different ridings
>
> For example
>
> Ajax riding includes
> Visits to Ajax communiy Centre
> Visits to Mcleans community Centre
> Visits to Ajax Library
> etc
>
>
> each person must register in the riding with their address etc then they
> aregiven a user id.
>
>
> I need to find out the total number of visits to the centres, however
> onlyone visit per riding per day per user counnts. So if they visit Ajax
> community Centre and Mcleans community Centre in one day only one of
> thosevisits counts for the day.
>
> I would therefore need to identify duplicates based on the user id and
> visitdate.
>
>
> How can I use identify dupplicates to find duplicates based on both the
> variables of user id and visit day.
>
>
> Do I just put both the variables of User Id and Visit date in the
> idenityuser section or do i use the other box below that says
>
>
> sort with match cases by
>
> What does this function do ?
>
> thanks
>
> Susan
>
> --
> View this message in context:
> http://spssx-discussion.1045642.n5.nabble.com/identfiying-dupli
> cates-with-multiple-variables-tp4654697p4654697.htmlSent from the SPSSX
Discussion mailing list archive at Nabble.com.

>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> LIS
> [hidden email] (not to SPSSX-L), with no body text except
> thecommand. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
>
>
>
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: identfiying duplicates with multiple variables

Barry
Perhaps I'm missing something, but the SPSS command to identify duplicate cases
would work here
(Data = >Identify Duplicate Cases).  In this case, you'd specify the personid,
date and riding as the key fields;
SPSS will flag all cases for which there is more than one row with the same
values of these three variables.
It gives you the option of just flagging rows, or going ahead and deleting
them.   The limitation is that SPSS allows
either the last or first case to be considered the valid case; it doesn't have
any other options.

Barry



----- Original Message ----

> From: "[hidden email]" <[hidden email]>
> To: [hidden email]
> Sent: Mon, August 1, 2011 9:35:02 AM
> Subject: Re: identfiying duplicates with multiple variables
>
> Susan,
>
> So far as i know, this command sequence (sort with match cases by)  is not
>valid.
>
> It'd be very helpful to see a sample of your dataset  structure. I'm going to
> assume this structure.
>
> ridingid personid  date       visitloc
> Ajax      101      7/21/2011  Ajax community centre
> Ajax      101      7/21/2011  Ajax Library
> Hamilton  101      7/31/2011  Hamilton libraray
>
> Basically, this is just an aggregate  command problem. But, an important
>problem
> is counting one visit per person  per riding per day. I'd suggest randomly
> selecting the visitloc to be  counted. So:
>
> compute rannum=uniform(1).  /* i think this line is  correct but it is from
>memory.
> sort cases by ridingid personid date visitloc  rannum.
>
> aggregate outfile=*/break=ridingid personid  date/visitloc=first(visitloc).
>
> But, why not also count the total number  of locations visited each day by
>riding
> by person. I'd kind of bet that you  (or somebody) could use that data to
> reconstruct the distribution of numbers  of visits. Also, and something else
>that
> enters in here is multiple visits to  a location on the same day by a specific
> person. I don't know if these data  are in your dataset or what. Even if i
have

> your dataset structure wrong,  this will get the discussion started.
>
> Gene  Maguin
>
>
>
>
>
>
> On Mon 08/01/11  6:59 AM , susieqtips [hidden email] sent:
> > Hi  there
> >
> > I am working with a data set that contains a list of  people who have come
> > tovisit centres within different  ridings
> >
> > For example
> >
> > Ajax riding includes
> >  Visits to Ajax communiy Centre
> > Visits to Mcleans community  Centre
> > Visits to Ajax Library
> > etc
> >
> >
> > each  person must register in the riding with their address etc then they
> >  aregiven a user id.
> >
> >
> > I need to find out the total number  of visits to the centres, however
> > onlyone visit per riding per day per  user counnts. So if they visit Ajax
> > community Centre and Mcleans  community Centre in one day only one of
> > thosevisits counts for the  day.
> >
> > I would therefore need to identify duplicates based on the  user id and
> > visitdate.
> >
> >
> > How can I use identify  dupplicates to find duplicates based on both the
> > variables of user id  and visit day.
> >
> >
> > Do I just put both the variables of User  Id and Visit date in the
> > idenityuser section or do i use the other box  below that says
> >
> >
> > sort with match cases by
> >
> >  What does this function do ?
> >
> > thanks
> >
> >  Susan
> >
> > --
> > View this message in context:
> >  http://spssx-discussion.1045642.n5.nabble.com/identfiying-dupli
> >  cates-with-multiple-variables-tp4654697p4654697.htmlSent from the  SPSSX
> Discussion mailing list archive at Nabble.com.
> >
> >  =====================
> > To manage your subscription to SPSSX-L, send a  message to
> > LIS
> > [hidden email] (not to  SPSSX-L), with no body text except
> > thecommand. To leave the list, send  the command
> > SIGNOFF SPSSX-L
> > For a list of commands to manage  subscriptions, send the command
> > INFO  REFCARD
> >
> >
> >
> >
> >
>
> =====================
> To  manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to  SPSSX-L), with no body text except the
> command. To leave the list, send the  command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions,  send the command
> INFO REFCARD
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: identfiying duplicates with multiple variables

Maguin, Eugene
In reply to this post by susieqtips
Hi Barry,

I'm not sure that you are missing anything. I see your point about the identify
duplicates command. Although I have never used it, I think it would be a valid
alternative to my use of aggregate.

Gene Maguin

On Mon 08/01/11 12:53 PM , Barry DeCicco [hidden email] sent:

> Perhaps I'm missing something, but the SPSS command to identify duplicate
> cases would work here
> (Data = >Identify Duplicate Cases).  In this case, you'd specify the
> personid, date and riding as the key fields;
> SPSS will flag all cases for which there is more than one row with the same
> values of these three variables.
> It gives you the option of just flagging rows, or going ahead and deleting
> them.   The limitation is that SPSS allows
> either the last or first case to be considered the valid case; it doesn't
> have any other options.
>
> Barry
>
>
>
> ----- Original Message ----
> > From: "emaguin@b
> uffalo.edu" <emaguin@b
> uffalo.edu>> To: SPSS
> [hidden email]> Sent: Mon, August 1, 2011 9:35:02 AM
> > Subject: Re: identfiying duplicates with
> multiple variables>
> > Susan,
> >
> > So far as i know, this command sequence (sort
> with match cases by)  is not >valid.
> >
> > It'd be very helpful to see a sample of your
> dataset  structure. I'm going to> assume this structure.
> >
> > ridingid personid  date
> visitloc> Ajax      101      7/21/2011  Ajax community
> centre> Ajax      101      7/21/2011  Ajax
> Library> Hamilton  101      7/31/2011  Hamilton
> libraray>
> > Basically, this is just an aggregate  command
> problem. But, an important >problem
> > is counting one visit per person  per riding per
> day. I'd suggest randomly> selecting the visitloc to be  counted.
> So:>
> > compute rannum=uniform(1).  /* i think this line
> is  correct but it is from >memory.
> > sort cases by ridingid personid date visitloc
> rannum.>
> > aggregate outfile=*/break=ridingid personid
> date/visitloc=first(visitloc).>
> > But, why not also count the total number  of
> locations visited each day by >riding
> > by person. I'd kind of bet that you  (or
> somebody) could use that data to> reconstruct the distribution of numbers  of
> visits. Also, and something else >that
> > enters in here is multiple visits to  a location
> on the same day by a specific> person. I don't know if these data  are in your
> dataset or what. Even if i have
> > your dataset structure wrong,  this will get the
> discussion started.>
> > Gene  Maguin
> >
> >
> >
> >
> >
> >
> > On Mon 08/01/11  6:59 AM , susieqtips tigrr@rogers
> .com sent:> > Hi  there
> > >
> > > I am working with a data set that contains
> a list of  people who have come> > tovisit centres within different
> ridings> >
> > > For example
> > >
> > > Ajax riding includes
> > >  Visits to Ajax communiy Centre
> > > Visits to Mcleans community
> Centre> > Visits to Ajax Library
> > > etc
> > >
> > >
> > > each  person must register in the riding
> with their address etc then they> >  aregiven a user id.
> > >
> > >
> > > I need to find out the total number  of
> visits to the centres, however> > onlyone visit per riding per day per  user
> counnts. So if they visit Ajax> > community Centre and Mcleans  community
> Centre in one day only one of> > thosevisits counts for the  day.
> > >
> > > I would therefore need to identify
> duplicates based on the  user id and> > visitdate.
> > >
> > >
> > > How can I use identify  dupplicates to find
> duplicates based on both the> > variables of user id  and visit
> day.> >
> > >
> > > Do I just put both the variables of User
> Id and Visit date in the> > idenityuser section or do i use the other
> box  below that says> >
> > >
> > > sort with match cases by
> > >
> > >  What does this function do ?
> > >
> > > thanks
> > >
> > >  Susan
> > >
> > > --
> > > View this message in context:
> > >  http://spssx-discussion.1045642.n5.nabble.com/identfiying-dupli
> > >
> cates-with-multiple-variables-tp4654697p4654697.htmlSent from the
> SPSSX> Discussion mailing list archive at
> Nabble.com.> >
> > >  =====================
> > > To manage your subscription to SPSSX-L,
> send a  message to> > LIS
> > > TSERV@
> LISTSERV.UGA.EDU (not to  SPSSX-L), with no body text except> > thecommand. To
leave the list, send  the

> command> > SIGNOFF SPSSX-L
> > > For a list of commands to manage
> subscriptions, send the command> > INFO  REFCARD
> > >
> > >
> > >
> > >
> > >
> >
> > =====================
> > To  manage your subscription to SPSSX-L, send a
> message to> LIS
> [hidden email] (not to  SPSSX-L), with no body text except
> the> command. To leave the list, send the
> command> SIGNOFF SPSSX-L
> > For a list of commands to manage subscriptions,
> send the command> INFO REFCARD
> >
>
>
>
>
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: identfiying duplicates with multiple variables

Rich Ulrich
If the first information you give them is complete enough, they won't
be coming back later to get more.

If it were my data, I would be interested (eventually) in knowing
how many people visited just one place in a day and how many
visited several.  And which places were seen alone.  And so on.

And that partly depends on not-worrying how it is going to be done,
but looking to define what you would ideally want to produce.
If a person visits the exact same place twice, does it count twice?

I think I would start by using Aggregate to put into each record
the number of visits on that day.  - One count that *could*  be
interesting is what counts result when you look at just those.

Is there really enough multiplicity that you do want to look in
detail at it?  Would crosstabs of places be interesting?  - That would
call for indicator variables for each place.

 ... just pondering the possibilities.

--
Rich Ulrich


> Date: Mon, 1 Aug 2011 19:14:48 -0400

> From: [hidden email]
> Subject: Re: identfiying duplicates with multiple variables
> To: [hidden email]
>
> Hi Barry,
>
> I'm not sure that you are missing anything. I see your point about the identify
> duplicates command. Although I have never used it, I think it would be a valid
> alternative to my use of aggregate.
>
> Gene Maguin
[snip, previous]

Reply | Threaded
Open this post in threaded view
|

Re: identfiying duplicates with multiple variables

Richard Ristow
In reply to this post by susieqtips
At 06:59 AM 8/1/2011, susieqtips wrote:

>I am working with a data set that contains a list of people who have
>come to visit centres within different ridings. For example
>
>Ajax riding includes
>Visits to Ajax communiy Centre
>Visits to Mcleans community Centre
>Visits to Ajax Library
>etc
>
>each person must register in the riding with their address etc then
>they are given a user id.
>
>I need to find out the total number of visits to the centres,
>however only one visit per riding per day per user counnts. So if
>they visit Ajax community Centre and Mcleans community Centre in one
>day only one of those visits counts for the day.

So you need to count the number of unique occurrences of the triplet

    user_id, visit_date, riding

(If I understand you correctly, it is not simply 'based on the user
id and visit
date.')

Your first problem may be to identify the riding where each visit
occurs. Possibly that's a RECODE, though there are other approaches:

STRING Riding (A12).
RECODE Centre
    ('Ajax community Centre'     = 'Ajax')
    ('Mcleans community Centre'  = 'Ajax')
    ('Ajax Library'              = 'Ajax')
    etc.
    INTO Riding.

Now there are various approaches. Myself, I'm a syntax person, so I'd
write syntax using AGGREGATE to get one record per "visit" as you've
defined it. The following (untested) requires SPSS 14 or later:

DATASET NAME      Original WINDOW=FRONT.
DATASET DECLARE   Visits.
AGGREGATE OUTFILE=Visits
    /BREAK=user_id  visit_date riding
    /NRECS 'Number of records for this visit' = NU.

DATASET ACTIVATE  Visits   WINDOW=FRONT.

Now, your active dataset has one record per visit as you define it --
at most one per user per riding per day, and you can count by riding,
by month, or however you please. (You probably can ignore variable 'NRECS'.)

Is this getting any closer?

-Best of luck,
  Richard

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD