SPSSX Discussion

Finding percentage of duplication variables

Classic

List

Threaded

13 messages Options

muh.hassan

Finding percentage of duplication variables

Hello,
I have recently encountered a problem regarding duplication in SPSS. The problem is that I want to find percentage of duplication in 100+ variables. For example, if there are only 80 variables which are duplicate in 10 cases then the percentage of duplication would be 80%. I could not find any way to do this in SPSS. If someone could help I would really appreciate it.

Andy W

Re: Finding percentage of duplication variables

Where does the 80% come from? If you can pre-specify the group of variables you want to check before hand, you can use the menu dialog data->Identify Duplicates.

That could be an interesting problem though to identify near duplicates among a larger set of variables though.

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

David Marso

Re: Finding percentage of duplication variables

Administrator

In reply to this post by muh.hassan

You are going to have to provide an example of what you are talking about.
A simple dummy data set with a precise definition of what you mean by duplication.

muh.hassan wrote

Hello,
I have recently encountered a problem regarding duplication in SPSS. The problem is that I want to find percentage of duplication in 100+ variables. For example, if there are only 80 variables which are duplicate in 10 cases then the percentage of duplication would be 80%. I could not find any way to do this in SPSS. If someone could help I would really appreciate it.

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

Maguin, Eugene

Re: Finding percentage of duplication variables

In reply to this post by muh.hassan

I agree completely with both Andy and David. More information is needed to really understand what you need.

This probably isn't going to do what you want unless what you need to do is what this will do.
So. Suppose id v1 to v100. Across v1 to v100 some variables may have the same value for a given case. Some cases have no variables with the same value. There may be a case where all (v1 to v100) have the same value.
I suggest: Varstocases followed by aggregate breaking on id and value and computing the nu function. (I now assume you are interested in maximum number of variables having the same value for each case.) Aggregate (again) breaking on id and compute the max function.
Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of muh.hassan
Sent: Wednesday, April 20, 2016 8:01 AM
To: [hidden email]
Subject: Finding percentage of duplication variables

Hello,
I have recently encountered a problem regarding duplication in SPSS.
The problem is that I want to find percentage of duplication in 100+ variables. For example, if there are only 80 variables which are duplicate in 10 cases then the percentage of duplication would be 80%. I could not find any way to do this in SPSS. If someone could help I would really appreciate it.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Finding-percentage-of-duplication-variables-tp5731966.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

David Marso

Re: Finding percentage of duplication variables

Administrator

"This probably isn't going to do what you want unless what you need to do is what this will do. "

I need to add that to my sig ;-)

Maguin, Eugene wrote

I agree completely with both Andy and David. More information is needed to really understand what you need.

This probably isn't going to do what you want unless what you need to do is what this will do.
So. Suppose id v1 to v100. Across v1 to v100 some variables may have the same value for a given case. Some cases have no variables with the same value. There may be a case where all (v1 to v100) have the same value.
I suggest: Varstocases followed by aggregate breaking on id and value and computing the nu function. (I now assume you are interested in maximum number of variables having the same value for each case.) Aggregate (again) breaking on id and compute the max function.
Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of muh.hassan
Sent: Wednesday, April 20, 2016 8:01 AM
To: [hidden email]
Subject: Finding percentage of duplication variables

Hello,
I have recently encountered a problem regarding duplication in SPSS.
The problem is that I want to find percentage of duplication in 100+ variables. For example, if there are only 80 variables which are duplicate in 10 cases then the percentage of duplication would be 80%. I could not find any way to do this in SPSS. If someone could help I would really appreciate it.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Finding-percentage-of-duplication-variables-tp5731966.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Bruce Weaver

Re: Finding percentage of duplication variables

Administrator

Well spotted, David! But let's not be too hard on Gene. IIRC, he's at U of Buffalo, so he has probably just been listening to too many political speeches recently.* ;-)

* For the benefit of those who don't follow US politics, or who are reading this in the archives, the Democratic & Republican primaries for NY State happened yesterday.

RESULTS:
http://www.nytimes.com/elections/results/new-york
http://www.bbc.com/news/election-us-2016-36084957

David Marso wrote

"This probably isn't going to do what you want unless what you need to do is what this will do. "

I need to add that to my sig ;-)

Maguin, Eugene wrote

I agree completely with both Andy and David. More information is needed to really understand what you need.

This probably isn't going to do what you want unless what you need to do is what this will do.
So. Suppose id v1 to v100. Across v1 to v100 some variables may have the same value for a given case. Some cases have no variables with the same value. There may be a case where all (v1 to v100) have the same value.
I suggest: Varstocases followed by aggregate breaking on id and value and computing the nu function. (I now assume you are interested in maximum number of variables having the same value for each case.) Aggregate (again) breaking on id and compute the max function.
Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of muh.hassan
Sent: Wednesday, April 20, 2016 8:01 AM
To: [hidden email]
Subject: Finding percentage of duplication variables

Hello,
I have recently encountered a problem regarding duplication in SPSS.
The problem is that I want to find percentage of duplication in 100+ variables. For example, if there are only 80 variables which are duplicate in 10 cases then the percentage of duplication would be 80%. I could not find any way to do this in SPSS. If someone could help I would really appreciate it.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Finding-percentage-of-duplication-variables-tp5731966.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).

muh.hassan

Re: Finding percentage of duplication variables

In reply to this post by Maguin, Eugene

Sorry everyone for being too hasty in typing my first post. Let me just elaborate my problem.
Suppose we have var v1 to v50, for each case every variable value is same except for 5 variables lets say v26 to v30.

Case : 1
Duplication for just one variable v1 whose value is same for each case
Duplicate case = 49, Percent = 98
Primary case = 1, Percent = 2

Case : 2
Duplication for the five variables v26 to v30 whose values are unique for each case
Primary case = 50, Percent =100

Case : 3
Duplication for all the 50 variables v1 to v50 which gives
Primary case = 50 , Percent = 100

The problem lies in Case # 3. The duplication exists for 45 variables out of 50 but the tool did not display that information. I need to get the information in percentage for all the 50 variables i.e in this case

Duplication case = 44, Percent = 88
Primary case = 5, Percent = 10

David Marso

Re: Finding percentage of duplication variables

Administrator

Please review the responses to your query and address each of them.
I for one requested a simple sample data example and the results from such.
What are we supposed to do with this second post? It is useless for answering your question.
Pretend we are NOT looking over your shoulder and reading your mind!
You need to 'elaborate' on what you mean by duplicate and provide an example of inputs and desired outputs.
It seems your second post has been far too hasty as well.

'but the tool did not display that information.'
What tool?

muh.hassan wrote

Sorry everyone for being too hasty in typing my first post. Let me just elaborate my problem.
Suppose we have var v1 to v50, for each case every variable value is same except for 5 variables lets say v26 to v30.

Case : 1
Duplication for just one variable v1 whose value is same for each case
Duplicate case = 49, Percent = 98but the tool did not display that information.
Primary case = 1, Percent = 2

Case : 2
Duplication for the five variables v26 to v30 whose values are unique for each case
Primary case = 50, Percent =100

Case : 3
Duplication for all the 50 variables v1 to v50 which gives
Primary case = 50 , Percent = 100

The problem lies in Case # 3. The duplication exists for 45 variables out of 50 I need to get the information in percentage for all the 50 variables i.e in this case

Duplication case = 44, Percent = 88
Primary case = 5, Percent = 10

Bruce Weaver

Re: Finding percentage of duplication variables

Administrator

Whatsamatta U, David? Is your ESPss on the fritz again? ;-)

David Marso wrote

Please review the responses to your query and address each of them.
I for one requested a simple sample data example and the results from such.
What are we supposed to do with this second post? It is useless for answering your question.
Pretend we are NOT looking over your shoulder and reading your mind!
You need to 'elaborate' on what you mean by duplicate and provide an example of inputs and desired outputs.
It seems your second post has been far too hasty as well.

'but the tool did not display that information.'
What tool?

muh.hassan wrote

Sorry everyone for being too hasty in typing my first post. Let me just elaborate my problem.
Suppose we have var v1 to v50, for each case every variable value is same except for 5 variables lets say v26 to v30.

Case : 1
Duplication for just one variable v1 whose value is same for each case
Duplicate case = 49, Percent = 98but the tool did not display that information.
Primary case = 1, Percent = 2

Case : 2
Duplication for the five variables v26 to v30 whose values are unique for each case
Primary case = 50, Percent =100

Case : 3
Duplication for all the 50 variables v1 to v50 which gives
Primary case = 50 , Percent = 100

The problem lies in Case # 3. The duplication exists for 45 variables out of 50 I need to get the information in percentage for all the 50 variables i.e in this case

Duplication case = 44, Percent = 88
Primary case = 5, Percent = 10

Art Kendall

Re: Finding percentage of duplication variables

In reply to this post by muh.hassan

Please elaborate on your question.
What is the context? What is a case?
Are you looking for:
-- duplicate cases?
-- pattern responding to a test/questionnaire?
etc.

Art Kendall
Social Research Consultants

David Marso

Re: Finding percentage of duplication variables

Administrator

In reply to this post by Bruce Weaver

More than likely. More likely is that I'm too lazy to help people who won't help themselves by providing clear examples of what they need to sort out ;-)
--

Bruce Weaver wrote

Whatsamatta U, David? Is your ESPss on the fritz again? ;-)

David Marso wrote

Please review the responses to your query and address each of them.
I for one requested a simple sample data example and the results from such.
What are we supposed to do with this second post? It is useless for answering your question.
Pretend we are NOT looking over your shoulder and reading your mind!
You need to 'elaborate' on what you mean by duplicate and provide an example of inputs and desired outputs.
It seems your second post has been far too hasty as well.

'but the tool did not display that information.'
What tool?

muh.hassan wrote

Sorry everyone for being too hasty in typing my first post. Let me just elaborate my problem.
Suppose we have var v1 to v50, for each case every variable value is same except for 5 variables lets say v26 to v30.

Case : 1
Duplication for just one variable v1 whose value is same for each case
Duplicate case = 49, Percent = 98but the tool did not display that information.
Primary case = 1, Percent = 2

Case : 2
Duplication for the five variables v26 to v30 whose values are unique for each case
Primary case = 50, Percent =100

Case : 3
Duplication for all the 50 variables v1 to v50 which gives
Primary case = 50 , Percent = 100

The problem lies in Case # 3. The duplication exists for 45 variables out of 50 I need to get the information in percentage for all the 50 variables i.e in this case

Duplication case = 44, Percent = 88
Primary case = 5, Percent = 10

Rich Ulrich

Re: Finding percentage of duplication variables

In reply to this post by muh.hassan

I think you are going to have to overcome your shyness, and tell us where
the data come from. Depending on how informative that is, you might
have to be explicit and say, also, What do 50 variables measure? (Are
they all the same thing?) What does a line of data represent? And, see below -

> Date: Thu, 21 Apr 2016 01:02:35 -0700
> From: [hidden email]
> Subject: Re: Finding percentage of duplication variables
> To: [hidden email]
>
> Sorry everyone for being too hasty in typing my first post. Let me just
> elaborate my problem.
> Suppose we have var v1 to v50, for each case every variable value is same
> except for 5 variables lets say v26 to v30.

Does "case", here, mean "line", or is it a reference to the "Case: 1", etc.,
listed below?

If there are 10 lines does that mean:
- there may well be 450 values that are all the same;
- there may well be 45 values on each line that are the same;
- there may well be duplication between lines, so that v1 shows one value,
v2 shows another, etc.

>
> Case : 1
Does this merely mean "example 1", as I expect? I will treat it as such.

> Duplication for just one variable v1 whose value is same for each case
> Duplicate case = 49, Percent = 98
> Primary case = 1, Percent = 2

Why is "Duplicate case = 49"? With "Percent = 98", this seems to be taken
from a total of 50. Does 50 also represent the number of lines? Should this
have been "duplicate cases for v1", followed similarly for "Primary cases"
(i.e., unique values) for v1?

>
> Case : 2
> Duplication for the five variables v26 to v30 whose values are unique for
> each case
> Primary case = 50, Percent =100

Well, if v26-v30 each have unique values for the whole dataset,
each line has a unique value and there would be 50 "Primary cases" for them.
I don't understand where the "duplication for the five variables" comes in,
unless you are saying that on each line, all the other values can be matched
to one of {v26 to v30}.

>
> Case : 3
> Duplication for all the 50 variables v1 to v50 which gives
> Primary case = 50 , Percent = 100

"Duplication for all the 50 variables" seems to contradict the overall
specification, that v26 to v30 are not "the same". And it seems to imply
that 50-lines-times-50-variables gives 2500 values that are the same, so
there would be only 1 (unique) primary case.

>
> The problem lies in Case # 3. The duplication exists for 45 variables out of
> 50 but the tool did not display that information. I need to get the
> information in percentage for all the 50 variables i.e in this case
>
> Duplication case = 44, Percent = 88
> Primary case = 5, Percent = 10
>
Well, I find this totally obscure. Where does 44 come from, and 5, and why
do they /not/ add up to 50 lines?

--
Rich Ulrich

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

David Marso

Re: Finding percentage of duplication variables

Administrator

I'll bet your brain really hurts now Rich ;-)

Rich Ulrich wrote

I think you are going to have to overcome your shyness, and tell us where
the data come from. Depending on how informative that is, you might
have to be explicit and say, also, What do 50 variables measure? (Are
they all the same thing?) What does a line of data represent? And, see below -

> Date: Thu, 21 Apr 2016 01:02:35 -0700
> From: [hidden email]
> Subject: Re: Finding percentage of duplication variables
> To: [hidden email]
>
> Sorry everyone for being too hasty in typing my first post. Let me just
> elaborate my problem.
> Suppose we have var v1 to v50, for each case every variable value is same
> except for 5 variables lets say v26 to v30.

Does "case", here, mean "line", or is it a reference to the "Case: 1", etc.,
listed below?

If there are 10 lines does that mean:
- there may well be 450 values that are all the same;
- there may well be 45 values on each line that are the same;
- there may well be duplication between lines, so that v1 shows one value,
v2 shows another, etc.

>
> Case : 1
Does this merely mean "example 1", as I expect? I will treat it as such.

> Duplication for just one variable v1 whose value is same for each case
> Duplicate case = 49, Percent = 98
> Primary case = 1, Percent = 2

Why is "Duplicate case = 49"? With "Percent = 98", this seems to be taken
from a total of 50. Does 50 also represent the number of lines? Should this
have been "duplicate cases for v1", followed similarly for "Primary cases"
(i.e., unique values) for v1?

>
> Case : 2
> Duplication for the five variables v26 to v30 whose values are unique for
> each case
> Primary case = 50, Percent =100

Well, if v26-v30 each have unique values for the whole dataset,
each line has a unique value and there would be 50 "Primary cases" for them.
I don't understand where the "duplication for the five variables" comes in,
unless you are saying that on each line, all the other values can be matched
to one of {v26 to v30}.

>
> Case : 3
> Duplication for all the 50 variables v1 to v50 which gives
> Primary case = 50 , Percent = 100

"Duplication for all the 50 variables" seems to contradict the overall
specification, that v26 to v30 are not "the same". And it seems to imply
that 50-lines-times-50-variables gives 2500 values that are the same, so
there would be only 1 (unique) primary case.

>
> The problem lies in Case # 3. The duplication exists for 45 variables out of
> 50 but the tool did not display that information. I need to get the
> information in percentage for all the 50 variables i.e in this case
>
> Duplication case = 44, Percent = 88
> Primary case = 5, Percent = 10
>
Well, I find this totally obscure. Where does 44 come from, and 5, and why
do they /not/ add up to 50 lines?

--
Rich Ulrich

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD