SPSSX Discussion

Duplicate checking

Classic

List

Threaded

11 messages Options

Kunal

Duplicate checking

Hi I have a dataset which are having data as

1 1 52 52 Row wise could you please let me know is there any method which shows that there are only 2 unique records.

Maguin, Eugene

Re: Duplicate checking

You've left so much very relevant information undefined. Are there a) four variables per record or b) 40 variables, for example, per record? If (a) have you tried the Identify Duplicates from the Data dropdown? If (b), can the target values (1 1 52 52) occur in b1) ANY four of the 40 variables or b2) in exactly four of the 40 variables? If (b2), see (a) response above. If (b1), provide details, example data records.
Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Kunal
Sent: Monday, November 16, 2015 3:35 PM
To: [hidden email]
Subject: Duplicate checking

Hi I have a dataset which are having data as

1 1 52 52 Row wise could you please let me know is there any method which shows that there are only 2 unique records.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Duplicate-checking-tp5730951.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Kunal

Re: Duplicate checking

Hi Maguin,

Many thanks for your response.

Here is the clear explanation of my query.

I have variables like q26_1 to Q26_11.

respid q26_1 q26_2 q26_3 q26_4 q26_5 q26_6 q26_7 q26_8 q26_9 q26_10 q26_11
40 52 1 1 52 1 1 1 1 1 1 1
58 44 2 44 44 22 44 44 44 44 6 44

For Respid 40 I want it shows that we have only 2 unique records,
Respid 58 it needs to show only 4 records i.e 44,2,22,6.

Regards,
Kunal Chamoli

Maguin, Eugene

Re: Duplicate checking

>>I'm really confused (and it just might be language issues).

respid q26_1 q26_2 q26_3 q26_4 q26_5 q26_6 q26_7 q26_8 q26_9 q26_10 q26_11
40 52 1 1 52 1 1 1 1 1 1 1
58 44 2 44 44 22 44 44 44 44 6 44

For Respid 40 I want it shows that we have only 2 unique records, Respid 58 it needs to show only 4 records i.e 44,2,22,6.

>>Ok. I see that for the set of variables q26_1 to q26_11, record 40 has two unique values and record 58 has four unique values.

>>I want to get a very concrete definition of the problem you need to solve.
>>Your example shows a wide format file, is that the true data structure or do you have a long format file?

>>In your example, you know that ID=40 has four unique values and ID=58 has four. What happens next?
>>Suppose ID=75 has the same value for every variable, i.e., zero unique values. What then?

Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Kunal
Sent: Monday, November 16, 2015 4:24 PM
To: [hidden email]
Subject: Re: Duplicate checking

Hi Maguin,

Many thanks for your response.

Here is the clear explanation of my query.

I have variables like q26_1 to Q26_11.

respid q26_1 q26_2 q26_3 q26_4 q26_5 q26_6 q26_7 q26_8 q26_9 q26_10 q26_11
40 52 1 1 52 1 1 1 1 1 1 1
58 44 2 44 44 22 44 44 44 44 6 44

For Respid 40 I want it shows that we have only 2 unique records, Respid 58 it needs to show only 4 records i.e 44,2,22,6.

Regards,
Kunal Chamoli

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Duplicate-checking-tp5730951p5730953.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Nirit Avnimelech

Re: Duplicate checking

In reply to this post by Kunal

Do you want to know how many unique values there are in each record, or what
these unique values are?
If you want to delete values that have already occurred, this macro will do
the trick:

DEFINE dupDelete (vars= !charend(';') / num=!charend(';') ).
vector x=!vars .
loop #first=!num to 1 by -1 .
loop #nest=1 to #first-1.
if (x(#first)=x(#nest)) x(#first)=$sysmis.
end loop.
end loop.
!ENDDEFINE.

Run this macro, and then call it with the relevant variables and the maximum
value that can occur in those variables. For example, if the maximum value
is 65, then:
dupDelete vars=q26_1 to q26_11; num=65.

Remember: this macro will delete duplicate values, and will only leave the
first ones.

Nirit

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Kunal
Sent: Monday, November 16, 2015 11:24 PM
To: [hidden email]
Subject: Re: Duplicate checking

Hi Maguin,

Many thanks for your response.

Here is the clear explanation of my query.

I have variables like q26_1 to Q26_11.

respid q26_1 q26_2 q26_3 q26_4 q26_5 q26_6 q26_7 q26_8 q26_9
q26_10
q26_11
40 52 1 1 52 1 1
1 1 1 1 1
58 44 2 44 44 22 44
44 44 44 6 44

For Respid 40 I want it shows that we have only 2 unique records, Respid 58
it needs to show only 4 records i.e 44,2,22,6.

Regards,
Kunal Chamoli

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Duplicate-checking-tp5730951p5
730953.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command SIGNOFF SPSSX-L For a list of
commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Bruce Weaver

Re: Duplicate checking

Administrator

In reply to this post by Kunal

One source of confusion, I think, is that Kunal is saying "records" where "values" is the more appropriate term.

I also think the solution to this problem is easier (and more transparent) if you restructure the data from WIDE to LONG. The following can probably be tidied up a bit, but I think it does what Kunal is asking for.

* Read in the original sample data to illustrate.
NEW FILE.
DATASET CLOSE all.
DATA LIST list /
respid q26_1 TO q26_11.
BEGIN DATA
40 52 1 1 52 1 1 1 1 1 1 1
58 44 2 44 44 22 44 44 44 44 6 44
END DATA.
FORMATS all (F5.0).
DATASET NAME Original.

* Make a copy of the original data.
DATASET COPY DataCopy.

* Restructure the copy from WIDE to LONG.
DATASET ACTIVATE DataCopy.
VARSTOCASES
/MAKE q26 FROM q26_1 TO q26_11
/INDEX=Index1(11)
/KEEP=respid
/NULL=KEEP.

* Sort cases by ID and response to Q26, then flag
* the first instance of each value within ID.

SORT CASES by respid q26.
MATCH FILES
FILE = * /
FIRST = FirstRec /
BY respid q26.
EXECUTE.
LIST.

* Use AGGREGATE to write a new dataset containing
* one row per ID with variable UniqueValues holding
* the number of unique scores for that ID.

DATASET DECLARE UniqueVals.
AGGREGATE
/OUTFILE='UniqueVals'
/BREAK=respid
/Q26UniqueValues=SUM(FirstRec).

* Merge the UniqueVals dataset with the original dataset.

MATCH FILES
FILE = 'Original' /
FILE = 'UniqueVals' /
BY respid.
EXECUTE.
FORMATS Q26UniqueValues(F2.0).
DATASET NAME Done.

DATASET ACTIVATE Done.
* Close all unneeded datasets.
DATASET CLOSE all.
LIST respid Q26UniqueValues.

Output from the final LIST command:

respid Q26UniqueValues

40 2
58 4

Number of cases read: 2 Number of cases listed: 2

Kunal wrote

Hi Maguin,

Many thanks for your response.

Here is the clear explanation of my query.

I have variables like q26_1 to Q26_11.

respid q26_1 q26_2 q26_3 q26_4 q26_5 q26_6 q26_7 q26_8 q26_9 q26_10 q26_11
40 52 1 1 52 1 1 1 1 1 1 1
58 44 2 44 44 22 44 44 44 44 6 44

For Respid 40 I want it shows that we have only 2 unique records,
Respid 58 it needs to show only 4 records i.e 44,2,22,6.

Regards,
Kunal Chamoli

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).

David Marso

Re: Duplicate checking

Administrator

In the same spirit as Bruce's solution, but a bit more streamlined and preserves a bit more information.
NEW FILE.
DATASET CLOSE all.
DATA LIST list /
respid q26_1 TO q26_11.
BEGIN DATA
40 52 1 1 52 1 1 1 1 1 1 1
58 44 2 44 44 22 44 44 44 44 6 44
END DATA.
FORMATS all (F5.0).
DATASET NAME Original.

/* Make a copy of the original data */.
DATASET COPY DataCopy.
/* Restructure the copy from WIDE to LONG */.
DATASET ACTIVATE DataCopy.
VARSTOCASES
/MAKE q26 FROM q26_1 TO q26_11
/KEEP=respid .

/* Determine unique values and their count */.
AGGREGATE OUTFILE * / BREAK respid q26/ Count=N.

/* Determine number of Unique values */.
AGGREGATE OUTFILE * MODE ADDVARIABLES / BREAK respid / NumUnique=N.

/* Now put into one record per respid */.
CASESTOVARS / ID=respid.
LIST.

respid NumUnique q26.1 q26.2 q26.3 q26.4 Count.1 Count.2 Count.3 Count.4

40 2 1 52 . . 9 2 . .
58 4 2 6 22 44 1 1 1 8

Number of cases read: 2 Number of cases listed: 2

Bruce Weaver wrote

One source of confusion, I think, is that Kunal is saying "records" where "values" is the more appropriate term.

I also think the solution to this problem is easier (and more transparent) if you restructure the data from WIDE to LONG. The following can probably be tidied up a bit, but I think it does what Kunal is asking for.

* Read in the original sample data to illustrate.
NEW FILE.
DATASET CLOSE all.
DATA LIST list /
respid q26_1 TO q26_11.
BEGIN DATA
40 52 1 1 52 1 1 1 1 1 1 1
58 44 2 44 44 22 44 44 44 44 6 44
END DATA.
FORMATS all (F5.0).
DATASET NAME Original.

* Make a copy of the original data.
DATASET COPY DataCopy.

* Restructure the copy from WIDE to LONG.
DATASET ACTIVATE DataCopy.
VARSTOCASES
/MAKE q26 FROM q26_1 TO q26_11
/INDEX=Index1(11)
/KEEP=respid
/NULL=KEEP.

* Sort cases by ID and response to Q26, then flag
* the first instance of each value within ID.

SORT CASES by respid q26.
MATCH FILES
FILE = * /
FIRST = FirstRec /
BY respid q26.
EXECUTE.
LIST.

* Use AGGREGATE to write a new dataset containing
* one row per ID with variable UniqueValues holding
* the number of unique scores for that ID.

DATASET DECLARE UniqueVals.
AGGREGATE
/OUTFILE='UniqueVals'
/BREAK=respid
/Q26UniqueValues=SUM(FirstRec).

* Merge the UniqueVals dataset with the original dataset.

MATCH FILES
FILE = 'Original' /
FILE = 'UniqueVals' /
BY respid.
EXECUTE.
FORMATS Q26UniqueValues(F2.0).
DATASET NAME Done.

DATASET ACTIVATE Done.
* Close all unneeded datasets.
DATASET CLOSE all.
LIST respid Q26UniqueValues.

Output from the final LIST command:

respid Q26UniqueValues

40 2
58 4

Number of cases read: 2 Number of cases listed: 2

Kunal wrote

Hi Maguin,

Many thanks for your response.

Here is the clear explanation of my query.

I have variables like q26_1 to Q26_11.

respid q26_1 q26_2 q26_3 q26_4 q26_5 q26_6 q26_7 q26_8 q26_9 q26_10 q26_11
40 52 1 1 52 1 1 1 1 1 1 1
58 44 2 44 44 22 44 44 44 44 6 44

For Respid 40 I want it shows that we have only 2 unique records,
Respid 58 it needs to show only 4 records i.e 44,2,22,6.

Regards,
Kunal Chamoli

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

Kirill Orlov

Re: Duplicate checking

In reply to this post by Kunal

If you want to know how many there are distinct values and how many there are duplicating values within each case - please check macro !hcount (in "Horizontal tools" collection) on my web page http://www.spsstools.net/en/KO-spssmacros.

It has further options (for example, show only how many there are values occuring k times); but it does not display you the list of values themselves. You have to transpose your data and do Aggregate or Frequencies, if you need the list of values.

16.11.2015 23:34, Kunal пишет:

Hi I have a dataset which are having data as

1 1 52 52 Row wise could you please let me know is there any method which
shows that there are only 2 unique records. 



--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Duplicate-checking-tp5730951.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Andy W

Re: Duplicate checking

In reply to this post by David Marso

If you really only need the number of unique values, a one liner solution using SPSSSINC TRANS and python is given below using python set's. The native SPSS solutions allow for you to calculate more information along the way though.

**********************************************.
DATA LIST list /
respid q26_1 TO q26_11.
BEGIN DATA
40 52 1 1 52 1 1 1 1 1 1 1
58 44 2 44 44 22 44 44 44 44 6 44
END DATA.
FORMATS all (F5.0).
DATASET NAME Original.

SPSSINC TRANS RESULT=Unique TYPE=0
/VARIABLES q26_1 TO q26_11
/FORMULA "len(set([<>]))".
**********************************************.

As a description of the code, set(x) in python returns all of the unique elements in a list (unordered). len(y) just returns the length of the object, which in this instance corresponds to the number of elements in the set.

The [<>] are idiosyncratic to how you can pass a set of multiple variables as arguments in SPSSINC TRANS. I have a blog post with a fuller description, https://andrewpwheeler.wordpress.com/2015/05/07/passing-arguments-to-spssinc-trans-2/.

The base SPSS code in this instance is pretty straightforward, but python set's are a good tool to keep in mind for things that are a little more tricky in SPSS code, like the intersection or difference of sets, or testing if one set is a subset of another.

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

Art Kendall

Re: Duplicate checking

In reply to this post by Kunal

are you asking how many distinct values across all of the variables in a case?

Art Kendall
Social Research Consultants

Art Kendall

Re: Duplicate checking

If so, and you also want to know waht those values are see MULT RESPONSE.

Art Kendall
Social Research Consultants