"Fingerprint" (MD5 or other) from set of SPSS variables

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

"Fingerprint" (MD5 or other) from set of SPSS variables

Mike Pritchard
Is there a way to compute a variable with SPSS (I'm using V17) that is a
checksum of multiple other variables?  This might not be the right approach
anyway, so let me give a little more background

I'm running a multi wave study. I want to check a single wave's data for
duplicates (including between panel sources) and also to find out if
respondents are duplicates between waves.  The panel companies are taking
steps to protect against duplicates, and we are improving the process as
data is gathered.  To make it a bit more complicated, I'm not absolutely
certain that I want to eliminate duplicates between waves - partly because
one of the sample cells is a small geography with limited sample available,
and also because I'd like to evaluate changes between waves.  The study is
not set up this way, but we've just changed from an annual study to 6
monthly with 1/2 the sample in each wave.  All the complexity (and perhaps
my screwy thinking) aside, I would like to be able to score for the
likelihood of a duplicate.

The variables I can access include automatically generated information about
the browser, IP, geographic information generated from the IP, and all the
survey questions.

If I can create a unique number for each response based on the variables I
select, I can check between waves.   But the values might not be exactly the
same, even if there is a duplicate, so perhaps a scoring approach would be
better. It would be nice to have the ability within SPSS to generate a
number or a score.  Failing that, I think I can do something similar with
Excel, using some VB modules.

The more I think about this, the more complicated it seems.  Perhaps the
list members have other suggestions.

Thanks
Mike
_________________________________________________________________________
Mike Pritchard | [hidden email] | 5 Circles Research | 425-444-3410 (c)
| 425-968-3883 (o)
Research to help companies build products that people buy

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: "Fingerprint" (MD5 or other) from set of SPSS variables

Jon K Peck
You could make up your own hashing algorithm, but using Python you can compute CRC, MD5, or SHA checksums or hashes.  Using this via the SPSSINC TRANS extension command would be easy, but that requires at least Statistics 18.

You could also use the COMPARE DATASETS command, but I think that was introduced after version 18.  It has the advantage that it can give you  count of the number of variables that disagree for each case.

Neither of these provide any fuzz factor, so they can only test for exact agreement for a variable.  If you merged two datasets, you could compute a sum of squared differences or other similarity measure for each case, although for strings squared differences obviously don't work.  There are string similarity measures available, again, in a Python module - extendedTransforms.py.  They include soundex, nysiis, Levenshtein, Jaro-Winkler, and Dice.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        Mike Pritchard <[hidden email]>
To:        [hidden email],
Date:        10/08/2013 02:37 PM
Subject:        [SPSSX-L] "Fingerprint" (MD5 or other) from set of SPSS variables
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Is there a way to compute a variable with SPSS (I'm using V17) that is a
checksum of multiple other variables?  This might not be the right approach
anyway, so let me give a little more background

I'm running a multi wave study. I want to check a single wave's data for
duplicates (including between panel sources) and also to find out if
respondents are duplicates between waves.  The panel companies are taking
steps to protect against duplicates, and we are improving the process as
data is gathered.  To make it a bit more complicated, I'm not absolutely
certain that I want to eliminate duplicates between waves - partly because
one of the sample cells is a small geography with limited sample available,
and also because I'd like to evaluate changes between waves.  The study is
not set up this way, but we've just changed from an annual study to 6
monthly with 1/2 the sample in each wave.  All the complexity (and perhaps
my screwy thinking) aside, I would like to be able to score for the
likelihood of a duplicate.

The variables I can access include automatically generated information about
the browser, IP, geographic information generated from the IP, and all the
survey questions.

If I can create a unique number for each response based on the variables I
select, I can check between waves.   But the values might not be exactly the
same, even if there is a duplicate, so perhaps a scoring approach would be
better. It would be nice to have the ability within SPSS to generate a
number or a score.  Failing that, I think I can do something similar with
Excel, using some VB modules.

The more I think about this, the more complicated it seems.  Perhaps the
list members have other suggestions.

Thanks
Mike
_________________________________________________________________________
Mike Pritchard | [hidden email] | 5 Circles Research | 425-444-3410 (c)
| 425-968-3883 (o)
Research to help companies build products that people buy

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: "Fingerprint" (MD5 or other) from set of SPSS variables

Albert-Jan Roskam
In reply to this post by Mike Pritchard
From: Mike Pritchard <[hidden email]>
>To: [hidden email]
>Sent: Tuesday, October 8, 2013 10:36 PM
>Subject: [SPSSX-L] "Fingerprint" (MD5 or other) from set of SPSS variables
>
>
>Is there a way to compute a variable with SPSS (I'm using V17) that is a
>checksum of multiple other variables?  This might not be the right approach
>anyway, so let me give a little more background

* sample data.
DATA LIST FREE /var1 (F) var2 (A2).
BEGIN DATA
11 ab
21 cd
END DATA.

* get the MD5 checksum.
BEGIN PROGRAM.
import spss, hashlib
get_checksum = lambda *vars: hashlib.md5(repr(vars)).hexdigest()
END PROGRAM.
SPSSINC TRANS RESULT=checksum TYPE=32 /FORMULA "get_checksum(var1, var2)".

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: "Fingerprint" (MD5 or other) from set of SPSS variables

Mike Pritchard
In reply to this post by Jon K Peck

Thanks Jon.  Looks like a couple of good reasons for an upgrade.

Mike

 

From: Jon K Peck [mailto:[hidden email]]
Sent: Tuesday, October 08, 2013 1:58 PM
To: Mike Pritchard
Cc: [hidden email]
Subject: Re: [SPSSX-L] "Fingerprint" (MD5 or other) from set of SPSS variables

 

You could make up your own hashing algorithm, but using Python you can compute CRC, MD5, or SHA checksums or hashes.  Using this via the SPSSINC TRANS extension command would be easy, but that requires at least Statistics 18.

You could also use the COMPARE DATASETS command, but I think that was introduced after version 18.  It has the advantage that it can give you  count of the number of variables that disagree for each case.

Neither of these provide any fuzz factor, so they can only test for exact agreement for a variable.  If you merged two datasets, you could compute a sum of squared differences or other similarity measure for each case, although for strings squared differences obviously don't work.  There are string similarity measures available, again, in a Python module - extendedTransforms.py.  They include soundex, nysiis, Levenshtein, Jaro-Winkler, and Dice.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        Mike Pritchard <[hidden email]>
To:        [hidden email],
Date:        10/08/2013 02:37 PM
Subject:        [SPSSX-L] "Fingerprint" (MD5 or other) from set of SPSS variables
Sent by:        "SPSSX(r) Discussion" <[hidden email]>





Is there a way to compute a variable with SPSS (I'm using V17) that is a
checksum of multiple other variables?  This might not be the right approach
anyway, so let me give a little more background

I'm running a multi wave study. I want to check a single wave's data for
duplicates (including between panel sources) and also to find out if
respondents are duplicates between waves.  The panel companies are taking
steps to protect against duplicates, and we are improving the process as
data is gathered.  To make it a bit more complicated, I'm not absolutely
certain that I want to eliminate duplicates between waves - partly because
one of the sample cells is a small geography with limited sample available,
and also because I'd like to evaluate changes between waves.  The study is
not set up this way, but we've just changed from an annual study to 6
monthly with 1/2 the sample in each wave.  All the complexity (and perhaps
my screwy thinking) aside, I would like to be able to score for the
likelihood of a duplicate.

The variables I can access include automatically generated information about
the browser, IP, geographic information generated from the IP, and all the
survey questions.

If I can create a unique number for each response based on the variables I
select, I can check between waves.   But the values might not be exactly the
same, even if there is a duplicate, so perhaps a scoring approach would be
better. It would be nice to have the ability within SPSS to generate a
number or a score.  Failing that, I think I can do something similar with
Excel, using some VB modules.

The more I think about this, the more complicated it seems.  Perhaps the
list members have other suggestions.

Thanks
Mike
_________________________________________________________________________
Mike Pritchard | [hidden email] | 5 Circles Research | 425-444-3410 (c)
| 425-968-3883 (o)
Research to help companies build products that people buy

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: "Fingerprint" (MD5 or other) from set of SPSS variables

Mike Pritchard
In reply to this post by Albert-Jan Roskam
Thanks Albert.  I think I need to devote some time, and brain, to the whole
idea.

Regards
Mike

-----Original Message-----
From: Albert-Jan Roskam [mailto:[hidden email]]
Sent: Wednesday, October 09, 2013 3:40 AM
To: Mike Pritchard; [hidden email]
Subject: Re: [SPSSX-L] "Fingerprint" (MD5 or other) from set of SPSS
variables

From: Mike Pritchard <[hidden email]>
>To: [hidden email]
>Sent: Tuesday, October 8, 2013 10:36 PM
>Subject: [SPSSX-L] "Fingerprint" (MD5 or other) from set of SPSS
>variables
>
>
>Is there a way to compute a variable with SPSS (I'm using V17) that is
>a checksum of multiple other variables?  This might not be the right
>approach anyway, so let me give a little more background

* sample data.
DATA LIST FREE /var1 (F) var2 (A2).
BEGIN DATA
11 ab
21 cd
END DATA.

* get the MD5 checksum.
BEGIN PROGRAM.
import spss, hashlib
get_checksum = lambda *vars: hashlib.md5(repr(vars)).hexdigest()
END PROGRAM.
SPSSINC TRANS RESULT=checksum TYPE=32 /FORMULA "get_checksum(var1, var2)".

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD