Is there a way to compute a variable with SPSS (I'm using V17) that is a
checksum of multiple other variables? This might not be the right approach anyway, so let me give a little more background I'm running a multi wave study. I want to check a single wave's data for duplicates (including between panel sources) and also to find out if respondents are duplicates between waves. The panel companies are taking steps to protect against duplicates, and we are improving the process as data is gathered. To make it a bit more complicated, I'm not absolutely certain that I want to eliminate duplicates between waves - partly because one of the sample cells is a small geography with limited sample available, and also because I'd like to evaluate changes between waves. The study is not set up this way, but we've just changed from an annual study to 6 monthly with 1/2 the sample in each wave. All the complexity (and perhaps my screwy thinking) aside, I would like to be able to score for the likelihood of a duplicate. The variables I can access include automatically generated information about the browser, IP, geographic information generated from the IP, and all the survey questions. If I can create a unique number for each response based on the variables I select, I can check between waves. But the values might not be exactly the same, even if there is a duplicate, so perhaps a scoring approach would be better. It would be nice to have the ability within SPSS to generate a number or a score. Failing that, I think I can do something similar with Excel, using some VB modules. The more I think about this, the more complicated it seems. Perhaps the list members have other suggestions. Thanks Mike _________________________________________________________________________ Mike Pritchard | [hidden email] | 5 Circles Research | 425-444-3410 (c) | 425-968-3883 (o) Research to help companies build products that people buy ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
You could make up your own hashing algorithm,
but using Python you can compute CRC, MD5, or SHA checksums or hashes.
Using this via the SPSSINC TRANS extension command would be easy,
but that requires at least Statistics 18.
You could also use the COMPARE DATASETS command, but I think that was introduced after version 18. It has the advantage that it can give you count of the number of variables that disagree for each case. Neither of these provide any fuzz factor, so they can only test for exact agreement for a variable. If you merged two datasets, you could compute a sum of squared differences or other similarity measure for each case, although for strings squared differences obviously don't work. There are string similarity measures available, again, in a Python module - extendedTransforms.py. They include soundex, nysiis, Levenshtein, Jaro-Winkler, and Dice. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: Mike Pritchard <[hidden email]> To: [hidden email], Date: 10/08/2013 02:37 PM Subject: [SPSSX-L] "Fingerprint" (MD5 or other) from set of SPSS variables Sent by: "SPSSX(r) Discussion" <[hidden email]> Is there a way to compute a variable with SPSS (I'm using V17) that is a checksum of multiple other variables? This might not be the right approach anyway, so let me give a little more background I'm running a multi wave study. I want to check a single wave's data for duplicates (including between panel sources) and also to find out if respondents are duplicates between waves. The panel companies are taking steps to protect against duplicates, and we are improving the process as data is gathered. To make it a bit more complicated, I'm not absolutely certain that I want to eliminate duplicates between waves - partly because one of the sample cells is a small geography with limited sample available, and also because I'd like to evaluate changes between waves. The study is not set up this way, but we've just changed from an annual study to 6 monthly with 1/2 the sample in each wave. All the complexity (and perhaps my screwy thinking) aside, I would like to be able to score for the likelihood of a duplicate. The variables I can access include automatically generated information about the browser, IP, geographic information generated from the IP, and all the survey questions. If I can create a unique number for each response based on the variables I select, I can check between waves. But the values might not be exactly the same, even if there is a duplicate, so perhaps a scoring approach would be better. It would be nice to have the ability within SPSS to generate a number or a score. Failing that, I think I can do something similar with Excel, using some VB modules. The more I think about this, the more complicated it seems. Perhaps the list members have other suggestions. Thanks Mike _________________________________________________________________________ Mike Pritchard | [hidden email] | 5 Circles Research | 425-444-3410 (c) | 425-968-3883 (o) Research to help companies build products that people buy ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Mike Pritchard
From: Mike Pritchard <[hidden email]>
>To: [hidden email] >Sent: Tuesday, October 8, 2013 10:36 PM >Subject: [SPSSX-L] "Fingerprint" (MD5 or other) from set of SPSS variables > > >Is there a way to compute a variable with SPSS (I'm using V17) that is a >checksum of multiple other variables? This might not be the right approach >anyway, so let me give a little more background * sample data. DATA LIST FREE /var1 (F) var2 (A2). BEGIN DATA 11 ab 21 cd END DATA. * get the MD5 checksum. BEGIN PROGRAM. import spss, hashlib get_checksum = lambda *vars: hashlib.md5(repr(vars)).hexdigest() END PROGRAM. SPSSINC TRANS RESULT=checksum TYPE=32 /FORMULA "get_checksum(var1, var2)". ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Jon K Peck
Thanks Jon. Looks like a couple of good reasons for an upgrade. Mike From: Jon K Peck [mailto:[hidden email]] You could make up your own hashing algorithm, but using Python you can compute CRC, MD5, or SHA checksums or hashes. Using this via the SPSSINC TRANS extension command would be easy, but that requires at least Statistics 18.
|
In reply to this post by Albert-Jan Roskam
Thanks Albert. I think I need to devote some time, and brain, to the whole
idea. Regards Mike -----Original Message----- From: Albert-Jan Roskam [mailto:[hidden email]] Sent: Wednesday, October 09, 2013 3:40 AM To: Mike Pritchard; [hidden email] Subject: Re: [SPSSX-L] "Fingerprint" (MD5 or other) from set of SPSS variables From: Mike Pritchard <[hidden email]> >To: [hidden email] >Sent: Tuesday, October 8, 2013 10:36 PM >Subject: [SPSSX-L] "Fingerprint" (MD5 or other) from set of SPSS >variables > > >Is there a way to compute a variable with SPSS (I'm using V17) that is >a checksum of multiple other variables? This might not be the right >approach anyway, so let me give a little more background * sample data. DATA LIST FREE /var1 (F) var2 (A2). BEGIN DATA 11 ab 21 cd END DATA. * get the MD5 checksum. BEGIN PROGRAM. import spss, hashlib get_checksum = lambda *vars: hashlib.md5(repr(vars)).hexdigest() END PROGRAM. SPSSINC TRANS RESULT=checksum TYPE=32 /FORMULA "get_checksum(var1, var2)". ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |