Hello all,
I got two files from an external source. The first one has a selection of records of sexual crimes in a certain group of locations over a period of 3 years There is a field/variable "location" and if that contains any of a certain list of values the record was selected. This file has 507 records, divided over years 1 to 3 in the order 168, 168 and 169. There was no restriction on the number of cases to be selected). There were however some duplications. When these are removed there remained 498 cases. The other"(much bigger) file was based on words used in a free text field. If any of a number of words was used in this field, the record was included. This resulted in 4998 cases. Exactly 4500 more than the "locations file". I matched the two files on the registration number with something like: match files /file location /file words /by registrationnumber. and construed a variable indicating wether a registrationnumber was present in the location file and /or in the words-file. This resulted in exactly 200 cases that were only present in the "location file and exactly 4700 records that were only present in the "words file". 298 cases where present in both files. Of course I will ask the external source to check his query procedure. Indeed I already did so for the 168, 168, 169, but he states that it is pure coïncidence. The problem might not be as big as it seems, because the data are only used to make a further selection of cases that are to be studied more in depth. If the selection error is not systematic there is no big problem. Do you think I can trust these data? Do you think it is pure coïncidence? If not, do you have any idea what can have happened? ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
I guess I don't understand the significance of the 4500, which is what I guess you are referring to as "coincidence". Or, maybe it is the 4700. It sounds like this is secondary data, data that you neither collected nor designed the collection of. To satisfy your concerns/curiosity, it seems to me that you need to have detailed discussions with the people that did that work. This is not an spss problem. You could be asking the same question using sas or stata.
Gene Maguin -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of A. Smulders Sent: Monday, September 19, 2016 11:57 AM To: [hidden email] Subject: Coïncidence? Hello all, I got two files from an external source. The first one has a selection of records of sexual crimes in a certain group of locations over a period of 3 years There is a field/variable "location" and if that contains any of a certain list of values the record was selected. This file has 507 records, divided over years 1 to 3 in the order 168, 168 and 169. There was no restriction on the number of cases to be selected). There were however some duplications. When these are removed there remained 498 cases. The other"(much bigger) file was based on words used in a free text field. If any of a number of words was used in this field, the record was included. This resulted in 4998 cases. Exactly 4500 more than the "locations file". I matched the two files on the registration number with something like: match files /file location /file words /by registrationnumber. and construed a variable indicating wether a registrationnumber was present in the location file and /or in the words-file. This resulted in exactly 200 cases that were only present in the "location file and exactly 4700 records that were only present in the "words file". 298 cases where present in both files. Of course I will ask the external source to check his query procedure. Indeed I already did so for the 168, 168, 169, but he states that it is pure coïncidence. The problem might not be as big as it seems, because the data are only used to make a further selection of cases that are to be studied more in depth. If the selection error is not systematic there is no big problem. Do you think I can trust these data? Do you think it is pure coïncidence? If not, do you have any idea what can have happened? ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by A. Smulders
Like Gene, I don't immediately see what is making you question whether there is a coincidence. But something else you said caught my eye. It was this:
"I matched the two files on the registration number with something like: match files /file location /file words /by registrationnumber. and construed a variable indicating wether a registrationnumber was present in the location file and /or in the words-file." You could simplify the creation of the variables that flag whether a registration number was present by using the /IN sub-command on your MATCH FILES command. Something like this (untested) would do it. Obviously, you need to replace my LocFile and WordFile with the appropriate dataset names. MATCH FILES FILE = LocFile / IN = InLocFile / FILE = WordFile / IN = InWordFile / BY registrationnumber. EXECUTE. COMPUTE InBothFiles = InLocFile and InWordFile. FORMATS InBothFiles (F1). FREQUENCIES InLocFile InWordFile InBothFiles. HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
In reply to this post by A. Smulders
Thank you for reminding me of the IN-subcommand. I completely forgot it. (What I actualy did is creating a variable in both files: inWordfile and inLocfile and then combining them to a variable InFile with values 1 2 and 3 for the wordfile, the locFile and both files respectively).
To make my main question more clear: my InFile variable gave frequencies of exactly 4500 for infile = 1 (wordfile) and exactly 200 for infile = 2 (Locfile) and 298 for InFile = 3 (both). The provider of the data maintains that he delivered all the relevant data for 3 consecutive years There was no criterium to limit the data to samples of a certain size. The fact that 2 out of 3 values are exact multitudes of 100 makes me suspicious. Also, as I mentioned, the figures for the locFile (when the duplicates are not removed) are almost the same over the 3 years (168, 168 and 169). When I discovered that I contacted the provider, but he insisted that this was coincidence. Of course that is a possibility. I repeat: there was no limitation in size of the datafiles intended. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by A. Smulders
Rectification: There are 4700 cases where inFile = 1 (WordFile only) (I said there were 4500).
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by A. Smulders
Just because there are the same number of cases across years does not mean they represent the same 'location' each year. Have you checked if the values are the same (does it matter?) (also 168+168+169=505 (not 507) were the some missing or did you mean 505?
You note there were 'duplicates'. Does this mean duplicates of location and Registration number, just location, just Registration number? In the words file, does a single record represent a single crime-registration number or a single word? Also, how is the words file is related to the location in the original data-- it sounds to me like your provider has a single large dataset (multiple tables in a relational dataset?) of sex crimes that includes location data and the free text field. From this dataset two subsets were pulled in the first, particular locations were selected regardless of words and in the second, particular words were selected-but the words were pulled regardless of the location value. Based on what you've said so far, it seems like of the 498 unduplicated locations 200 did not include any of the selected words, 298 did, and 4700 were from records that were from other locations. Melissa Ives DMHAS Research Division 410 Capitol Ave., MS#14RSD Hartford, CT 06106 860-418-6729 (phone) 860-418-6692 (fax) 860-778-5445 (cell) ________________________________________ From: SPSSX(r) Discussion <[hidden email]> on behalf of A. Smulders <[hidden email]> Sent: Tuesday, September 20, 2016 10:06 AM To: [hidden email] Subject: Re: [SPSSX-L] Coïncidence? Thank you for reminding me of the IN-subcommand. I completely forgot it. (What I actualy did is creating a variable in both files: inWordfile and inLocfile and then combining them to a variable InFile with values 1 2 and 3 for the wordfile, the locFile and both files respectively). To make my main question more clear: my InFile variable gave frequencies of exactly 4500 for infile = 1 (wordfile) and exactly 200 for infile = 2 (Locfile) and 298 for InFile = 3 (both). The provider of the data maintains that he delivered all the relevant data for 3 consecutive years There was no criterium to limit the data to samples of a certain size. The fact that 2 out of 3 values are exact multitudes of 100 makes me suspicious. Also, as I mentioned, the figures for the locFile (when the duplicates are not removed) are almost the same over the 3 years (168, 168 and 169). When I discovered that I contacted the provider, but he insisted that this was coincidence. Of course that is a possibility. I repeat: there was no limitation in size of the datafiles intended. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ________________________________ This correspondence contains proprietary information some or all of which may be legally privileged; it is for the intended recipient only. If you are not the intended recipient you must not use, disclose, distribute, copy, print, or rely on this correspondence and completely dispose of the correspondence immediately. Please notify the sender if you have received this email in error. NOTE: Messages to or from the State of Connecticut domain may be subject to the Freedom of Information statutes and regulations. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Thanks for your reply. See my remarks between your text.
-----Oorspronkelijk bericht----- Van: SPSSX(r) Discussion [mailto:[hidden email]] Namens Ives, Melissa L Verzonden: maandag 26 september 2016 16:23 Aan: [hidden email] Onderwerp: Re: Coïncidence? Just because there are the same number of cases across years does not mean they represent the same 'location' each year. Have you checked if the values are the same (does it matter?) (also 168+168+169=505 (not 507) were the some missing or did you mean 505? You note there were 'duplicates'. Does this mean duplicates of location and Registration number, just location, just Registration number? *********. AS: There were indeed 505 cases (there were 7 duplicates in the file. I guess that's where the 7 comes from). All fields in the duplicate cases were the same except for the date of registration and differences in the assignment of the type of crime (for instance sexual violaton or rape). In one case the date of registration "crossed" the line of a year. There were no inconsistencies for the location field. *********. In the words file, does a single record represent a single crime-registration number or a single word? *********. AS: In the original file there where records for each word, but I created a string variable to concat these with the previous value (sorted by registration number with the lag function) if the registration number was the same as the previous one. Then with aggregate I selected the last record of each sequence with the same registration number. *********, Also, how is the words file is related to the location in the original data-- it sounds to me like your provider has a single large dataset (multiple tables in a relational dataset?) of sex crimes that includes location data and the free text field. From this dataset two subsets were pulled in the first, particular locations were selected regardless of words and in the second, particular words were selected-but the words were pulled regardless of the location value. *******. AS: Your guess is right. There were two independent queries. The data structure is such that every case (registration) has only one location. *******. Based on what you've said so far, it seems like of the 498 unduplicated locations 200 did not include any of the selected words, 298 did, and 4700 were from records that were from other locations. *******. AS: Again you are right. In total there are 5198 cases. Of course it is possible that those two multiples of 100 area coincidence. The chance that they were multitudes of 99 or 101 would also be small, but less easily noticed. But because people tend to think in hundreds (and making errors is also human), I think there might be something wrong. I wondered if anyone has a suggestion where to look for possible mistakes. Of course I contacted the provider of the data and they will check the procedure. *******. Melissa Ives DMHAS Research Division 410 Capitol Ave., MS#14RSD Hartford, CT 06106 860-418-6729 (phone) 860-418-6692 (fax) 860-778-5445 (cell) ________________________________________ From: SPSSX(r) Discussion <[hidden email]> on behalf of A. Smulders <[hidden email]> Sent: Tuesday, September 20, 2016 10:06 AM To: [hidden email] Subject: Re: [SPSSX-L] Coïncidence? Thank you for reminding me of the IN-subcommand. I completely forgot it. (What I actualy did is creating a variable in both files: inWordfile and inLocfile and then combining them to a variable InFile with values 1 2 and 3 for the wordfile, the locFile and both files respectively). To make my main question more clear: my InFile variable gave frequencies of exactly 4500 for infile = 1 (wordfile) and exactly 200 for infile = 2 (Locfile) and 298 for InFile = 3 (both). The provider of the data maintains that he delivered all the relevant data for 3 consecutive years There was no criterium to limit the data to samples of a certain size. The fact that 2 out of 3 values are exact multitudes of 100 makes me suspicious. Also, as I mentioned, the figures for the locFile (when the duplicates are not removed) are almost the same over the 3 years (168, 168 and 169). When I discovered that I contacted the provider, but he insisted that this was coincidence. Of course that is a possibility. I repeat: there was no limitation in size of the datafiles intended. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ________________________________ This correspondence contains proprietary information some or all of which may be legally privileged; it is for the intended recipient only. If you are not the intended recipient you must not use, disclose, distribute, copy, print, or rely on this correspondence and completely dispose of the correspondence immediately. Please notify the sender if you have received this email in error. NOTE: Messages to or from the State of Connecticut domain may be subject to the Freedom of Information statutes and regulations. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |