Coïncidence?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Coïncidence?

A. Smulders
Hello all,

I got two files from an external source.
The first one has a selection of records of sexual crimes in a certain group of locations over a period of 3 years There is a field/variable "location" and if that contains any of a certain list of values the record was selected. This file has 507 records, divided over years 1 to 3 in the order 168, 168 and 169. There was no restriction on the number of cases to be selected).

There were however some duplications. When these are removed there remained 498 cases.

The other"(much bigger)  file was based on words used in a free text field. If any of a number of words was used in this field, the record was included. This resulted in 4998 cases. Exactly 4500 more than the "locations file".

I matched the two files on the registration number with something like:
match files /file location /file words /by registrationnumber.
and construed a variable indicating wether a registrationnumber was present in the location file and /or in the words-file.
This resulted in exactly 200 cases that were only present in the "location file and exactly 4700 records that were only  present in the "words file". 298 cases where present in both files.

Of course I will ask the external source to check his query procedure. Indeed I already did so for the 168, 168, 169, but he states that it is pure coïncidence.

The problem might not be as big as it seems, because the data are only used to make a further selection of cases that are to be studied more in depth. If the selection error is not systematic there is no big problem.

Do you think I can trust these data? Do you think it is pure coïncidence? If not, do you have any idea what can have happened?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Coïncidence?

Maguin, Eugene
I guess I don't understand the significance of the 4500, which is what I guess you are referring to as "coincidence". Or, maybe it is the 4700. It sounds like this is secondary data, data that you neither collected nor designed the collection of. To satisfy your concerns/curiosity, it seems to me that you need to have detailed discussions with the people that did that work. This is not an spss problem. You could be asking the same question using sas or stata.
Gene Maguin



-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of A. Smulders
Sent: Monday, September 19, 2016 11:57 AM
To: [hidden email]
Subject: Coïncidence?

Hello all,

I got two files from an external source.
The first one has a selection of records of sexual crimes in a certain group of locations over a period of 3 years There is a field/variable "location" and if that contains any of a certain list of values the record was selected. This file has 507 records, divided over years 1 to 3 in the order 168, 168 and 169. There was no restriction on the number of cases to be selected).

There were however some duplications. When these are removed there remained 498 cases.

The other"(much bigger)  file was based on words used in a free text field. If any of a number of words was used in this field, the record was included. This resulted in 4998 cases. Exactly 4500 more than the "locations file".

I matched the two files on the registration number with something like:
match files /file location /file words /by registrationnumber.
and construed a variable indicating wether a registrationnumber was present in the location file and /or in the words-file.
This resulted in exactly 200 cases that were only present in the "location file and exactly 4700 records that were only  present in the "words file". 298 cases where present in both files.

Of course I will ask the external source to check his query procedure. Indeed I already did so for the 168, 168, 169, but he states that it is pure coïncidence.

The problem might not be as big as it seems, because the data are only used to make a further selection of cases that are to be studied more in depth. If the selection error is not systematic there is no big problem.

Do you think I can trust these data? Do you think it is pure coïncidence? If not, do you have any idea what can have happened?

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Coïncidence?

Bruce Weaver
Administrator
In reply to this post by A. Smulders
Like Gene, I don't immediately see what is making you question whether there is a coincidence.   But something else you said caught my eye.  It was this:

"I matched the two files on the registration number with something like:
match files /file location /file words /by registrationnumber.
and construed a variable indicating wether a registrationnumber was present in the location file and /or in the words-file."

You could simplify the creation of the variables that flag whether a registration number was present by using the /IN sub-command on your MATCH FILES command.  Something like this (untested) would do it.  Obviously, you need to replace my LocFile and WordFile with the appropriate dataset names.  

MATCH FILES
 FILE = LocFile / IN = InLocFile /
 FILE = WordFile / IN = InWordFile /
 BY  registrationnumber.
EXECUTE.
COMPUTE InBothFiles = InLocFile and InWordFile.
FORMATS InBothFiles (F1).
FREQUENCIES InLocFile InWordFile InBothFiles.


HTH.




A. Smulders wrote
Hello all,

I got two files from an external source.
The first one has a selection of records of sexual crimes in a certain group of locations over a period of 3 years There is a field/variable "location" and if that contains any of a certain list of values the record was selected. This file has 507 records, divided over years 1 to 3 in the order 168, 168 and 169. There was no restriction on the number of cases to be selected).

There were however some duplications. When these are removed there remained 498 cases.

The other"(much bigger)  file was based on words used in a free text field. If any of a number of words was used in this field, the record was included. This resulted in 4998 cases. Exactly 4500 more than the "locations file".

I matched the two files on the registration number with something like:
match files /file location /file words /by registrationnumber.
and construed a variable indicating wether a registrationnumber was present in the location file and /or in the words-file.
This resulted in exactly 200 cases that were only present in the "location file and exactly 4700 records that were only  present in the "words file". 298 cases where present in both files.

Of course I will ask the external source to check his query procedure. Indeed I already did so for the 168, 168, 169, but he states that it is pure coïncidence.

The problem might not be as big as it seems, because the data are only used to make a further selection of cases that are to be studied more in depth. If the selection error is not systematic there is no big problem.

Do you think I can trust these data? Do you think it is pure coïncidence? If not, do you have any idea what can have happened?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Coïncidence?

A. Smulders
In reply to this post by A. Smulders
Thank you for reminding me of the IN-subcommand. I completely forgot it. (What I actualy did is creating a variable in both files: inWordfile and inLocfile and then combining them to a variable InFile with values 1 2 and 3 for the wordfile, the locFile and both files respectively).

To make my main question more clear: my InFile variable gave frequencies of exactly 4500  for infile = 1 (wordfile) and exactly 200 for infile = 2 (Locfile) and 298 for InFile = 3 (both).  The provider of the data maintains that he delivered all the relevant data for 3 consecutive years There was no criterium to limit the data to samples of a certain size.

The fact that 2 out of 3 values are exact multitudes of 100 makes me suspicious.

Also, as I mentioned, the figures for the locFile (when the duplicates are not removed) are almost the same over the 3 years (168, 168 and 169). When I discovered that I contacted the provider, but he insisted that this was coincidence. Of course that is a possibility. I repeat: there was no limitation in size of the datafiles intended.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Coïncidence?

A. Smulders
In reply to this post by A. Smulders
Rectification: There are 4700 cases where inFile = 1 (WordFile only) (I said there were 4500).

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Coïncidence?

MLIves
In reply to this post by A. Smulders
Just because there are the same number of cases across years does not mean they represent the same 'location' each year. Have you checked if the values are the same (does it matter?) (also 168+168+169=505 (not 507) were the some missing or did you mean  505?

You note there were 'duplicates'. Does this mean duplicates of location and Registration number, just location, just Registration number?

In the words file, does a single record represent a single crime-registration number or a single word?

Also, how is the words file is related to the location in the original data--
it sounds to me like your provider has a single large dataset (multiple tables in a relational dataset?) of sex crimes that includes location data and the free text field. From this dataset two subsets were pulled in the first, particular locations were selected regardless of words and in the second, particular words were selected-but the words were pulled regardless of the location value.

Based on what you've said so far, it seems like of the 498 unduplicated locations 200 did not include any of the selected words, 298 did, and 4700 were from records that were from other locations.


Melissa Ives
DMHAS Research Division
410 Capitol Ave., MS#14RSD
Hartford, CT 06106
860-418-6729 (phone)
860-418-6692 (fax)
860-778-5445 (cell)
________________________________________
From: SPSSX(r) Discussion <[hidden email]> on behalf of A. Smulders <[hidden email]>
Sent: Tuesday, September 20, 2016 10:06 AM
To: [hidden email]
Subject: Re: [SPSSX-L] Coïncidence?

Thank you for reminding me of the IN-subcommand. I completely forgot it. (What I actualy did is creating a variable in both files: inWordfile and inLocfile and then combining them to a variable InFile with values 1 2 and 3 for the wordfile, the locFile and both files respectively).

To make my main question more clear: my InFile variable gave frequencies of exactly 4500  for infile = 1 (wordfile) and exactly 200 for infile = 2 (Locfile) and 298 for InFile = 3 (both).  The provider of the data maintains that he delivered all the relevant data for 3 consecutive years There was no criterium to limit the data to samples of a certain size.

The fact that 2 out of 3 values are exact multitudes of 100 makes me suspicious.

Also, as I mentioned, the figures for the locFile (when the duplicates are not removed) are almost the same over the 3 years (168, 168 and 169). When I discovered that I contacted the provider, but he insisted that this was coincidence. Of course that is a possibility. I repeat: there was no limitation in size of the datafiles intended.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

________________________________

This correspondence contains proprietary information some or all of which may be legally privileged; it is for the intended recipient only. If you are not the intended recipient you must not use, disclose, distribute, copy, print, or rely on this correspondence and completely dispose of the correspondence immediately. Please notify the sender if you have received this email in error. NOTE: Messages to or from the State of Connecticut domain may be subject to the Freedom of Information statutes and regulations.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Coïncidence?

A. Smulders
Thanks for your reply. See my remarks between your text.

-----Oorspronkelijk bericht-----
Van: SPSSX(r) Discussion [mailto:[hidden email]] Namens Ives,
Melissa L
Verzonden: maandag 26 september 2016 16:23
Aan: [hidden email]
Onderwerp: Re: Coïncidence?

Just because there are the same number of cases across years does not mean
they represent the same 'location' each year. Have you checked if the values
are the same (does it matter?) (also 168+168+169=505 (not 507) were the some
missing or did you mean  505?

You note there were 'duplicates'. Does this mean duplicates of location and
Registration number, just location, just Registration number?
*********.
AS:  There were indeed 505 cases (there were 7 duplicates in the file. I
guess that's where the 7 comes from). All fields in the duplicate cases were
the same except for the date of registration and differences in the
assignment of the type of crime (for instance sexual violaton or rape). In
one case the date of registration "crossed" the line of a year. There were
no inconsistencies for the location field.
*********.
In the words file, does a single record represent a single
crime-registration number or a single word?
*********.
AS: In the original file there where records for each word, but I created a
string variable to concat these with the previous value (sorted by
registration number with the lag function) if the registration number was
the same as the previous one. Then with aggregate I selected the last record
of each sequence with the same registration number.
*********,

Also, how is the words file is related to the location in the original
data-- it sounds to me like your provider has a single large dataset
(multiple tables in a relational dataset?) of sex crimes that includes
location data and the free text field. From this dataset two subsets were
pulled in the first, particular locations were selected regardless of words
and in the second, particular words were selected-but the words were pulled
regardless of the location value.

*******.
AS: Your guess is right. There were two independent queries. The data
structure is such that every case (registration) has only one location.
*******.

Based on what you've said so far, it seems like of the 498 unduplicated
locations 200 did not include any of the selected words, 298 did, and 4700
were from records that were from other locations.
*******.
AS: Again you are right. In total there are 5198 cases.
Of course it is possible that those two multiples of 100 area coincidence.
The chance that they were multitudes of 99 or 101 would also be small, but
less easily noticed. But because people tend to think in hundreds (and
making errors is also human), I think there might be something wrong. I
wondered if anyone has a suggestion where to look for possible mistakes. Of
course I contacted the provider of the data and they will check the
procedure.
*******.




Melissa Ives
DMHAS Research Division
410 Capitol Ave., MS#14RSD
Hartford, CT 06106
860-418-6729 (phone)
860-418-6692 (fax)
860-778-5445 (cell)
________________________________________
From: SPSSX(r) Discussion <[hidden email]> on behalf of A.
Smulders <[hidden email]>
Sent: Tuesday, September 20, 2016 10:06 AM
To: [hidden email]
Subject: Re: [SPSSX-L] Coïncidence?

Thank you for reminding me of the IN-subcommand. I completely forgot it.
(What I actualy did is creating a variable in both files: inWordfile and
inLocfile and then combining them to a variable InFile with values 1 2 and 3
for the wordfile, the locFile and both files respectively).

To make my main question more clear: my InFile variable gave frequencies of
exactly 4500  for infile = 1 (wordfile) and exactly 200 for infile = 2
(Locfile) and 298 for InFile = 3 (both).  The provider of the data maintains
that he delivered all the relevant data for 3 consecutive years There was no
criterium to limit the data to samples of a certain size.

The fact that 2 out of 3 values are exact multitudes of 100 makes me
suspicious.

Also, as I mentioned, the figures for the locFile (when the duplicates are
not removed) are almost the same over the 3 years (168, 168 and 169). When I
discovered that I contacted the provider, but he insisted that this was
coincidence. Of course that is a possibility. I repeat: there was no
limitation in size of the datafiles intended.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command SIGNOFF SPSSX-L For a list of
commands to manage subscriptions, send the command INFO REFCARD

________________________________

This correspondence contains proprietary information some or all of which
may be legally privileged; it is for the intended recipient only. If you are
not the intended recipient you must not use, disclose, distribute, copy,
print, or rely on this correspondence and completely dispose of the
correspondence immediately. Please notify the sender if you have received
this email in error. NOTE: Messages to or from the State of Connecticut
domain may be subject to the Freedom of Information statutes and
regulations.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command SIGNOFF SPSSX-L For a list of
commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD