Jaccard's Coefficient- Data Preparation

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Jaccard's Coefficient- Data Preparation

ceje94
Hi there,

I have binary data of certain behaviours that have occurred in several series of criminal offences. I'm looking to use Jaccard's Coefficient to get a similarity measure on each of the series in my sample.

However, i'm not sure even how to prepare my data for this.
I have a number of variables in each series of cases- so do I need to run the analysis variable by variable?

For example, in a series of four offences the offender may have stolen in offence one, murdered in offence two and stolen again in offences three and four. How would I present this data for SPSS?

Thanks in advance
Reply | Threaded
Open this post in threaded view
|

Re: Jaccard's Coefficient- Data Preparation

Andy W
The only Jaccard coefficient I am familiar with, https://en.wikipedia.org/wiki/Jaccard_index, takes two sets. (And is simply the intersection of the sets divided by the union of the sets.)

So I wouldn't know how to get the Jaccard coefficient for your simplified example - you need a second set.

It may also be easier to start with how you have the data now and your desired end result.

In general I imagine I would use Python and its set functionality to do this, but I would need more info on your data to sketch out a more explicit solution.
Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/
Reply | Threaded
Open this post in threaded view
|

Re: Jaccard's Coefficient- Data Preparation

Rich Ulrich

"Two sets" is how I see it.  For the example cited, I see two distinct types of offense.


If that is the starting point, then you would want to have a list of variables representing

the possible offenses, and score them as  Yes=1, No=0.  For two lists, you would count

the number of variables that 1, either uniquely or for two subjects.  Does this define what

you need?


To compare Subjects, you probably have to Flip the file:  Then a matrix of correlations would

give one index of similarity for each pair of subjects as their correlation... which, along with

the counts could be manipulated to get Jaccard's index.  Instead of the correlation, the Flip-ed

file could probably be picked up my Matrix where you could do some simple counting.


Is that what you want? 


--
Rich Ulrich


From: SPSSX(r) Discussion <[hidden email]> on behalf of Andy W <[hidden email]>
Sent: Wednesday, January 11, 2017 8:17 AM
To: [hidden email]
Subject: Re: Jaccard's Coefficient- Data Preparation
 
The only Jaccard coefficient I am familiar with,
https://en.wikipedia.org/wiki/Jaccard_index, takes two sets. (And is simply
the intersection of the sets divided by the union of the sets.)
  ...
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Jaccard's Coefficient- Data Preparation

ceje94
Thanks so much to you both for getting back to me so quickly.

To be a bit clearer, I have about twenty offence variables (I wish there were only two!). They are currently coded as binary as below:

Series      Variable 1    Variable 2    Variable 3    Variable 4

1                   1              0                1                 1
1                   1              1                0                 0
1                   0              0                0                 0  
1                   1              0                0                 0
2                   0              1                1                 0
2                   1              0                1                 0
2                   1              1                0                 0

And etc (I have 70 series containing about 280 offences, plus a matching control group).

If I understand Jaccard's Coefficient correctly, I have to analyse variable by variable, so if I were looking at V1, Series 1 would be '1,1,0,1' and Series 2 '0,1,1'.
I have had a look at how to do the analysis by hand and I think I understand it (for Variable one the coefficients would be: Series 1=0.33, Series 2=0.50 ? Please tell me if i'm wrong, i'm not a natural mathematician!).

However, with such a large number of samples I can't really do it all by hand, variable by variable and series by series. I'm not sure the best way to arrange the data for SPSS, though I have tried several ways and been unable to make sense of the agglomeration schedule.

If anyone can help it would be much appreciated :)
Reply | Threaded
Open this post in threaded view
|

Re: Jaccard's Coefficient- Data Preparation

Art Kendall
please give a little more detail.

What questions are you using the data to explore?  Consistency of coding across coders? Arrests about the same offenses?  

What is a series? An individual arrest with offenses coded by several people?

What are the variables?  Are they 20 offenses that are charged or not charged?

Why does series 1 have 4 lines and series 2 have 3 lines?

Jacccard'ts coefficient (most commonly) can be used find groups/piles of series (arrests?) that contain the same pattern of offenses.  

Please clarify what you mean by "control group"? Sometimes people just mean that it is a just a group for comparison, and technically it means that cases (entities) were randomly assigned to conditions, treatments etc.
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Jaccard's Coefficient- Data Preparation

Rich Ulrich
In reply to this post by ceje94

I have no idea what you are computing as a "Jaccard's Coefficient".

It /appears/ that you are computing a value of 0.33  from the single series, (1,1,0,1) -- which has to be wrong because

the coefficient always computes similarity of /two/ sets.  So, where am I wrong?  But don't jump to answer that.


I am still trying to construe what the data are.  I have a picture, but it might be all wrong.


What I imagine is that your "series" could be called "ID".  The (so-far, unlabeled) lines should be called "Arrest number". The four

variables shown are only illustrative of the larger, actual set of 20 variables, each of which are Yes/No for one feature at one arrest.

That's what I was giving recommendations for in my first reply. Is this what you are showing with your data?


Do answer that.


- If I have it wrong, then ignore everything except the need to specify what you actually have.

 - If I have it right, then you /could/ compute a coefficient between each pair of lines in the file, showing the similarities between arrests,

same person or not. (An "aggregated record" for one ID would show, Yes/No for 20 features, whether that person ever had that feature.)


--

Rich Ulrich


From: SPSSX(r) Discussion <[hidden email]> on behalf of ceje94 <[hidden email]>
Sent: Saturday, January 14, 2017 10:59:07 PM
To: [hidden email]
Subject: Re: Jaccard's Coefficient- Data Preparation
 
Thanks so much to you both for getting back to me so quickly.

To be a bit clearer, I have about twenty offence variables (I wish there
were only two!). They are currently coded as binary as below:

Series      Variable 1    Variable 2    Variable 3    Variable 4

1                   1              0                1                 1
1                   1              1                0                 0
1                   0              0                0                 0 
1                   1              0                0                 0
2                   0              1                1                 0
2                   1              0                1                 0
2                   1              1                0                 0

And etc (I have 70 series containing about 280 offences, plus a matching
control group).

If I understand Jaccard's Coefficient correctly, I have to analyse variable
by variable, so if I were looking at V1, Series 1 would be '1,1,0,1' and
Series 2 '0,1,1'.
I have had a look at how to do the analysis by hand and I think I understand
it (for Variable one the coefficients would be: Series 1=0.33, Series 2=0.50
? Please tell me if i'm wrong, i'm not a natural mathematician!).

However, with such a large number of samples I can't really do it all by
hand, variable by variable and series by series. I'm not sure the best way
to arrange the data for SPSS, though I have tried several ways and been
unable to make sense of the agglomeration schedule.

If anyone can help it would be much appreciated :)




--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Jaccard-s-Coefficient-Data-Preparation-tp5733665p5733693.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Jaccard's Coefficient- Data Preparation

Kirill Orlov
In case the OP/audience might take interest:
Square matrices of binary data association measures - which command PROXIMITIES offers and other - are also easily computed with the help of a simple function !bincnt of mine (found in "Matrix-End matrix" collection on http://www.spsstools.net/en/KO-spssmacros).

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Jaccard's Coefficient- Data Preparation

ceje94
In reply to this post by Rich Ulrich
Hi,

My research aim is to investigate the intra-series consistency of these offenders. I have groups of series that are made up of each crime that same offender has committed, with a range of variables for each offence. The series differ in length because some offenders have committed more offences in their series than others. Basically I only want the consistency measure for each series.

I am hoping to come up with a coefficient for the consistency of each variable across all of the series, so that I can say that (for example) this sample is consistent in their choice of approach type across a series of offences.

I was under the impression that the formula for Jaccard is Sj= a/a+b+c, a being the number of 'joint occurrences' per series and b and c the number of 'single non-joint occurrences' per series.

You're right in saying that the example is only a snippet of the data, I have many more variables that I need to test and they are all in a yes/no format.

Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Jaccard's Coefficient- Data Preparation

ceje94
If it helps, I took the analysis from Harbers, Deslauriers-Varin, Beauregard & van der Kemp (2012) extract below:

Statistical analyses
Previous studies investigating consistency for crime linkage purposes have often used the
Jaccard’s coefficient (Bennell and Canter, 2002; Bennell and Jones, 2005; Bennell et al.,
2009; Tonkin et al., 2008, Woodhams and Toye, 2007; Woodhams et al., 2008). This
similarity coefficient (Jaccard, 1908) is suitable because it does not include joint nonoccurrences
(0/0) of a specific behaviour in its measurement. This means that a specific
behaviour will not automatically be consistent because it does not occur in most of the
offences. Jaccard’s coefficient is calculated by dividing the number of behaviours shared
by two offences (1/1) by the sum of the numbers of behaviours shared (1/1), and the
number of behaviours present in one crime but not in the other (1/0 and 0/1). A value of
1 would mean that there is a total similarity on this particular behaviour across the series,
and a value of 0 would indicate no similarity at all across the series. An important
advantage of using the Jaccard’s coefficient to measure consistency is that low frequencies
of certain behaviours do not lead to high consistency scores. This is important in this type
of research as the absence of a specific behaviour based on police records does not
necessarily mean that the behaviour did not occur. However, one of the disadvantages is
that the Jaccard’s coefficient is very sensitive to missing data (Bennell and Jones, 2005;
Everitt et al., 2001; Woodhams and Toye, 2007).

Consistency of a variable was first measured for each offender within the offender’s
series. The consistency score was measured by comparing the variable in each offence with
the variable in the previous offence for the full length of the series. If both offences showed
the behaviour (1/1), a value of 1 was awarded to the comparison. If the behaviour was
present in one of the offences and absent in the other offence (0/1 or 1/0), a value of 0
was awarded to the comparison. If both offences did not show the behaviour (0/0), the
comparison was left out of the measurement. To make sure that the results are not biased
by the various lengths of the series, the total score was divided by the number of
comparisons, the length of the series minus two. For example, in the case where the
offender has been linked to five sexual assaults, the use of a disguise during the assaults
was scored as follows: absent, absent, present, present, and absent (00110). The first
comparison (0/0) is left out of the measurement. The second comparison is awarded with a value of 0, the third comparison with a value of 1, and the last comparison with a value of
0 again. The score (1) divided by the number of comparisons (3) is 0.33. The consistency
score for using a disguise by this offender in his or her series is 0.33. Thus, the information
of all crimes within the series is used, whereas the results are not biased by undue
weighting because of the length of the series.


Harbers, E., Deslauriers‐Varin, N., Beauregard, E., & Kemp, J. J. (2012). Testing the behavioural and environmental consistency of serial sex offenders: A signature approach. Journal of Investigative Psychology and Offender Profiling, 9(3), 259-273.
Reply | Threaded
Open this post in threaded view
|

Re: Jaccard's Coefficient- Data Preparation

Art Kendall
Am i correct in understanding:
that "series" means an individual.  
There are several events (arrest occasions? ) for each individual.  
There are 20 behaviors that are measured about each event.

my eSPSS reads your post as saying that you want to know whether within an individual to what degree do events have similar profiles across the 20 behaviors.

Paste and run the following into a syntax window.
Is this what you are looking for?
data list list/
Id (n2) event(a2) Name (a20) ArrestDate(adate10) behavior1 behavior2 behavior3 behavior4 (4f1).
begin data
1 a 'John Doe' 10/10/1999 1 0 1 1
1 b 'John Doe' 05/22/2001 0 1 0 0
1 c'John Doe' 12/25/2004 1 1 1 1
1 d 'John Doe' 02/14/2005 0 0 0 0
2 a 'Mary Poe' 11/11/2010 0 1 1 0
2 b 'Mary Poe' 04/04/2011 1 1 0 0
2 c 'Mary Poe' 05/04/2011  0 0 1 1
end data.
list.
* Proximities wants a string variable as ID here "event".
split file by Id.
proximities behavior1 to behavior4 /measure=JACCARD /id=event.



There are many coefficients for binary input data in PROXIMITIES. Check HELP to see whether additional coefficients would be of help.

I cannot recall at this time whether the labels on the proximity matrix can be the arrest#.  

Please reply to the list whether this is what you are looking for.  Then if this is what you are looking for
perhaps there is a way to remove the case number in the complete file from the labels of the printed proximity matrix.


Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Jaccard's Coefficient- Data Preparation

Rich Ulrich
In reply to this post by ceje94

My interpretation of what the lines are /seems/ to be confirmed.


Here, I am going to re-write paragraph one into the sort of jargon that I can understand, before I comment.

   "My research aim is to describe the consistency of the crime-profiles for each offender across his multiple arrests."

 - You can compute Jaccard's coefficient for each pair of lines for an offender.  Then average those if you want one number.

  "I am hoping to come up with a coefficient for the consistency of each variable across his arrests."
 - Okay, your next post does describe how you did that, following the cited article.  That result is "what they did first" and
it is NOT Jaccard's coefficient.  Anything done on one variable is not Jaccard's coefficient.  What it computes is of dubious
use, I think, unless you have sets of, I don't know, 10 or 20 arrests.  Plus, it imputes a meaningful order to the arrests,
since it can come out different if you take the arrests in a different order.  It shares with Jaccard the step of ignoring 0-0 pairs.

Here is an comment that I originally thought of, concerning this extra, not-Jaccard coefficient, but it might also apply to Jaccard:
- Instead of using a statistic that is designed to compensate for low rates by ignoring the No-No combination, I would look for
some measure that starts with the overall rate for the whole sample  (expressed as per-person, maybe, rather than per-offense --
but I would check both possibilities).

--
Rich Ulrich


From: SPSSX(r) Discussion <[hidden email]> on behalf of ceje94 <[hidden email]>
Sent: Sunday, January 15, 2017 3:24 PM
To: [hidden email]
Subject: Re: Jaccard's Coefficient- Data Preparation
 
Hi,

My research aim is to investigate the intra-series consistency of these
offenders. I have groups of series that are made up of each crime that same
offender has committed, with a range of variables for each offence. The
series differ in length because some offenders have committed more offences
in their series than others. Basically I only want the consistency measure
for each series.

I am hoping to come up with a coefficient for the consistency of each
variable across all of the series, so that I can say that (for example) this
sample is consistent in their choice of approach type across a series of
offences.

I was under the impression that the formula for Jaccard is Sj= a/a+b+c, a
being the number of 'joint occurrences' per series and b and c the number of
'single non-joint occurrences' per series.

You're right in saying that the example is only a snippet of the data, I
have many more variables that I need to test and they are all in a yes/no
format.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Jaccard's Coefficient- Data Preparation

Andy W
To make all pair-wise comparisons to an offender with themselves I have a macro already made that can give you that output. The macro was originally designed to create a network set of co-offender edge lists from crime data, https://dl.dropboxusercontent.com/u/3385251/Edgelist_Macro.sps (and see the blog post that explains the steps, https://andrewpwheeler.wordpress.com/2013/06/30/making-an-edge-list-in-spss/). This is not needed if you only do the sequential comparisons as the paper you cited. You could use lag for that, see https://andrewpwheeler.wordpress.com/2013/02/18/using-sequential-case-processing-for-data-management-in-spss/ for some examples. But I would do all the pairwise comparisons. Arrests occurring are somewhat haphazard, so you want all the samples you can get.

So using Art's example data,

******************************************************************************.
data list list/
PersonId (n2) event(a2) Name (a20) ArrestDate(adate10) behavior1 behavior2 behavior3 behavior4 (4f1).
begin data
1 a 'John Doe' 10/10/1999 1 0 1 1
1 b 'John Doe' 05/22/2001 0 1 0 0
1 c 'John Doe' 12/25/2004 1 1 1 1
1 d 'John Doe' 02/14/2005 0 0 0 0
2 a 'Mary Poe' 11/11/2010 0 1 1 0
2 b 'Mary Poe' 04/04/2011 1 1 0 0
2 c 'Mary Poe' 05/04/2011  0 0 1 1
end data.
DATASET NAME Examp.
*Change this to the location you have downloaded.
*https://dl.dropboxusercontent.com/u/3385251/Edgelist_Macro.sps.
FILE HANDLE syntax /NAME = "C:\Users\axw161530\Desktop\ExampleCrimeLink".
INSERT FILE = "syntax\Edgelist_Macro.sps".
*The ID within a person needs to be numeric for my macro.
SORT CASES BY PersonId ArrestDate.
DO IF ($casenum = 1) OR PersonId <> LAG(PersonId).
  COMPUTE WithId = 1.
ELSE.
  COMPUTE WithId = LAG(WithId) + 1.
END IF.
EXECUTE.
!EdgeList IncId = PersonId PerId = WithId StaticVars = [Name] PerVars = event ArrestDate behavior1 behavior2 behavior3 behavior4.
DATASET ACTIVATE EdgeData.
*Example computing difference in dates between events.
COMPUTE DiffArrDate = DATEDIFF(ArrestDate.2,ArrestDate.1,"DAYS").
EXECUTE.
*That gives you two sets of behavior variables to compute whatever distance you want between those categories.
******************************************************************************.

I would suggest this (as opposed to Art's proximities example), as you may want to make a custom distance function. For one example crime linkage studies tend to find that offenses nearby in space and time are more likely to be linked, see http://www.citeulike.org/user/apwheele/article/13706987 for example. But those continuous measures would not make sense in comparison to the categories for Jaccard, so you need to think of a reasonable way to combine those two pieces of information.

Probably after you do this you will want to make a set of control distances, to see what the distances look like for unlinked events.

I don't agree with Rich that you need over ten offenses. There are two things you might do with this type of analysis, either 1) link up a series of offenses with unknown offenders, or 2) try to link a particular offense to a currently known offender in your database.

For 1 this is just based on the observed distances, so even if you only have offenders with 2 arrests in the sample you would probably still be able to do this if you had many different arrested people (think of every pairwise comparison as its own sample). For 2 it all depends on how rare the different categories are. For example, breaking into a house via the back door is probably a pretty regular MO, but pushing in the A/C unit is a more rare occurrence. (This is related to Rich's idea of looking at the overall rate for the whole sample, Naive Bayes is one example of this.) For 2 having more offenses helps link to a particular offender, but I bet you could do ok with fewer than 10 in many circumstances though.

You might want to check out Mike Porter's r package crimelinkage, https://cran.r-project.org/web/packages/crimelinkage/index.html, he has a bunch of functions to cluster the events, and the way he manages the data may make your life easier than our suggestions anyway.
Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/