|
SSD - sum of squared differences is calculated between the participant
and control. That means I can't separate participant-control pair. Participants and controls don't have nay duplicates on their own. However, each participant is matched with all possible controls and then only SSD is calculated. This would produce duplicate controls. Thank you for looking at this problem. Regards, San On Wed, Sep 9, 2009 at 4:19 AM, Gene Maguin <[hidden email]> wrote: > San, > > On one hand it seems like a solution to your question is pretty simple given > a couple of assumptions. 1) Participants and controls can be separated into > two datasets. 2) SSD is a property of controls only. Given this, sort the > controls file by SSD (I assume that any duplicates have already been removed > from the controls file.), select the first 4000 controls and match those > back to the participants without using the BY subcommand. > > On the other hand, I'll bet this scheme isn't correct. Could that be true? > If so, what's missing in the explanation of the problem? > > Gene Maguin > > >>>I'm wondering if anyone can help me with an algorithm (in spss) used > for matching participants to controls. > > I have about 4000 participants and 700,000 potential controls. > > Just to make it simple, say that we have a participant, potential > control and Squared differences as: > > DATA LIST FREE / Participant (A10) pControl (A10) SSD. > BEGIN DATA > P1 C1 0.7 > P1 C2 0.4 > P1 C3 0.99 > P1 C4 0.29 > P2 C1 0.56 > P2 C2 0.39 > P2 C3 0.2 > P2 C4 0.32 > P3 C2 0.37 > P3 C4 0.27 > P3 C6 0.11 > P3 C7 0.03 > P4 C1 0.04 > P4 C2 0.04 > P4 C4 0.05 > P4 C6 0.89 > END DATA. > > Now, I need to find a good NON repeated control for each participant > where the total of SSD is minimised. Key points here are 'non > repeated' and 'minimised total SSD'. > > Currently I'm simply sorting by SSD. Then picking from the top and > making sure I don't pick the same control again. This process doesn't > necessarily minimise the total sum of squared differences. > > I'm reasonably comfortable LOOPS, DO REAPEATS and macros. > > Any help would be greatly appreciated. > > Regards, > San > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
San,
Ok. This is much more difficult. So, let's get some facts out on the table. Tell me about the SSD computation. Not the formula, that's easy. But, how many variables are included in the SSD computation? May I assume that participants and controls have exactly the same variables that will be used in the SSD computation? I know that's a stupid question but stranger things have been done. And, are the variables all dichotomous or quasi-continuous (i.e., likert) or continuous (e.g., height or weight)? Last, and you stated this earlier but I don't recall, how many cases and how many controls? Gene Maguin ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Thank you for looking at this problem.
You can assume the SSD is already calculated. SSD is calculated based on previous 12 months consumption. I had only SSD in my sample data list to make things simple. In reality I include correlation values and few other variables depending on the task. You could assume SSD as a 'score' in this problem. Basically, 1. Participants (4000) and potential controls (600,000) are in separate files. 2. Each participant is matched with each potential control after spiting into sub groups (about 50 subgroups). That is about 50 million records: (4000 x 600000)/50 3. Scoring variable (Eg: SSD) is calculated. This is the sample data list I provided with just pairs and SSD. Now a best match needs to be found for each participant under following conditions: 1. A perticular control can not be used more than once. 2. Total SSD should be minimised. All these time, I didn't worry about minimising the total SSD. Minimisation of the total SSD is an integer programming problem as formulated below. For n participants and k potential controls (where k>n) Let aij = 1 if Control (j) is assigned to Participant (i), else 0. Then, Minimise n k Σ Σ aij*SSDij i=1 j=1 where Σ a1j = 1 j Σ a2j = 1 j .. .. Σ anj = 1 j and Σ ai1 ≤ 1 i Σ ai2 ≤ 1 i .. .. Σ aik ≤ 1 j and aij = 0 or 1 for all i,j I have no idea how to implement this in SPSS. Regards, San On Fri, Sep 11, 2009 at 11:30 PM, Gene Maguin <[hidden email]> wrote: > San, > > Ok. This is much more difficult. So, let's get some facts out on the table. > Tell me about the SSD computation. Not the formula, that's easy. But, how > many variables are included in the SSD computation? May I assume that > participants and controls have exactly the same variables that will be used > in the SSD computation? I know that's a stupid question but stranger things > have been done. And, are the variables all dichotomous or quasi-continuous > (i.e., likert) or continuous (e.g., height or weight)? Last, and you stated > this earlier but I don't recall, how many cases and how many controls? > > > Gene Maguin > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
SPSS is not really set up to handle integer programming problems, but from a practical point of view, is this really necessary? You have 4000 participants to match and 600,000 possibilities. Have you tried to see whether you can get an exact match for each control using a simpler approach? With that many controls, choosing any control that exactly matches the case (without replacement) would have pretty good behavior, I'd expect, if the two samples have a reasonably similar distribution.
-----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of San K Sent: Sunday, September 13, 2009 6:03 PM To: [hidden email] Subject: Re: [SPSSX-L] Matrix optimisation style syntax help Thank you for looking at this problem. You can assume the SSD is already calculated. SSD is calculated based on previous 12 months consumption. I had only SSD in my sample data list to make things simple. In reality I include correlation values and few other variables depending on the task. You could assume SSD as a 'score' in this problem. Basically, 1. Participants (4000) and potential controls (600,000) are in separate files. 2. Each participant is matched with each potential control after spiting into sub groups (about 50 subgroups). That is about 50 million records: (4000 x 600000)/50 3. Scoring variable (Eg: SSD) is calculated. This is the sample data list I provided with just pairs and SSD. Now a best match needs to be found for each participant under following conditions: 1. A perticular control can not be used more than once. 2. Total SSD should be minimised. All these time, I didn't worry about minimising the total SSD. Minimisation of the total SSD is an integer programming problem as formulated below. For n participants and k potential controls (where k>n) Let aij = 1 if Control (j) is assigned to Participant (i), else 0. Then, Minimise n k Σ Σ aij*SSDij i=1 j=1 where Σ a1j = 1 j Σ a2j = 1 j .. .. Σ anj = 1 j and Σ ai1 ≤ 1 i Σ ai2 ≤ 1 i .. .. Σ aik ≤ 1 j and aij = 0 or 1 for all i,j I have no idea how to implement this in SPSS. Regards, San On Fri, Sep 11, 2009 at 11:30 PM, Gene Maguin <[hidden email]> wrote: > San, > > Ok. This is much more difficult. So, let's get some facts out on the table. > Tell me about the SSD computation. Not the formula, that's easy. But, how > many variables are included in the SSD computation? May I assume that > participants and controls have exactly the same variables that will be used > in the SSD computation? I know that's a stupid question but stranger things > have been done. And, are the variables all dichotomous or quasi-continuous > (i.e., likert) or continuous (e.g., height or weight)? Last, and you stated > this earlier but I don't recall, how many cases and how many controls? > > > Gene Maguin > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Bring your friends to the fun. Invite your friends from Hotmail, Gmail to Yahoo! Mail today! |
|
Johnny
Not sure about on-line teaching, but Malcolm
Williams at Plymouth University (UK) is interested in this area and has
published a book or two. He's one of my old students and we have a
standing joke about measurement: he sent me a copy of his first book and had
scribbled a note about it inside the cover. There's also an
e-journal that you may find of interest.
Check out:
http://www.plymouth.ac.uk/pages/dynamic.asp?page=staffdetails&id=mwilliams
http://www.plymouth.ac.uk/pages/view.asp?page=22825
John Hall
|
|
In reply to this post by Peck, Jon
You are right Jon. It is not really necessary in this case.
Currently I'm doing like you said. "choosing any control that exactly matches the case (without replacement)" and it does produce pretty good behavior. I'm just preparing for the next evaluation where I may not have the luxury of having huge participants and possible controls. Regards, San On Mon, Sep 14, 2009 at 11:58 AM, Peck, Jon <[hidden email]> wrote: > SPSS is not really set up to handle integer programming problems, but from a practical point of view, is this really necessary? � You have 4000 participants to match and 600,000 possibilities. � Have you tried to see whether you can get an exact match for each control using a simpler approach? � With that many controls, choosing any control that exactly matches the case (without replacement) would have pretty good behavior, I'd expect, if the two samples have a reasonably similar distribution. > > -----Original Message----- > From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of San K > Sent: Sunday, September 13, 2009 6:03 PM > To: [hidden email] > Subject: Re: [SPSSX-L] Matrix optimisation style syntax help > > Thank you for looking at this problem. > > You can assume the SSD is already calculated. SSD is calculated based > on previous 12 months consumption. I had only SSD in my sample data > list to make things simple. In reality I include correlation values > and few other variables depending on the task. You could assume SSD as > a 'score' in this problem. > Basically, > 1. Participants (4000) and potential controls (600,000) are in separate files. > 2. Each participant is matched with each potential control after > spiting into sub groups (about 50 subgroups). That is about 50 million > records: � (4000 x 600000)/50 > 3. Scoring variable (Eg: SSD) is calculated. This is the sample data > list I provided with just pairs and SSD. > > Now a best match needs to be found for each participant under > following conditions: > 1. A perticular control can not be used more than once. > 2. Total SSD should be minimised. > > All these time, I didn't worry about minimising the total SSD. > > Minimisation of the total SSD is an integer programming problem as > formulated below. > For n participants and k potential controls (where k>n) > Let aij = 1 if Control (j) is assigned to Participant (i), else 0. > > Then, Minimise > n � � � k > Σ � � Σ � � � aij*SSDij > i=1 � j=1 > > where > � Σ a1j = 1 > � j > > Σ a2j = 1 > � j .. > � .. > > Σ anj = 1 > j > > and > Σ ai1 ≤ 1 > i > Σ ai2 ≤ 1 > i .. > .. > Σ aik ≤ 1 > j > and > aij = 0 or 1 for all i,j > > I have no idea how to implement this in SPSS. > > Regards, > San > > On Fri, Sep 11, 2009 at 11:30 PM, Gene Maguin <[hidden email]> wrote: >> San, >> >> Ok. This is much more difficult. So, let's get some facts out on the table. >> Tell me about the SSD computation. Not the formula, that's easy. But, how >> many variables are included in the SSD computation? May I assume that >> participants and controls have exactly the same variables that will be used >> in the SSD computation? I know that's a stupid question but stranger things >> have been done. And, are the variables all dichotomous or quasi-continuous >> (i.e., likert) or continuous (e.g., height or weight)? Last, and you stated >> this earlier but I don't recall, how many cases and how many controls? >> >> >> Gene Maguin >> >> ===================== >> To manage your subscription to SPSSX-L, send a message to >> [hidden email] (not to SPSSX-L), with no body text except the >> command. To leave the list, send the command >> SIGNOFF SPSSX-L >> For a list of commands to manage subscriptions, send the command >> INFO REFCARD >> > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
| Free forum by Nabble | Edit this page |
