David is correct on the matching ability
with these tiny datasets. Using exact matching, FUZZY matched only
one case.
If I add some tolerance, FUZZY can do a little better, but inspecting the data by hand shows, at least for my random draws, that you wouldn't get far. FUZZY works with multiple datasets. I generated demander and supplier datasets, size 8 and 20, respectively. Here's the syntax with some fuzz in the matches. It just adds the id of the matched supplier case to its demander case. There are various other options for the output dataset. This syntax matches on sex, age, and abctot with fuzz of 0 for sex, 1 for age, and .1 for agetot. get file="c:/temp/demander.sav". dataset name demander. get file="c:/temp/supplier.sav". dataset name supplier. FUZZY DEMANDERDS=demander SUPPLIERDS=supplier BY= sex age abctot FUZZ = 0 1 .1 SUPPLIERID=sid NEWDEMANDERIDVARS=supplierid. Here is the result: Case Control Matching Statistics Match Type Count Exact Matches 1 Fuzzy Matches 4 Unmatched Including Missing Keys 3 Unmatched with Valid Keys 3 A few weeks ago I did a test with 2000 demander cases and 60000 supplier cases using random data - mainly to see how much memory FUZZY's tables would require. Almost all cases were matched, and the memory usage was only 25mb. It took a while to run, but it appears that it would work with quite large datasets. Regards, Jon Peck Senior Software Engineer, IBM [hidden email] 312-651-3435 From: David Marso <[hidden email]> To: [hidden email] Date: 03/31/2011 07:12 PM Subject: Re: [SPSSX-L] Obtaining a matched control group (A final Nail) Sent by: "SPSSX(r) Discussion" <[hidden email]> I really wouldn't expect ANYTHING to work well with those sample sizes and distributions ;-) My code should be pretty much usable for reasonably large samples. How does Jon's Fuzzy do with this data? On Thu, Mar 31, 2011 at 7:08 PM, hillel vardi <[hidden email]> wrote: > Shalom > > After thinking all other answers I am quit sure that using Aggregate , Lag > or Rank will not work . > Te reason for that is that the assumption that there will be controls in all > the groups is not met in all situations. > Here is an example using David Marso program ( i only reduce the number of > cases to 8 and controls to 20 ) . > > input program. > loop sex= 1 to 2. > loop #=1 to 4. > compute age=trunc(uniform(10)). > compute abctot = trunc(uniform(10))/10. > compute mhprob=1. > leave sex. > end case. > end loop. > end loop. > loop sex= 1 to 2. > loop #=1 to 10. > compute age=trunc(uniform(10)). > compute abctot = trunc(uniform(10))/10. > compute mhprob=0. > leave sex. > end case. > end loop. > end loop. > end file. > end input program. > string datamark(a8). > COMPUTE datamark=CONCAT("DATA",STRING($CASENUM,N4)). > exe. > COMPUTE ID=$CASENUM. > COMPUTE SCRAMBL=UNIFORM(1). > RANK SCRAMBL BY SEX AGE ABCTOT mhPROB. > IF MHPROB=0 ID0=ID. > IF MHPROB=1 ID1=ID. > AGGREGATE OUTFILE * / BREAK sex age ABCTOT rscrambl /id0 id1=max(id0 id1). > COMPUTE MATCH=NOT(MISSING(ID1)) AND NOT(MISSING(ID0)). > FREQ MATCH. > > Hillel Vardi > BGU > > On 31/03/2011 15:52, David Marso wrote: >> >> Hi Ivana, >> You are very welcome! >> I was think on this further after an interesting email from Gene regarding >> sequences (similar to Hillel Vardi's post last night). I came up with the >> following tidbit which is much easier than my previous post and has the >> added feature of being almost completely intuitive. Another nice benefit >> is >> it does not require a SORT and in my tests is a KEEPER ;-). >> >> COMPUTE ID=$CASENUM. >> COMPUTE SCRAMBL=UNIFORM(1). >> RANK SCRAMBL BY SEX AGE ABCTOT mhPROB. >> IF MHPROB=0 ID0=ID. >> IF MHPROB=1 ID1=ID. >> AGGREGATE OUTFILE * / BREAK sex age ABCTOT rscrambl /id0 id1=max(id0 id1). >> COMPUTE MATCH=NOT(MISSING(ID1)) AND NOT(MISSING(ID0)). >> FREQ MATCH. >> >> Comments: >> RANK is able to construct 'counters' BY strata without the relevant cases >> being contiguous. NICE. >> After the AGGREGATE the file will have the strata variables (and paired >> IDs >> -ID1, ID2-) but not the MHPROB variable. No problem since this >> information >> is implied by presence/absence of ID0 and ID1. >> >> Taking it further: >> One could segregate the MATCH cases into a separate file, deleting from >> working file and then rerun the code after doing a VARSTOCASES (ie >> restoring >> ID from ID0 and ID1). In this case I would probably. >> >> COMPUTE a random variable and sort on it, then use a variant of the RANK >> as: >> RANK ABCTOT BY SEX AGE mhPROB (may need to specify TIES to deal with >> duplicate values in ABCTOT?). >> This would build RANKS of ABCTOT within the strata and a later AGGREGATE >> would group them together as previously (fuzzy match within the ranked >> values of ABCTOT). >> >> NOTE: In contrast to Gene's example I do not spread the data elements, I >> just store the IDs. To map the data to the IDs will simply require a >> VARSTOCASES to make the file long -That's all you need to carry- >> SORT CASES BY ID >> MATCH FILES into the SORTED detail level file. >> Hope this helps, >> David >> >> > > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |