A simple but pesky problem.
I want to draw a 0.1% random sample out of about 200 m cases. With "SAMPLE .01." it takes ages. Isn't there a faster way. The sample does not have to be perfectly random. Just need a smaller file to test the routines. Thanks
Dr. Frank Gaeth
|
Frank
I would do this in a two stage process - use sample to get a file with the first 2-3 million cases and sample that; not perfect but serves as a means of testing the process. I only have 120,000,000 cases to deal with and that is the process I use. Best Wishes John S. Lemon IT Services - Student Liaison Officer University of Aberdeen Edward Wright Building Tel: +44 1224 273350 DIT news for Students -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of drfg2008 Sent: 27 September 2013 07:50 To: [hidden email] Subject: sample A simple but pesky problem. I want to draw a 0.1% random sample out of about 200 m cases. With "SAMPLE .01." it takes ages. Isn't there a faster way. The sample does not have to be perfectly random. Just need a smaller file to test the routines. Thanks ----- Dr. Frank Gaeth FU-Berlin -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/sample-tp5722260.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD The University of Aberdeen is a charity registered in Scotland, No SC013683. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Hi, thanks!
Unfortunately the file is sorted and this would mean that certain important data is missing (like year = 2013)
Dr. Frank Gaeth
|
Frank Do the cases have a numeric ID? If so, you could try deriving something from the the ID. If not I just tried this on a much smaller file from the 2011 British Social Attitudes Survey (N=3311). compute copyid = ($casenum). format copyid (f10.0). desc copyid /sta min max.
compute sampleid = trunc (copyid/100). format sampleid (f10.0). desc var = sampleid /sta min max.
select if sampleid eq 7. desc var = sampleid /sta min max. show n.
So it should work for you if you divide the copyid by 100000 to yield a sampleid in the range 0 to 9. compute copyid = ($casenum). format copyid (f10). desc copyid /sta min max. compute sampleid = trunc (copyid/100000). desc var = sampleid /sta min max. You’ll still have 200 million cases, but you can then select IDs ending in any digit 0 – 9 and end up with 2000 cases. Anyway, worth a try. Don’t forget to save the working file with a new name! John F Hall (Mr) [Retired academic survey researcher] Email: [hidden email] Website: www.surveyresearch.weebly.com SPSS start page: www.surveyresearch.weebly.com/spss-without-tears.html -----Original Message----- Hi, thanks! Unfortunately the file is sorted and this would mean that certain important data is missing (like year = 2013) ----- Dr. Frank Gaeth FU-Berlin -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/sample-tp5722260p5722266.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by drfg2008
----- Original Message -----
> From: drfg2008 <[hidden email]> > To: [hidden email] > Cc: > Sent: Friday, September 27, 2013 9:09 AM > Subject: Re: [SPSSX-L] sample > > Hi, thanks! > > Unfortunately the file is sorted and this would mean that certain important > data is missing (like year = 2013) Here several SQL methods are described: http://www.petefreitag.com/item/466.cfm We use SQL server so ordering all the data by a random variable and then taking the top n also takes ages :-(. Given that today it's Friday anyway, I would just draw a sample with SAMPLE. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Try one of these
(untested).
temporary. compute ranvar= rv.uniform(0,1). select if ranvar le .01. xsave . . . or systematic selection with a random start. Use a table of random numbers, or the least significant digit on a currency bill to get e.g., the 77. select if mod($casenum,100) eq 77. xsave . . . Art Kendall Social Research ConsultantsOn 9/27/2013 6:34 AM, Albert-Jan Roskam [via SPSSX Discussion] wrote: ----- Original Message -----
Art Kendall
Social Research Consultants |
In reply to this post by drfg2008
Why not just use
select if rv.uniform(0,1) lt .001. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: drfg2008 <[hidden email]> To: [hidden email], Date: 09/27/2013 12:50 AM Subject: [SPSSX-L] sample Sent by: "SPSSX(r) Discussion" <[hidden email]> A simple but pesky problem. I want to draw a 0.1% random sample out of about 200 m cases. With "SAMPLE .01." it takes ages. Isn't there a faster way. The sample does not have to be perfectly random. Just need a smaller file to test the routines. Thanks ----- Dr. Frank Gaeth FU-Berlin -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/sample-tp5722260.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
This post was updated on .
exactly, that takes so much time: select if rv.uniform(0,1) lt .001.
input program. loop a =1 to 2*10**8 by 1. end case. end loop. end file. end input program. EXECUTE. select if rv.uniform(0,1) lt .001. EXECUTE.
Dr. Frank Gaeth
|
Administrator
|
In reply to this post by drfg2008
Frank,
Why not just build out some simulation data if all you are doing is testing routines. See INPUT PROGRAM. LOOP. END CASE. END FILE. END LOOP. END INPUT PROGRAM. Leaving you to flesh out any specific details!
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Or, if the file is in order by date, and you want to see tables, etc.,
that are generally realistic, you could use N OF CASES to pick up the first few thousand rows, and randomize the variable for Year so that it covers the whole range. -- Rich Ulrich > Date: Fri, 27 Sep 2013 08:19:31 -0700 > From: [hidden email] > Subject: Re: sample > To: [hidden email] > > Frank, > Why not just build out some simulation data if all you are doing is testing > routines. > See INPUT PROGRAM. > LOOP. > END CASE. > END FILE. > END LOOP. > END INPUT PROGRAM. > Leaving you to flesh out any specific details! > > > drfg2008 wrote > > A simple but pesky problem. > > > > I want to draw a 0.1% random sample out of about 200 m cases. With "SAMPLE > > .01." it takes ages. Isn't there a faster way. The sample does not have to > > be perfectly random. Just need a smaller file to test the routines. > > > > Thanks > |
I think, I generate simply a random sample as David suggested.
It is funny: By picking up only the first thousand rows with USE 1 thru 1000 /permanent I also ended up waiting for ages (although I have a 64 bit 16 GIG win7 prof. machine).
Dr. Frank Gaeth
|
I do not know if it is
still true but either
the method of generating a random number from 0 to one and taking cases where the number was le .001 or the method of systematic selection with a random start should be very quick. At least it used to be that by default SPSS "looked ahead" and if there were no more procedures in the syntax it would not create the binary scratch files. In the olden days, this was a major advantage of SPSS over SAS because SAS by default always did the "data step". Art Kendall Social Research ConsultantsOn 9/28/2013 2:26 AM, drfg2008 [via SPSSX Discussion] wrote: I think, I generate simply a random sample as David suggested.
Art Kendall
Social Research Consultants |
Free forum by Nabble | Edit this page |