sample

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

sample

drfg2008
A simple but pesky problem.

I want to draw a 0.1% random sample out of about 200 m cases. With "SAMPLE  .01." it takes ages. Isn't there a faster way. The sample does not have to be perfectly random. Just need a smaller file to test the routines.

Thanks
Dr. Frank Gaeth

Reply | Threaded
Open this post in threaded view
|

Re: sample

Lemon, John S.
Frank

I would do this in a two stage process - use sample to get a file with the first 2-3 million cases and sample that; not perfect but serves as a means of testing the process. I only have 120,000,000 cases to deal with and that is the process I use.

Best Wishes

John S. Lemon
IT Services - Student Liaison Officer
University of Aberdeen
Edward Wright Building
Tel:  +44 1224 273350

DIT news for Students

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of drfg2008
Sent: 27 September 2013 07:50
To: [hidden email]
Subject: sample

A simple but pesky problem.

I want to draw a 0.1% random sample out of about 200 m cases. With "SAMPLE .01." it takes ages. Isn't there a faster way. The sample does not have to be perfectly random. Just need a smaller file to test the routines.

Thanks



-----
Dr. Frank Gaeth
FU-Berlin

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/sample-tp5722260.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD




The University of Aberdeen is a charity registered in Scotland, No SC013683.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: sample

drfg2008
Hi, thanks!

Unfortunately the file is sorted and this would mean that certain important data is missing (like year = 2013)
Dr. Frank Gaeth

Reply | Threaded
Open this post in threaded view
|

Re: sample

John F Hall

Frank

 

Do the cases have a numeric ID?

 

If so, you could try deriving something from the the ID. If not I just tried this on a much smaller file from the 2011 British Social Attitudes Survey (N=3311).

 

compute copyid =  ($casenum).

format copyid (f10.0).

desc copyid /sta min max.

 

Descriptive Statistics

 

N

Minimum

Maximum

copyid

3311

1

3311

Valid N (listwise)

3311

 

 

 

compute sampleid = trunc (copyid/100).

format sampleid (f10.0).

desc var = sampleid /sta min max.

 

Descriptive Statistics

 

N

Minimum

Maximum

sampleid

3311

0

33

Valid N (listwise)

3311

 

 

 

select if sampleid eq 7.

desc var = sampleid /sta min max.

show n.

 

Descriptive Statistics

 

N

Minimum

Maximum

sampleid

100

7

7

Valid N (listwise)

100

 

 

 

System Settings

Keyword

Description

Setting

N

Number of cases in the working data file

100

 

So it should work for you if you divide the copyid by 100000 to yield a sampleid in the range 0 to 9.

 

compute copyid =  ($casenum).

format copyid (f10).

desc copyid /sta min max.

compute sampleid = trunc (copyid/100000).

desc var = sampleid /sta min max.

 

You’ll still have 200 million cases, but you can then select IDs ending in any digit 0 – 9 and end up with 2000 cases.  Anyway, worth a try.  Don’t forget to save the working file with a new name!

 

 

John F Hall (Mr)

[Retired academic survey researcher]

 

Email:   [hidden email] 

Website: www.surveyresearch.weebly.com

SPSS start page:  www.surveyresearch.weebly.com/spss-without-tears.html

  

  

 

 

 

 

 

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of drfg2008
Sent: 27 September 2013 09:10
To: [hidden email]
Subject: Re: sample

 

Hi, thanks!

 

Unfortunately the file is sorted and this would mean that certain important data is missing (like year = 2013)

 

 

 

-----

Dr. Frank Gaeth

FU-Berlin

 

--

View this message in context: http://spssx-discussion.1045642.n5.nabble.com/sample-tp5722260p5722266.html

Sent from the SPSSX Discussion mailing list archive at Nabble.com.

 

=====================

To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: sample

Albert-Jan Roskam
In reply to this post by drfg2008
----- Original Message -----

> From: drfg2008 <[hidden email]>
> To: [hidden email]
> Cc:
> Sent: Friday, September 27, 2013 9:09 AM
> Subject: Re: [SPSSX-L] sample
>
> Hi, thanks!
>
> Unfortunately the file is sorted and this would mean that certain important
> data is missing (like year = 2013)

Here several SQL methods are described: http://www.petefreitag.com/item/466.cfm
We use SQL server so ordering all the data by a random variable and then taking the top n also takes ages :-(. Given that today it's Friday anyway, I would just draw a sample with SAMPLE.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: sample

Art Kendall
Try one of these (untested).

temporary.
compute ranvar= rv.uniform(0,1)
.
select if ranvar le .01.
xsave . . .

or
systematic selection with a random start.
Use a table of random numbers, or the least significant digit on a currency bill to get e.g., the 77.

select if mod($casenum,100) eq 77.
xsave . . .
Art Kendall
Social Research Consultants
On 9/27/2013 6:34 AM, Albert-Jan Roskam [via SPSSX Discussion] wrote:
----- Original Message -----

> From: drfg2008 <[hidden email]>
> To: [hidden email]
> Cc:
> Sent: Friday, September 27, 2013 9:09 AM
> Subject: Re: [SPSSX-L] sample
>
> Hi, thanks!
>
> Unfortunately the file is sorted and this would mean that certain important
> data is missing (like year = 2013)

Here several SQL methods are described: http://www.petefreitag.com/item/466.cfm
We use SQL server so ordering all the data by a random variable and then taking the top n also takes ages :-(. Given that today it's Friday anyway, I would just draw a sample with SAMPLE.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/sample-tp5722260p5722268.html
To start a new topic under SPSSX Discussion, email [hidden email]
To unsubscribe from SPSSX Discussion, click here.
NAML

Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: sample

Jon K Peck
In reply to this post by drfg2008
Why not just use
select if rv.uniform(0,1) lt .001.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        drfg2008 <[hidden email]>
To:        [hidden email],
Date:        09/27/2013 12:50 AM
Subject:        [SPSSX-L] sample
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




A simple but pesky problem.

I want to draw a 0.1% random sample out of about 200 m cases. With "SAMPLE
.01." it takes ages. Isn't there a faster way. The sample does not have to
be perfectly random. Just need a smaller file to test the routines.

Thanks



-----
Dr. Frank Gaeth
FU-Berlin

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/sample-tp5722260.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: sample

drfg2008
This post was updated on .
exactly, that takes so much time: select if rv.uniform(0,1) lt .001.  

input program.
loop a =1 to 2*10**8 by 1.
end case.
end loop.
end file.
end input program.
EXECUTE.

select if rv.uniform(0,1) lt .001.
EXECUTE.


Dr. Frank Gaeth

Reply | Threaded
Open this post in threaded view
|

Re: sample

David Marso
Administrator
In reply to this post by drfg2008
Frank,
Why not just build out some simulation data if all you are doing is testing routines.
See INPUT PROGRAM.
LOOP.
END CASE.
END FILE.
END LOOP.
END INPUT PROGRAM.
Leaving you to flesh out any specific details!

drfg2008 wrote
A simple but pesky problem.

I want to draw a 0.1% random sample out of about 200 m cases. With "SAMPLE  .01." it takes ages. Isn't there a faster way. The sample does not have to be perfectly random. Just need a smaller file to test the routines.

Thanks
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: sample

Rich Ulrich
Or, if the file is in order by date, and you want to see tables, etc.,
that are generally realistic, you could use N OF CASES to
pick up the first few thousand rows, and randomize the variable for
Year so that it covers the whole range.

--
Rich Ulrich

> Date: Fri, 27 Sep 2013 08:19:31 -0700

> From: [hidden email]
> Subject: Re: sample
> To: [hidden email]
>
> Frank,
> Why not just build out some simulation data if all you are doing is testing
> routines.
> See INPUT PROGRAM.
> LOOP.
> END CASE.
> END FILE.
> END LOOP.
> END INPUT PROGRAM.
> Leaving you to flesh out any specific details!
>
>
> drfg2008 wrote
> > A simple but pesky problem.
> >
> > I want to draw a 0.1% random sample out of about 200 m cases. With "SAMPLE
> > .01." it takes ages. Isn't there a faster way. The sample does not have to
> > be perfectly random. Just need a smaller file to test the routines.
> >
> > Thanks
>

Reply | Threaded
Open this post in threaded view
|

Re: sample

drfg2008
I think, I generate simply a random sample as David suggested.

It is funny: By picking up only the first thousand rows with USE 1 thru 1000 /permanent I also ended up waiting for ages (although I have a 64 bit 16 GIG win7 prof. machine).
Dr. Frank Gaeth

Reply | Threaded
Open this post in threaded view
|

Re: sample

Art Kendall
I do not know if it is still true but either
 the method of generating a random number from 0 to one and taking cases where the number was le .001
or the method of systematic selection with a random start should be very quick.

At least it used to be that by default SPSS "looked ahead" and if there were no more procedures in the syntax it would not create the binary scratch files.  In the olden days, this was a major advantage of SPSS over SAS because SAS by default always did the "data step".
Art Kendall
Social Research Consultants
On 9/28/2013 2:26 AM, drfg2008 [via SPSSX Discussion] wrote:
I think, I generate simply a random sample as David suggested.

It is funny: By picking up only the first thousand rows with USE 1 thru 1000 /permanent I also ended up waiting for ages (although I have a 64 bit 16 GIG win7 prof. machine).
Dr. Frank Gaeth
FU-Berlin



If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/sample-tp5722260p5722299.html
To start a new topic under SPSSX Discussion, email [hidden email]
To unsubscribe from SPSSX Discussion, click here.
NAML

Art Kendall
Social Research Consultants