SPSSX Discussion

sample

Classic

List

Threaded

12 messages Options

drfg2008

sample

A simple but pesky problem.

I want to draw a 0.1% random sample out of about 200 m cases. With "SAMPLE .01." it takes ages. Isn't there a faster way. The sample does not have to be perfectly random. Just need a smaller file to test the routines.

Thanks

Dr. Frank Gaeth

Lemon, John S.

Re: sample

Frank

I would do this in a two stage process - use sample to get a file with the first 2-3 million cases and sample that; not perfect but serves as a means of testing the process. I only have 120,000,000 cases to deal with and that is the process I use.

Best Wishes

John S. Lemon
IT Services - Student Liaison Officer
University of Aberdeen
Edward Wright Building
Tel: +44 1224 273350

DIT news for Students

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of drfg2008
Sent: 27 September 2013 07:50
To: [hidden email]
Subject: sample

A simple but pesky problem.

I want to draw a 0.1% random sample out of about 200 m cases. With "SAMPLE .01." it takes ages. Isn't there a faster way. The sample does not have to be perfectly random. Just need a smaller file to test the routines.

Thanks

-----
Dr. Frank Gaeth
FU-Berlin

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/sample-tp5722260.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

The University of Aberdeen is a charity registered in Scotland, No SC013683.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

drfg2008

Re: sample

Hi, thanks!

Unfortunately the file is sorted and this would mean that certain important data is missing (like year = 2013)

Dr. Frank Gaeth

John F Hall

Re: sample

Frank

Do the cases have a numeric ID?

If so, you could try deriving something from the the ID. If not I just tried this on a much smaller file from the 2011 British Social Attitudes Survey (N=3311).

compute copyid = ($casenum).

format copyid (f10.0).

desc copyid /sta min max.

Descriptive Statistics
	N	Minimum	Maximum
copyid	3311	1	3311
Valid N (listwise)	3311

compute sampleid = trunc (copyid/100).

format sampleid (f10.0).

desc var = sampleid /sta min max.

Descriptive Statistics
	N	Minimum	Maximum
sampleid	3311	0	33
Valid N (listwise)	3311

select if sampleid eq 7.

desc var = sampleid /sta min max.

show n.

Descriptive Statistics
	N	Minimum	Maximum
sampleid	100	7	7
Valid N (listwise)	100

System Settings
Keyword	Description	Setting
N	Number of cases in the working data file	100

So it should work for you if you divide the copyid by 100000 to yield a sampleid in the range 0 to 9.

compute copyid = ($casenum).

format copyid (f10).

desc copyid /sta min max.

compute sampleid = trunc (copyid/100000).

desc var = sampleid /sta min max.

You’ll still have 200 million cases, but you can then select IDs ending in any digit 0 – 9 and end up with 2000 cases. Anyway, worth a try. Don’t forget to save the working file with a new name!

John F Hall (Mr)

[Retired academic survey researcher]

Email: [hidden email]

Website: www.surveyresearch.weebly.com

SPSS start page: www.surveyresearch.weebly.com/spss-without-tears.html

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of drfg2008
Sent: 27 September 2013 09:10
To: [hidden email]
Subject: Re: sample

Hi, thanks!

Unfortunately the file is sorted and this would mean that certain important data is missing (like year = 2013)

-----

Dr. Frank Gaeth

FU-Berlin

View this message in context: http://spssx-discussion.1045642.n5.nabble.com/sample-tp5722260p5722266.html

Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================

To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Albert-Jan Roskam

Re: sample

In reply to this post by drfg2008

----- Original Message -----

> From: drfg2008 <[hidden email]>
> To: [hidden email]
> Cc:
> Sent: Friday, September 27, 2013 9:09 AM
> Subject: Re: [SPSSX-L] sample
>
> Hi, thanks!
>
> Unfortunately the file is sorted and this would mean that certain important
> data is missing (like year = 2013)

Here several SQL methods are described: http://www.petefreitag.com/item/466.cfm
We use SQL server so ordering all the data by a random variable and then taking the top n also takes ages :-(. Given that today it's Friday anyway, I would just draw a sample with SAMPLE.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Art Kendall

Re: sample

Try one of these (untested).

temporary. compute ranvar= rv.uniform(0,1).select if ranvar le .01.xsave . . .

or
systematic selection with a random start.
Use a table of random numbers, or the least significant digit on a currency bill to get e.g., the 77.

select if mod($casenum,100) eq 77.xsave . . .

Art Kendall
Social Research Consultants

On 9/27/2013 6:34 AM, Albert-Jan Roskam [via SPSSX Discussion] wrote:

----- Original Message -----

> From: drfg2008 <[hidden email]>
> To: [hidden email]
> Cc:
> Sent: Friday, September 27, 2013 9:09 AM
> Subject: Re: [SPSSX-L] sample
>
> Hi, thanks!
>
> Unfortunately the file is sorted and this would mean that certain important
> data is missing (like year = 2013)

Here several SQL methods are described: http://www.petefreitag.com/item/466.cfm
We use SQL server so ordering all the data by a random variable and then taking the top n also takes ages :-(. Given that today it's Friday anyway, I would just draw a sample with SAMPLE.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/sample-tp5722260p5722268.html

To start a new topic under SPSSX Discussion, email [hidden email]
To unsubscribe from SPSSX Discussion, click here.
NAML

Art Kendall
Social Research Consultants

Jon K Peck

Re: sample

In reply to this post by drfg2008

Why not just use
select if rv.uniform(0,1) lt .001.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: drfg2008 <[hidden email]>
To: [hidden email],
Date: 09/27/2013 12:50 AM
Subject: [SPSSX-L] sample
Sent by: "SPSSX(r) Discussion" <[hidden email]>

A simple but pesky problem. I want to draw a 0.1% random sample out of about 200 m cases. With "SAMPLE .01." it takes ages. Isn't there a faster way. The sample does not have to be perfectly random. Just need a smaller file to test the routines. Thanks ----- Dr. Frank Gaeth FU-Berlin -- View this message in context:http://spssx-discussion.1045642.n5.nabble.com/sample-tp5722260.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

drfg2008

Re: sample

This post was updated on .

exactly, that takes so much time: select if rv.uniform(0,1) lt .001.

input program.
loop a =1 to 2*10**8 by 1.
end case.
end loop.
end file.
end input program.
EXECUTE.

select if rv.uniform(0,1) lt .001.
EXECUTE.

Dr. Frank Gaeth

David Marso

Re: sample

Administrator

In reply to this post by drfg2008

Frank,
Why not just build out some simulation data if all you are doing is testing routines.
See INPUT PROGRAM.
LOOP.
END CASE.
END FILE.
END LOOP.
END INPUT PROGRAM.
Leaving you to flesh out any specific details!

drfg2008 wrote

A simple but pesky problem.

I want to draw a 0.1% random sample out of about 200 m cases. With "SAMPLE .01." it takes ages. Isn't there a faster way. The sample does not have to be perfectly random. Just need a smaller file to test the routines.

Thanks

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

Rich Ulrich

Re: sample

Or, if the file is in order by date, and you want to see tables, etc.,
that are generally realistic, you could use N OF CASES to
pick up the first few thousand rows, and randomize the variable for
Year so that it covers the whole range.

--
Rich Ulrich

> Date: Fri, 27 Sep 2013 08:19:31 -0700

> From: [hidden email]
> Subject: Re: sample
> To: [hidden email]
>
> Frank,
> Why not just build out some simulation data if all you are doing is testing
> routines.
> See INPUT PROGRAM.
> LOOP.
> END CASE.
> END FILE.
> END LOOP.
> END INPUT PROGRAM.
> Leaving you to flesh out any specific details!
>
>
> drfg2008 wrote
> > A simple but pesky problem.
> >
> > I want to draw a 0.1% random sample out of about 200 m cases. With "SAMPLE
> > .01." it takes ages. Isn't there a faster way. The sample does not have to
> > be perfectly random. Just need a smaller file to test the routines.
> >
> > Thanks
>

drfg2008

Re: sample

I think, I generate simply a random sample as David suggested.

It is funny: By picking up only the first thousand rows with USE 1 thru 1000 /permanent I also ended up waiting for ages (although I have a 64 bit 16 GIG win7 prof. machine).

Dr. Frank Gaeth

Art Kendall

Re: sample

I do not know if it is still true but either
the method of generating a random number from 0 to one and taking cases where the number was le .001
or the method of systematic selection with a random start should be very quick.

At least it used to be that by default SPSS "looked ahead" and if there were no more procedures in the syntax it would not create the binary scratch files. In the olden days, this was a major advantage of SPSS over SAS because SAS by default always did the "data step".

Art Kendall
Social Research Consultants

On 9/28/2013 2:26 AM, drfg2008 [via SPSSX Discussion] wrote:

I think, I generate simply a random sample as David suggested.

It is funny: By picking up only the first thousand rows with USE 1 thru 1000 /permanent I also ended up waiting for ages (although I have a 64 bit 16 GIG win7 prof. machine).
Dr. Frank Gaeth
FU-Berlin

If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/sample-tp5722260p5722299.html

To start a new topic under SPSSX Discussion, email [hidden email]
To unsubscribe from SPSSX Discussion, click here.
NAML

Art Kendall
Social Research Consultants