SPSSX Discussion

Random sampling & matrix of histograms problem

Classic

List

Threaded

25 messages Options

David Marso

Re: Random sampling & matrix of histograms problem

Administrator

But: How often does one really want to bootstrap with huge samples?
I think of it mainly as a small sample technique.

VARSTOCASES is only one way to do the followup on the recent post.

MATRIX.
SAVE UNIFORM(100000,1).
END MATRIX.

COMPUTE samplenumber=($CASENUM-1)/1000+1.
--

Andy W wrote

Yes Art I see!,

I walked through the code and this time it certainly is sampling with replacement. I tried to amend it to work with data in long format (instead of having the data being sampled in wide column format), but I was unsuccessful.

In general though I don't see why I would prefer this to the approach I posted at the onset of series of emails (feel free to enlighten me). To make your approach work you would need to flip the original data, which is an expensive procedure. You also need to externally write a file with XSAVE.

While you are right in these things aren't a big deal for small datasets, this is more code, making it intrinsically more complicated. So again, why exactly would your approach be preferable?

David,

I liked your prior MATRIX bootstrap code better than the new snippet (and the code I provided at the beginning of the post, which is almost an exact duplicate of what you wrote in 1996 holy poopers!).

Mainly I'm concerned about the VARSTOCASES when either the number of original cases is larger or the number of samples needed is larger. I wouldn't want to stack the dataset and then sample if the original OP's request was with a population of 40,000 cases and he wanted 1,000 samples (i.e. a stacked dataset of 40 million). The problem grows with the size of the original population even if the number or size of samples needed does not. It does plug away though like a charm even with 40,000 cases and 1,000 samples!

Of course, whatever procedures individuals utilize will be dependent on the nature of the task and size of the data. I believe your MATRIX procedure could be modified to work in alot of situations. Either by calculating the stats right within a MATRIX loop, or by piping out to a new dataset, calculating the stats, and iterating for the number of repetitions one wants.

I'm thinking of here problems that are too big to practically stack the data and use split file. Otherwise, I'm personally pretty cool with the solution you posted over 16 years ago!

Andy

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

Art Kendall

Re: Random sampling & matrix of histograms problem

In reply to this post by Andy W

Somehow I missed your post. Your approach is very usable.

One of the virtues of this list is that we can see different ways of accomplishing the the same thing. This discussion is an example of why I believe that when we teach stat right from the beginning people should "reference" each others syntax.

Just a question of vocabulary.
You are drawing 9 samples of size 100?

Yes it does require writing an output file.
Yes it does require flipping the data file.

Yes it does use more code, but I am not sure that it is "more complicated" in the sense of communicating what sampling without replacement is about.
As you probably know I have a soapbox about readability. I borrowed a couple of ideas from your wording to clarify my syntax. (That did make it less compact.)

Art Kendall
Social Research Consultants

On 3/8/2013 1:03 PM, Andy W wrote:

Yes Art I see!,

I walked through the code and this time it certainly is sampling with
replacement. I tried to amend it to work with data in long format (instead
of having the data being sampled in wide column format), but I was
unsuccessful.

In general though I don't see why I would prefer this to the approach I
posted at the onset of series of emails (feel free to enlighten me). To make
your approach work you would need to flip the original data, which is an
expensive procedure. You also need to externally write a file with XSAVE.

While you are right in these things aren't a big deal for small datasets,
this is more code, making it intrinsically more complicated. So again, why
exactly would your approach be preferable?

David,

I liked your prior MATRIX bootstrap code better than the new snippet (and
the code I provided at the beginning of the post, which is almost an exact
duplicate of what you wrote in 1996 holy poopers
<https://groups.google.com/group/sci.stat.consult/msg/710ea4ab83ddf24a?dmode=source&pli=1>
!).

Mainly I'm concerned about the VARSTOCASES when either the number of
original cases is larger or the number of samples needed is larger. I
wouldn't want to stack the dataset and then sample if the original OP's
request was with a population of 40,000 cases and he wanted 1,000 samples
(i.e. a stacked dataset of 40 million). The problem grows with the size of
the original population even if the number or size of samples needed does
not. It does plug away though like a charm even with 40,000 cases and 1,000
samples!

Of course, whatever procedures individuals utilize will be dependent on the
nature of the task and size of the data. I believe your MATRIX procedure
could be modified to work in alot of situations. Either by calculating the
stats right within a MATRIX loop, or by piping out to a new dataset,
calculating the stats, and iterating for the number of repetitions one
wants.

I'm thinking of here problems that are too big to practically stack the data
and use split file. Otherwise, I'm personally pretty cool with the solution
you posted over 16 years ago!

Andy

-----
Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Random-sampling-matrix-of-histograms-problem-tp5718425p5718489.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Art Kendall
Social Research Consultants

Andy W

Re: Random sampling & matrix of histograms problem

Art,

Yes the code I initially produced was 9 samples of size 100.

It is difficult to objectively evaluate your own code. I believe all of the examples given in the thread (and the new split off thread) are not intuitive to the uninitiated. IMO David's most recent MATRIX examples are pretty simple (MATRIX isn't inherently more difficult to understand). Both mine (or I feel I should call it David 16 years ago), and yours rely on what I would consider idiosyncratic aspects of SPSS code; input program for mine to build an empty dataset, and XSAVE for yours. MATRIX code requires many loops and indices, but I don't believe it is any more difficult to follow along.

The concept isn't simple, and I'm not sure it can be made that simple. The fact that the first several rounds of your syntax did not produce random sampling with replacement I believe is evidence of that. All of the examples are pretty concise, and so is IMO a bit tit-for-tat to argue that one is obviously superior in terms of readability. I've already stated why I would prefer the approach I initially wrote over the one you produced, and already admitted the grievances were minor in many situations.

Andy

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

David Marso

Re: Random sampling & matrix of histograms problem

Administrator

For the record: That code is actually much older than 16 years. Recall it was a link to a post from my esteemed former colleague David Nichols to code I had written some time before.
IIRC: My first postings in the SPSS forums were around 1991 (soon after I began a roughly half decade stint in SPSS teksport).
It is most unfortunate that the archives have been truncated (beginning now in 1996).
The SPSSX-List had a long history before I came along and all of it is now lost to posterity.

FWIW: Most of what I posted of any value transpired between 1991 and 1996 (at which point I swapped out my teksport hat for SPSS client consulting and needless to say other commitments precluded extensive involvement in the day to day list activities). It would be nice if somehow the pre 1996 archives could somehow be restored. Some gems in that bit-bucket for sure! Of course I might look at some of these 'gems' today, face palm myself and mutter: What the heck was I thinking ;-)

Andy W wrote

Art,

Yes the code I initially produced was 9 samples of size 100.

It is difficult to objectively evaluate your own code. I believe all of the examples given in the thread (and the new split off thread) are not intuitive to the uninitiated. IMO David's most recent MATRIX examples are pretty simple (MATRIX isn't inherently more difficult to understand). Both mine (or I feel I should call it David 16 years ago), and yours rely on what I would consider idiosyncratic aspects of SPSS code; input program for mine to build an empty dataset, and XSAVE for yours. MATRIX code requires many loops and indices, but I don't believe it is any more difficult to follow along.

The concept isn't simple, and I'm not sure it can be made that simple. The fact that the first several rounds of your syntax did not produce random sampling with replacement I believe is evidence of that. All of the examples are pretty concise, and so is IMO a bit tit-for-tat to argue that one is obviously superior in terms of readability. I've already stated why I would prefer the approach I initially wrote over the one you produced, and already admitted the grievances were minor in many situations.

Andy

Bruce Weaver

Re: Random sampling & matrix of histograms problem

Administrator

It looks like even the Wayback Machine (http://archive.org/web/web.php) fails to exhume anything from before 1996.

David Marso wrote

For the record: That code is actually much older than 16 years. Recall it was a link to a post from my esteemed former colleague David Nichols to code I had written some time before.
IIRC: My first postings in the SPSS forums were around 1991 (soon after I began a roughly half decade stint in SPSS teksport).
It is most unfortunate that the archives have been truncated (beginning now in 1996).
The SPSSX-List had a long history before I came along and all of it is now lost to posterity.

FWIW: Most of what I posted of any value transpired between 1991 and 1996 (at which point I swapped out my teksport hat for SPSS client consulting and needless to say other commitments precluded extensive involvement in the day to day list activities). It would be nice if somehow the pre 1996 archives could somehow be restored. Some gems in that bit-bucket for sure! Of course I might look at some of these 'gems' today, face palm myself and mutter: What the heck was I thinking ;-)

Andy W wrote

Art,

Yes the code I initially produced was 9 samples of size 100.

It is difficult to objectively evaluate your own code. I believe all of the examples given in the thread (and the new split off thread) are not intuitive to the uninitiated. IMO David's most recent MATRIX examples are pretty simple (MATRIX isn't inherently more difficult to understand). Both mine (or I feel I should call it David 16 years ago), and yours rely on what I would consider idiosyncratic aspects of SPSS code; input program for mine to build an empty dataset, and XSAVE for yours. MATRIX code requires many loops and indices, but I don't believe it is any more difficult to follow along.

The concept isn't simple, and I'm not sure it can be made that simple. The fact that the first several rounds of your syntax did not produce random sampling with replacement I believe is evidence of that. All of the examples are pretty concise, and so is IMO a bit tit-for-tat to argue that one is obviously superior in terms of readability. I've already stated why I would prefer the approach I initially wrote over the one you produced, and already admitted the grievances were minor in many situations.

Andy

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).