Advice regarding very large dataset

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Advice regarding very large dataset

Mark Vande Kamp-2
I have used SPSS for a long time but am just now trying to learn about
things like vectors, loops and macros, because I am starting a new project.

We have a huge dataset of information regarding website visitor movements
through a set of web pages. The tab-delimited data are structured as a
single long record for each visitor with up to 100 page views, and each page
view is represented by many variables. A simplified schematic might be:

UserID  X1 Y1 Z1 X2 Y2 Z2....X100 Y100 Z100

Note that there are many more than 3 variables per page view and up to
400,000 records, so computation time is a big issue.

Many of the initial analyses are repeated for all 100 page views to look for
things like "entry pages" for each user, so a loop to, for example, test the
appropriate variable in each of the 100 page views (and do other types of
data processing) would seem to be an appropriate approach. Initially, I
thought the data might be imported into vector variables to facilitate this
loop approach. However, I just read (I think) that vectors are ephemeral and
not really a variable "type".

So, I'm asking for advice regarding the ways to read the data into SPSS and
do the repeated processing of the repeated groups of variables that are
necessary. As I mentioned before, computationally thrifty approaches would
be best, given the size of the dataset (in both # of variables and # of cases).

Thanks,

Mark

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Advice regarding very large dataset

Art Kendall
It is hard to say without a great deal of familiarity with your data.
However, you might consider
1) a match files to make X's, Y's, Z's, contiguous
2) look up DO REPEAT
3) I do not recall whether set definitions are now retained between data
sets in version 18 which I intend to install soon.
4) See if you think changing to long layout and using AGGREGATE by ID
would help
5) try to keep the data on a local disk
6) the syntax to define a list of variables for do repeat or vectors can
be cannibalized via cut-and-paste or via INSERT.

Before SPSS supported multiple files open at once and before there were
PCs with cut-and-paste across applications, I would start a new set of
syntax by editing an earlier set so that I could reuse text that was
complicated like a list of all the X's, Y's, Z's.

Hope this helps
Art Kendall
Social Research Consultants


Mark Vande Kamp wrote:

> I have used SPSS for a long time but am just now trying to learn about
> things like vectors, loops and macros, because I am starting a new project.
>
> We have a huge dataset of information regarding website visitor movements
> through a set of web pages. The tab-delimited data are structured as a
> single long record for each visitor with up to 100 page views, and each page
> view is represented by many variables. A simplified schematic might be:
>
> UserID  X1 Y1 Z1 X2 Y2 Z2....X100 Y100 Z100
>
> Note that there are many more than 3 variables per page view and up to
> 400,000 records, so computation time is a big issue.
>
> Many of the initial analyses are repeated for all 100 page views to look for
> things like "entry pages" for each user, so a loop to, for example, test the
> appropriate variable in each of the 100 page views (and do other types of
> data processing) would seem to be an appropriate approach. Initially, I
> thought the data might be imported into vector variables to facilitate this
> loop approach. However, I just read (I think) that vectors are ephemeral and
> not really a variable "type".
>
> So, I'm asking for advice regarding the ways to read the data into SPSS and
> do the repeated processing of the repeated groups of variables that are
> necessary. As I mentioned before, computationally thrifty approaches would
> be best, given the size of the dataset (in both # of variables and # of cases).
>
> Thanks,
>
> Mark
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Advice regarding very large dataset

Albert-Jan Roskam
Hi,

I'd like to add:

use SET SEED and SAMPLE (or maybe N OF CASES, depending how your data are sorted) while you're fine-tuning your syntax.

Btw, I think what Art meant with using MATCH FILES is using it in the form: MATCH FILES / FILE = * / KEEP = x1 x2 x3 ALL. So it's not really a file match but an spssian way of re-ordering the vars.


Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Before you criticize someone, walk a mile in their shoes, that way
when you do criticize them, you're a mile away and you have their shoes!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


--- On Sat, 11/14/09, Art Kendall <[hidden email]> wrote:

> From: Art Kendall <[hidden email]>
> Subject: Re: [SPSSX-L] Advice regarding very large dataset
> To: [hidden email]
> Date: Saturday, November 14, 2009, 1:47 PM
> It is hard to say without a great
> deal of familiarity with your data.
> However, you might consider
> 1) a match files to make X's, Y's, Z's, contiguous
> 2) look up DO REPEAT
> 3) I do not recall whether set definitions are now retained
> between data
> sets in version 18 which I intend to install soon.
> 4) See if you think changing to long layout and using
> AGGREGATE by ID
> would help
> 5) try to keep the data on a local disk
> 6) the syntax to define a list of variables for do repeat
> or vectors can
> be cannibalized via cut-and-paste or via INSERT.
>
> Before SPSS supported multiple files open at once and
> before there were
> PCs with cut-and-paste across applications, I would start a
> new set of
> syntax by editing an earlier set so that I could reuse text
> that was
> complicated like a list of all the X's, Y's, Z's.
>
> Hope this helps
> Art Kendall
> Social Research Consultants
>
>
> Mark Vande Kamp wrote:
> > I have used SPSS for a long time but am just now
> trying to learn about
> > things like vectors, loops and macros, because I am
> starting a new project.
> >
> > We have a huge dataset of information regarding
> website visitor movements
> > through a set of web pages. The tab-delimited data are
> structured as a
> > single long record for each visitor with up to 100
> page views, and each page
> > view is represented by many variables. A simplified
> schematic might be:
> >
> > UserID  X1 Y1 Z1 X2 Y2 Z2....X100 Y100 Z100
> >
> > Note that there are many more than 3 variables per
> page view and up to
> > 400,000 records, so computation time is a big issue.
> >
> > Many of the initial analyses are repeated for all 100
> page views to look for
> > things like "entry pages" for each user, so a loop to,
> for example, test the
> > appropriate variable in each of the 100 page views
> (and do other types of
> > data processing) would seem to be an appropriate
> approach. Initially, I
> > thought the data might be imported into vector
> variables to facilitate this
> > loop approach. However, I just read (I think) that
> vectors are ephemeral and
> > not really a variable "type".
> >
> > So, I'm asking for advice regarding the ways to read
> the data into SPSS and
> > do the repeated processing of the repeated groups of
> variables that are
> > necessary. As I mentioned before, computationally
> thrifty approaches would
> > be best, given the size of the dataset (in both # of
> variables and # of cases).
> >
> > Thanks,
> >
> > Mark
> >
> > =====================
> > To manage your subscription to SPSSX-L, send a message
> to
> > [hidden email]
> (not to SPSSX-L), with no body text except the
> > command. To leave the list, send the command
> > SIGNOFF SPSSX-L
> > For a list of commands to manage subscriptions, send
> the command
> > INFO REFCARD
> >
> >
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email]
> (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the
> command
> INFO REFCARD
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Advice regarding very large dataset

Mark Vande Kamp
In reply to this post by Art Kendall
OK. I'll try to ask a more specific question. I think the main thing I
want to know is how to set up for and make a loop structure that
efficiently deals with more than one indexed variable. So, I'll provide
an example. Recall that I have web pageview data in which each record
(starting with a unique ID) has up to 100 pageviews of data (there are
variables for 100 pageviews in each record, but the later variables are
empty for people who saw fewer pages.)

Data structure

ID PageID1 LoadTime1 UnloadTime1 PageID2 LoadTime2
UnloadTime2....PageID100 LoadTime100 UnloadTime100

So, I want to do a series of analyses using the sets of pageview
variables. Many of these analyses use more than one variable at a time,

For example, I might want to know how long people look at a "HowTo"
page depending on whether the preceding page was a "welcome" or an
"info" page. I'll write a loop below that I know won't be complete (and
probably not correct) but it should demonstrate the type of things I
want to do.

for i = 2 to 100
if (PageID(i) = "HowTo" and PageID(i-1) = "welcome")
HowToAfterWelcomeDuration = UnloadTime(i) - LoadTime(i).
if (PageID(i) = "HowTo" and PageID(i-1) = "info") HowToAfterInfoDuration
= UnloadTime(i) - LoadTime(i).
end loop.

*I understand that if people see more than one "HowTo" page after
"welcome" or "info" that this syntax will return only the last such
duration in the record and I know how to do fancier code to deal with
that situation if necessary.

My question is how to best get all the necessary variables into an
indexable form so we can do this kind of thing with loops. We are
currently doing a really kludgy method of creating 100 repeated blocks
of SPSS syntax using Word mail-merge to replace the indexing digits at
the end of the repeated variables, but that creates huge syntax files
and is extremely cumbersome.

My initial hope was that vectors were a sort of "variable type" and we
could just read our data into vector variables. However, I'm now under
the impression that vectors are a sort of ephemeral format that goes
away after transformations are executed. They still might be the best
way to address the situation I describe, but I'm not sure how they would
be applied.

I hope this explains our issues more understandably.

Thanks for any help and/or suggestions,

Mark


On Sat, 2009-11-14 at 07:47 -0500, Art Kendall wrote:

> It is hard to say without a great deal of familiarity with your data.
> However, you might consider
> 1) a match files to make X's, Y's, Z's, contiguous
> 2) look up DO REPEAT
> 3) I do not recall whether set definitions are now retained between data
> sets in version 18 which I intend to install soon.
> 4) See if you think changing to long layout and using AGGREGATE by ID
> would help
> 5) try to keep the data on a local disk
> 6) the syntax to define a list of variables for do repeat or vectors can
> be cannibalized via cut-and-paste or via INSERT.
>
> Before SPSS supported multiple files open at once and before there were
> PCs with cut-and-paste across applications, I would start a new set of
> syntax by editing an earlier set so that I could reuse text that was
> complicated like a list of all the X's, Y's, Z's.
>
> Hope this helps
> Art Kendall
> Social Research Consultants
>
>
> Mark Vande Kamp wrote:
> > I have used SPSS for a long time but am just now trying to learn about
> > things like vectors, loops and macros, because I am starting a new project.
> >
> > We have a huge dataset of information regarding website visitor movements
> > through a set of web pages. The tab-delimited data are structured as a
> > single long record for each visitor with up to 100 page views, and each page
> > view is represented by many variables. A simplified schematic might be:
> >
> > UserID  X1 Y1 Z1 X2 Y2 Z2....X100 Y100 Z100
> >
> > Note that there are many more than 3 variables per page view and up to
> > 400,000 records, so computation time is a big issue.
> >
> > Many of the initial analyses are repeated for all 100 page views to look for
> > things like "entry pages" for each user, so a loop to, for example, test the
> > appropriate variable in each of the 100 page views (and do other types of
> > data processing) would seem to be an appropriate approach. Initially, I
> > thought the data might be imported into vector variables to facilitate this
> > loop approach. However, I just read (I think) that vectors are ephemeral and
> > not really a variable "type".
> >
> > So, I'm asking for advice regarding the ways to read the data into SPSS and
> > do the repeated processing of the repeated groups of variables that are
> > necessary. As I mentioned before, computationally thrifty approaches would
> > be best, given the size of the dataset (in both # of variables and # of cases).
> >
> > Thanks,
> >
> > Mark
> >
> > =====================
> > To manage your subscription to SPSSX-L, send a message to
> > [hidden email] (not to SPSSX-L), with no body text except the
> > command. To leave the list, send the command
> > SIGNOFF SPSSX-L
> > For a list of commands to manage subscriptions, send the command
> > INFO REFCARD
> >
> >
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Advice regarding very large dataset

Richard Ristow
In reply to this post by Mark Vande Kamp-2
At 10:50 PM 11/13/2009, Mark Vande Kamp wrote:

We have a huge dataset of information... The tab-delimited data are structured as a single long record for each visitor with up to 100 page views, and each page view is represented by many variables. A simplified schematic might be:

UserID  X1 Y1 Z1 X2 Y2 Z2....X100 Y100 Z100

One of Art Kendall's suggestions (07:47 AM 11/14/2009) was,

4) See if you think changing to long layout and using AGGREGATE by ID would help

i.e., to

UserID PageView X Y Z

I second that, emphatically. That matches the true structure of your data. And SPSS handles large number of cases far better than it handles large numbers of variables.

Now, how to get it that way? VARSTOCASES would work, but the command would be tedious to write. (It would have to name, individually, all of the variables X1 ... Z100.) I think it would be worth every bit of the effort, though.

Or, since you're reading your data from outside SPSS, there may be an easier way. The command REPEATING DATA is for just this purpose.

Unfortunately (from the Command Syntax Reference article "REPEATING DATA"):

DATA Subcommand
DATA specifies a name, location within each repeating segment, and format for each variable to be read from the repeating groups.
REPEATING DATA.
.. Any input format available on the DATA LIST command can be specified on the DATA subcommand. Both FORTRAN-like and the column-style specifications can be used.

Here and elsewhere, the article assumes the equivalent of DATA LIST FIXED -- apparently it hasn't been updated since FREE and LIST became available.

It probably could be made to work with FREE or LIST; has anybody done that? Otherwise, it would take experimentation to find out.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Advice regarding very large dataset

Mark Vande Kamp

 

Richard Ristow Wrote:

 

One of Art Kendall's suggestions (07:47 AM 11/14/2009) was,

4) See if you think changing to long layout and using AGGREGATE by ID would help


i.e., to

UserID PageView X Y Z

I second that, emphatically. That matches the true structure of your data. And SPSS handles large number of cases far better than it handles large numbers of variables.

Now, how to get it that way?

 

 

 

If it’s better to work with it that way, getting it formatted that way isn’t a problem because we have a C+ utility pulling the data out of the original database and unpacking some variables anyway. Formatting as you suggest isn’t a problem, it’s a minor tweak to the C+ code.

 

On the other hand, I’m trying to re-imagine our analyses with the data structure you suggest, and I’ll need to ponder them for awhile to see how readily they translate to the new format. My first guess is that one of the first steps would be to mark the first and last pageview for each UserID.

 

Just for clarification, by saying, “SPSS handles large number of cases far better than it handles large numbers of variables” do you mean the syntax is easier, or the processing time will be shorter, or both?

 

Thanks,

 

Mark

 

 

 

 

VARSTOCASES would work, but the command would be tedious to write. (It would have to name, individually, all of the variables X1 ... Z100.) I think it would be worth every bit of the effort, though.

Or, since you're reading your data from outside SPSS, there may be an easier way. The command REPEATING DATA is for just this purpose.

Unfortunately (from the Command Syntax Reference article "REPEATING DATA"):

DATA Subcommand
DATA specifies a name, location within each repeating segment, and format for each variable to be read from the repeating groups.
REPEATING DATA.
.. Any input format available on the DATA LIST command can be specified on the DATA subcommand. Both FORTRAN-like and the column-style specifications can be used.

Here and elsewhere, the article assumes the equivalent of DATA LIST FIXED -- apparently it hasn't been updated since FREE and LIST became available.

It probably could be made to work with FREE or LIST; has anybody done that? Otherwise, it would take experimentation to find out.

 

Reply | Threaded
Open this post in threaded view
|

Re: Advice regarding very large dataset

Albert-Jan Roskam
In reply to this post by Mark Vande Kamp
It's far easier to use Python for that. You could use the Cursor class to read in each (part of a) record. You could use your pseudo code for that (although i-1 probably won't work the way you want for the first item of the list).

BEGIN PROGRAM.
import spss
cur=spss.Cursor(accessTyep='w')
for i in range(spss.GetCaseCount()):
   vars = cur.fetchone()
   for index, varx in enumerate(vars):
    if varx == ...
     # etc

cur.close()
END PROGRAM.

See the free spss data management book for details.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Before you criticize someone, walk a mile in their shoes, that way
when you do criticize them, you're a mile away and you have their shoes!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


--- On Sun, 11/15/09, Mark Vande Kamp <[hidden email]> wrote:

> From: Mark Vande Kamp <[hidden email]>
> Subject: Re: [SPSSX-L] Advice regarding very large dataset
> To: [hidden email]
> Date: Sunday, November 15, 2009, 12:00 AM
> OK. I'll try to ask a more specific
> question. I think the main thing I
> want to know is how to set up for and make a loop structure
> that
> efficiently deals with more than one indexed variable. So,
> I'll provide
> an example. Recall that I have web pageview data in which
> each record
> (starting with a unique ID) has up to 100 pageviews of data
> (there are
> variables for 100 pageviews in each record, but the later
> variables are
> empty for people who saw fewer pages.)
>
> Data structure
>
> ID PageID1 LoadTime1 UnloadTime1 PageID2 LoadTime2
> UnloadTime2....PageID100 LoadTime100 UnloadTime100
>
> So, I want to do a series of analyses using the sets of
> pageview
> variables. Many of these analyses use more than one
> variable at a time,
>
> For example, I might want to know how long people look at a
> "HowTo"
> page depending on whether the preceding page was a
> "welcome" or an
> "info" page. I'll write a loop below that I know won't be
> complete (and
> probably not correct) but it should demonstrate the type of
> things I
> want to do.
>
> for i = 2 to 100
> if (PageID(i) = "HowTo" and PageID(i-1) = "welcome")
> HowToAfterWelcomeDuration = UnloadTime(i) - LoadTime(i).
> if (PageID(i) = "HowTo" and PageID(i-1) = "info")
> HowToAfterInfoDuration
> = UnloadTime(i) - LoadTime(i).
> end loop.
>
> *I understand that if people see more than one "HowTo" page
> after
> "welcome" or "info" that this syntax will return only the
> last such
> duration in the record and I know how to do fancier code to
> deal with
> that situation if necessary.
>
> My question is how to best get all the necessary variables
> into an
> indexable form so we can do this kind of thing with loops.
> We are
> currently doing a really kludgy method of creating 100
> repeated blocks
> of SPSS syntax using Word mail-merge to replace the
> indexing digits at
> the end of the repeated variables, but that creates huge
> syntax files
> and is extremely cumbersome.
>
> My initial hope was that vectors were a sort of "variable
> type" and we
> could just read our data into vector variables. However,
> I'm now under
> the impression that vectors are a sort of ephemeral format
> that goes
> away after transformations are executed. They still might
> be the best
> way to address the situation I describe, but I'm not sure
> how they would
> be applied.
>
> I hope this explains our issues more understandably.
>
> Thanks for any help and/or suggestions,
>
> Mark
>
>
> On Sat, 2009-11-14 at 07:47 -0500, Art Kendall wrote:
> > It is hard to say without a great deal of familiarity
> with your data.
> > However, you might consider
> > 1) a match files to make X's, Y's, Z's, contiguous
> > 2) look up DO REPEAT
> > 3) I do not recall whether set definitions are now
> retained between data
> > sets in version 18 which I intend to install soon.
> > 4) See if you think changing to long layout and using
> AGGREGATE by ID
> > would help
> > 5) try to keep the data on a local disk
> > 6) the syntax to define a list of variables for do
> repeat or vectors can
> > be cannibalized via cut-and-paste or via INSERT.
> >
> > Before SPSS supported multiple files open at once and
> before there were
> > PCs with cut-and-paste across applications, I would
> start a new set of
> > syntax by editing an earlier set so that I could reuse
> text that was
> > complicated like a list of all the X's, Y's, Z's.
> >
> > Hope this helps
> > Art Kendall
> > Social Research Consultants
> >
> >
> > Mark Vande Kamp wrote:
> > > I have used SPSS for a long time but am just now
> trying to learn about
> > > things like vectors, loops and macros, because I
> am starting a new project.
> > >
> > > We have a huge dataset of information regarding
> website visitor movements
> > > through a set of web pages. The tab-delimited
> data are structured as a
> > > single long record for each visitor with up to
> 100 page views, and each page
> > > view is represented by many variables. A
> simplified schematic might be:
> > >
> > > UserID  X1 Y1 Z1 X2 Y2 Z2....X100 Y100 Z100
> > >
> > > Note that there are many more than 3 variables
> per page view and up to
> > > 400,000 records, so computation time is a big
> issue.
> > >
> > > Many of the initial analyses are repeated for all
> 100 page views to look for
> > > things like "entry pages" for each user, so a
> loop to, for example, test the
> > > appropriate variable in each of the 100 page
> views (and do other types of
> > > data processing) would seem to be an appropriate
> approach. Initially, I
> > > thought the data might be imported into vector
> variables to facilitate this
> > > loop approach. However, I just read (I think)
> that vectors are ephemeral and
> > > not really a variable "type".
> > >
> > > So, I'm asking for advice regarding the ways to
> read the data into SPSS and
> > > do the repeated processing of the repeated groups
> of variables that are
> > > necessary. As I mentioned before, computationally
> thrifty approaches would
> > > be best, given the size of the dataset (in both #
> of variables and # of cases).
> > >
> > > Thanks,
> > >
> > > Mark
> > >
> > > =====================
> > > To manage your subscription to SPSSX-L, send a
> message to
> > > [hidden email]
> (not to SPSSX-L), with no body text except the
> > > command. To leave the list, send the command
> > > SIGNOFF SPSSX-L
> > > For a list of commands to manage subscriptions,
> send the command
> > > INFO REFCARD
> > >
> > >
> >
> > =====================
> > To manage your subscription to SPSSX-L, send a message
> to
> > [hidden email]
> (not to SPSSX-L), with no body text except the
> > command. To leave the list, send the command
> > SIGNOFF SPSSX-L
> > For a list of commands to manage subscriptions, send
> the command
> > INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email]
> (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the
> command
> INFO REFCARD
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Advice regarding very large dataset

Jon K Peck

I'm not clear on why vectors don't meet the requirements for this problem.  You read in your data as usual and define a vector that in effect overlays the variable list.  Then you can use ordinary SPSS transformation looping commands such as LOOP and use the vector indexes as subscripts.  Although the vector definition exists only during transformation processing, that seems to be the time you need it.  You can also create new variables with VECTOR

Vector elements must all have the same type - you can't mix numbers and strings.

If you do want to go the Python route, I suggest looking at the SPSSINC TRANS extension command.  Using that, you can just write a function that deals with the transformations themselves and leave the case looping and new variable creation to the extension command to take care of.

Regards.

Jon Peck
SPSS, an IBM Company
[hidden email]
312-651-3435



From: Albert-Jan Roskam <[hidden email]>
To: [hidden email]
Date: 11/15/2009 04:21 AM
Subject: Re: [SPSSX-L] Advice regarding very large dataset
Sent by: "SPSSX(r) Discussion" <[hidden email]>





It's far easier to use Python for that. You could use the Cursor class to read in each (part of a) record. You could use your pseudo code for that (although i-1 probably won't work the way you want for the first item of the list).

BEGIN PROGRAM.
import spss
cur=spss.Cursor(accessTyep='w')
for i in range(spss.GetCaseCount()):
  vars = cur.fetchone()
  for index, varx in enumerate(vars):
   if varx == ...
    # etc

cur.close()
END PROGRAM.

See the free spss data management book for details.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Before you criticize someone, walk a mile in their shoes, that way
when you do criticize them, you're a mile away and you have their shoes!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


--- On Sun, 11/15/09, Mark Vande Kamp <[hidden email]> wrote:

> From: Mark Vande Kamp <[hidden email]>
> Subject: Re: [SPSSX-L] Advice regarding very large dataset
> To: [hidden email]
> Date: Sunday, November 15, 2009, 12:00 AM
> OK. I'll try to ask a more specific
> question. I think the main thing I
> want to know is how to set up for and make a loop structure
> that
> efficiently deals with more than one indexed variable. So,
> I'll provide
> an example. Recall that I have web pageview data in which
> each record
> (starting with a unique ID) has up to 100 pageviews of data
> (there are
> variables for 100 pageviews in each record, but the later
> variables are
> empty for people who saw fewer pages.)
>
> Data structure
>
> ID PageID1 LoadTime1 UnloadTime1 PageID2 LoadTime2
> UnloadTime2....PageID100 LoadTime100 UnloadTime100
>
> So, I want to do a series of analyses using the sets of
> pageview
> variables. Many of these analyses use more than one
> variable at a time,
>
> For example, I might want to know how long people look at a
> "HowTo"
> page depending on whether the preceding page was a
> "welcome" or an
> "info" page. I'll write a loop below that I know won't be
> complete (and
> probably not correct) but it should demonstrate the type of
> things I
> want to do.
>
> for i = 2 to 100
> if (PageID(i) = "HowTo" and PageID(i-1) = "welcome")
> HowToAfterWelcomeDuration = UnloadTime(i) - LoadTime(i).
> if (PageID(i) = "HowTo" and PageID(i-1) = "info")
> HowToAfterInfoDuration
> = UnloadTime(i) - LoadTime(i).
> end loop.
>
> *I understand that if people see more than one "HowTo" page
> after
> "welcome" or "info" that this syntax will return only the
> last such
> duration in the record and I know how to do fancier code to
> deal with
> that situation if necessary.
>
> My question is how to best get all the necessary variables
> into an
> indexable form so we can do this kind of thing with loops.
> We are
> currently doing a really kludgy method of creating 100
> repeated blocks
> of SPSS syntax using Word mail-merge to replace the
> indexing digits at
> the end of the repeated variables, but that creates huge
> syntax files
> and is extremely cumbersome.
>
> My initial hope was that vectors were a sort of "variable
> type" and we
> could just read our data into vector variables. However,
> I'm now under
> the impression that vectors are a sort of ephemeral format
> that goes
> away after transformations are executed. They still might
> be the best
> way to address the situation I describe, but I'm not sure
> how they would
> be applied.
>
> I hope this explains our issues more understandably.
>
> Thanks for any help and/or suggestions,
>
> Mark
>
>
> On Sat, 2009-11-14 at 07:47 -0500, Art Kendall wrote:
> > It is hard to say without a great deal of familiarity
> with your data.
> > However, you might consider
> > 1) a match files to make X's, Y's, Z's, contiguous
> > 2) look up DO REPEAT
> > 3) I do not recall whether set definitions are now
> retained between data
> > sets in version 18 which I intend to install soon.
> > 4) See if you think changing to long layout and using
> AGGREGATE by ID
> > would help
> > 5) try to keep the data on a local disk
> > 6) the syntax to define a list of variables for do
> repeat or vectors can
> > be cannibalized via cut-and-paste or via INSERT.
> >
> > Before SPSS supported multiple files open at once and
> before there were
> > PCs with cut-and-paste across applications, I would
> start a new set of
> > syntax by editing an earlier set so that I could reuse
> text that was
> > complicated like a list of all the X's, Y's, Z's.
> >
> > Hope this helps
> > Art Kendall
> > Social Research Consultants
> >
> >
> > Mark Vande Kamp wrote:
> > > I have used SPSS for a long time but am just now
> trying to learn about
> > > things like vectors, loops and macros, because I
> am starting a new project.
> > >
> > > We have a huge dataset of information regarding
> website visitor movements
> > > through a set of web pages. The tab-delimited
> data are structured as a
> > > single long record for each visitor with up to
> 100 page views, and each page
> > > view is represented by many variables. A
> simplified schematic might be:
> > >
> > > UserID  X1 Y1 Z1 X2 Y2 Z2....X100 Y100 Z100
> > >
> > > Note that there are many more than 3 variables
> per page view and up to
> > > 400,000 records, so computation time is a big
> issue.
> > >
> > > Many of the initial analyses are repeated for all
> 100 page views to look for
> > > things like "entry pages" for each user, so a
> loop to, for example, test the
> > > appropriate variable in each of the 100 page
> views (and do other types of
> > > data processing) would seem to be an appropriate
> approach. Initially, I
> > > thought the data might be imported into vector
> variables to facilitate this
> > > loop approach. However, I just read (I think)
> that vectors are ephemeral and
> > > not really a variable "type".
> > >
> > > So, I'm asking for advice regarding the ways to
> read the data into SPSS and
> > > do the repeated processing of the repeated groups
> of variables that are
> > > necessary. As I mentioned before, computationally
> thrifty approaches would
> > > be best, given the size of the dataset (in both #
> of variables and # of cases).
> > >
> > > Thanks,
> > >
> > > Mark
> > >
> > > =====================
> > > To manage your subscription to SPSSX-L, send a
> message to
> > > [hidden email]
> (not to SPSSX-L), with no body text except the
> > > command. To leave the list, send the command
> > > SIGNOFF SPSSX-L
> > > For a list of commands to manage subscriptions,
> send the command
> > > INFO REFCARD
> > >
> > >
> >
> > =====================
> > To manage your subscription to SPSSX-L, send a message
> to
> > [hidden email]
> (not to SPSSX-L), with no body text except the
> > command. To leave the list, send the command
> > SIGNOFF SPSSX-L
> > For a list of commands to manage subscriptions, send
> the command
> > INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email]
> (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the
> command
> INFO REFCARD
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: Advice regarding very large dataset

Mark Vande Kamp
On Sun, 2009-11-15 at 08:23 -0700, Jon K Peck wrote:
>
> I'm not clear on why vectors don't meet the requirements for this
> problem.

I thought they would originally, and I think they still  might be a good
way, but I ran into some problems that led me to ask the question here.

> Vector elements must all have the same type - you can't mix numbers
> and strings.

Does this mean that you CAN have number vectors, and you CAN have string
vectors, but you CANNOT define both types on one command?
>
Thanks,

Mark

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Advice regarding very large dataset

Richard Ristow
In reply to this post by Jon K Peck
At 10:23 AM 11/15/2009, Jon K Peck wrote:

I'm not clear on why vectors don't meet the requirements for this problem.  You read in your data as usual and define a vector that in effect overlays the variable list.  Then you can use ordinary SPSS transformation looping commands such as LOOP and use the vector indexes as subscripts.

But, here's the data structure:

[There is] a single record for each visitor with up to 100 page views, and each page view is represented by many variables. A simplified schematic might be:

UserID  X1 Y1 Z1 X2 Y2 Z2....X100 Y100 Z100

There are many more than 3 variables per page view

It would be great to define vectors X, Y, and Z with indices 1-100. But SPSS can't do that; it requires all elements of any vector to be contiguous. You could, if all variables are numeric, define

VECTOR AllData X1 TO Z100.

but that leads to terribly clumsy code to calculate the index values.

DO REPEAT does work.  It's a lengthy statement, since you have to name every variable:

DO REPEAT X = X1   X2   X3   X4   X5   X6   X7   X8   X9   X10
              X11  X12  X13  X14  [continuing to]
              X91  X92  X93  X94  X95  X96  X97  X98  X99  X100
         /Y = Y1   Y2   Y3   Y4   Y5   Y6   Y7   Y8   Y9   Y10
              ...
              Y91  Y92  Y93  Y94  Y95  Y96  Y97  Y98  Y99  Y100
        and the same for Z.     

As everybody knows, I usually advise 'unrolling' such structures to one record per event:

UserID PageView X Y Z

But it would be nice to have SPSS handle the original records more gracefully; for example, with a construct like

VECTOR X,Y,Z =X1 TO Z100.



===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Advice regarding very large dataset

Jon K Peck

See below.

Jon Peck
SPSS, an IBM Company
[hidden email]
312-651-3435



From: Richard Ristow <[hidden email]>
To: [hidden email]
Date: 11/15/2009 07:15 PM
Subject: Re: [SPSSX-L] Advice regarding very large dataset
Sent by: "SPSSX(r) Discussion" <[hidden email]>





At 10:23 AM 11/15/2009, Jon K Peck wrote:

I'm not clear on why vectors don't meet the requirements for this problem.  You read in your data as usual and define a vector that in effect overlays the variable list.  Then you can use ordinary SPSS transformation looping commands such as LOOP and use the vector indexes as subscripts.

But, here's the data structure:

[There is] a single record for each visitor with up to 100 page views, and each page view is represented by many variables. A simplified schematic might be:

UserID  X1 Y1 Z1 X2 Y2 Z2....X100 Y100 Z100

There are many more than 3 variables per page view


It would be great to define vectors X, Y, and Z with indices 1-100. But SPSS can't do that; it requires all elements of any vector to be contiguous. You could, if all variables are numeric, define

VECTOR AllData X1 TO Z100.

but that leads to terribly clumsy code to calculate the index values.


>>>You could reorder the  variables easily with a little Python code (to avoid writing out the names).  Or do the transformations with a small Python program.

To reorder the variables (this requires the Python plugin from Developer Central):

data list free /UserID X1 Y1 Z1 X2 Y2 Z2 X3 Y3 Z3.
begin data
999 1 11 111 2 22 222 3 33 333
end data.
dataset name xyz.

begin program.
import spss, spssaux
xvars = spssaux.VariableDict(pattern="X")
yvars = spssaux.VariableDict(pattern="Y")
zvars = spssaux.VariableDict(pattern="Z")
keepers = sorted(xvars.variables) + sorted(yvars.variables) + sorted(zvars.variables)
spss.Submit("match files file=* /keep = UserID " + " ".join(keepers))
end program.

Note:
- The names are sorted strictly alphabetically.  That means that x10 comes before x2.

HTH,
Jon Peck



DO REPEAT does work.  It's a lengthy statement, since you have to name every variable:

DO REPEAT X = X1   X2   X3   X4   X5   X6   X7   X8   X9   X10
             X11  X12  X13  X14  [continuing to]
             X91  X92  X93  X94  X95  X96  X97  X98  X99  X100
        /Y = Y1   Y2   Y3   Y4   Y5   Y6   Y7   Y8   Y9   Y10
             ...
             Y91  Y92  Y93  Y94  Y95  Y96  Y97  Y98  Y99  Y100

       and the same for Z.      

As everybody knows, I usually advise 'unrolling' such structures to one record per event:

UserID PageView X Y Z

But it would be nice to have SPSS handle the original records more gracefully; for example, with a construct like

VECTOR X,Y,Z =X1 TO Z100.



===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: Advice regarding very large dataset

Richard Ristow
In reply to this post by Mark Vande Kamp
I'd written (following Art Kendall),

See if you think changing to long layout and using AGGREGATE by ID would help

i.e., to

UserID PageView X Y Z

I second that, emphatically. That matches the true structure of your data. And SPSS handles large number of cases far better than it handles large numbers of variables.

At 08:16 PM 11/14/2009, Mark Vande Kamp wrote:

I’m trying to re-imagine our analyses with the data structure you suggest, and I’ll need to ponder them for awhile to see how readily they translate to the new format. My first guess is that one of the first steps would be to mark the first and last pageview for each UserID.

Sounds right.

You'll need to sort so the UserIDs are in alphabetical order, and views for are in chronological order, if they aren't that way automatically.

Marking first and last views (not tested, but standard syntax):

ADD FILES
  /FILE=*
  /BY UserID
  /FIRST=FirstView
  /LAST=LastView.

Just for clarification, by saying, “SPSS handles large number of cases far better than it handles large numbers of variables” do you mean the syntax is easier, or the processing time will be shorter, or both?

The syntax is much easier.

Processing time, it's harder to tell. Most times, SPSS is limited by disk transfer rate, and files in both organizations are about the same size. (The 'long' file will be somewhat larger, actually.)

It may be easier to write clean code with the 'long' file, and that may indirectly speed up performance.

The 'wide' file will slow performance drastically if it is VERY VERY wide, so wide that the dictionary data plus data for one case doesn't fit in RAM. But that's rare, nowadays.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Advice regarding very large dataset

Richard Ristow
In reply to this post by Mark Vande Kamp-2
A further thought: I'd written (following Art Kendall) advocating structuring data in 'long' form:

UserID PageView X Y Z

instead of 'wide' form:

UserID  X1 Y1 Z1 X2 Y2 Z2....X100 Y100 Z100

At 08:16 PM 11/14/2009, Mark Vande Kamp asked:

By saying, “SPSS handles large number of cases ['long' form] far better than it handles large numbers of variables ['wide' form]” do you mean the syntax is easier, or the processing time will be shorter, or both?

I replied, but had some later thoughts:

Most times, SPSS is limited by disk transfer rate, and files in both organizations are about the same size. (The 'long' file will be somewhat larger, actually.)

That will be true if all users have nearly the same number of page views. However, if there a few users have far more than the average number of page views,the 'long' file can be much smaller. It need have only records for page views that took place; the 'wide' file needs variables for the maximum number of views for any user.

(Make sure that the file doesn't have records for null views; discard them in the C+ code or, later, in SPSS.)

Other things being anywhere near equal, a smaller file will process faster, because it uses less traffic to and from disk. Also, you'll probably use AGGREGATE for much arithmetic, and I understand that AGGREGATE is very fast indeed - likely more so than LOOP code.

-Onward, with best wishes,
 Richard
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD