1.3 million lines and 2000 variables should take 15 minutes to sort?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

1.3 million lines and 2000 variables should take 15 minutes to sort?

Matthew Pirritano
Listers,
I could really use some advice on this. I tried last week but no one replied.
I've got a large dataset of 1.3 million lines and 2000 variables. Just to sort on one variable sometimes takes 15 to 20 minutes. Sometimes when it is counting up by 100s I just stop the processor. I'm working in spss 16 with a 2.3 ghz vpro processor and 2 gb of ram.  This just doesn't seem right.
Is there anyone that has worked with such large datasets who might have a feel for what type of machine will enable me to work with out such long pauses for processing?
Thanks,
Mat
 Matthew Pirritano, Ph.D.
Email: [hidden email]



----- Original Message ----
From: Albert-jan Roskam <[hidden email]>
To: [hidden email]
Sent: Monday, June 23, 2008 7:04:09 AM
Subject: ERROR 550

Hi list,

The code below worked last Friday, but now it consistently yields an error 550. Any idea what causes this? It works without the DATASET commands, but why not with them? The size of the data set is about 160k records; 147 vars.

Thanks in advance!

Cheers!
Albert-Jan

get file = 'out_dir/matched_TOTAL.sav'.
dataset name mysource.
dataset copy mydoubles.
dataset activate mydoubles.
aggregate outfile = * mode = addvariables / break = case_orig / count = n.


>Error # 550
>An SPSS program error has occurred: A procedure has attempted to add more
>variables to the file than it provided for in its call to OBINIT.  The
>error was detected in a call to OBPINI.  Please note the circumstances
>under which this error occurred, attempting to replicate it if possible,
>and then notify SPSS Technical Support.
>This command not executed.



>Warning # 552
>Possibly due to another error, a procedure has defined more new variables
>than it has added to the file.  All those which have been defined but not
>added will be discarded and will be unavailable for further processing.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


====================To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: 1.3 million lines and 2000 variables should take 15 minutes to sort?

Zdaniuk, Bozena-2
i found that sorting cases by a variable is taking more time that any other procedure in spss. When I sort 100 thousand cases on two variables, I click 'run' and go make tea.
bozena


________________________________________
From: SPSSX(r) Discussion [[hidden email]] On Behalf Of Matthew Pirritano [[hidden email]]
Sent: Monday, June 23, 2008 10:24 AM
To: [hidden email]
Subject: 1.3 million lines and 2000 variables should take 15 minutes to sort?

Listers,
I could really use some advice on this. I tried last week but no one replied.
I've got a large dataset of 1.3 million lines and 2000 variables. Just to sort on one variable sometimes takes 15 to 20 minutes. Sometimes when it is counting up by 100s I just stop the processor. I'm working in spss 16 with a 2.3 ghz vpro processor and 2 gb of ram.  This just doesn't seem right.
Is there anyone that has worked with such large datasets who might have a feel for what type of machine will enable me to work with out such long pauses for processing?
Thanks,
Mat
 Matthew Pirritano, Ph.D.
Email: [hidden email]



----- Original Message ----
From: Albert-jan Roskam <[hidden email]>
To: [hidden email]
Sent: Monday, June 23, 2008 7:04:09 AM
Subject: ERROR 550

Hi list,

The code below worked last Friday, but now it consistently yields an error 550. Any idea what causes this? It works without the DATASET commands, but why not with them? The size of the data set is about 160k records; 147 vars.

Thanks in advance!

Cheers!
Albert-Jan

get file = 'out_dir/matched_TOTAL.sav'.
dataset name mysource.
dataset copy mydoubles.
dataset activate mydoubles.
aggregate outfile = * mode = addvariables / break = case_orig / count = n.


>Error # 550
>An SPSS program error has occurred: A procedure has attempted to add more
>variables to the file than it provided for in its call to OBINIT.  The
>error was detected in a call to OBPINI.  Please note the circumstances
>under which this error occurred, attempting to replicate it if possible,
>and then notify SPSS Technical Support.
>This command not executed.



>Warning # 552
>Possibly due to another error, a procedure has defined more new variables
>than it has added to the file.  All those which have been defined but not
>added will be discarded and will be unavailable for further processing.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


=======
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: 1.3 million lines and 2000 variables should take 15 minutes to sort?

Albert-Jan Roskam
It may be easier to use the keep command, although one might miss out on a nice cup of tea:
get file = 'd:\file.sav' / keep = <<varlist in desired order here>>.

I'd just cut & paste the var list to excel, sort it there and paste it in the KEEP statement.

Cheers!!
Albert-Jan

--- On Mon, 6/23/08, Zdaniuk, Bozena <[hidden email]> wrote:

> From: Zdaniuk, Bozena <[hidden email]>
> Subject: Re: 1.3 million lines and 2000 variables should take 15 minutes              to sort?
> To: [hidden email]
> Date: Monday, June 23, 2008, 5:01 PM
> i found that sorting cases by a variable is taking more time
> that any other procedure in spss. When I sort 100 thousand
> cases on two variables, I click 'run' and go make
> tea.
> bozena
>
>
> ________________________________________
> From: SPSSX(r) Discussion [[hidden email]] On
> Behalf Of Matthew Pirritano
> [[hidden email]]
> Sent: Monday, June 23, 2008 10:24 AM
> To: [hidden email]
> Subject: 1.3 million lines and 2000 variables should take
> 15 minutes to sort?
>
> Listers,
> I could really use some advice on this. I tried last week
> but no one replied.
> I've got a large dataset of 1.3 million lines and 2000
> variables. Just to sort on one variable sometimes takes 15
> to 20 minutes. Sometimes when it is counting up by 100s I
> just stop the processor. I'm working in spss 16 with a
> 2.3 ghz vpro processor and 2 gb of ram.  This just
> doesn't seem right.
> Is there anyone that has worked with such large datasets
> who might have a feel for what type of machine will enable
> me to work with out such long pauses for processing?
> Thanks,
> Mat
>  Matthew Pirritano, Ph.D.
> Email: [hidden email]
>
>
>
> ----- Original Message ----
> From: Albert-jan Roskam <[hidden email]>
> To: [hidden email]
> Sent: Monday, June 23, 2008 7:04:09 AM
> Subject: ERROR 550
>
> Hi list,
>
> The code below worked last Friday, but now it consistently
> yields an error 550. Any idea what causes this? It works
> without the DATASET commands, but why not with them? The
> size of the data set is about 160k records; 147 vars.
>
> Thanks in advance!
>
> Cheers!
> Albert-Jan
>
> get file = 'out_dir/matched_TOTAL.sav'.
> dataset name mysource.
> dataset copy mydoubles.
> dataset activate mydoubles.
> aggregate outfile = * mode = addvariables / break =
> case_orig / count = n.
>
>
> >Error # 550
> >An SPSS program error has occurred: A procedure has
> attempted to add more
> >variables to the file than it provided for in its call
> to OBINIT.  The
> >error was detected in a call to OBPINI.  Please note
> the circumstances
> >under which this error occurred, attempting to
> replicate it if possible,
> >and then notify SPSS Technical Support.
> >This command not executed.
>
>
>
> >Warning # 552
> >Possibly due to another error, a procedure has defined
> more new variables
> >than it has added to the file.  All those which have
> been defined but not
> >added will be discarded and will be unavailable for
> further processing.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body
> text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the
> command
> INFO REFCARD
>
>
> =======
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body
> text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the
> command
> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body
> text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the
> command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: 1.3 million lines and 2000 variables should take 15 minutes to sort?

ViAnn Beadle
In reply to this post by Matthew Pirritano
You probably didn't get many replies because your original question was too
general. Two things to consider here.

1) why are you sorting? What exactly is the goal of the sorting.
2) why are you sorting the whole file?

Also, if you're going to tack your question onto somebody else's email, it's
best to delete theirs. Makes following embedded threads within the email
more difficult;-)

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Matthew Pirritano
Sent: Monday, June 23, 2008 8:25 AM
To: [hidden email]
Subject: 1.3 million lines and 2000 variables should take 15 minutes to
sort?

Listers,
I could really use some advice on this. I tried last week but no one
replied.
I've got a large dataset of 1.3 million lines and 2000 variables. Just to
sort on one variable sometimes takes 15 to 20 minutes. Sometimes when it is
counting up by 100s I just stop the processor. I'm working in spss 16 with
a 2.3 ghz vpro processor and 2 gb of ram.  This just doesn't seem right.
Is there anyone that has worked with such large datasets who might have a
feel for what type of machine will enable me to work with out such long
pauses for processing?
Thanks,
Mat
 Matthew Pirritano, Ph.D.
Email: [hidden email]

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: 1.3 million lines and 2000 variables should take 15 minutes to sort?

mpirritano
Sorry, I don't think I've ever forgotten to delete a nonrelevant thread
at the end of an email before. I'm a little under the weather.

Sorting. I'm sorting because I need to restructure this large dataset.
Cases to Variables. It is insurance claims data which is listed as a
claim per line. I want to be able to look at frequencies of diagnoses,
which I now realize I don't need to restructure to do. I realized that
in the middle of the last sentence.

However, in the future I will very likely want to match this claims data
up with other data that I have received. I guess I'll cross that
bridge...

Maybe spss 16 will be less buggy by then.

Thanks,
Matt

Matthew Pirritano, Ph.D.
Research Analyst IV
County of Orange
Medical Services Initiative (MSI)
[hidden email]
(714) 834-4775


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
ViAnn Beadle
Sent: Monday, June 23, 2008 8:31 AM
To: [hidden email]
Subject: Re: 1.3 million lines and 2000 variables should take 15 minutes
to sort?

You probably didn't get many replies because your original question was
too
general. Two things to consider here.

1) why are you sorting? What exactly is the goal of the sorting.
2) why are you sorting the whole file?

Also, if you're going to tack your question onto somebody else's email,
it's
best to delete theirs. Makes following embedded threads within the email
more difficult;-)

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Matthew Pirritano
Sent: Monday, June 23, 2008 8:25 AM
To: [hidden email]
Subject: 1.3 million lines and 2000 variables should take 15 minutes to
sort?

Listers,
I could really use some advice on this. I tried last week but no one
replied.
I've got a large dataset of 1.3 million lines and 2000 variables. Just
to
sort on one variable sometimes takes 15 to 20 minutes. Sometimes when it
is
counting up by 100s I just stop the processor. I'm working in spss 16
with
a 2.3 ghz vpro processor and 2 gb of ram.  This just doesn't seem right.
Is there anyone that has worked with such large datasets who might have
a
feel for what type of machine will enable me to work with out such long
pauses for processing?
Thanks,
Mat
 Matthew Pirritano, Ph.D.
Email: [hidden email]

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: 1.3 million lines and 2000 variables should take 15 minutes to sort?

ViAnn Beadle
When you need to restructure that data do you really need to keep all of the
2000 variables for analysis? If for your analysis you could get down to a
smaller set of variables things would go much, much faster. I don't think
that there are any significant changes from 15 to 16 that would affect
sorting performance.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Pirritano, Matthew
Sent: Monday, June 23, 2008 9:50 AM
To: [hidden email]
Subject: Re: 1.3 million lines and 2000 variables should take 15 minutes to
sort?

Sorry, I don't think I've ever forgotten to delete a nonrelevant thread
at the end of an email before. I'm a little under the weather.

Sorting. I'm sorting because I need to restructure this large dataset.
Cases to Variables. It is insurance claims data which is listed as a
claim per line. I want to be able to look at frequencies of diagnoses,
which I now realize I don't need to restructure to do. I realized that
in the middle of the last sentence.

However, in the future I will very likely want to match this claims data
up with other data that I have received. I guess I'll cross that
bridge...

Maybe spss 16 will be less buggy by then.

Thanks,
Matt

Matthew Pirritano, Ph.D.
Research Analyst IV
County of Orange
Medical Services Initiative (MSI)
[hidden email]
(714) 834-4775


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
ViAnn Beadle
Sent: Monday, June 23, 2008 8:31 AM
To: [hidden email]
Subject: Re: 1.3 million lines and 2000 variables should take 15 minutes
to sort?

You probably didn't get many replies because your original question was
too
general. Two things to consider here.

1) why are you sorting? What exactly is the goal of the sorting.
2) why are you sorting the whole file?

Also, if you're going to tack your question onto somebody else's email,
it's
best to delete theirs. Makes following embedded threads within the email
more difficult;-)

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Matthew Pirritano
Sent: Monday, June 23, 2008 8:25 AM
To: [hidden email]
Subject: 1.3 million lines and 2000 variables should take 15 minutes to
sort?

Listers,
I could really use some advice on this. I tried last week but no one
replied.
I've got a large dataset of 1.3 million lines and 2000 variables. Just
to
sort on one variable sometimes takes 15 to 20 minutes. Sometimes when it
is
counting up by 100s I just stop the processor. I'm working in spss 16
with
a 2.3 ghz vpro processor and 2 gb of ram.  This just doesn't seem right.
Is there anyone that has worked with such large datasets who might have
a
feel for what type of machine will enable me to work with out such long
pauses for processing?
Thanks,
Mat
 Matthew Pirritano, Ph.D.
Email: [hidden email]

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: 1.3 million lines and 2000 variables should take 15 minutes to sort?

Fry, Jonathan B.
In reply to this post by Matthew Pirritano
SORT CASES performance depends heavily on the WORKSPACE setting.  For a file that large, it will be worthwhile experimenting a bit with that.

On a machine with 2GB of RAM, I'd suggest setting WORKSPACE to 500000.  I'd also suggest SET MESSAGES ON so you can see actual file sizes and resources available.

Jonathan Fry
SPSS Inc.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Matthew Pirritano
Sent: Monday, June 23, 2008 9:25 AM
To: [hidden email]
Subject: 1.3 million lines and 2000 variables should take 15 minutes to sort?

Listers,
I could really use some advice on this. I tried last week but no one replied.
I've got a large dataset of 1.3 million lines and 2000 variables. Just to sort on one variable sometimes takes 15 to 20 minutes. Sometimes when it is counting up by 100s I just stop the processor. I'm working in spss 16 with a 2.3 ghz vpro processor and 2 gb of ram.  This just doesn't seem right.
Is there anyone that has worked with such large datasets who might have a feel for what type of machine will enable me to work with out such long pauses for processing?
Thanks,
Mat
 Matthew Pirritano, Ph.D.
Email: [hidden email]



----- Original Message ----
From: Albert-jan Roskam <[hidden email]>
To: [hidden email]
Sent: Monday, June 23, 2008 7:04:09 AM
Subject: ERROR 550

Hi list,

The code below worked last Friday, but now it consistently yields an error 550. Any idea what causes this? It works without the DATASET commands, but why not with them? The size of the data set is about 160k records; 147 vars.

Thanks in advance!

Cheers!
Albert-Jan

get file = 'out_dir/matched_TOTAL.sav'.
dataset name mysource.
dataset copy mydoubles.
dataset activate mydoubles.
aggregate outfile = * mode = addvariables / break = case_orig / count = n.


>Error # 550
>An SPSS program error has occurred: A procedure has attempted to add more
>variables to the file than it provided for in its call to OBINIT.  The
>error was detected in a call to OBPINI.  Please note the circumstances
>under which this error occurred, attempting to replicate it if possible,
>and then notify SPSS Technical Support.
>This command not executed.



>Warning # 552
>Possibly due to another error, a procedure has defined more new variables
>than it has added to the file.  All those which have been defined but not
>added will be discarded and will be unavailable for further processing.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


=======
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: 1.3 million lines and 2000 variables should take 15 minutes to sort?

Barnett, Adrian (DECS)
In reply to this post by Matthew Pirritano
Hi Matthew

For comparison, a 77 MB file of mine with 1,046,713  records and maybe 20 (mostly long string) variables sorted on 6 variables (2 of which were 40+ characters in length) in 43.24 sec on SPSS 15 (elapsed and CPU time essentially identical).

For comparison, SPSS 16 did it in 49.9 sec (CPU) and 57.7 (elapsed) and Stata 10/MP did it in 5.99 sec

Another file with 2.4 million records and around 100 (again mostly long string) variables sorted on the same 6 long string variables in 136 sec (CPU) and 270 sec (elapsed) in version 15

My hardware is Intel Core 2 Duo E6550 (2.3GHz) with 1GB RAM, Win XP + SP3.

As others have observed, you are carrying a lot of baggage with all 2000 variables in the file while you are sorting it.  It may be quicker to do all your sorting and other heavy-duty manipulations on a subset of just those variables that need to be manipulated and join the rest back on later, after all that work is done.


Regards

Adrian Barnett

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Matthew Pirritano
Sent: Monday, 23 June 2008 11:55 PM
To: [hidden email]
Subject: 1.3 million lines and 2000 variables should take 15 minutes to sort?

Listers,
I could really use some advice on this. I tried last week but no one replied.
I've got a large dataset of 1.3 million lines and 2000 variables. Just to sort on one variable sometimes takes 15 to 20 minutes. Sometimes when it is counting up by 100s I just stop the processor. I'm working in spss 16 with a 2.3 ghz vpro processor and 2 gb of ram.  This just doesn't seem right.
Is there anyone that has worked with such large datasets who might have a feel for what type of machine will enable me to work with out such long pauses for processing?
Thanks,
Mat
 Matthew Pirritano, Ph.D.
Email: [hidden email]



----- Original Message ----
From: Albert-jan Roskam <[hidden email]>
To: [hidden email]
Sent: Monday, June 23, 2008 7:04:09 AM
Subject: ERROR 550

Hi list,

The code below worked last Friday, but now it consistently yields an error 550. Any idea what causes this? It works without the DATASET commands, but why not with them? The size of the data set is about 160k records; 147 vars.

Thanks in advance!

Cheers!
Albert-Jan

get file = 'out_dir/matched_TOTAL.sav'.
dataset name mysource.
dataset copy mydoubles.
dataset activate mydoubles.
aggregate outfile = * mode = addvariables / break = case_orig / count = n.


>Error # 550
>An SPSS program error has occurred: A procedure has attempted to add more
>variables to the file than it provided for in its call to OBINIT.  The
>error was detected in a call to OBPINI.  Please note the circumstances
>under which this error occurred, attempting to replicate it if possible,
>and then notify SPSS Technical Support.
>This command not executed.



>Warning # 552
>Possibly due to another error, a procedure has defined more new variables
>than it has added to the file.  All those which have been defined but not
>added will be discarded and will be unavailable for further processing.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


=======
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD