Processing Large Datasets

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Processing Large Datasets

DKUKEC
Dear SPSSers.

I am wondering if my computer is having issues or whether the current
processing speed is within the range of expected performance for version 23
and the following hardware.

OS Windows Enterprise 7.
Intel Core i7 CPU @ 3.40 GHz
RAM 8.0 GB
64 Bit

The PARTY dataset contains 6.8 million records and 84 variables (mostly
STRING) and the COUNSEL file contains 1.3 million records.  I have read over
the numerous posts concerning processing speed, however, I was unable to
find anything current on the topic.  I have reduced the number of variables
to see if that speeds up processing, but I did not see any improvements.
Lastly, I did see significant improvements with using the EXECUTE command
sparingly.  The following syntax took about 25-30 minutes to process.  The
time does not include the initial query from a SQL server.    

Please let me know if you think this is par for the course, or whether there
are other approaches to improve performance.

Thank you
Damir


DATASET ACTIVATE PARTY WINDOW=FRONT .
DATASET COPY COUNSEL .
DATASET ACTIVATE COUNSEL WINDOW=FRONT .

 * MATCH FILES FILE = * / KEEP = CASEID PartyTypeDescription .
 * EXECUTE .

COMPUTE SOQ = DATE.MDY (04,06,2018) .
COMPUTE SOQT = TIME.HMS (09,27,00) .
COMPUTE START_OQ = (SOQ + SOQT) .
FORMATS START_OQ (DATETIME22) .
*EXECUTE .

MATCH FILES FILE=* / DROP = SOQ SOQT  .
*EXECUTE .

RECODE PartyTypeDescription
("ATTORNEY"=1)
("ASSISTANT PUBLIC DEFENDER" = 2)
("PUBLIC DEFENDER"=2)
("ASSISTANT STATE ATTORNEY" =3)
("SPECIAL PROSECUTOR / ASA"=3) INTO COUNSEL_TYPE .
*EXECUTE .


SELECT IF NOT MISSING (COUNSEL_TYPE) .
*EXECUTE .

COMPUTE END_PROC = $TIME .
FORMATS END_PROC (DATETIME22) .
*EXECUTE .

* Date and Time Wizard: TIME2_PROC.
COMPUTE  TIME2_PROC=DATEDIF(END_PROC, START_OQ, "minutes").
VARIABLE LABELS  TIME2_PROC.
VARIABLE LEVEL  TIME2_PROC (SCALE).
FORMATS  TIME2_PROC (F5.0).
VARIABLE WIDTH  TIME2_PROC(5).
EXECUTE.

*DATASET ACTIVATE COUNSEL.
FREQUENCIES VARIABLES=PartyTypeDescription COUNSEL_TYPE
  /ORDER=ANALYSIS.




--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Processing Large Datasets

David Marso
Administrator
Looks like the only thing of substance is the RECODE.  Everything else looks
like benchmark code.
25 minutes sounds excessive for a mere 6.8 M records.

DKUKEC wrote

> Dear SPSSers.
>
> I am wondering if my computer is having issues or whether the current
> processing speed is within the range of expected performance for version
> 23
> and the following hardware.
>
> OS Windows Enterprise 7.
> Intel Core i7 CPU @ 3.40 GHz
> RAM 8.0 GB
> 64 Bit
>
> The PARTY dataset contains 6.8 million records and 84 variables (mostly
> STRING) and the COUNSEL file contains 1.3 million records.  I have read
> over
> the numerous posts concerning processing speed, however, I was unable to
> find anything current on the topic.  I have reduced the number of
> variables
> to see if that speeds up processing, but I did not see any improvements.
> Lastly, I did see significant improvements with using the EXECUTE command
> sparingly.  The following syntax took about 25-30 minutes to process.  The
> time does not include the initial query from a SQL server.    
>
> Please let me know if you think this is par for the course, or whether
> there
> are other approaches to improve performance.
>
> Thank you
> Damir
>
>
> DATASET ACTIVATE PARTY WINDOW=FRONT .
> DATASET COPY COUNSEL .
> DATASET ACTIVATE COUNSEL WINDOW=FRONT .
>
>  * MATCH FILES FILE = * / KEEP = CASEID PartyTypeDescription .
>  * EXECUTE .
>
> COMPUTE SOQ = DATE.MDY (04,06,2018) .
> COMPUTE SOQT = TIME.HMS (09,27,00) .
> COMPUTE START_OQ = (SOQ + SOQT) .
> FORMATS START_OQ (DATETIME22) .
> *EXECUTE .
>
> MATCH FILES FILE=* / DROP = SOQ SOQT  .
> *EXECUTE .
>
> RECODE PartyTypeDescription
> ("ATTORNEY"=1)
> ("ASSISTANT PUBLIC DEFENDER" = 2)
> ("PUBLIC DEFENDER"=2)
> ("ASSISTANT STATE ATTORNEY" =3)
> ("SPECIAL PROSECUTOR / ASA"=3) INTO COUNSEL_TYPE .
> *EXECUTE .
>
>
> SELECT IF NOT MISSING (COUNSEL_TYPE) .
> *EXECUTE .
>
> COMPUTE END_PROC = $TIME .
> FORMATS END_PROC (DATETIME22) .
> *EXECUTE .
>
> * Date and Time Wizard: TIME2_PROC.
> COMPUTE  TIME2_PROC=DATEDIF(END_PROC, START_OQ, "minutes").
> VARIABLE LABELS  TIME2_PROC.
> VARIABLE LEVEL  TIME2_PROC (SCALE).
> FORMATS  TIME2_PROC (F5.0).
> VARIABLE WIDTH  TIME2_PROC(5).
> EXECUTE.
>
> *DATASET ACTIVATE COUNSEL.
> FREQUENCIES VARIABLES=PartyTypeDescription COUNSEL_TYPE
>   /ORDER=ANALYSIS.
>
>
>
>
> --
> Sent from: http://spssx-discussion.1045642.n5.nabble.com/
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD





-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Processing Large Datasets

Rick Oliver
In reply to this post by DKUKEC
All of the EXECUTE statements can be removed, and copying the dataset is unnecessary. The original dataset won't be overwritten unless you explicitly save the altered version of the data.

On Fri, Apr 6, 2018 at 9:27 AM, DKUKEC <[hidden email]> wrote:
Dear SPSSers.

I am wondering if my computer is having issues or whether the current
processing speed is within the range of expected performance for version 23
and the following hardware.

OS Windows Enterprise 7.
Intel Core i7 CPU @ 3.40 GHz
RAM 8.0 GB
64 Bit

The PARTY dataset contains 6.8 million records and 84 variables (mostly
STRING) and the COUNSEL file contains 1.3 million records.  I have read over
the numerous posts concerning processing speed, however, I was unable to
find anything current on the topic.  I have reduced the number of variables
to see if that speeds up processing, but I did not see any improvements.
Lastly, I did see significant improvements with using the EXECUTE command
sparingly.  The following syntax took about 25-30 minutes to process.  The
time does not include the initial query from a SQL server.

Please let me know if you think this is par for the course, or whether there
are other approaches to improve performance.

Thank you
Damir


DATASET ACTIVATE PARTY WINDOW=FRONT .
DATASET COPY COUNSEL .
DATASET ACTIVATE COUNSEL WINDOW=FRONT .

 * MATCH FILES FILE = * / KEEP = CASEID PartyTypeDescription .
 * EXECUTE .

COMPUTE SOQ = DATE.MDY (04,06,2018) .
COMPUTE SOQT = TIME.HMS (09,27,00) .
COMPUTE START_OQ = (SOQ + SOQT) .
FORMATS START_OQ (DATETIME22) .
*EXECUTE .

MATCH FILES FILE=* / DROP = SOQ SOQT  .
*EXECUTE .

RECODE PartyTypeDescription
("ATTORNEY"=1)
("ASSISTANT PUBLIC DEFENDER" = 2)
("PUBLIC DEFENDER"=2)
("ASSISTANT STATE ATTORNEY" =3)
("SPECIAL PROSECUTOR / ASA"=3) INTO COUNSEL_TYPE .
*EXECUTE .


SELECT IF NOT MISSING (COUNSEL_TYPE) .
*EXECUTE .

COMPUTE END_PROC = $TIME .
FORMATS END_PROC (DATETIME22) .
*EXECUTE .

* Date and Time Wizard: TIME2_PROC.
COMPUTE  TIME2_PROC=DATEDIF(END_PROC, START_OQ, "minutes").
VARIABLE LABELS  TIME2_PROC.
VARIABLE LEVEL  TIME2_PROC (SCALE).
FORMATS  TIME2_PROC (F5.0).
VARIABLE WIDTH  TIME2_PROC(5).
EXECUTE.

*DATASET ACTIVATE COUNSEL.
FREQUENCIES VARIABLES=PartyTypeDescription COUNSEL_TYPE
  /ORDER=ANALYSIS.




--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Processing Large Datasets

Jon Peck
As Rick, said, ditch the Executes and the Dataset Copy.  Without them, I think the whole syntax would run in one Data pass.  I suspect that the Dataset Copy may have eaten a lot of the time.  I agree that the time seems quite excessive.  Might be worth monitoring resource usage with the Task Manager to see what is using up resources.

Also, if the string variables are very wide, that adds overhead.  If they might have a lot of empty space,  ALTER TYPE A=AMIN would reclaim that.

On Sat, Apr 7, 2018 at 11:38 AM Rick Oliver <[hidden email]> wrote:
All of the EXECUTE statements can be removed, and copying the dataset is unnecessary. The original dataset won't be overwritten unless you explicitly save the altered version of the data.

On Fri, Apr 6, 2018 at 9:27 AM, DKUKEC <[hidden email]> wrote:
Dear SPSSers.

I am wondering if my computer is having issues or whether the current
processing speed is within the range of expected performance for version 23
and the following hardware.

OS Windows Enterprise 7.
Intel Core i7 CPU @ 3.40 GHz
RAM 8.0 GB
64 Bit

The PARTY dataset contains 6.8 million records and 84 variables (mostly
STRING) and the COUNSEL file contains 1.3 million records.  I have read over
the numerous posts concerning processing speed, however, I was unable to
find anything current on the topic.  I have reduced the number of variables
to see if that speeds up processing, but I did not see any improvements.
Lastly, I did see significant improvements with using the EXECUTE command
sparingly.  The following syntax took about 25-30 minutes to process.  The
time does not include the initial query from a SQL server.

Please let me know if you think this is par for the course, or whether there
are other approaches to improve performance.

Thank you
Damir


DATASET ACTIVATE PARTY WINDOW=FRONT .
DATASET COPY COUNSEL .
DATASET ACTIVATE COUNSEL WINDOW=FRONT .

 * MATCH FILES FILE = * / KEEP = CASEID PartyTypeDescription .
 * EXECUTE .

COMPUTE SOQ = DATE.MDY (04,06,2018) .
COMPUTE SOQT = TIME.HMS (09,27,00) .
COMPUTE START_OQ = (SOQ + SOQT) .
FORMATS START_OQ (DATETIME22) .
*EXECUTE .

MATCH FILES FILE=* / DROP = SOQ SOQT  .
*EXECUTE .

RECODE PartyTypeDescription
("ATTORNEY"=1)
("ASSISTANT PUBLIC DEFENDER" = 2)
("PUBLIC DEFENDER"=2)
("ASSISTANT STATE ATTORNEY" =3)
("SPECIAL PROSECUTOR / ASA"=3) INTO COUNSEL_TYPE .
*EXECUTE .


SELECT IF NOT MISSING (COUNSEL_TYPE) .
*EXECUTE .

COMPUTE END_PROC = $TIME .
FORMATS END_PROC (DATETIME22) .
*EXECUTE .

* Date and Time Wizard: TIME2_PROC.
COMPUTE  TIME2_PROC=DATEDIF(END_PROC, START_OQ, "minutes").
VARIABLE LABELS  TIME2_PROC.
VARIABLE LEVEL  TIME2_PROC (SCALE).
FORMATS  TIME2_PROC (F5.0).
VARIABLE WIDTH  TIME2_PROC(5).
EXECUTE.

*DATASET ACTIVATE COUNSEL.
FREQUENCIES VARIABLES=PartyTypeDescription COUNSEL_TYPE
  /ORDER=ANALYSIS.




--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD