selecting cases???????

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

selecting cases???????

Samuel Solomon
Hi all,

I have a data with variables year ,unit ,products. I would like to select cases which consecutive and delete cases  which are not. for instance , the variable year is populated by cases ,2003,2004 and 2005 only . I wish the result to be :
before      after                  
                                     
year year
20032003
20042004
20052005
20042003
20052004
20032005
2004
2005

Tthank you,
Samuel.

====================To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: selecting cases???????

Richard Ristow
At 11:22 AM 6/26/2008, Samuel Solomon wrote:

>I have a data with variables year ,unit ,products. I would like to
>select cases which consecutive and delete cases  which are not. for
>instance , the variable year is populated by cases ,2003,2004 and
>2005 only . I wish the result to be :
>before  after
>
>year    year
>2003    2003
>2004    2004
>2005    2005
>2004    2003
>2005    2004
>2003    2005
>2004
>2005

I think you've had no answer because your question is confusing.

It looks like the two columns are separate datasets, 'before' and
'after'. That probably confused a lot of people (it confused me),
because parallel columns almost invariably different *variables* in
the *same* dataset.

Let's see if I now understand what you want. I've added a variable,
'ID', to identify the cases uniquely, as the date ('year') does not.
That's another respect in which your question is pretty confusing.

Starting data
ID   year
  A   2003
  B   2004
  C   2005
  D   2004
  E   2005
  F   2003
  G   2004
  H   2005

You say you "would like to select cases which consecutive and delete
cases  which are not." So, you want to keep A, B and C, not D or E,
and keep F, G, and H? Like this?

ID   year
  A   2003
  B   2004
  C   2005
  F   2003
  G   2004
  H   2005

If that's not right, please clarify; if it is, say so, and maybe
somebody can help you. Please respond to the list.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: selecting cases???????

Samuel Solomon
That is exactly what I was trying to say.
sorry for not being articulate on my question. I was referring the 'before' as starting data and the 'after' the resulting data.
is there a way?
thank you ,
Samuel.



From: Richard Ristow
Sent: Fri 6/27/2008 9:40 PM
To: [hidden email]
Subject: Re: selecting cases???????


At 11:22 AM 6/26/2008, Samuel Solomon wrote:

>I have a data with variables year ,unit ,products. I would like to
>select cases which consecutive and delete cases  which are not. for
>instance , the variable year is populated by cases ,2003,2004 and
>2005 only . I wish the result to be :
>before  after
>
>year    year
>2003    2003
>2004    2004
>2005    2005
>2004    2003
>2005    2004
>2003    2005
>2004
>2005

I think you've had no answer because your question is confusing.

It looks like the two columns are separate datasets, 'before' and
'after'. That probably confused a lot of people (it confused me),
because parallel columns almost invariably different *variables* in
the *same* dataset.

Let's see if I now understand what you want. I've added a variable,
'ID', to identify the cases uniquely, as the date ('year') does not.
That's another respect in which your question is pretty confusing.

Starting data
ID   year
  A   2003
  B   2004
  C   2005
  D   2004
  E   2005
  F   2003
  G   2004
  H   2005

You say you "would like to select cases which consecutive and delete
cases  which are not." So, you want to keep A, B and C, not D or E,
and keep F, G, and H? Like this?

ID   year
  A   2003
  B   2004
  C   2005
  F   2003
  G   2004
  H   2005

If that's not right, please clarify; if it is, say so, and maybe
somebody can help you. Please respond to the list.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

====================To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: selecting cases???????

Richard Ristow
At 11:22 AM 6/26/2008, Samuel Solomon wrote:

>I would like to select cases which consecutive and delete
>cases  which are not.

I'd asked, does that mean that if you start with
>ID   year
>  A   2003
>  B   2004
>  C   2005
>  D   2004
>  E   2005
>  F   2003
>  G   2004
>  H   2005
   you want,
>ID   year
>  A   2003
>  B   2004
>  C   2005
>  F   2003
>  G   2004
>  H   2005

At 01:56 AM 6/30/2008, Samuel Solomon wrote:

>That is exactly what I was trying to say.

Good. Now, what is your rule for this selection? It *looks* like when
you have a record for 2003, then you keep it; and you keep the next
record if it's for 2004, and the next if it's for 2005, etc. But if a
record isn't for the consecutive year after its predecessor, it's
dropped unless it's for 2003; and then, all later records are dropped
until you get one for 2003.

Well?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: selecting cases???????

Bob Schacht-3
At 07:52 AM 7/1/2008, Richard Ristow wrote:

>At 11:22 AM 6/26/2008, Samuel Solomon wrote:
>
>>I would like to select cases which consecutive and delete
>>cases  which are not.
>
>I'd asked, does that mean that if you start with
>>ID   year
>>  A   2003
>>  B   2004
>>  C   2005
>>  D   2004
>>  E   2005
>>  F   2003
>>  G   2004
>>  H   2005
>   you want,
>>ID   year
>>  A   2003
>>  B   2004
>>  C   2005
>>  F   2003
>>  G   2004
>>  H   2005
>
>At 01:56 AM 6/30/2008, Samuel Solomon wrote:
>
>>That is exactly what I was trying to say.
>
>Good. Now, what is your rule for this selection? It *looks* like when
>you have a record for 2003, then you keep it; and you keep the next
>record if it's for 2004, and the next if it's for 2005, etc. But if a
>record isn't for the consecutive year after its predecessor, it's
>dropped unless it's for 2003; and then, all later records are dropped
>until you get one for 2003.


There's more than this going on. It looks like you have some implicit
variables that should be made explicit.
For example, record F does not follow *consecutively* after record C--
unless the consecutivity counter is re-set.

Here are some possible implicit rules-- which are true?
    * The range of years is 2003 to 2005.
    * "Consecutive" for the last year of the range (i.e., 2005) means that
the next record must be the first year of the range (i.e., 2003). I.e., the
year following 2005 must be 2003, and records for other years following
2005 should be deleted until a record for 2003 is found, which re-sets the
cycle.
    * Records may not be re-arranged (i.e., re-ordered), but only deleted.
There may be other implicit rules, too. I notice that Richard's example
just happens to have the first 3 records in proper order, forming a
paradigm, and that records D&E look like a defective triad (lacking the
record for 2003) but then lo and behold F, G, and H are consecutive and
complete.

I find myself wondering about the source of the original order. In other
words,

What if the original set of records was
>>ID   year
>>  A   2003
>>  B   2004
>>  C   2005
>>  D   2003
>>  E   2004
>>  F   2003
>>  G   2004
>>  H   2005

Would any records in this sequence need to be deleted?

I have the feeling that there is significant information not yet supplied.

Bob Schacht


Robert M. Schacht, Ph.D. <[hidden email]>
Pacific Basin Rehabilitation Research & Training Center
1268 Young Street, Suite #204
Research Center, University of Hawaii
Honolulu, HI 96814

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: selecting cases???????

Samuel Solomon
In reply to this post by Richard Ristow
Now you have understood me perfectly. So what is the solution for that?  






From: Richard Ristow
Sent: Tue 7/1/2008 8:52 PM
To: Samuel Solomon; [hidden email]
Subject: Re: selecting cases???????


At 11:22 AM 6/26/2008, Samuel Solomon wrote:

>I would like to select cases which consecutive and delete
>cases  which are not.

I'd asked, does that mean that if you start with
>ID   year
>  A   2003
>  B   2004
>  C   2005
>  D   2004
>  E   2005
>  F   2003
>  G   2004
>  H   2005
   you want,
>ID   year
>  A   2003
>  B   2004
>  C   2005
>  F   2003
>  G   2004
>  H   2005

At 01:56 AM 6/30/2008, Samuel Solomon wrote:

>That is exactly what I was trying to say.

Good. Now, what is your rule for this selection? It *looks* like when
you have a record for 2003, then you keep it; and you keep the next
record if it's for 2004, and the next if it's for 2005, etc. But if a
record isn't for the consecutive year after its predecessor, it's
dropped unless it's for 2003; and then, all later records are dropped
until you get one for 2003.

Well?

====================To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: selecting cases???????

Samuel Solomon
In reply to this post by Bob Schacht-3
All I need to delete is cases 'D' and 'E' and hence the sequence (2003,2004,2005,2003,2004,2005,..........it goes like that)  is kept intact. You were wondering if the range is between  2003 and 2005. well  that is right. there are only 2003,2004 and 2005. I just need to keep the sequence and delete the rest which are in between that does follow the order.

Thank you,
Samuel



From: Bob Schacht
Sent: Tue 7/1/2008 10:27 PM
To: [hidden email]
Subject: Re: selecting cases???????


At 07:52 AM 7/1/2008, Richard Ristow wrote:

>At 11:22 AM 6/26/2008, Samuel Solomon wrote:
>
>>I would like to select cases which consecutive and delete
>>cases  which are not.
>
>I'd asked, does that mean that if you start with
>>ID   year
>>  A   2003
>>  B   2004
>>  C   2005
>>  D   2004
>>  E   2005
>>  F   2003
>>  G   2004
>>  H   2005
>   you want,
>>ID   year
>>  A   2003
>>  B   2004
>>  C   2005
>>  F   2003
>>  G   2004
>>  H   2005
>
>At 01:56 AM 6/30/2008, Samuel Solomon wrote:
>
>>That is exactly what I was trying to say.
>
>Good. Now, what is your rule for this selection? It *looks* like when
>you have a record for 2003, then you keep it; and you keep the next
>record if it's for 2004, and the next if it's for 2005, etc. But if a
>record isn't for the consecutive year after its predecessor, it's
>dropped unless it's for 2003; and then, all later records are dropped
>until you get one for 2003.


There's more than this going on. It looks like you have some implicit
variables that should be made explicit.
For example, record F does not follow *consecutively* after record C--
unless the consecutivity counter is re-set.

Here are some possible implicit rules-- which are true?
    * The range of years is 2003 to 2005.
    * "Consecutive" for the last year of the range (i.e., 2005) means that
the next record must be the first year of the range (i.e., 2003). I.e., the
year following 2005 must be 2003, and records for other years following
2005 should be deleted until a record for 2003 is found, which re-sets the
cycle.
    * Records may not be re-arranged (i.e., re-ordered), but only deleted.
There may be other implicit rules, too. I notice that Richard's example
just happens to have the first 3 records in proper order, forming a
paradigm, and that records D&E look like a defective triad (lacking the
record for 2003) but then lo and behold F, G, and H are consecutive and
complete.

I find myself wondering about the source of the original order. In other
words,

What if the original set of records was
>>ID   year
>>  A   2003
>>  B   2004
>>  C   2005
>>  D   2003
>>  E   2004
>>  F   2003
>>  G   2004
>>  H   2005

Would any records in this sequence need to be deleted?

I have the feeling that there is significant information not yet supplied.

Bob Schacht


Robert M. Schacht, Ph.D. <[hidden email]>
Pacific Basin Rehabilitation Research & Training Center
1268 Young Street, Suite #204
Research Center, University of Hawaii
Honolulu, HI 96814

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

====================To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

data preparation

Christian Deindl
I'm working currently with a huge dataset. in order to do any analysis I
need to make a lot of adjustments.
my complete syntax-file needs 5! hours to run.

is there a way to speed it up?

I'm currently using spss 15 and a dual-core pc

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: selecting cases???????

Marta Garcia-Granero
In reply to this post by Samuel Solomon
Hi Samuel
> All I need to delete is cases 'D' and 'E' and hence the sequence
> (2003,2004,2005,2003,2004,2005,..........it goes like that)  is kept
> intact. You were wondering if the range is between  2003 and 2005.
> well  that is right. there are only 2003,2004 and 2005. I just need to
> keep the sequence and delete the rest which are in between that does
> follow the order.
>

I have added some cases to your sample dataset just to be ready to deal
with other "out of sequence" data, like 2003 followed by 2005 (see ID
"O" & "P"), besides the case you presented (2004 without a 2003 before,
like in ID "D" & "E").

* Sample dataset *.
DATA LIST LIST/ID(A1) year(F8).
BEGIN DATA
 A   2003
 B   2004
 C   2005
 D   2004
 E   2005
 F   2003
 G   2004
 H   2005
 I   2003
 J   2004
 K   2005
 L   2003
 M   2004
 N   2005
 O   2003
 P   2005
END DATA.

NUMERIC Flag(F8).
COMPUTE Flag=(year=2003).
* This part flags sequences not starting with 2003 *.
DO IF Flag NE 1.
- IF (year=2004) AND (LAG(year,1)=2003) Flag=1.
- IF (year=2005) AND (LAG(year,2)=2003) Flag=1.
END IF.
* This part flags sequences starting with 2003 not followed by 2004 *.
SORT CASES BY ID(D).
IF (year=2003) AND (LAG(year) NE 2004) Flag=0.
* Now we get rid of every flag=0 data *.
EXE. /*don't eliminate it *.
SELECT IF Flag=1.
SORT CASES BY ID(A).
DELETE VARIABLES Flag.
LIST.

HTH,
Marta García-Granero

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: selecting cases???????

Samuel Solomon
Thanks Marta and everybody.
That will do!


From: Marta García-Granero
Sent: Wed 7/2/2008 1:35 PM
To: [hidden email]
Subject: Re: selecting cases???????


Hi Samuel
> All I need to delete is cases 'D' and 'E' and hence the sequence
> (2003,2004,2005,2003,2004,2005,..........it goes like that)  is kept
> intact. You were wondering if the range is between  2003 and 2005.
> well  that is right. there are only 2003,2004 and 2005. I just need to
> keep the sequence and delete the rest which are in between that does
> follow the order.
>

I have added some cases to your sample dataset just to be ready to deal
with other "out of sequence" data, like 2003 followed by 2005 (see ID
"O" & "P"), besides the case you presented (2004 without a 2003 before,
like in ID "D" & "E").

* Sample dataset *.
DATA LIST LIST/ID(A1) year(F8).
BEGIN DATA
 A   2003
 B   2004
 C   2005
 D   2004
 E   2005
 F   2003
 G   2004
 H   2005
 I   2003
 J   2004
 K   2005
 L   2003
 M   2004
 N   2005
 O   2003
 P   2005
END DATA.

NUMERIC Flag(F8).
COMPUTE Flag=(year=2003).
* This part flags sequences not starting with 2003 *.
DO IF Flag NE 1.
- IF (year=2004) AND (LAG(year,1)=2003) Flag=1.
- IF (year=2005) AND (LAG(year,2)=2003) Flag=1.
END IF.
* This part flags sequences starting with 2003 not followed by 2004 *.
SORT CASES BY ID(D).
IF (year=2003) AND (LAG(year) NE 2004) Flag=0.
* Now we get rid of every flag=0 data *.
EXE. /*don't eliminate it *.
SELECT IF Flag=1.
SORT CASES BY ID(A).
DELETE VARIABLES Flag.
LIST.

HTH,
Marta García-Granero

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

====================To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

AW: data preparation

la volta statistics
In reply to this post by Christian Deindl
Hi Christian

Did you try making a production job using SPSS's Production Facility. I am
not sure if this procedure is faster, but it at least allows you just
continuing to work on.
Hope this helps
Christian



-----Ursprüngliche Nachricht-----
Von: SPSSX(r) Discussion [mailto:[hidden email]]Im Auftrag von
Christian Deindl
Gesendet: Mittwoch, 2. Juli 2008 12:07
An: [hidden email]
Betreff: data preparation


I'm working currently with a huge dataset. in order to do any analysis I
need to make a lot of adjustments.
my complete syntax-file needs 5! hours to run.

is there a way to speed it up?

I'm currently using spss 15 and a dual-core pc

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: AW: data preparation

Christian Deindl
the problem is I need a prepared dataset to continue.

more or less what I'm doing at the moment is to redo the preparation and
get rid of smaller mistakes, typos, etc.

I'm doing all analysis in stata (GLLAMM), so I could run my models while
SPSS is busy, it's just I don't have my dataset ready.
and every mistake I detect can result in 5 hour waiting.

christian

la volta statistics schrieb:

> Hi Christian
>
> Did you try making a production job using SPSS's Production Facility. I am
> not sure if this procedure is faster, but it at least allows you just
> continuing to work on.
> Hope this helps
> Christian
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: SPSSX(r) Discussion [mailto:[hidden email]]Im Auftrag von
> Christian Deindl
> Gesendet: Mittwoch, 2. Juli 2008 12:07
> An: [hidden email]
> Betreff: data preparation
>
>
> I'm working currently with a huge dataset. in order to do any analysis I
> need to make a lot of adjustments.
> my complete syntax-file needs 5! hours to run.
>
> is there a way to speed it up?
>
> I'm currently using spss 15 and a dual-core pc
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
>
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: AW: data preparation

Spousta Jan
You should tell us more details about the data file and the syntax. There are cases which can be easily speedied up, and there are cases where the only solution is a better computer + more specific software.

General advices:
* try to split the file into smaller pieces, use only those you need for a given task and handle the pieces separately (both separate variables and separate groups of cases are possible, depending on the task)
* use EXECUTE only in case of necessity
* save the raw data in SPSS sav files if possible
* run the time-consuming tasks overnight
* try to understand why the syntax is so slow - what is the most time-consuming part of it?

Jan


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Christian Deindl
Sent: Wednesday, July 02, 2008 3:06 PM
To: [hidden email]
Subject: Re: AW: data preparation

the problem is I need a prepared dataset to continue.

more or less what I'm doing at the moment is to redo the preparation and get rid of smaller mistakes, typos, etc.

I'm doing all analysis in stata (GLLAMM), so I could run my models while SPSS is busy, it's just I don't have my dataset ready.
and every mistake I detect can result in 5 hour waiting.

christian

la volta statistics schrieb:

> Hi Christian
>
> Did you try making a production job using SPSS's Production Facility.
> I am not sure if this procedure is faster, but it at least allows you
> just continuing to work on.
> Hope this helps
> Christian
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: SPSSX(r) Discussion [mailto:[hidden email]]Im Auftrag
> von Christian Deindl
> Gesendet: Mittwoch, 2. Juli 2008 12:07
> An: [hidden email]
> Betreff: data preparation
>
>
> I'm working currently with a huge dataset. in order to do any analysis
> I need to make a lot of adjustments.
> my complete syntax-file needs 5! hours to run.
>
> is there a way to speed it up?
>
> I'm currently using spss 15 and a dual-core pc
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except
> the command. To leave the list, send the command SIGNOFF SPSSX-L For a
> list of commands to manage subscriptions, send the command INFO
> REFCARD
>
>
>

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD



_____________
Tato zpráva a všechny připojené soubory jsou důvěrné a určené výlučně adresátovi(-ům). Jestliže nejste oprávněným adresátem, je zakázáno jakékoliv zveřejňování, zprostředkování nebo jiné použití těchto informací. Jestliže jste tento mail dostali neoprávněně, prosím, uvědomte odesilatele a smažte zprávu i přiložené soubory. Odesilatel nezodpovídá za jakékoliv chyby nebo opomenutí způsobené tímto přenosem.

Jste si jisti, že opravdu potřebujete vytisknout tuto zprávu a/nebo její přílohy? Myslete na přírodu.


This message and any attached files are confidential and intended solely for the addressee(s). Any publication, transmission or other use of the information by a person or entity other than the intended addressee is prohibited. If you receive this in error please contact the sender and delete the message as well as all attached documents. The sender does not accept liability for any errors or omissions as a result of the transmission.

Are you sure that you really need a print version of this message and/or its attachments? Think about nature.

-.- --

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: AW: data preparation

Peck, Jon
Production Mode will not speed anything up, however, if you don't need the user interface and can run your syntax through a programmability job, it might make a large difference.  While I can't offer any guarantees, we have had reports from users of between 4x and 10x speed improvement when taking this approach.

Converting a large syntax job to run as a program can be very simple.

If you have a block of syntax - let's assume it is in a file called clean.sps in temp,  you can run this as a Python program (assuming the plugin is installed), by just doing, from a Python shell (notice the forward slashes),

import spss
spss.Submit("INSERT FILE='c:/temp/clean.sps'")

The one problem with this is that you would just get the output as text streaming back to the console.  So you could get this as html output with oms by wrapping your syntax with something like

oms /destination outfile='c:/temp/clean.htm' format = html.
<your syntax file>
omsend.

You could put that syntax in the Submit call above or add the OMS commands to your INSERT file.

You might also want to call
spss.SetOutput("off")
ahead of the Submit to suppress echoing of the text to the console.

If you have SPSS 16, you can generate a true Viewer document (spv format) from OMS instead of the html.

As always, there are lots of programmability references at SPSS Developer Central (www.spss.com/devcentral).

HTH,
Jon Peck

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Spousta Jan
Sent: Wednesday, July 02, 2008 7:22 AM
To: [hidden email]
Subject: Re: [SPSSX-L] AW: data preparation

You should tell us more details about the data file and the syntax. There are cases which can be easily speedied up, and there are cases where the only solution is a better computer + more specific software.

General advices:
* try to split the file into smaller pieces, use only those you need for a given task and handle the pieces separately (both separate variables and separate groups of cases are possible, depending on the task)
* use EXECUTE only in case of necessity
* save the raw data in SPSS sav files if possible
* run the time-consuming tasks overnight
* try to understand why the syntax is so slow - what is the most time-consuming part of it?

Jan


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Christian Deindl
Sent: Wednesday, July 02, 2008 3:06 PM
To: [hidden email]
Subject: Re: AW: data preparation

the problem is I need a prepared dataset to continue.

more or less what I'm doing at the moment is to redo the preparation and get rid of smaller mistakes, typos, etc.

I'm doing all analysis in stata (GLLAMM), so I could run my models while SPSS is busy, it's just I don't have my dataset ready.
and every mistake I detect can result in 5 hour waiting.

christian

la volta statistics schrieb:

> Hi Christian
>
> Did you try making a production job using SPSS's Production Facility.
> I am not sure if this procedure is faster, but it at least allows you
> just continuing to work on.
> Hope this helps
> Christian
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: SPSSX(r) Discussion [mailto:[hidden email]]Im Auftrag
> von Christian Deindl
> Gesendet: Mittwoch, 2. Juli 2008 12:07
> An: [hidden email]
> Betreff: data preparation
>
>
> I'm working currently with a huge dataset. in order to do any analysis
> I need to make a lot of adjustments.
> my complete syntax-file needs 5! hours to run.
>
> is there a way to speed it up?
>
> I'm currently using spss 15 and a dual-core pc
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except
> the command. To leave the list, send the command SIGNOFF SPSSX-L For a
> list of commands to manage subscriptions, send the command INFO
> REFCARD
>
>
>

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD



_____________
Tato zpráva a všechny připojené soubory jsou důvěrné a určené výlučně adresátovi(-ům). Jestliže nejste oprávněným adresátem, je zakázáno jakékoliv zveřejňování, zprostředkování nebo jiné použití těchto informací. Jestliže jste tento mail dostali neoprávněně, prosím, uvědomte odesilatele a smažte zprávu i přiložené soubory. Odesilatel nezodpovídá za jakékoliv chyby nebo opomenutí způsobené tímto přenosem.

Jste si jisti, že opravdu potřebujete vytisknout tuto zprávu a/nebo její přílohy? Myslete na přírodu.


This message and any attached files are confidential and intended solely for the addressee(s). Any publication, transmission or other use of the information by a person or entity other than the intended addressee is prohibited. If you receive this in error please contact the sender and delete the message as well as all attached documents. The sender does not accept liability for any errors or omissions as a result of the transmission.

Are you sure that you really need a print version of this message and/or its attachments? Think about nature.

-.- --

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: AW: data preparation

Christian Deindl
thanks a lot for your answers,

I will try python for sure.

since my syntax is realy large I split it into smaller parts and use
"include" to put it all together.
If I run the same syntax as huge file without include, it gets about 2
to 3 times faster.

how is this possible?

christian

Peck, Jon schrieb:

> Production Mode will not speed anything up, however, if you don't need the user interface and can run your syntax through a programmability job, it might make a large difference.  While I can't offer any guarantees, we have had reports from users of between 4x and 10x speed improvement when taking this approach.
>
> Converting a large syntax job to run as a program can be very simple.
>
> If you have a block of syntax - let's assume it is in a file called clean.sps in temp,  you can run this as a Python program (assuming the plugin is installed), by just doing, from a Python shell (notice the forward slashes),
>
> import spss
> spss.Submit("INSERT FILE='c:/temp/clean.sps'")
>
> The one problem with this is that you would just get the output as text streaming back to the console.  So you could get this as html output with oms by wrapping your syntax with something like
>
> oms /destination outfile='c:/temp/clean.htm' format = html.
> <your syntax file>
> omsend.
>
> You could put that syntax in the Submit call above or add the OMS commands to your INSERT file.
>
> You might also want to call
> spss.SetOutput("off")
> ahead of the Submit to suppress echoing of the text to the console.
>
> If you have SPSS 16, you can generate a true Viewer document (spv format) from OMS instead of the html.
>
> As always, there are lots of programmability references at SPSS Developer Central (www.spss.com/devcentral).
>
> HTH,
> Jon Peck
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Spousta Jan
> Sent: Wednesday, July 02, 2008 7:22 AM
> To: [hidden email]
> Subject: Re: [SPSSX-L] AW: data preparation
>
> You should tell us more details about the data file and the syntax. There are cases which can be easily speedied up, and there are cases where the only solution is a better computer + more specific software.
>
> General advices:
> * try to split the file into smaller pieces, use only those you need for a given task and handle the pieces separately (both separate variables and separate groups of cases are possible, depending on the task)
> * use EXECUTE only in case of necessity
> * save the raw data in SPSS sav files if possible
> * run the time-consuming tasks overnight
> * try to understand why the syntax is so slow - what is the most time-consuming part of it?
>
> Jan
>
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Christian Deindl
> Sent: Wednesday, July 02, 2008 3:06 PM
> To: [hidden email]
> Subject: Re: AW: data preparation
>
> the problem is I need a prepared dataset to continue.
>
> more or less what I'm doing at the moment is to redo the preparation and get rid of smaller mistakes, typos, etc.
>
> I'm doing all analysis in stata (GLLAMM), so I could run my models while SPSS is busy, it's just I don't have my dataset ready.
> and every mistake I detect can result in 5 hour waiting.
>
> christian
>
> la volta statistics schrieb:
>> Hi Christian
>>
>> Did you try making a production job using SPSS's Production Facility.
>> I am not sure if this procedure is faster, but it at least allows you
>> just continuing to work on.
>> Hope this helps
>> Christian
>>
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: SPSSX(r) Discussion [mailto:[hidden email]]Im Auftrag
>> von Christian Deindl
>> Gesendet: Mittwoch, 2. Juli 2008 12:07
>> An: [hidden email]
>> Betreff: data preparation
>>
>>
>> I'm working currently with a huge dataset. in order to do any analysis
>> I need to make a lot of adjustments.
>> my complete syntax-file needs 5! hours to run.
>>
>> is there a way to speed it up?
>>
>> I'm currently using spss 15 and a dual-core pc
>>
>> =====================
>> To manage your subscription to SPSSX-L, send a message to
>> [hidden email] (not to SPSSX-L), with no body text except
>> the command. To leave the list, send the command SIGNOFF SPSSX-L For a
>> list of commands to manage subscriptions, send the command INFO
>> REFCARD
>>
>>
>>
>
> =====================
> To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
>
>
>
> _____________
> Tato zpráva a všechny připojené soubory jsou důvěrné a určené výlučně adresátovi(-ům). Jestliže nejste oprávněným adresátem, je zakázáno jakékoliv zveřejňování, zprostředkování nebo jiné použití těchto informací. Jestliže jste tento mail dostali neoprávněně, prosím, uvědomte odesilatele a smažte zprávu i přiložené soubory. Odesilatel nezodpovídá za jakékoliv chyby nebo opomenutí způsobené tímto přenosem.
>
> Jste si jisti, že opravdu potřebujete vytisknout tuto zprávu a/nebo její přílohy? Myslete na přírodu.
>
>
> This message and any attached files are confidential and intended solely for the addressee(s). Any publication, transmission or other use of the information by a person or entity other than the intended addressee is prohibited. If you receive this in error please contact the sender and delete the message as well as all attached documents. The sender does not accept liability for any errors or omissions as a result of the transmission.
>
> Are you sure that you really need a print version of this message and/or its attachments? Think about nature.
>
> -.- --
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: data preparation

Richard Ristow
In reply to this post by Christian Deindl
At 06:07 AM 7/2/2008, Christian Deindl wrote:

>I'm working currently with a huge dataset. in order to do any
>analysis I need to make a lot of adjustments. my complete
>syntax-file needs 5! hours to run.

Jan Spousta's question is most apposite: You should tell us more
details about the data file and the syntax. Different problems, and
different coding styles, require very different techniques for optimization.

Jon Peck can probably refine this; but, mostly, SPSS takes time per
line of code (or kilo-line), per procedure run (NOT counting data
read and written), and per gigabyte of data read or written. It's
hard to guess your problem, without approximate knowledge of those quantities.

Christian wrote, later,

>since my syntax is really large I split it into smaller parts and
>use "include" to put it all together.
>If I run the same syntax as huge file without include, it gets about
>2 to 3 times faster. how is this possible?

I wouldn't have expected it myself, at all. But that brings us to,
how large is "really large"? And what procedures do you run, and what
output do you produce? It sounds like it's many thousands of lines,
but how many?

Then, it's important to see whether the last category, data read and
written, may be taking the time. As a quick test, run your program on
a small fraction of the file, and see whether it's dramatically faster.


Anyway, in estimating whether data reading and writing may be a
problem, , multiply the file size by the number of times your code
reads or writes the full file (one "data pass"). For this purpose,
. A transformation program and a following procedure take one data pass
. An EXECUTE counts as a procedure, and takes a full data pass.
(That's why unnecessary EXECUTEs can slow processing so badly.)
. Count a SORT CASES as two data passes, though it's probably a
little more. Count an AGGREGATE with MODE=ADDVARIABLES as two passes;
any other AGGREGATE is one data pass, plus a data pass for the
(usually much smaller) output file
. VARSTOCASES is one pass each through the file as it is before, and
as it is after; for quick estimating, count it as two passes.
CASESTOVARS is two passes through the input, and one through the
output; think of it as three passes.
. Some procedures, notably nonparametric ones including calculations
of medians, may take much additional time on large files.
. Reading data from some external sources via GET DATA or GET
TRANSLATE can be much slower than a pass SPSS makes through its own data.

What, then, is your file size, and number of data passes? Sometimes,
code makes far too many passes, and on posters to this list have
occasionally experienced order-of-magnitude decreases in running
time, by simple optimizations to save data passes.

-Best of luck to you,
  Richard

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: selecting cases???????

Richard Ristow
In reply to this post by Marta Garcia-Granero
At 06:35 AM 7/2/2008, Marta García-Granero wrote:

>Hi Samuel
>>All I need to delete is cases 'D' and 'E' and hence the sequence
>>(2003,2004,2005,2003,2004,2005,..........it goes like that)  is kept
>>intact. You were wondering if the range is between  2003 and 2005.
>>well  that is right. there are only 2003,2004 and 2005. I just need to
>>keep the sequence and delete the rest which are in between that does
>>follow the order.
>
>I have added some cases to your sample dataset
>just to be ready to deal with other "out of
>sequence" data, like 2003 followed by 2005 (see
>ID "O" & "P"), besides the case you presented
>(2004 without a 2003 before, like in ID "D" & "E").

Here's an alternative implementation. It uses LAG
more extensively; doesn't assume a fixed maximum
length for the sequence; and uses AGGREGATE
rather than SORT, to back-fill 'rejection' for a
sequence that starts at 2003 but has a later gap.
It appears to produce the same results.
|-----------------------------|---------------------------|
|Output Created               |02-JUL-2008 22:19:36       |
|-----------------------------|---------------------------|
[Ristow]

ID     year

A      2003
B      2004
C      2005
D      2004
E      2005
F      2003
G      2004
H      2005
I      2003
J      2004
K      2005
L      2003
M      2004
N      2005
O      2003
P      2005

Number of cases read:  16    Number of cases listed:  16


NUMERIC Flag2   (F4)  /* Marker: take this record     */.
STRING  FirstID (A1)  /* ID of 1st record in sequence */.

*  Logic below assumes that the data is already      ....
*  sorted into the correct order                     ....
DO IF   Year EQ 2003.
*  Beginning of a sequence starting at 2003          ....
.  COMPUTE Flag2   = 1.
.  COMPUTE FirstID = ID.
ELSE IF MISSING(LAG(Year)).
*  First record in the file, and it's not 2003       ....
.  COMPUTE Flag2   = 0.
.  COMPUTE FirstID = ID.
ELSE IF Year EQ LAG(Year) + 1.
*  Record continues the sequence, with no gap        ....
.  COMPUTE FirstID = LAG(FirstID).
.  COMPUTE Flag2   = LAG(Flag2).
ELSE IF Year GT LAG(Year) + 1.
*  Record continues the sequence, with a  gap        ....
.  COMPUTE FirstID = LAG(FirstID).
.  COMPUTE Flag2   = 0.
ELSE IF Year LE LAG(Year).
*  Start a new sequence, but not at 2003             ....
.  COMPUTE FirstID = ID.
.  COMPUTE Flag2   = 0.
END IF.

*  AGGREGATE, to reject, post hoc, sequences that    ....
*  start at 2003 but then have a gap                 ....
AGGREGATE OUTFILE   = *
           MODE      = ADDVARIABLES
           OVERWRITE = YES
    /BREAK = FirstID
    /Flag2 = MIN(Flag2).

MATCH FILES
    /FILE=Marta
    /FILE=Ristow
    /BY  ID YEAR
    /KEEP=ID Year Flag Flag2 FirstID ALL.

LIST.

List
|-----------------------------|---------------------------|
|Output Created               |02-JUL-2008 22:19:38       |
|-----------------------------|---------------------------|
ID     year Flag Flag2 FirstID

A      2003    1     1 A
B      2004    1     1 A
C      2005    1     1 A
D      2004    .     0 D
E      2005    .     0 D
F      2003    1     1 F
G      2004    1     1 F
H      2005    1     1 F
I      2003    1     1 I
J      2004    1     1 I
K      2005    1     1 I
L      2003    1     1 L
M      2004    1     1 L
N      2005    1     1 L
O      2003    .     0 O
P      2005    .     0 O

Number of cases read:  16    Number of cases listed:  16
============================================
APPENDIX:  Test data, Marta's code, and mine
============================================
* Sample dataset *.
DATA LIST LIST/ID(A1) year(F8).
BEGIN DATA
A   2003
B   2004
C   2005
D   2004
E   2005
F   2003
G   2004
H   2005
I   2003
J   2004
K   2005
L   2003
M   2004
N   2005
O   2003
P   2005
END DATA.
DATASET NAME     TestData WINDOW=FRONT.

*  Marta Garcia-Granero's code:  ......  .
DATASET ACTIVATE TestData WINDOW=FRONT.
DATASET COPY     Marta    WINDOW=FRONT.
DATASET ACTIVATE Marta    WINDOW=FRONT.

NUMERIC Flag(F4).
COMPUTE Flag=(year=2003).
* This part flags sequences not starting with 2003 *.
DO IF Flag NE 1.
- IF (year=2004) AND (LAG(year,1)=2003) Flag=1.
- IF (year=2005) AND (LAG(year,2)=2003) Flag=1.
END IF.
* This part flags sequences starting with 2003 not followed by 2004 *.
SORT CASES BY ID(D).
IF (year=2003) AND (LAG(year) NE 2004) Flag=0.
* Now we get rid of every flag=0 data *.
*...EXE   /*don't eliminate it        */.
.   LIST  /* but this is a substitute */.
SORT CASES BY ID(A).
.  /**/  LIST   /*-*/.

SELECT IF Flag=1.



*  Richard Ristow's       code:  ......  .
DATASET ACTIVATE TestData WINDOW=FRONT.
DATASET COPY     Ristow   WINDOW=FRONT.
DATASET ACTIVATE Ristow   WINDOW=FRONT.

LIST.

NUMERIC Flag2   (F4)  /* Marker: take this record     */.
STRING  FirstID (A1)  /* ID of 1st record in sequence */.

*  Logic below assumes that the data is already      ....
*  sorted into the correct order                     ....
DO IF   Year EQ 2003.
*  Beginning of a sequence starting at 2003          ....
.  COMPUTE Flag2   = 1.
.  COMPUTE FirstID = ID.
ELSE IF MISSING(LAG(Year)).
*  First record in the file, and it's not 2003       ....
.  COMPUTE Flag2   = 0.
.  COMPUTE FirstID = ID.
ELSE IF Year EQ LAG(Year) + 1.
*  Record continues the sequence, with no gap        ....
.  COMPUTE FirstID = LAG(FirstID).
.  COMPUTE Flag2   = LAG(Flag2).
ELSE IF Year GT LAG(Year) + 1.
*  Record continues the sequence, with a  gap        ....
.  COMPUTE FirstID = LAG(FirstID).
.  COMPUTE Flag2   = 0.
ELSE IF Year LE LAG(Year).
*  Start a new sequence, but not at 2003             ....
.  COMPUTE FirstID = ID.
.  COMPUTE Flag2   = 0.
END IF.

*  AGGREGATE, to reject, post hoc, sequences that    ....
*  start at 2003 but then have a gap                 ....
AGGREGATE OUTFILE   = *
           MODE      = ADDVARIABLES
           OVERWRITE = YES
    /BREAK = FirstID
    /Flag2 = MIN(Flag2).

MATCH FILES
    /FILE=Marta
    /FILE=Ristow
    /BY  ID YEAR
    /KEEP=ID Year Flag Flag2 FirstID ALL.

LIST.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: selecting cases???????

Samuel Solomon
Dear Martha and Richard,
Many Many thanks to both of you!
I wonder how you manage to know SPSS inside out!

Thanks again,
Samuel



From: Richard Ristow
Sent: Thu 7/3/2008 5:25 AM
To: Samuel Solomon; [hidden email]
Cc: Marta García-Granero
Subject: Re: selecting cases???????


At 06:35 AM 7/2/2008, Marta García-Granero wrote:

>Hi Samuel
>>All I need to delete is cases 'D' and 'E' and hence the sequence
>>(2003,2004,2005,2003,2004,2005,..........it goes like that)  is kept
>>intact. You were wondering if the range is between  2003 and 2005.
>>well  that is right. there are only 2003,2004 and 2005. I just need to
>>keep the sequence and delete the rest which are in between that does
>>follow the order.
>
>I have added some cases to your sample dataset
>just to be ready to deal with other "out of
>sequence" data, like 2003 followed by 2005 (see
>ID "O" & "P"), besides the case you presented
>(2004 without a 2003 before, like in ID "D" & "E").

Here's an alternative implementation. It uses LAG
more extensively; doesn't assume a fixed maximum
length for the sequence; and uses AGGREGATE
rather than SORT, to back-fill 'rejection' for a
sequence that starts at 2003 but has a later gap.
It appears to produce the same results.
|-----------------------------|---------------------------|
|Output Created               |02-JUL-2008 22:19:36       |
|-----------------------------|---------------------------|
[Ristow]

ID     year

A      2003
B      2004
C      2005
D      2004
E      2005
F      2003
G      2004
H      2005
I      2003
J      2004
K      2005
L      2003
M      2004
N      2005
O      2003
P      2005

Number of cases read:  16    Number of cases listed:  16


NUMERIC Flag2   (F4)  /* Marker: take this record     */.
STRING  FirstID (A1)  /* ID of 1st record in sequence */.

*  Logic below assumes that the data is already      ....
*  sorted into the correct order                     ....
DO IF   Year EQ 2003.
*  Beginning of a sequence starting at 2003          ....
.  COMPUTE Flag2   = 1.
.  COMPUTE FirstID = ID.
ELSE IF MISSING(LAG(Year)).
*  First record in the file, and it's not 2003       ....
.  COMPUTE Flag2   = 0.
.  COMPUTE FirstID = ID.
ELSE IF Year EQ LAG(Year) + 1.
*  Record continues the sequence, with no gap        ....
.  COMPUTE FirstID = LAG(FirstID).
.  COMPUTE Flag2   = LAG(Flag2).
ELSE IF Year GT LAG(Year) + 1.
*  Record continues the sequence, with a  gap        ....
.  COMPUTE FirstID = LAG(FirstID).
.  COMPUTE Flag2   = 0.
ELSE IF Year LE LAG(Year).
*  Start a new sequence, but not at 2003             ....
.  COMPUTE FirstID = ID.
.  COMPUTE Flag2   = 0.
END IF.

*  AGGREGATE, to reject, post hoc, sequences that    ....
*  start at 2003 but then have a gap                 ....
AGGREGATE OUTFILE   = *
           MODE      = ADDVARIABLES
           OVERWRITE = YES
    /BREAK = FirstID
    /Flag2 = MIN(Flag2).

MATCH FILES
    /FILE=Marta
    /FILE=Ristow
    /BY  ID YEAR
    /KEEP=ID Year Flag Flag2 FirstID ALL.

LIST.

List
|-----------------------------|---------------------------|
|Output Created               |02-JUL-2008 22:19:38       |
|-----------------------------|---------------------------|
ID     year Flag Flag2 FirstID

A      2003    1     1 A
B      2004    1     1 A
C      2005    1     1 A
D      2004    .     0 D
E      2005    .     0 D
F      2003    1     1 F
G      2004    1     1 F
H      2005    1     1 F
I      2003    1     1 I
J      2004    1     1 I
K      2005    1     1 I
L      2003    1     1 L
M      2004    1     1 L
N      2005    1     1 L
O      2003    .     0 O
P      2005    .     0 O

Number of cases read:  16    Number of cases listed:  16
============================================
APPENDIX:  Test data, Marta's code, and mine
============================================
* Sample dataset *.
DATA LIST LIST/ID(A1) year(F8).
BEGIN DATA
A   2003
B   2004
C   2005
D   2004
E   2005
F   2003
G   2004
H   2005
I   2003
J   2004
K   2005
L   2003
M   2004
N   2005
O   2003
P   2005
END DATA.
DATASET NAME     TestData WINDOW=FRONT.

*  Marta Garcia-Granero's code:  ......  .
DATASET ACTIVATE TestData WINDOW=FRONT.
DATASET COPY     Marta    WINDOW=FRONT.
DATASET ACTIVATE Marta    WINDOW=FRONT.

NUMERIC Flag(F4).
COMPUTE Flag=(year=2003).
* This part flags sequences not starting with 2003 *.
DO IF Flag NE 1.
- IF (year=2004) AND (LAG(year,1)=2003) Flag=1.
- IF (year=2005) AND (LAG(year,2)=2003) Flag=1.
END IF.
* This part flags sequences starting with 2003 not followed by 2004 *.
SORT CASES BY ID(D).
IF (year=2003) AND (LAG(year) NE 2004) Flag=0.
* Now we get rid of every flag=0 data *.
*...EXE   /*don't eliminate it        */.
.   LIST  /* but this is a substitute */.
SORT CASES BY ID(A).
.  /**/  LIST   /*-*/.

SELECT IF Flag=1.



*  Richard Ristow's       code:  ......  .
DATASET ACTIVATE TestData WINDOW=FRONT.
DATASET COPY     Ristow   WINDOW=FRONT.
DATASET ACTIVATE Ristow   WINDOW=FRONT.

LIST.

NUMERIC Flag2   (F4)  /* Marker: take this record     */.
STRING  FirstID (A1)  /* ID of 1st record in sequence */.

*  Logic below assumes that the data is already      ....
*  sorted into the correct order                     ....
DO IF   Year EQ 2003.
*  Beginning of a sequence starting at 2003          ....
.  COMPUTE Flag2   = 1.
.  COMPUTE FirstID = ID.
ELSE IF MISSING(LAG(Year)).
*  First record in the file, and it's not 2003       ....
.  COMPUTE Flag2   = 0.
.  COMPUTE FirstID = ID.
ELSE IF Year EQ LAG(Year) + 1.
*  Record continues the sequence, with no gap        ....
.  COMPUTE FirstID = LAG(FirstID).
.  COMPUTE Flag2   = LAG(Flag2).
ELSE IF Year GT LAG(Year) + 1.
*  Record continues the sequence, with a  gap        ....
.  COMPUTE FirstID = LAG(FirstID).
.  COMPUTE Flag2   = 0.
ELSE IF Year LE LAG(Year).
*  Start a new sequence, but not at 2003             ....
.  COMPUTE FirstID = ID.
.  COMPUTE Flag2   = 0.
END IF.

*  AGGREGATE, to reject, post hoc, sequences that    ....
*  start at 2003 but then have a gap                 ....
AGGREGATE OUTFILE   = *
           MODE      = ADDVARIABLES
           OVERWRITE = YES
    /BREAK = FirstID
    /Flag2 = MIN(Flag2).

MATCH FILES
    /FILE=Marta
    /FILE=Ristow
    /BY  ID YEAR
    /KEEP=ID Year Flag Flag2 FirstID ALL.

LIST.

====================To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: data preparation

Eero Olli
In reply to this post by Christian Deindl
Hi Christian,

I have also struggeled with complicated datapreparation. You have
allready gotten some good advice on how to improve your code. However,
it is very timeconsuming to be sure that everyting works as it is
supposed and improve your code, if the syntax takes 5 hours to run.

My approach has been:

1) Devide large syntax files to small chunks of code and save them as
.sps files.

2) Create a datafile with a small sample of your original cases that you
use for testing your code. Use the full dataset first after your code
runs without errors.

* make a small sample for testing syntax.
FILTER OFF.
USE ALL.
SAMPLE 50 from 1000.
EXECUTE .

3) Controll the sequence they are run through a series of production
jobs (.spp).  Create separate jobs for
        a) preparation phase
        b) the real datapreparation
        c) produce all tabels for a publication
The preparation phase and the full datapreparation production jobs would
differ only in regard to which datafiles they open and save.

EXAMPLE PSEUDOCODE for a Production job.

* preparation phase with small datafile.
INSERT FILE='open_datafile1_N50.sps'
INSERT FILE='fix_errors1.sps'
INSERT FILE='recode1.sps'
INSERT FILE='choose_only_valid_cases1.sps'
INSERT FILE='combine_sources1.sps'
INSERT FILE='recode2.sps'
INSERT FILE='choose_only_valid_cases2.sps'
INSERT FILE='write_datafile1_N50.sps'
INSERT FILE='open_datafile2_N50.sps'
INSERT FILE='fix_errors1.sps'
INSERT FILE='choose_only_valid_cases1.sps'
INSERT FILE='combine_sources2.sps'
INSERT FILE='recode2.sps'
INSERT FILE='choose_only_valid_cases2.sps'
INSERT FILE='Write_datafile_N50.sps'

The same code can be used several times in one production job.  Even if
the whole job takes a long time it is possible to create small pieces of
jobs that can be controlled.

I have added a few lines of code to all my syntax files to make it
easier to work with production jobs.

An example of first three lines in my .sps files is
OUTPUT CLOSE ChooseValidCases.
OUTPUT NEW NAME=ChooseValidCases.
* filename: choose_only_valid_cases2.sps.

It is usefull in the testing phase to close old windows and create new
ones that have names that explain what is going on.  The most common
production job I run, opens 14 windows, and I can very quickly see that
every thing went well, or where the problems are.  After every run I
only have the most current output. If I need to keep an an output (very
seldom), all I need to do is to rename it, and it will not disapear.

Similarly, I use DATASET NAME, DATASET CLOSE, and DATASET ACTIVATE quite
a lot. To make sure that I do not accidentaly, by poking around with my
mouse change which one of the datafiles is in the focus.

* get rid of old data.
DATASET CLOSE ALL.
GET FILE='filename.sav'.
DATASET NAME mydata WINDOW=FRONT.


During datamanipulation and transformations I use this a lot:
*make sure focus is correct.
DATASET ACTIVATE mydata.
RECODE...

But be carefull, with DATASET ACTIVATE if you want to reuse your syntax
in several production jobs.

4) Move from a small sample to to full datafile in several steps.
My experience with dataproduction that takes a LONG time is that there
often is a problem with the data transformations (missing values for
CASESTOVARS, unexpected values that get into loops or something else
that stalls SPSS). Thus, the move from pseudo or smallsample data to the
real datafile can create surprices.   If the time increases
unproportionately, I would
        a) devide my large datafile to smaller files (with lets say 1000
cases) that I would test one by one.  If you can run one of these
pieces, but not all, it is very likely that there is a dataproblem.
        b) I would run my code piece by piece and visually examine the
large datafile after each stage.  And do a simple FREQ on every new
variable or transformation to find the unexpected values that create
trouble.


The bottom line is that if you split your code and split your data, it
is easier to find where the real problem is.

Best,
Eero

________________________________________
Eero Olli
Advisor
the Equality and Anti-discrimination Ombud
[hidden email]             +47 2405 5951
POB 8048 Dep,     N-0031 Oslo,      Norway

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: data preparation

Christian Deindl
thanks a lot for the advices.

I´m down to 2 hours by now, which is not that bad, since my analysis can
take days (gllamm, logistic mulilevel).

christian




On Thu, 3 Jul 2008 11:10:46 +0200
  Eero Olli <[hidden email]> wrote:

> Hi Christian,
>
> I have also struggeled with complicated datapreparation. You have
> allready gotten some good advice on how to improve your code. However,
> it is very timeconsuming to be sure that everyting works as it is
> supposed and improve your code, if the syntax takes 5 hours to run.
>
> My approach has been:
>
> 1) Devide large syntax files to small chunks of code and save them as
> .sps files.
>
> 2) Create a datafile with a small sample of your original cases that you
> use for testing your code. Use the full dataset first after your code
> runs without errors.
>
> * make a small sample for testing syntax.
>FILTER OFF.
> USE ALL.
> SAMPLE 50 from 1000.
> EXECUTE .
>
> 3) Controll the sequence they are run through a series of production
> jobs (.spp).  Create separate jobs for
>        a) preparation phase
>        b) the real datapreparation
>        c) produce all tabels for a publication
> The preparation phase and the full datapreparation production jobs would
> differ only in regard to which datafiles they open and save.
>
> EXAMPLE PSEUDOCODE for a Production job.
>
> * preparation phase with small datafile.
> INSERT FILE='open_datafile1_N50.sps'
> INSERT FILE='fix_errors1.sps'
> INSERT FILE='recode1.sps'
> INSERT FILE='choose_only_valid_cases1.sps'
> INSERT FILE='combine_sources1.sps'
> INSERT FILE='recode2.sps'
> INSERT FILE='choose_only_valid_cases2.sps'
> INSERT FILE='write_datafile1_N50.sps'
> INSERT FILE='open_datafile2_N50.sps'
> INSERT FILE='fix_errors1.sps'
> INSERT FILE='choose_only_valid_cases1.sps'
> INSERT FILE='combine_sources2.sps'
> INSERT FILE='recode2.sps'
> INSERT FILE='choose_only_valid_cases2.sps'
> INSERT FILE='Write_datafile_N50.sps'
>
> The same code can be used several times in one production job.  Even if
> the whole job takes a long time it is possible to create small pieces of
> jobs that can be controlled.
>
> I have added a few lines of code to all my syntax files to make it
> easier to work with production jobs.
>
> An example of first three lines in my .sps files is
> OUTPUT CLOSE ChooseValidCases.
> OUTPUT NEW NAME=ChooseValidCases.
> * filename: choose_only_valid_cases2.sps.
>
> It is usefull in the testing phase to close old windows and create new
> ones that have names that explain what is going on.  The most common
> production job I run, opens 14 windows, and I can very quickly see that
> every thing went well, or where the problems are.  After every run I
> only have the most current output. If I need to keep an an output (very
> seldom), all I need to do is to rename it, and it will not disapear.
>
> Similarly, I use DATASET NAME, DATASET CLOSE, and DATASET ACTIVATE quite
> a lot. To make sure that I do not accidentaly, by poking around with my
> mouse change which one of the datafiles is in the focus.
>
> * get rid of old data.
> DATASET CLOSE ALL.
> GET FILE='filename.sav'.
> DATASET NAME mydata WINDOW=FRONT.
>
>
> During datamanipulation and transformations I use this a lot:
> *make sure focus is correct.
> DATASET ACTIVATE mydata.
> RECODE...
>
> But be carefull, with DATASET ACTIVATE if you want to reuse your syntax
> in several production jobs.
>
> 4) Move from a small sample to to full datafile in several steps.
> My experience with dataproduction that takes a LONG time is that there
> often is a problem with the data transformations (missing values for
> CASESTOVARS, unexpected values that get into loops or something else
> that stalls SPSS). Thus, the move from pseudo or smallsample data to the
> real datafile can create surprices.   If the time increases
> unproportionately, I would
>        a) devide my large datafile to smaller files (with lets say 1000
> cases) that I would test one by one.  If you can run one of these
> pieces, but not all, it is very likely that there is a dataproblem.
>        b) I would run my code piece by piece and visually examine the
> large datafile after each stage.  And do a simple FREQ on every new
> variable or transformation to find the unexpected values that create
> trouble.
>
>
> The bottom line is that if you split your code and split your data, it
> is easier to find where the real problem is.
>
> Best,
> Eero
>
> ________________________________________
> Eero Olli
> Advisor
> the Equality and Anti-discrimination Ombud
> [hidden email]             +47 2405 5951
> POB 8048 Dep,     N-0031 Oslo,      Norway
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
>For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD