SPSSX Discussion

how to do this efficiently

Classic

List

Threaded

5 messages Options

Rodrigo Briceño

how to do this efficiently

Dear SPSS users. I have a problem with one of the databases that Im
processing.

My DB has information about Households and their expenditures in a set of
items (11 items that include food, transport, health, etc.). Each of this
items has a unique code, for ex. Food is 1, health is 5, etc. The thing is
that the people who prepared the database didnt put a value when the HH
didnt report expenditures in one or some of the issues.

What I want to do is to locate the HH which do not have the code 1 (for
food, since this is the first in the list for each HH) in order to have the
full list of HH no matter if they didnt reported Food expenditures.

Ex.

HH Item

1 1

1 2

1 5

1 4

2 1

2 2

2 5

2 4

3 2

3 3

3 4

3 5

I would like to introduce a new case with the info for HH 3 on the item 1,
no matter if the item has just missing values. (I can share an extract of
the DB for people interested in helping me). Thanks.

__________________________________________________________________

Rodrigo Briceño
Project Manager
Sanigest Internacional

+506 291 1200 ext. 113 Oficina Costa Rica
+506 232 0830 Fax
+506 886 1177 Celular
<mailto:[hidden email]> [hidden email]
<http://www.sanigest.com/> www.sanigest.com

MSN: <mailto:[hidden email]> [hidden email]
SKYPE: rbriceno1087

_____________________

This communication contains legal information which is privileged and
confidential. It is for the exclusive use of the address and distribution,
dissemination, copying or use by others is strictly prohibited. If you have
received this communication by error, please delete the original message and
e-mail us.

Esta comunicación contiene información legal privilegiada y confidencial
para el uso exclusivo del destinatario. La distribución, diseminación, copia
u otro uso por terceras personas es estrictamente prohibida. Si usted ha
recibido esta comunicación por error, le rogamos borrar el mensaje original
y comunicárnoslo a esta misma dirección.

Maguin, Eugene

Re: how to do this efficiently

Rodrigo,

I'm not sure that this is quite what you need but it does address the
specific issue you asked about in your example, i.e., "I would like to
introduce a new case with the info for HH 3 on the item 1, no matter if the
item has just missing values." I'm going to assume that your file has a
'long' structure as you show in your example.

So, you want to locate cases that don't have a specific type of record. To
find those cases you need to do a subtraction of cases with that record type
from all cases (i.e., set subtraction). Here's the strategy that I'd use to
start with.

Do a temporary select cases with an Item=1 record and save to a new file but
keep only HH and item.

Aggregate HH and save to a new file. You have to create a new variable when
you aggregate but this variable is not needed and can be discarded This is
the total.

Match these two files together. Cases with a missing value for item are
cases that are missing a type=1 record. Select those cases and change the
value of item to 1. This is the result of the subtraction operation.

Add this set of cases back into your original file.

In code.
* let your file0 be your current database file. It is assumed to be open.

Sort cases by hh item.

Do if (Item eq 1).
+ xsave outfile='your file1'/keep hh item.
End if.
Execute.

Aggregate outfile='your file2'/break=hh/junk=out(item).
Execute.

Match files file='your file1'/file='your file2'/by hh/drop=junk.
Execute.

Select if (sysmis(item)).
Compute item=1.
Execute.

Add files file=*/file='your file0'/by hh item.
Execute.

Gene Maguin

Richard Ristow

Re: how to do this efficiently

In reply to this post by Rodrigo Briceño

At 05:22 PM 9/6/2007, Rodrigo Briceno wrote:

>My DB has information about Households and their expenditures in a set
>of
>items (11 items that include food, transport, health, etc.). Each of
>this
>items has a unique code, for ex. Food is 1, health is 5, etc. The
>database [has no record] didn't put a value when the HH didn't report
>expenditures in one or some of the issues.

To paraphrase what you say later: you want a record for every code for
every household, by filling in the ones omitted from your database,
with missing data.

Here's a way: use XSAVE to produce a complete set of records for every
household, with no data; merge back with the original file. For
demonstration, it has 5 item codes instead of 11. SPSS 15 draft output:

|-----------------------------|---------------------------|
|Output Created |07-SEP-2007 23:56:38 |
|-----------------------------|---------------------------|
HH Item VALUE

1 1 192
1 2 64
1 5 165
1 4 118
2 1 232
2 2 175
2 5 294
2 4 110
3 2 320
3 3 39
3 4 176
3 5 127

Number of cases read: 12 Number of cases listed: 12

SORT CASES BY HH Item.
* Mark, if it's the first record in a household ............... .
DO IF $CASENUM EQ 1.
. COMPUTE #New_HH = 1.
ELSE IF HH NE LAG(HH).
. COMPUTE #New_HH = 1.
ELSE.
. COMPUTE #New_HH = 0.
END IF.

* For each household, when it's new, write a complete suite of .
* 5 'item' records, with missing values, to a scratch file .
DO IF #New_HH.
. COMPUTE #Hold_It = Item.
. NUMERIC Out_Item (F2).
. LOOP Out_Item = 1 TO 5.
. XSAVE OUTFILE=Fill_In
/RENAME=(Out_Item Item = Item Discard)
/KEEP = HH Item.
. END LOOP.
END IF.
EXECUTE /* -- Force a data pass, so the XSAVE loop is executed -- */.

MATCH FILES
/FILE=*
/FILE=Fill_In
/BY HH Item
/DROP=Out_Item.

LIST.

List
|-----------------------------|---------------------------|
|Output Created |07-SEP-2007 23:56:39 |
|-----------------------------|---------------------------|
HH Item VALUE

1 1 192
1 2 64
1 3 .
1 4 118
1 5 165
2 1 232
2 2 175
2 3 .
2 4 110
2 5 294
3 1 .
3 2 320
3 3 39
3 4 176
3 5 127

Number of cases read: 15 Number of cases listed: 15

===================================
APPENDIX: File handle and test data
===================================
* ................................................................. .
* ................. File for XSAVE output ..................... .
FILE HANDLE Fill_In
/NAME='C:\Documents and Settings\Richard\My Documents' +
'\Temporary\SPSS\' +
'2007-09-06 Briceno - how to do this efficiently' +
' - ' +
'Fill_In.SAV'.

* ................. Test data ..................... .
SET RNG = MT /* 'Mersenne twister' random number generator */ .
SET MTINDEX = 1634 /* Eastern Main Line, PA, telephone book */ .

DATA LIST LIST SKIP=1/
HH Item.
BEGIN DATA
HH Item
1 1
1 2
1 5
1 4
2 1
2 2
2 5
2 4
3 2
3 3
3 4
3 5
END DATA.
FORMATS HH Item (F2).

NUMERIC VALUE (F4).
COMPUTE VALUE = TRUNC(RV.UNIFORM(1,351)).

* ................. Post after this point ..................... .
* ................................................................. .
LIST.

Florio Arguillas

Re: how to do this efficiently

In reply to this post by Rodrigo Briceño

Hi Rodrigo,

Here's another approach. This involves a double-transpose of your data.

1. It requires that you create a subset of your
data which contains ALL records but with only the
following 3 variables: HH, Item, and a variable
that has two or more valid values and best if it
has no system missing values (let's call this
variable XX). I said "best" because it will be
easy to spot which records were added later on if
XX has no system missing values.
2. You then restructure this file to wide form
(CASESTOVARS) to create 11 item variables . The
variables are indexed by the values of your Item
variable and prefixed by your XX variable.
3. Then you restructure this wide form back to
long form (VARSTOCASES). The result will display
a rectangular data structure -- each hh will have
11 records one for each item. You will know
which records in the rectangular file were NOT in
your subset data as they will have missing values displayed in variable XX.
4. You then match this rectangular file to your original data set.

Best regards,

Florio

Here's a sample code.

*Step 1 - here i am creating a sample data, but
in your case you may just have to submit a SAVE
OUTFILE command with a DROP option to create your
subset because you already have an existing SPSS data set.

DATA LIST FREE / HH Item xx.
BEGIN DATA.
1001008011 1 0
1001008011 2 0
1001008011 3 0
1001008011 4 0
1001008011 5 1
1001008011 6 1
1001008011 8 1
1001008011 10 1
1001008011 11 1
1001008021 1 0
1001008021 2 0
1001008021 3 0
1001008021 4 0
1001008021 7 1
1001008021 8 1
1001008021 9 0
1001008021 10 0
1001008021 11 0
END DATA.

XSAVE OUTFILE 'test'.

*Step 2.

SORT CASES BY HH Item .
CASESTOVARS
/ID = HH
/INDEX = Item
/GROUPBY = VARIABLE .

*Step 3 - in this step i commented out the delete
var xx command so you can see what happens when
you restructure back to long form.
*Variable XX is not necessary after this step, so
you can delete it if you want.

VARSTOCASES /MAKE xx FROM xx.1.00 xx.2.00 xx.3.00 xx.4.00 xx.5.00 xx.6.00
xx.7.00 xx.8.00 xx.9.00 xx.10.00 xx.11.00
/INDEX = Item(11)
/KEEP = HH
/NULL = KEEP.

*DELETE VAR xx.

SAVE OUTFILE 'testa'.

*step 4.

MATCH FILE FILE = 'testa'
/FILE = '<your original data set>'
/BY HH Item.

SAVE OUTFILE 'testab'.

At 05:22 PM 9/6/2007, Rodrigo Briceno wrote:

>Dear SPSS users. I have a problem with one of the databases that Im
>processing.
>
>
>
>My DB has information about Households and their expenditures in a set of
>items (11 items that include food, transport, health, etc.). Each of this
>items has a unique code, for ex. Food is 1, health is 5, etc. The thing is
>that the people who prepared the database didnt put a value when the HH
>didnt report expenditures in one or some of the issues.
>
>What I want to do is to locate the HH which do not have the code 1 (for
>food, since this is the first in the list for each HH) in order to have the
>full list of HH no matter if they didnt reported Food expenditures.
>
>
>
>Ex.
>
>
>
>HH Item
>
>1 1
>
>1 2
>
>1 5
>
>1 4
>
>2 1
>
>2 2
>
>2 5
>
>2 4
>
>3 2
>
>3 3
>
>3 4
>
>3 5
>
>
>
>I would like to introduce a new case with the info for HH 3 on the item 1,
>no matter if the item has just missing values. (I can share an extract of
>the DB for people interested in helping me). Thanks.
>
>__________________________________________________________________
>
>Rodrigo Briceño
>Project Manager
>Sanigest Internacional
>
>+506 291 1200 ext. 113 Oficina Costa Rica
>+506 232 0830 Fax
>+506 886 1177 Celular
> <mailto:[hidden email]> [hidden email]
> <http://www.sanigest.com/> www.sanigest.com
>
>MSN: <mailto:[hidden email]> [hidden email]
>SKYPE: rbriceno1087
>
>_____________________
>
>This communication contains legal information which is privileged and
>confidential. It is for the exclusive use of the address and distribution,
>dissemination, copying or use by others is strictly prohibited. If you have
>received this communication by error, please delete the original message and
>e-mail us.
>
>
>Esta comunicación contiene información legal privilegiada y confidencial
>para el uso exclusivo del destinatario. La distribución, diseminación, copia
>u otro uso por terceras personas es estrictamente prohibida. Si usted ha
>recibido esta comunicación por error, le rogamos borrar el mensaje original
>y comunicárnoslo a esta misma dirección.
>
>

Richard Ristow

Re: how to do this efficiently

At 02:44 PM 9/8/2007, Florio Arguillas wrote:

>Here's another approach. This involves a double-transpose of your
>data.

Florio suggests a VARSTOCASES, instead of LOOP/XSAVE, to get full
suites of blank records for every household. Here's an implementation:

>2. You then restructure this file to wide form (CASESTOVARS) to
>create 11 item variables . The variables are indexed by the values of
>your Item variable and prefixed by your XX variable.

That's often a good idea. CASESTOVARS will give you a complete set of
item identifiers FOUND IN THE FILE; in a file of any size, that's
commonly a complete set of identifiers overall.

I don't do it, though. Instead, I rely on knowing the complete set of
allowed item identifiers, and generate those directly (by LOOP/COMPUTE)
in the wide-format file.

>3. Then you restructure this wide form back to long form
>(VARSTOCASES). The result will display a rectangular data structure
>-- each hh will have 11 records one for each item.

I do do this, as described.

>4. You then match this rectangular file to your original data set.

And I do this, as well.

This implementation is tested - SPSS draft output:

[Original]
HH Item VALUE

1 1 192
1 2 64
1 5 165
1 4 118
2 1 232
2 2 175
2 5 294
2 4 110
3 2 320
3 3 39
3 4 176
3 5 127

Number of cases read: 12 Number of cases listed: 12

SORT CASES BY HH Item.

* In a separate dataset, AGGREGATE to one record per household ... .

DATASET DECLARE Fill_In.

AGGREGATE OUTFILE=Fill_In
/BREAK=HH
/NRecrds 'Number of records for household' = NU.

DATASET ACTIVATE Fill_In WINDOW=FRONT.

* Add a complete suite of valid identifiers to every houseld record .

VECTOR Item. (5,F2).
LOOP #Item = 1 TO 5.
. COMPUTE Item.(#Item) = #Item.
END LOOP.

* VARSTOCASES, to give every houshold a *record* for each item ID .

VARSTOCASES
/MAKE Item FROM Item.1 TO Item.5
/DROP=NRecrds

Variables to Cases
|----------------------------|---------------------------|
|Output Created |11-SEP-2007 01:15:12 |
|----------------------------|---------------------------|
[Fill_In]

Generated Variables
|----|------|
|Name|Label |
|----|------|
|Item|<none>|
|----|------|

Processing Statistics
|-------------|-|
|Variables In |7|
|Variables Out|2|
|-------------|-|

* Join with the original file to give every household a complete .
* set of item numbers. .

MATCH FILES
/FILE=Original
/FILE=Fill_In
/BY HH Item.

LIST.

List
|-----------------------------|---------------------------|
|Output Created |11-SEP-2007 01:15:13 |
|-----------------------------|---------------------------|
HH Item VALUE

1 1 192
1 2 64
1 3 .
1 4 118
1 5 165
2 1 232
2 2 175
2 3 .
2 4 110
2 5 294
3 1 .
3 2 320
3 3 39
3 4 176
3 5 127

Number of cases read: 15 Number of cases listed: 15

===================
APPENDIX: Test data
===================
* ................................................................. .
* ................. Test data ..................... .
SET RNG = MT /* 'Mersenne twister' random number generator */ .
SET MTINDEX = 1634 /* Eastern Main Line, PA, telephone book */ .

DATA LIST LIST SKIP=1/
HH Item.
BEGIN DATA
HH Item
1 1
1 2
1 5
1 4
2 1
2 2
2 5
2 4
3 2
3 3
3 4
3 5
END DATA.
FORMATS HH Item (F2).

NUMERIC VALUE (F4).
COMPUTE VALUE = TRUNC(RV.UNIFORM(1,351)).
DATASET NAME Original WINDOW=FRONT.

* ................. Post after this point ..................... .
* ................................................................. .
LIST.