Specify criteria for replacing missing data

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Specify criteria for replacing missing data

Dawn Wiest

Dear listserv,

I have a dataset with 15, 000 cases (organizations) and 400 variables. The data were collected over a 10-year period. A single organization can appear in the dataset for 10 years or for fewer years, depending when it began to meet the criteria for inclusion. For a number of organizations (roughly 5% of total), I need to replace data on 150 variables that appear sequentially in the dataset. The variables are cities in which an organization has members or affiliates, named City1 to City150 in the dataset, where 1 = organization has members/affiliates in the city and 0= organization has no known members/affiliates in the city.

The organizations for which the data need to be replaced with data from another organization are given a particular code on a variable “Replacecode”, depending on the method of replacement required. Data are only replaced if another variable, “TotalCities” is equal to 0. If TotalCities=0 and Replacecode=1, the data for City1-City150 will be replaced with data from another occurrence of this organization for the most proximate year for which TotalCities is not equal to 0. If there is data for the same number of years prior to the missing year and the same number of years after the missing year, data should always be replaced with prior year data. In the table below, Organization A, year 2008, is an example of an organization that requires this type of replacement.

If Replacecode=2, it means that City data are missing for all occurrences of the organization in the dataset and that the data for City1 – City150 should be replaced with data from a similar organization. For example, in year 2010, organization B in the table below was 1 year old and based in Region 3. For all organizations like organization B (Age=1; Region=3) with missing data, data for City1-City150 should be replaced with data from another organization that is 1 year old and based in Region 3 (in this example, Organization H matches the criteria). Ideally, for all cases where Replacecode=2, the city from which data are copied to replace missing data will be randomly selected from all cities in the dataset that match the criteria we specify.

 

What is the most efficient way to replace these data according to the specifications outlined above?

 

To give you an idea what the data look like:

Organization

Year

Age

Region

Replacecode

City1

City2

City3

City4

City150

TotalCities

A

2003

5

1

0

1

1

0

0

0

2

A

2004

6

1

0

1

1

0

0

0

2

A

2005

7

1

0

1

1

0

0

0

2

A

2006

8

1

0

1

1

0

0

0

2

A

2007

9

1

0

1

1

0

0

0

2

A

2008

10

1

1

0

A

2009

11

1

0

0

0

1

1

0

2

B

2010

1

3

2

0

B

2011

2

3

2

0

C

2011

15

1

0

0

0

1

1

0

2

D

2008

33

5

0

1

0

0

0

0

1

D

2009

34

5

0

1

1

1

0

0

3

E

2002

26

2

0

0

1

1

0

0

2

E

2003

27

2

0

0

0

1

1

0

2

E

2004

28

2

0

0

0

1

1

0

2

E

2005

29

2

0

0

0

1

1

0

2

E

2006

30

2

0

1

1

1

1

1

5

E

2007

31

2

1

0

E

2008

32

2

0

1

1

1

1

1

5

F

2004

4

3

0

0

0

1

1

0

2

G

2005

15

5

0

1

1

1

1

1

5

G

2006

16

5

0

1

1

1

1

1

5

H

2011

1

3

0

1

0

0

0

0

1

 

Thank you for your help.

 

 

Dawn Wiest, Ph.D.
Senior Research Analyst
Research Center - 7th Floor
American College of Physicians
190 North Independence Mall West Philadelphia, PA 19106-1572

[hidden email]
(215) 351-2561
Reply | Threaded
Open this post in threaded view
|

Re: Specify criteria for replacing missing data

David Marso
Administrator
For part 1 consider using LAG within a DO REPEAT structure.
For part 2 here is some completely UNTESTED code which will most likely work like a charm (or NOT ;-).
*NOTE* I am not doing *ANYTHING* to address the other 250 odd variables.  You will want to segregate them and match them back later by Organization YEAR AGE etc so you don't cobber them with incoming data from the replacement cases.
---
COMPUTE SCRAMBLE=UNIFORM(1).
SORT CASES BY REPLACECODE AGE REGION SCRAMBLE.
COMPUTE YOKED=SUM(1, (REPLACECODE EQ LAG(REPLACECODE))
                   * (AGE EQ LAG(AGE))
                   * (REGION EQ LAG(REGION))
                   * LAG(YOKED) ).
DO IF REPLACECODE = 2.
DO REPEAT COPY=@ORG @YEAR @AGE @REGION @REPLACECODE
       / SOURCE=ORGANIZATION YEAR AGE REGION REPLACECODE .
COMPUTE COPY = SOURCE.
END REPEAT.
XSAVE OUTFILE "C:\TEMP\REPLACECODE2.SAV"
     / KEEP AGE REGION SCRAMBLE @ORG @YEAR @AGE @REGION @REPLACECODE.
END IF.
EXECUTE.
SELECT IF REPLACECODE NE 2 AND TOTALCITIES GT 0.
MATCH FILES
         / FILE *
        / DROP @ORG @YEAR @AGE @REGION @REPLACECODE.
SAVE OUTFILE "C:\TEMP\ALLBUTREPLACECODE2.SAV".
MATCH FILES
    / FILE *
    / FILE "C:\TEMP\REPLACECODE2.SAV"
    / BY AGE REGION SCRAMBLE .
SELECT IF NOT(MISSING(@REPLACECODE)).
COMPUTE DATAREPLACEDBY = ORGANIZATION.
COMPUTE ORGANIZATION=@ORG.
ADD FILES
  / FILE *
  / FILE="C:\TEMP\ALLBUTREPLACECODE2.SAV".


Dawn Wiest wrote
Dear listserv,
I have a dataset with 15, 000 cases (organizations) and 400 variables.
The data were collected over a 10-year period. A single organization can
appear in the dataset for 10 years or for fewer years, depending when it
began to meet the criteria for inclusion. For a number of organizations
(roughly 5% of total), I need to replace data on 150 variables that
appear sequentially in the dataset. The variables are cities in which an
organization has members or affiliates, named City1 to City150 in the
dataset, where 1 = organization has members/affiliates in the city and
0= organization has no known members/affiliates in the city.
The organizations for which the data need to be replaced with data from
another organization are given a particular code on a variable
“Replacecode”, depending on the method of replacement required. Data
are only replaced if another variable, “TotalCities” is equal to 0. If
TotalCities=0 and Replacecode=1, the data for City1-City150 will be
replaced with data from another occurrence of this organization for the
most proximate year for which TotalCities is not equal to 0. If there is
data for the same number of years prior to the missing year and the same
number of years after the missing year, data should always be replaced
with prior year data. In the table below, Organization A, year 2008, is
an example of an organization that requires this type of replacement.
If Replacecode=2, it means that City data are missing for all
occurrences of the organization in the dataset and that the data for
City1 – City150 should be replaced with data from a similar
organization. For example, in year 2010, organization B in the table
below was 1 year old and based in Region 3. For all organizations like
organization B (Age=1; Region=3) with missing data, data for
City1-City150 should be replaced with data from another organization
that is 1 year old and based in Region 3 (in this example, Organization
H matches the criteria). Ideally, for all cases where Replacecode=2, the
city from which data are copied to replace missing data will be randomly
selected from all cities in the dataset that match the criteria we
specify.

What is the most efficient way to replace these data according to the
specifications outlined above?

To give you an idea what the data look like:

Organization
Year
Age
Region
Replacecode
City1
City2
City3
City4
City150
TotalCities

A
2003
5
1
0
1
1
0
0
0
2

A
2004
6
1
0
1
1
0
0
0
2

A
2005
7
1
0
1
1
0
0
0
2

A
2006
8
1
0
1
1
0
0
0
2

A
2007
9
1
0
1
1
0
0
0
2

A
2008
10
1
1
0

A
2009
11
1
0
0
0
1
1
0
2

B
2010
1
3
2
0

B
2011
2
3
2
0

C
2011
15
1
0
0
0
1
1
0
2

D
2008
33
5
0
1
0
0
0
0
1

D
2009
34
5
0
1
1
1
0
0
3

E
2002
26
2
0
0
1
1
0
0
2

E
2003
27
2
0
0
0
1
1
0
2

E
2004
28
2
0
0
0
1
1
0
2

E
2005
29
2
0
0
0
1
1
0
2

E
2006
30
2
0
1
1
1
1
1
5

E
2007
31
2
1
0

E
2008
32
2
0
1
1
1
1
1
5

F
2004
4
3
0
0
0
1
1
0
2

G
2005
15
5
0
1
1
1
1
1
5

G
2006
16
5
0
1
1
1
1
1
5

H
2011
1
3
0
1
0
0
0
0
1


Thank you for your help.

 Dawn Wiest, Ph.D.
Senior Research Analyst
Research Center - 7th Floor
American College of Physicians
190 North Independence Mall West Philadelphia, PA 19106-1572

[hidden email]
(215) 351-2561
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"