The DHS Program User Forum
Discussions regarding The DHS Program data and results
Home » Countries » Ethiopia » Appropriate handling of missing values in analysis
Appropriate handling of missing values in analysis [message #25780] Thu, 08 December 2022 08:07 Go to next message
gebretsh@gmail.com is currently offline  gebretsh@gmail.com
Messages: 17
Registered: June 2022
Member
Dear DHS data experts,
I hope you realize how important your help is in correctly analyzing DHS data.
I have personally benefited much from this help forum.
Now, I would like to ask one practical question with regard to the handling of missing cases in DHS.
I have been analyzing the EDHS data with the help of DHS guide to statistics released in 2018. According to this very
important guide, I created variables in such a way that missings were coded to denominators, at least for some variables.
For example, for the variable mass media exposures, the guide says "Missing values are excluded from the numerators, but included in the denominator." For background characteristics of respondents, it says, "Missing values or "Don't know" responses are shown separately in the percent distributions."
Now, my question, is it appropriate to recode the missing cases into denominator as a standalone category of that particular variable or to recode the missing and or dk into a less disadvantageous category (like no education or no occupation) of a variable. For example, for maternal and partner occupation, is it appropriate to recode the missing and or dk into "not working" group or category, or to let the missing/dk be in a separate "missing/dk" category/group?
My aim is to use the variables in a regression model, not just in a descriptive statistic.
The problem that I would face if I recoded the missing/dk in a separate group is that they would have too small sample to be used in a regression model.

Your expertise would help me to tease apart the dilemmas here in my analysis.
Also note that this question applies to all variables namely maternal age, maternal and partner education, maternal and partner occupation, mass media exposure, whether a child is wanted/planning status, skilled antenatal and delivery care. I mention these variables thinking that your advice may differ by variables
Re: Appropriate handling of missing values in analysis [message #25781 is a reply to message #25780] Thu, 08 December 2022 11:04 Go to previous message
Bridgette-DHS is currently offline  Bridgette-DHS
Messages: 3032
Registered: February 2013
Senior Member
Following is a response from Senior DHS staff member, Tom Pullum:

Glad the forum is helpful to you, and thanks for saying that!

The term "missing values" is ambiguous. In the data files, a blank (or a dot in Stata) means NA, or "Not Applicable". It is used, for example, if there is a skip or filter in the questionnaire such that the respondent was not asked a specific question. As used in the Guide to DHS statistics, "missing" means the variable is not NA, but there was not a valid response. Sometimes those responses are included in the denominator, but sometimes they are not.

Here is an example from the KR file for the 2016 DHS survey in Ethiopia. h11, diarrhea in the past two weeks, has a code 8 for DK. Say we want to calculate the proportion of children who had diarrhea in the past two weeks, using Ethiopia 2016. The unweighted distribution of h11 is as follows:

. tab h11

       had diarrhea |
           recently |      Freq.     Percent        Cum.
--------------------+-----------------------------------
                 no |      8,826       88.21       88.21
yes, last two weeks |      1,090       10.89       99.10
         don't know |         90        0.90      100.00
--------------------+-----------------------------------
              Total |     10,006      100.00


"tab h11,m" will show that there were 635 NA cases. All of the NA cases are children born in the past 5 years who died before the survey. Clearly they should be omitted from both the numerator and the denominator. The denominator for the proportion would include all 10,006 cases, but the numerator would only include the "yes" responses. The "don't know" responses are in effect grouped with "no". The reason for grouping them with "no", as I think of it, is to avoid over-estimating the prevalence of the outcome. There could be other variables in which "don't know" would be grouped with "yes", but for the same reason, that we want to be conservative and avoid over-estimating the prevalence of an unfavorable outcome.

Actually, although this is what DHS usually does, someone else might want to drop the "don't know" responses from the denominator. I don't have strong feelings about that but I prefer to keep them in the denominator.

The distribution of h11 in the 2011 survey looks like this:

. tab h11

       had diarrhea |
           recently |      Freq.     Percent        Cum.
--------------------+-----------------------------------
                 no |      9,068       83.90       83.90
yes, last two weeks |      1,620       14.99       98.89
         don't know |        105        0.97       99.86
                  9 |         15        0.14      100.00
--------------------+-----------------------------------
              Total |     10,808      100.00

There are not supposed to be any 9's; the label for h11 does not include 9. However, for some reason, there were 15 children for whom the interviewer could not get a valid response. To get the proportion in this situation, I would prefer to drop the values with code 9. "Don't know" comes from the mother; perhaps the child is temporarily staying with the grandmother, for example, but "9" is completely meaningless. The remaining 10,793 cases would be in the denominator and (as with the 2016 data) the "yes" cases would be in the denominator. However, I'd have to go to the CSPro code to see what DHS actually did with those 15 cases. The Guide to DHS Statistics suggests that they were retained in the denominator, and I'm just saying that's not the only option.

For some variables, there will be a code 9 (or 99, etc.) that IS in the variable label but it means "refused", "inconsistent", etc. Usually I would omit them from the denominator, but this is a judgment call and I'm not 100% sure what is done during data processing.

My personal practice is usually to do a recode rather than dropping cases from the file or over-writing standard variables. For example, I would use these lines:

gen diarrhea=0 if h11<=8
replace diarrhea=1 if h11==1 


This is a long answer but I hope it's clear. I make a distinction between what I would prefer to do personally and what DHS does (or may do) during data processing.

Previous Topic: Confirming the correctiness of mereging two datasets
Next Topic: Question about variables in Ethiopia 2016 dataset
Goto Forum:
  


Current Time: Thu Apr 18 01:29:54 Coordinated Universal Time 2024