The DHS Program User Forum
Discussions regarding The DHS Program data and results
Home » Countries » Nigeria » Too many not-available (.) observations
Too many not-available (.) observations [message #8743] Fri, 11 December 2015 12:29 Go to next message
nwegbus is currently offline  nwegbus
Messages: 15
Registered: December 2015
Member
Hello,
I'm a doctoral student working on my first academic paper and I'm using the Nigeria DHS 2008 dataset and StatIC 13 software for analysis.

My population is 33,385 women (15-49years) but I'm interested in a subpopulation of these (23,954) who are married. However, when I weight the data, specify the subpopulation and run my logistic regression model, the output gives a subpopulation size of 15,449. I think this may have to do with the fact that my main predictor variable v511 (age at first marriage) has 8021 not-applicable (coded ".").

I initially had re-coded the not applicables as missing, but then ran into trouble when chief evaluator wanted to know why I wouldn't do multiple imputation since the missingness was so pronounced.

QUESTIONS:
(1) How do you suggest that I handle the not-applicable since recoding them as missing does not seem to work
(2) Would my sample size then be 15,449 (as specified by the regression output) or 23,954(the number of married women in my specified subpopulation)?

Any tips or links would be highly appreciated. Thanks in advance.

SN
Re: Too many not-available (.) observations [message #8747 is a reply to message #8743] Fri, 11 December 2015 21:41 Go to previous messageGo to next message
user-rhs is currently offline  user-rhs
Messages: 132
Registered: December 2013
Senior Member
When you run a regression model in Stata, Stata handles missing values with listwise deletion. This means that if even a single variable is missing from a list of covariates in your model, that observation will be excluded from analysis. The obvious problem when this happens is that your parameter estimates will usually be biased, unless the data are missing completely at random (MCAR). Data are rarely, if ever, MCAR.

Fortunately for you, before you go off and read Little and Rubin's rather excellent Statistical Analysis with Missing Data concurrently with Stata's Multiple Imputation Manual (recommended for the bold and adventurous types out there) to follow your chief evaluator's advice of doing "multiple imputation," there are things you should do to determine whether it is even necessary in the first place for you to do multiple imputation. (By the way, these are also the things Little and Rubin recommend doing in the first few chapters of their book). Many scientists have a tendency to go for the shiniest and fanciest new toy (and statistical models because we want to sound smart), but in many cases, the simple solutions may be sufficient.


Before I start, here's something that I think most seasoned statisticians will agree on:

The key to fitting good models is understanding your data and the data generation process. Therefore, you should familiarize yourself with the data (read the questionnaires, DHS recode manual, any data documentation that came with your dataset, DHS report for the country, run tabulations/cross-tabulations, etc.) before attempting to do any further analysis.



So, if you have not done so already:

1.) Examine each variable in the dataset to determine level of missingness. I like the user-written command -mdesc-, but this command will not give you the % missing if "missing" was coded as something other than (.) in the dataset. Doing a -tab, miss- for each variable will tell you exactly the numbers and proportions of system and non-system missing in those variables.

2.) When you find one or more variables with huge chunks of missing data, think about the process that generated the missingness. Does it make sense that the information was missing on that person, or should there be a response there? Were the data missing because the respondent refused to answer it or didn't know the answer to it (e.g. 98, 99) or was it because the question was not asked of the respondent (for example the skip pattern in the questionnaire). Speaking of skip patterns, it is helpful to familiarize yourself with the questionnaire used to collect the data, because it will tell you why the person was not asked the question based on their responses to another question. If the person did not answer the question due to a skip pattern, it probably does NOT make sense to try to impute a response (it's missing for a reason--if you asked them about how many years they have lived with their current husband, and they have never been married, they probably will not be able to give you an answer). If the person was supposed to answer the question (e.g. 97, 98, 99 missing codes), and the data are missing in huge quantities based on those missing codes, then you probably should impute.

3.) Determine how you're going to handle missing data. For most variables, there should be little to no missing, but these can add up, especially if you have many model covariates. You have several options (each has its limitations, but what can you do):


  • Do nothing and lose observations in listwise deletion--Some people may find this blasphemous, but if you lose 40 people out of a sample of 20,000, it's not a big deal
  • If the variable that contains huge proportions missing is binary, consider changing it to 1-"Has the characteristic" and 0-"Otherwise" instead of 0-"Does not have the characteristic". That way, people with 99 and (.) can stay in the model
  • If the variable that contains huge proportions missing is based on a skip pattern, consider recoding the missing to its own category and adding a "flag" (dummy) that takes on the value of 1 if the variable that determined whether the person got to answer the question was "Yes, eligible" and 0 otherwise. For example, if you have "number of miscarriages and stillbirths" as a model covariate, but this question was only asked of women who have had at least one pregnancy (the value will be . for women who have never been pregnant), then you can create a dummy variable called "ever had pregnancy" 0/1, and create a categorical variable based on "number of miscarriages and stillbirths" into something like "0 - Never had pregnancy; 1 - No miscarriages/stillbirths; 2 - 1 to 2 miscarriages/stillbirths; 3 - 3 to 4 miscarriages/stillbirths; 4 -5 or more miscarriages/stillbirths" This way, you do not lose the people who were never pregnant from the model.

    Caveat: I had a prof. who handled all of her missingness in this way (creating a sort of "flag" variable for the missing generation process), but you have to be very careful when you do this because you make the assumption that ALL people who are missing share the same characteristic after controlling for all other model covariates, which may not be true.


4.) It is always a good idea to add variables into your model one by one (or chunk by chunk, if you prefer) just to see how the model responds to the addition of other variables. It is also always a good idea to run bivariate analysis before you fit your multivariate model so you get an idea of how things are supposed to be related and how they change once you control for other factors.

Good luck!

RHS
Re: Too many not-available (.) observations [message #8767 is a reply to message #8747] Mon, 14 December 2015 14:59 Go to previous messageGo to next message
nwegbus is currently offline  nwegbus
Messages: 15
Registered: December 2015
Member
Thank you so very much. I really, truly, appreciate your response. This problem has literally held up my paper for 3weeks! I wish I'd tried this forum sooner.

Once again - thank you. I'll let you know how it goes.
Re: Too many not-available (.) observations [message #9748 is a reply to message #8767] Thu, 12 May 2016 12:31 Go to previous messageGo to next message
mianrashid is currently offline  mianrashid
Messages: 13
Registered: February 2016
Location: Pau, France
Member
Hello,
I am using Pakistan IR DHS dataset file. I want to know about Data screening for some values should be Prior declaring the survey design for dataset or after? If after command will be same svyset v001 [pw = v005], strata(v023)?



MianRashid
Re: Too many not-available (.) observations [message #10167 is a reply to message #9748] Fri, 01 July 2016 15:45 Go to previous messageGo to next message
Trevor-DHS is currently offline  Trevor-DHS
Messages: 787
Registered: January 2013
Senior Member
I'm not sure that I understand your question, but it doesn't matter if you include data screening before or after the definition of the svyset command.
For the svyset command, I recommend using:
gen wt=v005/1000000
svyset v021 [pw=wt], strata(v023)
Re: Too many not-available (.) observations [message #15180 is a reply to message #10167] Thu, 14 June 2018 00:34 Go to previous message
Hassen
Messages: 121
Registered: April 2018
Location: Ethiopia,Africa
Senior Member
Dear all Thank you very much!! I have learned a lot from your posts.
Cheers , Hassen


Hassen Ali(Chief Public Health Professional Specialist)
Previous Topic: create female autonomy index
Next Topic: Data Dictionary Help
Goto Forum:
  


Current Time: Thu Mar 28 07:15:34 Coordinated Universal Time 2024