Home » Data » Weighting data » Using weights in regression analysis
Using weights in regression analysis [message #72] |
Wed, 20 February 2013 11:48 |
DHS user
Messages: 111 Registered: February 2013
|
Senior Member |
|
|
I am planning to do regression analysis. In your manual you do not suggest to use weights for such analysis. Why is that the case, and would you advice me to use weights? And in case you do, which weight should I use since my unit of analysis is the couple?
[Updated on: Thu, 21 November 2013 18:45] by Moderator Report message to a moderator
|
|
|
Re: Using weights in regression analysis [message #73 is a reply to message #72] |
Wed, 20 February 2013 11:50 |
Bridgette-DHS
Messages: 3210 Registered: February 2013
|
Senior Member |
|
|
Here is a response from one of our DHS experts Tom Pullum, that should answer your question.
Future versions of the Guide to DHS Statistics will modify that recommendation. Not using weights is a minority viewpoint here at DHS. Almost all of us now advocate the use of weights. How you use them will depend somewhat on your statistical package. Most of us here use Stata.
If you do not use weights, the coefficients will be biased toward the over-sampled sub-populations.
For the HR and PR files, use hv005, for the IR, KR, and BR files, use v005, for the MR file use mv005. The CR file contains both v005 and mv005. It makes very little empirical difference which you use, but we prefer mv005 because it is adjusted for male non-response, which is typically more serious than female non-response. Some people (e.g. Stan Becker) have proposed a composite couples weight, but as I said the effect of alternatives is trivial.
If you use the AR file, the weight is hiv05, and if you form a couples file using the AR data, the weight is hiv05 for males.
For some purposes it is convenient to divide the weights by 1,000,000, but in Stata, for example, pweight is unaffected by that, and for regressions you use pweight.
You also need to adjust for the clusters (the primary sampling units) and the strata. In Stata that would be done with svyset and svy. These adjustments do not alter the coefficients but they do alter the standard errors, usually in opposite directions.
I hope this helps.
Bridgette-DHS
[Updated on: Mon, 18 March 2013 09:11] Report message to a moderator
|
|
|
|
|
|
Re: Using weights in regression analysis [message #174 is a reply to message #163] |
Wed, 20 March 2013 20:17 |
Trevor-DHS
Messages: 805 Registered: January 2013
|
Senior Member |
|
|
You can use the Complex Samples procedures in SPSS to achieve the same as using svy in Stata. You first need to set up a Complex Sampling Plan using the CSPLAN command (I recommend creating this using the dropdown menu under Analyze, Complex Samples, Prepare for Analysis, and then pasting it into your SPSS syntax. The parameters you typically need are:
Strata: V023 - or alternatively create your strata variable from a combination of V024 and V025.
Clusters: V021 - typically this is the same as V001, but for a few surveys the Primary Sampling Unit (PSU) is different from the final cluster, and the PSU should be used.
Analysis weight: V005 - don't divide by 1000000 as SPSS expects the weight used with Complex Samples to be an integer. Your "population" size will be a million times too big in your results, but just remember to divide it by 1000000 after your analysis. If you use the weight divided by 1000000, SPSS either rounds or truncates your weight to an integer and your analysis will be wrong.
Estimator type: WR (with replacement) - DHS doesn't use replacement sampling, but to match the DHS results this option is needed.
Once you have created your Complex Samples Plan you can then use one of the Complex Samples Procedures for your analysis. I suggest using the CSDESCRIPTIVES first and reproducing the sampling errors shown in the DHS report for one indicator to ensure that you have the CSPLAN set up properly before you try using one of the other CS procedures such as CSLOGISTIC. [Note that DHS uses confidence intervals of +/-2 SEs, whereas SPSS will use +/-1.96 SE for the confidence intervals].
|
|
|
|
Re: Using weights in regression analysis [message #240 is a reply to message #229] |
Sat, 30 March 2013 19:14 |
Reduced-For(u)m
Messages: 292 Registered: March 2013
|
Senior Member |
|
|
From the DHS FAQs (under "using data files": http://www.measuredhs.com/faq.cfm):
***First, use the svyset command to tell Stata how your data is set up:
*generate weight
generate weight = v005/1000000
*make unique strata values by region/urban-rural (label option automatically labels the results)
egen strata = group(v024 v025), label
*check results
tab strata
*tell Stata the weight (using pweights for robust standard errors), cluster (psu), and strata:
svyset [pweight=weight], psu(v021) strata(strata)
****Now for a regression - if you prefix regress with "svy:" Stata will now know how to weight your data and compute the right standard errors
svy: reg Y X
***Quick note: computing standard errors in this way is probably not OK for a lot of regressions. Without getting off track or all statsy, a good way to think of this is that this standard error calculation is alright IF the error terms and covariates are independently and identically distributed across observations, other than as operating through the sampling procedure (the stratification and clustering prior to randomization that produces the particular sample you have). I tend to think of these standard errors as the smallest the "true" standard errors could possibly be, but I'm kind of on the conservative/stickler end of this debate, and others would surely disagree.
|
|
|
|
Re: Using weights in regression analysis [message #300 is a reply to message #248] |
Thu, 11 April 2013 17:31 |
Bridgette-DHS
Messages: 3210 Registered: February 2013
|
Senior Member |
|
|
Here is a response from one of our STATA experts Tom Pullum, that should answer your question.
At DHS we mostly use Stata, so I will answer in terms of Stata. The weighting of the data is done as part of the estimation (e.g. regression) command. There is no other sense in which you would "weight the data". For example, instead of "regress y x" you would say "regress y x [pweight=v005]".
You will get the same result if you first say "gen pwt=v005/1000000" and then "regress y x [pweight=pwt]", which some users would prefer to do, but as I said it makes no difference, because Stata always automatically normalizes the weights. Without weights, the estimates are biased toward the oversampled subpopulations and away from the undersampled subpopulations.
The adjustments for clusters and strata will affect the standard errors but not the estimates. If you want to test the significance of the coefficients, you must make those adjustments. For the clusters you expand the above statement to "regress y x [pweight=pwt], cluster(v001)". If you use strata, you must use "svyset" and "svy: regress". The svyset can specify the pweights, clusters, and strata, and then apply them with "svy: regress y x". The svyset command differs slightly across different versions of Stata, e.g. between 11 and 12, so just enter "help svyset" to get the syntax for your version. The strata variable is usually either v022 or v023. However, it is not always labeled correctly. As a general rule, the strata are all combinations of urban/rural and region (the first subnational unit). If the variable labeled "strata" is not consistent with that rule, you should ask someone at DHS to check it.
|
|
|
|
|
Re: Using weights in regression analysis [message #371 is a reply to message #345] |
Fri, 26 April 2013 10:48 |
Bridgette-DHS
Messages: 3210 Registered: February 2013
|
Senior Member |
|
|
Here is a response from one of our STATA experts Tom Pullum:
"Weighting does not inflate or deflate the number of cases in your analysis. All it does is re-balance them so that under-sampled sub populations are weighted up, and over-sampled sub populations are weighted down, producing estimates of proportions, means, or coefficients that are unbiased. In any kind of regression or test using pweights in Stata, at least, the weights are calculated so that the sum of the weighted cases is exactly the same as the sum of the unweighted cases. You don't have to do anything -- this is automatic.
Standard errors and confidence intervals and statistical tests, in Stata at least, are calculated in a "robust" way with formulas that have been carefully developed. Those things will be more sensitive to whether you take the clusters and strata into account, using svyset and svy, than to whether you use weights. DHS strongly recommends that you make those adjustments."
|
|
|
Re: Using weights in regression analysis [message #545 is a reply to message #240] |
Fri, 14 June 2013 16:18 |
mnicolson
Messages: 1 Registered: May 2013
|
Member |
|
|
Hi, I have a couple of questions about weighting which I was hoping someone might be able to help with?
Weighting, clustering and stratification for regression
An earlier poster has responded by saying:
"From the DHS FAQs (under "using data files": http://www.measuredhs.com/faq.cfm):
***First, use the svyset command to tell Stata how your data is set up:
*generate weight
generate weight = v005/1000000
*make unique strata values by region/urban-rural (label option automatically labels the results)
egen strata = group(v024 v025), label
*check results
tab strata
*tell Stata the weight (using pweights for robust standard errors), cluster (psu), and strata:
svyset [pweight=weight], psu(v021) strata(strata)
****Now for a regression - if you prefix regress with "svy:" Stata will now know how to weight your data and compute the right standard errors
svy: reg Y X
***Quick note: computing standard errors in this way is probably not OK for a lot of regressions. Without getting off track or all statsy, a good way to think of this is that this standard error calculation is alright IF the error terms and covariates are independently and identically distributed across observations, other than as operating through the sampling procedure (the stratification and clustering prior to randomization that produces the particular sample you have). I tend to think of these standard errors as the smallest the "true" standard errors could possibly be, but I'm kind of on the conservative/stickler end of this debate, and others would surely disagree."
I have followed this and all works fine - however, I have two questions.
(1) It seems that the command given above assumes that the data has been collected using one-stage design.
The Stata Manual defines one-stage design as follows:
"A commonly used single-stage survey design uses clustered sampling across several strata, where
the clusters are sampled without replacement."
However, when I read the DHS country manuals, it suggests that samples were selected in two or more stages depending on whether the respondent comes from a rural or urban area. The Stata Manual states that we then have to use a different command, one which accounts for the multiple stages of sampling.
It gives the example:
"We have (fictional) data on American high school seniors (12th graders), and the data were collected
according to the following multistage design. In the first stage, counties were independently selected
within each state. In the second stage, schools were selected within each chosen county. Within each
chosen school, a questionnaire was filled out by every attending high school senior."
The stata command it suggests is:
svyset county [pw=sampwgt], strata(state) fpc(ncounties) || school, fpc(nschools)
Is the command -svyset [pweight=weight], psu(v021) strata(strata)- the correct way of dealing with DHS survey data? Or should I be using a command that takes into account the multiple-rounds used to collect DHS data?
I realise that there is also another factor to consider - namely, whether the clusters are sampled *with* or *without* replacement. Does DHS survey with replacement therefore making it unnecessary to account for the second-stage clusters?
(2) The comment suggests that this way of calculating standard errors ('the way' - accounting for weighting and stratification) won't be appropriate for a lot of regressions. Why is this? If it's not appropriate, does that mean an alternative way is to simply run a regression without weighting or accounting for sampling?
****
Even though I am using the sample weight, my tabulations differ from those in the country tables
I am analysing the India DHS dataset. My unit of analysis is the individual and I have appended all three DHS India datasets into one large dataset.
I am using the following command in order to attempt to replicate the total contraceptive prevalence rate given on p. 170 of the DHS-3 India country report:
tab cpr [iweight=weight] if (v025==0 | v501==1 | year==2005 & 2006)
I created the weight variable using the following command (given above)
generate weight = v005/1000000
cpr is a dummy variable that I have created from v313 (cpr=1 if any method is used; cpr=0 when no method is used)
v025==1 (means that the household type = urban)
v501==1(means that marital status = married)
My tabulation states that the CPR=45.43%
The country table states that the CPR=64%
Could the difference between my figure and the figure in the country table be due to the fact that I have appended the three datasets into one?
Or do you calculate the contraceptive prevalence rate differently from me? If so, how do you do it?
***
My apologies for the length of this post - I hope it all makes sense and I look forward to any responses
Thanks.
|
|
|
Re: Using weights in regression analysis [message #546 is a reply to message #545] |
Fri, 14 June 2013 17:24 |
Reduced-For(u)m
Messages: 292 Registered: March 2013
|
Senior Member |
|
|
Hey there. These are all really good questions. I'll go through them best I can. But first off, just to be clear, I'm not a DHS employee, and have no special insight other than what I've gleaned in my working with the DHS data, discussions with other users, and my general econometrics training. So nothing I say should be taken as the voice of the DHS speaking, or even the advice of some super-expert, just another practitioner trying to figure these things out. With that out of the way...
Weighting, clustering and stratification for regression
(1)...just about everything you say here is new and interesting to me. I had never used the "svy" command before using the DHS - I always weighted and specified standard error calculations manually. What I know about using "svy" for the DHS mostly comes from the DHS FAQ and this paper: http://eprints.soton.ac.uk/8142/
Basically, I have no real insight on the proper use of "svy" to deal with complicated survey designs.
(2) This is one of those times I wish I had said less (it happens), just because leaving something fuzzy like that is probably not helpful. First things first, when it comes to standard errors/inference, we aren't really talking about weighting, we are talking about stratification and clustering. The weighting problem is really just a question of what population you want your results to be representative of (the survey population, or the national population, or the regional population or whatever). Weighting doesn't require the use of "svy".
As for statistical inference (standard errors), to me, one way of thinking of the DHS standard error assumptions is that IF the DHS had used a simple random sampling, then we could just use OLS standard errors (and weight manually with [pweight=weight]). However, in many applications, like difference-in-difference estimation or a cohort fixed-effects-type regression, even with simple random sampling, this is probably an un-conservative technique. Coming from the Labor econ world, I think two really good introductions to the problem are "How much should we trust difference-in-difference estimations" (Bertrand, Duflo, Mullainathan) and "Robust inference with clustered data" (Cameron and Miller). These papers focus on situations where there is likely to be auto-correlation and/or heteroskedasticity in error terms within "clusters" like states or counties (not to be confused with sampling clusters, but some larger grouping of people). As another example, and closer to home, when estimating cohort determinants of HAZ (like, say, effect of month of birth, or the effect of some shock in the birth cohort) I find that the "svy" technique leads to rejection rates on a placebo treatment over 25% (when it should be 5%) and up to like 70% in some cases.
One important caveat though is that all of the things I mention above are uses of the DHS for which it was probably not originally designed. I think that when it comes to things like the effect of maternal age on HAZ, then the DHS method will produce better sized standard errors. I haven't run any placebo tests on that to check implied rejection rates, but you could probably do it fairly easily.
Even though I am using the sample weight, my tabulations differ from those in the country tables
Hopefully some "-DHS" will respond to this, as I have only two (maybe, maybe not) helpful comments and one useless one.
1 - I use "pweight" instead of "iweight". My guess is that it will not make a difference, but these are probability weights (best as I can tell) and since using pweight automatically scales everything to sum to 1, it might make a difference.
2 - Still on weights...when appending multiple rounds, my understanding is that this induces a new weighting problem, as DHS weights within a survey sum to the sample size. So, by just using the given weights, you are not weighting each survey the same, you are implicitly weighting it by the sample size. I'm not sure if that is what you want or not. An alternative would be to re-scale each survey's total weight to sum to one manually, preserving probability of sampling within survey but making each survey have the same total weight (assuming population size is constant, each survey is actually "representing" the same number of women).
egen surveytotalweight = total(weight), by(survey)
gen new_weight = weight/surveytotalweight
I so far have done that AFTER I dropped all observations that wouldn't go in the regression I use or the statistics I'm tabulating.
3 - I know nothing about replicating DHS tables, so I'll go back to hoping someone "-DHS" responds.
*******
I hope this has been in some way helpful. I've been struggling with the weighting thing myself, and how to deal with multiple survey rounds. Truth is, I don't think there are really "perfect" answers out there, and a lot of us are trying to figure things out on our own and doing things in different ways depending on our backgrounds. So my perspective is one that comes from dealing with problems in the Labor Econ world, and Epidemiologists or Nutritionists would have different opinions and different modelling concerns. For example, my field has basically stopped using any random effects models and switched to using an "arbitrary" or "cluster-robust" variance/covariance matrix estimation - I haven't been able to confirm, but I think that the "svy" command uses some weird random-effects-type specification of the V/C matrix. So my biases and "insights" (such as they are) come from that world, and may not be totally appropriate here. These are just my thoughts. I'd love to learn more if someone thinks I'm missing something obvious or important or just fundamentally not understanding something.
|
|
|
|
Re: Using weights in regression analysis [message #850 is a reply to message #848] |
Sun, 20 October 2013 19:09 |
Reduced-For(u)m
Messages: 292 Registered: March 2013
|
Senior Member |
|
|
Here is some discussion of the problem, which continues in more (and helpful) detail if you follow the link.
http://www.stata.com/support/faqs/statistics/stratum-with-on e-psu/
Having a stratum with a single PSU is a fairly common problem. When there is only one PSU within a stratum, there is insufficient information with which to compute an estimate of that stratum's variance. Therefore, it is impossible to compute the variance of an estimated parameter when the data are from a stratified clustered design. There are two solutions. The first solution is to simply delete the stratum with the singleton PSU from your sample. The second solution is to treat the data from that stratum as though it is from another stratum. In order to implement either solution, one must first identify which strata are affected and which observations in the dataset belong to those strata. The svydes command will identify the strata with singleton PSUs by placing an asterisk next to the stratum identifier. For example, in the output below, stratum 1 is identified as having only 1 PSU.
The other possibility (I think) is to use the subpop command, which is discussed in another context here:
http://www.icpsr.umich.edu/icpsrweb/CPES/support/faqs/2011/0 4/how-should-i-detect-and-handle-single
I really wish I understood better what kind of estimator this particular "svy" command is using, but I've still not found good documentation describing it, so I can't explain exactly why this is a problem in a mathematical/statistical sense. One other thing people have worried about here is the weighting - since you are only using people who have tested positive for HIV, you are pretending like HIV + is orthogonal to sampling probability, and I'm pretty sure it wouldn't be (because HIV is not distributed randomly across geography and SES class). But I wouldn't think it makes that much difference.
One alternative strategy would be just to give up on the weights and cluster at some larger-than-PSU geographic level - say maybe region if there are many regions (if there are few regions, the wild-t bootstrap would work and I would think you would "cluster" those by strata, because I'm guess that is something like region-by-urban status). Something like:
logistic unmetneed i.v106 if hivtest_result ==1, cluster(region)
Let me know if this helps.
|
|
|
|
|
|
|
|
|
Goto Forum:
Current Time: Fri Dec 13 18:20:05 Coordinated Universal Time 2024
|