The DHS Program User Forum
Discussions regarding The DHS Program data and results
Home » Data » Weighting data » How to Weight Data in R
How to Weight Data in R [message #179] Thu, 21 March 2013 16:17 Go to next message
Jewel is currently offline  Jewel
Messages: 1
Registered: March 2013
Location: United States
Member
I am trying to use the package "Survey" in R to do a DHS analysis, but I want to be sure that I am setting up the weights properly.
My general code is as follows:

weight<-mydhsdata$v005/1000000
> data <- svydesign(id = mydhsdata$caseid, strata=mydhsdata$v021,
+ weights = weight,
+ data=mydhsdata)


If anyone has any insights on how to set up the dataset in R, I would appreciate the help!
Re: How to Weight Data in R [message #198 is a reply to message #179] Mon, 25 March 2013 10:52 Go to previous messageGo to next message
Bridgette-DHS is currently offline  Bridgette-DHS
Messages: 1504
Registered: February 2013
Senior Member
Here is a response to your question, from one of our DHS experts, Tom Pullum.

We cannot offer much support for R.

Yes, V005 is always the weight variable.

The psu or cluster variable is V001 or V021. These are generally exactly the same--that is, they are duplicates. If in any doubt, use V001. There will typically be several hundred clusterS. Your code used v021 as the stratification variable, and that would be a mistake.

The stratification variable is not always clearly identified, but in virtually all surveys the strata are the combinations of region (the first subnational unit) and urban/rural (always v025). Region and strata are usually given by v022, v023, or v024. (v101 is a duplicate of region.) Take a quick look at those three variables. There will typically be about twice as many strata as regions--often one less than twice as many, because the capital region may be completely urban. The number of strata will typically be in the rage of 20 to 40.

If you have difficulty identifying the stratification variable for a specific survey, please contact DHS.

I hope this helps.

Bridgette-DHS

Re: How to Weight Data in R [message #706 is a reply to message #179] Thu, 22 August 2013 16:13 Go to previous messageGo to next message
onetwo
Messages: 1
Registered: August 2013
Location: United States
Member
Hi Jewel

I can't unfortunately answer to your question but since you seem to have used DHS data in R, I just wanted to ask how you manage to read the data. I have no conventional stat package in my computer so I have to totally rely on R. When I write the following code to simply read the stata data, I got errors messages:

> mydata <- read.dta("c:/Births/CDBR50DT/CDBR50FL.dta")
There were 50 or more warnings (use warnings() to see the first 50)
> warnings ()
Warning messages:
1: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, ... :
duplicated levels in factors are deprecated
...

Thanks for helping if possible!

And if anybody else can help, I'll be glad!
Re: How to Weight Data in R [message #770 is a reply to message #179] Sat, 14 September 2013 11:02 Go to previous messageGo to next message
Trevor-DHS is currently offline  Trevor-DHS
Messages: 626
Registered: January 2013
Senior Member
To follow up on how to weight the data in R and use the sample design, I use the following:

DHSdesign <- svydesign(id = mydata$v021, strata=mydata$v022, weights = mydata$v005/1000000, data=mydata)

Note that the id above is the cluster id (v021), not the caseid. The strata are given by v022, but as Tom Pullum noted in his reply (posted by Bridgette), you need to check the stratification to use. Sometimes v022 gives the stratification to use, sometimes v023, and sometimes neither are set and you have to create it from v024 (region) and v025 (urban/rural). See the sampling design appendix in the DHS final reports for each survey for information on the stratification used in the survey.

Once you have set up the design, you can use it as follows:
svymean(~v201, DHSdesign)
cv(svymean(~v201, DHSdesign))
confint(svymean(~v201, DHSdesign))
svymean(~factor(v025), DHSdesign)

Regards. Trevor
Re: How to Weight Data in R [message #771 is a reply to message #706] Sat, 14 September 2013 11:44 Go to previous messageGo to next message
Trevor-DHS is currently offline  Trevor-DHS
Messages: 626
Registered: January 2013
Senior Member
Hi Onetwo,

The warnings you are getting are because R converts categorical variables with labels into factors, but some variables have the same label for more than one category. R appears to convert these into a single level in the factor it creates. Some of these cases are because there are blank labels defined, and so all categories get converted into a single level in the factor.

You can avoid this by not converting categorical variables into factors. You can do this by using:

mydata <- read.dta("c:/Births/CDBR50DT/CDBR50FL.dta", convert.factors=FALSE)

and then convert any variables that you want to use into factors when you need them. This will avoid the warning messages that are produced in the conversion process.

Cheers. Trevor
Re: How to Weight Data in R [message #12663 is a reply to message #770] Wed, 28 June 2017 17:54 Go to previous messageGo to next message
dhswes is currently offline  dhswes
Messages: 6
Registered: June 2017
Member
Hello:

I am trying to calculate the % of urban households that use a flush toilet connected to a sewer. I am using the Survey package in R. Here is some sample code using the 2015 DHS for Zimbabwe:

zimsurvey <- svydesign(id = zim$hv001, strata=zim$hv023, weights = zim$hv005/1000000, data=zim)

svyby(~hv205, ~hv025, zimsurvey, svymean)

This gives me (partial results):

hv025 hv205 flush to piped sewer system
urban urban 0.75060493
rural rural 0.01380543

This proportion is much higher than what is reported in the DHS stat compiler (35.6%). I must be using the sampling weights incorrectly. Any insight?

Thanks,

Michael
Re: How to Weight Data in R [message #12773 is a reply to message #12663] Tue, 11 July 2017 18:52 Go to previous messageGo to next message
Trevor-DHS is currently offline  Trevor-DHS
Messages: 626
Registered: January 2013
Senior Member
Your approach is correct, but you are comparing it to the wrong number(s). In the DHS report and in the STATcompiler the flush toilets connected to a sewer system are broken down into those that are not shared (35.6% in urban areas) and those that are shared (39.5%) - sum = 75.1%
Re: How to Weight Data in R [message #13105 is a reply to message #12663] Wed, 20 September 2017 18:51 Go to previous message
dhswes is currently offline  dhswes
Messages: 6
Registered: June 2017
Member
Thanks, Trevor. I am just seeing this now. I will go back into my data, and let you know if I have any additional questions using R with DHS data.

Michael
Previous Topic: Weights for analysis of HIV+/HIV- urban and rural populations
Next Topic: When to use iweight and pweight in stata
Goto Forum:
  


Current Time: Tue Dec 11 02:41:08 Eastern Standard Time 2018