The DHS Program User Forum: Weighting data » Selecting sample within one standard deviation in R

Home » Data » Weighting data » Selecting sample within one standard deviation in R (Selecting sample within one standard deviation (help with R))

Show: Today's Messages :: Show Polls :: Message Navigator

Selecting sample within one standard deviation in R [message #24064]

Tue, 15 February 2022 15:00

berhardt93
Messages: 13
Registered: September 2021

Member

Hi,

I'm looking at the Nigeria 2018 DHS. I created a variable "tot_encounters" that calculates the number of sexual encounters reported by an individual in the past 12 months by adding the values from their most recent, second most recent, and third most recent partners. I also created the weighting variable "weight".

I found the mean of the weighted variable:

weighted.mean(yesNUIS$tot_encounters, yesNUIS$weight)

Then I found the standard deviation:

weighted_var <- wtd.var(yesNUIS$tot_encounters, yesNUIS$weight)
weighted_sd <- sqrt(weighted_var)

Weighted mean = 27.78
Standard deviation = 25.57

Now I want to select all observations that fall within one standard deviation (2.21-53.35). When I tried to do this, the sample was 80% of the original sample, not 68% (aka. the number of observations within one standard deviation of the mean):

sdNUIS <- yesNUIS
sdNUIS %<>%
dplyr::filter(tot_encounters > 2.2057 & tot_encounters < 53.3527)

How would I make sure that this filter only includes the 68% within one standard deviation of the weighted mean?

Thanks!

Report message to a moderator

Re: Selecting sample within one standard deviation in R [message #24066 is a reply to message #24064]

Tue, 15 February 2022 16:52

Bridgette-DHS
Messages: 3035
Registered: February 2013

Senior Member

Following is a response from DHS Research & Data Analysis Director, Tom Pullum:

You have an extremely skewed distribution. The "68%" rule works for normally distributed variables, and the normal approximation doesn't work for your variable. I can think of two options. One would be to take the log of the frequency, which will have a distribution that is more nearly normal, but there's the problem that you can't take the log of 0. Another option would be to calculate the percentiles of the distribution. If you identify the 25th and 75th percentiles, then you have the boundaries for the middle 50%. Or identify the 16th and 84th percentiles, which enclose the middle 68%.

Report message to a moderator

Previous Topic:	Seeking help in level1 and level2 weight generation
Next Topic:	Calculating level weight for multicounty data - determining level of alpha to use

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

]

Current Time: Fri Apr 19 20:24:36 Coordinated Universal Time 2024