I am using birth record data from appending Bangladesh DHS 1999/2000, 2004, 2007, 2011 and 2014 to study the determinants of infant/neonatal mortality. I have read some discussion on the user forum about weighting data and setting "svyset" commands and I still have two questions regarding to my specific context:

(1) How to set new "strata" if stratification is different across years? . Bangladesh 2011 and 2014 DHS, the most recent two years have different "strata design" from previous years. Here are two screenshots from BDHS report. Can I still set new strata as "egen strata = group (survey v024 v025) where survey indicates different rounds? And also "v022" in early years are very different from the ones in BDHS 2011 and 2014.

2011

2007

(2) How to re-weight the appended data across survey rounds if I am only interested in the birth records two years prior to the interview time? Are the following ways appropriate for my study?

generate weight = v005/10000000

egen clusters=group(survey v021)

egen strata = group(survey v024 v025)

svyset clusters [pweight=weight], strata(strata)

Any suggestions or comments will be highly appreciated. Thank you!

regards,

Qiao]]>

Even if you do not collapse the surveys, you still have to deal with the variation in sample sizes. Surveys with larger samples will tend to dominate. I prefer to revise v005, multiplying by a survey-specific factor. If, say, your combined data file with k surveys has N cases, you would revise the weights so that the weighted number of cases for each survey is the same, N/k. However, even that approach is vulnerable to criticism. Stata code to do this is posted.

Here at DHS, we usually do not pool surveys. When we combine successive surveys from one country into a single data file, that's usually just to make it easier to describe trends. There have been a few times when we pooled surveys because that was the only way to get enough cases, for some rare outcome. The main reason for not pooling is that the reference population, of which the data are supposed to be representative, is too difficult to define. But other users may have a different perspective.

]]>

I should have added that while the dependent variable is a calculated mean from the surveys, most (but not all) of the regressors are statistics (or variables created based on these) from other sources, provided at the country-year level.

The reason why I am including all the surveys, also multiple surveys for a country, is because I am also interested in exploring the time dimension.

You are totally right that by combining countries like India and Maldives (and weighting by sample size), then Maldives would have a negligible impact. So there would be no point in collapsing the data by survey. It would probably be more appropriate to simply estimate regressions directly from the data (without collapsing by country-year) even if I am using statistics from other sources at the country-year level.

Please let me know if you see anything wrong with this approach based on this new info.

Thanks so much.]]>

There have been several related postings on the forum. I agree that this is an instance in which aweights are appropriate. The options are (a) weight by sample size, (b) weight by population size, (c) weight equally. Your choice for a regression should be the same as your choice for an overall mean. The first question is this: why would you want to calculate an overall mean or regression from all these surveys? I can't think of any good justification for pooling 185 data files. Is there a meaningful population parameter that you are trying to estimate?

Many countries have had multiple surveys. Does it make sense to include one country once and another country six times, say? If you reduce to one survey per country, which survey would you select? Would you combine a 1990 survey from one country with a 2010 survey from another country?

If you weight by population size, and combine India and Maldives, then the impact of Maldives would be negligible. What's the value of pooling them?

I hope other forum users will add their viewpoints.

]]>

I have a related question but I am not sure your answer below applies to my case.

In brief, I am using all DHSs and created a country-year dataset (1 observation for each survey, for a total of 185 observations). It seems to me that, when running regressions, one should use weights that reflect the size of each country. For example India should weigh more in the regression than smaller countries, say Maldives.

What is the correct way to do regression analysis in this case?

I am thinking to run:

reg y x1 x2 [aweight=n]

analytic weights because I am using means calculated from each surveys (all these means have been calculated using DHS-provided sampling weights). n is the number of observations in the survey (or perhaps one should use population instead).

Your help is as always greatly appreciated!

thanks]]>

2) Weights in DHS are normalized to the unweighted number such that the total weighted number of households, women or men are the same as the unweighted number of households, women or men when applying the appropriate weight (hv005, v005, or mv005 respectively). DHS does not estimate population counts, rather proportions of the population. If you want to calculate population counts, you can create a factor to inflate the sample by multiplying by the total population divided by the total sample of de jure household members.]]>

1. I am using merged data from India 2015-2016. I have successfully combined the HIV, men, women and household members tables together. I thought to use mv005 and v005 as weights and just combining them together. Should I use the hv005 as my weight instead? I wasn't sure which one to use.

2. I am using SAS and after watching the video provided by DHS on how to set up the code I attempted to weight the data. The problem was that my numbers were still about the same as the original numbers. The weights were not giving the whole population of India. I put the code below that I was using. I wasn't sure what I was missing.

proc surveyfreq data=women;

table v025;

weight wgt; (v005/1000000)

cluster v021;

stratum v022;

run;

Thanks so much.

]]>

In general, you need to check the sampling design appendix to figure out the stratification variables and then construct it in a proper way.

Regarding the 1988 EDHS survey, see page 174 in the final report: "All list of PSUs allocated according to governorate and residential sector (urban/rural)" This means that similar to the other surveys, governorates by urban/rural were used as design strata for the 1988 EDHS.

]]>

Yes, that's all there is to it. ]]>

Sex ratios are calculated using the BR file (births). The following variables are used in the calculation:

• B2 - Year of birth

• B4 - Sex

• B5 Living status

After counting the number of girls and boys, the ratio is calculated by dividing the number of girls by the number of boys and multiplying by 1000.

]]>