The setup for a bootstrap that matches the sample design would be complicated. It's easier to get the estimates with a model that includes svyset--which you are using. I will paste below the lines to do this. Just for an illustration, I use the Mozambique 2011 data, with subpopulation hv024=1 (Niassa). The outcome y is 1 if the source of drinking water is an unprotected well (hv201=32), which is the largest category. The model has no covariates. The lines show how to extract the proportion of households with y=1 in Niassa, as well as the lower and upper bounds of a 95% CI for that proportion. I show how to do this with logit or logistic models. You also get the standard error on the logit or odds scale but I would not recommend the se on the scale of a proportion (also not on the odds scale). CI yes, se no. Hope this helps.

* Open HR file, cases are households

use "C:\Users\26216\ICF\Analysis - Shared Resources\Data\DHSdata\MZHR62FL.DTA" , clear

* Specify outcome and subpopulation

gen y=0

replace y=1 if hv201==32

gen Niassa=0

replace Niassa=1 if hv024==1

* Prepare svyset

svyset hv001 [pweight=hv005], strata(hv023) singleunit(centered)

* Logit model

svy, subpop(Niassa): logit y

matrix T=r(table)

matrix list T

* Extract P, L, and U as saved results

* P, L, and U are the point estimate and the lower and upper bounds

* of a 95% confidence interval for the proportion of households in

* Niassa whose main source of drinking water is an unprotected well.

scalar b=T[1,1]

scalar P=exp(b)/(1+exp(b))

scalar b=T[5,1]

scalar L=exp(b)/(1+exp(b))

scalar b=T[6,1]

scalar U=exp(b)/(1+exp(b))

scalar list P L U

* Equivalent using logistic

svy, subpop(Niassa): logistic y

matrix T=r(table)

matrix list T

scalar odds=T[1,1]

scalar P=odds/(1+odds)

scalar odds=T[5,1]

scalar L=odds/(1+odds)

scalar odds=T[6,1]

scalar U=odds/(1+odds)

scalar list P L U]]>

The methodology for sampling clusters from the sampling frame is consistent across surveys, but successive surveys in the same country have different samples of clusters, households, and respondents. There is no continuity in the actual sample.

]]>

v113 is a variable name, not a column. The variable has a label. I don't know which Malawi survey or which statistical package you are using, but in Stata you would enter "describe v113" to get the name of the value label. It is V113 in the 2015-16 survey. You can the list the value label with "label list V113". I will paste the label for the 2015-16 Malawi below. The first digit of the category label is a general type of source (for example, "1" for "piped" and "4" for "well" and the second digit is a sub-category. The categories vary from survey to survey, depending on the types that are prevalent for the date and country of the survey. The WHO classification into "improved" and "unimproved" has changed over time. There is a similar construction for v116, type of toilet facility. Hope this answers your question.

. label list V113

V113:

10 piped water

11 piped into dwelling

12 piped to yard/plot

13 piped to neighbor

14 public tap/standpipe

20 tube well water

21 tube well or borehole

30 dug well (open/protected)

31 protected well

32 unprotected well

40 surface from spring

41 protected spring

42 unprotected spring

43 river/dam/lake/ponds/stream/canal/irrigation channel

51 rainwater

61 tanker truck

62 cart with small tank

71 bottled water

96 other

97 not a dejure resident]]>

Project: I am combining DHS WaSH indicators from the 2011 Mozambique DHS with data from a cluster-randomised trial in Mozambique assessing the performance of various treatment strategies on Schistosomiasis prevalence. I am attempting to model individual-level infection status after 5-years of mass-drug administration to see if there is any effect modification of the treatment strategy (villages were randomised to different treatment strategies) by different WaSH indicators at the district-level, specifically using an improved water / sanitation source.

I will be using multi-level logistic regression to capture the clustering of the data i.e., (1) individuals in (2) villages (the treatment-level) in (3) districts.

The cluster-randomised trial was conducted in one province in Mozambique, so I am only working with 8 districts and attempting to calculate a district-level indicator e.g., percentage of households in that district using an improved water source. I have used GPS data to locate the clusters in corresponding districts and have followed the suggested methodology (the complex sample design weighting) to generate estimates. However, as has been extensively discussed previously, the SEs are too large to be usable.

I propose the following methodology to resolve this and would appreciate some input:

- Use a bootstrap (I saw a link to a wild bootstrap mentioned in a previous post?) to calculate more precise standard errors - how would I go about using the sampling weights here?

- Use weights within the multi-level logistic regression model to account for the uncertainty around the district-level estimates.

I understand that using DHS data in this way to generate district-level indicators is not ideal, however, this project is more for hypothesis generation and identifying areas for future research.

Do you have any comments on what I have proposed, or is there anything else I should be thinking about in terms of using this data and conducting this analysis in the best way?

I appreciate any feedback!

Kind regards!

]]>

HV002: Household number is the number identifying the household within the cluster or sample point. In some cases, this variable may be the combination of dwelling number and household number within dwelling. In these cases, the dwelling number is included as country-specific variable.

HV004: Ultimate area unit is a number assigned to each sample point to identify the ultimate area units used in the collection of data. This variable is usually the same as the cluster number, but may be a sequentially numbered variable for samples with a more complicated structure.

mean? Do they refer to geographic regions?]]>

I'm currently looking at DHS surveys in Malawi. I've loaded up the data and all but am having some trouble understanding the content. For example, column v113, according to the DHS Recode Manual, corresponds to the type of source of drinking water. The values in that column for each respondent is 21, 43, 14, 97, and so on. I'm trying to understand the meaning of these numbers. The DHS Recode Manual says they're country-specific, but doesn't allude at all to where I can derive their meanings.

Thanks!]]>

In the HR file, which you have opened, the variables begin with "hv". In that file, source of water is hv201. The HR file has one record per household. The multiple copies of hv113 that you found refer to different members of the household (hv113 is survival status of the father for children age 0-17). The PR file has the same data but is organized with one record per household member.

The variable v113 begins with "v", not "hv", and is in the IR file. The IR file has one record per woman age 15-49. v113 is the same as hv201 for all women in a given household.

Some variables that appear in the household files and the individual files have the same number, and differ only in terms of the prefix v or hv. However, most do not.

DHS data are not easy to use with Excel. If you can possibly switch to a package such as Stata or even SPSS, your analysis may be easier.

Thanks for using DHS data. Let us know if you have other questions.]]>

I hope this message finds you well.

I am working in a Project where I am going to create a comprehensive map of social vulnerability to climate change hazards in Kenya.

In pursuit of this goal, I have chosen to leverage the DHS-2022 dataset for Kenya. Having recently downloaded the Household data file (KEHR8BSD.zip) from the DHS website, I converted it to CSV format for streamlined integration into my research workflow. Unfortunately, I have found some challenges related to identifying the variable associated with the 'major source of drinking water.' I have looked at the DHS guidelines, such as the Microdata Library for Kenya and the USAID Guide to DHS Statistics, I have encountered difficulty in decoding the information contained in the Excel Sheet. The Microdata Library indicates that the code 'V113' corresponds to the major source of drinking water. Yet, upon examining the Excel file, I discovered 24 columns labeled HV113_01, HV113_02, and so forth. Unfortunately, the documentation does not shed light on the significance of these extension numbers, leaving me uncertain about the distinct sources represented by HV113_01 and HV113_02.

Moreover, I have encountered another code, 'V87 hv201 Source of drinking water' in in Microdata Library document, in Page 18, Regrettably, I am grappling with understanding the nuances of this code. I need help to figure this out. I not sure if I am using the right dataset since I am working with DHS data for the very first time. My apologies for bothering you. But I genuinely appreciate any guidance or insights you can provide.

Attached herewith is the Excel sheet containing the a small part of downloaded data for your convenience.

Thank you for your time and consideration.

Kind regards

]]>

You can actually find MANY examples in DHS reports of what you describe for the table on page 445. Such discrepancies result from a combination of two things. First, the frequencies in almost all tables are weighted and therefore have digits to the right of the decimal point. Second, to simplify the presentation, frequencies are rounded to the nearest integer AFTER all operations such as addition.

Say, for example, that A=10.4 and B=10.3, so A+B=20.7. These 3 numbers will round to 10, 10, and 21, respectively. The discrepancy could be described as rounding error, but I don't like that term, because there's no error, just a superficial inconsistency.

]]>