Home » Data » Sampling » What is the DHS position on the use of cluster-level data?
What is the DHS position on the use of cluster-level data? [message #9431] |
Sun, 27 March 2016 12:27 |
ld190
Messages: 5 Registered: March 2016 Location: Leicester, UK
|
Member |
|
|
Dear All,
Apologies in advance for the long response and thank you for reading it. I've tried to be thorough in order to bypass old material and to get to what I see as the key ambiguities (for me) remaining in the question of how, and whether, cluster-level data from the DHS should be analyzed.
I have come a point in my research that am starting to use DHS data in earnest. I have downloaded and conducted some preliminary analysis of datasets from Senegal, and some of the things that I would like to do involve aggregating from individual to "cluster-level" characteristics. So for example, calculating that X percent of Women 15-49 sampled in cluster Y have attribute A and Z percent have B attribute. I am interested in clusters because my research is about residential communities - small areas where people are co resident and have some chance of knowing or being influenced by each other. I am able to do this using SPSS and aggregating cases by Cluster. This analysis produces some very interesting (and theoretically plausible) results. However, I am concerned about warnings that I've read, against the disaggregation of DHS data.
Despite attempting to work through the DHS's very helpful store of literature, and this forum, I remain unsure about the DHS' position on the direct use of cluster data. On the one hand, the official guides, and very enjoyable YouTube tutorials, seem to me to emphasise that the surveys are designed to be representative at the Regional and National levels only, meaning that further disaggregation is not possible. However, I'm not sure about the extent to which this applies to my research. I am not interested in estimating the prevalence of attribute A (which is very common) for any area except for the cluster (the Enumeration Area) itself. So I'm not interested in the surrounding administrative area or some other geographic area, for example. I'm just interested in the cluster of households from which the chosen households were randomly sampled. Considering this level of analysis, around 20 households sampled at random from a pool of, on average, 110 households, is the data so unrepresentative as to be useless? Do the observed attributes of the randomly sampled households (20) tell us nothing reliable about the attributes of the overall population (110)? What about if we average across a large number of clusters, to produce a distribution of values?
The guidance on this issue on the forum appears to me to provide a number of alternative possible answers and issues to consider.
One user on the forum seems to suggest that the use of cluster-level data is "noisy" (error prone) but basically OK as long is this is taken into account and that it is common to use this level of data for certain purposes:
http://userforum.dhsprogram.com/index.php?t=msg&goto=905 4&S=41b1f8e9c6ffff1e5ed1b91414054772&srch=aggregatin g+clusters#msg_9054
However, DHS staff member Trevor, on another post suggests that the use of cluster-level estimates are "impossible" because the sample sizes are too small. Although he is referring to calculating Child Mortality rates, which is a very rare event, and so this measure might require an especially large sample size.
http://userforum.dhsprogram.com/index.php?t=msg&goto=852 4&S=41b1f8e9c6ffff1e5ed1b91414054772&srch=aggregatin g+clusters#msg_8524
On another post, Trevor says this:
Quote:You can and should still use hv005 as the sample weight, but doing your analysis with smaller geographic units is potentially problematic. The sample is designed to be representative at the region level, but not at the level of smaller units. As you disaggregate the data to smaller units the sample is less and less likely to be representative. The sample is also designed to provide a certain level of accuracy at the region level, and again as you disaggregate to smaller units the accuracy of those estimates gets worse and worse and the confidence intervals around the estimates quickly become very large and unreliable.
I found this advice slightly confusing. Presumably going from drawing inferences about the population at an officially representative level (region), to an intermediate level (like a small administrative unit) might reduce the representativeness of the data. This I because the size of the sample (N households) might be getting smaller relative to the size of target population (a whole administrative district). However, presumably at some point this trend will reverse? If we only tried to draw inferences about the Enumeration Area from which the sample is drawn, for example, then surely this is more representative than trying to use the cluster sample to draw inferences about, for example, a larger population within 5km2 of the Enumeration Area?
ClaraB, also a DHS staff member, offers this advice on the interpretation of cluster-samples:
http://userforum.dhsprogram.com/index.php?t=msg&goto=831 5&S=41b1f8e9c6ffff1e5ed1b91414054772&srch=aggregatin g+clusters#msg_8315
Quote:[inference about the] district location of the sampled clusters using a GIS software and the GPS dataset these data would not be statistically representative.
However, I'm unclear how to interpret this advice. Is the warning given because the user is trying to draw inferences about the district level (larger than the EA) from a single sample cluster?
Finally, a forum user posted this advice about the use of cluster-level measures:
Quote:cluster-level measurements are based on too few observations to be meaningful in and of themselves - as you say, there are wildly under-powered. A couple of things you could do: a) by averaging over many clusters, you can still get good estimates of community level variables, but each individual cluster-level point-estimate would be very, very noisy. But they may still mostly "agree" in some sense; b) so if in your hierarchical model you allow each cluster an unconstrained cluster-specific effect (like treating each cluster as a mini-experiment), you could look at those individual point-estimates on a scatter plot (say Beta across some variable you think would affect Beta); c) and then you could start restricting those Betas to have some particular distribution (a random slope model) and see how that changes your overall point estimate as you make your priors on the distribution of Beta more/less informative. I think this makes sense as a kind of model-checking or informal/additional inference procedure. A leave-one-out cross-validation approach might make sense too, depending on how you end up thinking about each of these within-cluster estimates.
This user's scatter-plot suggestion is very close to what I have done in my own research.
I addition to searching the DHS forums, I've discovered that some published academic work has engaged with data at the cluster-level. Storey and Kaggwa from the Department of Population, Johns Hopkins University, have used cluster level data from the 1995, 2000 and 2005 Egypt Demographic and Health Surveys (EDHS).
This is a quote from the abstract for their paper:
Quote:Norms are defined at the cluster level, which serves as our community-level unit of analysis
The official site for the article is here:
http://muse.jhu.edu/journals/population_review/v048/48.1.sto rey.html
Also there has also been some research to actually estimate the error introduced from using cluster-level measures with DHS data. This was conducted by Øystein Kravdal, Professor of Demography at the University of Oslo.
Here is a quote from the abstract for his paper:
Quote:For example, researchers may consider including in their models the average education within the sample (cluster) of approximately 25 women interviewed in each primary sampling unit (PSU). However, this is only a proxy for the theoretically more interesting average among all women in the PSU, and, in principle, the estimated effect of the sample mean may differ markedly from the effect of the latter variable. Fortunately, simulation experiments show that the bias actually is fairly small - less than 14% - when education effects on first birth timing are estimated from DHS surveys in sub-Saharan Africa. If other data are used, or if the focus is turned to other independent variables than education, the bias may, of course, be very different. In some situations, it may be even smaller; in others, it may be unacceptably large. That depends on the size of the clusters, and on how the independent variables are distributed within and across communities. Some general advice is provided.
This paper is available to read, published in a Peer Reviewed Open Source Journal:
http://www.demographic-research.org/volumes/vol15/1/
Both of these papers, seem favorable to the use of cluster-level DHS data.
I wonder if the 'proof of the pudding is in the eating'? The results from my analysis of community level data are theoretically plausible, there is a clear pattern (agreement) in a scatter plot showing the relationship between two measures (the frequency of observations A and B in each cluster) across all the clusters and this pattern is consistent accross Senegalese DHS surveys in 2005, 2010 and 2014. Presumably, if the level of noise were so great that no meaningful information could be gained from cluster-level analysis, then a clear pattern of results like this would be quite surprising?
Thank you again for reading through this long question. I am by no-means certain about any of this, I am new to this area of analysis and this kind of analysis. However I wanted to provide a detailed description of the problem that I am trying to grapple with.
If anyone can offer any further thoughts, clarification or advice on the use of cluster-level analysis with DHS data, I would be very grateful to hear it. Also, if there is some key DHS document (or other publication) that I have missed which elaborates on this issue would be grateful to receive a recommendation.
Many thanks in advance for your response.
Laurence.
P.s Thanks to UserRHS for the help in improving the formatting of this post.
[Updated on: Sun, 27 March 2016 15:39] Report message to a moderator
|
|
|
Re: What is the DHS position on the use of cluster-level data? [message #9435 is a reply to message #9431] |
Sun, 27 March 2016 16:47 |
|
user-rhs
Messages: 132 Registered: December 2013
|
Senior Member |
|
|
The key issue is that some of the clusters may be too small for the analysis to be meaningful. For example, how would you interpret the average from 3 households? You weren't clear in your original question about how you were using the aggregates, but since you cited the Storey & Kaggwa and Kravdal papers and quoted Reduced-For(u)m's responses, it looks like you want to enter them as covariates in a regression model.
There are no hard and fast rules about what to do, and determination should be done on a case-by-case basis and depends on: 1.) the overall sample size, 2.) the number of clusters, 3.) the number of observations in the cluster. If you have many clusters with a sufficient number of observations, your results will be less biased than if you many clusters with a small number of observations. For example, I would be fairly comfortable entering cluster-level averages into a model from the Indonesia 2012 DHS into a regression model, because 1.) it's huge (>45,000 observations), 2.) it has a lot of clusters (1,832 clusters), 3.) the clusters are sufficiently "large" (around 90% of the clusters have 20 or more people in them, and only about 80 people live in clusters with <10 observations each), but I would have less confidence doing it with a dataset with 5,000 observations and 1,200 clusters where the average size is 10 (I worked with a dataset like that once, and I ended up aggregating up to the district level to get respectable sizes) .
Second, cluster-level analysis can still be useful, depending on the level of inference. I think what the DHS team cautions against is making population-level inference based on the clusters, because the survey is not designed for that level of disaggregation. You can make a case for valid inference to the sample in the worst-case-scenario, or at least minimize the population-level implications of your findings.
Third, even the experts are still in disagreement about this, which works to your advantage. You can take one school of thought and justify it with citations from the peer-reviewed literature. At the end of the day, science is about weighing different opinions and evidence and defending your choices.
A good first step is determining the number and size of your clusters. If they are sufficiently "large" and you can make a case for it that's theoretically/empirically/clinically plausible, then why not? Cluster aggregates are less than ideal, but if we had better measures than cluster-level aggregates for whatever construct we were trying to operationalize, surely we would have used them instead of these proxies derived from the data, right? I would make the suggestion to fit the model first with just the individual/household-level variables first and enter the cluster-level aggregates separately to see how things change. Reduced-For(u)m has some good advice, which you have quoted above.
NB: I'm not a DHS affiliate, so I can't offer the official DHS position
[Updated on: Sun, 27 March 2016 16:50] Report message to a moderator
|
|
|
Re: What is the DHS position on the use of cluster-level data? [message #9483 is a reply to message #9435] |
Fri, 01 April 2016 06:47 |
ld190
Messages: 5 Registered: March 2016 Location: Leicester, UK
|
Member |
|
|
Dear user-rhs,
First, many thanks for your detailed and very helpful response. I really appreciate the time-taken.
To address the ambiguity you noticed in my post: As it happens I'm actually not using regression models. I am using an agent-based model, which is a kind of computer simulation. In this case the cluster-level measures that I am interested in are for purposes of model calibration and validation (rather than "fitting" as in regression models). Cluster level measure X will be used to "calibrate" the simulation - which means it will be used to set the value of a key parameter in the simulation. Then measure Y will be used to validate (i.e. test) the simulation by observing whether the simulated relationship between input variable X, and the output of the simulation, are the same as the relationships which exists in the real data between measures X and Y. Hence, scatter plots showing the relationship between simulation-input X and the subsequent simulated-output (which should correspond to Y) can be overlayed with the real cluster-level X and Y values - as an indication of the match between the simulation and the real clusters. I am interested in cluster-level measures because the model is a flexible model of social dynamics within a cluster (a village-sized residential community with a social network - etc.).
Having said that, the advice about regression is very useful for future reference - thank-you.
Having thought about what you've written, and about the recommendations of the Kravdal paper, I am now cautiously optimistic. It seems that the analysis is worth pursuing for the moment. Kravdal and you cite the importance of the absolute size of cluster samples (as well as the relative cluster population/ sample ratio). Having looked at the dataset from Senegal 2005, the cluster sizes are quite large (M = 38, SD = 11) and only 4.8% of clusters are below 20 cases. However, based on Kravdal's advice it will be important to check the within- and between- cluster variance of both measures. I'll also have to consider the average size of the cluster population itself. It may even be worth my creating a replication of something similar to Kravdal's simulation - in order to explore the viability of using this particular data-set in this way.
As and when I do a more in-depth investigation of the viability of using clusters for this purpose I will post about it here for the interest of future users of these particular measures.
Also thanks for the advice regarding a scientific justification for such choices. I agree that if a good justification can be found for the use of these measures and as long as one is open and honest about their limitations and drawbacks, there is no obligation to neglect their use.
Best,
Laurence.
|
|
|
|
|
Re: What is the DHS position on the use of cluster-level data? [message #10391 is a reply to message #10389] |
Sun, 24 July 2016 18:51 |
Reduced-For(u)m
Messages: 292 Registered: March 2013
|
Senior Member |
|
|
Why didn't the "collapse" command work? Is it just because the "svy" prefix doesn't work with collapse? You may not need that, since all HH in the same cluster have the same weight, and since you aren't doing any inference calculations (p-values) you don't need to worry about the stratification either. If you just want cluster means, collapse should work just fine.
|
|
|
|
|
Goto Forum:
Current Time: Wed Nov 27 01:56:25 Coordinated Universal Time 2024
|