Home » Topics » Child Health » How do I account for clustering within families?
How do I account for clustering within families? [message #1951] 
Fri, 11 April 2014 00:01 
vega25
Messages: 14 Registered: April 2014 Location: United States

Member 


Hello,
The children's dataset in DHS files may have some of the records for multiple children within families  for instance if the dataset has data for all children born in the five years preceding the survey and some of the female respondents had more than one children during that time. Thus many children in the dataset would be siblings.
When this is the case, how does one account for the fact that there are some siblings in the data and so they have the same set of background/household/parental variables? Is adding "cluster(caseid)" as an option at the end of the regression syntax in Stata a valid way of doing so? Would this be okay? Or just asking for robust standard errors? Alternately, is a familyfixed effects the best way of doing this? My concern with familyfixed effects is the loss of sample size.
Thank you for any advice that people may have.




Re: How do I account for clustering within families? [message #1969 is a reply to message #1968] 
Fri, 11 April 2014 14:25 
ReducedFor(u)m
Messages: 292 Registered: March 2013

Senior Member 


I would say that for most analyses (anything I can think of) you would want to cluster at a more aggregated level than the household. The survey design itself requires accounting for cluster sampling, so if you cluster at the PSU (primary sampling unit) level, that will subsume household and take care of both the study design and the withinhousehold problem.




Re: How do I account for clustering within families? [message #2006 is a reply to message #2005] 
Sat, 12 April 2014 09:41 

userrhs
Messages: 132 Registered: December 2013

Senior Member 


Dear Vega,
If you are so inclined, you can run a mixed model where household is nested within sampling cluster. Depending on what your outcome is, you can try xtmixed for multilevel mixed effects linear regression or xtmelogit for multilevel mixed effects logistic regression. For other outcomes, use gllamm, which I should warn is a beast* to run, and will probably take at least half a day to complete the procedure (and sometimes, after 3 days of running, you still don't get convergence................the point is, do NOT run gllamm if you're pressed for time). As I do not have Stata 13, I can't tell you whether there is a less clunky procedure that is equivalent to gllamm built into Stata 13. (Reducedfor(u)m, if you have Stata 13, does such a procedure exist?)
Sample syntax
xtmixed outcome covariate1 covariate2  cluster:  hhid:
If you suspect that having random slopes makes sense for some variables you can list the variables for which you want random slopes estimated after the colons, e.g.
xtmixed outcome covariate1 covariate2  cluster: covariate3  hhid:
****Note, this is usually an exception rather than norm and controlling for clustering via the cluster and household fixed effects is sufficient for most purposes.
For more information, visit the documentation for xtmixed, xtmelogit, gllamm, and the Bristol University Centre for Multilevel Modelling:
 http://www.stata.com/help.cgi?xtmixed
 http://www.stata.com/help.cgi?xtmelogit
 http://www.gllamm.org/
 http://www.bristol.ac.uk/cmm/learning/onlinecourse/index.ht ml
HTH,
RHS
*I have only the utmost respect for Prof. Sophia RabeHesketh who wrote gllamm (and is coauthor of the Stata Press books on multilevel modeling). GLLAMM is very flexible and powerful but takes a long time to run. The flexibility also means you have to read the documentation properly to make sure you won't get error warnings for failing to specify required "options" for the specific model you are trying to fit!
[Updated on: Sat, 12 April 2014 09:44] Report message to a moderator



Re: How do I account for clustering within families? [message #2009 is a reply to message #2005] 
Sat, 12 April 2014 17:01 
ReducedFor(u)m
Messages: 292 Registered: March 2013

Senior Member 


I think there is a confusion here that stems from different disciplines speaking differently, so let me clarify something.
By "clustering" I mean choosing a specification for the variance/covariance matrix of error terms that accounts for withincluster heteroskedasticity and serialcorrelation. In STATA terms, I mean something like "reg Y X, cluster(clustervar)". This is something that relates to getting your standard errors right, but will not in any way affect point estimates.
I think you're question about "controlling" for withinhousehold effects has to do with point estimates. In that case, you may (and may not) want to include household fixed effects (or dummy variables for each household). This would mean that your estimate is based off withinhousehold differences. It would also limit your effective sample to households with multiple children. When I say "control for household characteristics", I'm usually referring to this, which is about getting the right identifying variation for your model.
But what I was talking about before was "accounting for withinhousehold and withincluster similarities" in your standard errors (your estimated precision). In that case, you want to cluster at the PSU level because you get all of the benefits of clustering at HH level and the benefits of accounting for the sampling design (though not stratification, which could technically shrink your SEs back down a bit).
What the multilevel models do is, depending on what you choose, something closer to what I call "Random Effects". That is, the V/C matrix on the error terms is parametric in some way, whereas it is "nonparametric" in the clustering case. This gets technical real fast. So let me just repeat the main point:
Household fixedeffects would deal with things like selection (some parents are good, some are bad) or other omitted variable bias problems. Clustering will get the standard errors right. Two distinct problems.
That help? I can try again if it doesn't.



Re: How do I account for clustering within families? [message #2010 is a reply to message #2006] 
Sat, 12 April 2014 17:12 
ReducedFor(u)m
Messages: 292 Registered: March 2013

Senior Member 


I haven't used the mixedmodels in Stata 13, but I do have it! Here is the documentation if anyone is interested. The basic command is "mixed" and there's "meglm", "melogit", etc. too.
Basic: http://www.stata.com/manuals13/meme.pdf#memeRemarksandexampl es
More: http://www.stata.com/manuals13/me.pdf
There's also this new "gsem" command, and I see this regarding its relative speed:
Note: gllamm users will be especially interested in gsem. There is a lot of overlap in the models that gllamm and gsem can fit. Where there is overlap, gsem is faster. gsem is at least four times faster, usually it is 10 to 100 times faster, and there are examples where gsem is up to 1,000 times faster than gllamm.
http://www.timberlakeconsulting.com/Stata/?id=504
... I can't be much more helpful than that. I think back in the day I once used "xtmixed" to fit one of these to a dataset in the low thousands of obs, and it went really fast. But I'd bet it really depends on how much structure you are putting in, what kinds of prior distributions you may/not be fitting, and what particular estimation method you want.
If only Nick Cox were on the DHS Forum, we'd know what to do. Short of that, I'd say, if anyone ever has the need, drop a question on the Statalist, and then report back what Nick has to say about relative speed of the various Stata generalized linear model commands.



Re: How do I account for clustering within families? [message #2014 is a reply to message #1951] 
Sun, 13 April 2014 18:48 
LizDHS
Messages: 1516 Registered: February 2013

Senior Member 


Dear User,
Here is a response from one of our experts, Dr. Tom Pullum:
Yes, there is definitely clustering at the level of the family or household or mother for outcomes such as child survival, place of delivery, childhood illness and treatment for illness, etc. The standard practice at DHS has been just to include clustering at the level of the PSU, v001 (=v021). We do that either with the option cluster(v001) or (better) with svyset, followed by svy:. The standard errors will then be robust for that level of clustering. Up through Stata 12, as I understand it, only one level of clustering can be used, and for us that would be v001. We are about to upgrade to Stata 13, and as I understand it we will be able to add a second level of clustering, which will be the household (v002).
A problem which can arise with v001, and will be much more common with v002, is insufficient density at that level. It will be necessary to specify a default. We will be able to provide better guidance soon, after we start using Stata 13. I am sure other users already have some experience with household level clustering and perhaps they can volunteer a comment.



Re: How do I account for clustering within families? [message #2015 is a reply to message #2014] 
Sun, 13 April 2014 19:09 
ReducedFor(u)m
Messages: 292 Registered: March 2013

Senior Member 


Tom (via Liz):
Do you know what Stata does regarding estimating clusterrobust standard errors using the svy: command? If it is using the "cluster robust" sandwich estimator, clustering at PSU and Household would be the same as clustering at PSU. But if its using some randomeffectstype correction (some Moultontype paramteric specification of the V/C matrix of error terms), then multilevel clustering would be different. I've never figured out what Stata is doing with the svy. Using "reg Y X, cluster(clustervar)" uses the sandwich estimator that would subsume household in PSU, but it sounds like that is not what svy: is doing.
Thanks.




Re: How do I account for clustering within families? [message #2054 is a reply to message #2041] 
Fri, 18 April 2014 13:51 
LizDHS
Messages: 1516 Registered: February 2013

Senior Member 


Dear Reduced Forum,
This is a question for Stata Corp, not for us. Tom
************************************************************ *****
I'll try following up with Stata re:
Do you know what Stata does regarding estimating clusterrobust standard errors using the svy: command? If it is using the "cluster robust" sandwich estimator, clustering at PSU and Household would be the same as clustering at PSU. But if its using some randomeffectstype correction (some Moultontype paramteric specification of the V/C matrix of error terms), then multilevel clustering would be different. I've never figured out what Stata is doing with the svy. Using "reg Y X, cluster(clustervar)" uses the sandwich estimator that would subsume household in PSU, but it sounds like that is not what svy: is doing.



Re: How do I account for clustering within families? [message #2074 is a reply to message #2054] 
Thu, 24 April 2014 13:28 
LizDHS
Messages: 1516 Registered: February 2013

Senior Member 


From Stata Tech Support:
If you take a look at the section "Linearized/robust variance estimation" in the manual entry of "variance estimationVariance estimation for survey data" [(SVY] manual), near the end of this section, it says
"V^{G^(beta)}beta=beta^} is computed using the designbased
variance estimator for a total."
Then in the section "Variance of the total", you could see how PSUs and multistage sampling units are handled in the formulas.
A quick way to open the PDF manual is to first type
help svy
and click the hyperlink "[SVY] svy" in blue at the top of this page. You could then navigate the Bookmarks on the left of the PDF screen and under the bookmark "[SVY] Survey Data", select "variance estimation".








Re: How do I account for clustering within families? [message #13123 is a reply to message #13113] 
Sun, 24 September 2017 15:20 
ReducedFor(u)m
Messages: 292 Registered: March 2013

Senior Member 


One way to do this cleanly without having to sacrifice anything regarding survey design (clustering, stratification, weighting) would be to estimate each survey round individually, and then use Seemingly Unrelated Regressions commands (suest and sureg in Stata) to test the equality of coefficients across survey rounds. To do that you'd just do your standard DHS svyset/svy: stuff on an individual round, save the information (see help commands for suest/sureg), do the other rounds the same way, and then directly test the coefficients.
You can do it pooled by country too, but when you make those stratumid (which I presume you do before merging) you might end up with something weird where a value of that variable repeats across survey rounds. You could just append a "001" and "002" to the end of each stratumid where "001" would mean from survey round one. You'd also need to generate new PSU identifiers, since those repeat values from survey to survey but don't represent the same PSUs.
I assume by "i.year" you are meaning survey year and not cohort, right? Also remember, the Beta/pvalue on your year dummies is only relative to the omitted one, so if you have 3 rounds you'd need to test the survey round dummies against each other (not just the 0 that represents the omitted group).




Re: How do I account for clustering within families? [message #13128 is a reply to message #13125] 
Mon, 25 September 2017 14:26 
ReducedFor(u)m
Messages: 292 Registered: March 2013

Senior Member 


Glad I could help. Re followup questions:
1. Yes, that is fine if you are doing the surveys separately, but if you are merging them you'd need the new identifiers that include survey year.
2. Yeah  you could just add some numbers on the end of the PSU for survey year or round or whatever...just anything that makes the values of the variable unique by survey round. But of course if you do each survey separately, you can just use the code you have (assuming that strata are defined that way in your particular surveys...sometimes strata definitions change, but it is usually done in regionbyurban groupings, which I think is what you have there.




Re: How do I account for clustering within families? [message #13149 is a reply to message #13129] 
Thu, 28 September 2017 16:44 
ReducedFor(u)m
Messages: 292 Registered: March 2013

Senior Member 


I'm not sure exactly what you are asking. If you are clustering at the PSU level, that subsumes the household, and so you are allowing for interdependence of error terms within families anyway (since you are allowing it across families in the same PSU, and no household spans multiple PSUs). You could just set the weights using svyset and specific the clustering after the regression (with something like ", cluster(PSU)", but I think what you have is fine. If you want to do an explicit multilevel model, I think you might want to switch to the new mixedmodel commands in Stata under the "mixed" family (formerly called xtmixed). Questions on setting up the hierarchical structure code there would need to be sent to the StataList, not here, and I'm not an expert on that suite of commands.



Goto Forum:
Current Time: Sun Dec 10 04:41:51 Coordinated Universal Time 2023
