* do e:\DHS\programs\U5_mortality\child_mortality_24May2013_NG_Pullum_do.txt * CHECK PATHS!! You have to specify the locations of the data and the output. * The following is the path to this program (it is not a Stata do-file) and the log file; * you must change it! cd e:\DHS\programs\U5_mortality set logtype text log using child_mortality_24May2013_NG_Pullum_log.txt,replace set more off /* **************************************************************************************** PROGRAM TO PRODUCE UNDER 5 MORTALITY RATES FOR SPECIFIC WINDOWS OF TIME, WITH COVARIATES **************************************************************************************** Written by Thomas W. Pullum, tom.pullum@icfi.com, first version September 2011, many subsequent updates This program has been tested on several reports. For example, for the Ethiopia 2005 survey (ET51), it matches exactly the rates in the report on pp. 111 and 113 start_i and end_i are scalars referring to the boundaries of the age intervals. Add dob (b3) to them to get cmc's of time The generic chapter number for child mortality is 8, but in some reports it is another number, e.g. 9 chap8sr.APP is the generic CSPro program for tables See P:\DHS\PROJECTS\UgandaDHS2006\DataProcessing\tables\chapt8 U5mort_C.sps, U5mortDHS_D.sps, and U5mortDHS_E.sps are the generic SPSS programs The standard periods are 0-4, 5-9, etc. years before the survey, but there are exceptions. For example, the periods in the Uganda 2006 report are 1-5, 6-11, etc. years before the survey. This Stata program has similarities to the Stata fertility rates program. Children, rather than women, are the units of analysis for the calculations. The child MAY have AT MOST ONE event, i.e. death. Therefore, rather than poisson regression (glm with poisson error, log link), the statistical model is log probability (glm with binomial error, log link). Another crucial difference is that for births, we are given the month (cmc) when the birth occurred, but for child deaths, we are given the age of the child (in months) at death. This produces some ambiguity in the month of death. This is particularly true for age at death above 23 months. For example, "24" means "24-35" and the month of death (dod) could be anywhere from dob+24 to dob+36. Within a window of time, each child is exposed to intervals of age, in months: 1 0 2 1–2 3 3–5 4 6–11 5 12–23 6 24–35 7 36–47 8 48–59 The program agrees exactly with DHS procedures. It can use either the IR or the BR file. The main output files must be renamed or they will be over-written the next time the program is run. Here are the crucial variables, with DHS names caseid "Case Identification" for mother bidx Child id, nested in caseid v005 "Sample wt" v008 "Date of interview (CMC)" or doi b3 "Date of birth (CMC)" for the children b7 "Age at death (months-imputed)"; "." if child survived Rates (correctly called conditional probabilities) calculated here: Neonatal mortality: the probability of dying in the first month of life; Postneonatal mortality: the difference between infant and neonatal mortality (similar to the probability of dying during months 1 through 11, but different); Infant mortality: the probability of dying in the first year of life; Child mortality: the probability of dying between the first and fifth birthday; Under-five mortality: the probability of dying before the fifth birthday. DHS procedures, from http://www.measuredhs.com/help/Datasets/index.htm: A. Infant mortality rate: 1. Calculate the component survival probability by subtracting the component death probability from one. 2. Calculate the product of the component survival probabilities for 0, 1–2, 3–5, and 6–11 months of age. 3. Subtract the product from 1 and multiply by 1000 to get the infant mortality rate. Post neonatal mortality rate: Subtract the neonatal mortality rate from the infant mortality rate. B. Child mortality rate: 1. Calculate the component survival probability by subtracting the component death probability from 1. 2. Calculate the product of the component survival probabilities for 12–23, 24–35, 36–47, and 48–59 months of age. 3. Subtract the product from 1 and multiply by 1000 to get the child mortality rate. C. Under-five mortality rate: 1. Calculate the component survival probability by subtracting the component death probability from 1. 2. Calculate the product of the component survival probabilities for 0, 1–2, 3–5, and 6–11, 12–23, 24–35, 36–47, and 48–59 months of age. 3. Subtract the product from 1 and multiply by 1,000 to get the child mortality rate. This routine does not do any re-distribution of deaths to adjust for possible age heaping on a boundary such as 12 months. Dates are in cmc's and intervals are in months. The version of the program with a covariate works only for categorical covariates. The q's for each value of the covariate are written as lines in a separate output file. The filenames of the saved files will include v000 (e.g. results_TZ5.dta and results_with_ci_TZ5.dta). GO TO THE REPEATED LINES OF ASTERISKS FOR THE BEGINNING OF THE EXECUTABLE STATEMENTS */ ****************************************************************************** program define setup scalar run_number=0 * Specification of nageints, lengths of age intervals /* Standard specification but it can be changed: i start_i end_i length_i 1 0 1 1 2 1 3 2 3 3 6 3 4 6 12 6 5 12 24 12 6 24 36 12 7 36 48 12 8 48 60 12 */ scalar nageints=8 scalar length_1=1 scalar length_2=2 scalar length_3=3 scalar length_4=6 scalar length_5=12 scalar length_6=12 scalar length_7=12 scalar length_8=12 scalar start_1=0 local i=2 while `i'<=nageints+1 { local iminus1=`i'-1 scalar start_`i'=start_`iminus1'+length_`iminus1' local i=`i'+1 } local i=1 while `i'<=nageints { local iplus1=`i'+1 scalar end_`i'=start_`iplus1' local i=`i'+1 } use temp.dta, clear * The next seven lines are only used if the input file is IR rather than BR. * If the input file is BR, then insert /* before and */ after the next seven lines * If the input file is IR, then omit /* before and */ after the next seven lines /* drop bidx* renpfix b3_0 b3_ renpfix b7_0 b7_ reshape long b3_ b7_, i(caseid) j(bidx) drop if b3==. rename b3_ b3 rename b7_ b7 */ rename v008 doi summarize doi [fweight=v005] scalar doi_mean=r(mean) gen wt=v005/1000000. scalar v000_string=v000[1] save temp2.dta, replace end ****************************************************************************** program define end_date_start_date * This routine calculates the end date and start date for the current window of time /* There are two ways to specify the window of time. Method 1: as calendar year intervals, e.g. with scalar lw=1992 scalar uw=1996 for a window from January 1992 through December 1996, inclusive. Method 2: as an interval before the date of interview, e.g. with scalar lw=-2 scalar uw=0 for a window from 0 to 2 years before the interview, inclusive (that is, three years) The program knows you are using method 2 if the two numbers you enter are negative or zero. lw is the lower end of the window and uw is the upper end. (Remember that both are negative or 0.) start_date is the cmc for the earliest month in the window and end_date is the cmc for the latest month in the window */ * lw and uw will be NEGATIVE OR ZERO for "years ago" if lw<=0 { gen start_date=doi+12*lw-12 gen end_date=doi+12*uw-1 } if lw>0 { * lw and uw will be >0 for "calendar years" gen start_date=12*(lw-1900)+1 gen end_date=12*(uw-1900)+12 } replace end_date=min(end_date,doi-1) gen refcmc=(start_date+end_date)/2 summarize refcmc [fweight=v005] scalar refcmc_mean=r(mean) drop refcmc end ****************************************************************************** program prepare_child_file * This routine calculates dob, doi, dod_1, dod_2, and the start and end months * of each age interval for each child. * This routine calls end_date_start_date use temp2.dta, clear end_date_start_date rename b3 dob *drop if missing on birthdate drop if dob==. *drop if dob is later than the end of this time interval drop if dob>end_date * b7 is given in single months up to 23 and then given as 24, 36, 48, 60, ... * but for the Uganda 2006 survey there are 2 births at 25, 1 at 26, 1 at 35 rename b7 age_at_death replace age_at_death=. if age_at_death>end_date replace age_at_death=. if age_at_death>59 /* doi month of interview (CMC) dob "Date of birth (CMC)" age_at_death "Age at death (months-imputed)" dod_1 the lower bound of the range of possible months of death (CMC) dod_2 the upper bound of the range of possible months of death (CMC) construct the beginning and end months of each age interval, in months after dob dob+start_i is the cmc when the child entered age interval i dob+end_i is the cmc when the child left age interval i because births happen at the middle of the month, end_1=start_2, etc. drop if all risk preceded the window */ drop if dob+end_8=start_`i' & age_at_death<=end_`i'-1 replace dod_1=dob+start_`i' if age_int_at_death==`i' replace dod_2=dob+end_`i' if age_int_at_death==`i' local i=`i'+1 } * age_at_death and age_int_at_death will be "." if the child survived to doi-1 * drop the child if died before the beginning of this time interval drop if dod_2end_date & end_date==doi-1 save temp3.dta, replace end ****************************************************************************** program define make_risk_and_deaths /* CHILDREN'S DEATHS AND RISK IN AGE INTERVALS WITHIN THE WINDOW OF TIME Standard age intervals 1 through 8 refer to months 0, 1-2, 3-5, 6-11, 12-23, 24-35, 36-47, 48-59. In other DHS programs these intervals are numbered 0 through 7 rather than 1 through 8. Each child gets values for died1 through died 8. The initial values are 0. They remain at 0 if the child survived to the date of interview. The other possible values are 1 and 0.5: 1 if the child died in the time interval, 0.5 if the age interval at death is divided across two time intervals. DHS does not calculate months of exposure. To emphasize that, this program uses the term "risk" rather than "exposure". Each child gets values for risk1 through risk8. The initial values are 0. They remain at 0 for any ages for which the child was not observed (within the window of time), because of death or censoring. The other possible values are 1 and 0.5: 1 if the child was observed for that age and all such observation was within the time interval, and 0.5 if the observation for that age is divided across two time intervals. The risk is not altered for the age interval and time interval(s) in which the child died, but risk drops to 0 after that age interval. In the log probability analysis, .5 is replaced with 1 and the weight is multiplied by .5. */ prepare_child_file use temp3.dta, replace * Values of risk and died are calculated in the following loop. It could be streamlined * somewhat but you must be careful or you will lose the match with DHS. local i=1 while `i'<=nageints { gen died`i'=. * age interval is entirely in the time window replace died`i'=1 if age_int_at_death==`i' & dod_2<=end_date & dod_1>=start_date * age interval is partly in the time window and partly in the previous time window replace died`i'=.5 if age_int_at_death==`i' & dod_2>=start_date & dod_1< start_date * age interval is partly in the time window and partly in the next time window replace died`i'=.5 if age_int_at_death==`i' & dod_2> end_date & dod_1<=end_date gen risk`i'=. * age interval is entirely in the time window replace risk`i'=1 if (age_int_at_death>=`i' | age_at_death==.) & dob+end_`i'<=end_date & dob+start_`i'>=start_date * age interval is partly in the time window and partly in the previous time window replace risk`i'=.5 if (age_int_at_death>`i' | age_at_death==.) & dob+end_`i'>=start_date & dob+start_`i'< start_date * age interval is partly in the time window and partly in the next time window replace risk`i'=.5 if (age_int_at_death>`i' | age_at_death==.) & dob+end_`i'> end_date & dob+start_`i'<=end_date * child dies in this interval * age interval is partly in the time window and partly in the previous time window replace risk`i'=.5 if age_int_at_death==`i' & end_`i'>=start_date & dob+start_`i'< start_date * age interval is partly in the time window and partly in the next time window replace risk`i'=.5 if age_int_at_death==`i' & end_`i'> end_date & dob+start_`i'<=end_date * Next line will change risk from 0 to .5 for a handful of cases in which died=.5 and risk=0; * it should not be necessary and suggests an error above, but ensures * a perfect match with the DHS programs. replace risk`i'=.5 if died`i'==.5 * the preceding lines produce some values of died and risk that should be changed from missing to 0 replace died`i'=0 if died`i'==. & risk`i'>0 & risk`i'<=1 replace risk`i'=0 if risk`i'==. & died`i'>0 & died`i'<=1 * SEE NOTE BELOW!! *replace died`i'=.5 if risk`i'==.5 & died`i'==1 replace risk`i'=1 if died`i'==1 gen wt_died`i'=wt*died`i' gen wt_risk`i'=wt*risk`i' * To this point, everything was unweighted. Now carry along both wtd and unwtd values. * Weighted values can be used in calc_rates_division. * Unweighted values are used in the logprob model, which will include weights in the model. ren died`i' unwtd_died`i' ren risk`i' unwtd_risk`i' local i=`i'+1 } /* IN THE CSPRO AND SAS AND SPSS PROGRAMS IT CAN HAPPEN THAT RISK=.5 AND DIED=1. FROM THE PERSPECTIVE OF PROBABILITY THEORY, THIS IS AN ERROR. IN THOSE PROGRAMS IT ONLY HAPPENS IF THE CHILD DIED IN THE MOST RECENT TIME INTERVAL AND IN AN AGE GROUP THAT WOULD OTHERWISE HAVE BEEN CENSORED BY THE INTERVIEW. The above line "replace risk`i'=1 if died`i'==1" corrects this, possibly introducing a small difference from published estimates. */ gen v_run_number=_n gen v_lw=. gen v_uw=. gen v_refcmc=. gen str10 variable="All" gen value=. label variable value "All" * se`i' is the (estimated, of course) se of the log of rate i local i=1 while `i'<=nageints { gen v_rate`i'=. gen v_se`i'=. local i=`i'+1 } sort caseid bidx save risk_and_deaths.dta, replace end ****************************************************************************** program define calc_rates_division /* THIS ROUTINE IS NOT ACTUALLY CALLED BUT IS INCLUDED HERE FOR POSSIBLE USE It may have to be revised because it has not been used for a long time.... This routine produces one line for each combination of time and category of covariate. It calculates rates simply by aggregating deaths and risk, and dividing the sum of the deaths by the sum of the risk. Only the WEIGHTED deaths and risk are used here. By contrast, the the log probability approach uses unweighted deaths and risk and does the weighting with pweight. */ keep v_lw v_uw value variable wtd_died* wtd_risk* *collapse (sum) wtd_died* wtd_risk*, by(time value variable) collapse (sum) wtd_died* wtd_risk*, by(value variable) * calculate rates local i=1 while `i'<=nageints { gen rate`i'=wtd_died`i'/wtd_risk`i' local i=`i'+1 } format %10.2f wtd_died* wtd_risk* * Unweighted numerators and denominators can be useful to gauge statistical stability list lw uw variable value wtd_died1 wtd_died2 wtd_died3 wtd_died4, table clean list lw uw variable value wtd_died5 wtd_died6 wtd_died7 wtd_died8, table clean list lw uw variable value wtd_risk1 wtd_risk2 wtd_risk3 wtd_risk4, table clean list lw uw variable value wtd_risk5 wtd_risk6 wtd_risk7 wtd_risk8, table clean end ****************************************************************************** program define calc_5_from_8 * Routine to calculate the various q's from rate1-rate8 ren rate1 q_month_0 ren rate2 q_month_1to2 ren rate3 q_month_3to5 ren rate4 q_month_6to11 ren rate5 q_year_1 ren rate6 q_year_2 ren rate7 q_year_3 ren rate8 q_year_4 * calculate the compound q's gen neonatal=q_month_0 gen prob_1q0=1-(1-q_month_0)*(1-q_month_1to2)*(1-q_month_3to5)*(1-q_month_6to11) gen postneonatal=prob_1q0-neonatal gen prob_4q1=1-(1-q_year_1)*(1-q_year_2)*(1-q_year_3)*(1-q_year_4) gen prob_5q0=1-(1-prob_1q0)*(1-prob_4q1) end ****************************************************************************** program define save_results * This routine fills in the variables that were set up earlier, * using values calculated in calc_rates scalar run_number=run_number+1 replace v_lw=lw if v_run_number==run_number replace v_uw=uw if v_run_number==run_number replace v_refcmc=refcmc_mean if v_run_number==run_number replace value=code if v_run_number==run_number local i=1 while `i'<=nageints { replace v_rate`i'=rate`i' if v_run_number==run_number replace v_se`i'=se`i' if v_run_number==run_number local i=`i'+1 } end ****************************************************************************** program define logprob, byable(recall) marksample touse /* this routine produces one line for each combination of time and category of covariate each line gives results for all of the age intervals it calculates rates using log probability models (glm, log link, binomial error) the UNWEIGHTED deaths and risk are used, with adjustments for weights, clusters, and strata within the model */ local i=1 quietly while `i'<=nageints { quietly summarize unwtd_died`i' if `touse' scalar died`i'tot=r(sum) quietly summarize unwtd_risk`i' if `touse' scalar risk`i'tot=r(sum) quietly summarize value if `touse' scalar code=r(mean) scalar rate`i'=0 scalar se`i'=0 if died`i'tot>0 & risk`i'tot>0 & died`i'tot<. & risk`i'tot<. { *********************** quietly glm unwtd_died`i' if `touse' [pweight=v005], family(binomial unwtd_risk`i') link(log) cluster(v001) *********************** matrix bb=e(b) scalar rate`i'=exp(bb[1,1]) matrix sesq=e(V) scalar se`i'=sqrt(sesq[1,1]) } local i=`i'+1 } save_results end ****************************************************************************** program define calc_q_after_logprob * Routine that uses rate1-rate8 to get the various q's following the log prob model. * First part is exactly the same as calc_5_from_8. The additional part is for the * calculation of standard errors and confidence intervals. calc_5_from_8 /* Here is a complete list of the q's (plus postneonatal, which is not a q!) q_month_0 q_month_1to2 q_month_3to5 q_month_6to11 q_year_1 q_year_2 q_year_3 q_year_4 neonatal postneonatal prob_1q0 prob_4q1 prob_5q0 */ * calculate confidence intervals for the 8 basic q's local i=1 while `i'<=nageints { gen factor`i'=exp(1.96*se`i') local i=`i'+1 } gen L_q_month_0=q_month_0/factor1 gen L_q_month_1to2=q_month_1to2/factor2 gen L_q_month_3to5=q_month_3to5/factor3 gen L_q_month_6to11=q_month_6to11/factor4 gen L_q_year_1=q_year_1/factor5 gen L_q_year_2=q_year_2/factor6 gen L_q_year_3=q_year_3/factor7 gen L_q_year_4=q_year_4/factor8 gen H_q_month_0=q_month_0*factor1 gen H_q_month_1to2=q_month_1to2*factor2 gen H_q_month_3to5=q_month_3to5*factor3 gen H_q_month_6to11=q_month_6to11*factor4 gen H_q_year_1=q_year_1*factor5 gen H_q_year_2=q_year_2*factor6 gen H_q_year_3=q_year_3*factor7 gen H_q_year_4=q_year_4*factor8 drop factor* * a procedure to estimate the se and ci's for the compound rates using the se * of the 8 basic rates is in a separate do file, child_mortality_ci_do.txt end ********************************************************** program define partial_file_save drop if v_run_number==. keep v* if run_number>1 { append using partial_results.dta } save partial_results.dta, replace use risk_and_deaths.dta, clear end ********************************************************** program define final_file_save use partial_results.dta, clear * construct two Stata data files that save the results. * The first includes confidence intervals, the second does not renpfix v_ drop if lw==. calc_q_after_logprob sort lw uw variable value format q* se* L* H* p* neo* %6.4f format lw uw value %5.0f scalar results_with_ci_="results_with_ci_" scalar cid=v000[1] scalar dotdta=".dta" scalar sfn=results_with_ci_+cid+dotdta local lfn=sfn save `lfn', replace keep lw uw refcmc variable value neonatal postneonatal prob* gen str4 v000=v000_string * If all the interviews were in Jan. 1900, i.e. cmc 1, they would be placed * at the middle of January, i.e. one-half of one month into 1900. * Thus the reference doi doi must be adjusted downwards by one-half month. gen doi=1900+(doi_mean-.5)/12 * The reference date of interview would initially be calculated as starting month plus ending * month divided by 2 (which has already been calculated as mean_refcmc. Say the * starting month is 1 (January) and the ending month is 11 (December). The interval goes from * the start of the start month to the end of end month, so the mean is July 1, i.e. month 7.0. * But expressed as the fraction of the year, it is .5. In years, the correct ref date will be * 1900 + mean_refcmc/12. gen refdate=1900+refcmc/12 format doi refdate %8.2f list v000 doi refdate lw uw variable value neonatal postneonatal prob*, table clean compress scalar results_="results_" scalar scid=v000[1] scalar sfnend=".dta" scalar sfn=results_+scid+sfnend local lfn=sfn save `lfn', replace end ****************************************************************************** ****************************************************************************** ****************************************************************************** ****************************************************************************** ****************************************************************************** * EXECUTION BEGINS HERE /* Input file can be either IR or BR; check the end of the setup routine for seven extra lines, including reshape, that are to be included if the input file is IR. v002, v003 can be helpful for linking to other files, otherwise not needed v001 is the cluster variable v005 is the weight variable v023 is the stratum variable (usually but not always; can usually construct the strata aka domains as combinations of urban/rural and region) Include v025 to illustrate inclusion of a covariate */ * Nigeria 2003 *use c:\DHS\DHS_data\BR_files\NGBR4BFL.dta * Nigeria 2008 use c:\DHS\DHS_data\BR_files\NGBR52FL.dta keep caseid v000 v001 v002 v003 v005 v008 bidx* b3* b7* v024 v025 v106 v190 gen v106r=v106 replace v106r=2 if v106==3 save temp.dta, replace quietly setup * Generic lines for setting the time interval and running on the total, for * the first table in the mortality chapter of the main report scalar lw=-4 scalar uw=0 quietly make_risk_and_deaths quietly logprob partial_file_save replace value=v024 label variable value "v024" replace variable="Region" quietly bysort v024: logprob partial_file_save scalar lw=-9 scalar uw=-5 quietly make_risk_and_deaths quietly logprob partial_file_save replace value=v024 label variable value "v024" replace variable="Region" quietly bysort v024: logprob partial_file_save scalar lw=-9 scalar uw=0 quietly make_risk_and_deaths quietly logprob partial_file_save replace value=v024 label variable value "v024" replace variable="Region" quietly bysort v024: logprob partial_file_save * Generic lines for setting the time interval and running on the total and * covariates for the second table in the mortality chapter of the main report * the next line is essential at the end of the run final_file_save * the next lines are optional; "ci" for "confidence intervals"; must have * child_mortality_ci_date_do.txt in the path scalar results_with_ci_="results_with_ci_" scalar cid=v000[1] scalar dotdta=".dta" scalar sfn=results_with_ci_+cid+dotdta local lfn=sfn use `lfn', clear log close do child_mortality_ci_25March2013_do.txt ************************************************************** *END OF PROGRAM*********************************************** **************************************************************