* do e:\DHS\programs\U5_mortality\child_mortality_24May2013_NG_Pullum_do.txt

* CHECK PATHS!!  You have to specify the locations of the data and the output.

* The following is the path to this program (it is not a Stata do-file) and the log file;
*  you must change it!
cd e:\DHS\programs\U5_mortality

set logtype text
log using child_mortality_24May2013_NG_Pullum_log.txt,replace
set more off

/*
****************************************************************************************
PROGRAM TO PRODUCE UNDER 5 MORTALITY RATES FOR SPECIFIC WINDOWS OF TIME, WITH COVARIATES
****************************************************************************************

Written by Thomas W. Pullum, tom.pullum@icfi.com, first version September 2011, many subsequent updates

This program has been tested on several reports.  For example, for the Ethiopia 2005 survey (ET51),
  it matches exactly the rates in the report on pp. 111 and 113

start_i and end_i are scalars referring to the boundaries of the age intervals.

Add dob (b3) to them to get cmc's of time

The generic chapter number for child mortality is 8, 
but in some reports it is another number, e.g. 9

chap8sr.APP is the generic CSPro program for tables
See P:\DHS\PROJECTS\UgandaDHS2006\DataProcessing\tables\chapt8

U5mort_C.sps, U5mortDHS_D.sps, and U5mortDHS_E.sps are the generic SPSS programs

The standard periods are 0-4, 5-9, etc. years before the survey, but there are
 exceptions.  For example, the periods in the Uganda 2006 report are 1-5, 6-11,
 etc. years before the survey.

This Stata program has similarities to the Stata fertility rates program. 
  Children, rather than women, are the units of analysis for the calculations.  
  The child MAY have AT MOST ONE event, i.e. death. Therefore, rather than
  poisson regression (glm with poisson error, log link), the statistical 
  model is log probability (glm with binomial error, log link).
 
Another crucial difference is that for births, we are given the month (cmc) when 
  the birth occurred, but for child deaths, we are given the age of the child
  (in months) at death.  This produces some ambiguity in the month of death. 
  This is particularly true for age at death above 23 months.  For example, "24" 
  means "24-35" and the month of death (dod) could be anywhere from dob+24 to dob+36.

Within a window of time, each child is exposed to intervals of age, in months:

1 0 
2 1–2 
3 3–5 
4 6–11 
5 12–23 
6 24–35 
7 36–47
8 48–59


The program agrees exactly with DHS procedures.  
  It can use either the IR or the BR file.  

The main output files must be renamed or they
  will be over-written the next time the program is run.

Here are the crucial variables, with DHS names

caseid   "Case Identification" for mother
bidx     Child id, nested in caseid
v005     "Sample wt"
v008     "Date of interview (CMC)" or doi
b3       "Date of birth (CMC)" for the children
b7       "Age at death (months-imputed)"; "." if child survived


Rates (correctly called conditional probabilities) calculated here:

Neonatal mortality:     the probability of dying in the first month of life;
Postneonatal mortality: the difference between infant and neonatal mortality (similar to the
                            probability of dying during months 1 through 11, but different);
Infant mortality:       the probability of dying in the first year of life;
Child mortality:        the probability of dying between the first and fifth birthday;
Under-five mortality:   the probability of dying before the fifth birthday.


DHS procedures, from http://www.measuredhs.com/help/Datasets/index.htm: 

A. Infant mortality rate:

1. Calculate the component survival probability by subtracting the component death probability from one.
2. Calculate the product of the component survival probabilities for 0, 1–2, 3–5, and 6–11 months of age.
3. Subtract the product from 1 and multiply by 1000 to get the infant mortality rate. Post neonatal 
     mortality rate: Subtract the neonatal mortality rate from the infant mortality rate.

B. Child mortality rate:

1. Calculate the component survival probability by subtracting the component death probability from 1.
2. Calculate the product of the component survival probabilities for 12–23, 24–35, 36–47, and 48–59 months of age.
3. Subtract the product from 1 and multiply by 1000 to get the child mortality rate.

C. Under-five mortality rate:

1. Calculate the component survival probability by subtracting the component death probability from 1.
2. Calculate the product of the component survival probabilities for 0, 1–2, 3–5, and 6–11, 12–23, 24–35, 36–47,
      and 48–59 months of age.
3. Subtract the product from 1 and multiply by 1,000 to get the child mortality rate.


This routine does not do any re-distribution of deaths to adjust for possible 
  age heaping on a boundary such as 12 months.

Dates are in cmc's and intervals are in months.

The version of the program with a covariate works only for categorical covariates.  
The q's for each value of the covariate are written as lines in a separate output file.

The filenames of the saved files will include v000 (e.g. results_TZ5.dta and results_with_ci_TZ5.dta).

GO TO THE REPEATED LINES OF ASTERISKS FOR THE 
BEGINNING OF THE EXECUTABLE STATEMENTS

*/

******************************************************************************
program define setup

scalar run_number=0

* Specification of nageints, lengths of age intervals

/*
Standard specification but it can be changed:

i start_i  end_i  length_i
1    0        1      1  
2    1        3      2
3    3        6      3
4    6       12      6
5   12       24     12
6   24       36     12
7   36       48     12
8   48       60     12

*/

scalar nageints=8

scalar length_1=1
scalar length_2=2
scalar length_3=3
scalar length_4=6
scalar length_5=12
scalar length_6=12
scalar length_7=12
scalar length_8=12

scalar start_1=0
local i=2
while `i'<=nageints+1 {
local iminus1=`i'-1
scalar start_`i'=start_`iminus1'+length_`iminus1'
local i=`i'+1
}

local i=1
while `i'<=nageints {
local iplus1=`i'+1
scalar end_`i'=start_`iplus1'
local i=`i'+1
}

use temp.dta, clear 

* The next seven lines are only used if the input file is IR rather than BR.
* If the input file is BR, then insert /* before and */ after the next seven lines
* If the input file is IR, then omit   /* before and */ after the next seven lines

/*
drop bidx*
renpfix b3_0 b3_
renpfix b7_0 b7_
reshape long b3_ b7_, i(caseid) j(bidx)
drop if b3==.
rename b3_ b3
rename b7_ b7
*/

rename v008 doi
summarize doi [fweight=v005]
scalar doi_mean=r(mean)
gen wt=v005/1000000.
scalar v000_string=v000[1]
save temp2.dta, replace
end

******************************************************************************

program define end_date_start_date

* This routine calculates the end date and start date for the current window of time

/*
There are two ways to specify the window of time.

Method 1: as calendar year intervals, e.g. with

scalar lw=1992
scalar uw=1996

for a window from January 1992 through December 1996, inclusive.

Method 2: as an interval before the date of interview, e.g. with

scalar lw=-2
scalar uw=0  

for a window from 0 to 2 years before the interview, inclusive
  (that is, three years)

The program knows you are using method 2 if the two numbers
you enter are negative or zero.

lw is the lower end of the window and uw is the upper end.
(Remember that both are negative or 0.)

start_date is the cmc for the earliest month in the window and
end_date is the cmc for the latest month in the window

*/

* lw and uw will be NEGATIVE OR ZERO for "years ago"

if lw<=0 {
gen start_date=doi+12*lw-12
gen end_date=doi+12*uw-1
}

if lw>0 {
* lw and uw will be >0 for "calendar years"
gen start_date=12*(lw-1900)+1
gen end_date=12*(uw-1900)+12
}

replace end_date=min(end_date,doi-1)
gen refcmc=(start_date+end_date)/2
summarize refcmc [fweight=v005]
scalar refcmc_mean=r(mean)
drop refcmc

end

******************************************************************************

program prepare_child_file

* This routine calculates dob, doi, dod_1, dod_2, and the start and end months
*  of each age interval for each child.

* This routine calls end_date_start_date

use temp2.dta, clear

end_date_start_date

rename b3 dob

*drop if missing on birthdate
drop if dob==.  

*drop if dob is later than the end of this time interval
drop if dob>end_date

* b7 is given in single months up to 23 and then given as 24, 36, 48, 60, ...
*  but for the Uganda 2006 survey there are 2 births at 25, 1 at 26, 1 at 35

rename b7 age_at_death
replace age_at_death=. if age_at_death>end_date
replace age_at_death=. if age_at_death>59

/*
doi           month of interview (CMC)
dob           "Date of birth (CMC)"
age_at_death  "Age at death (months-imputed)"
dod_1         the lower bound of the range of possible months of death (CMC)
dod_2         the upper bound of the range of possible months of death (CMC)

construct the beginning and end months of each age interval, in months after dob
 dob+start_i is the cmc when the child entered age interval i
 dob+end_i   is the cmc when the child left age interval i

 because births happen at the middle of the month, end_1=start_2, etc.

drop if all risk preceded the window
*/

drop if dob+end_8<start_date-1

gen age_int_at_death=.
gen dod_1=.
gen dod_2=.

local i=1
while `i'<=nageints {
replace age_int_at_death=`i' if age_at_death>=start_`i' & age_at_death<=end_`i'-1
replace dod_1=dob+start_`i'  if age_int_at_death==`i'
replace dod_2=dob+end_`i'    if age_int_at_death==`i'
local i=`i'+1
}

* age_at_death and age_int_at_death will be "." if the child survived to doi-1

* drop the child if died before the beginning of this time interval
drop if dod_2<start_date

* Important adjustment for children who died near the end of interval 0. 
* They could not have died AFTER that date.
* This adjustment should NOT be made when time interval 0 is moved to be
* earlier than the date of interview (-1) but it WAS made for Uganda 2006.
replace dod_2=end_date if dod_1<= end_date & dod_2>end_date & end_date==doi-1

save temp3.dta, replace

end

******************************************************************************
program define make_risk_and_deaths

/*
CHILDREN'S DEATHS AND RISK IN AGE INTERVALS WITHIN THE WINDOW OF TIME

Standard age intervals 1 through 8 refer to 
  months 0, 1-2, 3-5, 6-11, 12-23, 24-35, 36-47, 48-59.

In other DHS programs these intervals are numbered 0 through 7 rather than 1 through 8.

Each child gets values for died1 through died 8.  The initial values are 0. They remain
at 0 if the child survived to the date of interview.

The other possible values are 1 and 0.5: 1 if the child died in the time interval, 0.5 
if the age interval at death is divided across two time intervals.

DHS does not calculate months of exposure.  To emphasize that, this program uses the term 
"risk" rather than "exposure".  Each child gets values for risk1 through risk8.  
The initial values are 0.  They remain at 0 for any ages for which the child was not observed 
(within the window of time), because of death or censoring.

The other possible values are 1 and 0.5: 1 if the child was observed for that age and all
such observation was within the time interval, and 0.5 if the observation for that age is
divided across two time intervals.

The risk is not altered for the age interval and time interval(s) in which the child died, 
but risk drops to 0 after that age interval.

In the log probability analysis, .5 is replaced with 1 and the weight is multiplied by .5.

*/

prepare_child_file

use temp3.dta, replace

* Values of risk and died are calculated in the following loop.  It could be streamlined
*  somewhat but you must be careful or you will lose the match with DHS.

local i=1
while `i'<=nageints {
gen died`i'=.

* age interval is entirely in the time window
replace died`i'=1  if age_int_at_death==`i' &  dod_2<=end_date   & dod_1>=start_date

* age interval is partly in the time window and partly in the previous time window
replace died`i'=.5 if age_int_at_death==`i' &  dod_2>=start_date & dod_1< start_date 

* age interval is partly in the time window and partly in the next time window
replace died`i'=.5 if age_int_at_death==`i' &  dod_2> end_date   & dod_1<=end_date

gen risk`i'=.

* age interval is entirely in the time window
replace risk`i'=1  if (age_int_at_death>=`i' | age_at_death==.) & dob+end_`i'<=end_date   & dob+start_`i'>=start_date

* age interval is partly in the time window and partly in the previous time window
replace risk`i'=.5 if (age_int_at_death>`i'  | age_at_death==.) & dob+end_`i'>=start_date & dob+start_`i'< start_date 

* age interval is partly in the time window and partly in the next time window
replace risk`i'=.5 if (age_int_at_death>`i'  | age_at_death==.) & dob+end_`i'> end_date   & dob+start_`i'<=end_date

* child dies in this interval
* age interval is partly in the time window and partly in the previous time window
replace risk`i'=.5 if age_int_at_death==`i' & end_`i'>=start_date & dob+start_`i'< start_date 

* age interval is partly in the time window and partly in the next time window
replace risk`i'=.5 if age_int_at_death==`i' & end_`i'> end_date   & dob+start_`i'<=end_date

* Next line will change risk from 0 to .5 for a handful of cases in which died=.5 and risk=0;
*   it should not be necessary and suggests an error above, but ensures
*   a perfect match with the DHS programs.

replace risk`i'=.5 if died`i'==.5 

* the preceding lines produce some values of died and risk that should be changed from missing to 0
replace died`i'=0 if died`i'==. & risk`i'>0 & risk`i'<=1
replace risk`i'=0 if risk`i'==. & died`i'>0 & died`i'<=1

* SEE NOTE BELOW!!
*replace died`i'=.5 if risk`i'==.5 & died`i'==1
replace risk`i'=1 if died`i'==1

gen wt_died`i'=wt*died`i'
gen wt_risk`i'=wt*risk`i'

* To this point, everything was unweighted. Now carry along both wtd and unwtd values. 
* Weighted values can be used in calc_rates_division.
* Unweighted values are used in the logprob model, which will include weights in the model.

ren died`i' unwtd_died`i'
ren risk`i' unwtd_risk`i'

local i=`i'+1
}

/*
IN THE CSPRO AND SAS AND SPSS PROGRAMS IT CAN HAPPEN THAT RISK=.5 AND DIED=1. 
FROM THE PERSPECTIVE OF PROBABILITY THEORY, THIS IS AN ERROR.
IN THOSE PROGRAMS IT ONLY HAPPENS IF THE CHILD DIED IN THE MOST RECENT TIME 
INTERVAL AND IN AN AGE GROUP THAT WOULD OTHERWISE HAVE BEEN CENSORED BY THE INTERVIEW.
The above line "replace risk`i'=1 if died`i'==1" corrects this, possibly introducing 
a small difference from published estimates.
*/

gen v_run_number=_n
gen v_lw=.
gen v_uw=.
gen v_refcmc=.

gen str10 variable="All"
gen value=.
label variable value "All"

* se`i' is the (estimated, of course) se of the log of rate i

local i=1 
while `i'<=nageints {
gen v_rate`i'=.
gen v_se`i'=.
local i=`i'+1
}

sort caseid bidx

save risk_and_deaths.dta, replace
end

******************************************************************************
program define calc_rates_division

/*

THIS ROUTINE IS NOT ACTUALLY CALLED BUT IS INCLUDED HERE FOR POSSIBLE USE

It may have to be revised because it has not been used for a long time....

This routine produces one line for each combination of time and category of covariate.

It calculates rates simply by aggregating deaths and risk, 
  and dividing the sum of the deaths by the sum of the risk. 

Only the WEIGHTED deaths and risk are used here.

By contrast, the the log probability approach uses unweighted deaths and risk
  and does the weighting with pweight.
*/

keep v_lw v_uw value variable wtd_died* wtd_risk*

*collapse (sum) wtd_died* wtd_risk*, by(time value variable)
collapse (sum) wtd_died* wtd_risk*, by(value variable)

* calculate rates

local i=1
while `i'<=nageints {
gen rate`i'=wtd_died`i'/wtd_risk`i'
local i=`i'+1
}

format %10.2f wtd_died* wtd_risk*

* Unweighted numerators and denominators can be useful to gauge statistical stability
list lw uw variable value wtd_died1 wtd_died2 wtd_died3 wtd_died4, table clean
list lw uw variable value wtd_died5 wtd_died6 wtd_died7 wtd_died8, table clean
list lw uw variable value wtd_risk1 wtd_risk2 wtd_risk3 wtd_risk4, table clean
list lw uw variable value wtd_risk5 wtd_risk6 wtd_risk7 wtd_risk8, table clean

end

******************************************************************************
program define calc_5_from_8

* Routine to calculate the various q's from rate1-rate8

ren rate1 q_month_0
ren rate2 q_month_1to2
ren rate3 q_month_3to5
ren rate4 q_month_6to11
ren rate5 q_year_1
ren rate6 q_year_2
ren rate7 q_year_3
ren rate8 q_year_4

* calculate the compound q's

gen neonatal=q_month_0
gen prob_1q0=1-(1-q_month_0)*(1-q_month_1to2)*(1-q_month_3to5)*(1-q_month_6to11)
gen postneonatal=prob_1q0-neonatal
gen prob_4q1=1-(1-q_year_1)*(1-q_year_2)*(1-q_year_3)*(1-q_year_4)
gen prob_5q0=1-(1-prob_1q0)*(1-prob_4q1)

end

******************************************************************************

program define save_results

* This routine fills in the variables that were set up earlier,
*  using values calculated in calc_rates

scalar run_number=run_number+1

replace v_lw=lw                               if v_run_number==run_number
replace v_uw=uw                               if v_run_number==run_number
replace v_refcmc=refcmc_mean                  if v_run_number==run_number

replace value=code                            if v_run_number==run_number

local i=1
while `i'<=nageints {
replace v_rate`i'=rate`i'                     if v_run_number==run_number
replace v_se`i'=se`i'                         if v_run_number==run_number
local i=`i'+1
}

end

******************************************************************************

program define logprob, byable(recall)
marksample touse

/* 
this routine produces one line for each combination of time and category of covariate

each line gives results for all of the age intervals

it calculates rates using log probability models (glm, log link, binomial error)

the UNWEIGHTED deaths and risk are used, with adjustments 
 for weights, clusters, and strata within the model
*/

local i=1
quietly while `i'<=nageints {
quietly summarize unwtd_died`i' if `touse'
scalar died`i'tot=r(sum)
quietly summarize unwtd_risk`i' if `touse'
scalar risk`i'tot=r(sum)

quietly summarize value if `touse'
scalar code=r(mean)

scalar rate`i'=0
scalar se`i'=0

if died`i'tot>0 & risk`i'tot>0 & died`i'tot<. & risk`i'tot<. {
***********************
quietly glm unwtd_died`i' if `touse' [pweight=v005], family(binomial unwtd_risk`i') link(log) cluster(v001)
***********************
matrix bb=e(b)
scalar rate`i'=exp(bb[1,1])
matrix sesq=e(V)
scalar se`i'=sqrt(sesq[1,1])
}

local i=`i'+1
}

save_results

end

******************************************************************************

program define calc_q_after_logprob

* Routine that uses rate1-rate8 to get the various q's following the log prob model.
* First part is exactly the same as calc_5_from_8.  The additional part is for the
*   calculation of standard errors and confidence intervals.

calc_5_from_8

/*
Here is a complete list of the q's (plus postneonatal, which is not a q!)
q_month_0
q_month_1to2
q_month_3to5
q_month_6to11
q_year_1
q_year_2
q_year_3
q_year_4

neonatal 
postneonatal 
prob_1q0 
prob_4q1 
prob_5q0
*/

* calculate confidence intervals for the 8 basic q's

local i=1
while `i'<=nageints {
gen factor`i'=exp(1.96*se`i')
local i=`i'+1
}

gen L_q_month_0=q_month_0/factor1
gen L_q_month_1to2=q_month_1to2/factor2
gen L_q_month_3to5=q_month_3to5/factor3
gen L_q_month_6to11=q_month_6to11/factor4
gen L_q_year_1=q_year_1/factor5
gen L_q_year_2=q_year_2/factor6
gen L_q_year_3=q_year_3/factor7
gen L_q_year_4=q_year_4/factor8

gen H_q_month_0=q_month_0*factor1
gen H_q_month_1to2=q_month_1to2*factor2
gen H_q_month_3to5=q_month_3to5*factor3
gen H_q_month_6to11=q_month_6to11*factor4
gen H_q_year_1=q_year_1*factor5
gen H_q_year_2=q_year_2*factor6
gen H_q_year_3=q_year_3*factor7
gen H_q_year_4=q_year_4*factor8

drop factor*

* a procedure to estimate the se and ci's for the compound rates using the se 
*  of the 8 basic rates is in a separate do file, child_mortality_ci_do.txt

end

**********************************************************

program define partial_file_save

drop if v_run_number==.
keep v*

if run_number>1 {
append using partial_results.dta
}

save partial_results.dta, replace
use risk_and_deaths.dta, clear
end

**********************************************************

program define final_file_save

use partial_results.dta, clear

* construct two Stata data files that save the results.
* The first includes confidence intervals, the second does not

renpfix v_
drop if lw==.

calc_q_after_logprob

sort lw uw variable value
format q* se* L* H* p* neo* %6.4f
format lw uw value %5.0f

scalar results_with_ci_="results_with_ci_"
scalar cid=v000[1]
scalar dotdta=".dta"
scalar sfn=results_with_ci_+cid+dotdta
local lfn=sfn
save `lfn', replace

keep lw uw refcmc variable value neonatal postneonatal prob*
gen str4 v000=v000_string
* If all the interviews were in Jan. 1900, i.e. cmc 1, they would be placed
*  at the middle of January, i.e. one-half of one month into 1900. 
* Thus the reference doi doi must be adjusted downwards by one-half month.

gen doi=1900+(doi_mean-.5)/12

* The reference date of interview would initially be calculated as starting month plus ending 
*  month divided by 2 (which has already been calculated as mean_refcmc.  Say the 
*  starting month is 1 (January) and the ending month is 11 (December).  The interval goes from 
*  the start of the start month to the end of end month, so the mean is July 1, i.e. month 7.0.  
*  But expressed as the fraction of the year, it is .5.  In years, the correct ref date will be
*  1900 + mean_refcmc/12.   

gen refdate=1900+refcmc/12

format doi refdate %8.2f

list v000 doi refdate lw uw variable value neonatal postneonatal prob*, table clean compress

scalar results_="results_"
scalar scid=v000[1]
scalar sfnend=".dta"
scalar sfn=results_+scid+sfnend
local lfn=sfn
save `lfn', replace

end

******************************************************************************
******************************************************************************
******************************************************************************
******************************************************************************
******************************************************************************
* EXECUTION BEGINS HERE

/*
Input file can be either IR or BR; check the end of the setup routine for seven 
  extra lines, including reshape, that are to be included if the input file is IR.

v002, v003 can be helpful for linking to other files, otherwise not needed
v001 is the cluster variable
v005 is the weight variable
v023 is the stratum variable (usually but not always; can usually construct the strata
 aka domains as combinations of urban/rural and region)

Include v025 to illustrate inclusion of a covariate
*/

* Nigeria 2003
*use c:\DHS\DHS_data\BR_files\NGBR4BFL.dta

* Nigeria 2008
use c:\DHS\DHS_data\BR_files\NGBR52FL.dta

keep caseid v000 v001 v002 v003 v005 v008 bidx* b3* b7* v024 v025 v106 v190

gen v106r=v106
replace v106r=2 if v106==3

save temp.dta, replace

quietly setup


* Generic lines for setting the time interval and running on the total, for 
*   the first table in the mortality chapter of the main report

scalar lw=-4
scalar uw=0
quietly make_risk_and_deaths
quietly logprob
partial_file_save

replace value=v024
label variable value "v024"
replace variable="Region"
quietly bysort v024: logprob
partial_file_save

scalar lw=-9
scalar uw=-5
quietly make_risk_and_deaths
quietly logprob
partial_file_save

replace value=v024
label variable value "v024"
replace variable="Region"
quietly bysort v024: logprob
partial_file_save

scalar lw=-9
scalar uw=0
quietly make_risk_and_deaths
quietly logprob
partial_file_save

replace value=v024
label variable value "v024"
replace variable="Region"
quietly bysort v024: logprob
partial_file_save


* Generic lines for setting the time interval and running on the total and
*   covariates for the second table in the mortality chapter of the main report


* the next line is essential at the end of the run
final_file_save


* the next lines are optional; "ci" for "confidence intervals"; must have
*   child_mortality_ci_date_do.txt in the path
scalar results_with_ci_="results_with_ci_"
scalar cid=v000[1]
scalar dotdta=".dta"
scalar sfn=results_with_ci_+cid+dotdta
local lfn=sfn
use `lfn', clear
log close
do child_mortality_ci_25March2013_do.txt


**************************************************************
*END OF PROGRAM***********************************************
**************************************************************