* do e:\DHS\programs\U5_mortality\child_mortality_ci_21March2013_do.txt

cd e:\DHS\programs\U5_mortality
set logtype text
log using child_mortality_ci_21March2013_log.txt, replace
set more off

/*

Tom Pullum, tpullum@icfi.com, Nov. 30, 2011, then updated

This program estimates 95% confidence intervals for the DHS child mortality rates produced by the 
log probability model in child_mortality_do_date.txt.  

In all of the following, "standard error" refers to an estimated standard error and 
  "confidence interval" refers to a 95% confidence interval

The input file for this program is an output file produced by the child_mortality program.  This program will
require some modification if there is a departure from the basic 8 age intervals normally used
by DHS, for months 0, 1-2, 3-5, 6-11, 12-23, 24-35, 36-47, 48-59.

The file "q_output_logprob...dta" produced by child_mortality_do_date.txt includes the following variables

Interval   Months   q                std. error    Low end of 95% ci High end of 95% ci 
1          0        q_month_0        sterr1        L_q_month_0       H_q_month_0
2          1-2      q_month_1to2     sterr2        L_q_month_1to2    H_q_month_1to2
3          3-5      q_month_3to5     sterr3        L_q_month_3to5    H_q_month_3to5
4          6-11     q_month_6to11    sterr4        L_q_month_6to11   H_q_month_6to11
5          12-23    q_year_1         sterr5        L_q_year_1        H_q_year_1
6          24-35    q_year_2         sterr6        L_q_year_2        H_q_year_2
7          36-47    q_year_3         sterr7        L_q_year_3        H_q_year_3
8          48-59    q_year_4         sterr8        L_q_year_4        H_q_year_4
 

In this list, sterr1 is the standard error of the natural log of q_month_0, and NOT of 
  q_month_0 itself, for example, and

  L_q_month_0 = L_q_month_0/exp(1.96*sterr1) and  
  H_q_month_0 = H_q_month_0*exp(1.96*sterr1)

These confidence intervals are adjusted for the weights, clustering, and stratification 
 (if possible), using the svyset command.  The q's are estimated with a log probability model 
 (GLM with binomial error and log link) and match the standard DHS estimates exactly.

These eight basic rates are converted to a set of five standard rates:

neonatal =     q_month_0
prob_1q0 =     1-(1-q_month_0)*(1-q_month_1to2)*(1-q_month_3to5)*(1-q_month_6to11)
postneonatal = prob_1q0-neonatal
prob_4q1 =     1-(1-q_year_1)*(1-q_year_2)*(1-q_year_3)*(1-q_year_4)
prob_5q1 =     1-(1-prob_1q0)*(1-prob_4q1)
         =     1-(1-q_month_0)*(1-q_month_1to2)*(1-q_month_3to5)*(1-q_month_6to11)*
                 (1-q_year_1)*(1-q_year_2)*(1-q_year_3)*(1-q_year_4)

The ci for the neonatal rate is simply (L_q_month_0 , H_q_month_0)

The ci for the postneonatal rate is complicated by the fact that the postneonatal rate is calculated
  as a difference between the two other rates, and those rates are not at all independent of each other.

I would recommend calculating 

pseudo-postnatal = 1-(1-q_month_1to2)*(1-q_month_3to5)*(1-q_month_6to11), 

then calculating a ci for pseudo-postneonatal using the methods described below, and
then re-centering that ci around the point estimate of the postneonatal rate.

To summarize, we need confidence intervals for the following compound rates:

prob_1q0 =     1-(1-q_month_0)*(1-q_month_1to2)*(1-q_month_3to5)*(1-q_month_6to11)

pseudo postneonatal = 1-(1-q_month_1to2)*(1-q_month_3to5)*(1-q_month_6to11)

prob_4q1 =     1-(1-q_year_1)*(1-q_year_2)*(1-q_year_3)*(1-q_year_4)

prob_5q1 =     1-(1-q_month_0)*(1-q_month_1to2)*(1-q_month_3to5)*(1-q_month_6to11)*
                 (1-q_year_1)*(1-q_year_2)*(1-q_year_3)*(1-q_year_4)


The following procedure is proposed for estimating these confidence intervals.  Alternative
  approaches are definitely possible.  My procedure is simply to "map" confidence intervals
  for basic q's into a confidence interval for a function of those q's.    

The logs of the 8 basic rates are the coefficients estimated by the log probability model.  
  They are labeled lnq1 through lnq8.  For example, lnq1=log(q_month_0), etc.

Refer to 1q0=IMR as just q, without a subscript.  Then 

q=1-(1-q1)*(1-q2)*(1-q3)*(1-q4)=1-p1*p2*p3*p4, where pi=1-qi (i=1,2,3,4) or

p=p1*p2*p3*p4, where p=1-q.  Taking logs, 

lnp=lnp1+lnp2+lnp3+lnp4, where lnpi=log(pi) (i=1,2,3,4) and lnp=log(p) 

It appears (evidence can be provided) that the q's (and p's) for the age intervals are
  close to being independent.  Their covariances are very small, and sometimes positive
  and sometimes negative, so if they were included, they would tend to cancel out.  I think
  it will be a good approximation to say that the variances are related simply as

var(lnp)=var(lnp1)+var(lnp2)+var(lnp3)+var(lnp4)

Therefore, for the IMR: begin with the a confidence interval for ln(qi), which is given in the 
  Stata output as the ci for the coefficient in the log probability model.

Step
 1--exponentiate this ci to get a ci for qi (i=1,2,3,4) (already calculated)
 2--subtract from 1 to get a confidence interval for pi=1-qi (the low and high ends will reverse)
 3--log this ci to get a ci for ln(pi)
 4--subtract L from H and divide by 2(1.96) to get an se for ln(pi)
 5--square to get variance of ln(pi)
 6--add up the variances of ln(p1), ln(p2), ln(p3), ln(p4)
 7--take the square root of the sum; this will be the est. se. of ln(1-IMR)
 8--use this se to calculate a ci for ln(1-IMR)
 9--exponentiate this ci to get a ci for 1-IMR
10--subtract from 1 to get a ci for IMR (the low and high ends will reverse)

*/

*********************************************************
program define make_ci_1q0

*list L_q_month*, table clean
*list H_q_month*, table clean

* Step 1

gen L_q1=L_q_month_0
gen L_q2=L_q_month_1to2
gen L_q3=L_q_month_3to5
gen L_q4=L_q_month_6to11

gen H_q1=H_q_month_0
gen H_q2=H_q_month_1to2
gen H_q3=H_q_month_3to5
gen H_q4=H_q_month_6to11

*list L_q1-L_q4, table clean
*list H_q1-H_q4, table clean

* Step 2

gen L_p1=1-H_q1
gen L_p2=1-H_q2
gen L_p3=1-H_q3
gen L_p4=1-H_q4

gen H_p1=1-L_q1
gen H_p2=1-L_q2
gen H_p3=1-L_q3
gen H_p4=1-L_q4

*list L_p*, table clean
*list H_p*, table clean

* Step 3

gen L_lnp1=log(L_p1)
gen L_lnp2=log(L_p2)
gen L_lnp3=log(L_p3)
gen L_lnp4=log(L_p4)

gen H_lnp1=log(H_p1)
gen H_lnp2=log(H_p2)
gen H_lnp3=log(H_p3)
gen H_lnp4=log(H_p4)

*list L_lnp*, table clean
*list H_lnp*, table clean

* Step 4

gen selnp1=(H_lnp1-L_lnp1)/(2*1.96)
gen selnp2=(H_lnp2-L_lnp2)/(2*1.96)
gen selnp3=(H_lnp3-L_lnp3)/(2*1.96)
gen selnp4=(H_lnp4-L_lnp4)/(2*1.96)

*list selnp*, table clean

* Step 5

gen varlnp1=selnp1^2
gen varlnp2=selnp2^2
gen varlnp3=selnp3^2
gen varlnp4=selnp4^2

*list varlnp*, table clean

* Step 6

gen varln1q0=varlnp1+varlnp2+varlnp3+varlnp4

*list varln1q0, table clean

* Step 7

gen se_ln_one_minus_1q0=sqrt(varln1q0)

*list se_ln_one_minus_1q0, table clean

* Step 8

gen ln_one_minus_1q0=log(1-prob_1q0)

gen L_ln_one_minus_1q0=ln_one_minus_1q0-1.96*se_ln_one_minus_1q0
gen H_ln_one_minus_1q0=ln_one_minus_1q0+1.96*se_ln_one_minus_1q0

*list L_ln_one_minus_1q0 H_ln_one_minus_1q0, table clean

* Step 9

gen L_one_minus_1q0=exp(L_ln_one_minus_1q0)
gen H_one_minus_1q0=exp(H_ln_one_minus_1q0)

*list L_one_minus_1q0 H_one_minus_1q0, table clean

* Step 10

gen L_1q0=1-H_one_minus_1q0
gen H_1q0=1-L_one_minus_1q0

list lw uw variable value L_1q0 prob_1q0 H_1q0, table clean
end
*********************************************************
program define make_ci_4q1
* same as make_ci_1q0 but with 4 age intervals of one year each

* Step 1

gen L_q5=L_q_year_1
gen L_q6=L_q_year_2
gen L_q7=L_q_year_3
gen L_q8=L_q_year_4

gen H_q5=H_q_year_1
gen H_q6=H_q_year_2
gen H_q7=H_q_year_3
gen H_q8=H_q_year_4

* Step 2

gen L_p5=1-H_q5
gen L_p6=1-H_q6
gen L_p7=1-H_q7
gen L_p8=1-H_q8

gen H_p5=1-L_q5
gen H_p6=1-L_q6
gen H_p7=1-L_q7
gen H_p8=1-L_q8

* Step 3

gen L_lnp5=log(L_p5)
gen L_lnp6=log(L_p6)
gen L_lnp7=log(L_p7)
gen L_lnp8=log(L_p8)

gen H_lnp5=log(H_p5)
gen H_lnp6=log(H_p6)
gen H_lnp7=log(H_p7)
gen H_lnp8=log(H_p8)

* Step 4

gen selnp5=(H_lnp5-L_lnp5)/(2*1.96)
gen selnp6=(H_lnp6-L_lnp6)/(2*1.96)
gen selnp7=(H_lnp7-L_lnp7)/(2*1.96)
gen selnp8=(H_lnp8-L_lnp8)/(2*1.96)

* Step 5

gen varlnp5=selnp5^2
gen varlnp6=selnp6^2
gen varlnp7=selnp7^2
gen varlnp8=selnp8^2

* Step 6

gen varln4q1=varlnp5+varlnp6+varlnp7+varlnp8

* Step 7

gen se_ln_one_minus_4q1=sqrt(varln4q1)

* Step 8

gen ln_one_minus_4q1=log(1-prob_4q1)

gen L_ln_one_minus_4q1=ln_one_minus_4q1-1.96*se_ln_one_minus_4q1
gen H_ln_one_minus_4q1=ln_one_minus_4q1+1.96*se_ln_one_minus_4q1

* Step 9

gen L_one_minus_4q1=exp(L_ln_one_minus_4q1)
gen H_one_minus_4q1=exp(H_ln_one_minus_4q1)

* Step 10

gen L_4q1=1-H_one_minus_4q1
gen H_4q1=1-L_one_minus_4q1

list lw uw variable value L_4q1 prob_4q1 H_4q1, table clean
end
*********************************************************
program define make_ci_5q0
* same as make_ci_1q0 but with 8 age intervals

* skip over steps 1-5

* Step 6

gen varln5q0=varlnp1+varlnp2+varlnp3+varlnp4+varlnp5+varlnp6+varlnp7+varlnp8

* Step 7

gen se_ln_one_minus_5q0=sqrt(varln5q0)

* Step 8

gen ln_one_minus_5q0=log(1-prob_5q0)

gen L_ln_one_minus_5q0=ln_one_minus_5q0-1.96*se_ln_one_minus_5q0
gen H_ln_one_minus_5q0=ln_one_minus_5q0+1.96*se_ln_one_minus_5q0

* Step 9

gen L_one_minus_5q0=exp(L_ln_one_minus_5q0)
gen H_one_minus_5q0=exp(H_ln_one_minus_5q0)

* Step 10

gen L_5q0=1-H_one_minus_5q0
gen H_5q0=1-L_one_minus_5q0

list lw uw variable value L_5q0 prob_5q0 H_5q0, table clean
end
*********************************************************
program define make_ci_postneonatal
* same as make_ci_1q0 but with age intervals 2-4 only

* strategy: construct an interval around the q for months 1-11,
*   then re-center on the postneonatal rate

* skip over steps 1-5

gen prob_pnn=1-(1-q_month_1to2)*(1-q_month_3to5)*(1-q_month_6to11)

* Step 6

gen varlnpnn=varlnp2+varlnp3+varlnp4

* Step 7

gen se_ln_one_minus_pnn=sqrt(varlnpnn)

* Step 8

gen ln_one_minus_pnn=log(1-prob_pnn)

gen L_ln_one_minus_pnn=ln_one_minus_pnn-1.96*se_ln_one_minus_pnn
gen H_ln_one_minus_pnn=ln_one_minus_pnn+1.96*se_ln_one_minus_pnn

* Step 9

gen L_one_minus_pnn=exp(L_ln_one_minus_pnn)
gen H_one_minus_pnn=exp(H_ln_one_minus_pnn)

* Step 10

gen L_pnn=1-H_one_minus_pnn
gen H_pnn=1-L_one_minus_pnn

list lw uw variable value L_pnn prob_pnn H_pnn, table clean

gen L_pnn_true=postneonatal-(prob_pnn-L_pnn)
gen H_pnn_true=postneonatal+(H_pnn-prob_pnn)

list lw uw variable value L_pnn prob_pnn H_pnn, table clean
list lw uw variable value L_pnn_true postneonatal H_pnn_true, table clean

end
*********************************************************
*********************************************************

*EXECUTION BEGINS HERE


quietly make_ci_1q0
quietly make_ci_4q1
quietly make_ci_5q0
quietly make_ci_postneonatal

list lw uw variable value L_q_month_0 q_month_0 H_q_month_0, table clean
list lw uw variable value L_pnn_true postneonatal H_pnn_true, table clean
list lw uw variable value L_1q0 prob_1q0 H_1q0, table clean
list lw uw variable value L_4q1 prob_4q1 H_4q1, table clean
list lw uw variable value L_5q0 prob_5q0 H_5q0, table clean

replace L_q_month_0=0 if L_q_month_0<0
replace L_pnn_true =0 if L_pnn_true<0
replace L_1q0      =0 if L_1q0<0
replace L_4q1      =0 if L_4q1<0
replace L_5q0      =0 if L_5q0<0

list lw uw variable value L_q_month_0 q_month_0 H_q_month_0, table clean
list lw uw variable value L_pnn_true postneonatal H_pnn_true, table clean
list lw uw variable value L_1q0 prob_1q0 H_1q0, table clean
list lw uw variable value L_4q1 prob_4q1 H_4q1, table clean
list lw uw variable value L_5q0 prob_5q0 H_5q0, table clean

save, replace

/*
Compare with naive estimate

The following approach makes no adjustments for weights, clustering,
stratification, segmentation of the time interval, etc.

In general, the estimated standard error of a proportion p is sqrt[p(1-p)/n], where
n is the denominator of the proportion.

Here we have R=d/E, where a so-called rate is acually an estimated probability, with d 
  deaths in the numerator and exposure or risk E in the denominator.  It can be shown
  that in this case the usual formula can be written as s.e. = R*sqrt[(1-R)/d] .

Go to the q_output_division....dta file and get d=died1+died2+died3+died4 and R=IMR=prob_1q0.  

Calculate the lower and upper ends of the 95% confidence interval and compare with the
  approach used above.


use results_with_ci_ET51.dta, clear
gen d=died1+died2+died3+died4
gen IMR=prob_1q0

gen se_IMR=IMR*sqrt((1-IMR)/d)
gen L_IMR=IMR-1.96*se_IMR
gen H_IMR=IMR+1.96*se_IMR

list lw uw variable value L_IMR IMR H_IMR, table clean
 
* Similarly for other rates

The naive estimates are much narrower than the robust confidence intervals, 
partly because they do not take the design effect into account, but also because
the DHS estimates are constructed from shorter segments with more uncertainty.
*/