Home » Data » Weighting data » Why I am getting different total observations when using iweight for tabulating a variable
Why I am getting different total observations when using iweight for tabulating a variable [message #26098] |
Tue, 07 February 2023 12:48 |
sujata
Messages: 18 Registered: May 2019
|
Member |
|
|
I am trying to tabulate the hv025 variable in the PR file for the Indian state of Punjab. Applying svyset and tabulating this variable only gives the proportions and not the absolute values. I understand that svy: ta hv025 and ta hv025 [iw=shweight/1000000] will give the same results if we are not interested in standard errors. But the total observation (67913.878) differs from 67856 (the total number of observations). I am putting here both the results. 67913.878 is the population size. How is the population size different from the number of observations (68549)? If I apply aweight then I am getting the Total as equal to the total number of observations, which is 68549. But I should use iweight and not aweight.
gen weight_dis=shweight/1000000
ta hv025 [iw= weight_dis]
type of |
place of |
residence | Freq. Percent Cum.
------------+-----------------------------------
urban |26,122.0845 38.46 38.46
rural | 41,791.793 61.54 100.00
------------+-----------------------------------
Total | 67,913.878 100.00
svyset [pw= weight_dis], psu( hv021) strata( hv022)
svy:ta hv025
(running tabulate on estimation sample)
Number of strata = 88 Number of obs = 68,549
Number of PSUs = 915 Population size = 67,913.878
Design df = 827
----------------------
type of |
place of |
residence | proportion
----------+-----------
urban | .3846
rural | .6154
|
Total | 1
----------------------
Key: proportion = Cell proportion
|
|
|
Re: Why I am getting different total observations when using iweight for tabulating a variable [message #26116 is a reply to message #26098] |
Thu, 09 February 2023 09:11 |
Bridgette-DHS
Messages: 3199 Registered: February 2013
|
Senior Member |
|
|
Following is a response from Senior DHS staff member, Tom Pullum:
When Stata sees "pweight", which is the only type of weight you can use with svyset, it normalizes them to have a mean of 1. Stata does not automatically normalize iweights.
I opened the PR file and entered "tab hv024, summarize(shweight)". I see that the mean of shweight in Pujab (hv024=3) is 989497.01, which after division by 1000000 is .98948701. What's relevant is that this mean is NOT 1. Stata, with pweight, will re-scale to 1. With iweight it will NOT re-scale to 1.
So why does the mean of shweight differ from 1 (or 1000000) in each of the states? It's because DHS has normalized shweight in the HR file, not the PR file. I confirmed that by opening the HR file and entering "tab hv024, summarize(shweight)". Sure enough, the mean of shweight is 1000000 in the HR file.
Thus the discrepancy you observe is just due to the way that DHS normalized shweight for households rather than units, and you are using the PR file, with individuals as units, and Stata (with pweight) has re-normalized shweight. Hope this makes sense. Interesting question.
|
|
|
Re: Why I am getting different total observations when using iweight for tabulating a variable [message #26119 is a reply to message #26116] |
Thu, 09 February 2023 11:45 |
sujata
Messages: 18 Registered: May 2019
|
Member |
|
|
Dear Tom,
Thank you very much for your reply.
I still have one query regarding the number of observations and population size. Even without applying svyset if I tabulate hv025 using iweight=shweight/1000000 the total number of observations is less.
gen weight_dis=shweight/1000000
ta hv025 [iw= weight_dis]
type of |
place of |
residence | Freq. Percent Cum.
------------+-----------------------------------
urban |26,122.0845 38.46 38.46
rural | 41,791.793 61.54 100.00
------------+-----------------------------------
Total | 67,913.878 100.00
Here Total is showing 67913.878, which is less than the actual number of observations, which is 68,549. Applying aweight gives a total of 68549. should I apply aweight or I am doing something wrong? Do I need to merge PR with HR and use shweight from HR file?
[Updated on: Thu, 09 February 2023 11:58] Report message to a moderator
|
|
|
|
|
|
|
|
Re: Why I am getting different total observations when using iweight for tabulating a variable [message #26539 is a reply to message #26536] |
Thu, 30 March 2023 11:04 |
Bridgette-DHS
Messages: 3199 Registered: February 2013
|
Senior Member |
|
|
Following is a response from Senior DHS staff member, Tom Pullum:
I spent some time looking into your question but can't provide much help. Here are some thoughts.
First, wealth scores such as sv271 are household-specific and are constructed with the HR file. Then in the PR and other individual-level files they are exactly the same for everyone in the same household. When you calculate the fractional rank, using the PR file, you are basically dividing the household's rank by the number of people in the household. I don't know why you would do that. It would seem better to me to use the HR file and skip the calculation of the fractional rank.
Second, I don't know why you would expect the mean of the fractional rank to be 0.5. Is there a mathematical reason for this? Your formula for the fractional rank is not clear to me but I don't see a mathematical reason why the mean would be 0.5.
[Updated on: Fri, 31 March 2023 07:59] Report message to a moderator
|
|
|
|
|
Re: Why I am getting different total observations when using iweight for tabulating a variable [message #26564 is a reply to message #26556] |
Sat, 01 April 2023 02:23 |
sujata
Messages: 18 Registered: May 2019
|
Member |
|
|
Dear Tom,
Thank you very much for looking into this.
I understand that this is outside the forum's scope, and I really appreciate that you spared some time for this.
However, I wanted to clarify further my understanding of how to treat the weights in my analysis. I want to ensure that I use them correctly and get accurate results.
Firstly, As per your suggestion, I normalized the data so that the mean of the shweight_PR in the PR file is equal to 1000000.
gen unwtd=1000000
total unwtd shweight
matrix B=e(b)
matrix list B
scalar sfactor=B[1,1]/B[1,2]
scalar list sfactor
gen shweight_PR=round(sfactor*shweight)
After that, I generated wgt_shweight_PR= shweight_PR/1000000. The mean of wgt_shweight_PR is 1.
svyset [pw= wgt_shweight_PR ], psu( hv021) strata( hv022)
sum wgt_shweight_PR
Variable Obs Mean Std. dev. Min Max
wgt_shweig~R 52,682 1 .6350303 .05442 4.638086
egen raw_rank_CE=rank(sv271s), unique
sort raw_rank_CE
qui sum shweight_PR
gen wi = shweight_PR /r(sum)
gen cusum = sum(wi)
gen wj= cusum[_n-1]
replace wj=0 if wj==.
gen rank_CE=wj+0.5*wi
sum rank_CE
Variable Obs Mean Std. dev. Min Max
rank_CE 52,682 .4857322 .2892787 6.00e-06 .9999868
I am getting the same mean (0.4857) with wgt_shweight_PR as well.
Is it the right way to use weights?
Thank you.
|
|
|
Re: Why I am getting different total observations when using iweight for tabulating a variable [message #26571 is a reply to message #26564] |
Mon, 03 April 2023 08:35 |
Bridgette-DHS
Messages: 3199 Registered: February 2013
|
Senior Member |
|
|
Following is a response from Senior DHS staff member, Tom Pullum:
With pweight, separately or within svyset, Stata automatically normalizes the weights to have a mean of 1. You did not have to do that with your construction of wgt_shweight_PR.
I have nothing to add to what I said earlier. I don't know why you are using "egen rank" or why you are calculating fractional weights or why you think you should get .5 instead of .4857. I hope someone else can help.
|
|
|
Goto Forum:
Current Time: Mon Nov 25 18:52:37 Coordinated Universal Time 2024
|