Home » Data » Dataset use in Stata » Strange Issues w/ Data Formatting from DHS (Data received from DHS seems to be formatted in a way that makes data extraction impossible (longer description below))
Strange Issues w/ Data Formatting from DHS [message #29033] |
Sat, 13 April 2024 08:17 |
tednoel
Messages: 7 Registered: April 2024
|
Member |
|
|
Hi all, I hope this message finds everyone in good health. I am currently a Master's student in my final semester. I am using DHS data for my thesis. In the interest of simplicity, I will break down the multifaceted problem I am having below. I hope someone might be able to find the time to help me with this strange issue.
Objective: I created an account with DHS and downloaded the data. The goal with the data downloaded is to disaggregate countries into survey "round" (year of survey) and a handful of variables from that round so that I can then merge each respective round with its' shape file (since these change across time for each country). This is important because I will need to merge data via geographic coordinates for my thesis, which is exploring the impact of environmental variables (ex. precipitation rate) on the propensity of marriage under the age of 18 across Sub-Saharan Africa.
Problem: I bulk downloaded survey and geographic data for every African countries where this was available. I decided to start working with one country only so that I could clear any issues with the code before replicating the process for the rest of the countries. To simplify the process, I grouped the bulk downloaded data into its' respective countries and tried to import batches to STATA to work with. The problem begins with the first country I attempted to work with, Tanzania. While I was able to unzip all the files in STATA, this was the furthest I was able to get because what ensued was a bizarre game of smoke and mirrors with the files. For efficiency, I have listed the most major problems below:
1. In the expanded and unzipped files, sometimes I would see a file that does not have a .dta listed, yet, when I would manually go into this file through my Finder just to double check, there would be a .dta file.
2. There are also situations where an expanded/unzipped file would list its' contents as including a dofile, and when I would go through my Finder to manually ensure that this was there, there would be nothing within the contents of the file.
3. Perhaps the largest issue is that it is impossible to run the do file importing the datasets of .dta files because every single path is different inside those files (not possible to write an extraction loop). I made a list of some of the different paths of the .dta files so anyone reading can better understand the issue. This means that I can't get variable lists into STATA.
Below is the code I have used in STATA:
cd "/Users/tbear/Desktop/M2 Thesis/DHSDATA/Tanzania"
capture log close
log using "D:\Niveen Wrking Files\Feps files\FEPS Teaching Files\Year 23-24\MDE\teddi\unzipfiles.log", replace
** [1] Unrar/Unzip all files under the main "DHSDATA" folder
* You need first to run this two lines to make STATA able to extract rar files
shell set path="C:\Program Files\WinRAR"; %path% & unrar e "*"
** some errors resulted while extracting the zip files:
* Zip files under which also contains another zip files - 7 files:
/*
"SNBR70FL"
"SNCR7IDT"
"SNCR7IFL"
"SNCR70DT"
"SNCR70FL"
"SNBR7IFL"
"SNBR70DT"
*/
* they can be extracted manually, then copy their contents zip files back into the main folder "DHSDATA"
* Now unzipping command will work
local path "/Users/tbear/Desktop/M2 Thesis/DHSDATA/Tanzania"
local filelist : dir "`path'" files "*.zip", respectcase
foreach file of local filelist {
unzipfile `file', replace
}
** [2] Extract all the "dta" files in each subfolder under "DHSDATA" folder
* make new folder in which all "dta" files will be saved
global usefile "/Users/tbear/Desktop/M2 Thesis/DHSDATA/Tanzania"
capture mkdir "/Users/tbear/Desktop/M2 Thesis/DHSDATA/Tanzania/Tanzania_dta"
clear
capture set maxvar 100000
local filelist : dir "$usefile" files "*.DTA", respectcase
foreach file of local filelist {
quietly use "`file'", clear
* save each "data" files into the new folder that we made in the first step
save "Tanzania_dta/`file'", replace
}
local filelist : dir "$usefile" files "*.dta", respectcase
foreach file of local filelist {
quietly use "`file'", clear
* save each "data" files into the new folder that we made in the first step
save "Tanzania_dta/`file'", replace
}
capture log close
clear
**********
Thank you so, so much to anyone that might be able to help!!
|
|
|
|
|
|
Re: Strange Issues w/ Data Formatting from DHS [message #29139 is a reply to message #29068] |
Mon, 29 April 2024 08:04 |
tednoel
Messages: 7 Registered: April 2024
|
Member |
|
|
Hi Trevor, THANK YOU SO MUCH. I have been mostly able to solve the problem thanks to the help you have given me. I have one remaining challenge in order to move forward and that is the merging of the different survey data sets for each survey round. So, to be clear, I am interested in controlling for wealth in my proportional hazard. This means that I will have to combine household data and individual data, as indicators for wealth do not exist in the individual recode for Tanzania 1999 (the year I'm starting this data work with). I have read from Tom Pullum in another part of this forum that "it would be virtually impossible to merge them [individual survey] with the HR file, which has households as units. You should use the PR file, which has individual household members as units, rather than the HR file." However, it's not clear to me that wealth is included in the PR file, either. To make matters a bit more confusing, I have seen that this type of merge is possible elsewhere on the internet, for example: https://www.researchgate.net/post/How_can_I_merge_Household_ database_to_Women_data_base_in_the_DHS_data_using_stata
Do you mind clarifying if it is possible to merge household and individual level data? This is the only way I will be able to control for wealth in my proportional hazards analysis. Thank you so much in advance for your time.
|
|
|
|
|
Re: Strange Issues w/ Data Formatting from DHS [message #29156 is a reply to message #29145] |
Wed, 01 May 2024 12:02 |
tednoel
Messages: 7 Registered: April 2024
|
Member |
|
|
Hi, thanks so much for the response. I've been a bit stumped by this merging process because there are three different datasets (IR, PR, and Wealth Index) that I have to combine and I'm a bit confused as to which one should be my base for merging. I was going to make the IR my base for merging because I've seen code that allows for the renaming of the cluster number, household number, and respondent's line number such as this:
use "/Users/tbear/Desktop/THESIS DATA/Tanzania_1999/Tanzania_1999_dta/TZIR41FL.dta", clear
* keep the variables you want
keep v0*
sort v001 v002 v003
save e:/Users/tbear/Desktop/THESISDATA/Tanzania_1999/Tanzania_199 9_dta/TZIR41FL.dta, replace
* Prepare PR file and merge
use "/Users/tbear/Desktop/THESIS DATA/Tanzania_1999/Tanzania_1999_dta/TZPR41FL.dta", clear
* reduce to women who are eligible for the IR file
keep if hv117==1
* keep the variables you want
keep hv0* sa33 sh*
rename hv001 v001
rename hv002 v002
rename hv003 v003
sort v001 v002 v003
merge v001 v002 v003 using ***Not entirely clear what I should be using here
tab _merge
******
BUT the problem is the wealth index only has the hhid variable that I can use to merge- and the IR file does not have this, only the PR file does. Should I be using the PR file as my base, merging the IR file, and then appending the wealth index?
Thank you so much for all of your help.
|
|
|
Re: Strange Issues w/ Data Formatting from DHS [message #29157 is a reply to message #29156] |
Wed, 01 May 2024 13:01 |
Trevor-DHS
Messages: 793 Registered: January 2013
|
Senior Member |
|
|
Hi
A few notes:
1) It looks like you are opening the IR file, then keeping just a few variables and sorting the file, and then overwriting the original file. This is generally not considered good practice as you are modifying the original file. Generally, you should start with your original file, but save to an interim file with a different name or in a different folder (or both).
2) The naming of your THESISDATA folder seems to vary - in two cases it has a space between THESIS and DATA and in one case it doesn't (this may be a display issue in the user forum as it occasionally puts extra blanks into the text).
3) In terms of the order of merging, I would start by merging the wealth index to the PR file and saving your output to an intermediate file. They both should have hhid so you should be able to merge those without problem. Then merge the info from the PR/wealth data onto the IR file. Below is a rough outline of the process (I haven't tested this, so there may be some bugs - this is just to give you the order of operations):
use TZWIxxxx.dta
sort hhid
save TZWIxxxx.dta, replace
use TZPRxxxx.dta, clear
* keep the variables you want from the PR file
keep hhid hv0* ...
sort hhid
merge m:1 hhid using TZWIxxxx.dta
clonevar v001 = hv001
clonevar v002 = hv002
clonevar v003 = hvidx
sort v001 v002 v003
save TZPRxxxx_temp.dta
use TZIRxxxx.dta, clear
sort v001 v002 v003
merge 1:1 v001 v002 v003 using TZPRxxxx_temp.dta
It is also possible to construct hhid from hv001 and hv002 or from v001 and v002, and vie versa, but I don't think you need to.
[Updated on: Wed, 01 May 2024 13:02] Report message to a moderator
|
|
|
|
|
|
Re: Strange Issues w/ Data Formatting from DHS [message #29167 is a reply to message #29166] |
Thu, 02 May 2024 10:03 |
Trevor-DHS
Messages: 793 Registered: January 2013
|
Senior Member |
|
|
Hi
The problem is that hhid does not exist in the WI file, but whhid does. You had the right idea when you created whhid in the PR data using the clonevar statement, but then you dropped it immediately by not including it in the keep statement. So you need to include whhid in the keep statement and then use it in the next few statements as follows:
use TZPR41FL.dta, clear
* keep the variables you want from the PR file
clonevar whhid = hhid
keep whhid hhid hv005 hv007 hv025 hv219
sort whhid
merge m:1 whhid using TZWI41FL.dta
|
|
|
Goto Forum:
Current Time: Thu May 2 14:22:21 Coordinated Universal Time 2024
|