The DHS Program User Forum
Discussions regarding The DHS Program data and results
Home » Data » Dataset use in Stata » Strange Issues w/ Data Formatting from DHS (Data received from DHS seems to be formatted in a way that makes data extraction impossible (longer description below))
Strange Issues w/ Data Formatting from DHS [message #29033] Sat, 13 April 2024 08:17 Go to next message
tednoel is currently offline  tednoel
Messages: 7
Registered: April 2024
Member
Hi all, I hope this message finds everyone in good health. I am currently a Master's student in my final semester. I am using DHS data for my thesis. In the interest of simplicity, I will break down the multifaceted problem I am having below. I hope someone might be able to find the time to help me with this strange issue.

Objective: I created an account with DHS and downloaded the data. The goal with the data downloaded is to disaggregate countries into survey "round" (year of survey) and a handful of variables from that round so that I can then merge each respective round with its' shape file (since these change across time for each country). This is important because I will need to merge data via geographic coordinates for my thesis, which is exploring the impact of environmental variables (ex. precipitation rate) on the propensity of marriage under the age of 18 across Sub-Saharan Africa.

Problem: I bulk downloaded survey and geographic data for every African countries where this was available. I decided to start working with one country only so that I could clear any issues with the code before replicating the process for the rest of the countries. To simplify the process, I grouped the bulk downloaded data into its' respective countries and tried to import batches to STATA to work with. The problem begins with the first country I attempted to work with, Tanzania. While I was able to unzip all the files in STATA, this was the furthest I was able to get because what ensued was a bizarre game of smoke and mirrors with the files. For efficiency, I have listed the most major problems below:
1. In the expanded and unzipped files, sometimes I would see a file that does not have a .dta listed, yet, when I would manually go into this file through my Finder just to double check, there would be a .dta file.

/index.php?t=getfile&id=2276&private=0
/index.php?t=getfile&id=2275&private=0

2. There are also situations where an expanded/unzipped file would list its' contents as including a dofile, and when I would go through my Finder to manually ensure that this was there, there would be nothing within the contents of the file.

/index.php?t=getfile&id=2277&private=0
/index.php?t=getfile&id=2278&private=0

3. Perhaps the largest issue is that it is impossible to run the do file importing the datasets of .dta files because every single path is different inside those files (not possible to write an extraction loop). I made a list of some of the different paths of the .dta files so anyone reading can better understand the issue. This means that I can't get variable lists into STATA.

Below is the code I have used in STATA:

cd "/Users/tbear/Desktop/M2 Thesis/DHSDATA/Tanzania"
capture log close
log using "D:\Niveen Wrking Files\Feps files\FEPS Teaching Files\Year 23-24\MDE\teddi\unzipfiles.log", replace


** [1] Unrar/Unzip all files under the main "DHSDATA" folder

* You need first to run this two lines to make STATA able to extract rar files
shell set path="C:\Program Files\WinRAR"; %path% & unrar e "*"

** some errors resulted while extracting the zip files:
* Zip files under which also contains another zip files - 7 files:
/*
"SNBR70FL"
"SNCR7IDT"
"SNCR7IFL"
"SNCR70DT"
"SNCR70FL"
"SNBR7IFL"
"SNBR70DT"
*/
* they can be extracted manually, then copy their contents zip files back into the main folder "DHSDATA"
* Now unzipping command will work
local path "/Users/tbear/Desktop/M2 Thesis/DHSDATA/Tanzania"
local filelist : dir "`path'" files "*.zip", respectcase
foreach file of local filelist {
unzipfile `file', replace
}

** [2] Extract all the "dta" files in each subfolder under "DHSDATA" folder


* make new folder in which all "dta" files will be saved
global usefile "/Users/tbear/Desktop/M2 Thesis/DHSDATA/Tanzania"
capture mkdir "/Users/tbear/Desktop/M2 Thesis/DHSDATA/Tanzania/Tanzania_dta"
clear
capture set maxvar 100000

local filelist : dir "$usefile" files "*.DTA", respectcase
foreach file of local filelist {
quietly use "`file'", clear
* save each "data" files into the new folder that we made in the first step
save "Tanzania_dta/`file'", replace
}

local filelist : dir "$usefile" files "*.dta", respectcase
foreach file of local filelist {
quietly use "`file'", clear
* save each "data" files into the new folder that we made in the first step
save "Tanzania_dta/`file'", replace
}


capture log close


clear
**********

Thank you so, so much to anyone that might be able to help!!
Re: Strange Issues w/ Data Formatting from DHS [message #29038 is a reply to message #29033] Mon, 15 April 2024 11:02 Go to previous messageGo to next message
Bridgette-DHS is currently offline  Bridgette-DHS
Messages: 3048
Registered: February 2013
Senior Member

Following is a response from Senior DHS staff member, Tom Pullum:

We believe your problem is with the unzipping procedure, and/or the use of a Mac, and not with the DHS files. My personal strategy in this situation would be to find or construct another dta file, zip it, and then try to unzip it, to learn more about the unzipping steps. Hope you can quickly figure this out and proceed with your research. Perhaps the IT staff at your university can help.

Re: Strange Issues w/ Data Formatting from DHS [message #29060 is a reply to message #29038] Thu, 18 April 2024 10:07 Go to previous messageGo to next message
tednoel is currently offline  tednoel
Messages: 7
Registered: April 2024
Member
Hi, thank you so much for responding. There are now multiple professors from my University across several departments trying to assist with this issue, but the issue is not with respect to unzipping the files- rather the issue is that the .dta files are not being extracted from the files when unzipping. We have attempted to make a loop using the dofile to extract the .dta but every .dta has a different type of file path- this is why we can't run everything at once... It goes without saying that extracting every .dta file for thousands of files would be nearly impossible in the month and a half I have left until my submission deadline. I'm truly beginning to panic because even professors who have worked with DHS data here have been puzzled by this challenge for several weeks now. Any resources or guidance you might be able to provide would be greatly appreciated.. Thank you so much in advance.
Re: Strange Issues w/ Data Formatting from DHS [message #29068 is a reply to message #29060] Fri, 19 April 2024 10:56 Go to previous messageGo to next message
Trevor-DHS is currently offline  Trevor-DHS
Messages: 793
Registered: January 2013
Senior Member
Hi, I would like to help you resolve this issue. A couple of first steps:
1) If you are working in Stata, you only need the DT files (e.g. TZxxvvDT.zip) and not the FL zip files (e.g. TZxxvvFL.zip), with the exception of the geospatial (GE) files which are only in one format. All of the DT files contain the .dta data files that you need. There is no need to use the .do and .dct files that are found in the FL zip files as the same data are in the DT files.zip
2) Once you have just the DT zip files, you will find the .dta files inside of those zip files and should be able to unzip them automatically to the location of your choice.
Can you test this out and see if it works?

I'm a little confused about the images you have shared of the file contents. I think you have cut sections of the output into the images you shared, but I think you have cut them in the wrong place. For example in your image that starts "successfully extracted TZKR41DT.zip ...", you are mixing the output from extracting two different zip files. You are showing the end of the process for extracting TZKR41DT.zip, with the list of files being shown before "successfully extracted TZKR41DT.zip ..." (but not included in the image you sent) and then telling you that 4 files were processed. Then in the same image you are showing the first part of the extraction of TZKR63FL.zip and listing the 8 files extracted from that, but then not showing the success message.

Let us know if this helps clarify your issues.
Re: Strange Issues w/ Data Formatting from DHS [message #29139 is a reply to message #29068] Mon, 29 April 2024 08:04 Go to previous messageGo to next message
tednoel is currently offline  tednoel
Messages: 7
Registered: April 2024
Member
Hi Trevor, THANK YOU SO MUCH. I have been mostly able to solve the problem thanks to the help you have given me. I have one remaining challenge in order to move forward and that is the merging of the different survey data sets for each survey round. So, to be clear, I am interested in controlling for wealth in my proportional hazard. This means that I will have to combine household data and individual data, as indicators for wealth do not exist in the individual recode for Tanzania 1999 (the year I'm starting this data work with). I have read from Tom Pullum in another part of this forum that "it would be virtually impossible to merge them [individual survey] with the HR file, which has households as units. You should use the PR file, which has individual household members as units, rather than the HR file." However, it's not clear to me that wealth is included in the PR file, either. To make matters a bit more confusing, I have seen that this type of merge is possible elsewhere on the internet, for example: https://www.researchgate.net/post/How_can_I_merge_Household_ database_to_Women_data_base_in_the_DHS_data_using_stata

Do you mind clarifying if it is possible to merge household and individual level data? This is the only way I will be able to control for wealth in my proportional hazards analysis. Thank you so much in advance for your time.
Re: Strange Issues w/ Data Formatting from DHS [message #29142 is a reply to message #29068] Mon, 29 April 2024 11:54 Go to previous messageGo to next message
tednoel is currently offline  tednoel
Messages: 7
Registered: April 2024
Member
It's actually come to my attention that the reason why I was having trouble finding the wealth data in the Household Recode or Household Member Recode for the Tanzania 1999 round is because the wealth index is separated in an entirely different file. For this earlier survey rounds, such as the one I am working on- is it possible to merge the wealth index with the Individual (IR) and Household Member (PR) survey data sets on Stata? Thank you so much in advance.
Re: Strange Issues w/ Data Formatting from DHS [message #29145 is a reply to message #29142] Mon, 29 April 2024 15:49 Go to previous messageGo to next message
Trevor-DHS is currently offline  Trevor-DHS
Messages: 793
Registered: January 2013
Senior Member
Hi
You need the TZWI41DT.zip file for the wealth index for the wealth index for this survey. Yes, you can merge the wealth index data to the IR or the PR data. You can find information about merging datasets in the Guide to DHS Statistics in Chapter 1) Introduction and Description of Datasets, Analyzing DHS Data, Matching and Merging Datasets. This doesn't provide specific information for merging the wealth index data, but does provide several examples of how to merge data. The wealth index data is based on households, so you can link the data to either the IR (individuals who live in the household) or PR (persons in the household), using the household ID information.
Re: Strange Issues w/ Data Formatting from DHS [message #29156 is a reply to message #29145] Wed, 01 May 2024 12:02 Go to previous messageGo to next message
tednoel is currently offline  tednoel
Messages: 7
Registered: April 2024
Member
Hi, thanks so much for the response. I've been a bit stumped by this merging process because there are three different datasets (IR, PR, and Wealth Index) that I have to combine and I'm a bit confused as to which one should be my base for merging. I was going to make the IR my base for merging because I've seen code that allows for the renaming of the cluster number, household number, and respondent's line number such as this:

use "/Users/tbear/Desktop/THESIS DATA/Tanzania_1999/Tanzania_1999_dta/TZIR41FL.dta", clear

* keep the variables you want
keep v0*
sort v001 v002 v003
save e:/Users/tbear/Desktop/THESISDATA/Tanzania_1999/Tanzania_199 9_dta/TZIR41FL.dta, replace

* Prepare PR file and merge
use "/Users/tbear/Desktop/THESIS DATA/Tanzania_1999/Tanzania_1999_dta/TZPR41FL.dta", clear
* reduce to women who are eligible for the IR file
keep if hv117==1

* keep the variables you want
keep hv0* sa33 sh*

rename hv001 v001
rename hv002 v002
rename hv003 v003

sort v001 v002 v003
merge v001 v002 v003 using ***Not entirely clear what I should be using here
tab _merge
******

BUT the problem is the wealth index only has the hhid variable that I can use to merge- and the IR file does not have this, only the PR file does. Should I be using the PR file as my base, merging the IR file, and then appending the wealth index?

Thank you so much for all of your help.
Re: Strange Issues w/ Data Formatting from DHS [message #29157 is a reply to message #29156] Wed, 01 May 2024 13:01 Go to previous messageGo to next message
Trevor-DHS is currently offline  Trevor-DHS
Messages: 793
Registered: January 2013
Senior Member
Hi
A few notes:
1) It looks like you are opening the IR file, then keeping just a few variables and sorting the file, and then overwriting the original file. This is generally not considered good practice as you are modifying the original file. Generally, you should start with your original file, but save to an interim file with a different name or in a different folder (or both).
2) The naming of your THESISDATA folder seems to vary - in two cases it has a space between THESIS and DATA and in one case it doesn't (this may be a display issue in the user forum as it occasionally puts extra blanks into the text).
3) In terms of the order of merging, I would start by merging the wealth index to the PR file and saving your output to an intermediate file. They both should have hhid so you should be able to merge those without problem. Then merge the info from the PR/wealth data onto the IR file. Below is a rough outline of the process (I haven't tested this, so there may be some bugs - this is just to give you the order of operations):
use TZWIxxxx.dta
sort hhid
save TZWIxxxx.dta, replace

use TZPRxxxx.dta, clear
* keep the variables you want from the PR file
keep hhid hv0* ...
sort hhid
merge m:1 hhid using TZWIxxxx.dta
clonevar v001 = hv001
clonevar v002 = hv002
clonevar v003 = hvidx
sort v001 v002 v003
save TZPRxxxx_temp.dta

use TZIRxxxx.dta, clear
sort v001 v002 v003
merge 1:1 v001 v002 v003 using TZPRxxxx_temp.dta
It is also possible to construct hhid from hv001 and hv002 or from v001 and v002, and vie versa, but I don't think you need to.

[Updated on: Wed, 01 May 2024 13:02]

Report message to a moderator

Re: Strange Issues w/ Data Formatting from DHS [message #29159 is a reply to message #29157] Wed, 01 May 2024 14:40 Go to previous messageGo to next message
tednoel is currently offline  tednoel
Messages: 7
Registered: April 2024
Member
Hi Trevor, thank you SO much for the guidance :). The DHS data is amazing but definitely a little tricky to navigate at first. I've adapted the code to meet the needs of what I'm trying to do but I've been a bit stuck since your previous message because I keep receiving an error message on STATA telling me that "variable hhid does not uniquely identify observations in the using data"

I'm not sure if this is because I had to clone whhid and set it equal to hhid at first (because the case identifier in these older wealth indexes doesn't match exactly the case identifier in the PR file) but in any case I know for a fact that hhid uniquely identifies cases in the PR file.. Could this be a situation wherein I have to construct hhid from hv001 and hv002 or from v001 and v002 as you alluded in your previous message (really hope not lol)... Below is my code just in case you might be able to see any problems I haven't picked up on so far.

use TZWI41FL.dta
sort whhid
save TZWI41FL.dta, replace

use TZPR41FL.dta, clear
* keep the variables you want from the PR file
clonevar whhid = hhid
keep hhid hv005 hv007 hv025 hv219
sort hhid
merge m:1 hhid using TZPR41FL.dta
clonevar v001 = hv001
clonevar v002 = hv002
clonevar v003 = hvidx
sort v001 v002 v003
save TZPR41FL_temp.dta

use TZIR41FL.dta, clear
sort v001 v002 v003
merge 1:1 v001 v002 v003 using TZPR41FL_temp.dta


As always thank you, thank you for any guidance you might be able to provide!
Re: Strange Issues w/ Data Formatting from DHS [message #29161 is a reply to message #29159] Wed, 01 May 2024 15:50 Go to previous messageGo to next message
Trevor-DHS is currently offline  Trevor-DHS
Messages: 793
Registered: January 2013
Senior Member
The following line:
merge m:1 hhid using TZPR41FL.dta
should refer to the WI file, not the PR file.
Re: Strange Issues w/ Data Formatting from DHS [message #29166 is a reply to message #29033] Thu, 02 May 2024 08:46 Go to previous messageGo to next message
tednoel is currently offline  tednoel
Messages: 7
Registered: April 2024
Member
Hi Trevor, I fixed this line of code- thank you so much. Unfortunately, there is still an error cropping up with respect to the hhid case identifier. Below is the lines of code concerned:

use TZWI41FL.dta
sort whhid
save TZWI41FL.dta, replace

use TZPR41FL.dta, clear
* keep the variables you want from the PR file
clonevar whhid = hhid
keep hhid hv005 hv007 hv025 hv219
sort hhid
merge m:1 hhid using TZWI41FL.dta
clonevar v001 = hv001
clonevar v002 = hv002
clonevar v003 = hvidx
sort v001 v002 v003
save TZPR41FL_temp.dta

use TZIR41FL.dta, clear
sort v001 v002 v003
merge 1:1 v001 v002 v003 using TZPR41FL_temp.dta


Unfortunately, it seems like the wealth index cannot be emrged using either "hhid" or "whhid," as I've received error messages for both iterations of this code (attached in photos). I've been trying to figure out what the issue is and perhaps I shouldn't be using the command "clonevar" and perhaps I should just be renaming the variable entirely? I want to make sure that this is correct.

As always, really grateful for your assistance!
Re: Strange Issues w/ Data Formatting from DHS [message #29167 is a reply to message #29166] Thu, 02 May 2024 10:03 Go to previous message
Trevor-DHS is currently offline  Trevor-DHS
Messages: 793
Registered: January 2013
Senior Member
Hi
The problem is that hhid does not exist in the WI file, but whhid does. You had the right idea when you created whhid in the PR data using the clonevar statement, but then you dropped it immediately by not including it in the keep statement. So you need to include whhid in the keep statement and then use it in the next few statements as follows:
use TZPR41FL.dta, clear
* keep the variables you want from the PR file
clonevar whhid = hhid
keep whhid hhid hv005 hv007 hv025 hv219
sort whhid
merge m:1 whhid using TZWI41FL.dta
Previous Topic: DHS 5 India dataset IR file-r900 error
Goto Forum:
  


Current Time: Thu May 2 14:22:21 Coordinated Universal Time 2024