library(tidyverse)
library(dplyr)

Overview

This week, we will begin by performing a merge between the CAPI and SPI files. Then, we will spend time re-applying our knowledge pertaining to the functions we have learned so far to work with this data to generate some descriptive stats. We will also learn how to create a new variable, keeping in mind that R is essentially a calculator. As discussed earlier, our schedule is flexible, however, today’s lab does mark the end of dplyr/tidyverse and we will begin with data visualization starting next week. However, there is no way to keep going without continuing to build on everything we have learned thus far. We will simply switch gears to spend more time on plotting data and performing statistical testing in order to complete the final project.

First, read in the files.

spi_nyc <- read.csv("/Users/saanchishah/Desktop/Spring_files/NYC/spi_nyc.csv")

capi_nyc <- read.csv("/Users/saanchishah/Desktop/Spring_files/NYC/capi_nyc.csv")

Merging datasets

Now, let’s do something very cool (imagine attempting to do this in excel). We are going to joint the two datasets to work with one comprehensive df. What is a key? And what is the key in this situation?

merged_nyc <- inner_join(spi_nyc, capi_nyc, by = "SP_ID")

How many rows and columns does this new datatset have? Does that make sense?

Use the head function and observe the first 10 rows to look at your newly created dataset. Great! I am going to select a few variables I think will be useful for us to work with. Your task was to prepare a list of variables for your work so your variables may very well be different than mine (I hope they are).

merged_nyc %>% 
  select(DBQ090G, # How many times/week eat restaurant meals?
         DBQ102G, # How often eat dark green vegetables?
         DBQ107G, # How often eat fresh fruits?
         DIQ010,  # Ever told you have diabetes?
         HIQ012,  # have health insurance?
         WHQ010f, # Self-reported height in feet
         WHQ010i, # Self reported height in inches, # The ht in metres var seems to be iffy
         WHQ025K, # Self reported weight in kgs
         RHQ142,  # Are you pregnant now?
         RHQ160,  # How many times have you been pregnant?
         RHQ170,  # Number of live birth pregnancies
         RHQ210,  # Ever breastfed any of the children?
         PUQ017,  # Household pesticide use
         INQ640,  # Income total estimate
         SMQ020,  # Ever smoker - 100 cigs/lifetime
         SMQ040,  # Current smoker?
         riagendr,# Gender
         riaageyr,# Age at the time of screening
         race_eth,# Recoded race/ethnicity
         SP_ID,   # ID of the person
         ) -> merged_nyc_subset

Creating new variables

Let’s now move on to the creation of a new variable using the mutate() function. I am going to create a variable for bmi. BMI = weight in kgs/(ht in m)^2. Remember R is a calculator!

merged_nyc_subset = 
  merged_nyc_subset %>% 
  mutate(ht_inch = WHQ010f*12, # The syntax is mutate(newvar = perform whatever)
         newheight = WHQ010i + ht_inch,
         height_m = newheight*0.0254,
         bmi = WHQ025K/(height_m)^2) # converting height in inches, 1 foot = 12 inches 


# The problem is a lot of weight obs are missing, so let's filter those out

merged_nyc_subset %>% 
  filter(!is.na(WHQ025K) & !is.na(height_m)) 
##    DBQ090G DBQ102G DBQ107G DIQ010 HIQ012 WHQ010f WHQ010i WHQ025K RHQ142 RHQ160
## 1        1       1       1      2      2       5       2      57      2      2
## 2        1       1       1      2      1       5       2      75      2      5
## 3        1       1       1      2      2       5       7      75     NA     NA
## 4        1       1       1      2      1       5       7      65     NA     NA
## 5        1       1       1      2      1       5       4      64      2      2
## 6        3       1       1      1      2       5       3      65     NA     NA
## 7        2       2       1      2      2       5       9      78     NA     NA
## 8        1       1       1      2      2       5       6      56     NA     NA
## 9        1       1       1      2      1       5       6      65     NA     NA
## 10       1       1       1      2      2       5       9      63     NA     NA
## 11       1       1       1      2      2       5       9      75     NA     NA
## 12       2       1       1      2      1       5       5      40     NA     NA
## 13       2       1       1      2      2       5       6      70     NA     NA
## 14       2       1       1      2      2       5       8      74      2      1
## 15       3       1       1      2      2       5       7      82     NA     NA
## 16       1       1       1      2      1       5       3      55     NA     NA
## 17       1       1       1      2      1       5       4      50      2      5
## 18       1       1       1      2      2       5       7      80     NA     NA
## 19       3       1       1      2      2       5       5      65     NA     NA
## 20       3       1       1      2      1       5       9      85     NA     NA
## 21       1       1       1      2      2       5       7      73     NA     NA
## 22       1       1       1      2      1       5       9      63     NA     NA
##    RHQ170 RHQ210 PUQ017 INQ640 SMQ020 SMQ040 riagendr riaageyr race_eth  SP_ID
## 1       1      1      2      1      2     NA        2       26        4 103096
## 2       3      1      2      1      1      3        2       49        1 124768
## 3      NA     NA      2      1      2     NA        1       32        4 169929
## 4      NA     NA      1      1      2     NA        1       32        3 171698
## 5       2      1      2      1      2     NA        2       35        3 173773
## 6      NA     NA      1      2      1      1        1       68        3 249987
## 7      NA     NA      2      9      1      2        1       33        1 284790
## 8      NA     NA      2      1      2     NA        1       29        3 365208
## 9      NA     NA      2      1      2     NA        1       56        3 428974
## 10     NA     NA      2      7      2     NA        2       26        3 598510
## 11     NA     NA      2      2      1      1        1       43        1 757136
## 12     NA     NA      1      2      2     NA        1       78        5 770719
## 13     NA     NA      1      2      1      2        1       31        1 816567
## 14      1      1      1      2      1      2        2       28        4 841635
## 15     NA     NA      1      2      2     NA        1       37        2 877762
## 16     NA     NA      1      1      2     NA        2       21        3 906209
## 17      4      1      2      2      2     NA        2       26        1 917727
## 18     NA     NA      2      1      1      1        1       46        1 944368
## 19     NA     NA      1      2      1      1        1       53        3 946100
## 20     NA     NA      1      1      1      3        1       72        1 949787
## 21     NA     NA      1      2      2     NA        1       31        3 952778
## 22     NA     NA      2      2      2     NA        1       21        1 966631
##    ht_inch newheight height_m      bmi
## 1       60        62   1.5748 22.98392
## 2       60        62   1.5748 30.24200
## 3       60        67   1.7018 25.89669
## 4       60        67   1.7018 22.44380
## 5       60        64   1.6256 24.21880
## 6       60        63   1.6002 25.38428
## 7       60        69   1.7526 25.39388
## 8       60        66   1.6764 19.92658
## 9       60        66   1.6764 23.12906
## 10      60        69   1.7526 20.51044
## 11      60        69   1.7526 24.41719
## 12      60        65   1.6510 14.67459
## 13      60        66   1.6764 24.90822
## 14      60        68   1.7272 24.80541
## 15      60        67   1.7018 28.31371
## 16      60        63   1.6002 21.47900
## 17      60        64   1.6256 18.92094
## 18      60        67   1.7018 27.62313
## 19      60        65   1.6510 23.84620
## 20      60        69   1.7526 27.67281
## 21      60        67   1.7018 25.20611
## 22      60        69   1.7526 20.51044

Using if_else to recode variables

This is an important function which can be used along with mutate() - we may not be able to do this today, we will pick up here next week. For instance, we wanted to split BMI into underweight, healthy, overwwight, obese.

library(magrittr)

# check if _R_USE_PIPEBIND_ environment variable is set
if (!"TRUE" %in% Sys.getenv("_R_USE_PIPEBIND_", unset = NA)) {
  # set _R_USE_PIPEBIND_ environment variable
  Sys.setenv("_R_USE_PIPEBIND_" = "TRUE")
}


# Here's what would be easier but inaccurate:
merged_nyc_subset %>% 
  mutate(newvar = if_else(bmi > 30, 'obese', 'not-obese'))

#OR

merged_nyc_subset %>% 
  mutate(newvar = if_else(bmi <= 25, 1, 0))
       
       # The below is a little complicated so don't worry, let's just follow the logic
merged_nyc_subset <- merged_nyc_subset %>%
  mutate(bmirange = if_else(bmi < 18.5, 'underweight',
                            if_else(bmi >= 18.5 & bmi < 25, 'normal',
                                    if_else(bmi >= 25 & bmi <= 30, 'overweight',
                                            if_else(bmi > 30, 'obese', NA_character_)))))

Student Exercise

  1. If you were following so far, start working on the variables you wanted to select and modify the code accordingly
  2. Rename the variables so it is easier to understand what is going on
  3. You may also work with the variables I provided and perform descriptive stats for the reproductive health and diet variables.
  4. Next, start thinking about how you could potentially observe trends between variables. For example, I think it would be interesting to understand reproductive health by different racial/ethnic groups or observing insurance status by race/ethnicity could also be interesting. Likewise, what questions would you be interested in? Hint: In week one, we used the table function but for one variable only. What if you could simply include two variables of interest within the function? Try it! Again, you will need to know the codes for the responses.

To do

  1. Read up documentation on Recoding Data here.

  2. Read documentation on ggplot here. Also provides cheat sheets.