Week 3 lab

Overview

This week we will continue working with the spi_nyc file to rename some variables and learn how to summarize after grouping variables. We will also learn a little more about logical operators.

First, read in the file.

spi_nyc <- read.csv("/Users/saanchishah/Desktop/Spring_files/NYC/spi_nyc.csv")

Let’s load a new package this week in addition to dplyr and tidyverse.

library(mosaic)
library(tidyverse)
library(dplyr)

Tally function

Recall from last lab how we used different functions to summarize numeric variables. What if we want to achieve the same for character variables without recoding variables (more on recoding later)?

summary(spi_nyc$DMQ111e) # Is this helpful?

##    Length     Class      Mode 
##      1999 character character

tally(spi_nyc$DMQ111e)

## X
##                                            
##                                       1376 
##                                Afghanistan 
##                                          1 
##                                    Albania 
##                                          4 
##                                    Algeria 
##                                          1 
##                        Antigua and Barbuda 
##                                          2 
##                                  Argentina 
##                                         10 
##                                  Australia 
##                                          6 
##                                    Austria 
##                                          2 
##                                 Azerbaijan 
##                                          1 
##                                    Bahamas 
##                                          1 
##                                 Bangladesh 
##                                         15 
##                                   Barbados 
##                                          7 
##                                    Belarus 
##                                          5 
##                                    Belgium 
##                                          1 
##                                     Belize 
##                                          4 
##                                    Bermuda 
##                                          1 
##                                    Bolivia 
##                                          1 
##                     Bosnia and Herzegowina 
##                                          2 
##                                     Brazil 
##                                          2 
##                                   Bulgaria 
##                                          6 
##                                   Cameroon 
##                                          1 
##                                     Canada 
##                                          4 
##                                      Chile 
##                                          3 
##                                   Colombia 
##                                         39 
##                                 Costa Rica 
##                                          2 
##                         Country not listed 
##                                         11 
##             Croatia (local name: Hrvatska) 
##                                          1 
##                                       Cuba 
##                                          9 
##                             Czech Republic 
##                                          2 
##                                    Ecuador 
##                                         53 
##                                      Egypt 
##                                          2 
##                                El Salvador 
##                                         12 
##                                     France 
##                                          4 
##                              French Guiana 
##                                          1 
##                                     Gambia 
##                                          3 
##                                    Germany 
##                                          5 
##                                      Ghana 
##                                         15 
##                                     Greece 
##                                          3 
##                                    Grenada 
##                                          5 
##                                  Guatemala 
##                                          4 
##                                     Guinea 
##                                          2 
##                                     Guyana 
##                                         28 
##                                      Haiti 
##                                         14 
##                                   Honduras 
##                                         16 
##                                  Hong Kong 
##                                          5 
##                                    Hungary 
##                                          1 
##                                      India 
##                                         31 
##                 Iran (Islamic Republic of) 
##                                          2 
##                                    Ireland 
##                                          5 
##                                     Israel 
##                                          6 
##                                      Italy 
##                                         14 
##                                      Japan 
##                                          6 
##                                 Kazakhstan 
##                                          1 
##                                      Kenya 
##                                          1 
##     Korea, Democratic People`s Republic of 
##                                         11 
##                         Korea, Republic of 
##                                          7 
##                                 Kyrgyzstan 
##                                          2 
##                                    Liberia 
##                                          4 
##                                      Macau 
##                                          1 
## Macedonia, the former Yugoslav Republic of 
##                                          2 
##                                   Malaysia 
##                                          2 
##                                       Mali 
##                                          1 
##                       Moldova, Republic of 
##                                          1 
##                                    Morocco 
##                                          2 
##                                    Myanmar 
##                                          4 
##                                      Nepal 
##                                          6 
##                                Netherlands 
##                                          1 
##                                New Zealand 
##                                          1 
##                                  Nicaragua 
##                                          2 
##                                    Nigeria 
##                                          7 
##                                   Pakistan 
##                                         11 
##                                     Panama 
##                                          5 
##                                       Peru 
##                                         11 
##                                Philippines 
##                                         20 
##                                     Poland 
##                                          6 
##                                    Romania 
##                                          1 
##                      Saint Kitts and Nevis 
##                                          1 
##                                Saint Lucia 
##                                          4 
##           Saint Vincent and the Grenadines 
##                                          3 
##                                    Senegal 
##                                          9 
##                 Slovakia (Slovak Republic) 
##                                          1 
##                               South Africa 
##                                          2 
##                                      Spain 
##                                          4 
##                                  Sri Lanka 
##                                          1 
##                                  Swaziland 
##                                          1 
##                                     Sweden 
##                                          3 
##                                Switzerland 
##                                          1 
##                  Taiwan, Province of China 
##                                          9 
##                                 Tajikistan 
##                                          1 
##               Tanzania, United Republic of 
##                                          1 
##                                   Thailand 
##                                          4 
##                        Trinidad and Tobago 
##                                         36 
##                                     Turkey 
##                                          2 
##                                    Ukraine 
##                                         17 
##                             United Kingdom 
##                                          8 
##                                    Uruguay 
##                                          1 
##                                 Uzbekistan 
##                                          7 
##                                  Venezuela 
##                                          3 
##                                   Viet Nam 
##                                          2 
##                   Virgin Islands (British) 
##                                          1 
##                      Virgin Islands (U.S.) 
##                                          1 
##                                      Yemen 
##                                          3 
##                                 Yugoslavia 
##                                          6

Renaming variables

Next, start renaming variables. The function to rename variables is ‘rename’. Before we rename the variables, let’s take a look at the variables.

names(spi_nyc)

##  [1] "X"        "SP_ID"    "riagendr" "riaageyr" "capistat" "capicmt" 
##  [7] "capicmto" "DMQ140"   "SFQ180"   "DMQ105"   "DMQ111e"  "DMQ111f" 
## [13] "DMQ130e"  "DMQ130f"  "DMQ161M"  "DMQ161Y"  "DMQ251a"  "DMQ251c" 
## [19] "DMQ251d"  "DMQ251e"  "DMQ251f"  "DMQ251i"  "DMQ251j"  "DMQ251k" 
## [25] "DMQ251AE" "DMQ251AF" "DMQ251OL" "DMQ251OS" "race_eth" "acasstat"
## [31] "acasicmt" "acascmto" "cidistat" "cidicmt"  "cidicmto" "ageadj"  
## [37] "agewt"    "agegroup" "racewt"   "stratum"  "PSU"      "WTSF1CH" 
## [43] "WTSF1C"   "WTSF1F"   "US_time"

# View(spi_nyc)

head(spi_nyc, n = 5)

##   X  SP_ID riagendr riaageyr capistat capicmt capicmto DMQ140 SFQ180 DMQ105
## 1 1 100230        1       59        1      NA              13      1     10
## 2 2 100243        1       25        1      NA              12      1     12
## 3 3 100270        2       76        1      NA               5      2     12
## 4 4 100597        1       29        1      NA              18      6     66
## 5 5 101166        2       48        1      NA              11      5     10
##   DMQ111e DMQ111f  DMQ130e DMQ130f DMQ161M DMQ161Y DMQ251a DMQ251c DMQ251d
## 1              NA New York      36      NA      NA      NA      NA      NA
## 2              NA               NA       4    1998      NA      NA      NA
## 3              NA               NA      11    1962      NA      12      NA
## 4 Ukraine     804               NA       5    1992      NA      NA      NA
## 5              NA New York      36      NA      NA      NA      NA      NA
##   DMQ251e DMQ251f DMQ251i DMQ251j DMQ251k DMQ251AE DMQ251AF DMQ251OL DMQ251OS
## 1      NA      NA      NA      NA      NA       NA       NA                  
## 2      NA      NA      NA      NA      NA       NA       NA                  
## 3      NA      NA      NA      NA      NA       NA       NA                  
## 4      NA      NA      NA      NA      NA       NA       NA                  
## 5      NA      NA      NA      NA      NA       NA       NA                  
##   race_eth acasstat acasicmt acascmto cidistat cidicmt cidicmto ageadj agewt
## 1        1        1       NA                 1      NA               2     4
## 2        5        1       NA                 1      NA               1     1
## 3        4        3        4                 1      NA               3     6
## 4        1        1       NA                 1      NA               1     1
## 5        2       NA       NA                NA      NA               2     3
##   agegroup racewt stratum PSU  WTSF1CH   WTSF1C   WTSF1F US_time
## 1        4      4       1 101 4012.849 6214.335 9175.899      NA
## 2        1      4       1   3 1836.281 1911.012 2595.647       6
## 3        5      1       1   8 3934.508 4128.225 5731.487      42
## 4        1      4       1  53 2875.775 3242.916    0.000      12
## 5        3      2       1  10 1957.079    0.000    0.000      NA

How are each of the above functions similar or different?

Let’s look at the syntax for renaming variables now.

spi_nyc %>% 
  rename('Country' = "DMQ111e",
                   'State' = "DMQ130e") %>% 
  head(n = 5)

##   X  SP_ID riagendr riaageyr capistat capicmt capicmto DMQ140 SFQ180 DMQ105
## 1 1 100230        1       59        1      NA              13      1     10
## 2 2 100243        1       25        1      NA              12      1     12
## 3 3 100270        2       76        1      NA               5      2     12
## 4 4 100597        1       29        1      NA              18      6     66
## 5 5 101166        2       48        1      NA              11      5     10
##   Country DMQ111f    State DMQ130f DMQ161M DMQ161Y DMQ251a DMQ251c DMQ251d
## 1              NA New York      36      NA      NA      NA      NA      NA
## 2              NA               NA       4    1998      NA      NA      NA
## 3              NA               NA      11    1962      NA      12      NA
## 4 Ukraine     804               NA       5    1992      NA      NA      NA
## 5              NA New York      36      NA      NA      NA      NA      NA
##   DMQ251e DMQ251f DMQ251i DMQ251j DMQ251k DMQ251AE DMQ251AF DMQ251OL DMQ251OS
## 1      NA      NA      NA      NA      NA       NA       NA                  
## 2      NA      NA      NA      NA      NA       NA       NA                  
## 3      NA      NA      NA      NA      NA       NA       NA                  
## 4      NA      NA      NA      NA      NA       NA       NA                  
## 5      NA      NA      NA      NA      NA       NA       NA                  
##   race_eth acasstat acasicmt acascmto cidistat cidicmt cidicmto ageadj agewt
## 1        1        1       NA                 1      NA               2     4
## 2        5        1       NA                 1      NA               1     1
## 3        4        3        4                 1      NA               3     6
## 4        1        1       NA                 1      NA               1     1
## 5        2       NA       NA                NA      NA               2     3
##   agegroup racewt stratum PSU  WTSF1CH   WTSF1C   WTSF1F US_time
## 1        4      4       1 101 4012.849 6214.335 9175.899      NA
## 2        1      4       1   3 1836.281 1911.012 2595.647       6
## 3        5      1       1   8 3934.508 4128.225 5731.487      42
## 4        1      4       1  53 2875.775 3242.916    0.000      12
## 5        3      2       1  10 1957.079    0.000    0.000      NA

# rename(new_var = old_var)

Did this actually change the variable names in the spi_nyc dataframe or do we need to do something such that they are permanently stored in the original df?

Logical operators

Logical operators are really important because they help us perform operations. We have already been using some and if I were more organized, I would have talked about logical operators earlier. Welp. Hopefully, datacamp was helpful in going over logical operators and describing how and when to use summarizing and grouping functions.

Group_by function

What if we want to understand the distribution of age by country of birth?

spi_nyc %>% 
  rename('Country' = "DMQ111e",
         'Age' = "riaageyr") %>% 
  group_by(Country) %>% 
  summarise(mean_age = mean(Age, na.rm = TRUE)) # Remember, R is case-sensitive

## # A tibble: 103 × 2
##    Country               mean_age
##    <chr>                    <dbl>
##  1 ""                        41.7
##  2 "Afghanistan"             27  
##  3 "Albania"                 42.2
##  4 "Algeria"                 37  
##  5 "Antigua and Barbuda"     25  
##  6 "Argentina"               41.8
##  7 "Australia"               47.2
##  8 "Austria"                 35  
##  9 "Azerbaijan"              70  
## 10 "Bahamas"                 44  
## # ℹ 93 more rows

Student Exercise

What if we wanted to group by more than one variable. What would the syntax look like?
In a new code chunk, rename the variable DMQ130e as state. Then group by state and calculate the minimum and maximum ages (you may rename the riaageyr variable as well if you would like to).
How many Cuban Americans are in this dataset? You may want to rename the variable and then show the count.

To do

Go through ‘The dplyr Package’ on Dr. Stephanie Hick’s website (JHU Professor) here. We will mostly only learn the ‘mutate()’ function next week but this lecture will provide you with a nice overview of the functions we have learned so far.

I think that collaboration is key in science. It’s really great that such wonderful professors have opened the doors for non-JHU students to learn from them./ Feel free to follow her. I wish that in we’d been taught R in our MPH. Unfortunately, we primarily used SAS. Who even uses SPSS and stata? They’re still using outdated lectures/homeworks. Even though I am aiming to graduate next year, I really wanted to ensure that students in this class can leave thinking ’alright, I kind of see what the fuss is about.” I had to learn to use R through random stats classes during my PhD, then I had a long break from coding due to my clinical track and I had to methodically re-learn from Dr. Ozan Jaquette in Education. He is a phenomenal professor and his notes are also open to the public! The point of my spiel is that we must all learn to work together and make use of all the free resources to get better!

In preparation for next week you will read up on ‘joins’/ ‘merging’ data here. Also, read up on ‘inner joins’ here. Of course, I would be thrilled if you read through all the different types of joins mentioned on this page.
You will also go through the codebook and prepare a list of about 15 variables(at the very least) you would want to keep in your merged dataset (spi_nyc + capi_nyc). You should have already glanced at the codebook once so far.
Optional resource: DataCamp chapter 8 Dplyr Review can be accessed here