This week we will continue working with the spi_nyc file to rename some variables and learn how to summarize after grouping variables. We will also learn a little more about logical operators.
First, read in the file.
spi_nyc <- read.csv("/Users/saanchishah/Desktop/Spring_files/NYC/spi_nyc.csv")
Let’s load a new package this week in addition to dplyr and tidyverse.
library(mosaic)
library(tidyverse)
library(dplyr)
Recall from last lab how we used different functions to summarize numeric variables. What if we want to achieve the same for character variables without recoding variables (more on recoding later)?
summary(spi_nyc$DMQ111e) # Is this helpful?
## Length Class Mode
## 1999 character character
tally(spi_nyc$DMQ111e)
## X
##
## 1376
## Afghanistan
## 1
## Albania
## 4
## Algeria
## 1
## Antigua and Barbuda
## 2
## Argentina
## 10
## Australia
## 6
## Austria
## 2
## Azerbaijan
## 1
## Bahamas
## 1
## Bangladesh
## 15
## Barbados
## 7
## Belarus
## 5
## Belgium
## 1
## Belize
## 4
## Bermuda
## 1
## Bolivia
## 1
## Bosnia and Herzegowina
## 2
## Brazil
## 2
## Bulgaria
## 6
## Cameroon
## 1
## Canada
## 4
## Chile
## 3
## Colombia
## 39
## Costa Rica
## 2
## Country not listed
## 11
## Croatia (local name: Hrvatska)
## 1
## Cuba
## 9
## Czech Republic
## 2
## Ecuador
## 53
## Egypt
## 2
## El Salvador
## 12
## France
## 4
## French Guiana
## 1
## Gambia
## 3
## Germany
## 5
## Ghana
## 15
## Greece
## 3
## Grenada
## 5
## Guatemala
## 4
## Guinea
## 2
## Guyana
## 28
## Haiti
## 14
## Honduras
## 16
## Hong Kong
## 5
## Hungary
## 1
## India
## 31
## Iran (Islamic Republic of)
## 2
## Ireland
## 5
## Israel
## 6
## Italy
## 14
## Japan
## 6
## Kazakhstan
## 1
## Kenya
## 1
## Korea, Democratic People`s Republic of
## 11
## Korea, Republic of
## 7
## Kyrgyzstan
## 2
## Liberia
## 4
## Macau
## 1
## Macedonia, the former Yugoslav Republic of
## 2
## Malaysia
## 2
## Mali
## 1
## Moldova, Republic of
## 1
## Morocco
## 2
## Myanmar
## 4
## Nepal
## 6
## Netherlands
## 1
## New Zealand
## 1
## Nicaragua
## 2
## Nigeria
## 7
## Pakistan
## 11
## Panama
## 5
## Peru
## 11
## Philippines
## 20
## Poland
## 6
## Romania
## 1
## Saint Kitts and Nevis
## 1
## Saint Lucia
## 4
## Saint Vincent and the Grenadines
## 3
## Senegal
## 9
## Slovakia (Slovak Republic)
## 1
## South Africa
## 2
## Spain
## 4
## Sri Lanka
## 1
## Swaziland
## 1
## Sweden
## 3
## Switzerland
## 1
## Taiwan, Province of China
## 9
## Tajikistan
## 1
## Tanzania, United Republic of
## 1
## Thailand
## 4
## Trinidad and Tobago
## 36
## Turkey
## 2
## Ukraine
## 17
## United Kingdom
## 8
## Uruguay
## 1
## Uzbekistan
## 7
## Venezuela
## 3
## Viet Nam
## 2
## Virgin Islands (British)
## 1
## Virgin Islands (U.S.)
## 1
## Yemen
## 3
## Yugoslavia
## 6
Next, start renaming variables. The function to rename variables is ‘rename’. Before we rename the variables, let’s take a look at the variables.
names(spi_nyc)
## [1] "X" "SP_ID" "riagendr" "riaageyr" "capistat" "capicmt"
## [7] "capicmto" "DMQ140" "SFQ180" "DMQ105" "DMQ111e" "DMQ111f"
## [13] "DMQ130e" "DMQ130f" "DMQ161M" "DMQ161Y" "DMQ251a" "DMQ251c"
## [19] "DMQ251d" "DMQ251e" "DMQ251f" "DMQ251i" "DMQ251j" "DMQ251k"
## [25] "DMQ251AE" "DMQ251AF" "DMQ251OL" "DMQ251OS" "race_eth" "acasstat"
## [31] "acasicmt" "acascmto" "cidistat" "cidicmt" "cidicmto" "ageadj"
## [37] "agewt" "agegroup" "racewt" "stratum" "PSU" "WTSF1CH"
## [43] "WTSF1C" "WTSF1F" "US_time"
# View(spi_nyc)
head(spi_nyc, n = 5)
## X SP_ID riagendr riaageyr capistat capicmt capicmto DMQ140 SFQ180 DMQ105
## 1 1 100230 1 59 1 NA 13 1 10
## 2 2 100243 1 25 1 NA 12 1 12
## 3 3 100270 2 76 1 NA 5 2 12
## 4 4 100597 1 29 1 NA 18 6 66
## 5 5 101166 2 48 1 NA 11 5 10
## DMQ111e DMQ111f DMQ130e DMQ130f DMQ161M DMQ161Y DMQ251a DMQ251c DMQ251d
## 1 NA New York 36 NA NA NA NA NA
## 2 NA NA 4 1998 NA NA NA
## 3 NA NA 11 1962 NA 12 NA
## 4 Ukraine 804 NA 5 1992 NA NA NA
## 5 NA New York 36 NA NA NA NA NA
## DMQ251e DMQ251f DMQ251i DMQ251j DMQ251k DMQ251AE DMQ251AF DMQ251OL DMQ251OS
## 1 NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA
## race_eth acasstat acasicmt acascmto cidistat cidicmt cidicmto ageadj agewt
## 1 1 1 NA 1 NA 2 4
## 2 5 1 NA 1 NA 1 1
## 3 4 3 4 1 NA 3 6
## 4 1 1 NA 1 NA 1 1
## 5 2 NA NA NA NA 2 3
## agegroup racewt stratum PSU WTSF1CH WTSF1C WTSF1F US_time
## 1 4 4 1 101 4012.849 6214.335 9175.899 NA
## 2 1 4 1 3 1836.281 1911.012 2595.647 6
## 3 5 1 1 8 3934.508 4128.225 5731.487 42
## 4 1 4 1 53 2875.775 3242.916 0.000 12
## 5 3 2 1 10 1957.079 0.000 0.000 NA
How are each of the above functions similar or different?
Let’s look at the syntax for renaming variables now.
spi_nyc %>%
rename('Country' = "DMQ111e",
'State' = "DMQ130e") %>%
head(n = 5)
## X SP_ID riagendr riaageyr capistat capicmt capicmto DMQ140 SFQ180 DMQ105
## 1 1 100230 1 59 1 NA 13 1 10
## 2 2 100243 1 25 1 NA 12 1 12
## 3 3 100270 2 76 1 NA 5 2 12
## 4 4 100597 1 29 1 NA 18 6 66
## 5 5 101166 2 48 1 NA 11 5 10
## Country DMQ111f State DMQ130f DMQ161M DMQ161Y DMQ251a DMQ251c DMQ251d
## 1 NA New York 36 NA NA NA NA NA
## 2 NA NA 4 1998 NA NA NA
## 3 NA NA 11 1962 NA 12 NA
## 4 Ukraine 804 NA 5 1992 NA NA NA
## 5 NA New York 36 NA NA NA NA NA
## DMQ251e DMQ251f DMQ251i DMQ251j DMQ251k DMQ251AE DMQ251AF DMQ251OL DMQ251OS
## 1 NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA
## race_eth acasstat acasicmt acascmto cidistat cidicmt cidicmto ageadj agewt
## 1 1 1 NA 1 NA 2 4
## 2 5 1 NA 1 NA 1 1
## 3 4 3 4 1 NA 3 6
## 4 1 1 NA 1 NA 1 1
## 5 2 NA NA NA NA 2 3
## agegroup racewt stratum PSU WTSF1CH WTSF1C WTSF1F US_time
## 1 4 4 1 101 4012.849 6214.335 9175.899 NA
## 2 1 4 1 3 1836.281 1911.012 2595.647 6
## 3 5 1 1 8 3934.508 4128.225 5731.487 42
## 4 1 4 1 53 2875.775 3242.916 0.000 12
## 5 3 2 1 10 1957.079 0.000 0.000 NA
# rename(new_var = old_var)
Did this actually change the variable names in the spi_nyc dataframe or do we need to do something such that they are permanently stored in the original df?
Logical operators are really important because they help us perform operations. We have already been using some and if I were more organized, I would have talked about logical operators earlier. Welp. Hopefully, datacamp was helpful in going over logical operators and describing how and when to use summarizing and grouping functions.
What if we want to understand the distribution of age by country of birth?
spi_nyc %>%
rename('Country' = "DMQ111e",
'Age' = "riaageyr") %>%
group_by(Country) %>%
summarise(mean_age = mean(Age, na.rm = TRUE)) # Remember, R is case-sensitive
## # A tibble: 103 × 2
## Country mean_age
## <chr> <dbl>
## 1 "" 41.7
## 2 "Afghanistan" 27
## 3 "Albania" 42.2
## 4 "Algeria" 37
## 5 "Antigua and Barbuda" 25
## 6 "Argentina" 41.8
## 7 "Australia" 47.2
## 8 "Austria" 35
## 9 "Azerbaijan" 70
## 10 "Bahamas" 44
## # ℹ 93 more rows
I think that collaboration is key in science. It’s really great that such wonderful professors have opened the doors for non-JHU students to learn from them./ Feel free to follow her. I wish that in we’d been taught R in our MPH. Unfortunately, we primarily used SAS. Who even uses SPSS and stata? They’re still using outdated lectures/homeworks. Even though I am aiming to graduate next year, I really wanted to ensure that students in this class can leave thinking ’alright, I kind of see what the fuss is about.” I had to learn to use R through random stats classes during my PhD, then I had a long break from coding due to my clinical track and I had to methodically re-learn from Dr. Ozan Jaquette in Education. He is a phenomenal professor and his notes are also open to the public! The point of my spiel is that we must all learn to work together and make use of all the free resources to get better!
In preparation for next week you will read up on ‘joins’/ ‘merging’ data here. Also, read up on ‘inner joins’ here. Of course, I would be thrilled if you read through all the different types of joins mentioned on this page.
You will also go through the codebook and prepare a list of about 15 variables(at the very least) you would want to keep in your merged dataset (spi_nyc + capi_nyc). You should have already glanced at the codebook once so far.
Optional resource: DataCamp chapter 8 Dplyr Review can be accessed here