R packages contain code, data and documentation. Different packages have different roles to play. We will mostly use the dplyr and tidyverse packages for data wrangling and ggplot2 for visualisation. Let’s learn how to install and load packages. Once you install a package in R, you need not re-install it. However, each time you begin an R session without initializing an R project, you would have to load the necessary packages.
# install.packages("tidyverse")
# install.packages("dplyr")
# install.packages("readr")
library(dplyr)
library(tidyverse)
library(readr) # to read in and write out csv files
It is essential to know what your working directory is. Next, we may have to change our current working directory so it is easier to access files.
# To access the current working directory
getwd()
## [1] "/Users/saanchishah/Desktop/seminar23/github/sassysaanch.github.io/Week2"
setwd("/Users/saanchishah/Desktop/Spring_files/Harry Potter")
# To set the working directory where we want to house the files
a <- 5 + 2
b <- 18499*40
# Take a look at your environment
a + b
## [1] 739967
Vectors, matrices and dataframes. Vectors can take in character or numeric object type data. Matrices can only contain numeric data. Dataframes can contain different types of data.
vec <- c(1, 2, 3, 4, 5) # c = concatenate
# is the same as...
vec1 <- c(1:5)
How can we create a character vector? Task: create a vector with 10 numbers of your choice and print the contents.
For this class, we will mostly work with csv files. However, files you receive from a stakeholder, client etc. may not be in this format. R is able to read in different types of files such as an excel file, SAS or stata file and so on. I have taken the liberty of converting the data into csv files so that it’s easier for you to work with the data.
hp_characters <- read.csv("/Users/saanchishah/Desktop/Spring_files/Harry Potter/Characters.csv", sep = ";")
# Notice how there is no ouput, we need to learn how to call it
# This is a base r function - read.csv
# Now that we have stored this data, we can run some stats on it
How do you know this worked besides not getting an error? You can look at the global environment on the right hand side to confirm if you have the data. What if you would like to see what the first few rows look like without pointing and clicking?
# Deliberately left blank
You can press ‘option key and - key’ to insert the assignment operator You can highlight a line of code and press the command key + return key to just hit that line of code.
It is always important to know what kind of variables you are working with in order to perform any function on it. For example, it may be difficult to perform an operation on a string variable. However, it is possible to convert strings to numeric variables. But we’re jumping the gun here. We need to understand the structure of the dataframe first!
str(hp_characters)
## 'data.frame': 140 obs. of 15 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : chr "Harry James Potter" "Ronald Bilius Weasley" "Hermione Jean Granger" "Albus Percival Wulfric Brian Dumbledore" ...
## $ Gender : chr "Male" "Male" "Female" "Male" ...
## $ Job : chr "Student" "Student" "Student" "Headmaster" ...
## $ House : chr "Gryffindor" "Gryffindor" "Gryffindor" "Gryffindor" ...
## $ Wand : chr "11\" Holly phoenix feather" "12\" Ash unicorn tail hair " "10¾\" vine wood dragon heartstring" "15\" Elder Thestral tail hair core" ...
## $ Patronus : chr "Stag" "Jack Russell terrier" "Otter" "Phoenix" ...
## $ Species : chr "Human" "Human" "Human" "Human" ...
## $ Blood.status: chr "Half-blood" "Pure-blood" "Muggle-born" "Half-blood" ...
## $ Hair.colour : chr "Black" "Red" "Brown" "Silver| formerly auburn" ...
## $ Eye.colour : chr "Bright green" "Blue" "Brown" "Blue" ...
## $ Loyalty : chr "Albus Dumbledore | Dumbledore's Army | Order of the Phoenix | Hogwarts School of Witchcraft and Wizardry" "Dumbledore's Army | Order of the Phoenix | Hogwarts School of Witchcraft and Wizardry" "Dumbledore's Army | Order of the Phoenix | Hogwarts School of Witchcraft and Wizardry" "Dumbledore's Army | Order of the Phoenix | Hogwarts School of Witchcraft and Wizardry" ...
## $ Skills : chr "Parseltongue| Defence Against the Dark Arts | Seeker" "Wizard chess | Quidditch goalkeeping" "Almost everything" "Considered by many to be one of the most powerful wizards of his time" ...
## $ Birth : chr "31 July 1980" "1 March 1980" "19 September, 1979" "Late August 1881" ...
## $ Death : chr "" "" "" "30 June, 1997 " ...
# How many rows and variables are does this dataset have?
# What type of variables do you observe?
I do not take credit for creating these datasets regardless of how massive of a fan I am. I used kaggle to download the files we will be working with in class. I will always provide links and references so that you may be able to use the original source should you want to.
What if we want to understand the type of a variable without having to look at the structure of the dataset? We would have to subset a variable first in order to do so.
# subsetting a variable
hp_characters$Name
## [1] "Harry James Potter"
## [2] "Ronald Bilius Weasley"
## [3] "Hermione Jean Granger"
## [4] "Albus Percival Wulfric Brian Dumbledore"
## [5] "Rubeus Hagrid"
## [6] "Neville Longbottom"
## [7] "Fred Weasley"
## [8] "George Weasley"
## [9] "Ginevra (Ginny) Molly Weasley"
## [10] "Dean Thomas"
## [11] "Seamus Finnigan"
## [12] "Lily J. Potter"
## [13] "James Potter"
## [14] "Sirius Black"
## [15] "Remus John Lupin"
## [16] "Peter Pettigrew"
## [17] "Percy Ignatius Weasley"
## [18] "(Bill) William Arthur Weasley"
## [19] "Charles Weasley"
## [20] "Lee Jordan "
## [21] "Oliver Wood"
## [22] "Angelina Johnson"
## [23] "Katie Bell"
## [24] "Alicia Spinnet"
## [25] "Lavender Brown"
## [26] "Parvati Patil"
## [27] "Romilda Vane"
## [28] "Colin Creevey"
## [29] "Cormac McLaggen"
## [30] "Minerva McGonagall"
## [31] "Molly Weasley"
## [32] "Arthur Weasley"
## [33] "Quirinus Quirrell"
## [34] "Cho Chang"
## [35] "Luna Lovegood"
## [36] "Gilderoy Lockhart"
## [37] "Filius Flitwick"
## [38] "Sybill Patricia Trelawney"
## [39] "Garrick Ollivander"
## [40] "Myrtle Elizabeth Warren (Moaning Myrtle)"
## [41] "Padma Patil"
## [42] "Michael Corner"
## [43] "Marietta Edgecombe"
## [44] "Terry Boot"
## [45] "Anthony Goldstein"
## [46] "Severus Snape"
## [47] "Draco Malfoy"
## [48] "Vincent Crabbe"
## [49] "Gregory Goyle"
## [50] "Bellatrix Lestrange"
## [51] "Dolores Jane Umbridge"
## [52] "Horace Eugene Flaccus Slughorn"
## [53] "Lucius Malfoy"
## [54] "Narcissa Malfoy"
## [55] "Regulus Arcturus Black"
## [56] "Pansy Parkinson"
## [57] "Blaise Zabini"
## [58] "Tom Marvolo Riddle"
## [59] "Theodore Nott"
## [60] "Rodolphus Lestrange"
## [61] "Millicent Bulstrode"
## [62] "Graham Montague"
## [63] "Bloody Baron"
## [64] "Marcus Flint"
## [65] "Penelope Clearwater"
## [66] "Roger Davies"
## [67] "Marcus Belby"
## [68] "Salazar Slytherin"
## [69] "Godric Gryffindor"
## [70] "Rowena Ravenclaw"
## [71] "Nicholas de Mimsy-Porpington"
## [72] "Cuthbert Binns"
## [73] "Barty Crouch Jr."
## [74] "Charity Burbage"
## [75] "Firenze"
## [76] "Alecto Carrow"
## [77] "Amycus Carrow"
## [78] "Helga Hufflepuff"
## [79] "Fat Friar"
## [80] "Helena Ravenclaw"
## [81] "Nymphadora Tonks"
## [82] "Pomona Sprout"
## [83] "Newton Scamander"
## [84] "Cedric Diggory"
## [85] "Justin Finch-Fletchley"
## [86] "Zacharias Smith"
## [87] "Hannah Abbott"
## [88] "Ernest Macmillan"
## [89] "Susan Bones"
## [90] "Walden Macnair"
## [91] "Augustus Rookwood"
## [92] "Antonin Dolohov"
## [93] "Corban Yaxley"
## [94] "Igor Karkaroff"
## [95] "Kingsley Shacklebolt"
## [96] "Alastor Moody"
## [97] "Alice Longbottom"
## [98] "Frank Longbottom"
## [99] "Rufus Scrimgeour"
## [100] "Cornelius Oswald Fudge"
## [101] "Barty Crouch Sr."
## [102] "Amos Diggory"
## [103] "Dedalus Diggle"
## [104] "Elphias Doge"
## [105] "Fleur Isabelle Delacour"
## [106] "Aberforth Dumbledore"
## [107] "Mundungus Fletcher"
## [108] "Sturgis Podmore"
## [109] "Hestia Jones"
## [110] "Marlene McKinnon"
## [111] "Fabian Prewett"
## [112] "Gideon Prewett"
## [113] "Emmeline Vance"
## [114] "Edgar Bones"
## [115] "Dorcas Meadowes"
## [116] "Benjy Fenwick"
## [117] "Madame Olympe Maxime"
## [118] "Gabrielle Delacour"
## [119] "Viktor Krum"
## [120] "Petunia Dursley"
## [121] "Vernon Dursley"
## [122] "Dudley Dursley"
## [123] "Marge Dursley"
## [124] "Dennis Creevey"
## [125] "Albus Severus Potter"
## [126] "Scorpius Hyperion Malfoy"
## [127] "Edward Remus Lupin"
## [128] "James Sirius Potter"
## [129] "Rose Granger-Weasley"
## [130] "Argus Filch"
## [131] "Poppy Pomfrey"
## [132] "Rolanda Hooch"
## [133] "Irma Pince"
## [134] "Aurora Sinistra"
## [135] "Septima Vector"
## [136] "Wilhelmina Grubbly-Plank"
## [137] "Fenrir Greyback"
## [138] "Gellert Grindelwald"
## [139] "Dobby"
## [140] "Kreacher"
typeof(hp_characters$Name) # or
## [1] "character"
class(hp_characters$Name)
## [1] "character"
# they are slightly different but we need not go into the differences in this class
# You might expect "Name" to be character, we can test it
is.character(hp_characters$Name) # Returns a logical
## [1] TRUE
table(hp_characters$House)
##
## Beauxbatons Academy of Magic
## 39 3
## Durmstrang Institute Gryffindor
## 1 38
## Hufflepuff Ravenclaw
## 13 18
## Slytherin
## 28
table(hp_characters$Gender)
##
## Female Male
## 1 49 90
table(hp_characters$Blood.status)
##
## Half-blood
## 17 23
## Half-blood or pure-blood Half-blood[
## 1 1
## Muggle Muggle-born
## 4 7
## Muggle-born or half-blood[ Part-Goblin
## 1 1
## Part-Human (Half-giant) Pure-blood
## 2 34
## Pure-blood or half-blood Pure-blood or half-blood
## 2 38
## Pure-blood or Half-blood Quarter-Veela
## 5 2
## Squib Unknown
## 1 1
Let’s only look at witches/wizards who’re in Gryffindor
gryffindor <- hp_characters[hp_characters$House == "Gryffindor", ]
# Notice how we have to use == when extracting by a value to return a logical
# It means that house = gryffindor must be true
# df[df$var, ] - this format is specifying rows and columns both
# all columns were chosen but we can choose specific columns like so
gryffindor_new <- hp_characters[hp_characters$House == "Gryffindor",
c(1:4, 7, 9, 13)]
# We have created a new dataframe
# Or could also use the filter function which is nicer - NEXT WEEK
In preparation for the upcoming weeks, you will start reading up about the NYC Health and Nutrition Examination Survey data we will be working with for most of the quarter(subject to change). You can access documentation on the data structure, codebook and such here. We will begin working with the SPI file next week. The week after we will likely use the CAPI files. If time permits, we will also use the labs file. You will need about 30 minutes to understand how the data were collected, what each file represents and how the responses were coded. Without understanding the codebook, it is impossible to work with the data. Note: We will be using the NYC HANES 2004 Data.
Where there is a wand, there is a way :D
Reference: (https://www.wizardingworld.com/features/what-does-your-wand-mean)