R-Lang

# R-Lang ## resources - [Ten random useful things in R that you might not know about](https://towardsdatascience.com/ten-random-useful-things-in-r-that-you-might-not-know-about-54b2044a3868) - [HexSticker Maker](https://connect.thinkr.fr/hexmake/) - [Control HTML Code Folding](https://stackoverflow.com/questions/37755037/how-to-add-code-folding-to-output-chunks-in-rmarkdown-html-documents) - [R Seek](https://rseek.org/) - [R Graph Gallery](https://www.r-graph-gallery.com/) - [knitr Options](https://yihui.org/knitr/options/) - [Pimp my Rmd](https://holtzy.github.io/Pimp-my-rmd/) ## libraries ### dplyr ```r knitr::opts_chunk$set(echo = TRUE, results = 'hide') library(tidyverse) library(nycflights13) ``` #### Initial Items ```r nycflights13::flights airlines #--------------------------------------------------# not just a data frame but a tibble view(flights) #---------------------------------------------# opens the viewer on the tibble so we can observe data in tabular format filter(flights, month == 1, day == 1) #---------------------# prints the result of the filter jan1 <- filter(flights, month == 1, day == 1) #-------------# assigns the result of the filter to a dataframe (dec25 <- filter(flights, month == 12, day == 25)) #--------# wrapped parenthese both assigns and prints out the dataframe sqrt(2) ^ 2 == 2 #------------------------------------------# False as its approximate and not exact, use near() to get around this 1/49 * 49 == 1 #--------------------------------------------# also false near(sqrt(2) ^ 2,2) #---------------------------------------# Returns True as it should near(1/49 * 49,1) #-----------------------------------------# Returns True as it should nov_dec <- filter(flights, month %in% c(11, 12)) #----------# will select rows where x is one of the values in y ``` #### More Exercises ```r # Exercise 1 transmute(flights, dep_time, dep_hours = dep_time %/% 100, dep_minutes = dep_time %% 100, sched_dep_time, sched_dep_hours = sched_dep_time %/% 100, sched_dep_minutes = sched_dep_time %% 100 ) # Exercise 2 newFlights <- transmute(flights, air_time, realTime = arr_time - dep_time, accuracy = air_time == realTime ) accurateFlights <- filter(newFlights, accuracy == T) #------# 0.36% Accuracy of AirTime # Exercise 3 flightsDelay <- transmute(flights, dep_delay, realDelay = dep_time - sched_dep_time, accurate = dep_delay == realDelay ) accurateDelays <- filter(flightsDelay, accurate == T) #-----# 67.9% Accuracy of delay times # Exercise 4 min_rank(flightsDelay$realDelay) subset <- sort(flightsDelay$realDelay,decreasing = T) subset[1:10] %>% mean() # Exercise 5 subset[1:3]+subset[1:10] ``` #### arrange ```r arrange(flights, desc(is.na(dep_time))) #-------------------# Sorted on the dep_time column using is.na to put missing values at top arrange(flights, desc(dep_delay)) #-------------------------# most delayed flights arrange(flights, dep_delay) #-------------------------------# flights that left the earliest, AESC is the default setting so no function exists for that arrange(flights, air_time) #--------------------------------# Fastest flights arrange(flights, desc(distance), desc(air_time)) #----------# flights that traveled the longest arrange(flights, distance) #--------------------------------# flights that traveled the shortest ``` #### filters ```r filter(flights, arr_delay >= 2) #---------------------------# arrival delay greater than 2 hours filter(flights, dest %in% c("IAH", "HOU")) #----------------# flew to IAH or HOU filter(flights, carrier %in% c("UA","AA","DL")) #-----------# Operated by United, American, or Delta filter(flights, between(month, left = 7, right = 9)) #------# Departed in summer (July, August, September) filter(flights, arr_delay > 120 & dep_delay <= 0)#----------# arrived more than 2 hours late but(AND) didnt leave late filter(flights, dep_delay >= 120 & arr_delay <= 90) #-------# delayed by at least an hour but made up 30min in flight filter(flights, between(dep_time,left = 0, right = 600)) #--# departed between midnight and 6am (Inclusive), using between() filter(flights, is.na(dep_time))#---------------------------# count of flights with missing departure time ``` #### modular arithmatic ```r transmute(flights, dep_time, hour = dep_time %/% 100, #------------------------# %/% is integer division minute = dep_time %% 100 #------------------------# %% is remainder division ) ``` #### mutate ```r mutate(flights_small, gain = arr_delay - dep_delay, #----------------------# Usage of "=" instead of <- because its assigning that value to the variable speed = distance / air_time * 60, #------------------# we're saying that this variable is "=" to this equation hours = air_time / 60, #-----------------------------# Not that we're assigning these valus to a vector gain_per_hour = gain / hours #-----------------------# Can even use new variables made within mutate to create new variables ) #--------------------------------------------------# if we had used "<-" then it displayed the operator in the variable name ``` #### select ```r select(flights, year, month, day) #-------------------------# Selecting specific variables from the data set (variables = columns) select(flights, year:day) #---------------------------------# Selecting specific variables using colon (this to this) select(flights, -(year:day)) #------------------------------# selects all variables EXCEPT those in the parens with the minus sign operating on it # functions to use with select # starts_with # ends_with # contains # matches # num_range select(flights, origin, dest, everything()) #---------------# Puts selected columns at front of data set while keeping all variable in data set select(flights, tailnum, tailnum) #-------------------------# If same variable named twice, it is displayed only once vars <- c("year","month","day","dep_delay","arr_delay") #---# Give character vector the variable names select(flights,one_of(vars)) #------------------------------# Use one_of function with character vector variable to grab variables from tibble that match select(flights, contains("TIME")) #-------------------------# Selects all tibble variables that contain substring "Time" flights_small <- select(flights, #--------------------------# Selecting desired variables for a lean-er data set through various methods year:day, ends_with("delay"), distance, air_time ) ``` #### summarize ```r avgDelay <- as.numeric(summarize(flights, delay = mean(dep_delay, na.rm = T))) cat("The average delay of all flights is:", avgDelay) by_day <- group_by(flights, year, month, day) summarize(by_day, delay = mean(dep_delay,na.rm = T)) ``` #### transmute ```r transmute(flights_small, #----------------------------------# Using transmute will let you mutate the variables but keeps only gain = arr_delay - dep_delay, # The Newly mutated variables in your tibble speed = distance / air_time * 60, hours = air_time / 60, gain_per_hour = gain / hours ) ``` ### ggplot2 ```r #### engine size / highway MPG #### ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy) ) ggplot(data = mpg) + #--------------------------------------------------# Play with changing color to size geom_point(mapping = aes(x = displ, #-------------------------------# (discrete variable "class" to ordered aesthetic "Size" not advised) y = hwy, size = class) ) ggplot(data = mpg) + #--------------------------------------------------# Transparancy of the points geom_point(mapping = aes(x = displ, y = hwy, alpha = class) ) ggplot(data = mpg) + #--------------------------------------------------# Shape of the points geom_point(mapping = aes(x = displ, y = hwy, shape = class) ) ggplot(data = mpg) + #--------------------------------------------------# Add Colors by class and add this to the aesthetic layer geom_point(mapping = aes(x = displ, y = hwy, color = class) ) ggplot(data = mpg) + #--------------------------------------------------# change color of the points geom_point(mapping = aes(x = displ, y = hwy), color = "blue" ) ggplot(data = mpg) + #--------------------------------------------------# change size and color to continious variables and shape to a categorical variable geom_point(mapping = aes(x = displ, y = hwy, color = year, size = cyl, shape = drv) ) ggplot(data = mpg) + #--------------------------------------------------# Making the save variable cover multiple fields geom_point(mapping = aes(x = displ, y = hwy, color = cyl, size = cyl, shape = drv) ) ggplot(data = mpg) + #--------------------------------------------------# testing the "stroke" argument geom_point(mapping = aes(x = displ, y = hwy, color = cyl, size = cyl, shape = drv, stroke = 5) ) ggplot(data = mpg) + #--------------------------------------------------# passing a condition into a aesthetic argument instead of the straight variable geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5, size = cyl, shape = drv, stroke = 5) ) ggplot(data = mpg) + #--------------------------------------------------# Highway MPg / Engine Cylinders geom_point(mapping = aes(x = hwy, y = cyl) ) ggplot(data = mpg) + #--------------------------------------------------# Type of car / What "Wheel Drive" the car is (F,B,4) geom_point(mapping = aes(x = class, y = drv) ) #### Facets on descrete variables #### ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy) ) + facet_wrap(~ class, nrow = 2) #-------------------------------------# the tilda "~" means a formula not akin to an equation ggplot(data = mpg) + #--------------------------------------------------# testing facet grid geom_point(mapping = aes(x = displ, y = hwy) ) + facet_grid(drv ~ cyl) ggplot(data = mpg) + #--------------------------------------------------# testing facet grid 2 geom_point(mapping = aes(x = displ, y = hwy) ) + facet_grid(. ~ cyl) #-----------------------------------------------# puts cyl facet into columns since argument is (r,c) ggplot(data = mpg) + #--------------------------------------------------# testing facet grid 3 geom_point(mapping = aes(x = displ, y = hwy) ) + facet_grid(drv ~ .) #-----------------------------------------------# puts drv facet into rows since argument is (r,c) #### Changing Geoms #### ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy) ) #-----------------------------------------------------# from point to smooth ggplot(data = mpg) + #--------------------------------------------------# line type based on a variable geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv) ) ggplot(data = mpg) + #--------------------------------------------------# line type based on a variable with points and colors to show the seperation geom_point(mapping = aes(x = displ, y = hwy, color = drv) ) + geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv, color = drv) ) ggplot(data = mpg, mapping = aes(x = displ, y = hwy) ) + geom_point() + geom_smooth() ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy) ) + geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv), show.legend = F ) ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = drv) ) + geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv, color = drv), show.legend = F ) #### Bar Plots #### ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut)) ``` ### pins - <https://www.youtube.com/embed/dsfEsJCiH-E> ### shiny - Shiny dashboards consist of 2 main elements the `ui` and the `server` files - Shiny dashboards are not great for high user traffic but can still handle mutliple user sessions. Hosting on RStudio is an option but shiny server can be hosted on your own system or containers. - A great practice is to store shiny dashboards as R Packages to be distributed for interactive reproducable analysis. - Shinydashboard and shinydashboard+ are packages that build off the shiny package base to allow greater aesthetic elements and more functionality with the web app. - For unit testing shiny applications there is shinytest and for load testing there is shinyloadtest - for making sure shiny apps load quickly you can use `profvis` and other methods to tackle your bigger and slower processes, another is to not load csv data into shiny directly, if loading data like this do your ETL first and save data as feather files, feather is slower to write but faster to read than csv's - another would be to implement plot caching if the plots are taking a while to load - Great presentations of your shiny apps like a walk along tutorial: [cicerone](https://github.com/JohnCoene/cicerone) #### Documentation <iframe width="560" height="315" src="https://www.youtube.com/embed/Wy3TY0gOmJw" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> #### Code ```r ######################################## # A basic example of a shiny dashboard # ######################################## library(shiny) # Define UI for application shinyUI(fluidPage( # Application title titlePanel("Old Faithful Geyser Data"), # Sidebar with input sidebarLayout( sidebarPanel( sliderInput("bins", "Number of bins:", min = 1, max = 50, value = 30 )#sliderInput ),#sidebarPanel # Show a plot of the generated distribution mainPanel( plotOutput("distPlot") )#mainPanel )#sidebarLayout ))#shinyUI(fluidPage( # Define server logic required to draw a histogram shinyServer(function(input, output) { output$distPlot <- renderPlot({ # generate bins based on input$bins from ui.R x <- faithful[, 2] bins <- seq(min(x), max(x), length.out = input$bins + 1) # draw the histogram with the specified number of bins hist(x, breaks = bins, col = 'darkgray', border = 'white') })#renderPlot })#shinyServer ``` ## data types ### array ```r # Array # The R Objects which can store data in more than 2 dimensions # Syntax: array(data, dim, dimnames) #============# # EXAMPLES # #============# arr = array(c(0:15),dim = c(4,4,2,2)) #this makes the array = 0-15 #the dimension makes the 0-15 display in a 4x4 grid #the other dimension makes it a 3d array by making 4 copies of the array in a #2x2 grid of arrays arr arr2 = array(c(1:9),dim = c(3,3,4,2)) arr2 ``` ### data classes ```r # Data Classes ---- 12.6 #Numeric 3 #Numeric 100 #Numeric "male" #Character TRUE #Logical FALSE #Logical T #Logical F #Logical # Data Structures ---- # Vector # List # Matrix # Data Frame ``` ### dataframe ```r # PART 1 ---- # Data Frames # A Table or a 2-dimensional array-like structure in which each column contains # values of one variable and each row contains one set of values from each column # Syntax: data.frame(data) #============# # EXAMPLES # #============# RowCount = c(1:5) PeopleNames = c("Bryan","Jude","kelly","janelle","Rosa") Values = c(15,25,65,145,74) df <- data.frame(RowCount,PeopleNames,Values) df # This alone will display the valeus of the data frame each vector in the frame # represents a vertical column of values that each contibue a value to each row data.frame(airquality) # Built in sample data table, can also import Excel files # PART 2 ---- myDataFrame <- read.csv("20190208 RC Registry.csv") myDataFrame <- myDataFrame[myDataFrame$Medical..Screening.Due.Date <date, c("CDCR", "Medical..Screening.Due.Date")] myDataFrame # PART 3 ---- # setting up the data frame vectors id <- 1:200 group <- c(rep("Vehicle",100), rep("Drug",100)) response <- c(rnorm(100,mean = 25, sd = 5), rnorm(100,mean = 23, sd=5)) age <- round(rnorm(200,40,20)) #compiling the data frame myData <- data.frame(Patient = id, Patient.Age = age, Treatment = group, Response = response) myData head(myData,10) tail(myData,10) dim(myData) str(myData) summary(myData) # subsetting Data.frames ---- myData[1,2] myData[2,3] myData[1:20,2:3] # first 20 rows with columns 2 & 3 present myData[1:20,] # returns 20 rows and all columns if left blank myData[,1] # returns everythingh in the first column only myData[,"Response"] # returns just the columns values for the column named "Response" myData$Response #the Dollar sign $ after the name of the data frame will return the entire column without quotes or brackets myData[myData$Response>26,] # give me the rows and all columns for every row that meets the criteria # of Response > 26 #perform a calculation and then add values to a new column of the data frame myData$Positive <- myData$Response<26 write.csv(myData[myData$Response>26,],file = "testData.csv", row.names = F) # write a CSV file to the current working directory # multiple filter criteria and then assigned # to a new object for ease of exporting to CSV CSVMyData <- myData[myData$Treatment == "Vehicle" & myData$Response>26 & myData$Patient.Age > 0,] write.csv(CSVMyData,file = "testData.csv", row.names = F) head(CSVMyData) ``` ### list ```r # List ---- # The R Objects which can contain elements of different types like # Numbers, Strings, Vectors, and another List inside of it # Syntax: list(data) #============# # EXAMPLES # ---- #============# vtr1 = c('hello','world') vtr2 = c(24.6345,3.6,345.5678) vtr3 = c(45L,'hi') ls = list(vtr1,vtr2,vtr3) ls # Within each list item/variable returned, it will not let seperate list items/variables # Dictate the data type of the other items/variable # It will allow this dictation if inside the variable itself it contains differing # data types # Nested Lists ---- list(1,2,list("a","b",list(T,T,F)),"hello",T) # you can have lists within lists and it displays nested lists effectively in the console ``` ### matrix ```r # Matrix # R Objects in which the elements are arranged in a 2 dimensional rectangular layout # Syntax: matrix(data, nrow, ncol, byrow, dimnames) # Data: the input vector which becomes the data elements of the matrix # NRow: number of rows to be created # NCol: number of columns to be created # ByRow: a logical clue. If TRUE then the input vector elements are arranged by row # DimName: The names assigned to the rows and columns #============# # EXAMPLES # #============# mtr = matrix(c(1:25),5,5) mtr # Warning message: # In matrix(c(5:30), 5, 5) : # data length [26] is not a sub-multiple or multiple of the number of rows [5] # If your total number of data points spills out of the matrix, error returns ``` ### null and na ```r # empty value ---- NA #NA is a logical constant of length 1 which contains a missing value indicator NULL #NULL represents the null object in R: # it is a reserved word. NULL is often returned by expressions and # functions whose value is undefined. ?NA ``` ### vectors ```r #vectors ---- #5 types # Logical # Integer # Numeric 'value greater than 7 digits will always be converted to the exponential format # Complex # Character #============# # EXAMPLES #---- #============# #Logical vtr1 = c(TRUE,FALSE) #Numeric vtr2 = c(15,85.674954,999999) #Integer requires "L" after number to treat as integer vtr3 = c(35L,58L,146L) #Integer with decemals converted with warning to numeric vtr4 = c(85.64L) #Wrong data types passed to vector vtr5 = c(TRUE,35L,3.14) #TRUE will be converted to Boolean 1 if put into a Numeric/Integer Vector #now incude a character string vtr6 = c(TRUE,35L,3.14,"hello") #===========# # Results #---- #===========# class(vtr1) vtr1 class(vtr2) vtr2 class(vtr3) vtr3 class(vtr4) vtr4 class(vtr5) vtr5 class(vtr6) vtr6 # Vector Math ---- numbers <- c(1:10) numbers * 2 # Subsetting Vectors ---- days <- c("mon", "tue", "wed", "thurs", "fri") days # to return a specific value of a vector in square brackets after its object holding the vector # use square brackets and the index of the value to pull out a subset of a vector days[1] # square brackets are ALWAYS for subsetting the parens "()" are for functions days[c(1,3,5)] #vector within the sub setting brackets to pull out specific values days[2:5] #if you wanted everything except monday days[-5] #basically saying "give me everything except what's in index 5" or in this case "Friday ``` ## data operators ```r # Arithmatic # Addition (+) # Subtraction (-) # Multiplication (*) # Division (/) # Modulus (%%) # Exponent (^) # Floor Division (%/%) # Relational # Equal To (==) #asking "is this equal to that" returns Bool # Not Equal To (!=) # Greater Than (>) # Less Than (<) # Greater Than Equal To (>=) # Less Than Equal To (<=) # Assignment # Left # Equals (=) # Assign (<-) # Right # Equals (=) # Assign (->) # Logical # AND (&) in 2 concurrent vectors, compare each aligned item ask AND # AND (&&) compre whole vector to whole vector and all AND's need to be # TRUE or else False # NOT (!) # OR (|) in 2 concurrent vectors, compare each aligned item ask OR # OR (||) compareshole vector to whole vector and either side needs # tobe TRUE or else False ``` ## functions ```r # Functions ---- # Descriptive Statistics Functions ---- myValues <- c(1:100) myValues mean(myValues) median(myValues) mode(myValues) min(myValues) max(myValues) sum(myValues) sd(myValues) #standard deviation class(myValues) length(myValues) log(myValues) log10(myValues) mySqrt <- sqrt(myValues) mySqrt ?rnorm # Adding a question mark before the name of a function opens a help pane with all the details # on that function, this is a great way to learn about what the functions require as arguments ?rgb hist(rnorm(100, mean = 5)) # Data frame Functioons ---- # setting up the data frame vectors id <- 1:200 group <- c(rep("Vehicle",100), rep("Drug",100)) response <- c(rnorm(100,mean = 25, sd = 5), rnorm(100,mean = 23, sd=5)) #compiling the data frame myData <- data.frame(Patient = id, Treatment = group, Response = response) myData head(myData,10) tail(myData,10) dim(myData) str(myData) summary(myData) # Change Value of data type present can work for entire columns ---- as.numeric(c("1","2","3")) as.character(1:10) # Remove Objects ---- # Objects names cannot contain spaces in varialbe/object names # But underscores and periods allowed but best to just stick with camel case my_Object <- 3 my.Object <- 3 myObject <-3 # all of these are valid # to remove an object use the RM command rm(my_Object) rm(my.Object) ``` ## conditional statements ```r # IF # Syntax: # if(expression) # { # //statements # } # ELSE IF # Syntax: # if(expression 1) # { # //statements # } # Else If(expression 2) # { # //statements # } vtr1 <- c(5) if(vtr1==5) { print("hello world") } ``` ## tips-tricks-and-hacks ### How to add indexed areas to your R script: ```r # Steps: # 1.) Add the hashtag in line to start the comment # 2.) Type what you'd like then space # 3.) then add 4 dashes "----" # This will inxes your code area and make it collapsable as well # the indexed option appears at the bottom of this window # PART A ---- print("Hello World") # PART B ---- ``` ## package development - [Pkg Dev From Scratch](https://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/) - [R Pkgs Book](https://r-pkgs.org/index.html)