# R-Lang
## resources
- [Ten random useful things in R that you might not know about](https://towardsdatascience.com/ten-random-useful-things-in-r-that-you-might-not-know-about-54b2044a3868)
- [HexSticker Maker](https://connect.thinkr.fr/hexmake/)
- [Control HTML Code Folding](https://stackoverflow.com/questions/37755037/how-to-add-code-folding-to-output-chunks-in-rmarkdown-html-documents)
- [R Seek](https://rseek.org/)
- [R Graph Gallery](https://www.r-graph-gallery.com/)
- [knitr Options](https://yihui.org/knitr/options/)
- [Pimp my Rmd](https://holtzy.github.io/Pimp-my-rmd/)
## libraries
### dplyr
```r
knitr::opts_chunk$set(echo = TRUE, results = 'hide')
library(tidyverse)
library(nycflights13)
```
#### Initial Items
```r
nycflights13::flights
airlines #--------------------------------------------------# not just a data frame but a tibble
view(flights) #---------------------------------------------# opens the viewer on the tibble so we can observe data in tabular format
filter(flights, month == 1, day == 1) #---------------------# prints the result of the filter
jan1 <- filter(flights, month == 1, day == 1) #-------------# assigns the result of the filter to a dataframe
(dec25 <- filter(flights, month == 12, day == 25)) #--------# wrapped parenthese both assigns and prints out the dataframe
sqrt(2) ^ 2 == 2 #------------------------------------------# False as its approximate and not exact, use near() to get around this
1/49 * 49 == 1 #--------------------------------------------# also false
near(sqrt(2) ^ 2,2) #---------------------------------------# Returns True as it should
near(1/49 * 49,1) #-----------------------------------------# Returns True as it should
nov_dec <- filter(flights, month %in% c(11, 12)) #----------# will select rows where x is one of the values in y
```
#### More Exercises
```r
# Exercise 1
transmute(flights,
dep_time,
dep_hours = dep_time %/% 100,
dep_minutes = dep_time %% 100,
sched_dep_time,
sched_dep_hours = sched_dep_time %/% 100,
sched_dep_minutes = sched_dep_time %% 100
)
# Exercise 2
newFlights <- transmute(flights,
air_time,
realTime = arr_time - dep_time,
accuracy = air_time == realTime
)
accurateFlights <- filter(newFlights, accuracy == T) #------# 0.36% Accuracy of AirTime
# Exercise 3
flightsDelay <- transmute(flights,
dep_delay,
realDelay = dep_time - sched_dep_time,
accurate = dep_delay == realDelay
)
accurateDelays <- filter(flightsDelay, accurate == T) #-----# 67.9% Accuracy of delay times
# Exercise 4
min_rank(flightsDelay$realDelay)
subset <- sort(flightsDelay$realDelay,decreasing = T)
subset[1:10] %>% mean()
# Exercise 5
subset[1:3]+subset[1:10]
```
#### arrange
```r
arrange(flights, desc(is.na(dep_time))) #-------------------# Sorted on the dep_time column using is.na to put missing values at top
arrange(flights, desc(dep_delay)) #-------------------------# most delayed flights
arrange(flights, dep_delay) #-------------------------------# flights that left the earliest, AESC is the default setting so no function exists for that
arrange(flights, air_time) #--------------------------------# Fastest flights
arrange(flights, desc(distance), desc(air_time)) #----------# flights that traveled the longest
arrange(flights, distance) #--------------------------------# flights that traveled the shortest
```
#### filters
```r
filter(flights, arr_delay >= 2) #---------------------------# arrival delay greater than 2 hours
filter(flights, dest %in% c("IAH", "HOU")) #----------------# flew to IAH or HOU
filter(flights, carrier %in% c("UA","AA","DL")) #-----------# Operated by United, American, or Delta
filter(flights, between(month, left = 7, right = 9)) #------# Departed in summer (July, August, September)
filter(flights, arr_delay > 120 & dep_delay <= 0)#----------# arrived more than 2 hours late but(AND) didnt leave late
filter(flights, dep_delay >= 120 & arr_delay <= 90) #-------# delayed by at least an hour but made up 30min in flight
filter(flights, between(dep_time,left = 0, right = 600)) #--# departed between midnight and 6am (Inclusive), using between()
filter(flights, is.na(dep_time))#---------------------------# count of flights with missing departure time
```
#### modular arithmatic
```r
transmute(flights,
dep_time,
hour = dep_time %/% 100, #------------------------# %/% is integer division
minute = dep_time %% 100 #------------------------# %% is remainder division
)
```
#### mutate
```r
mutate(flights_small,
gain = arr_delay - dep_delay, #----------------------# Usage of "=" instead of <- because its assigning that value to the variable
speed = distance / air_time * 60, #------------------# we're saying that this variable is "=" to this equation
hours = air_time / 60, #-----------------------------# Not that we're assigning these valus to a vector
gain_per_hour = gain / hours #-----------------------# Can even use new variables made within mutate to create new variables
) #--------------------------------------------------# if we had used "<-" then it displayed the operator in the variable name
```
#### select
```r
select(flights, year, month, day) #-------------------------# Selecting specific variables from the data set (variables = columns)
select(flights, year:day) #---------------------------------# Selecting specific variables using colon (this to this)
select(flights, -(year:day)) #------------------------------# selects all variables EXCEPT those in the parens with the minus sign operating on it
# functions to use with select
# starts_with
# ends_with
# contains
# matches
# num_range
select(flights, origin, dest, everything()) #---------------# Puts selected columns at front of data set while keeping all variable in data set
select(flights, tailnum, tailnum) #-------------------------# If same variable named twice, it is displayed only once
vars <- c("year","month","day","dep_delay","arr_delay") #---# Give character vector the variable names
select(flights,one_of(vars)) #------------------------------# Use one_of function with character vector variable to grab variables from tibble that match
select(flights, contains("TIME")) #-------------------------# Selects all tibble variables that contain substring "Time"
flights_small <- select(flights, #--------------------------# Selecting desired variables for a lean-er data set through various methods
year:day,
ends_with("delay"),
distance,
air_time
)
```
#### summarize
```r
avgDelay <- as.numeric(summarize(flights, delay = mean(dep_delay, na.rm = T)))
cat("The average delay of all flights is:", avgDelay)
by_day <- group_by(flights, year, month, day)
summarize(by_day, delay = mean(dep_delay,na.rm = T))
```
#### transmute
```r
transmute(flights_small, #----------------------------------# Using transmute will let you mutate the variables but keeps only
gain = arr_delay - dep_delay, # The Newly mutated variables in your tibble
speed = distance / air_time * 60,
hours = air_time / 60,
gain_per_hour = gain / hours
)
```
### ggplot2
```r
#### engine size / highway MPG ####
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ,
y = hwy)
)
ggplot(data = mpg) + #--------------------------------------------------# Play with changing color to size
geom_point(mapping = aes(x = displ, #-------------------------------# (discrete variable "class" to ordered aesthetic "Size" not advised)
y = hwy,
size = class)
)
ggplot(data = mpg) + #--------------------------------------------------# Transparancy of the points
geom_point(mapping = aes(x = displ,
y = hwy,
alpha = class)
)
ggplot(data = mpg) + #--------------------------------------------------# Shape of the points
geom_point(mapping = aes(x = displ,
y = hwy,
shape = class)
)
ggplot(data = mpg) + #--------------------------------------------------# Add Colors by class and add this to the aesthetic layer
geom_point(mapping = aes(x = displ,
y = hwy,
color = class)
)
ggplot(data = mpg) + #--------------------------------------------------# change color of the points
geom_point(mapping = aes(x = displ,
y = hwy),
color = "blue"
)
ggplot(data = mpg) + #--------------------------------------------------# change size and color to continious variables and shape to a categorical variable
geom_point(mapping = aes(x = displ,
y = hwy,
color = year,
size = cyl,
shape = drv)
)
ggplot(data = mpg) + #--------------------------------------------------# Making the save variable cover multiple fields
geom_point(mapping = aes(x = displ,
y = hwy,
color = cyl,
size = cyl,
shape = drv)
)
ggplot(data = mpg) + #--------------------------------------------------# testing the "stroke" argument
geom_point(mapping = aes(x = displ,
y = hwy,
color = cyl,
size = cyl,
shape = drv,
stroke = 5)
)
ggplot(data = mpg) + #--------------------------------------------------# passing a condition into a aesthetic argument instead of the straight variable
geom_point(mapping = aes(x = displ,
y = hwy,
color = displ < 5,
size = cyl,
shape = drv,
stroke = 5)
)
ggplot(data = mpg) + #--------------------------------------------------# Highway MPg / Engine Cylinders
geom_point(mapping = aes(x = hwy,
y = cyl)
)
ggplot(data = mpg) + #--------------------------------------------------# Type of car / What "Wheel Drive" the car is (F,B,4)
geom_point(mapping = aes(x = class,
y = drv)
)
#### Facets on descrete variables ####
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ,
y = hwy)
) +
facet_wrap(~ class, nrow = 2) #-------------------------------------# the tilda "~" means a formula not akin to an equation
ggplot(data = mpg) + #--------------------------------------------------# testing facet grid
geom_point(mapping = aes(x = displ,
y = hwy)
) +
facet_grid(drv ~ cyl)
ggplot(data = mpg) + #--------------------------------------------------# testing facet grid 2
geom_point(mapping = aes(x = displ,
y = hwy)
) +
facet_grid(. ~ cyl) #-----------------------------------------------# puts cyl facet into columns since argument is (r,c)
ggplot(data = mpg) + #--------------------------------------------------# testing facet grid 3
geom_point(mapping = aes(x = displ,
y = hwy)
) +
facet_grid(drv ~ .) #-----------------------------------------------# puts drv facet into rows since argument is (r,c)
#### Changing Geoms ####
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ,
y = hwy)
) #-----------------------------------------------------# from point to smooth
ggplot(data = mpg) + #--------------------------------------------------# line type based on a variable
geom_smooth(mapping = aes(x = displ,
y = hwy,
linetype = drv)
)
ggplot(data = mpg) + #--------------------------------------------------# line type based on a variable with points and colors to show the seperation
geom_point(mapping = aes(x = displ,
y = hwy,
color = drv)
) +
geom_smooth(mapping = aes(x = displ,
y = hwy,
linetype = drv,
color = drv)
)
ggplot(data = mpg, mapping = aes(x = displ,
y = hwy)
) +
geom_point() +
geom_smooth()
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ,
y = hwy)
) +
geom_smooth(mapping = aes(x = displ,
y = hwy,
linetype = drv),
show.legend = F
)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ,
y = hwy,
color = drv)
) +
geom_smooth(mapping = aes(x = displ,
y = hwy,
linetype = drv,
color = drv),
show.legend = F
)
#### Bar Plots ####
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
```
### pins
- <https://www.youtube.com/embed/dsfEsJCiH-E>
### shiny
- Shiny dashboards consist of 2 main elements the `ui` and the `server` files
- Shiny dashboards are not great for high user traffic but can still handle mutliple user sessions. Hosting on RStudio is an option but shiny server can be hosted on your own system or containers.
- A great practice is to store shiny dashboards as R Packages to be distributed for interactive reproducable analysis.
- Shinydashboard and shinydashboard+ are packages that build off the shiny package base to allow greater aesthetic elements and more functionality with the web app.
- For unit testing shiny applications there is shinytest
and for load testing there is shinyloadtest
- for making sure shiny apps load quickly you can use `profvis` and other methods to tackle your bigger and slower processes, another is to not load csv data into shiny directly, if loading data like this do your ETL first and save data as feather files, feather is slower to write but faster to read than csv's
- another would be to implement plot caching if the plots are taking a while to load
- Great presentations of your shiny apps like a walk along tutorial: [cicerone](https://github.com/JohnCoene/cicerone)
#### Documentation
<iframe width="560" height="315" src="https://www.youtube.com/embed/Wy3TY0gOmJw" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
#### Code
```r
########################################
# A basic example of a shiny dashboard #
########################################
library(shiny)
# Define UI for application
shinyUI(fluidPage(
# Application title
titlePanel("Old Faithful Geyser Data"),
# Sidebar with input
sidebarLayout(
sidebarPanel(
sliderInput("bins",
"Number of bins:",
min = 1,
max = 50,
value = 30
)#sliderInput
),#sidebarPanel
# Show a plot of the generated distribution
mainPanel(
plotOutput("distPlot")
)#mainPanel
)#sidebarLayout
))#shinyUI(fluidPage(
# Define server logic required to draw a histogram
shinyServer(function(input, output) {
output$distPlot <- renderPlot({
# generate bins based on input$bins from ui.R
x <- faithful[, 2]
bins <- seq(min(x), max(x), length.out = input$bins + 1)
# draw the histogram with the specified number of bins
hist(x, breaks = bins, col = 'darkgray', border = 'white')
})#renderPlot
})#shinyServer
```
## data types
### array
```r
# Array
# The R Objects which can store data in more than 2 dimensions
# Syntax: array(data, dim, dimnames)
#============#
# EXAMPLES #
#============#
arr = array(c(0:15),dim = c(4,4,2,2))
#this makes the array = 0-15
#the dimension makes the 0-15 display in a 4x4 grid
#the other dimension makes it a 3d array by making 4 copies of the array in a
#2x2 grid of arrays
arr
arr2 = array(c(1:9),dim = c(3,3,4,2))
arr2
```
### data classes
```r
# Data Classes ----
12.6 #Numeric
3 #Numeric
100 #Numeric
"male" #Character
TRUE #Logical
FALSE #Logical
T #Logical
F #Logical
# Data Structures ----
# Vector
# List
# Matrix
# Data Frame
```
### dataframe
```r
# PART 1 ----
# Data Frames
# A Table or a 2-dimensional array-like structure in which each column contains
# values of one variable and each row contains one set of values from each column
# Syntax: data.frame(data)
#============#
# EXAMPLES #
#============#
RowCount = c(1:5)
PeopleNames = c("Bryan","Jude","kelly","janelle","Rosa")
Values = c(15,25,65,145,74)
df <- data.frame(RowCount,PeopleNames,Values)
df
# This alone will display the valeus of the data frame each vector in the frame
# represents a vertical column of values that each contibue a value to each row
data.frame(airquality)
# Built in sample data table, can also import Excel files
# PART 2 ----
myDataFrame <- read.csv("20190208 RC Registry.csv")
myDataFrame <- myDataFrame[myDataFrame$Medical..Screening.Due.Date <date, c("CDCR", "Medical..Screening.Due.Date")]
myDataFrame
# PART 3 ----
# setting up the data frame vectors
id <- 1:200
group <- c(rep("Vehicle",100),
rep("Drug",100))
response <- c(rnorm(100,mean = 25, sd = 5),
rnorm(100,mean = 23, sd=5))
age <- round(rnorm(200,40,20))
#compiling the data frame
myData <- data.frame(Patient = id,
Patient.Age = age,
Treatment = group,
Response = response)
myData
head(myData,10)
tail(myData,10)
dim(myData)
str(myData)
summary(myData)
# subsetting Data.frames ----
myData[1,2]
myData[2,3]
myData[1:20,2:3] # first 20 rows with columns 2 & 3 present
myData[1:20,] # returns 20 rows and all columns if left blank
myData[,1] # returns everythingh in the first column only
myData[,"Response"] # returns just the columns values for the column named "Response"
myData$Response #the Dollar sign $ after the name of the data frame will return the entire column without quotes or brackets
myData[myData$Response>26,] # give me the rows and all columns for every row that meets the criteria
# of Response > 26
#perform a calculation and then add values to a new column of the data frame
myData$Positive <- myData$Response<26
write.csv(myData[myData$Response>26,],file = "testData.csv", row.names = F) # write a CSV file to the current working directory
# multiple filter criteria and then assigned
# to a new object for ease of exporting to CSV
CSVMyData <- myData[myData$Treatment == "Vehicle"
& myData$Response>26 &
myData$Patient.Age > 0,]
write.csv(CSVMyData,file = "testData.csv", row.names = F)
head(CSVMyData)
```
### list
```r
# List ----
# The R Objects which can contain elements of different types like
# Numbers, Strings, Vectors, and another List inside of it
# Syntax: list(data)
#============#
# EXAMPLES # ----
#============#
vtr1 = c('hello','world')
vtr2 = c(24.6345,3.6,345.5678)
vtr3 = c(45L,'hi')
ls = list(vtr1,vtr2,vtr3)
ls
# Within each list item/variable returned, it will not let seperate list items/variables
# Dictate the data type of the other items/variable
# It will allow this dictation if inside the variable itself it contains differing
# data types
# Nested Lists ----
list(1,2,list("a","b",list(T,T,F)),"hello",T)
# you can have lists within lists and it displays nested lists effectively in the console
```
### matrix
```r
# Matrix
# R Objects in which the elements are arranged in a 2 dimensional rectangular layout
# Syntax: matrix(data, nrow, ncol, byrow, dimnames)
# Data: the input vector which becomes the data elements of the matrix
# NRow: number of rows to be created
# NCol: number of columns to be created
# ByRow: a logical clue. If TRUE then the input vector elements are arranged by row
# DimName: The names assigned to the rows and columns
#============#
# EXAMPLES #
#============#
mtr = matrix(c(1:25),5,5)
mtr
# Warning message:
# In matrix(c(5:30), 5, 5) :
# data length [26] is not a sub-multiple or multiple of the number of rows [5]
# If your total number of data points spills out of the matrix, error returns
```
### null and na
```r
# empty value ----
NA #NA is a logical constant of length 1 which contains a missing value indicator
NULL #NULL represents the null object in R:
# it is a reserved word. NULL is often returned by expressions and
# functions whose value is undefined.
?NA
```
### vectors
```r
#vectors ----
#5 types
# Logical
# Integer
# Numeric 'value greater than 7 digits will always be converted to the exponential format
# Complex
# Character
#============#
# EXAMPLES #----
#============#
#Logical
vtr1 = c(TRUE,FALSE)
#Numeric
vtr2 = c(15,85.674954,999999)
#Integer requires "L" after number to treat as integer
vtr3 = c(35L,58L,146L)
#Integer with decemals converted with warning to numeric
vtr4 = c(85.64L)
#Wrong data types passed to vector
vtr5 = c(TRUE,35L,3.14) #TRUE will be converted to Boolean 1 if put into a Numeric/Integer Vector
#now incude a character string
vtr6 = c(TRUE,35L,3.14,"hello")
#===========#
# Results #----
#===========#
class(vtr1)
vtr1
class(vtr2)
vtr2
class(vtr3)
vtr3
class(vtr4)
vtr4
class(vtr5)
vtr5
class(vtr6)
vtr6
# Vector Math ----
numbers <- c(1:10)
numbers * 2
# Subsetting Vectors ----
days <- c("mon", "tue", "wed", "thurs", "fri")
days
# to return a specific value of a vector in square brackets after its object holding the vector
# use square brackets and the index of the value to pull out a subset of a vector
days[1]
# square brackets are ALWAYS for subsetting the parens "()" are for functions
days[c(1,3,5)] #vector within the sub setting brackets to pull out specific values
days[2:5] #if you wanted everything except monday
days[-5] #basically saying "give me everything except what's in index 5" or in this case "Friday
```
## data operators
```r
# Arithmatic
# Addition (+)
# Subtraction (-)
# Multiplication (*)
# Division (/)
# Modulus (%%)
# Exponent (^)
# Floor Division (%/%)
# Relational
# Equal To (==) #asking "is this equal to that" returns Bool
# Not Equal To (!=)
# Greater Than (>)
# Less Than (<)
# Greater Than Equal To (>=)
# Less Than Equal To (<=)
# Assignment
# Left
# Equals (=)
# Assign (<-)
# Right
# Equals (=)
# Assign (->)
# Logical
# AND (&) in 2 concurrent vectors, compare each aligned item ask AND
# AND (&&) compre whole vector to whole vector and all AND's need to be
# TRUE or else False
# NOT (!)
# OR (|) in 2 concurrent vectors, compare each aligned item ask OR
# OR (||) compareshole vector to whole vector and either side needs
# tobe TRUE or else False
```
## functions
```r
# Functions ----
# Descriptive Statistics Functions ----
myValues <- c(1:100)
myValues
mean(myValues)
median(myValues)
mode(myValues)
min(myValues)
max(myValues)
sum(myValues)
sd(myValues) #standard deviation
class(myValues)
length(myValues)
log(myValues)
log10(myValues)
mySqrt <- sqrt(myValues)
mySqrt
?rnorm # Adding a question mark before the name of a function opens a help pane with all the details
# on that function, this is a great way to learn about what the functions require as arguments
?rgb
hist(rnorm(100, mean = 5))
# Data frame Functioons ----
# setting up the data frame vectors
id <- 1:200
group <- c(rep("Vehicle",100),
rep("Drug",100))
response <- c(rnorm(100,mean = 25, sd = 5),
rnorm(100,mean = 23, sd=5))
#compiling the data frame
myData <- data.frame(Patient = id,
Treatment = group,
Response = response)
myData
head(myData,10)
tail(myData,10)
dim(myData)
str(myData)
summary(myData)
# Change Value of data type present can work for entire columns ----
as.numeric(c("1","2","3"))
as.character(1:10)
# Remove Objects ----
# Objects names cannot contain spaces in varialbe/object names
# But underscores and periods allowed but best to just stick with camel case
my_Object <- 3
my.Object <- 3
myObject <-3
# all of these are valid
# to remove an object use the RM command
rm(my_Object)
rm(my.Object)
```
## conditional statements
```r
# IF
# Syntax:
# if(expression)
# {
# //statements
# }
# ELSE IF
# Syntax:
# if(expression 1)
# {
# //statements
# }
# Else If(expression 2)
# {
# //statements
# }
vtr1 <- c(5)
if(vtr1==5)
{
print("hello world")
}
```
## tips-tricks-and-hacks
### How to add indexed areas to your R script:
```r
# Steps:
# 1.) Add the hashtag in line to start the comment
# 2.) Type what you'd like then space
# 3.) then add 4 dashes "----"
# This will inxes your code area and make it collapsable as well
# the indexed option appears at the bottom of this window
# PART A ----
print("Hello World")
# PART B ----
```
## package development
- [Pkg Dev From Scratch](https://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/)
- [R Pkgs Book](https://r-pkgs.org/index.html)