# Essential R
### What you need to get started
#### Eric Nord
##### Assistant Professor of Biology, Greenville College
##### Fulbright Scholar, Lilongwe University of Agriculture and Natural 
Resources, Malawi

## Chapter 2: Working with One Variable

### I. Introduction ----

## different types of data, it is very important to know:  
                                                                                                                            
## A. What type of data you have (by this I mean what it represents).  
## B. What R thinks the data is (or how it is encoded in R).  
## C. How to ensure that the answers to A. and B. are the same!
                                                                                                                          
### II. Working with Qualitative Data ----
                                                                                                                                                                                                                                                    
## Qualitative data, or categorical data stored in R as factor                                                             

cols<-c("Blue","Blue","Red","Red","Blue","Yellow","Green")
summary(cols);cols[2]
cols<-factor(cols); cols[2]
                                                                                                                          
## Note - created as a character variable - elements still have quotes 
## After conversion to factor there are no quotes and the levels 
reported

str(cols); summary(cols)
levels(cols)
table(cols)

b<-table(cols)
## creates an R object
b[3]
b[3]*4
## Many functions in R return objects as we'll see.
                                                                                                                          
## barplot() requires the values it will plot (it's height argument) in 
a numeric form

barplot(table(cols))
plot(cols)
barplot(b/length(cols), col=c("blue", "green", "red", 
"yellow"),ylab="Proportion")
## b=table(cols)

## Note that plot(cols) gives a barplot -- the data is categorical. 
## Note we can easily add color to plots and that we can carry out 
calculations within the call to barplot() (e.g. to calculate 
proportions). We can also specify the y-axis label (the argument is 
ylab). 
## Note that the col argument to barplot() is *optional* 
## create a logical vecotr from a factor using a logical test

cols=="Red"
which(cols=="Red")

pie(table(cols))
pie(b, col=c("blue", "green", "red", "yellow"))
## see ?pie for a discussion of limitation of pie charts

## with a numerically coded factor.

a<-factor(scan(text="2 4 3 3 2 1 1 2 3 4 2 3 3 4 1 3 2 1 4 3 2 4"))
# scan(text="some text string with spaces separating value")
table(a)
levels(a)<-c("<14","15-24","25-34",">35")
table(a)
barplot(table(a))

## Note scan() - when entering a longer list like this it may be easier 
than using c(), as commas don't need to be entered. 
## Note levels() to set levels as well as to return them (a number of 
other R functions display this dual utility as well).


###### EXTRA: Factors in R ----
# factor() & as.factor() convert character or numeric data to factor; 
as.numeric() will (sort of) do the opposite. 

a<-c(1,1,4,5)
str(a)
(a<-as.factor(a))
str(a)
levels(a)
as.numeric(a)

## Note factors stored as numeric representation of the levels. So a, 
with values 1,1,4,5 has three levels: 1 4 5. We can see this with str(a) 
or as.numeric(a): the values are replaced by the level numbers. 
## Note: be careful using factor variables in calculation! 
## (a=factor(a)) equivalent to a=factor(a);a

as.numeric(as.character(a))

(b<-c("-.1"," 2.7 ","B"))
str(b)
as.numeric(b)

## Note when we convert the vector to numeric, the non-numeric value 
("B"), can't be coerced to a number, so it is replaced by NA, thus the 
warning
### ----


#### Hypothesis Testing
# if we roll a die 50 times and get "6" 12 times how likely is it that 
the die is fair?

# prop.test(): compare observed and hypothesized frequency and calculate 
p-value for the difference. Our observed frequency is 12 out of 50, and 
the theoretical probability is 1/6; alternative hypothesis is 
probability is greater than 1/6.

prop.test(x=12,n=50,p=1/6,alt="greater")
## p-value here is 0.115, don't have very strong evidence of an unfair 
die.

##### EXTRA: Simulating a hypothesis test for a qualitative variable 
----
## sample() will randomly choose values, 
## sample(1:6,replace=TRUE): rolling a fair die
## sample(1:6,size=50,replace=TRUE): rolling the die 50 times.

sample(x=1:6,size=50,replace=TRUE) # rolling a die 50 times
sum(sample(1:6,50,TRUE)==6) # how many times is it 6?

## use the up arrow in the console to repeat this - the number of 6's 
varies. 
## repeat this 100 times to see how frequent a value of 12 or greater 
is. 
## use a loop (see Chapter 6).

die=rep(NA, 100)# vector to store results
for(i in 1:100){die[i]=sum(sample(1:6,50,TRUE)==6)} 
# loop 100 times
table(die); sum(die>=12)

# much faster than rolling a die 5000 times and recording the results!
### ----  
  ### III. Working with Quantitative Data ----
  
# (made-up) data that might represent leaf mid-rib length, and get some 
summary statistics:
  
mid.rib<-scan(text="3.2 4.2 4.1 3.6 3.9 4.0 4.1 3.6 3.5 4.1")
mean(mid.rib)
var(mid.rib)
sd(mid.rib)
median(mid.rib)
summary(mid.rib)

## fivenum() gives similar output to summary(), but can differ slightly 
as fivenum() returns upper and lower "hinges" rather than 1st and 3rd 
quartiles, and these can differ slightly depending on the number of data 
points. 
## "upper hinge" = median of data points above the median

quantile(x=mid.rib,prob=c(.25,.75))
quantile(mid.rib,c(.18,.36,.54,.72,.90))
## Note quantile() can return any desired quantile.

#### Hypothesis Testing
## is the mean of mid.rib 3? use a t-test.
## Recall: t = difference between observed and hypothesized means in 
units of the standard error; standard error = standard deviation divided 
by the square root of n.

t=(mean(mid.rib)-3)/(sd(mid.rib)/length(mid.rib)^0.5)
pt(t,df=length(mid.rib)-1,lower.tail=FALSE)*2 # *2 for 2 sided test

## length(mid.rib)^0.5 = square root of n; could also use 
sqrt(length(mid.rib)).
## t is +ive, and we want p of a more positive value, so we need the 
upper tail

t.test(mid.rib,mu=3)
# null (hypothesised) value of the mean given in argument mu.

##### EXTRA: Resistant measures of center and spread ----
# mean and sd quite sensitive to outliers, resistant measures of center 
and spread resist the influence of outliers

mid.rib[11]<-0.8 # add outlier
mean(mid.rib)
median(mid.rib)
mean(mid.rib,trim=0.1)

## median higher than mean, but trimmed mean is quite a bit nearer the 
median. Trimming more of the data will get still closer to the median.

## IQR = difference between the 3rd and 1st quartiles
IQR(mid.rib)

## MAD = "median average deviation" = median of the absolute differences 
from the median, scaled by a constant
median(abs(mid.rib-median(mid.rib)))*1.4826
mad(mid.rib)

## default value of the constant is 1.4826, as this gives a value 
comparable to standard deviation for normally distributed data.
### ----
### IV. Visualizing Quantitative Data ----

## add some more data to mid.rib - note that c() can be used to combine 
vectors also.
mid.rib[11]<-0.8 # add an outlier
mid.rib<-c(mid.rib,scan(text="3.0 4.3 4.1 3.7 3.8 4.0 4.2 3.4 3.6 4.3"))

hist(mid.rib)
hist(mid.rib,prob=TRUE,col="grey")

## 2 versions of the histogram here 1st with y-axis in frequency; the 
scale changes for differing n, 2nd is in units of probability; 
distributions with differing n can be compared. 
## kernel density estimate (KDE), or density estimate, approximates a 
probability density function.

hist(mid.rib)
lines(density(mid.rib),col="red")
hist(mid.rib,probability=TRUE,col="grey")
lines(density(mid.rib),col="red")

## KDE approximates the histogram (should have the same area), but for 
over-plotting on the histogram, the histogram must be in units of 
probability. 
## where our density function is off scale, we can force the histogram 
to use a longer y-axis with ylim (y limit) argument.

hist(mid.rib,probability=TRUE,col="grey",ylim=c(0,1))
lines(density(mid.rib),col="red")

## use rnorm() to generate some random data from the normal 
distribution.

a<-rnorm(n=40,mean=7,sd=2)
hist(a,prob=T,col="grey")
hist(a,prob=T,col="grey",breaks=seq(from=0.5,to=14.0,by=1.5))
hist(a,prob=T,col="grey",breaks=c(0,4,5,5.5,6,6.5,7,7.5,8.5,14))
## note breaks argument used to specify where the "bins" are in the 
histogram.Breaks can also be any arbitrary sequence

## Note that here the argument probability=TRUE abbreviated as prob=T. 
## R is happy to accept unambiguous abbreviations for arguments. 
## R is also happy to accept un-named arguments in the order they are 
entered-so seq(from=0.5,to=14.0,by=1.5) could be given as 
seq(0.5,14,1.5).

# Boxplots -- can also be "notched" to show a confidence interval about 
the median. Values beyond the whiskers are possible outliers.

par(mfrow=c(1,2))
boxplot(mid.rib)
boxplot(mid.rib, notch=TRUE,col="cornsilk")

## The value of 0.8 seems rather suspicious doesn't it? 

##### EXTRA: More tools for visualizing quantitative variables ----
## "stripchart" and the "stem and leaf" plot. 
## The "stripchart", a sort of "one-dimensional scatterplot". The 
argument method tells R how to display values that would overplot.

par(mar=c(2.1,1,1,1))
par(mfrow=c(3,1))
stripchart(mid.rib)
stripchart(mid.rib,method="jitter")
stripchart(mid.rib,method="stack")

## stem and leaf plot originally developed to be quickly be created with 
pencil and paper. 
## While it looks simple and a bit crude, it has the advantages that it 
preserves the original data

stem(mid.rib)
stem(mid.rib,scale=2)

## lowest value is 0.8, occurs once, and may be an outlier, while the 
maximum value is 4.3, and occurs twice, and in not likely to be an 
outlier.  Useful for exploratory data analysis (EDA), but I've rarely 
seen them published; histograms are much more commonly used.

### V. Converting Quantitative Data to Qualitative ----

## bin a quantitative variable into categories using cut()

m.r = cut(mid.rib,breaks=c(0,1,2,3,4,5)) # specify the breaks
## Note - any grouping for breaks, just as with hist(). 
m.r
## Note non-matching brakets here: (3,4] - this means that a value of 
3.0 is not included, and a value of 4.0 is included in the interval 
(3,4].
which(mid.rib==4.0)
m.r[which(mid.rib==4.0)] # values of 4.0 coded as (3,4]

table(m.r) 
levels(m.r)
levels(m.r) <- c("tiny","small","medium","med-large","large")
table(m.r)
plot(m.r)

### VI. Exercises. ----

## 1) The R function `rnorm(n,mean,sd)` generates random numbers from a 
normal distribution. Use `rnorm(100)` to generate 100 values and make a 
histogram. Repeat two or three times. Are the histograms the same?

## 2) Make a histogram from the following data, and add a density 
estimate line to it. (use `scan()` to enter the data:) Try changing the 
bandwith parameter for the denisty estimate (see `?density`)
26 30 54 25 70 52 51 26 67 18 21 29 17 12 18 35 30 36 36 21 24 18 10 43 
28 15 26 27

## 3) The function `rep()` makes repeated series - for example try 
`rep(4,times=5)` and `rep(c(1,2),each=3)`. Use `rep()` to enter the 
sequence `1 1 1 1 2 2 2 2 3 3 3 3` repeated 3 times. Now convert it to a 
factor with the levels "Low","Medium", and "High". Can you change the 
levels to "Never","Rarely","Sometimes"?

## 4) Convert the factor from Ex3 into a character vector. Can you 
convert the character vector to a numeric vector?


## Chapter 3: Getting Help ----

?mean # look for help on mean()

# search in "Help" tab. 
# function hinting
# example code-- copy and paste the code into your console and run 
# links to other help files for related functions.


# http://www.rseek.org
# create an account and post questions to the "R help mailing list" (see 
the "Mailing Lists" link at http://www.r-project.org/), -- be sure to 
follow the posting guidelines 

### II. Exercises. ----
# 1) Practice running help examples with ?hist, and ?plot: What can you 
learn about these functions?
# 
# 2) Follow some links from ?hist, and ?plot and run some examples - 
what did you learn from any of these linked files?