# Essential R ### What you need to get started #### Eric Nord ##### Assistant Professor of Biology, Greenville College ##### Fulbright Scholar, Lilongwe University of Agriculture and Natural Resources, Malawi ## Chapter 2: Working with One Variable ### I. Introduction ---- ## different types of data, it is very important to know: ## A. What type of data you have (by this I mean what it represents). ## B. What R thinks the data is (or how it is encoded in R). ## C. How to ensure that the answers to A. and B. are the same! ### II. Working with Qualitative Data ---- ## Qualitative data, or categorical data stored in R as factor cols<-c("Blue","Blue","Red","Red","Blue","Yellow","Green") summary(cols);cols[2] cols<-factor(cols); cols[2] ## Note - created as a character variable - elements still have quotes ## After conversion to factor there are no quotes and the levels reported str(cols); summary(cols) levels(cols) table(cols) b<-table(cols) ## creates an R object b[3] b[3]*4 ## Many functions in R return objects as we'll see. ## barplot() requires the values it will plot (it's height argument) in a numeric form barplot(table(cols)) plot(cols) barplot(b/length(cols), col=c("blue", "green", "red", "yellow"),ylab="Proportion") ## b=table(cols) ## Note that plot(cols) gives a barplot -- the data is categorical. ## Note we can easily add color to plots and that we can carry out calculations within the call to barplot() (e.g. to calculate proportions). We can also specify the y-axis label (the argument is ylab). ## Note that the col argument to barplot() is *optional* ## create a logical vecotr from a factor using a logical test cols=="Red" which(cols=="Red") pie(table(cols)) pie(b, col=c("blue", "green", "red", "yellow")) ## see ?pie for a discussion of limitation of pie charts ## with a numerically coded factor. a<-factor(scan(text="2 4 3 3 2 1 1 2 3 4 2 3 3 4 1 3 2 1 4 3 2 4")) # scan(text="some text string with spaces separating value") table(a) levels(a)<-c("<14","15-24","25-34",">35") table(a) barplot(table(a)) ## Note scan() - when entering a longer list like this it may be easier than using c(), as commas don't need to be entered. ## Note levels() to set levels as well as to return them (a number of other R functions display this dual utility as well). ###### EXTRA: Factors in R ---- # factor() & as.factor() convert character or numeric data to factor; as.numeric() will (sort of) do the opposite. a<-c(1,1,4,5) str(a) (a<-as.factor(a)) str(a) levels(a) as.numeric(a) ## Note factors stored as numeric representation of the levels. So a, with values 1,1,4,5 has three levels: 1 4 5. We can see this with str(a) or as.numeric(a): the values are replaced by the level numbers. ## Note: be careful using factor variables in calculation! ## (a=factor(a)) equivalent to a=factor(a);a as.numeric(as.character(a)) (b<-c("-.1"," 2.7 ","B")) str(b) as.numeric(b) ## Note when we convert the vector to numeric, the non-numeric value ("B"), can't be coerced to a number, so it is replaced by NA, thus the warning ### ---- #### Hypothesis Testing # if we roll a die 50 times and get "6" 12 times how likely is it that the die is fair? # prop.test(): compare observed and hypothesized frequency and calculate p-value for the difference. Our observed frequency is 12 out of 50, and the theoretical probability is 1/6; alternative hypothesis is probability is greater than 1/6. prop.test(x=12,n=50,p=1/6,alt="greater") ## p-value here is 0.115, don't have very strong evidence of an unfair die. ##### EXTRA: Simulating a hypothesis test for a qualitative variable ---- ## sample() will randomly choose values, ## sample(1:6,replace=TRUE): rolling a fair die ## sample(1:6,size=50,replace=TRUE): rolling the die 50 times. sample(x=1:6,size=50,replace=TRUE) # rolling a die 50 times sum(sample(1:6,50,TRUE)==6) # how many times is it 6? ## use the up arrow in the console to repeat this - the number of 6's varies. ## repeat this 100 times to see how frequent a value of 12 or greater is. ## use a loop (see Chapter 6). die=rep(NA, 100)# vector to store results for(i in 1:100){die[i]=sum(sample(1:6,50,TRUE)==6)} # loop 100 times table(die); sum(die>=12) # much faster than rolling a die 5000 times and recording the results! ### ---- ### III. Working with Quantitative Data ---- # (made-up) data that might represent leaf mid-rib length, and get some summary statistics: mid.rib<-scan(text="3.2 4.2 4.1 3.6 3.9 4.0 4.1 3.6 3.5 4.1") mean(mid.rib) var(mid.rib) sd(mid.rib) median(mid.rib) summary(mid.rib) ## fivenum() gives similar output to summary(), but can differ slightly as fivenum() returns upper and lower "hinges" rather than 1st and 3rd quartiles, and these can differ slightly depending on the number of data points. ## "upper hinge" = median of data points above the median quantile(x=mid.rib,prob=c(.25,.75)) quantile(mid.rib,c(.18,.36,.54,.72,.90)) ## Note quantile() can return any desired quantile. #### Hypothesis Testing ## is the mean of mid.rib 3? use a t-test. ## Recall: t = difference between observed and hypothesized means in units of the standard error; standard error = standard deviation divided by the square root of n. t=(mean(mid.rib)-3)/(sd(mid.rib)/length(mid.rib)^0.5) pt(t,df=length(mid.rib)-1,lower.tail=FALSE)*2 # *2 for 2 sided test ## length(mid.rib)^0.5 = square root of n; could also use sqrt(length(mid.rib)). ## t is +ive, and we want p of a more positive value, so we need the upper tail t.test(mid.rib,mu=3) # null (hypothesised) value of the mean given in argument mu. ##### EXTRA: Resistant measures of center and spread ---- # mean and sd quite sensitive to outliers, resistant measures of center and spread resist the influence of outliers mid.rib[11]<-0.8 # add outlier mean(mid.rib) median(mid.rib) mean(mid.rib,trim=0.1) ## median higher than mean, but trimmed mean is quite a bit nearer the median. Trimming more of the data will get still closer to the median. ## IQR = difference between the 3rd and 1st quartiles IQR(mid.rib) ## MAD = "median average deviation" = median of the absolute differences from the median, scaled by a constant median(abs(mid.rib-median(mid.rib)))*1.4826 mad(mid.rib) ## default value of the constant is 1.4826, as this gives a value comparable to standard deviation for normally distributed data. ### ---- ### IV. Visualizing Quantitative Data ---- ## add some more data to mid.rib - note that c() can be used to combine vectors also. mid.rib[11]<-0.8 # add an outlier mid.rib<-c(mid.rib,scan(text="3.0 4.3 4.1 3.7 3.8 4.0 4.2 3.4 3.6 4.3")) hist(mid.rib) hist(mid.rib,prob=TRUE,col="grey") ## 2 versions of the histogram here 1st with y-axis in frequency; the scale changes for differing n, 2nd is in units of probability; distributions with differing n can be compared. ## kernel density estimate (KDE), or density estimate, approximates a probability density function. hist(mid.rib) lines(density(mid.rib),col="red") hist(mid.rib,probability=TRUE,col="grey") lines(density(mid.rib),col="red") ## KDE approximates the histogram (should have the same area), but for over-plotting on the histogram, the histogram must be in units of probability. ## where our density function is off scale, we can force the histogram to use a longer y-axis with ylim (y limit) argument. hist(mid.rib,probability=TRUE,col="grey",ylim=c(0,1)) lines(density(mid.rib),col="red") ## use rnorm() to generate some random data from the normal distribution. a<-rnorm(n=40,mean=7,sd=2) hist(a,prob=T,col="grey") hist(a,prob=T,col="grey",breaks=seq(from=0.5,to=14.0,by=1.5)) hist(a,prob=T,col="grey",breaks=c(0,4,5,5.5,6,6.5,7,7.5,8.5,14)) ## note breaks argument used to specify where the "bins" are in the histogram.Breaks can also be any arbitrary sequence ## Note that here the argument probability=TRUE abbreviated as prob=T. ## R is happy to accept unambiguous abbreviations for arguments. ## R is also happy to accept un-named arguments in the order they are entered-so seq(from=0.5,to=14.0,by=1.5) could be given as seq(0.5,14,1.5). # Boxplots -- can also be "notched" to show a confidence interval about the median. Values beyond the whiskers are possible outliers. par(mfrow=c(1,2)) boxplot(mid.rib) boxplot(mid.rib, notch=TRUE,col="cornsilk") ## The value of 0.8 seems rather suspicious doesn't it? ##### EXTRA: More tools for visualizing quantitative variables ---- ## "stripchart" and the "stem and leaf" plot. ## The "stripchart", a sort of "one-dimensional scatterplot". The argument method tells R how to display values that would overplot. par(mar=c(2.1,1,1,1)) par(mfrow=c(3,1)) stripchart(mid.rib) stripchart(mid.rib,method="jitter") stripchart(mid.rib,method="stack") ## stem and leaf plot originally developed to be quickly be created with pencil and paper. ## While it looks simple and a bit crude, it has the advantages that it preserves the original data stem(mid.rib) stem(mid.rib,scale=2) ## lowest value is 0.8, occurs once, and may be an outlier, while the maximum value is 4.3, and occurs twice, and in not likely to be an outlier. Useful for exploratory data analysis (EDA), but I've rarely seen them published; histograms are much more commonly used. ### V. Converting Quantitative Data to Qualitative ---- ## bin a quantitative variable into categories using cut() m.r = cut(mid.rib,breaks=c(0,1,2,3,4,5)) # specify the breaks ## Note - any grouping for breaks, just as with hist(). m.r ## Note non-matching brakets here: (3,4] - this means that a value of 3.0 is not included, and a value of 4.0 is included in the interval (3,4]. which(mid.rib==4.0) m.r[which(mid.rib==4.0)] # values of 4.0 coded as (3,4] table(m.r) levels(m.r) levels(m.r) <- c("tiny","small","medium","med-large","large") table(m.r) plot(m.r) ### VI. Exercises. ---- ## 1) The R function `rnorm(n,mean,sd)` generates random numbers from a normal distribution. Use `rnorm(100)` to generate 100 values and make a histogram. Repeat two or three times. Are the histograms the same? ## 2) Make a histogram from the following data, and add a density estimate line to it. (use `scan()` to enter the data:) Try changing the bandwith parameter for the denisty estimate (see `?density`) 26 30 54 25 70 52 51 26 67 18 21 29 17 12 18 35 30 36 36 21 24 18 10 43 28 15 26 27 ## 3) The function `rep()` makes repeated series - for example try `rep(4,times=5)` and `rep(c(1,2),each=3)`. Use `rep()` to enter the sequence `1 1 1 1 2 2 2 2 3 3 3 3` repeated 3 times. Now convert it to a factor with the levels "Low","Medium", and "High". Can you change the levels to "Never","Rarely","Sometimes"? ## 4) Convert the factor from Ex3 into a character vector. Can you convert the character vector to a numeric vector? ## Chapter 3: Getting Help ---- ?mean # look for help on mean() # search in "Help" tab. # function hinting # example code-- copy and paste the code into your console and run # links to other help files for related functions. # http://www.rseek.org # create an account and post questions to the "R help mailing list" (see the "Mailing Lists" link at http://www.r-project.org/), -- be sure to follow the posting guidelines ### II. Exercises. ---- # 1) Practice running help examples with ?hist, and ?plot: What can you learn about these functions? # # 2) Follow some links from ?hist, and ?plot and run some examples - what did you learn from any of these linked files?