Press **→**, scroll, or swipe to advance

Made with **Keydown**

Department of Statistics, Iowa State University

Econ workshops on R

Nov 8, 2011

- What is R
- Computing
- grammar
- data manipulation
- simulation
- modelling

- Graphics
- base graphics
- ggplot2

- Other useful stuff
- debugging
- 3D plots, animations, interactive graphics

- I write homework solutions with R: https://github.com/yihui/stat579/downloads
- I write my own homework with R
- I play with distributions in R: http://yihui.name/en/2010/04/demonstrating-the-power-of-f-test-with-gwidgets/
- I grab datasets from webpages using R: http://yihui.name/en/2010/10/grabbing-tables-in-webpages-using-the-xml-package/

- I set fireworks with R: http://yihui.name/en/2011/01/happy-new-year-with-r-2011-fireworks/
- I play mine sweeper with R: http://yihui.name/en/2011/08/the-fun-package-use-r-for-fun/
- ...

- oriented to statistics (
`mean()`

,`var()`

,`lm()`

,`glm()`

,`rnorm()`

,`boxplot()`

, ...) - heavily vectorized (very important! can avoid loops in many, many cases)
- object-oriented, but in different forms (no
`obj.method()`

, but often`method(obj)`

) - unbeatable number of packages (~3400 now)
- free (in both senses of beer and freedom)

- Download: http://cran.r-project.org (you may choose the Iowa State mirror)
- three major platforms (Win, Linux, Mac)

- Editor? RStudio may be a good choice (alternatives: Notepad++/NppToR, Emacs/ESS, Eclipse/StatET, ...)
- anything but Notepad under Windows

`?`

to read documentatione.g. try `?lm`

for help on linear models

`help.start()`

for a view on the whole package## use install.packages(), e.g. install.packages('animation')

You will be asked to choose a mirror.

Too many packages?? Use the task view: http://cran.r-project.org/web/views/ (e.g. Econometrics)

`update.packages()`

- assignment by
`<-`

or`=`

(experts recommend the former but I do not believe them) `?Arithmetic`

(e.g.`x+y`

,`x %% y`

)- indexing by
`[]`

x <- 10 y <- 15:3 1 + 2 11 %% 2 11 %/% 2 3^4 log(10) y[4] y[-1] # what do negative indices mean?

`function_name(arguments)`

z = rnorm(100) fivenum # what are the arguments of this function? fivenum(z) fivenum(x = z, na.rm = TRUE) fivenum(z, TRUE)

## the if-else statement if (TRUE) { print(1) } else { print(2) } ## a for-loop (the other type is while-loop) s = 0; x = c(4, 2, 6) for (i in 1:3) { s = s + x[i] } s ## the above loop is the most stupid thing to do in R

- a series of functions like
`read.table()`

,`read.csv()`

, ... - can work with databases too (need add-on packages like RODBC, RMySQL, ...)

## the tips data tips = read.csv('http://dicook.public.iastate.edu/Army/tips.csv') str(tips) # structure of an object; one of the most useful function in R summary(tips) mean(tips$bill) # index by $name var(tips[, 'tip']) # index by character name table(tips[, 4]) # index by column number hist(tips$bill) boxplot(tip~sex, data=tips) plot(tip~bill, data=tips, col=as.integer(tips$sex))

- vector
- matrix, array
- list (very flexible; perhaps second most widely used)
- ...

A family of functions for distributions: `dfun()`

, `pfun()`

, `qfun()`

and `rfun()`

For example, `rnorm()`

library(animation) demo('fire', ask = FALSE) # an application of image()

Q: How many times do we need to flip the coin until we get a sequence of HTH and HTT respectively? (For example, for the sequence H*HTH*, the number for HTH to appear is 4, and in THT*HTT*, the number for HTT is 6.)

A: http://yihui.name/en/2009/08/counterintuitive-results-in-flipping-coins/

Take the tips data for example

fit1 = lm(tip ~ bill, data = tips) summary(fit1)

You get

Call: lm(formula = tip ~ bill, data = tips) Residuals: Min 1Q Median 3Q Max -3.1982 -0.5652 -0.0974 0.4863 3.7434 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.920270 0.159735 5.761 2.53e-08 *** bill 0.105025 0.007365 14.260 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.022 on 242 degrees of freedom Multiple R-squared: 0.4566, Adjusted R-squared: 0.4544 F-statistic: 203.4 on 1 and 242 DF, p-value: < 2.2e-16

The formula is an important component of many R functions for modelling.

fit2 = lm(tip ~ bill + sex, data = tips) # two variables fit3 = lm(tip ~ bill + 0, data = tips) # without intercept ## you try summary() on them

Before we get started, let's see if you can install the `gWidgetsRGtk2`

package:

install.packages('gWidgetsRGtk2') library(gWidgetsRGtk2) ## follow instructions in case of errors

- base graphics
- the
`graphics`

package - once drawn, no way to modify it again (have to redraw everything)
- functions to draw points, lines, polygons, ... like other languages
- many built-in types of plots (histogram, boxplot, bar chart, ...)

- the
- grid graphics
- the
`grid`

package - more object-oriented: graphical elements are objects
- can be modified without explicitly redraw the whole plot
- more like an infrastructure package (no built-in plot types)

- the

There are many add-on packages based on the two systems; see the Graphics task view on CRAN for an overview: http://cran.r-project.org/web/views/Graphics.html

`lattice`

: Trellis plots- sub-plots conditional on categorical variables
- shipped with R (no need to install; just
`library(lattice)`

)

`ggplot2`

: Grammar of Graphics- a truly masterpiece
- amazing abstraction
- http://had.co.nz/ggplot2
- have to install:
`install.packages('ggplot2')`

Last week we mentioned the tips data.

tips = read.csv('http://dicook.public.iastate.edu/Army/tips.csv') str(tips) ## scatter plot: positive correlation with a 'constraint' plot(tip ~ bill, data = tips) ## what is the problem with R's default choice of point shapes? ## plot() is a very 'tricky' function in R; details later hist(tips$tip, main = 'histogram of tips') ## you see nothing except a right-skewed distribution hist(tips$tip, breaks = 30) # more bins hist(tips$tip, breaks = 100) # what do you see now?

20-30 years ago, the research on choosing the histogram binwidth was extremely hot in statistics, but... who cares?

We can change the binwidth interactively in R via many tools; one possibility is to build a GUI

## I prefer gWidgetsRGtk2, but I do not know if you can install it ## gWidgetstcltk is easier to install if (!require('gWidgetstcltk')) install.packages('gWidgetstcltk') library(gWidgetstcltk) # or library(gWidgetsRGtk2) options(guiToolkit = 'tcltk') # or options(guiToolkit = 'RGtk2') x = tips$tip gslider(from = 1, to = 100, by = 1, container = gwindow('Change the number of bins'), handler = function(h, ...) { hist(x, breaks = seq(min(x), max(x), length = svalue(h$obj))) }) ## many many other GUI elements to use

There are many color models in R, like `rgb()`

, `hsv()`

, ... And there are built-in color names, e.g. `'red'`

, `'purple'`

. Here is an example of `rgb()`

rgb(1, 0, 0) # red (hexidecimal representation) rgb(1, 1, 0) # yellow ## an interactive example g = ggroup(horizontal = FALSE, container = gwindow('Color Mixer')) x = c(0,0,0) # red, green, blue for(i in 1:3) { gslider(from = 0, to = 1, by = 0.05, action = i, handler = function(h, ...) { x[h$action] <<- svalue(h$obj) par(bg = rgb(x[1], x[2], x[3]), mar = rep(0, 4)) plot.new() }, container = g) } colors() # all names you can use plot(rnorm(30), pch = 19, col = sample(colors(), 30), cex = 2)

Old-fashioned but many goodies...

- open
`help.start()`

and take a look at the`graphics`

package - all you need to learn about base graphics is there
- many types of plots of interest:
`contour()`

,`filled.contour()`

,`fourfoldplot()`

,`mosaicplot()`

,`pairs()`

,`smoothScatter()`

,`stripchart()`

,`sunflowerplot()`

,`symbols()`

Graphics are not as easy as you might have imagined

- avoid pie charts (why?)
- avoid 3D plots (what?!)
- unless they are interactive (e.g. the
`rgl`

package) - an alternative is the contour plot

- unless they are interactive (e.g. the
- consider color-blind people
- The Elements of Graphing Data (William S Cleveland)
- order of precision (length good; angle bad; ...)

Only the source code is real.

x = seq(.1, 10, .1) plot(x, 1/x, type = 'l', lwd = 2) lines(x, 1/x + 2, col = 'red', lwd = 2)

- you have to wrestle with gory details in base graphics
- yes, it is flexible
- but you have to take care of everything
- point symbols, colors, line types, line width, legend, ...

- common tasks in graphics
- color the points according to the
`sex`

variable - different point symbols denote the
`smoker`

variable - darker points denote larger parties (the
`size`

variable) - add a smoothing/regression line on a scatter plot
- ...

- color the points according to the

We still use the `tips`

data here.

library(ggplot2) ## different colors denote the sex variable qplot(bill, tip, data = tips, color = sex) ## point symbols qplot(bill, tip, data = tips, shape = smoker) ## you can manipulate ggplot2 objects p = qplot(bill, tip, data = tips, color = size) p ## do not like the color scheme? change it p + scale_colour_gradient2(low="white", high="blue") ## faceting qplot(tip, data = tips, facets = time ~ day) p + geom_smooth() # smoothing line

Unlike most of R packages, ggplot2 has its own website of documentation, which is a rich source of examples.

- boxplots: http://had.co.nz/ggplot2/geom_boxplot.html
- contours: http://had.co.nz/ggplot2/stat_contour.html
- hexagons: http://had.co.nz/ggplot2/stat_binhex.html
- Pac man chart: http://had.co.nz/ggplot2/coord_polar.html

As mentioned before, there are many other packages based on the two graphics systems.

`animation`

: a gallery of statistical animations and tools to export animations`rgl`

: interactive 3D plots`iplots`

: interactive statistical graphics`rggobi`

: connect R with GGobi (a standalone software package for interactive stat graphics)

## be prepared install.packages(c('animation', 'rgl', 'iplots'))

The idea is simple:

## rotate the word 'Animation' for (i in 1:360) { plot(1, ann = FALSE, type = "n", axes = FALSE) text(1, 1, "Animation", srt = i, col = rainbow(360)[i], cex = 7 * i/360) Sys.sleep(0.01) }

library(animation) ?brownian.motion ?quincunx ?grad.desc ## export to an HTML page ?saveHTML ## want more hilarious examples? try demo('busybees') demo('CLEvsLAL')

Play with statistical graphics.

## use your mouse (drag or wheel up/down) library(rgl) demo('rgl') ## an artificial dataset library(animation) demo('pollen') ## linked plots library(iplots) ibar(tips$sex) ihist(tips$tip)

- R can compute on its own code:
`parse()`

and`deparse()`

- the
**formatR**package uses the side effect of them to reformat R code - you give me
`1+1`

and I return you`1 + 1`

(what the heck is the difference?)

well, compare

for(k in 1:10){j=cos(sin(k)*k^2)+3;print(j-5)}

to

for (k in 1:10) { j = cos(sin(k) * k^2) + 3 print(j - 5) }

The function `debug`

can be used to debug a function.

f = function(x) { m = length(x) x[is.na(x)] = mean(x, na.rm = TRUE) # impute by mean s = sum(x^2) # sum of squares s } f(c(1, NA, 2)) ## begin to debug the function now debug(f) f(c(1, NA, 2)) undebug(f)