Modern Applied Statistics Using R

Motivation

The statistical package R offers a rich and flexible environment for data analysis and modeling. Due to its free availability and easy extensibility, it has become widely popular for implementing and distributing modern statistical procedures.

The course is intended for graduate students and researchers and has three goals:

To enable students to perform familiar statistical tasks easily and comfortably in the new environment.
To expose students to modern methods for data analysis and modelling in an applied and hands-on manner.
To enable students to extend their knowledge about and mastery of the R system independently and according to their needs and interests.

For this purpose, the course will combine familiar material (elementary statistics, linear and generalized linear models, classical plots) with newer methods (machine learning, multiple testing and visualization). The latter is by necessity only a small selection of available methods; instead of trying to cover everything, we will aim at identifying suitable tools for a given problem.

Organisation

The course is intended for students with previous training and/or experience in statistics: familiarity with statistical testing, linear models and logistic regression are assumed.

The course is planned for 1.5 university credits (1 week of education/training), distributed as half-days over ten weeks. During the first eight weeks, each unit will consist of an introductory lecture, followed by a computer lab where students can immediately apply the new methods. During the ninth week, students will work in small groups (2-3 persons) on a project assignment, which they will present in the last week.

Attendance will be compulsory. Missed lectures or computer labs need to be compensated for by doing and handing in extra reading and computing assignments. Grading will be based on performance during computer labs (50%) and the final project (50%). The passing grade will be 75%.

Literature

Any of the following books covers substantial parts of the course material, and can be easily supplemented with handouts:

William N. Venables and Brian D. Ripley. Modern Applied Statistics with S. Fourth Edition. Springer, New York, 2002.
Brian Everitt and Torsten Hothorn. A Handbook of Statistical Analyses Using R. Chapman & Hall/CRC, Boca Raton, FL, 2006.

In addition, the following books offer in-depth treatment of individual subjects from the course:

Robert Gentleman, Vince Carey, Wolfgang Huber, Rafael Irizarry, and Sandrine Dudoit, editors. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Statistics for Biology and Health. Springer, 2005.
Frank E. Harrell. Regression Modeling Strategies, with Applications to Linear Models, Survival Analysis and Logistic Regression. Springer, 2001.
Sandrine Dudoit and Mark J. van der Laan. Multiple Testing Procedures and Applications to Genomics. Springer Series in Statistics. Springer, 2007.
Richard C. Deonier, Simon Tavare, and Michael S. Waterman. Computational Genome Analysis: An Introduction. Springer, 2005.

Material

September 3: Handout Demonstration R code Assignment
EnExp.txt EnExp.xls
September 10: Handout Demonstration R code Assignment
Sleep.txt chol.txt
September 17: Handout Demonstration R code Assignment
cars.RData
September 24: Handout Demonstration R code Assignment
Demonstration script file
Mistake in assignments handed out at lecture: data set cuckoos is part of package DAAG - sorry!
October 8: Handout Demonstration R code Assignment
Golub.RData
October 15: Handout Demonstration R code Assignment
stockholmBC.zip Sample Sweave file pdf file generated from Sweave

Projects

On their own page

21.10.2007 A. Ploner . Medical Epidemiology & Biostatistics . Karolinska Institutet