Introduction to Biostatistics

An introductory course on data handling and biostatistics for students studying towards a Bachelor of Health Sciences (BHSc) Honours in Physiology at the University of the Witwatersrand, South Africa. The course is based around the statistical programming language R.

The aims of the course are to introduce participants to the basics of data wrangling, plotting, and reproducible data analysis and reporting. These aims are explored using the statistical computing programme R in the RStudio integrated development environment (IDE), and git (with the GitHub web-based git repository hosting service) for version control. The reason for choosing these apps is that they are free (as in beer and as in speech), and have well-established and active user and developer communities. You need a basic working knowledge of the command line, R and git to complete the course. So if you are not familiar with these apps, I suggest that you complete some free online courses before starting (see examples below).

Course assessment

The year mark for the course will constitute 40% of the total course mark, and will be assessed by a series of 6 short assignments, each worth 10 marks. The biostatistics examination will constitute the remaining 60% of the total mark for the course. Assignments must be submitted by 23:59 on the due date. No extensions will be granted, and 10% will be deducted from the assignment mark for each day the assignment is late.

The table below provides a link to the assignments and indicates the due date for each assignment.

Assignment Link Submission deadline
1 access link 18 July 2016
2 access link 22 July 2016
3 access link 25 July 2016
4 access link 29 July 2016
5 access link 15 August 2016
6 access link 29 August 2016

Lectures

Lecture Content Slides
(may load slowly)
Introduction Course overview Slideshow
Lecture 1 Basic concepts and tools for reproducible research Slideshow
Lecture 2 Data munging Slideshow
Lecture 3 Things to know before you start data analysis Slideshow
Lecture 4 A (very brief) introduction to plotting in R Slideshow
Lecture 5 and 6 Cookbook of commonly used statistical tests Slideshow
Lecture 7 Confidence intervals Slideshow
Lecture 8 Correlation and regression Slideshow

Tutorials

These tutorials do not count for course credit, but give you a chance to get hands-on experience applying what you learn in the lectures. The tutorials will take place with the course instructor in the computer laboratory immediately after the relevant lecture has finished. You may work alone or in groups. You can also work through them tutorials in your own time.

Tutorial files
Tutorial 1 Complete the swirl course: ‘R Programming’
Tutorial 2 RMarkdown[download] the file, and follow the instructions
Tutorial 3 Complete the swirl course: ‘Getting and Cleaning Data’
Tutorial 4 Complete the swirl course: ‘Exploratory Data Analysis’
(sections: 5, 7, 8, 9, 10)
Tutorial 5 Complete the swirl course: ‘Statistical Inference’
Tutorial 6 Complete the swirl course: ‘Regression Models’

The majority of the tutorials are deployed through the R package swirl. The swirl package was developed by the Swirl Development Team, and includes a suite of step-by-step interactive training courses on R, which are aimed primarily at the novice and intermediate R user.

Follow the instructions below to access swirl courses:

# Script to install and use 'swirl'
# If you haven't already installed swirl
install.packages('swirl')

# Load the 'swirl' package
library(swirl)

# Launch a 'swirl' session and follow the prompts
swirl()

To install swirl courses:

# Load the 'swirl' package
library(swirl)

# Download a course from the 'swirl' github repository
install_from_swirl('Course Name Here')

# Luanch a 'swirl' session
swirl()

Resources

Getting started

Configuring git

Global configuration

You need to configure git after you first install it. If you are going to be the only one using the computer, then open Terminal (OSX and Linux) or Git Bash (Windows) and enter the following text (substituting your username and email address as required):

git config --global user.name "Your Name"
git config --global user.email [email protected]

You are then ready to initiate your first git repository:

git init <path/to/directory> 

# NOTE: There is no need to add the path if you already 
# are in the directory you want to initiate.

Local project configuration

If you configure your computer using the --global tag, you only have to enter this information once. Thereafter, git will assume that any new repository initiations/changes are being made by you. As you may expect then, configuring your user details with the --global tag is not a good idea if the computer you use has multiple users working, for example, through a ‘Guest Account’. In that situation, rather individually set the user configuration for each directory (project) you initiate as follows:

  • Open Terminal (OSX and Linux) or Git Bash (Windows) and navigate to the directory you want to initiate as a repository;

  • Enter the following text (substituting your username and email address as required):

git init
git config user.name "Your Name"
git config user.email [email protected]

Your user information will only be associated with the repository you initiated.

Pushing commits to a GitHub remote

The GIF below demonstrates the process of initiating folder as a git repository, configuring it, adding a file, and pushing the file to an empty GitHub repository. It also shows how to add files in GitHub and then pull them locally.

When pushing to the remote, you will be asked for a username and password. Input the username and password for your GitHub account. If you get this username/password request everytime you try push to the remote, and you are on your own computer, then instead of typing in your GitHub account password, get a personal access token and paste that in the password dialogue box. That should stop the username/password requests. Instructions on how to get a personal access token can be found here.

Proxy problems

If you are working behind a corporate proxy, you may run into problems with pushing git commits to your remote, and installing packages in R. The following options should help solve the problem.

For git

(NOTE: All settings can be set specifically for the local project by omitting the --global tag.)

Enter the follwing commands in the Terminal (OSX or Linux) or Git Bash (Windows) (substituting your server details as required):

git config --global http.proxy http://proxyserver.com:8080
git config --global https.proxy https://proxy.server.com:8080
  • Change proxy.server.com to the address of your proxy server

  • Change 8080 to the proxy port configured on your proxy server

Try the settings above, but if your proxy protests that it needs a username and password (‘407 Authentication Warning’), add the following (not a great option on a computer used by multiple users):

git config --global http.proxy http://proxyusername:[email protected]:8080

git config --global https.proxy https://proxyusername:proxypasswordproxy.server.com:8080
  • Change proxyusername to your proxy username

  • Change proxypassword to your proxy password

If you want to clear your proxy settings (e.g., if you are working on a laptop that you use at home and at work), enter the following commands in Terminal or Git Bash:

git config --global --unset http.proxy
git config --global --unset https.proxy

For R/RStudio

You need to add an .Renviron file to your R home directory. This file is read by R at startup, and needs to include proxy settings if you want to access the internet through R/RStudio.

On OSX and Linux machines the home directory is usually your ‘user.name’ directory (/User/user.name). On Windows machines the home directory is usually the the users ‘Documents’ library (C:\Users\user.name\Documents). If you cannot locate your home directory, enter the following in R:

path.expand("~")

The .Renviron file should contain the proxy settings specific to your location. The file must be in your R home directory before you start R/RStudio, but remember that it is a hidden file, so you will have to have your file view set to show hidden files if you want to see the file in situ. The following is an example of the text that should be in the .Renviron file:

http_proxy=http://proxyserver.com:8080
http_proxy_user=ask
https_proxy=https://proxyserver.com:80
https_proxy_user=ask

If you don’t know how to create a hidden ‘dot’ file, here are some instructions on how to do so from Terminal (OSX and Linux) or Git Bash (Windows):

  1. Open Terminal or Git Bash.

  2. Navigate to your R home directory.

  3. Copy and paste the following text at the command prompt (remember to change the proxyserver and port as required):

(echo http_proxy=http://proxyserver.com:8080
echo http_proxy_user=ask
echo https_proxy=https://proxyserver.com:8080
echo https_proxy_user=ask) > .Renviron

When you are outside of the work setting and can access the internet without a proxy, either rename the .Renviron file, move the file to another directory, or delete the file before opening R/RStudio.

In addition to adding the .Renviron file, you also need to change some settings in RStudio. The GIF below shows you the steps on a Windows machine, but it is the same on OSX and Linux, expect there is no ‘Use Internet Explorer library/proxy for HTTP’ checkbox.

Software downloads

  • R (available for: Windows, Mac, and Linux)
  • RStudio Desktop (available for: Windows, Mac, and Linux. Only install after you have installed R)
  • git (available for: Windows, Mac, and Linux)
  • Github desktop client (available for: Windows and Mac only)

Once you have downloaded and installed R and RStudio, I recommend that you install the following R packages (you may need others during the course, but these will get you going):

  • knitr, rmarkdown, tidyverse, and swirl

You can install the packages directly from the R console:

install.packages(c('knitr', 'rmarkdown', 'tidyverse', 'swirl'))


…or using RStudio:


Free courses

R online
git online
Other

Cheat-sheets

Remembering commands can be difficult, especially when you are a new user or you are out of practice. So here are links to some useful cheat-sheets:

R / RStudio
git
Miscellaneous