Introduction to biostatistics

About

An introductory course on data handling and biostatistics for students studying towards a Bachelor of Health Sciences (BHSc) Honours in Physiology at the University of the Witwatersrand, South Africa. The course is based around the statistical programming language R.

The aims of the course are to introduce participants to the basics of data wrangling, plotting, and reproducible data analysis and reporting. These aims are explored using the statistical computing programme R in the RStudio integrated development environment (IDE), and git (with the GitHub web-based git repository hosting service) for version control. The reason for choosing these apps is that they are free (as in beer and as in speech), and have well-established and active user and developer communities. You need a basic working knowledge of the command line, R and git to complete the course. So if you are not familiar with these apps, I suggest that you complete some free online courses before starting (see examples below).

Course assessment

The year mark for the course will constitute 40% of the total course mark, and will be assessed by a series of 6 short assignments, each worth 10 marks. The biostatistics examination will constitute the remaining 60% of the total mark for the course. Assignments must be submitted by 23:59 on the due date. No extensions will be granted, and 10% will be deducted from the assignment mark for each day the assignment is late.

The table below provides a link to the assignments and indicates the due date for each assignment. Note: Dropbox (where the files are located) no longer renders html files, so until I find another free host that will render the files, clicking the link will download the associated file, which you can then open in your browser.

Assignment Link Submission deadline
1 access link TBA
2 access link TBA
3 access link TBA
4 access link TBA
5 access link TBA
6 access link TBA

Lectures

Lecture slides are best viewed on Safari.

Note: Dropbox (where the files are located) no longer renders html files, so until I find another free host that will render the files, clicking the link will download a zip file, which you can then unzip and double-click on the html file to open the presentation in your browser.

Lecture Content Slides
Introduction Course overview Slideshow (1MB)
Lecture 1 Basic concepts and tools for reproducible research Slideshow (2MB)
Lecture 2 Data munging Slideshow (730kB)
Lecture 3 Things to know before you start data analysis Slideshow (7.4MB)
Lecture 4 A (very brief) introduction to data presentation Slideshow (4.2MB)
Lecture 5 and 6 Cookbook of commonly used statistical tests Slideshow (58MB)
Lecture 7 Confidence intervals Slideshow (5.3MB)
Lecture 8 Correlation and regression Slideshow (2.8MB)

Tutorials

These tutorials do not count for course credit, but give you a chance to get hands-on experience applying what you learn in the lectures. The tutorials will take place with the course instructor in the computer laboratory immediately after the relevant lecture has finished. You may work alone or in groups. You may also work through them tutorials in your own time.

Tutorial files
Tutorial 1 Complete the swirl course: ‘R Programming’
Tutorial 2 RMarkdown and knitr (download instructions)
Tutorial 3 Complete the swirl course: ‘Getting and Cleaning Data’
Tutorial 4 Complete the swirl course: ‘Exploratory Data Analysis’
(sections: 5, 7, 8, 9, 10)
Tutorial 5 Complete the swirl course: ‘Statistical Inference’
Tutorial 6 Complete the swirl course: ‘Regression Models’

The majority of the tutorials are deployed through the R package swirl. The swirl package was developed by the Swirl Development Team, and includes a suite of step-by-step interactive training courses on R, which are aimed primarily at the novice and intermediate R user.

Follow the instructions below to access swirl courses:

# Re-type or copy and paste the text below into the R console, 
# pressing 'Enter' after each step.

# If you haven't already installed swirl
install.packages('swirl')

# Load the 'swirl' package
library(swirl)

# Launch a 'swirl' session and follow the prompts
swirl()

To install swirl courses:

# Re-type or copy and paste the text below into the R console, 
# pressing 'Enter' after each step.

# Load the 'swirl' package
library(swirl)

# Download a course from the 'swirl' github repository
install_from_swirl('Course Name Here')

# Launch a 'swirl' session and follow the prompts
swirl()

Resources

Visualizing statistics

I strongly recommend that all students go play around with the interactive plots at Seeing Theory, a project designed and created by Daniel Kunin with support from Brown University’s Royce Fellowship Program and National Science Foundation group STATS4STEM. The goal of the project is to make statistics more accessible to a wider range of students through interactive visualizations.

Software downloads

  • R (available for: Windows, Mac, and Linux)

  • RStudio Desktop (available for: Windows, Mac, and Linux. Only install after you have installed R)

  • git (available for: Windows, Mac, and Linux)

  • Github desktop client (available for: Windows and Mac only)

Once you have downloaded and installed R and RStudio, I recommend that you install the following R packages (you may need others during the course, but the suggested packages will get you through all activities in the course):

  • car, coin, devtools, kableExtra, knitr, RColorBrewer, rmarkdown, swirl, svglite, tidyverse, vcd, vcdExtra 1

Install the packages from the R console:

# Re-type or copy and paste the appropriate text into the R console 
# and press 'Enter'.

install.packages(c('car', 'coin', 'devtools', 'kableExtra', 
                   'knitr', 'RColorBrewer', 'rmarkdown', 
                   'swirl', 'svglite', 'tidyverse', 
                   'vcd', 'vcdExtra'))

Free courses

R online
git online
Other

Cheat-sheets

Remembering the specifics of every command is impossible, so there is no shame in looking-up this information. So here are links to some useful cheatsheets:

R / RStudio
git
Miscellaneous

Configuring git

Global configuration

You need to configure git after you install it. If you are going to be the only one using the computer, then open Terminal (OSX and Linux) or Git Bash (Windows) and enter the following text (substituting your username and email address as required):

git config --global user.name "Your Name"
git config --global user.email your@email.com

Local project configuration

If you configure your computer using the --global tag, you only have to enter this information once. Thereafter, git will assume that all commands are being eneterd by you. As you may expect then, configuring your user details with the --global tag is not a good idea if the computer you use has multiple users working, for example, through a ‘Guest Account’. In that situation, rather individually set the user configuration for each directory (project) you initiate as follows:

git init
git config user.name "Your Name"
git config user.email your@email.com

Your user information will only be associated with the repository you initiated.

Proxy problems

If you are working behind a corporate proxy, you may run into problems with pushing git commits to your remote. The following options should help solve the problem.

(NOTE: All settings can be set specifically for the local project by omitting the --global tag.)

Enter the follwing commands in the Terminal (OSX or Linux) or Git Bash (Windows):

git config --global http.proxy http://proxyserver.com:8080
git config --global https.proxy https://proxy.server.com:8080

Remember to:

If you want to clear your proxy settings (e.g., if you are working on a laptop that you use at home and at work), enter the following commands in Terminal or Git Bash:

git config --global --unset http.proxy
git config --global --unset https.proxy

  1. The tidyverse package bundles a series of essential packages for importing, munging and visualising data (e.g., dplyr, forcats, ggplot2, lubridate, purrr, readr, stringr, tidyr).↩︎