Introduction to biostatistics
About
An introductory course on data handling and biostatistics for students studying towards a Bachelor of Health Sciences (BHSc) Honours in Physiology at the University of the Witwatersrand, South Africa. The course is based around the statistical programming language R.
The aims of the course are to introduce participants to the basics of data wrangling, plotting, and reproducible data analysis and reporting. These aims are explored using the statistical computing programme R in the RStudio integrated development environment (IDE), and git (with the GitHub web-based git repository hosting service) for version control. The reason for choosing these apps is that they are free (as in beer and as in speech), and have well-established and active user and developer communities. You need a basic working knowledge of the command line, R and git to complete the course. So if you are not familiar with these apps, I suggest that you complete some free online courses before starting (see examples below).
Course assessment
The year mark for the course will constitute 40% of the total course mark, and will be assessed by a series of 6 short assignments, each worth 10 marks. The biostatistics examination will constitute the remaining 60% of the total mark for the course. Assignments must be submitted by 23:59 on the due date. No extensions will be granted, and 10% will be deducted from the assignment mark for each day the assignment is late.
The table below provides a link to the assignments and indicates the due date for each assignment. Note: Dropbox (where the files are located) no longer renders html files, so until I find another free host that will render the files, clicking the link will download the associated file, which you can then open in your browser.
Assignment | Link | Submission deadline |
---|---|---|
1 | access link | TBA |
2 | access link | TBA |
3 | access link | TBA |
4 | access link | TBA |
5 | access link | TBA |
6 | access link | TBA |
Lectures
Lecture slides are best viewed on Safari.
Note: Dropbox (where the files are located) no longer renders html files, so until I find another free host that will render the files, clicking the link will download a zip file, which you can then unzip and double-click on the html file to open the presentation in your browser.
Lecture | Content | Slides |
---|---|---|
Introduction | Course overview | Slideshow (1MB) |
Lecture 1 | Basic concepts and tools for reproducible research | Slideshow (2MB) |
Lecture 2 | Data munging | Slideshow (730kB) |
Lecture 3 | Things to know before you start data analysis | Slideshow (7.4MB) |
Lecture 4 | A (very brief) introduction to data presentation | Slideshow (4.2MB) |
Lecture 5 and 6 | Cookbook of commonly used statistical tests | Slideshow (58MB) |
Lecture 7 | Confidence intervals | Slideshow (5.3MB) |
Lecture 8 | Correlation and regression | Slideshow (2.8MB) |
Tutorials
These tutorials do not count for course credit, but give you a chance to get hands-on experience applying what you learn in the lectures. The tutorials will take place with the course instructor in the computer laboratory immediately after the relevant lecture has finished. You may work alone or in groups. You may also work through them tutorials in your own time.
Tutorial | files |
---|---|
Tutorial 1 | Complete the swirl course: ‘R Programming’ |
Tutorial 2 | RMarkdown and knitr (download instructions) |
Tutorial 3 | Complete the swirl course: ‘Getting and Cleaning Data’ |
Tutorial 4 | Complete the swirl course: ‘Exploratory Data Analysis’(sections: 5, 7, 8, 9, 10) |
Tutorial 5 | Complete the swirl course: ‘Statistical Inference’ |
Tutorial 6 | Complete the swirl course: ‘Regression Models’ |
The majority of the tutorials are deployed through the R package swirl
. The swirl
package was developed by the Swirl Development Team, and includes a suite of step-by-step interactive training courses on R, which are aimed primarily at the novice and intermediate R user.
Follow the instructions below to access swirl
courses:
# Re-type or copy and paste the text below into the R console,
# pressing 'Enter' after each step.
# If you haven't already installed swirl
install.packages('swirl')
# Load the 'swirl' package
library(swirl)
# Launch a 'swirl' session and follow the prompts
swirl()
To install swirl
courses:
Resources
Visualizing statistics
I strongly recommend that all students go play around with the interactive plots at Seeing Theory, a project designed and created by Daniel Kunin with support from Brown University’s Royce Fellowship Program and National Science Foundation group STATS4STEM. The goal of the project is to make statistics more accessible to a wider range of students through interactive visualizations.
Software downloads
R (available for: Windows, Mac, and Linux)
RStudio Desktop (available for: Windows, Mac, and Linux. Only install after you have installed R)
git (available for: Windows, Mac, and Linux)
Github desktop client (available for: Windows and Mac only)
Once you have downloaded and installed R and RStudio, I recommend that you install the following R packages (you may need others during the course, but the suggested packages will get you through all activities in the course):
car
,coin
,devtools
,kableExtra
,knitr
,RColorBrewer
,rmarkdown
,swirl
,svglite
,tidyverse
,vcd
,vcdExtra
1
Install the packages from the R console:
Offline installation of recommended packages and swirl courses
If you are working behind a corporate proxy you may experience problems installing packages from the CRAN servers. To help you get the packages required for this course, I have written a package that will install the packages and swirl tutorials from a local source.
The package is called biostatSetup
, and to reduce the package size (it’s essentially a mini CRAN repository), I have cteated three versions for each of the major operating systems: biostatSetupSrc
for Linux, biostatSetupMacOS
for Mac, and biostatSetupWindows
for Windows.
Please note that the package was developed for R v3.3. If you have a lower version of R, please upgrade your version before installing the package.
Installing and using biostatSetup
1. Download the relevant version using these URLs
Linux:
biostatSetupSrc
macOS:
biostatSetupMacOS
Windows:
biostatSetupWindows
2. Install the package
Install the package from the R console:
# Re-type or copy and paste the appropriate text into the R console
# and press 'Enter'.
# Remember to change 'path_to_file' to the reflect where the downloaded
# file is located on your system.
# Linux
install.packages('path_to_file',
repos = NULL,
type = 'source')
# macOS
install.packages('path_to_file',
repos = NULL,
type = 'mac.binary.mavericks')
# Windows
install.packages('path_to_file',
repos = NULL,
type = 'win.binary')
3. Load the package
4. Install packages
Free courses
R online
Introduction to R by datacamp.com
R Programming by codeschool.com
R for Data Science by Garrett Grolemund and Hadley Wickham
Other
- Learn the Command Line by codecademy.com
Cheat-sheets
Remembering the specifics of every command is impossible, so there is no shame in looking-up this information. So here are links to some useful cheatsheets:
R / RStudio
Base R (source: Mhairi McNeill via rstudio.com)
Importing data (source: rstudio.com)
Data wrangling with dplyr and tidyr (source: rstudio.com)
Data visualization with ggplot2 (source: rstudio.com)
Regular expressions (source: Ian Kopacka via rstudio.com)
RMarkdown cheatsheet (source: rstudio.com)
RStudio IDE (source: rstudio.com)
git
git (source: git-tower.com)
git (source: github.com)
git the simple guide (interactive)
git workflow overview (source: git-tower.com)
Miscellaneous
- Command line (source: git-tower.com)
Configuring git
Global configuration
You need to configure git after you install it. If you are going to be the only one using the computer, then open Terminal (OSX and Linux) or Git Bash (Windows) and enter the following text (substituting your username and email address as required):
git config --global user.name "Your Name"
git config --global user.email your@email.com
Local project configuration
If you configure your computer using the --global
tag, you only have to enter this information once. Thereafter, git will assume that all commands are being eneterd by you. As you may expect then, configuring your user details with the --global
tag is not a good idea if the computer you use has multiple users working, for example, through a ‘Guest Account’. In that situation, rather individually set the user configuration for each directory (project) you initiate as follows:
Open Terminal (OSX and Linux) or Git Bash (Windows) and navigate to the directory you want to initiate as a repository;
Enter the following text (substituting your username and email address as required):
git init
git config user.name "Your Name"
git config user.email your@email.com
Your user information will only be associated with the repository you initiated.
Proxy problems
If you are working behind a corporate proxy, you may run into problems with pushing git commits to your remote. The following options should help solve the problem.
(NOTE: All settings can be set specifically for the local project by omitting the --global
tag.)
Enter the follwing commands in the Terminal (OSX or Linux) or Git Bash (Windows):
git config --global http.proxy http://proxyserver.com:8080
git config --global https.proxy https://proxy.server.com:8080
Remember to:
Change proxy.server.com to the address of your proxy server
Change 8080 to the proxy port configured on your proxy server
If you want to clear your proxy settings (e.g., if you are working on a laptop that you use at home and at work), enter the following commands in Terminal or Git Bash:
git config --global --unset http.proxy
git config --global --unset https.proxy
The
tidyverse
package bundles a series of essential packages for importing, munging and visualising data (e.g.,dplyr
,forcats
,ggplot2
,lubridate
,purrr
,readr
,stringr
,tidyr
).↩︎