Introduction to Global Health Data Science

Fall 2021

Duke University

Introduction to Global Health Data Science

Rigorous introduction to health data science using current applications in biomedical research, epidemiology, and health policy. Use modern statistical software to conduct reproducible data exploration, visualization, and analysis. Interpret and translate results for interdisciplinary researchers. Critically evaluate data-based claims, decisions, and policies. Includes exploratory data analysis, visualization, basics of probability and inference, predictive modeling and classification. This course focuses on the R computing language. No statistical or computing background is necessary.

Check out our centralized resource page for intro data science courses for great tips and help throughout the course.


Code along labs on Mondays; classes on Wednesdays and Fridays; office hours Tuesdays, Wednesdays, and Thursdays; quizzes due Tuesdays; labs due Wednesdays; homework due Fridays.


Code along lab sessions
10:15-11:30am, Perkins Link 087
with Jackie and Chris
1:45-3:00pm, Perkins Link 079
with Phuc and Eli
or 5:15-6:30pm, Perkins Link 087
with Phuc and Eli


Prof. Herring hours - 1-2pm
Zoom (link on Sakai)
Eli hours - 4-5pm
Zoom (link on Sakai)
Jackie hours - 5-7pm
Edge Project Room 2
Chris hours - 8-9pm
Zoom (link on Sakai)


Class - 10:15-11:30am
Social Sciences 136
Prof. Herring hours - 11:30am-12:30pm
208 Old Chem


Phuc hours - 4-5pm
Zoom (link on Sakai)
Jackie hours - 5-7pm
Edge Project Room 2


Class - 10:15-11:30am
Social Sciences 136
Alexandra hours - 11:30am-12:30pm
Edge Project Room 6

Not your average biostatistics class…and we mean for you to succeed!

Course Schedule


This is a tentative course schedule. The flow of topics might change slightly depending on how quickly / slowly it feels right to …

Week 2 - Probability

Learn the fundamentals of probability with a focus on its use in statistics

Week 3 - Probability in Action

More probability, discrete distributions

Week 4 - Data wrangling

Importing data, data types and classes, recoding.

Week 5 - Visualizaing Spatial Data Effectively

Tips for effective visualization of spatial data

Week 6 - Normal Distribution and Confidence Intervals

Learn about the normal distribution and its central role in formal evaluation of hypotheses and construction of confidence intervals

Week 7 - Evaluating Hypotheses

Hypothesis testing using CLT and bootstrap

Week 8 - Modelling Data

Modelling Data

Week 10 - Categorical Data

contingency tables, exact and chi-squared tests

Week 11 - Logistic regression

Logistic Regression modeling for binary outcomes.

Week 13 - Project Presentations

Global Health Data Science Festival!


Course components

Weekly structure

  • Monday: Code-along lab sessions
  • Tuesday and Wednesday: Professor office hours
  • Wednesday and Friday: Class
  • Tuesday and Thursday: TA office hours

Student hours

Prof. Herring will each hold office hours on Tuesdays and Wednesdays, and TA office hours will be on Tuesdays and Thursdays. These will not be recorded, and zoom options will be available. It’s a great time to get real time answers to your questions or just say hi!

Code along sessions/labs

These will be held on Mondays, and they will not be recorded unless requested in advance. We expect that you show up to the workshop session you’ve been assigned to weekly. During these sessions you will work in teams on computing lab exercises, and you will finish the exercises after the workshop and turn in your lab reports by Wednesday at 4pm. Labs will be submitted as GitHub repositories, and labs with the lowest score for each student will be dropped.

A frequently asked question: “What happens if I can’t make it to class or a lab one week because I’m sick or have another obligation at that time?” Answer below:

  • First, if you have another obligation every week at the time of your lab, you should change into another lab. If you can’t make any of the lab times, you should drop this class.
  • Chances are you asked this question because you’re only missing one or two workshops throughout the semester:
    • If you’re missing a lab day due to short-term illness or some other reason, you should communicate this with your team and discuss accommodations. If you have made 0 commits towards a lab assignment, you will receive a 0 for that assignment, so you need to participate both for being a team player and also for your own individual score.
    • If you’re unable to contribute to a lab assignment because of an illness taking you away from school work for an extended period of time, you should let your team know that you won’t be able to contribute to that lab and either make this your dropped lab score or talk with your academic dean and Prof. Herring about options.

Overall these policies are put in place to ensure communication between team members, respect for each others' time, and also to give you a safety net in the case of illness or other reasons that keep you away from attending class once or twice.

Homework assignments

Beyond the in class activities, you will be assigned regular homework assignments throughout the semester. These assignments will be completed individually and submitted to Gradescope. Homework with the lowest score for each student will be dropped.


Regular quizzes on Sakai will be used to ensure students are keeping up with the reading assignments. These quizzes will generally be due Tuesday evenings by 11:59 pm and should be completed individually. They will be added together and count towards the semester grade as a single homework (the lowest 2 quiz grades will be dropped).

Final project

You will be responsible for the completion of an open ended final project for this course, the goal of which is to tackle an “interesting” problem using the tools and techniques covered in this class. Additional details on the project will be provided as the course progresses. You must complete the final project and be in class (or on video) to present it to earn a project grade.


For all of the team based assignments in this class you will be randomly assigned to teams of 3-4 students - these teams will change after each assignment. You will work in these teams during class and on the homework assignment. For team based assignments, all team members are expected to contribute equally to the completion of each assignment and you will be asked to evaluate your team members after each assignment is due. Failure adequately to contribute to an assignment will result in a penalty to your mark relative to the team’s overall mark.

Students are expected to make use of the provided GitHub repository as their central collaborative platform. Commits to this repository will be used as a metric (one of several) of each team member’s relative contribution for each homework.



Your overall course grade will be comprised of the following components, and their weights:

  • Homework and Quizzes: 15%
  • Labs: 15%
  • Project: 20%
  • Exams (10% exam 1; 15% exam 2; 25% final exam)

A letter grade will be assigned as follows.

93 A 100
90 A- < 93
87 B+ < 90
83 B < 87
80 B- < 83
77 C+ < 80
73 C < 77
70 C- < 73
67 D+ < 70
63 D < 67
60 D- < 63
0 F < 60

I never “curve down.” These posted cut points are guaranteed minimums. As well, this course is not graded to a pre-specified distribution; if every student earns a 95 in the course, then every student will receive an A. I reserve the right to “curve up” using more generous cut points depending on overall difficulty of assessments.

Regrade requests must be made within two days of when a report is returned. These will be honored if points were tallied incorrectly, or if you feel part of your report is correct, but it was marked wrong (these things do happen!). No regrade will be made to alter the number of points deducted for an issue. When a regrade request is evaluated, if new errors are identified, additional points may be deducted from the grade.


Class attendance and etiquette

Students are expected to attend class and labs in person, health permitting. Students with respiratory symptoms (or symptoms of other diseases communicable in a classroom setting) should stay home and watch class recordings.

If you need to miss class due to a religious holiday, illness, or varsity athletics be sure to follow appropriate university policies (linked for convenience).

Zoom expectations

Some office hours will be on Zoom, and it may become necessary to hold some class or lab meetings on Zoom, depending on collective health and safety considerations.

  • When in a full group session you should,

    • have your microphone muted by default
    • use the raise your hand feature or type in the chat for questions and comments
  • In the small team sessions or office hours you should

    • have your camera turned on as much as possible
    • engage with your team mates via voice and text chat
    • take turns sharing your screen when necessary

Collaboration policy

Only work that is clearly assigned as team work should be completed collaboratively. Individual assignments must be completed individually, you may not directly share or discuss answers / code with anyone other than the instructors and tutors. You are welcome to discuss the problems in general and ask for advice.

Sharing / reusing code

I am well aware that a huge volume of code is available on the web to solve any number of problems. Unless I explicitly tell you not to use something the course’s policy is that you may make use of any online resources (e.g. StackOverflow) but you must explicitly cite where you obtained any code you directly use (or use as inspiration). Any recycled code that is discovered and is not explicitly cited will be treated as plagiarism. On individual assignments you may not directly share code with another student in this class, and on team assignments you may not directly share code with another team in this class. You are welcome to discuss the problems together and ask for advice, but you may not send or make use of code from another team.

Academic integrity

Academic honesty is of paramount importance in this class, and all work must be done in accordance with the Duke Community Standard, reproduced as follows:

To uphold the Duke Community Standard:

  • I will not lie, cheat, or steal in my academic endeavors;
  • I will conduct myself honorably in all my endeavors; and
  • I will act if the Standard is compromised.

By enrolling in this course, you have agreed to abide by and uphold the provisions of the Duke Community Standard as well as the policies specific to this course. Cheating or plagiarism on assignments, lying about an illness or absence and other forms of academic dishonesty are a breach of trust with classmates and faculty, violate this Standard, and will not be tolerated; any violations will automatically result in a grade of 0 on the assignment, be reported to the Office of Student Conduct for further action, and potentially a failing (F) course grade depending on the magnitude of the offense.

Occasionally, data sets we are privileged to use in class may be confidential and cannot be distributed more broadly or without express permission from the data-granting sponsor. If so, we will let you know; any unauthorized dissemination or further use of such data sets beyond this class is a violation of the Duke Community Standard.

Reusing code: You are welcome to use online resources (e.g. StackOverflow). If you use code from an outside source, either directly or as inspiration, you must explicitly cite where you obtained the code. Any recycled code that is discovered and is not explicitly cited will be treated as plagiarism and a violation of the Duke Community Standard.

On individual assignments, you may not directly share code or write up with other students. On team assignments, you may not directly share code or write up with another team. Unauthorized sharing of the code or write up will be considered a violation for all students involved.

Late work, extensions, and special circumstances

All work is due on the stated due date. Due dates are there to help guide your pace through the course and they also allow us (the course staff) to return marks and feedback to you in a timely manner. However, sometimes life gets in the way and you might not be able to turn in your work on time. Note, first of all, that we drop the lowest score of lab and homework assignments. So if you miss one assignment, this can be your dropped score.

  • Late work policy: Some assignments cannot be turned in late and some assignments can be turned in past the deadline with a late penalty:
    • Labs: No late work accepted
    • Quizzes: No late work accepted (lowest 2 quiz grades automatically dropped)
    • Homework assignments: Late work accepted up to 3 days past the deadline (i.e. Monday after the deadline, 4pm), with 5% penalty for each day. Grading of late assignments may be delayed.
    • Project proposal: Late work accepted up to 4 days past the deadline, with 5% penalty for each day
    • Project re-proposal: No late work accepted (this is an optional assignment)
    • Project: No late work accepted
    • Exams: students with known conflicts on exam days should contact the instructor as soon as possible. Make-up exams will not be given. If the final exam grade is higher than the grade on either exam 1 or exam 2, then it will automatically be substituted to the student’s benefit, for a maximum of one exam substitution.
      • Example: If a student misses the second exam, the final exam grade will substitute for the exam 2 grade.
      • Example: If a student misses the first exam, the final exam grade will substitute for the exam 1 grade.
      • Example: If a student misses both exam 1 and exam 2, the final exam grade will substitute for exam 2 (worth more points), and the exam 1 grade will be zero.
      • Example: If a student completes all three exams, and the final exam grade is higher than the grade for either exam 1 or exam 2, then the final exam grade will substitute for the prior exam grade in a manner that maximizes the student’s score (because the exams have unequal weight in the final grade calculation, it may be preferable to use the final exam to substitute for the exam 2 grade due to its heavier weight, even if its score is slightly higher than the score on exam 1). If the final exam grade is lower than the grades on exam 1 or exam 2, no substitutions will be made.
      • Example: If a student misses the final exam, the course grade will be X unless the student is failing (in which case the course grade is F). If no acceptable explanation is presented to the academic dean’s office within 48 hours of missing the exam, the course grade is F. If the exam absence is excused, the dean, Professor, and student make arrangements for a make-up examination as soon as possible.


Duke University is committed to providing equal access to students with documented disabilities. Students with disabilities may contact the Student Disability Access Office (SDAO) to ensure your access to this course and to the program. There you can engage in a confidential conversation about the process for requesting reasonable accommodations both in the classroom and in clinical settings. Students are encouraged to register with the SDAO as soon as they matriculate. Please note that accommodations are not provided retroactively.

Diversity & inclusion

It is my intent that students from all diverse backgrounds and perspectives be well-served by this course, that students' learning needs be addressed both in and out of class, and that the diversity that the students bring to this class be viewed as a resource, strength, and benefit. It is my intent to present materials and activities that are respectful of diversity: gender identity, sexuality, disability, age, socioeconomic status, ethnicity, race, nationality, religion, and culture. Your suggestions are encouraged and appreciated. Please let me know ways to improve the effectiveness of the course for you personally, or for other students or student groups.

Furthermore, I would like to create a learning environment for my students that supports a diversity of thoughts, perspectives and experiences, and honors your identities (including gender identity, sexuality, disability, age, socioeconomic status, ethnicity, race, nationality, religion, and culture). To help accomplish this:

  • If you have a name that differs from those that appear in your official Duke records, please let me know!
  • Please let me know your preferred pronouns.
  • If you feel like your performance in the class is being impacted by your experiences outside of class, please don’t hesitate to come and talk with me. I want to be a resource for you. If you prefer to speak with someone outside of the course, your academic dean is an excellent resource.
  • I (like many people) am still in the process of learning about diverse perspectives and identities. If something was said in class (by anyone) that made you feel uncomfortable, please talk to me about it.

Learning during a pandemic

I want to make sure that you learn everything you were hoping to learn from this class. If this requires flexibility, please don’t hesitate to ask.

  • Out of respect for others sharing your physical space, the class expectation will be that if you are sick, you will stay home and participate remotely.

  • You never owe me personal information about your health (mental or physical), but you’re always welcome to talk to me. If I can’t help, I likely know someone who can.

  • I want you to learn lots of things from this class, but I primarily want you to stay healthy, balanced, and grounded during the semester. ]


Most of you will need help at some point and we want to make sure you can identify when that is without getting too frustrated and feel comfortable seeking help.

  • Sakai Forums: The best way to get any questions on course content, technology, logistics, policies is to post your question on Sakai’s Forums. You are encouraged to answer each others' questions here as well. When you post a question on Sakai, you can choose to do so anonymously in some forums (not ones for identifying study partners!) to your classmates. Note that the course staff can always see your name, and this is for a good reason! We want to be able to identify students who might be struggling so that we can extend help. Similarly, we want to know who you are if you’re providing great answers to others' questions!

  • Student hours: Course organisers will hold student office hours on Tuesdays (Prof Herring, 1-2pm, Eli 4-5pm, Jackie 5-6pm, Chris 8-9pm), Wednesdays (Prof Herring, 11:30am-12:30pm, Alexandra, 1:30-2:30pm), and Thursdays (Phuc 4-5pm, Jackie 5-6pm). Please feel free to call in with any questions, or just to say hi! I am also available to meet by appointment, please use the link below to request one.

  • Email: Please refrain from emailing any course content questions (those should go on Sakai Forums), and only use email for questions about personal matters that may not be appropriate for the public course forum (e.g. illness, major concerns).

  • For more general support and advice, please make use of the following resources:

Make good use of this support system, it is there for you! And if you’re not sure where to go for help, just ask any member of the course team.

  • Tuesdays 1-2, 4-5, 5-7, and 8-9; Wednesdays 11:30-12:30; Thursdays 4-5 and 5-7; Fridays 11:30-12:30
  • Email


Showcase your inner data scientist


Pick a global health data set …

…and do something informative with it. You will work in teams of 2-3 people of your own choosing. That is your final project in a nutshell. More details below.

May be too long, but please do read

The final project for this class will consist of analysis on a data set of your own choosing. The data set may already exist, or you may collect your own data using a survey or by conducting an experiment. You can choose the data based on your interests or based on work in other courses or research projects. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like) and apply them to a novel data set in a meaningful way.

The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather let me know that you are proficient at asking meaningful questions and answering them with results of data analysis, that you are proficient in using R, and that you are proficient at interpreting and presenting the results. Focus on methods that help you begin to answer your research questions. You do not have to apply every statistical procedure we learned. Also, critique your own methods and provide suggestions for improving your analysis. Issues pertaining to the reliability and validity of your data, and appropriateness of the statistical analysis should be discussed here.

The project is very open ended. You should create some kind of compelling visualization(s) of these data in R. There is no limit on what tools or packages you may use, but sticking to packages we learned in class (tidyverse) is recommended. You do not need to visualize all of the data at once. A single high quality visualization will receive a much higher grade than a large number of poor quality visualizations. Also pay attention to your presentation. Neatness, coherency, and clarity will count. All analyses must be done in RStudio, using R. References must be cited, including (known) prior analyses of the data.


In order for you to have the greatest chance of success with this project it is important that you choose a manageable data set. This means that the data should be readily accessible and large enough that multiple relationships can be explored. As such, your data set must have at least 50 observations and say 10-20 variables (exceptions can be made, but you must speak with me first). The variables in the data should include variables of multiple types (e.g., categorical and continuous).

If you are using a data set that comes in a format that we haven’t encountered in class, make sure that you are able to load it into R as this can be tricky depending on the source. If you are having trouble ask for help before it is too late.

Note on reusing data sets from class: Do not reuse data sets used in examples, homework assignments, or labs in the class.

The data you use must be able to be shared publicly. Students will be allowed to post their projects after the end of the course on their own GitHub repos in order to be competitive for summer internships and other opportunities. You will not be able to use data for the project that cannot be posted online for public viewing.

You cannot make substantial changes to the proposed data set or questions of interest after the proposal revision deadline.

Below are a list of data repositories or related events that might be of interest to browse. You’re not limited to these resources, and in fact you’re encouraged to venture beyond them. But you might find something interesting there:


  1. Team membership - due Friday, October 8 in class
  2. Proposal - due Monday, October 11, at 11:59 pm
  3. Proposal revision - due Monday, October 18, at 4:00 pm (optional)
  4. Write-up - due Tuesday, November 16, at 11:59pm
  5. Presentation - Wednesday, November 17 or Friday, November 19, in class (possibly some Monday, November 22 in class)


The purposes of the proposal are (1) to help you get started early with thinking about the project, reading relevant literature, and formulating your scientific questions, and (2) to ensure the data you wish to analyze, methods you wish to use, and scope of your analysis are feasible and set you up for success with your project.

  • Section 1 - Introduction: The introduction should introduce your general research question and your data (where it came from, how it was collected, what are the cases, what are the variables, etc.). The motivation for your research question should be clear, with citations to relevant literature as appropriate.

  • Section 2 - Data: Place your data and codebook in the /data folder. Then print out the output of glimpse() or skim() of your data frame.

  • Section 3 - Data analysis plan:

    • The outcome (response, Y) and predictor (explanatory, X) variables you will use to answer your question.
    • The comparison groups you will use, if applicable.
    • Very preliminary exploratory data analysis, including some summary statistics and visualizations, along with some explanation on how they help you learn more about your data. (You can add to these later as you work on your project.)
    • The statistical method(s) that you believe will be useful in answering your question(s). (You can update these later as you work on your project.)
    • What results from these specific statistical methods are needed to support your hypothesized answer?

The project proposal can be no more than 3 pages. You can check a print preview to confirm length.


5 minutes maximum, and each team member should say something substantial. You can either present live or pre-record and submit your video to be played during the presentation day.

Prepare a slide deck using the template in your repo. This template uses a package called xaringan, and allows you to make presentation slides using R Markdown syntax. There isn’t a limit to how many slides you can use, just a time limit (5 minutes total). Each team member should get a chance to speak during the presentation. Your presentation should not just be an account of everything you tried (“then we did this, then we did this, etc."), instead it should convey what choices you made, and why, and what you found.

Before you finalize your presentation, make sure your chunks are turned off with echo = FALSE.

Presentation schedule: Presentations will take place during the third week of November (we may have some the Monday of Thanksgiving week). You can choose to do your presentation live or pre-record it. During class you will watch presentations from other teams and provide feedback in the form of peer evaluations. The presentation line-up will be generated randomly.


Along with your presentation slides, we want you to provide a summary of your project in report form.

This write-up, which you can also think of as an summary of your project, should provide information on the dataset you’re using, your research question(s), your methodology, and your findings. Think of it as filling out your project proposal with all the interesting details. Additional information will be provided closer to the project deadline regarding formatting and other tips. The page limit of this write-up is 10 pages, including figures and references.

Repo organization

The following folders and files in your project repository:

  • presentation.Rmd + presentation.html: Your presentation slides
  • report.Rmd + report.html: Your write-up
  • /data/*: Your dataset in csv or RDS format, in the /data folder.
  • /proposal/: Your proposal from earlier in the semester

Style and format does count for this assignment, so please take the time to make sure everything looks good and your data and code are properly formated.


  • You’re working in the same repo as your teammates now, so merge conflicts will happen, issues will arise, and that’s fine! Commit and push often, and ask questions when stuck.
  • Review the marking guidelines below and ask questions if any of the expectations are unclear.
  • Make sure each team member is contributing, both in terms of quality and quantity of contribution (we will be reviewing commits from different team members).
  • Set aside time to work together and apart (physically).
  • When you’re done, review the documents on GitHub to make sure you’re happy with the final state of your work. Then go get some rest!
  • Code: In your presentation your code should be hidden (echo = FALSE) so that your document is neat and easy to read. However your document should include all your code such that if I re-knit your R Markdown file I should be able to obtain the results you presented. Exception: If you want to highlight something specific about a piece of code, you’re welcomed to show that portion.
  • Teamwork: You are to complete the assignment as a team. All team members are expected to contribute equally to the completion of this assignment and team evaluations will be given at its completion - anyone judged to not have sufficiently contributed to the final product will have their grade penalized. While different teams members may have different backgrounds and abilities, it is the responsibility of every team member to understand how and why all code and approaches in the assignment works.


Total 100 pts
Proposal 10 pts
Presentation 10 pts
Write-up 70 pts
Reproducibility and organization 10 pts


  • Content - What is the quality of research and/or policy question and relevancy of data to those questions?
  • Correctness - Are statistical procedures carried out and explained correctly?
  • Writing and Presentation - What is the quality of the statistical presentation, writing, and explanations?
  • Creativity and Critical Thought - Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?

Team peer evaluation

You will be asked to fill out a survey where you will report a contribution percentage for each team member. Filling out the survey is a prerequisite for getting credit on the team member evaluation.If you are suggesting that an individual did less than their fair share of the work, please provide some explanation. When peer scores indicate a team member did not do their fair share of work, proportional grading may be used (e.g., half the fair share yields half the points expected).

Late work policy

  • There is no late submission / make up for the presentation. You must be in class on the day of the presentation to get credit for it or pre-record and submit your presentation by 9am in the morning of the presentations.

  • The late work policy for the write-up is 5% of the maximum obtainable mark per calendar day up to seven calendar days after the deadline. If you intend to submit work late for the project, you must notify the course organizer before the original deadline as well as as soon as the completed work is submitted on GitHub.