Wednesday, April 22, 2020

Data sets for teaching biostatistics

There are data sets online for teaching biostatistics and many data sets are built into R. There are two problems I keep encountering and it takes several days of searching to find an appropriate data set - if ever.  Here's the issue - most data sets built into R are ready to be crunched but they are often irrelevant to the student. For example, the two commonly used data sets that are in R are cars (just what it sounds like) and iris. The latter is flower but decades old. I'm looking for something from year >2000.  There are a few data repositories, such as Dryad, that are current but the data are often too complicated for simple analyses like ANOVA, t-test, etc. There are data sources that are huge but not very useful (unless you are a political scientist) like health data from the UN and WHO that are already summarized (but needing to merged with other data such of economic data. 

It would be great to have a searchable database for data sources that you can select the response variable type (e.g., continuous, binary, etc) and the predictor variables (e.g., continuous, binary, random) and maybe the year. That would be amazing and I think it would help the hundreds... thousands of people teaching statistics. 

This is not that data base but maybe I can start to link data sets to techniques used

Frequency analysis 
Two sample tests (t-tests and related)
Logistic Regression 
Negative binomial 

No comments:

Post a Comment