Monday, April 27, 2020

Using R to quantify land use around a sampling point

Backstory: I was in Argentina in 2017 for a bird meeting at Iguazu Falls. The meeting was fantastic and I was able to see a bunch of old friends from my tropical days and meet a bunch of new friends and there were many great talks. The two students that went with me did a great job presenting our poster. There were two big downers. One, a student was robbed by a taxi driver in Iguazu. She had to hand over her phone and wallet and he dropped her off along the road in the middle of a national park. Very not cool. Two, in Buenos Aires we were robbed by a taxi when our driver took off with our backpacks as we were removing our luggage. Now I've been taking pictures of taxi drivers and my photos are automatically uploaded. 

We're all told to back up our important documents and that I did. I had code on my hard drive and on a USB drive but both were stolen in the backpack - so I now I have everything on a Google drive and one other place. This will be my third place for this code. 

The problem: So, let's suppose you're doing something ecological at a particular site and you have a bunch of those sites. If you are working in a heterogeneous system you might be interested in the interaction between what is happening at a site and the landscape around the site. Since what constitutes the landscape is nebulous, we can examine the effect of spatial extent on site properties. For example, take the decision of a bird to use a bird house or not. The bird likes trees around the nesting site but does it need to be a forest or just a woodlot? I often refer to larger spatial extents as the context (city, forest, suburban) and smaller spatial extents of local conditions (e.g., a woodlot in a city, a parking lot in the country, and, of course, a woodlot in the woods and a parking lot in the city). So maybe the bird's decision is based on context or local conditions or a combination (much much tougher to sort out since everything in the landscape tends to correlated). 

The general approach: Your ecological points need to geo-referenced in a spreadsheet. To that we will add land use information that we will extract from remote sensing. Then we can run statistics linking the ecological information to the land use. A focal analysis is used where a sampling point serves as the center of a circular buffer. We will count the number of pixels and estimate the proportions of land cover by the relative number of pixels. We will use a 1000 m buffer and a 200 m buffer nested within the large buffer. 

An example of a point (star) where ecological data were collected. The larger circle gives the context of the point (urban) and the smaller circle indicates the local conditions (wood lot).

The land use data set: I use the NLCD 2016 Land Use data from the MRLC website. The map is a raster image (matrix of pixels with numbers representing different land uses). Each pixel is 20 x 20 m pixels so a few km can represent a great deal of data. The image below is roughly, the whole spatial extent that I work. The red banana is the Scranton/Wilkes-Barre greater urban area (aka "The Valley"). To the east is NJ and NY to the northeast. The squiggly line leading to the Scranton/Wilkes-Barre area is the North Branch of the Susquehanna. 
I like examining how ecology changes over a gradient and, unfortunately, the Scranton/Wilkes-Barre area does not really have much of a gradient. The red is high intensity urban and green is forest. You can see that The Valley is intensely urban and the area surrounding the value is very green and beyond that a mix. That ring of green around The Valley is the steep slopes of the surrounding land that is heavily forested (which is nice). To the west and north is farm land (yellow) and to the east and south is a mix of small towns, lakes, and game lands. 

The data are downloaded as a Geo-TIF

The ecology: For this example, I am using predation rates on model clay caterpillars we put on branches to examine predation by birds. When birds bite the clay they leave behind a mark and we consider the caterpillar predated. We put out 20 at a site and those sites are georeferenced with lat and long recorded in degrees decimal (e.g., 72.234). We go back a week later and count to the bite marks to get a predation rate.

And, finally, the code

install.packages("raster", "tmap", "sf", “sp”)

library(raster)
library(sp)
library(sf)
library(tmap)

# the file is NLCD_2016.tiff
landcov <- raster(file.choose()) 
# check it out
plot(landcov)

#get the coordinate reference system to show the spatial info 
crs(landcov)


# get the ecological sampling/clay model data
clay <- read.csv(file.choose(), header=T)

#  turn the points into spatial points and reference them to the location
clay.pts <- st_as_sf(clay, coords=c("long","lat"), crs = 4326)

# look at the coordinate reference system (to make sure the above worked) 
crs(clay.pts)

library(tmap, gstat)
# create a map with the points on top of the map
tm_shape(landcov) + tm_raster() +
  tm_shape(clay.pts) + tm_dots(size = 0.5)

# extract 
landextract2 <- extract(landcov,             # raster layer
  clay.pts,   # SPDF with centroids for buffer
    buffer = 1000,     # buffer size, units depend on CRS
normalizeWeights=TRUE,
    fun=NULL,         # what to value to extract
    df=TRUE)         # return a dataframe? 

# make the output a data frame so I can mess with it as I know how
land <- as.data.frame(landextract2)
# make R count thousands and thousands of pixels for me
a <- table(land$ID, land$NLCD_2016)
write.table(a, "clipboard", sep="\t")

From this run a bunch of models and we can create this graph

Hope this helps someone!! 






Wednesday, April 22, 2020

Data sets for teaching biostatistics

There are data sets online for teaching biostatistics and many data sets are built into R. There are two problems I keep encountering and it takes several days of searching to find an appropriate data set - if ever.  Here's the issue - most data sets built into R are ready to be crunched but they are often irrelevant to the student. For example, the two commonly used data sets that are in R are cars (just what it sounds like) and iris. The latter is flower but decades old. I'm looking for something from year >2000.  There are a few data repositories, such as Dryad, that are current but the data are often too complicated for simple analyses like ANOVA, t-test, etc. There are data sources that are huge but not very useful (unless you are a political scientist) like health data from the UN and WHO that are already summarized (but needing to merged with other data such of economic data. 

It would be great to have a searchable database for data sources that you can select the response variable type (e.g., continuous, binary, etc) and the predictor variables (e.g., continuous, binary, random) and maybe the year. That would be amazing and I think it would help the hundreds... thousands of people teaching statistics. 

This is not that data base but maybe I can start to link data sets to techniques used

Frequency analysis 
Two sample tests (t-tests and related)
ANOVA
Regression 
Logistic Regression 
Poisson 
Negative binomial 

Monday, April 13, 2020

Biostatistics Course


Biostatistics and Experimental Design
  1. Basic Approaches to Science and Statistics
  2. Probability 
  3. Statistical Inference
  4. Nature of Data and Experiments
  5. Exploratory Data Analysis (nee Descriptive Statistics)
  6. Data Distributions 
  7. Analysis of Frequencies 
  8. Variance Measures of Continuous Data 
  9. Mean Differences between Two Groups
  10. Introduction to ANOVA
  11. Multiple comparisons
    • Lecture
    • How to in R (see end of ANOVA How-to)
  12. Two Factor ANOVA and Experimental Design
  13. Variable and Model Selection 
  14. Correlation and Regression 
  15. Fitting Models to Squiggly Lines
  16. ANCOVA:Analysis of Covariance (Predictors are Continuous and Categorical)
  17. Analysis of Count Data: Poisson and Negative Binomial 
  18. Logistic Regression 



Saturday, April 11, 2020

Conservation Biology Online


I've decided to put a large portion of my conservation biology lectures online. They will be slightly less detailed than what I would normally do in class but I will make up for by extra readings. I asked a few of our graduates in graduate school what we were missing and one theme was reading and critiquing the peer-reviewed literature. So that will be part of the course that's for our students but not the public. By what not have public lectures. Maybe I'll get some good feedback. 

As of 4/11 - first two lectures, lecture on threats to tropical forests 

This will be a work in progress.


3. Species and Species Diversity
    
5. Extinctions 


7. Habitat Loss

8. Habitat Fragmentation
    Part 1 and Part 2
 


11. Invasive Species and Emerging Diseases 

12. Climate Change


14. Threats to Marine Diversity 

15. Global Conservation 

16. Designing Parks 

17. Conservation Law 

18. Habitat Management 

19. Museums, Zoos, Gardens and Freezers

20. Restoration 

21. Funding Conservation 

22. Sustainability