Online Trick: Invasive Species

Showing posts with label Invasive Species. Show all posts

Sunday, September 20, 2015

Comparing Two Distributions

Here I compare two distributions, flowering duration of indigenous and allochtonous plant species. The hypothesis is that alien compared to indigenous plant species exhibit longer flowering periods.

download data*

*Data is courtesy of BiolFlor:
Klotz, S., Kühn, I. & Durka, W. [Hrsg.] (2002): BIOLFLOR - Eine Datenbank zu biologisch-ökologischen Merkmalen der Gefäßpflanzen in Deutschland. - Schriftenreihe für Vegetationskunde 38. Bonn: Bundesamt für Naturschutz.

## comparing flowering time in indigenous and alien species by
## usage of a quantile-quantile plot (qqplot), the ks-test, by
## testing shifts in medians (wilcox test) and by testing
## difference of means (t-test, as we deal with integer and
## not a continous variable it is not the most appropiate choice)
## as well as by a chi-square test:

dat <- read.csv("E:\\R\\Data\\flowering_alien_vs_indigen.csv",
                sep = ";")

library(lattice)

histogram(~ Flowering|Status, data = dat, col = "gray60", layout = c(1, 2),
          xlab = list("Months of flowering"),
          ylab = list("Percentage of total"),
          scales = list(y = list(alternating = F)),
          strip = strip.custom(factor.levels = c("alien", "indigenous")))

qqplot(dat$Flowering[dat$Status == "indigen"],
       dat$Flowering[dat$Status == "Neophyt"])
abline(a = 0, b = 1, lty = 3)

ks.test(dat$Flowering[dat$Status == "indigen"],
        dat$Flowering[dat$Status == "Neophyt"])

wilcox.test(Flowering ~ Status, data = dat)

t.test(Flowering ~ Status, data = dat)

## Note that in the two-sample case the estimator for the
## difference in location parameters does not estimate
## the difference in medians (a common misconception) but
## rather the median of the difference between a sample
## from x and a sample from y.

## as we deal with a limited number of classes (1-12 months)
## and sample size is big enough my favourite would be a
## chi-square test:

m <- table(dat$Status, dat$Flowering)

(Xsq <- chisq.test(m))  # Prints test summary
Xsq$observed   # observed counts (same as M)
Xsq$expected   # expected counts under the null
Xsq$residuals  # Pearson residuals
Xsq$stdres     # standardized residuals

..an illustrative example for testing proportions and presenting the results.

the data: number of indigenous and alien plant species with and without vegetative reproduction (N = 3399, mid-european species, data-courtesy: BiolFlor) . Hypothesis: The proportion of species with vegetative reproduction is different between alien and indigenuos plant species.

result: the prop. of plants with veg. reproduction is sign. lower for alien compared to indigenous plant species. this is simply due to the large number of agricultural weeds and contaminants within alien species - these species almost always reproduce by seeds.

## data:
dat <- data.frame(list(structure(list(flstat = structure(c(2L, 1L, 2L, 1L),
.Label = c("allo", "auto"), class = "factor"),
reprod = structure(c(1L, 1L, 2L, 2L),
.Label = c("non-veg", "veg"), class = "factor"),
X = c(872L, 423L, 1872L, 232L)),
.Names = c("flstat", "reprod", "X"),
class = "data.frame", row.names = c(NA, -4L))))

## proportion of species with vegetative reproduction
p_allo <- dat$X[4] / (dat$X[2] + dat$X[4])
p_auto <- dat$X[3] / (dat$X[1] + dat$X[3])
p_allo
p_auto

## restructure data for glm:
dat1 <- dat[rep(1:4, dat$X), 1:2]
head(dat1)
dat1$inc <- ifelse(dat1$reprod == "non-veg", 0, 1)

## glm:
summary(gmod <- glm(inc ~ flstat, data = dat1, family = binomial))

## intercept = logit(p_allo):
print(est_p_allo <- plogis(gmod$coef[1]))

## intercept + b = logit(p_allo+p_auto):
print(est_p_auto <- plogis(gmod$coef[1] + gmod$coef[2]))

## alternatively test difference in two proportions with prop.test():
ptest_diff <- prop.test(x = c(dat$X[1], dat$X[2]),
                        n = c(dat$X[1] + dat$X[3], dat$X[2] + dat$X[4]))

## only for one proportion prop.test gives you the confidence
## intervals of p.
## (you could also extract the glm-standard errors and calculate
## the conf.int. for this purpose..):
ptest_auto <- prop.test(x = dat$X[3], n = dat$X[1] + dat$X[3])
ptest_allo <- prop.test(x = dat$X[4], n = dat$X[2] + dat$X[4])

## plot with confidence intervals from prop.test
## (see methods in ?prop.test):

## coordinates for plotting confidence interval bars:
y0_al <- ptest_allo$conf[1]
y1_al <- ptest_allo$conf[2]
y0_au <- ptest_auto$conf[1]
y1_au <- ptest_auto$conf[2]

library(grid)
library(lattice)

## panel function for suppressing tck at top and right side,
## drawing bar with confidence interval,
## plotting glm-estimates (the crosses)
mpanel = function(...) {grid.segments(x0 = c(0.2725, 1 - 0.2725),
                                      x1 = c(0.2725, 1 - 0.2725),
                                      y0 = c(y0_al, y0_au),
                                      y1 = c(y1_al, y1_au))
                        panel.points(x = c(0.8, 2.2),
                                    y = c(est_p_allo, est_p_auto), pch = 4)
                        panel.abline(h = c(p_allo, p_auto), lty = 15,
                                     col = "grey70")
                        panel.text(x = 1.5, y = 0.9, cex = 1.2,
                                   "Species With\nVegetative Reproduction");
                        panel.xyplot(...)}

xyplot(c(p_allo, p_auto) ~ as.factor(c("Alien", "Indigenous")), type = "b",
       ylab = "Prop. +/- CIs\nX = GLM-Estimates",
       xlab = "", ylim = c(0, 1),
       panel = mpanel, pch = 16,
       scales = list(alternating = 1, tck = c(1, 0)))

Which traits make an alien species invasive?
Due to what traits an alien species becomes established in a foreign flora?

This kind of questions could be analysed by the use of recursive partitioning and classification trees..
(the below example also includes some useful data manipulation techniques)...

data-download*

*data is courtesy of BiolFlor:
Klotz, S., Kühn, I. & Durka, W. [Hrsg.] (2002): BIOLFLOR - Eine Datenbank zu biologisch-ökologischen Merkmalen der Gefäßpflanzen in Deutschland. - Schriftenreihe für Vegetationskunde 38. Bonn: Bundesamt für Naturschutz.

library(coin)
library(zoo)
library(modeltools)
library(sandwich)
library(strucchange)
library(vcd)
library(colorspace)
library(Formula)
library(partykit)

## (!)note: package party should not be loaded at the same time as partykit

## read data, mind to change the path:
dat <- read.csv("E:\\R\\Data\\Alien_Plant_Traits.csv",
                sep = ";")

## delete unneeded variable:
dat$Repr_Literatur <- NULL
dat$Diasporen..bzw..Germinulentyp <- NULL
dat$Heteromorphietyp  <- NULL
dat$Ausprägung <- NULL
dat$Status <- NULL


## source data uses "," as decimal seperator, as R uses
## "." we will change this appropriately:
dat$Länge_MW_mm <- as.numeric(sub(",", ".", dat$Länge_MW_mm))

## explore dataset:
head(dat[,1:10])
str(dat)

# Name                 : species names
# Status               : floristic status; "Neophyt" = alien species, which is the whole set
# Einführung           : mode of introduction
# Einbürg_grad_Max     : level of naturalization
# Lebensform           : lifeform
# Ausprägung           : manifestation of lifeform, "immmer" = always, "manchmal" = sometimes
# Blattanatomie        : leaf anatomy
# Blühdauer_Mon        : flowering duration
# N_Blühphasen         : number of flowering phases
# Länge_MW_mm          : length of diaspore / germinule, "MW_mm" = mean value in milimeters
# Länge_Bezug          : "Germinule" = length given is that of germinule,
#                       "Diaspore" = that of diaspore
# Länge_Bezug_Einschr  : restricted knowledge, maybe erroneus data, "_FALSCH" = false, "_WAHR" = true
# Reproduktionstyp     : reproduction type, "Samen" = seed, "vegetative" = clonal proliferation,
#                       "Samen/Sporen" = both

## ad diaspores / germinules:
## there are 2 cases - (1) a diaspore (= unit of propagation)
## is at the same time the germinule (= unit of seedling germination).
## this is the case when single seeds are propagated.
## (2) the diaspores disintegrates into many germinules.
## this is the cases when fruits or parts of them
## which contain many seeds are propagated.

## length measurments ("Länge_MW_mm") were taken for either
## germinules or diaspores or both (see "Länge_Bezug") and sometimes
## data may be erronous as indicated by the variable "Länge_Bezug_Einschr"
## what i want is the surely correct length-values of each germinules and
## diaspores for each species. therefore i create new variables:

dat1 <-
reshape(direction = "wide", idvar = "Name",
        timevar = "Länge_Bezug_Einschr",
        v.names = "Länge_MW_mm",
        drop =  "Länge_Bezug",
        data = dat)


## there's a warning that tells me i aggregated several variables that
## actually were varying - i solve this by pasting those variables
## into one character string per each species
## (beforehand i have to convert them to character vectors)
dat$Einführung <- as.character(dat$Einführung)
dat$Lebensform <- as.character(dat$Lebensform)
dat$Blattanatomie <- as.character(dat$Blattanatomie)
str(dat)

## make function for pasting multiple character strings of each species
## i want only unique values to be pasted, seperated by a comma:
my.paste <- function(x) {paste(unique(x), collapse = ", ")}
test.vector <- c("a", "a", "b", "c")
my.paste(test.vector)

print(dat2 <-
aggregate(cbind(dat$Einführung, dat$Lebensform, dat$Blattanatomie) ~
          dat$Name, FUN = my.paste))

## reset column names:
colnames(dat2) <- c("Name", "Einführung", "Lebensform", "Blattanatomie")
head(dat2)
## re-bind dataframes, variables with varying entries per species
## falsely aggregated in dat1 will be discarded and instead the
## pasted character strings of dat2, containing all info will be used.
## also variables present in both sub-datasets will be discarded:
head(dat2)

dat3 <- data.frame(dat1[, - which(names(dat1)%in%
                            c("Einführung", "Lebensform", "Blattanatomie"))],
                   dat2)
head(dat3)
## check for consistency (should be 340 species):
## Name-variables of the two sub-datasets should match
head(dat3)
dim(dat3)
dat3$Name == dat3$Name.1


## now i will choose one of the length measurements in the
## dataset - actually i'm interested in propagation (shorter/lighter propagules
## will be distributed easierly than larger/heavy ones) - that said i want
## diaspore weights but i don't have the data for each species.
## as an escape i will use the largest value within the different variables.
## the logic behind this would be that if diaspore = germinule
## the lengths will be the same and a germinule as a part of a diaspore
## will never be longer.
## by choosing the largest value i thus only use the germinule's value
## if the diaspore's value does not exist. this is seldomly the case and the
## length class will very likely still be the same as it would be for the
## diaspore of that plant, if i had the data
## (subsequently i will classify lengths):

dat3$Length_Diasp_new <- rep(NA, nrow(dat3))
dat3[is.na(dat3)] <- 0
for(i in 1:nrow(dat3)) {dat3$Length_Diasp_new[i] <- max(dat3[i, c(6:9)])}

## classify by cut-function:
dat3$Length_Class <- as.ordered(cut(dat3$Length_Diasp_new, c(0, 2, 5, Inf), labels = FALSE))

## i have to convert some character strings to factors
str(dat3)
dat3$Einführung <- as.factor(dat3$Einführung)
dat3$Lebensform <- as.factor(dat3$Lebensform)
dat3$Blattanatomie <- as.factor(dat3$Blattanatomie)
str(dat3)

## number if factor levels is too high for some variables -
## good results will rather be yielded
## if there are not too many levels:

levels(dat3$Lebensform)
## (1) makrophanerophytes will be "Tree"
## (2) chamaephyts, hemiphanerophyts, nanophanerophyts &
##     pseudophanerophyts will be "Shrub"
## (3) therophytes will be "Terr_Herbs_ann" (annual herbs)
## (4) geophytes and hemikryptophytes will be "Terr_Herb_per" (perennial herbs)
## (5) hydrophyts will be "Hydr_Herb"

## sometimes the classification in the original data is unclear -
## i will assign distinct lifeforms following the subsequent rules:
## "Chamaephyt, Therophyt" --> "Terr_Herb_ann"
## "Hemikryptophyt, Therophyt" --> "Terr_Herb_ann"
## "Geophyt, Hemikryptophyt" --> "Terr_Herb_per"
## "Hydrophyt, Geophyt" --> "Hydr_Herb"
## "Hydrophyt, Hemikryptophyt" --> "Hydr_Herb"
## "Makrophanerophyt, Nanophanerophyt" --> "Tree"

dat3$Lifeform <- rep(NA, nrow(dat3))

dat3$Lifeform[dat3$Lebensform == "Makrophanerophyt"] <- "Tree"
dat3$Lifeform[dat3$Lebensform == "Pseudophanerophyt"|
              dat3$Lebensform == "Nanophanerophyt"|
              dat3$Lebensform == "Hemiphanerophyt"|
              dat3$Lebensform == "Chamaephyt"] <- "Scrub"
dat3$Lifeform[dat3$Lebensform == "Therophyt"] <- "Terr_Herb_ann"
dat3$Lifeform[dat3$Lebensform == "Geophyt"|
              dat3$Lebensform == "Hemikryptophyt"] <- "Terr_Herb_per"
dat3$Lifeform[dat3$Lebensform == "Hydrophyt"] <- "Hydr_Herb"

dat3$Lifeform[dat3$Lebensform == "Chamaephyt, Therophyt"] <- "Terr_Herb_ann"
dat3$Lifeform[dat3$Lebensform == "Hemikryptophyt, Therophyt"] <- "Terr_Herb_ann"
dat3$Lifeform[dat3$Lebensform == "Geophyt, Hemikryptophyt"] <- "Terr_Herb_per"
dat3$Lifeform[dat3$Lebensform == "Hydrophyt, Geophyt"] <- "Hydr_Herb"
dat3$Lifeform[dat3$Lebensform == "Hydrophyt, Hemikryptophyt"] <- "Hydr_Herb"
dat3$Lifeform[dat3$Lebensform == "Makrophanerophyt, Nanophanerophyt"] <- "Tree"

## convert to factor:
dat3$Lifeform <- as.factor(dat3$Lifeform)
str(dat3)

levels(dat3$Einführung)
## "Begleiter"; "Begleiter, Verwilderte Nutzpflanze";
## "Begleiter, Verwilderte Zierpflanze"; "Transportbegleiter";
## "Saatgutbegleiter"; "Saatgutbegleiter, Verwilderte Nutzpflanze";      
## "Saatgutbegleiter, Verwilderte Zierpflanze" <-- "accompanying"
## "spontan"; "unbekannt"; "spontan, Verwilderte Nutzpflanze" <-- "spontaneous"
## "Verwilderte Nutzpflanze"; "Verwilderte Nutzpflanze, Verwilderte Zierpflanze";
## "Verwilderte Zierpflanze" <-- "cultivated"

dat3$Introduction <- rep(NA, nrow(dat3))
dat3$Introduction[dat3$Einführung == "Begleiter"|
            dat3$Einführung == "Begleiter, Verwilderte Nutzpflanze"|
            dat3$Einführung == "Begleiter, Verwilderte Zierpflanze"|
            dat3$Einführung == "Transportbegleiter"|
            dat3$Einführung == "Saatgutbegleiter"|
            dat3$Einführung == "Saatgutbegleiter, Verwilderte Nutzpflanze"|
            dat3$Einführung == "Saatgutbegleiter, Verwilderte Zierpflanze"] <- "accompanying"        
dat3$Introduction[dat3$Einführung == "spontan"|
            dat3$Einführung == "unbekannt"|
            dat3$Einführung == "spontan, Verwilderte Nutzpflanze"] <- "spontaneous"
dat3$Introduction[dat3$Einführung == "Verwilderte Nutzpflanze, Verwilderte Zierpflanze"|
            dat3$Einführung == "Verwilderte Nutzpflanze"|
            dat3$Einführung == "Verwilderte Zierpflanze"] <- "agric./ornam."

## convert to factor and check:
dat3$Introduction <- as.factor(dat3$Introduction)
str(dat3)

levels(dat3$Blattanatomie)
## leaf-anatomy: i will leaf this one out in the analysis
## as it is partly correlated with lifeforms

levels(dat3$Reproduktionstyp)
## "meist Samen, selten vegetativ" = mostly seed propagation
## "meist vegetativ, selten Samen" = mostly vegetative propagation
## "Samen und vegetativ" = both to same parts
## "Samen/Sporen" = only seeds                
## "vegetativ" = only vegetative
## "meist Samen, selten vegetativ"; "Samen/Sporen" <-- "mostly seeds"
## "meist vegetativ, selten Samen"; "vegetativ"   <-- "mostly vegetative"
## "Samen und vegetativ" <-- "both equally"
dat3$Reproduction <- rep(NA, nrow(dat3))
dat3$Reproduction[dat3$Reproduktionstyp == "meist Samen, selten vegetativ"|
                  dat3$Reproduktionstyp == "Samen/Sporen"] <- "mostly seeds"
dat3$Reproduction[dat3$Reproduktionstyp == "meist vegetativ, selten Samen"|
                  dat3$Reproduktionstyp == "vegetativ"] <- "mostly vegetativ"
dat3$Reproduction[dat3$Reproduktionstyp == "Samen und vegetativ"] <- "both equally"
dat3$Reproduction <- as.factor(dat3$Reproduction)

levels(dat3$Einbürg_grad_Max)
## this will be the dependent (outcome) variable
## i'll also try to use an outcome variable with
## only two levels:
## "Agriophyt" --> "invasive"
## "Epökophyt" & "Ephemerophyt" --> "not invasive"
dat3$Invasivness <- rep(NA, nrow(dat3))
dat3$Invasivness[dat3$Einbürg_grad_Max == "Epökophyt"|
                 dat3$Einbürg_grad_Max == "Ephemerophyt"] <- "not invasive"
dat3$Invasivness[dat3$Einbürg_grad_Max == "Agriophyt"] <- "invasive"
dat3$Invasivness <- as.factor(dat3$Invasivness)
dat3$Invasivness <- relevel(dat3$Invasivness , ref = "not invasive")

## another approach to this would be to define
## alien taxa that are established (in the sense
## that its population persist independently) and
## not established taxa.

dat3$Establ <- rep(NA, nrow(dat3))
dat3$Establ[dat3$Einbürg_grad_Max == "Epökophyt"|
              dat3$Einbürg_grad_Max == "Agriophyt"] <- "establ."
dat3$Establ[dat3$Einbürg_grad_Max == "Ephemerophyt"] <- "not establ."
dat3$Establ <- as.factor(dat3$Establ)
dat3$Establ <- relevel(dat3$Establ, ref = "not establ.")
str(dat3)

print(ct <- ctree(Einbürg_grad_Max ~ Introduction + Lifeform + Length_Class +
                                     Blühdauer_Mon + Reproduction, data = dat3,
                   control = ctree_control(mincriterion = 0.8, minsplit = 10L,
                                           minbucket = 10L)))

plot(ct, gp = gpar(fontsize = 7))


## the first node divides annula herbs from the rest
## thus it might be more representive to recode lifeforms
## into a lifecycle variable:
dat3$Life_Cycle <- rep(NA, nrow(dat3))
dat3$Life_Cycle[grep("_ann", dat3$Lifeform)] <- "annual"
dat3$Life_Cycle[-grep("_ann", dat3$Lifeform)] <- "perennial"
dat3$Life_Cycle <- as.factor(dat3$Life_Cycle)

## re-run:
print(ct <- ctree(Einbürg_grad_Max ~ Introduction + Life_Cycle + Length_Class +
                                     Blühdauer_Mon + Reproduction, data = dat3,
                   control = ctree_control(mincriterion = 0.8, minsplit = 10L,
                                           minbucket = 10L)))

plot(ct, gp = gpar(fontsize = 7))

## check predictions:
data.frame(Species = dat3$Name,
           Naturalization = dat3$Einbürg_grad_Max, predict(ct),
           Correspondence = dat3$Einbürg_grad_Max == predict(ct))

table(dat3$Einbürg_grad_Max == predict(ct))

## interpreting the result:
##
## (1) there is a large probability for agriophytes (= naturalized/invasive)
##     to occure in the group of perennial plants which
##     exhibit both, vegetative and generative reproduction
##
## (2) ephemerophytes most likely are present if plants are annual
##     and plant were introduced intentionally (ornamental, agricultural
##     purpose. within annuals the group of plants that made there way
##     without intentional help of men are likely to have more potential
##     to get established (-> portion of agriophytes and epoekophytes (=
##     naturalized/not-invasive) is larger here).
##
## also the other dependent nominal variables, invasivness & establishment
## could be investigated similarly..

To cite package ‘partykit’ in publications use:

Torsten Hothorn and Achim Zeileis (2011). partykit: A Toolkit for Recursive
Partytioning. R package version 0.1-0/r179.
http://R-Forge.R-project.org/projects/partykit/

Online Trick

Sunday, September 20, 2015

Comparing Two Distributions

Test Difference between Two Proportions & Plot Confidence Intervals

Use of Classification Trees to Investigate Traits of Invasive Species