Online Trick: grep()

Showing posts with label grep(). Show all posts

Sunday, September 20, 2015

Dear Silvio!..

R-Code to produce this nice gif-animated greeting card can be viewed HERE.

Here's a set of historical species presence records of a certain geographical region (data-link). I wanted to manipulate / simplify strings (species names) and get an overview of the data.

...The tasks were to split genera and epitheta, to exclude species with specific strings included and to get rid of unwanted text (author names). For graphical presentation of the species record history I did a plot with segments indicating the first and last year of a species record:

# for dbf import you'll need:
require(foreign)

# change to your directory: TLMFB_IL <- read.dbf("D:\\Downloads\\TLMFB_IL.dbf")

str(TLMFB_IL)

# get rid of hybrids:
dat_sub1 <- TLMFB_IL[-grep(" x ", TLMFB_IL$NAME),]

# check how many were dismissed:
length(TLMFB_IL$NAME) - length(dat_sub1$NAME)

# get genera and epitheta:
Gen <- sub(" .*", "", dat_sub1$NAME)
Epi <- sub(" .*", "", substring(dat_sub1$NAME, nchar(Gen)+2))

# get rid of species with unsure determination:
dat_sub2 <- dat_sub1[Epi != "spec." &
Epi != "sp." &
Epi != "cf." &
Epi != "" &
Epi != "?", ]

length(TLMFB_IL$NAME) - length(dat_sub2$NAME)

# check:
dat_sub2$NAME
TLMFB_IL$NAME

table(Epi != "spec." &
Epi != "sp." &
Epi != "cf." &
Epi != "" &
Epi != "?")

length(grep(" x ", TLMFB_IL$NAME))
length(dat_sub2$NAME)

# get genera and epitheta:
Gen <- sub(" .*", "", dat_sub2$NAME)
Epi <- sub(" .*", "", substring(dat_sub2$NAME, nchar(Gen)+2))

# get rid of authors:
sp <- paste(Gen, Epi)
length(sp)

# check arbitrary sample of 100 rows, window should be large

# enough to show columns next to each other:
id <- sample(1:3175, 100)
data.frame(sp = sp, orig = dat_sub2$NAME)[id,]

# add species names without authors to dataframe
dat_sub2$Sp <- sp

# there are some erronous values that should be discarded:
dat_sub3 <- dat_sub2[dat_sub2$date_long < 2010,]
str(dat_sub3)

# get max and min year at which species were recorded
y_min <- aggregate(. ~ Sp, min, data = dat_sub3[,c("Sp","date_long")])
y_max <- aggregate(. ~ Sp, max, data = dat_sub3[,c("Sp","date_long")])

# plot each species first and last record in line plot, data:
head(pldat <- data.frame(Sp = y_min[,1], y_min = y_min[,2],
y_max = y_max[,2], span = y_max[,2] - y_min[,2]))

# plot:
# example("segments")

# new ordering for plot:
pldat <- pldat[order(pldat$y_min, pldat$span),]
plot(x = c(min(pldat$y_min), max(pldat$y_max)), y = c(1, nrow(pldat)),
type = 'n', xlab = "Year", ylab ="", axes = F, frame.plot=FALSE)

segments(x0 = pldat$y_min, x1 = pldat$y_max, y0 = 1:nrow(pldat),
y1 = 1:nrow(pldat), col = "gray60")

axis(1, pretty(1800:2000), cex.axis = 0.75)
mtext(paste("Species Records Innsbruck\n", " (Sp - N = ", nrow(pldat), ")",
sep = ""), side = 2, line = -2)

# especially inbetween the 60's and 80's many species were
# re-recorded and newly added by only a few authors:

sixt_eight <- dat_sub3[dat_sub3$date_long > 1960 & dat_sub3$date_long < 1980, c("AUTOR", "Sp")]

data.frame(table(as.character(sixt_eight$AUTOR)))

To cite package ‘foreign’ in publications use:
R-core members, Saikat DebRoy , Roger Bivand and
others: see COPYRIGHTS file in the sources. (2011). foreign: Read Data Stored by Minitab, S,
SAS, SPSS, Stata, Systat, dBase, .... R package version 0.8-44.
http://CRAN.R-project.org/package=foreign

NOTE: Please see the update HERE and HERE!

...When reading Scott Chemberlain's last post about web-scraping I felt it was time to pick up and complete an idea that I was brooding over for some time now:

When a scientist aims out for a new project the first thing to do is to evaluate if other people already have come along to answer the very questions he is about to work on. I.e., I was interested if there has been done any research regarding amphibian diversity at regional/geographical scales correlated to environmental/landscape parameters. Usually I would got to Google-Scholar and search something like - intitle:amphibians AND intitle:richness OR intitle:diversity AND environment OR landscape - and then browse thru the results. But, this is often tedious and a way for a quick visual examination would be of great benefit.

The code I present will solve this task. It may be awkward in places and there might be a more effective way to yield the same result - but it may serve as a starter and I would very much appreciate people more literate than me picking up the torch...

For my example-search it is shown that there has not been very much going on regarding amphibian diversity correlated to environment and landscape...

See code HERE.

PS: I'd be happy about collaboration / tips / editing - so feel free to contact me and I will add you to the list of editors - you then could edit / comment / add to the script on Google Docs.

...some drawbacks need to be considered:

Maximum no. of search results = 100
Only titles are considered. Additionally considering abstracts may yield more representative results.. but abstracts are truncated in the search result and I don't know if it is possible to retrieve the full abstracts.
Also, long titles may be truncated...
A more illustrative result would be achieved if one could get rid of all other words than nouns, verbs and adjectives - don't know how to do this, but I am sure this is possible.
more drawbacks? you tell..

A Little Webscraping-Exercise...

In R it's quite easy to pull out anything from a webpage and I'll show a little exercise in doing so. Here I retrieve all blog addresses from R-bloggers by the function readLines() and some subsequent data processing.

# get the page's html-code
web_page <- readLines("http://www.r-bloggers.com")

# extract relevant part of web page:
# missing line added on oct. 24th:
ul_tags <- grep("ul>", web_page) 

pos_1 <- grep("Contributing Blogs", web_page) + 2
pos_2 <- ul_tags[which(ul_tags > pos_1)[1]] - 2

blog_list_1 <- web_page[pos_1:pos_2]

# extract 2nd element of sublists produced by stringsplit:
blog_list_2 <- unlist(lapply(strsplit(blog_list_1, "\""), "[[", 2))

# exclude elememts without propper address:
blog_list_3 <- blog_list_2[grep("http:", blog_list_2)]

# plot results:
len <- length(blog_list_3)
x <- rep(1:3, ceiling(len/3))[1:len]
y <- 1:len

par(mar = c(0, 5, 0, 5), xpd = T)
plot(x, y, ylab = "", xlab = "", type = "n",
     bty = "n", axes = F)
text(x, y, blog_list_3, cex = 0.5)

Online Trick

Sunday, September 20, 2015

Dear Silvio!..

Import dbf to R, Manipulate Strings with grep & sub Function

Webscraping Google Scholar & Show Result as Word Cloud Using R

Saturday, September 19, 2015

A Little Webscraping-Exercise...