Online Trick: stringr

Showing posts with label stringr. Show all posts

Sunday, September 20, 2015

Webscraping Google Scholar & Show Result as Word Cloud Using R

NOTE: Please see the update HERE and HERE!

...When reading Scott Chemberlain's last post about web-scraping I felt it was time to pick up and complete an idea that I was brooding over for some time now:

When a scientist aims out for a new project the first thing to do is to evaluate if other people already have come along to answer the very questions he is about to work on. I.e., I was interested if there has been done any research regarding amphibian diversity at regional/geographical scales correlated to environmental/landscape parameters. Usually I would got to Google-Scholar and search something like - intitle:amphibians AND intitle:richness OR intitle:diversity AND environment OR landscape - and then browse thru the results. But, this is often tedious and a way for a quick visual examination would be of great benefit.

The code I present will solve this task. It may be awkward in places and there might be a more effective way to yield the same result - but it may serve as a starter and I would very much appreciate people more literate than me picking up the torch...

For my example-search it is shown that there has not been very much going on regarding amphibian diversity correlated to environment and landscape...

See code HERE.

PS: I'd be happy about collaboration / tips / editing - so feel free to contact me and I will add you to the list of editors - you then could edit / comment / add to the script on Google Docs.

...some drawbacks need to be considered:

Maximum no. of search results = 100
Only titles are considered. Additionally considering abstracts may yield more representative results.. but abstracts are truncated in the search result and I don't know if it is possible to retrieve the full abstracts.
Also, long titles may be truncated...
A more illustrative result would be achieved if one could get rid of all other words than nouns, verbs and adjectives - don't know how to do this, but I am sure this is possible.
more drawbacks? you tell..

I found that examples for the use of regex in R are rather rare. Thus, I will provide some examples from my own learning materials - mostly stolen from the help pages, with small but maybe illustrative adaptions. ps: I will extent this list of examples HERE occasionally..

library(stringr)

shopping_list <- c("bread & Apples §$%&/()=?4", "flouR", "sugar", "milk x2")
str_extract(shopping_list, "[A-Z].*[1-9]")
# this extracts partial strings starting with an upper-case letter
# and ending with a digit, for all elements of the input vector..
# "." period, any single case letter, "*" the preceding item will
# be matched zero or more times, ".*" regex for a string
# comprised of any item being repeated arbitrarily often.

# output:
[1] "Apples §$%&/()=?4" NA                  NA                  NA

str_extract(shopping_list, "[a-z]{1,4}")
# this extracts partial strings with lowercase repetitions of 4,
# for all elements of the input vector..

# output:
[1] "brea" "flou" "suga" "milk"

str_extract(shopping_list, "\\b[a-z]{1,4}\\b")
# this extracts whole words with lowercase repetitions of 4,
# for all elements of the input vector..

#output:
[1] NA     NA     NA     "milk"

str <- c("&George W. Bush", "Lyndon B. Johnson?")
gsub("[^[:alnum:][:space:].]", "", str)
# keep alphanumeric signs AND full-stop, remove anything else,
# that is, all other punctuation. what should not be matched is
# designated by the caret.

# output:
[1] "George W. Bush"    "Lyndon B. Johnson"

How to Extract Citation from a Body of Text

Say, you have a text and you want to retrieve the cited names and years of publication. You wouldn't want to this by hand, wouldn't you?

Try the following approach:
(the text sample comes from THIS freely available publication)

library(stringr)

(txt <- readLines("http://dl.dropbox.com/u/68286640/Test_Doc.txt"))
[1] "1  Introduction"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
[2] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
[3] "Climate projections of the Intergovernmental Panel on Climate Change (IPCC) forecast a general increase of seasonal temperatures in the present century across the temperate zone, aggravated by decreasing amounts of summer rainfall in certain regions at lower latitudes (Christensen et al. 2007). These changes imply serious ecological consequences, especially in biome transition zones (Fischlin et al. 2007). Due to their economic importance, as well as their major contribution to supporting, regulating and cultural ecosystem services, predicted changes and shifts in temperate forest ecosystems receive wide public attention. It’s no surprise that dominant forest tree species are frequently modelled in bioclimatic impact studies (e.g., Sykes et al. 1996; Iverson, Prasad 2001; Rehfeldt et al. 2003; Ohlemüller et al. 2006). However, most studies focus on continental-scale effects of climate change, using low resolution climatic and species distribution data. More detailed regional studies focussing on specific endangered regions are also needed (Benito Garzón et al. 2008). Such regional studies have already been prepared for several European regions, including the Swiss Alps (Bolliger et al. 2000), the British Isles (Berry et al. 2002) and the Iberian Peninsula (Benito Garzón et al. 2008)."                                                                                                                    
[4] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
[5] "In this study, we aim to (1) identify the limiting macroclimatic factors and to (2) predict the future boundaries of beech (Fagus sylvatica L.) and sessile oak (Quercus petraea (Mattuschka) Liebl.) forests in a region highly vulnerable to climatic extremes. Both tree species form extensive zonal forests throughout Central Europe and reach their low altitude/low latitude, xeric (Mátyás et al. 2009) distributional limits within the forest-steppe biome transition zone of Hungary. The rise of temperature, and especially summer rainfall deficits expected for the twenty-first century, may strongly affect both species. Nevertheless, regarding the potential future distribution of these important forest tree species along their xeric boundaries in Central Europe, there has been no detailed regional analysis before. Experimental studies and field survey data suggest a strong decline in beech regeneration (Czajkowski et al. 2005; Penuelas et al. 2007; Lenoir et al. 2009) and increased mortality rates following prolonged droughts (Berki et al. 2009). Mass mortality and range retraction are potential consequences, which have been already sporadically observed in field survey studies (Jump et al. 2009; Allen et al. 2010; Mátyás et al. 2009). With the study, we intend to assist in assessing overall risks, locating potentially affected regions and supporting the formulation of appropriate measures and strategies."
[6] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
[7] "Beech and sessile oak forests of Hungary are to a large extent “trailing edge” populations (Hampe and Petit 2005), which should be preferably modelled using specific modelling strategies (Thuiller et al. 2008). Most modelling studies do not differentiate between leading and trailing edges and rely on assumptions and techniques which are intrinsically more appropriate for “leading edge” situations. Being aware of these challenges, we compiled a statistical methodology customized to yield inference on influential variables and providing robust and reliable predictions for climate-dependent populations near their xeric limits. We laid special emphasis on three features in the course of the modelling process: (1) screening of the occurrence data in order to limit modelling to plausible zonal (i.e. macroclimatically determined) occurrences, (2) avoiding pitfalls of statistical pseudoreplication caused by spatial autocorrelation (a problem to which regional distribution modelling studies are particularly prone; Dormann 2007) and (3) simultaneous use of several initial and boundary conditions in an ensemble modelling framework (Araújo et al. 2005; Araújo and New 2007; Beaumont et al. 2007). "                                                                                                                                                                                                                         

# retrieve text inbetween parantheses:
extr1 <- unlist(str_extract_all(txt, pattern = "\\(.*?\\)"))

# keep only those elements which have four digit strings (years):
extr2 <- extr1[grep("[0-9]{4}", extr1)]

# extract partial strings starting with uppercase letter (name)
# and end in a four digit string (year):
(str_extract(extr2, "[A-Z].*[0-9]"))
 [1] "Christensen et al. 2007"                                                              
 [2] "Fischlin et al. 2007"                                                                 
 [3] "Sykes et al. 1996; Iverson, Prasad 2001; Rehfeldt et al. 2003; Ohlemüller et al. 2006"
 [4] "Benito Garzón et al. 2008"                                                            
 [5] "Bolliger et al. 2000"                                                                 
 [6] "Berry et al. 2002"                                                                    
 [7] "Benito Garzón et al. 2008"                                                            
 [8] "Mátyás et al. 2009"                                                                   
 [9] "Czajkowski et al. 2005; Penuelas et al. 2007; Lenoir et al. 2009"                     
[10] "Berki et al. 2009"                                                                    
[11] "Jump et al. 2009; Allen et al. 2010; Mátyás et al. 2009"                              
[12] "Hampe and Petit 2005"                                                                 
[13] "Thuiller et al. 2008"                                                                 
[14] "Dormann 2007"                                                                         
[15] "Araújo et al. 2005; Araújo and New 2007; Beaumont et al. 2007"                        

# as proposed by a commentator -
# do this if you want each citation seperately:
(unlist(str_extract_all(extr2, "[A-Z].*?[0-9]{4}")))
 [1] "Christensen et al. 2007"   "Fischlin et al. 2007"     
 [3] "Sykes et al. 1996"         "Iverson, Prasad 2001"     
 [5] "Rehfeldt et al. 2003"      "Ohlemüller et al. 2006"   
 [7] "Benito Garzón et al. 2008" "Bolliger et al. 2000"     
 [9] "Berry et al. 2002"         "Benito Garzón et al. 2008"
[11] "Mátyás et al. 2009"        "Czajkowski et al. 2005"   
[13] "Penuelas et al. 2007"      "Lenoir et al. 2009"       
[15] "Berki et al. 2009"         "Jump et al. 2009"         
[17] "Allen et al. 2010"         "Mátyás et al. 2009"       
[19] "Hampe and Petit 2005"      "Thuiller et al. 2008"     
[21] "Dormann 2007"              "Araújo et al. 2005"       
[23] "Araújo and New 2007"       "Beaumont et al. 2007"

Sunday, September 20, 2015

Webscraping Google Scholar & Show Result as Word Cloud Using R

Some Simple but Propably Useful Regex Examples with R-Package stringr...

How to Extract Citation from a Body of Text