Online Trick: XML-Package

Showing posts with label XML-Package. Show all posts

Sunday, September 20, 2015

Web-Scraper for Google Scholar Updated!

I have updated the Google Scholar Web-Scraper Function GScholarScaper_2 to GScholarScraper_3 (and GScholarScaper_3.1) as it was outdated due to changes in the Google Scholar html-code. The new script is more slender and faster. It returns a dataframe or optionally a CSV-file with the titles, authors, publications & links. Feel free to report bugs, etc.

Update 11-07-2013: bug fixes due to google scholar code changes - https://github.com/gimoya/onlinetrickpdf-Archives/blob/master/R/Functions/GScholarScraper_3.2.R. Note that since lately your IP will be blocked by Google at about the 1000th search result (cumulated) - so there's not much fun when you want to do some extensive bibliometrics..

Usecase for KML-Parsing: Make New KML-File from File-Collection

In this usecase I had collected several KMLs from the internet but wanted to strip them down for only the relevant parts (the Linestrings inside the Placemark-nodes) and put them all inside one final File. In my script I create a new KML file and populate a folder-node inside it with Linestrings from the collection of KML-files which all reside in the same source directory. For this one needs to parse each file and grab the appropiate nodes and add them to the target kml file. In addition I alter some oroginal values, i.e. I use the file names of the single KML-files as Placemark names inside the new KML-file.

Here is the final file as seen after opening in Google Earth:

library(XML)

# new kml file... needs to be well-formed
z <-
  '
      
         
            ROUTES
         
      
    '
new_xmlDoc <- xmlInternalTreeParse(z, useInternalNodes = TRUE)

# important add all namespace definitions...
ns <- c(gx="http://www.google.com/kml/ext/2.2",
        kml="http://www.opengis.net/kml/2.2",
        atom="http://www.w3.org/2005/Atom")
ensureNamespace(new_xmlDoc, ns)

# get the root off the new file for latter processing
new_root <- xmlRoot(new_xmlDoc)

# loop over files from folder
# and insert Placemark content of each file as children nodes into 
# the new file

setwd("C:/Users/Kay/Google Drive/SKI-BIKE/Gastein")
files <- dir(pattern="bergfex*")

for (f in files) { 
   
   # get placemark node of each file
   doc <- xmlInternalTreeParse(f, useInternalNodes = TRUE)
   root <- xmlRoot(doc)
   plcm_node <- root[["Document"]][["Folder"]][["Folder"]][["Placemark"]]

   # insert file name as Placemark name
   xmlValue(plcm_node[["name"]]) <- sub('bergfextour_(.*)[.]kml', '\\1', f)

   # add placemark node to new doc
   addChildren(new_root[["Document"]][["Folder"]], plcm_node)

}

# save it...
saveXML(new_xmlDoc, "collapsed_ROUTES.kml")

Get No. of Google Search Hits with R and XML

UPDATE: Thanks to Max Ghenis for updating my R-script which I wrote a while back - the below R-script can now be used again for pulling the number of hits from Google-Search.

GoogleHits <- function(input)
   {
    require(XML)
    require(RCurl)
    url <- paste("https://www.google.com/search?q=\"",
                 input, "\"", sep = "")
 
    CAINFO = paste(system.file(package="RCurl"), "/CurlSSL/ca-bundle.crt", sep = "")
    script <- getURL(url, followlocation = TRUE, cainfo = CAINFO)
    doc <- htmlParse(script)
    res <- xpathSApply(doc, '//*/div[@id="resultStats"]', xmlValue)
    cat(paste("\nYour Search URL:\n", url, "\n", sep = ""))
    cat("\nNo. of Hits:\n")
    return(as.integer(gsub("[^0-9]", "", res)))
   }
 
# Example:
GoogleHits("R%Statistical%Software")

p.s.: If you try to do this in a robot fashion, like:

lapply(list_of_search_terms, GoogleHits)

google will block you after about the 300th recursion!

A Little Web Scraping Exercise with XML-Package

Some months ago I posted an example of how to get the links of the contributing blogs on the R-Blogger site. I used readLines() and did some string processing using regular expressions.

With package XML this can be drastically shortened -
see this:

# get blogger urls with XML:
library(RCurl)
library(XML)
script <- getURL("www.r-bloggers.com")
doc <- htmlParse(script)
li <- getNodeSet(doc, "//ul[@class='xoxo blogroll']//a")
urls <- sapply(li, xmlGetAttr, "href")

With only a few lines of code this gives the same result as in the original post! Here I will also process the urls for retrieving links to each blog's start page:

# get ids for those with only 2 slashes (no 3rd in the end):
id <- which(nchar(gsub("[^/]", "", urls )) == 2)
slash_2 <- urls[id]

# find position of 3rd slash occurrence in strings:
slash_stop <- unlist(lapply(str_locate_all(urls, "/"),"[[", 3))
slash_3 <- substring(urls, first = 1, last = slash_stop - 1)

# final result, replace the ones with 2 slashes,
# which are lacking in slash_3:
blogs <- slash_3; blogs[id] <- slash_2

p.s.: Thanks to Vincent Zoonekynd for helping out with the XML syntax.

FloraWeb Plant Species Report via R

For German-spoken users I added the function floraweb_scrape.R that allows you to conveniently collect species data and print to a PDF-file (see this example output). The function accesses data provided by the web-site FloraWeb.de (BfN - Bundesministerium für Naturschutz).
You can use it as an interactive version (RTclTk) which I have put to a Github repository HERE.

Preview: