Showing posts with label XML-Package. Show all posts
Showing posts with label XML-Package. Show all posts
Sunday, September 20, 2015
Web-Scraper for Google Scholar Updated!
I have updated the Google Scholar Web-Scraper Function GScholarScaper_2 to GScholarScraper_3 (and GScholarScaper_3.1) as it was outdated due to changes in the Google Scholar html-code. The new script is more slender and faster. It returns a dataframe or optionally a CSV-file with the titles, authors, publications & links. Feel free to report bugs, etc.
Update 11-07-2013: bug fixes due to google scholar code changes - https://github.com/gimoya/onlinetrickpdf-Archives/blob/master/R/Functions/GScholarScraper_3.2.R. Note that since lately your IP will be blocked by Google at about the 1000th search result (cumulated) - so there's not much fun when you want to do some extensive bibliometrics..
Read more »
Update 11-07-2013: bug fixes due to google scholar code changes - https://github.com/gimoya/onlinetrickpdf-Archives/blob/master/R/Functions/GScholarScraper_3.2.R. Note that since lately your IP will be blocked by Google at about the 1000th search result (cumulated) - so there's not much fun when you want to do some extensive bibliometrics..
Usecase for KML-Parsing: Make New KML-File from File-Collection
In this usecase I had collected several KMLs from the internet but wanted to strip them down for only the relevant parts (the Linestrings inside the Placemark-nodes) and put them all inside one final File. In my script I create a new KML file and populate a folder-node inside it with Linestrings from the collection of KML-files which all reside in the same source directory. For this one needs to parse each file and grab the appropiate nodes and add them to the target kml file. In addition I alter some oroginal values, i.e. I use the file names of the single KML-files as Placemark names inside the new KML-file.
Here is the final file as seen after opening in Google Earth:
Read more »
Here is the final file as seen after opening in Google Earth:
library(XML)
# new kml file... needs to be well-formed
z <-
''
ROUTES
new_xmlDoc <- xmlInternalTreeParse(z, useInternalNodes = TRUE)
# important add all namespace definitions...
ns <- c(gx="http://www.google.com/kml/ext/2.2",
kml="http://www.opengis.net/kml/2.2",
atom="http://www.w3.org/2005/Atom")
ensureNamespace(new_xmlDoc, ns)
# get the root off the new file for latter processing
new_root <- xmlRoot(new_xmlDoc)
# loop over files from folder
# and insert Placemark content of each file as children nodes into
# the new file
setwd("C:/Users/Kay/Google Drive/SKI-BIKE/Gastein")
files <- dir(pattern="bergfex*")
for (f in files) {
# get placemark node of each file
doc <- xmlInternalTreeParse(f, useInternalNodes = TRUE)
root <- xmlRoot(doc)
plcm_node <- root[["Document"]][["Folder"]][["Folder"]][["Placemark"]]
# insert file name as Placemark name
xmlValue(plcm_node[["name"]]) <- sub('bergfextour_(.*)[.]kml', '\\1', f)
# add placemark node to new doc
addChildren(new_root[["Document"]][["Folder"]], plcm_node)
}
# save it...
saveXML(new_xmlDoc, "collapsed_ROUTES.kml")
Get No. of Google Search Hits with R and XML
UPDATE: Thanks to Max Ghenis for updating my R-script which I wrote a while back - the below R-script can now be used again for pulling the number of hits from Google-Search.
Read more »
GoogleHits <- function(input)p.s.: If you try to do this in a robot fashion, like:
{
require(XML)
require(RCurl)
url <- paste("https://www.google.com/search?q=\"",
input, "\"", sep = "")
CAINFO = paste(system.file(package="RCurl"), "/CurlSSL/ca-bundle.crt", sep = "")
script <- getURL(url, followlocation = TRUE, cainfo = CAINFO)
doc <- htmlParse(script)
res <- xpathSApply(doc, '//*/div[@id="resultStats"]', xmlValue)
cat(paste("\nYour Search URL:\n", url, "\n", sep = ""))
cat("\nNo. of Hits:\n")
return(as.integer(gsub("[^0-9]", "", res)))
}
# Example:
GoogleHits("R%Statistical%Software")
lapply(list_of_search_terms, GoogleHits)google will block you after about the 300th recursion!
A Little Web Scraping Exercise with XML-Package
Some months ago I posted an example of how to get the links of the contributing blogs on the R-Blogger site. I used readLines() and did some string processing using regular expressions.
With package XML this can be drastically shortened -
see this:
Read more »
With package XML this can be drastically shortened -
see this:
# get blogger urls with XML:With only a few lines of code this gives the same result as in the original post! Here I will also process the urls for retrieving links to each blog's start page:
library(RCurl)
library(XML)
script <- getURL("www.r-bloggers.com")
doc <- htmlParse(script)
li <- getNodeSet(doc, "//ul[@class='xoxo blogroll']//a")
urls <- sapply(li, xmlGetAttr, "href")
# get ids for those with only 2 slashes (no 3rd in the end):p.s.: Thanks to Vincent Zoonekynd for helping out with the XML syntax.
id <- which(nchar(gsub("[^/]", "", urls )) == 2)
slash_2 <- urls[id]
# find position of 3rd slash occurrence in strings:
slash_stop <- unlist(lapply(str_locate_all(urls, "/"),"[[", 3))
slash_3 <- substring(urls, first = 1, last = slash_stop - 1)
# final result, replace the ones with 2 slashes,
# which are lacking in slash_3:
blogs <- slash_3; blogs[id] <- slash_2
Saturday, September 19, 2015
FloraWeb Plant Species Report via R
For German-spoken users I added the function floraweb_scrape.R that allows you to conveniently collect species data and print to a PDF-file (see this example output). The function accesses data provided by the web-site FloraWeb.de (BfN - Bundesministerium für Naturschutz).
You can use it as an interactive version (RTclTk) which I have put to a Github repository HERE.
Preview:
Read more »
You can use it as an interactive version (RTclTk) which I have put to a Github repository HERE.
Preview:
Subscribe to:
Posts (Atom)