Showing posts with label Web Scraping. Show all posts
Showing posts with label Web Scraping. Show all posts
Sunday, September 20, 2015
Next Level Web Scraping
The outcome presented above will not be very useful to most of you - still, this could be a good example for what possibly can be done via webscraping in R.
Background: TIRIS is the federal geo-statistical service of North-Tyrol, Austria. One of many great things it provides are historical and recent aerial photographs. These photographs can be addressed via URL's. This is the basis of the script: the URL's are retrieved, some parameters are adjusted, using the customized addresses images are downloaded and animated by saveHTML from the Animation-Package. The outcome (HTML-Animation) enables you to view and skip through aerial photographs of any location in North-Tyrol, from the year 1940 to 2010, and see how the landscape, buildings, etc. have changed...
View the script HERE.
R-Function GScholarScraper to Webscrape Google Scholar Search Result
NOTE: You'll find the update HERE and HERE.
NOTE: The script is currently not working because the code of the Google-Scholar site has changed...
I'll see for this as soon as I find some spare time for it!
NOTE: If you try to access GoogleScholar programatically consider this words of caution:
http://stackoverflow.com/questions/7523961/google-scholar-with-matlab/7587994#7587994
...
Based on my previous post on Web Scraping I coded and uploaded the Function "GScholarScraper" HERE for testing!
The function will pull all (!) results, processing pages in chunks of 100 results/titles, and return a file with all titles, links, etc. It will also produce a word cloud using the words in the publication titles.
Please try your own search strings and report errors, etc.!
Build and run properly under:
R version 2.13.0 (2011-04-13) and R version R-2.13.2 (2011-09-30)
Platform: i386-pc-mingw32/i386 (32-bit) locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_0.5 tm_0.5-6 wordcloud_1.2 Rcpp_0.9.7
loaded via a namespace (and not attached):
[1] plyr_1.5.1 slam_0.1-23
PS: Errors reported lately (see comments) were resolved, the source code was updated..
Read more »
NOTE: The script is currently not working because the code of the Google-Scholar site has changed...
I'll see for this as soon as I find some spare time for it!
NOTE: If you try to access GoogleScholar programatically consider this words of caution:
http://stackoverflow.com/questions/7523961/google-scholar-with-matlab/7587994#7587994
...
Based on my previous post on Web Scraping I coded and uploaded the Function "GScholarScraper" HERE for testing!
The function will pull all (!) results, processing pages in chunks of 100 results/titles, and return a file with all titles, links, etc. It will also produce a word cloud using the words in the publication titles.
Please try your own search strings and report errors, etc.!
Build and run properly under:
R version 2.13.0 (2011-04-13) and R version R-2.13.2 (2011-09-30)
Platform: i386-pc-mingw32/i386 (32-bit) locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_0.5 tm_0.5-6 wordcloud_1.2 Rcpp_0.9.7
loaded via a namespace (and not attached):
[1] plyr_1.5.1 slam_0.1-23
PS: Errors reported lately (see comments) were resolved, the source code was updated..
Get No. of Google Search Hits with R and XML
UPDATE: Thanks to Max Ghenis for updating my R-script which I wrote a while back - the below R-script can now be used again for pulling the number of hits from Google-Search.
Read more »
GoogleHits <- function(input)p.s.: If you try to do this in a robot fashion, like:
{
require(XML)
require(RCurl)
url <- paste("https://www.google.com/search?q=\"",
input, "\"", sep = "")
CAINFO = paste(system.file(package="RCurl"), "/CurlSSL/ca-bundle.crt", sep = "")
script <- getURL(url, followlocation = TRUE, cainfo = CAINFO)
doc <- htmlParse(script)
res <- xpathSApply(doc, '//*/div[@id="resultStats"]', xmlValue)
cat(paste("\nYour Search URL:\n", url, "\n", sep = ""))
cat("\nNo. of Hits:\n")
return(as.integer(gsub("[^0-9]", "", res)))
}
# Example:
GoogleHits("R%Statistical%Software")
lapply(list_of_search_terms, GoogleHits)google will block you after about the 300th recursion!
A Little Web Scraping Exercise with XML-Package
Some months ago I posted an example of how to get the links of the contributing blogs on the R-Blogger site. I used readLines() and did some string processing using regular expressions.
With package XML this can be drastically shortened -
see this:
Read more »
With package XML this can be drastically shortened -
see this:
# get blogger urls with XML:With only a few lines of code this gives the same result as in the original post! Here I will also process the urls for retrieving links to each blog's start page:
library(RCurl)
library(XML)
script <- getURL("www.r-bloggers.com")
doc <- htmlParse(script)
li <- getNodeSet(doc, "//ul[@class='xoxo blogroll']//a")
urls <- sapply(li, xmlGetAttr, "href")
# get ids for those with only 2 slashes (no 3rd in the end):p.s.: Thanks to Vincent Zoonekynd for helping out with the XML syntax.
id <- which(nchar(gsub("[^/]", "", urls )) == 2)
slash_2 <- urls[id]
# find position of 3rd slash occurrence in strings:
slash_stop <- unlist(lapply(str_locate_all(urls, "/"),"[[", 3))
slash_3 <- substring(urls, first = 1, last = slash_stop - 1)
# final result, replace the ones with 2 slashes,
# which are lacking in slash_3:
blogs <- slash_3; blogs[id] <- slash_2
Saturday, September 19, 2015
A Little Webscraping-Exercise...
In R it's quite easy to pull out anything from a webpage and I'll show a little exercise in doing so. Here I retrieve all blog addresses from R-bloggers by the function readLines() and some subsequent data processing.
Read more »
# get the page's html-code
web_page <- readLines("http://www.r-bloggers.com")
# extract relevant part of web page:
# missing line added on oct. 24th:
ul_tags <- grep("ul>", web_page)
pos_1 <- grep("Contributing Blogs", web_page) + 2
pos_2 <- ul_tags[which(ul_tags > pos_1)[1]] - 2
blog_list_1 <- web_page[pos_1:pos_2]
# extract 2nd element of sublists produced by stringsplit:
blog_list_2 <- unlist(lapply(strsplit(blog_list_1, "\""), "[[", 2))
# exclude elememts without propper address:
blog_list_3 <- blog_list_2[grep("http:", blog_list_2)]
# plot results:
len <- length(blog_list_3)
x <- rep(1:3, ceiling(len/3))[1:len]
y <- 1:len
par(mar = c(0, 5, 0, 5), xpd = T)
plot(x, y, ylab = "", xlab = "", type = "n",
bty = "n", axes = F)
text(x, y, blog_list_3, cex = 0.5)
Subscribe to:
Posts (Atom)