Sunday, September 20, 2015

Merging Dataframes by Partly Matching String

The latest posting by Tony Hirst sparked my attention because I was thinking about a very similar issue recently.

I was also fiddling around with agrep and adist until I realised that for this very issue matching of substrings is not as important as matching multiple words.. With this different approach I quite easily matched all but 3 countries.

See what I did:

## look up matches of one dataframe in another dataframe.
## the strings to be matched are comprised of 1 or more words
## and seperated by white space.
## method: match strings that have the highest fraction of words that match up

d1 <- read.csv("http://s.telegraph.co.uk/graphics/conrad/PercentageUsingTheNet.csv",
header = T, sep = ",", encoding = "UTF-8")
d2 <- read.csv("http://www.iso.org/iso/country_names_and_code_elements_txt",
header = T, sep = ";", encoding = "UTF-8")

## strings to be compared d1$ECONOMY and d2$Country.Name
mystr.1 <- as.character(d1$ECONOMY)
mystr.2 <- as.character(d2$Country.Name)
mystr.3 <- as.character(d2$ISO.3166.1.alpha.2.code)

## remove punctuation and multiple spaces
mystr.1 <- tolower(gsub("[^[:alnum:][:space:]]", "", mystr.1))
mystr.1 <- gsub("\\s+", " ", mystr.1)
mystr.2 <- tolower(gsub("[^[:alnum:][:space:]]", "", mystr.2))
mystr.2 <- gsub("\\s+", " ", mystr.2)

## function that finds matching words in string (words seperated by single space!)
n.wordshared <- function(x, y) {
sum(!is.na(match(unlist(strsplit(x, " ")),
unlist(strsplit(y, " ")))
)
)
}
## example
n.wordshared(x = "hello world", y = "good bye world")
## [1] 1

## function that calculates fraction of shared words
fr.wordshared <- function(x, y) {
n.wordshared(x, y) / (length(unique(unlist(strsplit(x, " "))))
+ length(unique(unlist(strsplit(y, " ")))))
}
## example
fr.wordshared(x = "hello world", y = "good bye world")
## [1] 0.2

mydf <- data.frame(str1 = mystr.1, mymatch = "", match.iso = "",
stringsAsFactors = F)

## now look up every element of string 1 in string 2
## and if there are matching words assign match to dataframe
for (i in 1:nrow(mydf)) {
xx <- sapply(mystr.2, fr.wordshared, y = mystr.1[i])
if (sum(xx) == 0) {
mydf$mymatch[i] <- NA
mydf$match.iso[i] <- NA
} else {
mydf$mymatch[i] <- paste(names(which(xx == max(xx))), collapse = "; ")
mydf$match.iso[i] <- paste(mystr.3[as.integer(which(xx == max(xx)))], collapse = "; ")
}
}

## see result
print(mydf)

## these are the multiple matches
(aa <- mydf[grep(";", mydf$mymatch), ])
##
## str1 mymatch match.iso
## 28 slovak republic czech republic; dominican republic CZ; DO


## these were not matched
(bb <- mydf[is.na(mydf$mymatch), ])
## str1 mymatch match.iso
##
## 61 russia NA NA
## 108 syria NA NA

Now, expanding on this concept by introduction of partial matching would most propably result in a 100% match...

No comments:

Post a Comment