7. Fuzzy match ("fuzzyjoin" package)
library(fuzzyjoin)
library(dplyr)
library(ggplot2)
data(diamonds)
d <- data_frame(approximate_name = c("Idea", "Premiums", "Premioom",
"VeryGood", "VeryGood", "Faiir"),
type = 1:6)
d
# no matches when they are inner-joined:
diamonds %>%
inner_join(d, by = c(cut = "approximate_name"))
# but we can match when they're fuzzy joined
diamonds %>%
stringdist_inner_join(d, by = c(cut = "approximate_name"))
# the above with all useful parameters
diamonds %>%
stringdist_inner_join(d, by = c(cut = "approximate_name"), max_dist=2, distance_col="Dist", ignore_case=T)
#the above using specified algorithm (jw = Jaro Winkler distance)
diamonds %>%
stringdist_inner_join(d, by = c(cut = "approximate_name"), max_dist=0.05, distance_col="Dist", ignore_case=T, method="jw")
I wonder if in left_join the lowest distance values are taken into account. If not, it is perhaps worth using inner_join, sort by distance_col and removing duplicates.
Example of syntax for removing duplicates e.g. :
data <- subset(X, !duplicated(X[,3])
On various matching algorithms read: https://cran.r-project.org/web/packages/stringdist/stringdist.pdf, page 19,
See also Agrep
Fuzzy match (RecordLinkage package)
require(RecordLinkage)
jarowinkler('William Clinton', "Willam Clntn")
# 0.96
> soundex('Clenton') == soundex('Clinton')
[1] TRUE
> levenshteinDist('Clinton', 'Clenton')
[1] 1
From there one shd write a loop to iterate through second table and find the highest distance value for each record.
Bib: http://stackoverflow.com/questions/16837461/dealing-with-wrong-spelling-when-matching-text-strings-in-r http://www.princeton.edu/~otorres/FuzzyMergeR101.pdf https://www.r-bloggers.com/fuzzy-string-matching-a-survival-skill-to-tackle-unstructured-information/