Powered by GitBook

7. Fuzzy match ("fuzzyjoin" package)

library(fuzzyjoin)
library(dplyr)
library(ggplot2)
data(diamonds)

d <- data_frame(approximate_name = c("Idea", "Premiums", "Premioom",
                                     "VeryGood", "VeryGood", "Faiir"),
                type = 1:6)
d
# no matches when they are inner-joined:
diamonds %>%
  inner_join(d, by = c(cut = "approximate_name"))

# but we can match when they're fuzzy joined
diamonds %>%
  stringdist_inner_join(d, by = c(cut = "approximate_name"))

# the above with all useful parameters
diamonds %>%
  stringdist_inner_join(d, by = c(cut = "approximate_name"), max_dist=2, distance_col="Dist", ignore_case=T)
#the above using specified algorithm (jw = Jaro Winkler distance)
diamonds %>%
  stringdist_inner_join(d, by = c(cut = "approximate_name"), max_dist=0.05, distance_col="Dist", ignore_case=T, method="jw")

I wonder if in left_join the lowest distance values are taken into account. If not, it is perhaps worth using inner_join, sort by distance_col and removing duplicates.

Example of syntax for removing duplicates e.g. :

data <- subset(X, !duplicated(X[,3])

On various matching algorithms read: https://cran.r-project.org/web/packages/stringdist/stringdist.pdf, page 19,

See also Agrep

Fuzzy match (RecordLinkage package)

require(RecordLinkage)

jarowinkler('William Clinton', "Willam Clntn")
# 0.96

> soundex('Clenton') == soundex('Clinton')
[1] TRUE

> levenshteinDist('Clinton', 'Clenton')
[1] 1

From there one shd write a loop to iterate through second table and find the highest distance value for each record.

Bib: http://stackoverflow.com/questions/16837461/dealing-with-wrong-spelling-when-matching-text-strings-in-r http://www.princeton.edu/~otorres/FuzzyMergeR101.pdf https://www.r-bloggers.com/fuzzy-string-matching-a-survival-skill-to-tackle-unstructured-information/

results matching ""

No results matching ""