The phonics
package
for R is designed to provide a variety of phonetic indexing algorithms
in common and not-so-common use today. The algorithms generally reduce a
string to a symbolic representation approximating the sound made by
pronouncing the string. They can be used to match names, words, and as a
proxy for assorted string distance algorithms.
All algorithms, except the Match Rating Approach, accept a character
vector or vector of character vectors as the input. These are converted
to their phonetic spelling using the relevant algorithm. For example, we
shall consider the Soundex and Refined Soundex algorithms. The Soundex
algorithm is implemented as the soundex
function and the
Refined Soundex method is given in the refinedSoundex
function, and we can observe them in the following examples.
library("phonics")
x1 <- "Catherine"
x2 <- "Kathryn"
x3 <- "Katrina"
x4 <- "William"
x <- c(x1, x2, x3, x4)
soundex(x1)
## [1] "C365"
## [1] "K365"
## [1] "C365" "K365" "K365" "W450"
## [1] "C30609080"
## [1] "K3060908"
Both functions accept a maxCodeLen
that limits the
length of the returned code. Except where noted, all the algorithms
support the maxCodeLen
option to change the maximum or
expected code length returned, as appropriate.
Beyond soundex, additional algorithms are available, as shown in the following table.
Algorithm | Function Name |
---|---|
Caverphone | caverphone |
Cologne Phonetic | cologne |
Lein Name Coding | lein |
Metaphone | metaphone |
New York State Identification and Intelligence System | nysiis |
Oxford Name Compression Algorithm | onca |
Phonex | phonex |
Roger Root Name Coding Procedure | rogerroot |
Statistics Canada Name Coding | statcan |
Unlike other algorithms described here, MRA is a two-stage algorithm with separate encoding and comparison routines. For instance, the results of Soundex on two different strings can be directly compared to test for equality:
## [1] FALSE
## [1] TRUE
However, the MRA encoding algorithm may return different encodings
for similar strings that should match. So the second stage, for
comparison, is used to compare to MRA-encoded strings. The encoding
algorithm is provided by mra_encode
and the comparison
algorithm is provided by mra_compare
.
## [1] "KTHRN"
## [1] "CTHRN"
## [1] "KTRN"
## [1] TRUE
## [1] TRUE
## [1] TRUE
The threshold necessary to establish similarity gets smaller as the encoded strings get larger. This leads to some interesting results. For instance, Catherine and William match as names.
## [1] TRUE
This paper has outlined the phonics
package for R.
Included in this package are several English-, German-, and
French-language suitable algorithms for phonetically reducing names and
strings. These can be used for comparison and indexing, as well as later
record-linkage.