A former colleague got in touch recently with a question about how one might perform the following grouping operation…

Given a set of medical terms that may include typos and variations (as is typically the case for real-world medical data) group “similar” items together in the {dplyr} sense of “group” (the data presumably is a table of values which need to be grouped together into medical categories).

The provided example was the following: “gastrointestinal disorders”, “gastrointestinal tract disorders”, “gastreinstestinal disorder” (sic).

I wasn’t aware of a straightforward way to do it (by all means, please let me know if there is one!) so I figured it was a nice challenge.

In order to deal with the typos, I figured I needed to “spellcheck” the entries. A regular spellchecker will complain to no end about medical terms which typically aren’t in its “standard” dictionary, so I needed a “ground-truth” wordlist to work with.

To be properly robust, a good cross-check of medical terms should probably use an ontology - and there are various options available with associated R packages - but I found this dataset which works nicely enough. I read it in to the session easily since readLines can take a URL of a file. I lowercased it because I don’t want to deal with the differences that casing brings.

terms <- tolower(readLines("https://raw.githubusercontent.com/socd06/medical-nlp/master/data/vocab.txt"))

To make a larger “data” example I added two other medical groups with similar typos

gi <- c("gastrointestinal disorders", "gastrointestinal tract disorders", "gastreinstestinal disorder")
hep <- c("hepatic encephalopathy", "hepatic encephalapathy", "hepatic encefalopathy")
co <- c("myocarditis", "myocardits", "myocardites")

In order to spellcheck, I first see if the word is in the wordlist verbatim, in which case it doesn’t need correcting, otherwise I take the word with the smallest Levenshtein distance (edit) distance to the word, being the most likely “correct” spelling, via the built-in adist()

match_word <- function(word, wordlist) {
  word <- tolower(word)
  if (word %in% wordlist) return(word)
  wordlist[which.min(adist(word, wordlist)[1, ])]
}

This checks individual words, not entire phases, so I apply this over each word in a given phrase

spellcheck_phrase <- function(phrase, wordlist) {
  sapply(phrase, \(w) paste(sapply(strsplit(w, " ")[[1]], \(word) match_word(word, wordlist)), collapse = " "), USE.NAMES = FALSE)
}

As a test, the corrected spellings of the GI terms

spellcheck_phrase(gi, terms)
#> [1] "gastrointestinal disorders"       "gastrointestinal tract disorders"
#> [3] "gastrointestinal disorder"

These still aren’t “groupable” yet; they’re spelled correctly but they’re not all the same.

In order to properly reproduce the ‘grouping’ I created an example dataset with some “values”

meddata <- data.frame(term = c(gi, hep, co), value = LETTERS[1:9])

Then shuffled them

meddata <- meddata[match(meddata$value, strsplit("FIABDEHCG", "")[[1]]), ]
meddata
#>                               term value
#> 3       gastreinstestinal disorder     C
#> 4           hepatic encephalopathy     D
#> 8                       myocardits     H
#> 5           hepatic encephalapathy     E
#> 6            hepatic encefalopathy     F
#> 1       gastrointestinal disorders     A
#> 9                      myocardites     I
#> 7                      myocarditis     G
#> 2 gastrointestinal tract disorders     B

Next, I added the “corrected” spellings to this

meddata$corrected <- sapply(meddata$term, \(x) spellcheck_phrase(x, terms), USE.NAMES = FALSE)
meddata
#>                               term value                        corrected
#> 3       gastreinstestinal disorder     C        gastrointestinal disorder
#> 4           hepatic encephalopathy     D           hepatic encephalopathy
#> 8                       myocardits     H                      myocarditis
#> 5           hepatic encephalapathy     E           hepatic encephalopathy
#> 6            hepatic encefalopathy     F           hepatic encephalopathy
#> 1       gastrointestinal disorders     A       gastrointestinal disorders
#> 9                      myocardites     I                      myocarditis
#> 7                      myocarditis     G                      myocarditis
#> 2 gastrointestinal tract disorders     B gastrointestinal tract disorders

Now for the actual grouping: I used the {zoomerjoin} package to calculate the Jaccard similarity and perform the grouping

meddata$group <- zoomerjoin::jaccard_string_group(meddata$corrected, threshold = 0.1)
#> Loading required namespace: igraph
meddata
#>                               term value                        corrected
#> 3       gastreinstestinal disorder     C        gastrointestinal disorder
#> 4           hepatic encephalopathy     D           hepatic encephalopathy
#> 8                       myocardits     H                      myocarditis
#> 5           hepatic encephalapathy     E           hepatic encephalopathy
#> 6            hepatic encefalopathy     F           hepatic encephalopathy
#> 1       gastrointestinal disorders     A       gastrointestinal disorders
#> 9                      myocardites     I                      myocarditis
#> 7                      myocarditis     G                      myocarditis
#> 2 gastrointestinal tract disorders     B gastrointestinal tract disorders
#>                       group
#> 3 gastrointestinal disorder
#> 4    hepatic encephalopathy
#> 8               myocarditis
#> 5    hepatic encephalopathy
#> 6    hepatic encephalopathy
#> 1 gastrointestinal disorder
#> 9               myocarditis
#> 7               myocarditis
#> 2 gastrointestinal disorder

This sets a “canonical” value for the name of each group, and the threshold may need adjusting with more “real” data, but it appears to work!

Performing the actual grouping shows that the “values” of each group have been recovered

library(dplyr)

meddata |> 
  group_by(group) |> 
  summarise(res = toString(sort(value)))
#> # A tibble: 3 × 2
#>   group                     res    
#>   <chr>                     <chr>  
#> 1 gastrointestinal disorder A, B, C
#> 2 hepatic encephalopathy    D, E, F
#> 3 myocarditis               G, H, I

I’m curious to know if anyone has a better/different approach? Please let me know either here, on Mastodon, or any way you can find me. A gist of this code and markup is available here.