There is a little-known (and hence heavily under-utilized) function in Oracle 11g and up. This is the Jaro-Winkler algorithm (and the companion algorithm named Edit Distance). The Jaro-Winkler algorithm tells you what level of permutation would be necessary in order to transform ‘String A’ into ‘String B’.
You can find the official Oracle documentation here. I implemented it using the BUILT_IN Oracle function UTL_MATCH, which is used with SQL code similar to:
SELECT UTL_MATCH.JARO_WINKLER_SIMILARITY(‘shackleford’, ‘shackelford’) FROM DUAL;
A vitally important feature of the Text Similarity function is that it allows you to measure difference with both normalized (0-1) and scalar (0-100) measures. By close examination, you can see the levels of difference involved with different string permutations. I used it to match diagnoses from Medicare CMS data to our internal data, but the function is versatile and not confined to any specific application (any text will work).
Note: strings starting with ‘0’ cannot compare to strings not starting with ‘0’ within the Jaro-Winkler function, but can compare with Edit Distance. This was an intuitive find that I spotted, but that isn’t defined in the literature anywhere.
Example: DX: ‘0100’ compared to DX: ‘100’ will return about a 95 with Edit Distance, but a 0 with Jaro-Winkler.
Like this:
Like Loading...
Related
Jaro-Winkler in ORACLE and textual fuzzy matching
There is a little-known (and hence heavily under-utilized) function in Oracle 11g and up. This is the Jaro-Winkler algorithm (and the companion algorithm named Edit Distance). The Jaro-Winkler algorithm tells you what level of permutation would be necessary in order to transform ‘String A’ into ‘String B’.
You can find the official Oracle documentation here. I implemented it using the BUILT_IN Oracle function UTL_MATCH, which is used with SQL code similar to:
SELECT UTL_MATCH.JARO_WINKLER_SIMILARITY(‘shackleford’, ‘shackelford’) FROM DUAL;
A vitally important feature of the Text Similarity function is that it allows you to measure difference with both normalized (0-1) and scalar (0-100) measures. By close examination, you can see the levels of difference involved with different string permutations. I used it to match diagnoses from Medicare CMS data to our internal data, but the function is versatile and not confined to any specific application (any text will work).
Note: strings starting with ‘0’ cannot compare to strings not starting with ‘0’ within the Jaro-Winkler function, but can compare with Edit Distance. This was an intuitive find that I spotted, but that isn’t defined in the literature anywhere.
Example: DX: ‘0100’ compared to DX: ‘100’ will return about a 95 with Edit Distance, but a 0 with Jaro-Winkler.
Like this:
Related