Jaro-Winkler in ORACLE and textual fuzzy matching

There is a little-known (and hence heavily under-utilized) function in Oracle 11g and up. This is the Jaro-Winkler algorithm (and the companion algorithm named Edit Distance). The Jaro-Winkler algorithm tells you what level of permutation would be necessary in order to transform ‘String A’ into ‘String B’.

You can find the official Oracle documentation here. I implemented it using the BUILT_IN Oracle function UTL_MATCH, which is used with SQL code similar to:

SELECT UTL_MATCH.JARO_WINKLER_SIMILARITY(‘shackleford’, ‘shackelford’) FROM DUAL;

A vitally important feature of the Text Similarity function is that it allows you to measure difference with both normalized (0-1) and scalar (0-100) measures. By close examination, you can see the levels of difference involved with different string permutations. I used it to match diagnoses from Medicare CMS data to our internal data, but the function is versatile and not confined to any specific application (any text will work).

Note: strings starting with ‘0’ cannot compare to strings not starting with ‘0’ within the Jaro-Winkler function, but can compare with Edit Distance. This was an intuitive find that I spotted, but that isn’t defined in the literature anywhere.

Example: DX: ‘0100’ compared to  DX: ‘100’ will return about a 95 with Edit Distance, but a 0 with Jaro-Winkler.

Advertisements
This entry was posted in Data Science, Information Technology, The Cloud, Wordplay and Commentary. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s