LevenshteinDist - script and chart function
LevenshteinDist() returns the Levenshtein distance between two strings. It is defined as the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other. The function is useful for fuzzy string comparisons.
Syntax:
LevenshteinDist(text1, text2)
Return data type: integer
Example | Result |
---|---|
LevenshteinDist('Kitten','Sitting') | Returns '3' |
Overview
Open the Data load editor and add the load script below to a new tab.
The load script contains:
-
-
One field in the data table called InputText.
Load script
Example:
Load * inline [
InputText
Sliver
SSiver
SSiveer
];
Results
Load the data and open a sheet. Create a new table and add this field as a dimension:
-
InputText
-
=LevenshteinDist('Silver', InputText), to calculate the minimum number of single character edits required to change the string values for InputText to the word 'Silver'.
InputText | LevenshteinDist(‘Silver’, InputText) |
---|---|
Sliver | 2 |
SSiveer | 3 |
SSiver | 2 |
The output of the LevenshteinDist function returns the number of changes required to change the InputText to the expected text, 'Silver'. For example, the first row requires two changes to modify the word 'Sliver' to 'Silver'. The second row requires 3 changes: 1) Delete the extra character 'S'. 2) Delete the extra character 'e'. 3) Insert a new character 'l'.
Overview
This example consolidates product names from different systems. The product names do not always use the same spelling due to typos, abbreviations, spacing, or other variations. Using the LevenshteinDist function, you can measure the similarity between two product names and identify which ones likely refer to the same product, even if the names are not identical.
Open the Data load editor and add the load script below to a new tab.
The load script contains:
-
-
-
ProductA
-
ProductB
-
Load script
Example:
Load * inline [
ProductA, ProductB
Coca Cola 330ml, CocaCola 330 ml
Pepsi 500 ml, Pepsi 500ml
Sprite Zero 600 ml, SpriteZero600ml
Red Bull 250ml, Redbull 250ml
];
Results
Load the data and open a sheet. Create a new table and add these fields as dimensions:
-
ProductA
-
ProductB
-
=LevenshteinDist(ProductA, ProductB), to calculate the minimum number of single character edits required to change the string values for ProductB to match ProductA.
ProductA | ProductB | LevenshteinDist(ProductA, ProductB) |
---|---|---|
Coca Cola 330ml | CocaCola 330 ml | 2 |
Pepsi 500 ml | Pepsi 500ml | 1 |
Red Bull 250ml | Redbull 250ml | 2 |
Sprite Zero 600 ml | SpriteZero600ml | 3 |
The Levenshtein distance is a type of fuzzy matching that is widely used as part of spell checkers, optical character recognition, and correction systems in areas such as customer data management, inventory systems, and document processing, where slight variations in text occur frequently.
Load script
T1:
Load *, recno() as ID;
Load 'Silver' as String_1,* inline [
String_2
Sliver
SSiver
SSiveer ];
T1:
Load *, recno()+3 as ID;
Load 'Gold' as String_1,* inline [
String_2
Bold
Bool
Bond ];
T1:
Load *, recno()+6 as ID;
Load 'Ove' as String_1,* inline [
String_2
Ove
Uve
Üve ];
T1:
Load *, recno()+9 as ID;
Load 'ABC' as String_1,* inline [
String_2
DEFG
abc
ビビビ ];
set nullinterpret = '<NULL>';
T1:
Load *, recno()+12 as ID;
Load 'X' as String_1,* inline [
String_2
''
<NULL>
1 ];
R1:
Load
ID,
String_1,
String_2,
LevenshteinDist(String_1, String_2) as LevenshteinDistance
resident T1;
Drop table T1;
Result
ID | String_1 | String_2 | LevenshteinDistance |
---|---|---|---|
1 | Silver | Sliver | 2 |
2 | Silver | SSiver | 2 |
3 | Silver | SSiveer | 3 |
4 | Gold | Bold | 1 |
5 | Gold | Bool | 3 |
6 | Gold | Bond | 2 |
7 | Ove | Ove | 0 |
8 | Ove | Uve | 1 |
9 | Ove | Üve | 1 |
10 | ABC | DEFG | 4 |
11 | ABC | abc | 3 |
12 | ABC | ビビビ | 3 |
13 | X | 1 | |
14 | X | - | 1 |
15 | X | 1 | 1 |