Group Similar Short Text and Assign ID Using DBSCAN in R

Amir Harjo
4 min readAug 7, 2022
Picture by Sebastian Herrmann from Unsplash

“We have more than 20 years of transactional data” I hear him said it. “It has millions of transaction record”

“Well, that’s a great news.” I was elated. It was my first few weeks on the task to develop data science project to combine all sort of data the customers digital behaviour and its transaction data.

“We will look into it and will update you once we finalised our data audit”.

Few days later, we come back with our finding. “We have check the data. However, it seems missing one fundamental column about the customers. It didn’t have customer ID, it only has customers name and the product they purchase”.

“Can you check if they have nationality ID?” they asked

“Nothing in the column that we can use to identify if a customers is the same person. No email, phone number or identity card,” we explain, “if we couldn’t identify a customers is the same person, we wouldn’t be able to model their behaviour such as frequency of purchase, when they will buy next and other transactional behaviour.”

“We understand that when people put name, there could be some typo or different spelling. I think what we could develop now is to measure the similarity between name and assign ID to each name that have certain degree of similarity”.

“Well. Let’s start then”.

Measure and Normalise Text Similarity

As data scientist, we will encounter all sort of data related problem. On above example, we get very dirty data where the business owner have legacy data without the customer ID. The problem is, name can be misspelled.

Luckily, there are algorithms to measure the similarity between two text. What we will use over here is the Levenshtein distance.

Levenshtein distance measure the minimum number of single-character edit required to change one word into the other. More on Levenshtein distance can be found via this link.

In R, we can use the package stringr and function adist. See the sample code below:

We see above the result if we want to compare text “South Korea” and “Germany”. We can conclude two observations.

First. The algorithm is case sensitive as proved in the distance between “south korea” and “South Korea”

Second. The result is depend on the number of the text. Since “south korea” and “germany” is very different, the distance is 11. The more the number of text, the higher the distance. There is no a way to normalise the result for example between 0–100 to make ease the users in making decision (obviously, it will be easier for us to say that the similarity is 80% rather than the similarity is 2).

We will add two treatment in the code to alleviate those two constrains. First by change the text to lower case and delete unnecessary character. Second by changing the distance into number between 0–100. We measure it by using this formula: (1 — distance/max possible distance) * 100

See the example below:

We now able to measure the distance between text and normalise the score. The fact is, when we received very bad data such the problem in the first story, the data will come in hundreds, thousands and even hundred thousands row.

What we need to do first is how to make the solution scale and ready to use. We need to transform the data into a table. The sequence of scaling the solution are such follows:

  1. Expand the text into table using expand.grid
  2. Lower case and delete unnecessary character
  3. Measure the distance and normalise the score

The code below is doing just that!

R function to measure and normalise distance

Result using the same input for above function

Result from R

Assign ID using DBSCAN

The final stage is assigning the ID for similar text. On the example above, we want “South Korea” and “south korea” have same ID, while “germany” will have its own ID. To do that, we will use the clustering algorithm called DBSCAN. More about DBSCAN can be found here.

Overall code can be found in the function below. The code take some default parameter. For example, the minimum score to be considered as a similar text. In this case, the minimum score is 80. Other default parameter is the distance 0.15 that will be used for DBSCAN algorithm. We also have an option to consider whether to include “space” character and an option if we want to ignore the case (lower or upper case) in the text similarity calculation.

The good news is, we can change those parameter value when we call the function.

R function to measure distance and assign D

Let see the result when we use the same input of “South Korea”,”south korea” and “germany”.

Result from R Studio

Voila. It works.

Conclusion

In this article, I have shown you how to measure the distance between text, normalise the score and then assign ID. This solution will be handy if we have dirty text data without ID, that come in numbers of rows and we want to decide which text belongs to the same group.

Luckily, I have created a package just for this. Please check my GitHub Package similaRText

Source

https://www.cuelogic.com/blog/the-levenshtein-algorithm

--

--