Comparing similarities in PHP can be accomplished in different ways. However, there is a question of efficiency about the result. PHP provides some functions such as similar_text, Levenshtein, and a few others. They operate using different algorithms.
This feature can also be used to implement “Did you mean?” features like Google. For example, when you searched for a string "thingsome"
instead of "something"
, Google suggests you with the possible correction. In the same way, we can implement this feature using PHP and different algorithms.
In this article we will implement three methods to check similarities between strings. The efficiency of the methods increases as we move to lower techniques. First of all, we will start with the PHP function named similar_text
.
Using similar_text()
This function calculates the similarity between two strings and return the number of matching characters in both strings. This function implements the algorithms by Oliver. It accepts three arguments where first two are required and third is optional. This function is case-sensitive.
Basic syntax:
similar_text( $string1, $string2, $percent)
Example:
<?php
$str1 = 'Hello World';
$str2 = 'Hello Jack';
echo similar_text($str1, $str2);
//output: 6
However, the complexity of this function is O(max(n,m)**3)
where where n
and m
are the lengths of the strings. We can also pass a third parameter as $percent
which will return the similarities in percentage.
<?php
$str1 = 'Hello World';
$str2 = 'Hello Jack';
similar_text($str2, $str1, $percent);
echo $percent;
//output: 57.142857142857
The limitation of this function is that it has perfomance issues for long strings. In addition, it sometimes returns different results when the order of string is changed. Even a bug was reported about this.
Using levenshtein function
We can compare the similarities between two strings using the Levenshtein function in PHP. It is the same as similar_text()
function but the complexity of this function is better than similar_text()
. This function uses O(m*n)
complexity which is more efficient than the previous one when there are thousands of strings to be checked.
The levenshtein()
function returns the Levenshtein distance between two strings. The Levenshtein distance is the number of characters you have to replace, insert or delete to transform string1 into string2. This function is not case sensitive as similar_text() function.
Basic Syntax:
levenshtein(string1, string2, insert, replace, delete)
Example:
<?php
$string1="Hello how are you doing" ;
$string2= " hi, how are you";
$distance = levenshtein($string2, $string1);
echo $distance;
//output: 11
Note: The levenshtein()
function is faster than the similar_text()
function. However, similar_text()
will give you a more accurate result with fewer modifications needed.
Using Smith–Waterman Algorithm
The algorithm was first proposed by Temple F. Smith and Michael S. Waterman in 1981. Like the Needleman-Wunsch algorithm, of which it is a variation, Smith-Waterman is a dynamic programming algorithm. As such, it has the desirable property that it is guaranteed to find the optimal local alignment with respect to the scoring system being used (which includes the substitution matrix and the gap-scoring scheme).
This algorithm return more accurate for larger strings compared to above two functions.