Close search
Hoa

Hack book of Hoa\Ustring

Strings can sometimes be complex, especially when they use the Unicode encoding format. The Hoa\Ustring library provides several operations on UTF-8 strings.

Table of contents

  1. Introduction
  2. Unicode strings
    1. String manipulation
    2. Comparison and search
    3. Characters
    4. Code-point
  3. Search algorithms
  4. Conclusion

Introduction

When we manipulate strings, the Unicode format establishes itself because of its compatibility with historical formats (like ASCII) and its capacity to understand a large range of characters and symbols for all cultures and all regions in the world. PHP provides several tools to manipulate such strings, like the following extensions: mbstring, iconv or also the excellent intl which is based on ICU, the reference implementation of Unicode. Unfortunately, sometimes we have to mix these extensions to achieve our aims and at the cost of a certain complexity along with a regrettable verbosity.

The Hoa\Ustring library answers to these issues by providing a simple way to manipulate strings with performance and efficiency in minds. It also provides some evoluated algorithms to perform search operations on strings.

Unicode strings

The Hoa\Ustring\Ustring class represents a UTF-8 Unicode strings and allows to manipulate it easily. This class implements the ArrayAccess, Countable and IteratorAggregate interfaces. We are going to use three examples in three different languages: French, Arab and Japanese. Thus:

$french   = new Hoa\Ustring\Ustring('Je t\'aime');
$arabic   = new Hoa\Ustring\Ustring('أحبك');
$japanese = new Hoa\Ustring\Ustring('私はあなたを愛して');

Now, let's see what we can do on these three strings.

String manipulation

Let's start with elementary operations. If we would like to count the number of characters (not bytes), we will use the count function. Thus:

var_dump(
    count($french),
    count($arabic),
    count($japanese)
);

/**
 * Will output:
 *     int(9)
 *     int(4)
 *     int(9)
 */

When we speak about text position, it is not suitable to speak about the right or the left, but rather about a beginning or an end, and based on the direction of writing. We can know this direction thanks to the Hoa\Ustring\Ustring::getDirection method. It returns the value of one of the following constants:

Let's observe the result with our examples:

var_dump(
    $french->getDirection()   === Hoa\Ustring\Ustring::LTR, // is left-to-right?
    $arabic->getDirection()   === Hoa\Ustring\Ustring::RTL, // is right-to-left?
    $japanese->getDirection() === Hoa\Ustring\Ustring::LTR  // is left-to-right?
);

/**
 * Will output:
 *     bool(true)
 *     bool(true)
 *     bool(true)
 */

The result of this method is computed thanks to the Hoa\Ustring\Ustring::getCharDirection static method which computes the direction of only one character.

If we would like to concatenate another string to the end or to the beginning, we will respectively use the Hoa\Ustring\Ustring::append and Hoa\Ustring\Ustring::prepend methods. These methods, like most of the ones which modifies the string, return the object itself, in order to chain the calls. For instance:

echo $french->append('… et toi, m\'aimes-tu ?')->prepend('Mam\'zelle ! ');

/**
 * Will output:
 *     Mam'zelle ! Je t'aime… et toi, m'aimes-tu ?
 */

We also have the Hoa\Ustring\Ustring::toLowerCase and Hoa\Ustring\Ustring::toUpperCase methods to, respectively, set the case of the string to lower or upper. For instance:

echo $french->toUpperCase();

/**
 * Will output:
 *     MAM'ZELLE ! JE T'AIME… ET TOI, M'AIMES-TU ?
 */

We can also add characters to the beginning or to the end of the string to reach a minimum length. This operation is frequently called the padding (for historical reasons dating back to typewriters). That's why we have the Hoa\Ustring\Ustring::pad method which takes three arguments: the minimum length, characters to add and a constant indicating whether we have to add at the end or at the beginning of the string (respectively Hoa\Ustring\Ustring::END, by default, and Hoa\Ustring\Ustring::BEGINNING).

echo $arabic->pad(20, ' ');

/**
 * Will output:
 *                     أحبك
 */

A similar operation allows to remove, by default, spaces at the beginning and at the end of the string thanks to the Hoa\Ustring\Ustring::trim method. For example, to retreive our original Arabic string:

echo $arabic->trim();

/**
 * Will output:
 *     أحبك
 */

If we would like to remove other characters, we can use its first argument which must be a regular expression. Finally, its second argument allows to specify from what side we would like to remove character: at the beginning, at the end or both, still by using the Hoa\Ustring\Ustring::BEGINNING and Hoa\Ustring\Ustring::END constants.

If we would like to remove other characters, we can use its first argument which must be a regular expression. Finally, its second argument allows to specify the side where to remove characters: at the beginning, at the end or both, still by using the Hoa\Ustring\Ustring::BEGINNING and Hoa\Ustring\Ustring::END constants. We can combine these constants to express “both sides”, which is the default value: Hoa\Ustring\Ustring::BEGINNING | Hoa\Ustring\Ustring::END. For example, to remove all the numbers and the spaces only at the end, we will write:

$arabic->trim('\s|\d', Hoa\Ustring\Ustring::END);

We can also reduce the string to a sub-string by specifying the position of the first character followed by the length of the sub-string to the Hoa\Ustring\Ustring::reduce method:

echo $french->reduce(3, 6)->reduce(2, 4);

/**
 * Will output:
 *     aime
 */

If we would like to get a specific character, we can rely on the ArrayAccess interface. For instance, to get the first character of each of our examples (from their original definitions):

var_dump(
    $french[0],
    $arabic[0],
    $japanese[0]
);

/**
 * Will output:
 *     string(1) "J"
 *     string(2) "أ"
 *     string(3) "私"
 */

If we would like the last character, we will use the -1 index. The index is not bounded to the length of the string. If the index exceeds this length, then a modulo will be applied.

We can also modify or remove a specific character with this method. For example:

$french->append(' ?');
$french[-1] = '!';
echo $french;

/**
 * Will output:
 *     Je t'aime !
 */

Another very useful method is the ASCII transformation. Be careful, this is not always possible, according to your settings. For example:

$title = new Hoa\Ustring\Ustring('Un été brûlant sur la côte');
echo $title->toAscii();

/**
 * Will output:
 *     Un ete brulant sur la cote
 */

We can also transform from Arabic or Japanese to ASCII. Symbols, like Mathemeticals symbols or emojis, are also transformed:

$emoji = new Hoa\Ustring\Ustring('I ❤ Unicode');
$maths = new Hoa\Ustring\Ustring('∀ i ∈ ℕ');

echo
    $arabic->toAscii(), "\n",
    $japanese->toAscii(), "\n",
    $emoji->toAscii(), "\n",
    $maths->toAscii(), "\n";

/**
 * Will output:
 *     ahbk
 *     sihaanatawo aishite
 *     I (heavy black heart)️ Unicode
 *     (for all) i (element of) N
 */

In order this method to work correctly, the intl extension needs to be present, so that the Transliterator class is present. If it does not exist, the Normalizer class must exist. If this class does not exist neither, the Hoa\Ustring\Ustring::toAscii method can still try a transformation, but it is less efficient. To activate this last solution, true must be passed as a single argument. This tour de force is not recommended in most cases.

We also find the getTransliterator method which returns a Transliterator object, or null if this class does not exist. This method takes a transliteration identifier as argument. We suggest to read the documentation about the transliterator of ICU to understand this identifier. The transliterate method allows to transliterate the current string based on an identifier and a beginning index and an end one. This method works the same way than the Transliterator::transliterate method.

More generally, to change the encoding format, we can use the Hoa\Ustring\Ustring::transcode static method, with a string as first argument, the original encoding format as second argument and the expected encoding format as third argument (UTF-8 by default). The get the list of encoding formats, we have to refer to the iconv extension or to use the following command line in a terminal:

$ iconv --list

To know if a string is encoded in UTF-8, we can use the Hoa\Ustring\Ustring::isUtf8 static method; for instance:

var_dump(
    Hoa\Ustring\Ustring::isUtf8('a'),
    Hoa\Ustring\Ustring::isUtf8(Hoa\Ustring\Ustring::transcode('a', 'UTF-8', 'UTF-16'))
);

/**
 * Will output:
 *     bool(true)
 *     bool(false)
 */

We can split the string into several sub-strings by using the Hoa\Ustring\Ustring::split method. As first argument, we have a regular expression (of kind PCRE), then an integer representing the maximum number of elements to return and finally a combination of constants. These constants are the same as the ones of preg_split.

By default, the second argument is set to -1, which means infinity, and the last argument is set to PREG_SPLIT_NO_EMPTY. Thus, if we would like to get all the words of a string, we will write:

print_r($title->split('#\b|\s#'));

/**
 * Will output:
 *     Array
 *     (
 *         [0] => Un
 *         [1] => ete
 *         [2] => brulant
 *         [3] => sur
 *         [4] => la
 *         [5] => cote
 *     )
 */

If we would like to iterate over all the characters, it is recommended to use the IteratorAggregate method, being the Hoa\Ustring\Ustring::getIterator method. Let's see on the Arabic example:

foreach ($arabic as $letter) {
    echo $letter, "\n";
}

/**
 * Will output:
 *     أ
 *     ح
 *     ب
 *     ك
 */

We notice that the iteration is based on the text direction, it means that the first element of the iteration is the first letter of the string starting from the beginning.

Of course, if we would like to get an array of characters, we can use the iterator_to_array PHP function:

print_r(iterator_to_array($arabic));

/**
 * Will output:
 *     Array
 *     (
 *         [0] => أ
 *         [1] => ح
 *         [2] => ب
 *         [3] => ك
 *     )
 */

Strings can also be compared thanks to the Hoa\Ustring\Ustring::compare method:

$string = new Hoa\Ustring\Ustring('abc');
var_dump(
    $string->compare('wxyz')
);

/**
 * Will output:
 *     string(-1)
 */

This methods returns -1 if the initial string comes before (in the alphabetical order), 0 if it is identical and 1 if it comes after. If we would like to use all the power of the underlying mechanism, we can call the Hoa\Ustring\Ustring::getCollator static method (if the Collator class exists, else Hoa\Ustring\Ustring::compare will use a simple byte to bytes comparison without taking care of the other parameters). Thus, if we would like to sort an array of strings, we will write:

$strings = array('c', 'Σ', 'd', 'x', 'α', 'a');
Hoa\Ustring\Ustring::getCollator()->sort($strings);
print_r($strings);

/**
 * Could output:
 *     Array
 *     (
 *         [0] => a
 *         [1] => c
 *         [2] => d
 *         [3] => x
 *         [4] => α
 *         [5] => Σ
 *     )
 */

Comparison between two strings depends on the locale, it means of the localization of the system, like the language, the country, the region etc. We can use the Hoa\Locale library to modify these data, but it's not a dependence of Hoa\Ustring.

We can also know if a string matches a certain pattern, still expressed with a regular expression. To achieve that, we will use the Hoa\Ustring\Ustring::match method. This method relies on the preg_match and preg_match_all PHP functions, but by modifying the pattern's options to ensure the Unicode support. We have the following parameters: the pattern, a variable passed by reference to collect the matches, flags, an offset and finally a boolean indicating whether the search is global or not (respectively if we have to use preg_match_all or preg_match). By default, the search is not global.

Thus, we will check that our French example contains aime with a direct object complement:

$french->match('#(?:(?<direct_object>\w)[\'\b])aime#', $matches);
var_dump($matches['direct_object']);

/**
 * Will output:
 *     string(1) "t"
 */

This method returns false if an error is raised (for example if the pattern is not correct), 0 if no match has been found, the number of matches else.

Similarly, we can search and replace sub-strings by other sub-strings based on a pattern, still expressed with a regular expression. To achieve that, we will use the Hoa\Ustring\Ustring::replace method. This method uses the preg_replace and preg_replace_callback PHP functions, but still by modifying the pattern's options to ensure the Unicode support. As first argument, we find one or more patterns, as second argument, one or more replacements and as last argument the limit of replacements to apply. If the replacement is a callable, then the preg_replace_callback function will be used.

Thus, we will modify our French example to be more polite:

$french->replace('#(?:\w[\'\b])(?<verb>aime)#', function ($matches) {
    return 'vous ' . $matches['verb'];
});

echo $french;

/**
 * Will output:
 *     Je vous aime
 */

The Hoa\Ustring\Ustring class provides constants which are aliases of existing PHP constants and ensure a better readability of the code:

Because they are strict aliases, we can write:

$string = new Hoa\Ustring\Ustring('abc1 defg2 hikl3 xyz4');
$string->match(
    '#(\w+)(\d)#',
    $matches,
    Hoa\Ustring\Ustring::WITH_OFFSET
  | Hoa\Ustring\Ustring::GROUP_BY_TUPLE,
    0,
    true
);

Characters

The Hoa\Ustring\Ustring class offers static methods working on a single Unicode character. We have already mentionned the getCharDirection method which allows to know the direction of a character. We also have the getCharWidth which counts the number of columns necessary to print a single character. Thus:

var_dump(
    Hoa\Ustring\Ustring::getCharWidth(Hoa\Ustring\Ustring::fromCode(0x7f)),
    Hoa\Ustring\Ustring::getCharWidth('a'),
    Hoa\Ustring\Ustring::getCharWidth('㽠')
);

/**
 * Will output:
 *     int(-1)
 *     int(1)
 *     int(2)
 */

This method returns -1 or 0 if the character is not printable (for instance, if this is a control character, like 0x7f which corresponds to DELETE), 1 or more if this is a character that can be printed. In our example, requires 2 columns to be printed.

To get more semantics, we have the Hoa\Ustring\Ustring::isCharPrintable method which allows to know whether a character is printable or not.

If we would like to count the number of columns necessary for a whole string, we have to use the Hoa\Ustring\Ustring::getWidth method. Thus:

var_dump(
    $french->getWidth(),
    $arabic->getWidth(),
    $japanese->getWidth()
);

/**
 * Will output:
 *     int(9)
 *     int(4)
 *     int(18)
 */

Try this in your terminal with a monospaced font. You will observe that Japanese requires 18 columns to be printed. This measure is very useful if we would like to know the length of a string to position it efficiently.

The getCharWidth method is different of getWidth because it includes control characters. This method is intended to be used, for example, with terminals (please, see the Hoa\Console library).

Finally, if this time we are not interested by Unicode characters but rather by machine characters char (being 1 byte), we have an extra operation. The Hoa\Ustring\Ustring::getBytesLength method will count the length of the string in bytes:

var_dump(
    $arabic->getBytesLength(),
    $japanese->getBytesLength()
);

/**
 * Will output:
 *     int(8)
 *     int(27)
 */

If we compare these results with the ones of the Hoa\Ustring\Ustring::count method, we understand that the Arabic characters are encoded with 2 bytes whereas Japanese characteres are encoded with 3 bytes. We can also get a specific byte thanks to the Hoa\Ustring\Ustring::getByteAt method. Once again, the index is not bounded.

Code-point

Each character is represented by an integer, called a code-point. To get the code-point of a character, we can use the Hoa\Ustring\Ustring::toCode static method, and to get a character based on its code-point, we can use the Hoa\Ustring\Ustring::fromCode static method. We also have the Hoa\Ustring\Ustring::toBinaryCode method which returns the binary representation of a character. Let's take an example:

var_dump(
    Hoa\Ustring\Ustring::toCode('Σ'),
    Hoa\Ustring\Ustring::toBinaryCode('Σ'),
    Hoa\Ustring\Ustring::fromCode(0x1a9)
);

/**
 * Will output:
 *     int(931)
 *     string(32) "1100111010100011"
 *     string(2) "Σ"
 */

Search algorithms

The Hoa\Ustring library provides sophisticated search algorithms on strings through the Hoa\Ustring\Search class.

We will study the Hoa\Ustring\Search::approximated algorithm which searches a sub-string in a string up to k differences (a difference is an addition, a deletion or a modification). Let's take the classical example of a DNA representation: We will search all the sub-strings approximating GATAA with 1 difference (maximum) in CAGATAAGAGAA. So, we will write:

$x      = 'GATAA';
$y      = 'CAGATAAGAGAA';
$k      = 1;
$search = Hoa\Ustring\Search::approximated($y, $x, $k);
$n      = count($search);

echo 'Try to match ', $x, ' in ', $y, ' with at most ', $k, ' difference(s):', "\n";
echo $n, ' match(es) found:', "\n";

foreach ($search as $position) {
    echo '    • ', substr($y, $position['i'], $position['l'), "\n";
}

/**
 * Will output:
 *     Try to match GATAA in CAGATAAGAGAA with at most 1 difference(s):
 *     4 match(es) found:
 *         • AGATA
 *         • GATAA
 *         • ATAAG
 *         • GAGAA
 */

This methods returns an array of arrays. Each sub-array represents a result and contains three indexes: i for the position of the first character (byte) of the result, j for the position of the last character and l for the length of the result (simply j - i). Thus, we can compute the results by using our initial string (here $y) and its indexes.

With our example, we have four results. The first is AGATA, being GATAA with one moved character, and AGATA exists in CAGATAAGAGAA. The second result is GATAA, our sub-string, which well and truly exists in CAGATAAGAGAA. The third result is ATAAG, being GATAA with one moved character, and ATAAG exists in CAGATAAGAGAA. Finally, the last result is GAGAA, being GATAA with one modified character, and GAGAA exists in CAGATAAGAGAA.

Another example, more concrete this time. We will consider the --testIt --foobar --testThat --testAt string (which represents possible options of a command line), and we will search --testot, an option that should have been given by the user. This option does not exist as it is. We will then use our search algorithm with at most 1 difference. Let's see:

$x      = 'testot';
$y      = '--testIt --foobar --testThat --testAt';
$k      = 1;
$search = Hoa\Ustring\Search::approximated($y, $x, $k);
$n      = count($search);

// …

/**
 * Will output:
 *     Try to match testot in --testIt --foobar --testThat --testAt with at most 1 difference(s)
 *     2 match(es) found:
 *         • testIt
 *         • testAt
 */

The testIt and testAt results are true options, so we can suggest them to the user. This is a mechanism user by Hoa\Console to suggest corrections to the user in case of a mistyping.

Conclusion

The Hoa\Ustring library provides facilities to manipulate strings encoded with the Unicode format, but also to make sophisticated search on strings.

An error or a suggestion about the documentation? Contributions are welcome!

Comments

menu