The method String.length()
gives the length of the String, right? Or does it?
It does not. It returns the number of Unicode code units, which is the size of the char[]
used internally. With Java 9 it might actually be a byte[]
, but for this article I assume it’s a char[]
. But this is not the “length” of the String.
The API says this:
Returns the length of this string.
The length is equal to the number of Unicode code units in the string.
Returns: the length of the sequence of characters represented by this object.
This is often misinterpreted by those who do not understand how UTF-16 works. They could have used another name to prevent such misunderstandings. There is codePointCount
, so why not charCount
or codeUnitCount
?
If you want to restrict the length of a String (e.g. a user name) it is very important that you have a clear definition of how you determine the length.
It always depends on what you want to count. length()
simply gives you the amount of char
s it needs to hold the String in memory. Twice the value gives you the amount of octet bytes. Such a char
(= two bytes) is the unit used for Strings. That’s why they are also called code units. The actual unicode symbol is called code point and some code points need two UTF-16 code units.
So how to count characters? My example shows just one possible solution.
Step one: Normalize the string. Sometimes it’s necessary to normalize it. For example you want to replace “a” + “¨” with the single letter “ä” (=A-Umlaut), so it is counted just once. For that I use NFKC in my example.
Oracle has this official tutorial on the topic: Normalizing Text
You might also want to handle line breaks as some systems use two characters and some use just one. I just replace “\r\n” (Windows) with just “\n” (Linux/Mac).
Step two: Count code points instead of code units. Code points are the actual characters defined by unicode.
public static int countChars(String str) {
String normalized = Normalizer.normalize(str, Normalizer.Form.NFKC);
normalized = normalized.replace("\r\n", "\n");
return normalized.codePointCount(0, normalized.length());
}
A character with a diacritical mark (e.g. accents) can be stored as a single character or as a combination of the base character plus the mark. The Normalizer will normalize them to single characters where possible.
// LATIN SMALL LETTER E (U+0065):
String e = "e";
// LATIN SMALL LETTER E WITH ACUTE (U+00E9):
String e_acute1 = "\u00e9"; // = "é"
// COMBINING ACUTE ACCENT (U+0301):
String acute = "\u0301"; // = "´"
// Combination (U+0301 + U+0065):
String e_acute2 = e+acute; // = "´e"
// G Clef (U+1D11E):
String clef = "\uD834\uDD1E"; // = "𝄞"
So here we have the letter e and the accent ´. Note that ´ will always be combined with the following character. Then we have é as a single character and also as the combination.
I’ve also added a “clef“, which your browser might not be able to display properly. It is a musical symbol and it takes two UTF-16 code units, as you can see in the source code.
for(String str : Arrays.asList(e, acute, e_acute1, e_acute2, clef)) {
System.out.print(str);
System.out.print(" : length = ");
System.out.print(str.length());
System.out.print("; countChars = ");
System.out.println(countChars(str));
}
This will give this output:
e : length = 1; countChars = 1
´ : length = 1; countChars = 1
é : length = 1; countChars = 1
é : length = 2; countChars = 1
𝄞 : length = 2; countChars = 1
Note that countChars
always counts just one character, because the combination was replaced with just one character. The “clef” is counted as one because it counts the code points, not the code units.
And that’s how you count characters instead of the length of a String.
Here’s an alternative way of counting “letters”. I’m not sure if it’s correct for Telugu, but it could help some so find the correct way for certain languages.
https://pastebin.com/K2h4zSJT