Introduction to Unicode

Unicode is an international standard that encodes characters so they can be seamlessly processed and represented regardless of the platform. Unicode represents human language (and other forms of communication, like emoji) on computers. Every character in the Unicode standard is assigned a unique number.

Swift’s String and Character types are built on top of Unicode, and they do the majority of the heavy lifting. Nonetheless, it is good to have an understanding of how these types work with Unicode. Having this knowledge will likely save you some time and frustration in the future.

UNICODE SCALARS

At their heart, strings in Swift are composed of Unicode scalars. Unicode scalars are 21-bit numbers that represent a specific character in the Unicode standard. The text U+1F60E is the standard way of writing a Unicode character. (The 1F60E portion is a number written in hexadecimal.) For example, U+0061 represents the Latin small letter a. U+2603 represents a snowman.

Create a constant to see how to use specific Unicode scalars in Swift and the playground.

Listing 7.7  Using a Unicode scalar

...
for c: Character in mutablePlayground {
    print("'\(c)'")                                        (26 times)
}

let snowman = "\u{2603}"                                   "☃"

This time, you used a new syntax to create a string. The quotation marks are familiar, but what is inside them is not a string literal, as you have seen before. It does not match the results in the sidebar.

The \u{} syntax is an escape sequence that resolves to the Unicode scalar whose hexadecimal number appears between the braces. In this case, the value of snowman is the Unicode character of a snowman.

How does this relate to more familiar strings? Swift strings are composed of Unicode scalars. So why do they look unfamiliar? To explain, we need to discuss a few more concepts.

Every character in Swift is an extended grapheme cluster. Extended grapheme clusters are sequences of one or more Unicode scalars that combine to produce a single human-readable character. One Unicode scalar generally maps onto one fundamental character in a given language, but there are also combining scalars. For example, U+0301 represents the Unicode scalar for the combining acute accent: ´. This scalar is placed on top of – that is, combined with – the character that precedes it.

In your playground, use this scalar with the Latin small letter a to create the character á:

Listing 7.8  Using a combining scalar

...
let snowman = "\u{2603}"                                   "☃"
let aAcute = "\u{0061}\u{0301}"                            "á"

Making characters extended grapheme clusters gives Swift flexibility in dealing with complex script characters.

Swift also provides a mechanism to see all the Unicode scalars in a string. For example, you can see all the Unicode scalars that Swift uses to create the instance of String named playground that you created earlier using the unicodeScalars property, which holds all the scalars that Swift uses to make the string. (Properties, which you will learn about in Chapter 16, are constants or variables that associate values with an instance of a type.)

Add the following code to your playground to see playground’s Unicode scalars.

Listing 7.9  Revealing the Unicode scalars behind a string

...
let snowman = "\u{2603}"                                   "☃"
let aAcute = "\u{0061}\u{0301}"                            "á"
for scalar in playground.unicodeScalars {
    print("\(scalar.value)")                               (17 times)
}

You should see the following output in the console: 72 101 108 108 111 44 32 112 108 97 121 103 114 111 117 110 100. What do all these numbers mean?

The unicodeScalars property holds on to data representing all the Unicode scalars used to create the string instance playground. Each number on the console corresponds to a Unicode scalar representing a single character in the string. But they are not the hexadecimal Unicode numbers. Instead, each is represented as an unsigned 32-bit integer. For example, the first, 72, corresponds to the Unicode scalar value of U+0048, or an uppercase H.

CANONICAL EQUIVALENCE

While there is a role for combining scalars, Unicode also provides already combined forms for some common characters. For example, there is a specific scalar for á. You do not actually need to decompose it into its two parts, the letter and the accent. The scalar is U+00E1. Create a new constant string that uses this Unicode scalar.

Listing 7.10  Using a precomposed character

...
let aAcute = "\u{0061}\u{0301}"                            "á"
for scalar in playground.unicodeScalars {
    print("\(scalar.value) ")                              (17 times)
}

let aAcutePrecomposed = "\u{00E1}"                         "á"

As you can see, aAcutePrecomposed appears to have the same value as aAcute. Indeed, if you check whether these two characters are the same, you will find that Swift answers “yes.”

Listing 7.11  Checking equivalence

...
let aAcute = "\u{0061}\u{0301}"                            "á"
for scalar in playground.unicodeScalars {
    print("\(scalar.value) ")                              (17 times)
}

let aAcutePrecomposed = "\u{00E1}"                         "á"

let b = (aAcute == aAcutePrecomposed)                      true

aAcute was created using two Unicode scalars, and aAcutePrecomposed only used one. Why does Swift say that they are equivalent? The answer is canonical equivalence.

Canonical equivalence refers to whether two sequences of Unicode scalars are the same linguistically. Two characters, or two strings, are considered equal if they have the same linguistic meaning and appearance, regardless of whether they are built from the same Unicode scalars. aAcute and aAcutePrecomposed are equal strings because both represent the Latin small letter a with an acute accent. The fact that they were created with different Unicode scalars does not affect this.

Counting elements

Canonical equivalence has implications for counting the elements of a string. You might think that aAcute and aAcutePrecomposed would have different character counts. Write the following code to check.

Listing 7.12  Counting characters

...
let aAcutePrecomposed = "\u{00E1}"                         "á"

let b = (aAcute == aAcutePrecomposed)                      true

aAcute.count                                               1
aAcutePrecomposed.count                                    1

You use the count property on String to determine the character count of these two strings. count iterates over a string’s Unicode scalars to determine its length. The results sidebar reveals that the character counts are the same: Both are one character long.

Canonical equivalence means that whether you use a combining scalar or a precomposed scalar, the result is treated as the same. aAcute uses two Unicode scalars; aAcutePrecomposed uses one. This difference does not matter since both result in the same character.

Indices and ranges

Because strings are ordered collections of characters, if you have worked with collections in other languages, you might think that you can access a specific character in a string like so:

    let playground = "Hello, playground"
    let index = playground[3] // 'l'?

The code playground[3] uses the subscript syntax. In general, the brackets ([]) after a variable name indicate that you are using a subscript in Swift. Subscripts allow you to retrieve a specific value within a collection.

The 3 in this example is an index that is used to find a particular element within a collection. The code above suggests that you are trying to select the fourth character from the collection of characters making up the playground string (fourth, not third, because the first index is 0). And for other Swift collection types, subscript syntax like this would work. (You will learn more about subscripts below and will also see them in action in Chapter 8 on arrays and Chapter 10 on dictionaries.)

However, if you tried to use a subscript like this on a String, you would get an error: "'subscript' is unavailable: cannot subscript String with an Int." The Swift compiler will not let you access a specific character on a string via a subscript index.

This limitation has to do with the way Swift strings and characters are stored. You cannot index a string with an integer, because Swift does not know which Unicode scalar corresponds to a given index without stepping through every preceding character. This operation can be expensive. Therefore, Swift forces you to be more explicit.

Swift uses a type called String.Index to keep track of indices in string instances. (The period in String.Index just means that Index is a type that is defined on String. You will learn more about nested types like this in Chapter 16.)

As you have seen in this chapter, an individual character may be made up of multiple Unicode code points (another term for Unicode scalars). It is the job of the Index to represent these code points as a single Character instance and to combine these characters into the correct string.

Because Index is defined on String, you can ask the String to hand back indices that are meaningful. To find the character at a particular index, you begin with the String type’s startIndex property. This property yields the starting index of a string as a String.Index. You then use this starting point in conjunction with the index(_:offsetBy:) method to move forward until you arrive at the position of your choosing. (A method is like a function; you will learn more about them in Chapter 12.)

Say you want to know the fifth character of the playground string that you created earlier.

Listing 7.13  Finding the fifth character

let playground = "Hello, playground"                       "Hello, playground"
...
aAcute.count                                               1
aAcutePrecomposed.count                                    1

let start = playground.startIndex                          String.Index
let end = playground.index(start, offsetBy: 4)             String.Index
let fifthCharacter = playground[end]                       "o"

You use the startIndex property on the string to get the first index of the string. This property yields an instance of type String.Index. Next, you use the index(_:offsetBy:) method to advance from the starting point to your desired position. You tell the method to begin at the first index and then add 4 to advance to the fifth character.

The result of calling index(_:offsetBy:) is a String.Index that you assign to the constant end. Finally, you use end to subscript your playground string, which results in the character o being assigned to fifthCharacter.

Character ranges, like indices, depend upon the String.Index type. Suppose you wanted to grab the first five characters of playground. You can use the same start and end constants.

Listing 7.14  Pulling out a range

...
let start = playground.startIndex                          String.Index
let end = playground.index(start, offsetBy: 4)             String.Index
let fifthCharacter = playground[end]                       "o"
let range = start...end                                    {{_rawBits 1}, {_rawBit...
let firstFive = playground[range]                          "Hello"

The result of the syntax start...end is a constant named range. It has the type ClosedRange<String.Index>. A closed range, as you saw in Chapter 4, includes a lower bound, an upper bound, and everything in between. <String.Index> indicates the type of the elements along the range – the type that strings use for their indices. (In Chapter 4, the ranges you used were of type Range<Int> and ClosedRange<Int>.)

Your range’s lower bound is start, which is a String.Index whose value you can think of as being 0. The upper bound is end, which is also a String.Index whose value you can think of as 4. (The actual values are more complicated, as the sidebar results hint at, and are outside the scope of this book.) Thus, range describes a series of indices within playground from its starting index up to and including the index offset by 4.

You used this new range as a subscript on the playground string. The subscript grabbed the first five characters from playground, making firstFive a constant equal to "Hello".

In addition to closed ranges and the half-open ranges you also saw in Chapter 4, there is a third type of range you can use in Swift: the one-sided range. Update your playground to use one:

Listing 7.15  Using a one-sided range

...
let start = playground.startIndex                          String.Index
let end = playground.index(start, offsetBy: 4)             String.Index
let fifthCharacter = playground[end]                       "o"
let range = start...end
let range = ...end                                         PartialRangeThrough<Str...
let firstFive = playground[range]                          "Hello"

By removing the lower bound from your range, you tell the compiler that the range should begin with the lowest possible value; in this case, the beginning of the string. A one-sided range can be created with either the lower or upper bound removed and using either range operator (... or ..<).

Because strings are such a central part of communicating with your user, it is no surprise that they have so many features for examining and working with their contents.

Written by

XR Developer responsible for end-to-end development of XR solutions spanning multiple domains, by using various XR and WebXR libraries.

Leave a Reply