Given a string, determine if it has all unique characters.
/**
* Determines if str has unique characters
* @param {string} str
* @returns {boolean}
*/
function isUnique(str) {
// Fill in your solution here
}
Think it’s easy? Try this input, str = "π¨βπ¦π¨"
.
TL;DR:
- Use nested loops
- Sorting and comparing
- Use an object cache
- Using bitwise operators
Examples:
Long unique string
Input: NzEtem04n2vcokuyI5iYZjlfRC639aX-=#[}?/.!@%
Output: true
Mix of alphabets, numbers and special characters
Input: !-&*$abc456$
Output: false
Uppercase and lowercase alphabets
Input: aAbB
Output: true, as uppercase and lowercase alphabets are unique. i.e. case-sensitive
UTF-16 surrogate pairs. Each character is a pair of one or more 16-bit code units. 'π'.length
returns 2!
Input: γΊΏΝΏ
πππ¨βπ¦π¨ (ΝΏ
is the Greek letter Yot and not the Latin-script J)
Output: true
Strings in JavaScript
A string is a sequence of characters used to represent text. Strings are represented as sequences of UTF-16 code units. Each code unit is 16 bits long. This means there are a maximum of 216, or, 65536 possible characters representable as single UTF-16 code units.
However, the entire Unicode character set is much, much bigger than 65536. The extra characters are stored as surrogate pairs, which are pairs of 16-bit code units that represent a single character. Each Unicode character, comprised of one or two UTF-16 code units, is also called a Unicode code point.
"π".split(""); // ['\ud83d', '\ude04'];
Shows the two 16 bit sequences that make up the emoji.π
is a Unicode code point.
1. Use nested loops
This method compares each character with every other character to check for duplicates.
/**
* Determines if str has unique characters
* Compares each character with every other character
* @param {string} str
* @returns {boolean}
*/
function isUnique(str) {
for (let i = 0; i < str.length; i++) {
for (let j = i + 1; j < str.length; j++) { // j starts from i + 1
if (str[i] === str[j]) {
return false // Immediately return false
}
}
}
return true
}
Time Complexity | O(n2) |
Space Complexity | O(1) |
There’s more to this simple algorithm than meets the eye. It does not work for surrogate pairs. console.log(isUnique('ππ'))
returns false
!
Why does this happen?
Each emoji is a pair of 16-bit units. So in memory, str ='ππ'
is stored as,
str[0] | str[1] | str[2] | str[3] |
\uD83D | \uDE00 | \uD83D | \uDE11 |
π
= str[0] and str[1]. π
= str[2] and str[3]str[0] is equal to str[2], so the function returns false.
Fixing this bug is not easy. One way would be to convert the string to an array before iterating over it. In fact, Array.from('ππ')
does split them correctly into, ['π', 'π']
.
But this fails for other emojis that are compositions of multiple emojis,
// "Family: Man, Boy"
Array.from("π¨βπ¦"); // ['π¨', 'β', 'π¦']
"π¨βπ¦".length; // 5
// splits into the "Man" and "Boy" emoji, joined by a ZWJ, \u200d character
These sequences of Unicode characters, like emojis, that should be treated as one visual unit, are known as grapheme clusters. Iterating through strings containing grapheme clusters will require custom code. Loops like, for
, for-of
etc will fail for composed emojis.
Length of a tweet on Twitter/X
π
, each one takes up two characters because they are composed of two UTF-16 code units. Therefore, despite the 280 character limit, you can only fit in 140 smileys!2. Sorting and Comparing
By sorting the string, we can check for duplicates by comparing each character with its next neighbour.
/**
* Determines if str has unique characters
* Converts string to an array, sorts and then compares
* @param {string} str
* @returns {boolean}
*/
function isUnique(str) {
str = Array.from(str).sort() // Convert to array and sort
for (let i = 0; i < str.length - 1; i++) {
if (str[i] === str[i + 1]) {
return false
}
}
return true
}
Time Complexity | O(n log n) |
Space Complexity | O(n). The size of the array grows linearly with the input string length. |
sort vs toSorted Array methods
sort
sorts elements of an array in-place. i.e. modifies the input array.const a = [3, 2, 1]; a.sort(); console.log(a);
Logs [1, 2, 3]
toSorted
sorts elements and returns a new array without modifying the original.const a = [3, 2, 1]; a.toSorted(); console.log(a);
Logs [3, 2, 1]
3. Using an Object Cache
We traverse the string and use an object to keep track of characters we’ve seen.
/**
* Determines if str has unique characters
* Uses a cache to remember characters seen already
* @param {string} str
* @returns {boolean}
*/
function isUnique(str) {
const cache = {}
/**
* It's important to use a for-of loop here as
* it iterates over Unicode code points
*/
for (const c of str) {
if (cache[c]) {
return false
}
cache[c] = true // Remember that this character is processed
}
return true
}
Time Complexity | O(n) |
Space Complexity | O(1). The size of the cache is constant. In the worst case, it has one entry for each character. Its size does not increase beyond this limit and is independent of input size. |
Note that this solution fails for the string, “π¨βπ¦π¨
” as the for-of
loop iterates over Unicode code points. π¨βπ¦
is composed of 2 Unicode code points and a ZWJ character, so the cache will hold, { 'π¨': true, 'β': true, 'π¦': true }
after the first emoji is processed.
4. Using Bitwise Operators
Similar to the previous method, we remember characters that are seen while traversing the string. But this time, we allocate a single bit to each character.
Shift operators return a 32-bit number. So we have 32 bits of memory. Associate each character with a one of the 32 bits. The algorithm works like this,
- Checking for Seen Characters: When encountering a character, check if it has been seen before. This is done by bitwise AND operation with a bit mask corresponding to that character’s bit position.
Memory | Bit (Incoming char) | Memory & Bit | Duplicate character |
0100 | 0100 | 0100 | Yes |
0100 | 0010 | 0000 | No |
- Marking Seen Characters: If the character is new (not seen before), mark its corresponding bit in the memory. The left shift operator << comes in handy here. Say we want to mark the 10th bit in memory as 1, we first create a bit mask,
bit = 1 << 9;
and then store it in memory using bitwise OR,memory = memory | bit;
Memory | Bit | New Memory = Memory | Bit. |
1000 | 0010 | 1010 |
This method effectively manages memory using just 32 bits, suitable for scenarios with a limited character set like lowercase English alphabets [a-z].
/**
* Determines if str has unique characters
* Uses bitwise operators to remember characters seen by
* setting bits in a 32 bit number
* @param {string} str
* @returns {boolean}
*/
function isUnique(str) {
/**
* codePointAt returns a non-negative integer that is the Unicode code point
* value of the character
*/
const codePointOfa = 'a'.codePointAt(0)
let memory = 0
for (const c of str) {
// Subtract from "a" to get values between 0 and 25
const codePoint = c.codePointAt(0) - codePointOfa
const bit = 1 << codePoint
if ((memory & bit) > 0) {
return false
}
memory = memory | bit // Save in memory
}
return true
}
Time Complexity | O(n) |
Space Complexity | O(1). The size of the memory variable is fixed to 32 bits. |
This algorithm can be extended to more characters by using a different number for each bucket of 32 characters.
Conclusion
Determining if a string has all unique characters can vary in complexity depending on the character set and the handling of special characters like emojis and grapheme clusters. Understanding these nuances ensures accurate and efficient solutions.
Your time is limited, so donβt waste it living someone elseβs life.
Steve Jobs
Book recommendation is good.
Cover pic is perfect π
Waiting for next blog. Keep writing