Runes, Bytes and Graphemes in Go
In Go, strings are bytes, runes are Unicode code points, and graphemes are what users actually see. Pick the right one before slicing, counting, or reversing text.
I once ran into this problem of differentiating runes, bytes and graphemes while handling names in Tamil and emoji in a Go web app: a string that looked short wasn’t, and reversing it produced gibberish. The culprit wasn’t Go being flawed, it was me making assumptions about what “a character” means.
Let’s map the territory precisely:
1. Bytes. The raw material Go calls a string
Go represents strings as immutable UTF-8 byte sequences.
What we see isn’t what Go handles under the hood.
s := "வணக்கம்"
fmt.Println(len(s)) // 21
The length is 21 bytes not visible symbols. Every Tamil character can span 3 bytes. Even simple-looking emojis stretch across multiple bytes.
2. Runes. Unicode code points
string
→ []rune
( gives you code points, but still not what a human perceives.
rs := []rune(s)
fmt.Println(len(rs)) // 7
Here it’s 7 runes, but some Tamil graphemes (like “க்”) combine two runes: க
+ ்
.
3. Grapheme clusters the units users actually see
Go’s standard library stops at runes. To work with visible characters, you need a grapheme-aware library, like github.com/rivo/uniseg
.
for gr := uniseg.NewGraphemes(s); gr.Next(); {
fmt.Printf("%q\\n", gr.Str())
}
That outputs what a human reads “வ”, “ண”, “க்”, “க”, “ம்”, and even “❤️” as a single unit.
Why this matters
If your app deals with names, chats, or any multilingual text indexing by bytes will break things. Counting runes helps, but can still split what you intend as one unit. Grapheme-aware operations align with what users actually expect.
Real bugs I’ve seen: Tamil names chopped mid-character, emoji reactions breaking because only one code point was taken.
To put it simply
Task Approach Count code points utf8.RuneCountInString(s)
Count visible units Grapheme iteration (uniseg
) Reverse text Parse into graphemes, reverse slice, join Slice safely Only use s[i:j]
on grapheme boundaries
Think about what you intend to manipulate: the raw bytes, the code points, or what a user actually reads on screen and choose the right level.