Think of the Lexer — also called a Tokenizer — as your eyes
reading through a page. If you read the phrase "Redis is fast", you don't think of it as 13 characters and spaces. Instead, you
instantly group those characters into three distinct concepts: [Redis], [is], and [fast].
Let's see how it operates with our classic example command:
This animation represents the process of turning a raw string into a meaningful representation. In computer science,
this process is known as Lexical Analysis — or simply lexing.
A Lexer performs this exact task for our Commands: it takes
a raw Command string and transforms it into an organized []Token.
To navigate through the input efficiently, our Lexer tracks its internal state using three fields:
-
pos: The index of the character we are currently examining. -
readPos: Our look-ahead pointer — always one step ahead ofpos, crucial for detecting multi-byte sequences like\r\n(which will be seen on the Redis Serialization Protocol chapter). -
ch: The current character being examined atreadPos.
Putting it all together, here is how our Lexer is defined
in Go:
type Lexer struct {
input string // Current command as raw text
pos int
readPos int
ch byte
}
Before Lexing, we need to sanitize the input and initialize our pointers. Two functions handle
this:
func trimInput(input string) string {
return strings.TrimRightFunc(input, func(r rune) bool {
return r == ' ' || r == '\t'
})
}
func New(input string) *Lexer {
l := &Lexer{
input: trimInput(input),
pos: 0,
readPos: 0,
}
l.next()
return l
} -
trimInput(): Removes trailing whitespace from the right of the command, preventing theLexerfrom processing ghost characters at the end of a line. -
New(): Initializes theLexerwith our sanitized input and zeroed pointers. It immediately callsnext()— a function that advances the pointers and loads the current character intoch— so that by the time the caller receives theLexer, it is already positioned and ready to scan.
Now let's look at what next() actually does:
func (l *Lexer) next() {
if l.readPos >= len(l.input) {
l.ch = byte(token.EOF)
return
} else {
l.ch = l.input[l.readPos]
}
l.pos = l.readPos
l.readPos++
} next() drives character-by-character advancement through
the input. It handles two cases:
- End of File (EOF): If
readPosexceeds the length of the input,chis set totoken.EOF, signaling that the input has been fully consumed. - Advancing: Otherwise,
chis updated to the character atreadPos. Thenposcatches up toreadPos, andreadPosincrements to scout the next character.
Now that we have implemented next(), we need to
implement NextToken() — whose purpose is to scan the input
and return the Token we defined in the
previous chapter. We
will build it incrementally, starting with the two most fundamental cases
any Lexer must handle before anything else: knowing
when to stop and knowing what to ignore.
- EOF —
token.EOFhas the value0, known as the null terminator in ASCII. Whenl.chequalstoken.EOF, the input has been fully consumed and we return a Token with Kind set to token.EOF. -
Whitespace—skipWhitespaces()advances the pointers past any spaces, tabs, or newlines so thatreadPosis always positioned on a meaningful character when we begin scanning a newToken.
func (l *Lexer) NextToken() token.Token {
// Handling EOF
if l.ch == byte(token.EOF) {
return token.New(token.EOF, "")
}
// Handling whitespaces
l.skipWhitespaces()
...
}
With our two fundamental cases handled, our NextToken()
is now ready to tackle the core scanning logic. There are
two kinds of values our Lexer needs to be able to read:
- Letters — for commands like
SETand identifiers likeuserName. - Numbers — for numeric values like
100or20.
Let's start with letters.
func isLetter(ch byte) bool {
return ('a' <= ch && ch <= 'z') || ('A' <= ch && ch <= 'Z') || ch == '_'
}
func (l *Lexer) readIdent() string {
pos := l.pos
for isLetter(l.ch) {
l.next()
}
return l.input[pos : l.pos]
} Two utility functions handle letter scanning:
-
isLetter()— defines what constitutes a valid character in ourLexer. Notice that it includes the underscore_, allowing identifiers likeauthor_nameto be recognized as a single token. -
readIdent()— advances through the input for as long asisLetter()returns true, then extracts the scanned word usinginput[pos:l.pos].
Now let's see how readIdent() plugs into NextToken():
func (l *Lexer) NextToken() token.Token {
// ... (previous EOF and whitespace logic)
var t token.Token
if isLetter(l.ch) {
t.Literal = l.readIdent()
t.Kind = token.LookupIdent(t.Literal)
return t
}
// ... (Handeling numbers)
}
Once skipWhitespaces() has positioned us on a meaningful
character, we check if it is a letter. If so, two things happen:
-
We use
readIdent()to extract the full word as a Literal. -
token.LookupIdent()examines thatLiteraland determines itsKind— whether it is a known command likeSET, or a user-defined identifier likeuserName.
The Token is then returned immediately, without the Lexer
advancing any further. The next call to NextToken() will
pick up exactly where we left off.
But we haven't implemented token.LookupIdent() yet — let's
do that now.
package token
var keywords = map[string]TokenKind{
"GET": GET,
"GETSET": GETSET,
"GETEX": GETEX,
"GETDEL": GETDEL,
"SET": SET,
"DEL": DEL,
"INCR": INCR,
"INCRBY": INCRBY,
"DECR": DECR,
"DECRBY": DECRBY,
// ... (The rest of the Commands and Keywords)
}
func LookupIdent(ident string) TokenKind {
kw, ok := keywords[ident]
if ok {
return kw
}
return IDENT
} LookupIdent() consults a lookup table that associates every
known Redis command with its corresponding TokenKind:
-
If the word is a known command — for example
"SET"returnsSET— its specificTokenKindis returned. - If not, it falls back to
IDENT.
With this design adding support for a new Redis command is
as simple as adding a single entry to the keywords map.
Putting it all Together
Now that our Lexer knows how to identify both commands and
identifiers, let's see it in action with a new command:
Notice how the Lexer correctly identifies INCRBY as a command token, articleRewrites as an identifier, and 1 as a number — exactly the token
kinds we have been building toward.