Understanding the Lexer Mechanics

Think of the Lexer — also called a Tokenizer — as your eyes reading through a page. If you read the phrase "Redis is fast", you don't think of it as 13 characters and spaces. Instead, you instantly group those characters into three distinct concepts: [Redis], [is], and [fast].

Let's see how it operates with our classic example command:

This animation represents the process of turning a raw string into a meaningful representation. In computer science, this process is known as Lexical Analysis — or simply lexing. A Lexer performs this exact task for our Commands: it takes a raw Command string and transforms it into an organized []Token.

SETSET
usernameIDENTIFIER
AlejandroIDENTIFIER

To navigate through the input efficiently, our Lexer tracks its internal state using three fields:

  1. pos: The index of the character we are currently examining.
  2. readPos: Our look-ahead pointer — always one step ahead of pos, crucial for detecting multi-byte sequences like \r\n (which will be seen on the Redis Serialization Protocol chapter).
  3. ch: The current character being examined at readPos.

Putting it all together, here is how our Lexer is defined in Go:

Lexer Initialization
type Lexer struct {
  input   string   // Current command as raw text

  pos     int
  readPos int
  ch byte
}

Before Lexing, we need to sanitize the input and initialize our pointers. Two functions handle this:

Lexer Definition
func trimInput(input string) string {
  return strings.TrimRightFunc(input, func(r rune) bool {
      return r == ' ' || r == '\t'
  })
}

func New(input string) *Lexer {
  l := &Lexer{
    input:   trimInput(input),
    pos:     0,
    readPos: 0,
  }

  l.next()

  return l
}
  1. trimInput(): Removes trailing whitespace from the right of the command, preventing the Lexer from processing ghost characters at the end of a line.
  2. New(): Initializes the Lexer with our sanitized input and zeroed pointers. It immediately calls next() — a function that advances the pointers and loads the current character into ch — so that by the time the caller receives the Lexer, it is already positioned and ready to scan.

Now let's look at what next() actually does:

next() definition
func (l *Lexer) next() {
  if l.readPos >= len(l.input) {
    l.ch = byte(token.EOF)
    return
  } else {
    l.ch = l.input[l.readPos]
  }

  l.pos = l.readPos
  l.readPos++
}

next() drives character-by-character advancement through the input. It handles two cases:

  1. End of File (EOF): If readPos exceeds the length of the input, ch is set to token.EOF, signaling that the input has been fully consumed.
  2. Advancing: Otherwise, ch is updated to the character at readPos. Then pos catches up to readPos, and readPos increments to scout the next character.

Now that we have implemented next(), we need to implement NextToken() — whose purpose is to scan the input and return the Token we defined in the previous chapter. We will build it incrementally, starting with the two most fundamental cases any Lexer must handle before anything else: knowing when to stop and knowing what to ignore.

  1. EOFtoken.EOF has the value 0, known as the null terminator in ASCII. When l.ch equals token.EOF, the input has been fully consumed and we return a Token with Kind set to token.EOF.
  2. WhitespaceskipWhitespaces() advances the pointers past any spaces, tabs, or newlines so that readPos is always positioned on a meaningful character when we begin scanning a new Token.
NextToken() First Part Definition
func (l *Lexer) NextToken() token.Token {
  // Handling EOF 
  if l.ch == byte(token.EOF) {
    return token.New(token.EOF, "")
  }

  // Handling whitespaces
  l.skipWhitespaces()

  ...
}

With our two fundamental cases handled, our NextToken() is now ready to tackle the core scanning logic. There are two kinds of values our Lexer needs to be able to read:

  1. Letters — for commands like SET and identifiers like userName.
  2. Numbers — for numeric values like 100 or 20.

Let's start with letters.

Handling Letters
func isLetter(ch byte) bool {
	return ('a' <= ch && ch <= 'z') || ('A' <= ch && ch <= 'Z') || ch == '_'
}

func (l *Lexer) readIdent() string {
  pos := l.pos

  for isLetter(l.ch) {
    l.next()
  }

  return l.input[pos : l.pos]
}

Two utility functions handle letter scanning:

  1. isLetter() — defines what constitutes a valid character in our Lexer. Notice that it includes the underscore _, allowing identifiers like author_name to be recognized as a single token.
  2. readIdent() — advances through the input for as long as isLetter() returns true, then extracts the scanned word using input[pos:l.pos].

Now let's see how readIdent() plugs into NextToken():

NextToken() Second Part Definition
func (l *Lexer) NextToken() token.Token {
  // ... (previous EOF and whitespace logic)

  var t token.Token

  if isLetter(l.ch) {
    t.Literal = l.readIdent()
    t.Kind = token.LookupIdent(t.Literal)
    return t
  } 

  // ... (Handeling numbers)
}

Once skipWhitespaces() has positioned us on a meaningful character, we check if it is a letter. If so, two things happen:

  1. We use readIdent() to extract the full word as a Literal.
  2. token.LookupIdent() examines that Literal and determines its Kind — whether it is a known command like SET, or a user-defined identifier like userName.

The Token is then returned immediately, without the Lexer advancing any further. The next call to NextToken() will pick up exactly where we left off.

But we haven't implemented token.LookupIdent() yet — let's do that now.

LookupIdent() Definition
package token

var keywords = map[string]TokenKind{
  "GET":    GET,
  "GETSET": GETSET,
  "GETEX":  GETEX,
  "GETDEL": GETDEL,
  "SET":    SET,
  "DEL":    DEL,
  "INCR":   INCR,
  "INCRBY": INCRBY,
  "DECR":   DECR,
  "DECRBY": DECRBY,
  // ... (The rest of the Commands and Keywords)
}

func LookupIdent(ident string) TokenKind {
  kw, ok := keywords[ident]

  if ok {
    return kw
  }

  return IDENT
}

LookupIdent() consults a lookup table that associates every known Redis command with its corresponding TokenKind:

  1. If the word is a known command — for example "SET" returns SET — its specific TokenKind is returned.
  2. If not, it falls back to IDENT.

With this design adding support for a new Redis command is as simple as adding a single entry to the keywords map.

Putting it all Together

Now that our Lexer knows how to identify both commands and identifiers, let's see it in action with a new command:

Notice how the Lexer correctly identifies INCRBY as a command token, articleRewrites as an identifier, and 1 as a number — exactly the token kinds we have been building toward.