Class: JsDuck::Lexer

Inherits:
Object
  • Object
show all
Defined in:
lib/jsduck/lexer.rb

Overview

Tokenizes JavaScript code into lexical tokens.

Each token has a type and value. Types and possible values for them are as follows:

  • :number – 25

  • :string – “Hello world”

  • :ident – “foo”

  • :regex – “/abc/i”

  • :operator – “+”

  • :doc_comment – “/** My comment */”

Plus a separate types for all keywords: :if, :while, :function, … For keywords the type and value are the same.

Notice that doc-comments are recognized as tokens while normal comments are ignored just as whitespace.

Constant Summary collapse

META_REGEX =

A regex to match a regex

%r{
  /               (?# beginning    )
  (
    [^/\[\\]      (?# any character except \ / [    )
    |
    \\.           (?# an escaping \ followed by any character    )
    |
    \[ ([^\]\\]|\\.)* \]    (?# [...] containing any characters including /    )
                            (?# except \ ] which have to be escaped    )
  )*
  (/[gim]*|\Z)   (?# ending + modifiers    )
}x
KEYWORDS =
{
  "break" => :break,
  "case" => :case,
  "catch" => :catch,
  "continue" => :continue,
  "default" => :default,
  "delete" => :delete,
  "do" => :do,
  "else" => :else,
  "finally" => :finally,
  "for" => :for,
  "function" => :function,
  "if" => :if,
  "in" => :in,
  "instanceof" => :instanceof,
  "new" => :new,
  "return" => :return,
  "switch" => :switch,
  "this" => :this,
  "throw" => :throw,
  "try" => :try,
  "typeof" => :typeof,
  "var" => :var,
  "void" => :void,
  "while" => :while,
  "with" => :with,
}

Instance Method Summary collapse

Constructor Details

#initialize(input) ⇒ Lexer

Input can be either a String or StringScanner.

In the latter case we ensure that only #next will advance the scanpointer of StringScanner - this allows context-switching while parsing some string. Specifically we need this feature to parse some JavaScript inside doc-comments.



30
31
32
33
# File 'lib/jsduck/lexer.rb', line 30

def initialize(input)
  @input = input.is_a?(StringScanner) ? input : StringScanner.new(input)
  @buffer = []
end

Instance Method Details

#buffer_tokens(n) ⇒ Object

Ensures next n tokens are read in buffer

At the end of buffering the initial position scanpointer is restored. Only the #next method will advance the scanpointer in a way that’s visible outside this class.



88
89
90
91
92
93
94
95
96
97
98
99
100
# File 'lib/jsduck/lexer.rb', line 88

def buffer_tokens(n)
  prev_pos = @input.pos
  @input.pos = @buffer.last[:pos] if @buffer.last
  (n - @buffer.length).times do
    @previous_token = tok = next_token
    if tok
      # remember scanpointer position after each token
      tok[:pos] = @input.pos
      @buffer << tok
    end
  end
  @input.pos = prev_pos
end

#empty?Boolean

True when no more tokens.

Returns:

  • (Boolean)


78
79
80
81
# File 'lib/jsduck/lexer.rb', line 78

def empty?
  buffer_tokens(1)
  return !@buffer.first
end

#look(*tokens) ⇒ Object

Tests if given pattern matches the tokens that follow at current position.

Takes list of strings and symbols. Symbols are compared to token type, while strings to token value. For example:

look(:ident, "=", :regex)


43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# File 'lib/jsduck/lexer.rb', line 43

def look(*tokens)
  buffer_tokens(tokens.length)
  i = 0
  tokens.all? do |t|
    tok = @buffer[i]
    i += 1
    if !tok
      false
    elsif t.instance_of?(Symbol)
      tok[:type] == t
    else
      tok[:value] == t
    end
  end
end

#next(full = false) ⇒ Object

Returns the value of next token, moving the current token cursor also to next token.

When full=true, returns full token as hash like so:

{:type => :ident, :value => "foo"}

For doc-comments the full token also contains the field :linenr, pointing to the line where the doc-comment began.



69
70
71
72
73
74
75
# File 'lib/jsduck/lexer.rb', line 69

def next(full=false)
  buffer_tokens(1)
  tok = @buffer.shift
  # advance the scanpointer to the position after this token
  @input.pos = tok[:pos]
  full ? tok : tok[:value]
end

#next_tokenObject

Parses out next token from input stream.

For efficency we look for tokens in order of frequency in JavaScript source code:

  • first check for most common operators.

  • then for identifiers and keywords.

  • then strings

  • then comments

The remaining token types are less frequent, so these are left to the end.



115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
# File 'lib/jsduck/lexer.rb', line 115

def next_token
  while !@input.eos? do
    skip_white
    if @input.check(/[.(),;={}:]/)
      return {
        :type => :operator,
        :value => @input.scan(/./)
      }
    elsif @input.check(/[a-zA-Z_$]/)
      value = @input.scan(/[$\w]+/)
      kw = KEYWORDS[value]
      return {
        :type => kw || :ident,
        :value => kw || value
      }
    elsif @input.check(/'/)
      return {
        :type => :string,
        :value => @input.scan(/'([^'\\]|\\.)*('|\Z)/m).gsub(/\A'|'\Z/m, "")
      }
    elsif @input.check(/"/)
      return {
        :type => :string,
        :value => @input.scan(/"([^"\\]|\\.)*("|\Z)/m).gsub(/\A"|"\Z/m, "")
      }
    elsif @input.check(/\//)
      # Several things begin with dash:
      # - comments, regexes, division-operators
      if @input.check(/\/\*\*[^\/]/)
        return {
          :type => :doc_comment,
          # Calculate current line number, starting with 1
          :linenr => @input.string[0...@input.pos].count("\n") + 1,
          :value => @input.scan_until(/\*\/|\Z/)
        }
      elsif @input.check(/\/\*/)
        # skip multiline comment
        @input.scan_until(/\*\/|\Z/)
      elsif @input.check(/\/\//)
        # skip line comment
        @input.scan_until(/\n|\Z/)
      elsif regex?
        return {
          :type => :regex,
          :value => @input.scan(META_REGEX)
        }
      else
        return {
          :type => :operator,
          :value => @input.scan(/\//)
        }
      end
    elsif @input.check(/[0-9]+/)
      nr = @input.scan(/[0-9]+(\.[0-9]*)?/)
      return {
        :type => :number,
        :value => nr
      }
    elsif  @input.check(/./)
      return {
        :type => :operator,
        :value => @input.scan(/./)
      }
    end
  end
end

#regex?Boolean

A slash “/” is a division operator if it follows:

  • identifier

  • the “this” keyword

  • number

  • closing bracket )

  • closing square-bracket ]

Otherwise it’s a beginning of regex

Returns:

  • (Boolean)


189
190
191
192
193
194
195
196
197
198
199
200
201
202
# File 'lib/jsduck/lexer.rb', line 189

def regex?
  if @previous_token
    type = @previous_token[:type]
    value = @previous_token[:value]
    if type == :ident || type == :number
      return false
    elsif type == :this
      return false
    elsif type == :operator && (value == ")" || value == "]")
      return false
    end
  end
  return true
end

#skip_whiteObject



204
205
206
# File 'lib/jsduck/lexer.rb', line 204

def skip_white
  @input.scan(/\s+/)
end