Class: JsDuck::Lexer
- Inherits:
-
Object
- Object
- JsDuck::Lexer
- Defined in:
- lib/jsduck/lexer.rb
Overview
Tokenizes JavaScript code into lexical tokens.
Each token has a type and value. Types and possible values for them are as follows:
-
:number – 25
-
:string – “Hello world”
-
:ident – “foo”
-
:regex – “/abc/i”
-
:operator – “+”
-
:doc_comment – “/** My comment */”
Plus a separate types for all keywords: :if, :while, :function, … For keywords the type and value are the same.
Notice that doc-comments are recognized as tokens while normal comments are ignored just as whitespace.
Constant Summary collapse
- META_REGEX =
A regex to match a regex
%r{ / (?# beginning ) ( [^/\[\\] (?# any character except \ / [ ) | \\. (?# an escaping \ followed by any character ) | \[ ([^\]\\]|\\.)* \] (?# [...] containing any characters including / ) (?# except \ ] which have to be escaped ) )* (/[gim]*|\Z) (?# ending + modifiers ) }x
- KEYWORDS =
{ "break" => :break, "case" => :case, "catch" => :catch, "continue" => :continue, "default" => :default, "delete" => :delete, "do" => :do, "else" => :else, "finally" => :finally, "for" => :for, "function" => :function, "if" => :if, "in" => :in, "instanceof" => :instanceof, "new" => :new, "return" => :return, "switch" => :switch, "this" => :this, "throw" => :throw, "try" => :try, "typeof" => :typeof, "var" => :var, "void" => :void, "while" => :while, "with" => :with, }
Instance Method Summary collapse
-
#buffer_tokens(n) ⇒ Object
Ensures next n tokens are read in buffer.
-
#empty? ⇒ Boolean
True when no more tokens.
-
#initialize(input) ⇒ Lexer
constructor
Input can be either a String or StringScanner.
-
#look(*tokens) ⇒ Object
Tests if given pattern matches the tokens that follow at current position.
-
#next(full = false) ⇒ Object
Returns the value of next token, moving the current token cursor also to next token.
-
#next_token ⇒ Object
Parses out next token from input stream.
-
#regex? ⇒ Boolean
A slash “/” is a division operator if it follows: - identifier - the “this” keyword - number - closing bracket ) - closing square-bracket ] Otherwise it’s a beginning of regex.
- #skip_white ⇒ Object
Constructor Details
#initialize(input) ⇒ Lexer
Input can be either a String or StringScanner.
In the latter case we ensure that only #next will advance the scanpointer of StringScanner - this allows context-switching while parsing some string. Specifically we need this feature to parse some JavaScript inside doc-comments.
30 31 32 33 |
# File 'lib/jsduck/lexer.rb', line 30 def initialize(input) @input = input.is_a?(StringScanner) ? input : StringScanner.new(input) @buffer = [] end |
Instance Method Details
#buffer_tokens(n) ⇒ Object
Ensures next n tokens are read in buffer
At the end of buffering the initial position scanpointer is restored. Only the #next method will advance the scanpointer in a way that’s visible outside this class.
88 89 90 91 92 93 94 95 96 97 98 99 100 |
# File 'lib/jsduck/lexer.rb', line 88 def buffer_tokens(n) prev_pos = @input.pos @input.pos = @buffer.last[:pos] if @buffer.last (n - @buffer.length).times do @previous_token = tok = next_token if tok # remember scanpointer position after each token tok[:pos] = @input.pos @buffer << tok end end @input.pos = prev_pos end |
#empty? ⇒ Boolean
True when no more tokens.
78 79 80 81 |
# File 'lib/jsduck/lexer.rb', line 78 def empty? buffer_tokens(1) return !@buffer.first end |
#look(*tokens) ⇒ Object
Tests if given pattern matches the tokens that follow at current position.
Takes list of strings and symbols. Symbols are compared to token type, while strings to token value. For example:
look(:ident, "=", :regex)
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
# File 'lib/jsduck/lexer.rb', line 43 def look(*tokens) buffer_tokens(tokens.length) i = 0 tokens.all? do |t| tok = @buffer[i] i += 1 if !tok false elsif t.instance_of?(Symbol) tok[:type] == t else tok[:value] == t end end end |
#next(full = false) ⇒ Object
Returns the value of next token, moving the current token cursor also to next token.
When full=true, returns full token as hash like so:
{:type => :ident, :value => "foo"}
For doc-comments the full token also contains the field :linenr, pointing to the line where the doc-comment began.
69 70 71 72 73 74 75 |
# File 'lib/jsduck/lexer.rb', line 69 def next(full=false) buffer_tokens(1) tok = @buffer.shift # advance the scanpointer to the position after this token @input.pos = tok[:pos] full ? tok : tok[:value] end |
#next_token ⇒ Object
Parses out next token from input stream.
For efficency we look for tokens in order of frequency in JavaScript source code:
-
first check for most common operators.
-
then for identifiers and keywords.
-
then strings
-
then comments
The remaining token types are less frequent, so these are left to the end.
115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 |
# File 'lib/jsduck/lexer.rb', line 115 def next_token while !@input.eos? do skip_white if @input.check(/[.(),;={}:]/) return { :type => :operator, :value => @input.scan(/./) } elsif @input.check(/[a-zA-Z_$]/) value = @input.scan(/[$\w]+/) kw = KEYWORDS[value] return { :type => kw || :ident, :value => kw || value } elsif @input.check(/'/) return { :type => :string, :value => @input.scan(/'([^'\\]|\\.)*('|\Z)/m).gsub(/\A'|'\Z/m, "") } elsif @input.check(/"/) return { :type => :string, :value => @input.scan(/"([^"\\]|\\.)*("|\Z)/m).gsub(/\A"|"\Z/m, "") } elsif @input.check(/\//) # Several things begin with dash: # - comments, regexes, division-operators if @input.check(/\/\*\*[^\/]/) return { :type => :doc_comment, # Calculate current line number, starting with 1 :linenr => @input.string[0...@input.pos].count("\n") + 1, :value => @input.scan_until(/\*\/|\Z/) } elsif @input.check(/\/\*/) # skip multiline comment @input.scan_until(/\*\/|\Z/) elsif @input.check(/\/\//) # skip line comment @input.scan_until(/\n|\Z/) elsif regex? return { :type => :regex, :value => @input.scan(META_REGEX) } else return { :type => :operator, :value => @input.scan(/\//) } end elsif @input.check(/[0-9]+/) nr = @input.scan(/[0-9]+(\.[0-9]*)?/) return { :type => :number, :value => nr } elsif @input.check(/./) return { :type => :operator, :value => @input.scan(/./) } end end end |
#regex? ⇒ Boolean
A slash “/” is a division operator if it follows:
-
identifier
-
the “this” keyword
-
number
-
closing bracket )
-
closing square-bracket ]
Otherwise it’s a beginning of regex
189 190 191 192 193 194 195 196 197 198 199 200 201 202 |
# File 'lib/jsduck/lexer.rb', line 189 def regex? if @previous_token type = @previous_token[:type] value = @previous_token[:value] if type == :ident || type == :number return false elsif type == :this return false elsif type == :operator && (value == ")" || value == "]") return false end end return true end |
#skip_white ⇒ Object
204 205 206 |
# File 'lib/jsduck/lexer.rb', line 204 def skip_white @input.scan(/\s+/) end |