Class: Linguist::Tokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/linguist/tokenizer.rb

Overview

Generic programming language tokenizer.

Tokens are designed for use in the language bayes classifier. It strips any data strings or comments and preserves significant language symbols.

Constant Summary collapse

BYTE_LIMIT =

Read up to 100KB

100_000
SINGLE_LINE_COMMENTS =

Start state on token, ignore anything till the next newline

[
  '//', # C
  '#',  # Ruby
  '%',  # Tex
]
MULTI_LINE_COMMENTS =

Start state on opening token, ignore anything until the closing token is reached.

[
  ['/*', '*/'],    # C
  ['<!--', '-->'], # XML
  ['{-', '-}'],    # Haskell
  ['(*', '*)'],    # Coq
  ['"""', '"""']   # Python
]
START_SINGLE_LINE_COMMENT =
Regexp.compile(SINGLE_LINE_COMMENTS.map { |c|
  "\s*#{Regexp.escape(c)} "
}.join("|"))
START_MULTI_LINE_COMMENT =
Regexp.compile(MULTI_LINE_COMMENTS.map { |c|
  Regexp.escape(c[0])
}.join("|"))

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.tokenize(data) ⇒ Object

Public: Extract tokens from data

data - String to tokenize

Returns Array of token Strings.



15
16
17
# File 'lib/linguist/tokenizer.rb', line 15

def self.tokenize(data)
  new.extract_tokens(data)
end

Instance Method Details

#extract_sgml_tokens(data) ⇒ Object

Internal: Extract tokens from inside SGML tag.

data - SGML tag String.

Examples

extract_sgml_tokens("<a href='' class=foo>")
# => ["<a>", "href="]

Returns Array of token Strings.



159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
# File 'lib/linguist/tokenizer.rb', line 159

def extract_sgml_tokens(data)
  s = StringScanner.new(data)

  tokens = []

  until s.eos?
    # Emit start token
    if token = s.scan(/<\/?[^\s>]+/)
      tokens << "#{token}>"

    # Emit attributes with trailing =
    elsif token = s.scan(/\w+=/)
      tokens << token

      # Then skip over attribute value
      if s.scan(/"/)
        s.skip_until(/[^\\]"/)
      elsif s.scan(/'/)
        s.skip_until(/[^\\]'/)
      else
        s.skip_until(/\w+/)
      end

    # Emit lone attributes
    elsif token = s.scan(/\w+/)
      tokens << token

    # Stop at the end of the tag
    elsif s.scan(/>/)
      s.terminate

    else
      s.getch
    end
  end

  tokens
end

#extract_shebang(data) ⇒ Object

Internal: Extract normalized shebang command token.

Examples

extract_shebang("#!/usr/bin/ruby")
# => "ruby"

extract_shebang("#!/usr/bin/env node")
# => "node"

Returns String token or nil it couldn’t be parsed.



133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
# File 'lib/linguist/tokenizer.rb', line 133

def extract_shebang(data)
  s = StringScanner.new(data)

  if path = s.scan(/^#!\s*\S+/)
    script = path.split('/').last
    if script == 'env'
      s.scan(/\s+/)
      script = s.scan(/\S+/)
    end
    script = script[/[^\d]+/, 0] if script
    return script
  end

  nil
end

#extract_tokens(data) ⇒ Object

Internal: Extract generic tokens from data.

data - String to scan.

Examples

extract_tokens("printf('Hello')")
# => ['printf', '(', ')']

Returns Array of token Strings.



57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# File 'lib/linguist/tokenizer.rb', line 57

def extract_tokens(data)
  s = StringScanner.new(data)

  tokens = []
  until s.eos?
    break if s.pos >= BYTE_LIMIT

    if token = s.scan(/^#!.+$/)
      if name = extract_shebang(token)
        tokens << "SHEBANG#!#{name}"
      end

    # Single line comment
    elsif s.beginning_of_line? && token = s.scan(START_SINGLE_LINE_COMMENT)
      # tokens << token.strip
      s.skip_until(/\n|\Z/)

    # Multiline comments
    elsif token = s.scan(START_MULTI_LINE_COMMENT)
      # tokens << token
      close_token = MULTI_LINE_COMMENTS.assoc(token)[1]
      s.skip_until(Regexp.compile(Regexp.escape(close_token)))
      # tokens << close_token

    # Skip single or double quoted strings
    elsif s.scan(/"/)
      if s.peek(1) == "\""
        s.getch
      else
        s.skip_until(/[^\\]"/)
      end
    elsif s.scan(/'/)
      if s.peek(1) == "'"
        s.getch
      else
        s.skip_until(/[^\\]'/)
      end

    # Skip number literals
    elsif s.scan(/(0x)?\d(\d|\.)*/)

    # SGML style brackets
    elsif token = s.scan(/<[^\s<>][^<>]*>/)
      extract_sgml_tokens(token).each { |t| tokens << t }

    # Common programming punctuation
    elsif token = s.scan(/;|\{|\}|\(|\)|\[|\]/)
      tokens << token

    # Regular token
    elsif token = s.scan(/[\w\.@#\/\*]+/)
      tokens << token

    # Common operators
    elsif token = s.scan(/<<?|\+|\-|\*|\/|%|&&?|\|\|?/)
      tokens << token

    else
      s.getch
    end
  end

  tokens
end