Module: ACSV::Detect

Defined in:
lib/acsv/detect/encoding.rb,
lib/acsv/detect/separator.rb,
lib/acsv/detect/encoding_holmes.rb,
lib/acsv/detect/encoding_rchardet.rb,
lib/acsv/detect/encoding_uchardet.rb

Defined Under Namespace

Modules: EncodingHolmes, EncodingRChardet, EncodingUChardet

Constant Summary collapse

CONFIDENCE =

Default confidence level for encoding detection to succeed

0.6
PREVIEW_BYTES =

Number of bytes to test encoding on

8 * 4096
SEPARATORS =

Possible CSV separators to check

[",", ";", "\t", "|", "#"]

Class Method Summary collapse

Class Method Details

.encoding(file_or_data, options = {}) ⇒ String

Tries to detect the file encoding.

Parameters:

  • file_or_data (File, String)

    CSV file or data to probe

  • options (Hash) (defaults to: {})

    a customizable set of options

Options Hash (options):

  • :confidence (Number)

    minimum confidence level (0-1)

  • :method (String)

    try only specific method, one of encoding_methods

Returns:

  • (String)

    most probable encoding



20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# File 'lib/acsv/detect/encoding.rb', line 20

def encoding(file_or_data, options={})
  if file_or_data.is_a? File
    position = file_or_data.tell
    data = file_or_data.read(PREVIEW_BYTES)
    file_or_data.seek(position)
  else
    data = file_or_data
  end

  detector_do(options) do |detector|
    if enc = detector.encoding(data, options)
      return enc
    end
  end
  nil
end

.encoding_methodsArray<String>

Returns List of available methods for encoding.

Returns:

  • (Array<String>)

    List of available methods for encoding



38
39
40
# File 'lib/acsv/detect/encoding.rb', line 38

def encoding_methods
  ENCODING_DETECTORS_AVAIL.map(&:require_name)
end

.encoding_methods_allArray<String>

Returns List of possible methods for encoding (even if its gem is missing).

Returns:

  • (Array<String>)

    List of possible methods for encoding (even if its gem is missing)



43
44
45
# File 'lib/acsv/detect/encoding.rb', line 43

def encoding_methods_all
  ENCODING_DETECTORS_ALL.map(&:require_name)
end

.separator(file_or_data) ⇒ String

TODO:

return whichever character returns the same number of columns over multiple lines

Returns most probable column separator character from first line, or nil when none found.

Parameters:

  • file_or_data (File, String)

    CSV file or data to probe

Returns:

  • (String)

    most probable column separator character from first line, or nil when none found



10
11
12
13
14
15
16
17
18
19
20
21
# File 'lib/acsv/detect/separator.rb', line 10

def self.separator(file_or_data)
  if file_or_data.is_a? File
    position = file_or_data.tell
    firstline = file_or_data.readline
    file_or_data.seek(position)
  else
    firstline = file_or_data.split("\n", 2)[0]
  end
  separators = SEPARATORS.map{|s| s.encode(firstline.encoding)}
  sep = separators.map {|x| [firstline.count(x),x]}.sort_by {|x| x[0]}.last
  sep[0] == 0 ? nil : sep[1].encode('ascii')
end