Class: CSVDiff::Source

Inherits:

Object

Object
CSVDiff::Source

show all

Defined in:: lib/csv-diff/source.rb

Overview

Reppresents an input (i.e the left/from or tight/to input) to the diff process.

Direct Known Subclasses

CSVSource, XMLSource

Instance Attribute Summary collapse

#case_sensitive ⇒ Boolean (also: #case_sensitive?) readonly

True if the source has been indexed with case- sensitive keys, or false if it has been indexed using upper-case key values.
#child_field_indexes ⇒ Array<Fixnum> readonly

The indexes of the child fields in the source file.
#child_fields ⇒ Array<String> readonly

The names of the field(s) that distinguish a child of a parent record.
#data ⇒ Array<Arrary> readonly

The data for this source.
#dup_count ⇒ Fixnum readonly

A count of the lines from this source that had the same key value as another line.
#field_names ⇒ Array<String> readonly

The names of the fields in the source file.
#index ⇒ Hash<String,Array<String>> readonly

A hash containing each parent key, and an Array of the child keys it is a parent of.
#key_field_indexes ⇒ Array<Fixnum> readonly

The indexes of the key fields in the source file.
#key_fields ⇒ Array<String> readonly

The names of the field(s) that uniquely identify each row.
#line_count ⇒ Fixnum readonly

A count of the lines processed from this source.
#lines ⇒ Hash<String,Hash> readonly

A hash containing each line of the source, keyed on the values of the key_fields.
#parent_field_indexes ⇒ Array<Fixnum> readonly

The indexes of the parent fields in the source file.
#parent_fields ⇒ Array<String> readonly

The names of the field(s) that identify a common parent of child records.
#path ⇒ String

The path to the source file.
#skip_count ⇒ Fixnum readonly

A count of the lines from this source that were skipped due to filter conditions.
#trim_whitespace ⇒ Boolean readonly

True if leading/trailing whitespace should be stripped from fields.
#warnings ⇒ Array<String> readonly

An array of any warnings encountered while processing the source.

Instance Method Summary collapse

#[](key) ⇒ Hash

Returns the row in the CSV source corresponding to the supplied key.
#index_source ⇒ Object

Given an array of lines, where each line is an array of fields, indexes the array contents so that it can be looked up by key.
#initialize(options = {}) ⇒ Source constructor

Creates a new diff source.
#path? ⇒ Boolean
#save_csv(file_path, options = {}) ⇒ Object

Save the data in this Source as a CSV at file_path.
#to_hash ⇒ Object

Convert the data in this source to Array<Hash> using the field names as keys for the Hash in each row.

Constructor Details

#initialize(options = {}) ⇒ `Source`

Creates a new diff source.

A diff source must contain at least one field that will be used as the key to identify the same record in a different version of this file. If not specified via one of the options, the first field is assumed to be the unique key.

If multiple fields combine to form a unique key, the combined fields are considered as a single unique identifier. If your key represents data that can be represented as a tree, you can instead break your key fields into :parent_fields and :child_fields. By doing this, if a child key is deleted from one parent, and added to another, that will be reported as an update, with a change to the parent key part(s) of the record.

All key options can be specified either by field name, or by field index (0 based).

Parameters:

options (Hash) (defaults to: {}) —

An options hash.

Options Hash (options):

:field_names (Array<String>) —

The names of each of the fields in source.
:ignore_header (Boolean) —

If true, and :field_names has been specified, then the first row of the file is ignored.
:key_field (String) —

The name of the field that uniquely identifies each row.
:key_fields (Array<String>) —

The names of the fields that uniquely identifies each row.
:parent_field (String) —

The name of the field(s) that identify a parent within which sibling order should be checked.
:child_field (String) —

The name of the field(s) that uniquely identify a child of a parent.
:case_sensitive (Boolean) —

If true (the default), keys are indexed as-is; if false, the index is built in upper-case for case-insensitive comparisons.
:include (Hash) —

A hash of field name(s) or index(es) to regular expression(s). Only source rows whose field values satisfy the regular expressions will be indexed and included in the diff process.
:exclude (Hash) —

A hash of field name(s) or index(es) to regular expression(s). Source rows with a field value that satisfies the regular expressions will be excluded from the diff process.

# File 'lib/csv-diff/source.rb', line 102

def initialize(options = {})
    if (options.keys & [:parent_field, :parent_fields, :child_field, :child_fields]).empty? &&
       (kf = options.fetch(:key_field, options[:key_fields]))
        @key_fields = [kf].flatten
        @parent_fields = []
        @child_fields = @key_fields
    else
        @parent_fields = [options.fetch(:parent_field, options[:parent_fields]) || []].flatten
        @child_fields = [options.fetch(:child_field, options[:child_fields]) || [0]].flatten
        @key_fields = @parent_fields + @child_fields
    end
    @field_names = options[:field_names]
    @case_sensitive = options.fetch(:case_sensitive, true)
    @trim_whitespace = options.fetch(:trim_whitespace, false)
    @ignore_header = options[:ignore_header]
    @include = options[:include]
    @exclude = options[:exclude]
    @path = options.fetch(:path, 'NA') unless @path
    @warnings = []
end

Instance Attribute Details

#case_sensitive ⇒ `Boolean` (readonly) Also known as: case_sensitive?

Returns True if the source has been indexed with case- sensitive keys, or false if it has been indexed using upper-case key values.

Returns:

(Boolean) —

True if the source has been indexed with case- sensitive keys, or false if it has been indexed using upper-case key values.



37
38
39

# File 'lib/csv-diff/source.rb', line 37

def case_sensitive
  @case_sensitive
end

#child_field_indexes ⇒ `Array<Fixnum>` (readonly)

Returns The indexes of the child fields in the source file.

Returns:

(Array<Fixnum>) —

The indexes of the child fields in the source file.



32
33
34

# File 'lib/csv-diff/source.rb', line 32

def child_field_indexes
  @child_field_indexes
end

#child_fields ⇒ `Array<String>` (readonly)

Returns The names of the field(s) that distinguish a child of a parent record.

Returns:

(Array<String>) —

The names of the field(s) that distinguish a child of a parent record.



22
23
24

# File 'lib/csv-diff/source.rb', line 22

def child_fields
  @child_fields
end

#data ⇒ `Array<Arrary>` (readonly)

Returns The data for this source.

Returns:

(Array<Arrary>) —

The data for this source



10
11
12

# File 'lib/csv-diff/source.rb', line 10

def data
  @data
end

#dup_count ⇒ `Fixnum` (readonly)

Returns A count of the lines from this source that had the same key value as another line.

Returns:

(Fixnum) —

A count of the lines from this source that had the same key value as another line.



59
60
61

# File 'lib/csv-diff/source.rb', line 59

def dup_count
  @dup_count
end

#field_names ⇒ `Array<String>` (readonly)

Returns The names of the fields in the source file.

Returns:

(Array<String>) —

The names of the fields in the source file



13
14
15

# File 'lib/csv-diff/source.rb', line 13

def field_names
  @field_names
end

#index ⇒ `Hash<String,Array<String>>` (readonly)

Returns A hash containing each parent key, and an Array of the child keys it is a parent of.

Returns:

(Hash<String,Array<String>>) —

A hash containing each parent key, and an Array of the child keys it is a parent of.



47
48
49

# File 'lib/csv-diff/source.rb', line 47

def index
  @index
end

#key_field_indexes ⇒ `Array<Fixnum>` (readonly)

Returns The indexes of the key fields in the source file.

Returns:

(Array<Fixnum>) —

The indexes of the key fields in the source file.



26
27
28

# File 'lib/csv-diff/source.rb', line 26

def key_field_indexes
  @key_field_indexes
end

#key_fields ⇒ `Array<String>` (readonly)

Returns The names of the field(s) that uniquely identify each row.

Returns:

(Array<String>) —

The names of the field(s) that uniquely identify each row.



16
17
18

# File 'lib/csv-diff/source.rb', line 16

def key_fields
  @key_fields
end

#line_count ⇒ `Fixnum` (readonly)

Returns A count of the lines processed from this source. Excludes any header and duplicate records identified during indexing.

Returns:

(Fixnum) —

A count of the lines processed from this source. Excludes any header and duplicate records identified during indexing.



53
54
55

# File 'lib/csv-diff/source.rb', line 53

def line_count
  @line_count
end

#lines ⇒ `Hash<String,Hash>` (readonly)

Returns A hash containing each line of the source, keyed on the values of the key_fields.

Returns:

(Hash<String,Hash>) —

A hash containing each line of the source, keyed on the values of the key_fields.



44
45
46

# File 'lib/csv-diff/source.rb', line 44

def lines
  @lines
end

#parent_field_indexes ⇒ `Array<Fixnum>` (readonly)

Returns The indexes of the parent fields in the source file.

Returns:

(Array<Fixnum>) —

The indexes of the parent fields in the source file.



29
30
31

# File 'lib/csv-diff/source.rb', line 29

def parent_field_indexes
  @parent_field_indexes
end

#parent_fields ⇒ `Array<String>` (readonly)

Returns The names of the field(s) that identify a common parent of child records.

Returns:

(Array<String>) —

The names of the field(s) that identify a common parent of child records.



19
20
21

# File 'lib/csv-diff/source.rb', line 19

def parent_fields
  @parent_fields
end

#path ⇒ `String`

Returns the path to the source file.

Returns:

(String) —

the path to the source file



8
9
10

# File 'lib/csv-diff/source.rb', line 8

def path
  @path
end

#skip_count ⇒ `Fixnum` (readonly)

Returns A count of the lines from this source that were skipped due to filter conditions.

Returns:

(Fixnum) —

A count of the lines from this source that were skipped due to filter conditions.



56
57
58

# File 'lib/csv-diff/source.rb', line 56

def skip_count
  @skip_count
end

#trim_whitespace ⇒ `Boolean` (readonly)

Returns True if leading/trailing whitespace should be stripped from fields.

Returns:

(Boolean) —

True if leading/trailing whitespace should be stripped from fields



41
42
43

# File 'lib/csv-diff/source.rb', line 41

def trim_whitespace
  @trim_whitespace
end

#warnings ⇒ `Array<String>` (readonly)

Returns An array of any warnings encountered while processing the source.

Returns:

(Array<String>) —

An array of any warnings encountered while processing the source.



50
51
52

# File 'lib/csv-diff/source.rb', line 50

def warnings
  @warnings
end

Instance Method Details

#[](key) ⇒ `Hash`

Returns the row in the CSV source corresponding to the supplied key.

Parameters:

key (String) —

The unique key to use to lookup the row.

Returns:

(Hash) —

The fields for the line corresponding to key, or nil if the key is not recognised.



134
135
136

# File 'lib/csv-diff/source.rb', line 134

def [](key)
    @lines[key]
end

#index_source ⇒ `Object`

Given an array of lines, where each line is an array of fields, indexes the array contents so that it can be looked up by key.

# File 'lib/csv-diff/source.rb', line 141

def index_source
    @lines = {}
    @index = Hash.new{ |h, k| h[k] = [] }
    if @field_names
        index_fields
        include_filter = convert_filter(@include, @field_names)
        exclude_filter = convert_filter(@exclude, @field_names)
    end
    @line_count = 0
    @skip_count = 0
    @dup_count = 0
    line_num = 0
    @data.each do |row|
        line_num += 1
        next if line_num == 1 && @field_names && @ignore_header
        unless @field_names
            if row.class.name == 'CSV::Row'
                @field_names = row.headers.each_with_index.map{ |f, i| f || i.to_s }
            else
                @field_names = row.each_with_index.map{ |f, i| f || i.to_s }
            end
            index_fields
            include_filter = convert_filter(@include, @field_names)
            exclude_filter = convert_filter(@exclude, @field_names)
            next
        end
        field_vals = row
        line = {}
        filter = false
        @field_names.each_with_index do |field, i|
            val = field_vals[i]
            val = val.to_s.strip if val && @trim_whitespace
            line[field] = val
            if include_filter && f = include_filter[i]
                filter = !check_filter(f, line[field])
            end
            if exclude_filter && f = exclude_filter[i]
                filter = check_filter(f, line[field])
            end
            break if filter
        end
        if filter
            @skip_count += 1
            next
        end
        key_values = @key_field_indexes.map{ |kf| @case_sensitive ?
                                                  field_vals[kf].to_s :
                                                  field_vals[kf].to_s.upcase }
        key = key_values.join('~')
        parent_key = key_values[0...(@parent_fields.length)].join('~')
        if @lines[key]
            @warnings << "Duplicate key '#{key}' encountered at line #{line_num}"
            @dup_count += 1
            key += "[#{@dup_count}]"
        end
        @index[parent_key] << key
        @lines[key] = line
        @line_count += 1
    end
end

#path? ⇒ `Boolean`

Returns:

(Boolean)



124
125
126

# File 'lib/csv-diff/source.rb', line 124

def path?
    @path != 'NA'
end

#save_csv(file_path, options = {}) ⇒ `Object`

Save the data in this Source as a CSV at file_path.

Parameters:

options (Hash) (defaults to: {}) —

A set of options to pass to CSV.open to control how the CSV is generated.

# File 'lib/csv-diff/source.rb', line 208

def save_csv(file_path, options = {})
    require 'csv'
    default_opts = {
        headers: @field_name, write_headers: true
    }
    CSV.open(file_path, 'wb', default_opts.merge(options)) do |csv|
        @data.each{ |rec| csv << rec }
    end
end

#to_hash ⇒ `Object`

Convert the data in this source to Array<Hash> using the field names as keys for the Hash in each row.

# File 'lib/csv-diff/source.rb', line 221

def to_hash
    @data.map do |row|
        hsh = {}
        @field_names.each_with_index.map{ |fld, i| hsh[fld] = row[i] }
        hsh
    end
end

Class: CSVDiff::Source

Overview

Direct Known Subclasses

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = {}) ⇒ Source

Instance Attribute Details

#case_sensitive ⇒ Boolean (readonly) Also known as: case_sensitive?

#child_field_indexes ⇒ Array<Fixnum> (readonly)

#child_fields ⇒ Array<String> (readonly)

#data ⇒ Array<Arrary> (readonly)

#dup_count ⇒ Fixnum (readonly)

#field_names ⇒ Array<String> (readonly)

#index ⇒ Hash<String,Array<String>> (readonly)

#key_field_indexes ⇒ Array<Fixnum> (readonly)

#key_fields ⇒ Array<String> (readonly)

#line_count ⇒ Fixnum (readonly)

#lines ⇒ Hash<String,Hash> (readonly)

#parent_field_indexes ⇒ Array<Fixnum> (readonly)

#parent_fields ⇒ Array<String> (readonly)

#path ⇒ String

#skip_count ⇒ Fixnum (readonly)

#trim_whitespace ⇒ Boolean (readonly)

#warnings ⇒ Array<String> (readonly)

Instance Method Details

#[](key) ⇒ Hash

#index_source ⇒ Object

#path? ⇒ Boolean

#save_csv(file_path, options = {}) ⇒ Object

#to_hash ⇒ Object

#initialize(options = {}) ⇒ `Source`

#case_sensitive ⇒ `Boolean` (readonly) Also known as: case_sensitive?

#child_field_indexes ⇒ `Array<Fixnum>` (readonly)

#child_fields ⇒ `Array<String>` (readonly)

#data ⇒ `Array<Arrary>` (readonly)

#dup_count ⇒ `Fixnum` (readonly)

#field_names ⇒ `Array<String>` (readonly)

#index ⇒ `Hash<String,Array<String>>` (readonly)

#key_field_indexes ⇒ `Array<Fixnum>` (readonly)

#key_fields ⇒ `Array<String>` (readonly)

#line_count ⇒ `Fixnum` (readonly)

#lines ⇒ `Hash<String,Hash>` (readonly)

#parent_field_indexes ⇒ `Array<Fixnum>` (readonly)

#parent_fields ⇒ `Array<String>` (readonly)

#path ⇒ `String`

#skip_count ⇒ `Fixnum` (readonly)

#trim_whitespace ⇒ `Boolean` (readonly)

#warnings ⇒ `Array<String>` (readonly)

#[](key) ⇒ `Hash`

#index_source ⇒ `Object`

#path? ⇒ `Boolean`

#save_csv(file_path, options = {}) ⇒ `Object`

#to_hash ⇒ `Object`