Class: CSVDiff::Source

Inherits:
Object
  • Object
show all
Defined in:
lib/csv-diff/source.rb

Overview

Reppresents an input (i.e the left/from or tight/to input) to the diff process.

Direct Known Subclasses

CSVSource, XMLSource

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = {}) ⇒ Source

Creates a new diff source.

A diff source must contain at least one field that will be used as the key to identify the same record in a different version of this file. If not specified via one of the options, the first field is assumed to be the unique key.

If multiple fields combine to form a unique key, the combined fields are considered as a single unique identifier. If your key represents data that can be represented as a tree, you can instead break your key fields into :parent_fields and :child_fields. By doing this, if a child key is deleted from one parent, and added to another, that will be reported as an update, with a change to the parent key part(s) of the record.

All key options can be specified either by field name, or by field index (0 based).

Parameters:

  • options (Hash) (defaults to: {})

    An options hash.

Options Hash (options):

  • :field_names (Array<String>)

    The names of each of the fields in source.

  • :ignore_header (Boolean)

    If true, and :field_names has been specified, then the first row of the file is ignored.

  • :key_field (String)

    The name of the field that uniquely identifies each row.

  • :key_fields (Array<String>)

    The names of the fields that uniquely identifies each row.

  • :parent_field (String)

    The name of the field(s) that identify a parent within which sibling order should be checked.

  • :child_field (String)

    The name of the field(s) that uniquely identify a child of a parent.

  • :case_sensitive (Boolean)

    If true (the default), keys are indexed as-is; if false, the index is built in upper-case for case-insensitive comparisons.

  • :include (Hash)

    A hash of field name(s) or index(es) to regular expression(s). Only source rows whose field values satisfy the regular expressions will be indexed and included in the diff process.

  • :exclude (Hash)

    A hash of field name(s) or index(es) to regular expression(s). Source rows with a field value that satisfies the regular expressions will be excluded from the diff process.



102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
# File 'lib/csv-diff/source.rb', line 102

def initialize(options = {})
    if (options.keys & [:parent_field, :parent_fields, :child_field, :child_fields]).empty? &&
       (kf = options.fetch(:key_field, options[:key_fields]))
        @key_fields = [kf].flatten
        @parent_fields = []
        @child_fields = @key_fields
    else
        @parent_fields = [options.fetch(:parent_field, options[:parent_fields]) || []].flatten
        @child_fields = [options.fetch(:child_field, options[:child_fields]) || [0]].flatten
        @key_fields = @parent_fields + @child_fields
    end
    @field_names = options[:field_names]
    @case_sensitive = options.fetch(:case_sensitive, true)
    @trim_whitespace = options.fetch(:trim_whitespace, false)
    @ignore_header = options[:ignore_header]
    @include = options[:include]
    @exclude = options[:exclude]
    @path = options.fetch(:path, 'NA') unless @path
    @warnings = []
end

Instance Attribute Details

#case_sensitiveBoolean (readonly) Also known as: case_sensitive?

Returns True if the source has been indexed with case- sensitive keys, or false if it has been indexed using upper-case key values.

Returns:

  • (Boolean)

    True if the source has been indexed with case- sensitive keys, or false if it has been indexed using upper-case key values.



37
38
39
# File 'lib/csv-diff/source.rb', line 37

def case_sensitive
  @case_sensitive
end

#child_field_indexesArray<Fixnum> (readonly)

Returns The indexes of the child fields in the source file.

Returns:

  • (Array<Fixnum>)

    The indexes of the child fields in the source file.



32
33
34
# File 'lib/csv-diff/source.rb', line 32

def child_field_indexes
  @child_field_indexes
end

#child_fieldsArray<String> (readonly)

Returns The names of the field(s) that distinguish a child of a parent record.

Returns:

  • (Array<String>)

    The names of the field(s) that distinguish a child of a parent record.



22
23
24
# File 'lib/csv-diff/source.rb', line 22

def child_fields
  @child_fields
end

#dataArray<Arrary> (readonly)

Returns The data for this source.

Returns:

  • (Array<Arrary>)

    The data for this source



10
11
12
# File 'lib/csv-diff/source.rb', line 10

def data
  @data
end

#dup_countFixnum (readonly)

Returns A count of the lines from this source that had the same key value as another line.

Returns:

  • (Fixnum)

    A count of the lines from this source that had the same key value as another line.



59
60
61
# File 'lib/csv-diff/source.rb', line 59

def dup_count
  @dup_count
end

#field_namesArray<String> (readonly)

Returns The names of the fields in the source file.

Returns:

  • (Array<String>)

    The names of the fields in the source file



13
14
15
# File 'lib/csv-diff/source.rb', line 13

def field_names
  @field_names
end

#indexHash<String,Array<String>> (readonly)

Returns A hash containing each parent key, and an Array of the child keys it is a parent of.

Returns:

  • (Hash<String,Array<String>>)

    A hash containing each parent key, and an Array of the child keys it is a parent of.



47
48
49
# File 'lib/csv-diff/source.rb', line 47

def index
  @index
end

#key_field_indexesArray<Fixnum> (readonly)

Returns The indexes of the key fields in the source file.

Returns:

  • (Array<Fixnum>)

    The indexes of the key fields in the source file.



26
27
28
# File 'lib/csv-diff/source.rb', line 26

def key_field_indexes
  @key_field_indexes
end

#key_fieldsArray<String> (readonly)

Returns The names of the field(s) that uniquely identify each row.

Returns:

  • (Array<String>)

    The names of the field(s) that uniquely identify each row.



16
17
18
# File 'lib/csv-diff/source.rb', line 16

def key_fields
  @key_fields
end

#line_countFixnum (readonly)

Returns A count of the lines processed from this source. Excludes any header and duplicate records identified during indexing.

Returns:

  • (Fixnum)

    A count of the lines processed from this source. Excludes any header and duplicate records identified during indexing.



53
54
55
# File 'lib/csv-diff/source.rb', line 53

def line_count
  @line_count
end

#linesHash<String,Hash> (readonly)

Returns A hash containing each line of the source, keyed on the values of the key_fields.

Returns:

  • (Hash<String,Hash>)

    A hash containing each line of the source, keyed on the values of the key_fields.



44
45
46
# File 'lib/csv-diff/source.rb', line 44

def lines
  @lines
end

#parent_field_indexesArray<Fixnum> (readonly)

Returns The indexes of the parent fields in the source file.

Returns:

  • (Array<Fixnum>)

    The indexes of the parent fields in the source file.



29
30
31
# File 'lib/csv-diff/source.rb', line 29

def parent_field_indexes
  @parent_field_indexes
end

#parent_fieldsArray<String> (readonly)

Returns The names of the field(s) that identify a common parent of child records.

Returns:

  • (Array<String>)

    The names of the field(s) that identify a common parent of child records.



19
20
21
# File 'lib/csv-diff/source.rb', line 19

def parent_fields
  @parent_fields
end

#pathString

Returns the path to the source file.

Returns:

  • (String)

    the path to the source file



8
9
10
# File 'lib/csv-diff/source.rb', line 8

def path
  @path
end

#skip_countFixnum (readonly)

Returns A count of the lines from this source that were skipped due to filter conditions.

Returns:

  • (Fixnum)

    A count of the lines from this source that were skipped due to filter conditions.



56
57
58
# File 'lib/csv-diff/source.rb', line 56

def skip_count
  @skip_count
end

#trim_whitespaceBoolean (readonly)

Returns True if leading/trailing whitespace should be stripped from fields.

Returns:

  • (Boolean)

    True if leading/trailing whitespace should be stripped from fields



41
42
43
# File 'lib/csv-diff/source.rb', line 41

def trim_whitespace
  @trim_whitespace
end

#warningsArray<String> (readonly)

Returns An array of any warnings encountered while processing the source.

Returns:

  • (Array<String>)

    An array of any warnings encountered while processing the source.



50
51
52
# File 'lib/csv-diff/source.rb', line 50

def warnings
  @warnings
end

Instance Method Details

#[](key) ⇒ Hash

Returns the row in the CSV source corresponding to the supplied key.

Parameters:

  • key (String)

    The unique key to use to lookup the row.

Returns:

  • (Hash)

    The fields for the line corresponding to key, or nil if the key is not recognised.



134
135
136
# File 'lib/csv-diff/source.rb', line 134

def [](key)
    @lines[key]
end

#index_sourceObject

Given an array of lines, where each line is an array of fields, indexes the array contents so that it can be looked up by key.



141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
# File 'lib/csv-diff/source.rb', line 141

def index_source
    @lines = {}
    @index = Hash.new{ |h, k| h[k] = [] }
    if @field_names
        index_fields
        include_filter = convert_filter(@include, @field_names)
        exclude_filter = convert_filter(@exclude, @field_names)
    end
    @line_count = 0
    @skip_count = 0
    @dup_count = 0
    line_num = 0
    @data.each do |row|
        line_num += 1
        next if line_num == 1 && @field_names && @ignore_header
        unless @field_names
            if row.class.name == 'CSV::Row'
                @field_names = row.headers.each_with_index.map{ |f, i| f || i.to_s }
            else
                @field_names = row.each_with_index.map{ |f, i| f || i.to_s }
            end
            index_fields
            include_filter = convert_filter(@include, @field_names)
            exclude_filter = convert_filter(@exclude, @field_names)
            next
        end
        field_vals = row
        line = {}
        filter = false
        @field_names.each_with_index do |field, i|
            val = field_vals[i]
            val = val.to_s.strip if val && @trim_whitespace
            line[field] = val
            if include_filter && f = include_filter[i]
                filter = !check_filter(f, line[field])
            end
            if exclude_filter && f = exclude_filter[i]
                filter = check_filter(f, line[field])
            end
            break if filter
        end
        if filter
            @skip_count += 1
            next
        end
        key_values = @key_field_indexes.map{ |kf| @case_sensitive ?
                                                  field_vals[kf].to_s :
                                                  field_vals[kf].to_s.upcase }
        key = key_values.join('~')
        parent_key = key_values[0...(@parent_fields.length)].join('~')
        if @lines[key]
            @warnings << "Duplicate key '#{key}' encountered at line #{line_num}"
            @dup_count += 1
            key += "[#{@dup_count}]"
        end
        @index[parent_key] << key
        @lines[key] = line
        @line_count += 1
    end
end

#path?Boolean

Returns:

  • (Boolean)


124
125
126
# File 'lib/csv-diff/source.rb', line 124

def path?
    @path != 'NA'
end

#save_csv(file_path, options = {}) ⇒ Object

Save the data in this Source as a CSV at file_path.

Parameters:

  • options (Hash) (defaults to: {})

    A set of options to pass to CSV.open to control how the CSV is generated.



208
209
210
211
212
213
214
215
216
# File 'lib/csv-diff/source.rb', line 208

def save_csv(file_path, options = {})
    require 'csv'
    default_opts = {
        headers: @field_name, write_headers: true
    }
    CSV.open(file_path, 'wb', default_opts.merge(options)) do |csv|
        @data.each{ |rec| csv << rec }
    end
end

#to_hashObject

Convert the data in this source to Array<Hash> using the field names as keys for the Hash in each row.



221
222
223
224
225
226
227
# File 'lib/csv-diff/source.rb', line 221

def to_hash
    @data.map do |row|
        hsh = {}
        @field_names.each_with_index.map{ |fld, i| hsh[fld] = row[i] }
        hsh
    end
end