Module: ClusterKit::DataValidator

Defined in:
lib/clusterkit/data_validator.rb

Overview

Shared data validation methods for all algorithms

Class Method Summary collapse

Class Method Details

.data_statistics(data) ⇒ Hash

Get data statistics for warnings/error context

Parameters:

  • data (Array)

    2D array

Returns:

  • (Hash)

    Statistics about the data



102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
# File 'lib/clusterkit/data_validator.rb', line 102

def data_statistics(data)
  return { n_samples: 0, n_features: 0, data_range: 0.0 } if data.empty?

  n_samples = data.size
  n_features = data.first&.size || 0
  
  # Calculate data range for warnings
  min_val = Float::INFINITY
  max_val = -Float::INFINITY

  data.each do |row|
    row.each do |val|
      val_f = val.to_f
      min_val = val_f if val_f < min_val
      max_val = val_f if val_f > max_val
    end
  end

  data_range = max_val - min_val

  {
    n_samples: n_samples,
    n_features: n_features,
    data_range: data_range,
    min_value: min_val,
    max_value: max_val
  }
end

.validate_basic_structure(data) ⇒ Object

Validate basic data structure and types

Parameters:

  • data (Array)

    Data to validate

Raises:

  • (ArgumentError)

    If data structure is invalid



10
11
12
13
14
15
16
# File 'lib/clusterkit/data_validator.rb', line 10

def validate_basic_structure(data)
  raise ArgumentError, "Input must be an array" unless data.is_a?(Array)
  raise ArgumentError, "Input cannot be empty" if data.empty?

  first_row = data.first
  raise ArgumentError, "Input must be a 2D array (array of arrays)" unless first_row.is_a?(Array)
end

.validate_clustering(data, check_finite: false) ⇒ Object

Validation for clustering algorithms (KMeans, HDBSCAN) with specific error messages

Parameters:

  • data (Array)

    2D array to validate

  • check_finite (Boolean) (defaults to: false)

    Whether to check for NaN/Infinite values

Raises:

  • (ArgumentError)

    If data is invalid



77
78
79
80
81
82
83
84
85
# File 'lib/clusterkit/data_validator.rb', line 77

def validate_clustering(data, check_finite: false)
  raise ArgumentError, "Data must be an array" unless data.is_a?(Array)
  raise ArgumentError, "Data cannot be empty" if data.empty?
  raise ArgumentError, "Data must be 2D array" unless data.first.is_a?(Array)

  validate_row_consistency(data)
  validate_numeric_types(data)
  validate_finite_values(data) if check_finite
end

.validate_finite_values(data) ⇒ Object

Validate finite values (no NaN or Infinite)

Parameters:

  • data (Array)

    2D array to validate

Raises:

  • (ArgumentError)

    If any float is NaN or Infinite



51
52
53
54
55
56
57
58
59
60
# File 'lib/clusterkit/data_validator.rb', line 51

def validate_finite_values(data)
  data.each_with_index do |row, i|
    row.each_with_index do |val, j|
      # Only check for NaN/Infinite on floats
      if val.is_a?(Float) && (val.nan? || val.infinite?)
        raise ArgumentError, "Element at position [#{i}, #{j}] is NaN or Infinite"
      end
    end
  end
end

.validate_numeric_types(data) ⇒ Object

Validate that all elements are numeric

Parameters:

  • data (Array)

    2D array to validate

Raises:

  • (ArgumentError)

    If any element is not numeric



38
39
40
41
42
43
44
45
46
# File 'lib/clusterkit/data_validator.rb', line 38

def validate_numeric_types(data)
  data.each_with_index do |row, i|
    row.each_with_index do |val, j|
      unless val.is_a?(Numeric)
        raise ArgumentError, "Element at position [#{i}, #{j}] is not numeric"
      end
    end
  end
end

.validate_pca(data) ⇒ Object

Validation for PCA with specific error messages (same as clustering but without finite checks)

Parameters:

  • data (Array)

    2D array to validate

Raises:

  • (ArgumentError)

    If data is invalid



90
91
92
93
94
95
96
97
# File 'lib/clusterkit/data_validator.rb', line 90

def validate_pca(data)
  raise ArgumentError, "Data must be an array" unless data.is_a?(Array)
  raise ArgumentError, "Data cannot be empty" if data.empty?
  raise ArgumentError, "Data must be 2D array" unless data.first.is_a?(Array)

  validate_row_consistency(data)
  validate_numeric_types(data)
end

.validate_row_consistency(data) ⇒ Object

Validate row consistency (all rows have same length)

Parameters:

  • data (Array)

    2D array to validate

Raises:

  • (ArgumentError)

    If rows have different lengths



21
22
23
24
25
26
27
28
29
30
31
32
33
# File 'lib/clusterkit/data_validator.rb', line 21

def validate_row_consistency(data)
  row_length = data.first.length

  data.each_with_index do |row, i|
    unless row.is_a?(Array)
      raise ArgumentError, "Row #{i} is not an array"
    end

    if row.length != row_length
      raise ArgumentError, "All rows must have the same length (row #{i} has #{row.length} elements, expected #{row_length})"
    end
  end
end

.validate_standard(data, check_finite: true) ⇒ Object

Standard validation for most algorithms

Parameters:

  • data (Array)

    2D array to validate

  • check_finite (Boolean) (defaults to: true)

    Whether to check for NaN/Infinite values

Raises:

  • (ArgumentError)

    If data is invalid



66
67
68
69
70
71
# File 'lib/clusterkit/data_validator.rb', line 66

def validate_standard(data, check_finite: true)
  validate_basic_structure(data)
  validate_row_consistency(data)
  validate_numeric_types(data)
  validate_finite_values(data) if check_finite
end