Module: ClusterKit::DataValidator
- Defined in:
- lib/clusterkit/data_validator.rb
Overview
Shared data validation methods for all algorithms
Class Method Summary collapse
-
.data_statistics(data) ⇒ Hash
Get data statistics for warnings/error context.
-
.validate_basic_structure(data) ⇒ Object
Validate basic data structure and types.
-
.validate_clustering(data, check_finite: false) ⇒ Object
Validation for clustering algorithms (KMeans, HDBSCAN) with specific error messages.
-
.validate_finite_values(data) ⇒ Object
Validate finite values (no NaN or Infinite).
-
.validate_numeric_types(data) ⇒ Object
Validate that all elements are numeric.
-
.validate_pca(data) ⇒ Object
Validation for PCA with specific error messages (same as clustering but without finite checks).
-
.validate_row_consistency(data) ⇒ Object
Validate row consistency (all rows have same length).
-
.validate_standard(data, check_finite: true) ⇒ Object
Standard validation for most algorithms.
Class Method Details
.data_statistics(data) ⇒ Hash
Get data statistics for warnings/error context
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
# File 'lib/clusterkit/data_validator.rb', line 102 def data_statistics(data) return { n_samples: 0, n_features: 0, data_range: 0.0 } if data.empty? n_samples = data.size n_features = data.first&.size || 0 # Calculate data range for warnings min_val = Float::INFINITY max_val = -Float::INFINITY data.each do |row| row.each do |val| val_f = val.to_f min_val = val_f if val_f < min_val max_val = val_f if val_f > max_val end end data_range = max_val - min_val { n_samples: n_samples, n_features: n_features, data_range: data_range, min_value: min_val, max_value: max_val } end |
.validate_basic_structure(data) ⇒ Object
Validate basic data structure and types
10 11 12 13 14 15 16 |
# File 'lib/clusterkit/data_validator.rb', line 10 def validate_basic_structure(data) raise ArgumentError, "Input must be an array" unless data.is_a?(Array) raise ArgumentError, "Input cannot be empty" if data.empty? first_row = data.first raise ArgumentError, "Input must be a 2D array (array of arrays)" unless first_row.is_a?(Array) end |
.validate_clustering(data, check_finite: false) ⇒ Object
Validation for clustering algorithms (KMeans, HDBSCAN) with specific error messages
77 78 79 80 81 82 83 84 85 |
# File 'lib/clusterkit/data_validator.rb', line 77 def validate_clustering(data, check_finite: false) raise ArgumentError, "Data must be an array" unless data.is_a?(Array) raise ArgumentError, "Data cannot be empty" if data.empty? raise ArgumentError, "Data must be 2D array" unless data.first.is_a?(Array) validate_row_consistency(data) validate_numeric_types(data) validate_finite_values(data) if check_finite end |
.validate_finite_values(data) ⇒ Object
Validate finite values (no NaN or Infinite)
51 52 53 54 55 56 57 58 59 60 |
# File 'lib/clusterkit/data_validator.rb', line 51 def validate_finite_values(data) data.each_with_index do |row, i| row.each_with_index do |val, j| # Only check for NaN/Infinite on floats if val.is_a?(Float) && (val.nan? || val.infinite?) raise ArgumentError, "Element at position [#{i}, #{j}] is NaN or Infinite" end end end end |
.validate_numeric_types(data) ⇒ Object
Validate that all elements are numeric
38 39 40 41 42 43 44 45 46 |
# File 'lib/clusterkit/data_validator.rb', line 38 def validate_numeric_types(data) data.each_with_index do |row, i| row.each_with_index do |val, j| unless val.is_a?(Numeric) raise ArgumentError, "Element at position [#{i}, #{j}] is not numeric" end end end end |
.validate_pca(data) ⇒ Object
Validation for PCA with specific error messages (same as clustering but without finite checks)
90 91 92 93 94 95 96 97 |
# File 'lib/clusterkit/data_validator.rb', line 90 def validate_pca(data) raise ArgumentError, "Data must be an array" unless data.is_a?(Array) raise ArgumentError, "Data cannot be empty" if data.empty? raise ArgumentError, "Data must be 2D array" unless data.first.is_a?(Array) validate_row_consistency(data) validate_numeric_types(data) end |
.validate_row_consistency(data) ⇒ Object
Validate row consistency (all rows have same length)
21 22 23 24 25 26 27 28 29 30 31 32 33 |
# File 'lib/clusterkit/data_validator.rb', line 21 def validate_row_consistency(data) row_length = data.first.length data.each_with_index do |row, i| unless row.is_a?(Array) raise ArgumentError, "Row #{i} is not an array" end if row.length != row_length raise ArgumentError, "All rows must have the same length (row #{i} has #{row.length} elements, expected #{row_length})" end end end |
.validate_standard(data, check_finite: true) ⇒ Object
Standard validation for most algorithms
66 67 68 69 70 71 |
# File 'lib/clusterkit/data_validator.rb', line 66 def validate_standard(data, check_finite: true) validate_basic_structure(data) validate_row_consistency(data) validate_numeric_types(data) validate_finite_values(data) if check_finite end |