Module: Statsample::Codification

Defined in:
lib/statsample/codification.rb

Overview

This module aids to code open questions

  • Select one or more vectors of a dataset, to create a yaml files, on which each vector is a hash, which keys and values are the vector's factors . If data have Statsample::SPLIT_TOKEN on a value, each value will be separated on two or more hash keys.

  • Edit the yaml and replace the values of hashes with your codes. If you need to create two or mores codes for an answer, use the separator (default Statsample::SPLIT_TOKEN)

  • Recode the vectors, loading the yaml file:

    • recode_dataset_simple!() : The new vectors have the same name of the original plus “_recoded”

    • recode_dataset_split!() : Create equal number of vectors as values. See Vector.add_vectors_by_split() for arguments

Usage:

recode_file="recodification.yaml"
phase=:first # flag
if phase==:first
  File.open(recode_file,"w") {|fp|
    Statsample::Codification.create_yaml(ds,%w{vector1 vector2}, ",",fp)
  }
# Edit the file recodification.yaml and verify changes
elsif phase==:second
  File.open(recode_file,"r") {|fp|
    Statsample::Codification.verify(fp,['vector1'])
  }
# Add new vectors to the dataset
elsif phase==:third
  File.open(recode_file,"r") {|fp|
    Statsample::Codification.recode_dataset_split!(ds,fp,"*")
  }
end

Class Method Summary collapse

Class Method Details

._recode_dataset(dataset, h, sep = Statsample::SPLIT_TOKEN, split = false) ⇒ Object


145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
# File 'lib/statsample/codification.rb', line 145

def _recode_dataset(dataset, h , sep=Statsample::SPLIT_TOKEN, split=false)
  v_names||=h.keys
  v_names.each do |v_name|
    raise Exception, "Vector #{v_name} doesn't exists on Dataset" if !dataset.vectors.include? v_name
    recoded = Daru::Vector.new(
      recode_vector(dataset[v_name], h[v_name],sep).collect do |c|
        if c.nil?
          nil
        else
          c.join(sep)
        end
      end
    )
    if split
      recoded.split_by_separator(sep).each {|k,v|
        dataset[(v_name.to_s + "_" + k).to_sym] = v
      }
    else
      dataset[(v_name.to_s + "_recoded").to_sym] = recoded
    end
  end
end

.create_excel(dataset, vectors, filename, sep = Statsample::SPLIT_TOKEN) ⇒ Object

Create a excel to create a dictionary, based on vectors. Raises an error if filename exists The rows will be:

  • field: name of vector

  • original: original name

  • recoded: new code


76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# File 'lib/statsample/codification.rb', line 76

def create_excel(dataset, vectors, filename, sep=Statsample::SPLIT_TOKEN)
  require 'spreadsheet'
  if File.exist?(filename)
    raise "Exists a file named #{filename}. Delete ir before overwrite."
  end
  book  = Spreadsheet::Workbook.new
  sheet = book.create_worksheet
  sheet.row(0).concat(%w(field original recoded))
  i = 1
  create_hash(dataset, vectors, sep).sort.each do |field, inner_hash|
    inner_hash.sort.each do |k,v|
      sheet.row(i).concat([field.to_s,k.to_s,v.to_s])
      i += 1
    end
  end

  book.write(filename)
end

.create_hash(dataset, vectors, sep = Statsample::SPLIT_TOKEN) ⇒ Object

Create a hash, based on vectors, to create the dictionary. The keys will be vectors name on dataset and the values will be hashes, with keys = values, for recodification

Raises:

  • (ArgumentError)

35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# File 'lib/statsample/codification.rb', line 35

def create_hash(dataset, vectors, sep=Statsample::SPLIT_TOKEN)
  raise ArgumentError,"Array should't be empty" if vectors.size==0
  pro_hash = vectors.inject({}) do |h,v_name|
    v_name = v_name.is_a?(Numeric) ? v_name : v_name.to_sym
    raise Exception, "Vector #{v_name} doesn't exists on Dataset" if 
      !dataset.vectors.include?(v_name)
    v = dataset[v_name]
    split_data = v.splitted(sep)
                  .flatten
                  .collect { |c| c.to_s  }
                  .find_all{ |c| !c.nil? }

    factors   = split_data.uniq
                          .compact
                          .sort
                          .inject({}) { |ac,val| ac[val] = val; ac }
    h[v_name] = factors
    h
  end

  pro_hash
end

.create_yaml(dataset, vectors, io = nil, sep = Statsample::SPLIT_TOKEN) ⇒ Object

Create a yaml to create a dictionary, based on vectors The keys will be vectors name on dataset and the values will be hashes, with keys = values, for recodification

v1 = Daru::Vector.new(%w{a,b b,c d})
ds = Daru::DataFrame.new({:v1 => v1})
Statsample::Codification.create_yaml(ds,[:v1])
=> "--- \nv1: \n  a: a\n  b: b\n  c: c\n  d: d\n"

65
66
67
68
# File 'lib/statsample/codification.rb', line 65

def create_yaml(dataset, vectors, io=nil, sep=Statsample::SPLIT_TOKEN)
  pro_hash=create_hash(dataset, vectors, sep)
  YAML.dump(pro_hash,io)
end

.dictionary(h, sep = Statsample::SPLIT_TOKEN) ⇒ Object


123
124
125
# File 'lib/statsample/codification.rb', line 123

def dictionary(h, sep=Statsample::SPLIT_TOKEN)
  h.inject({}) { |a,v| a[v[0]]=v[1].split(sep); a }
end

.excel_to_recoded_hash(filename) ⇒ Object

From a excel generates a dictionary hash to use on recode_dataset_simple!() or recode_dataset_split!().


97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# File 'lib/statsample/codification.rb', line 97

def excel_to_recoded_hash(filename)
  require 'spreadsheet'
  h={}
  book = Spreadsheet.open filename
  sheet= book.worksheet 0
  row_i=0
  sheet.each do |row|
    row_i += 1
    next if row_i == 1 or row[0].nil? or row[1].nil? or row[2].nil?
    key = row[0].to_sym
    h[key] ||= {}
    h[key][row[1]] = row[2]
  end
  h
end

.inverse_hash(h, sep = Statsample::SPLIT_TOKEN) ⇒ Object


113
114
115
116
117
118
119
120
121
# File 'lib/statsample/codification.rb', line 113

def inverse_hash(h, sep=Statsample::SPLIT_TOKEN)
  h.inject({}) do |a,v|
    v[1].split(sep).each do |val|
      a[val]||=[]
      a[val].push(v[0])
    end
    a
  end
end

.recode_dataset_simple!(dataset, dictionary_hash, sep = Statsample::SPLIT_TOKEN) ⇒ Object


138
139
140
# File 'lib/statsample/codification.rb', line 138

def recode_dataset_simple!(dataset, dictionary_hash ,sep=Statsample::SPLIT_TOKEN)
  _recode_dataset(dataset,dictionary_hash ,sep,false)
end

.recode_dataset_split!(dataset, dictionary_hash, sep = Statsample::SPLIT_TOKEN) ⇒ Object


141
142
143
# File 'lib/statsample/codification.rb', line 141

def recode_dataset_split!(dataset, dictionary_hash, sep=Statsample::SPLIT_TOKEN)
  _recode_dataset(dataset, dictionary_hash, sep,true)
end

.recode_vector(v, h, sep = Statsample::SPLIT_TOKEN) ⇒ Object


127
128
129
130
131
132
133
134
135
136
137
# File 'lib/statsample/codification.rb', line 127

def recode_vector(v,h,sep=Statsample::SPLIT_TOKEN)
  dict     = dictionary(h,sep)
  new_data = v.splitted(sep)
  new_data.collect do |c|
    if c.nil?
      nil
    else
      c.collect{|value| dict[value] }.flatten.uniq
    end
  end
end

.verify(h, v_names = nil, sep = Statsample::SPLIT_TOKEN, io = $>) ⇒ Object


169
170
171
172
173
174
175
176
177
178
179
# File 'lib/statsample/codification.rb', line 169

def verify(h, v_names=nil,sep=Statsample::SPLIT_TOKEN,io=$>)
  require 'pp'
  v_names||=h.keys
  v_names.each{|v_name|
    inverse=inverse_hash(h[v_name],sep)
    io.puts "- Field: #{v_name}"
    inverse.sort{|a,b| -(a[1].count<=>b[1].count)}.each {|k,v|
      io.puts "  - \"#{k}\" (#{v.count}) :\n    -'"+v.join("\n    -'")+"'"
    }
  }
end