Module: Measurable

Extended by:
Measurable
Included in:
Measurable
Defined in:
lib/measurable.rb,
lib/measurable/cosine.rb,
lib/measurable/maxmin.rb,
lib/measurable/hamming.rb,
lib/measurable/jaccard.rb,
lib/measurable/version.rb,
lib/measurable/tanimoto.rb,
lib/measurable/chebyshev.rb,
lib/measurable/euclidean.rb,
lib/measurable/haversine.rb,
lib/measurable/minkowski.rb

Constant Summary collapse

RAD_PER_DEG =

PI / 180 degrees.

Math::PI / 180
VERSION =

:nodoc:

"0.0.6"
EARTH_RADIUS_IN_MILES =

Earth radius in miles.

3956
EARTH_RADIUS_IN_KILOMETERS =

Earth radius in kilometers. Some algorithms use 6367.

6371
EARTH_RADIUS =

The great circle distance returned will be in whatever units R is in. Provides

{
  :miles => EARTH_RADIUS_IN_MILES,
  :km => EARTH_RADIUS_IN_KILOMETERS,
  :feet => EARTH_RADIUS_IN_MILES * 5282,
  :meters => EARTH_RADIUS_IN_KILOMETERS * 1000
}

Instance Method Summary collapse

Instance Method Details

#chebyshev(u, v) ⇒ Object

call-seq:

chebyshev(u, v) -> Float
  • Arguments :

    • u -> An array of Numeric objects.

    • v -> An array of Numeric objects.

  • Returns :

    • The L-infinite distance between u and v.

  • Raises :

    • ArgumentError -> The sizes of u and v don’t match.

Raises:

  • (ArgumentError)


16
17
18
19
20
21
22
# File 'lib/measurable/chebyshev.rb', line 16

def chebyshev(u, v)
  # TODO: Change this to a more specific, custom-made exception.
  raise ArgumentError if u.size != v.size

  abs_differences = u.zip(v).map { |a| (a[0] - a[1]).abs }
  abs_differences.max
end

#cosine(u, v) ⇒ Object

call-seq:

cosine(u, v) -> Float

Calculate the similarity between the orientation of two vectors.

See: en.wikipedia.org/wiki/Cosine_similarity

  • Arguments :

    • u -> An array of Numeric objects.

    • v -> An array of Numeric objects.

  • Returns :

    • The normalized dot product of u and v, that is, the angle between them in the n-dimensional space.

  • Raises :

    • ArgumentError -> The sizes of u and v don’t match.

Raises:

  • (ArgumentError)


19
20
21
22
23
24
25
26
# File 'lib/measurable/cosine.rb', line 19

def cosine(u, v)
  # TODO: Change this to a more specific, custom-made exception.
  raise ArgumentError if u.size != v.size

  dot_product = u.zip(v).reduce(0.0) { |acc, ary| acc += ary[0] * ary[1] }

  dot_product / (euclidean(u) * euclidean(v))
end

#euclidean(u, v = nil) ⇒ Object

call-seq:

euclidean(u) -> Float
euclidean(u, v) -> Float

Calculate the ordinary distance between arrays u and v.

If v isn’t given, calculate the Euclidean norm of u.

See: en.wikipedia.org/wiki/Euclidean_distance#N_dimensions

  • Arguments :

    • u -> An array of Numeric objects.

    • v -> (Optional) An array of Numeric objects.

  • Returns :

    • The euclidean norm of u or the euclidean distance between u and v.

  • Raises :

    • ArgumentError -> The sizes of u and v don’t match.

Raises:

  • (ArgumentError)


22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# File 'lib/measurable/euclidean.rb', line 22

def euclidean(u, v = nil)
  # If the second argument is nil, the method should return the norm of
  # vector u. For this, we need the distance between u and the origin.
  if v.nil?
    v = Array.new(u.size, 0)
  end

  # TODO: Change this to a more specific, custom-made exception.
  raise ArgumentError if u.size != v.size

  sum = u.zip(v).reduce(0.0) do |acc, ary|
    acc += (ary[0] - ary[-1]) ** 2
  end

  Math.sqrt(sum)
end

#euclidean_squared(u, v = nil) ⇒ Object

call-seq:

euclidean_squared(u) -> Float
euclidean_squared(u, v) -> Float

Calculate the same value as euclidean(u, v), but don’t take the square root of it.

This isn’t a metric in the strict sense, i.e. it doesn’t respect the triangle inequality. However, the squared Euclidean distance is very useful whenever only the relative values of distances are important, for example in optimization problems.

See: en.wikipedia.org/wiki/Euclidean_distance#Squared_Euclidean_distance

  • Arguments :

    • u -> An array of Numeric objects.

    • v -> (Optional) An array of Numeric objects.

  • Returns :

    • The squared value of the euclidean norm of u or of the euclidean distance between u and v.

  • Raises :

    • ArgumentError -> The sizes of u and v don’t match.

Raises:

  • (ArgumentError)


62
63
64
65
66
67
68
69
70
71
72
73
74
75
# File 'lib/measurable/euclidean.rb', line 62

def euclidean_squared(u, v = nil)
  # If the second argument is nil, the method should return the norm of
  # vector u. For this, we need the distance between u and the origin.
  if v.nil?
    v = Array.new(u.size, 0)
  end

  # TODO: Change this to a more specific, custom-made exception.
  raise ArgumentError if u.size != v.size

  u.zip(v).reduce(0.0) do |acc, ary|
    acc += (ary[0] - ary[-1]) ** 2
  end
end

#hamming(s1, s2) ⇒ Object

call-seq:

hamming(s1, s2) -> Integer

See: en.wikipedia.org/wiki/Cosine_similarity

  • Arguments :

    • s1 -> A String.

    • s2 -> A String with the same size of s1.

  • Returns :

    • The number of characters in which s1 and s2 differ.

  • Raises :

    • ArgumentError -> The sizes of s1 and s2 don’t match.

Raises:

  • (ArgumentError)


18
19
20
21
22
23
24
25
26
# File 'lib/measurable/hamming.rb', line 18

def hamming(s1, s2)
  # TODO: Change this to a more specific, custom-made exception.
  raise ArgumentError if s1.size != s2.size

  s1.chars.zip(s2.chars).reduce(0) do |acc, c|
    acc += 1 if c[0] != c[1]
    acc
  end
end

#haversine(u, v, unit = :meters) ⇒ Object

call-seq:

haversine(u, v) -> Float

Compute accurate distances between two points given their latitudes and longitudes, even for short distances. This isn’t a distance measure in the same sense as the other methods in Measurable.

The distance returned is the great circle (or orthodromic) distance between u and v, which is the shortest distance between them on the surface of a sphere. Thus, this implementation considers the Earth to be a sphere.

Reminding that the input vectors are of the form [latitude, longitude] in degrees, so if you have the coordinates [23 32’ S, 46 37’ W] (from São Paulo), the corresponding vector is [-23.53333, -46.61667].

References:

Raises:

  • (ArgumentError)


49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# File 'lib/measurable/haversine.rb', line 49

def haversine(u, v, unit = :meters)
  # TODO: Create better exceptions.
  raise ArgumentError if u.size != 2 || v.size != 2
  raise ArgumentError if unit.class != Symbol

  dlat = u[0] - v[0]
  dlon = u[1] - v[1]

  dlon_rad = dlon * RAD_PER_DEG
  dlat_rad = dlat * RAD_PER_DEG

  lat1_rad = v[0] * RAD_PER_DEG
  lon1_rad = v[1] * RAD_PER_DEG

  lat2_rad = u[0] * RAD_PER_DEG
  lon2_rad = u[1] * RAD_PER_DEG

  a = (Math.sin(dlat_rad / 2)) ** 2 + Math.cos(lat1_rad) * Math.cos(lat2_rad) * (Math.sin(dlon_rad / 2)) ** 2
  c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1 - a))

  EARTH_RADIUS[unit] * c
end

#jaccard(u, v) ⇒ Object Also known as: tanimoto_similarity

call-seq:

jaccard(u, v) -> Float

The jaccard distance is a measure of dissimilarity between two sets. It is calculated as:

jaccard_distance = 1 - jaccard_index

This is a proper metric, i.e. the following conditions hold:

- Symmetry:              jaccard(u, v) == jaccard(v, u)
- Non-negative:          jaccard(u, v) >= 0
- Coincidence axiom:     jaccard(u, v) == 0 if u == v
- Triangular inequality: jaccard(u, v) <= jaccard(u, w) + jaccard(w, v)
  • Arguments :

    • u -> Array of 1s and 0s.

    • v -> Array of 1s and 0s.

  • Returns :

    • Float value representing the dissimilarity between u and v.

  • Raises :

    • ArgumentError -> The size of the input arrays doesn’t match.



66
67
68
# File 'lib/measurable/jaccard.rb', line 66

def jaccard(u, v)
  1 - jaccard_index(u, v)
end

#jaccard_index(u, v) ⇒ Object

call-seq:

jaccard_index(u, v) -> Float

Give the similarity between two binary vectors u and v. Calculated as:

jaccard_index = |intersection| / |union|

In which intersection and union refer to u and v and |x| is the cardinality of set x.

For example:

jaccard_index([1, 0, 1], [1, 1, 1]) == 0.666...

Because |intersection| = |(1, 0, 1)| = 2 and |union| = |(1, 1, 1)| = 3.

See: en.wikipedia.org/wiki/Jaccard_coefficient

  • Arguments :

    • u -> Array of 1s and 0s.

    • v -> Array of 1s and 0s.

  • Returns :

    • Float value representing the Jaccard similarity coefficient between u and v.

  • Raises :

    • ArgumentError -> The size of the input arrays doesn’t match.

Raises:

  • (ArgumentError)


28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# File 'lib/measurable/jaccard.rb', line 28

def jaccard_index(u, v)
  # TODO: Change this to a more specific, custom-made exception.
  raise ArgumentError if u.size != v.size

  intersection = u.zip(v).reduce(0) do |acc, elem|
    # Both u and v must have this element.
    elem[0] + elem[1] == 2 ? (acc + 1) : acc
  end

  union = u.zip(v).reduce(0) do |acc, elem|
    # One of u and v must have this element.
    elem[0] + elem[1] >= 1 ? (acc + 1) : acc
  end

  intersection.to_f / union
end

#maxmin(u, v) ⇒ Object

call-seq:

maxmin(u, v) -> Float

The “Max-min distance” is used to measure similarity between two vectors.

When used in k-means clustering, this similarity measure can give better results in some datasets, as pointed out in the paper “K-means clustering using Max-min distance measure” — Visalakshi, N. K.; Suguna, J.

See: ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=05156398

  • Arguments :

    • u -> An array of Numeric objects.

    • v -> An array of Numeric objects.

  • Returns :

    • Similarity between u and v.

  • Raises :

    • ArgumentError -> The sizes of u and v don’t match.

Raises:

  • (ArgumentError)


22
23
24
25
26
27
28
29
30
31
32
33
# File 'lib/measurable/maxmin.rb', line 22

def maxmin(u, v)
  # TODO: Change this to a more specific, custom-made exception.
  raise ArgumentError if u.size != v.size

  sum_min, sum_max = u.zip(v).reduce([0.0, 0.0]) do |acc, attributes|
    acc[0] += attributes.min
    acc[1] += attributes.max
    acc
  end

  sum_min / sum_max
end

#minkowski(u, v) ⇒ Object Also known as: cityblock, manhattan

call-seq:

minkowski(u, v) -> Numeric

Calculate the sum of the absolute value of the differences between each coordinate of u and v.

  • Arguments :

    • u -> An array of Numeric objects.

    • v -> An array of Numeric objects.

  • Returns :

    • The Minkowski (or L1) distance between u and v.

  • Raises :

    • ArgumentError -> The sizes of u and v don’t match.

Raises:

  • (ArgumentError)


17
18
19
20
21
22
23
24
# File 'lib/measurable/minkowski.rb', line 17

def minkowski(u, v)
  # TODO: Change this to a more specific, custom-made exception.
  raise ArgumentError if u.size != v.size

  u.zip(v).reduce(0) do |acc, elem|
    acc += (elem[0] - elem[1]).abs
  end
end

#tanimoto(u, v) ⇒ Object

call-seq:

tanimoto(u, v) -> Float

Tanimoto distance is a coefficient explicitly chosen such as to allow for two dissimilar specimens to be similar to a third one. This breaks the triangle inequality, thus this isn’t a metric.

More information and references on this are needed. It’s left here mostly as a piece of curiosity.

See: # en.wikipedia.org/wiki/Jaccard_index#Tanimoto.27s_Definitions_of_Similarity_and_Distance

  • Arguments :

    • u -> An array of Numeric objects.

    • v -> An array of Numeric objects.

  • Returns :

    • A measure of the similarity between u and v.

  • Raises :

    • ArgumentError -> The sizes of u and v don’t match.

Raises:

  • (ArgumentError)


26
27
28
29
30
31
# File 'lib/measurable/tanimoto.rb', line 26

def tanimoto(u, v)
  # TODO: Change this to a more specific, custom-made exception.
  raise ArgumentError if u.size != v.size

  -Math.log2(jaccard_index(u, v))
end