Method: Addressable::URI.normalize_component

Defined in:
lib/addressable/uri.rb

.normalize_component(component, character_class = CharacterClassesRegexps::RESERVED_AND_UNRESERVED, leave_encoded = '') ⇒ String

Normalizes the encoding of a URI component.

Examples:

Addressable::URI.normalize_component("simpl%65/%65xampl%65", "b-zB-Z")
=> "simple%2Fex%61mple"
Addressable::URI.normalize_component(
  "simpl%65/%65xampl%65", /[^b-zB-Z]/
)
=> "simple%2Fex%61mple"
Addressable::URI.normalize_component(
  "simpl%65/%65xampl%65",
  Addressable::URI::CharacterClasses::UNRESERVED
)
=> "simple%2Fexample"
Addressable::URI.normalize_component(
  "one%20two%2fthree%26four",
  "0-9a-zA-Z &/",
  "/"
)
=> "one two%2Fthree&four"

Parameters:

  • component (String, #to_str)

    The URI component to encode.

  • character_class (String, Regexp) (defaults to: CharacterClassesRegexps::RESERVED_AND_UNRESERVED)

    The characters which are not percent encoded. If a String is passed, the String must be formatted as a regular expression character class. (Do not include the surrounding square brackets.) For example, "b-zB-Z0-9" would cause everything but the letters ‘b’ through ‘z’ and the numbers ‘0’ through ‘9’ to be percent encoded. If a Regexp is passed, the value /[^b-zB-Z0-9]/ would have the same effect. A set of useful String values may be found in the Addressable::URI::CharacterClasses module. The default value is the reserved plus unreserved character classes specified in <a href=“www.ietf.org/rfc/rfc3986.txt”>RFC 3986</a>.

  • leave_encoded (String) (defaults to: '')

    When character_class is a String then leave_encoded is a string of characters that should remain percent encoded while normalizing the component; if they appear percent encoded in the original component, then they will be upcased (“%2f” normalized to “%2F”) but otherwise left alone.

Returns:

  • (String)

    The normalized component.



552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
# File 'lib/addressable/uri.rb', line 552

def self.normalize_component(component, character_class=
    CharacterClassesRegexps::RESERVED_AND_UNRESERVED,
    leave_encoded='')
  return nil if component.nil?

  begin
    component = component.to_str
  rescue NoMethodError, TypeError
    raise TypeError, "Can't convert #{component.class} into String."
  end if !component.is_a? String

  if ![String, Regexp].include?(character_class.class)
    raise TypeError,
      "Expected String or Regexp, got #{character_class.inspect}"
  end
  if character_class.kind_of?(String)
    leave_re = if leave_encoded.length > 0
      character_class = "#{character_class}%" unless character_class.include?('%')

      bytes = leave_encoded.bytes
      leave_encoded_pattern = bytes.map { |b| SEQUENCE_ENCODING_TABLE[b] }.join('|')
      "|%(?!#{leave_encoded_pattern}|#{leave_encoded_pattern.upcase})"
    end

    character_class = if leave_re
                        /[^#{character_class}]#{leave_re}/
                      else
                        /[^#{character_class}]/
                      end
  end
  # We can't perform regexps on invalid UTF sequences, but
  # here we need to, so switch to ASCII.
  component = component.dup
  component.force_encoding(Encoding::ASCII_8BIT)
  unencoded = self.unencode_component(component, String, leave_encoded)
  begin
    encoded = self.encode_component(
      unencoded.unicode_normalize(:nfc),
      character_class,
      leave_encoded
    )
  rescue ArgumentError
    encoded = self.encode_component(unencoded)
  end
  encoded.force_encoding(Encoding::UTF_8)
  return encoded
end