Module: Wukong::Hadoop::EnvMethods

Defined in:
lib/wukong-hadoop/hadoop_env_methods.rb

Overview

Hadoop streaming exposes several environment variables to scripts it executes. This module contains methods that make these variables easily accessed from within a processor.

Since these environment variables are ultimately set by Hadoop's streaming jar when executing inside Hadoop, you'll have to set them manually when testing locally.

Via @pskomoroch via @tlipcon:

"there is a little known Hadoop Streaming trick buried in this Python script. You will notice that the date is not actually in the raw log data itself, but is part of the filename. It turns out that Hadoop makes job parameters you would fetch in Java with something like job.get("mapred.input.file") available as environment variables for streaming jobs, with periods replaced with underscores:

filepath = os.environ["map_input_file"]
filename = os.path.split(filepath)[-1]

Instance Method Summary collapse

Instance Method Details

#attempt_idString

ID of the current map/reduce attempt.

Returns:

  • (String)


65
66
67
# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 65

def attempt_id
  ENV['mapred_task_id']
end

#curr_task_idString

ID of the current map/reduce task.

Returns:

  • (String)


72
73
74
# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 72

def curr_task_id
  ENV['mapred_tip_id']
end

#hadoop_streaming_parameter(name) ⇒ String

Fetch a parameter set by Hadoop streaming in the environment of the currently executing process.

Parameters:

  • name (String)

    the '.' separated parameter name to fetch

Returns:

  • (String)

    the value from the process' environment



30
31
32
# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 30

def hadoop_streaming_parameter name
  ENV[name.gsub('.', '_')]
end

#input_dirString

Directory of the (data) file currently being processed.

Returns:

  • (String)


44
45
46
# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 44

def input_dir
  ENV['mapred_input_dir']
end

#input_fileString

Path of the (data) file currently being processed.

Returns:

  • (String)


37
38
39
# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 37

def input_file
  ENV['map_input_file']
end

#map_input_lengthString

Length of the chunk currently being processed within the current input file.

Returns:

  • (String)


58
59
60
# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 58

def map_input_length
  ENV['map_input_length']
end

#map_input_start_offsetString

Offset of the chunk currently being processed within the current input file.

Returns:

  • (String)


51
52
53
# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 51

def map_input_start_offset
  ENV['map_input_start']
end