Module: Wukong::Hadoop::EnvMethods

Defined in:: lib/wukong-hadoop/hadoop_env_methods.rb

Overview

Hadoop streaming exposes several environment variables to scripts it executes. This module contains methods that make these variables easily accessed from within a processor.

Since these environment variables are ultimately set by Hadoop's streaming jar when executing inside Hadoop, you'll have to set them manually when testing locally.

Via @pskomoroch via @tlipcon:

"there is a little known Hadoop Streaming trick buried in this Python script. You will notice that the date is not actually in the raw log data itself, but is part of the filename. It turns out that Hadoop makes job parameters you would fetch in Java with something like job.get("mapred.input.file") available as environment variables for streaming jobs, with periods replaced with underscores:

filepath = os.environ["map_input_file"]
filename = os.path.split(filepath)[-1]

Instance Method Summary collapse

#attempt_id ⇒ String
ID of the current map/reduce attempt.
#curr_task_id ⇒ String
ID of the current map/reduce task.
#hadoop_streaming_parameter(name) ⇒ String
Fetch a parameter set by Hadoop streaming in the environment of the currently executing process.
#input_dir ⇒ String
Directory of the (data) file currently being processed.
#input_file ⇒ String
Path of the (data) file currently being processed.
#map_input_length ⇒ String
Length of the chunk currently being processed within the current input file.
#map_input_start_offset ⇒ String
Offset of the chunk currently being processed within the current input file.

Instance Method Details

#attempt_id ⇒ `String`

ID of the current map/reduce attempt.

Returns:

(String)



65
66
67

# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 65

def attempt_id
  ENV['mapred_task_id']
end

#curr_task_id ⇒ `String`

ID of the current map/reduce task.

Returns:

(String)



72
73
74

# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 72

def curr_task_id
  ENV['mapred_tip_id']
end

#hadoop_streaming_parameter(name) ⇒ `String`

Fetch a parameter set by Hadoop streaming in the environment of the currently executing process.

Parameters:

name (String) —
the '.' separated parameter name to fetch

Returns:

(String) —
the value from the process' environment



30
31
32

# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 30

def hadoop_streaming_parameter name
  ENV[name.gsub('.', '_')]
end

#input_dir ⇒ `String`

Directory of the (data) file currently being processed.

Returns:

(String)



44
45
46

# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 44

def input_dir
  ENV['mapred_input_dir']
end

#input_file ⇒ `String`

Path of the (data) file currently being processed.

Returns:

(String)



37
38
39

# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 37

def input_file
  ENV['map_input_file']
end

#map_input_length ⇒ `String`

Length of the chunk currently being processed within the current input file.

Returns:

(String)



58
59
60

# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 58

def map_input_length
  ENV['map_input_length']
end

#map_input_start_offset ⇒ `String`

Offset of the chunk currently being processed within the current input file.

Returns:

(String)



51
52
53

# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 51

def map_input_start_offset
  ENV['map_input_start']
end

Module: Wukong::Hadoop::EnvMethods

Overview

Instance Method Summary collapse

Instance Method Details

#attempt_id ⇒ String

#curr_task_id ⇒ String

#hadoop_streaming_parameter(name) ⇒ String

#input_dir ⇒ String

#input_file ⇒ String

#map_input_length ⇒ String

#map_input_start_offset ⇒ String

#attempt_id ⇒ `String`

#curr_task_id ⇒ `String`

#hadoop_streaming_parameter(name) ⇒ `String`

#input_dir ⇒ `String`

#input_file ⇒ `String`

#map_input_length ⇒ `String`

#map_input_start_offset ⇒ `String`