BAsh Machinery = DFD

BAM is a tool that helps you be more productive while creating and maintaining projects.

TODO

  • Note to user if he generates downloaders and has not incremental taps
  • Clean docs so there is no reference to dataset as in sink
  • make 1:1 tap:sin easier

Installation

make sure you have ruby (1.9 and 1.8.7 is currently supported) and that you have gem installed.

gem install gd_bam

Notes: Mac On mac ruby is already on your machine but you have to provide it with root access privileges. You can do it by running sudo gem install gd_bam. On top of that some C libraries are going to be installed your machine to make this work you need to install XCode and command line tools on your computer so you get a c compiler (for command line tools in XCode look here http://stackoverflow.com/questions/9329243/xcode-4-4-command-line-tools). Also install GIT http://git-scm.com/download. We are working to make all those go away.

Win TBD

Linux TBD

Done.

Sample project -- GoodSales

Prerequisites

You need a working Salesforce project with API access. You should also have username pass and token ready to fill in to params.json.

Warnings

The project is currently cloned out of an existing project. That means that you need to have access to it. If you do not (the project PID is nt935rwzls50zfqwy6dh62tabu8h0ocy) ask [email protected]. Eventually this will be covered by a template so you will not need to do anything special. The template creation is tracked here https://jira.intgdc.com/browse/GD-34641 .

Let's get to it

We will spin a goodsales project and load it with data. Prerequisite for this is a functioning Salesforce poject that you can grab at force.com.

bam scaffold project test --blueprint goodsales

now you can go inside cd test. You will notice several directories and files. We will get to flows, taps and sinks late. Currently focus just on params.json If you open it you will see several parameters which you will have to fill. Common one should be predefined and empty. For starters you will need gd_login, gd_pass, sf_password, sf_token and sf_login parameters filled in. You can check that Salesforce connection is working by issuing bam sf_jack_in. If it is you should have a REPL opened up. If not you should get an error message.

One of the parameters is project_pid. To get that you need a project.

bam project --blueprint goodsales --token YOUR_TOKEN

This should spin for a while and eventually should give you project ID. Fill it in your params.json.

Now we are going to generate downloaders and before we do so it is a good practice to make sure that you have everything you need to. You can issue bam taps_validate which will go to Salesforce check every field you defined and make sure it is available. If not it will warn you. We tried to pick the fields that will be in you SalesForce but it is possible that they were deleted or the user does not have access to them.

If everything is ok go ahead and generate the downloaders.

bam generate_downloaders

We will talk in detail why we split etl for project into downloaders and rest of the ETL. Now just trust us. By default it is generated into downloader-project folder. You can go ahead and run it on platform.

bam run downloader-project --email [email protected]

You can watch the progress in our CloudConnect console.

Now generate the etl.

bam generate

This works the same as with downloaders but its default target is clover-project

bam run clover-project --email [email protected]

After it is finished log in to gooddata go into your project and celebrate. You just did project using BAM.

When Things go wrong

We tried our best to this experience be a smooth one but sometimes things go bad. Here are some typical problems that can occur.

Field is inaccessible in SF

In the log there should be something like

Worker task failed: Missing mandatory fields

This means that some of your fields are either not accessible or not in your SF project. You can check several things

  • first make sure that you can actually connect to sf. For this run bam sf_validate_connection
  • If that is correct. Use bam taps_validate to identify fields that are inaccessible. It will mark in red those that are not there orange will be those that are not there but are marked as not mandatory in taps.

Next steps

Ok so by now you hopefully have your project up and running. Before we dive into modifications you have to understand key concepts tha BAM builds on. Once you are comfortable with those we will get back with.

Taps

Taps are sources of data. Right now you can use just salesforce tap.

Common properties

Every tap has the source and an id. Source tells it where to go to grab data. DB, CSV, SalesForce. Depending on the type the definition might be different. Id is a field that holds a name with which you can reference particular tap. Obviously it needs to be unique.

Salesforce

This is a tap that connects to SalesForce using credentials mentioned in params.json. Object tells the downlooader which SF onject to grab.

{
   "source" : "salesforce"
  ,"object" : "User"
  ,"id"     : "user"
  ,"fields" : [
    {
      "name" : "Id"
    },
    {
      "name" : "FirstName"
    },
    {
      "name" : "LastName"
    },
    {
      "name" : "Region"
    },
    {
      "name" : "Department"
    }
  ]
}

Limit

Sometimes it is useful to limit number of grabbed values for example for testing purposes.

{
   "source" : "salesforce"
  ,"object" : "User"
  ,"id"     : "user"
  ,"fields" : [
    {
      "name" : "Id"
    }
    .
    .
    .
  ]
  ,"limit": 100
}

Acts as

Sometime it is needed to use one field several times in a source of data or you want to "call" certain field differently because the ETL relies on particular name. Both cases are handled using acts_as

{
   "source" : "salesforce"
  ,"object" : "User"
  ,"id"     : "user"
  ,"fields" : [
    {
      "name" : "Id", "acts_as" : ["Id", "Name"]
    },
    {
      "name" : "Custom_Amount__c", "acts_as" : ["RenamedAmount"]
    }
  ]
}

Id will be routed to both Id and Name. Custom_Amount__c will be called RenamedAmount.

Caution: This is a double edged sword so be careful. The idea here is that it should make your life easier not harder. You should map a field to a different one in exactly 2 cases. One is that you want the same field twice. The second is that (predefined) ETL requires certain field under certain name. If you are not careful it is easy to introduce data from 2 columns into a single one.

Condition

You can also specify a condition during download. I recommend using it only if it drastically lowers the amount of data that goes over wire. Otherwise implement it elsewhere.

Incremental

It is wasteful to download everything on and on again. If you specify the incremental=true you are telling BAM to download only incrementally. This means several things even if BAM tries to hide away as much as it can it might be condusing. Incremental tap means

  • tap is treated differently. If you generate the ETL with BAM generate the taps will not reach into SF but into some intermediary store.
  • with generate_downloaders you can generete graphs that will handle the incremental nature of downloaders and all the bookkeeping and make sure that nothing gets lost. It stores the data in an intermediary store.

Why split graphs?

The reason for this is simple. When you download only incrementally you do not stress the wires that much and that means you can run it pretty often. By runnning it often it means that even if something horrible happens once it will probably run succesfully next time. And as we mentioned this is cheap. On the other hand runnning the main ETL is often very expensive and recovering from failure is usually different so splitting them simplifies development of each. Since they are independent they can be even developed by different people which is sometimes useful.

Taps validation

Fail early. There is nothing more frustrating than when the ETL fails during execution. When you develop the taps you can ask BAM to connect to SF and validate that the fields are present. This is not bulletproof since some fields can go away at any time but it gives you good idea if you did not misspelled any fields.

Mandatory fields

Sometimes it is necessary to move fields around in SF. In such case the tap will. If you know this upfront you can tell BAM that this field is not mandatory and it will silently go along filling the missing field with ''. If it is marked as mandatory which all fields are by default it will fail if it cannot access the field.

CSV

{ "source" : "/some/path/to/file.csv" ,"id" : "user" ,"fields" : [ { "name" : "Id" }, { "name" : "FirstName" }, { "name" : "LastName" }, { "name" : "Region" }, { "name" : "Department" } ] }

CSV on GoodData WEBDav

{
   "source" : ""https://svarovsky%40gooddata.com:[email protected]/project-uploads/HERE_PUT_YOUR_PROJECT_ID/validated/account.csv""
  ,"id"     : "user"
  ,"fields" : [
    {
      "name" : "Id"
    },
    {
      "name" : "FirstName"
    },
    {
      "name" : "LastName"
    },
    {
      "name" : "Region"
    },
    {
      "name" : "Department"
    }
  ]
}

Note

If you wonder about the password we do not like it either it should go away soon. It is a workaround for a bug.

Flows

Flow is an abstraction that should connect a tap(s) with a sink creating a .. well a flow.

Probably better show you a simple example. This flow will download users from sf (you have to provide credentials in params.json). It then runs graph called "process user" (this is part of the distribution but we can act that is an arbitrary graph). This graph concatenates first name and last name together. It then feeds data to the sink.

GoodData::CloverGenerator::DSL::flow("user") do |f|
  tap(:id => "user")

  graph("process_owner")
  ("user") do |m|
    m.remove("FirstName")
    m.remove("LastName")
    m.add(:name => "Name")
  end

  sink(:id => "user")
end

Note couple of things.

  • The flow is defined using a DSL in Ruby. If you like Ruby great if you do not I recommend http://rubymonk.com/ ot get you up to speed. This might change and we might introduce our own DSL.
  • Flow has its own id. The name of the file does not actually matter. Again something that we are thinking about.
  • with a tap you will include a tap into the flow. You can specify the id of a tap with id param. If you omit it it will try to include the tap with the same id as the flow.
  • with graph you run a graph of a name process_owner. When BAM creates the graphs for you there are two places it looks for the graphs. First it lookes into your project into local_graphs then it tries to look into the library that comes with BAM. Again especially the second part is going to change.
  • there might be one or more metadata statements after graph definition. Each graph might expect numerous inputs so the order of these metadata statements is telling you which input goes where. Second purpose is actually telling what is going to change in those metadata. Here we are saying "Ok the user is going in as input number one (there is no number two in this case). At the output user will have one more field and that is Name. On top of that we are removing two fields FirstName and LastName".
  • The last thing we specify is the sink. Again as in tap you can specify id so you tell BAM for which sink it should look. If you do not fill it in by default it looks for the same is as your flow.

Creating your own graph - short tutorial

Rarely you want to put into sink exactly what comes from tap. Often you want to do some mangling with it. Lets have a look how BAM tackles this. BAM currently does not try to any of this by itself (might change) and relies on CloudConnect graphs to do the heavy lifting. So from the example above lets' redo the Owner example ourselves.

Let's pretend we start completely from scratch so first we need a flow.

bam scaffold flow my_owner

There is going to be a new flow created for us (accidentally it is actually the old owner flow but no worries we will change it soon). If you go ahead and try to generate it right away it should throw an error

bam generate --only my_owner

The error says "error: Tap "user" was not found" and it is because that our tap is named "owner". Go ahead and change it. The same should repeat for sink so change it to "owner" as well. Now the interesting part. When BAM hits the expression graph(xy) in the flow it goes and tries to find a grap that it incorporates in the generated output (it just copies it no magic). It currently looks at two places. First is the local_graphs directory of current graph. So what we need to do is create a graph put it into a local_graphs folder and than change the flow accordingly. BAM comes with couple of templates to make this easier so we are going to use one of those. What we wanna do is basically a reformat type of the job. There is going to be on input and one output. Luckily there is reformat template. Go ahead and run

bam scaffold graph_template reformat my_reformat.grf

Now you could go agead and edit the graph by hand if you are advanced user but usually it is much easier to edit the graph in CloudConnect. I will show you how to do it currently (will change hopefully :-)).

First let's open the flow again and change the graph name from "process_user" to "my_reformat". Then go ahead na generate this flow again.

bam generate --only my_owner

import the clover-project as a CC project (remmeber this ia a fully functional CC project. Also during import uncheck the checkbox that says "Copy the project to workspace" this will allow us to regenerate at will without importing or copying files). Go ahead and in CloudConnect open graph that says "my_reformat.grf". It lookes something like this.

my_reformat.grf graph

Notice couple of things.

  • Input is named generically "1_in.csv"
  • Output is named "out.csv"
  • If you try doubleclicking the edge. It does not open the metadata but throws an error. If you hover over the edge that goes to reformat the Metadata path will be something like "$PROJECT/metadata/$FLOW/$NAME/1_in.xml". Similarly if you hover over the edge that goes out of reformat it will say "$PROJECT/metadata/$FLOW/$NAME/1_out.xml"

Generally the graph you create cannot be arbitrary. The BAM framework expects something from you so it can give you something back. We will walk through that in more detail later.

If you know enough CloudConnect you know that $SOMETHING stands for a parameter. The metadata path "$PROJECT/metadata/$FLOW/$NAME/1_out.xml" thus expects 3 parameters. One is a global $PROJECT and is just the path to your current project and automatically provided by the project and CloudConnect. More BAM related are the other two. They are dynamically generated during runtime so when a graph is run it reaches to the correct metadata.

Short intermezzo - Why do all this? The thinking behind this came from the GoodSales project. There is a default implementation of GoodSales and majority of the customers share some of the pieces. If you share those pieces you do not have to create your own on top of that if somebody finds a bug in that piece and fixes it you will get the fix for free. So it is a way of sharing updates. If you have something specific you can either change the graph altogether or you can prepend some other graph that will for example normalize the data so the standard can be used etc.

Since we want to debug things we need to provide the clover-graph the same information it would have during runtime. This basically means filling the proper values to the FLOW and NAME params. There is a bam command to help you. Run

bam debug clover-project/ my_owner my_reformat

It says for project that resides inside 'clover-project' I want to debug my_owner flow and my_reformat graph. Now go eahead and close and reopen the graph in CC again (to let the CC reload the parameter files). Now if you click on metadata it should open just fine. Take note that in and out metadata are different following the specification we made inside our flow. Go ahead and open the reformat and open the source tab. You will see something like this

function integer transform() {
    $out.0.* = $in.0.*;

    return ALL;
}

You can go ahead and dig deeper into this language called CTL2 on the CloudConnect documentation page but what we need to worry about for now is the line

$out.0.* = $in.0.*;

It says for each field that comes in an input record find a record with the same name on the output and copy the value. This is great since our input metadata share a lots of fields. There are also couple of that are on the output but not input. Like URL. We need to say specifically what to do with those. For example like this

$out.0.Url = "This will be same for all fields"

Or maybe somewhat more useful

$out.0.Url = "This ia a peson named " + $out.0.Name

The result should look something like this

$out.0.* = $in.0.*;
$out.0.Url = "This ia a peson named " + $out.0.Name

Again let's note couple of things.

  • the * notation is what makes the BAM tick. It maps the fields by name automatically at runtime so if we add let's say Region to tap this reformat would still work letting Region through
  • Some of the fields are used by name. In our case it is URL and Id. If these are not on the input/output it will crash.
  • This is important to understand. Usually it is not a big deal. If you are doing your custom implementation you will just write the custom graphs that would suite your situation but sometimes you will (and always should strive to) implement projects that will reuse certain parts. Like we are doing with goodsales. This is ETL that use tens of customers. Then you have a problem when they do not have Id in their data and you have to deal with this situation.

You are almost done. Just save the work you have done and let's think about what yu have done. What you did is that you set up the graph that is part of the generated project. If you would regenereate it now the changes you did by hand would go away. You need to copy the changed graph somewhere where it would be found by BAM. This place is the local_graphs directory that we talked about before. You can do it in linux/mac easily with

cp ./clover_project/graphs/my_reformat ./local_graphs

Now go ahead and regenerate the project. If you reopen the graph in CloudConnect the changes you made should be there.

Creating graphs - some facts

Reading data

Data are moved around in CSVs so if you want an input in your graph create a CSV Reader. Framework will provide them in files n_in.csv where n is number from 1 up based on the order of the metadata statements after the graph call in your flow definition.

tap(:id => 'user')

graph('my_graph')
('user')
('account')

This means that my_graph.grf should have two CSV readers one with source ${DATA_DIR}/1_in.csv and that will be fed the data from User tap and other with source ${DATA_DIR}/2_in.csv and that will be fed data from tap account.

Writing data

The graph is expected to output one output and that should be as a CSV to file ${DATA_DIR}/out.csv.

Metadata

There are 2 metadata files for each tap. Input and output. This is similar to the situation with files. The metadata are numbered by the "port number". This is determined again by the order of metadata statements after graph in the flow definition.

Imagine you have this Tap.

{
   "source" : "salesforce"
  ,"object" : "User"
  ,"id"     : "user"
  ,"fields" : [
    {
      "name" : "Id"
    },
    {
      "name" : "Name"
    }
  ]
}

GoodData::CloverGenerator::DSL::flow("user") do |f|
  tap(:id => "user")
  tap(:id => "account")

  graph("my_graph")
  metadata("user") do |m|
    m.add(:name => "Url")
  end
  metadata("account")

  # other stuff
end

This means that in the graph the first "port" will get data from user tap in file 1_in.csv (we talked abou this). Two metadata files for this tap are going to be created in clover_project/user/my_graph/1_in.xml and clover_project/user/my_graph/1_out.xml. The first one will have

<?xml version="1.0" encoding="UTF-8"?>
<Record fieldDelimiter="," name="in_1" recordDelimiter="\n" type="delimited">
  <Field name="Id" type="string" nullable="true"/>
  <Field name="Name" type="string" nullable="true"/>
</Record>

the other

<?xml version="1.0" encoding="UTF-8"?>
<Record fieldDelimiter="," name="in_1" recordDelimiter="\n" type="delimited">
  <Field name="Id" type="string" nullable="true"/>
  <Field name="Name" type="string" nullable="true"/>
  <Field name="Url" type="string" nullable="true"/>
</Record>

Take note that this is in sync with what we have defined in the flow.

Sinks

Sink is a definition of where data goes. Currently there is only one sink type and that is gooddata dataset.

GoodData

{
  "type" : "dataset",
  "id" : "account",
  "gd_name" : "account",
  "fields": [
    {
      "name" : "id",
      "type" : "connection_point",
      "meta" : "Id"
    },
    {
      "type" : "label",
      "for"  : "id",
      "name" : "name",
      "meta" : "Name"
    },
    {
      "type" : "label",
      "for"  : "id",
      "name" : "url",
      "meta" : "Url"
    },
    {
      "name" : "market",
      "type" : "attribute",
      "meta" : "Market__c"
    }
  ]
}

Gooddat sink is currently just mimicking the CL tool definition + some shortcuts on top of that. If you are familiar with CL tool you should be right at home if I tell you that the only additional thing you have to provide is telling BAM which metadata field is pulled in to a given field.

Deploying and running your work

When you have finished work on your masterpiece it is time to run it. As we have already seen there is a command run in BAM. You can use it like this

bam run clover_project

It will deploy the directory (remember this is a fully functional CloudConnect project) to the secure server. It will potenitally create a channel with your email and run it. When the run ends it will delete both the channel and the deploy. This is useful for one off jobs when you want to take advantage of the remote stack. Please take note that when you run this what is in the directory is deployed. This might bite you if you previosly run this locally and there are some data or parameter files. I recommend regenerate the graph before running.

If you want something more permanent use deploy. This is the same as first step during run. When deployed you can schedule it through the administration UI https://secure.gooddata.com/admin/dataload/ . The same goes here as well. If you run it locally there probably are some files or parameters which might break your run. Regenerating is recomended.

Note: Both run and deploy will change in the future. Run still does not have the emails completely done. For deploy we will probably try to prepare schedule and email channels as well.

Adding a field

Ok let's say you have a basic GoodSales

Data validation

BAM tries to teach you the right way to do the projects and one of the pain points is the data you are getting from a customer. If you have ever got an invalid CSV file from a customer and had to write a custom script to find the error read on. We believe this pain can be automated. In a tap you can annotate the fields with validation rules. These rules are for example saying "This field is a number", "This field is a URI", "This is a date in format", "This cannot be empty". Off this description you can generate a validation graph that sits in between customers data and your ETL. If a new upload is validated and passes it is handed to ETL. If not it is ignored so your ETL can work only on data that passes certain checks. So how to do this?

{
  "type" : "tap"
  ,"source" : "https://svarovsky%40gooddata.com:[email protected]/project-uploads/d2uopvzruqsc9mwuili714h0g6sl8h5y/validated/account*.csv"
  ,"validation_source" : "https://svarovsky%40gooddata.com:[email protected]/project-uploads/d2uopvzruqsc9mwuili714h0g6sl8h5y/account*.csv"
  ,"incremental" : true
  ,"id" : "account"
  ,"fields" : [
        {
            "name" : "Id",
      "validates_as" : {
        "type" : "integer"
      }
        },
        {
            "name" : "Name"
        },
        {
            "name" : "Date",
      "validates_as" : {
        "type" : "date",
        "format" : "yyyy/MM/dd"
      }
        },
    {
      "name": "OtherField"
    }
    ]
    // ,"limit": "10"
}

Now you can generate data validator. This is a standard graph so you can deploy it to the production so it sits in GoodData.

bam generate_validator
bam -vl deploy validator-project

This is going to deploy it in verbose mode. You need to get the deploy process number so you can use it later. Since you ran it in verbose mode one of the last lines should look like

=>"/gdc/projects/d2uopvzruqsc9mwuili714h0g6sl8h5y/dataload/processes/663136fa-f996-4b35-828a-60dd154ff71a", "executions"=>"/gdc/projects/d2uopvzruqsc9mwuili714h0g6sl8h5y/dataload/processes/663136fa-f996-4b35-828a-60dd154ff71a/executions"}}}

The process number is this 663136fa-f996-4b35-828a-60dd154ff71a in this case.

Now you need to upload the data and tell the validator to check it. Either you can do it yourself (once we document the things that must be followed) or you can use one of our agents. There is a Java one maturing but if it is development time you can easily use the one in BAM (we do not recommend it to use in production though thare are lots of nifty features missing).

bam -vl run_validator --process 663136fa-f996-4b35-828a-60dd154ff71a account.csv

You can see that the run_validator command needs to have the process parameter passed in. It also consumes list of files to upload. After it does it runs the validator. If everything goes ok it just moves the files to other dir where ETL or downloaders can pick it up then it quits silently if not it tells you where to look for human readable report what went wrong.

Runtime commands

Part of the distribution is the bam executable which lets you do several neat things on the commandline

Run bam to get the list of commands Run bam help command to get help about the command

generate

Generates the ETL. The default target directory is clover_project (currently cannot be changed). --only flow_id generates only one flow. Useful for debugging

bame generate --only owner

generate_downloaders

Generates downloaders into downloaders_project (currently cannot be changed).

deploy directory

deploys the directory to the server.

bam deploy clover_project

--process process_id You can specify a process ID so you can redeploy to the same process. This just updates the deployed project. All the schedules are still in effect.

bam deploy clover_project --process 1231jkadjk123k

model_sync

This will go through the sinks and updates the model. It rellies on CL tool to do this so this describes the limitation. It is very useful for adding additonal fields not changing the model altogether.

run

Runs the project on the server. This is achived by deploying it there and deleting after the run finsihes.

bam run clover-project

--email [email protected] This will create a temporary email channel hooks the events on success and failure. The channel is tore down once the ETL is done.

scaffold

Creates a file templates so you do not need to start from scratch.

bam scaffold project my_new_project

bam scaffold tap new_tap

bam scaffold flow new_flow

bam scaffold dataset new_dataset

To further ease your typical tasks in ETL BAM comes with couple of templates with prefilled ETL constructs

bam scaffold graph_template reformat local_process_my_stuff
bam scaffold graph_template join local_process_my_other_stuff

taps_generate_docs

In your project there should be a README.md.erb file. By running this command it will be transformed into README.md and put into the project so it can be committed to git. Since it is an erb template there are several expressions that you can use.

<%= taps %> - list of taps
<%= sinks %> - list of sinks

You can run arbitrary ruby code inside so you can write something like

Last generated at <%= Date.today %>

taps_validate

Currently works only for SF. Validates that the target SF instance has all the fields in the objects that are specified in the taps definitions.

sinks_validate

TBD

sf_jack_in

Note: Before we start if you want to exit the interactive session just type exit. If there is output of the commmand that is larger than the screen the session enters a different "viewing" mode you can exit it bry pressing q

This will log you into Salesforce project and starts up interactive client. You can do several useful things for example validate fields while talking to the customer. I will show you couple of things

You can list fields

fields('Opportunity')

and do a lot of interesting stuff with it like searching

fields('Opportunity').grep /__c/

counting

fields('Opportunity').count

and basically anything you can do with ruby like writing those fields to a CSV file

CSV.open('list_of_opportunity_fields.csv', 'w') do |csv|
    fields('Opportunity').map {|f| f.upcase}.each do |f|
        csv << [f]
    end
end

You can make a query

query("SELECT SUM(Amount) FROM Opporunity")

or

query("SELECT Id, Name, StageName FROM Opportunity LIMIT 10")

again you can access the results in many ways like summing amount on Closed Won opportunities

query("SELECT Id, Amount, StageName FROM Opportunity LIMIT 10").find_all do |line|
    line[:StageName] == "Closed Won"
end.reduce(0) {|memo, line| memo += line[:Amount].to_i}

The why

For those that are interested in reading why we actually bothered developing this and what decisions we made. Read on and let us know if you like them or not.

Metadata management

Key pain that I had with CloudConnect is that I did not like the management of metadata. Every project I saw was just pile of metadata definition that has to be constantly changed and tweaked. This is caused by couple of choices that creators of underlying Clover engine made in the beginning and probably will not be changed easily. While I am trying to make it better I am still bound by these choices and sometimes the wiring stick out - sorry for that.

Incremental metadata

Bam is working with something that is called Incremental metadata. Metadata is not defined in each step you just say what you want to change. Picture is probably better than thousand words.

You have a conceptual picture of a simple transformation. You get a Tap that downloads FirstName and LastName somewhere. Obviously you would like to join them together to form a name. Exactly this happens in the second box the transformer. You would like to sink the only field and that is name. So on the next edge what you say is "I am adding Name and removing FirstName and LastName". So far so good. What is elegant about this approach is that how it copes with change. Imagine that the tap gets not only FirstName and LastName but also Age. Now what you need to change? If you would do it the old way You would have to change metadata on both edges, tap transformer and sink. With incremental metadata you need to change tap and sink nothing else. Since I claim that dealing with metadata was the biggest pain this is a lot of work (and errors) that you just saved.

Types or not?

Clover engine is built on Java and it shows. It is statically typed and CTL Clover transformation language resembles Java a lot. While it helps speed and many people claim it prevents errors it also causes more work and helps metadata explosion. Sometimes you need to translate an field into another field becuase you need to do something specific or the component needs it. It is not problem per se but it is important to see the tradeoffs and push the functionality into the components that should work for you and not against you. It is also important to do certain tasks at certain phases. If you do this you found out that certain parts are easier to automate or you can easily reuse work that you did somewhere else.