BAsh Machinery = BAM

BAM is a tool that helps you be more productive while creating and maintaining projects.

Installation

make sure you have ruby (1.9 and 1.8.7 is currently supported) and that you have gem installed. On mac this is already on your machine.

gem install gd_bam

Done.

Sample project -- GoodSales

Prerequisites

You need a working project with API access. You should also have username pass and token ready to fill in to params.json.

Warnings

The project is currently cloned out of an existing project. That means that you need to have access to it. If you do not (the project PID is i49w4c73c2mh75iiehte3fv3fbos8h2k) ask [email protected]. Eventually this will be covered by a template so you will not need to do anything special. The template creation is tracked here https://jira.intgdc.com/browse/GD-34641 .

Let's get to it

We will spin a goodsales project and load it with data. Prerequisite for this is a functioning Salesforce poject that you can grab at force.com.

bam scaffold project test --blueprint goodsales

now you can go inside cd test. You will notice several directories and files. We will get to flows, taps and sinks late. Currently focus just on params.json If you open it you will see several parameters which you will have to fill. Common one should be predefined and empty. For starters you will need gd_login and gd_pass parameters filled in.

One of the parameters is project_pid. To get that you need a project.

bam project --blueprint goodsales --token YOUR_TOKEN

This should spin for a while and eventually should give you project ID. Fill it in your params.json.

Now go ahead and generate the downloaders.

bam generate_downloaders

We will talk in detail why we split etl for project into downloaders and rest of the ETL. Now just trust us. By default it is generated into downloader_project folder. You can go ahead and run it on platform.

bam run downloader-project --email [email protected]

You can watch the progress in our CloudConnect console.

Now generate the etl.

bam generate

This works the same as with downloaders but its default target is clover_project

bam run clover-project --email [email protected]

After it is finished log in to gooddata go into your project and celebrate. You just did project using BAM.

Painful metadata management

Key pain that I had with CloudConnect is that I hated the management of metadata. Every project I saw was just pile of metadata definition that has to be constantly changed and tweaked. This is caused by couple of chioces that creators of underlying Clover engine made in the beginning and probably will not be changed easily. While I am trying to make it better I am still bound by these choices and sometimes the wiring stick out - sorry for that.

Incremental metadata

Bam is working with something that is called Incremental metadata. Metadata is not defined in each step you just say what you want to change. Picture is probably better than thousand words.

You have a conceptual picture of a simple transformation. You get a Tap that downloads FirstName and LastName somewhere. Obviously you would like to join them together to form a name. Exactly this happens in the second box the transformer. You would like to sink the only field and that is name. So on the next edge what you say is "I am adding Name and removing FirstName and LastName". So far so good. What is elegant about this approach is that how it copes with change. Imagine that the tap gets not only FirstName and LastName but also Age. Now what you need to change? If you would do it the old way You would have to change metadata on both edges, tap transformer and sink. With incremental metadata you need to change tap and sink nothing else. Since I claim that dealing with metadata was the biggest pain this is a lot of work (and errors) that you just saved.

Types or not?

Clover engine is built on Java and it shows. It is statically typed and CTL Clover transformation language resembles Java a lot. While it helps speed and many people claim it prevents errors it also causes more work and helps metadata explosion. Sometimes you need to translate an field into another field becuase you need to do something specific or the component needs it. It is not problem per se but it is important to see the tradeoffs and push the functionality into the components that should work for you and not against you. It is also important to do certain tasks at certain phases. If you do this you found out that certain parts are easier to automate or you can easily reuse work that you did somewhere else.

Taps

Taps are sources of data. Right now you can use just salesforce tap.

Common properties

Every tap has the source and an id. Source tells it where to go to grab data. DB, CSV, SalesForce. Depending on the type the definition might be different. Id is a field that holds a name with which you can reference particular tap. Obviously it needs to be unique.

Salesforce

This is a tap that connects to SalesForce using credentials mentioned in params.json. Object tells the downlooader which SF onject to grab.

{
   "source" : "salesforce"
  ,"object" : "User"
  ,"id"     : "user"
  ,"fields" : [
    {
      "name" : "Id"
    },
    {
      "name" : "FirstName"
    },
    {
      "name" : "LastName"
    },
    {
      "name" : "Region"
    },
    {
      "name" : "Department"
    }
  ]
}

Limit

Sometimes it is useful to limit number of grabbed values for example for testing purposes.

{
   "source" : "salesforce"
  ,"object" : "User"
  ,"id"     : "user"
  ,"fields" : [
    {
      "name" : "Id"
    }
    .
    .
    .
  ]
  ,"limit": 100
}

Acts as

Sometime it is needed to use one field several times in a source of data or you want to "call" certain field differently because the ETL relies on partiular name. Both cases are handled using acts_as

{
   "source" : "salesforce"
  ,"object" : "User"
  ,"id"     : "user"
  ,"fields" : [
    {
      "name" : "Id", "acts_as" : ["Id", "Name"]
    },
    {
      "name" : "Custom_Amount__c", "acts_as" : ["RenamedAmount"]
    }
  ]
}

Id will be routed to both Id and Name. Custom_Amount__c will be called RenamedAmount.

Condition

You can also specify a condition during download. I recommend using it only if it drastically lowers the amount of data that goes over wire. Otherwise implement it elsewhere.

Incremental

It is wasteful to download everything on and on again. If you specify the incremental=true you are telling BAM to download only incrementally. This means several things even if BAM tries to hide away as much as it can it might be condusing. Incremental tap means

  • tap is treated differently. If you generate the ETL with BAM generate the taps will not reach into SF but into some intermediary store.
  • with generate_downloaders you can generete graphs that will handle the incremental nature of downloaders and all the bookkeeping and make sure that nothing gets lost. It stores the data in an intermediary store.

Why split graphs?

The reason for this is simple. When you download only incrementally you do not stress the wires that much and that means you can run it pretty often. By runnning it often it means that even if something horrible happens once it will probably run succesfully next time. And as we mentioned this is cheap. On the other hand runnning the main ETL is often very expensive and recovering from failure is usually different so splitting them simplifies development of each. Since they are independent they can be even developed by different people which is sometimes useful.

Taps validation

Fail early. There is nothing more frustrating than when the ETL fails during exwcution. When you develop the taps you can ask BAM to connect to SF and validate that the fields are present. This is not bulletproof since some fields can go away at any time but it gives you good idea if you did not misspelled any fields.

Mandatory fields

Sometimes it is necessary to move fields around in SF. In such case the tap will. If you know this upfront you can tell BAM that this field is not mandatory and it will silently go along filling the missing field with ''

Flows

Flow is an abstraction

This flow will download users from sf (you have to provide credentials in params.json). It then runs graph called "process user" (this is part of the distribution). This graph concatenates first name and last name together. It then feeds data to the sink.

GoodData::CloverGenerator::DSL::flow("user") do |f|
  tap(:id => "user")

  graph("process_owner")
  ("user") do |m|
    m.remove("FirstName")
    m.remove("LastName")
    m.add(:name => "Name")
  end

  sink(:id => "user")
end

Now you have to provide it the definition of tap which you can do like this.

When I call external graph? How does it work?

In the flow you can call external graph by using

graph('my_graph')

It goes to 2 places (this will change) and tries to find the graph. First place is your local_graphs directory in your project the second place is central reporsitory that is currently inside bam library and this part will probably change.

Sinks

Sink is a definition of where data goes to. Currently there is only one sink type and that is gooddata dataset.

GoodData

{
  "type" : "dataset",
  "id" : "account",
  "gd_name" : "account",
  "fields": [
    {
      "name" : "id",
      "type" : "connection_point",
      "meta" : "Id"
    },
    {
      "type" : "label",
      "for"  : "id",
      "name" : "name",
      "meta" : "Name"
    },
    {
      "type" : "label",
      "for"  : "id",
      "name" : "url",
      "meta" : "Url"
    },
    {
      "name" : "market",
      "type" : "attribute",
      "meta" : "Market__c"
    }
  ]
}

Gooddat sink is currently just mimicking the CL tool definition + some shortcuts on top of that. If you are familiar with CL tool you should be right at home if I tell you that the only additional thing you have to provide is telling BAM which metadata field is pulled in to a given field.

Runtime commands

Part of the distribution is the bam executable which lets you do several neat things on the commandline

Run bam to get the list of commands Run bam help command to get help about the command

deploy directory

deploys the directory to the server. You can provide the param of the process as a parameter

generate

Generates the ETL. The default target directory is clover_project (currently cannot be changed). You can provide --only parameter to specify the name of the flow to be processed if you do not need to generate all flows. Currently you can specify only one flow

generate_downloaders

If you have incremental downloaders in your project it good to deploy them as a separate process. This generates only the downloaders and is meant for exacltly this purpose. If you are interested about why it is a good idea. Take a look here (TBD). The target directory is downloaders_project (currently cannot be changed).

generate_xmls

Investigates what is changed and performs the changes in the target project. Uses CL tool behind the scenes. Needs more work

model_sync

Syncs the model with the definition in sinks. Sometimes the new field can actually be a typo or something like that. Possible to uncover with validate_datasets

run

Runs the project and bam run clover-project --email [email protected]

scaffold

Takes an argument and creates a scaffold for you. It can scaffold project, flow, sink and tap.

taps_generate_docs

In your project there should be a README.md.erb file. By running this command it will be transformed into README.md and put into the project so it can be committed to git. The interpolated params are taps sinks

sinks_validate

Currently works only for SF. Validates that the target SF instance has all the fields in the objects that are specified in the taps definitions.

validate_datasets

Vallidates the sinks (currently only GD) with the definitions in the proeject. It looks for fields that are defined inside sinks and are not in the projects missing references etc. More description needed.

Roadmap

  • Allow different storage then ES (Vertica)
  • Contract checkers