Apr 182013

I am on a mission.  Several actually, but this one is about enabling the masses of PDI and Kettle users to keep their proverbial ducks in a row.

What I’m talking about is the bevy of errant connections that are meant to be straight, but are off ever so slightly.  But you and I know better, we see that errant line, the angle that is off by 3 degrees.  And it MUST BE FIXED!

Let’s look at an example.  Below we have a transformation that reads some sales data, does some data cleansing and data enrichment, and outputs the results into a fact table.  Now does that look terrible or what?  Every line is crooked.

If you’re like me, you’ll spend extra time making sure every line is truly straight.  Then, you’ll need to add or change something about your transformation and do the line-straightening thing all over again.  Let’s face it, that’s a waste of time.  Wouldn’t it be great if there was a way to have our ETL steps automatically lock into place, say, on a grid perhaps?  Well you’ve come to the right place.  Open up the Tools –> Options menu.

Switch to the “Look & Feel” tab, and change the “Grid Size” field to something larger.  I’ve found 30 to be a good value to set here, mostly because that’s what Lynn Yarbrough suggested during my PDI training class.  Thanks Lynn!

Now we can clean up our transformation and straighten everything out with the quickness!  Check it out:

See?  Look how nice that is.

So there you have it – the best PDI / Kettle tip ever for orderly people.  Don’t you agree?  Learn more goodies like this in our PDI training course!


Apr 112013

I have a little project going on right now.  I’m trying to build out a demonstration of our software that highlights how the platform could be leveraged for lending risk management.  One big problem with this is that I am not a bank, and therefore I do not have the necessary data to illustrate the use case (and Pentaho’s capabilities).  Therefore, I must create it.

So I found some public data regarding mortgage applications, which is great, but it’s still not exactly the right kind of data, though it’s a good foundation.  It’s missing key fields for the use case, like the application date, decision date, and closing date for the loan, not to mention payment status, home price, etc.  You know, information that a real bank would have at its disposal.

I set out to sample the data I found (a CSV file), and loaded it into a staging table via a simple transformation:

In this transformation, I’m loading the fields from the CSV file, setting certain field values to null if they contain “NA” instead of a blank entry, sampling 500,000 records (out of the ~15M in the CSV file), and loading it into a staging table.  Pretty straightforward.

Now I have a table with about 45 fields in it, but none of them with a specific application date field, which I want to add in.  I found a tool called Spawner that can generate fields of various types within certain ranges, including dates.  Here’s a quick screen shot of how I set up Spawner to generate an ApplicationDate field:

I do want the data to reflect seasonality through the months of a year to reflect the seasonal home buying patterns and the resulting mortgages, so I’m generating a months worth of data at a time.  On the output tab I configured Spawner to generate a CSV to hold the number of January dates that I want reflected in the data.

Now the big question is: how do I get this disconnected set of application dates included as a column with the other 45 fields from the public data I found?

I’m not going to lie, I tried a lot of different steps in PDI before it dawned on me.  If I’m allocating a certain percentage of the total number of records for each month I’m representing, there is no reason I can’t generate a sequence in both the public data set and the generated application dates, and leverage a stream lookup step in PDI to do the match.  After this revelation, I added a sequence as my “Id” field in Spawner for my January dates, and I quickly modified the staging transformation to include a sequence.  Now every record in the staging table has a unique identifier (a new field called “Id”):

Now in the transformation in which I load the fact table, I can use a stream lookup to load my randomly generated January dates:

The stream lookup is configured to match the Id field from the staging table with the Id field that I generated in the January CSV file, and retrieve the associated ApplicationDate for that specific record.  After doing this process for the remaining months in the year (which is admittedly a little bit tedious), the data will reflect the seasonality I want it to show.

So hopefully this will save people some time when they try to figure out how to add new columns to an existing set of data.  If you have suggestions on an even more expedient way to achieve the same result, please do leave a comment here or catch me on Twitter and let me know!  Also, if you want to give PDI a spin, grab the trial download here and have fun!