Apr 182013

I am on a mission.  Several actually, but this one is about enabling the masses of PDI and Kettle users to keep their proverbial ducks in a row.

What I’m talking about is the bevy of errant connections that are meant to be straight, but are off ever so slightly.  But you and I know better, we see that errant line, the angle that is off by 3 degrees.  And it MUST BE FIXED!

Let’s look at an example.  Below we have a transformation that reads some sales data, does some data cleansing and data enrichment, and outputs the results into a fact table.  Now does that look terrible or what?  Every line is crooked.

If you’re like me, you’ll spend extra time making sure every line is truly straight.  Then, you’ll need to add or change something about your transformation and do the line-straightening thing all over again.  Let’s face it, that’s a waste of time.  Wouldn’t it be great if there was a way to have our ETL steps automatically lock into place, say, on a grid perhaps?  Well you’ve come to the right place.  Open up the Tools –> Options menu.

Switch to the “Look & Feel” tab, and change the “Grid Size” field to something larger.  I’ve found 30 to be a good value to set here, mostly because that’s what Lynn Yarbrough suggested during my PDI training class.  Thanks Lynn!

Now we can clean up our transformation and straighten everything out with the quickness!  Check it out:

See?  Look how nice that is.

So there you have it – the best PDI / Kettle tip ever for orderly people.  Don’t you agree?  Learn more goodies like this in our PDI training course!


Apr 112013

I have a little project going on right now.  I’m trying to build out a demonstration of our software that highlights how the platform could be leveraged for lending risk management.  One big problem with this is that I am not a bank, and therefore I do not have the necessary data to illustrate the use case (and Pentaho’s capabilities).  Therefore, I must create it.

So I found some public data regarding mortgage applications, which is great, but it’s still not exactly the right kind of data, though it’s a good foundation.  It’s missing key fields for the use case, like the application date, decision date, and closing date for the loan, not to mention payment status, home price, etc.  You know, information that a real bank would have at its disposal.

I set out to sample the data I found (a CSV file), and loaded it into a staging table via a simple transformation:

In this transformation, I’m loading the fields from the CSV file, setting certain field values to null if they contain “NA” instead of a blank entry, sampling 500,000 records (out of the ~15M in the CSV file), and loading it into a staging table.  Pretty straightforward.

Now I have a table with about 45 fields in it, but none of them with a specific application date field, which I want to add in.  I found a tool called Spawner that can generate fields of various types within certain ranges, including dates.  Here’s a quick screen shot of how I set up Spawner to generate an ApplicationDate field:

I do want the data to reflect seasonality through the months of a year to reflect the seasonal home buying patterns and the resulting mortgages, so I’m generating a months worth of data at a time.  On the output tab I configured Spawner to generate a CSV to hold the number of January dates that I want reflected in the data.

Now the big question is: how do I get this disconnected set of application dates included as a column with the other 45 fields from the public data I found?

I’m not going to lie, I tried a lot of different steps in PDI before it dawned on me.  If I’m allocating a certain percentage of the total number of records for each month I’m representing, there is no reason I can’t generate a sequence in both the public data set and the generated application dates, and leverage a stream lookup step in PDI to do the match.  After this revelation, I added a sequence as my “Id” field in Spawner for my January dates, and I quickly modified the staging transformation to include a sequence.  Now every record in the staging table has a unique identifier (a new field called “Id”):

Now in the transformation in which I load the fact table, I can use a stream lookup to load my randomly generated January dates:

The stream lookup is configured to match the Id field from the staging table with the Id field that I generated in the January CSV file, and retrieve the associated ApplicationDate for that specific record.  After doing this process for the remaining months in the year (which is admittedly a little bit tedious), the data will reflect the seasonality I want it to show.

So hopefully this will save people some time when they try to figure out how to add new columns to an existing set of data.  If you have suggestions on an even more expedient way to achieve the same result, please do leave a comment here or catch me on Twitter and let me know!  Also, if you want to give PDI a spin, grab the trial download here and have fun!

Apr 052013

Last week I posted about how to define rolling averages in Pentaho Analyzer. Since then, my colleague Andy Grohe pointed out that this could also be done via a trend measure rather than having to use MDX.  I found this to be true with a minor caveat, but more on that a bit later.  Andy’s suggestion is a great one because this method is far more accessible than getting into MDX.

Average Trend Measures

For now, let’s look at how to set up a trend measure to calculate averages based on prior periods.  Let’s start with a simple example.  In this data set, we’re filtering on March and April of 2013, and showing the numbered week of the year and a count of requests opened during those weeks:

Right clicking on the “Requests” dimension reveals a bevy of useful options, one of which is “User Defined Measures”, under which a “Trend Measure” can be defined.  From here we can define our new “Average Requests” measure as follows:

Now we’ve got our new user defined measure included in our analysis, derived from the prior four time periods:

If we switch to a column-line combo chart, we get this view:

Right away you should notice something interesting: the average is based on the prior four time periods, so the line that is plotted is actually kind of a lagging indicator.

Rolling Time Periods

In an attempt to correct this “lag” issue, I thought I would leverage relative date filters in an attempt to force the average to be calculated on a rolling basis, including the weeks from the current and previous month, rather than explicitly filtering on March and April of 2013.  For a more detailed discussion of how to set that up, please refer to my prior post on the topic.

After implementing the necessary changes to my schema, I leveraged the new filtering option (note the different filter selection in the filter pane), but the result was the same (still driven from the prior time periods):

I did also try filtering on current and previous 4 weeks (instead of months as above), but the behavior was the same.

This is where I think MDX becomes necessary, since you can use it to specify a formula the explicitly includes the current member as part of the range to be averaged for a calculated measure.  For details on how to set that up, please refer to my prior post on the topic.

Using MDX you can achieve average calculations that are potentially as fine as the grain of your time hierarchy (which is very cool), but you may want to forecast expected future values as well.  Such capabilities could really enhance decision making regarding resourcing, investment, retention, quality, or cost reduction.  Pentaho does support such predictive analytics, but alas, that’s a post for another day.  In the meantime, feel free to take Pentaho for a test drive.


Apr 022013

A week or two ago I had a request for how to create visualizations that leveraged rolling dates.  For example, show the event count for the last 4 weeks (rather than specifically filtering on the last four specific weeks of the year).  My teammate Wayne Johnson referred me to some documentation on how to modify your Mondrian schema file to enable this capability in Pentaho Analyzer, and it took me a while to really figure out how to get it to work, so I thought I’d share an example that I was successful with.

First, the goal.  For this test data set, I wanted to be able to show something like the image below, and have it be correctly up to date each day to show only the prior four days.


So how did I get there?  Have a look at the Mondrian schema file:

<Cube name=”Events” visible=”true” cache=”true” enabled=”true”>
    <Table name=”fact_events”>
    <DimensionUsage source=”Date” name=”Date” visible=”true” foreignKey=”date_key” highCardinality=”false”>
    <Measure name=”Events” column=”id” formatString=”#,###;(#,###)” aggregator=”count”>
<Dimension type=”TimeDimension” visible=”true” highCardinality=”false” name=”Date”>
    <Hierarchy name=”Date” visible=”true” hasAll=”true” primaryKey=”date_id”>
        <Table name=”dim_date”>
        <Level name=”Year” visible=”true” column=”year_key” type=”Integer” uniqueMembers=”false” levelType=”TimeYears” hideMemberIf=”Never”>
                <Annotation name=”AnalyzerDateFormat”>
        <Level name=”Quarter” visible=”true” table=”dim_date” column=”quarter_key” nameColumn=”quarter_desc” type=”Integer” uniqueMembers=”false” levelType=”TimeQuarters” hideMemberIf=”Never”>
                <Annotation name=”AnalyzerDateFormat”>
                    <![CDATA[[yyyy].['Q'q yyyy]]]>
        <Level name=”Month” visible=”true” column=”month_key” nameColumn=”month_desc” type=”Integer” uniqueMembers=”false” levelType=”TimeMonths” hideMemberIf=”Never”>
                <Annotation name=”AnalyzerDateFormat”>
                    <![CDATA[[yyyy].['Q'q yyyy].[MMM yyyy]]]>
        <Level name=”Date” visible=”true” column=”date_id” nameColumn=”date_desc” type=”Integer” uniqueMembers=”false” levelType=”TimeDays” hideMemberIf=”Never”>
                <Annotation name=”AnalyzerDateFormat”>
                    <![CDATA[[yyyy].['Q'q yyyy].[MMM yyyy].[yyyy-MM-dd]]]>


The key to getting this to work is matching up the annotation format from each dimension level with the values from the date/time dimension table.  In this case, the values from my date dimension looked like this:

As such, I had to provide some string literals within the “AnalyzerDateFormat” annotation and use the year for every level.  For example, ‘[yyyy].['Q'q yyyy].[MMM yyyy]‘ matches the year, quarter, and month format.  ’[yyyy]‘ is clearly the year format, but the ‘quarter_desc’ column in the dim_date table held values that were non-standard according to the documented formatting rules.  The ‘Q’ in ‘['Q'q yyyy]‘ matches directly with the data, whereas the ‘q’ matches with the numeric quarter (1 in the graphic above).

Once I figured out how to configure the format strings, all I had to do was make use of the newly available filtering option for dates.  After right clicking on “Date” and choosing “Filter”, now I can “choose a commonly used time period” and select a specific number of previous dates (in this case 25, because my data is a little bit old).

If I set my system clock forward a day and re-run the report, we see that the report automatically updates (in this case showing only three days worth of data):

Hopefully this is helpful!

Apr 012013

Over the holidays, Pentaho developed and released the 12 days of visualizations, demonstrating the platform’s flexibility and how easy it can be to add new ways to visualize data.  I of course installed them all, enabling them within Pentaho Analyzer in the Pentaho User Console, but I never did figure out how to get them to run in PDI or Instaview.  But today changed everything…

Thanks to my colleague Sal Scalisi, I discovered that those visualizations could be deployed for use in PDI and Instaview by unzipping them to this location:


After restarting PDI, the custom visualizations were all available in the Visualization perspective, as shown here:

I would like to re-articulate the disclaimer made along with the release of these visualizations though:

These visualizations are provided for demonstration purposes ONLY.  They are provided with no warranty or commitments regarding the quality of the sample and/or platform support.  Many of the underlying charting libraries used in these samples only work on a subset of browsers that are part of Pentaho’s official support matrix.

Note that you will be installing these at your own risk.  For support purposes, you may be asked to disable these plugins within PDI and Instaview in order to fully debug any issues you may be experiencing.

Also, please note that this is not an April Fool’s joke, despite my posting it on April 1st.  If you want one of those, see here.