Apr 162014

You know what’s interesting?  Sensor data and their associated analytics, especially as it pertains to personal health.  I came across this TED talk where Eric Dishman discusses how even the most basic devices like a telephone can be measured in many different ways and provide insight into many different things, such as early detection of Parkinson’s, Alzheimer’s, arthritis, and more.  Not to mention wearable devices such as Shimmer that provide data on gait, stride length, and tremor as a few examples.  There seems to be so much opportunity to do things better with respect to healthcare, particularly by applying sensor and big data technology.  Watch the talk!

Apr 182013

I am on a mission.  Several actually, but this one is about enabling the masses of PDI and Kettle users to keep their proverbial ducks in a row.

What I’m talking about is the bevy of errant connections that are meant to be straight, but are off ever so slightly.  But you and I know better, we see that errant line, the angle that is off by 3 degrees.  And it MUST BE FIXED!

Let’s look at an example.  Below we have a transformation that reads some sales data, does some data cleansing and data enrichment, and outputs the results into a fact table.  Now does that look terrible or what?  Every line is crooked.

If you’re like me, you’ll spend extra time making sure every line is truly straight.  Then, you’ll need to add or change something about your transformation and do the line-straightening thing all over again.  Let’s face it, that’s a waste of time.  Wouldn’t it be great if there was a way to have our ETL steps automatically lock into place, say, on a grid perhaps?  Well you’ve come to the right place.  Open up the Tools –> Options menu.

Switch to the “Look & Feel” tab, and change the “Grid Size” field to something larger.  I’ve found 30 to be a good value to set here, mostly because that’s what Lynn Yarbrough suggested during my PDI training class.  Thanks Lynn!

Now we can clean up our transformation and straighten everything out with the quickness!  Check it out:

See?  Look how nice that is.

So there you have it – the best PDI / Kettle tip ever for orderly people.  Don’t you agree?  Learn more goodies like this in our PDI training course!


Apr 112013

I have a little project going on right now.  I’m trying to build out a demonstration of our software that highlights how the platform could be leveraged for lending risk management.  One big problem with this is that I am not a bank, and therefore I do not have the necessary data to illustrate the use case (and Pentaho’s capabilities).  Therefore, I must create it.

So I found some public data regarding mortgage applications, which is great, but it’s still not exactly the right kind of data, though it’s a good foundation.  It’s missing key fields for the use case, like the application date, decision date, and closing date for the loan, not to mention payment status, home price, etc.  You know, information that a real bank would have at its disposal.

I set out to sample the data I found (a CSV file), and loaded it into a staging table via a simple transformation:

In this transformation, I’m loading the fields from the CSV file, setting certain field values to null if they contain “NA” instead of a blank entry, sampling 500,000 records (out of the ~15M in the CSV file), and loading it into a staging table.  Pretty straightforward.

Now I have a table with about 45 fields in it, but none of them with a specific application date field, which I want to add in.  I found a tool called Spawner that can generate fields of various types within certain ranges, including dates.  Here’s a quick screen shot of how I set up Spawner to generate an ApplicationDate field:

I do want the data to reflect seasonality through the months of a year to reflect the seasonal home buying patterns and the resulting mortgages, so I’m generating a months worth of data at a time.  On the output tab I configured Spawner to generate a CSV to hold the number of January dates that I want reflected in the data.

Now the big question is: how do I get this disconnected set of application dates included as a column with the other 45 fields from the public data I found?

I’m not going to lie, I tried a lot of different steps in PDI before it dawned on me.  If I’m allocating a certain percentage of the total number of records for each month I’m representing, there is no reason I can’t generate a sequence in both the public data set and the generated application dates, and leverage a stream lookup step in PDI to do the match.  After this revelation, I added a sequence as my “Id” field in Spawner for my January dates, and I quickly modified the staging transformation to include a sequence.  Now every record in the staging table has a unique identifier (a new field called “Id”):

Now in the transformation in which I load the fact table, I can use a stream lookup to load my randomly generated January dates:

The stream lookup is configured to match the Id field from the staging table with the Id field that I generated in the January CSV file, and retrieve the associated ApplicationDate for that specific record.  After doing this process for the remaining months in the year (which is admittedly a little bit tedious), the data will reflect the seasonality I want it to show.

So hopefully this will save people some time when they try to figure out how to add new columns to an existing set of data.  If you have suggestions on an even more expedient way to achieve the same result, please do leave a comment here or catch me on Twitter and let me know!  Also, if you want to give PDI a spin, grab the trial download here and have fun!

Apr 052013

Last week I posted about how to define rolling averages in Pentaho Analyzer. Since then, my colleague Andy Grohe pointed out that this could also be done via a trend measure rather than having to use MDX.  I found this to be true with a minor caveat, but more on that a bit later.  Andy’s suggestion is a great one because this method is far more accessible than getting into MDX.

Average Trend Measures

For now, let’s look at how to set up a trend measure to calculate averages based on prior periods.  Let’s start with a simple example.  In this data set, we’re filtering on March and April of 2013, and showing the numbered week of the year and a count of requests opened during those weeks:

Right clicking on the “Requests” dimension reveals a bevy of useful options, one of which is “User Defined Measures”, under which a “Trend Measure” can be defined.  From here we can define our new “Average Requests” measure as follows:

Now we’ve got our new user defined measure included in our analysis, derived from the prior four time periods:

If we switch to a column-line combo chart, we get this view:

Right away you should notice something interesting: the average is based on the prior four time periods, so the line that is plotted is actually kind of a lagging indicator.

Rolling Time Periods

In an attempt to correct this “lag” issue, I thought I would leverage relative date filters in an attempt to force the average to be calculated on a rolling basis, including the weeks from the current and previous month, rather than explicitly filtering on March and April of 2013.  For a more detailed discussion of how to set that up, please refer to my prior post on the topic.

After implementing the necessary changes to my schema, I leveraged the new filtering option (note the different filter selection in the filter pane), but the result was the same (still driven from the prior time periods):

I did also try filtering on current and previous 4 weeks (instead of months as above), but the behavior was the same.

This is where I think MDX becomes necessary, since you can use it to specify a formula the explicitly includes the current member as part of the range to be averaged for a calculated measure.  For details on how to set that up, please refer to my prior post on the topic.

Using MDX you can achieve average calculations that are potentially as fine as the grain of your time hierarchy (which is very cool), but you may want to forecast expected future values as well.  Such capabilities could really enhance decision making regarding resourcing, investment, retention, quality, or cost reduction.  Pentaho does support such predictive analytics, but alas, that’s a post for another day.  In the meantime, feel free to take Pentaho for a test drive.


Apr 022013

A week or two ago I had a request for how to create visualizations that leveraged rolling dates.  For example, show the event count for the last 4 weeks (rather than specifically filtering on the last four specific weeks of the year).  My teammate Wayne Johnson referred me to some documentation on how to modify your Mondrian schema file to enable this capability in Pentaho Analyzer, and it took me a while to really figure out how to get it to work, so I thought I’d share an example that I was successful with.

First, the goal.  For this test data set, I wanted to be able to show something like the image below, and have it be correctly up to date each day to show only the prior four days.


So how did I get there?  Have a look at the Mondrian schema file:

<Cube name=”Events” visible=”true” cache=”true” enabled=”true”>
    <Table name=”fact_events”>
    <DimensionUsage source=”Date” name=”Date” visible=”true” foreignKey=”date_key” highCardinality=”false”>
    <Measure name=”Events” column=”id” formatString=”#,###;(#,###)” aggregator=”count”>
<Dimension type=”TimeDimension” visible=”true” highCardinality=”false” name=”Date”>
    <Hierarchy name=”Date” visible=”true” hasAll=”true” primaryKey=”date_id”>
        <Table name=”dim_date”>
        <Level name=”Year” visible=”true” column=”year_key” type=”Integer” uniqueMembers=”false” levelType=”TimeYears” hideMemberIf=”Never”>
                <Annotation name=”AnalyzerDateFormat”>
        <Level name=”Quarter” visible=”true” table=”dim_date” column=”quarter_key” nameColumn=”quarter_desc” type=”Integer” uniqueMembers=”false” levelType=”TimeQuarters” hideMemberIf=”Never”>
                <Annotation name=”AnalyzerDateFormat”>
                    <![CDATA[[yyyy].['Q'q yyyy]]]>
        <Level name=”Month” visible=”true” column=”month_key” nameColumn=”month_desc” type=”Integer” uniqueMembers=”false” levelType=”TimeMonths” hideMemberIf=”Never”>
                <Annotation name=”AnalyzerDateFormat”>
                    <![CDATA[[yyyy].['Q'q yyyy].[MMM yyyy]]]>
        <Level name=”Date” visible=”true” column=”date_id” nameColumn=”date_desc” type=”Integer” uniqueMembers=”false” levelType=”TimeDays” hideMemberIf=”Never”>
                <Annotation name=”AnalyzerDateFormat”>
                    <![CDATA[[yyyy].['Q'q yyyy].[MMM yyyy].[yyyy-MM-dd]]]>


The key to getting this to work is matching up the annotation format from each dimension level with the values from the date/time dimension table.  In this case, the values from my date dimension looked like this:

As such, I had to provide some string literals within the “AnalyzerDateFormat” annotation and use the year for every level.  For example, ‘[yyyy].['Q'q yyyy].[MMM yyyy]‘ matches the year, quarter, and month format.  ’[yyyy]‘ is clearly the year format, but the ‘quarter_desc’ column in the dim_date table held values that were non-standard according to the documented formatting rules.  The ‘Q’ in ‘['Q'q yyyy]‘ matches directly with the data, whereas the ‘q’ matches with the numeric quarter (1 in the graphic above).

Once I figured out how to configure the format strings, all I had to do was make use of the newly available filtering option for dates.  After right clicking on “Date” and choosing “Filter”, now I can “choose a commonly used time period” and select a specific number of previous dates (in this case 25, because my data is a little bit old).

If I set my system clock forward a day and re-run the report, we see that the report automatically updates (in this case showing only three days worth of data):

Hopefully this is helpful!

Apr 012013

Over the holidays, Pentaho developed and released the 12 days of visualizations, demonstrating the platform’s flexibility and how easy it can be to add new ways to visualize data.  I of course installed them all, enabling them within Pentaho Analyzer in the Pentaho User Console, but I never did figure out how to get them to run in PDI or Instaview.  But today changed everything…

Thanks to my colleague Sal Scalisi, I discovered that those visualizations could be deployed for use in PDI and Instaview by unzipping them to this location:


After restarting PDI, the custom visualizations were all available in the Visualization perspective, as shown here:

I would like to re-articulate the disclaimer made along with the release of these visualizations though:

These visualizations are provided for demonstration purposes ONLY.  They are provided with no warranty or commitments regarding the quality of the sample and/or platform support.  Many of the underlying charting libraries used in these samples only work on a subset of browsers that are part of Pentaho’s official support matrix.

Note that you will be installing these at your own risk.  For support purposes, you may be asked to disable these plugins within PDI and Instaview in order to fully debug any issues you may be experiencing.

Also, please note that this is not an April Fool’s joke, despite my posting it on April 1st.  If you want one of those, see here.


Mar 292013

I’m fairly new to the business analytics space, so multidimensional expressions (MDX) are not even close to second nature to me.  This week however, I had a request for how to calculate a rolling average for the last five time periods for a particular measure.

Clearly, I was going to need the Avg() function, so I checked the reference for it:

Avg( Set_Expression [ , Numeric_Expression ] )

The numeric expression was the easy part – that would simply be the measure (in this case aggregated by count) I want the average of.  The set expression I was less sure about.  After a bit of trial and error (and some guidance from Bo Borland), I discovered that I needed to supply a range that identified the set to average based on the date, and that one great way to do that was to leverage “:” operator and the Lag() function.  In the end, my new calculated measure looked like this:

Avg({[Date].[Date].CurrentMember.Lag(5):[Date].[Date].CurrentMember}, [Measures].[Events])

In this case, the first occurrence of “Date” refers to the name of the dimension referenced by the cube, and the second instance of the word “Date” refers to the level that describes the actual date of the event.  Here are the relevant sections of the schema file I used:

<Cube name="Events" visible="true" cache="true" enabled="true">
 <Table name="fact_events">
 <DimensionUsage source="Date" name="Date" visible="true" foreignKey="date_key" highCardinality="false">
 <Measure name="Events" column="id" formatString="#,###;(#,###)" aggregator="count">
 <CalculatedMember name="Rolling Avg Events" formula="Avg({[Date].[Date].CurrentMember.Lag(5):[Date].[Date].CurrentMember}, [Measures].[Events])" dimension="Measures" visible="true">
<Dimension type="TimeDimension" visible="true" highCardinality="false" name="Date">
 <Hierarchy name="Date" visible="true" hasAll="true" primaryKey="date_id">
 <Table name="dim_date">
 <Level name="Year" visible="true" column="year_key" type="Integer" uniqueMembers="false" levelType="TimeYears" hideMemberIf="Never">
 <Level name="Quarter" visible="true" table="dim_date" column="quarter_key" nameColumn="quarter_desc" type="Integer" uniqueMembers="false" levelType="TimeQuarters" hideMemberIf="Never">
 <Level name="Month" visible="true" column="month_key" nameColumn="month_desc" type="Integer" uniqueMembers="false" levelType="TimeMonths" hideMemberIf="Never">
 <Level name="Date" visible="true" column="date_id" nameColumn="date_desc" type="Integer" uniqueMembers="false" levelType="TimeDays" hideMemberIf="Never">

As a result, I’m now able to show the rolling average number of events in a column line combo chart (or any number of our other visualizations) as well as a convenient tabular view, such as below.

One improvement I haven’t quite figured out yet is how to make the value supplied to the Lag() function dynamic instead of it always being 5.  But that’s a post for another day.  Hopefully this one is helpful for anyone looking to create a rolling average calculation for use in Analyzer!

Update, 4/9/2013 – there is an easier way, though it has slightly different results.  See my post about it here: http://robinsontechnology.com/blog/2013/04/05/average-trend-measures-in-pentaho-analyzer/.


Mar 182013

Let me just sum this up for you right up front: If you have multiple sheets in a spreadsheet with different fields in them, use multiple Excel input steps, not just one.  Read on for the long version.

This week I got some sample data in Excel format.  The file contained 17 sheets, each with a different set of fields in the header row (which was on the 2nd row by the way, not the first).  So I dragged an Excel input step into PDI, and went to work configuring it.  Based on the data model description (provided in a separate file), I got all 17 sheet names and then narrowed them down to the 3 that I needed for my purposes.  I set the start row to “1″ for each (apparently the header row is 0 based), and got all the fields.  There were over 800 of them.  I narrowed those down to just the 80 or so I needed, and continued to set up the dimension tables via the “Dimension Lookup / Update” step.  I was also sure to update the “Spread sheet type (engine)” to Excel 2007 XLSX (Apache POI) to match the file format I was provided.

Once I had it all set up, I ran the transformation, which completed successfully.  Unfortunately it only successfully pulled data out of one of the three worksheets.  Upon closer inspection, I found that the two sheets from which data was not retrieved had some formatting in them.  The cells within that sheet looked like this:


When I hovered over the warning sign, the message was “The number in this cell is formatted as text or preceded by an apostrophe.”  My first thought was that I needed to get a new data set that had no formatting, and I wasn’t sure where to go from here.  I asked around, and Sal Scalisi, a colleague of mine here at Pentaho produced a more simple example with the same formatting I had in which everything worked just fine.  The difference?  He only had one sheet in his Excel file.  He suggested that I use one Excel input for each sheet in the Excel workbook, and he was right on the money.  I rearranged the transformation, and everything works extremely well.  This is the end result:

Hopefully this tip is helpful to some of you out there dealing with Excel data!


Jan 022013

“If you want to succeed, you should strike out on new paths, rather than travel worn paths of accepted success.”  - John D. Rockefeller

Today I start a new chapter in my career, as a pre-sales solutions engineer for Pentaho, a leading provider of cost effective relational and big data analytics, and I’m very excited about it!

I have a lot to learn, and I look forward to understanding the problem space more completely, mastering the myriad of technologies involved, and communicating the value of the Pentaho platform to customers and prospects in coming years in order to bring valuable business insights into their organizations.

It’ll be a lot of work, but this is going to be fun!


Aug 142012

I am a solutions engineer for Alfresco Software covering the northeastern United States. As such, I believe it is important to clearly communicate the many ways in which our enterprise content management platform can be used in the financial services industry, which is very prevalent in the northeast. Therefore, I have done some research on current customer use cases as well as potential use cases discovered from a few sources.  Namely, the annual reports of some of the biggest banks in the world, and discussions with several colleagues internally and at some of our partner companies. One of those partners agreed to work with me to flesh out the bits and bytes to be able to demonstrate one of the use cases, and so of course I created a site in Alfresco’s cloud at http://my.alfresco.com and invited a few folks from the partner company to join and collaborate with me there.  I uploaded the document in which I was collecting use case research, and another that included details of the design for the first use case.

And therein lies the problem.

Both of those documents were already in their own folder within our U.S. Solutions Engineering Share site in our on-premise deployment of Alfresco, because I’ve been collaborating with some of my colleagues on them as well.  Now I’m collaborating with two different groups of people on the same documents, and I have to carefully manage updates in both directions.  This is why I need Alfresco cloud sync.

Fortunately, Alfresco’s product management, engineering, quality assurance, and release management teams are inching ever closer to releasing this capability to the world, upon which time I will gladly begin to take advantage the automatic nature of Alfresco cloud sync.  I’ll set up synchronization of the folder containing the two documents within our on-premise deployment of Alfresco with the document library of the site I created in the Alfresco cloud to collaborate with our partners.  And forever after, I can collaborate on this project seamlessly both internally and externally while maintaining changes and versions from both places.

Seriously, how cool is that?