Mar 292013

I’m fairly new to the business analytics space, so multidimensional expressions (MDX) are not even close to second nature to me.  This week however, I had a request for how to calculate a rolling average for the last five time periods for a particular measure.

Clearly, I was going to need the Avg() function, so I checked the reference for it:

Avg( Set_Expression [ , Numeric_Expression ] )

The numeric expression was the easy part – that would simply be the measure (in this case aggregated by count) I want the average of.  The set expression I was less sure about.  After a bit of trial and error (and some guidance from Bo Borland), I discovered that I needed to supply a range that identified the set to average based on the date, and that one great way to do that was to leverage “:” operator and the Lag() function.  In the end, my new calculated measure looked like this:

Avg({[Date].[Date].CurrentMember.Lag(5):[Date].[Date].CurrentMember}, [Measures].[Events])

In this case, the first occurrence of “Date” refers to the name of the dimension referenced by the cube, and the second instance of the word “Date” refers to the level that describes the actual date of the event.  Here are the relevant sections of the schema file I used:

<Cube name="Events" visible="true" cache="true" enabled="true">
 <Table name="fact_events">
 <DimensionUsage source="Date" name="Date" visible="true" foreignKey="date_key" highCardinality="false">
 <Measure name="Events" column="id" formatString="#,###;(#,###)" aggregator="count">
 <CalculatedMember name="Rolling Avg Events" formula="Avg({[Date].[Date].CurrentMember.Lag(5):[Date].[Date].CurrentMember}, [Measures].[Events])" dimension="Measures" visible="true">
<Dimension type="TimeDimension" visible="true" highCardinality="false" name="Date">
 <Hierarchy name="Date" visible="true" hasAll="true" primaryKey="date_id">
 <Table name="dim_date">
 <Level name="Year" visible="true" column="year_key" type="Integer" uniqueMembers="false" levelType="TimeYears" hideMemberIf="Never">
 <Level name="Quarter" visible="true" table="dim_date" column="quarter_key" nameColumn="quarter_desc" type="Integer" uniqueMembers="false" levelType="TimeQuarters" hideMemberIf="Never">
 <Level name="Month" visible="true" column="month_key" nameColumn="month_desc" type="Integer" uniqueMembers="false" levelType="TimeMonths" hideMemberIf="Never">
 <Level name="Date" visible="true" column="date_id" nameColumn="date_desc" type="Integer" uniqueMembers="false" levelType="TimeDays" hideMemberIf="Never">

As a result, I’m now able to show the rolling average number of events in a column line combo chart (or any number of our other visualizations) as well as a convenient tabular view, such as below.

One improvement I haven’t quite figured out yet is how to make the value supplied to the Lag() function dynamic instead of it always being 5.  But that’s a post for another day.  Hopefully this one is helpful for anyone looking to create a rolling average calculation for use in Analyzer!

Update, 4/9/2013 – there is an easier way, though it has slightly different results.  See my post about it here:


Mar 182013

Let me just sum this up for you right up front: If you have multiple sheets in a spreadsheet with different fields in them, use multiple Excel input steps, not just one.  Read on for the long version.

This week I got some sample data in Excel format.  The file contained 17 sheets, each with a different set of fields in the header row (which was on the 2nd row by the way, not the first).  So I dragged an Excel input step into PDI, and went to work configuring it.  Based on the data model description (provided in a separate file), I got all 17 sheet names and then narrowed them down to the 3 that I needed for my purposes.  I set the start row to “1″ for each (apparently the header row is 0 based), and got all the fields.  There were over 800 of them.  I narrowed those down to just the 80 or so I needed, and continued to set up the dimension tables via the “Dimension Lookup / Update” step.  I was also sure to update the “Spread sheet type (engine)” to Excel 2007 XLSX (Apache POI) to match the file format I was provided.

Once I had it all set up, I ran the transformation, which completed successfully.  Unfortunately it only successfully pulled data out of one of the three worksheets.  Upon closer inspection, I found that the two sheets from which data was not retrieved had some formatting in them.  The cells within that sheet looked like this:


When I hovered over the warning sign, the message was “The number in this cell is formatted as text or preceded by an apostrophe.”  My first thought was that I needed to get a new data set that had no formatting, and I wasn’t sure where to go from here.  I asked around, and Sal Scalisi, a colleague of mine here at Pentaho produced a more simple example with the same formatting I had in which everything worked just fine.  The difference?  He only had one sheet in his Excel file.  He suggested that I use one Excel input for each sheet in the Excel workbook, and he was right on the money.  I rearranged the transformation, and everything works extremely well.  This is the end result:

Hopefully this tip is helpful to some of you out there dealing with Excel data!