I have a little project going on right now. I’m trying to build out a demonstration of our software that highlights how the platform could be leveraged for lending risk management. One big problem with this is that I am not a bank, and therefore I do not have the necessary data to illustrate the use case (and Pentaho’s capabilities). Therefore, I must create it.
So I found some public data regarding mortgage applications, which is great, but it’s still not exactly the right kind of data, though it’s a good foundation. It’s missing key fields for the use case, like the application date, decision date, and closing date for the loan, not to mention payment status, home price, etc. You know, information that a real bank would have at its disposal.
I set out to sample the data I found (a CSV file), and loaded it into a staging table via a simple transformation:
In this transformation, I’m loading the fields from the CSV file, setting certain field values to null if they contain “NA” instead of a blank entry, sampling 500,000 records (out of the ~15M in the CSV file), and loading it into a staging table. Pretty straightforward.
Now I have a table with about 45 fields in it, but none of them with a specific application date field, which I want to add in. I found a tool called Spawner that can generate fields of various types within certain ranges, including dates. Here’s a quick screen shot of how I set up Spawner to generate an ApplicationDate field:
I do want the data to reflect seasonality through the months of a year to reflect the seasonal home buying patterns and the resulting mortgages, so I’m generating a months worth of data at a time. On the output tab I configured Spawner to generate a CSV to hold the number of January dates that I want reflected in the data.
Now the big question is: how do I get this disconnected set of application dates included as a column with the other 45 fields from the public data I found?
I’m not going to lie, I tried a lot of different steps in PDI before it dawned on me. If I’m allocating a certain percentage of the total number of records for each month I’m representing, there is no reason I can’t generate a sequence in both the public data set and the generated application dates, and leverage a stream lookup step in PDI to do the match. After this revelation, I added a sequence as my “Id” field in Spawner for my January dates, and I quickly modified the staging transformation to include a sequence. Now every record in the staging table has a unique identifier (a new field called “Id”):
Now in the transformation in which I load the fact table, I can use a stream lookup to load my randomly generated January dates:
The stream lookup is configured to match the Id field from the staging table with the Id field that I generated in the January CSV file, and retrieve the associated ApplicationDate for that specific record. After doing this process for the remaining months in the year (which is admittedly a little bit tedious), the data will reflect the seasonality I want it to show.
So hopefully this will save people some time when they try to figure out how to add new columns to an existing set of data. If you have suggestions on an even more expedient way to achieve the same result, please do leave a comment here or catch me on Twitter and let me know! Also, if you want to give PDI a spin, grab the trial download here and have fun!