How can we implement TDD (Test Driven Development) in
Business Intelligence or Data Warehousing projects that involve ETL
processes? Is there a way to implement
TDD when the project team is using a tool for ETL or developers are writing
programs to extract, transform and load data into the data warehouse or a set
of target tables? Let us explore.
Whether you consider extract, transform and load as distinct
steps or all of them together as a single step, one thing to keep in mind is
the nature of TDD you can implement here.
It is not going to be the typical JUnit approach of using assertions. All your data are going to be in schemas or
tables. You are not going to embed input
and output data in your test scripts. If you embed input and output data in your
assertions and write pure play JUnits, you can test the ‘T’ of ETL – i.e.,
transform. So, what can you do to
include E and L – or extract and load?
Think of flight operations in airports. Before the departure
of every flight there are several tests or checks to ensure that the aircraft
satisfies a set of pre-conditions. This
will include the verification of a whole lot of technical parameters and things
related to the number of passengers, and other inventory such as food and other
supplies. At the destination there will
be a bunch of tests or checks to ensure that all passengers including the crew
members reached the destination. This
will also include routine verifications or checks on the technical parameters
of the flight depending on a set of post-conditions.
Now, apply this analogy to implementing tests in ETL
context. The first step is to articulate
the pre-conditions and write tests to assert or verify all pre-conditions. Obviously, if you know all pre-conditions and
if you check them one by one by hand, the next wise step is to automate all of
them. That is going to save your time
and eliminate manual errors.
Let us assume that your source tables satisfied all
pre-conditions and you initiated the ETL process. When the ETL process completes, or the flight
has reached the destination, it is necessary to run data quality checks. This is
a high-level health check. For this, you need scripts to verify row count, data
integrity, aggregates (sum, average) and so on.
You can call it a sanity test.
The next step is about verifying all post-conditions. These are the next level, detailed tests. Many of these will depend on the
pre-conditions and the data values present in source tables. One way to handle this is to have a meta-table
or an intermediate table and scripts to compute and store expected results
corresponding to post-conditions. Using this table, you need to run another set
of scripts to verify if your ETL process populated the target tables with the
right set of data. The meta-table or
intermediate table is what holds your expected results. The target tables hold the actual
results. Depending on the test strategy
of your project, design the meta-table or intermediate table to hold as many
rows or pieces of data. In case of
complex projects, you may need multiple meta-tables or intermediate tables.
If you are using manual tests to verify post-conditions,
convert them in to scripts. Populate the
results of your scripts in a table or create a log file. Write a script to analyze errors. This can
save you lot of time and money.
Doing all these in small steps – incrementally and
iteratively will help you adopt test driven development. If you decide to do all this as a one big
chunk somewhere in the project –mostly at project end, it is going to be
classical phase driven approach. Try
TDD!
Are there other ways to implement TDD in ETL projects? Let us discuss.
Related Posts: Tips on Embracing Agility in BI Projects
Related Posts: Tips on Embracing Agility in BI Projects