One of the more controversial ideas for the new process methodology is the idea of a Data Exploration phase. This would sit just after Data Preparation and just before Modeling.
It’s controversial because there are only a limited number of phases that we think are sensible (otherwise we end up with 34 steps that no one can remember), and because although it intuitively makes sense to have this phase there isn’t a precise definition of what it actually should be. To give you an idea of the controversy, in our consortium meetings we were pretty evenly split between people who wanted the phase and people who said “over my dead body”.
So here’s my take on why we need it…
A major part of data-mining involves our interpretation of data before we subject it to our data mining algorithms. We’ve already understood the raw data (Data Understanding) and massaged it into an acceptable shape (Data Preparation), but now we start to REALLY look at it. Of course we use statistical measures, we use graphical tools, we use our heads! Sometimes the project stops there – we discover the fault in the data that prevents us doing anything further, or sometimes we spot the trend or linkage that means we don’t even need to build the predictive models. More often we will go back to the raw data looking for some support, or we will take the plunge and hit it with our model of choice. But whatever we do we can’t ignore this intermediate exploration step…
I agree. The application of data exploration tools is a key step in gaining a better understanding of one’s data.
Comment by porticobi — September 10, 2007 @ 4:59 pm