I recently read an excellent article on the Adobe CMO.com site, Don’t Disregard These 5 Big Details Before Integrating Big Data.
As I read, I couldn't help but think of the Kimball Subsystems of ETL. If fact, so much so that I was motivated to map the obstacles to subsystems (yes, I'm that geeky). This exercise yielded two interesting observations:
- In general the obstacles and subsystems map well, and as such, the subsystems might provide a useful framework for evaluating systems or platforms that aspire to address the obstacles.
- The obstacles did reveal a couple missing subsystems (IMO).
Here's how I see the mapping…
Obstacle #1: Divergent Data Models
- Subsystem: Data Profiling (subsystem 1) - Understand the data elements so that meaningful conformance decisions can be made.
- Subsystem: Data Conformance (subsystem 8) - Transform and conform data elements to common model
Obstacle #2: Data Quality
- Subsystem: Data Profiling (subsystem 1) - Understand where quality problems exist
- Subsystem: Data Cleansing System (subsystem 4) - Standardize, parse and cleanse data
- Subsystem: De-duplication (subsystem 7) - Match data to identify duplicates, then "master" the duplicated data in a "best record" (aka "Survivorship")
- Subsystem: Audit Dimension Creation (subsystem 6) - Append metadata to data which supports auditing and lineage
- Subsystem: Lineage and Dependency (subsystem 29) - Track lineage of data; critical for chasing down and solving data quality issues
Obstacle #3: Data Relevance
- Subsystem: Data Profiling (subsystem 1) - Understand which data elements are relevant.
- Subsystem: Lineage and Dependency (subsystem 29) - Track those elements through to real consumption that provides real business value.
- Subsystem: Data Expiration (* new subsystem*) - Data becomes less valuable as it ages; expire it automatically to balance cost versus value.
Obstacle #4: Integration Speed
- Subsystem: Data Profiling (subsystem 1) - Understand recency, frequency and relevance of data - no need to ingest a record in real-time that has already aged 2 months.
- Subsystem: Change Data Capture (subsystem 2) - To support low latency processing, you often have to process only the true changes.
Obstacle #5: Self-Service Capabilities
- Subsystem: Metadata Repository (subsystem 34) - Self-service by non-developers assumes a strong data model managed by an easy-to-use interface.
- Subsystem: Workflow Configurator (*new subsystem*) - An easy-to-use interface that services common patterns in order to make it possible for non-developers to create and manage workflows. Striking the balance between simplicity and flexibility is the biggest challenge here.
- Subsystem: Workflow Monitor (subsystem 27) - Defining workflows in a self-serve fashion is only the first step; you have to monitor them too.