I recently read an excellent article on the Adobe CMO.com site, Don’t Disregard These 5 Big Details Before Integrating Big Data.

As I read, I couldn't help but think of the Kimball Subsystems of ETL. If fact, so much so that I was motivated to map the obstacles to subsystems (yes, I'm that geeky). This exercise yielded two interesting observations:

Big Data Word Cloud

  1. In general the obstacles and subsystems map well, and as such, the subsystems might provide a useful framework for evaluating systems or platforms that aspire to address the obstacles.
  2. The obstacles did reveal a couple missing subsystems (IMO).

Here's how I see the mapping…

Obstacle #1: Divergent Data Models

  • Subsystem: Data Profiling (subsystem 1) - Understand the data elements so that meaningful conformance decisions can be made.
  • Subsystem: Data Conformance (subsystem 8) - Transform and conform data elements to common model

Obstacle #2: Data Quality

  • Subsystem: Data Profiling (subsystem 1) - Understand where quality problems exist
  • Subsystem: Data Cleansing System (subsystem 4) - Standardize, parse and cleanse data
  • Subsystem: De-duplication (subsystem 7) - Match data to identify duplicates, then "master" the duplicated data in a "best record" (aka "Survivorship")
  • Subsystem: Audit Dimension Creation (subsystem 6) - Append metadata to data which supports auditing and lineage
  • Subsystem: Lineage and Dependency (subsystem 29) - Track lineage of data; critical for chasing down and solving data quality issues

Obstacle #3: Data Relevance

  • Subsystem: Data Profiling (subsystem 1) - Understand which data elements are relevant.
  • Subsystem: Lineage and Dependency (subsystem 29) - Track those elements through to real consumption that provides real business value.
  • Subsystem: Data Expiration (* new subsystem*) - Data becomes less valuable as it ages; expire it automatically to balance cost versus value.

Obstacle #4: Integration Speed

  • Subsystem: Data Profiling (subsystem 1) - Understand recency, frequency and relevance of data - no need to ingest a record in real-time that has already aged 2 months.
  • Subsystem: Change Data Capture (subsystem 2) - To support low latency processing, you often have to process only the true changes.

Obstacle #5: Self-Service Capabilities

  • Subsystem: Metadata Repository (subsystem 34) - Self-service by non-developers assumes a strong data model managed by an easy-to-use interface.
  • Subsystem: Workflow Configurator (*new subsystem*) - An easy-to-use interface that services common patterns in order to make it possible for non-developers to create and manage workflows. Striking the balance between simplicity and flexibility is the biggest challenge here.
  • Subsystem: Workflow Monitor (subsystem 27) - Defining workflows in a self-serve fashion is only the first step; you have to monitor them too.

Want more analytics?  Read our blog on Uplift Modeling

View all Blog Posts