Product Release Overview
Our newest product enhancement updates are here! While we are all the time improving and listening to your feedback, here are some of the coolest updates to Quaero CDP’s staging environment. These performance, infrastructure, reliability and data file format support enhancements are aimed at improving efficiency, effectiveness of ingesting and analyzing huge volumes of offline and online data at a faster rate. And what is even better, some of our customers are already realizing cost and operational benefits from these new feature enhancements.
The spark that was missing!
Quaero CDP staging moves from ‘Oozie’ to Big data friendly ‘Spark’
Data Engineers using Quaero CDP can now ingest files from HDFS, S3 (list all the formats) in various input formats, in a fast and scalable way into Quaero staging layer. Spark is for all the right reasons widely used by the data engineering community as it tremendously saves time and, in our case, from 3 hours for ingestion to under 3 minutes with a 98.33% reduction in ingestion time.
New File Format Enhancement
Efficient data compression with Apache Parquet
Quaero CDP staging environment now supports Parquet popularly known for its cleaner structure, and efficiency in terms of storage and performance. Parquet is compatible with most of the data processing frameworks in the Hadoop environment and can handle complex data in bulk.
Based on our recent implementations using Parquet, some of the areas where there are clearly big gains using Parquet vs CSV/JSON/ORC are;
- Batch size – Allows splitting and re-combining of files at run time
- Subset data scanning – Only scan the columns that matter to you resulting in limited table scans, faster scanning and cost benefits.
- Better computation and aggregation – Data in parquet format compliments Spark for complex logics as Parquet files can take updated aggregations that are stored in an intermediate format and write these aggregations to a new bucket in Parquet format.
Empowering data users with full refresh/ incremental refresh options
A full refresh daily on the staging environment not only meant more time consuming, but also translated to poor usage of the storage space due to heavy duplication of data. With an incremental refresh option, staging environment is beyond just a data- dumping yard and the user can now leverage insights and intelligence on the ingested data much before the conformance level. Incremental refresh, helps data engineering teams save significant cost and can optimize the storage space by X%. With an option to choose full/incremental refresh, enterprises now have the flexibility to sync data as per their business requirements.
<screenshot to be included>
Tightening the workflow pipeline and eliminating data duplication
The hit-level (user-level) data from Adobe analytics is typically ingested in a format with no headers, and earlier this meant first staging the data, cleaning the data, creating a lineage to rename and transform the data only at a conformance level. In some sense, at the ingestion level it was more of a data-dumping ground and only at the conformance layer the data starts making sense. This meant duplication of efforts, data and incurred huge amount of time as there were multiple workflows and involved manual data creation and mapping of the columns.
The new staging feature update allows data renaming and type transformation right at the ingestion level resulting in better data accuracy, doing away with multiple linkages. A clean view of data powered by automation right at ingestion allows better data modelling, saves time (~ 2 hours due to lesser workflow pipeline), saves pace and is far less prone to errors which otherwise would be the case with manual mapping and updated.
We are currently working on advanced transformations that would allow data users to take specific ids from any table anywhere and map it with corresponding value automatically.
Automated control file check for 100% accuracy of ingested data
Earlier ingestion of data was followed by manual QC validation by a data analyst to map the number of rows ingested from the source file. This was time consuming and meant not smartly utilizing the bandwidth of expensive data science resources.
The control file check ensures that when an upstream system that is generating HDFS data dump in a multi-part format, Quaero Staging doesn’t start to read the first file out of N until the upstream system is done the complete write out. Now with the automated control file check, corrupt file detection is done right before ingesting the file and frees up the analyst team to pursue more business critical and impactful work.
Reach out to Quaero Product team to know more about these features!