Canadian Waste Management Provider

Enabling BI, ML & AI through an AWS Data Moderization

Project Details

Background

Our client was looking to migrate their existing MongoDB instance and set of 30+ PowerBI reports to an AWS environment. Query times in their legacy environment were very long (over 5 minutes), and they were doing all calculations for reporting on their desktops. The goal of the migration was to enable under 30 seconds to run PowerBI queries, and leverage the tool in a more meaningful way to enable customer and site-level reports, insights, marketing, and decision making.

With a lean IT team, and lack of cloud knowledge internally, the client sought out a local partner with deep data and AWS experience to co-create with their team to design the new environment, perform the migration, and train their team to sustain and enhance the new AWS environment.

A new Redshift and Glue environment was desired, along with a high-level assessment of effort to migrate critical reporting to a Redshift back-end. The project included a Customer View report use case, as part of the larger Scan at Dropoff PowerBI dashboard which allow for insights into customer behaviours at Self-Service recycling depots.

The project also included an additional stream to assess the client’s PowerBI landscape and provide high-level estimates for migration to the new architecture (t-shirt sizing). Data Elephant relied on co-development with Return-It as part of the process of continuous knowledge transfer and rapid delivery to bring their development and engineering team up to speed on AWS technologies to sustain and enhance the solution in the future.

As an existing AWS customer, the client was familiar with Infrastructure as a Service (IaaS) offerings – primarily EC2. These EC2s were used for all facets of the Information Technology stack, including the database and application layer. In meeting with AWS and Data Elephant, the technical team described the challenges with their current implementation and workflow to present data to their PowerBI reporting – including the usage of an intermediary MSSQL instance to prepare and transform the data.

Project

The proposed solution centered around the usage of AWS Glue which became the central location for Extract, Transform, and Load (ETL) operations from Return-It's databases (MongoDB and MSSQL). Further expanding on this architecture, it was recommended that the client investigate the usage of Redshift – specifically Redshift Serverless – for their data storage requirements. The use of Redshift Serverless would provide the required compute and performance required for highly intensive calculations, while allowing the service to spin down during off-peak period of non-usage and providing the client with a cost-optimized data warehousing solution.

Extensive AWS Glue transformations were required to extract and correlate data to other data sources, especially for the MongoDB tables which varied in schema and dimensions. Several built-in and custom transformations were required to align data from nested arrays and complex structures into a columnar format for AWS Redshift consumption. Throughout testing was performed to ensure data consistency which led to a strategy of truncating and reloading data on a nightly basis. This was due to numerous updates, deletes, and changes to not only current but historical records throughout the entire data set – making a delta change strategy not efficient or effective.

Once data was loaded into Redshift, the client’s PowerBI infrastructure was connected seamlessly to the Redshift Serverless instance, providing immediate transition for current reports. Computation and calculated fields were also moved to Redshift to make use of the elastic compute power available versus the limited compute available within PowerBI on an end-user's computer.

There was a high demand from Sales, Development and Customer Insights for data products and dashboards and multiple overlapping business and technology priorities requiring a common implementation plan.

This created the need for AWS architecture best practices and a review of their existing environment to revise it for innovation and scalability, while leveraging existing technology investments, and a plan to demonstrate how data modernization could improve bottom line.

Outcomes

The transition from the intermediary MSSQL solution to the Redshift Serverless enabled by AWS Glue reduced average query/execution time from 90-300 seconds down to 10-30 seconds (90% decrease). In addition to the query time reduction, intensive calculations and computations were migrated from the PowerBI reports to the Redshift Serverless instance, which further reduced query times and overall end-user experience by reducing latency and polling time browsing the reports.

Why AWS? The client decided to build off of the existing AWS infrastructure to achieve their vision for a scalable and extensible data platform with analytics and ML, as well as straightforward integrations to Power BI, and other internal business applications.

AWS Services Used & Proposed AWS S3, Redshift, Glue, MongoDB

Why Data Elephant? Data Elephant provided a local presence, flexible engagement model, willingness to invest in pre-sales workshops, and proven experience with both strategy and execution of AWS data projects.