Let’s say you’ve made a commitment to data quality. You’ve implemented a strong data governance program to ensure consistency and accuracy. You have adopted standards and practices that add rigor and discipline to how data is entered, modified, and otherwise used. You’ve conducted a careful examination of legacy data, identified existing issues, and corrected them. You can state with 100% confidence that your data is clean, and you might be right; I applaud you for taking these important steps!
But even with pristine source data, there is some effort involved in analyzing it and gleaning insights, and I’m not just talking about the work involved in putting together dashboards and visualizations. The qualities that make for a good data model in an application – any system that collects some of your source data – are quite different from what’s required to perform analysis. In fact, many of these qualities are diametrically opposed.
Think about your laundry for a moment. Suppose you’ve just emptied your drier and have a nice, clean basket full of t-shirts to put away. You are probably going to do one of two things:
- You fold each shirt and stack them neatly in a drawer, or
- You place each shirt on a hanger and hang them in a row in your closet.
There are pros and cons of each approach. When you stack shirts in a drawer, you can fit a ton of them in a small space. My t-shirt drawer has 3 stacks. Of course, when I open it, I only see the three shirts on top. I have to shuffle through them if I want to grab a specific shirt. If I chose instead to hang the shirts in my closet, I can quickly find what I’m looking for. The shirts are hanging on their side and I can easily identify them by the color of the sleeve. But my closet, relatively modest as far as closets go, is huge compared to my dresser drawer which takes up just a few cubic feet.
Related Reading: Building a Data-Driven Culture
This is the age-old tension between how data is modeled for applications versus the ideal structure to perform analysis. Application databases are like dresser drawers. They’re optimized for transactions, which in this context means constantly adding, editing, and deleting small amounts of data. Every time you open your customer relationship management tool and add a new contact, the system performs a transaction that writes a new record, like adding a new t-shirt to your pile. The goal here is capacity that supports a high number of small queries with fast response time. We want our software to add new data as quickly as possible and store it using the smallest possible amount of disk space.
Analysis prefers more of a closet-like approach. We want to see everything, all at once. Thus, data models for analytical processing are optimized for reading. The number of queries is smaller, but their size and complexity are orders of magnitude larger. Instruments like data warehouses are specifically structured to support a more holistic view of your data. Imagine if you wanted to create a bar chart that compares the number of shirts you own, grouped by type: t-shirt, polo, button-down, etc. What’s easier, opening your closet and taking a peek, or pulling out every drawer and dumping them out on your bed?
You May Also Like: The ‘Nature’ of Data Ecosystems
The process of taking data from its original source and turning it on its head is called “extract-transform-load”, or ETL. In some cases it’s more appropriate to do ELT, which achieves a similar result but reverses the last two elements of the process. Some people think the “transformation” piece is mostly concerned with cleaning up messy data remedially. That’s often part of it, but even perfect source data needs some intervention for advanced analysis.
There are numerous ways to accomplish this, some more automated and easier to implement than others. Some source applications feature specific output formats that present an already-transformed view of the raw data you’ve entered. Business intelligence tools like Tableau, Power BI, and Qlik also have tools that support simple transformation. There are some tools whose specific purpose is to automate a lot of the ETL/ELT process; Stitch, Matillion, and Skyvia are a few examples. And of course, complex integrations may require custom programming, on its own or in conjunction with any of the elements I just mentioned.
When you understand the differences between how applications and analytics tools interact with data, it’s clear to see why we need to pay attention to the transformation step, even when we have a high level of confidence in data cleanliness and accuracy. The shirts in our drawer aren’t any cleaner once we move them to our closet, but they’re certainly easier to look at!
As your trusted business intelligence partner, Trilix can help your organization leverage data more effectively to unlock hidden opportunities. Click here to learn more!