In this second post in the series “Practices for Data Warehousing with Ascend” (see the first post here), we’ll explore the concept of automating the management of a data warehouse through projection of source data.
First, let’s talk about the concept of “projection” by using the metaphor of a video projector. A video projector takes source material encoded as data in the form of video, processes that data and projects that data onto a destination (i.e. a projection screen), where viewers see the projected video data. In this case, consider the source material to be your source data (upstream from your data warehouse), the projector as your data pipeline, and your data warehouse as the projection screen.
Figure 1: Data warehouse as a projection of source data
This metaphor is imperfect but helpful. “Projection” in the sense of a video projector is ephemeral/temporary. You turn on a projector and an image appears. If you were to turn off the projector, the image disappears, and when you turn it on again, the image would be reconstructed and displayed again. The “disappearing” act of a projection is not as relevant to our discussion—your data warehouse doesn’t disappear if you turn off your data pipeline! What is important from this metaphor, though, is the way it changes your thinking about what your data warehouse is—something that can be “re-projected” as needed—a pretty big idea.
When you are building a data warehouse, you use a combination of operations to apply transformations to data (for example, data integration or unified data engineering tools such as Ascend). Some database/data warehouse vendors themselves use the term “projection” to describe how data is physically stored in their engines (e.g. Vertica). While data may go through a series of many different transformations and operations while in a pipeline, typically, one can think about the final step as being a kind of projection (or rather being projected) onto the destination. If this sounds like materialized views at all… that is also related!
It is helpful to think about the data warehouse as a projection that may be more temporary, as opposed to the typical way of thinking about a data warehouse as a final resting place of all historical facts in an organization, that is written once in stone. It is common to see a data warehouse as the final authoritative store of all data from within a single organization. This monolithic approach, while popular, has not always been achieved/implemented for a variety of reasons (cost, flexibility, time to develop, etc.), nor even always deemed necessary. If you’ve ever used the “drop, recreate, full reload” pattern for your data warehouse, then you may know why!
In recent years, architectures such as data lake or data vault emerged that better incorporated the idea of storing an entire “raw” copy of all historical source data which originated from transactional systems of record. With these developments, the data warehouse itself became seen as more of an “aggregation” and “projection” of the raw historical data stored in a data lake or data vault. This became especially true in the era of big data, where processing and storing enormous volumes of data in traditional data warehouse platforms became cost-prohibitive and difficult to scale. Ascend even makes the data warehouse something that can be completely recreated at will from the data it stores!
With the theory out of the way, let us see how these concepts of “projection” can be automated using Ascend through two more practices/patterns that build on this core concept!
Practice #2: Build data transformations using multiple programming languages to produce data warehouse projections
It is common to string together long, complex graphs of transformation operations, in one or more data pipelines, to bring data together from many different sources. This is usually to normalize/unify disparate data, to produce the desired output “projections” and “shapes” to feed into the final data warehouse (whether those be dimensional, snowflake or other types of data models).
This practice is all about taking advantage of the fact that Ascend lets you build your ultimate data warehouse “projections” using one or multiple programming languages (SQL, Python or Scala, for example), within a single pipeline. You can think of this as being given a set of different, flexible and powerful tools to build your video projector and set up your video source material for projection, to use that metaphor.
Figure 2: Some of the programming language choices for your transformation logic
This practice is also about enabling data engineers and analysts across your organization to participate in a collaborative and dynamic process of data engineering to support data warehouse development (which, coincidentally, was an intention of the inventors of the concept of data warehousing all along!).
We know that there are a variety of users within different teams in organizations who prefer to use different tools to produce their data pipelines. Some of these team members you find within your IT department, and others yet you find embedded within your organization’s domains (ex. sales, operations, support, etc.). Some of these team members are software engineers or developers, some of them are data analysts or data scientists—you really do find a spectrum of experience.
The trend that emerged and became what we now call “data engineering” was actually spurred by a kind of return to a time before graphical ETL tools, when the development of data pipelines was done by developing custom programs in programming languages. Whether this occurred due to a rejection of the limitations of traditional/legacy ETL tools, or bottlenecks and limitations observed from enough years of experience building data warehouses/etc. is unclear. It may have occurred because of a generational shift that came with “digital transformation”, where more people entering business roles also had experience with programming.
Regardless of the reason, it turned out that the flexibility granted to the data engineer by a full programming language was needed and beneficial to handle the diversity and complexity of the data being processed. This data complexity also reminds us of the “V’s” of data, of which there are now as many as 10!
This trend also contributed to (and was supported by) the popularity of accessible tools such as scripting languages (Python, and things like Perl before that), even SQL itself, to build data pipelines. When faced with the overwhelming and time-sensitive demands that were placed on data analysts and engineers across the organization, we think it was natural to need the speed, flexibility and accessibility of a full programming language, with the dynamic qualities of scripting languages and SQL, to be able to move, adapt and produce results quickly.
As a user of Ascend, you can take advantage of these multiple programming languages, sitting on top of the baked-in platform code+data intelligence that we call DataAwareTM, with a simple visual flow paradigm for ergonomics. This enables several valuable outcomes that include:
- Increased accessibility of the data engineering platform to data engineers of all skill levels within your organization;
- Increased collaboration across teams, on top of a standardized data platform;
- A range of problem-solving approaches to ensure that you have all the tools at your disposal for any kind of data challenges, from simple SQL-based transformations to complex PySpark-based ones; and
- Increased learning, skill-sharing/up-skilling and “cross-pollination” across technical and non-technical teams in the organization, including software developers, business analysts, data engineers, and more – because everyone is enabled to work together using multiple programming languages and tools on a single platform.
Practice #3: Use Ascend to automate the management of a data warehouse (code/schema/data) as a projection
This next practice is not a requirement when using Ascend. It is, however, a very good and powerful one!
For those who have been building traditional, formal data warehouses (either dimensional or snowflake), whether you are following an ETL or ELT approach, the pattern shown below is familiar. Figure 3 shows that in the construction of a data warehouse (tabular in this case), source data that is brought into the warehouse may land in raw and/or staging/intermediate processing area(s), generally followed by a “core” data model for the data warehouse (and then optionally, sometimes into downstream dependent data marts).
Figure 3: Typical progression of data in a traditional data warehouse lifecycle
There are certainly other approaches, especially modern ones that combine data lakes (or the trade-offs you make using ETL vs ELT), but we chose this more traditional pattern to illustrate how you can benefit from the best practice of relying on the capabilities of Ascend to simplify management of transformation logic throughout the traditional data warehouse – from beginning to end. A key difference when you use Ascend in an end-to-end data warehouse lifecycle is that you treat your data warehouse as a projection.
If you apply this practice, you don’t have to write code to keep your data warehouse updated with data from your upstream transactional sources. You do have to model your data warehouse in code, but this is natural and where your focus should be. Code to populate your data warehouse model has been notoriously difficult to write, whether it be through MERGE/UPSERT concepts, proprietary tools/techniques, or having to use tedious insert/update/delete methods. This is especially challenging when trying to model time-oriented concepts, for example – using dimensional modeling techniques that represent slowly changing dimensions (SCD) that are type 2 through 6.
Other challenges arise in environments that have chosen an agile approach to data warehousing, or those that are resource/time-constrained in terms of the team that manages the data warehouse. These scenarios – when schema and/or data in upstream systems changes – are common, especially in the modern world where schemaless/NoSQL data stores are common in transactional systems. The flexibility granted to developers by these tools (in the application layer), ends up being a serious challenge for the downstream data warehouse architects, data models, and data engineers – whose job it is to maintain a consistent model of the enterprise’s data for a large number of organizational citizens and data consumers in business intelligence, analytics, the executive suite, etc.
Three typical problems you might have encountered in a dynamic, fluid and ever-changing data ecosystem include forgotten schema elements, schema drift and late-arriving data. All three scenarios are typically a result of unexpected or unanticipated situation(s) in the upstream systems.
The simplest of the three problems and the one that happens for the most obvious reasons (data warehouse teams are busy/overloaded or work off of incomplete requirements) is forgotten schema elements. Schema elements could be tables, collections, columns, attributes, metadata, etc. It literally occurs because data warehouse teams do not always have time nor access to everything in upstream systems to bring into the warehouse – therefore, things get excluded. While it is generally considered a best practice in data warehousing to ingest all data into the data warehouse when adding a new upstream source, for many practical reasons, we have seen that this does not always occur.
In a similar schema-related type challenge, schema drift is an issue that occurs when the schema of upstream source changes (adding, moving, or deleting columns/attributes) unexpectedly, often without notification to the data warehouse team. This produces similar problems to forgotten schema elements but also introduces the potential of broken data pipelines due to changing upstream system requirements and/or deleted schema objects.
Lastly, data engineers and warehouse teams are surely also familiar with late-arriving data (e.g. late arriving facts if using a dimensional data model). This also manifests itself as “changes to historical data” in upstream transactional systems – which generally should never really occur, but in special circumstances can indeed happen (especially in those where the warehouse team has little control or influence over the upstream systems).
Ascend helps address all three of these problems by taking a different approach to the relationship of your upstream sources and your downstream data warehouse. Ascend relieves a lot of the manual coding/thought you traditionally would put into solving these issues, whether that be through: Ascend’s intelligent source parsing that can handle changing or varied schemas in upstream systems; automatic schema change resolution in the downstream data warehouse (see Figure 4 below); or, automating the sync of upstream data with your downstream data warehouse copy.
Figure 4: Schema Mismatch handling on an Ascend Write Connector for a Data Warehouse Table
When used to manage the schema/data in a data warehouse, Ascend automates the mundane parts of the “last mile” of the data pipeline—getting schema/data into the data warehouse and keeping it up-to-date. You get to focus on modelling your data (e.g. dimensional modelling) by expressing the shape of data you want to appear in code, from the data which is flowing through Ascend. This means, modelling the final shape of the data for the data warehouse using SQL, Python, Scala, etc. as described Practice #2 earlier, and then allowing Ascend.io to manage the projection of the data into the final data warehouse tables.
Figure 5: Using Ascend to maintain your data warehouse as a projection of upstream data
Ascend relies on a combination of hashing (to track the state of changing code, schema and/or data) and partitioning (a common scheme for “chunking” your data, often based on time) to automate data warehouse management operations. Ascend identifies which code, schema and/or data changes cause updates that are needed in the destination warehouse tables, as it comes in through the data pipeline. Ascend automatically takes care of updating the final data warehouse tables, by altering tables to match schema changes, and swapping in and out data partitions and records that were impacted by data changes in the upstream data pipeline.
What this means for the problems mentioned above (forgotten schema elements, schema drift and late-arriving data) is that the automation and intelligence built into Ascend (e.g. DataAwareTM) largely alleviates the manual labor/coding you would previously have had to do to address these issues. By Ascend being aware of the relationship of your transformation code/logic, and the schema/data produced by that code, as well as the impact of changes to either, Ascend is able to automate the last mile. You can focus on your modeling and business logic, and let Ascend take care of the rest.
These features save development time, help avoid potential errors in manually developed code, speeds up time to value in the data warehouse, reduces the time it takes to iterate… it really is a totally different experience when you shift to the model of the data warehouse being a projection of your source data and transformations. This is what we think it means to be autonomous, and the difference between the “declarative” world of Ascend (“what I want the result to be and Ascend takes care of the rest”) vs. an “imperative” world (“how I make the result happen on my own”).
If you follow this pattern completely, one big difference is that you cease storing historical data only within your final data warehouse model. In this model, you project from a separate historical archive of data into the final data warehouse model. This means you can recreate your data warehouse data model from historical data (from scratch if needed), and this is very different from having your final data warehouse model be your sole record of history.
To go back to the digital projector metaphor, you can use Ascend to reproject the entire data warehouse (if you wanted to), or more likely, let Ascend’s intelligence reproject only the parts of the picture (i.e. data warehouse) that have been impacted by changes.
As you can see from our practices shared so far, Ascend offers powerful, effective and cost/time-saving features that simplify building a data warehouse, while leaving you in control of your code, modeling, and data logic. Ascend enables an agile approach to data warehousing that is reliable, flexible and adaptable to quickly changing circumstances and requirements – areas that have always been sticky/pain points in traditional data integration/engineering platforms and data warehouse projects.
Our customers have specifically told us that choosing to use Ascend to perform the complete end-to-end data lifecycle from source to data warehouse offers clear advantages – including simplicity, flexibility and confidence in both stable operations and the final results. Frankly, Ascend also makes these tasks easier and more efficient, and that is the point of automation, isn’t it?
We are still scratching the surface of the benefits of automation that Ascend brings to data warehousing (and beyond data warehousing). Stay tuned for more posts in this series, as we continue to explore the possibilities for rethinking your data engineering ecosystem with Ascend!