In my last article, I talk about WHY business analysts, data analysts, data scientists, and data engineers are writing code. As we discovered, in their mission to find and extract valuable insights that the business can act quickly on, this cohort spends most of its time building ad-hoc software to process the data, most frequently using the “data pipelines” pattern.
Many enterprises are making major investments in the development of a custom data pipelines platform that could be available commercially.
But is writing redundant software that massages data really the best way for these specialists to create value for the business? If you ask the working individuals in these professions, you will get diverse answers that sound something like this:
- NO. “We deeply know data, math, analysis, statistics, machine learning, AI, and are experts in using math-based software packages to crunch data into quantitative insights. Unfortunately the professional development of enterprise-grade software at scale is a different world that demands different skillsets and different practices. In the meantime, we copy code from online forums and try our best.”
- YES. “We are enterprise software developers who know how to build scaled software systems and have core expertise in many of the cloud and open-source building blocks out there. In contrast, data is easy, and we have picked up enough about data science along the way to create the data assets our business needs. We’ve got this.”
- GOOD ENOUGH. “We know our business’ data well and are familiar with enough basic programming in Python, Javascript. We also hack our enterprise BI tools to run some basic algorithms, they are inefficient but create datasets that meet our basic needs. We could use some help, though.”
While your data team likely includes people with all three of these points of view, what really matters is the position of the leaders, and the pace with which the team is adapting to the real needs of the business. So while we’re rolling the dice with the alphas, let’s take a moment to look at the two sources of value in this context: data and code.
Creating value with data
Most practitioners agree that the value to the business lies in the data, and that the work to be done is to extract actionable insights from it. Let’s look at three groups of key requirements for creating value with data.
Foremost, operate at speed:
- The speed with which new ideas and hypotheses can appear.
- The speed with which an idea can be turned into a working pipeline.
- The speed with which the data can be turned into insight and action.
Then, focus on good insights:
- Increase the rate at which good insight are detected and built out.
- Increase the speed at which good insight are launched as pipelines.
- Increase the payoff of good pipelines and action they lead to.
- Improve the robustness of good pipelines to provide continuity.
Finally, reduce the TCO of data:
- Reduce the cost of acquiring and maintaining new data sources.
- Reduce the cost at which poor hypotheses are detected and dismissed.
- Reduce the cost of creating individual pipelines.
- Reduce the cost of holding data.
Creating value with code
Since the value of pipeline software lies singularly in its ability to support the value creation from data, the requirements for this software are really the same as for creating value with data, augmented with the following:
- Increase the efficiency of the pipelines’ use of storage and compute.
- Reduce the cost of developing pipelines, so you can invest in data.
- Reduce the size of the team it takes to build each pipeline.
- Reduce the complexity of the software the team has to maintain.
- Reduce the number of libraries and dependencies the team has to maintain.
- Reduce manual intervention with intelligence and automation.
Enterprise data teams get deeply stuck in the paradigm of software development by conflating the creation of value with data versus creation of value with code.
The mistake of investing in code to create value with data
If insights are the golden eggs, then this pipeline software must be the special goose, right? Urged on by several leading analysts, this logic is leading many enterprises to invest in in-house software for data pipelines. The effort often turns into a major investment in the development of a custom software platform to construct and operate data pipelines.
Too few executives question this conflation of creating value with code versus creating value with data. The resulting progression often looks like this:
Year 1 – 2: The journey is usually started by well-intentioned data science and engineering teams who often fall into the “GOOD ENOUGH” camp. The initial data pipelines often do well to demonstrate the concept, but most of the value propositions remain out of reach. As the complexity of the key requirements sinks in, individuals on these teams often shift into the “NO” camp. The truth is that for properly trained data professionals, writing repetitive code, always fixing bugs and scripts, and manually coaxing brittle systems through their daily functions is a nightmare.
Year 3 – 4: To bring out more of the desired value, experienced software engineers from the “YES” camp usually join the data team to take pipelines to the next level. This becomes a major, multi-year investment by the enterprise, promising to eventually refocus the data engineers on the data itself and relieve the data scientists of writing code altogether. Eventually.
Data should have a distinctly faster velocity than code.
Keep code and data separate
The lesson we are learning is that enterprise data teams get deeply stuck in the paradigm of software development by conflating the creation of value with data versus with code. While this outcome is in line with the drumbeat of custom apps and differentiated business software of the last decade, it misses new opportunities to bypass this bottleneck.
Executives often recognize that data should have a distinctly faster velocity than code by following the trends of emergent platforms for machine learning, AI, and cloud computing in general. Their challenge is to help data teams unblock their value creation capacity, by reducing custom software development and switch to commercial platforms for their data pipelines, APIs, analytics, and machine learning / AI operations. Just like software teams in the cloud no longer wait for system administrators to patch operating systems, partition hard drives, or schedule tape backups, data teams can now create far more value by decoupling the construction and operation of data pipelines and related assets from in-house software teams. Software teams continue to drive enormous value creation elsewhere, especially downstream from the data pipelines by turning the resulting insights into specific actions in the business.
In my next article, we’ll delve into the expertise and structure of data teams, hear some recommendations, and identify some common traps in the management of these teams.