Let’s look at how managers of data teams can set the stage for a path that fuels speed and business results, by sorting out different aspects of what needs to get accomplished, and tagging four common mistakes as killer anti-patterns.

We’ll build on my August 19 article, where we explored the new conditions of the data landscape that are forcing data professionals to spend their time writing code. On September 04 we made the case for separating code from data by diving into the expectations that data teams are working under.  Let’s go from there.

Data professionals exhibit T-shaped skills and specialize deeply in specific complexities

Data expertise

Recent literature around data engineering, analytics, and DataOps in particular, has begun to tease out the specializations that practitioners find themselves growing into. Not only do professionals increasingly fall into the three broad categories we described in September (math, software, or data), they also exhibit T-shaped skills and specialize deeply in the specific complexities of either data-related tooling, or open-source software, or cloud infrastructure capabilities, or different domain-specific data sets, or sophisticated analytic techniques, or the nuances of machine learning / AI.

To get a sense of what we’re dealing with, let’s go one level deeper and list 13 fairly common responsibilities that underlie a successful data program. We purposely avoid the ambiguity of titles, and sort from the business on top, through math and software, down to infrastructure:

  • Business requirements analysis
  • UI / report design and programming
  • Math and statistics model design
  • Machine learning and AI model design
  • Machine learning and AI programming
  • ETL / SQL / DW programming
  • NoSQL / big data programming
  • Advanced software development
  • Datastores / platforms administration
  • Data infrastructure operations
  • Distributed / containerized systems
  • Enterprise capabilities architecture
  • Site reliability / cloud engineering

There is nothing particularly magical about this list, you can augment it to fit your business, with the goal of being comprehensive and descriptive.

The engineering trap

Many organizations simply add new hires with a sprinkling of these data-related skills to their existing data warehouse or big-data groups, expecting them to self-organize into a well-functioning team. I described in the last article how such teams start with open source components to build the first few data flows, then encounter new requirements for scale or throughput, and then get stuck in the complexities of constructing a proprietary data platform. This pattern is a complex trap that gets in the way of truly improving business performance over the old way of doing things.

Confirming this finding, Sean Knapp, founder and CEO of Ascend, says that “we’ve never seen a company scale their data strategy with open source technology alone. They inevitably encounter failures in their initial approach, requiring substantial amounts of code to productionize these raw technologies. Ultimately, they all try to tame the growing tech debt by architecting an internal platform to stitch these technologies together for broader use. Companies invest years on this approach, but we’ve yet to find one that is happy with the end results.”

The expertise for data team responsibilities falls into four distinct tiers

Balanced data teams

So how should we approach the data landscape differently, in order to meet the critical modern business requirements I described last time: operational speed, quality of insight, and TCO of data?

Four tiers of data expertise

The first pattern decomposes the problem by recognizing that the expertise for these responsibilities fall into four distinct tiers:

  1. Business analysis and presentation
  2. Data science and mathematics
  3. Data engineering and operations
  4. Infrastructure and cloud operations

These tiers are different in many important ways:

  • Different tools, platforms, methods
  • Different communities, associations, professional hubs
  • Different education, experience, professions
  • Different recruiting, compensation, career paths
  • Different KPIs, success criteria, management styles

Let me caution that to separate these tiers into distinct “centers of excellence” or house them in different parts of the organization is killer anti-pattern #1. Instead, the goal is to crystallize the mission of the overall effort, then carefully mix the professionals who bring dedicated expertise for each of these tiers into highly functioning collaborative teams.

Highly functioning collaborative data teams are a careful mix of professionals who each bring nuanced expertise

Tiers are interdependent

The second pattern recognizes that each tier depends on the tiers “below” it, and enables the tier “above” it. Let’s get a more nuanced understanding by stepping through each of these tiers.

Data engineering, data science, and data operations skills fall into one of four tiers
  1. Business analysis and visualization have long been the domain of business intelligence and data warehouses, evolving several generations of technologies to provide human visibility to historical data and analytics. Traditionally the teams in this tier pulled data from transactional systems and cached it in data warehouse structures. With the shift away from historical reporting toward the use of prediction to drive the business, this tier is increasingly dependent on data designed with data science and delivered with data engineering.
  2. Data science is the newest expertise in the stack, creating dynamic datasets with dedicated machine learning and AI engines to directly fuel predictions and emerging business objectives. This expertise includes academic backgrounds and cutting edge mathematical techniques, using technologies that are cloud-native. The scale and sophistication of these operations makes them utterly dependent on modern data engineering.
  3. Data engineering is currently undergoing an evolutionary step. This collection of skills builds on the experiences of establishing data lakes, to now write data pipelines using new specialized data engineering tools and platforms, as well as open source libraries and advanced, production-grade cloud-native code. This tier also tunes and optimizes its use of underlying infrastructure and cloud, with orchestration that allocates and ramps compute and store services.
  4. Infrastructure and platforms are usually concerned with throughput, security, privacy, scale, resilience, and cost of the intensive use of compute and store resources of the data programs. This tier should coordinate with data engineering to understand their needs, administer data stores and distributed compute platforms, and maintain a consistent overall cloud architecture for the enterprise. This includes topics like cloud services and a coherent container strategy, and use of capabilities for which the company is building up in-house competence.

Let me caution here to avoid waterfall workflows between tiers, since as killer anti-pattern #2 they undermine speed and agility. We’ll talk more about workflow in well functioning data teams next time, but for now, the goal is to enhance specialization and productivity by aligning roles and responsibilities with the affinities and career paths of individual contributors. For example, data scientists should not be asked to write platform code, and business intelligence teams should not be pulled into building production-grade big data solutions, simply because they are on the same team.

If your enterprise already has some key skills in house, fill the gaps with specific experts who bring new skills to the table

Staffing each tier appropriately

The third pattern recognizes that emergent data teams are evolutions of past practices and that each enterprise consists of unique brownfield conditions. The existing enterprise teams, individuals, and career paths surrounding data flows can contribute to each of the four tiers as follows:

  1. Leverage existing business analysis and BI teams by staging the results of data science and data engineering into their familiar data warehouses. Existing data warehouse teams continue to assure data integrity and optimize SQL for analytic and business access. Over time, add visual application specialists to build more action-oriented applications and business automation based directly on data lakes and new technologies.
  2. Attract and retain rare data scientist talent by reducing their data management and software construction responsibilities. Their long-term mission is to infuse machine learning into every fiber of the business while partnering closely with data engineering experts who can obtain data and construct enterprise-grade pipelines.
  3. Recruit senior data engineering veterans who are already working on data lakes and data mining applications that reuse valuable data, as well as data architects knowledgeable about the data across the business. Focus new hires on specific technology gaps in this team, anticipating emerging requirements from data science teams and from fast-changing cloud-based infrastructure. Note that general data warehouse skills are less useful here, and are better suited for the business analysis tier.
  4. Augment existing enterprise architecture with dedicated cloud expertise and develop a dedicated IT policy for cloud. Almost all aspects of IT manifest differently in cloud environments and do not translate to traditional techniques and management models. Strong data engineering teams partner well with cloud architects; weaker teams will struggle with true self-service models and need dedicated support from the cloud infrastructure team to onboard.

Let me caution here that there are always insurmountable differences between existing teams and what is needed in the future. In addition to institutional effects like Conway’s Law, expecting new expertise and practices to emerge solely through self-organization is killer anti-pattern #3. Especially don’t let the existing team hire more people just like themselves. Instead, the goal is to actively invest in the retention of institutional knowledge, enable distinct career paths defined by vital expertise, introduce new contributors to close specific gaps, and incentivize team behavior that produces targeted business results.

At this moment of evolution (2019), current roles and titles are poor indicators of the skills and expertise needed in the near future

How to use titles and roles

For our fourth and final pattern today, here is a map from responsibilities to a few of the common data-related titles seen on business cards and LinkedIn profiles today.

map of data engineering expertise to common job titles and roles

Notice that this is a loosely coupled map – individuals describing themselves with relevant-sounding titles may or may not have the specific skills that are needed on the team. While titles are useful orientation for which tier the role is likely to function in, to rely solely on such headlines to compose teams is killer anti-pattern #4.

In fact, consider using this map in interviews in which existing team members and new candidates self-select where their specific expertise sits, and they explain how they fit into the overall landscape. Such an approach will reveal much about each of them, the nuances of their knowledge, and non-technical aspects of how they will function in your team. This process blends a self-organizational approach with improved management awareness, and helps align expectations and business performance.

So if data projects are not simply agile software development, what should the workflows look like, and why? Let’s dig into that in the next article.