New Research Finds Data Automation Adoption to Climb from 3.5% to 88.5% Over the Next 12 Months
Adoption Rising as the Need for Data Products Grows Faster Than Team Size and a Staggering 95% of Data Teams Are Still at or Over Capacity
Ascend.io
Ascend.io
data-eng@ascend.io
Adoption Rising as the Need for Data Products Grows Faster Than Team Size and a Staggering 95% of Data Teams Are Still at or Over Capacity
MENLO PARK, Calif. — April 13, 2022 — Ascend.io, the Data Automation Cloud, today announced results from its third annual research study, The DataAware Pulse Survey, about the work capacity and priorities of data teams. Findings from more than 500 U.S.-based data scientists, data engineers, data analysts, enterprise architects, and—new this year—chief data officers (CDOs) reveal that despite 81% of respondents indicating that their team's overall productivity has improved in the last 12 months, 95% of teams are still at or over capacity—just a 1% decrease from the 2021 study. The study also found that data automation is emerging as the most promising path to increase data team capacity and productivity, with a majority (85%) planning to implement automation technologies in the next year even though only 3.5% of the same respondents reported currently have automation technologies in place.
Data Initiatives Are Ballooning Beyond Team Capacity
Nearly all data teams (93%) anticipate the number of data pipelines in their organization to increase between now and the end of the year—with 57% projecting an increase of 50% or greater. Amid the rising number of data pipelines across their organization, nearly three in four respondents (72%) indicated that the need for data products is growing faster than their team size. This was especially true among data engineer respondents, of which 82% stated that the need for data products was increasing at a faster rate than their team size.
"Data team productivity remains the single biggest threat to the success of data projects and workloads," said Sean Knapp, CEO and founder of Ascend.io. "In fact, data team capacity has only marginally improved year over year, yet the demands on these teams continue to grow exponentially—far beyond what teams can feasibly keep up with."
Team Backlogs Have Emerged Across the Data Lifecycle
One major roadblock to data team productivity remains fast access to data. When asked how much time they spend trying to gain access to the data they need to do their job, respondents said they spend an astounding 18.9 hours on average per week. Data scientists spend the most time trying to gain access to data each week at 24.6 hours, followed by data engineers at 19.1 hours.
However, data access is not the only roadblock. When it comes to the other top bottlenecks for team productivity, 66% cited team size or hiring constraints as their biggest productivity roadblock, followed by technology limitations (42%).When asked which activities or tasks in their organization's data ecosystem are the most backlogged, respondents were split. However, data scientists, data engineers, data analysts, and enterprise architects all agree to disagree—they each were all more likely to identify their own function as the most backlogged or resource-demanding compared to their peers.
Data scientists are 3.3 times more likely to say data science is the most bottlenecked
Data engineers are 2 times more likely to indicate data engineering
Data analysts are 1.9 times more likely to say data analysis
Enterprise architects are 1.5 times more likely to indicate data architecture
Data Teams Look to Automation, Flex-Code, and Data Mesh to Increase Productivity
As data teams look for ways to overcome bandwidth limitations, many data professionals are turning to automation to improve data workload efficiency and productivity. In fact, while only 3.5% currently use them, 85% of respondents indicated that their team will likely implement data automation technologies in the next 12 months.
As data teams assess new solutions, many are considering low-code tools and data mesh frameworks to unlock greater team efficiency and business value. Respondents indicated a strong interest in low-code tools that provided greater flexibility (i.e., flex-code), with the majority (81%) saying they would be more inclined to use a no-code or low-code tool if it offered the ability to use their preferred programming languages, up from 73% in 2021.Respondents also cited a strong interest in data mesh frameworks, with 76% planning to implement a data mesh in the next 12 to 24 months. The majority (86%) of data teams believe a data mesh will enable their business to make the most of their existing data architectures and resources. A striking 90% of CDOs agree that a data mesh will enable the business to make the most of their data investments.
"The numbers don't lie—data teams must find a way to dramatically accelerate their productivity, and the overwhelming majority are looking to automation as the answer," said Knapp. "Data leaders are increasingly finding that leveraging automation in conjunction with flex-code and data mesh technologies significantly increases productivity and amplifies the impact of some of their most talented resources."
deffetch_commit_history( repos: Union[str, List[str], pathlib.Path],
timeout_seconds: int = 120,
since_date: Optional[str] = None,
from_ref: Optional[str] = None,
to_ref: Optional[str] = None,
) -> Dict[str, List[Dict[str, Any]]]:"""
Fetches commit history from one or multiple GitHub repositories using the GitHub CLI.
Works with both public and private repositories, provided the authenticated user has access.
"""# Check GitHub CLI is installed subprocess.run(
["gh", "--version"],
capture_output=True,
check=True,
timeout=timeout_seconds,
)
# Process the repos input to handle various formatsifisinstance(repos, pathlib.Path) or (isinstance(repos, str) and os.path.exists(repos) and repos.endswith(".json")):
withopen(repos, "r") as f:
repos = json.load(f)
elifisinstance(repos, str):
repos = [repo.strip() for repo in repos.split(",")]
results = {}
for repo in repos:
# Get repository info and default branch default_branch_cmd = subprocess.run(
["gh", "api", f"/repos/{repo}"],
capture_output=True,
text=True,
check=True,
timeout=timeout_seconds,
)
repo_info = json.loads(default_branch_cmd.stdout)
default_branch = repo_info.get("default_branch", "main")
# Build API query with parameters api_path = f"/repos/{repo}/commits" query_params = ["per_page=100"]
if since_date:
query_params.append(f"since={since_date}T00:00:00Z")
target_ref = to_ref or default_branch
query_params.append(f"sha={target_ref}")
api_url = f"{api_path}?{'&'.join(query_params)}"# Fetch commits using GitHub CLI result = subprocess.run(
["gh", "api", api_url],
capture_output=True,
text=True,
check=True,
timeout=timeout_seconds,
)
commits = json.loads(result.stdout)
results[repo] = commits
return results
Key implementation details:
GitHub CLI integration: Uses the `gh` command-line tool for authenticated API access to both public and private repositories
Flexible input handling: Accepts single repos, comma-separated lists, or JSON files containing repository lists
Robust error handling: Validates GitHub CLI installation and repository access before attempting to fetch commits
Configurable date filtering: Supports both date-based and ref-based commit filtering
AI-Powered Summarization
defsummarize_text(content: str, api_key: Optional[str] = None) -> str:"""
Summarize provided text content (e.g., commit messages) using OpenAI API.
"""ifnot content.strip():
return"No commit data found to summarize"# Get API key from parameter or environment api_key = api_key or os.getenv("OPENAI_API_KEY")
ifnot api_key:
raise RuntimeError("OpenAI API key not found. Set the OPENAI_API_KEY environment variable.")
client = OpenAI(api_key=api_key)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": content},
],
temperature=0.1,
max_tokens=1000,
)
return response.choices[0].message.content.strip()
defsummarize_commits(content: str, add_date_header: bool = True) -> str:"""
Summarize commit content and optionally add a date header.
""" summary_body = summarize_text(content)
if add_date_header:
# Add header with week date now_iso = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
monday = get_monday_of_week(now_iso)
returnf"## 🗓️ Week of {monday}\n\n{summary_body}"return summary_body
Our initial system prompt for consistent categorization:
You are a commit message organizer. Analyze the commit messages and organize them into a clear summary.
Group similar commits and format as bullet points under these categories:
- 🚀 Features
- ⚠️ Breaking changes
- 🌟 Improvements
- 🛠️ Bug fixes
- 📝 Additional changes
...
Within the Improvements section, do not simply say "Improved X" or "Fixed Y" or "Added Z" or "Removed W".
Instead, provide a more detailed and user-relevant description of the improvement or fix.
Convert technical commit messages to user-friendly descriptions and remove PR numbers and other technical IDs.
Focus on changes that would be relevant to users and skip internal technical changes.
Format specifications:
- Format entries as bullet points: "- [Feature description]"
- Use clear, user-friendly language while preserving technical terms
- For each item, convert technical commit messages to user-friendly descriptions:
- "add line" → "New line functionality has been added"
- "fix css overflow" → "CSS overflow issue has been fixed"
- Capitalize Ascend-specific terms in bullet points such as "Components"
Strictly exclude the following from your output:
- Any mentions of branches (main, master, develop, feature, etc.)
- Any mentions of AI rules such as "Added the ability to specify keywords for rules"
- Any references to branch integration or merges
- Any language about "added to branch" or "integrated into branch"
- Dependency upgrades and version bumps
…
Prompt engineering:
Structured categorization: Our prompt enforces specific emoji-categorized sections for consistent output formatting
User-focused translation: Explicitly instructs the AI to convert technical commits into user-friendly language
Content filtering: Automatically excludes dependency updates, test changes, and internal technical modifications
Low temperature setting: Uses 0.1 temperature for consistent, factual output rather than creative interpretation
Content Integration and File Management
defget_monday_of_week(date_str: str) -> str:"""
Get the Monday of the week containing the given date, except for Sunday which returns the next Monday.
""" date = datetime.strptime(date_str, "%Y-%m-%dT%H:%M:%SZ")
# For Sunday (weekday 6), get the following Mondayif date.weekday() == 6: # Sunday days_ahead = 1else: # For all other days, get the Monday of the current week days_behind = date.weekday()
days_ahead = -days_behind
target_monday = date + timedelta(days=days_ahead)
return target_monday.strftime("%Y-%m-%d")
File handling considerations:
Consistent date formatting: Automatically calculates the Monday of the current week for consistent release note headers
Encoding safety: Properly handles Unicode characters in commit messages from international contributors
Atomic file operations: Uses temporary files during processing to prevent corruption if the process is interrupted
GitHub Actions: Orchestrating the Automation
Our workflow ties everything together with robust automation that handles the complexities of CI/CD environments.
Workflow Triggers and Inputs
name:WeeklyReleaseNotesUpdateon:workflow_dispatch:inputs:year:description:'Year (YYYY) of date to start collecting releases from'default:'2025'month:description:'Month (MM) of date to start collecting releases from'default:'01'day:description:'Day (DD) of date to start collecting releases from'default:'01'repo_filters:description:'JSON string defining filters for specific repos'required:falsetimeout_seconds:description:'Timeout in seconds for API calls'default:'45'
Flexible triggering options:
Manual dispatch with granular date control: Separate year, month, day inputs for precise date filtering
Repository-specific filtering: JSON configuration allows different filtering strategies per repository
Configurable timeouts: Adjustable API timeout settings for different network conditions
-name:Generatereleasenotesenv:OPENAI_API_KEY:${{secrets.OPENAI_API_KEY}}GITHUB_TOKEN:${{steps.app-token.outputs.token}}run:|
CONFIG_JSON='${{ steps.repo_config.outputs.config_json }}'
CONFIG_FILE=$(mktemp)
echo "$CONFIG_JSON" > "$CONFIG_FILE"
RAW_OUTPUT=$(pythonbin/release_notes/generate_release_notes.py\--repo-config-string"$(cat "$CONFIG_FILE")"\--timeout"${{ github.event.inputs.timeout_seconds }}")# Split summary and commits using delimiterSUMMARY=$(echo"$RAW_OUTPUT"|sed-n'1,/^### END SUMMARY ###$/p'|sed'$d')MONDAY_DATE=$(echo"$SUMMARY"|head-n1|grep-oE"[0-9]{4}-[0-9]{2}-[0-9]{2}")echo"monday_date=$MONDAY_DATE">>$GITHUB_OUTPUTecho'summary<<EOF'>>$GITHUB_OUTPUTecho"$SUMMARY">>$GITHUB_OUTPUTecho'EOF'>>$GITHUB_OUTPUT
Key implementation lessons:
Temporary file strategy: We learned the hard way that GitHub Actions environments can lose data between steps. Writing to temporary files solved reliability issues where data would appear blank in subsequent steps.
Complex JSON handling: Uses `jq` for safe JSON manipulation and temporary files to avoid shell quoting issues with complex JSON strings
Output parsing: Logic to split AI-generated summaries from raw commit data using delimiter markers
Robust error handling: `set -euo pipefail` ensures the script fails fast on any error, preventing silent failures
Atomic file operations: Uses temporary files and atomic moves to prevent file corruption
Branch management: Creates date-based branches for organized PR tracking
Content preservation: Carefully prepends new content while preserving existing documentation structure
Lessons Learned and Best Practices
Building this pipeline taught us valuable lessons about documentation automation that go beyond the technical implementation.
Technical Insights
File persistence matters in CI/CD environments. GitHub Actions environments can be unpredictable—always write important data to files rather than relying on environment variables or memory. We learned this the hard way when release notes would mysteriously appear blank in PRs.
API reliability requires defensive programming. Build retry logic and fallbacks for external API calls (OpenAI, GitHub). Network issues and rate limits are inevitable, especially as your usage scales.
Prompt engineering is crucial for consistent output. Spend time crafting prompts that consistently produce the format and tone you want. Small changes in wording can dramatically affect AI output quality and consistency.
Human review is essential, even with AI generation. Having team members review PRs catches edge cases, ensures quality, and builds confidence in the automated system. The goal isn't to eliminate human oversight—it's to make it more efficient and focused.
Historical tracking and product evolution insights. Automated generation creates a consistent record of product evolution that's valuable for retrospectives, planning, and onboarding new team members.
Results and Impact
The automation has fundamentally transformed our release process and team dynamics:
Quantifiable Improvements
Dramatic time savings: Reduced release note creation from 2-3 hours of writing time to 15 minutes of review time. That's a 90% reduction in effort while improving quality and consistency.
Perfect consistency: Every release now has properly formatted, comprehensive notes. No more missed releases or inconsistent formatting across different team members.
Increased frequency: We can now generate release notes weekly, providing users with more timely updates about product improvements.
Complete coverage: Captures changes across all repositories without manual coordination, eliminating the risk of missing important updates.
Next Steps and Future Enhancements
We're continuously improving the pipeline based on team feedback and evolving needs:
Immediate Roadmap
Slack integration: Building a Slackbot to automatically share release notes with our community channels, extending the reach beyond just documentation updates.
Repository tracing: Categorize the raw commits by repository and add links so it's easy to (literally) double-click into each PR for additional context.
Future Possibilities
Multi-language support: Generating release notes in different languages for global audiences as we expand internationally.
Ready to automate your own release notes? Start with the requirements above and build incrementally. Begin with a single repository, get the basic workflow running, then expand to multiple repos and add advanced features. Your future self (and your team) will thank you for eliminating this manual drudgery and creating a more consistent, professional release process.
The investment in automation pays dividends immediately—not just in time saved, but in the improved quality and consistency of your user communication. In a time where software moves fast, automated release notes ensure your documentation keeps pace.