A simple way of describing observability is how well you can understand the system and it’s internal state from the outputs it generates.
Expanded to IT, software, and cloud computing, observability is how engineers can understand the current state of a system from the data it generates. To fully understand, you’ve got to proactively collect the right data, then visualise it, and apply intelligence.
Observability provides a proactive approach to troubleshooting and optimising software systems effectively. It offers a real-time and interconnected perspective on all operational data within a software system, enabling on-the-fly inquiries about applications and infrastructure.
In the modern era of complex systems developed by distributed teams, observability is essential. Observability goes beyond traditional monitoring by allowing engineers to understand not only what is wrong but also why something broke down.
The observability market is huge. To give you an idea - “For every $1 you spend on public cloud, you’re likely spending $0.25–$0.35 on observability” - The observability and monitoring market was valued at $41B at the end of 2022 and is poised to grow to $62B by the end of 2026.
Organisations are willing to pay for the right tools which is highlighted by DataDog’s annual posted revenues to the tune of $2B in FY23. Coinbase was billed $65M in Datadog expenses alone in 2021, highlighting how these costs can be much larger than one can imagine.
In essence, this is a mature space with players like Splunk, Datadog, Grafana, New Relic and others taking up significant wallet share of software enterprises. These companies are innovating daily and the table below helps us understand their strengths/ focus areas:
Traditional ML algorithms, such as decision trees, random forests, and clustering algorithms, were and are used in AIOps for tasks such as anomaly detection, root cause analysis, and predictive analytics. Traditional ML techniques played a crucial role in AIOps by providing the foundational algorithms and methodologies for automating IT operations tasks. However, the advent of LLMs has further advanced the capabilities of AIOps by enabling more sophisticated natural language processing and understanding, which can be particularly useful for analyzing unstructured data such as logs and alerts. Foundational models are able to effectively “apply intelligence” by “inquiring” the system about observability data (think MELT - metrics, events, logs and traces) to get a faster understanding of the “why”. Due to the readily available data and need for such solutions in the market, this is one the first areas where generative AI is witnessing immediate applications. In addition to traditional machine learning approaches that have their own salient points, LLMs create value in a number of new ways.
Some of the key differences between traditional ML models and LLMs are as follows:
These capabilities are very useful for the observability problem statement since in essence the field revolves around understanding time-series system data to understanding more about the current and possible future states of the system.
The most surprising part of the earning’s call by Datadog was this - “Management projects its core observability and monitoring market will grow at a 10.89% compound annual growth rate (CAGR) from $41 billion at the end of 2022 to $62 billion at the end of 2026. Thus far, Datadog has only captured a 4.36% share of this market, so the company has a long runway for growth.”
Despite a myriad of tools, the following challenges still exist which were independently confirmed through our discussions with various users and reports all across:
1. Tool Sprawl: Separate storage tools are typically used for each telemetry data type: logs, metrics, and traces, creating additional cost and complexity to manage.
2. Increasing TCO: Observability tools are notoriously known for large bills highlighted in the graph below. This is primarily due to high tool sprawl and index storage costs across.
Additionally, IT budgets are being rationalised:
3. Shortage of Talent: Resolving incidents requires a large surface area of unique skills. These skills are usually only developed after years of trial and error, so solving complex production problems requires senior SREs which are hard to hire.
Additionally, layoffs affect the MTTR for incidents as well (fuelling into the SRE problem):
The biggest shift during current times has been the advent of large language models to extract precise information faster and better. Basis this, we have seen multiple players starting up here.
Despite the crowded landscape there’s more to be done with the paradigm shift in the making with LLMs. We think that the following factors should be considered as guiding markers for starting up here at the outset:
In summary, this is a large addressable spend and the flexibility of new tools can lead to massive value unlock by simplifying the path to outcomes. We’re ready for some big changes in this space and excited to see what startups build.
At 3one4 Capital, the team has intentionally built a long-term commitment to responsible investing and to support the evolution of an ecosystem conducive to RI. This active commitment has helped the firm secure the signatory status to the UN PRI.
3one4 Capital has been ranked by Preqin, a global reference database for asset management, as India’s top performer for two of its funds, in the recent Alternative Assets report. The seed and early-stage funds managed by the firm have been recognized for their performance amongst the India-focused venture capital funds in this Asia Pacific-focused report published in 2021. With industry-leading Net IRRs, 3one4 Capital’s Rising I & Fund II are the top two amongst the best performing India-focused VC funds between the vintage years, 2010- 2018.