Improving CI/CD Pipelines through Observability

Craig Risi
Jun 23, 2023
14 min read

I wanted to take a short break from my series on testing tools and showcase some content which I had previously written for InfoQ. If you don't read that site, I would encourage you to do so, as they have fantastic content on a variety of software development topics and their editing team is excellent - making my writing come across a lot better than it normally does.

Hopefully this will help provide some of my readers here with some other useful information on CI/CD Pipelines and how they can use observability to improve their quality.

Now to the article:

Continuous Integration (CI) pipelines have become an important, and even ubiquitous part of software development teams, thanks to the value they bring to teams in being able to continuously test the code at various levels and automate much of the complex deployment process. Just having a CI pipeline in place though is not enough if you want to get the greatest value.

How do you best keep track of the effectiveness of your CI pipelines and processes? How do you ensure that your pipeline is actually delivering the right level of quality to your software rather than implicitly trusting in its success? How can you better make use of your pipelines to troubleshoot software issues and make your different applications operate more efficiently?

All of these questions are answers that our CI pipelines can provide by implementing effective monitoring and observability into our CI pipelines. To do this though, we first need to address the following questions:

What are the different aspects of observability that we need to be aware of when it comes to CI pipelines?
How do we configure this monitoring in our pipelines?
What metrics should we be monitoring to understand our CI pipelines and software applications better?
How do we best visualize some of these metrics?

In this article, we will look to address many of these questions to allow you to use observability to make better use of your CI pipelines. And while in this article we will provide several important traits that teams should strive for, it's important to acknowledge that every team and software application is different. You may need to make adjustments based on your team's specific needs. Similarly, most of the technical solutions that this article explores will involve tools like InfluxDB and Grafana and showcase how you can configure various dashboards through them. You may be using different tools in your team, but the principles should largely still apply. You may need to explore how best to achieve the same results given your specific toolset.

Understanding the Different Technical Aspects of Observability

There are several key components of observability in a CI pipeline, including monitoring, logging, and tracing.

Monitoring refers to the ongoing tracking of the pipeline operation, including the performance of the various stages, the status of builds and deployments, and the overall health of the pipeline. This can be done using a variety of tools, such as Prometheus and Grafana, which can provide real-time visibility into the pipeline and alert developers to any issues that may arise.

Logging refers to the collection and storage of log data from the pipeline, including information about builds, deployments, and pipeline performance. This data can be used for troubleshooting and root cause analysis and can be stored in a centralized log management system, such as ELK or Splunk, for easy access and analysis.

Tracing refers to the ability to follow the flow of a request or transaction through the pipeline, from development to production. This can be done using a tracing tool, such as Jaeger or Zipkin, which can provide detailed information about the various stages of the pipeline, including the time taken for each stage, the resources used, and any errors that may have occurred.

Overall, observability in a CI pipeline is essential for maintaining the reliability and efficiency of the pipeline and allows developers to quickly identify and resolve any issues that may arise. It can be achieved by using a combination of monitoring, logging, and tracing tools, which can provide real-time visibility into the pipeline and assist with troubleshooting and root cause analysis.

In addition to the above, you can also use observability tools such as Application Performance Management (APM) solutions like New Relic or Datadog. APMs provide end-to-end visibility of the entire application and infrastructure, which in turn gives the ability to identify bottlenecks, performance issues, and errors in the pipeline.

It is important to note that, observability should be integrated throughout the pipeline, from development to production, to ensure that any issues can be identified and resolved quickly and effectively.

How Best to Configure Monitoring in Your CI Pipeline?

Perhaps the hardest part of this part is choosing the right tools. There are many different tools available each with different pros and cons which are beyond the scope of this article. I would recommend you invest considerable effort into looking into the different tools available on the market and which might best match your existing tech stack, budget, and skillset, and then play around with the different options to see which might work for you.

Tools like Prometheus, Grafana, and the ELK stack (Elasticsearch, Logstash, Kibana) are popular choices for monitoring CI pipelines. However, the decision is not just on how best to visualize your monitoring and which tools provide the best reporting or alerting features, but - perhaps more importantly- how best can the data be collected.

The following steps are key to helping configure both your data collection and pipeline process so that they align effectively:

Collect data from multiple sources: This includes the build process, testing process, and deployment process. This will provide a complete picture of the pipeline's performance.
Store data in a central location: Leverage a data warehouse or a centralized logging system for easy access and analysis.
Use APIs to automate data gathering: Use APIs to collect data from the pipeline and other sources, such as code repositories and issue tracking systems. This allows for easy integration with other tools and systems, as well as simpler automation as the data can be pulled without the need for manual intervention.
Use logging and monitoring frameworks: Frameworks such as Logstash and Prometheus can be used to collect and analyze data. These frameworks provide built-in support for data collection, storage, and analysis.
Use data visualization tools: Once you've collected your data in a central place effectively, it's time to look into ways of visualizing it (which I will discuss later). Use data visualization tools, such as Grafana or Tableau, to have a relatively easy-to-understand format, simpler identification of trends, and filtering based on specific requirements and patterns in the data.
Set up alerting: Set up alerting mechanisms so that you can be notified when there are issues with the pipeline. This can include sending notifications to a team's chat tool, Slack channel, email, SMS, creating an incident in an incident management tool like PagerDuty, or even logging the incident into an issue management board like JIRA.
Keep track of data retention policies: We’re talking about gathering all this data, but it doesn’t help if you end up just continuously storing all the data. While having large volumes of data is helpful, it’s expensive, can slow down the performance of the system, and often just gets in the way. Keep track of data retention policies and ensure that data is kept for a sufficient amount of time for analysis and compliance.
Continuously monitor and optimize: It perhaps goes without saying that once you have your data gathering and visualizing in place, you will need to continuously monitor the pipeline and make adjustments as necessary. This includes adjusting the data gathering configuration, adding new data points, and optimizing the pipeline for better performance.

How to Push Data through a Pipeline

There are many ways to push data from a CI pipeline to a data source, and the specific method will depend on the data source and the CI tool you are using. Here are a few examples of how data can be pushed from a pipeline to a data source using code:

Using a REST API:

Many data sources provide a REST API that allows data to be pushed to the data source using HTTP requests. For example, you can use a library like requests in Python to make a POST request to a REST API endpoint to push data to the data source.

Kotlin example:

import requests

data = {'key1': 'value1', 'key2': 'value2'}

response = requests.post('https://example.com/data', json=data)

Using an SDK:

Some data sources provide an SDK or client library that can be used to push data to the data source. For example, you can use the AWS SDK for Python (boto3) to push data to an Amazon S3 bucket.

Java example:

import boto3

s3 = boto3.client('s3')

s3.put_object(Bucket='my-bucket', Key='data.json', Body=data)

Using a command line tool:

Some data sources provide a command line tool that can be used to push data to the data source. For example, you can use the curl command to push data to a REST API endpoint.

Example:

curl -X POST -H "Content-Type: application/json" -d '{"key1": "value1", "key2": "value2"}' https://example.com/data

Using a data pipeline tool:

Some data sources provide a data pipeline tool that can be used to push data to the data source. For example, you can use Apache NiFi to push data to a data lake.

These examples are very high-level and rudimentary but should help to provide a basis on which the team can start to extract this data from the CI pipeline to your required data source.

Below is a full example of some code using Typescript that sets up a data store in a CI pipeline to push the relevant results through to a data store. In this case, we used InfluxDB due to its configurability and low cost.

import { config as dotenv } from 'dotenv';
import * as influxDB from 'influx';
let dbName: string;
let connection: influxDB.InfluxDB;

export async function streamMeasurement(
  measurement: string,
  points: influxDB.IPoint[]
  ): Promise<void> {
    if (connection == null) {
      dotenv({ path: '.influxconfig' });

    dbName = process.env.INFLUXDB_METRICS_DBNAME;
    if (dbName == null) {
      return;
    }

    await createConnection();
    await createDatabase();
  }

  await connection.writeMeasurement(measurement, points);
}

export async function executeQuery<T>(
  influxQl: string
  ): Promise<influxDB.IResults<T>> {
  await createConnection();
  return connection.query(influxQl);
 }

async function createConnection(): Promise<void> {
  dbName = process.env.INFLUXDB_METRICS_DBNAME;
  const host = process.env.INFLUXDB_METRICS_HOST;
  const port = process.env.INFLUXDB_METRICS_PORT;

  connection = new influxDB.InfluxDB(`http://${host}:${port}/${dbName}`);}

async function createDatabase(): Promise<void> {
  dbName = process.env.INFLUXDB_METRICS_DBNAME;
  const dbNames = await connection.getDatabaseNames();
  
  if (dbNames.includes(dbName)) {
    return;
   }

  await connection.createDatabase(dbName);
  await connection.createRetentionPolicy(dbName, {
    duration: '700d',
    database: dbName,
    replication: 1,
    isDefault: true,
   });
}

Metrics That Can Be Measured through Your CI Pipelines

There are many different types of metrics that we can capture through our CI pipelines. You may want to measure different things at different stages of the CI pipeline to give you the most relevant and reliable results.

The list of metrics can also be quite exhaustive, but you don’t want to fall into the trap of trying to measure everything. Doing so can lead to analysis paralysis where your teams have access to lots of information but can’t make sense of which metrics to focus on to understand, address, or rectify certain issues, often leading to no effective work being done.

Just a reminder that the specific metrics being showcased below relate purely to the CI process. Measuring things like application performance is still important and should be measured, just not as part of your CI process.

Below is a list of the most important metrics to keep track of:

Build time: This metric measures the time it takes for a build to complete, from the start of the build process to the completion of the tests. It can be used to identify slow build times and optimize the pipeline for faster builds.
Test pass rate: This metric measures the percentage of tests that pass during the build process. It can be used to identify flaky tests and improve the overall quality of the code.
Security Scan Results: Any pipeline should have some form of static analysis in place which checks the code for any known vulnerabilities or unsupported packages in the code. And while it may seem trivial to think about this as you may fail any pull requests that have significant vulnerabilities again them, there is still a need to track the different security risks and ensure they are
Deployment frequency: This metric measures the frequency at which code is deployed to production. It can be used to identify bottlenecks in the pipeline and optimize the deployment process.
Failure rate: This metric measures the percentage of builds or deployments that fail. It can be used to identify issues in the pipeline and optimize the process for fewer failures.
Mean Time to Recovery (MTTR): This metric measures the time it takes to recover from a failure. It can be used to identify issues in the pipeline and optimize the process for faster recovery times.
Resource utilization: These metrics measure the usage of underlying system resources like CPU, memory, disk, or network bandwidth. It can be used to identify bottlenecks in the pipeline and optimize the process for better resource usage.
Code quality metrics: These metrics measure the quality of code, such as the number of bugs, code complexity, maintainability, and test coverage. It can be used to identify issues in the pipeline and improve the overall quality of the code.
User engagement metric: This metric measures how users are interacting with the system, such as the number of active users, response times, or error rates. It can be used to identify issues in the pipeline and optimize the process for better user engagement.

It's important to remember that not all metrics are equally important for all pipelines, it depends on the pipeline and the specific requirements of the organization. It's important to pick the metrics that are most relevant to the pipeline and the organization's goals.

Data Visualization Tools

Before going into detail on ways to visualize the data, I want to briefly talk about some of the visualization tools that are often best for CI observability. These are not the only tools, but because of their ease of use in working with large volumes of data, a variety of tools that are aimed at tracking CI pipelines, and reconfigurability, they tend to be the most widely used:

Grafana

Grafana is an open-source dashboard and visualization tool that can be used to display metrics from a variety of data sources, including Prometheus, InfluxDB, Graphite, Elasticsearch, and more. It allows you to create custom dashboards and alerts and has a wide variety of pre-built panels and plugins that can be used to display pipeline metrics.

Kibana

Kibana is an open-source data visualization and exploration tool that is part of the Elastic Stack. It can be used to display metrics from Elasticsearch and can be used to create custom visualizations and dashboards. It also allows you to search and explore your data and set up alerts.

Datadog

Datadog is a cloud-based monitoring and analytics platform that can be used to display metrics from a variety of data sources, including agents, integrations, and APIs. It allows you to create custom dashboards, set up alerts, and can be used to display pipeline metrics.

New Relic

New Relic is a cloud-based performance monitoring and analytics platform that can be used to display metrics from a variety of data sources, including agents, integrations, and APIs. It allows you to create custom dashboards, and set up alerts and can be used to display pipeline metrics.

Prometheus

Prometheus is an open-source monitoring and alerting system that can be used to collect and store metrics from a variety of data sources. It also provides a built-in visualization and exploration tool called Prometheus Web UI, which can be used to display pipeline metrics.

The benefit of many of these tools is that they can be structured using a form of HTML or JSON to pass information through, which means that you can easily distribute or scale your dashboarding to operate in different domains without needing to build everything from scratch.

A lot of these visualizations can be configured in code and I have provided a link to a JSON file which will detail how this can be done. The actual file is considerably large and so I won't detail it in the article, but rather allow those of you interested to view for yourselves.

How to visualize these metrics

This is another topic that can cover a multitude of different options, as there are many ways to display the different metrics. Some tools provide you with a lot of built-in metrics and dashboards to make these easier. Still, given the diversity of different software needs, it is often better for an organization to put its own dashboards together in a way that makes sense to them.

Some important tips are:

Keep it simple: Use simple and easy-to-understand visualizations, such as bar charts, line charts, and pie charts, rather than complex or hard-to-interpret visualizations.
Use color effectively: Use color effectively to make the data stand out and to highlight important trends or patterns.
Use labels and annotations: Use labels and annotations to help explain the data and to make it easy for users to understand what the data represents.
Use real-time data: Use real-time data to show the most up-to-date information and allows users to see how the data changes over time.
Use a consistent design: Use a consistent design to make it easy for users to understand the data and to ensure that the visualizations are easy to read and interpret.
Making it accessible: Make sure that the visualizations are accessible to all users, including those with visual impairments or color vision deficiencies.

The most important thing is to remember the key metrics and alerts that you are trying to track. Many teams will put together visually-attractive dashboards that look useful and provide lots of information, but the purpose of observability is about maintaining and monitoring the pipeline effectiveness and not visual appeal.

For instance, it's easy to visualize the data across every pipeline job that runs in an attractive timeline graph, but if your pipelines are running multiple times a day across different builds and environments, the information will quickly become overwhelming and difficult to visualize effectively. Instead, you can rather showcase the pipeline pass rate and run times as a metric and then use your graphing to visualize the problematic pipelines to better explore what is happening there.

Visualization is also something that helps to identify things that stand out quickly but doesn’t necessarily provide you with all the information you may need to debug a situation. That is where the logging mentioned earlier in this article becomes important and provides more specific data should it be needed.

I’ve provided some examples of dashboards that could provide good visualization of your CI pipelines. The below dashboards are all created in Grafnan, but these sorts of visualizations can be represented in other tools. The below examples do showcase the benefit though of being able to configure the look of dashboards to better match your needs rather than relying on a generic dashboarding template which only provides a limited scope.

Some examples:

If you want to analyze trends, the above dashboard idea could prove quite useful. There are a few graphical dials that bring color, but the focus is really on analyzing dips and outliers that are only often visible when doing trend analysis. This can be important because if you were to base a metric on a simple pass rate or performance average, you might be happy with the overall values, but miss out on the spikes that may not be frequent, but could prove significant in the long run. Especially at scale.

This is an example of a dashboard that provides a good mix of visuals and information. Not everything needs to be displayed in a graph and sometimes just providing information in a text or numerical format provides you with all you really need to know, with a color grading to know what to pay attention to. Where specific graphs or trends are needed, they are available. It’s a simple way of ensuring things are healthy, giving visibility to the numbers you need - without overwhelming people with data. When numbers get worrying, alerts can still be set up so that triggers are put in place.

Here is a dashboard idea that has a mixture of visualization techniques that includes easy-to-read numbers with colour coding to highlight the health of these numbers against predetermined benchmarks, while also using some graphical visualizations to showcase issues a little more clearly. It’s a good mix of different techniques that can provide interesting information, though there is the concern that the pipeline duration graph at the bottom right, is perhaps showing too much information and should probably only showcase the problematic pipelines rather than just try and show everything.

This is an example of a dashboard that perhaps does too much. The information here is tracking the performance of the servers running the pipeline jobs and while the information here is quite detailed and well-visualized, it’s difficult to get a sense of where specific issues might lie. Information like this could be useful for debugging performance concerns, but it's likely that teams are going to struggle to focus on finding the problems here as there is too much data and it is difficult to correlate what is going on.

All this information should give you the start you need to try and implement observability in your pipelines. There are lots of different approaches to doing this and the important thing is that you as a team and company work on identifying the information and strategies that work best for you - with a goal to refine and improve everything as you go along. If you’re willing to improve and refine, you will eventually land with not just the right monitoring for your CI pipelines, but also the information you need to improve their utilization too.

CRAIG RISI