Usability of KPIs and SLAs on CI/CD
If a company operates CI/CD as a central service, SLAs are usually specified for this service - as for other company services. These generally include specifications for availability, performance and/or reaction times to incidents. SLAs originated from the IT environment and have been used for quite some time, especially in the area of hosting. There, the framework conditions for the provision of a service are sufficiently well understood and definable.
This does not apply in the same way to KPIs. Formulating them for an application service is generally not that easy. What's more, in an agile or DevOps environment, KPIs also require constant feedback. This continuous feedback continuously modifies the initially selected KPIs. Over time and with the increasing maturity of the CI/CD service, established metrics take a back seat and other - usually "superior", more complex - metrics increase in importance. This is normal and no reason to panic.
KPIs are then often used to control employee evaluations and/or bonuses. For a long time these specifications came from the upper hierarchical levels. Increasingly, DevOps teams now develop KPIs together, which are then subject to constant feedback so that they can change during operation.
The most important mantra for DevOps is 'Measure', i.e. a large number of metrics are recorded and evaluated with the help of suitable tools.
For CI/CD as a service, this means recording a host of metrics from a variety of systems:
upstream systems such as SCM, LDAP, mail, HTTP proxy, ticketing
Infrastructures such as build servers, agents, test machines
Performance data of the application in production environments
Once all these measured values have been recorded, the work begins...
Defining indicators correctly
SLA definitions must be mapped to the measured data: does "available" mean whether the system in question is available at all, or that it responds to a defined request within a defined maximum time, for example? This corresponds to the formulation of a "Definition of Done" (DoD) from the agile environment.
It is also important to find an equivalent for KPIs in the collected data. An "Indicator" is not an absolute measured value. An indicator is a prompt to take a closer look. If there are deviations (mostly on the time axis), you must always look at the reason and not simply accept the value.
Why KPIs are not quite so simple
In larger companies there is a tendency to derive assessments or variable salary components (bonuses) from KPIs directly. This is often insufficient, however. Many of the values from the overview below sound plausible at first, depending on your point of view, but on closer inspection and taking human nature into account they reveal some weaknesses.
Lines of code per developer, per day - actually came from a highly paid consulting firm, and was fortunately rejected because it was obviously nonsense.
Cost distribution after use - if I want to establish a service, I should not receive payment for utilisation, but rather penalise non-use, and thus bill everyone for the service costs. Those who don't use the service will have problems justifying this.
Build duration - the build duration is influenced by too many different factors, such as the number and thoroughness of tests, parallelisation within the build, availability of resources, etc.
Number of errors of a component in an iteration - not a good indicator because it depends too much on individuals and environmental conditions. May, however, be good for improving the process, e.g. commits / pushes only once all tests have been run locally.
Number of tests - the number of tests can increase easily without actually increasing the quality.
Test coverage - only suitable as a sole criterion under certain conditions. What is more important is that the value continuously improves. It is also important, however, to have a common definition of what is to be tested and how.
Ticket handling time - typically causes tickets to be closed mercilessly, without actually fixing the problem in question. A combination of measured values that take into account the steps within the workflow, including loops as well as other factors, is better.
Errors found in production - here an analysis as to why errors are not found until the system has gone live would be better
Disabled tests / number of tests per release - if abnormalities are found, this is a good time to have a look at the causes: Is the code currently being refactored, are new third-party libraries being used, which means some of the existing tests cannot be used without being adapted? A comparison with the previous release would be worthwhile here.
Architectural index / Maintainability index (e.g. from SonarQube) - a very good indicator of code quality, but not for other aspects of the application.
Number of known vulnerabilities per release, per application, broken down / weighted by severity. Realistically, you should only measure the improvement and not the absolute value.
Infrastructure utilisation - depending on available resources, it makes general sense to measure utilisation. However, the interpretation depends on many details, e.g. do I have to evaluate a static infrastructure with bare metal or VMs differently in this respect than a Kubernetes cluster.
Visualisation of KPIs - selected examples
The following figures show examples using a combination of Prometheus and Grafana. Utilisation of the ELK stack (Elasticsearch, Logstash, Kibana) is common in this context.