How and why our team improved its software monitoring and support tools.

Applying ITIL and DevOps concepts together with Splunk.

10 min readMay 20, 2020

This is a story about how our developers team decided to implement end to end monitoring to help us with our issues resolution problems and bring data back into our decisions.

The story is built on a continuous improvement scheme so I will first describe the initial situation, then our improvement goals and then the considered solutions before actually tell you how we implemented the solution and if we achieved our goals or not.

Where do we start ?

We already had some softwares running in production and our dev team had to handle several issues each days.
The issues tickets were sometimes hardly expressed and unfiltered so these are nearly impossible to assign or correlate.
Most of our decisions were instinctive, we took decisions to solve problems we didn’t understand.
We were trying to focus on our ongoing project but lost it all when an issue called “UsefullCoreBusinessSoftware is malfunctioning-143021” and tagged as CRITICAL fell in the board and everyone went crazy about it.
We struggled to get any accurate data and context about what the users did, what they saw and in what environment.
In reaction we spent hours in the database and source code trying to understand where the issue comes to finally close the ticket because “the issue isn’t reproducible”.

Looking from the inside, it was a real nightmare as it could be for your development and support team. Some may say it’s the job, software engineering is about fixing bugs and provide support isn’t it ?

Well no, not only, software engineering is about creating value, improve and reduce waste. Depending of a software’s designers passing few hours a week to fix it again and again is as waste and you have lot of tools and practices to help improve it.

Our problems were actually symptoms so we needed to heal.

What situation did we aim for ?

Easily sort our issues tickets by architectural components so we can commit it to the right team or assign it to the right person and avoid losing time passing the buck between the whole organisation.
Measure and visualize issues occurrence, especially between interconnected components so we can build a strong knowledge on our ecosystem performance and take wiser decisions.
A complete view of issuers environment and locally loaded datas. So we are able to debug faster, correlate with eventual ongoing incidents and avoid to be fetch for not real incidents.
Our system raise alerts on defined events or suits of. If something really critical happen, our team will be the first to know and could maybe prevent incidents.
Our dev team isn’t interrupted in work by new issues ticket. Critical tickets are assigned to product manager so he/she can define a new requirement, a missing alert or a lack of UX.
Filtering the logged events by user give us the ability to trace their interaction with GUI, time of responses and rendering, do/undo habits, and leverage these informations to improve UX.
Using these events suits will also help us identify users journeys and define our AB tests.

How did we plan to get there ?

I was likely to create a little monitoring service and tools but didn’t get any support for that since it was not in our current project scope so I started looking for existing solutions.

As we first expressed, we wanted to centralize monitoring and I oriented my first researches on these key words.

A friend told me about Amplitude but I was not convinced it was a good match for our team since it is more likely used for empathize with our users than help us with day to day support, which was our major goal.

I also found articles about using Elastic suite which seems pretty cool but, knowing our local culture, I feared the change resistance will be too heavy with such products.

I found other articles speaking about Splunk, it seemed to match all our requirements, easy to implement and use… Even I read a lot of good about it, I was disappointed by their website which was almost incomprehensible to me, it finally left me a mitigated impression.

A common point to those three was that the costs were unclear and I didn’t feel legitimate to contact the sales as a new coming developer in a 5K employees company.

Of course they still offer free trials but let’s be serious, we are only three developers and none of us had time neither will to test every single solutions on the market !

Now I know using some reserved words such as End to end monitoring, Service availability management and Anomaly detection could help me find more solutions to this where “centralized monitoring solutions” was all but the words I should have use.

Unfortunately, we first didn’t get the approval for this improvement project since it was considered more an overkill, expensive software feature than a long term acquisition for the team.

How did we get there anyway implementing Splunk.

The hospitals group I currently work for organise its IT department in 3 units and struggles to promote visibility throughout the organisation.

No shame here, we are around 70 IT people running more than two hundred softwares, and our processes, if they are, are mainly reactive and project oriented. It is a hard task to build knowledge among our architecture and we can’t just stop the continual run to clarify how things works.

What happened is that one day, I was rudely standing behind an infra guy that was trying to find out why my requests for linux packages were dropped by our firewall when he opened a familiar interface to help him find out… Splunk !

An actual solution to one of our problems was already used inside the organisation and neither us nor our manager did see the opportunity in it or even noticed it was a part of our portfolio.

In the mean time, we had noticed some issues occurring on smartphones and had reasons not to trust what the users said but we were unable to take remote control of the phone to check the local datas.

I immediately took these opportunities to ask again, there was almost no more cost to implement Splunk and we needed it to address an immediate problem so the reactive design operated and I finally got the approval and ressources.

We started to use Splunk http event collector to log new events such as user triggered alerts contextualized with care plans loaded data, web services requests and response time, automatically granted and canceled accesses based on the company’s authentication and rights manager, …

Since the Splunk database served a support purpose, it was not meant to keep personal datas but only monitoring and business intelligence datas and I had to make sure everyone in the team understand it for our patients and users best interest.

We decided to keep the logs payload readable, non technical and to always add a message that fit with the user expression of what he/she did while our meta datas are more technical.

Note we use jobs and proxies to eventually support a crash or network error without losing incoming logs and to protect our access token.

As an exemple (the structure of “event” is dynamic, it’s up to you to design it):

If we got an issue ticket describing an incident while trying to prepare an Irish coffee, we can now look in the past with a simple research on “Irish coffee” or the user’s ID and see what ingredients the users tried to use without wasting their time or our own trying to contact them or reading the whole database.

Any people with a Splunk access and a minimum knowledge could then tell the user he/she should have added some liquor (please don’t make the comments another debate on the best way to make Irish coffee :) ) and tell the dev team they better add a warning than crashing the app.

Another use of such an event data could be to link an incident with a certain environment depending of its occurrence on this environment, let say you couldn’t prepare an Irish coffee using Chrome 67, who would have guessed it ?

As you see, you could spare your developers lot of dive in sessions just letting your system live and accumulate useful data so they can spend more time fixing than investigating.

End to end monitoring

Our previous support process was based on the dev team availability and our badly designed datas understanding, we were often unable to meet our company service level agreement and provide issue resolution at time.

After I logged some events in Splunk and speak of its advantages with my coworkers our manager mandated one of them to log more and more events in our first running software.

It didn’t take long before we use these logs , Manuel did a great job and logged clear and concise events we used to improve support only a week later !

In some situations, we are now able to understand incidents by only fetching our human readable logs, it improves the process in many ways, here are some:

No more hours keeping a ticket for issues that has to be fix by other teams or partners just to ensure we are really not responsible for the issue, logs doesn’t lie.
No more hours in databases or servers logs looking for maybe not existing tracks of the issue. Using SPL queries, we can now easily create views based on what happened nor than an unreadable data model.
We are now capable to understand mobile issues without fetching the concerned device to use our dev tools on it.
We begin to better understand what we built since we still have to find where and what to log so we can make a good use of Splunk.

What next ?

Although we still didn’t achieved all our goals and just started with end to end monitoring, I am confident we will raise more experience on the amazing tool Splunk is and level up the maturity of our informations system.

Now I wait for more opportunities to go further in the implementation of it on which I consider the next steps, anomaly detection and service availability awareness.

Anomaly detection

The challenge here will be to set up quality standards based on our knowledge about production softwares and provide the adapted automations to it.

If for exemple we know that users tend to suffer the use of an application when some socket service is overloaded, and can’t get real time datas (which are very important in hospitals businesses), we would like to measure our opened sockets connections or the server performances and estimate when the anomaly occurs.

Our first automation then could only be to send us a mail when it happens too much times a given interval but going further we could also trigger deployment of another service to help the overloaded one, what we can do using some of our already implemented tools (Bamboo, Docker, Nginx, …)

The whole point of this is for us to know when an incident happens before the organisation suffer from it or to win time on the resolution with self handled scenarios and by even starting to solve problems before we get any call.

It has also a security purpose since you can use it to audit accesses, failed authentications, suspicious transactions, IT operations, …

Service availability management

We also aim to provide the users and support team a visibility of our services states and knew consequences of eventual down services.

As an exemple, let say I am user A and I can’t log in my application no more, I am merely like to ask for IT support. Now let say users B, C, D, E, F, …

If each of them take time to make a ticket (they work very far one from another), you are merely responsible for a loss of {ticket creation time * issuers} minutes to your company or clients but also for a waste of tickets processing time.

Service availability is about letting know the organisation when a service will be down (eg: for a major update or servers migration) so both your IT and users benefit from this information.

You can implement such a thing with mails, web pages and all other kind of alerts or visual indicators in the work space for visibility. To measure it, you can leverage on servers monitoring, health checks through pings or http requests, issues occurrence, …

It is important not just to provide this raw information but also to manage your dependencies and ensure concerned people are aware of crashes or anticipated unavailability of your services.

Since we don’t only speak of down services but actually of the whole services state, it can be a plus if you also let know if a service may be slower than usual or down for some of your users but not the others, …

Review

I hope this post inspired you, please share it and let a comment if it does !

If you feel like something is wrong or inaccurate, or if you got hurt by my Netflix English , let me know so I can improve both the post and my English.

Feel free to connect with me on LinkedIn.

Here are some practical lessons I wish you recall from this post:

You should promote visibility throughout your organisation.
Think of opportunities, not only costs.
Splunk is one of multiple amazing tools that can help you reach an high quality informations system.
It is important to design not only the software but its monitoring and support.
End to end monitoring helps you take wiser and data driven decisions and speed up the issue resolution process.
Service availability management and anomaly detection help you relieve pressure on your support team and better manage incidents