Pereiti prie pagrindinio turinio

The last resort of capturing database slow queries

In Visma, we have a strong engineering culture, and one of the main pillars is automation. In fact, we are fascinated to automate things! We try to create an environment where our engineers could be more focused on building solutions based on data-driven decisions. Let’s look at it in action.

Case: slow database query

In the Visma InSchool product, the database is an essential component for a successful business. To be clear on the technical side, we use MySQL (AWS RDS Aurora). Most of the access patterns are direct user’s input, which makes a user wait for a result. Hence, if database queries are slow – we have a challenge! The slowness could be incurred by many factors, such as ineffective access patterns, data models, poor connectivity, data size, or inefficient queries.

  Today, let’s discuss inefficient queries.

No alt text provided for this image

While there are proactive ways how to capture inefficient queries in an early development phase, for instance doing analysis by running explain plan or running performance tests, there is still a chance that something could be missed due to confirmation bias and queries that appeared to be fine in a sandbox, could induce the opposite results in a live system.

In the past, we only used AWS Performance Insights to tell us about slow database queries, but it was a sporadic activity that was more of a reaction to a given incident in a live system. Also, it wasn’t fluent – you have to know how to use it. So we knew that it was not enough, and we wanted something simple, yet effective. Moreover, we didn’t have a common understanding of „how we are doing“ on a daily basis. All of that we wanted to improve. We found out about slow query log that is supported by MySQL. We were fascinated to integrate it into our development routine, but how?

Note: same logs are supported in PostgreSQL and MS SQL.

Slow query log

It’s a table whose main idea is to capture all DML operations whose execution time is slower than a specified query time, so it gives us a clue about who (a caller – schema user) when (query execution date-time), what (DML plain-text with original content), and how much time a query spent on execution.

While it could sound like a plan, first listen to this limitation on MySQL – when a single leader replication model is used as row-based replication, replicas don’t capture queries in log. If that’s the case for you, consider switching to statement-based replication as a possible solution.

Plumbing system

So, let’s back to the use-case. How do we ship logs information to our engineers? What we knew is that we didn’t want to make it a manual chore that could be easily forgotten! So we had to automate it. Hence, the main idea we wanted to achieve is to automate reading from a slow log table (that we’ve already known could help us!), and somehow let a target group be aware of what’s up.

Figure 2, shows a big picture of how everything is integrated together:

No alt text provided for this image
  • We use Jenkins for a plumbing system
  • Our main report logic is python scripts that read from a database and generate HTML files
  • Python scripts are executed in a docker environment
  • Jenkins publishes a final HTML report using HTML publisher plugin
  • Dedicated slack channel posts a link for everyone to check

Hence, database queries that were captured equal or higher a threshold bar, were considered as ‘slow’ and required attention from development teams. From the early days, it gave us useful insights. Our development teams started tackling slow queries and soon we saw empty reports the other weeks. It enabled us to lower a threshold bar (increase performance expectations), but soon enough we hit a use-case where query couldn’t be improved – it was a batch-alike user-case and longer execution time was expected. We call these queries “false-positives”. Since there are only a handful of these queries, we decided to leave them in a report to stay on top of it – it constantly appears, everyone knows that it’s a “false-positive”, but also we don’t want that query to become worse, hence we decided to keep an eye on it.

Housekeeping rules

A report is not accessible for everyone – a special permission grant is needed on a dedicated AD Jenkins user. The main reason is that queries could contain sensitive data and it’s best to limit its exposure to the only target group.

Main housekeeping rules we apply for these reports:

  1. Create a JIRA ticket for every slow query in a backlog and treat it as technical debt.
  2. Remove or change sensitive parts before sharing SQL for a wider audience.
  3. Do not share whole reports in JIRA, Slack, etc.
Conclusions

While most of the time we strive to prevent „bad“ queries from going to a live system, it could be challenging to verify if a query is „bad“ with a limited view – sandbox, confirmation bias („how we perceive a scope – involved variables“), etc. Hence, we need to have an automated plumbing system that gives us a clue what queries are slow in a live system. It could be combined with other metrics (like throughput, number of users) and recently released features or improvements, to have a better understanding of the live system’s behavior – what variables impact what part of the system.

We have built a plumbing system that makes slow query logs’ data easily accessible. That data informed our decisions that helped us to improve the user experience. So far it has helped us correlate data and spot – deadlocks, missing database indexes that improved the average response time of our system by 40% from 100 (ms) to 60 (ms). We are still curious how far we can go with it! And we are actively looking at other metrics that could inform our future decisions!

Interested to hear more?

Let’s chat! We have open positions that you might be interested in. Currently, we are looking for QA, Java, and Frontend engineers to join one of our cross-functional product teams. QA will work with their assigned dev team and the wider QA team will build and maintain automated tests to cover their modules – including Functional Regression, API Acceptance, Scalability, Performance Benchmarks, Penetration, and Mobile. QA will participate in QA Code Review process and QACore team rotation where will assist in the development of test, deployment, and release frameworks. Java and Frontend engineers will contribute to critical areas by adding new features, solving complex problems from a backlog. Collaborate with team members in Scrum routines and beyond – code reviews, weekly knowledge sharing sessions, trysdays, software architecture discussions.

Populiariausi blog'ai