Monday, April 30, 2012

Successful application performance tuning

Non-technical but useful to techies
In this entry I am going to outline a strategy for successful application performance tuning. This is deliberately non-technical but my target audience here is also the technical people involved in fixing these problems. For the most part performance tuning failures do not spring from a lack of technical know-how, they spring from a lack of a strategic approach.

The key steps are easily laid out, but there are pitfalls along the way that must be avoided.

The overview is:
  1. Assemble a team
  2. Define the problem
  3. Reproduce the problem
  4. Investigate the cause
  5. Propose a solution
  6. Test for success
Depending on the scale of the problem or complexity of the application some of the below may not be applicable, but the general outline will be helpful at any level of tuning exercise.

Assemble a team
It is essential to avoid the pass-the-parcel approach that is often used in tackling performance issues. The users complain about the issue, management asks the application support team to investigate. They in turn ask the DBA team, who talk to the Unix team and so on.

Performance issues are generally a combination of different components within a system. Their resolution takes a combination of skills. Depending on your organization or the application being analyzed these people may have different titles or some people may provide a combination of the skills needed, but it breaks down to this:
  1. Business Analyst, someone who knows what the system is supposed to do for the business and preferably has a direct contact with the user community.
  2. Functional Application support, someone with knowledge and understanding of the functional aspects of the application
  3. Technical Application support, someone with a technical understanding of the application infrastructure.
  4. Database support, someone with knowledge of the underlying database.
  5. OS support, someone who can provide insight into the operating system and hardware resources.
  6. Developer and/or vendor support. If the application is supplied by a vendor and they cannot be onsite for the whole process then a clear channel of communication, preferably with a named individual should be setup.
  7. Network Support, although not normally required throughout the process it is good to have involvement from this area from the start.

Defining the problem
A "What not to do" note. You do not yet know what the problem is. So do not start solving it. It is very easy in the initial meetings of the team to begin to discuss things such as CPU load or bad SQL, lack of memory etc etc. Do not do this.

The first step is to properly identify what the problem is. Usually the response to the question "What is slow?" is vague at best. Quite often "It's just generally slow" is the reply. Most of the time this is wrong.

It seems obvious that you must first identify what is slow before you can fix it, but a lack of clear definition at this stage is central to the failure of many a performance tuning effort. This is the point where you define the scope of any activities. It is also the starting point for communication with the user community. The other vital component that comes from this step is a set of criteria to judge success by.

In short, if the problem definition is "It's slow" and the success criteria "It's faster" then the outcome will likely be failure.

If at all possible direct communication with the users should be started. By this I mean that someone from the team should visit the users, sit down with them as they work and actually see the problem in action. This has the added benefit of letting the users know that someone is working on their problems. Get out from behind your email client and go sit with your user community.

Your goal is a set of clear Problem Statements that define the problem or problems. Avoid generalized statements such as "The Order Entry screen is slow".  Also, we are not looking for Use Cases such as might be required for functional testing or performance load testing. An example of the output you want might be:
  1. In the Order Entry screen, when a client name is searched for it takes over 30s to return results
A user enters the Order Entry screen and searches for the client name by entering the first two characters of the name in the search box. When the user presses the Search button it takes over 30s to return results. 

It is also useful to add in a screenshot of the problem page and the menu choices taken to arrive at the page. Where applicable the username used to log in and any group information such as business units should be noted.

The number and selection of your Problem Statements will depend on the nature of the application and the performance issues. If there are many areas where performance is a problem it is best to list a fixed number of targets in key areas of the application. Indeed, creating a Top 10 list with the users can be very helpful if you are struggling to properly define Problem Statements. Even though you may not be targeting all the possible issues that users are experiencing, you need a focused list to perform a proper analysis with.

Reproduce the problem
Another "What not to do" note. Do not try and reproduce the problem with all your analysis tooling switched on. This runs the risk of giving initial false results.

The first goal you have is to reproduce the problem on a test environment. While it is possible to investigate on a production system it is often very difficult to isolate the activity of the Problem Statements in a busy production environment.

A test environment must be as close as possible to the production one. This is not only the case for hardware infrastructure but also for data. With data care should be taken to not only reproduce the size of the data set, but also data skew. If you have ten business units in an application you will often find that most of the orders come through only one or two business units. When querying order history it is likely that performance will be poorer for business units with large numbers of orders.

Ideally a copy of the production system would be used, but this opens up security issues in creating a test environment with production data. Tools such as Oracle's Data Masking Pack can duplicate a production database and at the same time mask the underlying data, giving you a realistic test environment without the danger of exposing production data.

Care should also be taken with parameter files and application options. Again, these options should preferably be a copy of a production set adjusted to match the test environment. Be wary of how these parameters are changed. Any such changes should be recorded and tracked with the team.

Once the test environment is setup a backup should be taken and used as a reference point for all future tests and results reporting. This is important because as testing continues you will normally be changing the underlying data set. Each test cycle should start from the same base of data and parameters.

You should initially set out to reproduce each Problem Statement. In doing so look to define the conditions under which the problem occurs. Is it always for a given client number, or a particular business unit? For a search does it matter how many characters are searched on or the number of results returned? Remember also the effect of caching at various levels of the infrastructure. Try and level the playing field in this aspect by repeating tests with and without data cached. The goal of this activity is not only to reproduce the problem but also to begin to understand the workings of the application with the underlying data set and infrastructure.

In some cases the Problem Statement may include a time-of-day factor or the problem may be intermittent. If such cases cannot be immediately reproduced then notes should be taken on the tests made and the results. Work should then continue onto the next phase.

Investigate the cause
A good first step here is to set up automated tests using tools that can replay the test cases and properly time them. The creation of these automated tests should use the previous steps gathered data as input. Their output should become the initial baseline for reporting the impact of tuning activity in the teams final report.

This area is very technical and I will leave the details to another time, but some useful approaches can be outlined:

  1. Make very sure that you are working together and not against each other. In particular coordinate when changes and testing will occur and when the environment will be set back to its baseline start point.
  2. Make use of available profiling tools, tracing and other data gathering.
  3. Work closely with the developers and use any data gathering tools built into the application. If none are available discuss with them how they might be added.
  4. Reuse this work. As you begin to build up data points think also about how this data might be reused later in gathering and reporting production performance metrics.
  5. Find and examine changes in behavior closely. Work to explain why a particular case behaves well with one set of options or data and poorly with another. This will greatly aid your understanding of the problem.
  6. Extrapolate and test. If you have a working theory on the nature of a problem you should be able to predict under what circumstances it will worsen or improve. Create these circumstances and test and prove your theory.
Be aware that you cannot hope for perfection, therefore you need to focus efforts on where the majority of the problems lie. Here your data gathering abilities come into key play. Measure each aspect of the various parts of the system per Problem Statement to determine where the elapsed time is being used up. In a modern Java environment for example, you want to find put how much CPU time is recorded at the web layer, how much at the application tier and how much at the database. If out of 30s elapsed time there are 25s spent in CPU on the database then this will become your focus in resolving the problem.

At this stage you should have far more knowledge about where the problems lie. You can now revisit the Problem Statements in the previous stage that you could not reproduce to see if, armed with your new knowledge, you are better able to do so.

Try and build a picture across Problem Statements. Are there correlations in where time is most spent? Relating the data and information gathered across your different tests can help you pinpoint more structural issues in the application or its framework.

Propose a solution
One last "What not to do" note. You are not now armed with a set of weapons with which to attack a group or groups of people that you think "caused" the problem. A solution will only come about if you can get agreement from all on what the problem is and how to tackle it. Be sensitive about the vested interests of those involved and try and bring them with you.


Unless you are particularly lucky there is probably no quick fix that can be easily implemented. If that had been the case it would normally have been found and put in place a while ago. You are faced then with what might be a considerable requirement for time and effort from one or more groups to implement a solution. It is therefore important how you represent your findings.

The data gathered and the information about the solution are only a part of this. You also need to show what work was done to ensure that your findings are correct. As well you need to demonstrate why the solution is the best way forward and in particular why smaller incremental changes will not work. There will be a strong tendency, driven by a need to reduce costs, to accept small changes in place of your proposed solution. Your report should demonstrate, with evidence, why this will not resolve the problem or will only create a short term fix.

It is all too easy for those required to fix the problem to feel under attack. This is particularly the case for vendors or other external parties. You should discuss fully the contents of the report with the groups involved in the solution prior to publishing. Gain their trust and agreement on the data presented and the proposals being made. Ensure that everyone is of one voice in this.

Use the communication opened with your user community to understand how changes will impact them. It may be that changes in what information is presented or how work is done are a part of the solution. If that is the case you will need to demonstrate that the changes will not diminish the business fit of the application. Having user approval prior to the report being finalized will help in this area.

Examine also across the team what options there exist for staged implementation. Your investigation will have brought about a better understanding of what the most important problem areas are. Can these areas be tackled first, or does the solution have to be a big bang?

The job does not finish with the report being published. Organize follow up meetings to discuss the report involving decision makers and preferably the project management that will be required to implement. The important part here is a clear set of goals and check markers for the project that will ensure the underlying issues have been satisfactorily dealt with and not merely masked. In the case of external suppliers this set of success criteria will have to be agreed with them.

Test for success
As all or part of the solution is implemented you should reuse the test environment built for the reproduction and investigation phases to test the results of the changes. It is important here to have a level playing field or to be able to account for differences. If there has been a length of time between the investigation and the solution you may be looking at a larger data set than had previously existed. As much as possible you want the new test results to be comparable to the previous ones.

The tests of the solution should not simply be about timings. They should also reuse the expertise gained in the initial investigations. You are looking to ensure that the same or similar mistakes that caused the original problems have been avoided.

Be wary of fixes being implemented that fit the test cases. Go back to your reproduction phase and test for other conditions outside of the test cases. Try and ensure that you have not broken something else in order to solve one or more of your Problem Statements.

Finally, you now have an infrastructure and knowledge base for ongoing micro testing of performance. Make sure this is now a documented part of your test and release process.


Friday, March 9, 2012

Database performance, measuring the effects of network latency

In Oracle 11gR2 there are 16 different wait events associated with SQL*Net traffic. The two most common are SQL*Net message to client and SQL*Net message from client. Without additional data neither of these events is useful in measuring the impact of network latency on application performance. The first measures only the time to pass the message to the operating system, not to actually send the message across the wire. The second is the time spent waiting for the client to send a message, so it does not tell us how long the message took to send.


Yet we do know that there exists an impact. If it was suggested to place the application tier of a performance critical application on one continent and the database on another there would be an immediate concern about latency. We can measure the latency of a network connection using network tools, but this alone does not give us real data about what effect that latency will have on user experience.


I decided to delve into this a little further, to see first how we might measure the issue and then later to use those measurements to get a better understanding of what impact latency has.

Gathering the data
The tests were executed on three separate databases, each with the same hardware configuration and init.ora parameters. The first database, Local, was on the same machine as the test scripts. The second, LAN, was on a high performance local LAN which is expected to have minimal amounts of latency. The third, WAN, was over a Wide Area Network which has an expected latency of 10-12ms.



The test scripts were written in Perl, using the DBD::Oracle package. The timings were taken using the Time::HiRes package, which delivers timings in microseconds. The time measured was from the start to the end of each iteration.

The Output
SELECT 2000 rows, Fetch size 4 rows


SELECT 2000 rows, fetch size 4, timings (s)
In this test a single SELECT statement is executed against a table containing 2000 rows. The fetch size is set to four rows, meaning the rows are returned to the client four at a time. This results in 503 round trips from the application to the database. The total includes parse and execute round trips and an extra fetch returning no rows that signals the end of the cursor to the client. You can see from the results that there is a noticeable difference even between the  Local and LAN times. The WAN timing is over 4.5s, not an acceptable response time for such a simple query.

SELECT 2000 rows, Fetch size 20 rows


SELECT 2000 rows, fetch size 20, timings (s)
Here the test is repeated, but this time fetching 20 rows at a time. There is an immediate benefit across the three test sets. Even in the Local test the response time has decreased. In the Local test the connection does not pass through any network, but there is still an overhead in process communication and in CPU time spent in re-reading blocks in memory to get the next rows. The WAN savings are of course the most significant. A simple parameter change has reduced the response time here to under 1s.

SELECT 2000 rows, Fetch size 200 rows


SELECT 2000 rows, fetch size 20,0 timings (s)
This time the test is executed fetching 200 rows at a time. Again there is benefit to be had, but at the Local and LAN level it is becoming less significant. It should be kept in mind that there is an overhead in memory usage the larger the fetch size. The client is required to pre-allocate enough memory to fit all columns at their largest byte size for all rows. Care should be taken not to create a huge memory footprint on the application tier just to shave off a few ms of elapsed time.

SELECT 1 row, multiple executions

In this test the script executes a single SELECT statement returning one row from a table containing only one row. Each execution includes the statement parse. Along with the execution and fetch steps this is a total of three network To/From pairs per execution.


This test cuts down as much as possible any other factors than just network latency. For the WAN measurements only the first 500 results, reporting 1500 round trips, are shown. This was done purely to keep a reasonable scale for the chart.
As might be expected, there is a linear progression between the number of network packets and the elapsed response time. Even though the Local connection should show only a negligible latency the times become significant when there are very large numbers of interactions with the database. In the LAN test the drop in response time is steeper. The difference between the LAN and Local timings would probably not be significant for most applications, but where there are design faults which lead to large amounts of network traffic it could be an issue. With the WAN configuration the drop-off in response time becomes very significant quite quickly. 

But do the tests reasonably match with the expected results? Although for Local and LAN there is no fixed expected latency the WAN configuration has, as I stated at the start, an expected round trip latency of 10-12 ms. Below are the figures for the difference between Local, LAN and WAN timings.


Delta times between configuration tests (s)
You can see that the delta time for the WAN compared to the Local timing is on average about 9ms. This is near our expected 10-12ms. Not spot on the values we expect, but within a range that shows that the tests are producing a reasonable estimation of the network latency. The figures for response time calculations are reasonable then. They can be used to give estimates of the effect of network latency on applications.

Finishing up

Network latency and network performance between the application tier and database tier can be a significant factor in response times. It has an effect that can be measured and these measurements can be used to feed information into business cases for improvements in infrastructure to reduce latency or the re-factoring of applications to reduce network round trips.


The second topic, that of application re-factoring, involves more than just network latency. It also includes other overheads that can have a far larger impact on overall performance. In a future blog I will return to the topic of how you might apply these measurements in real applications.




First entry, an introduction

Being new to the world of blogging I thought I would start with a quick introduction and an outline of what I aim to discuss on this blog.

I am an Oracle DBA with 20+ years of experience, covering a wide range of business areas and database functions. Over the years I have come to specialize in performance tuning and capacity monitoring. These are areas that are both interesting technically and challenging in how you communicate with other teams around the business.

Over the course of this blog I hope to share a few technical tips and tricks and investigate some aspects of performance that too often get neglected.

I also want to open up a discussion on how DBAs can be more effective in bringing about solutions to problems. In many applications the database is a central point that everything passes through. This gives the DBA a unique insight, but this insight can only be of benefit when we communicate effectively with the other teams involved.