Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
I see a lot of questions from users or administrators who have decided that they have a performance problem but don’t know where to start or what information to provide when they ask for help. I have seen email from people who just say “my system is slow” and give no additional information at all. I have also seen 10-megabyte email messages with 20 attachments containing days of vmstat, sar, and iostat reports, but with no indication of what application the machine is supposed to be running. In this section, I’ll lead you through the initial questions that need to be answered. This may be enough to get you on the right track to solving the problem yourself, and it will make it easier to ask for help effectively.
What is the business function of the system?
What is the system used for? What is its primary application? It could be a file server, database server, end-user CAD workstation, internet server, embedded control system.
Who and where are the users?
How many users are there, how do they use the system, what kind of work patterns do they have? They might be a classroom full of students, people browsing the Internet from home, data entry clerks, development engineers, real-time data feeds, batch jobs. Are the end users directly connected? From what kind of device?
Who says there is a performance problem, and what is slow?
Are the end users complaining, or do you have some objective business measure like batch jobs not completing quickly enough? If there are no complaints, then you should be measuring business-oriented throughput and response times, together with system utilization levels. Don’t waste time worrying about obscure kernel measurements. If you have established a baseline of utilization, business throughput, and response times, then it is obvious when there is a problem because the response time will have increased, and that is what drives user perceptions of performance. It is useful to have real measures of response times or a way to derive them. You may get only subjective measures—“it feels sluggish today”—or have to use a stopwatch to time things. See “Collecting Measurements” on page 48.
What is the system configuration?
How many machines are involved, what is the CPU, memory, network, and disk setup, what version of Solaris is running, what relevant patches are loaded? A good description of a system might be something like this: an Ultra2/2200, with 512 MB, one 100-Mbit switched duplex Ethernet, two internal 2-GB disks with six external 4-GB disks on their own controller, running Solaris 2.5.1 with the latest kernel, network device, and TCP patches.
What application software is in use?
If the system is just running Solaris services, which ones are most significant? If it is an NFS server, is it running NFS V2 or NFS V3 (this depends mostly upon the NFS clients). If it is a web server, is it running Sun’s SWS, Netscape, or Apache (and which version)? If it is a database server, which database is it, and are the database tables running on raw disk or in filesystem tables? Has a database vendor specialist checked that the database is configured for good performance and indexed correctly?
What are the busy processes on the system doing?
A system becomes busy by running application processes; the most important thing to look at is which processes are busy, who started them, how much CPU they are using, how much memory they are using, how long they have been running. If you may have a lot of short-lived processes, the only way to catch their usage is to use system accounting; see “Using Accounting to Monitor the Workload” on page 48. For long-lived processes, you can use the ps command or a tool such as top, proctool, or symon; see “Sun Symon” on page 35. A simple and effective summary is to use the old Berkeley version of ps to get a top ten listing, as shown in Figure 1-1. On a large system, there may be a lot more than ten busy processes, so get all that are using significant amounts of CPU so that you have captured 90% or more of the CPU consumption by processes.
|
Code View:
Scroll
/
Show All
% /usr/ucb/ps uaxw | head
USER PID %CPU %MEM SZ RSS TT S START TIME COMMAND
adrianc 2431 17.9 22.63857628568 ? S Oct 13 7:38 maker
adrianc 666 3.0 14.913073618848 console R Oct 02 12:28 /usr/openwin/bin/X :0
root 6268 0.2 0.9 1120 1072 pts/4 O 17:00:29 0:00 /usr/ucb/ps uaxw
adrianc 2936 0.1 1.8 3672 2248 ?? S Oct 14 0:04 /usr/openwin/bin/cmdtool
root 3 0.1 0.0 0 0 ? S Oct 02 2:17 fsflush
root 0 0.0 0.0 0 0 ? T Oct 02 0:00 sched
root 1 0.0 0.1 1664 136 ? S Oct 02 0:00 /etc/init -
root 2 0.0 0.0 0 0 ? S Oct 02 0:00 pageout
root 93 0.0 0.2 1392 216 ? S Oct 02 0:00 /usr/sbin/in.routed -q
|
Unfortunately, some of the numbers above run together: the %MEM field shows the RSS as a percentage of total memory. SZ shows the size of the process virtual address space; for X servers this size includes a memory-mapped frame buffer, and in this case, for a Creator3D the frame buffer address space adds over 100 megabytes to the total. For normal processes, a large SZ indicates a large swap space usage. The RSS column shows the amount of RAM mapped to that process, including RAM shared with other processes. In this case, PID 2431 has an SZ of 38576 Kbytes and RSS of 28568 Kbytes, 22.6% of the available memory on this 128-Mbyte Ultra. The X server has an SZ of 130736 Kbytes and an RSS of 18848 Kbytes.
What are the CPU and disk utilization levels?
How busy is the CPU overall, what’s the proportion of user and system CPU time, how busy are the disks, which ones have the highest load? All this information can be seen with iostat -xc (iostat -xPnce in Solaris 2.6—think of “expense” to remember the new options). Don’t collect more than 100 samples, strip out all the idle disks, and set your recording interval to match the time span you need to instrument. For a 24-hour day, 15-minute intervals are fine. For a 10-minute period when the system is busy, 10-second intervals are fine. The shorter the time interval, the more “noisy” the data will be because the peaks are not smoothed out over time. Gathering both a long-term and a short-term peak view helps highlight the problem areas. One way to collect this data is to use the SE toolkit—a script I wrote, called virtual_adrian.s, (See “The SymbEL Language” on page 505, and “virtual_adrian.se and /etc/rc2.d/S90va_monitor” on page 498.) writes out to a text-based log whenever it sees part of the system (a disk or whatever) that seems to be slow or overloaded.
What is making the disks busy?
If the whole disk subsystem is idle, then you can skip this question. The per-process data does not tell you which disks the processes are accessing. Use the df command to list mounted file systems, and use showmount to show which ones are exported from an NFS server; then, figure out how the applications are installed to work out which disks are being hit and where raw database tables are located. The swap -l command lists swap file locations; watch these carefully in the iostat data because they all become very busy with paging activity when there is a memory shortage.
What is the network name service configuration?
If the machine is responding slowly but does not seem to be at all busy, it may be waiting for some other system to respond to a request. A surprising number of problems can be caused by badly configured name services. Check /etc/nsswitch.conf and /etc/resolv.conf to see if DNS, NIS, or NIS+ is in use. Make sure the name servers are all running and responding quickly. Also check that the system is properly routing over the network.
How much network activity is there?
You need to look at the packet rate on each interface, the NFS client and server operation rates, and the TCP connection rate, throughput, and retransmission rate. One way is to run this twice, separated by a defined time interval.
% netstat -i; nfsstat; netstat -sAnother way is to use the SE toolkit’s nx.se script that monitors the interfaces and TCP data along the lines of iostat -x.
% se nx.se 10
Current tcp RtoMin is 200, interval 10, start Thu Oct 16 16:52:33 1997
Name Ipkt/s Opkt/s Err/s Coll% NoCP/s Defr/s tcpIn tcpOut Conn/s %Retran
hme0 212.0 426.9 0.00 0.00 0.00 0.00 65 593435 0.00 0.00
hme0 176.1 352.6 0.00 0.00 0.00 0.00 53 490379 0.00 0.00Is there enough memory?
When an application starts up or grows or reads files, it takes memory from the free list. When the free list gets down to a few megabytes, the kernel decides which files and processes to steal memory from, to replenish the free list. It decides by scanning pages, looking for ones that haven’t been used recently and paging out their contents so that the memory can be put on the free list. If there is no scanning, then you definitely have enough memory. If there is a lot of scanning and the swap disks are busy at the same time, you need more memory. If the swap disks are more than 50% busy, you should make swap files or partitions on other disks to spread the load and improve performance while waiting for more RAM to be delivered. You can use vmstat or sar -g to look at the paging system, or virtual_adrian.se will watch it for you, using the technique described in “RAM Rule” on page 456.
What changed recently and what is on the way?
It is always useful to know what was changed. You might have added a lot more users, or some event might have caused higher user activity than usual. You might have upgraded an application to add features or installed a newer version. Other systems may have been added to the network. Configuration changes or hardware “upgrades” can sometimes impact performance if they are not configured properly. You might have added a hardware RAID controller but forgotten to enable its nonvolatile RAM for fast write capability. It is also useful to know what might happen in the future. How much extra capacity might be needed for the next bunch of additional users or new applications?