/*****************************************************************/ /* Short introductionary note on how to manage the paging space */ /* on unix systems. */ /* */ /* By : Albert van der Sel */ /* Version: 1.5 - 20 may, 2009 */ /*****************************************************************/ Remark: This simple note is created for people "who do not know too much" about controlling paging on unix systems. But for experienced sysadmins, I do not think this note is of any value. ============================================================= 1.1 A few words on paging and virtual memory: ============================================================= Controlling the "paging space usage" (or the "swap space usage") can sometimes be quite troublesome. Related to that, handling the "paging rate" might even be more important. We don't want to be "dramatic" here, because most unix systems run great. But there are just some classes of applications that can nearly drive you cracy. That's why you need to know a bit on "what's going on", on your system". For most of us, "paging" and "swapping" means the same thing, but "formally" there is a difference. In this note, we don't make that distinction. Just like almost everybody else: we will just speak about paging or swapping as if they are the same. Please do not forget: this is truly a very simple note on the subject of swapping. In a "lightweight discussion", you would probably say the following about swapping: It is the process where real memory pages, are paged from and to the "paging space" (or also called /"pagefile"/"swap space"/"swapfile") which is located on disk. Whenever free real memory gets low, the need for paging may arise, if memory needs to be allocated for new objects, or whenever already "paged out" pages needs to get back into memory again. Ofcourse, this all is related to the "virtual memory" implementation, which is based on mapping the "real memory address space" to contiguous "virtual memory addresses", in order to "trick" programs into thinking they are using large blocks of contiguous addresses. The real memory is just simply limited. The virtual address space, is the range that (for example) a 32bit or 64bit system, can address: that's a much higher figure. To make the trick work, we need an additional object (swap space on disk) that functions as a sort of scratchbook, holding pages that are swapped out, and maybe later swapped in to memory again. So, the paging space is used to page out (at that time unneeded) pages to the paging space, and to page in (at that time needed) pages back into memory again. Let's first take a look at a few commands, by which we can check the size of the paging space, and see how much space is used and how much is free. Important: just looking at how much memory is still free (by a command or some tool) does not tell you much, on most unixes. The vmm's on many unixes will just use much free memory as a filesystem cache. Just looking "at how much memory is free, right now" is not a good indicator, whether or not you are really low on memory. However, you can use values related to memory utilisation, in combination with the %usage of the swap space, and paging rate. ============================================================= 1.2 Show the total of swapspace, and %usage of swapspace: ============================================================= Let's take a look at a few common commands to check the swapspace: ---------------------------------- AIX: lsps -a lsps -s pstat -s ---------------------------------- HP: swapinfo -a swapinfo -tam ---------------------------------- Solaris: swap -l prtswap -l ---------------------------------- Linux: swapon -s cat /proc/swaps cat /proc/meminfo free ---------------------------------- We now might just as well produce a few common commands, that show you the total memory of your system: (some commands give quite a detailed overview, and some commands might require root privilege) ---------------------------------- AIX: bootinfo -r lsattr -E -l mem0 lsattr -E -l sys0 -a realmem svmon -G vmstat -v vmo -L (very detailed overview) ---------------------------------- Linux: cat /proc/meminfo dmesg | grep "Physical" free (the free command) ---------------------------------- HP: getmem grep MemTotal /proc/meminfo dmesg | grep -i phys wc -c /dev/mem ---------------------------------- Solaris: prtconf | grep Mem prtmem memps -m ---------------------------------- Ofcourse there are many more commands that produce information about swap, but in any case, with the above list, we have got a few commands handy. >> Example 1: So for example, for Solaris you can use "swap -l" which could produce output like: # swap -l swapfile dev swaplo blocks free /dev/dsk/c0t0d0s3 136,3 16 302384 302384 /data/swapfile - 16 102384 102384 Here you can compare the total blocks to the free blocks, to get an impression of %usage of swap. Notice that in this example we have "two area's" of swap: one is a partition on some disk (c0t0d0s3), and the second is a swapfile (/data/swapfile) located on some filesytem. This could be arranged in almost all unixes. You can either choose for a didicated partition (in some unixes a "logical volume" if a Logical Volume Manager is in use), or just a file on a filesystem. Notes: (1) General consensus is, that it is more efficient to have dedicated partion(s) (or swap Logical Volumes) for swap, instead of swapfiles. But this might be questioned by many persons. (2) It's probably best to have swap on a (fast) local disk(s), instead to have it on SAN or other forms of "remote" storage. But this too might be questioned by many persons. (3) Having multiple swapspaces on different disks, might increase performance. (4) When creating swapspaces, on some systems, you can specify that swapping (to those swapspaces) should take place in a round robin fashion, or other swap-policy. >> Example 2: Suppose we go to an AIX system, and try "lsps -s": # lsps -s Total Paging Space Percent Used 7168MB 31% In this example, you see that the swap space is 31% used, so we still have quite some swap space left. But, is that really so? Do we have plenty free? We try to answer that in section 1.3. But first a few words on another subject. So, we have a number of commands that shows us the total size of the swap space, and what is already in use. Obviously, thats important, because you can monitor swap usage, and see if it grows or not. It's true that %swap-usage is an important indicator. That figure should not be too high, and it certainly should not grow too much in too short time. The main question you will try to answer is: Does my system really has adequate memory, at all loads that may be thrown at it? To answer that, %swap-usage is an important indicator. But we also need other information: the paging rate, which is even more important. Tools are available for showing the paging rate, like for example "vmstat" (short for: virtual memory status), We will explore that in section 1.4. ============================================================= 1.3 How large should the paging space be? ============================================================= No definitive answer is possible here (as long as it's large enough). It's also dependent on your system, most notably ofcourse on how much resources it has (in terms of amount of memory, cpu and disks) and what the average load and the peak loads, on that system is (or will be), and which type of services it will provide. >> One common recommendation that is heard often is this: << >> make the swap space at least the size of "2 x total memory size". << The above "rule" is actually a bit "old", and does not work in all cases. In general, for systems with a smaller amount of memory, it's often a reasonable recommendation. These are systems with, say, up to 10GB of RAM. But suppose you have a Server with 128GB of RAM. Should you then neccessarily equip it with a swap space of 256GB on disk? No, certainly not. Maybe that system would work very well with a swapspace of only 30GB or so. Example: Suppose someone asks you, how much swap space a unix system with 12GB memory should have. What would you then say? Actually, it's quite difficult to answer it precisely. For example: what will be the role of this Server? Will it be a machine providing fileservices (and thus filecaching is important), or will it be a Database Server (probably with a smaller cache, but a larger shared memory area in use). Also: what is the average load, and what (if applicable) will be peak loads? Although this is just a hypothetical example, maybe this is a relatively good answer: - Start out with, say 8GB swap Now, if all applications are installed, and are working in the "normal" way, then - Monitor swap usage (as well as other resource usage; cpu, memory, disks, network) - Monitor the swapping behaviour (like with vmstat, with special attention to pi, po, sr scanrate) Now if the pagingrate stays too high (ofcourse, occasionally it may be high) for longer periods, it's an indictor that you simply need more memory. That will be explained in section 1.4. If the swap usage grows, you might add swapspace, but that will not lower the pagingrate. Anyway, you may never run out of swapspace, otherwise your system will stall. Next to the monitoring of your system, you probably agree that, at some point, you should (or should have done before): - if possible: tune memory (do you need a large filecache, or can it be made smaller?) - if possible: tune the application(s) - maybe you still need adjust some kernel parameters Important: %usage of swap is only one indicator. Paging rate (pi=page in, po=page out and sr=scanning rate, as can be seen in a monitoring tool like "vmstat") is even more important. The following statement is very important. Most systems will not "just" free allocated swap space, unless needed. What does that mean? Suppose you have a system that steadily shows 15% usage of swap space under normal conditions. Well, that does not look bad at all. Now suppose there was a short period of unusual high load, and high pressure on memory usage. Then it's very likely that the %usage of paging page has grown as well. Now, if the systemload drops to normal values, the %usage of swap will most likely not return to the former value. So, in this case you might observe a higher %usage compared to what you see in "normal" conditions. Is this a bad situation? Not neccessarily. It is also possibly, that at time of high load, more serverprocesses were running, and "when the job completes" those additional processes exits. In this case, it's likely that %usage of swap will drop. (Note: this is a bit of a "problem" with unix systems: it is not always true that swap is deallocated when usage was very high. Only shutting down the apps (or processes), will lower %usage of swap). Thus, %usage of swap does not tell you all that there is to say. But in general, you should aim at plenty free swap. Suppose you see a system, where the swap space usage is about 30%, and it stays around that value even if you monitor it for longer times. Then you can be confident that the situation is under control. Key point is thus: is swap usage not too high (lower than, say, 40%) and relatively stable? As a further remark: many things are relative and questionable. Did you know that some experts say the following: "....You really don't want hundreds of megabytes of BloatyApp's untouched memory floating about in the machine. Get it out on the disk, use the memory for something useful....." There is a sense of truth in the above statement. The only thing that really should be clear is: - If your swap grows, that is %swap-usage grows steadily over time, that is generally not a good sign. - If the paging-rate shows consistently high values, that is generally not a good sign, and is an indicator that you need to add memory. But, before you invest in memory, be sure that you have explored your tuning options, in terms of memory, kernel parameters, and the application involved. ============================================================= 1.4 Monitoring the paging-rate with vmstat and other tools: ============================================================= vmstat: ======= %usage of swap is one thing. Paging-rate, which means how much pages are swapped in and out from the paging space per second, is another. Generally, it's a true indicator whether or not you need more memory. Obviously, when the rate is high, the system is working hard "to keep up" with memory requirements at that time. When this goes extreme, the system might enter a "trashing state" and solving this is very difficult. But you should ONLY worry if you see high paging rates, consistently over longer periods. How can you view those rates? If you know of another tool to observe the paging rate, you can use that ofcourse. But "vmstat" is common on all unix systems, and that's why I use it here. You can start vmstat in several ways. Here we will illustrate two important possibilities: $ vmstat [Interval_parameter] $ vmstat [Interval_parameter] [Count_parameter] Like in: $ vmstat 5 2 $ vmstat 5 Let's try this: $ vmstat 5 System configuration: lcpu=8 mem=10240MB ent=2.25 kthr memory page faults cpu ----- ----------- ------------------------ ------------ ----------------------- r b avm fre re pi po fr sr cy in sy cs us sy id wa pc ec 3 0 2343326 31286 15 0 0 0 0 0 115 41149 1334 23 11 66 0 0.84 37.5 5 0 2349714 26797 0 0 0 0 0 0 205 125613 2315 52 28 20 0 1.93 85.6 2 0 2343395 33031 0 7 0 0 0 0 52 77581 2310 32 28 40 0 1.53 67.9 1 0 2343377 33030 0 11 0 0 0 0 16 76822 2177 26 28 46 0 1.39 61.8 1 0 2343346 33017 0 0 0 0 0 0 28 27805 1159 11 10 79 0 0.55 24.6 .. Then every 5 seconds, a record is produced, showing all kinds of statistics. In the above example, I have let it run for 25 seconds, so we see 5 records. So, records are continuously generated until you press Ctrl-c. So the Interval_parameter says: produce a sampling record, at every "Interval_parameter " seconds. If you add the Count_parameter, you just specify how much records you want, and after that, vmstat terminates. There is much to learn from vmstat's output. For the purpose of this note, the following statistics are important for determining the paging rate: pi Pages paged in from paging space. po Pages paged out to paging space. fr Pages freed (page replacement). sr Pages scanned by page-replacement algorithm. If you consistently see very high values for pi, po and sr, it's an indicator that your system is low on memory. But remember, your system is supposed to work (and not standing idle all the time), so if you watch high values at some moments, you can ignore that. And do not worry at high cpu/us (cpu % due to userprocesses) values. Again, your machine is supposed to work. There are slight differences in the output that vmstat produces, if you would compare Solaris to AIX to etc.. Also, if you work on a virtual machine (lpar, vpar etc..), additional columns might be visible that are related to virtualisation. But in general, you can see (more or less) the following columns: procs/r : Run queue length. procs/b : Processes blocked while waiting for I/O. procs/w : Idle processes which have been swapped. page/re : Pages reclaimed from the free list. (If a page on the free list still contains data needed for a new request, it can be remapped.) page/mf : Minor faults (page in memory, but not mapped). (If the page is still in memory, a minor fault remaps the page. It is comparable to the vflts value reported by sar -p.) page/pi : Paged in from swap (Kb/s). (When a page is brought back from the swap device, the process will stop execution and wait. This may affect performance.) page/po : Paged out to swap (Kb/s). (The page has been written and freed. This can be the result of activity by the pageout scanner, a file close, or fsflush.) page/fr : Freed or destroyed (Kb/s). (This column reports the activity of the page scanner.) page/de : Freed after writes (Kb/s). (These pages have been freed due to a pageout.) page/sr : Scan rate (pages). Note that this number is not reported as a "rate," but as a total number of pages scanned. faults/in : Interrupts (per second). faults/sy : System calls (per second). faults/cs : Context switches (per second). cpu/us : User CPU time (%). cpu/sy : Kernel CPU time (%). cpu/id : Idle cpu/wa : CPU waits, due to IO We can further make the following remarks: - procs/b or kthr/b: The processes blocked (in a queue) should be low, or 0. - Paging statistics: pi, po, sr should be low If you consistently see values in the hundreds, that's not good. - cpu/us might show on average medium high values (your system should be at work, dont you agree?), - cpu/sy should not be too high - cpu/wa should be low (these are waits due to IO) Other tools: topas, nmon ======================== In addition to using vmstat, many folks like to use a tool like "topas" or "nmon". Those work perfectly as well. If you then see consistently high values for "pgspin" and "pgspout", that will be an indicator that the "swapping rate" (or paging rate) is too high, which can be caused by having too little memory for the load on your system. sar: ==== If sar is implemented on your system (which is quite likely) then a number of sar command options, and reports, can be used as well, like for example: # sar -g Just try that command. Also, you can try the command "man sar" right now on your system. If you do not know sar, spend a day (or so) reading a couple of articles of sar. It's worth it. ============================================================= 1.5 Classes of applications: ============================================================= Most shops uses the well-known common commodity software, like Oracle, DB2, Webshere, CRM software etc.. Normally, they will behave correctly (in terms of paging) if your machine was sized (and tuned) properly. But there are specialized large aplications, which only a few organizations uses. Obviously, there are less special situations where that software was tested in, there are fewer bugreports (limited customers), lesser programmers etc.. You know what I mean. It's more likely that you will encounter paging problems with that class of software compared to the common commodity software (again, that's not an absolute truth ofcourse). ============================================================= 1.6 A few notes on the swapping mechanics: ============================================================= Virtual memory is implemented at all unix systems. But there was an intrinsic problem to solve: Namely, there are N virtual pages, M physical pages, and N > M. Actually, normally N is usually much much larger than M. And... the system is really multiprocessing (multiprogramming/multitasking/timesharing). In that case, there are k processes, each with their own N virtual pages. Then, the problem becomes mapping kN virtual pages onto M physical pages. Obviously, if there are more virtual pages than physical ones, the OS can’t simply keep everything in main memory. Swapping lets the OS save memory pages to the swap space, to free up physical memory, however disks are much slower than main memory. Therefore, if the OS is going to be reasonably responsive or efficient it will have to do a “good” job of picking pages for swapping out and predicting pages for swapping in. The above describes "in a lightweight tone" why your system needs a swap space on disk. If we want to go "deeper", it gets quite complex. First, realize that it's not only the OS that plays a role. A hardware based MMU (memory management unit) plays a role too. The CPU has an MMU slaved to it. This MMU is responsible for translating virtual addresses into physical addresses. With paging, the MMU performs the page table lookups to transform a virtual address into a physical address. The MMU has also additional responsibilities. Secondly, the physical memory is "layed out" / "divided" into "pages" (by the mmu). In most cases, a page is 4K or 8K in size. Thirdly, regions of pages are (if we look at it from the OS level) organized into various "regions" like various buffers, caches, shared memory. It's the VMM (Virtual memory Manager) who does that work, and in most cases, you can influence the sizes of those regions by setting various kernel parameters. This may have a lconsiderable effect on how your system will swap. It can also be (on some systems) that the VMM "breaks" the virtual memory into segments. Each segment manages mapping for the virtual address range and converts this mapping to MMU. As a fourth point, we can say that typically, one or more system deamons "wake up" periodically, or when a treshhold is reached. One task of such a daemon is the role of a "Page scanner" which reads the state of memorypages, which are maintaind by the mmu. How often, and how "hard" the pagescanner will go to work, depends on a certain algolrithm that relates values as "minfree", "lotsfree" and "cachefree" pages of memory. When the need for swapping arise, pages that can be swapped will indeed be swapped to swapspace. Another mechanism of finding pages to be marked as free ,is the "pagestealing" mechanism. From the filecache, pages that (normally) has been least referenced, are "stealed" and marked as free. Some systems use a lrud mechanism (least recently used) or some others use fifo (first in first out). ============================================================= 1.7 A few tips on kernel parameters that affect swapping: ============================================================= Below is just a simple listing of some main kernel parameters, for various platforms. If you are a sysadmin, you probably know how to set kernel parameters. Indeed, this section just is no more than a mere listing. But still, you might find a parameter interesting enough, to do a bit of searching (system manuals, internet). Needless to say ofcourse, that you should be extra carefull in changing kernel parameters. In any case, make sure you have a way to fallback to the former situation. Important: Please remember, that "what's good" for one application, could "be bad" for another type of application. Some of the main parameters on various platforms are: AIX: ==== A few Configurable kernel parameters that can play a role for memory paging: lru_file_repage defps maxclient% maxperm% minperm% AIX is a little bit "special" in setting kernel parameters. For example, most of Solaris parameters are stored in the "/etc/system" file. Likewise, for Linux, most parameters are stored in "/etc/sysctl.conf". When dealing with AIX, don't be surprised if you find "vmo" or "no" commands (and other system commands to change memory, (user)process properties, cpu, diskaccess, and network behaviour), in several files. For example, you might find stuff in "/etc/tunables/nextboot", and possibly also in a startupfile called "/etc/rc.local" and possibly some others as well. So, it's not uncommon to see commands in such a file like: # vmo -p -o maxclient%=70 The "vmo" command (among other commands) is very important for a sysadmin, in order to change system values. The most common way to use it: # vmo [-p] [-o] # p: persistent o: option Note: The "tunchange" command updates a "tunable" file. Some folks use it to put parameters with their values in some file (like the "/etc/tunables/nextboot" file). But this small note is not an AIX manual, so if you are interested, please search the internet or the system manuals. -- About "lru_file_repage": The "lru_file_repage" parameter can affect the paging behaviour in considerable way. By setting this parameter to 0, you force the system to only free file pages when you run out of memory and to not write working pages out to paging space. This normally will decrease the paging rate, and %swap-usage. In other words: AIX, when under memory pressure, will "try harder" to steal memory from the file cache instead of paging process memory out to the paging space (lru_file_repage = 0). So if you observe that on your AIX machine, the swap space is heavily used, you might consider setting this parameter to 0. You can do that like this: # vmo [-p] -o lru_file_repage=0 -- About maxclient%, maxperm%, minperm% Very lightweight discussion on maxclient%, maxperm%, minperm%: -------------------------------------------------------------- A way to classify pages in memory, is to distinguish "computational" pages and "non computational" pages. Maybe you like to use the tool Topas. This tool is often seen and used on AIX. Look in topas for the values "comp mem" (which is related to processes) and "non comp mem" (which is related to the filesystem cache) to see the distribution of the memory usage. After a default install of AIX, and if no further memory tuning was executed, AIX will use the most of the memory as a filesystem cache. For many applications, that's good. For other apps, you may want to use another memory layout, where the cache is significant lower (e.g., for a large shared memory area to be used for a database instance). How the ratio will be between upper and lower limits of the cache, will be determined by setting the right values of maxclient%, maxperm%, minperm% (using the vmo command). Suppose you want to lower the cache, just right away after the OS starts (because you are using Oracle for example) you might use commands like in the following example: # /usr/sbin/vmo -o maxperm%=45 # (default 80) # /usr/sbin/vmo -o maxclient%=45 # (default 80) # /usr/sbin/vmo -o minperm%=15 # (default 20) (you can use multiple "-o " in one command). The max parameters are related to the upper limit of the cache, while the min parameter is related to the lower limit. This is how IBM defines maxclient%, maxperm%, minperm%: ------------------------------------------------------- The ratio of page frames used for files versus those used for computational (working or program text) segments is loosely controlled by the minperm and maxperm values: If percentage of RAM occupied by file pages rises above maxperm, page-replacement steals only file pages. If percentage of RAM occupied by file pages falls below minperm, page-replacement steals both file and computational pages. If percentage of RAM occupied by file pages is between minperm and maxperm, page-replacement steals only file pages unless the number of file repages is higher than the number of computational repages. What's best for you, just depends on what type of application you are using. Actually, the default of AIX is quite good, unless you want to run a database, which usually uses a lower filesystem cache, and instead uses a larger shared memory area (which contains it's own block buffer cache). Again, there is no absolute truth here. In general, I would say that the default is quite good for lower pagingrates, because the stealing mechanism will steal from the cache, when needed. HP: === A few Configurable kernel parameters that can play a role for memory paging: -- About Total System Swap Maximum swap space that can be allocated, system-wide. Parameters: maxswapchunks and swchunk. -- About Device Swap Swap space allocated on hard disk devices. Parameters: nswapdev. -- About File System Swap Swap space allocated on mounted file systems. Parameters: allocate_fs_swapmap and nswapfs. -- About Pseudo-Swap Use of installed RAM as pseudo-swap, that is allowing virtual memory space allocation instead of the limit of swap space on disk devices. Parameters: swapmem_on. -- About Variable Page Sizes The size of virtual memory pages might even be altered to make swap operations perhaps more efficient for certain applications. Parameters: vps_ceiling, vps_chatr_ceiling, and vps_pagesize. maxswapchunks and swchunck are a very common parameters to set on HPUX systems. Linux: ====== A few Configurable kernel parameters that can play a role for memory paging: -- About vm.swappiness: vm.swappiness is a tunable kernel parameter that controls how much the kernel favors swap over RAM. It's a sort of general parameter that determins how eager the system will be in stealing pages. The higher the vm.swappiness value, the more the system will swap, and the otherway around. You can show that parameter as in the following example: suse10~# sysctl vm.swappiness vm.swappiness 60 The value is also recorded in /proc/sys/vm/swappiness. -- About freepages: freepages.min When the number of free pages in the system reaches this number, only the kernel can allocate more memory. freepages.low If the number of free pages gets below this point, the kernel starts swapping aggressively. freepages.high The kernel tries to keep up to this amount of memory free; if memory comes below this point, the kernel gently starts swapping in the hopes that it never has to do real aggressive swapping. -- About kswapd. The kernel parameters are the following: tries_base The maximum number of pages that the kswapd daemon tries to free in one round is calculated from this number. Usually this number will be divided by 4 or 8 (see mm/vmscan.c), so it is not as big as it appears. When you need to increase the bandwidth to or from swap, you will want to increase this number. tries_min This is the minimum number of times that the kswapd daemon tries to free a page each time it is called. Basically, it is just there to make sure that the kswapd daemon frees some pages even when it is being called with minimum priority. swap_cluster This is the number of pages that the kswapd daemon writes in one turn. You want this large so that kswapd does it's I/O in large chunks and the disk does not have to seek often, but you do not want it to be too large since that would flood the request queue. -- Other important parameters: page-cluster The Linux virtual memory (VM) subsystem avoids excessive disk seeks by reading multiple pages on a page fault. The number of pages it reads is dependent on the amount of memory in your machine. The number of pages the kernel reads in at once is equal to 2 ^ page-cluster. Values above 2 ^ 5 do not make much sense for swap because we only cluster swap data in 32-page groups. pagecache This file does exactly the same as the buffermem file, only this file controls the page_cache structure, and thus controls the amount of memory used for the page cache. The page cache is used for three main purposes as follows: Caching read() data from files. Caching mmap() data and executable files. Swapping cache. When your system is both deep in swap and high on cache, it probably means that a lot of the swapped data is being cached, making for more efficient swapping than possible with prior kernels. ============================================================= 1.8 Adding, removing, increasing, and decreasing swap: ============================================================= Still to do. But there is no hurry here. Note: On "antapex.org", the link "Some unix links" will lead you serveral articles about paging on various platforms.