Very simple note on disk device "paths" in some common architectures.

Version : 2.5
Date : 19/05/2012
By : Albert van der Sel



Main Contents:

1. A GENERIC (INCOMPLETE) OS MODEL.
2. THE DEVICE TREE & FIRMWARE pre-boot IMPLEMENTATIONS: Open firmware & UEFI.
3. MBR (BIOS), GPT (UEFI).
4. A VERY SHORT SECTION ON SOME SCSI TERMS.
5. WORLD WIDE NAME IDENTIFIERS.
6. BLOCK IO, FILE IO, AND PROTOCOLS.
7. JUST A FEW NOTES ON VMWARE.
8. JUST A FEW NOTES ON NETAPP.
9. JUST A FEW NOTES ABOUT STORAGE ON UNIX, LINUX AND WINDOWS.


In this simple note, we try to address some key concepts in "access of storage". For example, what driver models
are generally in use in a common Operating System? Or, why exists WWPN's,and what exactly are they?
And, why, in some OS'ses, do we see device files like "/dev/rdsk/c0t0d0s0"? And, what is is a thing like EFI,
a "device tree" and stuff, and indeed, how does nvram or firmware play a role in finding storage?

Nummerous other questions exists ofcourse. Now, this simple note might help us in understanding
some of the answers to those questions.

However, the material presented here, really is supersimple. Sometimes, to understand something completely,
you really have no other option than to "dive deep" into some protocol.
In contrast, this note, keeps a very High-Level view at all times.

Also, there is no special focus on any specific type of architecture.

Next I need to apologize for the very simplistic pictures in this note. There are many professional figures
on the internet, but it's not always clear which are "free" to use, so I created a few myself. Don't laugh ;-)




Chapter 1. A GENERIC (BUT INCOMPLETE) OS MODEL.


We must start "somewhere". So, maybe it's a good idea to see first which parts in the Operating System (OS)
have some relation to finding and accessing storage.

Suppose a user application wants to open a file, which exists "somewhere" on "some sort of storage".
So, we could have a user who uses an editor, and wants to open some "readme" file on some
filesystem, say, a directory somewhere in the "/home" filesystem.
Note that in this case, it is "known" to the OS what the type of the filesystem is (like JFS, ext3 etc..)

Typically, the applications uses a "fopen()" systemcall, which will be handled by the "system call interface",
(sort of api), and the request will be passed on to the kernel.

Please take a look at figure 1.

The kernel has many essential tasks like process management and the like, and it will
hand off the request to a specialized 'IO Manager' (which in fact is a very generalized concept).
Since there are so many types of "filesystems", this module really uses a sort of "plug and play" concept.
In our "generic" OS, "a close friend" of the IO Manager, namely the "Virtual Filesystem", will use (or if neccessary load),
any specific module which can deal with a certain specific filesytem.

So, suppose the Virtual Filesystem has determined that the user application actually wants to access
an "ext3" filesystem, then it will put the corresponding "ext3" driver/module at work, and it won't use
the NTFS and JFS modules (since they are not of use here).
Note that some number of those modules might thus already be loaded, ready for use.

Fig 1. Simplified storage driver stack in a model OS.



Now, this specific filesystem driver knows all "in and outs" of that specific filesystem, like
locking behaviour and types of access etc... but, it cannot retreive the file all by itself since it does not have
a clue of the true physical path.
That's why more specialized modules are set in action. Some are helper modules, but the last one in the stack
really "knows" how to access the Host Bust Adapter (HBA), if the file that the user application wants, happens to resides somewhere
on a Fiber Channel (FC) SAN.
If another type of storage needed to be accessed, like iSCSI, another specialized set of drivers was used.

Our generic OS model cannot be so bad indeed. Take a look at figure 2. Here we see a similar stack
but this time of a real OS. Figure 2 shows you (very high-level !) how it is implemented in Windows.


Fig 2. Simplified storage driver stack in Windows.



Here too, we see a Virtual Filesystem (VFS), which can utilise specialised modules for a specific filesystem type.
In figure 2, you can see the "ntfs", "fat" and other specialized modules.

The Partition Manager, "partmgr.sys", among other tasks, keeps track of partitions and LUNs and makes sure
that their "identity" stays the same (like an F: drive stays the F: drive, also after a reboot).

Next, "storport.sys" is a general acces module for any type of remote storage.
It's the follow up of the former "scsiport" module. Storport is an interface between higher level modules,
and the lower level "miniport" drivers. It deals with many tasks like queueing, error messages etc..

If we get lower in the stack, we will end up with specialized modules which knows how to handle
for example a netcard (for iSCSI) or a HBA (for accessing a FC SAN).




Chapter 2. The Device Tree & firmware pre-boot implementations.


This chapter takes a look at possible pre-boot implementations (like Device Tree & firmware) with respect
to booting a Host. Now, whatever Operating System follows after this "pre boot" is actually of no concern here.
Ofcourse, it is important for the full picture, but here we do not really distinguish between
a Host which is as the full boot ends, is "just" an Operating System, or an OS that is the basis for
supporting "Virtual Machines" (VMs).

So, for practical purposes, view this chapter as a description of "bare metal machine" boot.
In a later chapter, we will see more of a boot of a Hypervisor-like machine, which natively is meant to support VMs.


It's interesting to see what "entity" actually discovers buses, and devices on those buses.
But it severly depends on the architecture.

I think it would be quite reasonble if you would have the idea that when your OS boots, it fully scans
the computer to find out "what's in there", and configure the system accordingly.

Yes and No. I know that "yes and no" does not sound good, but it is the truth.

Please be aware that I'am not saying that an OS does not have "polling/enummarator features". Certainly not.
It's only that for certain platforms, device trees can be build by "firmware like" solutions, immediately after power-on
of such a system. Usually, such a firmware solution is accompanied by a "shell" with a relatively compact commandset,
which enables the admin the view/walk the tree, map devices, probe for devices, and ofcourse, boot an OS.

Ofcourse, in general terms, if you get a kernel (and the rest of an OS), it's meant for a certain architecture.
We all understand that, say, just some Linux distro for Intel x86, will not run unmodified on other platforms.

But there is a bit more to it. Even for a certain architecture, there might exists many mainboards using
different buses and interconnects. Engineers and other IT folks, never liked the idea of creating
a superheavy kernel which is prepared for all sorts of mainboards and the miriad of possible devices
on all sorts of buses.
The problem is clear: how can we let the OS reckognize all buses and devices, and bind drivers to those
devices? One solution might be: build a kernel with lots of tables or configuration databases.
In an Open Industry, this was never perceived as a viable solution.

That's why some "solutions" were presented already quite some time ago, and some newer followup protocols were constructed,
like "Open firmware", ACPI and UEFI.


2.1 Early firmware-like solutions.

Already quite early, some firmware-like solutions were implemented, like for example Openboot for Sun Sparc machines.
Openboot might be considered to be an "early" Open firmware implementation.

It was partly a "PROM" and "NVRAM" solution, where at "power on", a socalled "device tree" was build first,
before the OS boots (like Solaris 8 in those days).
The associated code was written in Forth, and the NVRAM could store all sorts of variables like if
autoboot (to Solaris) was true, and so forth.
A short time after the power was put on, the Admin had the choiche to boot to the OS, or just stay
at the "Openboot" prompt, usually visible on the console as a "ok>" prompt.
On this prompt, if the Admin wanted to do so, he could traverse the Device Tree (showing what was in there),
or even "probe" for newly installed devices.

We should not stay too long at this particular implementation, but I want you know that already at
the "ok>" prompt, address paths to devices were visible. And a key point of the story is, that the Device Tree,
a datastructure in memory, was passed on to the kernel when it boots, and where the kernel next was able
to build "friendly" names in "/dev" while the device tree was still visible (and mounted) in "/devices".
So, note that the kernel uses this datastructure, to know devices and to more easily bind drivers
and configure the system.

Here is an example of such a physical address path:

/pci@1f,0/pci@1/isptwo@4/sd@2,0

Such a path actually can be seen as a "root node" where a hierarchy of "sub/child nodes" resides,
ultimately ending in devices. In the example above, we see a path to a SCSI disk, from a local SCSI controller
in a PCI slot.

Actually, a key point here, is that devices, also storage devices, might be found at a firmware boot,
where it gets asssociated with an address path (a "physical device name" if you like), and subsequenly the kernel
uses that info further for configuration.


2.2 Open Firmware:

While Sun's former "Openboot" might be considered to be a predecessor, "Open firmware" as is it is now called, should be seen as a
non-proprietary boot firmware that might be implemented on various platforms or architectures.

⇒ For example, the CHRP Specification (Common Hardware Reference Platform) requires the use of Open Firmware.
Examples are the Power- and PowerPC architectures of IBM.
No doubt that "system p" and "system i" Admins, know of the "ok>" prompt which can be accessed through the SMS Boot Menu.
These are the modern AIX and AS400 (old term) machines, capable of running AIX, I (AS400), and Linux VM's.

⇒ Also, you will not be surprised that Sun machines uses an Open firmware implmentation too. (Sun has been "eaten" by Oracle).

⇒ In contrast, as you probably know, Intel based x86/64 machines usually used to use PC BIOS implementations.
However, implementing ACPI is quite established, and on newer Intel/Windows Platforms, EFI can be used as well.
More about that later on.

You might say, that Open Frirmware is very effective for PCI devices. Open firmware can be seen as a boot manager, and when the
"minikernel" (so to speak) from firmware has booted, it can offer a shell with a limited commandset.
Additionally, PCI devices might be equiped with FCode which enables it to report identification- and resource characteristics,
which Openfirmware can use to build the device tree.
Usually, Open firware offers simple means to boot the system to an OS (possibly multiboot) which will use the device tree further for configuration
and driver binding on devices.


2.3 EFI or UEFI:

EFI, or the "Extensible Firmware Interface", was Intel's idea on replacing the traditional PC BIOS systems.
Formally, EFI "evolved" into the "Unified EFI Platform Initialization" specifications. So, from now on we will use the term UEFI
to decribe this "new" BIOS-like followup protocol.

I'am not saying that the BIOS stayed exactly the same over all those years. Far from it. The BIOS'ses from different vendors
using more and more extensions, kept up with new hardware implementations. However, it was time for a fundamental change.
For example, demands for security, better descisions by low level software on sleeping modes, better bootmanagers, and more
control on the bootprocess (and maybe many more), probably all led to the UEFI implementation.
That is no to say that UEFI is "great". No, actually it's very complex, but at least it overcomes many limitations.

Note that the GUID Partition Table (GPT) was also introduced as part of UEFI, to overcome the limitations of the MBR
partitioning scheme. However, in principle, you can use GPT without EFI, if it isn't a bootdisk.

UEFI as it is now, seems to be tied to Intel newest architectures. Not meaning that you now immediately think of Windows
Operating Systems perse. One nice example is HP systems based on Itanium". Ofcourse, HP hopes that you will boot to HP-UX
, but EFI, seen as a bootmanager, allows you easily to boot to another OS like RedHat, Suse, Windows, etc..

For security experts, the EFI pre boot environment could be seen as a large improvement too. In principle, various forms of authentication
are possible, as described in the "EFI pre-boot authentication protocol".
As a sidenote: It's also, for example, quite interesting to read articles on the Windows 8 "Early Launch Anti-Malware" implementation.

In one sentence: EFI is often defined as an interface between an operating system and platform firmware.
It's best to view EFI as a modular structure, consisting of "certain parts", like a firmware interface, a bootmanager,
an EFI systempartition, and support for a drivermodel using EFI Byte Code (EBC).

The hardware must be suitable for using EFI, so, really old x86 systems are not usable, but x86 is not excluded.
However, if the hardware is supported, it is still possible (if you would insist) to run legacy BIOS Operating Systems,
"thanks" to compatibility modules like CSM. However, UEFI is targeted for BIOS free systems.

After a system is powered on, very globally, the following happes:

First, some system dependent routines are activated, like for example "IMM" initialization on a IBM System X,
where at the last stage, UEFI code is called. So, each architecture uses it's own very specific initial routines.

Next, the Security (SEC) phase, Pre-EFI Initialization (PEI) phase, and then the Driver Execution Environment (DXE) phase
are executed in sequence.
In reality, those phases are really pretty complex, but for our purposes, it not neccessary to go into the details.

During the DXE phase, EFI Byte Code drivers might be loaded from any firmware flash, or could come
from UEFI-compliant adapters. A device tree is build, for OS'ses that can use it.

Lastly, the "Boot Device Selection" (BDS) takes place, and the system:
  • might be configured for "autoboot"to some OS.
  • Or, the system might enter a "bootmenu"
  • Or possibly enter the EFI "Shell".
With respect to the autoboot: just like with Open boot or Open firmware, NVRAM can store several variables,
like whether "autoboot" to some selected OS should take place.

From this point on, it gets interesting for us. Take a look please, at figure 3.

Fig 3. Simplified EFI boot sequence (on HP Itanium).



UEFI is a set of specifications in technical documents, but how it, per architecture, will "look like",
depends also a bit per manufacturer, I am afraid.
However, something called the EFI System partition (ESP) really is part of the EFI specs.
Note the existence of the EFI System partition in figure 3.

Example 1: HP on Itanium using EFI:

Figure 3 tries to show the Itanium implementation, as is often used by HP.
After the EFI launch, you might enter a bootmenu, or a Shell where you can use a small commandset
like "cd", "map" etc..
This system partition, is indeed just a partition (likely to be on disk0) and it's of a FAT filesystem type.

Do you notice the "\EFI" main directory, and the subdirectories like for example "\EFI\Redhat"?
Each such subdir contains a specific "OS loader" for that specific OS.
You can see such loaders by entering the "Shell" fs0> and navigate around a bit. Example loaders could be files like:

"\EFI\vms\vms_loader.efi"

or

"\EFI\redhat\elilo.efi"


The actual kernel of an OS, will reside on another partition, possibly a partition on another disksystem.

This explains why here on Itanium, multiboot is possible. By the way, the already mentioned "bootmenu"
is very easy to use, and here you can also specify to which OS the system should autoboot.

Example 2: EFI and RedHat (RHELS 6):

Not using EFI, RedHat boots using the "grub" bootloader, which enters various "stages", and it will also
use the "/boot/grub/grub.conf" file which helps to display a bootmenu, from which (usually) various
kernel versions can be choosen to boot from.

The EFI implementation is a bit like this:
The EFI System partition is now "/boot/efi/" of type vFAT, where in subdirectories the OS loaders of various
Operating systems may reside. For booting to RedHat, the OS Loader directory is "/boot/efi/EFI/redhat/".
This directory contains "grub.efi", which is a special GRUB compiled for the EFI firmware architecture.
If "autoboot" specifies this loader, then the system will boot to RedHat.

This example is quite similar to example 1, but there are some minor differences per Manufacturer.

This section was indeed a very lightweight discussion on the UEFI implementation. But it will help us later on.
Note that I skipped the GPT implementation (replacement of MBR) here. I think it's better to don't do
"everything in one Bang" so I leave GPT for Chapter 3.


2.4 Other firmware-like solutions.

As we have seen, in many systems, the Kernel is "helped" by configuring the system, using a Device Tree.
This might be the case in Open Firmware, UEFI systems, and a few others.

However, traditionally other methods are in use as well. There are so many architectures, that it is
not really helpfull to create all sorts of listings in this note. For example, for some architectures,
just a pre-stored file or blob is passed to the kernel when it boots.

But ofcourse Operating Systems themselves can scan buses and find devices. For example, "ioscan" or
"cfgmgr" (for some Unix systems) are a few examples of commands which the sysadmin can use to scan
for new devices, which usually are new local disks or new LUNs from a SAN.
Some notes about that can be found in chapter 6.




Chapter 3. MBR & GPT.


Sections 3.1 (MBR) and 3.2 (GPT), are about "disk boot structures" on Intel architectures,
that are used by typical OS'ses on that platform, like Linux, VMWare, Windows and others.

3.1 The MBR.

In Chapter 2, we discusses (lightly) the function of UEFI. But part of UEFI, is a new boot sector structure,
called "GPT" as a follow up of the traditional MBR.
We did not discusses GPT there, because it more or less focused on firmware and Device Trees.
Now, it is also time to discuss stuff like MBR and GPT.

For PC systems (workstations & Servers), using BIOS, we can discuss the MBR first, and it's role in booting the system,
as well as it's limitations. Mind you: there are still countless Windows, Linux and other Server/Workstations
out there, using BIOS, instead of Open Firmware or UEFI.

So here are a few words on MBR...

In very ancient times, Cylinder-Head-Sector (CHS) addressing was used to address sectors of a disk, but is was
rather quickly replaced by "Logical Block Addressing" (LBA), which uses a simple numbering scheme (0,1,2 etc..) of disksectors.
This could indeed be implemented in those days, thanks to newer logic on the diskcontroller, and the support of
BIOS int 13 and Enhanced BIOS implementations.

In this scheme, we simply have one "linear address space", from LBA 0 to LBA N, and leave the details to the
onboard logic of the Controller.

Note: there is an interesting history in "geometry translation" methods, and various addressing limits of BIOS,
which explains why various partition size limits existed in the past, like the infamous 512M, 2G, 8G, 137G limits.
However, that's much too "lengthy" so I skip that here (it's also not very relevant).

Now there are (or were) at least 2 problems:

(1). Disk manufacturers already have (or want) to go from a fundamental sector size of 512 bytes to 4096 bytes.

(2). The Traditional MBR (Master Boot Record) of a disk is 512 bytes in size. The MBR is located in Sector 0.

The bootsequence of an OS through the MBR is most easiest described using Windows. Not that it's very different
from another often used OS on Intel, like Linux, but a description of a MBR based Linux boot via a stage 1 "grub"
installed in the MBR, It think, is not very "opportunistic" at this point, so I will just discuss an MBR based Windows boot.

The MBR starts with initial bootcode, and some tiny errormessages (like 'Missing Operating System') and this bootcode
has a length of 446 bytes. It's followed by the 64 byte "Partition Table", which supports 4 "partition entries" of each 16 bytes.

One partition could be marked "active", and this then was a bootable partition containing the Windows OS bootloader.
So, the booting sequence in the MBR scheme, was like this:
  • The initial bootcode of the MBR gets loaded, and reads the partition table.
  • The active partition was found, and execution was transferred to the OS loader in that partition (like NTLDR).
  • This OS loader then initiates the boot of Windows.
A very schematic "layout" of an MBR looks like this:

Fig.4: Schema of the MBR.

-From starting byte 0 to byte 445 (incl)
-length:446 bytes
:
Purpose: Initial bootcode (also for loading/reading
the partition table)
and some error messages
-From starting byte 446 to byte 509 (incl)
-length: 64 bytes
:
Purpose: Partition Table
4 Partition Entries each 16 bytes in length
bytes 510 and 511 (2 bytes):
Purpose: 2 byte closing "Boot record signature"
with values: 55 AA

One problem will get clear in a moment. A 16 byte Partition Entry has the following structure:

Fig.5: Schema of a Partition Entry in the MBR.

Lengt (bytes): Content:
1 Boot Indicator (80h=active):
3 Starting CSH
1 Partition Type Descriptor
3 Ending CSH
4 Starting Sector
4 Partition size (Sectors)

The last 2 fields express the problem. For example, the "partition size" (in no of sectors), is 4 bytes (32 bits) long, so it
can have as a maximum value "FF FF FF FF" in hex, which is "4294967295" in decimal. So, when using 512byte sectorsize,
this amounts to about 4294967295 x 512=2.199.023.255.040 bytes, or a maximum partition size of 2.2 TB.

The fact that only 4 partitions (not counting optional logigal drives in an "extended" partition)
are possible, and this partition size limit, these limits are, for today's standards, considered to be too small.

So, as you can read in a trillion other internet documents, the GUID Partition Table, or GPT, is the replacement
for the MBR.

As we have seen in Chapter 2, UEFI is a new firmware interface for newer Intel machines, as a replacement
for the traditional BIOS.

Also, the UEFI specifications supposes that the machine gets a "EFI System Partition". So.. what is the relation
with GPT, which is also a UEFI spec?
It seems complicated, but its really not !

As it turns out, if you have a UEFI compliant machine, and you install a UEFI compliant OS,
then you get GPT with a "EFI System Partition".

What makes it all a bit cloudy, is that a GPT disk, is actually sort of "self describing" which has as a
consequence that you can use a GPT disk even with a BIOS based system, although with certain restrictions.

So, UEFI is actually not required for using a GPT disk.

This will be explained better in the next section.

Indeed, popular OS'ses on Intel like VMWare, Linux distros, Windows, they all used MBR in the past,
but later versions are switched to GPT.

Later on, you will find a table listing those OS'ses, with respect to Version, 32/64 bit, UEFI/BIOS system,
and showing if they can use GPT.


3.2 The GUID Partition Table (GPT).

The GUID Partition Table is the follow up of the MBR. In section 3.1, we have seen how the MBR was structured,
how it was used in the bootsequence, and the limitations posed by the MBR and Partition Entries.

Contrary to the simple and small MBR (of 512 bytes long), the GPT is a completely different thing.

As said before, GPT is part of the UEFI specifications. In practice, the following statements are true:
  • An system with UEFI firmware, will natively use GPT based disks, and can boot from a GPT disk.
  • An (older) BIOS based system, can use GPT based disks as data disks, but cannot boot from a GPT disk.
  • So, UEFI is not "perse" required for using GPT disks
  • Most newer releases of popular (Intel based) OS'ses, transfer to using UEFI and GPT disks (or already UEFI based).
  • A GPT based disk uses as the first sector (LBA 0) a MBR like structure, called "the protective MBR",
    which precedes the newer GPT implementation. It looks exactly the same a the oldfashioned MBR, but it was added
    for several reasons, like protection from older tools like "fdisk" or legacy programs and utilities.

A GPT is way larger than the old MBR. A GPT spans multiple LBA's. In fact, GPT reserves LBA 0 to LBA 33,
leaving LBA 34 is the first usable sector for a true Partition.
So, as from LBA 34, we can have a number of true partitions, that is, usable diskspace like C:, D: etc..

But the "end" of the disk is special again! It's a copy of the GPT, which can be used for recovery purposes.

Just as we did in figure 4 for the MBR, let's take a look at a schematic representation of the GPT.
Since we can number sectors just by refering to LBA numbers, let's use that as well. So, if we use that
for example in the old MBR scheme, it that case we can then say that the MBR was in "LBA 0".

Fig.6: Simplified Schema of the GUID Partition Table.

LBA 0 Protective MBR
LBA 1 Primary GPT Header
LBA 2 Partition Table starts.
Partition Entries 1,2,3,4
LBA 3 - 33 Partition Entries 5-128
LBA 34 -LBA M Possible first true Partition (like C:)
LBA M+1 - LBA N Possible Second true Partition (like D:)
Other LBA's, except the last 33 LBA's Possible other partitions
up to the last LBA: END_OF_DISK -34
END_OF_DISK - 33 Partition Entries 1,2,3,4 (copy)
END_OF_DISK - 32 Partition Entries 5-128 (copy)
END_OF_DISK - 1 Secondary GPT Header (copy)

Please be aware that the LBA numbers (like the starting "LBA 34" for usable partitions), is not exactly
specified in the original specifications.
If you take the numbers like in the above table, like 128 byte sized Partition Entries, and "room"
for 5 up to 128 Partition Entries, only then you will end up in LBA 34 as a startpoint for usable data.

In GPT, in a Partition Entry, the "partition size field" (in no of sectors/LBA's), is now 64 bits wide.
This amounts to a max partition size of about 9.4 ZetaBytes (about 9.4 billion Tera Bytes), which is quite large indeed.
You can easily do that math yourself (if needed, take a look at the math in section 3.1).

Note that the information in a GPT, or even MBR, can be considered to form "metadata" of a disk,
since both describes the structure of the disk (like "where" is "which" partition on "what" location etc..).

Remember the UEFI System Partition (ESP) as described in section 2.3?
On a "native" UEFI system, it will be automatically created as one partition on the first disk.
So, it will usually be the first partition in the "orange area" as depicted in figure 6.
Depending on the Manufacturer, it's size may vary somewhat, but it's typically about a few hundreds MB or so,
since it only needs to store the uefi metadata and OS bootloaders (see also figure 3).

Note:

For OS'ses like Linux, Windows, it should be stressed not to use older GPT unaware tools
like "fdisk" and the like.
For example, on recent versions of Linux, the traditional "fdisk" tool is replaced by utilities like "parted".


3.3 Some figures illustrating a BIOS/MBR boot, and an UEFI/GPT boot.

3.3.1. BIOS and MBR:

Fig 7. Simplified example of a BIOS/MBR initiated boot of an "older" Windows system like Win2K3.



In fig 7, we see an "old-fashioned style" boot of a once popular OS like XP, Win2K3 etc..
What we see here, is the stuff we have talked about before.

-The BIOS selects a bootable device. Maybe it tries the DVD first, before going to harddisk0.
-Then, it loads the "initial bootcode" from the MBR, which will access the "partition table".
-Then it determines the specs from the "active" or "bootable partition" (for example, where it starts).
This partition could be, for example, partition No 1.

-Control is then passed to the OS loader on that partition (in this example, "NTLDR").
-Next, ntldr reads "boot.ini" which is a small file containing socalled "arc" paths, which "points" to the partitions
containing Operating Systems.
For example, such an arc path could point to partition No 2 (or to for example partition No 1 on another disk).

-From then, the bootsequence of that OS really starts.

Now you may say: "this is Windows !" So how about another OS that also initially starts from BIOS/MBR, like Linux RHEL 5 or so?
Ok, the following figure it not so terribly different from fig 7, but I like to show it here anyway.

Fig 8. Simplified example of a BIOS/MBR initiated boot of an "older" Linux system like RHEL5.



This note is not for discussing bootsequences in detail. However, the overall sequence of events is visible in figure 8.
Ofcourse, the start of the Linux OS as such, is depicted to happed as of Step 7 in figure 8.
At this point, those phases are not so of interest, so that's why it was not detailed further.

3.3.2. UEFI using GPT:

Actually, we already have discussed the UEFI boot in section 2.3. However, here I try to produce a figure that illustrates
a bit more on the role of GPT, and the UEFI System Partition.
Here, I will only show that picture. If you need more info on UEFI itself, you might check section 2.3.
So lets give it a try:

Fig 9. Simplified example of a UEFI/GPT initiated boot.



Ok, I myself would certainly not give an "A rating" to the figure above, but it should help "somewhat" in understanding UEFI boots.

The firmware boot, is decscribed in section 2.3. At a certain moment, the UEFI bootmanager reeds the GPT.
Since the GPT is "metadata", mainly about true partitions, partitions can be indentified by there Globally Unique Id, the GUID.

So, from the GPT, UEFI tries to indentify the "EUFI System Partition (EPT)", by it's unique GUID,
which should be like C12A7328-F81F-11D2-BA4B-00A0C93EC93B.

Once found, UEFI can locate the correct OS loader, in case autoboot is in effect.

Contary to the "native" MBR situation as shown in section 3.3.1, UEFI does NOT load "initial bootcode" from GPT.
So, in this phase, no sort of bootsecor with code, is loaded.
Only meta data is read. The preboot is fully containted in UEFI itself.

3.3.3. UEFI using MBR:

Now, the following might "feel" a bit strange. In the discussion above, we have seen that UEFI more or less
expects GUID Partition Table metadata (GPT). Indeed, GPT is part of the UEFI specifications.

However, manufacturers sometimes find smart ways (or maybe not so smart ways) to implement sort of hybrid forms.
Since the "EFI System Partition" (ESP) actually is "just" a partition, like any other partition (only with a special content),
it actually is possible to have a system with an ESP, using the "old MBR style" metadata.
In this case, in a Partition Entry in the MBR, the ESP can be indentified by it's ID of value "0xEF".

There actually exists (some might say, "weird") variants. It's possible to replace the original bootcode
in the standard MBR, to a variant "which looks like EFI firmware". This makes it even possible that non UEFI machines are capable
of booting from GPT disks.

We already have seen the limitations of the MBR. GPT based disks are becoming more and more the standard on Intel systems.

Although deviating variants exists, I would say that it's important to remember that:
  • Traditional BIOS systems use MBR bootcode and MBR partitioning metadata.
  • Native UEFI machines (newer x64 and Itanium) use GPT as partitioning metadata. The only preboot code is from UEFI.
  • Traditional BIOS systems using MBR, can (easily) use GPT disks as data disks.
  • And.. indeed it's possible to use MBR and a EFI System Partition.




Chapter 4. A VERY SHORT SECTION ON SOME SCSI TERMS.


This very short section, will just focus on few concepts related to SCSI, namely addressing and paths.

Undoubtly, with any OS using SCSI, you will encounter addressing paths like for example "[1:0:1:0]".
But what is it? It's really extremely simple.

It's all about on how to "reach" any "end device", that is: from which adapter, which bus, which target,
and lastly (and optionally) which LUN do we want to address?

Sure, the SCSI protocol (and bus) has it's complexities, but for this note we can keep it really that simple.
Don't forget that it is just a set of rules and commands. In other words: it's a protocol.
So, definitions were described too, on how to address devices (targets) and sub-devices (LUNs) on the bus. That's all.

Now, since a computer might have multiple SCSI cards, and since any card might have multiple ports (thus buses),
a "fully qualified" path to any subdevice goes like:

SCSI language: adapter#, channel#, scsi id, lun#

and for example implemented in Unix/Linux

Unix/Linux: Host#, bus#, target#, lun#

Fig. 10. SCSI controller with bus, targets.



The figure above illustrates an old-fashioned "SCSI card", or also called "SCSI controller".
However, with respect to addressing of storage, the most common "name" for such a card is "HBA" (Host Bus Adapter).

Now, with respect to device addressing the way to address devices is not much different when you compare
a true physical SCSI bus, to for example a fiber based FC LAN, which connects your system to SCSI disks.

How does it look like?

=> Example 1: On LInux you might see stuff like:

-- list devices:

[root@starboss tmp]# cat /proc/scsi/scsi

Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
..Vendor: HP...Model: HSV210...Rev: 6220
..Type:...RAID.................ANSI SCSI revision: 05
Host: scsi0 Channel: 00 Id: 00 Lun: 01
..Vendor: HP...Model: HSV210...Rev: 6220
..Type:...Direct-Access........ANSI SCSI revision: 05
Host: scsi0 Channel: 00 Id: 00 Lun: 02
..Vendor: HP...Model: HSV210...Rev: 6220
..Type:...Direct-Access........ANSI SCSI revision: 05
etc..

-- list adapters (hosts) like FC HBA cards:

# ls -al /sys/class/scsi_host

[root@starboss ~]# ls -al /sys/class/scsi_host
total 0
drwxr-xr-x 9 root root 0 Sep 21 14:59 .
drwxr-xr-x 39 root root 0 Sep 21 14:59 ..
drwxr-xr-x 2 root root 0 Sep 21 14:59 host0
drwxr-xr-x 2 root root 0 Sep 21 14:59 host1
drwxr-xr-x 2 root root 0 Sep 21 14:59 host2


=> Example 2: On VMWare, you might see stuff like:

# cd /vmfs/devices/disks
# ls vmh*

vmhba0:0:0:0
vmhba0:0:41:0
vmhba0:0:53:0


=> Example 3: On HPUX, you might see stuff like:

# ioscan -fnC ext_bus

ext_bus..15..0/2/0/0.2.1.0.1...fcparray.CLAIMED..INTERFACE...FCP Array Interface

Note the string in "bold" which is again a SCSI address, as a part of a full "hardware path".


If your system is connected to an FC- or iSCSI SAN, you will see one or more devices names, probably also identifiers
like "1:0:2:0" as in example 1, or denoted slightly otherwise like for example "adaptername0:C0:T0:L0" as in example2.
Note: in many SANs the "T" might refer to the "zoning target/number", but the idea stays quite te same.

I hope it gets clear that in most OS'ses, the address is always reckognizable as:

"host#:bus#:target#:lun#" or "adapter#:channel#:scsiID#:lun#"

This then represents the address of a LUN (or "disk") from a FC or iSCSI SAN.

Now, you might sometimes observe several "0" (zeros) in such a identifier, but that comes from the fact that the first controller is "host0",
the first bus is "bus0" etc... Also, LUNs start numbering as of "0".

Now, a "target" is the device the SCSI bus will address first. Maybe to know the address the target, is sufficient, since
that might be the "end device" itself: it does not have any subdevices "inside".
Don't forget that there exists many sorts of devices which can be placed on a SCSI bus.

However, in Storage, the target often is a complex device which manages many subdevices.
In such a case, we do have subdevices that needs an subaddress too (the LUN numbers).
This is more or less how the situation often is, in addressing LUNs from a SAN.
You might have found one target#, with one or multiple LUNs "below" it.

Note that it does not really matter if you would have SCSI commands on a old-fashioned SCSI bus, or if
those commands are just simply encapsulated in a network packet as in iSCSI: the principle stays quite the same.

On a SCSI bus, like shown in figure 10, a path like for example [0:0:1:1], uniquely identifies an object on that Bus.
However, in a typical SAN situation, we have multiple Hosts, connected to switches, which themselves are connected
to Storage controllers. In this situation, we cannot uniquely identify for example a HBA on a Host.
If you have multiple Hosts, all with a HBA called "host0", you cannot uniquely identify communication partners.
That's why extended addressing have been applied in SANs, using socalled "WWPN addresses".
This will be the subject of the next chapter.




Chapter 5. "World Wide" Identifiers.


Suppose you would have a computersystem, with a HBA installed, which connects the system to a number
of local targets using a traditional SCSI bus (or channel). Just like in figure 10.

In such a case, the addressing scheme of Chapter 4 would be fully sufficient, because in that example, any device can
be uniquely addressed (and identified).


Now, in contrast, imaging a number of Servers connected to switches, while those switches themselves are connected
to SANs.
In such a case, ServerA has a HBA called "host0", but the same is true for any other Server like Server B.
So, ServerB might have a HBA called "host0" too.

You see?

In a larger environment we need an "extended addressing scheme", in order to be able to really distinguish
between the different HBA's on the different Servers, and all other devices like the switch ports involved, the SAN ports involved etc..

Note:
Before we go any further, well known other examples are close at hand. For example, you probably know that any netcard
is supposed to have a unique MAC address, to identify it uniquely on the network. Also, just look at the internet:
any Host is supposed to have a unique IP address just to make sure that the "sender" and the "receipient" of data are uniquely defined.

Key point is: in any network, every device should be able to be identified uniquely, which is not possible if for example
only names like "host0" is used. Questions like "which host0 (hba) on which Server" would arise immediately.

That's why globally unique "World Wide" ID's were introduced.


5.1 Addressing in Fiber Channel:

Just like in networking, where on a subnet the networkcard MAC addresses of all devices is essential for communication,
the same idea have been applied in FC SAN networks.

  • Every "device" like a HBA (on a Host), or SAN controller, has it's unique WWNN (Word Wide Node Number).
  • But, a device might have one, or even multiple "ports". That's why a associated to the device's WWNN, each port
    has it's derived WWPN (Word Wide Port Number).
  • A SAN controller/filer (managing diskarrays) has a WWNN, and it has one or more ports too, each with it's on WWPN.
  • Ultimately, traffic goes from WWPN to WWPN (Host to/from SAN) with optional switches in between.
The format of a WWPN:

A WWPN is sort of the "MAC equivalent" in a SAN network. It's a 8 byte identifier.
Here is an example WWPN: "2135-3900-f027-6769". This could also be denoted without the hyphens, like "21353900f0276769",
or with a ":" between each byte (two digits), like "21:35:39:00:f0:27:67:69".

Some organization has to keep an "eye" on the possible WWPNs, in order to warrant uniqueness. This is the IEEE.

When a Manufacturer wants to produce FC hardware, it has to register with the IEEE for a 3 byte "OUI" identifier, which, when granted,
will be part of the WWPNs in all their products. This OUI is then unique per Manufacturer.

There is a certain structure in any WWPN. A WWPN is derived from it's parent WWNN (of the device).

Let's see how to find WWPNs on a few example platforms:


5.2 A few examples of finding WWPNs on some example platforms:

Example 1: Linux:

On Linux, people often use the "QLogic" adapters (qla drivers) or "Emulex" adapters (lpfc driver).

You might try:

(1):
# ls -al /proc/scsi/adapter_type/n

Where adapter_type is the host adapter type and n is the host adapter number for your card.

(2):
If for your adapter, the more modern "sysfs" registration have been implemented, browse around in "/sys/class/scsi_host/hostn" or subdirs,
or in "/sys/class/fc_host". In the latter, you then probably find an entry called "port_name".

(3):
Or take a look in "/var/log/messages", since we should see the adapter modules be loaded at boottime:

# cat /var/log/messages
...
Dec 15 09:40:10 stargate kernel: (scsi): Found a QLA2200 @ bus 1, device 0x1, irq 20, iobase 0x2300
..
Dec 15 09:40:10 stargate kernel: scsi-qla1-adapter-node=200000e08b02e534;
Dec 15 09:40:10 stargate kernel: scsi-qla1-adapter-port=210000e08b02e534;
Dec 15 09:40:10 stargate kernel: scsi-qla1-target-0=5005076300c08b1f;
..


Example 2: AIX

It depends a bit which driver stack you have loaded (like SDD), but here are a few examples, just for illustrational purposes:

# datapath query wwpn

Adapter Name....PortWWN
fscsi0..........10000000C94F91CD
fscsi1..........10000000C94F9923

or

# lscfg -lv fcs0

fcs0...............U7879.001.DQDKCPR-P1-C2-T1..FC Adapter

...Part Number.................03N6441
...EC Level....................A
...Serial Number...............1D54508045
...Manufacturer................001D
...Feature Code................280B
...FRU Number.................. 03N6441
...Device Specific.(ZM)........3
...Network Address.............10000000C94F91CD
...ROS Level and ID............0288193D

A lot of other output omitted...


Example 3: Windows

Again, it depends a bit of which driver and FC Card you use. Maybe, with your Card, some utilty was provided too.
Here is just an example for illustrational purposes:

C:\> fcquery

com.emulex-LP9002-1: PortWWN: 10:00:00:00:c8:22:d0:18 \\.\Scsi3:

or this Powershell commandlet might work on your system:

PS C:\> Get-HBAWin -ComputerName IPAddress | Format-Table -AutoSize

See also: "http://gallery.technet.microsoft.com/scriptcenter/Find-HBA-and-WWPN-53121140"


5.3 A conceptual representation of SAN connections:

Now that we know that ultimately, a HBA port identified by it's WWPN, connects to a Storage controller port dentified by it's WWPN,
let's see if we can capture that in a simple figure:

Fig. 11. Host-SAN communication



While the above sketch focusses on the importance of WWPNs in communication, ofcourse in most SANs,
multiple Hosts are connected (through one or more switches) to a SAN controller.
Maybe it's nice to show such a figure as well. In the figure below, the left side is a highlevel
view of such a SAN.

Fig. 12. Sketch of Hosts to FC SAN connections, and NAS network communication.







Chapter 6. BLOCK IO, FILE IO, AND PROTOCOLS.



Local disks or local (SCSI) disk arrays, work at the block I/O layer, below the file system layer.
The same is true for LUNs exposed by iSCSI or FC SANs. For the client OS, they "just look like" local disks
and are also accessed by block IO services.

Contrary, unix-like NFS network mounts, or Microsoft SMB/CIFS shares, operate on the network layer.
File IO commands will see to it that the client OS gets access to the data on those shares.

1. Block IO with traditional FC and iSCSI SANs:

When your Host is connected to (what we now see as) a traditional FC SAN, your Server might have one or more
HBA Fiber cards, which connects to one or more switch(es), which is then further connected to the Storage arrays.

Typically, the elements associated with the transfer of data, are "block address spaces" and "datablocks", and that's why
people talk about "block I/O services" when discussing (traditional) SAN's.
So, here SCSI block-based protocols are in use, over Fibre Channel (FC)

The same is true in iSCSI, where transfer of block data to hosts occurs, using the SCSI protocol over TCP/IP.

2. File IO with Shares, Exports, and NAS:

In a network, there might exists "fileshares" exposed by file/print servers (Microsoft), or NFS Servers (Unix/Linux).
Here, your redirector client software, "thinks" in terms of "files" that it want to retreive or store to/from the Server.
This network is just a normal network that we all know of. Ofcourse, a file will be transfered by the network protocol, meaning
that data segments are enveloped by more or less regular network packets, like in any other normal TCPIP network.
Since client and Server think in terms of whole files, people speak of File IO Services.

Two main Redirector/Server protocols are often used: CIFS (the well-known SMB from Microsoft) and NFS (Unix/Linux world).

3. Network Attached Storage (NAS):

A NAS device actually has all features of a regular FileServer. So, it's often used as a CIFS and/or NFS based Server.
It has some features like a SAN, since a true NAS device is also a device with a controller and Diskarray(s), thus resembling a SAN.
Obviously, a NAS devices has network ports which connects it to networkdevices (like a switch).
However NAS comes in a wide variety of forms and shapes.


Please take a look again at figure 12. Here, the left side tries to depict a traditional FC SAN, while the part
on the right side shows a NAS device that's placed in a "regular" network.

4. Modern SAN controllers/filers:

Many modern controllers (or filers) have options for FC SAN, and iSCSI SAN implementations.
Obviously FC uses primarily FC cards and FC connectors. However, iSCSI just uses network controllers.
And, modern filers have options to expose the device as a NAS (CIFS and/or NFS Server) on the network as well.


Some folks used to say: "if it is an Block I/O then it is SAN, if it is an File I/O then it is an NAS"
Nowadays, this is still true, but as we saw above, modern SANs have options to expose it (partly) as a NAS too.

5. Some main types of SANs:

A number of implementations exists. The most prominent ones are:
  • FC-SAN or Fibre Channel Protocol (FCP), usually using a switched Fiber infrastructure, can be seen as a mapping of SCSI over Fibre Channel.
  • iSCSI SAN, using a network infrastructure, can be seen as a mapping of SCSI over a TCPIP network.
  • Fibre Channel over Ethernet (FCoE). This is the FCP protocol using network technology.
In Europe, especially FC-SANs and iSCSI SANs are popular, while recently a renewed interest seems to exists in FCoE SANs.

Many other sorts of "storage access" implementations exists, especially remote storage access. Some of those have features that a regular SAN has too
like for example "FICON" in IBM mainframe storage technologies. However, it's not alway called a "SAN".

6. Note on LUNs:

"NAS" devices, and FileServers, primarily expose "shares" (like Microsoft SMB/CIFS, or NFS on Unix/Linux), where "File IO"
is implemented. So, the "unit" that ultimately gets transferred (using network packets), is a file.

A LUN exposed by a FC- or iSCSI SAN (or FCoE SAN), is a bit different. For the Host (the client), the LUN acts just as if
it were a local disk (which it's not ofcourse).
So, once a LUN is discovered on a Host, you can format it and create a filesystem, like for example creating a a G: drive in Windows,
or creating the "/data" filesystem on a Unix/Linux system.
Note that this is different from NAS. On a NAS, the Host (client) only accesses the storage, but it does not format it, and
it does not create some sort of preferred filesystem on that NAS based storage.

How the LUN physically is organized on the SAN, often is not known, except for Storage Admins. For example, it could be a "slice", striped over 6 physical disks
using some RAID implementation. The more spindels (disks) are used to support the LUN, usually, the better the performance (IOPS)
will be.

So, although most often LUNs are used for filesystems (where files and directories can be created on in the usual way), sometimes
a LUN is used as a "raw" device. An example can be Oracle "ASM", where the LUN cannot be directly accessed by the Host Operating System,
and only Oracle ASM IO services has formatted it to it's proprierty layout, and only ASM "knows" the details on how to access the data on that LUN.




Chapter 7. A FEW NOTES ON VMWARE.


A VMWare ESX / ESXi host, which is a "bare metal" or "physical" machine, is the "home" for a number of "Virtual Machines" (VMs).
There are quite some differences between ESX 3.x, and ESXi 4/5, also with respect to storage implementations.

Along all versions, the following "red line" can be discovered.

You know that an ESX(i) Host, essentially runs Virtual Machines (like most notably Windows Servers, Linux Servers).
These VMs have one or more "disks", just like their "bare metal" cousins.

However, most cleverly, such a "systemdisk" (like C: of a Windows Server) actually is a ".vmdk" file on
some storage system. Ofcourse there are a few supporting files as well, but suppose we have a Windows VM called "goofy",
then somewhere (on some storage) the systemdisk of this machine is just contained in the vmdk file "goofy_flat.vmdk", which
typically could have a size of 20GB or so.

VMWare uses a "datastore" for storage of such files of the VM's. Such a datastore often is stored on
a "VMFS (VMware File System) datastore" which could be local disks on the ESXi Host, or it can be found
on a NFS filesystem. This NFS storage can be "exposed" by a NAS, or a SAN acting as NAS, or other NFS Server.

However, the datastore could also be found on LUNs from a FC SAN or iSCSI SAN.

So, a VM uses one or more "common disks", which are just .vmdk files, for example stored on NFS.
For a Windows VM, such a disk is the local systemdrive C:, and possible other drives like D:, all stored in their .vmdk files.

However, a VM might also use RDM (Raw Device Mapping). These are LUNs, traditionally stored in an FC SAN,
and nowadays also often on iSCSI SANs too.

These RDM's are not the common disks (like the sytem disk of a Windows VM), but are those shared storage areas often found
in clustered solutions, where for example the SQL Server database files reside on.

Common Scenario:

So, a common scenario in VMWare was, that the "system drives" of the VMs were stored as .vmdk files in some datastore.

For those VMs that needed it, like VMs in an "MSCS/SQL Server cluster", they also used RDMs on a SAN, used for the
(shared) storage of SQL Server database files, and other shared resources needed for the cluster.

This is still a very common scenario, but in the latest versions, different scenario's are possible too.

Although a VMWare host can connect to various SAN vendors, it's just a fact that "NetApp" SANs are very popular.

Fig. 13. Traditional VMWare hosts and NetApp storage solution.



In the "sketch" above, we see three VM's, each with their own "virtual machine disk" (vmdk). Such a vmdk disk file
can represent their local system disk (like C: on a Windows VM).
Only the second VM (VM2) uses LUNs from a SAN. In this example, it's a Netapp SAN.

Nowaydays, for remote Hosts (like an ESXi host) to connect to a SAN, not only Fibre Channel (FC) is used,
but iSCSI and various forms of NFS/NAS implementations have gained popularity too.


Disks in VMWare:


⇒ VMDK files (system disk and optionally other disks of the VM):

A said before, VM's uses ".vmdk" files for their systemdisk, and optionally other disks.
Ofcourse, such a VM might additionally also use a LUN from an FC SAN, or iSCSI SAN, or other LUN provider.

Let's first see how the VM's systemdisks (their .vmdk file and other files) are organized.

Note: The sample commands below, are just part of a full procedure, and cannot be used isolated.
They are listed for illustrational purposes only.

VMWare uses the concept of "datastore", which can be viewd as a "storage container" for files.
The datastore could be on a local Host hard drive, on NFS, or (nowadays) even on a FC or iSCSI SAN.
"Inside" the datastore, you will find the virtual machines .vmdk files and other files (like a .vmx configfile).
Here, our aim is to find out which files makes up a VM's systemdisk. But, let's first create a datastore.

You can use graphical tools to create a new datastore (like vSphere client), or you can use a commandline.
In VMWare, using graphical tools while connected to a central Management Server, is the best way to go.

However, for illustrational purposes, a typical command to create a datastore locally on a ESXi host,
while having a session to that host, looks similar to this:

# vmkfstools -C vmfs5 -b 8m -S Datastore2 /vmfs/devices/disks/naa.60811234567899155456789012345321:1

Now, lets suppose that we already have created several VM's. How does the VM files "look like"?
Here too, as a preliminary, it is important that the VM's were properly registered in the repository of "vCenter" (or "VirtualCenter").
Again, using the vSphere client, those actions are really easy and it makes sure the VM's are properly registered.
Here, just "browse" the correct datastore, locate the correct .vmx file, and choose "register". Very easy indeed.

For illustrational purposes, if having a session to an ESXi Host, a commandline action might resemble this:

# vim-cmd -s register /vmfs/volumes/datastore2/vm1/VM1.vmx

You might say that the VM "exists", when it was shutdown, solely of the files located in the datastore.
The files can be found in the VM's "homedirectory".

Actually, there are a number of files, with different extensions, and different purposes.
Suppose that our VM is named "VM1", then here is a typical listing of the most prominent files:

Fig.12: Some files that make up a VM in VMWare.

vm1.nvram The "firmware" or "BIOS" as presented to the VM by the Host.
vm1.vmx Editable configuration file with specific setting for this VM
like amount of RAM, nic info, disk info etc..
vm1-flat.vmdk The full content of the VM's "harddisk".
vm1.vswp Swapfile associated with this VM.
-rdm.vmdk If the VM uses SAN LUNs, this is a proxy .vmdk for a RAW Lun.
.log files Various log files exists for VM activity records, usable for troubleshooting.
The current one is called vmware.log, and a number of former log files are retained.

This list is far from complete, but it's enough for getting an idea about how it's organized.




Chapter 8. A FEW NOTES ON NETAPP.


8.1 A quick overview.

NetApp is the name of a company, delivering a range of small to large popular SAN solutions.

It's not really possible to "capture" the solution in just a few pages. People go to trainings for a good reason:
the product is very wide, and technically complex. To implement an optimal configured SAN, is a real challenge.
So, this chapter does not even scratch the surface, I am afraid. However, to get a high-level impression, it should be OK.

Essentially, a high-level description of "NetApp" is like this:
  • A controller, called the "Filer" or "FAS" (NetApp Fabric-Attached Storage), functions as the managing device for the SAN.
  • The Filer runs the "Ontap" Operating System, a unix-like system, which has it's root in FreeBSD.
  • The Filer manages "diskarrys" which are also called "shelves".
  • It uses a "unified" architecture, that is, from small to large SANs, it's the same Ontap software, with the
    same CL and tools, and methodology.
  • Many features in NetApp/Ontap must be seperately licensed, and the list of features is very impressive.
  • There is a range of SNAP* methodologies which allows for very fast backups, and replication of Storage data to other another controller and its shelves,
    and much more other stuff, not mentioned here. But we will discuss Snapshot backup Technology in section 8.4.
  • The storage itself uses the WAFL filesystem, which is more than just a "filesystem". It was probably inspired by "FFS/Episode/LFS",
    resulting in "a sort of" Filesystem with "very" extended LVM capabilities.

Fig. 14. SAN: Very simplified view on connection of the NetApp Filer (controller) to diskshelves.



In the "sketch" above, we see a simplified model of a NetApp SAN.
Here, the socalled "Filer", or the "controller" (or "FAS"), is connected to two disk shelves (disk arrays).
Most SANs, like NetApp, supports FCP disks, SAS disks, and (slower) SATA disks.
Since quite some time, NetApp favoures to put SAS disks in their shelves.

If the Storage Admin wants, he or she can configure the system to act as a SAN and/or as a NAS, so that it can provide storage using either
file-based or block-based protocols.

The picture above is extremely simple. Often, two Filers are arrangend in a clustered solution, with multiple paths
to multiple diskshelves. This would then be a HA solution using a "Failover" methodology.
So, suppose "netapp1" and "netapp2" are two Filers, each controlling their own shelves. Then if netapp1 would fail for some reason,
the ownership of its shelves would go to the netapp2 filer.


8.2 A conceptual view on NetApp Storage.

Note from figure 14, that if a shelve is on port "0a", the Ontap software identifies individual disks by the portnumber and the disk's SCSI ID, like for example "0a.10", "0a.11", "0a.12" etc..


This sort of identifiers are used in many Ontap prompt (CL) commands.

But first it's very important to get a notion on how NetApp organizes it's storage. Here we will show a very high-level
conceptual model.

Fig. 15. NetApp's organization of Storage.



The most fundamental level is the "Raid Group" (RG). NetApp uses "RAID4", or "RAID6 with double parity (DP)" on two disks,
which is the most robust option ofcourse. It's possible to have one or more Raid Groups.

An "Aggregate" is a logical entity, composed of one or more Raid Groups.
Once created, it fundamentally represents the storage unit.

If you want, you might say that an aggregate "sort of" virtualizes the real physical implementation of RG's

Ontap will create RG groups for you "behind the scene" when you create an aggregate. It uses certain rules for this,
depending on disk type, disk capacities and the number of disks choosen for the aggregate. So, you could end up with one or more RG's
when creating a certain aggregate.

As an example, for a certain default setup:

- if you would create a 16 disk aggregate, you would end up with one RG.
- if you would create a 32 disk aggregate, you would end up with two RG's.

It's quite an art to get the arithmetic right. How large do you create an aggregate initially? What happens if additional spindles
become available later? Can you then still expand the aggregate? What is the ratio of usable space compared to what gets reserved?

You see? When architecting these structures, you need a lot of detailed knowledge and do a large amount of planning.

A FlexVol is next level of storage, "carved out" from the aggregate. The FlexVol forms the basis for "real" usable stuff, like
LUNs (for FC or iSCSI), or CIFS/NFS shares.

From a FlexVol, CIFS/NFS shares or LUNs are created.

A LUN is a logical representation of storage. As we have seen before, it "just looks" like a hard disk to the client.
From a NetApp perspective, it looks like a file inside a volume.
The true physical implementation of a LUN on the aggregate, is that it is a "stripe" over N physical disks in RAID DP.

Why would you choose CIFS/NFS or (FC/iSCSI) LUNs? Depends on the application. If you need a large share, then the answer is obvious.
Also, some Hosts really need storage that acts like a local disk, and where SCSI reservations can be placed on (as in clustering).
In this case, you obviously need to create a LUN.

Since, using NetApp tools, LUNs are sometimes represented (or showed) as "files", the entity "qtree" gets meaning too.
It's analogous to a folder/subdirectory. So, it's possible to "associate" LUNs with a qtree.
Since it have the properties that a folder has too, you can associate NTFS or Unix-like permissions to all
objects associated to that qtree.


8.3 A note on tools.

There are a few very important GUI or Webbased tools for a Storage Admin, for configuring and monitoring their Filers and Storage.
Once "FilerView" (depreciated on Ontap 8) was great, and followup versions like "OnCommand System Manager" are probably indispensable too.

These type of GUI tools allow for monitoring, and creating/modifying all entities as discussed in section 8.2.

It's also possible to setup a "ssh" session through a network to the Filer, and it also has a serial "console" port for direct communication.


There is a very strong "command line" (CL) available too, which has a respectable "learning curve".

Even if you have a very strong background in IT, nothing in handling a SAN of a specific Vendor is "easy".
Since, if a SAN is in full production, almost all vital data of your Organization is centered on the SAN, you cannot afford any mistakes.
To be carefull and not taking any risks, is a good quality.

There are hundreds of commands. Some are "pure" unix shell-like, like "df" and many others. But most are specific to Ontap like "aggr create"
and many others to create and modify the entities as discussed in section 8.2.

If you want to be "impressed", here are some links to "Ontap CL" references:

Ontap 7.x mode CL Reference
Ontap 8.x mode CL Reference


8.4 A note on SNAPSHOT Backup Technology.

One attractive feature of NetApps storage, is the range of SNAP technologies, like the usage of SNAPSHOT backups.
You can't talk about NetApp, and not dealing with this one.

From Raid Groups, an aggregate is created. From an aggregate, FlexVols are created. From a FlexVol, a NAS (share) might be created,
or LUNs might be created (accesible via FCP/iSCSI).

Now, we know that NetApp uses the WAFL "filesystem", and it has its own "overhead", which will diminish your total usable space.
This overhead is estimated to be about 10% per disk (not reclaimable). It's partly used for WAFL metadata.

Apart from "overhead", several additional "reservations"are in effect.

When an aggregate is created, per default "reserved space" is defined to hold optional future "snapshot" copies.
The Storage Admin has a certain degree of freedom of the size of this reserved space, but in general it is advised
not to set it too low. As a guideline (and default), often a value of 5% is "postulated".

Next, it's possible to create a "snapshot reserve" for a FlexVol too.
Here the Storage Admin has a certain degree of freedom as well. NetApp generally seems to indicate that a snapshot
reserve of 20% should be applied. However, numbers seem to vary somewhat when reading various recommendations.
However, there is a big difference in NAS and SAN LUN based Volumes.

Here is an example of manipulating the reserved space on the volume level, setting it to 15%, using the Ontap CL:

FAS1> snap reserve vol10 15


Snapshot Technologies:

There are few different "Snapshot" technologies around.

One popular implementation uses the "Copy On Write" technology, which is fully block based or page based. NetApp does not use that.
In fact, NetApp uses "a new block write", on any change, and then sort of cleverly "remebers" inode pointers.

To understand this, lets review "Copy On Write" first, and then return to NetApp Snapshots.

⇒ "Copy On Write" Snapshot:

Fig. 16. "Copy on Write" Snapshot (not used by NetApp).



Let's say we have a NAS volume, where a number of diskblocks are involved. "Copy on Write" is really easy to understand.
Just before any block gets modified, the original block gets copied to a reserved space area.
You see? Only the "deltas", as of a certain t=t0 (when the snapshot was activated), of a Volume (or file, or whatever)
gets copied. This is great, but it involves multple "writes": first, write the original block to a save place, then write the
the block with the new data.

In effect, you have a backup of the entity (the Volume, the file, the "whatever") as it was at t=t0.

If, later on, at t=t1, you need to restore, or go back to t=t0, you need the primary block space, and copy the all reserved
(saved) blocks "over" the modified blocks.
Note that the reserved space does NOT contain a full backup. It's only a collection of blocks freezed at t=t0, before they
were modified between t=t1 - t=t0.
Normally, the reserved space will contain much less blocks than the primary (usable, writable) space, which means a lot of saving
of diskspace compared to a traditional "full" copy of blocks.

⇒ "NetApp" Snapshot copy: general description (1)

You can schedule a Snapshot backup of a Volume, or you can make one interactively using an Ontap command or GUI tool.
So, a Netapp Snapshot backup is not an "ongoing process". You start it (or it is scheduled), then it runs until it is done.

The mechanics of a snapshot backup are pretty "unusual", but it sure is fast.

Fig. 17. NetApp Snapshot copy.



It's better to speak of a "Snapshot copy", than of a "Snapshot backup", but most of us do not care too much about that.
It's an exact state of the Volume as it was at t=t0, when it started.

With a snapshot running, WAFL takes a completely another approach than many of us are used to. If an existing "block" (that already contained data),
is going to be modified while the backup runs, WAFL just takes a new free block, and puts the modified block there.
The original block stays the same, and the inode (pointer) to that block is part of the Snapshot !
So, there is only one write (that to the new block). The inode (a pointer) of the original block is part of the Snapshot.

It explains why snapshots are so incredably fast.

⇒ "NetApp" Snapshot copy: the open file problem (2)

From Ontap's perspective, there is no problem at all. However, many programs run on Hosts (Servers) and not on the Filer ofcourse.
So, applications like Oracle, SQL Server etc.. have a completely different perspective.

The Snapshot copy might thus be inconsistent. This is not caused by Netapp. Netapp only produced a state image of pointers at t=t0.
And that is actually a good backup.

The potential problem is this: NetApp created the snapshot at t0, during the t0 to t=t1 interval.
In that interval, a database file is fractioned, meaning that processes might have updated records in the databasefiles.
Typical of databases is, is that their own checkpoint system process flushes dirty blocks to disk, and update
fileheaders accordingly with a new "sequence number". If all files are in sync, the database engine considers the database
as "consistent". If that's not done, the database is "inconsistent" (so the database engine thinks).

By the way, it's not databases alone that behave in that manner. Also all sorts of workflow, messaging, queuing programs etc..
show similar behaviour.

Although the Snapshot copy is, from a filesystem view, perfectly consistent, Server programs might think differently.
That thus poses a problem.

Netapp fixed that, by letting you install additional programs on any sort of Database Server.
These are "SnapDrive" and "SnapManager for xyz" (like SnapManager for SQL Server).

In effect, just before the Snapshot starts, the SnapManager asks the Database to checkpoint and to "shut up" for a short while (freeze as it were).
SnapDrive will do the same for any other open filesystem processes.
The result is good consistent backups at all times.




Chapter 9. A FEW NOTES ABOUT STORAGE ON UNIX, LINUX, and WINDOWS.



After a few words on storage in general and SAN's, let's take a look to how storage is viewed, used, and managed from some client Operating Systems.

Since this information really can be seen as a "independent" chapter, I have put this in a seperate file.
It can be used as a fully independent document.

If you are interested, please see:

Some notes on Storage and LVM in Unix systems.


Hope you think this document was a bit usefull...