1. General Discussion
Traditionally, storage controllers (e.g., disk array controllers,tape library controllers) have supported the SCSI-3 protocol and have been attached to computers by SCSI parallel bus or Fibre Channel. The IP infrastructure offers compelling advantages for volume/ block-oriented storage attachment. It offers the opportunity to take advantage of the performance/cost benefits provided by competition in the Internet marketplace. This could reduce the cost of storage network infrastructure by providing economies arising from the need to install and operate only a single type of network.
In addition, the IP protocol suite offers the opportunity for a rich array of management, security and QoS solutions. Organizations may
initially choose to operate storage networks based on iSCSI that are independent of (isolated from) their current data networks except for secure routing of storage management traffic. These organizations anticipated benefits from the high performance/cost of IP equipment and the opportunity for a unified management architecture. As security and QoS evolve, it becomes reasonable to build combined
networks with shared infrastructure; nevertheless, it is likely that sophisticated users will choose to keep their storage sub-networks isolated to afford the best control of security and QoS to ensure a high-performance environment tuned to storage traffic.
Mapping SCSI over IP also provides:
-- Extended distance ranges
-- Connectivity to "carrier class" services that support IP
The following applications for iSCSI are contemplated:
-- Local storage access, consolidation, clustering and pooling (as in the data center)
-- Network client access to remote storage (eg. a "storage service provider")
-- Local and remote synchronous and asynchronous mirroring between storage controllers
-- Local and remote backup and recovery
ISCSI will support the following topologies:
-- Point-to-point direct connections
-- Dedicated storage LAN, consisting of one or more LAN segments
-- Shared LAN, carrying a mix of traditional LAN traffic plus storage traffic
-- LAN-to-WAN extension using IP routers or carrier-provided "IP Datatone"
-- Private networks and the public Internet IP LAN-WAN routers may be used to extend the IP storage network to the wide area, permitting remote disk access (as for a storage utility), synchronous and asynchronous remote mirroring, and remote backup and restore (as for tape vaulting). In the WAN, using TCP end-to-end avoids the need for specialized equipment for protocol conversion, ensures data reliability, copes with network congestion, and provides retransmission strategies adapted to WAN delays.
The ISCSI technology deployment will involve the following elements:
(1) Conclusion of a complete protocol standard and supporting implementations;
(2) Development of Ethernet storage NICs and related driver and protocol software;
[NOTE: high-speed applications of iSCSI are expected to require significant portions of the iSCSI/TCP/IP implementation in hardware to achieve the necessary throughput.]
(3) Development of compatible storage controllers; and
(4) The likely development of translating gateways to provide connectivity between the Ethernet storage network and the Fibre Channel and/or parallel-bus SCSI domains.
(5) Development of specifications for iSCSI device management such as MIBs, LDAP or XML schemas, etc.
(6) Development of management and directory service applications to support a robust SAN infrastructure. Products could initially be offered for Gigabit Ethernet attachment,
with rapid migration to 10 GbE. For performance competitive with alternative SCSI transports, it will be necessary to implement the performance path of the full protocol stack in hardware. These new storage NICs might perform full-stack processing of a complete SCSI
task, analogous to today's SCSI and Fibre Channel HBAs, and might also support all host protocols that use TCP (NFS, CIFS, HTTP, etc).
The charter of the IETF IP Storage Working Group (IPSWG) describes the broad goal of mapping SCSI to IP using a transport that has
proven congestion avoidance behavior and broad implementation on a variety of platforms. Within that broad charter, several transport
alternatives may be considered. Initial IPS work focuses on TCP, and this requirements document is restricted to that domain of interest.
2. Performance/Cost
In general, iSCSI MUST allow implementations to equal or improve on the current state of the art for SCSI interconnects. This goal breaks down into several types of requirement:
Cost competitive with alternative storage network technologies:
In order to be adopted by vendors and the user community, the iSCSI protocol MUST enable cost competitive implementations when compared
to other SCSI transports (Fibre Channel).
Low delay communication:
Conventional storage access is of a stop-and-wait remote procedure call type. Applications typically employ very little pipelining of their storage accesses, and so storage access delay directly impacts performance. The delay imposed by current storage interconnects,
including protocol processing, is generally in the range of 100 microseconds. The use of caching in storage controllers means that many storage accesses complete almost instantly, and so the delay of the interconnect can have a high relative impact on overall performance. When stop-and-wait IO is used, the delay of the interconnect will affect performance. The iSCSI protocol SHOULD minimize control overhead, which adds to delay.
Low host CPU utilization, equal to or better than current technology:
For competitive performance, the iSCSI protocol MUST allow three key implementation goals to be realized:
(1) iSCSI MUST make it possible to build I/O adapters that handle an entire SCSI task, as alternative SCSI transport implementations do.
(2) The protocol SHOULD permit direct data placement ("zero-copy" memory architectures, where the I/O adapter reads or writes host memory exactly once per disk transaction.
(3) The protocol SHOULD NOT impose complex operations on the host software, which would increase host instruction path length relative to alternatives.
Direct data placement (zero-copy iSCSI):
Direct data placement refers to iSCSI data being placed directly "off the wire" into the allocated location in memory with no intermediate copies. Direct data placement significantly reduces the memory bus and I/O bus loading in the endpoint systems, allowing improved performance. It reduces the memory required for NICs, possibly reducing the cost of these solutions.
This is an important implementation goal. In an iSCSI system, each of the end nodes (for example host computer and storage controller) should have ample memory, but the intervening nodes (NIC, switches) typically will not.
High bandwidth, bandwidth aggregation:
The bandwidth (transfer rate, MB/sec) supported by storage controllers is rapidly increasing, due to several factors:
1. Increase in disk spindle and controller performance;
2. Use of ever-larger caches, and improved caching algorithms;
3. Increased scale of storage controllers (number of supported spindles, speed of interconnects).
The iSCSI protocol MUST provide for full utilization of available link bandwidth. The protocol MUST also allow an implementation to
exploit parallelism (multiple connections) at the device interfaces and within the interconnect fabric.
*****
The next two sections further discuss the need for direct data placement and high bandwidth.
3. Framing
Framing refers to the addition of information in a header, or the data stream to allow implementations to locate the boundaries of an iSCSI protocol data unit (PDU) within the TCP byte stream. There are two technical requirements driving framing: interfacing needs, and accelerated processing needs.
A framing solution that addresses the "interfacing needs" of the iSCSI protocol will facilitate the implementation of a message-based upper layer protocol (iSCSI) on top of an underlying byte streaming protocol (TCP). Since TCP is a reliable transport, this can be
accomplished by including a length field in the iSCSI header. Finding the protocol frame assumes that the receiver will parse from the beginning of the TCP data stream, and never make a mistake (lose alignment on packet headers).
The other technical requirement for framing, "accelerated processing", stems from the need to handle increasingly higher data rates in the physical media interface. Two needs arise from higher data rates:
(1) LAN environment - NIC vendors seek ways to provide "zero-copy" methods of moving data directly from the wire into application buffers.
(2) WAN environment- the emergence of high bandwidth, high latency, low bit error rate physical media places huge buffer requirements on the physical interface solutions.
First, vendors are producing network processing hardware that offloads network protocols to hardware solutions to achieve higher data rates. The concept of "zero-copy" seeks to store blocks of data in appropriate memory locations (aligned) directly off the wire, even when data is reordered due to packet loss. This is necessary to drive actual data rates of 10 Gigabit/sec and beyond.
Secondly, in order for iSCSI to be successful in the WAN arena it must be possible to operate efficiently in high bandwidth, high delay
networks. The emergence of multi-gigabit IP networks with latencies in the tens to hundreds of milliseconds presents a challenge. To fill such large pipes, it is necessary to have tens of megabytes of outstanding requests from the application. In addition, some protocols potentially require tens of megabytes at the transport layer to deal with buffering for reassembly of data when packets are received out-of-order.
In both cases, the issue is the desire to minimize the amount of memory and memory bandwidth required for iSCSI hardware solutions.
Consider that a network pipe at 10 Gbps x 200 msec holds 250 MB. [Assume land-based communication with a spot half way around the world at the equator. Ignore additional distance due to cable routing. Ignore repeater and switching delays; consider only a speed-of-light delay of 5 microsec/km. The circumference of the globe at the equator is approx. 40000 km (round-trip delay must be considered to keep the pipe full). 10 Gb/sec x 40000 km x 5 microsec/km x B / 8b = 250 MB]. In a conventional TCP implementation, loss of a TCP segment means that stream processing MUST stop until that segment is recovered, which takes at least a time of <network round trip> to accomplish. Following the example above, an implementation would be obliged to catch 250 MB of data into an anonymous buffer before resuming stream processing; later, this data would need to be moved to its proper location. Some proponents of iSCSI seek some means of putting data directly where it belongs, and avoiding extra data movement in the case of segment drop. This is a key concept in understanding the debate behind framing methodologies.
The framing of the iSCSI protocol impacts both the "interfacing needs" and the "accelerated processing needs", however, while
including a length in a header may suffice for the "interfacing needs", it will not serve the direct data placement needs. The framing mechanism developed should allow resynchronization of packet boundaries even in the case where a packet is temporarily missing in the incoming data stream.
4. High bandwidth, bandwidth aggregation
At today's block storage transport throughput, any single link can be saturated by the volume of storage traffic. Scientific data
applications and data replication are examples of storage applications that push the limits of throughput.
Some applications, such as log updates, streaming tape, and replication, require ordering of updates and thus ordering of SCSI
commands. An initiator may maintain ordering by waiting for each update to complete before issuing the next (a.k.a. synchronous updates). However, the throughput of synchronous updates decreases inversely with increases in network distances.
For greater throughput, the SCSI task queuing mechanism allows an initiator to have multiple commands outstanding at the target simultaneously and to express ordering constraints on the execution of those commands. The task queuing mechanism is only effective if
the commands arrive at the target in the order they were presented to the initiator (FIFO order). The iSCSI standard must provide an
ordered transport of SCSI commands, even when commands are sent along different network paths (see Section 5.2 SCSI). This is referred to
as "command ordering".
The iSCSI protocol MUST operate over a single TCP connection to accommodate lower cost implementations. To enable higher performance
storage devices, the protocol should specify a means to allow operation over multiple connections while maintaining the behavior of
a single SCSI port. This would allow the initiator and target to use multiple network interfaces and multiple paths through the network for increased throughput. There are a few potential ways to satisfy the multiple path and ordering requirements.
A popular way to satisfy the multiple-path requirement is to have a driver above the SCSI layer instantiate multiple copies of the SCSI
transport, each communicating to the target along a different path. "Wedge" drivers use this technique today to attain high performance. Unfortunately, wedge drivers must wait for acknowledgement of completion of each request (stop-and-wait) to ensure ordered updates.
Another approach might be for iSCSI protocol to use multiple instances of its underlying transport (e.g. TCP). The iSCSI layer would make these independent transport instances appear as one SCSI transport instance and maintain the ability to do ordered SCSI command queuing. The document will refer to this technique as "connection binding" for convenience.
The iSCSI protocol SHOULD support connection binding, and it MUST be optional to implement.
In the presence of connection binding, there are two ways to assign features to connections. In the symmetric approach, all the connections are identical from a feature standpoint. In the
asymmetric model, connections have different features. For example,
some connections may be used primarily for data transfers whereas others are used primarily for SCSI commands.
Since the iSCSI protocol must support the case where there was only one transport connection, the protocol must have command, data, and status travel over the same connection.
In the case of multiple connections, the iSCSI protocol must keep the command and its associated data and status on the same connection (connection allegiance). Sending data and status on the same connection is desirable because this guarantees that status is received after the data (TCP provides ordered delivery). In the case where each connection is managed by a separate processor, allegiance decreases the need for inter-processor communication. This symmetric
approach is a natural extension of the single connection approach.
An alternate approach that was extensively discussed involved sending all commands on a single connection and the associated data and
status on a different connection (asymmetric approach). In this scheme, the transport ensures the commands arrive in order. The protocol on the data and status connections is simpler, perhaps lending itself to a simpler realization in hardware. One disadvantage of this approach is that the recovery procedure is
different if a command connection fails vs. a data connection. Some argued that this approach would require greater inter-processor communication when connections are spread across processors.
The reader may reference the mail archives of the IPS mailing list between June and September of 2000 for extensive discussions on
symmetric vs asymmetric connection models.
No comments:
Post a Comment