Consider Again the Sdn Openflow Network Shown in Figure 4 30 Suppose That the Desired

Overview

These notes just scratch the surface of the very deep, interesting, and practical area of switch architecture. We emphasize some key concepts that come upward oftentimes in the pattern of switches and that take wide applicability across multiple technologies. To start nosotros take a look at how the marketing and sales of commercial switches can get a fleck giddy.

Building a Switch with Marketing's Help

Our goal will exist to create the highest capacity switch ever built for the lowest cost!

For highest capacity nosotros'll have it piece of work with DWDM signals
1. Presume each fiber has 100 wavelengths each carrying 100Gbps implies 10Tbps per link (Marketing says nosotros can double or triple this based on the wavelength capacity of a fiber).
2. Presume that we can become 100 cobweb pairs onto the front console with careful cable management.
3. We now have a switch with 100 x 10Tbps = 1.0Pbps! The world's beginning Peta bit per 2d capacity switch.
But how will the switch actually work?
1. We'll have "double" receptacles for each front end panel plug.
2. When a connectedness request comes in we'll folio Bob our technology technician and take him connect a "patch string" betwixt the fibers.
3. Hmm, this seems a bit fishy and familiar. Isn't this but a fiber optic path panel? Sounds like we reinvented an former fashioned phone switch lath. See Figure 1!
Do folks see any issues with this?
1. Doesn't information technology seem like this approach would accept actually wearisome switching time. No offense to Bob. Hmm, we really need to automate this...
2. Any other issues? What'due south the smallest amount of bandwidth our switch tin switch? This is the concept of switching granularity.

Effigy 1. Early telephone switchboard with key switching chemical element (operator) from Wikipedia.

Switch Operation Measures

As the previous department showed at that place is more that just i dimension to switch operation, due east.g. raw switching capacity. In full general we will be interested in:

Throughput: the agregate rates that are switched. Along these lines nosotros have from fastest to slowest:
1. Port switching, i.e., like an automatic patch panel. Since WDM fibers currently carry the most $.25 switching between such fibers offers the highest "throughput" switching.
2. Wavelength switching as washed in wavelength switched optical networks. Each wavelength can transport 10Gbps, 40Gbps, 100Gbps or more.
3. Time Partitioning Multiplexed switching channels generally run nigh the rates of individual wavelength channels.
4. Bundle Switching. These generally run near the rates of private TDM or wavelength channels.
Granularity: The problem with our previously designed switch was its fibroid granularity (10 Tera bit per 2nd). Few users can directly deal with flows such as this. Hence finer granularity is by and large desired and tin can upshot in less bandwidth being "wasted" or "stranded". Note that the finer granularity switching is currently done electronically rather than optically. From finest to coarsest we have:
1. Bundle switching can switch just one packet betwixt a source and destination, there is no minimum granularity imposed past the technology.
2. TDM: Older TDM engineering included 64kbps channels to accomodate voice. However nearly vocalization traffic is existence moved over to packet networks. Some private line services notwithstanding use ane.5Mbps (T1) or 2 Mbps (E1) channels, but about TDM services are based on either 50Mbps or 2.5Gbps granularity.
3. Wavelength switching: most wavelengths today are used for 10Gbps or college signals resulting in adequately coarse granularity.
4. Port switching
Toll & Power per bit (from least expensive to most expensive)
1. Port switching (optical)
2. Wavelength switching
3. TDM switching
4. Packet switching
Time to Switch: Some devices specially optical technologies can take fairly long times to transition from one switching configuration to another. Faster changes are better. In full general we have from fastest to slowest:
1. Packet switching
2. TDM switching
3. Optical switching (wavelength or ports)

General Construction of Switches

Telecommunication and data communications switches come in a wide variety of shapes and sizes, and are implemented in a wide variety of ways. We'll highlight the full general structure of a switch here. In general we can decompose a switch into the following parts every bit shown in Figure ii.

Input processing
1. Optical/Wavelength switching: may include optical amplification or automatic gain control (to continue signal levels inside the range of the switch fabric)
2. TDM: may contain optical to electric conversion, framing, link related OA&M, processing of overhead bytes, serial to parallel conversion.
3. Packet switching: may contain optical to electrical conversion, serial to parallel conversion, physical and link layer processing, input packet processing (nosotros'll hash out this more later)
Output processing
1. Optical/Wavelength switching: may include: wavelength combining, optical amplification,
2. TDM: may include: parallel to serial conversion and framing, electrical to optical conversion.
3. Packet switching: output packet processing (including queueing), link and physical layer processing, may include: parallel to serial conversion, electrical to optical conversion. Nosotros'll be studying more on output processing when nosotros look at Quality of Service (QoS) in bundle networks.
Switching Cloth
- Responsible for "transporting" signals, TDM frames, or packets (or parts of packets) between input processing and output processing sections. We'll be studying commonly used techniques to create switch fabrics in these lectures.
Switch Command: Coordinates input processing, output processing, and the switch fabric to produce the desired switching operations. Interfaces with entities outside the switch via management and control interfaces and protocols.

Batten Switch Fabrics

The simplest concept of a switch fabric is that known equally a crossbar and is shown schematically in Figure iii. Here we have N inputs that could be connected to whatever ane of N outputs. For at present we tin simply think of unproblematic port switching just the results and concepts apply to other forms of switching. In a batten fabric, to make a "connectedness" between an input port and and an output port requires the use of "cross point" element (some type of mechanical, electrical, or optical device). When a item "cantankerous point" element has been enabled we say a "cross connect" has been fabricated.

Effigy 3. Schematic of an NxN crossbar from Clos' original paper.

The name "batten" stems from the original mechanical electric switches used in the telephone system where bodily conducting "confined" where used to make connections semi-automatically. Run into Figure 4.

Figure iv. Mechanical-electrical crossbar circa 1960's from [Wikipedia](https://en.wikipedia.org/wiki/File:Crossbar-banjo2-hy.jpg).

For automating our optical patch panel we need a mechanism to automate the "cross connects", i.due east, to mechanize the cross points.

How many cross points do we need for an Due north 10 N switch? An M ten Due north switch?
In the old mechanical switches and in current optical technology cantankerous points can comprise the bulk of the switch cost and limits switch size.
In electrical switches IC pin limits, fan out, delay, and bespeak degradation are important to.
A big issue as the phone organization got bigger was to reduce switch cost which implied trying to reduce number of cross points in a switch of a given size.

Multi-Stage Switch Architectures

Ane approach that is both applied and tin be constructive is to build a large switch material out of a network of smaller sized switch fabrics. However arbitrary arrangements of elements may neglect to yield any advantage as the following case shows.

Example: a 9 x nine switch created with 3 ten 3 elements

Consider my attempt shown in Figure 5 to create a 9x9 switch from 3x3 switching elements. Equally we can see I congenital a 9x9 switch out of 3x3 elements. So I managed to create a large switch from smaller packages. Did I reduce the number of cross points needed? A 9x9 switch needs 81 cross points, a 3x3 switch needs only ix cross points, just I used ix 3x3 switches so ended upwards using 81 cross points again. Hence my design has no reward with respect to cross points. Very disappointing so far, merely it gets worse.

Figure 5. Naive attempt at switch fabric construction.

Now let's try to route connections across this multi-stage switch fabric. We announce a particular connection between an input port and an output port past a uppercase letter such as A, B, C, etc... We will place this letter aslope the respective input and output port to be connected as shown in the figure below. To make use of a multi-stage material nosotros need to notice a path across the fabric whose links take non been used by another connexion. Remember nosotros are thinking in terms of port switching, likewise known as space switching, right at present and each connexion between needs its own path and cannot share a path with whatever other connectedness.

In Effigy 6 we requite unique colors to the paths taken by connections so nosotros can easily tell them apart (this has nothing to exercise with wavelength switching right now!). The figure shows sucessful paths for the connections A through D, but for connection Eastward no path can be establish that allows us to connect input switch #3 with output switch #one. Hence nosotros run into that connection Eastward is blocked. Can we remedy this state of affairs? If you look at connection D you'll see that three different paths are feasible (through any of the centre switches). If connection D was re-routed through heart switch #ii rather than #3 so connection E could be accomodated through middle switch #3 to achieve output switch #1.

Figure 6. Blocking in our multi-stage switch fabric.

Every bit nosotros have just seen an capricious attempt to pattern a multi-stage switch material resulted in no cost benefit (in terms of number of cantankerous points) and performance issues (internal switch blocking). Some of the problems faced when developing a multi-stage switching network include:

The complexity of interconnection of the switching elements
The complexity of determining a path across this "mini-network"
Difficulty in figuring out if we can satisfy multiple connectedness requests simultaneously or incrementally.

Many unlike many multi-stage switch material architectures have been proposed over the years we will discuss one of the oldest and most sucessful. Variants of the pattern we will report have been and are used in TDM switches, optical switches and in the blueprint of some of the highest performance datacenter networks yet congenital.

Three Stage Clos Networks

General Multi-Phase Clos Network: An interconnection network of smaller sized switches with a very simple and specific structure, Figure 7:

Each switch in each stage is the same size in terms of number of inputs and outputs
Each switch connects to i and merely one switch in the next stage
At phase k nosotros have $r_k$ crossbar switches, each with $m_k$ inputs, and $n_k$ outputs.
For a 3 stage Close network there are only v costless parameters $m_1$, $n_3$, $r_1$, $r_2$, and $r_3$.

Effigy vii. Full general three stage Clos network.

Strictly Non-blocking Clos Networks

Before we worry about cost optimization of our multi-phase fabric we first want to make sure that we can make connections across the fabric. This leads us to the following:

Definition (strict-sense non-blocking): A switching network is strict sense non-blocking if a path can exist set up betwixt whatever idle transmitter and whatsoever idle receiver without disturbing any paths previously setup.

Due to the relatively simple structure of a 3-stage Clos network we volition be able to determine a uncomplicated criteria for strict sense non-blocking via some "creative accounting".

Bookkeeping Connectedness Routing with Paull'southward Matrix [Paull1962]

Consider the iii-phase Clos network shown in Figure viii with $m_1 = 3$, $r_1 = 2$, $r_2 = iv$, $r_3 = iii$, and $n_3 = 2$. In general nosotros will employ numbers to denote the first (input) and third (output) stage switches and lowercase messages to denote the 2nd (middle) stage switches.

Effigy 8. Example iii-stage Clos network.

Now consider the iv connections set upwardly across our example switch cloth denoted with capital letter letters: A, B, C, D (in that lodge). See Effigy 9.

What would a complete specification of the path taken by connections A-D look similar?
What is the simplest style we can announce the paths taken by A-D?

Figure 9. Connections beyond our example 3-stage Clos network.

A connection a iii-stage Clos network is uniquely identified past its input switch, output switch, and eye switch identifiers. This is due to there being but one connection betwixt each switch in a stage and a switch in the subsequent phase. 1 style to represent this in a form acquiescent to proving theorems or establishing algorithms is via Paull's matrix [Paull1962]. This "matrix" will have $r_1$ rows (one for each input switch) and $r_3$ columns (1 for each output switch). Nosotros then put a alphabetic character representing the heart switch used for each connexion in the row/column corresponding to the input and output switches. Note that nosotros tin can accept more than one letter of the alphabet per matrix entry. For our 4 connections higher up this would lead to the post-obit matrix:

Figure x. Paull'southward matrix for our example network and connections.

We prove a general diagram of Paull'due south matrix in Figure eleven and notation the following of import properties of this "matrix":

At most $m_1$ entries per row. Reason: A row represents a particular input switch and these switches have but $m_1$ input ports. Hence we can never become more than $m_1$ entries into a row.
At nigh $n_3$ entries per column. Reason: A column represents a item output switch and these switches have only $n_3$ output ports. Hence we tin never get more than $n_3$ entries into a column.
All entries in any row or column must be unique with a maximum of $r_2$ total entries in whatsoever row or column. Reasoning: The uniqueness of entries in a row (meaning the messages used to denote the centre switches) is due to the limit of each input switch connecting to i and only one heart switch. The uniqueness of entries in a column stems from each middle switch connecting to 1 and only ane output switch. The maximum number of distinct entries in a row or column is limited by the number of middle switches $r_2$.

Figure 11. General diagram of Paull's matrix.

At present nosotros are in a position to establish Clos's Theorem on when a 3-stage Clos network is strict sense non-blocking.

Worst instance state of affairs:

An input switch 10 has $m_1-1$ of its inputs used
An output switch Y has $n_3-1$ of its outputs used
We want to connect the concluding unused input input $m_1$ of 10 to the last unused output $n_3$ of Y.

Paull's matrix bookkeeping implications:

For the row corresponding to input switch Ten nosotros will have used $m_1-1$ unique symbols. Remember symbols (lowercase letters) identify the heart switch used in the connexion.
For the cavalcade Y will have used $n_3-1$ unique symbols.
In the worst case scase all of the symbols used are unique so we have used a total of $m_1 + n_3 -two$ symbols (eye switches) and to complete this last connection we need one more so our strict sense not-blocking criteria is that the number of centre switches $r_2 \geq m_1+n_3-1$.

Theorem (Clos) A 3 stage Clos network is strict sense not-blocking if $r_2 \geq m_1+n_3-1$.

Better Cross Point Scaling with a Clos Network

Now that nosotros've seen how to insure that a 3-stage Clos network volition be strict sense non-blocking allow'southward come across if we can do better on the cross point count. In Figure 12 nosotros show an example from Clos'southward original 1953 paper [Clos1953]. In this instance the goal was to create a switch with $N=36$ input and output ports. The parameters $m_1 = 6$ and $n_3 = vi$ were chosen which unsaid that $r_1 = 6$ and $r_3 = half dozen$ to satisfy the sizing and that $r_2 = eleven$ to insure strict sense not-blocking.

Figure 12. Clos'south 3-stage instance network.

The input switches have size 6x11 and there are be 6 of them leading to 396 cantankerous points. The center switches have size 6x6 and at that place are be 11 of them leading to 396 additional cross points. Finally the output switches take size 11x6 and there are half dozen of them leading to another 396 cross points, giving usa a total of 1188 cross points. For a 36x36 crossbar we would need 1296 cross points and so nosotros have saved 108 cantankerous points! Okay, that doesn't sound so impressive but let's see how this scales up...

To scale his fabrics up Clos' made the post-obit assumptions:

Symmetrical network (same number of inputs as outputs = Northward)
Number of network inputs is a perfect square $N = n^ii$
The three-stage network parameters are:
- $m_1 = n$ and $r_1 = northward$ the number of inputs on the commencement stage switches and the number of first phase switches.
- $n_3 = due north$ and $r_3 = north$ the number of outputs on the third stage switches and the number of third stage switches.
- Clos' Theorem for strict sense not-blocking then requires $r_2 = 2n-i$

At present permit'due south us do the bookkeeping on the number of crosspoints:

Input Switches (number n)
- Size = $northward \times (2n-1)$
- Input switches total cross points = $2n^3 - due north^2$
Middle Switches (number $2n-1$)
- Size = $n \times n$
- Middle switches total cantankerous points = $2n^3 - due north^2$
Output Switches (number northward)
- Size = $(2n-1) \times n$
- Input switches total cross points = $2n^3 - n^ii$
G full number of cross points
- $6n^iii - 3n^2$ just since $North = n^2$ this gives $$6N^{3/2} - 3N$$.

Hence instead of $O(N^2)$ a 3-phase Clos network has $O(N^{1.5})$ growth! In Figure xiii nosotros show the comparison table from Clos' original paper which illustrates the crosspoint savings as Northward grows larger.

Effigy 13. Cross point growth for crossbar and 3-stage Clos.

Pin Count and Power Limitations

As nosotros motility forward from the 1950's to the 2000'south and beyond Clos networks are still useful and of import just other optimization criteria besides the number of cross points may take precedence. In Clos'south table we saw that a crossbar based switch with x,000 inputs and outputs would require 100 million cross points and a three-stage Clos less than vi million. With electrical technology 1 can implement a cross point with a unmarried CMOS transistor (ignoring fanout and fan-in issues). A relatively modern smart phone circa 2015 will comprise i to two billion transistors. Hence today nosotros have tons of cross points to spare. Withal, power and heat disipation is nevertheless an event and by reducing the number of cross points we significantly salve on power.

At that place is a more subtle issue that gives ascent to an interest in Clos networks in large switch implementations and that is pin count limitations associated with integrated excursion packaging. While the corporeality of logic that ane can put on an integrated circuit has increased exponentially over the years the increase in the number of input and output pins (electric connections) one can adhere to an integrated circuit has not kept pace. 2 of the densest forms of packaging are ball filigree arrays and flip chip ball grid arrays. These dense packaging technologies accept pin count limits in the range of 3000 pins. If we wish to employ a single integrated circuit as a switching chemical element in a multi-stage fabric we must keep these limits in mind.

Rearrangeably Non-Blocking Clos Networks

In Clos's symmetric 3-phase cloth $m_1 = northward$, $n_3 = northward$, and $r_2 = 2n-i$ for strict sense not-blocking. This led to switches of sizes $n \times (2n-1)$, $n \times north$, and $(2n-1) \times n$ for stages 1 through three respectively. We come across that the strict sense non-blocking criteria of $r_2 = 2n-1$ is pushing upward the beginning and third stage switch sizes and hence their pin counts. Is in that location whatsoever less rigorous requirement than that of strict sense non-blocking? In detail what if we are immune to rearrange the paths taken by existing connections through the fabric to suit a connection? It turns out there is a fundamental result forth these lines.

Theorem (Slepian-Duguid Theorem) [Hui1990]

A three stage Clos network is rearrangeably not-blocking if and merely if $r_2 \geq max(m_1, n_3)$.

This consequence, if nosotros can tolerate the rearrangements, allows for a pin count reduction of near $one/iii$ for the showtime and third stage switches. Allow's come across an example of how this works. In Figure fourteen below we bear witness a 3-stage Clos network built from 4x4 switches $m_1=iv$, $n_3 = iv$, since $r_2 = 4$ the criteria for strict sense non-block is not met and nosotros run across that in that location is no open up path to brand the connection J which wants to get from input switch three to output switch number 3.

Figure fourteen. iii-phase Clos network with a blocked connection.

In Effigy 15 nosotros show that by rearranging the paths of connections C, D, and H, then a path through the textile could be found for connexion J.

Figure fifteen. 3-stage Clos network with connections rearranged.

Figuring out which connections to rearrange and where to motion them is non-piffling, one algorithm for doing this is given beneath. Notation that these techniques (and more avant-garde) are actually used in existent switches. The author of these notes led a software development team on a very large electro-optical switch that featured a three-stage rearrangeably not-blocking Clos fabric and that saw worldwide deployment in carrier optical backbone networks.

Algorithm to establish a connection between input switch SI and output switch And then, given Paull'due south matrix of existing connections for a Clos network satisfying the above criteria.

Stride i (no-rearrangement, connect). Check if at that place exists a middle switch "symbol" (e.g., we've been using lower example letters) which is not constitute in both row SI and column SO. If and so and so you can utilise the middle switch represented by this symbol to make the connection. Otherwise go to step 2.

Step two (discovering a rearrangement concatenation).

(a) Find a symbol x in row SI which is not found in column And so and detect a y in column SO not found in row SI. (Note that such symbols must exist). Remember these 10 and y symbols and where you constitute them.

(b) Cheque the row where you institute the y above to see if an x symbol is nowadays if non proceed to step 3, otherwise remember this x and go to step (c).

(c) Check the column that the 10 appeared in pace (b) to see if a y appears if non go to stride 3, otherwise recall this y and go to pace (b).

Step three (rearrange and connect) Put a y in (SI, SO) of Paull's matrix and bandy all x with y in the "chain" from step 2. (Notation that the commencement ten is not part of the chain).

Clos Networks and Large Data Centers

In 1985 with the goal of designing a amend communications network within a highly parallel supercomputer a network compages chosen a fat tree was published [Leiserson1985]. A fundamental feature, shown in Figure 16, is that the network is arranged in a tree construction with the processors as the leaf nodes and as ane moves upwards the tree more bandwidth is fabricated available between the switches. This led to the name fat tree for such a construction.

Figure sixteen. Fat Tree interconnection of processors in a supercomputer circa 1985 Leiserson1985

In 2008 citing issues with traditional data centre network topologies, equally shown in Figure 17, for modern compute applications (Map/Reduce, etc...) Al-Fares et. al. [Al-Fares2008] proposed a new data center topology based on a Clos topology with faster switch to switch links than switch to host links. Note that in the traditional data eye topology shown below the host to switch links were 1Gbps and the switch to switch links were 10Gbps hence the "fat tree" nature wasn't the new feature of their proposal.

Figure 17. Traditional information center topology from Al-Fares2008

What was new in the Al-Fares network, shown in Figure 18, was the use of a 3-phase Clos topology betwixt their aggregation switches and their core switches. This may not initially look like the iii-phase Clos network that we studied, but this is only due to the links in the diagram beingness bidirectional equally contrasted our apply of unidirectional links when we previously drew our Clos networks. Note also nosotros are using the term switch in its generic sense, in the Al-Fares network Layer iii switching (IP routing) was used.

Key points made past Al-Fares et. al. [Al-Fares2008] include the demand for Equal Price Multi-Path (ECMP) to distribute IP traffic over multiple paths, and the fact that the Clos network congenital in this manner volition be rearrangeably non-blocking.

Figure 18. Al-Fares "fat tree" data heart network topology Al-Fares2008

Equally of 2014 we were saw Clos topologies for data centers showing up in a Juniper white paper, see Figure 19. Some nicer terminology is coming into use with the "border" switches called leaf switches and the "core" switches called spine switches. At the end of 2014 one very big data centre, Facebook, announced that it was using 5-phase Clos networks with commodity switch fabrics Facebook'due south data middle fabric. Annotation that this post includes a nice video with animations. For a very complete review of Clos for data centers featuring a control plane based on BGP, run across BGP large DC.

Figure 19. Juniper architecture for a big data center network based on a iii-stage Clos topology.

Finer Granularity Switches

We've looked at how to create very large chapters switches via cantankerous points, crossbars, and multi-stage switch fabrics. There is one problem with the switches we've designed then far: they switch the unabridged "content" from an input port to a selected output port. This is extremely fibroid granularity switching and while useful in the proper context nosotros will demand other techniques to acheive finer granularity. Nosotros as well note that the coarse granularity "port to port" switching nosotros accept been discussing is also called infinite switching.

Now we volition look at a general techniques useful for effectively granularity switching. All our techniques will exist based on electronic rather than optical technology.

Slowing Things Down for Electronic Processing

In Figure twenty below we show standardized interfaces,OIF-SFI-S-01, between (right to left) optical transmitter and receiver, a Serdes (serializer-deserializer), FEC (forward fault correction) processor, and a TDM (M.709, SDH, SONET) or link layer framer. In the specification the number of lanes northward ranges from 4 to twenty. This specification is aimed at optical interfaces running at between fourscore-160Gbps. Note that since the signals have yet to be processed by the framer no byte boundaries have been determined.

Effigy 20. OIF SerDes to Framer Interface (from OIF-SFI-Southward-01)

In the Effigy 21 we show an early on 2000's system package interface, OIF-SPI5-01, which featured sixteen bit wide data interfaces for transmit and receiving from the somewhat misnamed PHY Device (really a TDM framer). This was aimed at optical interfaces running in the 40Gbps range.

Figure 21. OIF Arrangement packet interface

Such conversion from series to parallel isn't just used in optical situations. In Figure 22 we shown a layer diagram for Gigabit Ethernet from the IEEE 802.three-2012 section 3, chapter 34. We see in this diagram above the various physical sublayers (PMD, PMA, PCS) there is a gigabit media independent interface (GMII).

Figure 22. IEEE 802.iii Gigabit Ethernet Layer diagram (from IEEE 802.3-2012).

The GMII interface, Figure 23, features eight flake wide transmit and receive interfaces. In Effigy 24 an case utilize of the GMII interface in shown, from a micrel Gigabit Ethernet Transceiver datasheet.

Figure 24. Instance use of the GMII interface from Ethernet transceiver vendor [Micrel](http://www.micrel.com/).

Note that while wider interfaces permit the use of lower speed signals they also utilize more integrated circuit pins. Hence much of the "widening" now a days occurs internal to a fleck.

TDM switching via Time Slot Interchange

As our first example of a fine grainularity switch let's build an switch that works with time division multiplexed signals and switches at the level of a DS0 (64kbps). In Figure 25 we bear witness a block diagram of our switch that takes in 10 bi-directional T1 lines and can switch whatsoever (input port, fourth dimension slot) to any (output port, time slot).

Effigy 25. Example TDM switch with T1 inputs.

Basic idea: write all the incoming data into bytes of memory associated with the particular port and fourth dimension slot. Read data back out to proper output port and fourth dimension slot. A "memory map" for this is shown in Figure 26. Notation that we are essentially switching in time (due to fourth dimension slot interchange) and in infinite (across ports).

Figure 26. Memory map of ports and timeslots in RAM.

How much retention will we demand? Let's assume that everything is synchronous (not really truthful but framers have small-scale FIFOs to have up the slack), and that nosotros can't read and write the aforementioned retention location at the same time. In this case we'll utilize two retentivity locations per time slot and alternate reading and writing from those locations. So given 10 T1 lines, 24 byte time slots per T1, and 2 retentiveness spaces per timeslot = 480 bytes of retention. That is an insignificant corporeality of retentivity even for a micro-controller.

How fast does the memory demand to be able to operate? Assume byte wide memory and that read and write operations must be separate. For a T1 we have 24 bytes every 125us, 10 T1s => 240 bytes every 125us, read/write separate ==> 480 bytes every 125us ==> 0.261us per byte.

How does this compare with modern SRAM access times? Cheap modest RAMs have access times between 15-150ns hence are twice to 10 times as fast as this.

Or we tin can call up of this as (10 T1s)*(ane.5Mbps per T1)*(two carve up read/write) = 30Mbps transfer rate memory.

Now lets design a packet switch via the shared retentiveness approach. How about 8 ports of 1Gbit Ethernet. Assume that we might have to buffer up to 100 Ethernet MTUs per output port.

How much retentiveness space would nosotros need? (8 ports)*(1500 bytes/MTU)*(100 MTUs) = one.2e6 bytes. Much bigger (four orders of magnitude) than the TDM case.

How fast would the retentivity need to operate? Raw: 16Gbits/sec for separate read and write. Where can we get this type of retentiveness speed and size? Or should we give upward? First let'southward check the speeds of common DRAM used in computers at Wikipedia (DRAM). Looking at this table we notation that 133MHz DDR comes in effectually 17Gbps. Simply await a second isn't SRAM supposed to be faster than DRAM? Where did this actress speed come from? DDR SDRAM uses a bus that is 64 bits wide. Would someone actually make a switch this manner? Yes and No. In Figure 27 we evidence a block diagram for an Ethernet switching chip circa 1999 which features access to external DRAM via a "Rambus DRAM controller".

Figure 27. Example of switch silicon utilizing DDR retentiveness for bundle storage from Texas Instruments production brief circa 1999.

With the every increasing density of retentiveness and "system on a chip" (SoC) engineering science that allows integration of relatively large amounts of retention, parcel buffers are ofttimes kept on flake as Figure 28 from a relatively contempo Marvell product brief shows. This trend is seen with other manufacturers and discussed at length in a 2012 Broadcom white paper where they state: "Today, traditional fixed switch designs using distributed chipsets and external packet buffers have been largely replaced past highly integrated devices with on-chip buffering."

Figure 28. Instance of switch silicon using internal packet buffers (from Marvell circa 2015).

There are limits to the size of the memory that tin can be put on chips along with the switching functionality. Publically available information isn't readily available on how large these are on current chips however an estimate can exist gotten by looking at trends in L3 caches in loftier performance processors where these tin can exist 8MB or more (circa 2015). The other interesting trend noted in the L3 enshroud article is that in new multi-core processors the L3 cache is shared amongst multiple processors. The shared L3 cache scenario is very similar to a shared memory switch fabric and the crux of the previously noted Broadcom whitepaper was on shared retention management.

Input Parcel Processing

Afterward we've suceeded in getting our bundle into our switch, we at present need process it and ship it out. What kind of processing might be required on a packet? Whatsoever of the actions that we learned about when nosotros studied tunneling, VLANs, MPLS, and software defined networks (SDNs). Here's just a brusk sample of the packet processing that might be down on initial input:

IP Processing
1. Longest match address prefix lookup
2. Increment TTL counter
3. Modify type of service $.25, and/or, congestion control $.25
4. Encapsulate/De-encapsulation tunneling protocols
MPLS Processing
1. FEC (Forwarding Equivalence Class) classification (edge Label Switch Routers)
2. Characterization look up
3. Label push and pop operations
Ethernet
1. Ethernet adddress await upwards
2. VLAN tag addition or removal
3. Change QoS bits
SDN (OpenFlow data aeroplane processing)
1. Priority based generalized package header matching
2. Generalized packet header processing

How much time do we go to do these and potentially other operations? To avoid our packet processing becoming a bottleneck (rather than the rate of the ingress or egress port) we demand to complete our header processing by the time the entire packet is completely received. Note that except for potentially some checksums that are done over the unabridged bundle (and are readily computable at line rate in hardware) the input processing (and most output processing) is only concerned with processing the header. Hence a larger total package size gives u.s.a. more fourth dimension to process the header. Thus nosotros see that smaller packets are more challenging for input processing than larger.

Example

What is the worst example header processing charge per unit we would need to meet for the eight one Gbps port Ethernet switch we investigated in our shared retention package switch design? Assume smallest Ethernet packet is 64 bytes long. The worst instance total switch input rate is 8Gbits per 2d, so we would go a worst case input package rate of 8Gbps/(64 bytes/Packet * 8 bits/byte) = fifteen.625 million packets/second. Or 64 nano seconds per packet worst example.

How could get more time to process our packets? I approach is partitioning the input processing amongst groups of links rather than processing all links with the aforementioned input processor. In the extreme we could dedicate an input processing unit to each link and in the example above this would give united states of america 512 nano seconds to procedure the header. The downside of such an approach is that it requires u.s. to duplicate the hardware required for input processing multiple times thereby driving up the cost of the hardware. To assistance switch buyers tell if their switch manufacturer has cut corners on input processing capabilities or in other areas, switching benchmarks (examination proceedures) should be consulted. One useful LAN switching criterion is RFC2889.

Lookups, Matches, and Such

What is the most computationally hard functioning that needs to exist performed on bundle ingress? In the bullet list of packet operations I purposely put the hardest first: IP — longest friction match prefix lookup; MPLS — FEC matching and label lookup; Ethernet — address lookup; SDN — prioritized general header matching. All these are variants of a search problem and hence every bit the number of entries in our tabular array (IP prefixes, Ethernet addresses, MPLS labels, SDN matches) grows we may run out of time to complete our search. Or equally is more commonly encountered to reach the desired lookup speed specialized hardware is used that places fairly restricted limits on the table size. In the following nosotros chop-chop await at two common hardware techniques used to perform the search operation needed for input packet processing.

TCAMs for General Matches

A key aspect of IP addresses is the hierarchical nature of their assignments to geographic regions, countries, ISPs and such. This allows IP addresses within a given domain to exist summarized by just a portion of the total length of the accost, known equally a prefix. This profoundly reduces the size of IP address tables in routers. Since unlike domains may exist vastly different in size, the prefixes used in IP routing have variable size and this lead to IP'south longest match rule. For software based routers concerned with performance, optimizing such matches led to very specialized search algorithms based on copse.

As networking speeds increased the longest match search was implemented in specialized hardware known as a content addressable memory (CAM). The name content addressable retentivity is a bit confusing but its really just a hardware lookup/search implementation. Such specialized memories are oftentimes used in the memory management units of almost all advanced processors and in that instance are chosen translation lookaside buffers (TLB). For longest match searches or for general matching such as in MPLS FEC nomenclature or SDN we need the power to indicate "don't care" or "wild card" fields and the type of CAM that supports this is called a ternary CAMs (TCAM).

At present in addition to longest match IP address searches TCAMs play and of import role in MPLS classification and OpenFlow. TCAMs may appears as separate integrated circuits or with system on a chip switch designs. Critical parameters include the bit width of the key (portion of packet header to search against), the size of the retentiveness and the lookup speed.

Hardware Hash Tables

While very full general and fast, TCAMs tin can be costly for the corporeality of table infinite they provide. For more specific types of look up other mechanisms accept ofttimes been employed. If ane considers destination based Ethernet forwarding the await ups are based on the entire 6 byte Ethernet address with no wildcard fields involved. Such a restrictive lookup (search) is amendable to fast hardware implemented hash tables. For an overview of how these tin be implemented meet this post on hardware hash table implementation issues.

For more on hash tabular array and TCAM and their strengths and weakness come across the following posts:

Weblog post on Hash Tables versus CAMs. Has good definitions of Hash tables, CAMs, and TCAMs.
Web log post on ternary hashing, i.e., doing varible length matching with hash tables in hardware.
Web log post on SDN TCAMs and hashes. The difficulty of general SDN matches for either TCAMs or hashes.

Multiple Lookup Tables

Information technology is the limits on TCAM size and width, and better alternatives for more specialized searches that led to the multiple table model that we saw in OpenFlow 1.1 and later (Figure 29).

Figure 29. Multiple lookup table OpenFlow switch model from OF1.5

Nosotros likewise see this in system on a bit switch implementations as shown in Figure 30 which specifically mentions a TCAM based solution, only shows space for separate L2 (Ethernet) and L3 (IP) tables.

Figure 30. Example system on a chip switch from the Broadcom BCM56228 production brief.

References

Note that BSTJ manufactures are available gratuitous from http://www3.alcatel-lucent.com/bstj/

[Paull1962] M. C. Paull, "Reswitching of Connection Networks", Bong System Technical Journal, vol 41, no 3, bll 833–855, 1962.
[Clos1953] C. Clos, "A Written report of Not-Blocking Switching Networks", Bell System Technical Journal, The, vol 32, no 2, bll 406–424, Mrt 1953.
[Hui1990] J. Y. Hui, Switching and Traffic Theory for Integrated Broadband Networks. Norwell, MA, USA: Kluwer Academic Publishers, 1990.
[Leiserson1985] C. Due east. Leiserson, "Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing", IEEE Trans. Comput., vol 34, no 10, bll 892–901, 1985.
[Al-Fares2008] M. Al-Fares, A. Loukissas, en A. Vahdat, "A Scalable, Commodity Data Middle Network Architecture", in Proceedings of the ACM SIGCOMM 2008 Conference on Data Advice, 2008, bll 63–74.

rowlandsoetted.blogspot.com

Source: https://www.grotto-networking.com/BBSwitchArch.html