Friday, December 30, 2011

physical memory architecture

18-548 Physical Memory Architecture
 8/31/98
3
Physical Memory
Architecture

18-548 Memory System Architecture
Philip Koopman
August 31, 1998
Required Reading:
Hennessy & Patterson: skim 5.1, 6.1-6.2, pp. 496-497, 7.1, pp. 635-642
Cragon: 2.0
http://www.isdmag.com/Editorial/1996/CoverStory9610.html
Supplemental Reading:
Hennessy & Patterson: 2.0-2.3, 2.7

Assignments
u By next class read about cache organization and access:
• Hennessy & Patterson 5.1, 5.2, pp. 390-393
• Cragon 2.1, 2.1.2, 2.1.3, 2.2.2
• Supplemental Reading: Flynn 1.6, 5.1
u Homework 1 due Wednesday September 2
u Lab 1 due Friday September 4 at 3 PM.
1


18-548 Physical Memory Architecture
 8/31/98
Where Are We Now?
u Where we’ve been:
• Key concepts applied to memory hierarchies
– Latency
– Bandwidth
– Concurrency
– Balance
u Where we’re going today:
• Physical memory architecture -- a trip through the memory hierarchy
u Where we’re going next:
• Cache organization and access
• Virtual memory
Preview
u Principles of Locality
u Physical Memory Hierarchy:
• CPU registers
• Cache
• Bus
• Main Memory
• Mass storage
u Bandwidth “Plumbing” diagrams
2


18-548 Physical Memory Architecture
 8/31/98
Physical Memory Architecture
u Emphasis on providing low latency and high bandwidth
REGISTER
FILE
ON-CHIP
L1 CACHE
TLB
(?)
ON-CHIP
L2 CACHE
E-UNIT
L2/L3
CACHE
(?)
L3/L4
CACHE
VIRTUAL
MEMORY
CD-ROM
TAPE
etc.
INTER-
CONNECTION
NETWORK
CPU
OTHER
COMPUTERS
& WWW
I-UNIT
SPECIAL-
PURPOSE
CACHES
SPECIAL-
PURPOSE
MEMORY
MAIN
MEMORY
DISK FILES &
DATABASES
CACHE BYPASS
LOCALITY

3


18-548 Physical Memory Architecture
 8/31/98
Locality
u Typically, one can predict future memory access addresses from past
ones
u Temporal Locality
• A future memory access is likely to be identical to a past one
u Spatial Locality
• A future memory access is likely to be near a past one
u Sequentiality
• A future memory access is likely to be in sequential order
(e.g., a program counter)
Temporal Locality
u Once a memory word has been accessed, it is likely to be accessed again
u Program structures
• Loops
• Frequently used subroutines/
object methods (calls)
• Frequently used interrupt service code/
TLB miss code
u Application data structures
• Frequently used variables that aren’t register-resident
• “Directory” information (e.g., linked list pointer chain)
u System-induced temporal locality
• Local variables on stack frame
• Generational garbage collection
CALL CALL CALL
SUBROUTINE
4


18-548 Physical Memory Architecture
 8/31/98
Temporal Locality of Instructions
u 80/20 rule of systems: 80% of cost or benefit attributable to 20% of
system (sometimes appears as the 90/10 rule)
(Hennessy & Patterson Figure 1.14)
Spatial Locality
u Once a word has been accessed, neighboring words are likely to be
accessed
u Program structures
• Short relative branches
• Related methods in same object (e.g., constructor followed by operator)
u Data structures
• Records (C language struct)
• Operations on 1-dimensional arrays of data
• Image processing (e.g., 2-D convolution)
u Coincidental spatial locality
• Putting important routines together for virtual memory locality
• Overlays (manual management of pages with small address spaces)
5


18-548 Physical Memory Architecture
 8/31/98
Instruction Spatial Locality
u Branch offset coverage (Flynn Table 3.12)
Offset % Branches
± 8 13%
± 16 33%
± 32 37%
± 64 46%
± 128 55%
± 256 61%
± 512 68%
± 1K 73%
± 2K 79%
± 4K 83%
± 8K 87%
± 16K 93%
± 32K 99%
Sequentiality
u Memory words tend to be accessed in sequential order (a special case of
spatial locality)
u Program structures
• Program counter
• Compiler output arranged to avoid branch delay penalties
u Data structures
• Strings, BCD, multi-precision numbers
• Vector/matrix operations with unit stride (or, sometimes, any fixed stride)
• Multimedia playback
SD
DD
DD
DD
DD
D.
DD
7-BYTE/12-DIGIT BCD NUMBER
6


18-548 Physical Memory Architecture
 8/31/98
Basic Block Lengths: Sequentiality
u A few large blocks raise mean length
(Flynn Figure 3.4)
CPU-BASED PHYSICAL
MEMORY ELEMENTS

7


18-548 Physical Memory Architecture
 8/31/98
CPU Resources Form Top of Memory Hierarchy
u Registers
• Register file
• Top-of-stack “cache” (actually more like a buffer)
u Buffers
• Instruction pre-fetch buffer
• Data write buffer
• Branch destination buffer (specialized instruction cache)
u Programs with good locality make effective use of limited resources
• Only a handful of active values in register set (grows with superscalar CPUs)
u Bandwidth -- limited by number of ports in storage
• For example, 3-ported 32-bit register file at 500 MHz is ~6 GB/sec
u Latency -- cycled once or twice per clock cycle
• Latency is effectively “zero” -- CPU pipeline takes it into account
Registers Are Software-Managed Cache
u Registers offer highest bandwidth and shortest latency in system
• Built into the CPU
• Multi-ported fast access register file
u Compiler/programmer manages registers
• Explicit load and store instructions
• No hardware interlocks for dependencies
u Very large register set can be effective in the right circumstances
• Vector computing where registers store intermediate vector results
• Embedded systems with multiple register sets for context switching speed
– But may not be so great for general purpose computing
8


18-548 Physical Memory Architecture
 8/31/98
CACHE MEMORY

Cache Memory
u A small memory that provides the CPU with low latency and high
bandwidth access
• Typically hardware management of which memory locations reside in cache at
any point in time
• Can be entirely software managed, but this not commonly done today (may
become more common on multiprocessor systems for managing coherence)
u Multiple levels of cache memory possible
• Each level is typically bigger but slower; faster levels SRAM, slower levels
may be DRAM
u Bandwidth -- determined by interconnection technology
• On-chip limited by number of bits in memory cell array
• Off-chip limited by number of pins and speed at which they can be clocked
u Latency -- determined by interconnection technology & memory speed
• On-chip can be one or more clocks, depending on connect/cycle delays
• Off-chip is likely to be 3-10 clocks, depending on system design
9


18-548 Physical Memory Architecture
 8/31/98
Major Cache Design Decisions
u Cache Size -- in bytes
u Split/Unified -- instructions and data in same cache?
u Associativity -- how many sectors in each set?
u Sector/Block size -- how many bytes grouped together as a unit?
u Management policies
• Choosing victim for replacement on cache miss
• Handling writes (when is write accomplished?; is written data cached?)
• Ability to set or over-ride policies in software
u How many levels of cache?
• Perhaps L1 cache on-chip; L2 cache on-module; L3 cache on motherboard
Cache Addressing
10


18-548 Physical Memory Architecture
 8/31/98
Example Cache Lookup Operation
u Assume cache “hit” at address 0x345678 (word access)
VALID?
(1 BIT)
TAG
(10 BITS)
DATA
(8 BYTES)
. . . BLOCK
NUMBER
2CF
0x2CF 0x00xD1
0xD1
0x345678
WHICH
BYTE?
WHICH
BLOCK?
TAG MATCHES =
CACHE HIT!
10 BITS 11 BITS
2 BITS
LOW 4 BYTE WORD HIGH 4 BYTE WORD
Block size
u Cache is organized into blocks of data that are moved as a whole
• Blocks can range from 4 bytes to 256 bytes or more in size
• Blocks are loaded and stored as a single unit to/from the cache
u # blocks = cache size / block size
u example: 8 KB cache with 64-byte blocks:
• # blocks = 8192 / 64 = 128 blocks
11


18-548 Physical Memory Architecture
 8/31/98
Cache Associativity
u When a cache block must be replaced with a new one, the cache logic
determines which block to replace from a set of candidate blocks
• Fully associative: the set consists of all blocks in the cache
• n-way set associative: the set consists of n blocks
– For example, 4-way set associative means that there are 4 candidates
• Direct mapped: there is only one block that can be replaced (1-way set
associative)
u # sets = # blocks / n
u example: 256 B cache:
16-byte blocks,
2-way set associative,
16 blocks
• #sets = (256/16) / 2
= 8 sets
BLOCK 0 BLOCK 1
TAG TAG V VD DWORD WORDWORD WORD
TAG TAG V VD DWORD WORDWORD WORD
TAG TAG
TAG TAG
V VD DWORD WORDWORD WORD
V VD DWORD WORDWORD WORD
TAG TAG V VD DWORD WORDWORD WORD
TAG TAG V VD DWORD WORDWORD WORD
TAG TAG V VD DWORD WORDWORD WORD
TAG TAG V VD DWORD WORDWORD WORD
SET 0
SET 4
SET 1
SET 5
SET 2
SET 6
SET 3
SET 7
2-WAY SET ASSOCIATIVE CACHE
BUS / INTERCONNECT

12


18-548 Physical Memory Architecture
 8/31/98
Bus
u Bus is a shared system interconnection
• CPU to cache
– Might be on system bus
– As CPUs get more pins they move to a dedicated cache bus or all within package
• Cache to main memory
• CPU or main memory to I/O
• Typically high-speed connection, and often carries processor/memory traffic
• Typically accessed transparently via a memory address
Bus Performance
u Bandwidth -- limited by cost and transmission line effects
• 64-bit or 128-bit data bus common (but, fewer bits on cost-sensitive systems)
– Why was the 8088 used instead of the 8086 in the original IBM PC?
• Bus speed often limited to 50 - 66 MHz due to transmission line effects
• Example: Pentium Pro -- up to 528 MB/sec for 64-bit bus at 66 MHz
u Latency -- limited by distance and need for drivers
• Multiple clock latency, but can pipeline and achieve 1 clock/datum throughput
• Be careful about “bus clocks” vs. “processor clocks”
– Many current processors clocked at a multiple of the bus frequency
13


18-548 Physical Memory Architecture
 8/31/98
Interconnect
u Interconnect is a generalization of a bus
• Crossbar switch permits connecting, for example,
n CPUs to m memory banks for simultaneous
accesses
– Cost is O(n*m) switches
• Multi-stage interconnection is a heavily research
area (hypercubes, omega networks, mesh
connections)
– Cost is O(n log n) switches
• Serial interconnections such as Ethernet are
widely used for between-box communications
– Cost is O(n) connections
Interconnect Performance
u Bandwidth -- usually limited by cost to fast serial connection
• Crossbar provides very high bandwidth (n simultaneous connections); but costs
O(n2) in terms of switching nodes
• Omega network provides potentially high bandwidth, but suffers from
blocking/congestion
• 10 Mbit/sec common for Ethernet; 100 Mbit/sec being introduced
– Also limited by cost of switch to avoid sharing of high-speed line
u Latency -- limited by routing

Crossbar offers lowest latency, but at high cost
• Each “hop” on network requires passing through an embedded processor/switch
– Can be many seconds for the Internet

Omega network provides high potential bandwidth, but at cost of latency of log n
switching stages
14


18-548 Physical Memory Architecture
 8/31/98
MAIN MEMORY

Main Memory
u Main Memory is what programmers (think they) manipulate
• Program space
• Data space
• Commonly referred to as “physical memory” (as opposed to “virtual memory”)
u Typically constructed from DRAM chips
• Multiple clock cycles to access data, but may operate in a “burst” mode once
data access is started
• Optimized for capacity, not necessarily speed
u Latency -- determined by DRAM construction
• Shared pins for high & low half of address to save on packaging costs
• Typically 2 or 3 bus cycles to begin accessing data
• Once access initiated can return multiple data at rate of datum per bus clock
15


18-548 Physical Memory Architecture
 8/31/98
Main Memory Capacities
u Main memory capacity is determined by DRAM chip
• At least 1 “bank” of DRAM chips is required for minimum memory size
• Multiple banks (or bigger chips) used to increase memory capacity
u Bandwidth -- determined by memory word width
• Memory words typically same width as bus

Peak memory bandwidth is usually one word per bus cycle

Sustained memory bandwidth varies with the complexity of the design

Sometimes multiple banks can be activated concurrently, exploiting “interleaved”
memory
• Representative main memory bandwidth is 500 MB/sec peak; 125 MB/sec
sustained
Special-Purpose Memory
u Memory dedicated to special hardware use
• Graphics frame buffer
– Special construction to support high-speed transfers to video screen
– Dual-ported access for modify-while-display
• Video capture buffer
u Perhaps could consider memory embedded in other devices as specialpurpose
system memory
• Postscript printer memory
• Audio sample tables
16


18-548 Physical Memory Architecture
 8/31/98
DISK / BACKING STORE

“Backing Store”
u Used for data that doesn’t fit into main memory
• “Virtual memory” uses it to emulate a larger physical memory
• File systems use it for nonvolatile storage of information
u Hard disk technology is prevalent, but don’t forget:
• Flash memory (especially for mobile computing) based on IC technology
• Floppy disks (up to 100 MB)
• CD-ROM (and WORM, and magneto-optical, and ...)
• Novel storage (bar codes, optical paper tape...)
u Magnetic disks have killed off:
• Magnetic drums
• Punched cards
• Paper tape
• Bubble memory
• Magnetic tape (not quite dead yet)
17


18-548 Physical Memory Architecture
 8/31/98
Disk Construction
u One or more platters with one or more heads
• Details of sectors/tracks/cylinders discussed at a
later lecture
Seagate Technology
u Latency: disk controller + physics of disk
• Seek latency for head to move into position
• Rotational latency for disk to spin to correct position under the head
• Minor additional processing latency
• Typical latency is a handful or so of milliseconds; but can vary considerably
from access to access
u Bandwidth: disk controller + physics of disk
• Raw bandwidth is determined by data density on disk and rotational rate
• Electronics have to be able to handle the data to achieve peak bandwidth
• Transfer rate may be 1-10 MB/sec
Physical Disk Layout Example
u Example: 4 platters, 2-sided
18


18-548 Physical Memory Architecture
 8/31/98
PLUMBING DIAGRAMS

Physical Memory Characteristics
u Approximate & representative numbers for 1998 technology
Capacity Latency Bandwidth
Registers 128 Bytes+ 1 CPU clock (2-5 ns) 3-10 GB/sec
Cache 8 KB - 1 MB 1-10 clocks 3-5 GB/sec
Bus --10+
clocks 0.5 - 1 GB/sec
Main Memory 8 MB - 1 GB 25 - 100 clocks 0.1 - 1 GB/sec
Interconnect --1-
10 msec 1-20 MB/sec
Backing Store 8 GB+ 5-10 msec 1-10 MB/sec
19


18-548 Physical Memory Architecture
 8/31/98
Example Baseline Plumbing Diagram
u Example for bandwidth computations
• Uses representative numbers; not
necessarily corresponding to any particular
real system
u Bandwidths may or may not be
sufficient depending on attenuation
factor at each stage
An Example That Is Memory-Bandwidth-Bound
u Assume register file is
used at full
bandwidth
u Assume 25% of
necessary bandwidth
gets through each
level of memory
filtering
• This is not really a
miss rate in that it is
by bandwidth, not
simply number of
accesses
u Result:
• For this example, not
enough bandwidth
beyond the bus
20


18-548 Physical Memory Architecture
 8/31/98
REVIEW

Physical Memory Architecture
u Emphasis on providing low latency and high bandwidth
REGISTER
FILE
ON-CHIP
L1 CACHE
TLB
(?)
ON-CHIP
L2 CACHE
E-UNIT
L2/L3
CACHE
(?)
L3/L4
CACHE
VIRTUAL
MEMORY
CD-ROM
TAPE
etc.
INTER-
CONNECTION
NETWORK
CPU
OTHER
COMPUTERS
& WWW
I-UNIT
SPECIAL-
PURPOSE
CACHES
SPECIAL-
PURPOSE
MEMORY
MAIN
MEMORY
DISK FILES &
DATABASES
CACHE BYPASS
21


18-548 Physical Memory Architecture
 8/31/98
Key Concepts
u Latency
• Higher levels in hierarchy have lower latency because there are shorter and less
complicated interconnections
u Bandwidth
• Generally higher levels in hierarchy have greater bandwidth because it’s
cheaper to run wider data paths with short “distances” and low fanout
u Concurrency
• Replication of resources can improve bandwidth for a cost
– Split caches
– Concurrent interconnection paths
– Multiple memory banks
– Multiple mass storage devices
u Balance
• Plumbing diagrams can give a first-order approximation of system balance for
throughput
Review
u Principles of Locality
• Temporal, spatial, sequential
u Physical Memory Hierarchy:
• CPU registers -- smallest & fastest; measured in Bytes
• Cache -- almost as fast as CPU; measured in KB
• Bus/Interconnect -- bandwidth costs money
• Main Memory -- slow but large; measured in MB
• Mass storage -- slower and larger; measured in GB
u Bandwidth “Plumbing” diagrams
• Back-of-envelope calculations can demonstrate bottlenecks
22