18-548 Physical Memory Architecture
8/31/98
3
Physical Memory
Architecture
18-548 Memory System Architecture
Philip Koopman
August 31, 1998
Required Reading:
Hennessy & Patterson: skim 5.1, 6.1-6.2, pp. 496-497, 7.1, pp. 635-642
Cragon: 2.0
http://www.isdmag.com/Editorial/1996/CoverStory9610.html
Supplemental Reading:
Hennessy & Patterson: 2.0-2.3, 2.7
Assignments
u By next class read about cache organization and access:
• Hennessy & Patterson 5.1, 5.2, pp. 390-393
• Cragon 2.1, 2.1.2, 2.1.3, 2.2.2
• Supplemental Reading: Flynn 1.6, 5.1
u Homework 1 due Wednesday September 2
u Lab 1 due Friday September 4 at 3 PM.
1
18-548 Physical Memory Architecture
8/31/98
Where Are We Now?
u Where we’ve been:
• Key concepts applied to memory hierarchies
– Latency
– Bandwidth
– Concurrency
– Balance
u Where we’re going today:
• Physical memory architecture -- a trip through the memory hierarchy
u Where we’re going next:
• Cache organization and access
• Virtual memory
Preview
u Principles of Locality
u Physical Memory Hierarchy:
• CPU registers
• Cache
• Bus
• Main Memory
• Mass storage
u Bandwidth “Plumbing” diagrams
2
18-548 Physical Memory Architecture
8/31/98
Physical Memory Architecture
u Emphasis on providing low latency and high bandwidth
REGISTER
FILE
ON-CHIP
L1 CACHE
TLB
(?)
ON-CHIP
L2 CACHE
E-UNIT
L2/L3
CACHE
(?)
L3/L4
CACHE
VIRTUAL
MEMORY
CD-ROM
TAPE
etc.
INTER-
CONNECTION
NETWORK
CPU
OTHER
COMPUTERS
& WWW
I-UNIT
SPECIAL-
PURPOSE
CACHES
SPECIAL-
PURPOSE
MEMORY
MAIN
MEMORY
DISK FILES &
DATABASES
CACHE BYPASS
LOCALITY
3
18-548 Physical Memory Architecture
8/31/98
Locality
u Typically, one can predict future memory access addresses from past
ones
u Temporal Locality
• A future memory access is likely to be identical to a past one
u Spatial Locality
• A future memory access is likely to be near a past one
u Sequentiality
• A future memory access is likely to be in sequential order
(e.g., a program counter)
Temporal Locality
u Once a memory word has been accessed, it is likely to be accessed again
u Program structures
• Loops
• Frequently used subroutines/
object methods (calls)
• Frequently used interrupt service code/
TLB miss code
u Application data structures
• Frequently used variables that aren’t register-resident
• “Directory” information (e.g., linked list pointer chain)
u System-induced temporal locality
• Local variables on stack frame
• Generational garbage collection
CALL CALL CALL
SUBROUTINE
4
18-548 Physical Memory Architecture
8/31/98
Temporal Locality of Instructions
u 80/20 rule of systems: 80% of cost or benefit attributable to 20% of
system (sometimes appears as the 90/10 rule)
(Hennessy & Patterson Figure 1.14)
Spatial Locality
u Once a word has been accessed, neighboring words are likely to be
accessed
u Program structures
• Short relative branches
• Related methods in same object (e.g., constructor followed by operator)
u Data structures
• Records (C language struct)
• Operations on 1-dimensional arrays of data
• Image processing (e.g., 2-D convolution)
u Coincidental spatial locality
• Putting important routines together for virtual memory locality
• Overlays (manual management of pages with small address spaces)
5
18-548 Physical Memory Architecture
8/31/98
Instruction Spatial Locality
u Branch offset coverage (Flynn Table 3.12)
Offset % Branches
± 8 13%
± 16 33%
± 32 37%
± 64 46%
± 128 55%
± 256 61%
± 512 68%
± 1K 73%
± 2K 79%
± 4K 83%
± 8K 87%
± 16K 93%
± 32K 99%
Sequentiality
u Memory words tend to be accessed in sequential order (a special case of
spatial locality)
u Program structures
• Program counter
• Compiler output arranged to avoid branch delay penalties
u Data structures
• Strings, BCD, multi-precision numbers
• Vector/matrix operations with unit stride (or, sometimes, any fixed stride)
• Multimedia playback
SD
DD
DD
DD
DD
D.
DD
7-BYTE/12-DIGIT BCD NUMBER
6
18-548 Physical Memory Architecture
8/31/98
Basic Block Lengths: Sequentiality
u A few large blocks raise mean length
(Flynn Figure 3.4)
CPU-BASED PHYSICAL
MEMORY ELEMENTS
7
18-548 Physical Memory Architecture
8/31/98
CPU Resources Form Top of Memory Hierarchy
u Registers
• Register file
• Top-of-stack “cache” (actually more like a buffer)
u Buffers
• Instruction pre-fetch buffer
• Data write buffer
• Branch destination buffer (specialized instruction cache)
u Programs with good locality make effective use of limited resources
• Only a handful of active values in register set (grows with superscalar CPUs)
u Bandwidth -- limited by number of ports in storage
• For example, 3-ported 32-bit register file at 500 MHz is ~6 GB/sec
u Latency -- cycled once or twice per clock cycle
• Latency is effectively “zero” -- CPU pipeline takes it into account
Registers Are Software-Managed Cache
u Registers offer highest bandwidth and shortest latency in system
• Built into the CPU
• Multi-ported fast access register file
u Compiler/programmer manages registers
• Explicit load and store instructions
• No hardware interlocks for dependencies
u Very large register set can be effective in the right circumstances
• Vector computing where registers store intermediate vector results
• Embedded systems with multiple register sets for context switching speed
– But may not be so great for general purpose computing
8
18-548 Physical Memory Architecture
8/31/98
CACHE MEMORY
Cache Memory
u A small memory that provides the CPU with low latency and high
bandwidth access
• Typically hardware management of which memory locations reside in cache at
any point in time
• Can be entirely software managed, but this not commonly done today (may
become more common on multiprocessor systems for managing coherence)
u Multiple levels of cache memory possible
• Each level is typically bigger but slower; faster levels SRAM, slower levels
may be DRAM
u Bandwidth -- determined by interconnection technology
• On-chip limited by number of bits in memory cell array
• Off-chip limited by number of pins and speed at which they can be clocked
u Latency -- determined by interconnection technology & memory speed
• On-chip can be one or more clocks, depending on connect/cycle delays
• Off-chip is likely to be 3-10 clocks, depending on system design
9
18-548 Physical Memory Architecture
8/31/98
Major Cache Design Decisions
u Cache Size -- in bytes
u Split/Unified -- instructions and data in same cache?
u Associativity -- how many sectors in each set?
u Sector/Block size -- how many bytes grouped together as a unit?
u Management policies
• Choosing victim for replacement on cache miss
• Handling writes (when is write accomplished?; is written data cached?)
• Ability to set or over-ride policies in software
u How many levels of cache?
• Perhaps L1 cache on-chip; L2 cache on-module; L3 cache on motherboard
Cache Addressing
10
18-548 Physical Memory Architecture
8/31/98
Example Cache Lookup Operation
u Assume cache “hit” at address 0x345678 (word access)
VALID?
(1 BIT)
TAG
(10 BITS)
DATA
(8 BYTES)
. . . BLOCK
NUMBER
2CF
0x2CF 0x00xD1
0xD1
0x345678
WHICH
BYTE?
WHICH
BLOCK?
TAG MATCHES =
CACHE HIT!
10 BITS 11 BITS
2 BITS
LOW 4 BYTE WORD HIGH 4 BYTE WORD
Block size
u Cache is organized into blocks of data that are moved as a whole
• Blocks can range from 4 bytes to 256 bytes or more in size
• Blocks are loaded and stored as a single unit to/from the cache
u # blocks = cache size / block size
u example: 8 KB cache with 64-byte blocks:
• # blocks = 8192 / 64 = 128 blocks
11
18-548 Physical Memory Architecture
8/31/98
Cache Associativity
u When a cache block must be replaced with a new one, the cache logic
determines which block to replace from a set of candidate blocks
• Fully associative: the set consists of all blocks in the cache
• n-way set associative: the set consists of n blocks
– For example, 4-way set associative means that there are 4 candidates
• Direct mapped: there is only one block that can be replaced (1-way set
associative)
u # sets = # blocks / n
u example: 256 B cache:
16-byte blocks,
2-way set associative,
16 blocks
• #sets = (256/16) / 2
= 8 sets
BLOCK 0 BLOCK 1
TAG TAG V VD DWORD WORDWORD WORD
TAG TAG V VD DWORD WORDWORD WORD
TAG TAG
TAG TAG
V VD DWORD WORDWORD WORD
V VD DWORD WORDWORD WORD
TAG TAG V VD DWORD WORDWORD WORD
TAG TAG V VD DWORD WORDWORD WORD
TAG TAG V VD DWORD WORDWORD WORD
TAG TAG V VD DWORD WORDWORD WORD
SET 0
SET 4
SET 1
SET 5
SET 2
SET 6
SET 3
SET 7
2-WAY SET ASSOCIATIVE CACHE
BUS / INTERCONNECT
12
18-548 Physical Memory Architecture
8/31/98
Bus
u Bus is a shared system interconnection
• CPU to cache
– Might be on system bus
– As CPUs get more pins they move to a dedicated cache bus or all within package
• Cache to main memory
• CPU or main memory to I/O
• Typically high-speed connection, and often carries processor/memory traffic
• Typically accessed transparently via a memory address
Bus Performance
u Bandwidth -- limited by cost and transmission line effects
• 64-bit or 128-bit data bus common (but, fewer bits on cost-sensitive systems)
– Why was the 8088 used instead of the 8086 in the original IBM PC?
• Bus speed often limited to 50 - 66 MHz due to transmission line effects
• Example: Pentium Pro -- up to 528 MB/sec for 64-bit bus at 66 MHz
u Latency -- limited by distance and need for drivers
• Multiple clock latency, but can pipeline and achieve 1 clock/datum throughput
• Be careful about “bus clocks” vs. “processor clocks”
– Many current processors clocked at a multiple of the bus frequency
13
18-548 Physical Memory Architecture
8/31/98
Interconnect
u Interconnect is a generalization of a bus
• Crossbar switch permits connecting, for example,
n CPUs to m memory banks for simultaneous
accesses
– Cost is O(n*m) switches
• Multi-stage interconnection is a heavily research
area (hypercubes, omega networks, mesh
connections)
– Cost is O(n log n) switches
• Serial interconnections such as Ethernet are
widely used for between-box communications
– Cost is O(n) connections
Interconnect Performance
u Bandwidth -- usually limited by cost to fast serial connection
• Crossbar provides very high bandwidth (n simultaneous connections); but costs
O(n2) in terms of switching nodes
• Omega network provides potentially high bandwidth, but suffers from
blocking/congestion
• 10 Mbit/sec common for Ethernet; 100 Mbit/sec being introduced
– Also limited by cost of switch to avoid sharing of high-speed line
u Latency -- limited by routing
•
Crossbar offers lowest latency, but at high cost
• Each “hop” on network requires passing through an embedded processor/switch
– Can be many seconds for the Internet
–
Omega network provides high potential bandwidth, but at cost of latency of log n
switching stages
14
18-548 Physical Memory Architecture
8/31/98
MAIN MEMORY
Main Memory
u Main Memory is what programmers (think they) manipulate
• Program space
• Data space
• Commonly referred to as “physical memory” (as opposed to “virtual memory”)
u Typically constructed from DRAM chips
• Multiple clock cycles to access data, but may operate in a “burst” mode once
data access is started
• Optimized for capacity, not necessarily speed
u Latency -- determined by DRAM construction
• Shared pins for high & low half of address to save on packaging costs
• Typically 2 or 3 bus cycles to begin accessing data
• Once access initiated can return multiple data at rate of datum per bus clock
15
18-548 Physical Memory Architecture
8/31/98
Main Memory Capacities
u Main memory capacity is determined by DRAM chip
• At least 1 “bank” of DRAM chips is required for minimum memory size
• Multiple banks (or bigger chips) used to increase memory capacity
u Bandwidth -- determined by memory word width
• Memory words typically same width as bus
•
Peak memory bandwidth is usually one word per bus cycle
•
Sustained memory bandwidth varies with the complexity of the design
–
Sometimes multiple banks can be activated concurrently, exploiting “interleaved”
memory
• Representative main memory bandwidth is 500 MB/sec peak; 125 MB/sec
sustained
Special-Purpose Memory
u Memory dedicated to special hardware use
• Graphics frame buffer
– Special construction to support high-speed transfers to video screen
– Dual-ported access for modify-while-display
• Video capture buffer
u Perhaps could consider memory embedded in other devices as specialpurpose
system memory
• Postscript printer memory
• Audio sample tables
16
18-548 Physical Memory Architecture
8/31/98
DISK / BACKING STORE
“Backing Store”
u Used for data that doesn’t fit into main memory
• “Virtual memory” uses it to emulate a larger physical memory
• File systems use it for nonvolatile storage of information
u Hard disk technology is prevalent, but don’t forget:
• Flash memory (especially for mobile computing) based on IC technology
• Floppy disks (up to 100 MB)
• CD-ROM (and WORM, and magneto-optical, and ...)
• Novel storage (bar codes, optical paper tape...)
u Magnetic disks have killed off:
• Magnetic drums
• Punched cards
• Paper tape
• Bubble memory
• Magnetic tape (not quite dead yet)
17
18-548 Physical Memory Architecture
8/31/98
Disk Construction
u One or more platters with one or more heads
• Details of sectors/tracks/cylinders discussed at a
later lecture
Seagate Technology
u Latency: disk controller + physics of disk
• Seek latency for head to move into position
• Rotational latency for disk to spin to correct position under the head
• Minor additional processing latency
• Typical latency is a handful or so of milliseconds; but can vary considerably
from access to access
u Bandwidth: disk controller + physics of disk
• Raw bandwidth is determined by data density on disk and rotational rate
• Electronics have to be able to handle the data to achieve peak bandwidth
• Transfer rate may be 1-10 MB/sec
Physical Disk Layout Example
u Example: 4 platters, 2-sided
18
18-548 Physical Memory Architecture
8/31/98
PLUMBING DIAGRAMS
Physical Memory Characteristics
u Approximate & representative numbers for 1998 technology
Capacity Latency Bandwidth
Registers 128 Bytes+ 1 CPU clock (2-5 ns) 3-10 GB/sec
Cache 8 KB - 1 MB 1-10 clocks 3-5 GB/sec
Bus --10+
clocks 0.5 - 1 GB/sec
Main Memory 8 MB - 1 GB 25 - 100 clocks 0.1 - 1 GB/sec
Interconnect --1-
10 msec 1-20 MB/sec
Backing Store 8 GB+ 5-10 msec 1-10 MB/sec
19
18-548 Physical Memory Architecture
8/31/98
Example Baseline Plumbing Diagram
u Example for bandwidth computations
• Uses representative numbers; not
necessarily corresponding to any particular
real system
u Bandwidths may or may not be
sufficient depending on attenuation
factor at each stage
An Example That Is Memory-Bandwidth-Bound
u Assume register file is
used at full
bandwidth
u Assume 25% of
necessary bandwidth
gets through each
level of memory
filtering
• This is not really a
miss rate in that it is
by bandwidth, not
simply number of
accesses
u Result:
• For this example, not
enough bandwidth
beyond the bus
20
18-548 Physical Memory Architecture
8/31/98
REVIEW
Physical Memory Architecture
u Emphasis on providing low latency and high bandwidth
REGISTER
FILE
ON-CHIP
L1 CACHE
TLB
(?)
ON-CHIP
L2 CACHE
E-UNIT
L2/L3
CACHE
(?)
L3/L4
CACHE
VIRTUAL
MEMORY
CD-ROM
TAPE
etc.
INTER-
CONNECTION
NETWORK
CPU
OTHER
COMPUTERS
& WWW
I-UNIT
SPECIAL-
PURPOSE
CACHES
SPECIAL-
PURPOSE
MEMORY
MAIN
MEMORY
DISK FILES &
DATABASES
CACHE BYPASS
21
18-548 Physical Memory Architecture
8/31/98
Key Concepts
u Latency
• Higher levels in hierarchy have lower latency because there are shorter and less
complicated interconnections
u Bandwidth
• Generally higher levels in hierarchy have greater bandwidth because it’s
cheaper to run wider data paths with short “distances” and low fanout
u Concurrency
• Replication of resources can improve bandwidth for a cost
– Split caches
– Concurrent interconnection paths
– Multiple memory banks
– Multiple mass storage devices
u Balance
• Plumbing diagrams can give a first-order approximation of system balance for
throughput
Review
u Principles of Locality
• Temporal, spatial, sequential
u Physical Memory Hierarchy:
• CPU registers -- smallest & fastest; measured in Bytes
• Cache -- almost as fast as CPU; measured in KB
• Bus/Interconnect -- bandwidth costs money
• Main Memory -- slow but large; measured in MB
• Mass storage -- slower and larger; measured in GB
u Bandwidth “Plumbing” diagrams
• Back-of-envelope calculations can demonstrate bottlenecks
22
8/31/98
3
Physical Memory
Architecture
18-548 Memory System Architecture
Philip Koopman
August 31, 1998
Required Reading:
Hennessy & Patterson: skim 5.1, 6.1-6.2, pp. 496-497, 7.1, pp. 635-642
Cragon: 2.0
http://www.isdmag.com/Editorial/1996/CoverStory9610.html
Supplemental Reading:
Hennessy & Patterson: 2.0-2.3, 2.7
Assignments
u By next class read about cache organization and access:
• Hennessy & Patterson 5.1, 5.2, pp. 390-393
• Cragon 2.1, 2.1.2, 2.1.3, 2.2.2
• Supplemental Reading: Flynn 1.6, 5.1
u Homework 1 due Wednesday September 2
u Lab 1 due Friday September 4 at 3 PM.
1
18-548 Physical Memory Architecture
8/31/98
Where Are We Now?
u Where we’ve been:
• Key concepts applied to memory hierarchies
– Latency
– Bandwidth
– Concurrency
– Balance
u Where we’re going today:
• Physical memory architecture -- a trip through the memory hierarchy
u Where we’re going next:
• Cache organization and access
• Virtual memory
Preview
u Principles of Locality
u Physical Memory Hierarchy:
• CPU registers
• Cache
• Bus
• Main Memory
• Mass storage
u Bandwidth “Plumbing” diagrams
2
18-548 Physical Memory Architecture
8/31/98
Physical Memory Architecture
u Emphasis on providing low latency and high bandwidth
REGISTER
FILE
ON-CHIP
L1 CACHE
TLB
(?)
ON-CHIP
L2 CACHE
E-UNIT
L2/L3
CACHE
(?)
L3/L4
CACHE
VIRTUAL
MEMORY
CD-ROM
TAPE
etc.
INTER-
CONNECTION
NETWORK
CPU
OTHER
COMPUTERS
& WWW
I-UNIT
SPECIAL-
PURPOSE
CACHES
SPECIAL-
PURPOSE
MEMORY
MAIN
MEMORY
DISK FILES &
DATABASES
CACHE BYPASS
LOCALITY
3
18-548 Physical Memory Architecture
8/31/98
Locality
u Typically, one can predict future memory access addresses from past
ones
u Temporal Locality
• A future memory access is likely to be identical to a past one
u Spatial Locality
• A future memory access is likely to be near a past one
u Sequentiality
• A future memory access is likely to be in sequential order
(e.g., a program counter)
Temporal Locality
u Once a memory word has been accessed, it is likely to be accessed again
u Program structures
• Loops
• Frequently used subroutines/
object methods (calls)
• Frequently used interrupt service code/
TLB miss code
u Application data structures
• Frequently used variables that aren’t register-resident
• “Directory” information (e.g., linked list pointer chain)
u System-induced temporal locality
• Local variables on stack frame
• Generational garbage collection
CALL CALL CALL
SUBROUTINE
4
18-548 Physical Memory Architecture
8/31/98
Temporal Locality of Instructions
u 80/20 rule of systems: 80% of cost or benefit attributable to 20% of
system (sometimes appears as the 90/10 rule)
(Hennessy & Patterson Figure 1.14)
Spatial Locality
u Once a word has been accessed, neighboring words are likely to be
accessed
u Program structures
• Short relative branches
• Related methods in same object (e.g., constructor followed by operator)
u Data structures
• Records (C language struct)
• Operations on 1-dimensional arrays of data
• Image processing (e.g., 2-D convolution)
u Coincidental spatial locality
• Putting important routines together for virtual memory locality
• Overlays (manual management of pages with small address spaces)
5
18-548 Physical Memory Architecture
8/31/98
Instruction Spatial Locality
u Branch offset coverage (Flynn Table 3.12)
Offset % Branches
± 8 13%
± 16 33%
± 32 37%
± 64 46%
± 128 55%
± 256 61%
± 512 68%
± 1K 73%
± 2K 79%
± 4K 83%
± 8K 87%
± 16K 93%
± 32K 99%
Sequentiality
u Memory words tend to be accessed in sequential order (a special case of
spatial locality)
u Program structures
• Program counter
• Compiler output arranged to avoid branch delay penalties
u Data structures
• Strings, BCD, multi-precision numbers
• Vector/matrix operations with unit stride (or, sometimes, any fixed stride)
• Multimedia playback
SD
DD
DD
DD
DD
D.
DD
7-BYTE/12-DIGIT BCD NUMBER
6
18-548 Physical Memory Architecture
8/31/98
Basic Block Lengths: Sequentiality
u A few large blocks raise mean length
(Flynn Figure 3.4)
CPU-BASED PHYSICAL
MEMORY ELEMENTS
7
18-548 Physical Memory Architecture
8/31/98
CPU Resources Form Top of Memory Hierarchy
u Registers
• Register file
• Top-of-stack “cache” (actually more like a buffer)
u Buffers
• Instruction pre-fetch buffer
• Data write buffer
• Branch destination buffer (specialized instruction cache)
u Programs with good locality make effective use of limited resources
• Only a handful of active values in register set (grows with superscalar CPUs)
u Bandwidth -- limited by number of ports in storage
• For example, 3-ported 32-bit register file at 500 MHz is ~6 GB/sec
u Latency -- cycled once or twice per clock cycle
• Latency is effectively “zero” -- CPU pipeline takes it into account
Registers Are Software-Managed Cache
u Registers offer highest bandwidth and shortest latency in system
• Built into the CPU
• Multi-ported fast access register file
u Compiler/programmer manages registers
• Explicit load and store instructions
• No hardware interlocks for dependencies
u Very large register set can be effective in the right circumstances
• Vector computing where registers store intermediate vector results
• Embedded systems with multiple register sets for context switching speed
– But may not be so great for general purpose computing
8
18-548 Physical Memory Architecture
8/31/98
CACHE MEMORY
Cache Memory
u A small memory that provides the CPU with low latency and high
bandwidth access
• Typically hardware management of which memory locations reside in cache at
any point in time
• Can be entirely software managed, but this not commonly done today (may
become more common on multiprocessor systems for managing coherence)
u Multiple levels of cache memory possible
• Each level is typically bigger but slower; faster levels SRAM, slower levels
may be DRAM
u Bandwidth -- determined by interconnection technology
• On-chip limited by number of bits in memory cell array
• Off-chip limited by number of pins and speed at which they can be clocked
u Latency -- determined by interconnection technology & memory speed
• On-chip can be one or more clocks, depending on connect/cycle delays
• Off-chip is likely to be 3-10 clocks, depending on system design
9
18-548 Physical Memory Architecture
8/31/98
Major Cache Design Decisions
u Cache Size -- in bytes
u Split/Unified -- instructions and data in same cache?
u Associativity -- how many sectors in each set?
u Sector/Block size -- how many bytes grouped together as a unit?
u Management policies
• Choosing victim for replacement on cache miss
• Handling writes (when is write accomplished?; is written data cached?)
• Ability to set or over-ride policies in software
u How many levels of cache?
• Perhaps L1 cache on-chip; L2 cache on-module; L3 cache on motherboard
Cache Addressing
10
18-548 Physical Memory Architecture
8/31/98
Example Cache Lookup Operation
u Assume cache “hit” at address 0x345678 (word access)
VALID?
(1 BIT)
TAG
(10 BITS)
DATA
(8 BYTES)
. . . BLOCK
NUMBER
2CF
0x2CF 0x00xD1
0xD1
0x345678
WHICH
BYTE?
WHICH
BLOCK?
TAG MATCHES =
CACHE HIT!
10 BITS 11 BITS
2 BITS
LOW 4 BYTE WORD HIGH 4 BYTE WORD
Block size
u Cache is organized into blocks of data that are moved as a whole
• Blocks can range from 4 bytes to 256 bytes or more in size
• Blocks are loaded and stored as a single unit to/from the cache
u # blocks = cache size / block size
u example: 8 KB cache with 64-byte blocks:
• # blocks = 8192 / 64 = 128 blocks
11
18-548 Physical Memory Architecture
8/31/98
Cache Associativity
u When a cache block must be replaced with a new one, the cache logic
determines which block to replace from a set of candidate blocks
• Fully associative: the set consists of all blocks in the cache
• n-way set associative: the set consists of n blocks
– For example, 4-way set associative means that there are 4 candidates
• Direct mapped: there is only one block that can be replaced (1-way set
associative)
u # sets = # blocks / n
u example: 256 B cache:
16-byte blocks,
2-way set associative,
16 blocks
• #sets = (256/16) / 2
= 8 sets
BLOCK 0 BLOCK 1
TAG TAG V VD DWORD WORDWORD WORD
TAG TAG V VD DWORD WORDWORD WORD
TAG TAG
TAG TAG
V VD DWORD WORDWORD WORD
V VD DWORD WORDWORD WORD
TAG TAG V VD DWORD WORDWORD WORD
TAG TAG V VD DWORD WORDWORD WORD
TAG TAG V VD DWORD WORDWORD WORD
TAG TAG V VD DWORD WORDWORD WORD
SET 0
SET 4
SET 1
SET 5
SET 2
SET 6
SET 3
SET 7
2-WAY SET ASSOCIATIVE CACHE
BUS / INTERCONNECT
12
18-548 Physical Memory Architecture
8/31/98
Bus
u Bus is a shared system interconnection
• CPU to cache
– Might be on system bus
– As CPUs get more pins they move to a dedicated cache bus or all within package
• Cache to main memory
• CPU or main memory to I/O
• Typically high-speed connection, and often carries processor/memory traffic
• Typically accessed transparently via a memory address
Bus Performance
u Bandwidth -- limited by cost and transmission line effects
• 64-bit or 128-bit data bus common (but, fewer bits on cost-sensitive systems)
– Why was the 8088 used instead of the 8086 in the original IBM PC?
• Bus speed often limited to 50 - 66 MHz due to transmission line effects
• Example: Pentium Pro -- up to 528 MB/sec for 64-bit bus at 66 MHz
u Latency -- limited by distance and need for drivers
• Multiple clock latency, but can pipeline and achieve 1 clock/datum throughput
• Be careful about “bus clocks” vs. “processor clocks”
– Many current processors clocked at a multiple of the bus frequency
13
18-548 Physical Memory Architecture
8/31/98
Interconnect
u Interconnect is a generalization of a bus
• Crossbar switch permits connecting, for example,
n CPUs to m memory banks for simultaneous
accesses
– Cost is O(n*m) switches
• Multi-stage interconnection is a heavily research
area (hypercubes, omega networks, mesh
connections)
– Cost is O(n log n) switches
• Serial interconnections such as Ethernet are
widely used for between-box communications
– Cost is O(n) connections
Interconnect Performance
u Bandwidth -- usually limited by cost to fast serial connection
• Crossbar provides very high bandwidth (n simultaneous connections); but costs
O(n2) in terms of switching nodes
• Omega network provides potentially high bandwidth, but suffers from
blocking/congestion
• 10 Mbit/sec common for Ethernet; 100 Mbit/sec being introduced
– Also limited by cost of switch to avoid sharing of high-speed line
u Latency -- limited by routing
•
Crossbar offers lowest latency, but at high cost
• Each “hop” on network requires passing through an embedded processor/switch
– Can be many seconds for the Internet
–
Omega network provides high potential bandwidth, but at cost of latency of log n
switching stages
14
18-548 Physical Memory Architecture
8/31/98
MAIN MEMORY
Main Memory
u Main Memory is what programmers (think they) manipulate
• Program space
• Data space
• Commonly referred to as “physical memory” (as opposed to “virtual memory”)
u Typically constructed from DRAM chips
• Multiple clock cycles to access data, but may operate in a “burst” mode once
data access is started
• Optimized for capacity, not necessarily speed
u Latency -- determined by DRAM construction
• Shared pins for high & low half of address to save on packaging costs
• Typically 2 or 3 bus cycles to begin accessing data
• Once access initiated can return multiple data at rate of datum per bus clock
15
18-548 Physical Memory Architecture
8/31/98
Main Memory Capacities
u Main memory capacity is determined by DRAM chip
• At least 1 “bank” of DRAM chips is required for minimum memory size
• Multiple banks (or bigger chips) used to increase memory capacity
u Bandwidth -- determined by memory word width
• Memory words typically same width as bus
•
Peak memory bandwidth is usually one word per bus cycle
•
Sustained memory bandwidth varies with the complexity of the design
–
Sometimes multiple banks can be activated concurrently, exploiting “interleaved”
memory
• Representative main memory bandwidth is 500 MB/sec peak; 125 MB/sec
sustained
Special-Purpose Memory
u Memory dedicated to special hardware use
• Graphics frame buffer
– Special construction to support high-speed transfers to video screen
– Dual-ported access for modify-while-display
• Video capture buffer
u Perhaps could consider memory embedded in other devices as specialpurpose
system memory
• Postscript printer memory
• Audio sample tables
16
18-548 Physical Memory Architecture
8/31/98
DISK / BACKING STORE
“Backing Store”
u Used for data that doesn’t fit into main memory
• “Virtual memory” uses it to emulate a larger physical memory
• File systems use it for nonvolatile storage of information
u Hard disk technology is prevalent, but don’t forget:
• Flash memory (especially for mobile computing) based on IC technology
• Floppy disks (up to 100 MB)
• CD-ROM (and WORM, and magneto-optical, and ...)
• Novel storage (bar codes, optical paper tape...)
u Magnetic disks have killed off:
• Magnetic drums
• Punched cards
• Paper tape
• Bubble memory
• Magnetic tape (not quite dead yet)
17
18-548 Physical Memory Architecture
8/31/98
Disk Construction
u One or more platters with one or more heads
• Details of sectors/tracks/cylinders discussed at a
later lecture
Seagate Technology
u Latency: disk controller + physics of disk
• Seek latency for head to move into position
• Rotational latency for disk to spin to correct position under the head
• Minor additional processing latency
• Typical latency is a handful or so of milliseconds; but can vary considerably
from access to access
u Bandwidth: disk controller + physics of disk
• Raw bandwidth is determined by data density on disk and rotational rate
• Electronics have to be able to handle the data to achieve peak bandwidth
• Transfer rate may be 1-10 MB/sec
Physical Disk Layout Example
u Example: 4 platters, 2-sided
18
18-548 Physical Memory Architecture
8/31/98
PLUMBING DIAGRAMS
Physical Memory Characteristics
u Approximate & representative numbers for 1998 technology
Capacity Latency Bandwidth
Registers 128 Bytes+ 1 CPU clock (2-5 ns) 3-10 GB/sec
Cache 8 KB - 1 MB 1-10 clocks 3-5 GB/sec
Bus --10+
clocks 0.5 - 1 GB/sec
Main Memory 8 MB - 1 GB 25 - 100 clocks 0.1 - 1 GB/sec
Interconnect --1-
10 msec 1-20 MB/sec
Backing Store 8 GB+ 5-10 msec 1-10 MB/sec
19
18-548 Physical Memory Architecture
8/31/98
Example Baseline Plumbing Diagram
u Example for bandwidth computations
• Uses representative numbers; not
necessarily corresponding to any particular
real system
u Bandwidths may or may not be
sufficient depending on attenuation
factor at each stage
An Example That Is Memory-Bandwidth-Bound
u Assume register file is
used at full
bandwidth
u Assume 25% of
necessary bandwidth
gets through each
level of memory
filtering
• This is not really a
miss rate in that it is
by bandwidth, not
simply number of
accesses
u Result:
• For this example, not
enough bandwidth
beyond the bus
20
18-548 Physical Memory Architecture
8/31/98
REVIEW
Physical Memory Architecture
u Emphasis on providing low latency and high bandwidth
REGISTER
FILE
ON-CHIP
L1 CACHE
TLB
(?)
ON-CHIP
L2 CACHE
E-UNIT
L2/L3
CACHE
(?)
L3/L4
CACHE
VIRTUAL
MEMORY
CD-ROM
TAPE
etc.
INTER-
CONNECTION
NETWORK
CPU
OTHER
COMPUTERS
& WWW
I-UNIT
SPECIAL-
PURPOSE
CACHES
SPECIAL-
PURPOSE
MEMORY
MAIN
MEMORY
DISK FILES &
DATABASES
CACHE BYPASS
21
18-548 Physical Memory Architecture
8/31/98
Key Concepts
u Latency
• Higher levels in hierarchy have lower latency because there are shorter and less
complicated interconnections
u Bandwidth
• Generally higher levels in hierarchy have greater bandwidth because it’s
cheaper to run wider data paths with short “distances” and low fanout
u Concurrency
• Replication of resources can improve bandwidth for a cost
– Split caches
– Concurrent interconnection paths
– Multiple memory banks
– Multiple mass storage devices
u Balance
• Plumbing diagrams can give a first-order approximation of system balance for
throughput
Review
u Principles of Locality
• Temporal, spatial, sequential
u Physical Memory Hierarchy:
• CPU registers -- smallest & fastest; measured in Bytes
• Cache -- almost as fast as CPU; measured in KB
• Bus/Interconnect -- bandwidth costs money
• Main Memory -- slow but large; measured in MB
• Mass storage -- slower and larger; measured in GB
u Bandwidth “Plumbing” diagrams
• Back-of-envelope calculations can demonstrate bottlenecks
22