| Index: > A B C D E F G H I J K L M N O P Q R S T U V W X Y Z |
|
|||||
| First Prev [ 1 2 3 4 5 6 ] Next Last |
To illustrate both specialization and multi-level caching, here is the cache hierarchy of the AMD Athlon 64, whose core design is known as the K8 (details here):
The K8 has 4 specialized caches: an instruction cache, an instruction TLB, a data TLB, and a data cache. Each of these caches is specialized:
The K8 also has multiple-level caches. There are second-level instruction and data TLBs, which store only PTEs mapping 4KB. Both instruction and data caches, and the various TLBs, can fill from the large unified L2 cache. This cache is exclusive to both the L1 instruction and data caches, which means that any 8-byte line can only be in one of the L1 instruction cache, the L1 data cache, or the L2 cache. It is, however, possible for a line in the data cache to have a PTE which is also in one of the TLBs — the operating system is responsible for keeping the TLBs coherent by flushing portions of them when the page tables in memory are updated.
The K8 also caches information that is never stored in memory — prediction information. These caches are not shown in the above diagram. As is usual for this class of CPU, the K8 has fairly complex
branch prediction, with tables that help predict whether branchesare taken and other tables which predict the targets of branches and jumps. Some of this information is associated with the contents of the instruction cache, and is transferred to the secondary cache along with evicted instruction cache lines to improve the prediction accuracy later, when the code is filled back into the level 1 instruction cache. In the secondary cache, this information is stored in the same locations used for data ECC — instructions have only parity in the secondary cache as well.
(This section should be rewritten.)
Other processors have other kinds of predictors (e.g. the store-to-load bypass predictor in the DEC Alpha 21264), and various specialized predictors are likely to flourish in future processors.
These predictors are caches in the sense that they store information that is costly to compute. Some of the terminology used when discussing predictors is the same as that for caches (one speaks of a hit in a branch predictor), but predictors are not generally thought of as part of the cache hierarchy.
The K8 keeps the instruction and data caches coherent in hardware, which means that a store into an instruction closely following the store instruction will change that following instruction. Other processors, like those in the Alpha and MIPS family, have relied on software to keep the instruction cache coherent. Stores are not guaranteed to show up in the instruction stream until a program calls an operating system facility to ensure coherency. The idea is to save hardware complexity on the assumption that self-modifying code is rare.
The cache hierarchy gets still larger if we consider software as well as hardware. The register file in the core of the processor can be considered a very small, very fast cache whose hits, misses, and fills are scheduled by the compiler ahead of time. (See especially
loop nest optimization.) Register files sometimes also havehierarchy: The Cray-1 (circa 1976) had 8 scalar and 8 address registers that were generally usable, and 64 scalar and 64 address "B" registers. The "B" registers could be loaded faster than main memory, because the Cray-1 did not have a data cache.
Because cache reads are the most common operation that take more than a single cycle, the recurrence from a load instruction to an instruction dependent on that load tends to be the most critical path in well-designed processors, so that data on this path wastes the least amount of time waiting for the clock. As a result, the level-1 cache is the most latency sensitive block on the chip.
The simplest cache is a virtually indexed direct-mapped cache. The virtual address is calculated with an adder, the relevant portion of the address extracted and used to index an SRAM, which returns the loaded data. The data is byte aligned in a byte shifter, and from there is bypassed to the next operation. There is no need for any tag checking in the inner loop — in fact, the tags need not even be read. Later in the pipeline, but before the load instruction is retired, the tag for the loaded data must be read, and checked against the virtual address to make sure there was a cache hit. On a miss, the cache is updated with the requested cache line and the pipeline is restarted.
An associative cache is more complicated, because some form of tag must be read to determine which entry of the cache to select. An N-way set-associative level-1 cache usually reads all N possible tags and N data in parallel, and then chooses the data associated with the matching tag. Level-2 caches sometimes save power by reading the tags first, so that only one data element is read from the data SRAM.
The diagram to the right is intended to clarify the manner in which the various fields of the address are used. The address bits are labelled in little endian notation: bit 31 is most significant; bit 0 is least significant. The diagram shows the SRAMs, indexing, and multiplexing for a 4KB, 2-way set-associative, virtually indexed and virtually tagged cache with 64B lines, a 32b read width and 32b virtual address.
Because the cache is 4KB and has 64B lines, there are just 64 lines in the cache, and we read two at a time from a Tag SRAM which has 32 rows, each with a pair of 21 bit tags. Although any function of virtual address bits 31 through 6 could be used to index the tag and data SRAMs, it is simplest to use the least significant bits.
Similarly, because the cache is 4KB and has a 4B read path, and reads two ways for each access, the Data SRAM is 512 rows by 8 bytes wide.
A more modern cache might be 16KB, 4-way set-associative, virtually indexed, virtually hinted, and physically tagged, with 32B lines, 32b read width and 36b physical addresses. The read path recurrence for such a cache looks very similar to the path above. Instead of tags, vhints are read, and matched against a subset of the virtual address. Later on in the pipeline, the virtual address is translated into a physical address by the TLB, and the physical tag is read (just one, as the vhint supplies which way of the cache to read). Finally the physical address is compared to the physical tag to determine if a hit has occurred.
Some SPARC designs have improved the speed of their L1 caches by a few gate delays by collapsing the virtual address adder into the SRAM decoders. See Sum addressed decoder.