| Methods for generating code for an architecture encoding an extended register specification -> Monitor Keywords |
|
Methods for generating code for an architecture encoding an extended register specificationRelated Patent Categories: Data Processing: Software Development, Installation, And Management, Software Program Development Tool (e.g., Integrated Case Tool Or Stand-alone Development Tool), Translation Of CodeMethods for generating code for an architecture encoding an extended register specification description/claimsThe Patent Description & Claims data below is from USPTO Patent Application 20070038984, Methods for generating code for an architecture encoding an extended register specification. Brief Patent Description - Full Patent Description - Patent Application Claims CROSS-REFERENCE TO RELATED APPLICATIONS [0001] This is a non-provisional application claiming the benefit of U.S. provisional application Ser. No. 60/707,572, entitled "Methods for Generating Code for an Architecture Encoding an Extended Register Specification", filed on Aug. 12, 2005, which is incorporated by reference herein. Moreover, this application is related to a non-provisional application, Attorney Docket No. YOR920050390US2, entitled "Implementing Instruction Set Architectures with Non-Contiguous Register File Specifiers", filed concurrently herewith, and incorporated by reference herein. BACKGROUND [0002] 1. Technical Field [0003] The present invention relates generally to computers and, more particularly, to methods for generating code for an architecture encoding an extended register specification. [0004] 2. Description of the Related Art [0005] In modern microprocessors, increases in latencies have been an increasingly severe problem. These increases are occurring both for operations performed on the chip, and for memory access latencies. There are a number of reasons for this phenomenon. [0006] One reason is that the trend to achieve ever higher performance results in increased use of high clock frequencies. This leads to deeper pipelining (i.e., the division of a basic operation into multiple stages) and, hence, a larger number of total stages as an operation is divided into ever-smaller units of work to achieve these high frequencies. [0007] Another reason relates to the differences in chip and memory speeds. That is, while both chip and memory speeds have been increasing, memory speed has been increasing at a much smaller rate. Thus, in terms of processor cycles to access a location in memory, latency has increased significantly. The relatively faster increase in chip speed is due to both the above-mentioned deep pipelining, and to CMOS scaling used as a technique to increase chip speeds, as disclosed by R. H. Dennard et al., in "Design of Ion-Implanted MOSFETs with Very Small Physical Dimensions," IEEE Journal of Solid-State Circuits, SC-9, pp. 256-68, 1974, which is incorporated by reference herein. [0008] Moreover, another reason for the increasing latencies relates to differences in wire and logic speeds. That is, as CMOS scaling is applied ever more aggressively, wire speeds do not scale at the same rate as logic speeds, leading to a variety of latency increases, e.g., increasing the time required to complete operations by requiring longer time to write back their results. [0009] In addition to aggressive technology scaling and deep pipelining, computer architects have also turned to the use of more aggressive parallel execution by means of superscalar instruction issue, whereby multiple operations can be initiated in a single cycle. Recent microprocessors such as the state-of-the art Power5 or PowerPC 970 processor can dispatch 5 operations per cycle and initiate operations at the rate of 7 and 9 operations per cycle, respectively. [0010] To continue improving the performance of microprocessors, there are two challenges of significance: achieving high levels of parallelism; and tolerating increasing latency (in terms of processor cycles) of memory. Both achieving higher parallelism and tolerating longer latency requires programs to be compiled to use more independent strands of computation simultaneously. This, in turn, requires a large number of registers to be available to support the multiple independent strands of computation by storing all of their intermediate results. [0011] A result of the ability to execute more instructions in pipelines with increasing latency, and to initiate execution in multiple pipelines, requires ever-larger amounts of data to be maintained by a processor in order to serve as inputs or to be received as results of operations. To accomplish this, architects and programmers have two options, namely retrieve and store data in a memory hierarchy, or in on-chip register file storage. [0012] Of these choices, register file storage offers multiple advantages, including higher bandwidth and shorter latency, as well as lower energy dissipation per access. However, the number of registers specified in architectures has not increased since the introduction of RISC computing (when the size of register files was increased from the then customary 8 or 16 registers to 32 registers) until recently. Thus, as the demands for faster register storage grow, to buffer input operands and operation results from an increasing number of instructions simultaneously being executed, the number of architected registers has stayed constant while the performance of memory hierarchies has de facto decreased (in terms of processor cycles required to provide data to the processor core). [0013] To show how the effectiveness of register files has diminished in light of changes to processor architecture that have occurred in response to technology shifts, consider the following simple ratios. Approximately 15 years ago, circa 1990, a high-end processor would typically have one floating point pipeline, with about 3 computational pipeline stages, plus an additional cycle for register file access. When processing FMA operations, i.e., merged floating-point multiply-add high performance computation primitives, a pipeline would have 4 FMA operations in flight (one per pipeline stage), each requiring 3 input registers and one output register, for a total of 16 registers to support all computations in flight. Given the typical complement of 32 floating-point registers, this would leave an additional 16 registers to hold other data and/or constants. Considering the parallelism provided by state-of-the-art microprocessors, coupled with the latencies incurred by deep pipelining, the number of registers required to supply and store sufficient operands to exploit the peak execution rate provided by a modern microprocessor is well in excess of the 32 floating-point registers typically provided in the instruction set architecture. [0014] Similarly, in past machines, a second level cache could be accessed with a 3 cycle hit latency, which gives a ratio of about 10 registers available per cycle of latency to L2 cache (i.e., 32 registers divided across the 3 access cycles). This is a conservative measure; the actual number of registers required to completely cover an L2 cache access (and therefore to decouple memory access from computational use) in a more realistic scenario would depend on the actual number of operands consumed during such time, which scales up with issue width. Today, with a 10-12 cycle latency to L2 cache, to preserve a similar 10 registers per cycle ratio would require between 100 and 120 registers. [0015] Large numbers of registers are in fact built, e.g., both the Power4 and Power5 microprocessors implement many more than the 32 architected registers. Several recent microprocessors implement a technique called register renaming, whereby the limited number of architected registers is translated to use a larger pool of (physical) registers internally. However, to exploit these larger register files, complex (and area intensive) renaming logic and out-of-order issue capabilities are required. Even then, the inability to express the best schedule for the program using a compiler or a skillfully tuned Basic Linear Algebra Subprogram (BLAS) or other such library limits the overall performance potential. [0016] While register renaming does allow an increase in the number of registers, register renaming is a complex task that requires additional steps in the instruction processing of microprocessors. Thus, what is required to address the challenges in modern microprocessor design is an increased number of registers that are easy to access using an extended name space in the architecture, as opposed to techniques such as register renaming used in high-end microprocessors such as the IBM PowerPC 970 and Power5. [0017] Recently, the IA-64 architecture and the CELL SPU architecture have offered implementations with 128 architected registers. In reference to these implementations, the IA-64 offers an implementation using instruction bundles, a technique to build instruction words wider than a machine word. While this resolves the issue of instruction encoding space, it leads to inefficient encoding due to a reduction of code density because an instruction word disadvantageously occupies more than a single machine word, thereby reducing the number of instructions which can be stored in a given memory unit. [0018] Recent advances in the encoding of instruction sets, disclosed in the U.S. patent application to Altman et al., entitled "Method and Apparatus to Extend the Number of Instruction Bits in Processors with Fixed Length Instructions in a Manner Compatible with Existing Code", U.S. patent application Ser. No. 10/720,585, filed on Nov. 24, 2003, which is commonly assigned and incorporated by reference herein, advantageously allows wide instruction words to be used in conjunction with fixed size word instruction set architectures having an instruction format requiring only a single machine word for most instructions. While this offers a significant advantage over prior wide-word bundle-oriented instruction sets in terms of code density, decoding complexity is increased. [0019] In an advantageous implementation of fixed width 32 bit instruction words, the CELL SPU instruction set architecture supports the specification of 128 registers in a 32 bit instruction word, implementing a SIMD-ISA in accordance with the U.S. patent application to Gschwind et al., entitled "SIMD-RISC Microprocessor Architecture", U.S. patent application Ser. No. 11/065,707, filed on Feb. 24, 2005, and the U.S. Pat. No. 6,839,828 to Gschwind et al., entitled "SIMD Datapath Coupled to Scalar/Vector/Address/Conditional Data Register File With Selective Subpath Scalar Processing Mode", which are commonly assigned and incorporated by reference herein. [0020] While the SPU advantageously offers the use of 128 registers in a fixed instruction word using a new encoding that, in turn, uses fields of 7 adjacent bits in a newly specified instruction set, this is accomplished in an entirely new instruction set architecture without regard to existing (legacy) instructions or programs. Legacy architectures, on the other hand, are not without deficiency. For example, since many bit combinations have been assigned a meaning in legacy architectures, and certain bit fields have been set aside to signify specific architectural information (such as extended opcodes, register fields, and so forth) legacy architectures offer significant obstacles to encoding new information. Specifically, when allocating new instructions, the specification for these new instructions cannot arbitrarily allocate new fields without complicating the decoding of both the pre-existing and these new instructions. [0021] Additionally, the number of bits in instruction sets with fixed instruction word width limits the number of different instructions that can be encoded. For example, most RISC architectures use fixed length instruction sets with 32 bit instruction words. This encoding limitation is causing increasing problems as instruction sets are extended. For example, there is a need to add new instructions to efficiently execute modern applications. Primary examples are multimedia extensions such as INTEL's MMX, SSE and SSE2 and the PowerPC VMX/Altivec extensions. Moreover, the number of cycles required to access caches and memory is growing as the processor frequencies increase. One way to alleviate this problem is to add more registers to the processor to reduce the number of loads and stores (typically required to spill and restore register values when insufficient register resources or names are available). However, it is difficult or impossible to specify additional registers in the standard 32-bit RISC instruction encoding. [0022] The most common solution to this problem is an approach typically associated with CISC architectures, which allows multiple instruction lengths, rather than maintaining a single, fixed instruction size such as 32 bits. This variable length CISC approach has several drawbacks, and was one of the reasons RISC was developed in the 1980's. Among the problems associated with variable length CISC encoding is the additional complexity it requires in the instruction decode, resulting in additional decode pipeline stages in the machine or a reduced frequency. Moreover, another problem with variable length CISC encoding is that it allows instructions to span natural memory boundaries (e.g., cache line and page boundaries), complicating instruction fetch and virtual address translation. Another problem with variable length CISC encoding is that such a CISC approach cannot be compatibly retrofitted to a RISC architecture. For example, architectures having fixed length instructions today assume pervasively that all instructions are aligned on the boundary, that branch addresses are specified at a multiple of a fixed length instruction, and so forth. Further, no mechanisms are defined to address the issue of page-spanning instructions, and so forth. Continue reading about Methods for generating code for an architecture encoding an extended register specification... Full patent description for Methods for generating code for an architecture encoding an extended register specification Brief Patent Description - Full Patent Description - Patent Application Claims Click on the above for other options relating to this Methods for generating code for an architecture encoding an extended register specification patent application. ### 1. Sign up (takes 30 seconds). 2. Fill in the keywords to be monitored. 3. Each week you receive an email with patent applications related to your keywords. Start now! - Receive info on patent apps like Methods for generating code for an architecture encoding an extended register specification or other areas of interest. ### Previous Patent Application: Systems and methods for enterprise software management Next Patent Application: Syntactic program language translation Industry Class: Data processing: software development, installation, and management ### FreshPatents.com Support Thank you for viewing the Methods for generating code for an architecture encoding an extended register specification patent info. IP-related news and info Results in 0.14352 seconds Other interesting Feshpatents.com categories: Software: Finance , AI , Databases , Development , Document , Navigation , Error 174 |
* Protect your Inventions * US Patent Office filing
PATENT INFO |
|