FreshPatents.com Logo FreshPatents.com icons
Monitor Keywords Patent Organizer File a Provisional Patent Browse Inventors Browse Industry Browse Agents

4

views for this patent on FreshPatents.com
updated 05/17/13


Inventor Store

    Free Services  

  • MONITOR KEYWORDS
  • Enter keywords & we'll notify you when a new patent matches your request (weekly update).

  • ORGANIZER
  • Save & organize patents so you can view them later.

  • RSS rss
  • Create custom RSS feeds. Track keywords without receiving email.

  • ARCHIVE
  • View the last few months of your Keyword emails.

  • COMPANY PATENTS
  • Patents sorted by company.

Dual mode floating point multiply accumulate unit   

pdficondownload pdfimage preview


Abstract: Disclosed are various embodiments of a stream processing unit for single instruction multiple data (SIMD) processing, wherein the stream processing unit executes a stage of a Multiply-Accumulate calculation. In one embodiment, the stream processing unit comprises a plurality of scalar arithmetic logic units (ALUs) configured to receive data having a plurality of data types. The number and type of scalar ALUs corresponds to an SIMD factor. In one embodiment, the scalar ALUs are executed sequentially with a delay being introduced in between execution of each of the scalar ALUs, wherein the delay corresponds to the SIMD factor. ...

Agent: Via Technologies, Inc. - Taipei, TW
Inventors: Boris Prokopenko, Timour Paltashev, Derek Gladding
USPTO Applicaton #: #20110208946 - Class: 712 22 (USPTO) - 08/25/11 - Class 712 
Related Terms: Arithmetic   Instruction   Logic   Mode   Multiple   Number   Processing   Simd   
view organizer monitor keywords


The Patent Description & Claims data below is from USPTO Patent Application 20110208946, Dual mode floating point multiply accumulate unit.

pdficondownload pdf

CROSS REFERENCE

This application is a continuation of U.S. patent application Ser. No. 11/671,630, filed Feb. 6, 2007, which claimed the benefit of U.S. Provisional Application No. 60/765,571, filed on Feb. 6, 2006. The contents of these prior applications are incorporated herein by reference. This application is also related to U.S. Utility Patent Application entitled “Stream Processor with Variable Single Instruction Multiple Data (SIMD) Factor and Common Special Function” filed on the same day as the present application and accorded Ser. No. 11/671,610 (now abandoned), which is hereby incorporated by reference herein in its entirety.

The U.S. Patent Application entitled “SIMD Processor with Scalar Arithmetic Logic Units” filed on Jan. 29, 2003 and given Ser. No. 10/354,795, now U.S. Pat. No. 7,146,486, is also incorporated by reference in its entirety.

BACKGROUND

Since the year 2000, fixed function Graphics Processing Units (GPUs) are becoming more and more programmable, providing a user with direct and flexible control on the processing primitive, vertex, texture, and pixel streams in graphics chips. Many current GPUs can feature programmability in the form of at least one shader (primitive, vertex, etc.) but generally can process only a few types of data (say 32-bit floating point for vertex and 32-bit integer). The programmable shaders in the graphics pipeline are generally arranged in sequential manner for forwarding data to fixed function units and to each other with a data format conversion if desired.

Also generally involved in the design of GPUs are parallel multiprocessor architecture principles. Application of parallel architecture principles generally utilizes a plurality of same type arithmetic logic units (ALUs) to process different types of stream data in non-uniform program threads. In many circumstances, the ALUs are desired to process different kinds of data for every clock cycle if non-uniform program threads are interleaved.

One of important issues is an implementation of complex mathematical functions (special functions) in such multiprocessor structures. There are generally two ways to implement them: special subroutine executed on general ALU and special hardware unit attached to general ALU which produced result by its request. Software implementation of such functions creates significant performance degradation, which might be unacceptable in case of real-time graphics applications. In the case of multiple ALU combined in SIMD structure such unit should be attached to every ALU which may significantly increase hardware overhead. Such complex functions are not used very often in a shader program and most of the time those special hardware units combined with each general ALU will be idling.

This situation can be partially resolved by sharing the special function unit (SFU) among a plurality of ALUs, but in the case of an SIMD structure, a thread will be stalled until all streams will get their result from shared SFU which will process requests sequentially. It may take several cycles of overhead in each involvement of complex mathematical function in shader program. Special arrangements in the SIMD stream architecture should be made to minimize stall wait cycles and provide smooth stream processing with minimal overhead if non-uniform program threads are interleaved.

While the ALUs used in this multiprocessing manner generally sustain high throughput, the ALUs should be able to process more data streams in short format sharing the same hardware for longer format. Generally speaking, current ALUs for GPUs are configured to process only one format of floating point unit (e.g., 32-bit IEEE format as standard) and generally experience low performance in processing lower accuracy pixel and texture data. Additionally, if another type of data format is supported, the ALU generally works with the same number of streams with little to no throughput improvement nor Single Instruction Multiple Data (SIMD) factor variability regardless of the data format. Further, current ALUs are generally not configured to arbitrarily interleave the flow of instructions (lack of support for non-uniform threads). Additionally, current dual format Multiply Accumulate (MACC) units can generally process only integer data.

Vector machines with a fixed data format and a fixed SIMD factor generally have less of a hardware load and generally process stream data relatively slowly in the case where there are a lesser number of elements in the vector stream than the width of a vector unit. Additionally current graphics shader architecture generally has limited instruction set capabilities in processing different format data in the same instruction.

Thus, a heretofore unaddressed need exists in the industry to address the aforementioned deficiencies and inadequacies.

SUMMARY

Included are embodiments of a stream processing unit for single instruction multiple data (SIMD) processing that is configured to process a plurality of different data types. Embodiments of the stream processing unit include a plurality of scalar arithmetic logic units (ALUs) for executing a stage of a Multiply-Accumulate calculation. Additionally, embodiments of the stream processing unit include a short format component configured to facilitate processing of short format data, a long format component configured to facilitate processing of long format data, a mixed format component configured to facilitate processing of short format data and long format data, and a mantissa component configured to facilitate processing of a plurality of different formatted operands.

Also included are methods of process a plurality of different data types. At least one embodiment of a method includes receiving data for processing, executing a stage of a Multiply-Accumulate calculation using scalar ALUs, executing each of the scalar ALUs, and directing the output using a control unit.

Other systems, methods, features, and advantages of this disclosure will be or become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the present disclosure.

BRIEF DESCRIPTION

Many aspects of the disclosure can be better understood with reference to the following drawings. The components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Moreover, in the drawings, like reference numerals designate corresponding parts throughout the several views. While several embodiments are described in connection with these drawings, there is no intent to limit the disclosure to the embodiment or embodiments disclosed herein. On the contrary, the intent is to cover all alternatives, modifications, and equivalents.

FIG. 1A is a flowchart illustrating stream data processing steps that can be taken in an exemplary vector processing unit.

FIG. 1B is a flowchart illustrating stream data processing steps that can be taken in an exemplary scalar processing unit, similar to the steps illustrated in FIG. 1A.

FIG. 1C is an exemplary stream processing SIMD structure with software implementation of complex mathematical functions.

FIG. 1D is an exemplary stream processing SIMD structure with hardware implementation of complex mathematical functions using private special function unit (SFU) for each ALU.

FIG. 1E is an exemplary stream processing SIMD structure with hardware implementation of complex mathematical functions using a common SFU for all ALUs.

FIG. 1F is an exemplary stream processing SIMD structure with implementation of complex mathematical functions using a common SFU with interleaved access to common SFU.

FIG. 1G is an exemplary illustration of an SIMD factor reduction in the case of a common SIMD structure for both vertex and triangle processing.

FIG. 2A a flowchart illustrating steps that can be taken in an exemplary scalar processing unit, similar to the flowchart from FIG. 1, with an SIMD factor 4.

FIG. 2B is a flowchart illustrating steps that can be taken in an exemplary scalar processing unit, similar to the flowchart from FIG. 1, with an SIMD factor 1.

FIG. 2C is a flowchart illustrating steps that can be taken in an exemplary scalar processing unit, similar to the flowchart from FIG. 1, with an SIMD factor 8 for short data format.

FIG. 2D is a flowchart illustrating steps that can be taken in an exemplary processing unit, similar to the flowchart from FIG. 1, with an SIMD factor 4 for short data format.

FIG. 3 is an exemplary logical structure of paired scalar ALUs with dual format processing capabilities, illustrating processing characteristics from FIGS. 1 and 2A-2G, illustrating stream ALU functionality.

FIG. 4 is an exemplary stream processing unit in long format processing mode with paired scalar ALUs, similar to the structure from FIG. 3, and showing an upper level of control and memory.

FIG. 5A is a table illustrating exemplary arithmetic functionality of paired scalar ALUs, and can be used as a base for numerical processing instruction set development such as the ALUs illustrated in FIGS. 3 and 4.

FIG. 5B is a GPU structure where an exemplary stream processor pool is used as a computational core, where the stream processor has a scalable architecture and may contain from 2 to 16 ALUs combined with a reduced number of special function units.

FIG. 6 an exemplary flow diagram and logical structure of a stream processor with 4 scalar ALUs, and SFU interaction, similar to the ALUs from FIGS. 3 and 4.

FIG. 7A is a flowchart illustrating an exemplary normalized vector difference processing in a vector ALU.

FIG. 7B is a flowchart of an exemplary processing routine in a proposed stream scalar ALU combined with an SFU.

FIG. 7C is a continuation of FIG. 7B.

FIG. 8 is an exemplary ALU module, implementing functionality of the ALUs from FIG. 6.

FIG. 9 is an exemplary modular stream processor with a combination of 4 ALU modules, similar to the ALUs from FIGS. 3 and 4.

FIGS. 10A-10C are diagrams illustrating exemplary logical structure and data formats for Multiply Accumulate units, such as the Multiply Accumulate Unit from FIG. 8.

FIG. 11 is an exemplary structure of a MACC unit, similar to the MACC unit from FIG. 8.

FIG. 12 is an exemplary diagram of a short exponent calculation, similar to the short exponent calculation from FIG. 11.

FIG. 13 is an exemplary diagram of a short exponent calculation combined with a mixed exponent, similar to the short exponent calculation from FIG. 11.

FIG. 14 is an exemplary diagram of a short mantissa path for various channels, describing details of the mantissa path illustrated in FIG. 11.

FIG. 15 is an exemplary diagram of a long exponent calculation, describing details of the exponent calculation block from FIG. 11.

FIG. 16 is an exemplary diagram of a long exponent calculation, for a paired ALU, describing details of the long exponent calculation block from FIG. 11.

FIG. 17 is an exemplary diagram of a long mantissa data path, describing details of a data path illustrated in FIG. 11.

FIG. 18 is an exemplary diagram of a long mantissa data path for a paired ALU, similar to the data path illustrated in FIG. 11.

FIG. 19 is an exemplary diagram of a mixed exponent calculation, describing details of the mixed exponent calculation illustrated in FIG. 11.

FIG. 20 is an exemplary diagram of a mixed exponent calculation for a paired ALU, similar to a mixed exponent calculation illustrated in FIG. 19.

FIG. 21 is an exemplary diagram of a mixed mantissa data path, describing details of the data path illustrated in FIG. 11.

FIG. 22 is an exemplary diagram of a mixed mantissa data path for a paired ALU, similar to a data path illustrated in FIG. 21.

FIG. 23 is an exemplary diagram of a merged mantissa data path, which can process short and long data formats, describing details of a possible implementation of the data path illustrated in FIG. 11.

FIG. 24 is an exemplary diagram illustrating a merged mantissa data path, similar to a data path illustrated in FIG. 11.

FIG. 25A is an exemplary diagram illustrating merged shift and control logic, which can be applied in the MACC from FIGS. 23 and 24.

FIG. 25B is an exemplary diagram illustrating sign control logic, which can be applied in the MACC from FIGS. 23 and 24.

FIG. 26 is an exemplary table of complement shift input and output formats, which may be utilized in the MACC from FIG. 11.

FIG. 27A is an exemplary diagram of a mantissa addition path, which can be utilized in the MACC from FIGS. 23 and 24.

FIG. 27B is an exemplary diagram of processing formats that can be utilized in the MAD carry save adder tree units from FIGS. 23 and 24.

FIG. 27C is a continuation of the processing formats from FIG. 27B.

FIG. 28A is an exemplary diagram of a fence implementation in a CSA adder, which may be utilized in the MACC from FIGS. 23 and 24.

FIG. 28B is an exemplary diagram of a fence implementation in a CPA adder, which may be utilized in the MACC from FIGS. 23 and 24.

FIG. 29 is an exemplary diagram of a fence implementation in a complement shift unit, which may be utilized in the MACC from FIGS. 23 and 24.

FIG. 30A is an exemplary fence in a normalization shifter, which may be utilized in the MACC from FIGS. 23 and 24.

FIG. 30B is a more detailed view of the exemplary fence from FIG. 30A.

FIG. 31 is a flowchart illustrating an exemplary process that may be utilized for sending data to a functionally separated ALU.

DETAILED DESCRIPTION

FIG. 1A is a flowchart illustrating stream data processing steps that can be taken in an exemplary processing unit using a vector ALU combined with a special function unit. More specifically, the nonlimiting example of FIG. 1A illustrates a stream vector processing unit with a regular architecture 100. As illustrated, an input stream of 3-dimensional graphics data vectors are sent to an input buffer regular memory 102. The input buffer regular memory in this nonlimiting example communicates vector data to the vector arithmetic logic unit (ALU) 104. As illustrated with the sequential instruction cycles, each vector includes four components X, Y, Z, and W. As illustrated, as the vectors are being sent from the input buffer regular memory 102 to the vector ALU 104, the vectors are arranged with each vector being communicated together. The vector ALU 104 and Special Function Unit (SFU) 106 can perform the desired operation to produce outputs for each component of the current vector. An SFU can be configured to process various types of operations such as sine functions, cosine functions, square root functions, fractions, exponentials, etc.

FIG. 1B is a flowchart illustrating steps that can be taken in an exemplary scalar processing unit, similar to the steps illustrated in FIG. 1A. FIG. 1B illustrates a vector data processing using a stream processor with four scalar ALUs 124. More specifically, an input stream of 3-dimensional graphics data vectors is input into input data buffer 4-Bank orthogonal access memory 122. The memory illustrated in this nonlimiting example is configured to provide a vertical access pattern on the data read versus a horizontal access pattern on data write (memory input or output). Such type of memory has a special vector component multiplexor and address generators for one or more of the memory banks, as discussed in U.S. Patent application 20040172517, filed Sep. 19, 2003, which is hereby incorporated by reference in its entirety.

The input data buffer 4-bank orthogonal access memory 122 can then send the rearranged (vertical) vector data to scalar ALUs 124a-124d. More specifically, the input data buffer 4-bank orthogonal access memory sequentially sends the first vector data elements (W1, Z1, Y1, and X1) to scalar ALU 1 124a; sequentially sends second vector data elements (W2, Z2, Y2, and X2) to scalar ALU 2 124b; sequentially sends third vector data elements to scalar ALU 3 124c; and sequentially sends fourth vector data elements to scalar ALU 4 124d. The scalar ALUs 124a-124d and special function unit (SFU) 126 can process the vector data accordingly and send the processed data to buffers S1, S2, S3, and S4, respectively. The output buffers (S1-S4) then send the data to the output orthogonal convereter 130, which can convert the received data into a horizontal vector format. More specifically, the orthogonal convereter 130 can be configured to convert the processed data from a scalar sequential or vertical representation to a vector horizontal representation. The data can then be output as illustrated with Xout, Yout, Zout, and Wout.

One should note that while the vector processing unit with regular architecture 100 processes vector data one vector at a time, the vector data processing using stream processor with four scalar ALUs 120 does not have this requirement. As illustrated, vector component data can be processed in any order and subsequently rearranged for output. Additionally, while the data in both the vector data processing using stream processor with four scalar ALUs 120 and the vector processing unit with regular architecture 100 receive vector data as a data set, however this is not a requirement. Vector components can be received as scalars in any order and processed in an SIMD manner.

As was mentioned earlier, a SIMD stream processor can be configured to perform complex mathematical operations (special functions) such as square root, sine, cosine and others to provide graphics data processing in modern GPU. A vector ALU may have an attached (or otherwise accessible) SFU and the SFU may be configured to work every time when appropriate command arrives to ALU. This SFU may be considered as separate channel in this nonlimiting ALU.

FIG. 1C is an exemplary stream processing SIMD structure with software implementation of complex mathematical functions. In the situation with a SIMD scalar ALU, the special function implementation may have few options. FIG. 1C illustrates stream processing SIMD structure with software implementation of complex mathematical functions. Each ALU has special attached lookup table and slightly modified data path to perform special function calculation sequence described in special routine (for example Newton-Raphson algorithm for square root). Latency of special function calculation in this case will equal the number of instructions in each special function routine multiplied by SIMD scalar ALU instruction execution cycle time. One problem of such implementation is the latency that would be quite significant depending on number of instruction to be executed in each ALU.

FIG. 1D is an exemplary stream processing SIMD structure with hardware implementation of complex mathematical functions using private SFU for each ALU. As illustrated in FIG. 1D, another approach is to provide a private hardware special function unit for each scalar ALU. The nonlimiting example of FIG. 1D illustrates a stream processing SIMD structure with hardware implementation of complex mathematical functions using private SFU for each ALU. One problem with such implementation is excessive hardware, which (generally) is rarely used. Latency of a special function calculation is minimal and normally equal to average instruction execution cycle.

FIG. 1E is an exemplary stream processing SIMD structure with hardware implementation of complex mathematical functions using common SFU for all ALUs. As illustrated, one can reduce hardware overhead by using a common SFU hardware block that can process requests from multiple scalar ALUs. FIG. 1E illustrates stream processing SIMD structure with hardware implementation of complex mathematical functions using common SFU for all ALUs. One problem of such implementation is significant stall time for all scalar ALUs while the SFU sequentially process requests from all ALUs and calculates values for all streams. One should note that in such SIMD structure all requests to the SFU appear at the same time. Generally speaking, all the ALUs will wait until last ALU receives a value from the SFU. The overall latency on such operation is equal to SFU processing cycle multiplied by number of scalar ALUs combined with this SFU.

FIG. 1F is an exemplary stream processing SIMD structure with implementation of complex mathematical functions using a common SFU with interleaved access to common SFU. The SFU latency for each stream can be reduced using interleaved access to SFU from scalar ALUs. More specifically, the nonlimiting example of FIG. 1F illustrates a proposed embodiment of a stream processing SIMD structure with common SFU. In this configuration, requests from different scalar ALUs are separated in time using special delay registers, which reschedule same SIMD instruction execution in different ALUs. Latency for each stream will be equal to latency of private SFU, the rest of the latency compare to previous structure will be compensated by delay registers.

Another problem which affects SIMD scalar stream processor efficiency is

SIMD factor when processing different types of input streams. These streams may contain vertex, triangle, and/or pixel data and accumulation of required input data in the storage may create significant delays as well as increases the time of data life span in local memory.

FIG. 1G is an exemplary illustration of an SIMD factor reduction in the case of a common SIMD structure for both vertex and triangle processing. As illustrated, the nonlimiting example of FIG. 1G illustrates vertex and triangle stream processing on the same SIMD structure with factor 4 when four ALUs process the stream data. The vertex packet to be processed contains data for four vertices. The triangle packet to be processed contains data for 12 vertices and time overhead for accumulation of complete packet may create significant delay on start of triangle processing. This is why a reduction of SIMD factor from 4 to 2 or 1 in same structure with 4 ALUs for triangle processing tasks becomes important issue in modern GPUs.

FIG. 2A is a flowchart illustrating steps that can be taken in an exemplary processing unit, similar to the flowchart from FIG. 1, with an SIMD factor 4. As indicated, FIG. 2A relates to vector stream data processing with scalar ALUs, having an SIMD factor of 4 and a long data format. Similar to the data flow of FIG. 1 B, vector data is not constrained to flow as a data set. As each data component reaches the respective ALU (ALU0 204a, ALU1 204b, ALU2 204c, AND ALU3 204d), that ALU can process the data accordingly to an ALU command delivered synchronously with delay of data delivery. Additionally, as illustrated, data is received at ALU0 204a prior to data being received at ALU1 204b. Similarly, ALU2 204c is delayed when compared to ALU1 204b. ALU3 204d is delayed when compared to ALU2 204c. After the data is processed, the processed data is sent to output buffers S1, S2, S3, and S4, with synchronization delay, respectively.

One should also note that the nonlimiting example illustrated in FIG. 2A is associated with an SIMD factor of 4 because there are four ALUs that perform substantially the same operation. Additionally, as the nonlimiting example of FIG. 2A illustrates, each ALU is configured to process long format 36 bit data.

FIG. 2B is a flowchart illustrating steps that can be taken in an exemplary processing unit, similar to the flowchart from FIG. 1, with an SIMD factor 1, which is a result of folding results of 4 ALUs to one ALU3. As indicated, FIG. 2B illustrates vector stream data processing with scalar ALUs and an SIMD factor of 1 in long format. While the configuration in FIG. 2A illustrates that vector data is sent to the ALUs in a manner that not consistent with a vector elements data set, the configuration of FIG. 2B illustrates the vector data being communicated to the ALUs as a vector data set. More specifically, FIG. 2B illustrates that data X1 is sent to ALU0. ALU0 can process the data and send at least a portion of the result to ALU1, while also sending output data to component shuffle 226. ALU1, which is delayed from ALU0 receives data Y1 and data from ALU0. ALU1 then sends output data to component shuffle 226 and data to ALU2. ALU2 receives Z1 and data from ALU1. ALU2 then sends output data to component shuffle and data to ALU3. ALU3 receives data W1 and data from ALU2. ALU3 sends output data to component shuffle 226. Component shuffle 226 can send data to one or more of the following outputs: Xout, Yout, Zout, and Wout. As a nonlimiting example, if such operation is a vector dot product, such mode may be desired to process data with a small number of streams, such as triangles versus vertex packets in a fewer number of clock cycles.

One should note that the configuration of FIG. 2B is associated with an SIMD factor of 1 due to the fact that each of the ALUs are performing the same command with a different number of operands. More specifically, because each ALU receives data from the previous ALU, the ALUs are performing different operations depending on the position of the ALU. As a nonlimiting example, in the case of a dot product command, embodiments of the ALU will have the following functionality:

ALU0: D0=A0*B0+0, which implements X1*X2

ALU1: D1=A1*B1+D0, which implements Y1*Y2+X1*X2

ALU2: D2=A2*B2+D1, which implements Z1*Z2+Y1*Y2+X1*X2

ALU3: D3=A3*B3+D2, which implements W1*W2+Z1*Z2+Y1*Y2+X1*X2

Actual results can be in the output of ALU3 and may be shuffled to any vector position for later use. Additionally, as illustrated in FIG. 2A, the configuration of FIG. 2B processes 36 bit (long format) data in each of the ALUs.

FIG. 2C is a flowchart illustrating steps that can be taken in an exemplary scalar processing unit, similar to the flowchart from FIG. 2A, with an SIMD factor 8. The scalar processing unit in this nonlimiting example includes the same number of ALUs as in FIG. 2A, however, in FIG. 2C, each ALU is split to process two streams of short format data (e.g., 18-bit components instead of 36-bit components). As indicated, FIG. 2C includes vector stream data processing with scalar ALUs that is associated with an SIMD factor of 8 in short format. This means that one can process 8 sets of input data and produce 8 results based on the same command sent to the ALUs with respective delays. More specifically, the vector data can take the form of 18 bit (short format) as opposed to the 36 bit data (long format) discussed above. More specifically, the W1 vector component from previous nonlimiting examples now takes the form of two separate components W1.0 and W1.1, each of which is a short format component. Similarly, X, Y, and Z, as well as the other data sets 2, 3, and 4 are also represented in a short format. Additionally, as also illustrated FIG. 2B, data input into the ALUs does not necessarily correlate to a vector element data set. More specifically, the ALUs are not constrained to process vector data sets, as the data input into each ALU need not be related.

Also included in this nonlimiting example are a plurality of divided or split ALUs that can be configured to process short data more efficiently. More specifically, data X1.0 is input into the left side of ALU0, which has been designated ALU0.0. The right side of ALU0, designated ALU0.1 receives data X1.1. The data sent to ALU0.0 and ALU0.1 is processed and sent to output buffers S1.0 and S1.1, respectively. Similarly, data X2.0 and X2.1 are sent to the left side of ALU1 (ALU1.0) and the right side of ALU1 (ALU1.1), respectively. As illustrated, there is a delay in the processing of data in ALU1.0 and ALU1.1, when compared with the processing of ALU0.0 and ALU0.1. Once the data is processed, the ALU1.0 and ALU1.1 send the output data to output buffers S2.0 and S2.1, respectively.

In similar fashion, ALU2.0 and ALU2.1 receive data X3.0 and X3.1, respectively. After processing the received data, ALU2.0 and ALU2.1 send the output data to output buffers S3.0 and S3.1, respectively. In addition, the processing of data in ALU2.0 and ALU2.1 is delayed from the processing of the previous ALUs discussed. As with the previous operations, ALU3.0 and ALU3.1 receives data X4.0 and X4.1 respectively. ALU3.0 and ALU3.1 process the receive data (delayed from that of ALU2.0 and ALU2.1) and send the output data to output buffers S4.0 and S4.1, respectively.

Because all eight ALUs (which can physically take the form of four dual channel ALUs, each logically divided in half) are executing the same command, the SIMD factor of the nonlimiting example of FIG. 2C is 8. Additionally, the ALUs in FIG. 2C can be configured to receive and process 18-bit (short format) data, as well as 36-bit (long format) data.

FIG. 2D is a flowchart illustrating steps that can be taken in an exemplary processing unit, similar to the flowchart from FIG. 2A, with an SIMD factor 4. As indicated, FIG. 2D includes vector stream data processing with scalar ALUs that are associated with an SIMD factor 4 in short format. As illustrated, the data input into the ALUs is similar to that of FIG. 2C, which may or may not be organized according to a data set. Additionally, as in the previous nonlimiting example, data X0.0 is input into ALU0.0 and data X0.1. However, in this nonlimiting example, ALU0.1 is slightly delayed when compared with ALU0.0 and uses a result of ALU0.0. Additionally, ALU0.1 receives input data not only from X1.1, but also from the output of ALU0.0. Similarly, ALU1.0 receives data X2.0, processes the received data, and outputs the processed data to ALU1.1. ALU1.1 receives the output data from ALU1.0 and also receives data X2.1. ALU1.1 processes the received data and outputs the processed data to output buffer S2.1. ALU2.0 receives data X3.0, processes the received data, and outputs the result to ALU2.1. ALU2.1 receives the output data from ALU2.0 as well as the data X3.1. ALU2.1 processes the received data and outputs the result to output buffer S3.1. ALU3.0 receives input data X4.0. ALU3.0 processes the received data and outputs the processed data to ALU3.1. ALU3.1 receives the output from ALU3.0 as well as data X4.1. ALU3.1 processes the received data and sends the processed data to S4.1.

Embodiments of such ALUs are configured with the following functionality:

ALU0.0: d0.0=a0.0*b0.0+0

ALU0.1: d0.1=a0.1*b0.1+d0.0

ALU1.0: d1.0=a1.0*b1.0+0

ALU1.1: d1.1 =a1.1*b1.1+d0.0

ALU2.0: d2.0=a2.0*b2.0+0

ALU2.1: d2.1=a2.1*b2.1+d2.0

ALU3.0: d3.0=a3.0*b3.0+0

ALU3.1: d3.1=a3.1*b3.1+d3.0

As there are eight ALUs processing data and only four are outputting a result, the logic of FIG. 2D is associated with a SIMD factor of four. Additionally, as ALU0.0 sends data to ALU0.1, ALU0.1 is associated with a slight delay in processing when compared with ALU0.0. ALU0.1 can wait for ALU0.0 to process the data X1.0 and then receive the output from ALU0.0. At this point, ALU0.1 can process the received output from ALU0.0 as well as data X1.1. A similar delay and process is also executed for the remaining ALUs.

FIG. 3 is an exemplary logical structure of paired scalar ALUs with dual format processing capabilities, illustrating processing characteristics from FIGS. 1 and 2A-2D. More specifically, FIG. 3 includes embodiments of a stream processor configured to process data in any of a plurality of different formats. At least one embodiment includes a first scalar arithmetic logic unit (ALU), configured to process a first plurality of sets of short format floating point data in response to a received short format control signal from an instruction set and process a first set of long format floating point data in response to a received long format control signal from the instruction set. Additionally, some embodiments include a second arithmetic logic unit (ALU), configured to process a second plurality of sets of short format floating point data in response to a received short format control signal from the instruction set, process a second set of long format floating point data in response to a received long format control signal from the instruction set, receive the processed data from the first arithmetic logic unit (ALU), and process the input data and the processed data from the first ALU according to a control signal from the instruction set. Some embodiments include a special function unit (SFU) configured to provide additional computational functionality to the first ALU and the second ALU. Further, some embodiments are configured such that wherein, in response to receiving short format data, the stream processor is configured to functionally divide at least one pair of the ALUs to facilitate dual format processing with a variable Single Instruction Multiple Data (SIMD) factor for short formats and for long formats. Some embodiments are configured wherein the instruction set includes at least one instruction to process in at least one of the following modes: a short format operand mode, a long format operand mode, and a mixed format operand mode. Some embodiments are configured wherein the instruction set is configured to control variable SIMD folding mode, when output data of the first ALU is sent as an operand to the second ALU in long format mode and wherein the output of one channel of the first ALU is sent as an operand to the second channel of the first ALU in a short format mode.

More specifically, the two ALUs 310, 320 of FIG. 3 may be configured operate in long and short data format with SIMD factor 2 and 4, respectively. The depicted structure illustrates data paths, which includes sectional multipliers and adders combined with sectional Multiply Accumulate (MACC) registers capable to process short and long data. In this nonlimiting example, data from an SFU is received at the accumulator registers of ALU0 and ALU1 (block 370). Coupled to the accumulator is a cache memory data in module 372, as well as an ALU port P0 376. The ALU port P0 can be configured to process 72 bits in four segments. Coupled to the cache memory data in 372 is an ALU port P1 378. Similar to the ALU port P0 376, the ALU port P1 378 is also configured to process 72 bits of data in four 18 bit segments. Coupled to the ALU port P1 is an ALU port P2, configured to process 72 bits in four 18 bit segments.

Coupled to ALU port P0, ALU port P1, and ALU port P2 is ALU0 310, which includes an input multiplexor 382a and an input multiplexor 384a. The input multiplexor 382a includes output ports CH, A1H, B0L, A1L, and B1L, while the input multiplexor 384a includes output ports A0H, B0H, A0L, B1H and CL. The output CH is coupled to adder 396a while the outputs A1H and B0L are coupled to multiplier 386a. Multiplier 386a is also coupled to adder 396a. Outputs A1L and B1L are coupled to multiplier 388a, which is coupled to 13 bit shifter 371a, which is coupled to adder 396a.

From input multiplexor 384a, outputs A0H and B0H are coupled to multiplier 392a. Multiplier 392a is then coupled to adder 399a. Outputs A0L and B1H are coupled to multiplier 390a, which is coupled to 13 bit shifter 373a, which is then coupled to adder 399a. Output CL is coupled to 399a. Adders 396a and 399a are coupled together via 13-bit shifter and enable component 398a. A multiply accumulate units (MACC) 394a and 397a are also coupled to adders 396a, and 399a, respectively. The output of adders 396a and 399a are coupled to low output DL and high output DH, respectively.

ALU port P0 376, ALU port P1 378 and ALU port P2 380 are also coupled to ALU1 320 via delay registers 383. Delay registers 383 are coupled to input multiplexors 382b and 384b. Input multiplexor 382b includes output CH, which is coupled to adder 396b. Outputs A1H and B0L are coupled to multiplier 386b, which is coupled to adder 396b. Outputs A1L and B1L are coupled to multiplier 388b, which is coupled to 13 bit shifter 371b, which is then coupled to adder 396b.

Outputs to input multiplexor 384b includes A0H and B0H, which are coupled to multiplier 392b. Multiplier 392b is then coupled to adder 399b. Outputs A0L and B1H are coupled to multiplier 390b, which is coupled to 13 bit shifter 377b, which is then coupled to adder 399b. Output CL is coupled to adder 399b. Adders 396b and 399b are coupled via shifter and enable component 398b. Also coupled to adders 396b and 399b are MACC 394b and 397b. Adder 396b is coupled to low output DL, while adder 399b is coupled to high output DH. Also included in this nonlimiting example is a bypass component 395 outputting CL data component 393, which are coupled between ALU0 310 and ALU1 320, and facilitate a clock cycle delay in the operation of ALU1 320.

One should note that while the components of FIG. 3 are described, the nonlimiting example of FIG. 3 is intended to illustrate an exemplary logical structure of operations. More specifically, the structure depicted with respect to FIG. 3 illustrates principles of design of an ALU with a split data path and a variable SIMD factor.

FIG. 4 is an exemplary stream processing unit with paired scalar ALUs, similar to the structure from FIG. 3. As illustrated, input data is communicated to cache memory unit 472, which includes L0, L1, S0, S1, S2, S3, etc. The cache memory unit 472 communicates stored data to memory out multiplexor 474, which is coupled to port P0 476, port P1 478 and port P2 480. Port P0 476, port P1 478, and port P2 480 are also coupled to input multiplexor and latch 482a, which are coupled to ALU0. ALU0, in this nonlimiting example, is configured to calculate D0 from A0*B0+C0, which is output to D0L.

Port P0 476, port P1 478, and port P2 480 are also coupled to delay fregister 483, which is coupled to input multiplexor 482b, which is associated with ALU1. ALU1, in this nonlimiting example, is configured to calculate D1 from A1*B1+C1+D0. The solution can be output to D1L. Also coupled to ALU1 is output port D0L from ALU0. As one of ordinary skill in the art will understand, this particular nonlimiting example includes a calculation in ALU1 of a value from ALU0. More specifically, ALU0 calculates a value of D0, which is then sent to delay register 386. From the delay register, D0 is sent to ALU1 for calculation of D1.

Also coupled to the outputs of both ALU0 and ALU1 is multiplexor 484, which is coupled to special function unit 470 shared between two ALUs. The special function unit 470 is also coupled to the inputs of ALU0 and ALU1 via delay register 483. Outputs to ALU0 and ALU1 are also coupled to the input of the cache memory unit 472, as well as sent to other units.

Also included in the nonlimiting example of FIG. 4 is a SIMD microcoded controller 488, which can be configured to determine and communicate the desired operation control signal to the ALU0 and ALU1. Coupled to the SIMD microcoded controller 488 is a control and address for ALU component 490. Delay register 483 can be coupled between control and address for ALU component 490 and ALU1.

One should note that as FIG. 3 is directed to an embodiment where short data is being processed, FIG. 4 is directed to an embodiment where long format is being processed. More specifically, while embodiments of the present disclosure include the ability to process short data, long data, mixed data, etc., various nonlimiting examples described herein can include processing any permutation of data.

FIG. 5A is a table illustrating exemplary arithmetic functionality of paired scalar ALUs, such as the ALUs illustrated in FIGS. 3 and 4. This table describes all possible operations of a pair of ALUs (ALU0 and ALU1). Those operations can be executed with short 18-bit, long 36-bit and mixed 18-36 bit floating point data. All operations are divided on three big groups: regular, blend, and cross operations. In each group there are normal operations and quad/double type operations for 18/36 bit data. Quad/double type operations use data forwarding between sections of the same ALU or ALU0 and ALU 1. On the top of table there are columns which have exactly the same names as inputs of ALU0 and ALU1 in FIG. 3 as well as data path control signals on the same diagram.

Each operation is described by two rows: first row shows input data from ALU ports P0, P1, P2 (particular elements P0.0, P0.1 etc) to be sent to ALU inputs (a, b, c), status of few data path control signals and the second row contains the formula which describes a result sent to outputs dl and dh. The last column contains information about an SIMD factor in this particular operation for the pair of ALUs. This pair of ALUs may be replicated several times to increase overall SIMD factor. The right side of the table contains comments with abbreviated name of operation, arithmetic function of ALU hardware using multiplication sign “S” and addition sign “s” as well as involvement of MAC register in particular operation. Below is a detailed instruction set description may illustrate complete functionality of proposed stream processor.

FIG. 5B includes a GPU where a SIMD stream processor is being used as computational core. This nonlimiting example contains 4 stream processors and each of the processors contains 4 pairs of ALU and 2 SFUs. Embodiments of the stream processor are configured to process different types of data (both geometry and pixel/texel) providing variable SIMD factor for those types of data via using different command from its instruction set.

Stream processor instructions may have length from 3 to 9 bytes depending on instruction types and address modes. Instruction contain following parts: (1)Main body (general instructions and flow control instructions); (2) Instruction prefixes which may forward results of general instructions to SFU or repeat execution of general instruction; and (3) Instruction modifiers which may scale operands, set flags and control write back of result. Instruction encoding principles are listed below:

TABLE 1 1st byte of 2nd byte of 3rd byte of instruction instruction instruction Address bytes General instruction format Opcode Operand address Operand Operand address addresses Instruction prefix (special function unit) Prefix opcode None None None Instruction prefix (instruction repeat control) Repeat opcode Immediate value None None Instruction modifier prefixes

Download full PDF for full patent description/claims.




You can also Monitor Keywords and Search for tracking patents relating to this Dual mode floating point multiply accumulate unit patent application.
###
monitor keywords

Other recent patent applications listed under the agent Via Technologies, Inc.:



Keyword Monitor How KEYWORD MONITOR works... a FREE service from FreshPatents
1. Sign up (takes 30 seconds). 2. Fill in the keywords to be monitored.
3. Each week you receive an email with patent applications related to your keywords.  
Start now! - Receive info on patent apps like Dual mode floating point multiply accumulate unit or other areas of interest.
###


Previous Patent Application:
Generating random addresses for verification of distributed computerized devices
Next Patent Application:
System and method for simplifying transmission in parallel computing system
Industry Class:
Electrical computers and digital processing systems: processing architectures and instruction processing (e.g., processors)

###

FreshPatents.com Support - Terms & Conditions
Thank you for viewing the Dual mode floating point multiply accumulate unit patent info.
- - - AAPL - Apple, BA - Boeing, GOOG - Google, IBM, JBL - Jabil, KO - Coca Cola, MOT - Motorla

Results in 1.91546 seconds


Other interesting Freshpatents.com categories:
Software:  Finance AI Databases Development Document Navigation Error g2