BACKGROUND OF THE INVENTION
- Top of Page
1. Technical Field
This disclosure relates to a data processing system in which computations are efficiently offloaded from a system central processing unit (CPU) to a system graphics processing unit (GPU).
2. Related Art
Performance is a key challenge in building large-scale applications because predicting the behavior of such applications is inherently difficult. Weaving security solutions into the fabric of the architectures of these applications almost always worsens the performance of the resulting systems. The performance degradation can be more than 90% when all application data is protected, and may be even worse when other security mechanisms are applied.
In order to be effective, cryptographic algorithms are necessarily computationally intensive and must be integral parts of data protection protocols. The cost of using cryptographic algorithms is significant since their execution consumes many CPU cycles which affects the performance of applications negatively. For example, cryptographic operations in the Secure Socket Layer (SSL) protocol slow downloading files from servers from about 10 to about 100 times. The SSL operations also penalize performance for web servers anywhere from a factor of about 3.4 to as much as a factor of nine. Generally, whenever a data message crosses a security boundary, the message is encrypted and later decrypted. These operations give rise to the performance penalty.
One prior attempt at alleviating the cost of using cryptographic protocols included adding separate specialized hardware to provide support for security. The extra dedicated hardware allowed applications to use more CPU cycles. However, dedicated hardware is expensive and using it requires extensive changes to the existing systems. In addition, using external hardware devices for cryptographic functions adds marshalling and unmarshalling overhead (caused by packaging and unpackaging data) as well as device latency.
Another prior attempt at alleviating the cost of using cryptographic protocols was to add CPUs to handle cryptographic operations. However, the additional CPUs are better utilized for the core computational logic of applications in order to improve their response times and availability. In addition, most computers have limitations on the number of CPUs that can be installed on their motherboards. Furthermore, CPUs tend to be expensive resources that are designed for general-purpose computations rather than specific application to cryptographic computations. This may result in underutilization of the CPUs and an unfavorable cost-benefit outcome.
Therefore, a need exists to address the problems noted above and others previously experienced.
- Top of Page
A system for securing multithreaded server applications improves the availability of a CPU for executing core applications. The system improves the performance of multithreaded server applications by providing offloading, batching, and scheduling mechanisms for efficiently executing processing tasks needed by the applications on a GPU. As a result, the system helps to reduce the overhead associated with cooperative processing between the CPU and the GPU, with the result that the CPU may instead spend more cycles executing the application logic.
Other systems, methods, features and advantages will be, or will become, apparent to one with skill in the art upon examination of the following figures and detailed description. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the invention, and be protected by the following claims.
BRIEF DESCRIPTION OF THE DRAWINGS
- Top of Page
The system may be better understood with reference to the following drawings and description. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. Moreover, in the figures, like referenced numerals designate corresponding parts throughout the different views.
FIG. 1 shows a system for supervisory control of encryption and decryption operations in a multithreaded application execution environment in which messages are batched for submission to a GPU.
FIG. 2 shows a system for supervisory control of encryption and decryption operations in a multithreaded application execution environment in which processed message components from a processed message received from a GPU are delivered to threads of an application.
FIG. 3 shows a flow diagram of the processing that encryption supervisory logic may implement to batch messages for submission to a GPU.
FIG. 4 shows a flow diagram of the processing that encryption supervisory logic may implement to return messages processed by a GPU to threads of an application.
FIG. 5 shows a flow diagram of the processing that encryption supervisory tuning logic may implement.
FIG. 6 shows experimental results of the batching mechanism implemented by the encryption supervisory logic in the system.
FIG. 7 shows an example of simulation results of mean waiting time against maximum composite message capacity.
- Top of Page
FIG. 1 shows a system 100 for supervisory control of encryption and decryption operations in a multithreaded application execution environment. The system 100 includes a central processing unit (CPU) 102, a memory 104, and a graphics processing unit (GPU) 106. The GPU 106 may be a graphics processor available from NVIDIA of Santa Clara, Calif. or ATI Research, Inc. of Marlborough, Mass., as examples. The GPU 106 may communicate with the CPU 102 and memory 104 over a bus 108, such as the peripheral component interconnect (PCI) bus, the PCI Express bus, Accelerated Graphics Port (AGP) bus, Industry Standard Architecture (ISA) bus, or other bus. As will be described in more detail below, the CPU 102 executes applications from the system memory 104. The applications may be multi-threaded applications.
One distinction between the CPU 102 and the GPU 106 is that the CPU 102 typically follows a Single Instruction Single Data (SISD) model and the GPU 106 typically follows a Single Instruction Multiple Data (SIMD) stream model. Under the SISD model, the CPU 102 executes one (or at most a few) instructions at a time on a single (or at most a few) data elements loaded into the memory prior to executing the instruction. In contrast, a SIMD processor includes many processing units (e.g., 16 to 32 pixel shaders) that simultaneously execute instructions from a single instruction stream on multiple data streams, one per processing unit. In other words, one distinguishing feature of the GPU 106 over the CPU 102 is that the GPU 106 implements a higher level of processing parallelism than the CPU. The GPU 106 also includes special memory sections, such as texture memory, frame buffers, and write-only texture memory used in the processing of graphics operations.
The memory holds applications executed by the CPU 102, such as the invoicing application 110 and the account balance application 112. Each application may launch multiple threads of execution. As shown in FIG. 1, the invoicing application has launched threads 1 through ‘n’, labeled 114 through 116. Each thread may handle any desired piece of program logic for the invoicing application 110.
Each thread, such as the thread 114, is associated with a thread identifier (ID) 118. The thread ID may be assigned by the operating system when the thread is launched, by other supervisory mechanisms in place on the system 100, or in other manners. The thread ID may uniquely specify the thread so that it may be distinguished from other threads executing in the system 100.
The threads perform the processing for which they were designed. The processing may include application programming interface (API) calls 120 to support the processing. For example, the API calls 120 may implement encryption services (e.g., encryption or decryption) on a message passed to the API call by the thread. However, while the discussion below proceeds with reference to encryption services, the API calls may request any other processing logic (e.g., authentication or authorization, compression, transcoding, or other logic) and are not limited to encryption services. Similarly, the supervisory logic 154 may in general handle offloading, scheduling, and batching for any desired processing, and is not limited to encryption services.
The GPU 106 includes a read-only texture memory 136, multiple parallel pixel shaders 138, and a frame buffer 140. The texture memory 136 stores a composite message 142, described in more detail below. Multiple parallel pixel shaders 138 process the composite message 142 in response to execution calls (e.g., GPU draw calls) from the CPU 102. The multiple parallel pixel shaders 138 execute an encryption algorithm 144 that may provide encryption or decryption functionality applied to the composite message 142, as explained in more detail below. The GPU 106 also includes a write-only texture memory 146. The GPU 106 may write processing results to the write-only texture memory 146 for retrieval by the CPU 102. The CPU 102 returns results obtained by the GPU 106 to the individual threads that gave rise to components of the composite message 142. Other data exchange mechanisms may be employed to exchange data with the GPU rather than or in addition to the texture memory 136 and the write-only texture memory 146.
The programming functionality of the pixel shaders 138 may follow that expected by the API call 120. The pixel shaders 138 may highly parallelize the functionality. However, as noted above, the pixel shaders 138 are not limited to implementing encryption services.
Each thread, when it makes the API call 120, may provide a source message component upon which the API call is expected to act. FIG. 1 shows a source message component 148 provided by thread 114, and a source message component ‘n’ provided by thread ‘n’ 116, where ‘n’ is an integer. For example, the source message component may be customer invoice data to be encrypted before being sent to another system. Thus, the system 100 may be used in connection with a defense-in-depth strategy through which, for example, messages are encrypted and decrypted at each communication boundary between programs and/or systems.
The system 100 intercepts the API calls 120 to provide more efficient processing of the potentially many API calls made by the potentially many threads of execution for an application. To that end, the system 100 may implement an API call wrapper 152 in the memory. The API call wrapper 152 receives the API call, and substitutes the encryption supervisory logic 154 for the usual API call logic. In other words, rather than the API call 120 resulting in a normal call to the API call logic, the system 100 is configured to intercept the API call 120 through the API call wrapper 152 and substitute different functionality.
Continuing the example regarding encryption services, the API call wrapper 152 substitutes encryption supervisory logic 154 for the normal API call logic. The memory 104 may also store encryption supervisory parameters 156 that govern the operation of the encryption supervisory logic 154. Furthermore, as discussed below, the system 100 may also execute encryption supervisory tuning logic 158 to adjust or optimize the encryption supervisory parameters 156.
To support encryption and decryption of source message components that the threads provide, the encryption supervisory logic 154 may batch requests into a composite message 142. Thus, for example, the encryption supervisory logic 154 may maintain a composite message that collects source message components from threads requesting encryption, and a composite message that collects source message components from threads requesting decryption. Separate encryption supervisory parameters may govern the batching of source message components into any number of composite messages. After receiving each source message component, the encryption supervisory logic 154 may put each thread to sleep by calling an operating system function to sleep a thread according to a thread ID specified by the encryption supervisory logic 154. One benefit of sleeping each thread is that other active threads may use the CPU cycles freed because the CPU is no longer executing the thread that is put to sleep. Accordingly, the CPU stays busy executing application logic.
In the example shown in FIG. 1, the composite message 142 holds source message components from threads that have requested encryption of particular messages. More specifically, the encryption supervisory logic 154 obtains the source message components 148, 150 from the threads 114, 116 and creates a composite message section based on each source message component 148, 150. In one implementation, the encryption supervisory logic 154 creates the composite message section as a three field frame that includes a thread ID, a message length for the source message component (or the composite message section that includes the source message component), and the source message component. The encryption supervisory logic 154 then batches each composite message section into the composite message 142 (within the limits noted below) by adding each composite message section to the composite message 142.
FIG. 1 shows that the composite message 142 includes ‘n’ composite message sections labeled 162, 164, 166. Each composite message section includes a thread ID, message length, and a source message component. For example, the composite message section 162 includes a thread ID 168 (which may correspond to the thread ID 118), message length 170, and a source message component 172 (which may correspond to the source message component 148).
The CPU 102 submits the composite message 142 to the GPU 106 for processing. In that regard, the CPU 102 may write the composite message 142 to the texture memory 136. The CPU 102 may also initiate GPU 106 processing of the composite message by issuing, for example, a draw call to the GPU 106.
The batching mechanism implemented by the system 100 may significantly improve processing performance. One reason is that the system 100 reduces the data transfer overhead of sending multiple small messages to the GPU 106 and retrieving multiple small processed results from the GPU 106. The system 100 helps improve efficiency by batching composite message components into the larger composite message 142 and reading back a larger processed message from the write-only texture 146. More efficient data transfer to and from the GPU 106 results. Another reason for the improvement is that fewer draw calls are made to the GPU 106. The draw call time and resource overhead is therefore significantly reduced.
Turning briefly to FIG. 6, experimental results 600 of the batching mechanism implemented by the encryption supervisory logic 154 are shown. The experimental results 600 show a marked decrease in the cost of processing per byte as the composite message size increases. Table 1 provides the experimental data points. For example, at a log base 2 message size of 16, a 57 times increase in efficiency is obtained over a log base 2 message size of 10.
Cost per byte in seconds
of processing time