BACKGROUND OF THE INVENTION
1. Technical Field
This disclosure relates to a data processing system in which computations are efficiently offloaded from a system central processing unit (CPU) to a system graphics processing unit (GPU).
2. Related Art
Performance is a key challenge in building large-scale applications because predicting the behavior of such applications is inherently difficult. Weaving security solutions into the fabric of the architectures of these applications almost always worsens the performance of the resulting systems. The performance degradation can be more than 90% when all application data is protected, and may be even worse when other security mechanisms are applied.
In order to be effective, cryptographic algorithms are necessarily computationally intensive and must be integral parts of data protection protocols. The cost of using cryptographic algorithms is significant since their execution consumes many CPU cycles which affects the performance of applications negatively. For example, cryptographic operations in the Secure Socket Layer (SSL) protocol slow downloading files from servers from about 10 to about 100 times. The SSL operations also penalize performance for web servers anywhere from a factor of about 3.4 to as much as a factor of nine. Generally, whenever a data message crosses a security boundary, the message is encrypted and later decrypted. These operations give rise to the performance penalty.
One prior attempt at alleviating the cost of using cryptographic protocols included adding separate specialized hardware to provide support for security. The extra dedicated hardware allowed applications to use more CPU cycles. However, dedicated hardware is expensive and using it requires extensive changes to the existing systems. In addition, using external hardware devices for cryptographic functions adds marshalling and unmarshalling overhead (caused by packaging and unpackaging data) as well as device latency.
Another prior attempt at alleviating the cost of using cryptographic protocols was to add CPUs to handle cryptographic operations. However, the additional CPUs are better utilized for the core computational logic of applications in order to improve their response times and availability. In addition, most computers have limitations on the number of CPUs that can be installed on their motherboards. Furthermore, CPUs tend to be expensive resources that are designed for general-purpose computations rather than specific application to cryptographic computations. This may result in underutilization of the CPUs and an unfavorable cost-benefit outcome.
Therefore, a need exists to address the problems noted above and others previously experienced.
A system for securing multithreaded server applications improves the availability of a CPU for executing core applications. The system improves the performance of multithreaded server applications by providing offloading, batching, and scheduling mechanisms for efficiently executing processing tasks needed by the applications on a GPU. As a result, the system helps to reduce the overhead associated with cooperative processing between the CPU and the GPU, with the result that the CPU may instead spend more cycles executing the application logic.
Other systems, methods, features and advantages will be, or will become, apparent to one with skill in the art upon examination of the following figures and detailed description. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the invention, and be protected by the following claims.
BRIEF DESCRIPTION OF THE DRAWINGS
The system may be better understood with reference to the following drawings and description. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. Moreover, in the figures, like referenced numerals designate corresponding parts throughout the different views.
FIG. 1 shows a system for supervisory control of encryption and decryption operations in a multithreaded application execution environment in which messages are batched for submission to a GPU.
FIG. 2 shows a system for supervisory control of encryption and decryption operations in a multithreaded application execution environment in which processed message components from a processed message received from a GPU are delivered to threads of an application.
FIG. 3 shows a flow diagram of the processing that encryption supervisory logic may implement to batch messages for submission to a GPU.
FIG. 4 shows a flow diagram of the processing that encryption supervisory logic may implement to return messages processed by a GPU to threads of an application.
FIG. 5 shows a flow diagram of the processing that encryption supervisory tuning logic may implement.
FIG. 6 shows experimental results of the batching mechanism implemented by the encryption supervisory logic in the system.
FIG. 7 shows an example of simulation results of mean waiting time against maximum composite message capacity.
FIG. 1 shows a system 100 for supervisory control of encryption and decryption operations in a multithreaded application execution environment. The system 100 includes a central processing unit (CPU) 102, a memory 104, and a graphics processing unit (GPU) 106. The GPU 106 may be a graphics processor available from NVIDIA of Santa Clara, Calif. or ATI Research, Inc. of Marlborough, Mass., as examples. The GPU 106 may communicate with the CPU 102 and memory 104 over a bus 108, such as the peripheral component interconnect (PCI) bus, the PCI Express bus, Accelerated Graphics Port (AGP) bus, Industry Standard Architecture (ISA) bus, or other bus. As will be described in more detail below, the CPU 102 executes applications from the system memory 104. The applications may be multi-threaded applications.
One distinction between the CPU 102 and the GPU 106 is that the CPU 102 typically follows a Single Instruction Single Data (SISD) model and the GPU 106 typically follows a Single Instruction Multiple Data (SIMD) stream model. Under the SISD model, the CPU 102 executes one (or at most a few) instructions at a time on a single (or at most a few) data elements loaded into the memory prior to executing the instruction. In contrast, a SIMD processor includes many processing units (e.g., 16 to 32 pixel shaders) that simultaneously execute instructions from a single instruction stream on multiple data streams, one per processing unit. In other words, one distinguishing feature of the GPU 106 over the CPU 102 is that the GPU 106 implements a higher level of processing parallelism than the CPU. The GPU 106 also includes special memory sections, such as texture memory, frame buffers, and write-only texture memory used in the processing of graphics operations.
The memory holds applications executed by the CPU 102, such as the invoicing application 110 and the account balance application 112. Each application may launch multiple threads of execution. As shown in FIG. 1, the invoicing application has launched threads 1 through ‘n’, labeled 114 through 116. Each thread may handle any desired piece of program logic for the invoicing application 110.
Each thread, such as the thread 114, is associated with a thread identifier (ID) 118. The thread ID may be assigned by the operating system when the thread is launched, by other supervisory mechanisms in place on the system 100, or in other manners. The thread ID may uniquely specify the thread so that it may be distinguished from other threads executing in the system 100.
The threads perform the processing for which they were designed. The processing may include application programming interface (API) calls 120 to support the processing. For example, the API calls 120 may implement encryption services (e.g., encryption or decryption) on a message passed to the API call by the thread. However, while the discussion below proceeds with reference to encryption services, the API calls may request any other processing logic (e.g., authentication or authorization, compression, transcoding, or other logic) and are not limited to encryption services. Similarly, the supervisory logic 154 may in general handle offloading, scheduling, and batching for any desired processing, and is not limited to encryption services.
The GPU 106 includes a read-only texture memory 136, multiple parallel pixel shaders 138, and a frame buffer 140. The texture memory 136 stores a composite message 142, described in more detail below. Multiple parallel pixel shaders 138 process the composite message 142 in response to execution calls (e.g., GPU draw calls) from the CPU 102. The multiple parallel pixel shaders 138 execute an encryption algorithm 144 that may provide encryption or decryption functionality applied to the composite message 142, as explained in more detail below. The GPU 106 also includes a write-only texture memory 146. The GPU 106 may write processing results to the write-only texture memory 146 for retrieval by the CPU 102. The CPU 102 returns results obtained by the GPU 106 to the individual threads that gave rise to components of the composite message 142. Other data exchange mechanisms may be employed to exchange data with the GPU rather than or in addition to the texture memory 136 and the write-only texture memory 146.
The programming functionality of the pixel shaders 138 may follow that expected by the API call 120. The pixel shaders 138 may highly parallelize the functionality. However, as noted above, the pixel shaders 138 are not limited to implementing encryption services.
Each thread, when it makes the API call 120, may provide a source message component upon which the API call is expected to act. FIG. 1 shows a source message component 148 provided by thread 114, and a source message component ‘n’ provided by thread ‘n’ 116, where ‘n’ is an integer. For example, the source message component may be customer invoice data to be encrypted before being sent to another system. Thus, the system 100 may be used in connection with a defense-in-depth strategy through which, for example, messages are encrypted and decrypted at each communication boundary between programs and/or systems.
The system 100 intercepts the API calls 120 to provide more efficient processing of the potentially many API calls made by the potentially many threads of execution for an application. To that end, the system 100 may implement an API call wrapper 152 in the memory. The API call wrapper 152 receives the API call, and substitutes the encryption supervisory logic 154 for the usual API call logic. In other words, rather than the API call 120 resulting in a normal call to the API call logic, the system 100 is configured to intercept the API call 120 through the API call wrapper 152 and substitute different functionality.
Continuing the example regarding encryption services, the API call wrapper 152 substitutes encryption supervisory logic 154 for the normal API call logic. The memory 104 may also store encryption supervisory parameters 156 that govern the operation of the encryption supervisory logic 154. Furthermore, as discussed below, the system 100 may also execute encryption supervisory tuning logic 158 to adjust or optimize the encryption supervisory parameters 156.
To support encryption and decryption of source message components that the threads provide, the encryption supervisory logic 154 may batch requests into a composite message 142. Thus, for example, the encryption supervisory logic 154 may maintain a composite message that collects source message components from threads requesting encryption, and a composite message that collects source message components from threads requesting decryption. Separate encryption supervisory parameters may govern the batching of source message components into any number of composite messages. After receiving each source message component, the encryption supervisory logic 154 may put each thread to sleep by calling an operating system function to sleep a thread according to a thread ID specified by the encryption supervisory logic 154. One benefit of sleeping each thread is that other active threads may use the CPU cycles freed because the CPU is no longer executing the thread that is put to sleep. Accordingly, the CPU stays busy executing application logic.
In the example shown in FIG. 1, the composite message 142 holds source message components from threads that have requested encryption of particular messages. More specifically, the encryption supervisory logic 154 obtains the source message components 148, 150 from the threads 114, 116 and creates a composite message section based on each source message component 148, 150. In one implementation, the encryption supervisory logic 154 creates the composite message section as a three field frame that includes a thread ID, a message length for the source message component (or the composite message section that includes the source message component), and the source message component. The encryption supervisory logic 154 then batches each composite message section into the composite message 142 (within the limits noted below) by adding each composite message section to the composite message 142.
FIG. 1 shows that the composite message 142 includes ‘n’ composite message sections labeled 162, 164, 166. Each composite message section includes a thread ID, message length, and a source message component. For example, the composite message section 162 includes a thread ID 168 (which may correspond to the thread ID 118), message length 170, and a source message component 172 (which may correspond to the source message component 148).
The CPU 102 submits the composite message 142 to the GPU 106 for processing. In that regard, the CPU 102 may write the composite message 142 to the texture memory 136. The CPU 102 may also initiate GPU 106 processing of the composite message by issuing, for example, a draw call to the GPU 106.
The batching mechanism implemented by the system 100 may significantly improve processing performance. One reason is that the system 100 reduces the data transfer overhead of sending multiple small messages to the GPU 106 and retrieving multiple small processed results from the GPU 106. The system 100 helps improve efficiency by batching composite message components into the larger composite message 142 and reading back a larger processed message from the write-only texture 146. More efficient data transfer to and from the GPU 106 results. Another reason for the improvement is that fewer draw calls are made to the GPU 106. The draw call time and resource overhead is therefore significantly reduced.
Turning briefly to FIG. 6, experimental results 600 of the batching mechanism implemented by the encryption supervisory logic 154 are shown. The experimental results 600 show a marked decrease in the cost of processing per byte as the composite message size increases. Table 1 provides the experimental data points. For example, at a log base 2 message size of 16, a 57 times increase in efficiency is obtained over a log base 2 message size of 10.
Cost per byte in seconds
of processing time
FIG. 2 highlights how the encryption supervisory logic 154 handles a processed message 202 returned from the GPU 106. In one implementation, the GPU 106 completes the requested processing on the composite message 142 and writes a resulting processed message 202 into the write-only texture memory 146. The GPU 106 notifies the CPU 102 that processing is complete on the composite message 142. In response, the CPU 102 reads the processed message 202 from the write-only texture memory 146.
As shown in FIG. 2, the processed message 202 includes multiple processed message sections, labeled 204, 206, and 208. The processed message sections generally arise from processing of the composite message sections in the composite message 142. However, there need not be a one-to-one correspondence between what is sent for processing in the composite message 142 and what the GPU 106 returns in the processed message 202.
A processed message section may include multiple fields. For example, the processed message section 204 includes a thread ID 208, message length 210, and a processed message component 212. The message length 210 may represent the length of the processed message component (or the processed message section that includes the processed message component). The thread ID 208 may designate the thread to which the processed message component should be delivered.
The encryption supervisory logic 154 disassembles the processed message 202 into the processed message sections 204, 206, 208 including the processed message components. The encryption supervisory logic 154 also selectively communicates the processed message components to chosen threads among the multiple execution threads of an application, according to which of the threads originated source message components giving rise to the processed message components. In other words, a thread which submits a message for encryption receives in return an encrypted message. The GPU 106 produces the encrypted message and the CPU 102 returns the encrypted message to the thread according to the thread ID specified in the processed message section accompanying the encrypted processed message component. The thread ID 208 specified in the processed message section generally tracks the thread ID 168 specified in the composite message section that gives rise to the processed message section.
In the example shown in FIG. 2, the encryption supervisory logic 154 returns the processed message component 212 to thread 1 of the invoicing application 110. The encryption supervisory logic 154 also returns the other processed message components, including the processed message component 214 from processed message section ‘n’ 208 to the thread ‘n’ 116. Prior to returning each processed message component, the encryption supervisory logic 154 may wake each thread by calling an operating system function to wake a thread by thread ID.
FIG. 3 shows a flow diagram of the processing that encryption supervisory logic 154 may implement to submit composite messages 142 to the GPU 106. The encryption supervisory logic 154 reads the encryption supervisory parameters 156, including batching parameters (302). The batching parameters may include the maximum or minimum length of a composite message 142, and the maximum or minimum wait time for new source message components (e.g., a batching timer) before sending the composite message 142. The batching parameters may also include the maximum or minimum number of composite message sections permitted in a composite message 142, the maximum or minimum number of different threads from which to accept source message components, or other parameters which influence the processing noted above.
The encryption supervisory logic 154 starts a batching timer based on the maximum wait time (if any) for new source message components (304). When a source message component arrives, the encryption supervisory logic 154 sleeps the thread that submitted the source message component (306). The encryption supervisory logic 154 then creates a composite message section to add to the current composite message 142. To that end, the encryption supervisory logic 154 may create a length field (308) and a thread ID field (310) which are added to the source message component to obtain a composite message section (312). The encryption supervisory logic 154 adds the composite message section to the composite message (314).
If the batching timer has not expired, the encryption supervisory logic 154 continues to obtain source message components as long as the composite message 142 has not reached its maximum size. However, if the batching timer has expired, or if the maximum composite message size is reached, the encryption supervisory logic 154 resets the batching timer (316) and writes the composite message to the GPU 106 (318). Another limit on the batch size in the composite message 142 may be set by the maximum processing capacity of the GPU. For example, if the GPU has a maximum capacity of K units (e.g., where K is the number of pixel shaders or other processing units or capacity on the GPU), then the system 100 may set the maximum composite message size to include no more than K composite message sections.
Accordingly, no thread is forced to wait more than a maximum amount of time specified by the batching timer until the source message component submitted by the thread is sent to the GPU 106 for processing. A suitable value for the batching timer may depend upon the particular system implementation, and may be chosen according to a statistical analysis described below, at random, according to pre-selected default values, or in many other ways. Once the composite message 142 is written to the GPU 106, the encryption supervisory logic 154 initiates execution of the GPU 106 algorithm on the composite message 142 (320). One mechanism for initiating execution is to issue a draw call to the GPU 106. The encryption supervisory logic 154 clears the composite message 142 in preparation for assembling and submitting the next composite message to the GPU 106.
It is the responsibility of the algorithm implementation on the GPU 106 to respect the individual thread IDs, message lengths, and source message components that give structure to the composite message 142. Thus, for example, the encryption algorithm 144 is responsible for executing fragments on the processors in the GPU for separating the composite message sections, processing the source message components, and creating processed message component results that are tagged with the same thread identifier as originally provided with the composite message sections. In other words, the algorithm implementation recognizes that the composite message 142 is not necessarily one single message to be processed, but a composition of smaller composite message sections to be processed in parallel on the GPU, with the processed results written to the processed message 202.
FIG. 4 shows a flow diagram of the processing that encryption supervisory logic 154 may implement to return processed message components to application threads. The encryption supervisory logic 154 reads the processed message 202 (e.g., from the write-only texture 146 of the GPU 106) (402). The encryption supervisory logic 154 selects the next processed message section from the processed message 202 (404). As noted above, the encryption supervisory logic 154 wakes the thread identified by the thread ID in the processed message section (406). Once the thread is awake, the encryption supervisory logic 154 sends the processed message component in the processed message section to the thread (408). The thread then continues processing normally. The encryption supervisory logic 154 may disassemble the processed message 202 into as many processed message sections as exist in the processed message 202.
FIG. 5 shows a flow diagram of the processing that encryption supervisory tuning logic 158 (“tuning logic 158”) may implement. The tuning logic 158 may simulate or monitor execution of applications running in the system 100 (502). As the applications execute, the tuning logic 158 gathers statistics on application execution, including message size, number of API processing calls, time distribution of processing calls, and any other desired execution statistics (504). The statistical analysis may proceed using tools for queue analysis and batch service to determine expected message arrival rates, message sizes, mean queue length, mean waiting time or long-term average number of waiting processes (e.g., using the Little Law that the long-term average number of customers in a stable system N, is equal to the long-term average arrival rate, λ, multiplied by the long-term average time a customer spends in the system, T) and other parameters (506).
Given the expected arrival rate, message sizes, and other statistics for processing calls, the tuning logic 158 may set the batching timer, maximum composite message size, maximum composite message sections in a composite message, and other encryption supervisory parameters 156 to achieve any desired processing responsiveness by the system 100. In other words, the encryption supervisory parameters 156 may be tuned to ensure that an application does not wait longer, on average, than an expected time for a processed result.
FIG. 7 shows an example of simulation results 700 of mean waiting time against maximum composite message capacity. Using such statistical analysis results, the tuning logic 158 may set the maximum composite message length to minimize mean waiting time, or obtain a mean waiting time result that balances mean waiting time against other considerations, such as cost of processing per byte as shown in FIG. 6.
The system described above optimizes encryption for large-scale multithreaded applications, where each thread executes any desired processing logic. The system implements encryption supervisory logic that collects source message components from different threads that execute on the CPU, batches the source message components into a composite message in composite message sections. The system then sends the composite message to the GPU. The GPU locally executes any desired processing algorithm, such as encryption algorithm that encrypts or decrypts the source message components in the composite message sections on the GPU.
The GPU returns a processed message to the CPU. The encryption supervisory logic then disassembles the processed message into processed message sections, and passes the processed message components within each processed message section back the correct threads of execution (e.g., the threads that originated the source message components). The system thereby significantly reduces the overhead that would be associated with passing and processing many small messages between the CPU and the GPU. The system 100 is not only cost effective, but can also reduce the performance overhead of cryptographic algorithms to 12% or less with a response time less than 200 msec, which is significantly smaller than other prior attempts to provide encryption services.
The logic described above may be implemented in any combination of hardware and software. For example, programs provided in software libraries may provide the functionality that collects the source messages, batches the source messages into a composite message, sends the composite message to the GPU, receives the processed message, disassembles the processed message into processed message components, and that distributes the processed message components to their destination threads. Such software libraries may include dynamic link libraries (DLLs), or other application programming interfaces (APIs). The logic described above may be stored on a computer readable medium, such as a CDROM, hard drive, floppy disk, flash memory, or other computer readable medium. The logic may also be encoded in a signal that bears the logic as the signal propagates from a source to a destination.
Furthermore, it is noted that the system carries out electronic transformation of data that may represent underlying physical objects. For example, the collection and batching logic transforms, by selectively controlled aggregation, the discrete source messages into composite messages. The disassembly and distribution logic transforms the processed composite messages by selectively controlled separation of the processed composite messages. These messages may represent a wide variety of physical objects, including as examples only, images, video, financial statements (e.g., credit card, bank account, and mortgage statements), email messages, or any other physical object.
In addition, the system may be implemented as a particular machine. For example, the particular machine may include a CPU, GPU, and software library for carrying out the encryption (or other API call processing) supervisory logic noted above. Thus, the particular machine may include a CPU, a GPU, and a memory that stores the encryption supervisory logic described above. Adding the encryption supervisory logic may include building function calls into applications from a software library that handle the collection, batching, sending, reception, disassembly, and distribution logic noted above or providing an API call wrapper and program logic to handle the processing noted above. However, the applications or execution environment of the applications may be extended in other ways to cause the interaction with the encryption supervisory logic.
While various embodiments of the invention have been described, it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the invention. Accordingly, the invention is not to be restricted except in light of the attached claims and their equivalents.