In safety-critical systems, the ability to guarantee that safety-critical tasks have sufficient time to execute is key for system integrity and safe operation. A big challenge facing developers of safety-critical software applications is managing the shared resources of memory and cache to minimize worst case execution times (WCET). On-chip cache allows the processor to run at on-chip memory bus speeds and increases overall compute power when executing from the cache. However, task switching and competition among multiple cores can degrade cache performance and have dramatic impacts on WCET. Benchmarks demonstrate that WCET can be 3 times higher than average-case execution time (ACET) on single-core processors, up to an order of magnitude higher or more on multicore processors.
By utilizing real-time operating systems like Deos that support cache partitioning, programmers can isolate safety-critical tasks from detrimental cache effects, thereby reducing WCET and boosting overall CPU time available for time-critical tasks. The ability to bound and control cache effects also simplifies analysis of inter-application interference patterns, thereby streamlining safety certification. Results from benchmarks derived from actual flight software demonstrate these benefits, and are presented later in the article.
Cache maximizes processor performance by enabling it to access faster on-chip memory. That advantage, however, only exists when the data for the executing task resides in the cache. During a task switch in a memory-partitioned RTOS, the cache must be invalidated, and those portions of the cache that had been altered by the previous task (dirty cache) must be written back to main memory. Only after this is done and the cache is repopulated with data for the new task will the CPU again be able to access memory at chip speeds.
On a multicore processor, the problem is even worse because no longer is the major impact limited to a process or memory partition context switch. Now, multiple cores will compete for shared cache during normal process execution. A common dual-core processor configuration is shown in Figure 1. In this configuration, each core has its own CPU and small L1 cache, and both cores share the large L2 cache.
In this configuration, applications executing on Core 0 compete for the entire L2 cache with applications executing on Core 1 (note that applications on the same core also compete with one another for the caches). If application A on Core 0 during normal operation uses data that maps to the same cache line(s) as application B on Core 1, then a collision occurs and must be corrected by the hardware.
As an example, suppose A’s data resides in L2, and that B accesses data that happens to map to the same L2 cache line as A’s data. At that point, A’s data must be evicted from L2 (including a potential “write-back” to RAM), and B’s data must be brought into cache from RAM. The time required to handle this collision increases B’s execution time. Then, suppose A accesses its data again. Since that data is no longer in L2 (B’s data is in its place), B’s data must be evicted from L2 (including a potential “write-back” to RAM), and A’s data must be brought back into cache from RAM. The time required to handle this collision increases A’s execution time.
Most times, A and B will encounter such collisions infrequently. In those cases, their respective execution times are considered “average case” (i.e., ACETs). However, on occasion, their data accesses will collide at a high frequency. In these cases, the impacts of the worst possible flip-flopping of cache lines must be included as the “worst case” possible execution times (i.e., WCETs), as it can occasionally happen during normal execution, although infrequently.
When developing certifiable, safety-critical software, one must design and budget an application’s execution time for worst-case behavior, since such software must have adequate time budget to complete its intended function every time it executes, lest it cause an unsafe failure condition. With the potential for multiple applications on multiple cores to generate contention for L2, WCETs on MCPs often are considerably higher than ACETs. Because certifiable, safety-critical applications must have adequate time budgets to accommodate their WCETs, this situation leads to a great deal of budgeted but unused time, resulting in significantly degraded CPU availability for time-critical tasks.
Cache partitioning reduces WCET and increases CPU utilization by reducing L2 cache effects on single core systems, and cache competition effects on multicore systems, thereby making it easier to bound and control interference patterns. By setting aside dedicated partitions of the cache for critical tasks (or groups of tasks), developers can reduce interference and provide timely, deterministic access to cache. This reduces the amount of time that must be budgeted for critical tasks by decreasing WCET, thus maximizing the “guaranteed” execution time available for safety-critical tasks.
Figure 2 shows conceptually how (in single-core operation) the large L2 cache can be separated into multiple partitions, thereby allowing a system integrator to isolate critical tasks from the cache effects of other tasks.
Additionally, as shown in Figure 3, the RTOS may partition the L2 cache such that each core has its own segment, or segments of L2, meaning that data used by applications on Core 0 will only be cached in Core 0’s L2 partitions. Similarly, data used by applications on Core 1 will only be cached in Core 1’s L2 partitions. This partitioning eliminates the possibility of applications on different cores interfering with one another via L2 cache data collisions. By limiting the cache interference, the application WCETs are much closer to its ACETs than is the case without cache partitioning in the system. Hence, by bounding and controlling these interference patterns, application execution times are more deterministic and time budgets can be set far more tightly, thereby keeping processor utilization high for time-critical tasks.
Test Environment and Applications
For the testing, actual flight software was used as a test application and the resultant execution times were measured over 1600 frames. Tests were run using single-core operation to highlight the severity of cache effects and benefits of cache partitioning in the more constrained single core environment.
As mentioned earlier, this single-core operation limits the cache impacts on critical tasks to the cache invalidates across context switches, and not the constant cache contention that can occur on a multicore system. On a multicore system, WCET degradation has been measured as 10X over the application’s ACET.
Tests were run with and without a “cache trasher” application, which evicts test application data/code from L2 and “dirties” L2 with its own data/code. In effect, the cache trasher puts L2 into a worst-case state where different applications run concurrently and compete for the shared L2 cache.
Each test application was executed under multiple scenarios:
In scenario 1, the test application competes for the entire 512KB L2 cache along with the RTOS kernel and a variety of tools. There is no partitioning or cache trashing. This test establishes baseline average performance, each test executing with average to minimal L2 contention.
Scenario 2 also omits cache partitioning, but adds cache trashing. Here, the test application competes for the entire L2 cache along with the RTOS kernel, a variety of debug tools and the rogue cache trasher. This test establishes baseline worst-case performance, each test executing with worst-case L2 interference from other applications, primarily the cache trasher.
In Figure 4, scenario 1 sets the lower band limit, and scenario 2 sets the upper band limit. The area between scenario 1 and scenario 2 represents the normal execution band of a fully loaded system.
Scenarios 3 and 4 are the same as scenarios 1 and 2, respectively, except that the L2 cache is turned off. The red band in Figure 5 shows the performance impact. Scenario 3 sets the low end of the band, scenario 4 the high end of the band. The area between represents how a typically loaded system would be expected to perform if the L2 cache was turned off. Note that best-case scenario 3 performance declines with respect to scenario 1 due to the disabling of the L2 cache. Meanwhile, the worst-case response time for scenario 4 (WCET) improves with respect to scenario 2, as the performance penalty of the cache trasher is limited to the L1 cache.
Scenarios 5 and 6, depicted by the green band in figure 6, add cache partitioning. As the figure shows, partitioning the L2 cache among applications reduces the amount of cache available to each application, thereby reducing average performance. At the same time, however, partitioning the cache also reduces cache contention, which bounds worst-case execution. In this way, cache partitioning strikes a balance, enabling applications to benefit from the cache, but in a way that drastically reduces WCET.
Note that it is possible for applications executing in the same cache partition to interfere with each other. However, such interference is much easier to analyze and bound than the unpredictable interference patterns that may occur between all applications executing on a single core or different cores with a completely shared cache. If necessary, in those situations, applications could be mapped to completely isolated cache partitions.
The benchmark results clearly demonstrate that cache partitioning provides an effective means of bounding and controlling interference patterns in shared cache. Though the benchmarks are run on a single core processor, the results would only be amplified greatly if the benchmarks were run on a multicore CPU, where multiple cores would be contending for the same shared cache constantly during normal process execution. Ultimately, worst-case execution times are bounded and controlled much more tightly when the cache is partitioned. This enables application developers to set relatively tight, yet safe, execution time budgets for the tasks thereby maximizing processor utilization in both single- and multicore devices.
This article was written by Greg Rose, VP of Marketing and Product Management, DDC-I, Inc. (Phoenix, AZ). For more information, Click Here .