How is an LRU cache implemented in a CPU?












6















I'm studying up for an interview and want to refresh my memory on caching. If a CPU has a cache with an LRU replacement policy, how is that actually implemented on the chip? Would each cache line store a timestamp tick?



Also what happens in a dual core system where both CPUs write to the one address simultaneously?










share|improve this question

























  • You may want to take a look here.

    – Joachim Isaksson
    May 3 '14 at 18:56











  • Thank you that is useful. I also added some more to my qn.

    – fred basset
    May 3 '14 at 19:25











  • timestamp is not tick, but some short value (remember, that LRU works independent for each cache set. Dual core system has several caches levels, some levels are private to the core (only owner core may request cache to store new line).

    – osgx
    May 3 '14 at 19:43











  • What would the short value be set to in practice?

    – fred basset
    May 3 '14 at 20:46











  • Possibly of interest: What cache invalidation algorithms are used in actual CPU caches?

    – Paul A. Clayton
    May 3 '14 at 22:40
















6















I'm studying up for an interview and want to refresh my memory on caching. If a CPU has a cache with an LRU replacement policy, how is that actually implemented on the chip? Would each cache line store a timestamp tick?



Also what happens in a dual core system where both CPUs write to the one address simultaneously?










share|improve this question

























  • You may want to take a look here.

    – Joachim Isaksson
    May 3 '14 at 18:56











  • Thank you that is useful. I also added some more to my qn.

    – fred basset
    May 3 '14 at 19:25











  • timestamp is not tick, but some short value (remember, that LRU works independent for each cache set. Dual core system has several caches levels, some levels are private to the core (only owner core may request cache to store new line).

    – osgx
    May 3 '14 at 19:43











  • What would the short value be set to in practice?

    – fred basset
    May 3 '14 at 20:46











  • Possibly of interest: What cache invalidation algorithms are used in actual CPU caches?

    – Paul A. Clayton
    May 3 '14 at 22:40














6












6








6


3






I'm studying up for an interview and want to refresh my memory on caching. If a CPU has a cache with an LRU replacement policy, how is that actually implemented on the chip? Would each cache line store a timestamp tick?



Also what happens in a dual core system where both CPUs write to the one address simultaneously?










share|improve this question
















I'm studying up for an interview and want to refresh my memory on caching. If a CPU has a cache with an LRU replacement policy, how is that actually implemented on the chip? Would each cache line store a timestamp tick?



Also what happens in a dual core system where both CPUs write to the one address simultaneously?







caching cpu cpu-architecture cpu-cache lru






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 21 '18 at 5:37









Peter Cordes

129k18194330




129k18194330










asked May 3 '14 at 18:50









fred bassetfred basset

4,0752373124




4,0752373124













  • You may want to take a look here.

    – Joachim Isaksson
    May 3 '14 at 18:56











  • Thank you that is useful. I also added some more to my qn.

    – fred basset
    May 3 '14 at 19:25











  • timestamp is not tick, but some short value (remember, that LRU works independent for each cache set. Dual core system has several caches levels, some levels are private to the core (only owner core may request cache to store new line).

    – osgx
    May 3 '14 at 19:43











  • What would the short value be set to in practice?

    – fred basset
    May 3 '14 at 20:46











  • Possibly of interest: What cache invalidation algorithms are used in actual CPU caches?

    – Paul A. Clayton
    May 3 '14 at 22:40



















  • You may want to take a look here.

    – Joachim Isaksson
    May 3 '14 at 18:56











  • Thank you that is useful. I also added some more to my qn.

    – fred basset
    May 3 '14 at 19:25











  • timestamp is not tick, but some short value (remember, that LRU works independent for each cache set. Dual core system has several caches levels, some levels are private to the core (only owner core may request cache to store new line).

    – osgx
    May 3 '14 at 19:43











  • What would the short value be set to in practice?

    – fred basset
    May 3 '14 at 20:46











  • Possibly of interest: What cache invalidation algorithms are used in actual CPU caches?

    – Paul A. Clayton
    May 3 '14 at 22:40

















You may want to take a look here.

– Joachim Isaksson
May 3 '14 at 18:56





You may want to take a look here.

– Joachim Isaksson
May 3 '14 at 18:56













Thank you that is useful. I also added some more to my qn.

– fred basset
May 3 '14 at 19:25





Thank you that is useful. I also added some more to my qn.

– fred basset
May 3 '14 at 19:25













timestamp is not tick, but some short value (remember, that LRU works independent for each cache set. Dual core system has several caches levels, some levels are private to the core (only owner core may request cache to store new line).

– osgx
May 3 '14 at 19:43





timestamp is not tick, but some short value (remember, that LRU works independent for each cache set. Dual core system has several caches levels, some levels are private to the core (only owner core may request cache to store new line).

– osgx
May 3 '14 at 19:43













What would the short value be set to in practice?

– fred basset
May 3 '14 at 20:46





What would the short value be set to in practice?

– fred basset
May 3 '14 at 20:46













Possibly of interest: What cache invalidation algorithms are used in actual CPU caches?

– Paul A. Clayton
May 3 '14 at 22:40





Possibly of interest: What cache invalidation algorithms are used in actual CPU caches?

– Paul A. Clayton
May 3 '14 at 22:40












2 Answers
2






active

oldest

votes


















10














For a traditional cache with only two ways, a single bit per set can be used to track LRU. On any access to a set that hits, the bit can be set to the way that did not hit.



For larger associativity, the number of states increases dramatically: factorial of the number of ways. So a 4-way cache would have 24 states, requiring 5 bits per set and an 8-way cache would have 40,320 states, requiring 16 bits per set. In addition to the storage overhead, there is also greater overhead in updating the value.



For a 4-way cache, the following encoding of the state that would seem to work reasonably well: two bits for the most recently used way number, two bits for the next most recently used way number, and a bit indicating if the higher or lower numbered way was more recently used.




  • On a MRU hit, the state is unchanged.

  • On a next-MRU hit the two bit fields are swapped.

  • On other hits, the numbers of the two other ways are decoded, the number of the way that hits is placed in the first two-bit portion and the former MRU way number is placed in the second two-bit portion. The final bit is set based on whether the next-MRU way number is higher or lower than the less recently used way that did not hit.

  • On a miss, the state is updated as if an LRU hit had occurred.


Because LRU tracking has such overhead, simpler mechanisms like binary tree pseudo-LRU are often used. On a hit, such just updates each branching part of the tree with which half of the associated ways the hit was in. For a power of two number of ways W, a binary tree pLRU cache would have W-1 bits of state per set. A hit in way 6 of an 8-way cache (using a 3-level binary tree) would clear the bit at the base of the tree to indicate that the lower half of the ways (0,1,2,3) are less recently used, clear the higher bit at the next level to indicate that the lower half of those ways (4,5) are less recently used and set the higher bit in the final level to indicate that the upper half of those ways (7) is less recently used. Not having to read this state in order to update it can simplify hardware.



For skewed associativity, where different ways use different hashing functions, something like an abbreviated time stamp has been proposed (e.g., "Analysis and Replacement for Skew-Associative Caches", Mark Brehob et al., 1997). Using a miss counter is more appropriate than a cycle count, but the basic idea is the same.



With respect to what happens when two cores try to write to the same cache line at the same time, this is handled by only allowing one L1 cache to have the cache line in the exclusive state at a given time. Effectively there is a race and one core will get exclusive access. If only one of the writing core already has the cache line in a shared state, it will probably be more likely to win the race. With the cache line in shared state, the cache only needs to send an invalidation request to other potential holders of the cache line; with the cache line not present a write would typically need to request the cache line of data as well as asking for exclusive state.



Writes by different cores to the same cache line (whether to the same specific address or, in the case of false sharing, to another address within the line of data) can result in "cache line ping pong", where different cores invalidate the cache line in other caches to get exclusive access (to perform a write) so that the cache line bounces around the system like a ping pong ball.






share|improve this answer































    0














    There is a good slide-deck Page replacement algorithms that talks about various page replacement schemes. It also explains the LRU implementation using mxm matrix really well.






    share|improve this answer


























    • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.

      – emmanuel
      Sep 14 '14 at 17:56











    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f23448528%2fhow-is-an-lru-cache-implemented-in-a-cpu%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    10














    For a traditional cache with only two ways, a single bit per set can be used to track LRU. On any access to a set that hits, the bit can be set to the way that did not hit.



    For larger associativity, the number of states increases dramatically: factorial of the number of ways. So a 4-way cache would have 24 states, requiring 5 bits per set and an 8-way cache would have 40,320 states, requiring 16 bits per set. In addition to the storage overhead, there is also greater overhead in updating the value.



    For a 4-way cache, the following encoding of the state that would seem to work reasonably well: two bits for the most recently used way number, two bits for the next most recently used way number, and a bit indicating if the higher or lower numbered way was more recently used.




    • On a MRU hit, the state is unchanged.

    • On a next-MRU hit the two bit fields are swapped.

    • On other hits, the numbers of the two other ways are decoded, the number of the way that hits is placed in the first two-bit portion and the former MRU way number is placed in the second two-bit portion. The final bit is set based on whether the next-MRU way number is higher or lower than the less recently used way that did not hit.

    • On a miss, the state is updated as if an LRU hit had occurred.


    Because LRU tracking has such overhead, simpler mechanisms like binary tree pseudo-LRU are often used. On a hit, such just updates each branching part of the tree with which half of the associated ways the hit was in. For a power of two number of ways W, a binary tree pLRU cache would have W-1 bits of state per set. A hit in way 6 of an 8-way cache (using a 3-level binary tree) would clear the bit at the base of the tree to indicate that the lower half of the ways (0,1,2,3) are less recently used, clear the higher bit at the next level to indicate that the lower half of those ways (4,5) are less recently used and set the higher bit in the final level to indicate that the upper half of those ways (7) is less recently used. Not having to read this state in order to update it can simplify hardware.



    For skewed associativity, where different ways use different hashing functions, something like an abbreviated time stamp has been proposed (e.g., "Analysis and Replacement for Skew-Associative Caches", Mark Brehob et al., 1997). Using a miss counter is more appropriate than a cycle count, but the basic idea is the same.



    With respect to what happens when two cores try to write to the same cache line at the same time, this is handled by only allowing one L1 cache to have the cache line in the exclusive state at a given time. Effectively there is a race and one core will get exclusive access. If only one of the writing core already has the cache line in a shared state, it will probably be more likely to win the race. With the cache line in shared state, the cache only needs to send an invalidation request to other potential holders of the cache line; with the cache line not present a write would typically need to request the cache line of data as well as asking for exclusive state.



    Writes by different cores to the same cache line (whether to the same specific address or, in the case of false sharing, to another address within the line of data) can result in "cache line ping pong", where different cores invalidate the cache line in other caches to get exclusive access (to perform a write) so that the cache line bounces around the system like a ping pong ball.






    share|improve this answer




























      10














      For a traditional cache with only two ways, a single bit per set can be used to track LRU. On any access to a set that hits, the bit can be set to the way that did not hit.



      For larger associativity, the number of states increases dramatically: factorial of the number of ways. So a 4-way cache would have 24 states, requiring 5 bits per set and an 8-way cache would have 40,320 states, requiring 16 bits per set. In addition to the storage overhead, there is also greater overhead in updating the value.



      For a 4-way cache, the following encoding of the state that would seem to work reasonably well: two bits for the most recently used way number, two bits for the next most recently used way number, and a bit indicating if the higher or lower numbered way was more recently used.




      • On a MRU hit, the state is unchanged.

      • On a next-MRU hit the two bit fields are swapped.

      • On other hits, the numbers of the two other ways are decoded, the number of the way that hits is placed in the first two-bit portion and the former MRU way number is placed in the second two-bit portion. The final bit is set based on whether the next-MRU way number is higher or lower than the less recently used way that did not hit.

      • On a miss, the state is updated as if an LRU hit had occurred.


      Because LRU tracking has such overhead, simpler mechanisms like binary tree pseudo-LRU are often used. On a hit, such just updates each branching part of the tree with which half of the associated ways the hit was in. For a power of two number of ways W, a binary tree pLRU cache would have W-1 bits of state per set. A hit in way 6 of an 8-way cache (using a 3-level binary tree) would clear the bit at the base of the tree to indicate that the lower half of the ways (0,1,2,3) are less recently used, clear the higher bit at the next level to indicate that the lower half of those ways (4,5) are less recently used and set the higher bit in the final level to indicate that the upper half of those ways (7) is less recently used. Not having to read this state in order to update it can simplify hardware.



      For skewed associativity, where different ways use different hashing functions, something like an abbreviated time stamp has been proposed (e.g., "Analysis and Replacement for Skew-Associative Caches", Mark Brehob et al., 1997). Using a miss counter is more appropriate than a cycle count, but the basic idea is the same.



      With respect to what happens when two cores try to write to the same cache line at the same time, this is handled by only allowing one L1 cache to have the cache line in the exclusive state at a given time. Effectively there is a race and one core will get exclusive access. If only one of the writing core already has the cache line in a shared state, it will probably be more likely to win the race. With the cache line in shared state, the cache only needs to send an invalidation request to other potential holders of the cache line; with the cache line not present a write would typically need to request the cache line of data as well as asking for exclusive state.



      Writes by different cores to the same cache line (whether to the same specific address or, in the case of false sharing, to another address within the line of data) can result in "cache line ping pong", where different cores invalidate the cache line in other caches to get exclusive access (to perform a write) so that the cache line bounces around the system like a ping pong ball.






      share|improve this answer


























        10












        10








        10







        For a traditional cache with only two ways, a single bit per set can be used to track LRU. On any access to a set that hits, the bit can be set to the way that did not hit.



        For larger associativity, the number of states increases dramatically: factorial of the number of ways. So a 4-way cache would have 24 states, requiring 5 bits per set and an 8-way cache would have 40,320 states, requiring 16 bits per set. In addition to the storage overhead, there is also greater overhead in updating the value.



        For a 4-way cache, the following encoding of the state that would seem to work reasonably well: two bits for the most recently used way number, two bits for the next most recently used way number, and a bit indicating if the higher or lower numbered way was more recently used.




        • On a MRU hit, the state is unchanged.

        • On a next-MRU hit the two bit fields are swapped.

        • On other hits, the numbers of the two other ways are decoded, the number of the way that hits is placed in the first two-bit portion and the former MRU way number is placed in the second two-bit portion. The final bit is set based on whether the next-MRU way number is higher or lower than the less recently used way that did not hit.

        • On a miss, the state is updated as if an LRU hit had occurred.


        Because LRU tracking has such overhead, simpler mechanisms like binary tree pseudo-LRU are often used. On a hit, such just updates each branching part of the tree with which half of the associated ways the hit was in. For a power of two number of ways W, a binary tree pLRU cache would have W-1 bits of state per set. A hit in way 6 of an 8-way cache (using a 3-level binary tree) would clear the bit at the base of the tree to indicate that the lower half of the ways (0,1,2,3) are less recently used, clear the higher bit at the next level to indicate that the lower half of those ways (4,5) are less recently used and set the higher bit in the final level to indicate that the upper half of those ways (7) is less recently used. Not having to read this state in order to update it can simplify hardware.



        For skewed associativity, where different ways use different hashing functions, something like an abbreviated time stamp has been proposed (e.g., "Analysis and Replacement for Skew-Associative Caches", Mark Brehob et al., 1997). Using a miss counter is more appropriate than a cycle count, but the basic idea is the same.



        With respect to what happens when two cores try to write to the same cache line at the same time, this is handled by only allowing one L1 cache to have the cache line in the exclusive state at a given time. Effectively there is a race and one core will get exclusive access. If only one of the writing core already has the cache line in a shared state, it will probably be more likely to win the race. With the cache line in shared state, the cache only needs to send an invalidation request to other potential holders of the cache line; with the cache line not present a write would typically need to request the cache line of data as well as asking for exclusive state.



        Writes by different cores to the same cache line (whether to the same specific address or, in the case of false sharing, to another address within the line of data) can result in "cache line ping pong", where different cores invalidate the cache line in other caches to get exclusive access (to perform a write) so that the cache line bounces around the system like a ping pong ball.






        share|improve this answer













        For a traditional cache with only two ways, a single bit per set can be used to track LRU. On any access to a set that hits, the bit can be set to the way that did not hit.



        For larger associativity, the number of states increases dramatically: factorial of the number of ways. So a 4-way cache would have 24 states, requiring 5 bits per set and an 8-way cache would have 40,320 states, requiring 16 bits per set. In addition to the storage overhead, there is also greater overhead in updating the value.



        For a 4-way cache, the following encoding of the state that would seem to work reasonably well: two bits for the most recently used way number, two bits for the next most recently used way number, and a bit indicating if the higher or lower numbered way was more recently used.




        • On a MRU hit, the state is unchanged.

        • On a next-MRU hit the two bit fields are swapped.

        • On other hits, the numbers of the two other ways are decoded, the number of the way that hits is placed in the first two-bit portion and the former MRU way number is placed in the second two-bit portion. The final bit is set based on whether the next-MRU way number is higher or lower than the less recently used way that did not hit.

        • On a miss, the state is updated as if an LRU hit had occurred.


        Because LRU tracking has such overhead, simpler mechanisms like binary tree pseudo-LRU are often used. On a hit, such just updates each branching part of the tree with which half of the associated ways the hit was in. For a power of two number of ways W, a binary tree pLRU cache would have W-1 bits of state per set. A hit in way 6 of an 8-way cache (using a 3-level binary tree) would clear the bit at the base of the tree to indicate that the lower half of the ways (0,1,2,3) are less recently used, clear the higher bit at the next level to indicate that the lower half of those ways (4,5) are less recently used and set the higher bit in the final level to indicate that the upper half of those ways (7) is less recently used. Not having to read this state in order to update it can simplify hardware.



        For skewed associativity, where different ways use different hashing functions, something like an abbreviated time stamp has been proposed (e.g., "Analysis and Replacement for Skew-Associative Caches", Mark Brehob et al., 1997). Using a miss counter is more appropriate than a cycle count, but the basic idea is the same.



        With respect to what happens when two cores try to write to the same cache line at the same time, this is handled by only allowing one L1 cache to have the cache line in the exclusive state at a given time. Effectively there is a race and one core will get exclusive access. If only one of the writing core already has the cache line in a shared state, it will probably be more likely to win the race. With the cache line in shared state, the cache only needs to send an invalidation request to other potential holders of the cache line; with the cache line not present a write would typically need to request the cache line of data as well as asking for exclusive state.



        Writes by different cores to the same cache line (whether to the same specific address or, in the case of false sharing, to another address within the line of data) can result in "cache line ping pong", where different cores invalidate the cache line in other caches to get exclusive access (to perform a write) so that the cache line bounces around the system like a ping pong ball.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered May 3 '14 at 22:39









        Paul A. ClaytonPaul A. Clayton

        3,32011022




        3,32011022

























            0














            There is a good slide-deck Page replacement algorithms that talks about various page replacement schemes. It also explains the LRU implementation using mxm matrix really well.






            share|improve this answer


























            • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.

              – emmanuel
              Sep 14 '14 at 17:56
















            0














            There is a good slide-deck Page replacement algorithms that talks about various page replacement schemes. It also explains the LRU implementation using mxm matrix really well.






            share|improve this answer


























            • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.

              – emmanuel
              Sep 14 '14 at 17:56














            0












            0








            0







            There is a good slide-deck Page replacement algorithms that talks about various page replacement schemes. It also explains the LRU implementation using mxm matrix really well.






            share|improve this answer















            There is a good slide-deck Page replacement algorithms that talks about various page replacement schemes. It also explains the LRU implementation using mxm matrix really well.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Sep 14 '14 at 17:50









            emmanuel

            9,041101734




            9,041101734










            answered Sep 14 '14 at 17:02









            Ajit VermaAjit Verma

            26




            26













            • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.

              – emmanuel
              Sep 14 '14 at 17:56



















            • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.

              – emmanuel
              Sep 14 '14 at 17:56

















            While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.

            – emmanuel
            Sep 14 '14 at 17:56





            While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.

            – emmanuel
            Sep 14 '14 at 17:56


















            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f23448528%2fhow-is-an-lru-cache-implemented-in-a-cpu%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to change which sound is reproduced for terminal bell?

            Can I use Tabulator js library in my java Spring + Thymeleaf project?

            Title Spacing in Bjornstrup Chapter, Removing Chapter Number From Contents