Diff for "NonResidentPages" - linux-mm.org Wiki

Differences between revisions 4 and 5

The goals for this implementation of non-resident page bookkeeping:

minimal space overhead
SMP scalability
reasonably fast

Data structures

When looking up a page, we use page->offset, page->mapping to determine which nr_bucket hash bucket to put the page in, and hash page->offset, page->mapping and page->mapping->host->i_ino to get to the value stored in the nr_bucket.page array. We recycle nr_bucket.page entries in a circular fashion, with hand pointing to the current entry that's being updated.

This means we do not need a lookup list for these pages, we simply look through all the objects in the cacheline, doing simple comparisons. Hopefully the cost of comparing 31 entries is outweighed by only having 1 cache miss per lookup.

/* Number of non-resident pages per hash bucket */
#define NUM_NR ((L1_CACHE_BYTES - sizeof(atomic_t))/sizeof(unsigned long))

struct nr_bucket
{
        atomic_t hand;
        unsigned long page[NUM_NR];
} ____cacheline_aligned;

Uniqueness of mapping

Because the address_space structure underlying mapping can be freed by the kernel and then reallocated for another file, we need to take some measures to prevent the VM from thinking a brand-new page of a newly opened file was recently evicted. The obvious way to do this would be invalidating all entries of a particular mapping when the struct address_space is freed. However, we would need a much larger data structure if we wanted to support efficient invalidation.

The alternative is to simply hash the inode number into the non-resident page value, and hope that the next time a particular struct address_space is allocated to a different file, that file will have a different inode number.

The swap cache is a special case, since we can invalidate entries when a process exits, because we free up swap pages one by one. We can simply call the recently_evicted function from remove_exclusive_swap_page. This also covers swapoff(8) and a subsequent swapon(8), since the non-resident entries will be invalidated at swapoff time.

Back to AdvancedPageReplacement

-  ⇤ ← Revision 4 as of 2005-04-23 20:43:23 → 
  Size: 2991
  Editor: fangorn
  Comment:
+   ← Revision 5 as of 2005-05-02 15:16:48 → ⇥
  Size: 2184
  Editor: fangorn
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 8:
-The ''nr_page'' data structure represents one non-resident page, by virtue of pointing to the page mapping (or mm) and the page offset.  The ''offset_and_gen'' field contains a cryptographic hash of ''offset'', ''mapping->host->i_ino'' and ''mapping->host->i_sb''. It would be more natural to have the cryptographic hash on the ''mapping'' field, but the consequences of a collision would be much more severe...
+When looking up a page, we use ''page->offset'', ''page->mapping'' to determine which ''nr_bucket'' hash bucket to put the page in, and hash ''page->offset'', ''page->mapping'' and ''page->mapping->host->i_ino'' to get to the value stored in the ''nr_bucket.page'' array. We recycle ''nr_bucket.page'' entries in a circular fashion, with ''hand'' pointing to the current entry that's being updated.
 Line 10:
-{{{
struct nr_page {
        void * mapping;
        unsigned long offset_and_gen;
};
}}}

We fit multiple of these nr_page structs in one (cacheline sized?) hash bucket.  This means we do not need a lookup list for these pages, we simply look through all the objects in the cacheline, doing quick pointer comparisons.  Having one spin_lock per hash bucket, and having that spinlock in the same cacheline, should help SMP scalability.
+This means we do not need a lookup list for these pages, we simply look through all the objects in the cacheline, doing simple comparisons. Hopefully the cost of comparing 31 entries is outweighed by only having 1 cache miss per lookup.
-Line 21:
+Line 14:
-#define NUM_NR ((L1_CACHE_BYTES - sizeof(spinlock_t))/sizeof(struct nr_page))
+#define NUM_NR ((L1_CACHE_BYTES - sizeof(atomic_t))/sizeof(unsigned long))
-Line 25:
+Line 18:
-        spinlock_t lock;
        struct nr_page pages[NUM_NR];
} __cacheline_aligned;
+        atomic_t hand;
        unsigned long page[NUM_NR];
} ____cacheline_aligned;
-Line 32:
+Line 25:
-Because the address_space structure underlying ''mapping'' can be freed by the kernel and then reallocated for another file, we need to take some measures to prevent the VM from thinking a brand-new page of a newly opened file was recently evicted. The obvious way to do this would be invalidating all entries of a particular ''mapping'' when the struct address_space is freed. However, that would mean increasing the size of the ''nr_page'' structure, which is something we don't want to do.
+Because the address_space structure underlying ''mapping'' can be freed by the kernel and then reallocated for another file, we need to take some measures to prevent the VM from thinking a brand-new page of a newly opened file was recently evicted. The obvious way to do this would be invalidating all entries of a particular ''mapping'' when the struct address_space is freed. However, we would need a much larger data structure if we wanted to support efficient invalidation.
-Line 34:
+Line 27:
-What we can do instead is look at other data than the ''mapping'' itself and use a cryptographic hash to fold them into one of the fields in ''nr_page''. Eg. for a pagecache ''struct address_space'' we could use a hash of ''mapping->host->i_ino'' and ''mapping->host->i_sb''. After all, neither of these should change during normal system operation.
+The alternative is to simply hash the inode number into the non-resident page value, and hope that the next time a particular ''struct address_space'' is allocated to a different file, that file will have a different inode number.
-Line 38:
+Line 31:
-For now I am hashing the ''mapping->host->i_ino'' and ''mapping->host->i_sb'' into the ''offset_and_gen'' field. The birthday paradox limit for a 32 bit field should be 2^16 pages - or enough for 256MB. Not sure if this is the right thing to do, but the alternative would be hashing something into the ''mapping'' field itself, and collissions there could have worse consequences than a single page false positive... I'm not sure what to do best.