NetApp Aggr useage does not match Sum of Volumes

Recently we had an issue where a Aggrigate on a netapp grew by 10% over the weekend. A check on the volumes showed nothing had grown which suggested something in the Aggrigate layer.

 

aggr status -v
 
df -Agr

We called NetApp Support who confirmed it was a known bug

Summary

 Deduplication identifies and removes duplicate blocks in a volume, storing only unique blocks.

Deduplication requires the use of a certain amount of metadata, including a ‘fingerprint‘ summary to keep track of data in the blocks. When the data in the blocks changes frequently, the fingerprints become stale.

When the sis start command is running, any stale fingerprint metadata is normally detected and removed. If the deletion of stale fingerprint metadata fails, the stale fingerprint data will linger and consume space in the volume, and can significantly slow deduplication processing.

Issue Description

When the sis start command is running on a flex volume, the deduplication subsystem of Data ONTAP performs in several phases:

  • Fingerprint gathering
  • Fingerprint sorting
  • Fingerprint compressing
  • Block sharing

Normally, if the fraction of stale fingerprints in the database increases to greater than 20 percent, an additional ‘fingerprint checking’ phase is also performed, which cleans up the data. However, there is an issue in some releases of Data ONTAP (Data ONTAP 8.1, 8.1.1 and 8.1.2 and P/D-patch derivatives) that might cause the percentage to be calculated incorrectly, such that the checking phase is never performed. For more information, see BUG ID: 657692.

Symptom

The stale fingerprints in the fingerprint database are not deleted; the excess data lingers and consumes space in the volume.

As more stale fingerprints accumulate, the increasing size of the fingerprint metadata increases the deduplication workload on the system, with the sorting and merging phases running for a long time. In aggravated cases, storage clients might experience a slow response.

This issue is more likely to be observed on a volume where there is a lot of file delete activity.

Diagnosis:
To determine if a flex volume on a storage system is experiencing this issue, the output of two administrative commands can be examined for numeric values from which a calculation can be made. The commands are:

  • sis check  -c <vol>
  • sis status -l <vol>

Note: Run the sis check command from the diag node.

For Example:
The output of sis check -c for a volume includes the following lines:
Checking fingerprint  ...  18115836411 records
Number of Segments: 3
Number of Records:  18003077302, 53607122, 59151987
Checking fingerprint.vvol  ...  56538330 records
Checking fingerprint.vvol.delta  ...  2665604040 records


The important value is in the first line, the total of checked records, 18115836411, which will be called ‘TOTALCHECKED’ here.

In the output of sis status -l for the same volume, the following line is included:
Logical Data:                    3509 GB/49 TB (7%)

The important value is displayed first, the size of the logical data, 3509.

Take the logical-data size (in gigabytes) and apply the following calculation, which yields the number of storage blocks occupied by the logical data.
LOGICALBLOCKS = (LOGICALSIZE * 1024 * 1024) / 4
In this case, (3509 * 1024 * 1024) / 4 = 919863296 is the LOGICALBLOCKS value.

To calculate the percentage of stale fingerprints, take the total of checked records from the sis check -c output and use it in the following equation:
PERCENTSTALE = ((TOTALCHECKED - LOGICALBLOCKS) * 100) / LOGICALBLOCKS

In this case, ((18115836411 - 919863296) * 100) / 919863296 gives a PERCENTSTALE result of 1869.

As the result, 1869 is much larger than 20. The conclusion is that the triggering of sis check at 20 percent stale did not occur, and thus the volume and storage system are experiencing the issue.

Workaround

A cleanup of the fingerprint database on a volume impacted by this issue is accomplished by running the following command:
sis start -s <vol>

This is resource intensive and a very long-running process as it deletes (entirely) the old Fingerprint Database to reclaim volume space and then builds a brand new copy of the Fingerprint Database.

If the workload imposed on the storage system by running sis start -s is extremely large, a NetApp Support Engineer can guide the user to use the following advanced-mode command on the impacted volume:
sis check -d <vol>

Note: Dedupe operations for any new data will not be performed while ‘sis check -d‘ is running, expect to use more space from the volume until this command finishes.

In addition, the ‘sis check -d’ command requires an amount of free space in the volume greater than or equal to twice the size of the Fingerprint Database files. You can estimate the size of the Fingerprint Database by running ‘sis check -c‘ and adding the number of records in three files, then multiplying by 32 bytes which is the size of each record. To estimate the amount of free space required, in bytes, use this formula:
Number of records in [fingerprint.vvol + fingerprint.vvol.delta + fingerprint.vvol.delta.old (if present)  ] * 32bytes = records (or database size)

Ensure that there is sufficient free space prior to running ‘sis check -d’.

'sis check -d' is invalid on a snapvault secondary

Solution

Users should upgrade to Data ONTAP release 8.1.2P4 or later.

After upgrading to a release with the fix, running deduplication twice (sis start) on each volume will automatically remove these stale fingerprints.
Note: If there is no new data added to the volume, deduplication will not go through all its phases, including the phase responsible for cleaning up stale fingerprints (Verify Phase). Deletes on the volume would not cause deduplication to initiate. Data deletions from the volume will definitely create stale fingerprint metadata in the volume.

The first deduplication job post upgrade might take longer time than expected. Subsequent operations will complete at normal operating times. This process of removing the stale fingerprints will temporarily consume additional space in both the deduplication enabled FlexVol volumes and their containing aggregates.

Also, to confirm that the controller running Data ONTAP version with the fix is not witnessing this issue, please check sis logs. Logs should have the following two lines mentioned:
<timestamp> /vol/<volname> Begin (sis check)
<timestamp> /vol/<volname> Sis Verify Phase 1

 

You can do the formula in the Link to check if it’s applying to you to check PERCENTSTALE is over 20

Workaround

Run theses on all of the volumes which come back with PERCENTSTALE is over 20 , run once at a time to stop high I/O on San

 

priv set advanced 
 
sis check -c /vol/volume
 
sis status -l /vol/volume

Fix

Upgrade Ontap to new version

 

 

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...