TL;DR: There are tools to accomplish this, but not as part of the official tool suite and probably not in your distribution’s repositories. You will have to choose from a number of tools and probably build it yourself. See below for details.
The btrfs wiki has an article on deduplication, which also mentions some tools.
There are more tools out there – I looked at one, though it seems to have been unmaintained for 6 years as of this writing, so I decided to stick with what is on the btrfs wiki.
None of these are part of the official btrfs suite so far, and at least Ubuntu 20.04 does not offer packages for them – you will have to build them yourself.
looked promising – it claims to do both file and block based deduplication (i.e. it deduplicates entire files, as well as blocks which are identical between two or more files). Also it is said to be fast as it relies on internal btrfs indices. Being written in python, it does not need to be built before use (you do need the dduper package for Python on your machine, though). However, it seems to skip any files below 4 KB, which I figure is counterproductive when you have lots of small, identical files.prettytable
I decided to go with , which only does file based deduplication. Apart from a C build environment and autotools, you will need the duperemove package on your machine. Grab the sources and build them by running libsqlite3-dev from the source dir. make can then be run directly from the source dir, for those who don’t want to duperemove random stuff on their system.make install
duperemoveThe docs mention two ways to run : directly, or by running duperemove and piping its output into fdupes. The first is only recommended for small data sets. The second one, however, turned out to be extremely resource-hungry for my data set of 2–3 TB and some 4 million files (after a day, progress was around half a percent, and memory usage along with constant swapping rendered the system almost unusable).duperemove
What seems to work for me is
sudo duperemove -drA /foo --hashfile=/path/to/duperemove.hashfile
This will:
-d), as opposed to just collecting hashes and spitting out a list of dupes-r)-A, needed because my snapshots are read-only)--hashfile), giving you two advantages:
fdupes mode), although the hash file takes up disk space: 90 bytes per block and 270 bytes per file. runs in multiple phases:duperemove
can be interrupted and resumed at any point, as long as the index database remains in place.duperemove
I ran against a disk with 8 subvolumes, 5 of which are snapshotted regularly, with 24 snapshots being kept around (the last 12 monthly, 5 weekly and 7 daily ones). Snapshots included, the disk holds some 5–6 million files taking up 3.5 TB (pre-dedup, expected 1.8 TB post-dedup).duperemove
When starts running and displaying progress, the percentage only refers to the indexing phase, not the whole deduplication process. Loading duplicate hashes can take much less or much more than that, depending on how many blocks were examined. Deduplication again takes roughly the same time as indexing. Also, progress calculation for the indexing phase seems to be based solely on the number of files, not blocks, making it an unreliable indicator of total time/space required if all your large files tend to be at the beginning (or the end) of the set.duperemove
Resource usage during the indexing phase is low enough to keep my system responsive when using a hashfile, though loading duplicate data can eat into your free memory. If the index DB is larger than the amount of free memory on your system, this may cause excessive swapping and and slow down your system.
Indexing everything (with the default block size of 128K) took some 28 days to complete and produced a 21 GB hash file. I ran out of memory on day 36, which left my system unresponsive, so I had to abort. Memory usage by the process had been oscillating around 12–14 GB for four days, though total memory usage kept increasing until the system became unusable.duperemove
For the next attempts, I decided to deduplicate subvolumes one-by-one, with another run between portions of two subvolumes where I know I had moved data. I started out with a 1024K block size, though this will miss smaller duplicate blocks as well as files smaller than the block size, in exchange for better performance. This took around 24 hours and ended up freeing up some 45 GB on my drive – satisfactory performance but space savings are not what I expected.
I aborted another attempt with 4K on a 300G subvolume – indexing took roughly four times as long as with 1024K but after 3 days, loading duplicate hashes still had not finished. Another attempt at 64K completed in under 4 hours. Note that deduplication for any subsequent pass after the first one should finish faster, as only the small blocks are left to deduplicate.
Hence my suggestions, based on practical experience:
/tmp may not be a good idea as it may get wiped on reboot (even if you don’t plan on rebooting, you might want to be safe in case of a system crash or power outage).Start by making a full backup so that if something goes wrong you haven't lost anything.
I believe you are looking for duperemove -d
"Duperemove is a simple tool for finding duplicated extents and submitting them for deduplication. When given a list of files it will hash their contents on a block by block basis and compare those hashes to each other, finding and categorizing extents that match each other. When given the -d option, duperemove will submit those extents for deduplication using the btrfs-extent-same ioctl.
Duperemove has two major modes of operation one of which is a subset of the other.
Readonly / Non-deduplicating Mode
When run without -d (the default) duperemove will print out one or more tables of matching extents it has determined would be ideal candidates for deduplication. As a result, readonly mode is useful for seeing what duperemove might do when run with '-d'. The output could also be used by some other software to submit the extents for deduplication at a later time.
It is important to note that this mode will not print out all instances of matching extents, just those it would consider for deduplication.
Generally, duperemove does not concern itself with the underlying representation of the extents it processes. Some of them could be compressed, undergoing I/O, or even have already been deduplicated. In dedupe mode, the kernel handles those details and therefore we try not to replicate that work.
Deduping Mode
This functions similarly to readonly mode with the exception that the duplicated extents found in our "read, hash, and compare" step will actually be submitted for deduplication. An estimate of the total data deduplicated will be printed after the operation is complete. This estimate is calculated by comparing the total amount of shared bytes in each file before and after the dedupe.
See the duperemove man page for further details about running duperemove."
this doesn't seem to appear in the btrfs-tools package but there is a git hub page for it here. Recent open and closed issues (aka pulse) available here.
Packages for All currently supported versiosn of Ubuntu can be found in this PPA
I must re-iterate that backing up is highly recommended. See: https://github.com/markfasheh/duperemove/issues/50
Quoted Source: https://github.com/markfasheh/duperemove
man page: https://manpages.debian.org/testing/duperemove/duperemove.8.en.html