Software / ftcmp

By Jason Spencer
First published: 01 January 1970
Last update: 15 February 2016
HISTORY

Jump to: [ Source and compiling ] [ Usage ] [ Details ] [ Future work ]

ftcmp, "file tree comparison", is a simple program to find duplicate files.. I know this has been done to death and there is a metric plethora of programs to do this, but this one is different - at least from the six or so existing programs I tried (first ones appearing on google or freshmeat - I gave up after seven) - they all fail miserably when there are a large number of files with exactly the same size. This can easily happen when you are writing data files such as matrices (non-sparse, uncompressed), or fixed-length time-series data, or tar files which are split at a specific size boundary - potentially large files, all with the exact same size.

Source and compiling

Anyway - have at it:

ftcmp.cc

Compiles fine under GNU C++ on POSIX (tested under Linux 32-bit and 64-bit - should also probably work under Mac OS X, but not tested) and Visual C++ under Windows 32-bit (haven't tried 64-bit, but should be ok). boost is required - I used version 1.42 under Linux and 1.46 under Windows - specifically the filesystem and lambda packages. Newer versions of Boost may require the -lboost_system switch to g++.

To build it under Linux/g++ :

g++ -Wextra -Wall ftcmp.cc -lboost_filesystem

Performance is disk bound but the bits that aren't (std::map lookups etc) do not show much of a difference when enabling compiler optimisations.

And under Visual C++ 2010 :

cl.exe ftcmp.cc /D "WIN32" /D "_CONSOLE" /Gm /Zi /W3 /GS /MD /Gd /EHsc /Zc:forScope
	/I"c:\Program Files\boost\boost_1_46_1" /link /LIBPATH:"C:\Program Files\boost\boost_1_46_1\lib"

With the paths set up for the location of the Boost libraries.

Usage

The command options present themselves thusly:

jsp@fatman:~/Desktop/XP share/ftcmp$ ./a.out
./a.out [switches] directory_or_filename [more directories or filenames..]
where switches can be one or more of:
	-reverse_size	- check for dupes in reverse order of file size
	-min_size smin	- check only files bigger than smin bytes
	-max_size smax	- check only files smaller than smax bytes
	-bufsize n	- buffer size in bytes for comparison (currently set to 8192)
	-follow_sym	- follow sym links (default is not to follow)
	-dir_stats	- print what percentage of files in a directory are dupes
	-v		- be verbose in the output
smin, smax and n can be suffixed by k, M and G
jsp@fatman:~/Desktop/XP share/ftcmp$

Sample output:

jsp@fatman:~/Desktop/XP share/ftcmp$ ls dir?
dir1:
foo1  foo2  dir11  dir12  bar  maxi_file2  maxi_file3  non_dupe

dir2:
foo1  bar1  bar2  maxi_file  sym1  sym_dir12
jsp@fatman:~/Desktop/XP share/ftcmp$ ./a.out -dir_stats dir1 dir2 dir1/bar
>	 Found 15 files in the file tree
==>	File size: 5
	=> dir1/dir12/bar1
	=> dir1/dir12/bar2
	=> dir1/bar
	=> dir2/bar1
	=> dir2/bar2

==>	File size: 5
	=> dir1/dir11/foo1
	=> dir1/dir11/foo2
	=> dir1/foo1
	=> dir1/foo2
	=> dir2/foo1

==>	File size: 44
	=> dir1/maxi_file3
	=> dir1/maxi_file2

>	Compiling directory statistics..
*>	Found 5 duplicate files out of 6 files ( 83.3333 % ) in dir1
*>	Found 2 duplicate files out of 2 files ( 100 % ) in dir1/dir11
*>	Found 2 duplicate files out of 2 files ( 100 % ) in dir1/dir12
*>	Found 3 duplicate files out of 5 files ( 60 % ) in dir2
jsp@fatman:~/Desktop/XP share/ftcmp$

The tokens at the beginning of each line with useful output are there to make it easier to parse the output in a script.
Note that in the example above dir1/bar is only spotted as a duplicate once, although it appears twice in the program arguments - once explicity, and once as a child of the dir1 argument - ftcmp recognises that this is the same file and does not return a false positive. If a soft symbolic link appeared in the tree then it will not be followed unless -follow_sym is specified.

Details

When one or more directories and/or filenames are specified to ftcmp it first compiles a list of files by doing a depth-first-search of the directories. Files that are outside the upper and lower size restrictions in the min_size and max_size options, as well as non-regular files (e.g. pipes, block devices) are not included in the compiled list. This list is then partitioned by file size and where there is a set of common sizes with more than one member then all the member files' contents are compared against each other and any files that have identical (irrespective of filename) contents are reported to the user. The files are compared in pairs, and only two are open at a time so even if there are 5e5 files with exactly the same size then ftcmp should handle it.
I've used this program to compare half a million files. It took some time (on a fanless Intel Atom CPU), but it worked.
The -buf_size option is the number of bytes per read that are requested of the I/O sub-system - depending on how well that buffers and whether the disk is an Advanced Format drive this may need some tweaking. The value should pretty much always be a power of two - or at least a multiple of 4096. From testing 8192 worked best (on AF drives and otherwise) and anything larger did not significantly improve performance. Values of 16384, 32768, 65536 and up would work fine also but anything higher degraded performance significantly.

Future work

I am not really inclined to add any more features - it works - it should be left alone. If I did then I have a few ideas:

Speeding up file content comparison - memory mapped files may be an idea, but only under Linux.
Speeding up file content comparison - maybe some kind of crude disk benchmarking and an automatic setting of the buffer size?
Speeding up file content comparison - take the MD5 or SHA512 hashes of all identically sized files and compare those in the first pass.. if they are identical then actually compare the contents in the second pass. This idea isn't as great as it sounds as it requires the entire file to be read to make the hash (although only O(N) times, instead of O(N^2), but then hashes are compared O(N^2) times - although only in memory and not bound by disk access time) and in most cases direct content comparison turns up differences pretty quickly anyway, without having to read the entire file. I've done testing with SHA512 from Crypto++ (it's 3-4 times faster than the sha512sum command line) and it's almost an order of magnitude slower than simple file content comparison. Hashing would still win if the difference in the data were at the end of the file.
More directory analysis - maybe compare file times - and the oldest file is the original. Also, the "Found 5 duplicate files out of 6 files ( 83.3333 % ) in dir1" statistic should optionally include sub-directories in the analysis - not just child files - so if a directory has two sub-directories and their contents are all dupes, then the directory is considered a 100% duplicate. Maybe. Another analysis result would be to say that "All of directory A is a copy of all of directory B".
add option to read list of directories and files to check from stdin, either one per line or null delimited.

C'est tout. Thank you for visiting. Exit through the gift shop.

Version history:

Thu 01 January 1970 00:00 UTC	initial version
Mon 15 February 2016 19:22 UTC	minor error handling fix (cosmetic), updated so builds on newer versions of boost