ftcmp, "file tree comparison", is a simple program to find duplicate files.. I know this has been done to death and there is a metric plethora of programs to do this, but this one is different - at least from the six or so existing programs I tried (first ones appearing on google or freshmeat - I gave up after seven) - they all fail miserably when there are a large number of files with exactly the same size. This can easily happen when you are writing data files such as matrices (non-sparse, uncompressed), or fixed-length time-series data, or tar files which are split at a specific size boundary - potentially large files, all with the exact same size.
Anyway - have at it:
Compiles fine under GNU C++ on POSIX (tested under Linux 32-bit and 64-bit - should also probably work under Mac OS X, but not tested) and Visual C++ under Windows 32-bit (haven't tried 64-bit, but should be ok). boost is required - I used version 1.42 under Linux and 1.46 under Windows - specifically the filesystem and lambda packages. Newer versions of Boost may require the -lboost_system switch to g++.
To build it under Linux/g++ :
g++ -Wextra -Wall ftcmp.cc -lboost_filesystem
Performance is disk bound but the bits that aren't (std::map lookups etc) do not show much of a difference when enabling compiler optimisations.
And under Visual C++ 2010 :
cl.exe ftcmp.cc /D "WIN32" /D "_CONSOLE" /Gm /Zi /W3 /GS /MD /Gd /EHsc /Zc:forScope /I"c:\Program Files\boost\boost_1_46_1" /link /LIBPATH:"C:\Program Files\boost\boost_1_46_1\lib"
With the paths set up for the location of the Boost libraries.
The command options present themselves thusly:
jsp@fatman:~/Desktop/XP share/ftcmp$ ./a.out ./a.out [switches] directory_or_filename [more directories or filenames..] where switches can be one or more of: -reverse_size - check for dupes in reverse order of file size -min_size smin - check only files bigger than smin bytes -max_size smax - check only files smaller than smax bytes -bufsize n - buffer size in bytes for comparison (currently set to 8192) -follow_sym - follow sym links (default is not to follow) -dir_stats - print what percentage of files in a directory are dupes -v - be verbose in the output smin, smax and n can be suffixed by k, M and G jsp@fatman:~/Desktop/XP share/ftcmp$
jsp@fatman:~/Desktop/XP share/ftcmp$ ls dir? dir1: foo1 foo2 dir11 dir12 bar maxi_file2 maxi_file3 non_dupe dir2: foo1 bar1 bar2 maxi_file sym1 sym_dir12 jsp@fatman:~/Desktop/XP share/ftcmp$ ./a.out -dir_stats dir1 dir2 dir1/bar > Found 15 files in the file tree ==> File size: 5 => dir1/dir12/bar1 => dir1/dir12/bar2 => dir1/bar => dir2/bar1 => dir2/bar2 ==> File size: 5 => dir1/dir11/foo1 => dir1/dir11/foo2 => dir1/foo1 => dir1/foo2 => dir2/foo1 ==> File size: 44 => dir1/maxi_file3 => dir1/maxi_file2 > Compiling directory statistics.. *> Found 5 duplicate files out of 6 files ( 83.3333 % ) in dir1 *> Found 2 duplicate files out of 2 files ( 100 % ) in dir1/dir11 *> Found 2 duplicate files out of 2 files ( 100 % ) in dir1/dir12 *> Found 3 duplicate files out of 5 files ( 60 % ) in dir2 jsp@fatman:~/Desktop/XP share/ftcmp$
The tokens at the beginning of each line with useful output are there to make it easier to parse the output in a script.
Note that in the example above dir1/bar is only spotted as a duplicate once, although it appears twice in the program arguments - once explicity, and once as a child of the dir1 argument - ftcmp recognises that this is the same file and does not return a false positive. If a soft symbolic link appeared in the tree then it will not be followed unless -follow_sym is specified.
When one or more directories and/or filenames are specified to ftcmp it first compiles a list of files by doing a depth-first-search of the directories. Files that are outside the upper and lower size restrictions in the min_size and max_size options, as well as non-regular files (e.g. pipes, block devices) are not included in the compiled list. This list is then partitioned by file size and where there is a set of common sizes with more than one member then all the member files' contents are compared against each other and any files that have identical (irrespective of filename) contents are reported to the user. The files are compared in pairs, and only two are open at a time so even if there are 5e5 files with exactly the same size then ftcmp should handle it.
I've used this program to compare half a million files. It took some time (on a fanless Intel Atom CPU), but it worked.
The -buf_size option is the number of bytes per read that are requested of the I/O sub-system - depending on how well that buffers and whether the disk is an Advanced Format drive this may need some tweaking. The value should pretty much always be a power of two - or at least a multiple of 4096. From testing 8192 worked best (on AF drives and otherwise) and anything larger did not significantly improve performance. Values of 16384, 32768, 65536 and up would work fine also but anything higher degraded performance significantly.
I am not really inclined to add any more features - it works - it should be left alone. If I did then I have a few ideas:
C'est tout. Thank you for visiting. Exit through the gift shop.
|Thu 01 January 1970 00:00 UTC||initial version|
|Mon 15 February 2016 19:22 UTC||minor error handling fix (cosmetic), updated so builds on newer versions of boost|