datascientist.co

aspiring through practice

Find Duplicate Files Using R

Find Duplicate Files

This is a simple script to search a directory tree for all files with duplicate content. It is based upon the Python code presented by Raymond Hettinger in his PyCon AU 2011 keynote “What Makes Python Awesome”. The slides for the keynote are here. As an exercise, I decided to convert the “find duplicate files” Python code to R.

The Original Python Code

# A bit of awesomeness in five minutes
# Search directory tree for all duplicate files  
import os, hashlib, pprint
hashmap = {}  # content signature -> list of filenames  
    for path, dirs, files in os.walk('/Users/user/test_photo'):
        for filename in files:
             fullname = os.path.join(path, filename)
             with open(fullname) as f:
                 d = f.read()         
                 h = hashlib.md5(d).hexdigest()         
                 filelist = hashmap.setdefault(h, [])         
                 filelist.append(fullname)   
pprint.pprint(hashmap)`

which has the following expected output (given my test directory):

{'79123bbfa69a73b78cf9dfd8047f2bfd': 
['/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3480 copy.JPG',
 '/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3480.JPG'],
 '8428f6383f9591a01767c54057770989': 
 ['/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3482 copy.JPG',
  '/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3482.JPG',
  '/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3482 copy.JPG',
  '/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3482.JPG'],
 '8b25c2e6598c33aa1ca255fe1c14a775': 
 ['/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3481 copy.JPG',
  '/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3481.JPG',
  '/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3481.JPG']}

The R Code

Step 1: Load the digest library so we can calculate MD5 hash values. The MD5 hash is common method of checking data integrity. We’ll be calculating the MD5 hash of each photo file to determine the uniqueness of the file contents (independent of file name and location).

library("digest")    

In the next code chunk A list of photo files are recursively generated using R’s dir() function. Note the regex “JPG|AVI” parameter to isolate the files of interest.

test\_dir = "/Users/user/test\_photo" filelist <- dir(test_dir, pattern = "JPG|AVI", recursive = TRUE, all.files = TRUE, full.names = TRUE)
head(filelist)

results in the following output:

[1] "/Users/user/test_photo/folder_a/IMG_3480 copy.JPG"    
[2] "/Users/user/test_photo/folder_a/IMG_3480.JPG"         
[3] "/Users/user/test_photo/folder_a/IMG_3481 copy.JPG"     
[4] "/Users/user/test_photo/folder_a/IMG_3481.JPG"          
[5] "/Users/user/test_photo/folder_a/IMG_3482 copy.JPG"     
[6] "/Users/user/test_photo/folder_a/IMG_3482.JPG"     

Now that we have the list of files, let’s generate the md5 hash function to each file. In this case, I am limiting the MD5 calculation to the first 5000 bytes of the file to speed things up. :

md5s <- sapply(filelist, digest, file = TRUE, algo = "md5", length = 5000)        
duplicate_files = split(filelist, md5s)    
head(duplicate_files)    

 ## $`56fd210390058f97ccba512db9b23b89`
 ## [1] "/Users/user/test_photo/folder_a/IMG_3480 copy.JPG"
 ## [2] "/Users/user/test_photo/folder_a/IMG_3480.JPG"     
 ## 
 ## $c142f7904e355be0c1f6d38211ed602f
 ## [1] "/Users/user/test_photo/folder_a/IMG_3482 copy.JPG"
 ## [2] "/Users/user/test_photo/folder_a/IMG_3482.JPG"     
 ## [3] "/Users/user/test_photo/folder_b/IMG_3482 copy.JPG"
 ## [4] "/Users/user/test_photo/folder_b/IMG_3482.JPG"     
 ## 
 ## $e6ecbcc84eca1c044fcf8669db1882fa
 ## [1] "/Users/user/test_photo/folder_a/IMG_3481 copy.JPG"
 ## [2] "/Users/user/test_photo/folder_a/IMG_3481.JPG"     
 ## [3] "/Users/user/test_photo/folder_b/IMG_3481.JPG"

That completes the code conversion from python to R. However, to make the results a little more useful, we can split the unique and duplicate files by the length of the lists. An MD5 hash with more than one filename indicates duplicate files:

z = duplicate_files    
z2 = sapply(z, function(x) {    
length(x) > 1    
})    
z3 = split(z, z2)    
head(z3$"TRUE")`    


## $`56fd210390058f97ccba512db9b23b89`
## [1] "/Users/user/test_photo/folder_a/IMG_3480 copy.JPG"
## [2] "/Users/user/test_photo/folder_a/IMG_3480.JPG"     
## 
## $c142f7904e355be0c1f6d38211ed602f
## [1] "/Users/user/test_photo/folder_a/IMG_3482 copy.JPG"
## [2] "/Users/user/test_photo/folder_a/IMG_3482.JPG"     
## [3] "/Users/user/test_photo/folder_b/IMG_3482 copy.JPG"
## [4] "/Users/user/test_photo/folder_b/IMG_3482.JPG"     
## 
## $e6ecbcc84eca1c044fcf8669db1882fa
## [1] "/Users/user/test_photo/folder_a/IMG_3481 copy.JPG"
## [2] "/Users/user/test_photo/folder_a/IMG_3481.JPG"     
## [3] "/Users/user/test_photo/folder_b/IMG_3481.JPG"

Notes on Vectorization

A previous attempt utilized a “for” loop o create the list of file digests. But as Jeffery Breen said in his excellent presentation on [grouping and summarizing data in r] (http://www.slideshare.net/jeffreybreen/grouping-summarizing-data-in-r)
“Rule of Thumb: If you are using a loop in R you’re probably doing something wrong.”

fl = list()  #create and empty list to hold md5's and filenames
for (itm in filelist) {
    file_digest = digest(itm, file = TRUE, algo = "md5", length = 1000)
    fl[[file_digest]] = c(fl[[file_digest]], itm)
}

… which also produces the desired output (albeit a little less elegantly):

head(fl)             
## $`5715b719723c5111b3a38a6ff8b7ca56`
## [1] "/Users/user/test_photo/folder_a/IMG_3480 copy.JPG"
## [2] "/Users/user/test_photo/folder_a/IMG_3480.JPG"     
## 
## $`24fd4d7d252ca66c8d7a88b539c55112`
## [1] "/Users/user/test_photo/folder_a/IMG_3481 copy.JPG"
## [2] "/Users/user/test_photo/folder_a/IMG_3481.JPG"     
## [3] "/Users/user/test_photo/folder_b/IMG_3481.JPG"     
## 
## $`2a1d668c874dc856b9df0fbf3f2e81ec`
## [1] "/Users/user/test_photo/folder_a/IMG_3482 copy.JPG"
## [2] "/Users/user/test_photo/folder_a/IMG_3482.JPG"     
## [3] "/Users/user/test_photo/folder_b/IMG_3482 copy.JPG"
## [4] "/Users/user/test_photo/folder_b/IMG_3482.JPG"

Credits

I welcome any suggestions you may have to improve the code / to make it more “idiomatic R”. The stackoverflow user nograpes and others in the stackoverflow community were very helpful with the elegant solution to the vectorization question I posted here.
The HTML output was generated using the Knitr Package from within the RStudio version 0.97.173.

Source Code

The R markdown (.rmd) and R source files are available at my public github repository:

 https://github.com/mspan/find-duplicate-files.git

7 Comments on “Find Duplicate Files Using R

  1. Hasan Diwan
    December 30, 2012

    You need to import pprint in the python file, else it throws an exception.

    • admin
      December 30, 2012

      Thank you for pointing out my omission. I did have to add the import, as you mentioned, to generate the output shown in the post. I updated teh post accordingly.
      Cheers,
      Mike

  2. Paolo
    December 31, 2012

    Nice post! You could wrap your code into a single function and use normalizePath for converting file paths to canonical form for the platform (?nomalizePath):


    find.duplicates <- function(path=".", pattern="JPG|AVI", algo="md5", length=5000, fullpath=TRUE){
    require("digest")
    if (fullpath) path <- normalizePath(path)
    filelist <- dir(path, pattern=pattern, recursive=TRUE, all.files=TRUE, full.names=TRUE)
    md5s <- sapply(filelist, digest, file=TRUE, algo=algo, length=length)
    duplicate_files <- split(filelist, md5s)
    z <- duplicate_files
    z2 1})
    z3 <- split(z, z2)
    head(z3$"TRUE")
    }

    • admin
      December 31, 2012

      Thank you for the tip to use the normalizePath function. I will update the post with your suggestions – including wrapping the code in a function.
      Cheers,
      Mike

  3. Paolo
    January 1, 2013

    Thanks to you for sharing the code! Happy 2013!

    • admin
      January 1, 2013

      Thank you for your help. Happy New Year!

  4. Pingback: Find Duplicate Files Using R - R Project Aggregate

Leave a Reply

Your email address will not be published. Required fields are marked *

Information

This entry was posted on December 28, 2012 by in rstats.

Category Specific RSS