See if two files are the same file

Introduction

The requirement: to know when two files are the same file. This is not the same as testing their contents / size / checksum for equality; rather it is seeing if, in spite of their apparently being on different drives or in different directories, they are in fact the same file, obscured by drive mappings or hard links.

This was surprisingly difficult to work out: you'd have thought there would have been a FilesAreIdentical API call of some sort. But searching turned nothing up. (This is where I get 25 emails from knowledgeable people telling me that there is one, after all). The technique I'm going to use here, though, comes with a Microsoft seal of approval. It uses the GetFileInformationByHandle API call to return a volume serial number and a file index number, valid while the file is open. Note that last point: if you open one file, get its index, and close it, and then open a completely unrelated file, to get its index, you might get the same index both times even though the files are not the same.

I've broken the code down slightly into three functions just because of the long-windedness of the two API calls used. It could all be done in one place, obviously. This code has been tested on SUBST drives, network-mapped drives, hard links and mounted volumes.

Update: Richard Philips points out that if you specify the FILE_FLAG_BACKUP_SEMANTICS option to CreateFile, you can use the same code to check identity for directories as well as for files.

import os, sys
import tempfile
import win32file

def get_read_handle (filename):
  if os.path.isdir(filename):
    dwFlagsAndAttributes = win32file.FILE_FLAG_BACKUP_SEMANTICS
  else:
    dwFlagsAndAttributes = 0
  return win32file.CreateFile (
    filename,
    win32file.GENERIC_READ,
    win32file.FILE_SHARE_READ,
    None,
    win32file.OPEN_EXISTING,
    dwFlagsAndAttributes,
    None
  )

def get_unique_id (hFile):
  (
    attributes,
    created_at, accessed_at, written_at,
    volume,
    file_hi, file_lo,
    n_links,
    index_hi, index_lo
  ) = win32file.GetFileInformationByHandle (hFile)
  return volume, index_hi, index_lo

def files_are_equal (filename1, filename2):
  hFile1 = get_read_handle (filename1)
  hFile2 = get_read_handle (filename2)
  are_equal = (get_unique_id (hFile1) == get_unique_id (hFile2))
  hFile2.Close ()
  hFile1.Close ()
  return are_equal

#
# This bit of the example will only work on Win2k+; it
#  was the only way I could reasonably produce two different
#  files which were the same file, without knowing anything
#  about your drives, network etc.
#
filename1 = sys.executable
filename2 = tempfile.mktemp (".exe")
win32file.CreateHardLink (filename2, filename1, None)
print filename1, filename2, files_are_equal (filename1, filename2)