在实际生活中,经常会有文件重复的困扰,即同一个文件可能既在A目录中,又在B目录中,更可恶的是,即便是同一个文件,文件名可能还不一样。在文件较少的情况下,该类情况还比较容易处理,最不济就是one by one的人工比较——即便如此,也很难保证你的眼神足够犀利。倘若文件很多,这岂不是个impossible mission?最近在看《Python UNIX和Linux系统管理指南》,里面就有有关“数据比较”的内容,在其基础上,结合实际整理如下。
该脚本主要包括以下模块:diskwalk,chechsum,find_dupes,delete。其中diskwalk模块是遍历文件的,给定路径,遍历输出该路径下的所有文件。chechsum模块是求文件的md5值。find_dupes导入了diskwalk和chechsum模块,根据md5的值来判断文件是否相同。delete是删除模块。具体如下:
1. diskwalk.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
import os,sys class diskwalk( object ): def __init__( self ,path): self .path = path def paths( self ): path = self .path path_collection = [] for dirpath,dirnames,filenames in os.walk(path): for file in filenames: fullpath = os.path.join(dirpath, file ) path_collection.append(fullpath) return path_collection if __name__ = = '__main__' : for file in diskwalk(sys.argv[ 1 ]).paths(): print file |
2.chechsum.py
1
2
3
4
5
6
7
8
9
10
11
12
13
|
import hashlib,sys def create_checksum(path): fp = open (path) checksum = hashlib.md5() while True : buffer = fp.read( 8192 ) if not buffer : break checksum.update( buffer ) fp.close() checksum = checksum.digest() return checksum if __name__ = = '__main__' : create_checksum(sys.argv[ 1 ]) |
3. find_dupes.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
from checksum import create_checksum from diskwalk import diskwalk from os.path import getsize import sys def findDupes(path): record = {} dup = {} d = diskwalk(path) files = d.paths() for file in files: compound_key = (getsize( file ),create_checksum( file )) if compound_key in record: dup[ file ] = record[compound_key] else : record[compound_key] = file return dup if __name__ = = '__main__' : for file in findDupes(sys.argv[ 1 ]).items(): print "The duplicate file is %s" % file [ 0 ] print "The original file is %s\n" % file [ 1 ] |
findDupes函数返回了字典dup,该字典的键是重复的文件,值是原文件。这样就解答了很多人的疑惑,毕竟,你怎么确保你输出的是重复的文件呢?
4. delete.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
import os,sys class deletefile( object ): def __init__( self , file ): self . file = file def delete( self ): print "Deleting %s" % self . file os.remove( self . file ) def dryrun( self ): print "Dry Run: %s [NOT DELETED]" % self . file def interactive( self ): answer = raw_input ( "Do you really want to delete: %s [Y/N]" % self . file ) if answer.upper() = = 'Y' : os.remove( self . file ) else : print "Skiping: %s" % self . file return if __name__ = = '__main__' : from find_dupes import findDupes dup = findDupes(sys.argv[ 1 ]) for file in dup.iterkeys(): delete = deletefile( file ) #delete.dryrun() delete.interactive() #delete.delete() |
deletefile类构造了3个函数,实现的都是文件删除功能、其中delete函数是直接删除文件,dryrun函数是试运行,文件并没有删除,interactive函数是交互模式,让用户来确定是否删除。这充分了考虑了客户的需求。
总结:这四个模块已封装好,均可单独使用实现各自的功能。组合起来就可批量删除重复文件,只需输入一个路径。
最后,贴个完整版本的,兼容Python 2.0, 3.0。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
|
#!/usr/bin/python # -*- coding: UTF-8 -*- from __future__ import print_function import os, sys, hashlib class diskwalk( object ): def __init__( self , path): self .path = path def paths( self ): path = self .path files_in_path = [] for dirpath, dirnames, filenames in os.walk(path): for each_file in filenames: fullpath = os.path.join(dirpath, each_file) files_in_path.append(fullpath) return files_in_path def create_checksum(path): fp = open (path, 'rb' ) checksum = hashlib.md5() while True : buffer = fp.read( 8192 ) if not buffer : break checksum.update( buffer ) fp.close() checksum = checksum.digest() return checksum def findDupes(path): record = {} dup = {} d = diskwalk(path) files = d.paths() for each_file in files: compound_key = (os.path.getsize(each_file), create_checksum(each_file)) if compound_key in record: dup[each_file] = record[compound_key] else : record[compound_key] = each_file return dup class deletefile( object ): def __init__( self , file_name): self .file_name = file_name def delete( self ): print ( "Deleting %s" % self .file_name) os.remove( self .file_name) def dryrun( self ): print ( "Dry Run: %s [NOT DELETED]" % self .file_name) def interactive( self ): try : answer = raw_input ( "Do you really want to delete: %s [Y/N]" % self .file_name) except NameError: answer = input ( "Do you really want to delete: %s [Y/N]" % self .file_name) if answer.upper() = = 'Y' : os.remove( self .file_name) else : print ( "Skiping: %s" % self .file_name) return def main(): directory_to_check = sys.argv[ 1 ] duplicate_file = findDupes(directory_to_check) for each_file in duplicate_file: delete = deletefile(each_file) delete.interactive() if __name__ = = '__main__' : main() |
其中,第一个参数是待检测的目录。
到此这篇关于如何用Python寻找重复文件并删除的文章就介绍到这了,更多相关Python删除重复文件内容请搜索服务器之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持服务器之家!
原文链接:https://www.cnblogs.com/ivictor/p/4377609.html