-
Notifications
You must be signed in to change notification settings - Fork 70
ZimDMS custom tool
(for the latest version & more information see the project page at http://www.inrim.it/~magni/zimDMS.htm)
Since a lot of time I've been searching for a system able to document all my digital data in a wiki-like way. The purpose being to allow me to orient myself in a huge and ever increasing data structure, where I keep all my digital life.
Data Management Systems (DMS) seem to provide this, but are cumbersome, and network&multiuser oriented. I'm interested instead in personal use, but for directories with sizes in the order of thousands of subfolders.
The wiki concept and Zim in particular are by themselves a perfect way to solve this problem, with easy subfolder ("node children") links creation for navigation.
Its only problem (in my opinion) is that Zim considers any *.txt file to be a zim node, so it will overwrite, on every textfile it finds, a Zim-related header. The only workaround I found is this python script I wrote, which crawls a given directory (D), creating a Zim-compatible mirror structure (in ZD)
When launched, it crawls the D folder structure (avoiding paths explicitly excluded via the command line, for stuff you're not interested in documenting). When arriving to a folder not present as a node in ZD, it creates the corresponding node in ZD. The zim node is simply a title and the list of its children. Symlinks are supported.
During successive executions things are much faster. Both new nodes and nodes whose subdirectories changed are updated with the new list of subdirectories written. Any subfolder present in ZD and not anymore in D is deleted from ZD, //if and only if// it has never been manually edited. Else it is marked as belonging to a deleted subtree structure, and left alone. This makes in perfect to run it periodically - e.g. nightly via crontab.
You are of course free to create new nodes in the resulting ZD structure -successive launch of the script will leave those nodes alone.
- -z (--zimdepo) : position of Zim repository (default ~/zim)
- -r (--rootname) : position of root directory to explore (default ~)
- -f (--force): force rewriting all nodes - preserves however any custom modification (default False)
- -b (--backup): backup in directory zimdepo_backup (default False)
(HP Proliant ML350 quad-core Xeon CPU 1.86GHz)
n.directories: 4600.\ initial scan: ~ 14sec.\ initial notebook upgrade (only once): ~ 13min.\ zim folder total dimension: 37MB.\ maintenance scans: ~ 13sec.\
- 0.7 change zimfolder structure, make it hierarchical as a map of your rootname
- 0.7 opt arglist of directories not to scan
- 0.7 even if arglist empty, DO NOT attempt to scan zimdepo anyway
- 0.7 remove subtree: only if empty
- 0.94 remove node: only if body empty - else tag it as deleted in body \ and do not list it in the 'children' section
- 0.7 able to identify zimDMS/not zimDMS nodes - based on title
- 0.7 non-zimDMS nodes never deleted, never added CHILDREN section \ remove only if never modified from default generation
- 0.7 write down children in node files
- 0.91 include symlinks
- 0.7 print warning not to mess up after CHILDREN NODES
- 0.7 add --f switch to force the update of all the nodes - in case of script update
- 0.7 change in all zimDMS nodes ' ' to '_', correct equality tests
- 0.8 set children links in Home.txt too
- 0.9 write down parent node
- 0.91 better formatting
- 0.92 title sends you to filemanager
-
- collect readme.txt and directory.jpg files when crawling ?
- OK check: moving wiki subtree, then moving directories works ?
- OK check: is the wiki portable on devices ?
- 0.92 backup wiki
- 0.93 better routine iszimDMSnode to check if a node is a zimDMS node
- 0.94 improved update() routine
- 0.95 dont write children is they're on the noscan list
- 0.95 added option to exclude .* directories
- 0.96 add info after run (% of edited nodes, n. of deleted-in-root nodes etc)
- 0.96 implement backup rotation scheme (see http://www.computer-repair.com/Backup.htm, possibly GFS scheme) \ > added as external routine
-
- alternate/add new way to exclude dirs: .zimDontScan files in root dir
UNSOLVED
- OK after some modifications (e.g. delete subtrees) seems mandatory do a zim --index zimdir -> solved, header problem
- OK set default node background/color (not possible here: change GTK theme) -> set env
#! /usr/bin/env python
"""
Copyright 2010 Alessandro Magni [email protected]
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
zimDMS
DIRECTORY CRAWLER
GENERATING A ZIM-COMPATIBLE WIKI STRUCTURE
crawler program scanning a given (rootname) root directory,
and creating (in zipdepo) a Zim-compatible mirror structure of rootname
example execution times (HP Proliant ML350 quad-core Xeon CPU 1.86GHz)
n.directories: ~4800.
initial scan: ~ 14sec.
initial notebook upgrade (only once): ~ 13min.
zim folder total dimension: 37MB.
maintenance scans: ~ 17sec.
force scans: ~ 25sec.
TODO ('-' == yet to do)
0.7 change zimfolder structure, make it hierarchical as a map of your rootname
0.7 opt arglist of directories not to scan
0.7 even if arglist empty, DO NOT attempt to scan zimdepo anyway
0.7 remove subtree: only if empty
0.94 remove node: only if body empty - else tag it as deleted in body
and do not list it in the 'children' section
0.7 able to identify zimDMS/not zimDMS nodes - based on title
0.7 non-zimDMS nodes never deleted, never added CHILDREN section
remove only if never modified from default generation
0.7 write down children in node files
0.91 include symlinks
0.7 print warning not to mess up after CHILDREN NODES
0.7 add --f switch to force the update of all the nodes - in case of script update
0.7 change in all zimDMS nodes ' ' to '_', correct equality tests
0.8 set children links in Home.txt too
0.9 write down parent node
0.91 better formatting
0.92 title sends you to filemanager
- collect readme.txt and directory.jpg files when crawling ?
OK check: moving wiki subtree, then moving directories works ?
OK check: is the wiki portable on devices ?
0.92 backup wiki
0.93 better routine iszimDMSnode to check if a node is a zimDMS node
0.94 improved update() routine
0.95 dont write children is they're on the noscan list
0.95 added option to exclude .* directories
0.96 add info after run (% of edited nodes, n. of deleted-in-root nodes etc)
0.96 implement backup rotation scheme (see http://www.computer-repair.com/Backup.htm, possibly GFS scheme)
> added as external routine
- alternate/add new way to exclude dirs: .zimDontScan files in root dir
UNSOLVED
OK after some modifications (e.g. delete subtrees) seems mandatory do a zim --index zimdir -> solved, header problem
OK set default node background/color (not possible here: change GTK theme) -> set env
"""
VERSION="0.96"
import os, sys, errno
from subprocess import call
import shutil
import re
import tarfile
import glob
import datetime
import time
import pickle
from optparse import OptionParser
defrootname=os.path.expanduser('~')
defzimdepo=os.path.join(os.path.expanduser('~'),"zim")
rootname=''
zimdepo=''
scandotdir=False
dotre=re.compile('.*/\.')
totn, newn, deln, edtn = 0,0,0,0
# key: current directory
# value: list
# 0:list of children directories
# 1:list of parent directories
# used to check if a tree under/above a given node is changed
mappa={}
# all the symlinks under rootname.
# key: symlink; value: target
symlinks={}
# list of directories not to scan
noscan=[]
# -------------------------------------------------------------------------------------------------------------------
# -------------------------------------------- FUNC DEFINITIONS --------------------------------------------------
# -------------------------------------------------------------------------------------------------------------------
# mkdir -p behaviour
def mkdir_p(path):
try:
os.makedirs(path)
except OSError as exc: # Python >2.5
if exc.errno != errno.EEXIST:
raise
def addheader(x,n):
# add Zim-like header in empty x file for nodename n
F=open(x,'w')
F.write('Content-Type: text/x-zim-wiki\n')
F.write('Wiki-Format: zim 0.4\n')
F.write('X-Zimdms: '+VERSION+'\n\n')
# t=datetime.datetime.now()
# F.write('Creation-Date: '+str(t).replace(' ','T')+'\n') # format: Creation-Date: 2010-09-17T14:09:00.910925
F.write([[file://'+n+'|'+n+']]\n')
F.close()
def bodyinsert(zt,m):
# inserts the string m just after the X-Zimdms header - if not already present
os.rename( zt, zt+"~" )
destination= open( zt, "w" )
source= open( zt+"~", "r" )
for line in source:
destination.write( line )
if 'X-Zimdms:' in line:
l=source.next()
if not m in l:
destination.write(m)
destination.write(l)
source.close()
destination.close()
os.remove(zt+"~")
def getchildren(node):
# returns an array to be given as children to mappa dictionary:
# dont report dotdirs if necessary, dont report children in noscan[]
ch=[]
ld=os.listdir(node)
for child in ld:
report=1
s=os.path.join(node,child)
if os.path.isdir(s):
if not scandotdir and re.match(dotre, s):
report=0
continue
for ns in noscan:
if s.startswith(ns):
report=0
continue
if report==1: ch.append(s)
return ch
def getparents(node,osnode):
# returns an array containing all the nodes linking to node
# it is called also with the real OS name, to look for symlinks
global symlinks
par=[os.path.dirname(node)] # init it with the direct parent node
for k,v in symlinks.items():
if v==osnode:
par.append(os.path.dirname(k)) # append the directory where the simlink is (key: symlink; value: target)
return par
def lll(dirname):
# find all symlinks under dirname
sy={}
for root,dirs,files in os.walk(dirname):
if dirs:
for node in dirs:
full=os.path.join(root,node)
if os.path.islink(full):
sy[full]=os.readlink(full)
return sy
def iszimDMSnode(zn):
# check by header if the file zn is a zimDMS file
# go for something better in files header...
if os.path.isfile(zn):
F=open(zn,'r')
line=F.readline()
if line=="Content-Type: text/x-zim-wiki\n": # touch only zimfiles
for i in range(2):
line=F.readline()
if 'X-Zimdms:' in line:
return True
else:
# print'node ',zn,' doesnt seem a zimDNS node'
return False
F.close()
else:
return False
def recdelete(znode):
# recursively deletes the zimDMSfiles and the relative directories under znode
# starting from node and descending into the children
# watchout for symlinks exiting from the subtree!
global deln
for dname in os.listdir(znode):
d=os.path.join(znode,dname)
if os.path.isdir(d):
recdelete(d)
a=os.listdir(znode)
zt=znode+'.txt'
if len(a)==0:
body=retbody(zt)
#if iszimDMSnode(zt):
if body==[]:
deln+=1
os.rmdir(znode)
os.remove(zt)
orignode=str.replace(znode,zimdepo,rootname)
if orignode in mappa:
del mappa[orignode]
else:
bodyinsert(zt,'__ORIGINAL TREE NOT FOUND__\n')
else:
bodyinsert(zt,'__ORIGINAL TREE NOT FOUND__\n')
print'warning - zim node ',znode,' not empty\nIt will remain in your wiki as long as you will not delete additional material manually'
def retbody(zf):
# returns the body of a zimDMS file,
# or None if it doesnt exist, or it isnt a zimDMS file,
# or [] if body is empty
if os.path.isfile(zf) and iszimDMSnode(zf):
body=[]
x=re.compile('.*\[\[file')
F=open(zf,'r')
for line in F:
if re.match(x, line):
break
for line in F: # now we collect the body
if line=='**PARENT NODES**\n':
break
else:
body.append(line)
F.close()
for l in body:
if not l.isspace():
return body
return []
else:
return None
def update(zf,n,c,p):
# update zimDMSfile zn, node n, with list of children c, parents p
global edtn
body=retbody(zf)
if body is not None:
if body==[]:
body='\n'*5
else: edtn+=1
addheader(zf,n)
F=open(zf,'a')
for a in body:
F.write(a)
# write down the parent&children links of node
F.write('**PARENT NODES**\n')
for x in sorted(p):
if x==rootname:
F.write('* [[Home|Home]]\n')
elif x==p[0]: # the 1st is the real parent
F.write('* [['+os.path.basename(x)+'|'+os.path.basename(x)+']]\n')
else: # the rest are symlinks
x=compatzn(x)
x=str.replace(x,rootname+'/','')
x=str.replace(x,'/',':')
F.write('* [['+x+'|'+x+']](**symlink**)\n')
F.write('**CHILDREN NODES**\n')
for x in sorted(c):
if os.path.islink(x):
if x in symlinks:
x=symlinks[x]
x=compatzn(x)
x=str.replace(x,rootname+'/','')
x=str.replace(x,'/',':')
F.write('* [['+x+'|'+x+']](**symlink**)\n') # not using '+' it is absolute addressing
else:
F.write('* [[+'+os.path.basename(x)+'|'+os.path.basename(x)+']]\n') # links '+name' are resolved UNDER the current page1
F.close()
else:
pass # zf doesnt exist or isnt a zimDMS file
def compatzn(f):
# converts a name to be zim compatible
# i.e. at the moment a name where ' ' -> '_'
f = f.replace(' ', '_')
return f
def doBackup():
# does a tar jcvf to the repository
bck=zimdepo+'_backup'
if not os.path.exists(bck):
mkdir_p(bck)
if os.path.exists(zimdepo):
today = datetime.datetime.now()
today=today.strftime("%Y-%m-%d")
destination='repo'+today+'.bz2'
destination = os.path.join(bck,destination)
print'backup to dir ',destination
if os.path.exists(destination):
os.remove(destination)
out = tarfile.TarFile.open(destination, 'w:bz2')
out.add(zimdepo, arcname=os.path.basename(zimdepo))
out.close()
# -------------------------------------------------------------------------------------------------------------------
# ------------------------------------------ MAIN ------------------------------------------------------------
# -------------------------------------------------------------------------------------------------------------------
def main():
global symlinks, mappa
global noscan
global totn, newn, deln
global rootname,zimdepo
parser = OptionParser(usage="Usage: %prog [options] [d1 ... dn]",
description="""[email protected] Alessandro Magni
Version """+VERSION+"""
crawler program scanning a given (rootname) root directory,
and creating (in zipdepo) a Zim-compatible mirror structure of rootname
d1..dn are directories excluded from crawling""")
parser.add_option("-z", "--zimdepo",
dest="zimdepo",default=defzimdepo,
help="position of Zim repository (default ~/zim)")
parser.add_option("-r", "--rootname",
dest="rootname",default=defrootname,
help="position of root directory to explore (default ~)")
parser.add_option("-f", "--force",
action="store_true", dest="force",
help="force rewriting all nodes - preserves however any custom modification (default False)")
parser.add_option("-b", "--backup",
action="store_true", dest="bck",
help="backup in directory zimdepo_backup (default False)")
parser.add_option("-d", "--scandotdir",
action="store_true", dest="scandotdir",
help="scan also inside directories starting with . (default False)")
(options, args) = parser.parse_args()
r=parser.rargs
zimdepo=options.zimdepo
rootname=options.rootname
force=options.force
bck=options.bck
scandotdir=options.scandotdir
if force:
print'updating all the nodes in the structure'
if scandotdir:
print'scanning of dot-directories enabled'
# Backup Wiki ---------------------------------------------------
if bck:
doBackup()
if zimdepo[-1]=='/': zimdepo=zimdepo[:-1]
if rootname[-1]=='/': rootname=rootname[:-1]
dictdumpfile=os.path.join(zimdepo,'mappa.dump')
if not os.path.exists(zimdepo):
mkdir_p(zimdepo)
for a in args:
ba=os.path.abspath(a)
noscan.append(ba)
noscan.append(zimdepo)
print'Directories not to be crawled:'
for a in noscan:
print '> '+a
t0 = datetime.datetime.now()
# load mappa dictionary if present -------------------------------------------
if os.path.isfile(dictdumpfile) and os.path.getsize(dictdumpfile) > 0: # cannot pickle empty file
F=open(dictdumpfile,"r")
mappa=pickle.load(F)
F.close()
for k in mappa:
if not scandotdir and re.match(dotre, k): del mappa[k]
print len(mappa),' entries on map file'
else:
print'map file not present'
symlinks=lll(rootname)
# recursive walk on zimdepo to possibly delete subtrees
# given a node.txt and node dir: does it exists in rootname?
# yes
# no: is it empty?
# yes -> delete node.txt & directory
# no -> do nothing, just warn.
for zimroot,zimdirs,files in os.walk(zimdepo):
if zimdirs:
for znode in zimdirs:
if znode!='.zim': # .zim is always present
znode=os.path.join(zimroot,znode)
if not scandotdir and re.match(dotre, znode):
os.rmdir(znode)
if os.path.exists(znode+'.txt'): os.remove(znode+'.txt')
realdir=str.replace(znode,zimdepo,rootname)
realdir=realdir.replace('[', '[[]') # 2 lines to compare independently
realdir=realdir.replace('_', '[ _]') # of spaces in dirnames
# if not os.path.exists(realdir):
if not glob.glob(realdir):
recdelete(znode)
# --- Check the directory structure ---
# 1st the root node
x=os.path.join(zimdepo,'Home.txt')
addheader(x,rootname)
ch=getchildren(rootname)
addheader(x,rootname)
F=open(x,'a')
F.write('\n\n {{../Home.jpg}}')
F.write('\n\n\n\n\n\n**CHILDREN NODES**\n')
for c in sorted(ch):
F.write('* [['+os.path.basename(c)+'|'+os.path.basename(c)+']]\n') # links '+name' are resolved UNDER the current page1
F.close()
# then recursive walk from the children downward
for root,dirs,files in os.walk(rootname):
if dirs:
for node in dirs:
proceed=1
nodeupdate=0
osnode=os.path.join(root,node) # osnode before conversion
node=compatzn(osnode)
if os.path.islink(node): proceed=0
if not scandotdir and re.match(dotre, node): proceed=0
for ns in noscan:
if node.startswith(ns):
proceed=0
if proceed:
totn+=1
zimpath=str.replace(node,rootname,zimdepo)
zimname=zimpath+'.txt'
ch=getchildren(osnode)
pa=getparents(node,osnode)
if not os.path.exists(zimpath):
print'node ',node,' is new'
newn+=1
nodeupdate=1
mkdir_p(zimpath)
addheader(zimname,osnode)
#mappa[node]=[ch,pa]
if node in mappa:
d=mappa[node]
savedch,savedpa = d[0],d[1]
if sorted(ch) != sorted(savedch) or sorted(pa) != sorted(savedpa):
nodeupdate=1
print 'children/parents of node ',node,' changed'
#mappa[node]=[ch,pa]
else:
nodeupdate=1
mappa[node]=[ch,pa]
if nodeupdate==1 or force:
update(zimname,osnode,ch,pa)
# save a clean dictionary
mappaclean={}
for root,dirs,files in os.walk(rootname):
if dirs:
for node in dirs:
node=compatzn(os.path.join(root,node))
if node in mappa:
mappaclean[node]=mappa[node]
F=open(dictdumpfile,"w")
pickle.dump(mappaclean,F)
F.close()
if len(mappaclean)!=totn:
print len(mappaclean),' entries on map file, ',totn,' total nodes'
print "Scanned a total of %d nodes, among which %d were new and %d resulted deleted" % (totn,newn,deln)
if force==True:
print"number of non-default (edited) nodes is %d (%.1f %%)" % (edtn,100.*edtn/totn)
print "\n\nEdit whatever you want, but only between the title and the CHILDREN NODES line"
delta_t = datetime.datetime.now() - t0
print "Time needed ",delta_t," sec"
if __name__ == '__main__':
main()
=== Comments === . . .