#!/usr/bin/perl # This program "find-documents" searches through your notes for a pattern: # a grep-like tool with outline context for personal notes (see below for much more detail). # To view colors after piping through "less", use raw mode, "less -r", or set "export LESS="-r". # # Here is an example search of my file "commands.readme" for "acroread", # where the single searched file includes the contiguous lines, # 13. To read Adobe (Acrobat) PDF (postscript based) files in X, use # xpdf # did not always convert pdf to ps correctly. # pdftops # command line conversion of pdf to ps (works as well as acroread) # # "pdftops -t3 -l17" generates postscript for pages 3-17. # gs # acroread # from Adobe itself (has pretty front-end) # "find-documents acroread commands.readme" replies, # commands.readme: 13. To read Adobe (Acrobat) PDF (postscript based) files in X, use # commands.readme: acroread # from Adobe itself (has pretty front-end) # Here, the search word "acroread" gets highlighted red; # and, on its first instance, the search filename "commands.readme" gets highlighted blue. # When working with my notes, I make backup copies like "fax.old3" # or some variation of "fax.12-25-1999", so I want to see only current versions. # For this, I separately created the simple one line command "lsnew", # /bin/ls -1 $* |egrep -v '.old|~$|^#|\.[0-9]+-[0-9][0-9]' |xargs ls -dCF --color=auto # This program "find-documents" falls into the category of "Personal Information Manager", PIM. # Alternatives besides this program "find-documents" include, # a. "sgrep" does a nice search, handling exclusions very cleanly, # though the authors state that their program does not handle regular expressions. # b. "Remembrance Agent" from MIT, which doesn't list contents of matches, # but pops up relevant filenames/mail with a rating to text you write. # c. thebrain.com # Perhaps you have heard of similar software or methods to keep notes: # give me its name or tell me about it. # Why have a program to search notes? # By 1995, I had written so many notes on paper that I often searched through them for 45 minutes. # Thereafter, I put almost all new notes on computer. # By 2000, I had 55,000 lines of notes in 175 files, including subdirectories. # Once again, I sometimes spent 45 minutes searching for a note. # # HOW MIGHT I KEEP NOTES? # a. In perhaps 20 large files, whose filename categorizes data well. # This approach prevents the spread of notes to numerous other files and subdirectories. # However, some of my files approach 1000 lines, so scrolling 50 lines at a time, # I would need to scroll through 20 screens. # b. In numerous small files. # With perhaps only 100 lines per file, such files can be scrolled through quickly. # Here, the filename categorizes extremely well, # but I must consider which of a few hundred filenames contains my note. # c. In subdirectories (hierarchical filesystem approach). # Here, I might have 20 directories with 10 files in each. # This is like "numerous small files" approach in (b), but in directories. # This approach helps categorize to the level of the "20 large files" approach in (a). # Standard unix tools can work with subdirectories with an option or through # find . -type f |xargs grep some-pattern #*d. Outline within files. # This helps while you scroll through a file, but unix tools like grep will # not appropriately output this outline structure. # With this "find-documents" tool, I make heavy use of this outline approach. # e. Database. # The database approaches, with which I am familiar, produce too much information. # They would ouput the equivalent of a whole file of notes, # or all lines in a level of that file's outline. # For example, in this file, if I searched for "outline", I wouldn't want every line # in this paragraph, but only this paragraph's 1st line and the 3rd line. # f. "Note-taking" software. # I suspect that such software has the same limitations as the "Database" approach above. # If it doesn't provide too much information, it provides too little at one stage of use; # eg, the equivalent of filenames but not their contents. # "HOW SHOULD I KEEP NOTES?" # In Unix fashion, I sought flexibility. # Keeping my notes in files rather than in some software-specific files allows me # to still use grep, less, and vim for corrections. # # Here is how I chose to keep notes, reflecting the primary search method of this "find-documents". # a. In files and directories. # When a file gets over 1000 lines, I consider creating a couple smaller files, # or creating a directory with a couple smaller files. # Currently, I have only one file, "www.readme", with over 1000 lines. # b. Within each file, I embed a crude outline structure. # b1. My coarsest outline structure is easily seen through repeated characters; eg, # ~~~~~~~~~~~~PORTSENTRY~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Over the years, I have chosen different repeat characters, including ~#*-+=: # This program "find-documents", allots for spacing at line ends, # searching for a preceding 7 identical characters. # I call these parts "sections". # You can see this coarsest outline structure of a file by merely; eg, # egrep '~~~~~~~' filename.readme # b2. Within these coarse entries, I have numbered entries; eg, # 15. configuration # This program "find-documents" allots for spacing at line beginnings, # searching for a number followed by period. # I call these entries "topics". # Many of my files have only one or none of this outline's structure. # You will want some sort of structure so you can display more than "grep" displays. # If you choose a different outline form, you can probably easily modify this program, # or send me a note. # # This approach provides three, perhaps four levels of an outline, # 1. directory name # 2. file name # 3. section within file: those repeated character lines ############# # 4. topic entries: those numerical entries "15." # "HOW DO I SEARCH MY NOTES?" # I sought to provide a grep-like tool, outputting line matches # and ANY OUTLINE-CONTEXT (with highlighting as appropriate). # Because this tool "find-documents" prints out the outline-context, # if you only need two levels of context, ANY file organization works well, # including under " HOW MIGHT I KEEP NOTES?" (a), (b), and (c) above, # together with "(d) outline within files". # So, you could just as well put all your notes in a directory structure # as in a single file. # # This program "find-documents" excludes probable backup files; # for example, files ending in ~, .10-15-1998, or .old3 . # Licensing/copyright: GPL [free]. ###POSSIBLE PROGRAM CHANGES:#################################################### # 1. 11/8/2002: Recently, I have noticed that some large directories (50MB), # like an uncompressed HOWTO directory, take 30 seconds while "egrep" takes but 2 seconds. # I might look for where my this perl program is slow. # 2. I have seen software, # ***** outman # to peruse a list with filenames, allowing clicking those filenames to enter them. # Convenience could be added by piping this find-documents output into a program like "outman", # but I haven't the proper gnome libraries to compile "outman", # so I await those libraries inclusion in the next round of Debian Linux after "woody": "sarge". # 3. 11/8/2002: There is currently no compress option to find strings in compress files. # I currently have no plans to add such an option. #######THE PROGRAM:############################################################# # The following three lines are irrelevant to this program. # $* = 1 ; #Perl does multi-line searches; THIS IS DEPRECATED TO /.../.../m, SO USE "m" OR "s". # $/ = "" ; #Brings in a paragraph rather than a line; IF UNSET, perl READS TO END OF FILE. # undef $/ ; #If unset, perl reads to end of file, rather than to end of line; see "man perlfaq6". # Options. For details, see "perldoc Getopt::Long". use Getopt::Long ; Getopt::Long::config("no_ignore_case") ; #Make options case sensitive # In the following GetOptions, the first alternative determines the option name; # eg, "help|HELP|h|H" controls $opt_help. GetOptions("recursive", "casesensitive|I", "nohidden", "nosection", "help|HELP|h|H") ; if ($opt_help == 1) { print "find-documents [-r] [-I] [--nohidden] [--nosection] [--help] search-string2 [files3] \n" ; print "Default files3 is current directory's files.\n" ; print "-r [--recursive] recursively search.\n" ; print "-I [--casesensitive] don't ignore character case (ie, case sensitive).\n" ; print "--nohidden will not search hidden \".\" files.\n" ; print "--nosection will not search for section-headings like ~~SPlus~~~~~~~.\n" ; print "-h [-H] [--help] will print options this program allows.\n" ; print "For further documentation, see top half of this executable, \"find-documents\". \n" ; exit ; } if ($opt_casesensitive != 1) { $pattern = "(?i)$ARGV[0]" ; #on command-line, pattern occurs before any filenames; ignore case through "(?i)". } else { $pattern = $ARGV[0] ; #on command-line, pattern occurs before any filenames. } shift(@ARGV) ; #drops $pattern/$ARGV[0] from remaining @ARGV list. #############FILE NAMES BEGIN##################### #Now to get the list of files. if ($opt_recursive == 1) #recursively get all files from @ARGV, else from current directory. { if ($#ARGV < 0) #Note: "$#ARGV = 0" corresponds to $ARGV[0]; "-1" corresponds to no arguments: @ARGV then undefined. { $ARGV[0] = '.' ; #current directory. } # I created the following code primarily with "find2perl -type f". require "find.pl"; # &find("/home/jameson/unix"); &find(@ARGV); #notice that @ARGV is not comma separated. sub wanted { (($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($_)) && -f _ && push(@filelist, "$name" ) ; } } elsif ($#ARGV >= 0) #look only at files listed within @ARGV, including both files in @ARGV and files in any @ARGV directory. { #The following contains all @ARGV names as both files within directories .../* #and as files, so half the arguments will be extraneous. @file_and_directory = ( join("/*", @ARGV, ""), @ARGV ) ; #note the last "" in join(... , ""), so join also appends "/*". @filelist = grep(-f, @file_and_directory) ; #***find files from among the @file_and_directory arguments. } else #look only at files in current directory. { @filelist = grep(-f, <* .*>) ; #get files, including hidden (.??*) files. } #############FILE NAMES END##################### foreach $filename ( @filelist ) { # if ( $filename =~ /\.old|~$|^#|\.[0-9]+-[0-9][0-9]|\.swp$/ ) {next} #ignore backup files. # if ( $filename =~ /\.old|~$|^#|\.[0-9-]+-[0-9][0-9]|\.swp$/ ) {next} #ignore backup files. if ( $filename =~ /\.old|~$|^#|[\._][0-9-]+-[0-9][0-9]|\.swp$/ ) {next} #ignore backup files. if ( ! -T $filename ) {next} #ignore binary, *.dvi, ... files; we search "text" files only. if ( $opt_nohidden && $filename =~ /^\./ ) {next} #--nohidden option asks to ignore hidden files. ##################FIND PATTERNS AND CALL PRINT SUBROUTINE################ open(READ_FILE, $filename) ; while(defined($_ = )) { # Color (ansi) of search word and filename is set according to: # esc-character[41m #red; escape-character is \e # 0 regular # 30 black # 31 red # 32 green # 33 yellow # 34 blue # 35 magenta # 36 cyan/turquoise # 37 peach # ***highlighted, which I use: # 40 black #worked badly on both light and dark backgrounds. # 41 red #****very good: I use this to highlight pattern matches. # 42 green #***a tad bright on dark background. # 43 yellow #*somewhat bright on dark background. # 44 blue #too dark on a light background, but very nice on a dark background, as Eterm often sets. # 45 magenta #*somewhat dark on light backgrounds. # 46 cyan/turquoise #bright on dark backgrounds. # 47 peach #too bright on dark backgrounds. s/(${pattern})/\e[41m${1}\e[0m/g ; #highlight search word in red; #ignore case through $pattern set earlier; ${1} retains case. # # ####if ( /.*^(\s*\d+\..*${pattern}[^\n]*)/sm ) { $tempit = $1 } #CODE HERE when pull in whole file as one record-- see ".*^ at beginning!!! if ( /^\s*\d+\.\s+/ ) #if beginning of a numerical item; eg, "14. ...." #Others may prefer "14)" or "III." ... { # AN ITEM [TOPIC] ENTRY HAS BEEN FOUND LIKE "14." print_topic() ; #print subroutine I define at end. $hold_lines[0] = $_ ; #Retaining this line prevents this code from being more direct [printing only at end of file]. $hold_line_number = 0 ; if ( /${pattern}/ ) { $found_pattern = "yes" } #ignore case in search through $pattern setting earlier, #alterable thru a command line variable. } # elsif ( /[a-zA-Z].*(\S)\1\1\1\1\1\s*$/ ) #Way too slow, taking minutes for what took 5 seconds. elsif ( /(\S)\1\1\1\1\1\1\1\s*$/ && /[a-zA-Z]/ && $opt_nosection != 1 ) #Last 7 characters are identical, #ignoring appended spaces. { # A SECTION-HEADING HAS BEEN FOUND LIKE, # ----Math Software--------------------------------------------------------------- # If the $pattern has already been found, print the previous section-heading material. # If the current heading matches $pattern, print it now. # Otherwise, should a pattern later be found, print it then and undefine $section_heading then, # since I want to print this section-heading but once for any appropriately numbered-topic entries. # Because of this less frequent printing of the section-heading than the item [topic] entries, # this code cannot put $_ into $hold_lines[*], so this code differs from that of item [topic] entries above. # print_topic() ; #Print any previous section-heading material. # undef $section_heading ; #undefine any $section_heading set earlier in the file. if ( /${pattern}/ ) { print "\e[42m", ${filename}, "\e[0m: ", $_ ; #print colored filename and section_heading. $first_line_printed = "yes" ; #Prevents coloring same $filename on two different output lines. } else { $section_heading = $_ ; #For later printing if appropriate. } } elsif ( /${pattern}/ ) #ignore case through $pattern setting earlier; alterable through a command line variable. { $found_pattern = "yes" ; ++$hold_line_number ; @hold_lines[${hold_line_number}] = $_ ; } } #ends "while" on $filename. print_topic() ; #print any last matching topic for this $filename. undef $first_line_printed ; #So next file will have its first line printed in color. undef $section_heading ; #So no saved $section_heading will be saved to the next read file. ##################END FIND PATTERNS AND CALL PRINT SUBROUTINE################ } #ends "foreach" on @filelist. # The following subroutine print_topic() is entered, after reading a new line, on any of the following three conditions, # a. Found a new section-heading, which are indicated by, eg, "-----" . # b. Found a new topic, which are indicated by, eg, "18." . # c. Found end-of-file. # sub print_topic { if ( $found_pattern eq "yes" ) #At end of a file, print_topic() is invoked, yet $found_pattern can be undefined. { if ( $first_line_printed ne "yes" ) #Will print first output line of a file with a colored filename. { # print @hold_lines ; #print the last set of held-lines with $pattern if ( defined($section_heading) ) { print "\e[42m", ${filename}, "\e[0m: ", $section_heading ; #print colored filename and section_heading. print ${filename}, ': ', $hold_lines[0] ; } else { print "\e[42m", ${filename}, "\e[0m: ", $hold_lines[0] ; #print colored filename and first line of held topic lines. } $first_line_printed = "yes" ; } else { if ( defined( $section_heading) ) { print ${filename}, ': ', $section_heading ; } print ${filename}, ': ', $hold_lines[0] ; #not colored since not this file's first printed line. } foreach $print_line ( @hold_lines[1..$#hold_lines] ) #print remaining lines in the topic that has $pattern. { print ${filename}, ': ', $print_line ; } undef $section_heading ; #print an appropriate section heading but once; this line must be here, not below. undef $found_pattern ; #to begin collecting a new topic's hold lines; this line can be either here or below. } undef @hold_lines ; #to begin collecting a new topic's hold lines. $hold_line_number = -1 ; #to begin collecting a new topic's hold lines. }