Organizing Large Volumes of Email? 24
Trixter asks: "Like most nerds, I receive a large volume of email that I archive in several files and directories in a filesystem. This is inefficient, especially when it comes to searching for an old or obscure bit of information. I can imagine several better ways to organize email for archival and lookup, but has anyone already done this? I want to try avoiding reinventing the wheel for the tenth time this year. By 'better ways', I'm talking about all solutions--from the Perl monger 'one 10-line script will do the trick' perl script to parse up a long mbox-format file into little bits for intelligent grepping, to maybe an elegant 'mbox-format file to SQL database' loader/translator script and a series of SQL statements to support searches. Please, help me organize my gigabytes-long, decade-long email archive!"
Not an answer you may enjoy. (Score:1)
For example, in Lotus Notes, you can create a Archive database of you email and then you can perform full text searches on it. The solution is very easy and extremely non-techical. I'm sure there are other email programs that allow archiving and similar features.
Simple 3-step program (Score:2)
A few ideas (Score:3)
so I have something like this:
Pending
Misc
Friends
Pre-1998\Pending
Pre-1998\Misc
Pre-1998\Friends
Then I use hypermail to create an html archive of everything nightly, and put it into a password protected directory on my webserver. Then I use a regular web based search engine for searching.
Right now I'm playing around with doing all my mail via a web interface (using aeromail->imap) so I can access it securely (SSL) anywhere. It's working pretty good, I just need to figure out a good way to notify me when I get a new message (I'm thinking of a ICQ bot that sends me a message or something...)
I hope that helps. I'd be happy to work with anyone who wants to creat a better shrink-wrapped system for managing large amounts of old email. To me it's important that however it's stored that it is very portable since I've changed email clients a lot over the years.
procmail and mutt (Score:2)
I also use Mutt (http://www.mutt.org), and since it knows about the Maildir format, mail is pre-sorted before I even see it!
After a few years, however, even this approach runs out of steam... but it's still more automatic than by deliberately saving mail into different folders as you read it, and there's virtually no scripting/coding required.
b.g.
I can see two ways to go here (Score:2)
One, create an intranet (yes, I hate that word too, but everyone knows what I'm talking about) webserver. Use something like MHonarc [mhonarc.org] to archive the mail, and then install a search engine on the webserver.
Two, and I like this way better, write a couple of tools to handle this (perhaps based on MHonarc, heh) which will stick your mail in a database (Interbase, Postgresql, Mysql, whatever) and then do yet another web (or otherwise) application which will let you read them out and search them, interactively.
I personally hold onto my mail by renaming my mailbox (For instance, on the 240sx.org mailing list) to 240sx-preYYYYMMDD, then creating a new 240sx mailbox. I use Mozilla on win32 for email (despite the amazing crashiness - It feels like I'm using Corel software or something) and it has a fairly decent search facility.
Don't sort it out if you don't have to (Score:3)
The other necessary item is the fastest file content search utility you can lay hands on.
Indices, tables of contents, folders, catagories, catalogs, and directories will always have misfiles after a certain period of time, and besides, who wants to categorise all that mail anyway? Dump it into a few (time-based?) directories and simply search for things when you need them.
Lotus Notes?!? (Score:1)
Sorry, seeing "very easy and extremely non-technical" associated with Lotus Notes kind of set me off...
Multiple directories works for me... (Score:1)
Your library of e-mail sounds complicated enough to warrant dumping into a nice PostgreSQL database or something and writing a frontend to search the thing... You could use Perl... I'd probably play with PHP on my web server. That's about the best idea I can come up with.
Re:Keep it Simple (Score:2)
Final thought: don't keep everything. Mailing lists are usually deleted on read. Any mailing list worth reading (and worth reading two years from now) is being archived on the web somewhere. I see no need to create a local mirror. The hardest part of being an archivist is knowing what to throw away.
Yeah Matt, but you also keep archives of old BBSes around as well. :) Where's Panther's Byte on the Web?
Actually, I'd second the ASCII method, but add that if you don't want to use mail(1) (because you want to set up and access filters || triggers || subfolders while browsing html messages or threaded subject lists), the mbox format has been around for long enough that I feel confidant in leaving my email in that format.
There are plenty of nice GUI interfaces to mbox (I use KMail from KDE2). Just make sure when you choose your mail software that it won't permanantly delete *anything* (as most do my default). If you trash something, it should go into a "I probably won't need to see this again" file.
I finally gave up about a year ago, and started trashing Spam. Until then, I had every single piece of email I'd gotten via the net. Unfortunantly, much of my early to mid 80's stuff is on AppleDOS 3.3 disks. Post PC/MS-DOS, I have everything in one mega /pub directory which (after deleteing pr0n), barely makes it onto one DSS3 DAT tape.
And one :~( 40 meg RLE drive that I can't access.
--
Evan
Xemacs RULES!!! (well, works, anyways) (Score:1)
Right now, it hides deleted messages, and deletes them after 30 days (handy now and then) this also saves me the hassle of emptying a trash can now and then.
If I used scoring, I could do a lot of neat things, such as deleting unread messages that I am not likely to find interesting after a certain amount of time, or having me go straight to likely interesting messages when I open a message area.
I use BBDB for an adress book, (new homepage http://www.waider.ie/hacks/emacs/bbdb/ ) which can be synced with a palm pilot (just started to work) http://home.rochester.rr.com/tsdeweese/SyncBBDB.h
Over all, I'm happy, I just need to learn a bit of emacs lisp.
Re:A few ideas (Score:2)
Index & Search (Score:1)
You first could run your archive into something like a Hypermail-style archiver [freshmeat.net] if you prefer HTML (in addition to whatever indexes the archiver creates).
Keep it Simple (Score:5)
I've got everything I've ever written on a computer -- email and else -- since around 1981.
It ain't always pretty but I'll tell you the secret to my success. Plain text. ASCII. If you want to be able to read what you've written now ten years from now, keep it ASCII.
Thanks to a basic format, I was able to convert my TRS-80 Model I tapes to Model 4 disks and my Model 4 disks to the 20-meg drive in my first Tandy 1000. From there it has been easy. My new harddrive is always huge compared to the last one so my old data usually takes up a third of the new disk. No big deak.
I read all my email with 'mail' thus protecting myself from viruses and funky email formats (Eudora, Outlook, CCMail, etc.). At the end of the month, my mailbox it is dated, rotated and gzipped. The header information (Date, From and Subject) is added to a master index file along with the filename where the message can be found.
I've got a few ugly scripts that will search by keyword so I can find old stuff.
Yes, I'm living in the stone age. Those of you able to read your email going back to the early 1980s feel free to throw stones.
I think putting the stuff in a database would be a bad idea. When you change platforms, there will be maintenance. When you change databases, there will be maintenance. With plain ASCII text, you know you'll always be able to read it and you never have to upgrade. (Okay, by using gzip, there may be some extra effort on my part. A few years ago, I moved everything from compress to gzip. That's not the same thing as going from Oracle to Sybase, however.)
Final thought: don't keep everything. Mailing lists are usually deleted on read. Any mailing list worth reading (and worth reading two years from now) is being archived on the web somewhere. I see no need to create a local mirror. The hardest part of being an archivist is knowing what to throw away.
InitZero
Re:Keep it Simple (Score:1)
But, (and there's always a big but) kevin42 [slashdot.org] had a good point [slashdot.org] about email access when away from 'the big box'. I just moved home from the dorms, and I'm using my father's home machine for net access, so instead of Eudora or Opera for my email, I'm using Hotmail. [shiver]
I would really like to be able to put my recent email in a web searchable format that I can access from anywhere. Maybe even include PGP/GPG options, if access is via SSL. Basically, I want to be able to get to my email from other machines; whether that means SSL, sniffable web access, or carrier pigeon, I don't really care. I just want to read and send email.
Louis Wu
"Where do you want to go ...
Re:Keep it Simple (Score:2)
Plus, I can keep an eye on the security of my mail, and be (reasonably) sure that no one is snooping around.
--
Re:Xemacs RULES!!! (well, works, anyways) (Score:1)
Re:Keep it Simple (Score:1)
Evolution (Score:1)
Re:Keep it Simple (Score:1)
Thanks for your suggestion, but this is what I'm already doing and it's not cutting the mustard any more. Searching, for example, takes ages:
find ~/Mail -type f -exec grep -li $WORD) {} \;
There's not much you can do to speed this up, sadly, and the number of files/directories/folders/mailspools is becoming unmanagable.
As for saving *everything*, I don't. I just happen to receive 50+ emails every day that have information in them I need to file.
PS: Everything you've written back to 1981 including (I'm assuming) Super Scripsit files? Kick ass!
Re:Keep it Simple (Score:1)
I think putting the stuff in a database would be a bad idea.
Although once you have all your mail in a database it is a simple step to write a script to write it back to plain text again. Or even to any other format you want.
Re:Keep it Simple (Score:1)
HEY OTG's Email Xtender IS WHAT U NEED (Score:1)
MH/Procmail/glimpse/ifile? (Score:1)
Then I use glimpse to index everything. I'll admit I'd like a better search engine, but it works well enough and exmh has a nice interface.
I had a script that would go though a folder & refile into new sub folders based on year and month too.
All I need to make it perfect is a text wrapper for MH that can navigate subfolders so I can have reasonable speed & usability over a remote link.
Re:Keep it Simple (Score:1)
Also, something like 'find . -type f | xargs grep foo' will usaually run faster than trying to get find to do the dirty work.