Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

How Do You Organize Your Experimental Data?

timothy posted more than 3 years ago | from the can't-remember-where-I-put-my-memory dept.

Data Storage 235

digitalderbs writes "As a researcher in the physical sciences, I have generated thousands of experimental datasets that need to be sorted and organized — a problem which many of you have had to deal with as well, no doubt. I've sorted my data with an elaborate system of directories and symbolic links to directories that sort by sample, pH, experimental type, and other qualifiers, but I've found that through the years, I've needed to move, rename, and reorganize these directories and links, which have left me with thousands of dangling links and a heterogeneous naming scheme. What have you done to organize, tag and add metadata to your data, and how have you dealt with redirecting thousands of symbolic links at a time?"

cancel ×

235 comments

Here (2, Funny)

Xamusk (702162) | more than 3 years ago | (#33257094)

I store them in first posts.

Re:Here (-1, Offtopic)

Anonymous Coward | more than 3 years ago | (#33257186)

That's not such a bad idea: make the data public, use Google to access it.

Use databases! (3, Insightful)

Cyberax (705495) | more than 3 years ago | (#33257106)

Subj.

If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files.

Re:Use databases! (2, Insightful)

garcia (6573) | more than 3 years ago | (#33257204)

If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files.

That really depends on what your intended use for them is. I mean I don't know this particular fellow's situation for data collection or what tools he uses for reporting and visualization but perhaps, for him, it's a much better idea to store them in flat files. Me? I have been using flat files for all my data collection about local crime (see here [lazylightning.org] , here [lazylightning.org] , here [lazylightning.org] , and here [lazylightning.org] ) for several reasons:

1. I script it all with awk/sed to scrape the data and then put it in a CSV for summary with MySQL.

2. Yes, I could use MySQL for it all but I like to easily see it in its raw format on another remote machine. I also like to use Excel to do ad-hoc pivots and this is the easiest way for me to do that.

3. I upload the data to Google Docs and use their gadgets to make charts for my dashboards and maps. If I were to store it solely in MySQL I would have to make the CSV, pipe it into the MySQL, convert it back out to CSV and then upload it. An additional step for nothing.

Hey, no method is perfect for everyone and every project is a little different and while it's hard for me, based on the information provided, to give this guy any help, automatically suggesting that he needs a relational database to do his data storage might be just a little shortsighted.

YMMV.

Re:Use databases! (maybe, maybe not) (1)

mikehoskins (177074) | more than 3 years ago | (#33257664)

I agree. It depends.

Yes, relational databases store and retrieve well-defined data very, very well. Do you have referential integrity needs?. If that's your situation, use SQLite (small data and very simple types but little referential integrity), MySQL (medium to large data), or PostgreSQL (medium to very large data or more complex data types) and don't look back. SQL queries, relationships, and referential integrity are very powerful.

If not, then I'd look at MongoDB with GridFS. I'd even go further and explore GridFS-FUSE (a mountable file system version of MongoDB/GrisFS).

With GridFS-FUSE, you have a crazy powerful database/file system combo. Now, since MongoDB is a NoSQL database, you cannot do SQL queries against it. You can store and retrieve key-value pairs, NoSQL "documents," and actual files with MongoDB/GridFS/GridFS-FUSE.

Re:Use databases! (3, Interesting)

mobby_6kl (668092) | more than 3 years ago | (#33257916)

Certainly it depends, YMMV, and all that. Still, I think that some of the points that you bring up are not actually arguing against a relational database, perhaps just for a slight reorganization of your processes.

  1. I don't know where you get the data from, but anything awk/sed can do, so can Perl. And from Perl (or PHP, if you're lame) it's very easy to load the data into a database
  2. It's easy to connect to an SQL server from the remote machine and either dump everything or just select what you need. You'll need more than notepad.exe to do this, but it's not rocket science. Pivots in Excel can be really useful, but Excel can easily connect to the same database and query the data directly from there and use it for your charts/tables.
  3. Since by this point you'll already have all the data in the db, exporting it to CSV would be trivial. Or you could even skip Google Docs entirely and generate your charts with tools which can automatically query your database.

I agree with your final point though, we really have no idea what would be best for the submitter within the limitations of skills, budget, time, etc. Perhaps flat files are really the best solutions, or maybe stone tablets are.

Re:Use databases! (2, Insightful)

Idiomatick (976696) | more than 3 years ago | (#33257216)

I'm amused by all the /.ers suggesting the physics nerd setup a SQL database. It ain't super easy and I can't imagine anyone that isn't a programmer of some sort doing so.

The question he is really asking is probably more like: 'Is there anything like Windows live photo gallery that will allow me to tag and sort all of my data in a variety of ways like wlpg lets me sort my pictures?'

Re:Use databases! (1)

StormReaver (59959) | more than 3 years ago | (#33257298)

The question he is really asking is probably more like: 'Is there anything like Windows live photo gallery that will allow me to tag and sort all of my data in a variety of ways like wlpg lets me sort my pictures?'

That is what Nepomuk is supposed to do: allow you to build semantic meaning into your data. The problem I have with it as it current stands is that the Nepomuk processes suck the life out of my computers, so I disable Nepomuk entirely.

Re:Use databases! (2, Funny)

grub (11606) | more than 3 years ago | (#33257300)

If the dude works at a research institute of minor size and up, they should have IT staff who can do that initial setup for him.

Re:Use databases! (2, Informative)

BrokenHalo (565198) | more than 3 years ago | (#33257856)

If the dude works at a research institute of minor size and up, they should have IT staff who can do that initial setup for him.

I'm not quite sure why your post is rated as funny; scientists are not necessarily the best people to be left in charge of setting up databases. I've seen all sorts of atrocities constructed in the name of science, from vast flat files to cross-linked ISAM/VSAM files, and I remember many late nights (with complaints from wife) spent sorting them out when a subscript went out of range.

Re:Use databases! (1)

bkaul01 (619795) | more than 3 years ago | (#33257858)

Funny indeed. Having worked at both the university and National Lab levels, the IT departments I've encountered have more of the function of pushing so many restrictive security policies and pieces of corporate spyware to everyone's systems than of actually enabling researchers to be more productive...

Really, this is exactly what WinFS would've been perfect for, if MS had ever gotten it working and released it. As it is, I use hierarchical directory structure - top level is the research project, then date/experiment from there. Definitely subject to the kind of problems the original poster is encountering, though.

Re:Use databases! (1)

Planesdragon (210349) | more than 3 years ago | (#33257386)

I'm amused by all the /.ers suggesting the physics nerd setup a SQL database. It ain't super easy

one SQL database? As in, one ginamic SQL database? yeah, that's a headache.

OTOH, one SQL database per "type" of data (i.e., "last summer's research.sql")? Hell yes. If it's not in a database, you need to spend the ten minutes to learn how to make one. MS Access exists for a reason, and this is it. It won't be a very pretty db, but it doesn't need to be -- it'll be indexed, searchable, and more or less protected.

The only reason NOT to put it into a relational database is if you have some system that's essentially a database already, and in that case you can just leave it as-is. (Microsoft's SharePoint, if the data's small and simple enough, is an example.)

Re:Use databases! (2, Insightful)

BlitzTech (1386589) | more than 3 years ago | (#33257688)

Apache: click the install button (use default options, or switch to non-service mode which it very clearly explains means it only runs when you run it instead of whenever you start your computer)
MySQL: click the install button (use default options, they're all fine)
phpMyAdmin: put in document root, configure ("click the install button")

And you're set. How was that hard...?

Some software is, in fact, difficult to set up and maintain. As a scientist with an unusually large sample collection, learning to use a database is probably a good idea. Many scientists are taught MATLAB, and setting up a WAMP stack is much, much easier than learning MATLAB.

I'm amused by all the /.ers suggesting the physics nerd setup a SQL database. It ain't super easy and I can't imagine anyone that isn't a programmer of some sort doing so.

Scientists are pretty smart. He should learn to use a database.

Re:Use databases! (1)

bitingduck (810730) | more than 3 years ago | (#33257870)

Any physics nerd should be able to set up a MySQL database pretty easily-- it's not quite as easy as falling out of a tree, but it's not anywhere near as difficult as a lot of other things in physics. A great deal of data acquisition and analysis for many (if not most) physical scientists involves a bunch of custom programming, and many of the theoretical sorts do a lot of computer modeling. MySQL is pretty easy to install on just about anything, and if you have a reasonable idea of what your data will look like it's pretty easy to decide how to set up the tables you need. The first iteration may not be more than a few simple tables and some straightforward queries, but it's way easier to maintain than a tangled nest of symlinks.

(I'm a physical scientist who plays around with SQL for cheap entertainment)

Databases are not as convenient as files (2, Interesting)

goombah99 (560566) | more than 3 years ago | (#33257264)

I agree that this is a candidate for a database. One problem with data bases for researchers is that generally one does not know the right schema before hand ond one is dealing with ad hoc contingencies a lot. Another is portability to machines you don't control or that are not easily networked. A final problem is archival persistence. I can't think of a single data base scheme that has lasted 5 let alone ten years and still function without being maintained. Files can do that.

So if you want some bandaid approaches:

1) if you have a mac then, uses aliases rather than symbolic links. alias don't get messed up if you move the file.

2) use hard links rather than symbolic links. THe problem here is that these can get unlinked if you plan to modify the file. But if the file will never change these are just as space efficient and a softlink but tolerate renaming. They can't span across different disks however.

3) poormans database:
give your files a numerical name that chages, typically the date and time they were created. then have a flat file that list the files in some set for each category.

4) low tech database. If you decide to use a database then choose one that is likely never to go out of style. for example pick something like a perl-tie. those are so close to the language that they probably won't get depricated in the next 10 years.

Re:Databases are not as convenient as files (1)

atamido (1020905) | more than 3 years ago | (#33257426)

I agree that this is a candidate for a database. One problem with data bases for researchers is that generally one does not know the right schema before hand ond one is dealing with ad hoc contingencies a lot. Another is portability to machines you don't control or that are not easily networked. A final problem is archival persistence. I can't think of a single data base scheme that has lasted 5 let alone ten years and still function without being maintained. Files can do that.

The article didn't actually ask for a way to organize data, he asked for a way to organize files. He could easily create a database that points to whatever files he wants. Populate the relevant columns, and if he wants to add another type of data to search on, add a column. It's not like you need data in every cell, you simply need data in the cells that you want to find again.

Re:Databases are not as convenient as files (1)

bitingduck (810730) | more than 3 years ago | (#33257900)

The article didn't actually ask for a way to organize data, he asked for a way to organize files. He could easily create a database that points to whatever files he wants. Populate the relevant columns, and if he wants to add another type of data to search on, add a column. It's not like you need data in every cell, you simply need data in the cells that you want to find again.

And right now he's using a complicated mess of symlinks that amounts to a db schema that's probably a huge pain to maintain. Pick one straightforward way to organize files (e.g. date, with directories by month or something) and use the db for sifting through them to pick files by pH, lunar phase, and hair color (or whatever).

Re:Use databases! (4, Interesting)

rumith (983060) | more than 3 years ago | (#33257332)

Hello, I'm a space research guy.
I've recently made a comparison of MySQL 5.0, Oracle 10i and HDF5 file based data storage for our space data. The results [google.com] are amusing (the linked page contains charts and explanations; pay attention to the conclusive chart, it is the most informative one). In short (for those who don't want to look at the charts): using relational databases for pretty common scientific tasks sucks badly performance-wise.

Disclaimer: while I'm positively not a guru DBA and thus admit that both of the databases tested could be configured and optimized better, but the thing is that I am not supposed to. Neither is the OP. While we may do lots of programming work to accomplish our scientific tasks, being a qualified DBA is a completely separate challenge - an unwanted one, as well.

So far, PyTables/HDF5 FTW. Which brings us back to the OP's question about organizing these files...

Re:Use databases! (2, Insightful)

Dynedain (141758) | more than 3 years ago | (#33257620)

Translation: I am not a DB guru, but I deal with massive amounts of complex data and need a DB guru, but I have no intent on hiring one.

Seriously, hire a DB wizard in the DB software of your choice for a couple of days. Have him setup the data and optimize it. You'll save yourself a lot of headaches, AND put yourself in a good position for future data maintenance. Imagine that your project gets a lot of attention in the future, and you suddenly get a lot of funding and the money to hire more people Or imagine that you'd like to provide or incorporate data with some outside sources or other researchers. If you're using something "standard" like a relational DB, it will be much easier to hire a DB wizard then trying to find a programmer who can piece together a lot of mismatched files and convoluted organization schemes.

This is what databases are designed to do. Just because you're not an expert at setting them up, and theres a performance hit to setting them up wrong, doesn't mean that they aren't still the right tool.

Re:Use databases! (3, Interesting)

rockmuelle (575982) | more than 3 years ago | (#33257776)

I've built LIMS systems that manage peta-bytes of data for companies and small scale data management solutions for my own research. Regardless the scale, the same basic pattern applies: Databases + files + programming languages. Put your meta-data and experimental conditions in the database. This makes it easy to manage and find your data. Keep the raw data in files. Keep the data in a simple directory structure (I like instrument_name/project_name/date type heiracchies, but it doesn't really matter what you do as long as you're consistent) and keep pointers to the files in the database. Use Python/Perl/Erlang/R/Haskell/C/whatever-floats-your-boat for analysis

Databases are great tools when used properly. They're terrible tools when you try to shoehorn all your analysis into them. It's unfortunate that so few scientists (computer and other) understand how to use them. Also, for most scientific projects, SQLite is all you need for managing meta-data. Anything else and you'll be tempted to to your analysis in the database. Basic database design is not difficult to learn - it's about the same as learning to use a scripting language for basic analysis tasks.

The main points:

1) Use databases for what they're good at: managing meta-data and relations.
2) Use programming languages for what they're good at: analysis.
3) Use file systems for what they're good at: storing raw data.

-Chris

Re:Use databases! (1)

PrecambrianRabbit (1834412) | more than 3 years ago | (#33257828)

Depending on the size and stability of the GPs research budget, that may not be practical. I worked on a fairly large academic research team (by EE/CS standards) that had the budget to hire a few full-time staff members for certain things. After the main implementation push the project wound down a bit, and those staff moved on to other jobs, leaving the grad students to maintain the infrastructure. That was fine as it was, but could have been massively not-fine if the staff had used complex tools that required specialized knowledge that the students didn't have, and would have to divert their energies from research to tool-learning.

Basically, if you're hiring a DBA, make sure that you can keep them on staff indefinitely.

Re:Use databases! (1)

oldhack (1037484) | more than 3 years ago | (#33257686)

"While we may do lots of programming work to accomplish our scientific tasks, being a qualified DBA is a completely separate challenge - an unwanted one, as well."

While I get what you are getting at, it's the same shit - you muck with it (hacking perl/python/dbms) because there is no prepackaged stuff for your needs.

Re:Use databases! (2, Interesting)

clive_p (547409) | more than 3 years ago | (#33257694)

As it happens I'm also in space research. My feeling is that what approach you take depends a lot on what sort of operations you need to carry out. Databases are good at sorting, searching, grouping, and selecting data, and joining one table with another. Getting your data into a database and extracting it is always a pain, and for practical purposes we found nothing to beat converting to CSV format (comma-separated-value). We ended up using Postgres as it had the best spatial (2-d) indexing, beating MySQL at the time. The expensive commercial DBMS like Oracle didn't have anything that the open-source ones did for modest-sized scientific datasets. I found Postgres was fine for our tables, which were no bigger than around 10 million rows long and 300 columns wide. You might well get better performance using something like HDF but you'll probably spend a lot more time programming to do that, and it won't be as flexible. The only thing you can be sure of in scientific data handling is that the requirements will change often, so flexibility is important. If your scientific data are smallish in volume and pretty consistent in format from one run to the next, you might consider storing the data in the database, in a BLOB (binary large object) if no other data type seems to suit. But a fairly good alternative is just to store the metadata in the database, e.g. filename, date of observation, size, shape, parameters, etc and leave the scienficic data in the files. You can then use the database to select the files you need according to the parameters of the observation or experiment.

Consistently (3, Insightful)

rwa2 (4391) | more than 3 years ago | (#33257854)

Word up. I'd say the first goal is to store your raw, bulk data consistently. Then you can have several sets of post processing scripts that all draw from the same raw data set.

You want this data format to be well-documented, but I wouldn't bother meticulously marking it up with XML tags and other metadata or whatever. You just want to be able to read it fast, and have other scripts be able to convert it into other formats that would be useful for analysis, be it matlab, octave, csv, or some tediously marked-up XML. You do want to be able to grep and filter the data pretty easily, so keep that in mind when you're designing the format. It will likely end up being pretty repetitive, but that's OK, since you'll likely store it compressed. That can improve performance when reading it, since the storage medium you're pulling the data from is often slower than the processor doing the decompression... and it also provides some data integrity / consistency checking. Oh, and of course, you can store more raw data if its compressed.

Re:Use databases! (1)

turbidostato (878842) | more than 3 years ago | (#33257338)

"If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files."

Ever tried to access from Access (pun intended) big blob fields?

The guy has a lot of data. Therefore he needs a data-base. He already has one with a hierarchical storage (the filesystem) that probably conveys the way data is generated (hierarchically, if only by date).

Then he needs to access the data by different criteria. Are they relational? Then a relational data-base is probably what's needed. Access is still hierarchical, tree-based...?

My point is that (quite of obviously) data storage and data mining/retrival are different beasts so they can (and probably should) be managed in different ways/by different tools.

One first thing to do is "freezing" the data storage deployment. He says he has problems because dangling symlinks. That wouldn't happen if he didn't find the need to move the "original" data around. OK: don't do that and the problem will go away. Just let the "original" data go into a plain or almost plain structure: as long as there're no name collisions anything could do, from a single directory to a somehow optimized tree structure (like a first level directory list ordered by date or alphabetically or anything that fits and adding subdirs as needed to the point that any subdirectory only holds files on the hundreds or lower thousands).

The only important thing to remember is that once a file gets stored it never moves from its place (at least logically: with time maybe newer data can go to faster filesystems while older/less used data can go to second tier storage, etc.).

On top of that you add searching/datamining tools. Since your data storage is fixed by convention you can add as many and as unrelated searching tools as you see fit. It can be a forest of symlinks for different ordering criteria, it can be a RDBM, it can be a search engine, it can be accesable only by means of command line tools, a web interface... whatever and you can use all or part of them as need arises.

A practical example: Linux distributions' sources of packages like apt or yum. Packages are stored on an (almost) flat alphabetically ordered filesystem and then an in-parallel structure handles access (ordered by architecture, usage subset, etc.) by different means.

Re:Use databases! (1)

Shikaku (1129753) | more than 3 years ago | (#33257448)

http://sourceforge.net/projects/vym/ [sourceforge.net]

I think using a flat file and this would be helpful. Maybe. I don't know exactly what the data is, however, so this may not be very helpful.

Re:Use databases! (1)

RobertLTux (260313) | more than 3 years ago | (#33257820)

Nice idea

Nice service you have advertised in your sig
Even Nicer referral link

Totally disorganized (1, Funny)

countertrolling (1585477) | more than 3 years ago | (#33257116)

Whenever I need to find anything, I use "Command-F"

Re:Totally disorganized (0)

Anonymous Coward | more than 3 years ago | (#33257188)

Quiet now, adults are speaking.

He's a researcher in physical sciences. The computer he uses won't have a command button.

Re:Totally disorganized (1)

ElektronSpinRezonans (1397787) | more than 3 years ago | (#33257748)

I wish that was true. # of command buttons are steadily increasing in all sciences. I genuinely fear for the future of science...

Use a revision control system or a database (0)

Anonymous Coward | more than 3 years ago | (#33257122)

Data isn't just data - it has, as you've learned, a history. Learn about how RCS works and use one to store your data in from now on.

Or, you could just store it in a proper SQL database, and be able to query it any way you like, without having to create all these link farms giving you different views on the underlying data.

Separate data from presentation (4, Informative)

mangu (126918) | more than 3 years ago | (#33257126)

In my experience, the best thing is to let the structure stand as it was the first time you stored the data.

Later, when you discover more and more relationships around that data, you may create, change, and destroy those symbolic links as you wish.

I usually refrain from moving the data itself. Raw data should stand untouched, or you may delete it by mistake. Organize the links, not the data.

Re:Separate data from presentation (1)

DamonHD (794830) | more than 3 years ago | (#33257422)

Absolutely agreed.

And indeed my biggest 'data' collection of over a decade old is all directly exposed to the Web, and other than taking care of significant errors in the original naming, I've followed the "Cool URIs Don't Change" manta and had the presentation app tie itself in knots if need be to leave the data as-is and present it in any new ways required.

It's like having an ancient monument: you don't shuffle it around to suit the latest whim, else you'll most likely mess it up beyond repair.

Rgds

Damon

Databases. (1)

BisexualPuppy (914772) | more than 3 years ago | (#33257138)

Ever heard about relational databases ? That's *exactly* what you are looking for.

Re:Databases. (1)

nicolas.kassis (875270) | more than 3 years ago | (#33257166)

I was going to say the same thing. You can also check to see if there are any software in your domain that might help you insert it into a database. If not, you can keep the data as flat files but have records in the database and have the path to them in there. A little bit of programming but not much will get you a list of file path that you can then just us a bash script to retrieve.

Organize it with style (1, Funny)

Anonymous Coward | more than 3 years ago | (#33257144)

Organize your data like I organize my bedroom: Everything on the floor.

Look, how big is your desk? 8 square feet? How big is your floor? Several hundred square feet? If you can see all of your stuff, then you can access it instantly. Organized Chaos.

Now, if you'll excuse me... I think something's moving around in my trash can.

Databases (1)

eexaa (1252378) | more than 3 years ago | (#33257164)

SQL comes really handy. I can imagine several simple scripts + SQLite indexing table. Or anything else.

Re:Databases (1)

obstacleman (634020) | more than 3 years ago | (#33257232)

I agree. As the amount of data and metadata increase a good way to organize it all is via a database. Then access can be done through queries on the metadata and all relevant locations returned. In some sense, it will no longer matter where the data is stored on disk as long as the database knows the location (and moving it can be done easily but requires the database be updated too). One of the simplest form for the directory structure is along the lines of date ordering, e.g. year-dir, month-dir, day-dir, dataset-dir. One of the advantages of a database is it can allow you to replicate the data, say for instance on tape copies, and store the location on tape in the database too. In High Energy Physics there are petabytes of data stored this way.

extended attributes? (1)

otis wildflower (4889) | more than 3 years ago | (#33257168)

I wonder if there's an opensource project to create and manage extended attributes on supporting filesystems?

http://www.freedesktop.org/wiki/CommonExtendedAttributes [freedesktop.org]

But you're likely to get better results from having filenames be a field in a DB, and let all the metadata live in other DB fields..

ps: here's a CPAN entry that manipulates extended attributes: http://search.cpan.org/dist/File-ExtAttr/lib/File/ExtAttr.pm [cpan.org]

Matlab Structures (4, Interesting)

Anonymous Coward | more than 3 years ago | (#33257170)

I have to organize and analyze 100 GB of data from a single day's experiment. Raw data goes on numbered HDs that are cloned and stored in separate locations. This data is the processed and extracted into about 1 GB of info. From there the relevant bits are pulled into hierarchal matlab structures. Abstract everything that you can into a MATLAB structure array. Have all your analysis methods embedded in the classes that compose the array. Use SVN to maintain your analysis code base.

Go for NoSQL! (3, Funny)

JamesP (688957) | more than 3 years ago | (#33257172)

OK, subject is the short answer, here's the big answer

Since experimental data usually doesn't have the same structure for all experiments, you may try something like this:

at the deeper, most basic level organize it using JSON or XML (I don't know what kind of experiment you do, but you would put lists of data, etc)

Then you store this in a NoSQL db (like CouchDb or Redis) and index it the way you like, still if you don't index you can always search it manually (slower, still...)

Don't bother with hierarchies (5, Interesting)

ccleve (1172415) | more than 3 years ago | (#33257174)

Instead of trying to organize your data into a directory structure, use tagging instead. There's a lot of theory on this -- originally from library science, and more recently from user interface studies. The basic idea is that you often want your data to be in more than one category. In the old days, you couldn't do this, because in a library a book had to be on one and only one shelf. In this digital world you can put a book on more than one "shelf" by assigning multiple tags to it.

Then, to find what you want, get a search engine that supports faceted navigation. [wikipedia.org]

Four "facets" of ten nodes each have the same discriminatory power as a single hierarchy of 10,000 nodes. It's simpler, cleaner, faster, and you don't have to reorganize anything. Just be careful about how you select the facets/tags. Use some kind of controlled vocabulary, which you may already have.

There are a bunch of companies that sell such search engines, including Dieselpoint, Endeca, Fast, etc.

Re:Don't bother with hierarchies (0)

Anonymous Coward | more than 3 years ago | (#33257464)

This is horribly off-topic, but a lot of the problems faced by librarians are similar, if not the same, as the problems faced by warehouse designers.

Re:Don't bother with hierarchies (1)

Yvanhoe (564877) | more than 3 years ago | (#33257646)

What ccleve said.

I would also add that if your datasets are like mine (enormous), it may be a good idea to md5 them. Have a log file where you enter information about each file : its creation date, its md5, nature of data, source, etc...

Do not use name or path hierarchies to keep track of metadata, it is doomed to fail. If you feel this is worth the effort you can set up a database for this info but in my opinion if you have hundreds to thousands files, a simple flat file can be good enough.

Re:Don't bother with hierarchies (1)

aperdaens (1395859) | more than 3 years ago | (#33257872)

You could also use a sharing tool focused on faceted navigation like Knowledge Plaza [knowledgeplaza.net] . This way you have an interface where you can both store and categorize your results and be able to browse and search those results.

Use hard links (1)

swm (171547) | more than 3 years ago | (#33257176)

Instead of symlinking to directories,
create directories of hard links to the files.

Then you can move files around whenever you like,
and you never have any dangling links.

Careful... (1)

SanityInAnarchy (655584) | more than 3 years ago | (#33257256)

It depends how you update the files. Many systems, when updating a file, will write the entire new file to a temporary location, then atomically rename it on top of the old location, which would kill any hardlinks, but symlinks would still work.

I have to agree with the database suggestions, though something NoSQL-ish may work better.

Re:Use hard links (1)

vrmlguy (120854) | more than 3 years ago | (#33257754)

Instead of symlinking to directories,
create directories of hard links to the files.

Then you can move files around whenever you like,
and you never have any dangling links.

I second this. I have a big collection of photos that I've downloaded over the years, and I "tag" them via hard-links into directories. The same photo may be found under "party/jane/nyc", "party/nyc/jane", "nyc/party/jane", "nyc/jane/party", "jane/party/nyc" and "jane/nyc/party". If two people are in the photo, that's twenty-four links, but I have Perl scripts that take care of the grunt work; a picture with N tags will have "just" N! links. I don't link photos to intermediate directories, but all pictures from parties in New York can be found via either "find party/nyc -type f" or "find nyc/party -type f"; removing the dups is left as an exercise for the student.

BTW, this works with Windows as well as Unix. NTFS supports hard-links and while there isn't a native command to create them, Perl will do so.

I used to be anal about organization... (2, Insightful)

taoboy (118003) | more than 3 years ago | (#33257178)

...but then google came along and taught me that it's not about know where things are, but rather about being able to find them. My email, for instance, is "organized" by the year in which it arrives, and I use the search function of my email client to find things. No big folder structure, moving messages around, and I haven't had problems finding any email I need. Oh yes, I keep them all... good fodder for "on x/x/xxx you said..." retorts.

For files, then, the key is to have descriptive file names that provide readily searched text. Including the data somewhere in the name (I tend to use this format because it sorts well: 20100815) makes it easier to sort through multiple versions.

Then, you can spend quality time figuring out how to reliably back up all that stuff.... :)

Interns. (1)

Gordonjcp (186804) | more than 3 years ago | (#33257182)

Life is too short. Get someone else to do it, under the disguise of valuable field experience.

Re:Interns. (3, Informative)

spacefight (577141) | more than 3 years ago | (#33257244)

Yeah right, let the interns do the job. Not. Interns use new tools no one understands, then finish the project during their term, then move on and let the most probably buggy or unfinished project behind. Pitty for the person who has to cleanup the mess. Better do the job on your own, know the tools or hire someone permanently for the whole deptartment.

Linked Data, of course (2, Informative)

Rui Lopes (599077) | more than 3 years ago | (#33257202)

The present (and the future) of experimental data organisation, repurposing, re-analysing, etc. is being shifted towards Linked Data [linkeddata.org] and supporting graph [openrdf.org] data [franz.com] stores [openlinksw.com] . Give it a spin.

How can you not? (1)

Rivalz (1431453) | more than 3 years ago | (#33257222)

I never understood how you can have something organized or not.
I organize my stuff at the planetary level.
Universe > Solarsystem > Earth > Contenent > United States > Florida > County > City > Street > House > Room > Desk > Computer > Hard Drive > Folder > File Type > Location
I think im pretty well organized even though i miss place stuff all the time.

Re:How can you not? (1)

TimSSG (1068536) | more than 3 years ago | (#33257472)

You forgot the Galaxy. Tim S.

Re:How can you not? (1)

Rivalz (1431453) | more than 3 years ago | (#33257538)

I skip the little things... Plus I do not recognize U.S. soccer teams as a tool for organization. I had to remove Galaxy from my organizational charts when they formed the L.A. Galaxy. That little naming convention set me back 10 years worth of organization. I became confused of which galaxy I lived in and had to seek extensive therapy. I feel a relapse coming on.

But honestly thanks for the clarification, I actually forgot a little thing like galaxies.

Try using a scientific workflow system (3, Insightful)

moglito (1355533) | more than 3 years ago | (#33257226)

You may want to consider a scientific workflow system. These systems handle both data storage (including meta-data and provenance -- where the data came from), and design and execution of computational experiments. If you are concerned about the complexity of the meta-data (e.g., pH value..) and would like to make sure to be able sort things according to this, you want to give "Wings" a try. You can try out the sandbox to get an idea: http://wind.isi.edu/sandbox [isi.edu] .

Be like Google... (1)

Vornzog (409419) | more than 3 years ago | (#33257234)

"Search, don't sort".

The size and complexity of your data management should match the size and complexity of your data set. If you have thousands of datasets, give serious consideration to a relational database. Store all of your metadata (pH, date, etc) in the database so you can query it easily. If your raw data lives in a text-based format, put it in the database too, otherwise just store the path to your file in the database and keep your files in some sort of simple date-based archive or whatever.

Now, you can start to search though the data by thinking about which sets of data to compare. Much easier.

This is very general advice - if you have one experimenter and a couple of experiments, just use a lab notebook. If you have a handful of experimenters and ~100 experiments, try a spreadsheet or well organized structure on disk. If you have many people involved, or thousands of experiments, or both, you need something to help manage all of that in a way that lets you think in terms of sets rather than individual data files. Otherwise, you'll find yourself wearing your 'data steward' hat way to often, and not wearing your 'experimentalist' or 'analyst' hats much at all.

four directories (4, Funny)

arielCo (995647) | more than 3 years ago | (#33257236)

$PRJ_ROOT/data/theoretical
$PRJ_ROOT/data/fits
$PRJ_ROOT/data/doesnt_fit
$PRJ_ROOT/data/doesnt_fit/fixed
$PRJ_ROOT/data/made_up

Re:four directories (0)

Anonymous Coward | more than 3 years ago | (#33257360)

you say it's four because there's no "fits", right?

Re:four directories (0)

Anonymous Coward | more than 3 years ago | (#33257470)

$ ls -lrt $PRJ_ROOT/data
total 5
-rw-r--r-- 1 geek phd 4096 1998-04-15 15:16 theoretical
-rw-r--r-- 1 geek phd 4096 2000-10-01 17:20 doesnt_fit
-rw-r--r-- 1 geek phd 4096 2006-06-29 22:17 doesnt_fit/fixed
lrwxrwxrwx 1 geek phd 11 2007-03-12 23:03 fits -> theoretical
-rw-r--r-- 1 geek phd 65536 2009-02-17 23:33 made_up

Re:four directories (3, Funny)

morgan_greywolf (835522) | more than 3 years ago | (#33257488)

Oh, come on! Who let the climatologists in here?

Re:four directories (1)

jochem_m (1718280) | more than 3 years ago | (#33257786)

with $PRJ_ROOT/data/made_up being the biggest one? ;)

MindMaps is a perfect solution for you, I think (0)

Anonymous Coward | more than 3 years ago | (#33257274)

My research area is about programming with MindMaps, MindMaps as source code, I'm developing a programming language based on them

I choose MindMaps because I could see the detail and the global in the same GUI so I recommend you Freemind

The MindMaps software could map your filesystem structure even with the symbolik link structure

Good luck

The Obvious Solution (1)

fast turtle (1118037) | more than 3 years ago | (#33257296)

is to use CVS (comma/tab seperated value) files to store the data. This makes it easy to import into a spreadsheet or database in the future as your needs grow.

Re:The Obvious Solution (1)

Improv (2467) | more than 3 years ago | (#33257528)

I think you mean CSV

Relational Databases won't do! (3, Informative)

gmueckl (950314) | more than 3 years ago | (#33257306)

To everybody here suggesting relational databases: you are on the wrong track here, I'm afraid to tell you. Relational databases handle large sets of completely homogenious data well if you can be bothered to write software for all the I/O around them. This is where it all falls apart:

1. Many lab experiments don't give you the exact same data every time. You often don't do the same experiment over and over. You vary it and the data handling tools have to be flexible enough to cope with that. Relational databases aren't the answer to that.

2. Storing and fetching data through database interfaces is vastly more difficult than just using the standard input/ouput or plain text files. I've written hundreds of programs for processing experimental data and I can tell you: nothing beats plain old ASCII text files! The biggest boon is that they are compatible to almost any scientific software you can get your hands on. Your custom database likely is not. Or how would you load the contents your database table into gnuplot, Xmgrace or Origin, just to name a few tools that come to my mind right now?

I wish I had a good answer to the problem. At times I wished for one myself, but I fear the best reply might still be "shut up and cope with it".

Re:Relational Databases won't do! (1)

bjourne (1034822) | more than 3 years ago | (#33257452)

Storing and fetching data through database interfaces is vastly more difficult than just using the standard input/ouput or plain text files. I've written hundreds of programs for processing experimental data and I can tell you: nothing beats plain old ASCII text files!

Well I believe utf-8 encoded text files beats them hands down. Especially for scientific research. //snarky

Re:Relational Databases won't do! (1, Insightful)

Anonymous Coward | more than 3 years ago | (#33257490)

You can just pipe the output from the SQL client to a text file (or export the results to a CSV file if you use a Query Browser).

Re:Relational Databases won't do! (0)

Anonymous Coward | more than 3 years ago | (#33257634)

2. Storing and fetching data through database interfaces is vastly more difficult than just using the standard input/ouput or plain text files. I've written hundreds of programs for processing experimental data and I can tell you: nothing beats plain old ASCII text files! The biggest boon is that they are compatible to almost any scientific software you can get your hands on. Your custom database likely is not. Or how would you load the contents your database table into gnuplot, Xmgrace or Origin, just to name a few tools that come to my mind right now?

I kinda think that the guy posted this question because flatfiles/directories/symlnks are not working for him.

Having done scientific computing for 20 years or so, I would recommend a relational database from what little information that the guy has given. His data seems pretty simple and relatively small, I'm guessing around 1GB, certainly much less than 1TB. Also, since the data is so small, the guy can leave the raw data as is and write scripts to insert and retrieve the data. The DB can be wiped, and reloaded with raw data during the transition.

I believe the original poster's use and issues with symlinks is a clear indication that a relational DB is required.

SparkLab (1)

guznik (1743356) | more than 3 years ago | (#33257336)

This is exactly what SparkLab aims to solve, take a look here: http://sparklix.com/demo-movie [sparklix.com] It's free for academic and non-profit organizations. Personal free edition will be up later this year.

Contractors/Grad Students (0)

Anonymous Coward | more than 3 years ago | (#33257348)

Hire a local contractor (read local grad students) to program a simple system for you. This really needs to be in a database which is accessed through an interface you will be comfortable with and which makes it easy for you to manipulate your data.

Write down how your data is described, how you access and update the data, as well as what output is needed from the system, like how you need to view the data in order to use it in reports or calculations. It doesn't sound like it would be very hard to write something to organize your data. A good price for something like this where I live would be three to seven hundred dollars. Find someone with a decent track record and you should be much more organized in no time.

HDF5 database files / PyTables (0)

Anonymous Coward | more than 3 years ago | (#33257392)

I have recently started using PyTables to store my data. Very fast, great compression and in Python!
http://www.pytables.org/

Two approaches... (1)

meburke (736645) | more than 3 years ago | (#33257398)

A lot depends on the type of data. If it is truly experimental results, then results could be easily organized in tables, and tables can be logically accessed, arranged and manipulated using standard rules of set theory. Relational databases work this way, but there are other approaches.

If your data is derived or crunched, You may have a massive logic problem. See this: http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html [blogspot.com] , and take heed.

The previous suggestions about leaving your data intact and refining the access is good advice. I have used and developed some network dbms systems for this type of data. The current trend seems to be toward Object-Oriented Network dbms systems, but I'm not sure that is the way to go; OONDBMS tends to be static and hard to maintain in a dynamic experimental environment. The largest experimental environment that I've had the opportunity to work on, with an energy company here in Houston, was a statistical analysis of nuclear reactions. The data was constantly changing and we needed a self-referencing, dynamic data repository. This is the type of system where you download data sets and do your analysis AFTER you have acquired it locally. The dbms was written in FORTRAN90 and was very fast, but you need a team for something like this unless you are epert enough to program it all yourself.. It actually used very little code, but the record management and indexes (mostly ISAM/invertedISAM took massive amounts of computer power. There are now some cute tools in FORTRAN 2000 that allow you to use a web browser as a front end, but I don't usually want to look at the data being gathered; I usually want to crunch the statistics and see the results. The browser front-ends I have seen tend to require too much tweaking in order to adapt to the changing data parameters. Remote terminals make more sense. Maybe you should be willing to change the method of accessing the data and not try to maintain dozens of links.

Re:Two approaches...OOPS! (1)

meburke (736645) | more than 3 years ago | (#33257436)

Sorry, I forgot to include the fact that the network dbms system does not require you to rename or re-link your directory scheme. It simply creates pointers to relevant links and then maintains the pointer logic.

tags and search (find/grep) (1)

nycguy (892403) | more than 3 years ago | (#33257400)

I would just put each set of experimental data in a separate subdirectory. Within each subdirectory I'd put a file with specific name (e.g., "description.txt") in which you briefly write up exactly what the experimental data is, how it was generated (e.g., if generated by a program, give the arguments and/or pointers to input data), and some keywords to allow it to be indexed/searched. Then I'd use your standard OS search tools to find the description file(s) you're looking for, thereby allowing you to locate your data based on its description rather than some brittle directory hierarchy.

I have a pretty standard setup for generating experimental data in my work. Whenever I run an experiment (which are usually simulations), I have a wrapper script that generates a random (meaningless) subdirectory name, copies my simulation binary and configuration to that directory (so I can reproduce the results later in case either my simulator code or its configuration changes), and prompts me to enter a description of what it is I'm simulating, and asks me to provide some keyword tags. The only way I can find the data afterward is to search the description files from the last step, because the data is otherwise just in a randomly-named directory.

Of course, this scheme depends on you doing a decent job of describing your data and providing keywords, but I don't think you can get around that with any technique. At some point you have to inject some human labeling/categorization. Directories and symlinks are just a pretty restrictive way of organizing things.

Self-describing data (0)

Anonymous Coward | more than 3 years ago | (#33257888)

Mark each file with its description and insert each transformation after the original description as suggested in "More Programming Pearls" by Jon Bentley.
Put some keywords in the description and fell free to add more as you go. You can use a free format or go to a more rigid organization.

SQLite + Scripting language (2, Informative)

ericbg05 (808406) | more than 3 years ago | (#33257402)

Others have already mentioned SQLite. Let me briefly expound on the features that are likely the most important to you, assuming (if you'll permit me) that you don't have much experience with databases:
  1. 0. The basic idea here is that you are replacing this whole hierarchy of files and directories by a single file that will contain all your data from an experiment. You figure out ahead of time what data the database will hold and specify that to SQLite. Then you to create, update, read, and destroy records as you see fit--pretty much as many records as you want. (I personally have created billions of records in a single database, though I'm sure you could make more.) Once you have records in your database, you can with great flexibility define which result sets you want from the data. SQLite will compute the result sets for you.
  2. 1. SQLite is easy to learn and use properly. This is as opposed to other database management systems, which require you to do lots of computery things that are probably overkill for you.
  3. 2. Your entire data set sits in a single file. If you're not in the middle of using the file, you can back up the database by simply copying the file somewhere else.
  4. 3. Transactions. You can wrap a large set of updates into a single "transaction". These have some nice properties that you will want:
    1. 3.1. Atomic. A transaction either fully happens or (if e.g. there was some problem) fully does not happen.
    2. 3.2. Consistent. If you write some consistency rules into your database, then those consistency rules are always satisfied after a transaction (whether or not the transaction was successful).
    3. 3.3 Isolated. (Not likely to be important to you.) If you have two programs, one writing a transaction to the database file while the other reads it, then the reader will either see the WHOLE transaction or NONE of it, even if the writer and reader are operating concurrently.
    4. 3.4. Durable. Once SQLite tells you the transaction has happened, it never "un-happens".
    5. These properties hold even if your computer loses power in the middle of the transaction.
  5. 4. Excellent scripting APIs. You are a physical sciences researcher -- in my experience this means you have at least a little knowledge of basic programming. Depending on what you're doing, this might greatly help you to get what you need out of your data set. You may have a scripting language that you prefer -- if so, it likely has a nice interface to SQLite. If you don't already know a language, I personally recommend Tcl -- it's an extremely easy language to get going with, and has tremendous support directly from the SQLite developers.

Good luck and enjoy!

What about a wiki? (3, Insightful)

gotfork (1395155) | more than 3 years ago | (#33257416)

In my previous lab group we used a mediawiki install to keep track of microelectronic devices that several people were working on at the same time. These devices were still under development so most of the data was qualitative -- images, profilometry data, IV/CV curves were all stored on the wiki page for each sample, and each page included a recipe for exactly how it was made, which made it easy to trouble shoot later. It worked pretty well for what we used it for, but once we had a working device all the in-depth data for that sample was kept separately. This seemed like a half-decent way of cataloging samples, although one would need something a bit more robust for complex data sets that don't integrate well with a wiki.

Lab book (2, Informative)

PeterKraus (1244558) | more than 3 years ago | (#33257430)

After a couple of months of working in an actual chemical company, as opposed to uni, I've realised how good it is to keep a lab book. Obviously, if you handle 1000's of data daily, it doesn't help that much, but it helps to have written trace of everything you do in the lab.

Also, keep the actual data intact and work on a copy. We use excel at work, but something using SQL tailored for your needs might be even better.

Master of Chaos (0)

Anonymous Coward | more than 3 years ago | (#33257458)

Name them by dates and save them in 1 directory. That's how you'll end up saving the files for you LaTeX paper anyway.

A database ... (1)

frogzilla (1229188) | more than 3 years ago | (#33257476)

You need to start using a database. You don't have to actually put the data in a database but all of the meta data needs to go into one. Store your data files in one file system using whatever naming scheme you want and never move the files again. At the same time record the file system location along with all other meta data that is relevant. Then some simple database queries, e.g. embedded in some web pages can retrieve the location and even the data. You can of course also store the data in a database as well if you wish. I personally find it more practical to do it this way.

Use tags in Apple OS X (2, Insightful)

wealthychef (584778) | more than 3 years ago | (#33257478)

If you are using Mac OS X, you can tag the files using the Finder Get Info and putting "Spotlight comments" there. Then you can easily find them based on keywords and Spotlight in constant time. The good thing about keywords is that they give you a multidimensional database effect. The bad thing I've found is I tend to forget my keywords that I'm storing stuff with, so I don't really know what to search for. OS X Spotlight is promsing and might work very well for you.

Re:Use tags in Apple OS X (0)

Anonymous Coward | more than 3 years ago | (#33257522)

If you are using Mac OS X, you can tag the files using the Finder Get Info and putting "Spotlight comments" there. Then you can easily find them based on keywords and Spotlight in constant time. The good thing about keywords is that they give you a multidimensional database effect. The bad thing I've found is I tend to forget my keywords that I'm storing stuff with, so I don't really know what to search for.
OS X Spotlight is promsing and might work very well for you.

Extremely unlikely that a scientist would be using OSX, but luckily Linux and Windows can do the same thing with better performance.

How CMS sorts data (2, Informative)

toruonu (1696670) | more than 3 years ago | (#33257484)

Well CMS is one of the large experiments at the LHC. The data produced should reach pentabytes per year and add to it the simulated data we have a hellava lot of data to store and address. What we use is a logical filename (LFN) format. We have a global "filesystem" where different storage elements have files in a filesystem organized in a hierarchical subdirectory structure. As an example: /store/mc/Summer10/Wenu/GEN-SIM-RECO/START37_V5_S09-v1/0136/0400DDE2-F681-DF11-BA13-00215E21DC1E.root

the /store is a beginning marker of the logical filename region that different sites can map differently (who uses NFS, who uses http etc etc) /mc/ -> it's monte carlo data /Summer10/ -> the data was produced during Summer of 2010 /Wenu/ -> it's a simulation of W decaying to electron and neutrino /GEN-SIM-RECO/ -> the data generation steps that have been done /START37_.../ -> The detector conditions that have been used (the actual full description of the conditions is in some central database) /0136/ -> is the serial number (actually I'm not 100%, but it's related to the production workflow etc) /0400DDE2-F681-DF11-BA13-00215E21DC1E.root -> the actual filename, the hash is due to the fact that the process has to make sure there are no conflicts in filenames

Another example: /store/data/Run2010A/MinimumBias/RECO/Jul16thReReco-v1/0000/0018523B-D490-DF11-BF5B-00E08178C111.root

This file is real data, taken during the first run of 2010 and filtered to the MinimumBias primary dataset (related to event trigger content). The datafiles in there contain RECO content and were done during the re-reconstruction process on July 16th. Then there's again the serial number (block edges define new serial numbers) and then the filename.

You could use a similar structure to differentiate the datafiles that you actually use. The good thing is that you can map such filenames separately everywhere as long as you change the prefix according to the protocol used (we use for example file:, http [slashdot.org] :, gridftp:, srm: etc). You can also easily share data with other collaborating sites as long as everyone uses similar structure it's quite good. No need for special databases etc. If you need some lookup functionality, then one option is a simple find (assuming you have filesystem access) or you could build a database in parallel and you can use the LFN structure to index things etc.

Another vote for NoSQL and some experience (2, Informative)

wolf87 (989346) | more than 3 years ago | (#33257496)

I have seen these kinds of situations happen a lot (I'm a statistician who works on computationally-intensive physical science applications), and the best solution I have seen was a BerkeleyDB setup. One group I work with had a very, very large number of ASCII data files (order of 10-100 million) in a directory tree. One of their researchers consolidated them to a BerkeleyDB, which greatly improved data management and access. CouchDB or the like could also work, but I think the general idea of a key-value store that lets you keep your data in the original structure would work well.

Knowledge Management Tools (1)

Isao (153092) | more than 3 years ago | (#33257506)

I haven't had to store experimental results like that. My work produces prototypes, some data, demos and support documentation. There are tons of KM tools out there to manage heterogenous data in a recoverable way. We've used document repositories like Hummingbird (acceptable) and of course SharePoint. The key (literally) is including the right metadata and tags when you check in the element. When a data set goes dormant (static) you can tarball the CVS tree or whatever and drop it in the repo. Then there's Knowledge Discovery, something we've created tools for. They let you understand how you got that idea from three hours of web/repo surfing.

First devise a meaningful stable primary key (2, Informative)

RandCraw (1047302) | more than 3 years ago | (#33257514)

First I would lay out your data using meaningful labels, like a directory named for the acquisition date + machine + username. Never change this. It will always remain valid and allow you to later recover the data if other indexes are lost. Then back up this data.

Next build indexes atop the data that semantically couple the components in the ways that are meaningful or acessible. This may manifest as indexed tables in a relational database, duplicate flat files linked by a compound naming convention, unix directory soft links, etc.

If you're processing a lot of data, your choice of indexes may have to optimize your data access pattern rather than the data's underlying semantics. Optimize your data organization for whatever is your weakest link: analysis runtime, memory footprint, index complexity, frequent data additions or revisions, etc.

In a second repository, maintain a precise record of your indexing scheme, and ideally, the code that automatically re-generates it. This way you (or someone else) can rebuild lost databases/indexes without repeating all your design and data cleansing decisions, and domain expertise. This info is often stored in a lab notebook (or nowadays in an 'electronic lab notebook').

I'd emphasize that if you can't remember how your data is laid out or pre-conditioned, your analysis of it may be invalid or unrepeatable. Be systematic, simple, obvious, and keep records.

Well, I'd have to say (0)

Anonymous Coward | more than 3 years ago | (#33257516)

In the butt Bob.

SharePoint! (0)

Anonymous Coward | more than 3 years ago | (#33257610)

SharePoint lists items, basically a database. You can sort and group by any parameter and attach files. No programming of special technical knowledge necessary.

postgreSql (1)

Alanonfire (1415379) | more than 3 years ago | (#33257626)

I did a little bioinformatics in the past, and we were using postgreSql to manage our results. It was nice because you can create meaningful fields to query in the future. It took some time developing the system, but it really helped out in the long run. We had to consider errors in the readings of the results and had to incorporate a little bit of fuzzy logic into the tools we used to run comparisons on the database.

If you are at a university or near a university, the computer science dept may give a few students credit to build you a system that can handle it, so you don't have to.

test lists and RCS (1)

wrench turner (725017) | more than 3 years ago | (#33257670)

Instead of sorting datasets, use a testlist database (flat files). The test contains/links/points to its dataset. The test lists are selected at test run time. Each entry in a test list tells how to generate the specific test environment for the test. A test list entry contains the test, the RCS tag/version of the test to be "gotten", the test seed, and array of exit codes that should be retired, how many retries, whether the test is gating, and an array of tests dependencies. A test run can be considered to pass even though an individual, non-gating test fails. One test entry may extract and prepare the test data and other dependent entries can then run against that test dataset.

LIMS! This is a no-brainer! (1)

Wdi (142463) | more than 3 years ago | (#33257682)

It seems you have never heard of LIMS (Laboratory Information Systems), which is unfortunate.

This is a thriving software sector, and you are actually expected to be at least vaguely familiar with these kind of systems should you ever transfer to industry and work in data-generating or data-processing positions.

Nobody in industry keeps experimental data as individual, handcrafted datasets. The risk of losing important data, not not being able to make cross-references (patents!) is much too high if you let people run their own set-ups. Do yourself, and your research group, a favor: Get some grant money and purchase a robust commercial set-up at least for your group, or better your department. Entry level systems, with academic discounts, are affordable. There are no competitive open-source solutions.

Start your research here:

http://en.wikipedia.org/wiki/LIMS

(though the systems listed there are instrument-centric, if you are more into generic chemistry there are other standard package by companies such as Accelrys and CambridgeSoft).

Used to be two-word answer (1)

MarkusQ (450076) | more than 3 years ago | (#33257712)

I used to have a two word answer for this question: Use BeOS

But now it's a six word answer (*sigh*): Invent time machine, then use BeOS

--MarkusQ

Computation project organization (0)

Anonymous Coward | more than 3 years ago | (#33257760)

This was developed in the context of computational biology experiments, but should hold true for other types of computational projects:

http://www.ncbi.nlm.nih.gov/pubmed/19649301

Learn a Relational Database! (1)

theNAM666 (179776) | more than 3 years ago | (#33257774)

It's already been said, but it bears saying again. Directories and symlinks.... oh my!

Spotlight (0)

Anonymous Coward | more than 3 years ago | (#33257784)

Google Desktop on a PC and Spotlight on my Mac have helped me a great deal.

backups are for wimps (1)

Gothmolly (148874) | more than 3 years ago | (#33257788)

real men upload their stuff to kernel.org and let the world mirror it.

By date, external file for metadata. (1)

goodmanj (234846) | more than 3 years ago | (#33257800)

Here's what I do:

Directory for each data set, labeled by date (20100815).
Short README file inside each directory with description of the run.
Big spreadsheet (or database, if you're fancy) with experimental parameters and core results, that can be sorted, reorganized, and graphed.

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...