Distributed Databases? 12
yamla asks: "I am interested in learning about distributed, fault-tolerant databases. That is, a database (not necessarily SQL) where the data is spread out (not replicated) amongst a large number of computers and furthermore, any reasonable number of those computers could disconnect or reconnect at any time without making it impossible to retrieve the stored data. I think this is a far more interesting problem than peer-to-peer because, provided such a solution scales, it would seem to solve the decentralised peer-to-peer problems. It would also seem to open up all kinds of new applications which we have hardly begun to think about yet. So I'm interested in good places to go to read up on (potential) solutions."
LDAP (Score:1)
Can be organized into a hierarchy
Is small in size
Is read much more than it is written consider using a LDAP data store. There are excellent open source [openldap.org] and commercial [iplanet.com] options available for Linux.
Get started by reading this nice series of tutorials from LinuxWorld [linuxworld.com].
After that, help yourself to some of the free schemas here [hklc.com].
Novell NDS (Score:1)
NDS is a distributed database of NetWare accounts, passwords, server metadata, etc.
NDS is far from a general purpose database. Still, reading up on NDS will give you a sense of the issues involved, i.e. how they chose to implement replication, db integrity issues, time sync issues, etc.
Once upon a time there was some NDS-to-ODBC glue that could be used to submit SQL queries to NDS, don't know if it still exists.
Novell hasn't opened up any source worth mentioning, but you can get free (beer) copies of NetWare from your local reseller or education center, just call them up and tell them you need an eval copy of NetWare.
NDS now has a version running on top of Linux, I'm not sure if there is a free eval version available.
NDS is complicated stuff, you really don't see any payback until you've got more then 4-5 servers. But once you've got 20+ servers in 10 sites sharing the same account database, you'll really like it.
Slightly simplier problems (Score:3)
There are a couple solutions, but all (iirc) in turn ultimately reduce to an even simplier question: how do you manage distributed messaging? A classic example of this is "buy" and "sell" orders in a distributed stock exchange - there needs to be some way of ensuring that all parties can agree on <b>the</b> ordering of all messages. Disagreement on ordering can have major ramifications since it can affect the price paid, possibly even whether the stock was obtained at all. Likewise, ordering of file systems reads and writes can determine what gets written to disk and/or what gets fed to running applications.
Once you have that, you can start looking at recovery issues in filesystems. IIRC, all come down to a question of how many systems you write data to, and how many systems you read data from. When you read data you'll often get multiple versions of the same information (because of update latency) and you need to know how to determine which is the most current.
The two extremes are "everyone has everything" (total replication) and "only one server has each item" (multiple independent and disjoint servers). Depending on expected loads (esp. the ratio of reads to writes) you might see a policy of reading from a third of all systems, writing to 2/3 of them. No system will have all data, but all will have most.
All of this points to an unstated assumption in your question. "Distributed" means more than one thing - to someone who has studied algorithms they usually refer to designs that maximize availability despite network partitioning (e.g., line cuts or court injuctions against some servers). These algorithms require substantial, if not complete, data replication.
To many people, "distributed" also means what we would call "partitioned" algorithms where multiple sites work on a small part of the problem and the results are combined later. Examples are factorization efforts and SETI-at-home. These algorithms don't require replication, but they are highly vulnerable to partitioning.
What problem are you trying to solve with this distributed database?
There are seminal research papers on this (Score:2)
IBM had a distributed database project going on back in the System-R days, and they never really got it working. I worked on the Mariposa project at U.C. Berkeley which attempted to solve some of this problem, and it didn't really get that far beyond a data warehousing context. The problem of ensuring that replication along with ownership and transactional semantics were preserved just became too difficult to solve in a purely generic way.
If you're just interested in high availability query processing, the Mariposa work is probably pretty relevant (a company called Cohera [cohera.com] tried to commercialize it). If you're interested in distributed transactions, you've walked into the realm of Tuxedo [bea.com] (by BEA systems [bea.com], caveat, a former employer). While specific instances of the problem CAN be solved, one general purpose system is going to have significant problems, so it's best to categorize what you're interested in solving.
I highly recommend that you dive into the big Stonebraker/Hellerstein book on database system implementation research papers [berkeley.edu] and start reading up. It's a VERY difficult problem. Hellerstein is part of a new project [berkeley.edu] which is also trying to solve some of the problems in a different way.
Ignore me (Score:2)
I'm feeling particularly dumb right now.
Re:There are seminal research papers on this (Score:1)
I just found one of Jim Gray's replication papers, he's now at Microsoft. Is this the one you were talking about?
The Dangers of Replication and a Solution (1996) [nec.com]Re:ummm... (Score:1)
Samba Information HQ
dns (Score:2)
All your events [openschedule.org] are belong to us.
ummm... (Score:3)
--
There is a db project by Ericsson (Score:1)
It's called Mnesia: Mnesia [erlang.se]
Also Visit Erlang [erlang.org]
It's not necessarily SQL based.
Couldn't have said it better myself. (Score:1)
Re:ummm... (Score:1)
Well I am not sure of that, but I'm pretty sure that all your base are belong to us.