Distributed Databases?

Distributed Databases? 12

Posted by Cliff on Saturday February 17, 2001 @06:20AM from the spreading-it-out-all-over dept.

yamla asks: "I am interested in learning about distributed, fault-tolerant databases. That is, a database (not necessarily SQL) where the data is spread out (not replicated) amongst a large number of computers and furthermore, any reasonable number of those computers could disconnect or reconnect at any time without making it impossible to retrieve the stored data. I think this is a far more interesting problem than peer-to-peer because, provided such a solution scales, it would seem to solve the decentralised peer-to-peer problems. It would also seem to open up all kinds of new applications which we have hardly begun to think about yet. So I'm interested in good places to go to read up on (potential) solutions."

Distributed Databases?

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 12 Comments Log In/Create an Account

Comments Filter:

LDAP (Score:1)

by Pauly ( 382 ) writes:

If your data:
Can be organized into a hierarchy
Is small in size
Is read much more than it is written consider using a LDAP data store. There are excellent open source [openldap.org] and commercial [iplanet.com] options available for Linux.
Get started by reading this nice series of tutorials from LinuxWorld [linuxworld.com].
After that, help yourself to some of the free schemas here [hklc.com].
Novell NDS (Score:1)

by Clover_Kicker ( 20761 ) writes:

Take a look at Novell's NDS [novell.com].
NDS is a distributed database of NetWare accounts, passwords, server metadata, etc.

NDS is far from a general purpose database. Still, reading up on NDS will give you a sense of the issues involved, i.e. how they chose to implement replication, db integrity issues, time sync issues, etc.

Once upon a time there was some NDS-to-ODBC glue that could be used to submit SQL queries to NDS, don't know if it still exists.

Novell hasn't opened up any source worth mentioning, but you can get free (beer) copies of NetWare from your local reseller or education center, just call them up and tell them you need an eval copy of NetWare.

NDS now has a version running on top of Linux, I'm not sure if there is a free eval version available.

NDS is complicated stuff, you really don't see any payback until you've got more then 4-5 servers. But once you've got 20+ servers in 10 sites sharing the same account database, you'll really like it.
Slightly simplier problems (Score:3)

by coyote-san ( 38515 ) writes: on Saturday February 17, 2001 @10:35AM (#424825)

A slightly simplier problem can give you insights into possible solutions. How do you manage a distributed <b>file system</b>? That is, something that looks like a single file system but can operate and recover from partitioning?

There are a couple solutions, but all (iirc) in turn ultimately reduce to an even simplier question: how do you manage distributed messaging? A classic example of this is "buy" and "sell" orders in a distributed stock exchange - there needs to be some way of ensuring that all parties can agree on <b>the</b> ordering of all messages. Disagreement on ordering can have major ramifications since it can affect the price paid, possibly even whether the stock was obtained at all. Likewise, ordering of file systems reads and writes can determine what gets written to disk and/or what gets fed to running applications.

Once you have that, you can start looking at recovery issues in filesystems. IIRC, all come down to a question of how many systems you write data to, and how many systems you read data from. When you read data you'll often get multiple versions of the same information (because of update latency) and you need to know how to determine which is the most current.

The two extremes are "everyone has everything" (total replication) and "only one server has each item" (multiple independent and disjoint servers). Depending on expected loads (esp. the ratio of reads to writes) you might see a policy of reading from a third of all systems, writing to 2/3 of them. No system will have all data, but all will have most.

All of this points to an unstated assumption in your question. "Distributed" means more than one thing - to someone who has studied algorithms they usually refer to designs that maximize availability despite network partitioning (e.g., line cuts or court injuctions against some servers). These algorithms require substantial, if not complete, data replication.

To many people, "distributed" also means what we would call "partitioned" algorithms where multiple sites work on a small part of the problem and the results are combined later. Examples are factorization efforts and SETI-at-home. These algorithms don't require replication, but they are highly vulnerable to partitioning.

What problem are you trying to solve with this distributed database?

There are seminal research papers on this (Score:2)

by MemRaven ( 39601 ) writes:

and they all come down to one thing: it can't be done very well, and we should all stop trying. It all got summed up by Jim Gray in a paper I can't find a link to right now.
IBM had a distributed database project going on back in the System-R days, and they never really got it working. I worked on the Mariposa project at U.C. Berkeley which attempted to solve some of this problem, and it didn't really get that far beyond a data warehousing context. The problem of ensuring that replication along with ownership and transactional semantics were preserved just became too difficult to solve in a purely generic way.
If you're just interested in high availability query processing, the Mariposa work is probably pretty relevant (a company called Cohera [cohera.com] tried to commercialize it). If you're interested in distributed transactions, you've walked into the realm of Tuxedo [bea.com] (by BEA systems [bea.com], caveat, a former employer). While specific instances of the problem CAN be solved, one general purpose system is going to have significant problems, so it's best to categorize what you're interested in solving.
I highly recommend that you dive into the big Stonebraker/Hellerstein book on database system implementation research papers [berkeley.edu] and start reading up. It's a VERY difficult problem. Hellerstein is part of a new project [berkeley.edu] which is also trying to solve some of the problems in a different way.
Ignore me (Score:2)

by MemRaven ( 39601 ) writes:

I was specifically referring to the issues with a transactional SQL database, and I failed to read one parenthetical bit on the original post.
I'm feeling particularly dumb right now.
Re:There are seminal research papers on this (Score:1)

by Jason Pollock ( 45537 ) writes:

I just found one of Jim Gray's replication papers, he's now at Microsoft. Is this the one you were talking about?
The Dangers of Replication and a Solution (1996) [nec.com]
Re:ummm... (Score:1)

by mbyte ( 65875 ) writes:

A raid-5 volume doesn't mirror the data either ... so ..

Samba Information HQ
dns (Score:2)

by po_boy ( 69692 ) writes:

the DNS setup is a pretty good example of a distributed database.

All your events [openschedule.org] are belong to us.
ummm... (Score:3)

by nomadic ( 141991 ) writes: <`nomadicworld' `at' `gmail.com'> on Saturday February 17, 2001 @01:34AM (#424831) Homepage

Wait, if they're not replicated, how could you get data from a machine that's been brought offline?
--

There is a db project by Ericsson (Score:1)

by mikehoskins ( 177074 ) writes:

that does what you want....
It's called Mnesia: Mnesia [erlang.se]
Also Visit Erlang [erlang.org]
It's not necessarily SQL based.
Couldn't have said it better myself. (Score:1)

by Operandi ( 231803 ) writes:

I absolutely agree. I consider the lack of a truely horizontally scalable db solution the final enigmatic issue of highly available computing. Preemptive strike: No, I do not consider replication a solution, not passive/active clusters a solution either. I want a cluster of servers serving a database stored on shared storage that are all live at the same time all doing update/select/insert/deletes. So basically you could scale a db just by putting a load balancing switch in front of it and adding servers to the cluster and connecting them to the shared storage. (EMC or what-have-you.) If anyone has any ideas for this let me know. (As far as I know Oracle 9i Real Applications doesn't do this. It has active/passive arrangement where if the master goes down one of the slaves will come up within 17 seconds, iirc.)
Re:ummm... (Score:1)

by All Your Base Are Be ( 318215 ) writes:

Got any answers? Or does your base belong to us?

Well I am not sure of that, but I'm pretty sure that all your base are belong to us.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Distributed Databases? 12

Distributed Databases? More Login

Distributed Databases?

LDAP (Score:1)

Novell NDS (Score:1)

Slightly simplier problems (Score:3)

There are seminal research papers on this (Score:2)

Ignore me (Score:2)

Re:There are seminal research papers on this (Score:1)

Re:ummm... (Score:1)

dns (Score:2)

ummm... (Score:3)

There is a db project by Ericsson (Score:1)

Couldn't have said it better myself. (Score:1)

Re:ummm... (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot