ODBMS Industry Watch

Feb 8 11

“Distributed joins are hard to scale”: Interview with Dwight Merriman.

by Roberto V. Zicari

On the topic of NoSQL databases, I asked a few questions to Dwight Merriman, CEO of 10gen.

You can read the interview below.

RVZ

Q1. Traditionally, the obvious platform for most database applications has been a relational DBMS. Why do we need new Data Stores?

Dwight Merriman:
The catalyst for acceptance of new non-relational tools has been the desire to scale –particularly the desire to scale operational databases horizontally on commodity hardware. Relational is a powerful tool that every developer already knows; using something else has to clear a pretty big hurdle. The push for scale and speed is helping people clear that and add other tools to the mix. However, once the new tools are part of the established set of things developers use, they often find there are other benefits – such as agility of development of data backed applications. So we see two big benefits : speed and agility.

Q2. There has been recently a proliferation of “new data stores”, such as “document stores”, and “nosql databases”: What are the differences between them? How do you position 10gen?

Dwight Merriman:
The products vary on a few dimensions:
– what’s the data model?
– what’s the scale-out / data distribution model?
– what’s the consistent model (strong consistency or eventual or both?)
– does the product make development more agile, or just focus on scaling only

MongoDB uses a document-oriented data model, auto-sharding (range based partitioning) for data distribution, and is more towards the strong consistency side of the spectrum (for example atomic operations on single documents are possible). Agility is a big priority for the mongodb project.

Q3. Systems such as CouchDB, MongoDB, SimpleDB, Voldemort, Scalaris, etc. provide less functionality than OODBs and are little more than a distributed “object” cache over multiple machines. Do you agree with this? How do these new data stores compare with object-oriented databases?

Dwight Merriman:
Some products have a simple philosophy, and others in the space do not. However a little bit must be left out if one wants to scale well horizontally. The philosophy with MongoDB project is to try to make the functionality rich, but not to add any features which make scaling hard or impossible. Thus there is quite a bit of functionality : ad hoc queries, secondary indexes, sorting, atomic operations, etc.

Q4. Recently you annouced the completion of a round of funding adding Sequoia Capital as New Investment Partner. Why Sequoia Capital is investing in 10gen and how do you see the database market shaping on?

Dwight Merriman:
We think the “NoSQL” will be an important new tool for companies of all sizes. We expect large companies in the future to have in house three types of data stores : a traditional RDBMS, a NoSQL database, and a data warehousing technology.

Q5. In his post, titled “The “NoSQL” Discussion has Nothing to Do With SQL”, Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?

Dwight Merriman:
We agree. It has nothing to do with SQL. It has to do with relational though – distributed joins are hard to scale.

Q6. Some progress has also been made on RDBMS scalability. For example, Oracle RAC and MySQL Cluster provide some partitioning of load over multiple nodes. More recently, there are new scalable variations of MySQL underway with ScaleDB and Drizzle, and VoltDB is expected to provide scalability on top of a more performant inmemory RDBMS with minimal overhead. Typically you cannot scale well if your SQL operations span many nodes. And you cannot scale well if your transactions span many nodes.
Will RDBMSs provide scalability to 100 nodes or more? And if yes, how?

Dwight Merriman:
Without loss of generality, we believe no. With loss of generality, perhaps yes. For example if you say, i only want star schemas, i only want to bulk load data at night and query it all day, i only want to run a few really expensive queries not millions of tiny ones — then it works and you are now in the realm of the relational business intelligent / data warehousing products, which do scale out quite well.

If you put some restrictions on – say require the user to do their data models certain ways, or force them to run a proprietary low latency network, or only have 10 nodes instead of 1000, you can do it. But we think there is a need for scalable solutions that work on commodity hardware, particularly because cloud computing is such an important trend in the future. And there are other benefits too such as developer agility.

Jan 28 11

On Graph Databases: Interview with Daniel Kirstenpfad.

by Roberto V. Zicari

I wanted to know more about Sones, a relatively new graph database company based in Germany.
I therefore interviewed Daniel Kirstenpfad, their CTO.

RVZ

Q1. Sones was founded in 2007 in Germany, with the aim of developing an object oriented database with its own file system technology. What motivated you to start the company?

Daniel Kirstenpfad: Building a product and a new technology with a great team surely was the main motivation to start sones. What started as a private project took off in 2007 when a business angel believed in our idea and started to fly with the substantial investments by T-Venture, TGFS and KfW. A great team and great opportunities are the things that keep that motivation high for everyone at sones.

Q2. Your database, ‘GraphDB’, is a graph database. Why did you develop a graph database? How does it differ from an object oriented database? Is GraphDB a NoSQL database?

Daniel Kirstenpfad: Well once upon a time sones got it’s name from the words “SOcial NEtwork Systems”. That time we mainly were focused on tools and technologies for social networks. We started to develop those technologies and found out that a graph-theoretical approach is the most natural one for typical problems in social networks.

When we drilled deeper into graph theory we found out that there are a bazillion real problems people are having which can be solved faster and more elegant with a graph-theory based database management system – not mentioning the new opportunities such graph-theory database brings.

Taking the knowledge about the tools and use cases we did previously we started the work on the multi purpose graph dbms that is today the open source “sones GraphDB”.

The sones GraphDB is a graph database management system which includes many important features object oriented databases typical have. Typically object oriented databases lack specific graph features like graph algorithms and easy to use property graph data structures (vertices, edges (directed/undirected), hyper edges,…). More importantly we think that an easy to learn and use query language which is derived from things the users know already like SQL needs to be included into a database these days – in sones GraphDB we have those features and more.

We think of NoSQL as “not only SQL”. When we think of SQL as relational databased think that this RDBMS technology is still very important for the field of use cased that RDBMSs are tailored to solve. It’s just that people found a huge number of problems that cannot be solved in a scalable and performant way with the SQL approach.

When we think of SQL as a query language we think that having an easy to use functionality to do ad-hoc queries on a large set of structured is probably one of the things that made those RDBMSs so popular. It’s the reason we at sones believe that it’s very important to have an easy to use query language which you can use to do ad-hoc queries on large sets of semi-structured and unstructured data. It’s the reason why the query language you use to query the sones GraphDB is solely based on the syntax of SQL: it’s easy to use and easy to learn. A typical user can write and run rather complex queries within minutes.

Q3. One feature of a graph database is to allow linking data from various SQL databases. How does it work in practice? And who needs such a feature?

Daniel Kirstenpfad: Data in most cases does not stand alone like a list. In most cases data comes with links and relations between the data. SQL is highly limited when working with complex linked data sets. Even worse with RDBMSs the query performance is significantly impacted when schemes are getting more complex. To overcome those performance and scalability limitations it can be helpful to establish a graph database layer which handles those complex linking needs. So in practice the user keeps his RDBMs databases and establishes what we call a “metadata repository” across his different RDBMs databases. This metadata repository is basically a large graph of all the edges that link data sets. Giving the user the ability to do ad-hoc graph queries globally on a virtually unlimited number of different RDBMs databases (or data silos as we call them).

That also answers the question which users benefit the most of such a feature: Everyone who has mutliple relational databases which each store another aspect of the data can benefit by establishing a globally available graph metadata repository.

Q4. How do you handle semistructured and unstructured data?

Daniel Kirstenpfad: Semistructured data is a compromise between structured and unstructured data without sacrificing important things. To handle semistructured data the sones GraphDB comes with a dynamic data scheme and consistency criteria. This data scheme is an extension of the OOP data model allowing inheritance, abstract types, data streams (binary – unstructured data), undefined Attributes, Editions and Revisions on Object Namespaces and Object Instances. Having a dynamic scheme means that the user can change the data scheme anytime without performance penalties. Unstructured data is handled either using undefined attributes on edges and vertices or using the binary data streams which can be handled by the sones GraphDB without any mapping to external storages – basically binary data is stored with data objects.

Q5. How do ensure scalability and performance?

Daniel Kirstenpfad: Ensuring scalability and performance was a major design goal through all development steps of the sones GraphDB. Because there are so many possible ways in enhancing the scalability and performance aspects of a graph database we have been busy in implementing some of those ways and we will be busy implementing more in the future. Currently Master-Slave replication is one key factor to scale the query performance of a data set. By replicating a data set onto multiple machines and running queries on read slaves the number of possible queries per seconds scales nearly linear. In the future the sones GraphDB will have an easy to use and extend framework to partition a graph onto multiple machines – coupled with the Master-Slave Replication this allows virtually unlimited sizes of data sets and virtually unlimited number of simultaneous queries.

Q6. How do you see the market for Cloud computing and open source?

Daniel Kirstenpfad: We see a very diverse cloud computing market with some bigger and some smaller cloud platform providers on the one hand and many customers who want to utilize these cloud platforms, either as a software vendor or a platform user, on the other hand.

Basically the cloud for us is another opportunity to host the services the sones GraphDB provides. It’s important for us to support the major cloud platforms like Microsofts Windows Azure and Amazons EC2. There are some cool technologies and frameworks that can be utilized by a database to store and access data in a fast and very scalable way.

What most of those cloud platforms lack these days is an actually useable system to allow software vendors like sones to publish their applications and services. Currently many users need to take the not-so-scenic route until they get their instance of something hosted on a cloud platform. It would be great to have a better software vendor integration and therefore allow the users to take the scenic route to service.

Q7. sones offers its product using a dual licensing scheme: open source software under AGPLv3, and a full enterprise version. Why did you choose a dual licensing scheme and how do handle the two license models?

Daniel Kirstenpfad: We started as a closed source company – mainly because we wanted to have something to share before we start going open source. Making the whole sones GraphDB open source was always an option – which we finally chose to take in Mid-2010. Since then there is an Open Source Edition of the sones GraphDB available which shares it’s code with the enterprise version.

What differentiates the Open Source Edition and the Enterprise Edition are feature plugins which extend the functionality in certain ways. There are plans to gradually publish the previously closed source plugins under open source licenses however. We chose this dual licensing scheme because first of all it gives us and the software the most flexibility. We can combine closed source technology – from partners for example – and open source technology using dual-licensing. And we think that this will suit most of our users needs best.

Q8. Sones has recently concluded a round of financing. What is your strategy for the next months?

Daniel Kirstenpfad: First of all we want to broaden our presence. One key factor to enable a great community is to start an open dialogue with the community. We want to bring our team to the next level – that means expanding the development, customer support, press activities and management team. We furthermore are going to expand our cooperations with universities and partners. Beyond anything partners are another key factor which will lead to the success of the sones GraphDB.

Jan 25 11

Agile data modeling and databases.

by Roberto V. Zicari

In a previous interview with our expert Dr. Michael Blaha, we talked about Patterns of Data Modeling. This is a follow up interview with Dr. Blaha on a related topic: Agile data modeling and databases.

RVZ

Q1. Many developers are familiar with the notion of agile programming. What is the main promise of agile programming? And how does it relate to conceptual data modeling?

Blaha: Agile programming focuses on writing code quickly and showing the evolving results to the customer. Agile programming is a reaction to broken software engineering practices where a lengthy and tedious process keeps software hidden until the very end.

However, software engineering need not be ponderous if you use agile modeling. Conceptual data models can be constructed live, in front of an audience — I know this for a fact, because I often build models this way. Business users understand a data model, if you explain it as you go. They are patient as long as there is clear progress and
they can see that the data model reflects their business goals.

Q2. You have posted a series of videos on YouTube that demonstrate agile data modeling. What is the main message you want to convey?

Blaha: The videos demonstrate that it is possible to rapidly construct a conceptual data model. (Many developers do not realize that it is possible.) The videos use the example of developing library asset management software and are representative of how I build models.

Q3. What about object oriented databases?

Blaha: When I develop data-oriented applications I always start with a data model — this applies to relational databases, object-oriented databases, flat files, XML, and other formats.

For object-oriented databases, the UML class model specifies the structure of domain classes. Several books elaborate the mapping details, such as my book Object-Oriented Modeling and Design for Database Applications.

The trickiest issue for mapping a UML class model to OO databases is the handling of associations. An association relates objects of the participating classes and is inherently bidirectional. For example, the WorksFor association can be traversed to find the collection of employees for a company, as well as the employer for an employee.
By their very nature, associations transcend classes and cannot be private to a class. Associations are important because they break class encapsulation.

OO databases vary in their association support. ObjectStore uses dual pointers — a class A to class B association is implemented with an instance variable in A that refers to B and an instance variable in B that refers to A. The ObjectStore database engine automatically keeps the dual pointers mutually consistent across inserts, deletes, and
updates.

Many other OO databases lack association support and require that you architect your own solution. You can have dual references that application software maintains. Or you can build a generic association mechanism apart from your application. Or you can just implement a single direction when that suffices.

In any case, a data model documents associations so that they are clearly recognized and understood. Then you are sure not to overlook the associations in your application architecture.

You can watch here the series of videos:

Jan 18 11

Interview with Rick Cattell: There is no “one size fits all” solution.

by Roberto V. Zicari

I start in 2011 with an interview with Dr. Rick Cattell.
Rick is best known for his contributions to database systems and middleware — He was a founder of SQL Access (a predecessor to ODBC), the founder and chair of the Object Data Management Group (ODMG), and the co-creator of JDBC.

Rick has worked for over twenty years at Sun in management and senior technical roles, and for ten years in research at Xerox PARC and at Carnegie-Mellon University.

You can download the article by Rick Cattell: “Relational Databases, Object Databases, Key-Value Stores, Document Stores, and Extensible Record Stores: A Comparison.”

And look at recent posts on the same topic: “New and Old Data Stores”, and “NoSQL databases”.

RVZ

Q1. Traditionally, the obvious platform for most database applications has been a relational DBMS. Why do we need new Data Stores?

Rick Cattell: Basically, things changed with “Web 2.0” and with other applications where there were thousands or millions of users writing as well as reading a database. RDBMSs could not scale to this number of writers. Amazon (with Dynamo) and Google (with BigTable) were forced to develop their own scalable datastores. A host of others followed suit.

Q2. There has been recently a proliferation of “new data stores”, such as “document stores”, and “nosql databases”: What are the differences between them?

Rick Cattell: That’s a good question. The proliferation and differences are confusing, and I have no one-paragraph answer to this question. The systems differ in data model, consistency models, and many other dimensions. I wrote a couple papers and provide some references on my website, these may be helpful for more background. There I categorize several kinds of “NoSQL” data stores, according to data model: key-value stores, document stores, and extensible record stores. I also discuss scalable SQL stores.

Q3. How new data stores compare with respect to relational databases?

Rick Cattell: In a nutshell, NoSQL datastores give up SQL and they give up ACID transactions, in exchange for scalability. Scalability is achieved by partitioning and/or replicating the data over many servers. There are some other advantages, as well: for example, the new data stores generally do not demand a fixed data schema, and provide a simpler programming interface, e.g. a RESTful interface.

Q4. Systems such as CouchDB, MongoDB, SimpleDB, Voldemort, Scalaris, etc. provide less functionality than OODBs and are little more than a distributed “object” cache over multiple machines. How do these new data stores compare with object-oriented databases?

Rick Cattell: It is true, OODBs provide features that NoSQL systems do not, like integration with OOPLs, and ACID transactions. On the other hand, OODBs do not provide the horizontal scalability. There is no “one size fits all” solution, just as OODBs and RDBMSs are good for different applications.

Q5. With the emergence of cloud computing, new data management systems have surfaced. What is in your opinion of the direction in which cloud computing data management is evolving? What are the main challenges of cloud computing data management?

Rick Cattell: There are a number of data management issues with cloud computing, in addition to the scaling issue I already discussed. For example, if you don’t know which servers your software is going to run on, you cannot tune your hardware (RAM, flash, disk, CPU) to your software, or vice versa.

Q6 What are cloud stores omitting that enable them to scale so well?

Rick Cattell: You haven’t defined “cloud stores”. I’m going to assume that you mean something similar to what we discussed earlier: new data stores that provide horizontal scaling. In which case, I answered that question earlier: they give up SQL and ACID.

Q7. Will cloud store projects end up with support for declarative queries and declarative secondary keys?

Rick Cattell: As I interpret this question, systems such as MongoDB already have this. Also, a SQL interpreter has been ported to BigTable, but the lower-level interface has proven to be more popular. The main scalability problem with declarative queries is when queries require operations like joins or transactions that span many servers: then you get killed by the node coordination and data movement.

Q8. In his post, titled “The “NoSQL” Discussion has Nothing to Do With SQL”, Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management.
To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?

Rick Cattell: I agree with Stonebraker. There are actually two points here: one about performance (of each server) and one about scalability (of all the servers together). We already discussed the latter.
Stonebraker makes an important point about the former: with databases that fit mostly in RAM (on distributed servers), the DBMS architecture needs to change dramatically, otherwise 90% of your overhead goes into transaction coordination, locking, multi-threading latching, buffer management, and other operations that are “acceptable” in traditional DBMSs, where you spend your time waiting for disk. Stonebraker and I had an argument a year ago, and reached agreement on this as well as other issues on scalable DBMSs. We wrote a paper about our agreement, which will appear in CACM. It can be found on my website in the meantime.

Q9. Some progress has also been made on RDBMS scalability. For example, Oracle RAC and MySQL Cluster provide some partitioning of load over multiple nodes. More recently, there are new scalable variations of MySQL underway with ScaleDB and Drizzle, and VoltDB is expected to provide scalability on top of a more performant inmemory RDBMS with minimal overhead. Typically you cannot scale well if your SQL operations span many nodes. And you cannot scale well if your transactions span many nodes. Will RDBMSs provide scalability to 100 nodes or more? And if yes, how?

Rick Cattell: Yes, I believe so. MySQL Cluster is already close to doing so, and I believe that VoltDB and Clustrix will do so. The key to scalability with RDBMSs is to avoid SQL and transactions that span nodes, as you say. VoltDB demands that transactions be encapsulated as stored procedures, and allows some control over how tables are sharded over nodes. This allows transactions to be pre-compiled and pre-analyzed to execute on a single node, in general.

Q10. There is also xml DBs, which go beyond relational. Hybridization with relational turned out to be very useful. For example, DB2 has a huge investment in XML, and it is extensively published, and it has also succeeded commercially. Monet DB did substantial work in that area early on as well. How do they relate with “new data stores”?

Rick Cattell: With XML, we have yet another data model, like relational and object-oriented. XML data can be stored in a separate DBMS like Monet DB, or could be transformed to store in another DBMS, as with DB2. The focus of the new NoSQL data stores is generally not a new data model, but new scalability. In fact, they generally have quite simple data models. The “document stores” like MongoDB and CouchDB do allow nested objects, which might make them more amenable to storing XML. But in my experience, the new data stores are being used to store simpler data, like key-value pairs required for user information on web site.

Q11. Choosing a solution: Given this confusing array of alternatives, which data storage system do you choose for your application?

Rick Cattell: This is an even harder question to answer than the ones contrasting the DBMSs themselves, because each application has characteristics that might make you lean one way or another. I made an attempt at answering this in the paper I mentioned, but I only scratched the surface… I concluded that there is no “cookbook” answer to tell you which way to go.

Dec 30 10

Don White on “New and old Data stores”.

by Roberto V. Zicari

“New and old Data stores” : This time I asked Don White a number of questions.
Don White is a senior development manager at Progress Software Inc., responsible for all feature development and engineering support for ObjectStore.

RVZ

Q1. Traditionally, the obvious platform for most database applications has been a relational DBMS. Why do we need new Data Stores?

Don White : Speaking from an OODB perspective, OODBs grew out of recognition that the relational model is not the best fit for all application needs. OODBs continue to deliver their traditional value which is transparency in handling and optimizing moving rich data models between storage and virtual memory. The emergence of common ORM solutions must have provide benefit for some RDB based shops, where I presume they had need to use object oriented programming for data that already fits well into an RDB. There is something important to understand, if you fail to leverage what your storage system is good at then you are using the wrong tool or the wrong approach. Relational model wants to model with relations, wants to perform joins, an application’s data access pattern should expect to query the database the way the model wants to work. ORM mapping for an RDB that tries to query and build one object at a time will have real poor performance. If you try to query in bulk to avoid costs of model transitions then you likely have to live with compromises in less than optimal locking and/or fetch patterns. A project with model complexity that pursues OOA/OOD to for a solution will find implementation easier with an OOP and will find storage of that data easier and more effective with an OODB. As for newer Data Stores that are neither OODB nor RDB, and they appear to be trying to fill a need that provides a storage solution that is less than general database solution. Not trying to be everything to everybody allows for different implementation tradeoffs to be made.

Q2. There has been recently a proliferation of “new data stores”, such as “document stores”, and “nosql databases”: What are the differences between them?

Don White: This probably needs to be answered by people involved with the products trying to distinguish themselves. I lump document stores in the NoSQL category. However there does seem to be some common themes or subclass types within new NoSQL stores document stores and key value stores. Each subclass seems to have a different way of declaring groups of related information and differences in exposing how to find and access stored information. In case it is not obvious you can argue an OODB has some characteristics of the NoSQL stores, although any discussion will have to clearly define the scope of what is included in the NoSQL discussion.

Q3. How new data stores compare with respect to relational databases?

Don White: In general there seems to be recognition that Relational based technology has difficulty making tradeoffs managing fluid schema requirements and how to optimize access to related information.

Don White: These new data stores are not Object Oriented. Some might provide language bindings to Object Oriented languages but they are not preserving OOA/OOD as implemented in an OOP all the way through to the storage model. The new data systems are very data centric and are not trying to facilitate the melding of data and behavior. These new storage systems present a specific model abstractions and provide their own specific storage structure. In some cases they offer schema flexibility, but it is basically used to just manage data and not for building sophisticated data structures with type specific behavior. One way of keeping modeling abilities in perspective, you can use an OODB as a basis to build any other database system, NoSQL or even relational. The simple reason is an OODB can store any structure a developer needs and/or can even imagine. A document store, name/value pair store or RDB store, all present a particular abstractions for a data store, but under the hood there is an implementation to serve that abstraction. No matter what that implementation looks like for the store it could be put into an OODB. Of course the key is determining if the data store abstraction presented works for your actual model and application space.

The problem with an OODB is not everyone is looking to build a database of their own design and they prefer someone else to supply the storage abstraction and worry about the details to make the abstraction work. Not
to say the only way to interface with an OODB is a 3GL program, but the most effective way to use an OODB is when the storage model matches the needs of the in-memory model. That is a very leading statement because it really is forcing a particular question, why would you want to store data differently than how you intend to use it? I guess the simple answer is when you don’t know how you are going to use your data, so if you don’t know how you are going to use it then why is any data store abstraction better than another? If you want to use an OO model and implementation then you will find a non OODB is a poor way of handling that situation.

To generalize it appears the newer stores make different compromises in the management of the data to suit their intended audience. In other words they are not developing a general purpose database solution so they are
willing to make tradeoffs that traditional database products would/should/could not make. The new data stores do not provide traditional database query language support or even strict ACID transaction support. They do provide an abstractions for data storage and processing capabilities that leverage the idiosyncrasies of their chosen implementation data structures and/or relaxations in strictness of the transaction model to try to make gains in processing.

Don White: One challenge is just trying to understand what is really meant by cloud computing. In some form it is how to leverage computing resources or facilities available through a network. Those resources could be software or hardware, leveraging those resources requires nothing to be installed on the accessing device, you only need a network connection. The network is the virtual mainframe and any device used to access the network is the virtual terminal endpoint. You have the same concerns of trying to leverage the computing power of a virtual mainframe as a real local machine, how to optimize computing resources, how to share them among many users and how to keep them running. You have interesting upside with all the possible scalability but with the power and flexibility comes new levels of management complexity. You have to consider how algorithms for processing and handling data can be distributed and coordinated. When you involve more than one machine to do anything then you have to consider what happens when any node or connecting piece fails along the way.

Q6 What are cloud stores omitting that enable them to scale so well?

Don White: Strict serialized transaction processing for one. I think you will find the more complex a data model needs to be, the more need there is for strict serialized transactions. You can’t expect to navigate relationships cleanly if you don’t promise to keep all data strictly serialized.
The data and/or the storage abstractions used in the new models seem devoid of any sophisticated data processing and relationship modeling. What is being managed and distribute is simple data, where algorithms and the data needing to be managed can be easily partitioned/dispersed and required processing is easily replicated with basic coordination requirements. It is easy to imagine how to process queries that can replicated in bulk across simple data stored in structures that are amenable to be split apart.

Why are serialized transactions important? It makes sure this is one view of history and is necessary to maintain integrity among related data. Some systems try to pass off something less than serializable isolation as
adequate for transaction processing, however allowing updates to occur without the prerequisite read locks risks trying to use data that is not correct. If you are using pointers rather than indirect references as part
of your processing, the things you point to have to exist to run. Once you materialize a key/value based relationship as a pointer then there has to be commitment to not only the existence of the relationship (thing pointed to) but also the state of the data involved in the relationship that allows the existence to be valid.

Q7. Will cloud store projects end up with support for declarative queries and declarative secondary keys?

Don White: Can’t answer that. It will be a shame if these systems end up having to build many things that are available in other database systems that could have given them that for free.

Q8. In his post, titled “The “NoSQL” Discussion has Nothing to Do With SQL”, Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?

Don White: I don’t have any argument with the overheads identified, however I would say I don’t want to use SQL, a non-procedural way of getting to data, when I can solve my problem faster by using navigation of data structures specifically geared to solve my targeted problem. I have seen customers put SQL interfaces on top of specialized models stored in an OODB. They use SQL through ODBC as a standard endpoint to get at the data, but the implementation model under the hood is a model the customer implemented that performs queries faster than what a traditional relational implementation could do.

Q9. Some progress has also been made on RDBMS scalability. For example, Oracle RAC and MySQL Cluster provide some partitioning of load over multiple nodes. More recently, there are new scalable variations of MySQL underway with ScaleDB and Drizzle, and VoltDB is expected to provide scalability on top of a more performant inmemory RDBMS with minimal overhead. Typically you cannot scale well if your SQL operations span many nodes. And you cannot scale well if your transactions span many nodes.
Will RDBMSs provide scalability to 100 nodes or more? And if yes, how?

Don White: Hmm, what is the barrier? What makes SQL hard to span nodes? I suppose one inherit problem is an RDB is built around the relational model, which is based on joining relations. If processing is going to spread across many nodes then where does joining take place. So either there possibly single points of failure or some layer of complicated partitioning that has to be managed to figure out how to join data together.

Don White: I would think one thing that has to be addressed is how you store and process non text information that is ultimately represented as text in XML. String based models are a poor means to manage relationships and numeric information. There are also costs in trying to make sure the information is valid for the real data type you want it to be.
A product would have to decide on how to handle types that are not meant to be textual. For example you can’t expect to accurately compare/restrict floating point numbers that are represented as text, certainly storing numbers as text is an inefficient storage model. Most likely you would want to leverage parsed XML for your processing, so if the data is not stored in a parsed format then you will have to pay for parsing when moving the data to and from storage model. XML can be used to store trees of information, but not all data is easily represented with XML.
Common data modeling needs like graphs and non containment relationships among data items would be a challenge. When evaluating any type of storage system it should be based on the type of data model needed and how it will be used.

Q11. Choosing a solution: Given this confusing array of alternatives, which data storage system do you choose for your application?

Don White: Make sure you choose a tool for the job at hand. I think the one thing we know is the Relational Model has been used to solve lots of problems, but it has not the end all and be all of data storage solutions. Other data storage model can offer advantages for more than niche situations.

Dec 16 10

Watch the Video of the Keynote Panel “New and old Data stores”

by Roberto V. Zicari

A number of people asked me to make it easier to watch the video of the Keynote Panel “New and old Data stores”, held at ICOODB 2010 Frankfurt on September 29, 2010.
So rather then downloading the video, you can now watch it directly here!

The panel discussed the pros and cons of new data stores with respect to classical relational databases.

The panel of experts was composed of:
-Robert Greene, Chief Strategist Versant.
-Leon Guzenda, Chief Technology Officer Objectivity.
-Michael Keith, architect at Oracle.
-Patrick Linskey, Apache OpenJPA project.
-Peter Neubauer, COO NeoTechnology.
-Ulf Michael (Monty) Widenius, main author of the original version of the open source MySQL database.

Moderators were: Alan Dearle, University of St. Andrew, and myself.

The panelists engaged in lively discussions addressing a variety of interesting issues, such as: why the recent proliferation of “new data stores”, such as “document stores”, and “nosql databases”; their differences with classic relational databases, how object databases compare with NoSQL databases, scalability and consistency for huge amount of data…to name a few.

RVZ
Since the original video was rather large, I split it into two parts

Keynote Panel “New and old Data stores” PART I:

Keynote Panel “New and old Data stores” PART II:

Dec 2 10

Robert Greene on “New and Old Data stores”

by Roberto V. Zicari

I am back covering the topic “New and Old Data stores”.
I asked several questions to Robert Greene, CTO and V.P. Open Source Operations at Versant.

Q1. Traditionally, the obvious platform for most database applications has been a relational DBMS. Why do we need new Data Stores?

Robert Greene: Well, it’s a question of innovation in the face of need. When relational databases were invented, applications and their models were simpler, data was smaller, concurrent users were less. There was no internet, no wireless devices, no global information systems. In the mid 90’s, even Larry Ellison stated that complexly related information, at the time largely in niche application areas like CAD, did not fit well with the relational model. Now, complexity is pervasive in nearly all applications.

Further, the relational model is based on a runtime relationship execution engine, re- calculating relations based on primary-key, foreign-key data associations even though the vast majority of data relationships remain fixed once established. When data continues to grow at enormous rates, the approach of re-calculating the relations becomes impractical. Today even normal applications start to see data at sizes which in the past were only seen in data warehousing solutions, the first data management space which embraced a non-relational approach to data management.

So, in a generation when millions of users are accessing applications linked to near real-time analytic algorithms, at times operating over terabytes of data, innovation must occur to deal with these new realities.

Q2. There has been recently a proliferation of “new data stores”, such as “document stores”, and “nosql databases”: What are the differences between them?

Robert Greene:The answer to this could require a book, but let’s try to distil into the fundamentals.

I think the biggest difference is the programming model. There is some overlap, so you don’t see clear distinctions, but for each type: object database, distributed file systems, key-value stores, document stores and graph stores, the manner in which the user stores and retrieves data varies considerably. The OODB uses language integration, the distributed file systems use map-reduce, key-value stores use data keys, document stores use keys and query based on indexed meta data overlay, graph stores use a navigational expression language. I think it is important to point out that “store” is probably a more appropriate label than “database” for many of these technologies as most do not implement the classical ACID requirements defined for a database.

Beyond programming model, these technologies vary considerably in architecture, how they actually store data, retrieve it from disk, facilitate backup, recovery, reliability, replication, etc.

Q3. How new data stores compare with respect to relational databases?

Robert Greene: As described above, they have a very different programming model than the RDB. Though in some ways, they are all subsets of the RDB, but their specialization allows them to do what they do ( at times ) better than the RDB.

Most of them are utilizing an underlying architecture which I call, “the oldest scalability architecture of the relational database”. It’s the use of the key-value/blob architecture. The RDB has long suffered performance under scalability and historically many architects have gotten around those performance issues by removing the JOIN operation from the implementation. They manage identity from the application space and store information in either single tables and/or blobs of isolatable information. This comparison is obvious for key-value stores. However, you can also see this approach in the document store, which is storing its information as key-JSON objects. The keys to those documents ( JSON blob objects ) must be managed by user implemented layers in the application space. Try to implement a basic collection reference, you will find yourself writing lots of custom code. Of course, JSON objects also have meta data which can be extracted and indexed, allowing document stores to provide better ways at finding data, but the underlying architecture is key-value.

Robert Greene: They compare similarly in that they achieve better scalability than the RDB by utilizing identity management in the application layer similarly to the way done with the object database. However, the approach is significantly less opaque, because for those NoSQL stores, the management of the identity is not integrated into the language constructs and abstracted away from the user API as it is with the object database. Plus, there is a big difference in the delivery of the ACID properties of a database. The NoSQL databases are almost exclusively non-transactional unless you use them in only the narrowest of use cases.

Robert Greene: Unquestionably, the world is moving to a platform as a service computing model (PaaS). Databases will play a role in this transition in all forms. The challenges in delivering on data management technology which is effective in these “cloud” computing architectures turn out to be very similar to effectively delivering technology for the new n-core chip architectures. They are challenges related to distributed data management, whether it is across machines or across cores, splitting the problem into pieces and managing the distributed execution in the fact of concurrent updates. Then the often overlooked aspect in these discussions is the operational element. How to effectively develop, debug, manage and administer the production deployments of this technology within distributed computing environments.

Q6 What are cloud stores omitting that enable them to scale so well?

Robert Greene: I think architecture plays the biggest role in their ability to scale. It is that application identity managed approach to data retrieval, data distribution, semi-static data relations. These are things they actually have in common with object databases, which incidentally, you also find in some of the worlds largest, most demanding application domains. I think that is the biggest scalability story for those technologies. If you look past architecture then it comes down to some of the sacrifices made in the area of fully supporting the ACID requirements of a database. Taking the “eventually consistent” approach, this in some cases, makes a tremendous amount of sense if you can afford probabilistic results instead of determinism

Q7. Will cloud store projects end up with support for declarative queries and declarative secondary keys?

Robert Greene: I am sure you will see this as literally all database technologies which will remain relevant, will live in the cloud.

Q8. In his post, titled “The “NoSQL” Discussion has Nothing to Do With SQL”, Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?

Robert Greene: I agree with the theory. Reality though does introduce some practical limitations during implementation. Technology is doing a remarkable job of removing those bottlenecks. For example, you can now get non-volatile memory appliances which are 5T in size effectively eliminating disk I/O as the what was historically the #1 bottleneck in database systems. Still, architecture will continue in the future to play the strongest role in performance and scalability. Relational databases and other implementations which need to runtime calculate relationships based on data values over growing volumes of data will remain performance challenged.

Q9. Some progress has also been made on RDBMS scalability. For example, Oracle RAC and MySQL Cluster provide some partitioning of load over multiple nodes. More recently, there are new scalable variations of MySQL underway with ScaleDB and Drizzle, and VoltDB is expected to provide scalability on top of a more performant inmemory RDBMS with minimal overhead. Typically you cannot scale well if your SQL operations span many nodes. And you cannot scale well if your transactions span many nodes.
Will RDBMSs provide scalability to 100 nodes or more? And if yes, how?

Robert Greene: Yes of course, they already do in vendors like Netezza, Greenplum, AsterData. The question is will they perform well in the face of those scalability requirements. This distinction between performance and scalability is often overlooked.

However, I think this notion that you cannot scale well if your transactions span many nodes is non-sense. It is a question of implementation. Just because a database has 100 nodes, does not mean that all transactions will operate on data within those 100 nodes. Transactions will naturally partition and span some percentage of nodes especially with regards to relevant data. Access in a multi-node system can be parallelized in all aspects of a transaction. Further, at a commit boundary, the overwhelming case is that the number of nodes involved where data is inserted, changed, deleted and/or logically dependent, is some small fraction of all the physical nodes in the system. Therefore, advanced 2-phase commit protocols can do interesting things like rolling back non active nodes and parallelizing protocol handshaking and using asynchronous I/O and handshaking to finalize the commit. Is it complicated, yes, but is it too complicated to work, not by a long shot.

Robert Greene: I really look at XML databases as large index engines. I have seen implementations of these which look very much like document stores. The main difference being that they are generally indexing everything. Where the document stores appear to be much more selective about the meta data exposed for indexing and query. Still, I think the challenge for XML db’s is the mismatch in it’s use within the programming paradigm. Developers think of XML as data interchange and transformation technology. It is not perceived as transactional data management and storage and developers don’t program in XML, so it feels clunky for them to figure out how to wrap it into their logical transactions. I suspect if feels a little less clunky if what you are dealing with are documents. Perhaps they should be considered the original document stores.

Q11. Choosing a solution: Given this confusing array of alternatives, which data storage system do you choose for your application?

Robert Greene: I choose the right tool for the job. This is again one of those questions which deserves several books. There is no 1 best solution for all applications, deciding factors can be complicated, but here is what I think about as major influencing factors. I look at it from the perspective of whether the application is data driven or model driven.

If it is model driven, I lean towards ODB or RDB.
If it is data driven, I lean towards NoSQL or RDB.

If the project is model driven and has a complex known model, ODB is a good choice because it handles the complexity well. If the project is model driven and has a simple known model, RDB is a good choice, because you should not be performance penalized if the model is simple, there are lots of available choices and people who know how to use the technology.

If the project is data driven and the data is small, RDB is good for the prior reasons. If the project is data driven and the data is huge, then NoSQL is a good choice because it takes a better architectural approach to huge data allowing the use of things like map reduce for parallel processing and/or application managed identity for better data distribution.

Of course, even within these categorizations you have ranges of value in different products. For example, MySQL and Oracle are both RDB, so which one to choose? Similarly, db4o and Versant are both ODB’s, so which one should you choose? So, I also look at the selection process from the perspective of 2 additional requirements: Data Volume, Concurrency. Within a given category, these will help narrow in on a good choice. For example, if you look at the company Oracle, you naturally consider that MySQL is less data scalable and less concurrent than the Oracle database, yet they are both RDB. Similarly, if you look at the company Versant, you would consider db4o to be less data scalable and less concurrent than the Versant database, yet they are both ODB.

Finally, I say you should test, evaluate any selection within the context of your major requirements. Get the core use cases mocked up and put your top choices to the test, it is the only way to be sure.

Nov 22 10

Why Patterns of Data Modeling?

by Roberto V. Zicari

I published another chapter of the new book on “Patterns of Data Modeling” of
Dr. Michael Blaha. All together you can now download three chapters of the book:
Tree Template, Models, and Universal Antipatterns.

At the same time, I asked Dr. Blaha a few questions.
At the end of the interview you`ll find some more opinions on this topic.

Q1. What are Patterns of Data Modeling?

Michael Blaha: Experienced data modelers don’t limit their thinking to primitive constructs. Rather they leverage what they have seen before. Patterns of data modeling are ways of cataloging past superstructures that are profound and likely to recur.
There are different aspects of data modeling patterns. There are models of common data structures (mathematical templates), models to be avoided (antipatterns), core concepts that transcend application domains (archetypes), and models of common services (canonical models). Modelers should avail themselves of the full pattern toolkit and not focus on one technique to the exclusion of others.
The literature covers abstract programming patterns that exist apart from application concepts. For example, the gang of four book — “Design Patterns: Elements of Reusable Object-Oriented Software” has excellent coverage of abstract programming patterns. There is no reason why databases should not have a comparable level of treatment. Until my recent book (“Patterns of Data Modeling“) the literature has lacked an abstract treatment of data modeling patterns.

Q2. Where and when are Patterns of Data Modeling useful?

Michael Blaha: All experienced modelers should use data modeling patterns. It is important to reuse ideas that have been tried and tested, rather than reinvent technology from scratch. I know that data modeling patterns are useful because this is the way that I think as I perform my work as an industrial consultant.
I use data modeling patterns for application data models, enterprise data models, data reverse engineering, and abstract conceptual thinking. Data modeling patterns are not a panacea to the troubles of development, but they are part of the solution. With patterns, developers can accelerate their thinking and reduce modeling errors.

Q3. Is there any difference in the applicability of Patterns of Data Modeling if the underlying Database System is a relational database as opposed to for example an Object Oriented or a NoSQL database?

Michael Blaha: No. That is the whole premise of software engineering — to quickly address the essential aspects of a problem and defer implementation details. A conceptual data model is focused on finding the important concepts for a problem, delineating scope, and determining the proper level of abstraction. All this deep, early thinking happens regardless of the eventual implementation target. Data modeling patterns mostly apply to the early stages of software development
Bill Premerlani and I took this approach in our 1998 book (“Object-Oriented Modeling and Design for Database Applications”). We presented detailed mapping rules for how to implement conceptual models with relational databases, an object-oriented database (ObjectStore) and flat files. Our 1991 book (“Object-Oriented Modeling and Design”) and its 2005 sequel explained how to map OO models to several programming languages.
So patterns of data modeling (as well as programming patterns and other kinds of patterns) apply regardless of the eventual downstream implementation.

Q4. What’s the difference between a pattern and a seed model?

Michael Blaha: A seed model is specific to a problem domain. It is a tangible piece that you can extend to build an entire application. Several authors (such as Hay, Fowler, and Silverston) have published excellent books with seed models. In constrast, a pattern is abstract and stands apart from any particular application domain. Patterns are at the same level of abstraction as UML classes, associations, and generalizations. A pattern is a composite building block. Seed models and abstract patterns are both valuable techniques. They are complimentary and are often used together.

Q5. What do you see as frontier areas of databases and data modeling?

Michael Blaha: I’m now working on a new topic — SOA and databases. SOA is an acronym for Service-Oriented Architecture, an approach for organizing business functionality into meaningful units of work. Instead of placing logic in application silos, SOA organizes functionality into services that transcend the various departments and fiefdoms of a business. A service is a meaningful unit of business processing. Services communicate by passing data back and forth. Such data is typically expressed in terms of XML. XML combines data with metadata that defines the data’s structure. A second language — XSD (XML Schema Definition) — is often used to specify valid XML data structure.
The promise of SOA is being held back by a lack of rigor with XSD files. Many developers focus on the design of individual services and pay little attention to how the services fit together and collectively evolve. Enterprise data modeling is the solution to this problem. A data model is essential for grasping the entirety of services and abstracting services properly. A data model also provides a guide for combining services in flexible ways.
I see evidence for a lack of data modeling in my consulting practice. I have studied several XSD standards and they all ignore data models. The literature in the area of SOA and data modeling is sparse. The current situation is untenable and SOA projects must pay more attention to data.
#

“Patterns of data modeling are very important. They enable data modeling efforts to be both effective and efficient. Working without patterns is like wandering around in the data wilderness trying to find your way.
SOA and Data. This is another vital area that must be addressed. I am doing it in my practice. It brings together data, metadata, metacards, data registries, data catalogs — and service. Very important for scalablility when the data network size grows (e.g., the government, nationwide health services, etc.).” — James Odell.

“I am mostly an object modeller, but I always recommend that my clients start with existing data model patterns rather than with a blank sheet of paper.
The data modelling patterns I most turn to are David C. Hay (Data Model Patterns: Conventions of Thought etc.).” — Jim Arlow.

“I agree with all that Dr. Blaha said advocating the use of patterns. This was very articulately worded, and I like to see those views spread around.
I also recognize that what he’s tried to do in this book is very different from what Len Silverston, Martin Fowler and I did.
It is true that we were focused on modeling the real world–“domains” as he described it. He, on the other hand has abstracted modeling to the point that he describes modeling itself–“tree” structures, undirected graphs, directed graphs, and so forth.
It is true that Dr. Blaha’s book is abstract in the extreme.
In fact, in my new book, Enterprise Model Patterns: Describing the World take on the issue of level of abstraction directly. In this, I am presenting a semantic model that I claim describes the entire enterprise, but on multiple levels of abstraction.
The first (Level 1) is a generic model that any company or government agency can take on as a starting point. It is generic because most attributes are actually captured as data in CHARACTERISTIC entities. (This corresponds to Dr. Blaha’s discussion of soft-coded values.) Thus, they become the problem of the user community, not the data modeler. The data modeler can address the true structures of the business. Yes, this model is organized in terms of five fundamental domains: people and organizations (who), geographic locations (where), physical assets (what), and activities and events (how). It also addresses time (when), but that’s a different kind of model. (This model is based on some 20+ years experience in the field, but I was inspired to write it from my experience over the last few years with the Federal Data Architecture Subcommittee. The committee hasn’t been very effective at creating patterns to distribute to Federal agencies, but it did inspire me to try to capture my views on the subject.)
I then address Level 0, which is a template for the first four categories above. (This is an enhanced version of the THING/THING TYPE model). In addition, at this level are two “meta” models: Document management and accounting. Each of these subject areas itself refers to the entire rest of the model.
At Level 2, I deal with functional specializations. These are more detailed than the level 1 models and make use of the entities in Level 1 combined in specific ways. These subject areas address such things as addresses (both physical addresses–“facilities”–and virtual addresses–telephone numbers, e-mail addresses, etc.), human resources, contracts, and the like. While they are more specialized than level 1, they are still generally applicable patterns. (And these areas address the “why” of the organization.)
At level 3, I address specific industries. For “vertical” models, I take the position that the Level 1/2 models address 80-90% of any company’s requirements. For each industry, however, there are a few special areas that need special attention. These are the things that make that industry unique. I took on a five of these, trying to get a cross-section from completely different worlds: criminal justice, microbiology, banking, oil production, and highway maintenance. If you don’t know anything about one of these industries, here is where you can learn something.
I agree that patterns are technology independent. I disagree that “object” models are technologically independent. That Dr. Blaha began with the gang of four book – “Design Patterns: Elements of Reusable Object-Oriented Software” tells something about his orientation. As it happens, in my latest book, I did (as my colleagues would say) “move over to the dark side”, and use UML as the notation, even though that notation is specifically oriented towards object-oriented design, not business modeling. I had to tweak some of the terms to break out of its object-oriented design history. These are conceptual, business-oriented models, not design models.
In doing this, I may have managed to offend both my data modeling colleagues (“You really have gone over to the dark side, haven’t you?”) and my UML colleagues (“What have you done to my UML?”). Or, perhaps, maybe have started building a bridge between the two groups? Only time will tell.” — Dave Hay.

Oct 28 10

Video “New and old Data stores”.

by Roberto V. Zicari

You can now freely download the Video of the Keynote Panel “New and old Data stores”, held at ICOODB 2010 Frankfurt on September 29, 2010.

Here is the LINK to download the video.

Since the original file was rather large, I split it into two separate files, each one takes about 6 minutes to download…

The panel discussed the pros and cons of new data stores with respect to classical relational databases.

The panel of experts was composed by:
Ulf Michael (Monty) Widenius, main author of the original version of the open source MySQL database.
Michael Keith, architect at Oracle.
Patrick Linskey, Apache OpenJPA project.
Robert Greene, Chief Strategist Versant.
Leon Guzenda, Chief Technology Officer Objectivity.
Peter Neubauer. COO NeoTechnology.

Moderators were: Alan Dearle, University of St. Andrew, and Roberto V. Zicari, Goethe University Frankfurt.

RVZ

Oct 26 10

Proceedings ICOODB 2010 Frankfurt.

by Roberto V. Zicari

The research papers (RESEARCH TRACK) of the ICOODB 2010 Frankfurt conference, have been published by Springer in their Lecture Notes in Computer Science. Here are the details:

Objects and Databases. Dearle, Alan; Zicari, Roberto V. (Eds.)
Proceedings Series: Lecture Notes in Computer Science, Vol. 6348. 1st Edition., 2010, XIV, 161 p., Softcover ISBN: 978-3-642-16091-2
Preface and Table of Contents| September 2010|.

This book constitutes the thoroughly refereed conference proceedings of the Third International Conference on Object Databases, ICOODB 2010, held in Frankfurt/Main, Germany in September 2010.

Most presentations in the Industry Track, Keynotes and Tutorials are available for free download at ODBMS.ORG.

I will very soon upload the video of the very interesting keynote panel “NEW AND OLD DATA STORES” …stay tuned.

RVZ

ODBMS Industry Watch

“Distributed joins are hard to scale”: Interview with Dwight Merriman.

On Graph Databases: Interview with Daniel Kirstenpfad.

Agile data modeling and databases.

Interview with Rick Cattell: There is no “one size fits all” solution.

Don White on “New and old Data stores”.

Watch the Video of the Keynote Panel “New and old Data stores”

Why Patterns of Data Modeling?

Video “New and old Data stores”.

Proceedings ICOODB 2010 Frankfurt.

About the author

Archives

Meta

About

Flickr

Search

About the author

Tags

Archives

Meta

About

Flickr

Search