Skip to content

"Trends and Information on AI, Big Data, Data Science, New Data Management Technologies, and Innovation."

This is the Industry Watch blog. To see the complete ODBMS.org
website with useful articles, downloads and industry information, please click here.

Sep 29 11

MariaDB: the new MySQL? Interview with Michael Monty Widenius.

by Roberto V. Zicari

“I want to ensure that the MySQL code base (under the name of MariaDB) will survive as open source, in spite of what Oracle may do.”– Michael “Monty” Widenius.

Michael “Monty” Widenius is the main author of the original version of the open-source MySQL database and a founding member of the MySQL AB company. Since 2009, Monty is working on a branch of the MySQL code base, called MariaDB.
I wanted to know what`s new with MariaDB. You can read the interview with Monty below.

RVZ.

Q1. MariaDB is an open-source database server that offers drop-in replacement functionality for MySQL. Why did you decide to develop a new MySQL database?

Monty: Two reasons:
1) I want to ensure that the MySQL code base (under the name of MariaDB) will survive as open source, in spite of what Oracle may do. As Oracle is now moving away from Open Source to Open Core (see my blog) it was in hindsight the right thing to do!

2) I want to ensure that the MySQL developers have a good home where they can continue to develop MySQL in an open source manner. This is important as if the MySQL ecosystem would loose the original core developers there would be no way for the product to survive. This is also in hindsight proven to be important as Oracle has lost almost all of the original core developers, fortunately most of them has joined the MariaDB project.

Q2. What is new in MariaDB 5.3.1 beta?

Monty: 5.1 and 5.2 was about fixing some outstanding issues and getting in patches for MySQL that has been available in the community for a long time but never was accepted into the MySQL code base for political reasons.

5.3 is where we have put most of our development efforts, especially in the optimizer area and replication. Replication is now a magnitude faster than before (when running with many concurrent updates) and many queries involving sub queries or big joins are now 2x – 100x faster. All the changes are listed in the MariaDB knowledge base.

Q3. How is MariaDB currently being used? Could you give examples of applications that use MariaDB?

Monty: We see a lot of old MySQL users switching MariaDB. These are especially big sites with a lot of queries that needs even more performance. We have some case studies in the knowledgebase.

What we do expect with 5.3 is that also people that need a more flexible SQL will start using MariaDB thanks to the new ‘no-sql’ features we have added like handler socket and dynamic columns.
Dynamic columns allows you to have a different set of columns for every row in a table.

Q4. When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?

Monty: A lot of the new features, like batched key access, hash joins, sub query caching etc are targeted to allow handling of larger data sets.

We have also started to continuously do benchmarks on data in the terabyte range to be able to improve MariaDB even more in the area. We hope to have some very interesting announcements in this area shortly.

Q5. How do you handle structured and unstructured data?

Monty: Yes. The dynamic columns feature is especially designed to do that. We have now one developer working on using the dynamic columns feature to implement a storage engine for HBASE.

Q6. Who else is using MariaDB and why?

Monty: MariaDB has several hound-red of thousands of downloads and is part of many Linux distributions, but as we don’t track users we don’t know who they are. (This was exactly the problem we had with MySQL in the early days). That is why we have started to collect success stories about MariaDB installations to spread the awareness of who is using it.
The problem is that big companies usually never want to tell what they are using which makes this a bit difficult 🙁

The main reasons people people are switching to MariaDB is:
1. Faster, more features and fewer bugs.
The fewer bugs comes from the fact that we have fixed a lot of bugs that MySQL has and is introducing in each release.
2. It’s a drop in replacement of MySQL, so it’s trivial to switch.
(All old connectors works unchanged and you don’t have to dump/reload your data).
3. Lot’s of new critical features that people have wanted for years:
– Microsecond support
– Virtual columns.
– Faster sub queries, big data queries & replication
– Segmented key cache (speeds up MyISAM tables a LOT)
– Progress reporting for alter table.
– SphinxSE search engine.
– FederatedX search engine (better and supported federated engine)
See for a full list here..
4. The upgrade process to a new release is easier than in MySQL; you don’t have to dump & restore your data to
upgrade to a new version.
5. It’s guaranteed to be open source now and forever (no commercial extensions from Monty Program Ab).
6. We actively work with the community and add patches and storage engines they create to the MariaDB code base.

Q7. What’s next in MariaDB?

Monty: We are just now working on a the last cleanups so that we can release MariaDB 5.5.
This will include everything in MariaDB 5.3, MySQL 5.5 and also most of the closed source features that is in MySQL 5.5 Enterprise.

In parallel we are working on 5.6. The plans for it can be found here.
The most important features in this are probably:
– True parallel replication (not only per database)
– Multi source slaves.
– A better MEMORY engine that can handle BLOB and VARCHAR gracefully (this is already in development).

We are also looking at introducing a new clustered storage engines that is optimized for the cloud.

What will be in 5.6 is still a bit up in the air; We are working with a lot of different companies and the community to define the new features. The final feature set will be decided upon in our next MariaDB developer meeting in Greece in November.

Q8. How can the open source community contribute to the project?

Monty: Anyone can contribute patches to MariaDB and if you are active and have proven that you are not going to break the code you can get commit access to the MariaDB tree.
The following link tells you how you can get involved.

Another way is of course to contract Monty Program Ab to implement features to MariaDB. This is a good options for those that have more money than time and wants to get something done quickly.

Q9. Cloud computing: Does it play a role for MariaDB? If yes, how?

Monty: Yes, cloud computing is very important for us. MySQL & MariaDB is already one of the most popular databases in the cloud; Rackspace, Amazon and Microsoft are all providing MySQL instances.
The popularity of MySQL in the cloud is mostly thanks to the fact that MySQL is quite easy to configure in different setups, from using very little memory to using all resources on the box. This combined with replication has made MySQL/MariaDB the database of choice in the cloud.

The one thing MariaDB is missing for being an even better choice for the cloud is a good engine that allows one to, on the switch of a key, add / drop nodes to dynamicly change how many cloud entities one is using. We hope to have a solution for this very soon.

Resources

Posts tagged ‘MySQL’.

Michael (Monty) Widenious, MariaDB
State of MariaDB, Dynamic Column in MariaDB
Describes MariaDB 5.1 a branch of MySQL 5.1. It also introduces the new features in versions 5.2(beta) and 5.3 (alpha). Copy of the presentation given at ICOODB Frankfurt 2010.
Presentation | Intermediate | English | DOWNLOAD (.PDF) | September 2010|

Databases Related Resources:
Blog Posts | Free Software | Articles and Presentations| Lecture Notes | Journals|

—————–

Sep 19 11

Benchmarking XML Databases: New TPoX Benchmark Results Available.

by Roberto V. Zicari

“A key value is to provide strong data points that demonstrate and quantify how XML database processing can be done with very high performance.”Agustin Gonzalez, Intel Corporation.

“We wanted to show that DB2’s shared-nothing architecture scales horizontally for XML warehousing just as it does for traditional relational warehousing workloads.”Dr. Matthias Nicola, IBM Corporation.

TPoX stands for “Transaction Processing over XML” and is a XML database benchmark that Intel and IBM have developed several years ago and then released as open source.
A couple of months ago, the project has published some new results.

To learn more about this I have interviewed the main leaders of the TPoX project, Dr. Matthias Nicola, Senior engineer for DB2 at IBM Corporation and Agustin Gonzalez, Senior Staff Software Engineer at Intel Corporation.

RVZ

Q1. What is exactly TPoX?

Matthias: TPoX is an XML database benchmark that focuses on XML transaction processing. TPoX simulates a simple financial application that issues XQuery or SQL/XML transactions to stress the XML storage, XML indexing, XML Schema support, XML updates, logging, concurrency and other components of an XML database system. The TPoX package comes with an XML data generator, an extensible Workload Driver, three XML Schemas that define the XML structures, and a set of predefined transactions. TPoX is free, open source, and available at http://tpox.sourceforge.net/ where detailed information can be found. Although TPoX comes with a predefined workload, it’s very easy to change this workload to adjust the benchmark to whatever your goals might be. The TPoX Workload driver is very flexible, it can even run plain old SQL against a relational database and simulate hundreds concurrent database users. So, when you ask “What is TPoX”, the complete answer is that it is an XML database benchmark but also a very flexible and extensible framework for database performance testing in general.

Q2. When did you start with this project? What was the original motivation for TPoX? What is the motivation now?

Matthias: We started with this project approximately in 2003/2004. At that time we were working on the native XML support in DB2 that was later released in DB2 version 9.1 in 2006. We needed an XML workload -a benchmark- that was representative of an important class of real-world XML applications and that would stress all critical parts of a database system.
We needed a tool to put a heavy load on the new XML database functionality that we were developing. Some XML benchmarks had been proposed by the research community, such as XMark, MBench, XMach-1, XBench, X007, and a few others. They were are all useful in their respective scope, such as evaluating XQuery processors, but we felt that none of them truly aimed at evaluating a database system in its entirety. We found that they did not represent all relevant characteristics of real-world XML applications.
For example, many of them only defined a read-only and single-user workload on a single XML document. However, real applications typically have many concurrent users, a mix of read and write operations, and millions or even billions of XML documents.
That’s what we wanted to capture in the TPoX benchmark.

Agustin: And the motivation today is the same as when TPoX became freely available as open source: database and hardware vendors, database researchers, and even database practitioners in the IT departments of large corporations need a tool evaluate system performance, compare products, or compare different design and configuration options.
At Intel, the main motivation behind TPoX it to benchmark and improve our platforms for the increasingly relevant intersection of XML and databases. So far, the joint results with IBM have exceeded our expectations.

Q3. TPoX is an application-level benchmark. What does it mean? Why did you choose to develop an application-level benchmark?

Matthias: We typically distinguish between micro-benchmarks and application-level benchmarks, both of which are very useful but have different goals. A micro-benchmark typically defines a range of tests such that each test exercises a narrow and well-defined piece of functionality. For example, if your focus is an XQuery processor you can define tests to evaluate XPath with parent steps, other tests to evaluate XPath with descendant-or-self axis, other tests to evaluate XQuery “let” clauses, and so on.
This is very useful for micro-optimization of important features and functions. In contrast, an application-level benchmark tries to evaluate the end-to-end performance of a realistic application scenario and to exercise the performance of a complete system as a whole, instead of just parts of it.

Agustin: As an application-level benchmark, TPoX has proven much more useful and believable than “synthetic” micro-benchmarks. As a result, TPoX can even be used to predict how similar real-world applications will perform, or where they will encounter a bottleneck. You cannot make such predications with a micro-benchmark. Another important feature is that TPoX is very scalable – you can run TPoX on a laptop but also scale it up and run on large enterprise-grade servers, such as multi-processor Intel Xeon platforms.

Q4. How do you exactly evaluate the performance of XML databases?

Agustin: Well, one way is to use TPoX on a given platform and then compare to existing results on different combinations of hardware and software. I know that this is a simplistic answer but we really learn a lot from this approach. Keeping a precise history of the test configurations and the results obtained is always critical.

Matthias: This is actually a very broad question! We use a wide range of approaches. We use micro-benchmarks, we use application-level benchmarks such as TPoX, we use real-world workloads that we get from some of our DB2 customers, and we continuously develop new performance tests. When we use TPoX, we often choose a certain database and hardware configuration that we want to test and then we gradually “turn up the heat”. For example, we perform repeated TPoX benchmark runs and increase the number of concurrent users until we hit a bottleneck, either in the hardware or the software. Then we analyze the bottleneck, try to fix it, and repeat the process. The goal is to always push the available hardware and software to the limit, in order to continuously improve both.

Q5. What is the difference of TPoX with respect to classical database benchmarks such as TPC-C and TPC-H?

Matthias: One of the obvious differences is that TPC-C and TPC-H focus on very traditional and mature relational database scenarios. In contrast, TPoX aims at the comparatively young field of XML in databases. Another difference is that the TPC benchmarks have been standardized and “approved” by the TPC committee, while TPoX was developed by Intel and IBM, and extended by various students and Universities as an open source project.

Agustin: But, TPoX also has some important commonalities with the TPC benchmarks. TPC-C, TPC-H, and TPoX are all application-level benchmarks. Also, TPC-C, TPC-H, and TPoX have each chosen to focus on a specific type of database workload. This is important because no benchmark can (or should try to) exercise all possible types of workloads. TPC-C is a relational transaction processing benchmark, TPC-H is a relational decision support benchmark, and TPoX is an XML transaction processing benchmark. Some people have called TPoX the “XML-equivalent of TPC-C”. Another similarity between TPC-C, TPC-E, and TPoX is that all three are throughput oriented “steady state benchmarks”, which makes it straightforward to communicate results and perform comparisons.

Q6. Do you evaluate both XML-enabled and Native XML databases? Which XML Databases did you evaluate?

Matthias: TPoX can be used to evaluate pretty much any database that offers XML support. The TPoX workload driver is architected such that only a thin layer (a single Java class) deals with the specific interaction to the database system under test. Personally have used TPoX only on DB2. I know that other companies as well as students at various Universities have also run TPoX against other well-known database systems.

Q7. How did you define the TPoX Application Scenario? How did you ensure that the TPoX Application Scenario you defined is representative of a broader class of applications?

Matthias: Over the years we have been working with a broad range of companies that have XML applications and require XML database support. Many of them are in the financial sector. We have worked closely with them to understand their XML processing needs. We have examined their XML documents, their XML Schemas, their XML operations, their data volumes, their transaction rates, and so on. All of that experience has flown into the design of TPoX. One very basic but very critical observation is that there are practically no real-world XML applications that use only a single large XML document. Instead, the majority of XML applications use very large numbers of small documents.

Agustin: TPoX is also very realistic because it uses a real-world XML Schema called FIXML, which standardizes trade-related messages in the financial industry. It is a very complex schema that defines thousands of optional elements and attributes and allows for immense document variability. It is extremely hard to map the FIXML schema to a traditional normalized relational schema. In the past, many XML processing systems were not able to handle the FIXML schema. But, since type of XML is used in real-world applications, it is a great fit for a benchmark.

Q8. How did you define the workload?

Matthias: Again, by experience with real XML transaction processing applications.

Q9. In your documentation you write that TPoX uses a “stateless” workload? What does it mean in practice? Why did you make this choice?

Matthias: It means that every transaction is submitted to the database independently from any previous transactions. As a result, the TPoX workload driver doesn’t need to remember anything about previous transactions. This makes it easier to design and implement a benchmark that scales to billions of XML documents and hundreds of millions transactions in a single benchmark run.

Q10. Why not define a workload also for complex analytical queries?

Matthias: We did! And we ran it on a 10TB XML data warehouse with more than 5.5 Billion XML documents.
That was a very exciting project and you can find more details on my blog.
Although the initial wave of XML database adoption was more focused on transactional and operations systems, companies soon realized that they were accumulating very large volumes of XML documents that contained a goldmine of information. Hence, the need for XML warehousing and complex analytical XML queries was pressing. We wanted to show that DB2’s shared-nothing architecture scales horizontally for XML warehousing just as it does for traditional relational warehousing workloads.

Agustin: Admittedly, we have not yet formally included this workload of complex XML queries into the TPoX benchmark. Just like TPC-C and TPC-H are separate for transaction processing vs. decision support, we would also need to define two flavors of TPoX, even if the underlying XML data remains the same. A TPoX workload with complex queries is definitely very meaningful and desirable.

Q11. What are the main new results you obtained so far? What are the main values of the results obtained so far?

Agustin: We have produced many results using TPoX over the years, with ever larger numbers of transactions per second and continuous scalability of the benchmark on increasingly larger platforms. A key value is to provide strong data points that demonstrate and quantify how XML database processing can be done with very high performance. In particular, the first public 1TB XML benchmark that we did a few years ago has helped establish the notion that efficient XML transaction processing is a reality today. Such results give the hardware and the software a lot of credibility in the industry. And of course we learn a lot with every benchmark, which allows us to continuously improve our products.

Q12. You write in your Blog “For 5 years now Intel has a strong history of testing and showcasing many of their latest processors with the Transaction Processing over XML (TPoX) benchmark.” Why has Intel been using the TPoX benchmark? What results did they obtain?

Matthias: I let Agustin answer this one.

Agustin: Intel uses the TPoX benchmark because it helps us demonstrate the power of Intel platforms and generate insights on how to improve them. TPoX also enables us to work with IBM to improve the performance of DB2 on Intel platforms, which is good for both IBM and Intel. This collaboration of Intel and IBM around TPoX is an example of an extensive effort at Intel to make sure that enterprise software has excellent performance on Intel. You can see our most important results on the TPoX web page.

Q13. Can you use TPoX to evaluate other kinds of databases (e.g. Relational, NoSQL, Object Oriented, Cloud stores)? How does TPoX compare with the Yahoo! YCSB benchmark for Cloud Serving Systems?

Matthias: Yes, the TPoX workload driver can be used to run traditional SQL workloads against relational databases. Assuming you have a populated relational database, you can define a SQL workload and use the TPoX driver to parameterize, execute, and measure it. TPoX and YCSB have been designed for different systems under test. However, parts of the TPoX framework can be reused to quickly develop other types of benchmarks, especially since TPoX offers various extension points.

Agustin: Some open source relational databases have started to offer at least partial support for the SQL/XML functions and the XML data type. Given the level of parameterization and the extensible nature of the TPoX workload driver it would be very easy to develop custom workloads for the emerging support of the XML data type on open source databases. At the same time, the powerful XML document generator included in the kit can be used to generate the required data. Using TPoX to test the performance of XML in open source databases is an intriguing possibility.

Q14. Is it possible to extend TPoX? If yes, how?

Matthias: Yes, TPoX can be extended in several ways. First, you can change the TPoX workload in any way you want. You can modify, add, or remove transactions from the workload, you can change their relative weight, and you can change the random value distributions that are used for the transaction parameters. We have used the TPoX workload driver to run many different XML workloads, also on other XML data than just the TPoX documents. We have also used the workload driver for relational SQL performance tests, just because it’s so easy to setup concurrent workloads.
Second, the database specific interface of the TPoX workload driver is encapsulated in a single Java class, so it is relatively easy to port the driver to another database system. And third, the new version TPoX 2.1 allows transactions to be coded not only in SQL, SQL/XML, and XQuery, but also in Java. TPoX 2.1 supports “Java-Plugin transactions” that allow you to implement whatever activities you want to run and measure in a concurrent manner. For example, you can run transactions that call a web service, send or receive data from a message queue, access a content management system, or perform any other operations – only limited by what you can code in Java!

Agustin: At Intel we have been using TPoX internally for various other projects. Since the TPoX workload driver is open source, it is straightforward to modify it to support other type of workloads, not necessarily steady state, which makes it amenable to testing other aspects of computer systems such as power management, storage, and so on.

Q15 What are the current limitations of TPoX?

Matthias: Out of the box, the TPoX workload driver only works with databases that offer a JDBC interface. If a particular database system has specific requirements for its API or query syntax, then some modifications may be necessary. Some database system might require their own JDBC driver to be compiled into the workload driver.

Q16. Who else is using TPoX?

Matthias: You can see some examples of other TPoX usage on the TPoX web site. We know that other database vendors are using TPoX internally, even if haven’t decided to publish results yet. I also know a company in the Data Security space that uses TPoX to evaluate the performance of different data encryption algorithms. And TPoX also continues to be used at various universities in Europe, US, and Asia for a variety of research and student projects. For example, the University of Kaiserslautern in Germany has used TPoX to evaluate the benefit of solid-state disks for XML databases. Other universities have used TPoX to evaluate and compare the performance of several XML-only databases.

Q17. TPoX is an open source project. How can the community contribute?

Matthias: A good starting point is to use TPoX. From there, contributing to the TPoX project is easy. For example, you can report problems and bugs , or you can submit new feature requests. Or even better, you can implement bug fixes and enhancements yourself and submit them to the SVN code repository on sourceforge.net.
If you design other workloads for the TPoX data set, you can upload new workloads to the TPoX project site and have your results posted on he TPoX web site.

Agustin: As is customary for an open source project on sourceforge, anybody can download all TPoX files and source code freely.
If you want to upload any changed or new files or modify the TPoX web page, you only need to become a member of the TPoX sourceforge project, which is quick and easy.
Everybody is welcome, without exceptions.

Resources

TPoX software:

XML Database Benchmark: “Transaction Processing over XML (TPoX)”:
TPoX is an application-level XML database benchmark based on a financial application scenario. It is used to evaluate the performance of XML database systems, focusing on XQuery, SQL/XML, XML Storage, XML Indexing, XML Schema support, XML updates, logging, concurrency and other database aspects.
Download TPoX (LINK), July 2009. | TPoX Results (LINK), April 2011.

Articles:

“Taming a Terabyte of XML Data”.
Augustin, Gonzales, Matthias Nicola, IBM Silicon Valley Lab.
Paper | Advanced | English | LINK DOWNLOAD (PDF)| 2009|

“An XML Transaction Processing Benchmark”.
Matthias Nicola, Irina Kogan, Berni Schiefer, IBM Silicon Valley Lab.
Paper | Advanced | English | LINK DOWNLOAD (PDF)| 2007|

“A Performance Comparison of DB2 9 pureXML with CLOB and Shredded XML Storage”.
Matthias Nicola et a., IBM Silicon Valley Lab.
Paper | Advanced | English | LINK DOWNLOAD (PDF)| 2006|

Related Posts

Measuring the scalability of SQL and NoSQL systems.

Benchmarking ORM tools and Object Databases.

Sep 7 11

Big Data and the European Commission: How to get involved.

by Roberto V. Zicari

I reported in a previous post that the European Commission has a budget for funding projects in the area of “Intelligent Information Management”.
The European Commission is looking to support projects that develop and test new technologies to manage and exploit extremely large volumes of data, with real time capabilities.

Here are some new interesting information on how you can get involved.

RVZ

1. The Call 8 of the Seventh Framework Programme (FP7) is published since 20 July 2011 (Call FP7-ICT-2011-8).

2. A new document called “Technical background notes for Framework Programme 7, Strategic Objective ICT-2011.4.4 “Intelligent Information Management” is now available.
The document provides background information and technical commentary on the Strategic Objective ICT-2011.4.4 as far as checklists. It aims to facilitate the preparation of proposals on this objective and the preparation of the Information and Networking Day.

3. An Information and Networking Day, will take place on 26 September 2011 in Luxembourg, at the Jean Monnet Conference Centre.
The Registration for the Information and Networking Day is open until 19 September 2011, 18h00 on the EU Event’s page.
You can also visit the home page of the EU Unit Technologies for Information Management for more information.

The deadline for submitting Presentations and Posters, and for the Pre-proposal service registration is 19 September 2011, 18h00 CET.

Useful information on the Information and Networking Day, 26 September 2011 in Luxembourg,

Matchmaking Sessions: These sessions are dedicated to “matchmaking/partner finding”, where you can present your organisation, your skills and ideas (morning), or state what sort of partners and skills you are looking for (afternoon sessions).
If you wish to take the podium for three minutes (maximum three slides), please tick the option in your online registration form and upload the presentation in the foreseen box in your participant’s profile.
Please choose the relevant session and indicate the headline of your presentation not later than the 12 September 2011 and finalise your slides by 19 September 2011.

The slots will be allocated on a first-come-first-served basis, and the confirmation and exact timing will be communicated to you by e-mail. The presentations will be published after the event, if you have ticked the box “I do not object to the publication of my presentation on the meeting’s web site”.

Pre-proposal service: The Pre-proposal service will run in parallel to the Event’s programme. European Commission Project Officers will be available to discuss project ideas and offer preliminary feedback with respect to the Work programme and Call 8. If you wish to sign up for a bilateral meeting, please tick this option in the online registration form. The slots will be allocated on a first-come-first-served basis, and the exact schedule will be communicated to you at the registration desk.

Poster Session: The European Commission facilitates a poster session. It will give the participants the chance to showcase their work and to advertise their skills. No commercial advertising is allowed. Please send the posters electronically via your profile as registered participant and bring the paper version to the event (no print-out service is provided, maximum size A1 portrait). Please note that you will receive the confirmation by e-mail. The posters will be displayed in the Foyer of the Jean Monnet Conference Centre.

Registration and terms of participation

1. Participation is free of charge but subject to prior registration and confirmation (only registered participants can access the conference centre).

2. Travel and accommodation expenses must be borne by the delegates and will not be reimbursed by the European Commission.

Aug 29 11

The future of data management: “Disk-less” databases? Interview with Goetz Graefe

by Roberto V. Zicari

“With no disks and thus no seek delays, assembly of complex objects will have different performance tradeoffs. I think a lot of options in physical database design will change, from indexing to compression and clustering and replication.” — Goetz Graefe.

Are “disk-less” databases the future of data management? What about the issue of energy consumption and databases? Will we have have “Green Databases”?
I discussed these issues with Dr. Goetz Graefe, HP Fellow (*), one of the most accomplished and influential technologists in the areas of database query optimization and query processing.

Hope you`ll enjoy this interview.

RVZ.

Q1 The world of data management is changing. Service platforms, scalable cloud platforms, analytical data platforms, NoSQL databases and new approaches to concurrency control are all becoming hot topics both in academia and industry. What is your take on this?

Goetz Graefe: I am wondering whether new and imminent hardware, in particular very large RAM memory as well as inexpensive non-volatile memory (NV-RAM, PCM, etc.) will be significant shifts that will affect software architecture, functionality, scalability, total cost of ownership, etc.
Hasso Plattner in his keynote at BTW (” SanssouciDB: An In-Memory Database for Processing Enterprise Workloads”) definitely seemed to think so. Whether or not one agrees with everything he said, he traced many of the changes back to not having disk drives in their new system (except for backup and restore).

For what it’s worth, I suspect that, based on the performance advantage of semiconductor storage, the vendors will ask for a price premium for semiconductor storage for a long time to come. That enables disk vendors to sell less expensive storage space, i.e., there’ll continue to be role for traditional disk space for colder data such as long-ago history analyzed only once in a while.

I think there’ll also be a differentiation by RAID level. For example, warm data with updates might be on RAID-1 (mirroring) whereas cold historical data might be on RAID-5 or RAID-6 (dual redundancy, no data loss in dual failure). In the end, we might end up with more rather than fewer levels in the memory hierarchy.

Q2. What is expected impact of large RAM (volatile) memory on database technology?

Goetz Graefe: I believe that large RAM (volatile) memory has already made a significant difference. SAP’s HANA /Sanssouci/NewDB project is one example,
C-store/VoltDB
is another. Others database management system vendors are sure to follow.

It might be that NoSQL databases and key-value stores will “win” over traditional databases simply because they adapt faster to the new hardware, even if purely because they currently are simpler than database management systems and contain less code that needs adapting.

Non-volatile memory such as phase-change memory and memristors will change a lot of requirements for concurrency control and recovery code. With storage in RAM, including non-volatile RAM, compression will increase in economic value, sophistication, and compression factors. Vertica, for example, already uses multiple compression techniques, some of them pretty clever.

Q3. Will we end up having “disk less” databases then?

Goetz Graefe: With no disks and thus no seek delays, assembly of complex objects will have different performance tradeoffs. I think a lot of options in physical database design will change, from indexing to compression and clustering and replication.
I suspect we’ll see disk-less databases where the database contains only application state, e.g., current account balances, currently active logins, current shopping carts, etc. Disks will continue to have a role and economic value where the database also contains history, including cold history such as transactions that affected the account balances, login & logout events, click streams eventually leading to shopping carts, etc.

Q4. Where will the data go if we have no disks? In the Cloud?

Goetz Graefe: Public clouds in some cases, private clouds in many cases. If “we” don’t have disks, someone else will, and many of us will use them whether we are aware of it or not.

Q5. As new developments in memory (also flash) occur, it will result in possibly less energy consumption when using a database. Are we going to see “Green Databases” in the near future?

Goetz Graefe: I think energy efficiency is a terrific question to pursue. I know of several efforts, e.g., by Jignesh Patel et al. and Stavros Harizopoulos et al. Your own students at DBIS Goethe Universität just did a very nice study, too.

It seems to me there are many avenues to pursue.

For example, some people just look at the most appropriate hardware, e.g., high-performance CPUs such as Xeon versus high-efficiency CPUs such as Centrino (back then). Similar thoughts apply to storage, e.g., (capacity-optimized) 7200 rpm SATA drives versus (performance-optimized) 15K rpm fiber channel drives.
Others look at storage placement, e.g., RAID-1 versus RAID-5/6, and at storage formats, e.g., columnar storage & compression.
Others look at workload management, e.g., deferred index maintenance (& view maintenance) during peak load (perhaps the database equivalent to load shedding in streams) or ‘pause and resume’ functionality in utilities such as index creation.
Yet others look at caching, e.g., memcached. Etc.

Q6. What about Workload management?

Goetz Graefe: Workload management really has two aspects: the policy engine including its monitoring component that provides input into policies, and the engine mechanisms that implement the policies. It seems that most people focus on the first aspect above. Typical mechanisms are then quite crude, e.g., admission control.

I have long been wondering about more sophisticated and graceful mechanisms. For example, should workload management control memory allocation among operations? Should memory-intensive operations such as sorting grow & shrink their memory allocation during their execution (i.e.,, not only when they start)? Should utilities such as index creation participate is resource management? Should index creation (etc.) support ‘pause and resume’ functionality?

It seems to me that I’d want to say ‘yes’ to all those questions. Some of us at Hewlett-Packard Laboratories have been looking into engine mechanisms in that direction.

Q7. What are the main research questions for data management and energy efficiency?

Goetz Graefe: I’ve recently attended a workshop by NSF on data management and energy efficiency.
The topic was split into data management for energy efficiency (e.g., sensors & history & analytics in smart buildings) and energy efficiency in data management (e.g., efficiency of flash storage versus traditional disk storage).
One big issue in the latter set of topics was the difference between traditional performance & scalability improvements versus improvements in energy efficiency, and we had a hard time coming up with good examples where the two goals (performance, efficiency) differed. I suspect that we’ll need cases with different resources, e.g., trading 1 second of CPU time (50 Joule) against 3 seconds of disk time (20 Joule).
NSF (the US National Science Foundation) seems to be very keen on supporting good research in these directions. I think that’s very timely and very laudable. I hope they’ll receive great proposals and can fund many of them.

————————–
Dr. Goetz Graefe is an HP Fellow, and a member of the Intelligent Information Management Lab within Hewlett-Packard Laboratories. His experience and expertise are focused on relational database management systems, gained in academic research, industrial consulting, and industrial product development.

His current research efforts focus on new hardware technologies in database management as well as robustness in database request processing in order to reduce total cost of ownership. Prior to joining Hewlett-Packard Laboratories in 2006, Goetz spent 12 years as software architect in product development at Microsoft, mostly in database management. Both query optimization and query execution of Microsoft’s re-implementation of SQL Server are based on his designs.

Goetz’s areas of expertise within database management systems include compile-time query optimization including extensible query optimization, run-time query execution including parallel query execution, indexing, and transactions. He has also worked on transactional memory, specifically techniques for software implementations of transactional memory.

Goetz studied Computer Science at TU Braunschweig from 1980 to 1983.

(*) HP Fellows are “pioneers in their fields, setting the standards for technical excellence and driving the direction of research in their respective disciplines”.

Resources:

NoSQL Data Stores (Free Downloads and Links).

Cloud Data Stores (Free Downloads and Links).

Graphs and Data Stores (Free Downloads and Links).

Databases in General (Free Downloads and Links).

##

Aug 17 11

On Versant`s technology. Interview with Vishal Bagga.

by Roberto V. Zicari

“We believe that data only becomes useful once it becomes structured.” — Vishal Bagga

There is a lot of discussion on NoSQL databases nowdays. But what about object databases?
I asked a few questions to Vishal Bagga, Senior Product Manager at Versant.

RVZ

Q1. How has Versant’s technology evolved over the past three years?

Vishal Bagga: Versant is a customer driven company. We work closely with our customers trying to understand how we can evolve our technology to meet their challenges – whether it’s regarding complexity, data size or demanding workloads.

In the last 3 years we have seen 2 very clear trends from our interaction with our new and existing customers – growing data sizes and increasingly parallel workloads. This is very much in-line with what the general database market is seeing. In addition there was request for simplified database management and monitoring.

Our state of the art Versant Object Database 8 released last year was designed for exactly these scenarios. We have added increased scalability and performance on multi-core architectures, faster and better defragmentation tools, Eclipse based management and monitoring tools to name a few. We are also re-architecting our database server technology to automatically scale when possible without manual DBA intervention and allow online tuning (reconfigure the database instance online without impacting applications).

Q2. On December 1, 2008 Versant acquired the assets of the database software business of Servo Software, Inc. (formerly db4objects, Inc.). What happened to db4objects since then? How does db4objects fit into Versant technology strategy?

Vishal Bagga: The db4o community is doing well and is an integral part of Versant. In fact, when we first acquired db4o at the end of 2008, there were just short of 50,000 registered members.
Today, the db4o community boasts nearly 110,000 members having more than doubled in size in the last 2+ years.
In addition, db4o has had 2 major releases with some significant advances in enterprise type features allowing things like online defragmentation support. In our latest major release, we announced a new data replication capability between db4o and the large scale enterprise class Versant database.
Versant sees a great need in the mobile markets for technology like db4o which can play well in the lightweight handheld, mobile computing and machine-to-machine space while leveraging big data aggregation servers like Versant which can handle the huge number of events coming off of these intelligent edge devices.
In the coming year, even greater synergies are being developed and our communities are merging into one single group dedicated to next generation NoSQL 2.0 technology development.

Q3. Versant database and NoSQL databases: what are the similarities and what are the differences?

Vishal Bagga: The Not Only SQL databases are essentially systems that have evolved out of a certain business need – The need was essentially to have horizontally scalable systems running on commodity hardware with a simple “soft-schema” model for example social networking, offline data crunching, distributed logging system, event processing systems etc.

Relational databases were considered to be too slow, expensive and difficult to manage and administrate, expensive and difficult to adapt to quick changing models.

If I look at similarities between Versant and NoSQL, I would say that:

Both systems have designed around the inefficiency of JOINs. This is the biggest problem with relational databases. If you think about it, in most operational systems relations don’t change e.g. Blog:Article, Order:OrderItem, so why recalculate those relations each time they are accessed using a methodology which gets slower and slower as the amount of data gets larger. JOINs have a use case, but for some 20%, not 100% of the use cases.

Both systems leverage an architectural shift to a “soft-schema” which allows scale-out capability – the ability to partition information across many physical nodes and treat those nodes as 1 ubiquitous database.

When it comes to differences:

The biggest in my opinion is the complexity of the data. Versant allows to you to model very complicated data models seamlessly with ease whereas doing so with a NoSQL solution would be much more effort and you would need to write a lot of code in the application to represent the data model.
In this respect, Versant prefers to use the term “soft-schema” –vs- the term “schemaless”, terms which are often interchanged in discussion.
We believe that data only becomes useful once it becomes structured, in fact that is the whole point of technologies like Hadoop, to churn unstructured data looking for a way to structure it into something useful.
NoSQL technologies that bill themselves as “schema-less” are in denial of the fact that they are leaving the application developer the burden of defining the structure and mapping the data into that structure in the language of the application space. In many ways, it is the mapping problem all over again. Plus, that kind of data management is very hard to change it over time, leading to a brittle solution difficult to optimize for more than 1 use case. The use of “soft-schema” lends itself to a more enterprise manageable and extensible system where the database still retains important elements of structure, while still being able to store and manipulate unstructured types.

Another is the difference in the consistency model. Versant is ACID centric and Versant’s customers depend on this for their mission critical systems – it would be nearly impossible for these systems to use NoSQL given the relaxed constraints. Versant can do a CAP mode, but that is not our only mode of operation. You use it where it is really needed; you are not forced into using it unilaterally.

NoSQL systems make you store your data in a way that you can lookup efficiently by a key. But what if want to lookup something differently; it is likely to be terribly inefficient. This may be okay for the design but a lot of people do not realize that this is a big change in mindset. Versant offers a more balanced approach where you can navigate between related objects using references; you can for example define a root object and then navigate your tree from that object. At the same time you can run ad-hoc queries whenever you want to.

Q4. Big Data: Can Versant database be useful when dealing with petabytes of user data? How?

Vishal Bagga: I don’t see why not. Versant was always designed to work on a network of databases from the very start. Dealing with a Petabyte is really about designing a system with the right architecture. Versant has that architecture just as intact as anyone in the database space saying they can handle a Petabyte. Make no mistake, no matter how you do it, it is a non-trivial task. Today, our largest customer databases are in the 100’s of terrabyte range, so getting to a Petabyte is really a matter of needing that much data.

Q5. Hadoop is designed to process large batches of data quickly. Do you plan to use Hadoop and leverage components of the Hadoop ecosystem like HBase, Pig, and Hive?

Vishal Bagga: Yes, and some of our customers already do that today. A question for you: “Why are those layers in existence?” I would say the answer is that most of these early NoSQL 1.0 technologies do not handle real world complexity in information models. So, these layers are built to try and compensate for that fact. That is the exact point where Versant’s NoSQL 2.0 technology fits into the picture, we help people deal with complexity of information models, something that 1st generation NoSQL has not managed to accomplish.

Q6. Do you think that projects such as JSON (JavaScript Object Notation) and MessagePack (binary-based efficient object serialization library ) play a role in the odbms market?

Vishal Bagga: Absolutely. We believe in open standards. Fortunately, you can store any type in an ODBMS. These specific libraries are particularly important for current most popular client frameworks like Ajax. Finding ways to deliver a soft-schema into a client friendly format is essential to help ease the development burden.

Q7. Looking at three elements: Data, Platform, Analysis, where is Versant heading up?

Vishal Bagga: It is a difficult question as database and data management is increasingly a cross cutting concern. It used to be perfectly fine to keep your Analysis as part of your off-line OLAP systems, but these days there is an increasing push to get Analytics to the real time business.
So, you play with Data, you play with Analytics whether you do it directly or in concert with other technologies through partnership. Certainly, as Versant embraces Platform as a Service, we will do so through eco system partners who are paving the way with new development and deployment methodologies.

Related Posts

Objects in Space: “Herschel” the largest telescope ever flown. (March 18, 2011)

Benchmarking ORM tools and Object Databases. (March 14, 2011)

Robert Greene on “New and Old Data stores” . (December 2, 2010)

Object Database Technologies and Data Management in the Cloud. (September 27, 2010)

##

Aug 9 11

Google Fusion Tables. Interview with Alon Y. Halevy

by Roberto V. Zicari

“The main challenges is that it’s hard for people who have data, but not database expertise, to manage their data.” –Alon Y. Halevy.

Google Fusion Tables was launched on June 9th, 2009. I wanted to know what happened since then. I have therefore interviewed Dr. Alon Y. Halevy, who heads the Structured Data Group at Google Research.

RVZ

Q1.In your web page you write that your job at Google is ” to make data management tools collaborative and much easier to use, and to leverage the incredible collections of structured data on the Web.” What are the main problems and challenges you are currently facing?

Halevy: The main challenges is that it’s hard for people who have data, but not database expertise, to manage their data, share it and create visualizations. Data management requires too much up-front effort, and that forms a significant impediment towards data sharing on a large scale.

Q2. Your group is responsible for Google Fusion Tables, “a service for managing data in the cloud that focuses on ease of use, collaboration and data integration.” What is exactly Google Fusion Tables? What challenges with respect to data management is solving and how?

Halevy: Fusion Tables enables you to easily upload data sets (e.g., spreadsheets and CSV files) to the cloud and manage it. Fusion Tables makes it easy to create insightful visualizations (e.g,. maps, timelines and other charts) and to share these with collaborators or with the public at large.

In addition, Fusion Tables enables merging data sets that belong to different owners. The true power of data is realized when we can combine data from multiple sources and draw conclusions that were impossible earlier.

As an example, this visualization combines data about earthquakes and data about the location of nuclear power reactors, showing what areas are prone to disasters similar to the one experienced in Japan in March 2011.

Q3. Fusion Tables enables users to upload tabular data files. Is there a limitation to the size of such users data files? Which one?

Halevy: Right now we allow 100MB per table and 250MB per user, but that’s not a technical limitation, just a limitation of our free offering

Q4. Google Fusion Tables was launched On June 9th, 2009. What happened since then?

Halevy: We’ve continually improved our service based largely on needs expressed by our users. In particular, we’ve have made our map visualizations much more powerful, and we’ve developed APIs for programmers.

Q5. What data sources do you consider and how do you integrate them together?

Halevy: We do not prescribe any data sources. All the sources you can obtain in Fusion Tables come from users who’ve explicitly marked their data sets as public. Data sources are combined with a merge operation that’s similar to a ‘join’ in SQL.

Q6. How do you ensure a good performance in the presence of millions of user tables? In fact, do you have any benchmark results for that?

Halevy: Given that we’re the first ones to pursue storing such a large number of tables in a single repository, there isn’t an established benchmark. The technical description of how we do it appears in two short papers that we published in SIGMOD 2010 [2] and SoCC 2010 [1] .

Q7. One of the main features of Fusion Tables is that it “allows multiple users to merge their tables into one, even if they do not belong to the same organization or were not aware of each other when they created the tables.
A table constructed by merging (by means of equi-joins) multiple base tables is a view” [1]. What about data consistency, duplicates and updates? Do you handle such cases?

Halevy: The data belongs to the users, not to us, so they have to ensure that it is up to date and does not contain duplicates. Of course, when you combine data from multiple sources you may get inconsistencies. Hopefully, the visualizations we provide will enable you to discover them quickly and resolve them.

Q8. How does Google Fusion Tables relates to Google Maps? What are the main challenges with respect to data management that you face when dealing with Big Data coming from Google Maps?

Halevy: Fusion Tables relies on a lot of the Google Maps infrastructure to display maps. The challenges when displaying maps from large data sets is that you need to do a lot of the computation on the server side so the client is not overwhelmed with points to render, but at the same time that the user experience remains snappy and interactive.

Q9. Why and how do you manage data in the Cloud?

Halevy: Managing data in the cloud is easier for many data owners because they do not have to maintain their own database system (which requires hiring database experts). Putting data in the cloud is also a key facilitator in order to share data with others, including people outside your organization.

We manage the data using some of the Google infrastructure such as BigTable and some layers built over it.

Q10. When you started Google Fusion Tables you did not support complex SQL queries or high throughput transactions. Why? How is the situation now?

Halevy: We still don’t support all forms of SQL queries, and we’re not in the race to become the database system supporting the highest transaction throughput. There are plenty of products on the market that serve those needs. Our goal from the start was to help under-served users with data management tasks that typically do not require complex SQL queries or high-throughput transactions, but rather emphasize data sharing and visualization.

Q11. Fusion Tables is about to “make it easier for people to create, manage and share on structured data on the Web.”
What about handling non structured data and the Web?

Halevy: There are plenty of other tools for that, including blogs, site creation tools and cloud-based word processors.

Q12. You write “…to facilitate collaboration, users can conduct fine-grained discussions on the data.” What does it mean?

Halevy: This means that users can attach comments to individual rows in a table, individual columns and even individual cells. If you are collaborating on a large data set then it is not enough to put all the comments in one big blob around the table. You really need to attach it to specific pieces of data.

Q13. Is Google Fusion Tables open source? Do you have plans to open up the API to developers?

Halevy: We have had an API for almost two years now. Fusion Tables is not open source; it’s a Google service built on top of Google’s infrastructure.

Q14. Fusion Tables API provides developers with a way of extending the functionality of the platform. What are the main extensions being implemented by the community so far?

Halevy: There have been tools developed for importing data from different formats into Fusion Tables (e.g., Shape to Fusion ). There is also a tool for importing Fusion Tables into the statistical R package and outputting the results back into Fusion Tables.

The API is used mostly to tailor applications using Fusion Tables to specific needs.

Q15. How Fusion Tables differs from Amazon`s SimpleDB?

Halevy: Fusion Tables is designed primarily to be a user-facing tool, not as much for developers. I should emphasize that Fusion Tables is not part of Google Labs anymore — it has “graduated” over a year ago.

Q16. In 2005 you co-authored a paper introducing the concept of “Dataspace Systems”, that is systems which provide “pay-as-you-go data management based on best-effort services.” Is this still actual? What are these “Dataspace Systems” exactly?

Halevy: Yes, it is. In fact, Fusion Tables are one example of a dataspace system. Fusion Tables does not require you to create a schema before entering the data, and it tries to get the data types of the columns in order to offer relevant visualizations.
The collaborative aspects of Fusion Tables make it easier for a group of collaborators to improve the quality of the data and combine it with others.

Dataspace systems are still in their infancy, and we have a long way to go to realize the full vision.

Q17. You have worked on Deep web, Surface web, and now Fusion Tables: How do these three area relate to each others?
What is your next project?

Halevy: All of these projects have the same overall goal: to make structured data on the Web more discoverable, so users can enhance it, combine it with data from other sources, and create and publish interesting new data sets.

The Deep Web project had the goal of extracting data sets from behind HTML forms and making the data discoverable in search.
The Surface Web (WebTables) project’s goal was to identify interesting data sets that are on the Web but are not being treated in the most optimal way. Fusion Tables provides a tool for users to upload their own data and publish data sets that can be crawled by search engines.

These three projects — and filling in the gaps between — will keep me busy for a while to come!

———————
Dr. Alon Halevy heads the Structured Data Group at Google Research. Prior to that, he was a Professor of Computer Science at the University of Washington, where he founded the Database Research Group. From 1993 to 1997 he was a Principal Member of Technical Staff at AT&T Bell Laboratories (later AT&T Laboratories). He received his Ph.D in Computer Science from Stanford University in 1993, and his Bachelors degree in Computer Science and Mathematics from the Hebrew University in Jerusalem in 1988. Dr. Halevy was elected Fellow of the Association of Computing Machinery in 2006.
————————-

Resources:

[1]Google Fusion Tables: Web-Centered Data Management and Collaboration (link .pdf).

Hector Gonzalez, Alon Halevy, Christian S. Jensen, Anno Langen,Jayant Madhavan, Rebecca Shapley, Warren Shen, Google Inc.
in SoCC’10, June 10-11, 2010, Indianapolis, Indiana, USA.

[2] Megastore: A Scalable Data System for User Facing Applications.
J. Furman, J. S. Karlsson, J.-M. Leon, A. Lloyd, S. Newman, and P. Zeyliger. In SIGMOD, 2008.

[3] Megastore: Providing Scalable, Highly Available Storage for Interactive Services (link .pdf)
Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd, Vadim Yushprakh , Google, Inc. In 5 Biennial Conference on Innovative Data Systems Research (CIDR ’11), January 9-12, 2011, Asilomar, California, USA.

Google Fusion Tables (link)

Fusion Tables API (link)

Amazon`s SimpleDB (link)

Deep-web crawl Project (Download .pdf).

WebTables Project (Download .pdf)

——————–

Jul 25 11

How good is UML for Database Design? Interview with Michael Blaha.

by Roberto V. Zicari

„ The tools are not good at taking changes to a model and generating incremental design code to alter a populated database.”— Michael Blaha

The Unified Modeling Language™ – UML – is OMG’s most-used specification.
UML is a de facto standard for object modeling, and it is often used for database design as well. But how good is UML really for the task of database conceptual modeling?
I asked a few questions to Dr. Michael Blaha, one of the leading authorities on databases and data modeling.

RVZ

Q1. Why using UML for database design?

Blaha: Often the most difficult aspect of software development is abstracting a problem and thinking about it clearly — that is the purpose of conceptual data modeling.
A conceptual model lets developers think deeply about a system, understand its core essence, and then choose a proper representation. A sound data model is extensible, understandable, ready to implement, less prone to errors, and usually performs well without special tuning effort.

The UML is a good notation for conceptual data modeling. The representation stands apart from implementation choices, be it a relational database, object oriented database, files, or some other mechanism.

Q2. What are the main weaknesses of UML for database design? And how do you cope with them in practice?

Blaha: First consider object databases. The design of object database code is similar to the design of OO programming code. The UML class model specifies the static data structure. The most difficult implementation issue is the weak support in many object database engines for associations. The workaround depends on object database features and the application architecture.

Now consider relational databases. Relational database tools do not support the UML. There is no technical reason for this, but several cultural reasons. One is that there is a divide between the programming and database communities; each has their own jargon, style, and history and pay little attention to the other.
Also the UML creators focused on unifying programming notation, but spent little time talking to the database community.
The bottom line is that the relational database tools do not support the UML and the UML tools do not support relational databases. In practice, I usually construct a conceptual model with a UML tool (so that I can think deeply and abstractly).
Then I rekey the model into a database tool (so that I can generate schema).

Q3. Even if you have a sound high level UML design, what else can get wrong?

Blaha: I do lots of database reverse engineering for my consulting clients, mostly for relational database applications because that’s what’s used most often in practice. I start with the database schema and work backwards to a conceptual model. I published a paper 10 years ago with statistics for what does go wrong.

In practice, I would say that about 25% of applications have a solid conceptual model, 50% have a mediocre conceptual model, and 25% are just downright awful. Given that a conceptual model is the foundation for an application, you can see why many applications go awry.

In practice, about 50% of applications have a professional database design and 50% are substantially flawed. It’s odd to see so many database design mistakes, given the wide availability of database design tools. It’s relatively easy to take a conceptual model and generate a database design. This illustrates that the importance of software engineering has not reached many developers.

Of course, there can always be flaws in programming logic and user interface code, but these kinds of flaws are easier to correct if there is a sound conceptual model underlying the application and if the model is implemented well with a database schema.

Q4. And specifically for object databases?

Blaha: An object database is nothing special when it comes to the benefits of a sound model and software engineering practice. A carefully considered conceptual model gives you a free hand to choose the most appropriate development platform.

One of my past books (Object-Oriented Modeling and Design for Database Applications) vividly illustrated this point by driving object-oriented data models into different implementation targets, specifically relational databases, object databases, and flat files.

Q5. What are most common pitfalls?

Blaha: It is difficult to construct a robust conceptual model. A skilled modeler must quickly learn the nuances of a problem domain and be able to meld problem content with data abstractions and data patterns.

Another pitfall is that it is important to perform agile development. Developers much work quickly, deliver often, obtain feedback, and build on prior results to evolve an application. I have seen too many developers not take the principles of agile development to heart and become bogged down by ponderous development of interminable scope.

Another pitfall is that some developers are sloppy with database design. Nowdays there really is no excuse for that as tools can
generate database code. Object-oriented CASE tools can generate programming stubs that can seed an object database.
For relational database projects, I first construct an object-oriented model, then re-enter the design into a relational database tool, and finally generate the database schema. (The UML data modeling notation is nearly isomorphic with the modeling language in most relational database design tools.)

Q6. In your experience, how do you handle the situation when a UML conceptual database design is done and a database is implemented using such design, but then later on, updates to the implementation are done without considering the original conceptual design. What to do in such cases?

Blaha: The more common situation is that an application gradually evolves and the software engineering documentation (such as the conceptual model) is not kept up to date.
With a lack of clarity for its intellectual focus, an application gradually degrades. Eventually there has to be a major effort to revamp the application and clean it up, or replace the application with a new one.

The database design tools are good at taking a model and generating the initial database design.
The tools are not good at taking changes to a model and generating incremental design code to alter a populated database.
Thus much manual effort is needed to make changes as an application evolves and keep documentation up to date. However, the alternative of not doing so is an application that eventually becomes a mess and is unmaintainable.

————————————————–
Michael Blaha is a partner at Modelsoft Consulting Corporation.
Dr. Blaha is recognized as one of the world’s leading authorities on databases and data modeling. He has more than 25 years of experience as a consultant and trainer in conceiving, architecting, modeling, designing, and tuning databases for dozens of major organizations around the world. He has authored six U.S. patents, six books, and many papers. Dr. Blaha received his doctorate from Washington University in St. Louis and is an alumnus of GE Global Research in Schenectady, New York.

Related Resources

OMG UML Resource Page.

Object-Oriented Design of Database Stored Procedures, By Michael Blaha, Bill Huth, Peter Cheung

Models, By Michael Blaha

Universal Antipatterns, By Michael Blaha

Patterns of Data Modelling (Database Systems and Applications),Blaha, Michael, CRC Press, May 2010, ISBN 1439819890

Jul 15 11

Big Data and the European Commission: Call for Project Proposals.

by Roberto V. Zicari

I thought that this information is of interest for the data management community:

The European Commission has a budget for funding projects in the area of Intelligent Information Management.
It is currently seeking Project Submissions in this area.

It is not always easy to understand the official documents published by the European Commission… Therefore I tried to develop a set of easy-to-read questions/answers for whose of you who might be interested to know more.

Hope it helps

RZV

– What is it?

The European Commission has a budget for funding projects in the area of Intelligent Information Management.

– Which Program is it?:

Formally the Program is called: “Work Programme 2011-2012 of the FP7 Specific Programme ‘Cooperation’, Theme 3, ICT – Information and Communication Technologies”, which has been published on 19 July 2010.

The Work Programme contains in Challenge 4: ‘Technologies for Digital Content and Languages’ the objective ICT-2011.4.4 – Intelligent Information Management.

The objective ICT 2011-4.4 is part of call 8, which is expected to be published July 26, 2011.

– What kind of projects is the European Commission looking to support?

Projects that develop and test new technologies to manage and exploit extremely large volumes of data, with real time capabilities whenever this is relevant. Projects must be the joint work of consortia of partners.
It is very important that at least one member of your consortium be willing and able to make available the large volumes of data needed to test your ideas.

– What kind of support is the EU giving to accepted project proposals?

Depends on the type of partner and the type of proposals. For the most common type of proposal, up to 75% of direct costs for research institutions and small or medium enterprises.

– Why should I submit a proposal?

To join other talented people to solve a problem that you couldn’t solve alone and with your own resources.

– What are the benefits for me and/or for my organization to participate?

Funding is a clear benefit but it should be thought as a means to real end, which is the benefit of advancing the state of the art (for scientific or business objectives) working with people from all over Europe who have very strong skills complementary to yours.

– What is required if the proposal is accepted?

Entering a grant agreement with the European Commission, committing to an agreed plan of work, opening your work to the evaluation of peers selected by the Commission, periodically reporting your costs using agreed standards, being open to audits throughout the duration of the project and for a few years afterwards.

– How do I qualify to participate?

The full rules for participation (which classify countries in various categories with different types of access to the programme) are available here.
Most participants are legal entities established in the EU.

– How can I participate?

You need to become part of an eligible consortium (see again rules for participation above) and submit a proposal that addresses the specific requirements of the call.

– When is the deadline?

Expected to be 17 January 2012 17:00 Brussels time. Please refer to the official text of the call when it is published.

– How do I submit a proposal?

You need to fill the forms and submit your proposal using the submission system .

– Can I see other proposals submitted in the past?

This is not really possible because the Commission has an obligation of confidentiality to past submitters.

– Can I see some projects funded by the EU in the past in the same area?

Certainly. Please visit content-knowledge/projects_en.

– How do I get more info?

Further details on the scope of the call, individual research lines and indicative budget are provided in the Work Programme 2011-2012.

You can also write to: infso-e2@ec.europa.eu

Note: These calls are highly competitive. It is thus important that your idea be really innovative and that your plans for implementing and testing it be really concrete and really credible. It is also important for you to find the right partners and to work with them as a team. This requires a joint vision based on shared objective.
This means in turn that you will need to try to figure out how your skills can help others as hard as you will try to figure out who could help you with what you want to do.

##

Jun 22 11

“Applying Graph Analysis and Manipulation to Data Stores.”

by Roberto V. Zicari

” This mind set is much different from the set theoretic notions of the relational database world. In the world of graphs, everything is seen as a walk—a traversal. ” — Marko Rodriguez and Peter Neubauer.
__________________________________________________

Interview with Marko Rodriguez and Peter Neubauer.

The open source community is quite active in the area of Graph Analysis and Manipulation, and their applicability to new data stores. I wanted to know more about an open source initiative called “Tinkerpop”.
I have interviewed Marko Rodriguez and Peter Neubauer, who are the ledears of the Tinkerpop” project.

RVZ

Q1. You recently started a project called “Tinkerpop”. What is it?

Marko Rodriguez and Peter Neubauer:
TinkerPop is an open-source graph software group. Currently, we provide a stack of technologies (called the TinkerPop stack) and members contribute to those aspects of the stack that align with their expertise. The stack starts just above the database layer (just above the graph persistence layer) and connects to various graph database vendors — e.g. Neo4j, OrientDB, DEX, RDF Sail triple/quad stores, etc.

The graph database space is a relatively nascent space. At the time that TinkerPop started back in 2009, graph database vendors were primarily focused on graph persistence issues–storing and retrieving a graph structure to and from disk. Given the expertise of the original TinkerPop members (Marko, Peter, and Josh), we decided to take our research (from our respective institutions) and apply it to the creation of tools one step above the graph persistence layer. Out of that effort came Gremlin — the first TinkerPop project. In late 2009, Gremlin was pieced apart into multiple self contained projects: Blueprints and Pipes.
From there, other TinkerPop products have emerged which we discuss later.

Q2. Who currently work on “Tinkerpop”?

Marko Rodriguez and Peter Neubauer:
The current members of TinkerPop are Marko A. Rodriguez (USA), Peter Neubauer (Sweden), Joshua Shinavier (USA), Stephen Mallette (USA), Pavel Yaskevich (Belarus), Derrick Wiebe (Canada), and Alex Averbuch (New Zealand).
However, recently, while not yet an “official member” (i.e. picture on website), Pierre DeWilde (Belgium) has contributed much to TinkerPop through code reviews and community relations. Finally, we have a warm, inviting community where users can help guide the development of the TinkerPop stack.

Q3. You say, that you intend to provide higher-level graph processing tools, APIs and constructs? Who needs them? and for what?

Marko Rodriguez and Peter Neubauer:
TinkerPop facilitates the application of graphs to various problems in engineering. These problems are generally defined as those that require expressivity and speed when traversing a joined structure. The joined structure is provided by a graph database. With a graph database, a user can does not arbitrarily join two tables according to some predicate as there is no notion of tables.
There only exists a single atomic structure known as the graph. However, in order to unite disparate data, a traversal is enacted that moves over the data in order to yield some computational side-effect — e.g. a search, a score, a rank, a pattern match, etc.
The benefit of the graph comes from being able to rapidly traverse structures to an arbitrary depth (e.g., tree structures, cyclic structures) and with an arbitrary path description (e.g. friends that work together, roads below a certain congestion threshold). Moreover, this space provides a unique way of thinking about data processing.
We call this data processing pattern, the graph traversal pattern.
This mind set is much different from the set theoretic notions of the relational database world. In the world of graphs, everything is seen as a walk—a traversal.

Q4. Why using graphs and not objects and/or classical relations? What about non normalized data structures offered by NoSQL databases?

Marko Rodriguez and Peter Neubauer:
In a world where memory is expensive, hybrid memory/disk technology is a must (colloquially, a database).
A graph database is nothing more than a memory/disk technology that allows for the rapid creation of an in-memory object (sub)graph from a disk-based (full)graph. A traversal (the means by which data is queried/processed) is all about filling in memory those aspects of the persisted graph that are being touched as the traverser moves along the graph’s vertices and edges.
Graph databases simply cache what is on disk into memory which makes for a highly reusable in-memory cache.
In contrast, with a relational database, where any table can be joined with any table, many different data structures are constructed from the explicit tables persisted. Unlike a relational database, a graph database has one structure, itself.
Thus, components of itself are always reusable. Hence, a “highly reusable cache.” Given this description, if a persistence engine is sufficiently fast at creating an in-memory cache, then it meets the processing requirements of a graph database user.

Q5. Besides graph databases, who may need Tinkerpop tools? Could they be useful for users of relational databases as well? or of other databases, like for example NoSQL or Object Databases? If yes, how?

Marko Rodriguez and Peter Neubauer:
In the end, the TinkerPop stack is based on the low-level Blueprints API.
By implementing the Blueprints API and making it sufficiently speedy, any database can, in theory, provide graph processing functionality. So yes, TinkerPop could be leveraged by other database technologies.

Q6. Tinkerpop is composed of several sub projects: Gremlin, Pipes, Blueprints and more. At a first glimpse, it is difficult to grasp how they are related to each other. What are all these sub projects? do they all relate with each other?

Marko Rodriguez and Peter Neubauer:
The TinkerPop stack is described from bottom-to-top:
Blueprints: A graph API with an operational semantics test suite that when implemented, yields a Blueprints-enabled graph database which is accessible to all TinkerPop products.
Pipes: A data flow framework that allows for lazy graph traversing.
Gremlin: A graph traversal language that compiles down to Pipes.
Frames: An object-to-graph mapper that turns vertices and edges into objects and relations (and vice versa).
Rexster: A RESTful graph server that exposes the TinkerPop suite “over the wire.”

Q7. Is there a unified API for Tinkerpop? And if yes, how does it look like?

Marko Rodriguez and Peter Neubauer:
Blueprints is the foundation of TinkerPop.
You can think of Blueprints as the JDBC of the graph database community. Many graph vendors, while providing their own APIs, also provide a Blueprints implementation so the TinkerPop stack can be used with their database. Currently, Neo4j, OrientDB, DEX, RDF Sail, TinkerGraph, and Rexster are all TinkerPop promoted/supported implementations.
However, out there in the greater developer community, there exists an implementation for HBase (GraphBase) and Redis (Blueredis). Moreover, the graph database vendor InfiniteGraph plans to release a Blueprints implementation in the near future.

Q8. In your projects you speak of “dataflow-inspired traversal models”. What is it?

Marko Rodriguez and Peter Neubauer:
Data flow graph processing, in the Pipes/Gremlin-sense, is a lazy iteration approach to graph traversing.
In this model, chains of pipes are connected. Each pipe is a computational step that is one of three types of operations: transform, filter, or side-effect.
A transformation pipe will take data of one type and emit data of another type. For example, given a vertex, a pipe will emit its outgoing edges. A filter pipe will take data and either emit it or not. For example, given an edge, emit it if its label equals “friend.” Finally, a side-effect will take data and emit the same data, however, in the process, it will yield some side-effect.
For example, increment a counter, update a ranking, print a value to standard out, etc.
Pipes is a library of general purpose pipes that can be composed to effect a graph traversal based computation. Finally, Gremlin is a DSL (domain specific language) that supports the concise specification of a pipeline. The Gremlin code base is actually quite small — all of the work is in Pipes.

Q9. How other developers could contribute to this project?

Marko Rodriguez and Peter Neubauer:
New members tend to be users. A user will get excited about a particular product or some tangent idea that is generally useful to the community. They provide thoughts, code, and ultimately, if they “click” with the group (coding style, approach, etc.), then they become members. For example, Stephen Mallette was very keen on advancing Rexster and as such, has and continues to work wonders on the server codebase.
Pavel Yaskevich was interested in the compiler aspects of Gremlin and contributed on that front through many versions. Pavel is also a contributing member to Cassandra’s recent query language known as CQL.
Derrick Wiebe has contributed alot to Pipes and in his day job, needed to advance particular aspects of Blueprints (and luckily, this benefits others). There are no hard rules to membership. Primarily its about excitement, dedication, and expert-level development.
In the end, the community requires that TinkerPop be a solid stack of technologies that is well thought out and consistent throughout. In TinkerPop, its less about features and lines of code as it is about a consistent story that resonates well for those succumbing to the graph mentality.

____________________________________________________________________________________

Marko A. Rodriguez:
Dr. Marko A. Rodriguez currently owns the graph consulting firm Aurelius LLC. Prior to this venture, he was a Director’s Fellow at the Center for Nonlinear Studies at the Los Alamos National Laboratory and a Graph Systems Architect at AT&T.
Marko’s work for the last 10 years has focused on the applied and theoretical aspects of graph analysis and manipulation.

Peter Neubauer:
Peter Neubauer has been deeply involved in programming for over a decade and is co-founder of a number of popular open source projects such as Neo4j, TinkerPop, OPS4J and Qi4j. Peter loves connecting things, writing novel prototypes and throwing together new ideas and projects around graphs and society-scale innovation.
Right now, Peter is the co-founder and VP of Product Development at Neo4j Technology, the company sponsoring the development of the Neo4j graph database.
If you want brainstorming, feed him a latte and you are in business.

______________________________________

For further readings

Graphs and Data Stores:
Blog Posts | Free Software | Articles, Papers, Presentations| Tutorials, Lecture Notes

Related Posts

“On Graph Databases: Interview with Daniel Kirstenpfad”.

“Marrying objects with graphs”: Interview with Darren Wood.

“Interview with Jonathan Ellis, project chair of Apache Cassandra”.

The evolving market for NoSQL Databases: Interview with James Phillips.”

_________________________

Jun 13 11

Interview with Iran Hutchinson, Globals.

by Roberto V. Zicari

“ The newly launched Globals initiative is not about creating a new database.
It is however, about exposing the core multi-dimensional arrays directly to developers.” — Iran Hutchinson.

__________________________

InterSystems recently launched a new initiative: Globals.
I wanted to know more about Globals. I have therefore interviewed Iran Hutchinson, software/systems architect at InterSystems and one of the people behind the Globals project.

RVZ

Q1. InterSystems recently launched a new database product: Globals. Why a new database? What is Globals?

Iran Hutchinson: InterSystems has continually provided innovative database technology to its technology partners for over 30 years. Understanding customer needs to build rich, high-performance, and scalable applications resulted
in a database implementation with a proven track record. The core of the database technology is multi-dimensional arrays (aka globals).
The newly launched Globals initiative is not about creating a new database. It is however, about exposing the core multi-dimensional arrays directly to developers. By closely integrating access into development technologies like Java and JavaScript, developers can take full advantage of high-performance access to our core database components.

We undertook this project to build much broader awareness of the technology that lies at the heart of all of our products. In doing so, we hope to build a thriving developer community conversant in the Globals technology, and aware of the benefits to this approach of building applications.

Q2. You classify Globals as a NoSQL-database. Is this correct? What are the differences and similarities of Globals with respect to other NoSQL databases in the market?

Iran Hutchinson: While Globals can be classified as a NoSQL database, it goes beyond the definition of other NoSQL databases. As you there are many different offerings in NoSQL and no key comparison matrices or feature lists. Below we list some comparisons and differences with hopes of later expanding the available information on the globalsdb.org website.

Globals differs from other NoSQL databases in a number of ways.

o It is not limited to one of the known paradigms in NoSQL (Column/Wide Column, Key-Value, Graph, Document, etc.). You can build your own paradigm on top of the core engine. This is an approach we took as we evolved Caché to support objects, xml, and relational, to name a few.
o Globals still offers optional transactions and locking. Though efficient in implementation we wanted to make sure that locking and transactions were at the discretion of the developer.
o MVCC is built into the database.
o Globals runs in-memory and writes data to disk.
o There is currently no sharding or replication available in Globals. We are discussing options for these features.
o Globals builds on the over 33 years of success of Caché. It is well proven. It is the exact same database technology. Globals will continue to evolve, and receive the innovations going into the core of Caché.
o Our goal with Globals is be a very good steward of the project and technology. The Globals initiative will also start to drive contests and events to further promote adoption of the technology, as well as innovative approaches to building applications. We see this stewardship as a key differentiator, along with the underlying flexible core technology.

• Globals shares similar traits with other NoSQL databases in the market.

o It is free for development and deployment.
o The data model can optionally use a schema. We mitigate the impact of using schemas by using the same infrastructure we use to store the data. The schema information and the data are both stored in globals.
o Developers can index their data.
o The document paradigm enabled by the Globals Document Store (GDS) API enables a query language for data stored using the GDS API. GDS is also an example of how to build a storage paradigm in Globals. Globals APIs are open source and available on the github link.
o Globals is fast and efficient at storing data. We know performance is one of many hallmarks of NoSQL. Globals can store data at rates exceeding 100,000 objects/records per second.
o Different technology APIs are available for use with Globals. We’ve released 2 Java APIs and the JavaScript API is immanent.

Q3. How do you position Globals with respect to Caché? Who should use Globals and who should use Caché?

Iran Hutchinson: Today, Globals offers multi-dimensional array storage, whereas Caché offers a much richer set of features. Caché (and the InterSystems technology it powers including Ensemble, DeepSee, HealthShare, and TrakCare) offers a core underlying object technology, native web services, distributed communication via ECP (Enterprise Cache Protocol), strategies for high availability, interactive development environment, industry standard data access (JDBC, ODBC, SQL, XML, etc.) and a host of other enterprise ready features.

Anyone can use Globals or Caché to tackle challenges with large data volumes (terabytes, petabytes, etc.), high transactions (100,000+ per second), and complex data (healthcare, financial, aerospace, etc.). However, Caché provides much of the needed out-of-box tooling and technology to get started rapidly building solutions in our core technology, as well as a variety of languages. Currently provided as Java APIs, Globals is a toolkit to build the infrastructure already provided by Caché. Use Caché if you want to get started today; use Globals if you have a keen interest in building the infrastructure of your data management system.

Q4. Globals offers multi-dimensional array storage. Can you please briefly explain this feature, and how this can be beneficial for developers?

Iran Hutchinson: It is beneficial to go here. I grabbed the following paragraphs directly from this page:

Summary Definition: A global is a persistent sparse multi-dimensional array, which consists of one or more storage elements or “nodes”. Each node is identified by a node reference (which is, essentially, its logical address). Each node consists of a name (the name of the global to which this node belongs) and zero or more subscripts.

Subscripts may be of any of the types String, int, long, or double. Subscripts of any of these types can be mixed among the nodes of the same global, at the same or different levels.

Benefits for developers: Globals does not limit developers to using objects, key-value, or any other type of storage paradigm. Developers are free to think of the optimal storage paradigm for what they are working on. With this flexibility, and the history of successful applications powered by globals, we think developers can begin building applications with confidence.

Q5. Globals does not include Objects. Is it possible to use Globals if my data is made of Java objects? If yes, how?

Iran Hutchinson:. Globals exposes a multi-dimensional sparse array directly to Java and other languages. While Globals itself does not include direct Java object storage technology like JPA or JDO, one can easily store and retrieve data in Java objects using the APIs documented here. Anyone can also extend Globals to support popular Java object storage and retrieval interfaces.

One of the core concepts in Globals is that it is not limited to a paradigm, like objects, but can be used in many paradigms. As an example, the new GDS (Globals Document Store) API enables developers to use the NoSQL document paradigm to store their objects in Globals. GDS is available here (more docs to come).

Q6. Is Globals open source?

Iran Hutchinson: Globals itself it not open source. However, the Globals APIs hosted at the github location are open source.

Q7. Do you plan to create a Globals Community? And if yes, what will you offer to the community and what do you expect back from the community?

Iran Hutchinson: We created a community for Globals from the beginning. One of the main goals of the Globals initiative is to create a thriving community around the technology, and applications built on the technology.
We offer the community:
• Proven core data management technology
• An enthusiastic technology partner that will continue to evolve and support project ◦ Marketing the project globally
◦ Continual underlying technology evolution ◦ Involvement in the forums and open source technology development ◦ Participation in or hosting events and contests around Globals.
• A venue to not only express ideas, but take a direct role in bringing those ideas to life in technology
• For those who want to build a business around Globals, 30+ years of experience in supplying software developers with the technology to build successful breakthrough applications.

____________________________________

Iran Hutchinson serves as product manager and software/systems architect at InterSystems. He is one of the people behind the Globals project. He has held architecture and development positions at startups and Fortune 50 companies. He focuses on language platforms, data management technologies, distributed/cloud computing, and high performance computing. When not on trail talking with fellow geeks or behind the computer you can find him eating (just look for the nearest steak house).
___________________________________

Resources

Globals.
Globals is a free database from InterSystem. Globals offer multi-dimensional storage. The first version is for Java. Software | Intermediate | English | LINK | May 2011

Globals APIs
Globals APIs are open source available at github location .

Related Posts

Interview with Jonathan Ellis, project chair of Apache Cassandra.

The evolving market for NoSQL Databases: Interview with James Phillips.

– “Marrying objects with graphs”: Interview with Darren Wood.

“Distributed joins are hard to scale”: Interview with Dwight Merriman.

On Graph Databases: Interview with Daniel Kirstenpfad.

Interview with Rick Cattell: There is no “one size fits all” solution.
———-