Skip to content

"Trends and Information on AI, Big Data, Data Science, New Data Management Technologies, and Innovation."

This is the Industry Watch blog. To see the complete ODBMS.org
website with useful articles, downloads and industry information, please click here.

May 30 11

Measuring the scalability of SQL and NoSQL systems.

by Roberto V. Zicari

“Our experience from PNUTS also tells that these systems are hard to build: performance, but also scaleout, elasticity, failure handling, replication.
You can’t afford to take any of these for granted when choosing a system.
We wanted to find a way to call these out.”

— Adam Silberstein and Raghu Ramakrishnan, Yahoo! Research.

___________________________________

A team of researchers composed of Adam Silberstein, Brian F. Cooper, Raghu Ramakrishnan, Russell Sears, and Erwin Tam, all at Yahoo! Research Silicon Valley, developed a new benchmark for Cloud Serving Systems, called YCSB.

They published their results in the paper “Benchmarking Cloud Serving Systems with YCSB” at the 1st ACM Symposium on Cloud Computing, ACM, Indianapolis, IN, USA (June 10-11, 2010).

They open-sourced the benchmark about a year ago.

The YCSB benchmark appears to be the best to date for measuring the scalability of SQL and NoSQL systems.

Together with my colleague and odbms.org expert, Dr. Rick Cattell, we interviewed Adam Silberstein and Raghu Ramakrishnan.

Hope you`ll enjoy the interview.

Rick Cattell and Roberto V. Zicari
________________________________

Q1. What motivated you to write a new database benchmark?

Silberstein, Ramakrishnan: Over the last few years, we observed an explosive rise in the number of new large-scale distributed data management systems: BigTable, Dynamo, HBase, Cassandra, MongoDB, Voldemort, etc. And of course our own PNUTS system [2].
This field expanded so quickly that there is no agreement on what to call these systems: NoSQL systems, cloud databases, key-value stores, etc. (Some of these terms (e.g., cloud databases) are even applied to Map-Reduce systems such as Hadoop, which are qualitatively different, leading to terminological confusion.)
This trend created a lot of excitement throughout the community of web application developers and among data management developers and researchers. But it also created a lot of debate. Which systems are most stable and mature? Which have the best performance? Which is best for my use case? We saw these questions being asked in the community, but also within Yahoo!.

Our experience with PNUTS tells us there are many design decisions to make when building one of these systems, and those decisions have a huge impact on how the system performs for different workloads (e.g., read-heavy workloads vs. write-heavy workloads), how it scales, how it handles failures, ease of operation and tuning, etc. We wanted to build a benchmark that lets us expose the nuances of each these decisions and their implementations.

Our initial experiences with systems in this space made it very clear that tuning the systems is a challenging problem requiring expert advice from the systems’ developers. Out-of-the-box, systems might be tuned for small-scale deployment, or running batch jobs rather than serving. By creating a serving benchmark, we could create a common “language” for describing the type of workloads web application developers care about, which in turn might bring about tuning best practices for these workloads.

The space of cloud databases/nosql systems/key-value stores exploded very quickly: HBase, Cassandra, MongoDB, Voldemort…even the space of names describing such systems grew quickly (nosql, cloud db, etc.).
This created a lot of excitement and confusion, both in Yahoo and externally (what do they do, are they stable/mature, what is performance like, etc).

Each system has very nice things to say about itself. Wanted to make sense of the space so we could learn for ourselves, give advice, etc. Wanted a way for these systems to speak a common language when describing their capabilities.

Our experience from PNUTS also tells that these systems are hard to build: performance, but also scaleout, elasticity, failure handling, replication. You can’t take afford to take any of these for granted when choosing a system. We wanted to find a way to call these out.
We also quickly learned that tuning these systems is difficult. Out-of-box, they might be tuned for small clusters, rather than large (or vice-versa), for different hardware, etc.
And the systems aren’t at the point where they have detailed tuning guides/best practices. While we planned to open-source all along, this motivated us further: put the tuning problem in the hands of the experts by giving them the workload.

Q2. In your paper [1] you write “The main motivation for developing new cloud serving systems is the difficulty in providing scale out and elasticity using traditional database systems”. Can you please explain this statement? What about databases such as object oriented and XML-based?

Silberstein, Ramakrishnan: Database systems have addressed scale-out, and commercial systems (e.g., high-end systems from the major vendors) do scale out well. However, they typically do not scale to web-scale workloads using commodity servers, and are not designed to be operated in configurations of 1000s of servers, and to support high-availability and geo-replication in these settings.
For example, how does ACID carry over when you scale the traditional systems across data centers? Further, they do not support elastic expansion of a running system—When adding servers, how do you offload data to them? How do you replicate for fault tolerance? The new systems we see are making entirely new design tradeoffs. For example, they may sacrifice much of ACID (e.g., it’s strong consistency model), but make it easy and cost-effective to scale to a large number of servers with high-availability and elastic expandability.

These considerations are largely orthogonal to the underlying data model and programming paradigm (i.e., XML or object-oriented systems), though some of the newer systems have also innovated in the areas of flexible schema evolution and nested columns.

Q3. What is the difference between your YCSB benchmark and well known benchmarks such as TPC-C?

Silberstein, Ramakrishnan: At a high level, we have a lot in common with TPC-C and other OLTP benchmarks.
We care about query latency and overall system throughput. When take a closer look, however, the queries are very different. TPC-C contains several diverse types of queries meant to mimic a company warehouse environment. Some queries execute transactions over multiple tables; some are more heavyweight than others.
In contrast, the web applications we are benchmarking tend to run a huge number of extremely simple queries. Consider a table where each record holds a user’s profile information. Every query touches only a single record, likely either reading it, or reading+writing it. We do include support for skewed workloads; some tables may have active sets accessed much more than others. But we have focused on simple queries, since that is what we see in practice.
Another important difference is that we have emphasized the ease of creating a new suite of benchmarks using the YCSB framework, rather than the particular workload that we have defined, because we feel it is more important to have a uniform approach to defining new benchmarks in this space, given its nascent stage.

Q4. In your paper (1) you focus your work on two properties of “Cloud” systems: “elasticity” and “performance”, or as you write “simplified application development and deployment”. Can you please explain what do you mean with these two properties?

Silberstein, Ramakrishnan: – Performance refers to the usual metrics of latency and throughput, with the ability to scale out by adding capacity. Elasticity refers to the ability to add capacity to a running deployment “on-demand”, without manual intervention (e.g., to re-shard existing data across new servers).

Q5. Cloud systems differ for example in the data models they support (e.g., column-group oriented, simple hashtable based, document based). How do you compare systems with such different data models?

Silberstein, Ramakrishnan: In our initial work we focused on the commonalities among the systems we benchmarked (PNUTS, Cassandra, HBase). We ran the systems as row stores holding ordered data.
The YCSB “core workload” accesses entire records at a time, and executes range queries over small portions of the table.
Though Cassandra and HBase both support column-groups, we did not exercise this feature.

It is of course possible to design workloads that test features like column groups, and we encourage users to do so—as we noted earlier, one of our goals was to make it easy to add benchmarks using the YCSB framework. One way to compare systems with different feature sets is to disqualify systems lacking the desired features.

Rather than disqualify a system because it lacks column groups, for example, it may make sense to compare the system as-is against others with column groups.
It is then possible to measure the cost of reading all columns, even when only one is needed. On the other hand, if a workload must execute range scans over a continuous set of of keys, there is no reasonable way to run that workload against a hash-based store.

Q6. You write that the focus of your work is on “serving” systems, rather than “batch” or “analytical”. Are there any particular reasons for doing that? What are the challenges in defining your benchmark?

Silberstein, Ramakrishnan: This is analogous to having TPC-C and TPC-H benchmarks in the conventional database setting.
Many prominent systems in cloud computing, either target serving workloads or target analytical/batch workloads. Hadoop is the most well-known example of a batch system. Some systems, such as HBase, support both. We believe serving and batch are extremely important, but very different. Serving needs its own treatment and we wanted to very clearly call that out in our benchmark. It is easy to confuse the two when looking at benchmark results. In both settings we talk a lot about throughput, e.g., number of records read or written per second.
But in batch, where the workload reads or writes an entire table, those records are likely sequentially on disk. In serving, the workload does many random reads and writes, so those records are likely scattered on disk. Disk I/O patterns have a huge impact on throughput, and so we must understand whether a workload is batch or serving before we can appreciate the benchmark results.

-These are really different spaces, although some applications combine both. We believe both are important in the cloud space.

-One reason this space gets confusing is because most systems we’ve looked at have some batch functionality (often via Hadoop integration). (We know of cases where YCSB is used to drive both a serving and batch workload). And there are even use cases that require both. But there are many important use cases that do only serving, and in this work we wanted to benchmark that specifically.

-Another reason we started with just serving is because it is very easy to confuse the two when discussing performance.

For example, in both settings we can talk about throughput numbers (ops/second).
For serving you are probably talking about doing a huge number of operations on a random set of records. For batch you are talking about doing 1 operation on a huge number of records, and they are probably physically clustered together. So we’re talking about, for example, random reads vs. sequential scans. The numbers will be very different and its important to look at the results for the case you care about.

Q7. Your YCSB benchmark consists of two tiers: Tier-1 Performance and Tier-2 Scaling. Can you please explain what these two Tiers are and what do you expect to measure with them?

Silberstein, Ramakrishnan: The performance tier is the bread-and-butter of our initial YCSB work; for a fixed system size, we want to see how much load we can put on each system while still getting low latency.

This is one of the key questions, if not the key question, an application developer asks when choosing a data system: how much of my expected workload does the system support per-server, and so how many servers am I going to have to buy? No matter what system we benchmark, the performance results have a similar feel. We plot throughput on the x-axis and (average or 95th percentile) latency on the y-axis. As we push throughput higher, latency grows gradually. Eventually, as the system the system reaches saturation, latency jumps dramatically, and then we can push throughput no higher. It is easy to compare these graphs across systems and get a feel for what kind of load each can support. It is also worthwhile to verify that while systems slow down at saturation point, they nonetheless remain stable.

The big selling point of cloud serving systems is their ability to scale upward by adding more servers. If the system offers low latency at small scale, as we proportionally increase workload size and the number of servers, latency should remain constant. If this is not the case, this is a hint the system might have bottlenecks that surface at scale.

In our scaling tier we also make the point of distinguishing scalability and elasticity. While good scalability means the system runs well over large workloads if pre-configured with the appropriate number of servers, elasticity means the system can actually grow from small to large scale by adding servers while remaining online.

In our initial work we observed systems doing well on scalability, but having erratic behavior when we added capacity online and increased workloads (i.e., they were often weak on elasticity).

Q8. How are the workloads defined in your YCSB benchmark? How do they differ from traditional TCP-C workloads?

Silberstein, Ramakrishnan: Our workloads consist of much simpler queries than TPC-C. There are just a handful of knobs for the user to adjust. The user may specify size settings such as existing database size and record size, and distribution settings, such as the relative proportions of insert, reads, updates, delete, and scan operations and distribution of accesses over the key space. These simple settings let us characterize many important workloads we find at Yahoo!, and we provide a few simple ones as part of our Core Workload.

Of course, users can develop much more complex workloads that look more like TPC-C, and our code is designed to make that easy.

Q9. You used your benchmark to measure four different systems: Cassandra, HBase, your cloud system PNUTS and your implementation of a sharded MySQL. Why did you choose these four systems? What other systems would you have liked to include, or would you like to see someone run YCSB on?

Silberstein, Ramakrishnan: When we started our work there were certainly more than four systems to choose from, and there are even more now. We limited ourselves to just a handful to avoid spreading ourselves too thin. This was a wise decision, since figuring out how to run and tune each system for serving was a full-time job (providing even more motivation to produce the benchmark). Ultimately, we made our choice of systems based on interest at Yahoo! and our desire to compare a collection of systems that made fundamentally different design decisions.
We built PNUTS here, so that is an obvious choice, and many Yahoo! developers are curious about the features and performance of Cassandra and HBase. PNUTS uses a buffer page based architecture, while Cassandra and HBase (itself a clone of BigTable) are based on differential-file architectures.

We are happy with our initial choice of systems. We got a lot of interest in our work and have gained a lot of users and contributors. Those contributors have done a variety of work, including improvements to clients of the systems we initially benchmarked and adding new clients. As of this writing, we have clients for Voldemort, MongoDB, and JDBC.

Q10. In your paper you present the main results of comparing these four systems. Can you please summarize the main results you obtained and the lessons learned?

Silberstein, Ramakrishnan: Our impact comes not from the actual numbers we collected; this was just a snapshot in time for each system. Our key contribution was creating an apples-to-apples environment that made ongoing comparisons feasible. We encouraged some competition between the systems and produced a tool for the systems to compare themselves against in the future.

That said, we had some interesting results. We know that the systems we tested made different design decisions to optimize reads or writes, and we saw these decisions reflected in the results. For a 50% read 50% write workload, Cassandra achieved the highest throughput. For a 95% read workload, PNUTS tied Cassandra on throughput while having better latency.

All systems we tested advertised scalability and elasticity. The systems all performed well at scale, but we noticed hiccups when growing elastically.
Cassandra took an extremely long time to integrate a new server, and had erratic latencies during that time. HBase is extremely lazy about integrating new servers, requiring background compactions to move partitions to them.

This was over a year ago, and these systems may well have overcome these issues by now.

-Actual numbers not the important thing here…this was just a snapshot in time of these systems. The key thing is that we created an apples-to-apples setup that made comparisons possible, created a bit of competition, and created something for the systems to evaluate themselves against.

-Result #1. We knew the systems made fundamental decisions to optimize writes or optimize reads.
It was nice to see these decisions show up in the results.

Example: in a 50/50 workload, Cassandra was best on throughput. In a 95% read workload, PNUTS caught up and had the best latencies.

-Result #2. The systems may advertise scalability and elasticity, but this is clearly a place where the implementations needed more work. Ref. elasticity experiment. Ref. HBase with only 1-2 nodes.

-Lesson. We are in the early stages. The systems are moving fast enough that there is no clear guidance on how to tune each system for particular workloads.

So figuring out how to do justice to each system is tricky.

Q11. Please briefly explain the experiment set up.

Silberstein, Ramakrishnan: We simply performed a bake-off. We had a collection of server class machines, and installed and ran each cloud serving system on them. We spent a huge amount of time configuring each system to perform their best on our hardware. We are in the early days of cloud serving systems, and there are no clear best practices for how to tune these systems to our workloads. We also took care to allocate memory fairly across each system.
For example, HBase runs on top of HDFS, and both HBase and HDFS must be given memory, and the sum must equal what we grant Cassandra.
Finally we made sure to load each system with enough data to avoid fitting all records in memory; all of the systems we benchmarked perform dramatically differently when running solely in memory vs. not.
We were more interested in performance in the latter case, so ran in that mode.

Q12. You plan to extend the benchmark to include two additional properties: availability and replication. Can you please explain how you intend to do so?

Silberstein, Ramakrishnan: These are tricky, but important, tiers to build. We want to measure a variety of things. What is the added cost of replication? What happens to availability under different failure scenarios? Can the records be read and/or written, and is there a cost penalty when doing so? What is the record consistency model during failures, and can we quantify the differences between different failure models?

These tiers expose areas where system designers can make very different design decisions. They might ensure write durability by replicating writes on-disk or in-memory only. They might replicate intra-data center or cross-data center. During network partitions, they might prioritize consistency and make data read-only, or might prioritize availability, and use eventual consistency to make record replicas converge later.

Q13. Your benchmark appears to be the best to date for measuring the scalability of SQL and NoSQL systems; it would be great if others would run it on their systems. Do you have ideas to encourage that to happen?

Silberstein, Ramakrishnan: We think we have been fairly successful here already. We open-sourced the benchmark about a year ago. It is available here.

Before we released the benchmark, we shared drafts of our results with the developer/user lists of each benchmarked system a few times prior to publication, and each time got a lot of interest and questions: what do our machines look like, what tuning parameters did we use, etc. This interest is a great sign. First, as you point out, we’re filling an unmet need.
Second, our audience recognizes our Yahoo! workloads are important and that we as Yahoo! researchers are doing a rigorous comparison.

By the time we released the benchmark we had many people waiting to use it.

Today, we have many users and contributors. We see comments on system mailing lists like “I am running YCSB Workload A and seeing XXX ops/sec and that seems too low. How should I tune my setup?”

Q14. A simpler benchmark might be easier for others to reproduce, getting more results. Is there a simple subset of your benchmark you’d suggest that captures most of the important elements?

Silberstein, Ramakrishnan: The Core Benchmark is already very simple. It uses synthetic data and provides just a few knobs for users to adjust. We and others users have done more complex testing like feed in actual production logs in YCSB, but that is not part of Core. Within Core, we should mention there are 3 very simple workloads that can execute on the simplest hash-based key value store. Workloads A, B, and C execute only inserts, reads, and updates. They do not execute range queries.

Q15. Open source NoSQL projects may not have half a dozen dedicated servers available to reproduce your results on their systems. Do you have suggestions there? Is it possible to run an accurate benchmark on a leased cloud platform?

Silberstein, Ramakrishnan: Funny you ask this because we felt the half dozen servers we ran on was not enough. We went to great lengths to borrow 50 homogeneous servers for a short time to run 1 large-scale experiment that showed YCSB itself scales.
Running YCSB on a leased platform like Amazon EC2 is straightforward.
The big question is determining if the results are accurate and repeatable. Certainly if the machines are VMs on non-dedicated servers, that might not be the case. We have heard if you request the largest size VMs, EC2 might allocate dedicated servers. One of the benefits of working at a large Internet company is that we have not had to try this ourselves!

Q16. Do you think that flash memory may “tip the balance” in the best data platforms? It cannot simply be treated as a disk, nor as RAM, because it has fundamentally different read/write costs.

Silberstein, Ramakrishnan: As SSD hardware has matured, we have noticed two trends. First, it is quite easy to build a machine with enough SSDs to run into CPU, bus and controller bottlenecks. This has led to rewrites of most storage stacks, cloud based or not.

Second, SSDs use extremely sophisticated log structured techniques to mask the cost of writes. Some of these techniques, such as data deduplication and compression only help certain workloads. The big question in this space is how higher-level database indexes will interact with the lower-level log structured system.

It could be that future hardware devices will mask the cost of random writes so well that higher-level log structured techniques will become redundant. On the other hand, higher-level log structured approaches have more computational resources at their disposal, and also have more information about the application. These advantages could mean that they will always be able to significantly improve upon hardware-based approaches.
______________________________________________________

Adam Silberstein, Yahoo! Research.
Research Area: Web Information Management.
My current interests are in large distributed data systems. My research interests are in the general area of large scale data management. Specifically, this includes both online transaction processing, and analytics, and bridging the gap between them,
as well as techniques for generating user feeds in social networks. I joined Yahoo! in August 2007, after finishing my Ph.D. at Duke University in February 2007.

Raghu Ramakrishnan, Yahoo! Research.
Raghu Ramakrishnan is Chief Scientist for Search and Cloud Platforms at Yahoo!, and is a Yahoo! Fellow, heading the Web Information Management research group. His work in database systems, with a focus on data mining, query optimization, and web-scale data management, has influenced query optimization in commercial database systems and the design of window functions in SQL:1999. His paper on the Birch clustering algorithm received the SIGMOD 10-Year Test-of-Time award, and he
has written the widely-used text “Database Management Systems” (with Johannes Gehrke).
His current research interests are in cloud computing, content optimization, and the development of a “web of concepts” that indexes all information on the web in semantically rich terms. Ramakrishnan has received several awards, including the ACM SIGKDD Innovations Award, the ACM SIGMOD Contributions Award, a Distinguished Alumnus Award from IIT Madras, a Packard Foundation Fellowship in Science and Engineering, and an NSF Presidential Young Investigator Award. He is a Fellow of the ACM and IEEE.

Ramakrishnan is on the Board of Directors of ACM SIGKDD, and is a past Chair of ACM SIGMOD and member of the Board of Trustees of the VLDB Endowment. He was Professor of Computer Sciences at the University of Wisconsin-Madison, and was founder and CTO of QUIQ, a company that pioneered crowd-sourcing, specifically question-answering communities, powering Ask Jeeves’ AnswerPoint as well as customer-support for companies such as Compaq.

____________________________________________
Dr. R. G. G. “Rick” Cattell is an independent consultant in database systems and engineering management. He previously worked as a Distinguished Engineer at Sun Microsystems, mostly recently on open source database systems and horizontal database scaling. Dr. Cattell served for 20+ years at Sun Microsystems in management and senior technical roles,
and for 10+ years in research at Xerox PARC and at Carnegie-Mellon University.
Dr. Cattell is best known for his contributions to middleware and database systems, including database scalability, enterprise Java, object/relational mapping, object-oriented databases, and database interfaces. He is the author of several dozen papers and six books. He instigated Java DB and Java 2 Enterprise Edition, and was a contributor to a number of the Enterprise Java APIs and products. He previously led development of the Cedar DBMS at Xerox PARC, the Sun Simplify database GUI, and SunSoft’s ORB-database integration. He was a founder of SQL Access (a predecessor to ODBC), the founder and chair of the Object Data Management Group (ODMG), the co-creator of JDBC, the author of the world’s first monograph on object/relational and object databases, and a recipient of the ACM Outstanding PhD Dissertation Award.

References:
[1] Benchmarking Cloud Serving Systems with YCSB. (.pdf)
Adam Silberstein, Brian F. Cooper, Raghu Ramakrishnan, Russell Sears, Erwin Tam, Yahoo! Research.
1st ACM Symposium on Cloud Computing, ACM, Indianapolis, IN, USA (June 10-11, 2010)

[2] PNUTS: Yahoo!’s Hosted Data Serving Platform (.pdf)
Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni, Yahoo! Research.
The paper describes PNUTS/Sherpa, Yahoo’s record-oriented cloud database.

Open Source Code

Yahoo! Cloud Serving Benchmark (brianfrankcooper / YCSB) Downloads

Related Posts

Benchmarking ORM tools and Object Databases.

Interview with Jonathan Ellis, project chair of Apache Cassandra.

– Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera

The evolving market for NoSQL Databases: Interview with James Phillips.

For further readings

Scalable Datastores, by Rick Cattell

_____________________________________________________________________________________________

May 16 11

Interview with Jonathan Ellis, project chair of Apache Cassandra.

by Roberto V. Zicari

You’re going to see these databases attempting to make things easy that today are possible but difficult.” –Jonathan Ellis.

This interview is part of my series of interviews on the evolving market for Data Management Platforms. This time, I had the pleasure to interview Jonathan Ellis, project chair of Apache Cassandra.

RVZ: In my understanding of how the market of Data Management Platforms is evolving, I have identified three phases:

Phase I– New Proprietary data platforms developed: Amazon (Dynamo), Google (BigTable). Both systems remained proprietary and are in use by Amazon and Google.

Phase II- The advent of Open Source Developments: Apache projects such as Cassandra, Hadoop (MapReduce, Hive, Pig). Facebook and Yahoo! played major roles. Multitude of new data platforms emerged.

Phase III– Evolving Analytical Data Platforms. Hadoop for analytic. Companies such a Cloudera, but also IBM`s BigInsights are in this space.

Q1. Why Amazon and Google had to develop their own database systems? Why didn’t they use/”adjust” existing database systems?

Jonathan Ellis: Google and Amazon were really breaking new ground when they were working on Bigtable and Dynamo. The thing you have to remember is that the problem they were trying to solve was high-volume transactional systems: while companies like Teradata have been building large-scale databases for some time, these were analytical databases and not designed for high query volumes with low latency.

The state-of-the-art at the time for transactional systems was horizontal and vertical partitioning customized for a given application, built on a traditional database like Oracle. These systems were not application-transparent, meaning repartitioning was a major undertaking, nor were they reusable from one application to another.

Q2. I defined Phase II as the advent of Open Source Developments with Apache projects such as Cassandra, Hadoop (MapReduce, Hive, Pig). Facebook and Yahoo! played major roles. Multitude of new data platforms emerged. Any comment on this?

Jonathan Ellis: Note that Hadoop is a different kind of animal here: Hadoop *is* an analytical system, not a real–time or transaction oriented one.

Q3. How was it possible that Amazon and Google`s proprietary systems were used as input for Open Source Projects?

Jonathan Ellis: Google and Amazon both published whitepapers on Bigtable and Dynamo which were very influential for the open source systems that started appearing soon afterward. Since then, the open-source systems have of course continued to evolve; today Cassandra—which began as a fusion of Bigtable and Dynamo concepts—includes features described by neither of its ancestors, such as distributed counters and column indexes.

Q4. Why Facebook and LinkedIn developed Open Source data platforms and not proprietary software?

Jonathan Ellis: Probably a combination of two reasons:
– They recognized that with the kind of head start that Google and Amazon had, it would be difficult to achieve technical parity without leveraging the efforts of a larger development community.

– They don’t see infrastructure in and of itself as the place where they gain their competitive advantage. You can see more evidence of this in Facebook’s recent announcement that they were opening up the plans for their newest data center. So it goes beyond just source code.

Q5. Not everyone has data and scalability requirements such as Amazon and Google. Who currently needs such new data management platforms, and why?

Jonathan Ellis: Again, I think the piece that’s new here is the emphasis on high-volume transaction processing. Ten years ago you didn’t see this kind of urgency around transaction processing–some large web sites like eBay were concerned already, but it feels like there’s been a kind of Moore’s law of data growth that’s been catching more and more companies, both on the web and off. Today DataStax has customers like startups Backupify and Inkling, as well as companies you might expect to see like Netflix.

Q6. Is there a common denominator between the business models built around such open source projects?

Jonathan Ellis: At the most basic level, there’s only so many options to choose from.
You have services and support, and you have proprietary products built on top or around the open source core. Everyone is doing some combination of those.

Phase III– Evolving Analytical Data Platforms
Q7. Is Business Intelligence becoming more like Science for profit?

Jonathan Ellis: If you mean that a lot of teams are now trying to commercialize technologies that were originally developed without that kind of focus, then yes, we’ve definitely seen a lot of that the last couple years.

Q8. Who are the main actors in the Platform Ecosystem?

Jonathan Ellis: On the real-time side, Cassandra’s strongest competitors are probably Riak and HBase. Riak is backed by Basho, and I believe Cloudera supports HBase although it’s not their focus.

For analytics, everyone is standardizing on Hadoop, and there are a number of companies pushing that.

DataStax is unique here in that our just-released Brisk project gives you the best of both worlds: a Hadoop powered by Cassandra so you never have to do an ETL process before running an analytical query against your real-time data, while at the same time keeping those workloads separate so that they don’t interfere with each other.

Q9. What role will RDBMS play in the future? What about Object Databases, do they have a role o play?t

Jonathan Ellis: Relational databases will continue to be the main choice when you need ACID semantics and you have a relatively small data or query volume that you care about. Many of our customers continue to use a relational database in conjunction with Cassandra for things like user registration.

To be honest, I don’t see object databases being able to ride the NoSQL wave out of their niche. The popularity of NoSQL options isn’t from their rejection of the SQL language per se, but because that was part of what they left behind when they added features that are starting to matter even more than query language, primarily scalability.

Q10. Looking at three elements: Data, Platform, Analysis, what are the main research challenges ahead? And what are the main business challenges ahead?

Jonathan Ellis: I see the technical side as more engineering than R&D. You’re going to see these databases attempting to make things easy that today are possible but difficult. Cassandra’s column indexes are an example of this–you could use Cassandra to look up rows by column values in the past, but you had to maintain those indexes manually, in application code. Today Cassandra can automate that for you.

This ties into the business side as well: the challenge for everyone is to move beyond the early adopter market and go mainstream. Ease of use will be a big part of that.

Q11. What are the main future developments? Anything you wish to add?

Jonathan Ellis: This is an exciting space to work in right now because the more we build, the more we can see that we’ve barely scratched the surface so far. The feature we’re working on right now that I’m personally most excited about is predicate push-down for Brisk: allowing Hive, a data warehouse system for Hadoop, to take advantage of Cassandra column indexes.

Readers curious about Brisk–which is fully open-source–can learn more here.

Thanks for the questions!

Jonathan Ellis

__________________________________
Jonathan Ellis.
Jonathan Ellis is CTO and co-founder of DataStax (formerly Riptano), the commercial leader in products and support for Apache Cassandra. Prior to DataStax, Jonathan built a multi-petabyte, scalable storage system based on Reed-Solomon encoding for backup provider Mozy. Jonathan is project chair of Apache Cassandra.
__________________________________

Related Posts

1. Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera.

2. The evolving market for NoSQL Databases: Interview with James Phillips.

____________________

Apr 13 11

Objects in Space vs. Friends in Facebook.

by Roberto V. Zicari

“Data is everywhere, never be at a single location. Not scalable, not maintainable.”–Alex Szalay

I recently reported about the Gaia mission which is considered by the experts “the biggest data processing challenge to date in astronomy“.

Alex Szalay- who knows about data and astronomy, having worked from 1992 till 2008 with the Sloan Digital Sky Survey together with Jim Gray – wrote back in 2004:
“Astronomy is a good example of the data avalanche. It is becoming a data-rich science. The computational-Astronomers are riding the Moore’s Law curve, producing larger and larger datasets each year.” [Gray,Szalay 2004]

Gray and Szalay observed: “If you are reading this you are probably a “database person”, and have wondered why our “stuff” is widely used to manage information in commerce and government but seems to not be used by our colleagues in the sciences. In particular our physics, chemistry, biology, geology, and oceanography colleagues often tell us: “I tried to use databases in my project, but they were just to [slow | hard-to-use |expensive | complex ]. So, I use files.” Indeed, even our computer science colleagues typically manage their experimental data without using database tools. What’s wrong with our database tools? What are we doing wrong? “ [Gray,Szalay 2004].

Six years later, Szalay in his presentation “Extreme Data-Intensive Computing” presented what he calls “Jim Gray`s Law of Data Engineering”:
1. Scientific computing is revolving around data.
2. Need scale-out solution for analysis.”
[Szalay2010].

He also says about Scientific Data Analysis, or as he calls it (DISC: Data Intensive Scientific Computing): “Data is everywhere, never be at a single location. Not scalable, not maintainable.”[Szalay2010]

I would like to make three observations:

i. Great thinkers do anticipate the future. They “feel” it. Better said, they “see” more clearly how things really are.
Consider for example what the philosopher Friedrich Nietzsche wrote in his book “Thus Spoke Zarathustra”: “The middle is everywhere.” Confirmed 128 years later by “Data is everywhere”….

ii. “Astronomy is a good example of the data avalanche”: the Universe is beyond our comprenshion, which means I believe, that ultimately we will figure out that indeed “data is not scalable, and not maintainable.”

iii. I now dare to twist the quote: “If you are reading this you are probably a “database person”, and have wondered why our “stuff” is widely used to manage information in commerce and government but seems to not be used by our colleagues at Facebook or Google”….

I have asked Professor Alex Szalay for his opinion.

Alexander Szalay is a professor in the Department of Physics and Astronomy of the Johns Hopkins University. His research interests are theoretical astrophysics and galaxy formation.

RVZ

Alex Szalay: This is very flattering… and I agree. But to be fair, the Facebook guys are using databases, first MySQL, and now Oracle in the middle of their whole system.

I have recently heard a talk by Jeff Hammerbacher, who built the original infrastructure for Facebook. Now he quit, and formed Cloudera. He did explicitly say that in the middle there will always be SQL, but people use Hadoop/MR for the ETL layer… and R and other tools for analytics and reporting.

As far as I can see Google is also gently moving towards not quite a database yet, but Jeff Dean is building Bloom filters and other indexes into BigTable. So even if it is NoSQL, some of their stuff starts to resemble a database….

So I think there is a general agreement that indices are useful, but for large scale data analytics, we do not need full ACID, transactions are much more a burden than an advantage. And there is a lot of religion there, of course.

I would put it in such a way, that there is a phase transition coming, and there is an increasing diversification, where there were only three DB vendors 5 years ago, now there are many options and a broad spectrum of really interesting niche tools. In a healthy ecosystem everything is a 1/f power law, and we will see a much bigger diversity. And this is great for academic research. “In every crisis there is an opportunity” — we again have a chance to do something significant in academia.

RVZ: The National Science Foundation has awarded a $2M grant to you and your team of co-investigators from across many scientific disciplines, to build a 5.2 Petabyte Data-Scope, a new instrument targeted at analyzing the huge data sets emerging in almost all areas of science. The instrument will be a special data-supercomputer, the largest of its kind in the academic world.

What is the project about?

Alex Szalay: We feel that the Data-Scope is not a traditional multi-user computing cluster, but a new kind of instrument, that enables people to do science with datasets ranging between 100TB and 1000TB.
This is simply not possible today. The task is much more, than just throw the necessary storage together.
It requires a holistic approach: the data must be first brought to the instrument, then staged, and then moved to the computing nodes that have both enough compute power and enough storage bandwidth (450GBps) to perform the typical analyses, and then the (complex) analyses must be performed.

RVZ: Could you please explain what are the main challenges that this project poses?

Alex Szalay: It would be quite difficult, if not outright impossible to develop a new instrument with so many cutting-edge features without adequately considering all aspects of the system, beyond the hardware. We need to write at least a barebones set of system management tools (beyond the basic operating system etc), and we need to provide help and support for the teams who are willing to be at the “bleeding-edge” to be able to solve their big data problems today, rather than wait another 5 years, when such instruments become more common.
This is why we feel that our proposal reflects a realistic mix of hardware and personnel, which leads to a high probability of success.

The instrument will be open for scientists beyond JHU. There was an unbelievable amount if interest just at JHU in such an instrument, since analyzing such data sets is beyond the capability of any group on campus. There were 20 groups with data sets totaling over 2.8PB just within JHU, who would use the facility immediately, if it was available. We expect to go no-line at the end of this summer.

References

[Szalay2010]
Extreme Data-Intensive Computing (.pdf)
Alex Szalay, The Johns Hopkins University, 2010.

[Gray,Szalay 2004]
Where the Rubber Meets the Sky: Bridging the Gap between Databases and Science.
Jim Gray,Microsoft Research and Alex Szalay,Johns Hopkins University.
IEEE Data Engineering Bulletin and Technical Report, MSR-TR-2004-110, Microsoft Research, 2004

Friedrich Nietzsche,
Thus Spoke Zarathustra: a Book for Everyone and No-one. (Also Sprach Zarathustra: Ein Buch für Alle und Keinen) – written between 1883 and 1885.

Related Posts

Objects in Space

Objects in Space: “Herschel” the largest telescope ever flown.

Objects in Space. –The biggest data processing challenge to date in astronomy: The Gaia mission.–

Big Data

Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera.

The evolving market for NoSQL Databases: Interview with James Phillips.

#

Apr 4 11

Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera.

by Roberto V. Zicari

“Data is the big one challenge ahead” –Michael Olson.

I was interested to learn more about Hadoop, why it is important, and how it is used for business.
I have therefore interviewed Michael Olson, Chief Executive Officer, Cloudera..

RVZ

RVZ: In my understanding of how the market of Data Management Platforms is evolving, I have identified three phases:

Phase I– New Proprietary data platforms developed: Amazon (Dynamo), Google (BigTable). Both systems remained proprietary and are in use by Amazon and Google.

Phase II- The advent of Open Source Developments: Apache projects such as Cassandra, Hadoop (MapReduce, Hive, Pig). Facebook and Yahoo! played major roles. Multitude of new data platforms emerge.

Phase III– Evolving Analytical Data Platforms. Hadoop for analytic. Companies such a Cloudera, but also IBM`s BigInsights are in this space.

Q1. Would you agree with this? Would you have anything to add/change?

Michael Olson: I think that’s generally accurate. The one qualification I’d offer is that the process isn’t a waterfall, where the Phase I players do some innovative work that flows down to Phase II where it’s implemented, and so on. The open source projects have come up with some novel and innovative ideas not described in the papers. The arrival of commercial players like Cloudera and IBM was also more interesting than the progression suggests. We’ve spent considerable time with customers, but also in the community, and we’ll continue to build reputation by contributing alongside the many great people working on Apache Hadoop and related projects around the world. IBM’s been working with Hadoop for a couple of years, and BigInsights really flows out of some early exploratory work they did with a small number of early customers.

Hadoop
Q2. What is Hadoop? Why is it becoming so popular?

Michael Olson:Hadoop is an open source project, sponsored by the Apache Software Foundation, aimed at storing and processing data in new ways. It’s able to store any kind of information — you needn’t define a schema and it handles any kind of data, including the tabular stuff than an RDBMS understands, but also complex data like video, text, geolocation data, imagery, scientific data and so on. It can run arbitrary user code over the data it stores, and it’s able to do large-scale parallel processing very easily, so you can answer petabyte-scale questions quickly. It’s genuinely a new thing in the data management world.

Apache Hadoop, the project, consists of three major components: the Hadoop Distributed File System, or HDFS, which handles storage; MapReduce, the distributed processing infrastructure, which handles the work of running analyses on data; and Common, which is a bunch of shared infrastructure that both HDFS and MapReduce need. When most people talk about “Hadoop,” they’re talking about this collection of software.

Hadoop’s developed by a global community of programmers. It’s one hundred percent open source. No single company owns or controls it.

Q3. Who needs Hadoop?, and for what kind of applications?

Michael Olson:The software was developed first by big web properties, notably Yahoo! and Facebook, to handle jobs like large-scale document indexing and web log processing. It was applied to real business problems by those properties. For example, it can examine the behavior of large numbers of users on a web site, cluster individual users into groups based on the the things that they do, and then predict the behavior of individuals based on the behavior of the group.

You should think of Hadoop in kind of the same way that you think of a relational database. All by itself, it’s a general-purpose platform for storing and operating on data. What makes the platform really valuable is the application that runs on top of it. Hadoop is hugely flexible. Lots of businesses want to understand users in the way that Yahoo! and Facebook do, above, but the platform supports other business workloads as well: portfolio analysis and valuation, intrusion detection and network security applications, sequence alignment in biotechnology and more.

Really, anyone who has a large amount of data — structured, complex, messy, whatever — and who wants to ask really hard questions about that data can use Hadoop to do that. These days, that’s just about every significant enterprise on the planet.

Q4. Can you use Hadoop stand alone or do you need other components? If yes which one and why?

Michael Olson: Hadoop provides the storage and processing infrastructure you need, but that’s all. If you need to load data into the platform, then you either have to write some code or else go find a tool that does that, like Flume (for streaming data) or Sqoop (for relational data). There’s no query tool out of the box, so if you want to do interactive data explorations, you have to go find a tool like Apache Pig or Apache Hive to do that.

There’s actually a pretty big collection of those tools that we’ve found that customers need. It’s the main reason we created the open source package we call Cloudera’s Distribution including Apache Hadoop, or CDH. We assemble Apache Hadoop and tools like Pig, Hive, Flume, Sqoop and others — really, the full suite of open source tools that our users have found they require — and we make it available in a single package. It’s 100% open source, not proprietary to us. We’re big believers in the open source platform — customers love not being shackled to a vendor by proprietary code.

Analytical Data Platforms

Q5. It is said that more than 95% of enterprise data is unstructured, and that enterprise data volumes are growing rapidly. Is it true? What kind of applications generate such high volume of unstructured data? and what can be done with such data?

Michael Olson: You have to talk to a firm like IDC to get numbers. What I will say is that what you call “unstructured” data (I prefer “complex” because all data has structure) is big and getting bigger really, really fast.

It used to be that data was generated at human scale. You’d buy or sell something and a transaction record would happen. You’d hire or fire someone and you’d hit the “employee” table in your database.

These days, data comes from machines talking to machines. The servers, switches, routers and disks on your LAN are all furiously conversing. The content of their messages is interesting, and also the patterns and timing of the messages that they send to one another. (In fact, if you can capture all that data and do some pattern detection and machine learning, you have a pretty good tool for finding bad guys breaking into your network.) Same is true for programmed trading on Wall Street, mobile telephony and many other pieces of technology infrastructure we rely on.

Hadoop knows how to capture and store that data cheaply and reliably, even if you get to petabytes. More importantly, Hadoop knows how to process that data — it can run different algorithms and analytic tools, spread across its massively parallel infrastructure, to answer hard questions on enormous amounts of information very quickly.

Q6. Why building data analysis applications on Hadoop? Why not using already existing BI products?

Michael Olson: Lots of BI tools today talk to relational databases. If that’s the case, then you’re constrained to operating on data types that an RDBMS understands, and most of the data in the world — see above — doesnt’ fit in an RDBMS. Also, there are some kinds of analyses — complex modeling of systems, user clustering and behavioral analysis, natural language processing — that BI tools were never designed to handle.

I want to be clear: RDBMS engines and the BI tools that run on them are excellent products, hugely successful and handling mission-critical problems for demanding users in production every day. They’re not going away. But for a new generation of problems that they weren’t designed to consider, a new platform is necessary, and we believe that that platform is Apache Hadoop, with a new suite of analytic tools, from existing or new vendors, that understand the data and can answer the questions that Hadoop was designed to handle.

Q7. Why Cloudera? What do you see as your main value proposition?

Michael Olson: We make Hadoop consumable in the way that enterprises require.
Cloudera Enterprise provides an open source platform for data storage and analysis, along with the management, monitoring and administrative applications that enterprise IT staff can use to keep the cluster running. We help our customers set and meet SLAs for work on the cluster, do capacity planning, provision new users, set and enforce policies and more. Of course Cloudera Enterprise comes with 24×7 support and a subscription to updates and fixes during the year.

When one of our customers deploys Hadoop, it’s to solve serious business problems. They can’t tolerate missed deadlines. They need their existing IT staff, who probably know how to run an Oracle database or VMware or other big data center infrastructure. That kind of person can absolutely run Hadoop, but needs the right applications and dashboards to do so. That’s what we provide.

Q8. In your opinion, what role will RDBMS and classical Data Warehouse systems play in the future in the market for Analytical Data Platforms? What about other data stores such NoSQL and Object Databases? Will they play a role?

Michael Olson: I believe that RDBMS and classic EDWs are here to stay. They’re outstanding at what they do — they’ve evolved alongside the problems they solve for the last thirty years. You’d be nuts to take them on. We view Hadoop as strictly complementary, solving a new class of problems: Complex analyses, complex data, generally at scale.

As to NoSQL and ODBMS, I don’t have a strong view. The “NoSQL” moniker isn’t well-defined, in my opinion.
There are a bunch of different key-value stores out there that provide a bunch of different services and abstractions. Really, it’s knives and arrows and battleships — they’re all useful, but which one you want depends on what kind of fight you’re in.

Q9. Is Cloud technology important in this context? Why?

Michael Olson: “Cloud” is a deployment detail, not fundamental. Where you run your software and what software you run are two different decisions, and you need to make the right choice in both cases.

Q10. Looking at three elements: Data, Platform, and Analysis, what are the main business and technical challenges ahead?

Michael Olson: Data is the big one. Seriously: More. More complex, more variable, more useful if you can figure out what’s locked up in it. More than you can imagine, even if you take this statement into account.

We obviously need to improve the platforms we have, and I think the next decade will be an exciting time for that. That’s good news — I’ve been in the database industry since 1986, and it has frankly been pretty dull. Same is true for analyses, but our opportunities there will be constrained by both the platforms we have and the data on which we can operate.

Q11. Anything you wish to add?

Michael Olson: Thanks for the opportunity!

–mike
________________
Michael Olson, Chief Executive Officer, Cloudera.
Mike was formerly CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine. Mike spent two years at Oracle Corporation as Vice President for Embedded Technologies after Oracle’s acquisition of Sleepycat in 2006. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies and Informix Software. Mike has Bachelor’s and Master’s degrees in Computer Science from the University of California at Berkeley.)
–––––––––––––––––

Related Post

The evolving market for NoSQL Databases: Interview with James Phillips.

–––––––––––––––––

Mar 26 11

The evolving market for NoSQL Databases: Interview with James Phillips.

by Roberto V. Zicari

“It is possible we will see standards begin to emerge, both in on-the-wire protocols and perhaps in query languages, allowing interoperability between NoSQL database technologies similar to the kind of interoperability we’ve seen with SQL and relational database technology.” — James Phillips.
_______________

In my understanding of how the market of Data Management Platforms is evolving, I have identified three phases:
Phase INew Proprietary data platforms developed: Amazon (Dynamo), Google (BigTable). Both systems remained proprietary and are in use by Amazon and Google.

Phase IIThe advent of Open Source Developments: Apache projects such as Cassandra, Hadoop (MapReduce, Hive, Pig). Facebook and Yahoo! played major roles. Multitude of new data platforms emerge.

Phase III– Evolving Analytical Data Platforms. Hadoop for analytic. Companies such a Cloudera, but also IBM`s BigInsight are in this space.

I wanted to learn more about the background of Phases I and II. I have interviewed James Phillips, who co-founded of Membase, and since last month is Co-Founder and Senior Vice President of Couchbase, the new company that originated from the merge of Membase and CouchOne.

RVZ

Phase I– New Proprietary data platforms developed: Amazon (Dynamo), Google (BigTable). Both systems remained proprietary and are in use by Amazon and Google.

Q1. Why did Amazon and Google have to develop their own database systems? Why didn’t they use/”adjust” existing database systems?

James Phillips: Existing relational database technologies were a poor match for the flexibility, performance, scaling and cost requirements of these organizations. It wasn’t enough to simply “adjust” relational database technology; there was a wholesale rethinking required. See our (non marketing 🙂 white paper for a detailed look at why that was the case.

Phase II- The advent of Open Source Developments: Apache projects such as Cassandra, Hadoop (MapReduce, Hive, Pig). Facebook and Yahoo! played major roles. Multitude of new data platforms emerge.

Q2. How was it possible that Amazon and Google`s proprietary systems were used as input for Open Source Projects?

James Phillips: Amazon and Google published academic papers
[e.g. GoogleBigTable, Amazon-Dynamo], highlighting the design of many of their data management technologies. These papers inspired the creation of a number of open source software projects.

The Apache Cassandra open source project, initially developed at Facebook, has been described by Jeff Hammerbacher who led the Facebook Data team at the time as a Big Table data model running on an Amazon Dynamo-like infrastructure.

Key Apache Hadoop project technologies (initially developed by Doug Cutting while employed at Yahoo!) were heavily influenced by published papers describing the Google File System and MapReduce technologies.

Q3. Why Facebook and Yahoo! developed Open Source data platforms and not proprietary software?

James Phillips: Obviously I can’t answer why another company made a business decision, because I wasn’t involved in the decisions.
But one can make a reasonable guess as to why these companies would do this: Facebook is not in the database business
– they are a social networking company.

They would probably have taken and used Dynamo and/or BigTable had they been open-sourced by Google and or Amazon.
But they weren’t and Facebook was forced to write Cassandra.
By open-sourcing the technology, Facebook could ostensibly benefit from the advancement of that technology by the open source community
– new features, increased performance, bug fixes and other community-driven value.

Assuming rational behavior, one can reasonably infer that the potential value of these community-driven benefits was deemed greater than the perceived cost of possibly “arming” a potential social networking competitor with data management technology, of having to fully maintain the technology themselves, or of any sort of liability associated with open-sourcing the software.

There is also some recruiting value to be gained for companies like Facebook
– by demonstrating they are developing leading-edge technology, solving hard computer science problems, etc. they are able to attract the best and brightest minds to the company.

Q4. Not everyone has data and scalability requirements such as Amazon and Google. Who currently needs such new data management platforms, and why?

James Phillips: Our white paper covers this in great detail. While not everyone has the scalability requirements, anyone building web applications (and who isn’t) needs the flexibility and cost advantages these solutions deliver.
Google and Amazon had the biggest pain initially, driving the innovation, but everyone benefits now. Velcro was invented to solve problems in space travel. Not everyone struggles with those problems. But use cases for Velcro are still being discovered.

Q5. Is there a common denominator between the business models built around such open source projects?

James Phillips:s: The only common denominator between business models related to open source software is open source software. There are support, product, services, training and countless other offerings that are derived from, based on or related to open source software projects.

Q6. Last month Membase and CouchOne merged to form Couchbase. What is the reasoning for this merge?

James Phillips:: Prior to merging, Membase and CouchOne had focused on different layers of the NoSQL database technology stack:

Membase had focused on distributed caching, cluster management and high-performance memory-to-disk data flows.

CouchOne had concentrated on advanced indexing, document-oriented operations, real-time map-reduce, replication and support for mobile/web application synchronization.

There were many Membase customers asking for features of the CouchOne platform, and vice versa. This merger has allowed us to each eliminate roughly 18 months of redundant R&D we would have done separately. We’ve effectively doubled the size of our engineering team and eliminated 2 net years of work allowing us to get better products to market more quickly and focus on innovating versus duplicating functionality.

Q7 Technically speaking, do you plan to “merge” the two products into one? If you do not “merge” the two products, what else do you do then?

James Phillips:: Yes, Elastic Couchbase is a new product we will introduce this Summer; it will combine technologies from Membase and CouchOne.

Q8. What happens to existing customers who are using Membase and CouchOne respective products?

James Phillips:: We will continue to support customers using existing Membase and CouchOne products, while providing a seamless upgrade path for customers who want to migrate to Elastic Couchbase. Couch customers who migrate get a higher-performance, elastic version of CouchDB. Membase customers who migrate gain the ability to index and query data stored in the cluster.

Q9. How do you see the NoSQL market evolving in the next 12 months?

James Phillips: It is possible we will see standards begin to emerge, both in on-the-wire protocols and perhaps in query languages, allowing interoperability between NoSQL database technologies similar to the kind of interoperability we’ve seen with SQL and relational database technology. It would not be surprising to see additional consolidation as well.
–––––––––––––––––––––––––––––
James Phillips, Co-Founder and Senior Vice President, Couchbase.

A twenty-five year veteran of the software industry, James Phillips started his career writing software for the Apple II and TRS-80 microcomputer platforms. In 1984, at age 17, he co-founded his first software company, Fifth Generation Systems, which was acquired by Symantec in 1993 forming the foundation of Symantec’s PC backup software business.
Most recently, James was co-founder and CEO of Akimbi Systems, a venture-backed software company acquired by VMware in 2006. Book-ended by these entrepreneurial successes, James has held executive leadership roles in software engineering, product management, marketing and corporate development at large public companies including Intel, Synopsys and Intuit and with venture-backed software startups including Central Point Software (acquired by Symantec), Ensim and Actional Corporation (acquired by Progress Software). Additionally, James spent two years as a technology investment banker with PaineWebber and Robertson Stephens and Co., delivering M&A advisory services to software companies.
James holds a BS in Mathematics and earned his MBA, with honors, from the University of Chicago.
He currently serves on the board of directors of Teneros and as an investor in and advisor to a number of privately-held software companies including Delphix, Replay Solutions and Virsto.

For further Reading

NoSQL Database Technology.
White Paper, Couchbase.
This white paper introduces the basic concepts of NoSQL Database technology and compares them with RDBMS.

Paper | Introductory | English | Link DOWNLOAD (PDF)| March 2011|

Related Posts
“Marrying objects with graphs”: Interview with Darren Wood.

“Distributed joins are hard to scale”: Interview with Dwight Merriman.

On Graph Databases: Interview with Daniel Kirstenpfad.

Interview with Rick Cattell: There is no “one size fits all” solution.

Mar 18 11

Objects in Space: “Herschel” the largest telescope ever flown.

by Roberto V. Zicari

Managing telemetry data and information on steering and calibrating scientific on-board instruments with an object database.

Interview with Dr. Johannes Riedinger, Herschel Mission Manager, European Space Agency (ESA), and Dr Jon Brumfitt, System Architect of the Herschel Science Ground Segment, European Space Agency (ESA).
———————————
More Objects in Space…
I became aware of another very interesting project at the European Space Agency (ESA). On May 14, 2009, the European Space Agency launched an Arianne 5 rocket carrying the largest telescope ever flown: the “Herschel” telescope, 3.5 meters in diameter. The satellite whose orbit is some 1.6 million kilometers from the Earth, will operate 48 months.

One interesting aspect of this project, for us at odbms.org, is that they use an object database to manage telemetry data and information on steering and calibrating scientific on-board instruments.

I had the pleasure to interview Dr. Johannes Riedinger, Herschel Mission Manager, and Dr. Jon Brumfitt, System Architect of Herschel Scientific Ground Segment, both at the European Space Agency.

Hope you`ll enjoy the interview!

RVZ

Q1. What is the mission of the “Herschel” telescope?

Johannes: The Herschel Space Observatory is the latest in a series of space telescopes that observe the sky at far-infrared wavelengths that cannot be observed from the ground. The most prominent predecessors were the Infrared Astronomical Satellite (IRAS), a joint US, UK, NL project launched in 1983, next in line was the Infrared Space Observatory (ISO), a telescope the European Space Agency launched in 1995, which was followed by the Spitzer Space Telescope, a NASA mission launched in 2003. The two features that distinguish Herschel from all its predecessors are its large primary mirror, which translates into a high sensitivity because of the large photon collecting area, and the fact that it is sensitive out to a wavelength of 670 μm. Extending the wavelength range of its predecessors by more than a factor of 3, Herschel is closing the last big gap that has remained in viewing celestial objects at wavelengths between 200 μm and the sub-millimetre regime. In the wavelength range in which Herschel overlaps to varying degrees with its predecessors, from 57 to 200 μm, Herschel’s big advantage is once more the size of its primary mirror. Because spatial resolution at a given wavelength improves linearly with telescope diameter, Herschel has a 4 to 6 times sharper vision than all earlier far-infrared telescopes.

Q2. In a report in 2010 [1] you wrote that “you collect every day an average of six to seven gigabit raw telemetry data and additional information for steering and calibrating three scientific on-board instruments“. Is this still actual? What are the main technical challenges with respect to data processing, manipulation and storage that this project poses?

Johannes: The average rate at which we have been receiving raw data from Herschel over the past 18 months is about 12 gigabits/day, which increases somewhat through the addition of related data by the time it is ingested into the database. The amount of data, and its distribution to the three national data centres that monitor the health and calibration of the instruments, is quite manageable. More challenging are the data volumes that need to be handled in the form of various levels of scientific data products, which are generated in the data processing pipeline. Another challenge, even for today’s computers, is the amount of memory some products require during processing. This goes significantly beyond the amount of memory a standard desktop or laptop computer comes with, and even the amount of disk space such machines used to come with until a few years ago: Only recently we have procured two computers with 256 GB of RAM for the grid on which we run the pipeline to efficiently process some of the data.

Q3. What are the requirements of the “Herschel” data processing software? And what are the main components and functionalities of the Herschel software?

Jon: The software is required to plan daily sequences of astronomical observations and associated spacecraft manoeuvres, based on proposals submitted by the astronomical community, and to turn these into telecommands sequences to uplink to the spacecraft. The data processing system is required to process the resulting telemetry and convert it into scientific products (images, spectra, etc) for the astronomers. The “uplink” system is particularly critical, because it must have a high availability to keep the spacecraft supplied with commands. Significant parts of the software were required to be working several years before launch, so that they could be used to test the scientific instruments in the laboratory. This approach, which we call “smooth transition”, ensures that certain critical parts of the software are very mature by the time we come to launch and that the instruments have been extensively tested with the real control system.

The uplink chain starts with a proposal submission system that allows astronomers to submit their observation requests. The astronomer downloads a Java client which communicates with a JBOSS application server to ingest proposals into the Versant database. Next, the mission planning system allows the mission planner to schedule sequences of observations. This is a complex task which is subject to many constraints on the allowed spacecraft orientation, the target visibility, stray light, ground station availability, etc. It is important to minimize the time wasted by slewing between targets. Each schedule is expanded into a sequence of several thousand telecommands to control the spacecraft and scientific instruments for a period of 24 hours. The spacecraft is in contact with the Earth only once a day for typically 3 hours, during which time a new schedule is uplinked and data from the previous day is downlinked and ingested into the Versant database. Between contact periods, the spacecraft executes the commands autonomously. The sequences of instrument commands have to be accurately synchronized with each little manoeuvre of the spacecraft when we are mapping areas of the sky. This is achieved by a system known as the “Common Uplink System” in which the detailed commanding is programmed in a special language that models the on-board execution timing.

The other major component of the software is the Data Processing system. This consists of a pipeline that processes the telemetry each day and places the resulting data products in an archive. Astronomers can download the products. We also provide an interactive analysis system which they can download to carry out further specialized analysis of their data.

Overall, it is quite a complex system with about 3 million lines of Java code (including test code) organized as 1100 Java packages and 300 CVS modules. There are 13,000 classes of which 300 are persistent.

Q4. You have chosen to have two separate database systems for Herschel: a relational database for storing and managing processed data products and an object database for storing and managing proposal data, mission planning data, telecommands and raw (unprocessed) telemetry. Is this correct? Why two databases? What is the size of the data expected for each database?

Jon: The processed data products, which account for the vast bulk of the data, are kept in a relational database, which forms part of a common infrastructure shared by many of our ESA science projects. This archive provides a uniform way of accessing data from different missions and performing multi-mission queries. Also, the archive will need to be maintained for years after the Herschel project finishes. The Herschel data in this archive is expected to grow to around 50 TB.

The proposal data, scheduling data, telecommands and raw (unprocessed) telemetry are kept in the Versant object database. This database will only grow to around 2 TB. In fact, back in 2000, we envisaged that all the data would be in the object database, but it soon became apparent that there were enormous benefits from adopting a common approach with other projects.

The object database is very good for the kind of data involved in the uplink chain. This typically requires complex linking of objects and navigational access. The data in the relational archive is stored in the form of FITS files (a format widely used by astronomers) and the appropriate files can be found by querying the meta-data that consist only of a few hundred keywords even for files that can be several gigabytes in size. We need to deliver the data as FITS files, so it makes sense to store it in that form, rather than to convert objects in a database into files in response to each astronomer query. Our interactive analysis system retrieves the files from the archive and converts them back into objects for processing.

Johannes: At present the Science Data Archive contains about 100 TB of data. This includes older versions of the same products that were generated by earlier versions of the pipeline software—we regenerate the products with the latest software and the best current knowledge of instrument calibration about every 6 months. Several years from now, when we will have generated the “best and final” products for the Herschel legacy archive and after we have discarded all the earlier versions of each product, we expect to end up, as Jon says, with an archive volume of about 50 to 100 TB. This is up by two orders of magnitude from the legacy archive of ISO which, given today’s technology, you can carry around in your pocket on a 512 GB external disk drive.

Q5. How do the two databases interact with each other? Are there any similarities with the Gaia data processing system?

Jon: The two database systems are largely separate as they perform rather different roles. The gap is bridged by the data processing pipeline which takes the telemetry from the Versant database, processes it and puts the resulting products in the archive.

We do, in fact, have the ability to store products in the Versant database and this is used by experts within the instrument teams, although for the normal astronomer everything is archive based. There are many kinds of products and in principle new kinds of products can be defined to meet the needs of data processing as the mission progresses. If it were necessary to define a new persistent class for each new product type, we would have a major schema evolution problem. Consequently, product classes are defined by “composition” of a fixed set of building blocks to build an appropriate hierarchical data-structure. These blocks are implemented as a fixed set of persistent classes, allowing new product types to be defined without a schema change.

The Versant database requires that first-class objects in the database are “enhanced” to make them persistence capable. However, we need to use Plain Old Java Objects for the products so that our interactive analysis system does not require a Versant installation. We have solved this by using “product persistor” classes that act as adaptors to make the POJOs persistent.

Johannes: Concerning Gaia, and here I need to make a point that has little to do with database technology per se but it has a lot to do with using databases as a tool for different purposes even in the same generic area of science—astronomy in this case—I want to emphasize the differences between Herschel and Gaia rather than the similarities.

Herschel is an observatory, i.e. the satellite collects data from any celestial object which is selected by a user to be observed in a particular observing mode with one or more of Herschel’s three on-board instruments. Each observing mode generates its own, specific set of products that is characteristic of the observing mode rather than characteristic of the celestial object that is observed: A set of products can e.g. consist of a set of 2-dimensional, monochromatic images at infrared wavelengths, from which various false-colour composites can be generated by superposition. In the lingo of this journal: Observing modes are the classes, the data products are the objects. If I observe two different celestial objects in the same observing mode, I can directly compare the results. E.g. the distribution of colours in a false-colour image immediately tells me something about the temperature distribution of the material that is present; the intensity of the colour tells me something about the amount of material of a given temperature. And if I also have a spectrum of the celestial object, this tells me something about the chemical composition of the material and its physical state, such as pressure, density, and temperature.

Gaia, on the other hand, is not an observatory and it does not offer different observing modes that can be requested. It measures the same set of a dozen or so parameters for each and every object time and time again, with some of these parameters being time, brightness in different colours, position relative to other objects that appear in the same snapshot, and radial velocity. But every time it measures these parameters for a particular object it measures them in a different context, i.e. in combination with a different set of other objects that appear in the same snapshot. So you end up with a billion objects, each appearing on a different ensemble of several dozen snapshots, and you have to find the “global best fit” that minimizes the residual errors of a dozen parameters fitted to a billion objects. Computationally, and compared to Herschel, this is a gargantuan task. But it is a single task—derive the mechanical state of an ensemble of a billion objects whose motion is controlled by relativistic celestial mechanics in the presence of a few possible disturbances, such as orbiting planets or unstable stellar states (pulsating or otherwise variable stars). On Herschel, the scientific challenge is somewhat different: We are not so intensely interested in the state of motion of individual objects, we are interested in the chemical composition of matter—much of which does not consist of stars but of clouds of gas and dust which Gaia cannot “see”—and its physical state.

Q6. You have chosen an object database, from Versant, for storing and managing raw (unprocessed) telemetry data. What is the rationale for this choice? How do you map raw data into database objects? What is a typical object database representation of these subsets of “Herschel” data stored in the Versant Object Database? What are the typical operations on such data?

Jon: Back in 2000, we looked at the available object databases. Versant and Objectivity appeared to have the scalability to cope with the amount of data we were envisaging. After a detailed comparison, we chose Versant although I think either would have done the job. At this stage, we still envisaged storing all the product data in the object database.

The uplink objects, such as proposals, observations, telecommands, schedules, etc, are stored directly as objects in the Versant database, as you might expect. The telemetry is a special case because we want to preserve the original data exactly as it arrives in binary packets, so that we can always get back to the raw data. So we have a TelemetryPacket class that encapsulates the binary packet and provides methods for accessing its contents. To support queries, we decode key meta-data from the binary packet when the object is constructed and store it as attributes of the object. This allows querying to retrieve, for example, packets for a given time range for a specified instrument or packets for a particular observation.

The persistent data model is known as the Core Class Model (CCM). This was developed by starting with a domain model describing the key entities in the problem domain, such as proposals and observations, and then adding methods by analysing the interactions implied by the use-cases. The model then evolved by the introduction of various design patterns and further classes related to the implementation domain.

Q7. How did you specifically use the Versant object database? Are there any specific components of the Versant object database that were (are) crucial for the data storage and management part of the project? If yes, which ones? Were there any technical challenges? If yes, which ones?

Jon: With a project that spans 20 years from the initial concept at the science ground segment level to generation and availability of the final archive of scientific products, it is important to allow for technology becoming obsolete. At the start of the science ground segment development part of the project, a decade ago, we looked at the standards that were then emerging (ODMG and JDO) to see if these could provide a vendor-independent database interface. However, the standards were not sufficiently mature, so we designed our own abstraction layer for the database and the persistent data. That meant that, in principle, we could change to a different database if needed. We tried to only include features that you would reasonably expect from any object database.

We organized all the persistent classes in the system into a single package. This was important to keep changes to the schema under tight control by ensuring that the schema version was defined by a single deliverable item. Consequently, the CCM classes are responsible for persistence and data integrity, but have little application-specific functionality. The persistent objects can be customized by the applications for example by using the decorator pattern. For various reasons, the persistent classes tend to contain Versant-specific code, so by keeping them all in one package it keeps the vendor-specific code together in one module behind the abstraction layer.

We needed to propagate copies of subsets of the data to databases to the three instrument control centres. At the time, there wasn’t an out-of-the-box solution that did what we wanted, so we developed our own data propagation. This was quite a substantial amount of work. One nice thing is that this is all hidden within our database abstraction layer, so that it is transparent to the applications. We continuously propagate the database to all three instrument teams and we also propagate it to our test system and backup system.

Another important factor is backup and recovery and it makes a big difference how you organize the data. We have an uplink node and a set of telemetry nodes. There is a current telemetry node into which new data is ingested. All the older telemetry nodes are read-only as the data never changes, which means you only have to back them up once. The abstraction layer is able to hide the mapping of objects onto databases.

Q8. Is scalability an important factor in your application?

Jon: In the early stages, when we were considering putting all the data into the object database, scalability was a very important issue. It became much less of an issue when we decided to store the products in the common archive.

Q9. Do you have specific time requirements for handling data objects?

Jon: We need to ingest the 1.5 GB of telemetry each day and propagate it to the instrument sites. In general the performance of the Versant database is more than adequate. It is also important that we can plan (and if necessary replan) the observations for an operational day within a few hours, although this does not pose a problem in terms of database performance.

Q10. By the end of the four years life cycle of “Herschel” you are expecting a minimum of 50 terabytes of data which will be available to the scientific community. How do you plan to store, maintain and elaborate this amount of data?

Johannes: At the European Space Astronomy Centre (ESAC) in Madrid we keep the archives of all of ESA’s astrophysics missions, and a few years ago we started to also import and organise into archives the data of the planetary missions that deal with observations of the sun, planets, comets and asteroids. The user interface to all these archives is basically the same. This generates a kind of “corporate identity” of these archives and if you have used one you know how to use all of them.

For many years, the Science Archive Team at ESAC has played a leading role in Europe in the area of “Virtual Observatories”. If you are looking for some specific information on a particular astronomical object, chances are good, and they are getting better by the day, that some archive in the world already has this information. ESA’s archives, and amongst them the Herschel legacy archive once it has been built, are accessible through this virtual observatory from practically anywhere in the world.

Q10. You are basically half way in the life cycle of “Herschel”. What are the main lessons learned so far from the project? What are the next steps planned for the project and the main technical challenges ahead?

Johannes: We are very lucky, and without doubt the more than 1,300 individuals who are listed on a web page as having made major contributions to the success of this project have every reason to be proud, that we are operating this world class observatory with only minor problems in satellite hardware which are well under control.

We are approximately half way through the in-orbit life time of the mission after launch. But this is not the same as saying that we have reaped half of the science results from the mission.

For the first 4 months of the mission, i.e. until mid-September 2009, the satellite and the ground segment underwent a checkout, a commissioning and a performance verification phase. Through a large number of “engineering” and “calibration” observations in these initial phases we ensured that the observing modes we had planned to provide and had advertised to our users worked in principle, that the telescope pointed in the right direction, that the temperatures did not drift beyond specification, etc. From mid-September to about the end of 2009, we performed an increasing number of scientific observations of astronomical targets in what we called the “Science Demonstration Phase”. These observations were “the proof of the pudding” and showed that, indeed, the astronomers could do the science they had intended to do with the data they were getting from Herschel: More than 250 scientific papers were published in mid-2010 that are based on observations made during this Science Demonstration Phase.

We have been in the “Routine Science Mission Phase” for about one year now, and we expect to remain in this phase for at least another year and a half, i.e. we have collected perhaps 40% of the scientific data Herschel will eventually have collected when the Liquid Helium coolant runs out. Already now we can see that Herschel will revolutionize some of the currently accepted theories and concepts of how and where stars are born, how they enrich the interstellar medium with elements heavier than Helium when they die, and in many other areas such as how the birth rate of stars has changed over the age of the universe, which is over 13 billion years old. But, most importantly, we need to realise and remember that sustained progress in science does not come about on short time scales. The initial results that have already been published are spectacular, but they are only the tip of the iceberg, they are results that stare you in the face. Many more results, which together will change our view of the cosmos as profoundly as the cream of the coffee results we already see now, will only be found after years of archival research, by astronomers who plough through vast amounts of archive data in the search of new, exciting similarities between seemingly different types of objects and stunning differences between objects that had been classified to be of the same type. And this kind of knowledge will accumulate from Herschel data for many years beyond the end of the Herschel in-orbit lifetime. ESA will support this slower but more sustained scientific progress through archival research by keeping together a team of about half a dozen software experts and several instrument specialists for up to 5 years after the end of the in-orbit mission. In addition, members of the astronomical community will, as they have done on previous missions, contribute to the archive their own, interactively improved products which extract features from the data that no automatic pipeline process can extract no matter how clever it has been set up to be.

I believe it is fair to say that no-one yet knows the challenges the software developers and the software users will face.
But I do expect that a lot more problems will arise from the scientific interpretation of the data than from the software technologies we use, which are up to the task that is at hand.
—————————

Dr. Johannes Riedinger, Herschel Mission Manager, European Space Agency.
Johannes Riedinger has a background in Mathematics and Astronomy and has worked on space projects since the start of his PhD thesis in 1980, when he joined a team at Max-Planck-Institut für Astronomie in Heidelberg in the development of an instrument for a space shuttle payload. Working on a successor project, the ISO Photo-Polarimeter ISOPHOT, he joined industry from 1985-1988 as the System Engineer for the phase B study of this payload. Johannes joined the European Agency in 1988 and, having contributed to the implementation of the Science Ground Segments of the Infrared Space Observatory (ISO, launched in 1995) and the X-ray Multi Mirror telescope (XMM-Newton, launched in 1999), he became the Herschel Science Centre Development Manager in 2000. Following successful commissioning of the satellite and ground segment, he became Herschel Mission Manager in 2009.

Dr. Jon Brumfitt, System Architect of Herschel Scientific Ground Segment, European Space Agency.
Jon Brumfitt has a background in Electronics with Physics and Mathematics and has worked on several of ESA’s astrophysics missions, including IUE, Hipparcos, ISO, XMM and currently Herschel. After completing his PhD and a post-doctoral fellowship in image processing, Jon worked on data reduction for the IUE satellite before joining Logica Space and Defence in 1980. In 1984 he moved to Logica’s research centre in Cambridge and then in 1993 to ESTEC in the Netherlands to work on the scientific ground segments for ISO and XMM. In January 2000, he joined the newly formed Herschel team as ground segment System Architect. As Herschel approached launch, he moved down to the European Space Astronomy Centre in Madrid to become part of the Herschel Science Operations Team.

For Additional Reading

“Herschel” telescope

[1] Data From Outer Space.
Versant.
This short paper describes a case study: the handling of telemetry data and information of the “Herschel” telescope from outer space with the Versant object database. The telescope was launched by the European Space Agency (ESA) with the Ariane 5 rocket on 14 May 2009.
Paper | Introductory | English | LINK to DOWNLOAD (PDF)| 2010|

[2] ESA Web site for “Herschel”

[3]Additonal Links
SPECIALS/Herschel
HERSCHEL: Exploring the formation of galaxies and stars
ESA latest_news
ESA Press Releases

Related Post:

Objects in Space.
–The biggest data processing challenge to date in astronomy: The Gaia mission.–

Mar 14 11

Benchmarking ORM tools and Object Databases.

by Roberto V. Zicari

“I believe that one should benchmark before making any technology decisions.”
An interview with Pieter van Zyl creator of the OO7J benchmark.

In August last year, I published an interesting resource in ODBMS.ORG, the dissertation of Pieter van Zyl, from the University of Pretoria:“Performance investigation into selected object persistence stores”. The dissertation presented the OO7J benchmark.

OO7J is a Java version of the original OO7 benchmark (written in C++) from Mike Carey, David DeWitt and Jeff Naughton at the University of Wisconsin-Madison. The original benchmark tested Object Databases (ODBMS) performance. This project also includes benchmarking Object Relational Mapping (ORM) tools. Currently there are implementations for Hibernate on PostgreSQL, MySQL, db4o and Versant databases. The source code is available on sourceforge. It uses the GNU GPL license.

Together with Srini Penchikala, InfoQ I have interviewed Pieter van Zyl creator of the OO7J benchmark.

Hope you`ll enjoy the interview.
RVZ

Q.1 Please give us a summary of OO7J research project.

Pieter van Zyl: The study investigated and focused on the performance of object persistence and compared ORM tools to object databases. ORM tools provide an extra layer between the business logic layer and the data layer. This study began with the hypothesis that this extra layer and mapping that happens at that point, slows down the performance of object persistence.
The aim was to investigate the influence of this extra layer against the use of object databases that remove the need for this extra mapping layer. The study also investigated the impact of certain optimisation techniques on performance.

A benchmark was used to compare ORM tools to object databases. The benchmark provided criteria that were used to compare them with each other. The particular benchmark chosen for this study was OO7, widely used to comprehensively test object persistence performance. Part of the study was to investigate the OO7 benchmark in greater detail to get a clearer understanding of the OO7 benchmark code and inside workings thereof.

Because of its general popularity, reflected by the fact that most of the large persistence providers provide persistence for Java objects, it was decided to use Java objects and focus on Java persistence. A consequence of this decision is that the OO7 Benchmark, currently available in C++, has had to be re-implemented in Java as part of this study.

Included in this study was a comparison of the performance of an open source object database, db4o, against a proprietary object database, Versant. These representatives of object databases were compared against one another as well as against Hibernate, a popular open source representative of the ORM stable. It is important to note that these applications were initially used in their default modes (out of the box). Later some optimisation techniques were incorporated into the study, based on feedback obtained from the application developers. My dissertation can be found here-

Q.2 Please give us a summary of the recommendations of the research project.

Pieter van Zyl: The study found that:
The use of an index does help for queries. This was expected. Hibernate’s indexes seem to function better than db4o’s indexes during queries.
− Lazy and eager loading was also investigated and their influence stood out in the study. Eager loading improves traversal times if the tree being traversed can be cached between iterations. Although the first run can be slow, by caching the whole tree, calls to the server in follow-up runs are reduced. Lazy loading improves query times and unnecessary objects are not loaded into memory when not needed.
− Caching was investigated. It was found that caching helps to improve most of the operations. By using eager loading and a cache, the speed of repeated traversals of the same objects is increased. Hibernate with its first level cache in some cases perform better than db4o during traversals with repeated access to the same objects.

When creating and using benchmarks it is important to clearly state what settings and environment is being used. In this study it was found that:
Running in client-server mode or embedded mode in db4o has different performance results. It was shown that some queries are faster when db4o is run in embedded mode and also that some traversals are faster when db4o is run in client-server mode.
− That it is important to state when a cache is being used and cleared. It was shown that clearing the Versant cache with every transaction commit influenced the hot traversal times. It is also important for example, to state if a first or second level cache is being used in Hibernate as this could influence traversal times.
Having a cache does not always improve random inserts, deletes and queries. It was shown that the cache assisted traversals more than inserts, deletes and queries. Because of random inserts, deletes and queries not all objects accessed will be in the cache. Also if query caches are used, it must be stated clearly, otherwise it could create false results.
− It is important to note that it is quite difficult to generalise. One mechanism is faster with updates to smaller amount of objects, others with larger amount of objects; some perform better with changes to index fields others with changes to non indexed fields; some perform better if traversals are repeated with the same objects. Others perform better in first time traversals which might be what the application under development needs. The application needs to be profiled to see if there is repeated access to the same objects in a client session.
In most cases the cache helps to improve access times. But if an application does not access the same objects repeatedly or accesses scattered objects then the cache will not help as much. In these cases it is very important to look at the OO7 cold traversal times which access the disk resident objects. For cold traversals, or differently stated, first time access to objects, in most cases Versant is the fastest of all the mechanisms tested by OO7.
By not stating the benchmark and persistence mechanisms settings clearly it is very easy to cheat and create false statements. It is easy to cheat if certain settings are not brought to light. For example with Versant it is important to state what type of commit is being used: a normal commit or a checkpoint commit. Also what types of queries are used can impact benchmark results. Also there is a slight performance difference in using a find() vs. running a query in Hibernate.
It is important to make sure that the performance techniques that are being used actually improve the speed of the application. It has been shown that to add subselects to a MySQL database does not automatically improve the speed.

This work formed part of my MSc. While the findings are not always surprising or new, the work showed that you could use the OO7 benchmark still to test today’s persistence frameworks. It really brought out performance differences between ORM Tools and object databases. This work is also the first OO7 implementation that tested ORM tools and compared open source against commercial object databases.

Q.3 What is the current state of the project?

Pieter van Zyl: The project has implementations for db4o, Hibernate with PostgreSQL and MySQL and the Versant database. The project currently works with settings files and Ant script to run different configurations. The project is a complete implementation of the original OO7 C++ implementation. More implementations will be added in the future. I also believe that all results must be audited. I will keep submitting benchmark results to vendors.

Q.4 What are the best practices and lessons learned in the research project?

Pieter van Zyl: See answer of Question 2. What is interesting today is that bench-markers are still not allowed to publish benchmark result of commercial products. Their licensees prohibit it. We felt that academics must be allowed to investigate and publish their results freely. In the end we did comply with the licenses and submitted the work to the vendors.

Q.5 Do you see a chance that your benchmark be used by the industry? Why?

Pieter van Zyl: Yes, but I suspect they are using benchmarks already. These benchmarks are probably home grown. Also there are no de-facto benchmarks for object database and ORM tool vendors. There exists a TPC benchmark for relational database vendors. While some vendors did use the OO7 benchmark in the late 90s they seem to not use it any more or maybe they have adjusted for in-house use.

OO7J could be used to test improvements from one version to the next. I have used it to benchmark differences between different db4o releases. We use tested embedded versions of db4o with the client-server version of db4o and this gave us valuable information and we could discern the differences in performance.

Currently OO7J has its own interface to the persistence store being benchmark. This means that it can be extended to test most persistence tools. We wanted to use the JPA or JDO interfaces but not all vendors support these standards.

Q.6 What is the feedback did you receive so far?

Pieter van Zyl: The dissertation was well received. I got a distinction for the work. I submitted the benchmark to the vendors to get their input on the benchmark and how to optimize their products. The feedback was good and no bugs were found. It is important that a benchmark is accurate and used consistently for all vendor implementations. I don’t think there are any funnies or inconsistencies in the benchmark code.

Jeffrey C. Mogul states that it is important that benchmarks should be repeatable, relevant, use realistic metrics, be comparable and widely used. I think OO7 complies with those requirements and I stayed as close as possible to OO7 with OO7J.

Also OO7J has been used by students at ETH Zurich – Department of Computer Science. Another object database vendor in America also contacted me about my work and wanted to use it for their benchmarking. Not sure how far they progressed.

Q.7 What are the main related works? How does OO7J research project compare with other persistence benchmarking approaches and what are the limitations of the OO7J project?

Pieter van Zyl: There have been related attempts to create a Java implementation of OO7 in the late 90s by a few researchers. Sun also created a Java version. These versions are not available any-more and wasn’t open sourced. See my dissertation for more details.

More recent work includes:
• ETH Zurich CS department (Prof. Moira C. Norrie) created a Mavenized and GUI version of my work but include changes to the database and they need to sort out some small issues.
Prof. William R. Cook`s students effort: A public implementation of the OO7 benchmark..
Other benchmarking work in the Java object space:
• New benchmark that I discovered recently: JPA Performance Benchmark.
• The PolePosition benchmark.

These benchmarks are not entirely vendor independent. But they are Open Source and one can look at the code and challenge their coding.

I think OO7 has one thing going for it that the others don’t have: I still think it is more widely used. Especially in the academic world. It has a lot of vendor independence behind it historically. It has had more reviews and documentation on how it works internally.

But I have seen some implementation of OO7 that are not complete: they for example build half the model and then don’t disclose these changes when publishing the results. Or only have some of the queries of traversals working.

That is why I like to stay close to the original well known OO7. I document any changes clearly.
If you run Query 8 of OO7 I want to expect that it functions 100% like the original. If anyone modifies it they should see this as an extension and rename the operation.
I have also included asserts/checkpoints to make sure the correct number of objects are returned for every operation.

Limitations of OO7J:
• It needs to be upgraded to run in client/server mode. More concurrent clients must be created to run operation on the databases at the same time.
• Its configurations need to be updated to create tera and peta byte database models.

Q.8 To do list: What still needs to be done?

Pieter van Zyl:
• Currently there are 3 implementations for each of the products being tested. While it uses the same model and operation I found that parent objects must be saved before children objects in the Hibernate version. Also the original OO7 also had an implementation per product benchmarked. I want to create one code base that can switch between different products using configuration. The ETH students has attempted this already but I am thinking of a different approach:
• Configurations for larger datasets
• More concurrent clients
• Investigate if I could use OO7 in a MapReduce world.
• Investigate if OO7 can be used to benchmark column-oriented databases
• Include Hibernate+Oracle combination.

Q.9 NoSQL/NRDBMS solutions are getting lot of attention these days. Are there any plans to do a persistence performance comparison of NoSQL persistence frameworks in the future?

Pieter van Zyl: Yes, they will be incorporated. I still believe object databases are well suited to this environment.
Still not sure that people are using them in the correct situations. I sometimes suspect people jump on to a hot technology without really benchmarking or under sting their application needs.

Q.10 What is the future road map of your research?

Pieter van Zyl: Investigate clustering, caches, MapReduce, Column-oriented databases and investigate how to incorporate these into my benchmarking effort.

I would also love to get more implementation experience either with an vendor or building my own database.

Quoting Martin Fowler (Patterns of Enterprise Application Architecture): “Too often I’ve seen designs used or rejected because of performance considerations, which turn out to be bogus once somebody actually does some measurements on the real setup used for the application”

I believe that one should benchmark before making any technology decisions. People have a lot of opinions of what performs better but there are usually not enough proof. There are a lot of noise in the market. Cut through it and benchmark and investigate for yourself.

For Additional Reading

Pieter van Zyl, University of Pretoria.
Performance investigation into selected object persistence stores.
Dissertation | Advanced| English | LINK DOWNLOAD (187 pages PDF) |February 2010.| ***

OO7J benchmark Pieter van Zyl and Espresso Research Group.
Software | Advanced | English | LINK User Manual| The code is available on Sourceforge.| Access is through SVN directory | February 2010.

Related Articles

Object-Oriented or Object-Relational? An Experience Report from a High-Complexity, Long-Term Case Study.
Peter Baumann, Jacobs University Bremen
Object-Oriented or Object-Relational? We unroll this question based on our experience with the design and implementation of the array DBMS rasdaman which offers storage and query language retrieval on large, multi-dimensional arrays such as 2-D remote sensing imagery and 4-D atmospheric simulation results. This information category is sufficiently far from both relational tuples and object-oriented pointer networks to achieve a “fair” comparison where no approach has an immediate advantage.
Paper | Intermediate | English | LINK DOWNLOAD (PDF)| July 2010|

More articles on ORM in the Object Databases section of ODBMS.ORG.

More articles on ORM in the OO Programming Section of ODBMS.ORG.

Mar 14 11

One minute of our time.

by Roberto V. Zicari

Many of us are very busy, engaged in activities, duties, hopes, worries, joy. This is our life.
But life can end within a few minutes or even less.

In dedication to those who suffer in Japan right now.

RVZ

Mar 5 11

“Marrying objects with graphs”: Interview with Darren Wood.

by Roberto V. Zicari

Is it possible to have both objects and graphs?
This is what Objectivity has done. They recently launched InfiniteGraph, a Distributed Graph Database for the Enterprise. Is InfiniteGraph a NoSQL database? How does it relate to their Objectivity/DB object database?.
To know more about it, I have interviewed Darren Wood, Chief Architect, InfiniteGraph, Objectivity, Inc.

RVZ

Q1. Traditionally, the obvious platform for most database applications has been a relational DBMS. Why do we need new Data Stores?

Wood: I think at some level, there have always been requirements for data stores that don’t fit the traditional relational data model.
Objectivity (and many others) have built long term successful businesses meeting scale and complexity requirements that generally went beyond what was offered by the RDBMS market. In many cases, the other significant player in this market was not a generally available alternative technology at all, but “home grown” systems designed to meet a specific data challenge. This included everything from application managed file based systems to the more well known projects like Dynamo (Amazon) and BigTable (Google).

This trend is not in any way a failing of RDBMS products, it is simply a recognition that not all data fits squarely into rows and columns and not all data access patterns can be expressed precisely or efficiently in SQL.
Over time, this market has simply grown to a point where it makes sense to start grouping data storage, consistency and access model requirements together and create solutions that are designed specifically to meet them.

Q2. There has been recently a proliferation of New data stores, such as document stores, and NoSQL databases: What are the differences between them?

Wood: Although NoSQL has been broadly categorized as a collection of Graph, Document, Key-Value, and BigTable style data stores, it is really a collection of alternatives which are best defined by the use case for which they are most suited.

Dynamo derived Key-Value stores for example, mostly trade off data consistency for extreme horizontal scalability with a high tolerance to host failure and network partitioning. This is obviously suited to
systems serving large numbers of concurrent clients (like a web property), which can tolerate stale or inconsistent data to some extent.

Graph databases are another great example of this, they treat relationships (edges) as first class citizens and organize data physically to accelerate traversal performance across the graph. There is also typically a very “graph oriented” API to simplify applications that view their data as a graph. This is a perfect example of how
providing a database with a specific data model and access pattern can dramatically simplify application development and significantly improve performance.

Q3. Objectivity has recently launched InfiniteGraph, a Distributed Graph Database for the Enterprise. Is InfiniteGraph DB a NoSQL graph database? How does it relate to your Objectivity/DB object database?

Wood: InfiniteGraph had been an idea within our company for some time. Objectivity has a long and successful track record providing customers with a scalable, distributed data platform and in many cases the underlying data model was a graph. In various customer accounts our Systems Engineers would assist building custom applications to perform high speed data ingest and advanced relationship analytics.
For the most part, there was a common “graph analytics” theme emerging from these engagements and the differences were really only in the specifics of the customer’s domain or industry.

Eventually, an internal project began with a focus on management and analysis of graph data. It took the high performance distributed data engine from Objectivity/DB and married it to a graph management and analysis platform which makes development of complex graph analytic applications significantly easier. We were very happy with what we had achieved in the first iteration and eventually offered it as a public beta under the InfiniteGraph name. A year later we are now into our third public release and adding even more features focused around scaling the graph in a distributed environment.

Q4. Systems such as CouchDB, MongoDB, SimpleDB, Voldemort, Scalaris, etc. provide less functionality than OODBs and are little more than a distributed object cache over multiple machines. How do these new data stores compare with object-oriented databases?

Wood: I think that most of the technologies you mention generally have unique properties that appeal to some segment of the database market. In same cases its the data model (like the flexibility of document model databases) and in others it is the ease of use or deployment and simple interface that will attract users (which is often said about CouchDB). Systems architects that evaluate various solutions will
invariably balance ease of use and flexibility with other requirements like performance, scalability and supported consistency models.

Object Databases will be treated in much the same way, they handle complex object models really well and minimize the impedance mismatch between your application and the database, so in cases where this is
an important requirement, they will always be considered a good option.
Of course not all OODBMS implementations are the same, so even within this genre there are significant differences to help make a clear choice for a particular use case.

Q5. With the emergence of cloud computing, new data management systems have surfaced.
What is in your opinion the direction in which cloud computing data management is evolving? What are the main challenges of cloud computing data management?

Wood: This is an area of great interest to us. Coming from a distributed data background, we are well positioned to take advantage of the trend for sure. I think there are a couple of major things that distinguish traditional “distributed systems” from the typical cloud environment.

Firstly, products that live in the cloud need to be specifically designed for tolerance to host failures and network partitions or inconsistencies. Cloud platforms are invariably built on low cost commodity hardware and provide a virtualized environment that offers a lower class of reliability seen in dedicated “enterprise class”
hardware. This essentially requires that availability be built into the software to some extent, which translates to replication and redundancy in the data world.

Another important requirement in the cloud is ease of deployment. The elasticity of a cloud environment generally leads to “on the fly” provisioning of resources, so spinning up nodes needs to be a simple from a deployment and configuration perspective. When you look at the technologies (like Dynamo based KV stores) that have their roots in early cloud based systems, there is a real focus in these areas.

Q6. Will cloud store projects end up with support for declarative queries and declarative secondary keys?

Wood: If you look at most of the KV type stores out there, a lot of them are now turning some focus on what is being termed as “search”. Cross population indexing and complex queries were not the primary design goal of these systems, however many users find “some” capability in this area is necessary, especially if it is being used exclusively as the persistence engine of the system.

An alternative to this is actually using multiple data stores side by side (so called polyglot persistence) and directing queries at the system that has been best designed to handle it. We are see a lot of this in the Graph Database market.

Q7. In his post, titled “The “NoSQL” Discussion has Nothing to Do With SQL”, Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?

Wood: I agree totally that NoSQL has nothing to do with SQL, it’s an unfortunate term which is often misunderstood. It is simply about choosing the data store with just the right mix of trade offs and characteristics that are they closest match to your applications requirements. The problem was, for the most part people are familiar with RDBMS/SQL, so NoSQL became a “place” for other lesser known data
stores and models to call home (hence the unfortunate name).

A good example is ACID, mentioned in the above abstract. ACID in itself is one of these choices, some use cases require it and others don’t. Arguing about the efficiency of its implementation is somewhat of a mute point if it isn’t a requirement at all ! In other cases, like graph databases, the physical data model can dramatically effect its performance for certain types of navigational queries which the RDBMS data model and SQL query language are simply not designed for.

I think this post was a reaction to the idea that NoSQL somehow threatened RDBMS and SQL, when that isn’t the case at all. There are still a large proportion of data problems out there that are very well suited to the RDBMS model and the plethora of implementations.

Q8. Some progress has also been made on RDBMS scalability. For example, Oracle RAC and MySQL Cluster provide some partitioning of load over multiple nodes. More recently, there are new scalable variations of MySQL underway with ScaleDB and Drizzle, and VoltDB is expected to provide scalability on top of a more performant inmemory RDBMS with minimal overhead. Typically you cannot scale well if your SQL operations span many nodes. And you cannot scale well if your transactions span many nodes. Will RDBMSs provide scalability to 100 nodes or more? And if yes, how?

Wood: Certainly partitioning and replication of RDBMS can be used to great effect where the application suits the relational model well, however this doesn’t change its suitability for a specific task. Even a set of indexed partitions doesn’t make sense if a KV store is all that is required. Using a distributed hashing algorithm doesn’t require execution of lookups on every node, so there is no reason to pay an
overhead for a generalized query when it is not required. Of course this doesn’t devalue the existence of a partitioned RDBMS at all, since there are many applications where this would be a perfect solution.

Recent Related Interviews/Videos:

“Distributed joins are hard to scale”: Interview with Dwight Merriman.

On Graph Databases: Interview with Daniel Kirstenpfad.

Interview with Rick Cattell: There is no “one size fits all” solution.

Don White on “New and old Data stores”.

Watch the Video of the Keynote Panel “New and old Data stores”

Robert Greene on “New and Old Data stores

Feb 14 11

Objects in Space.

by Roberto V. Zicari

Objects in Space.
–The biggest data processing challenge to date in astronomy: The Gaia mission.–

I became aware of a very interesting project at the European Space Agency:the Gaia mission. It is considered by the experts “the biggest data processing challenge to date in astronomy“.

I wanted to know more. I have interviewed William O`Mullane, Science Operations Development Manager, at the European Space Agency, and Vik Nagjee, Product Manager, Core Technologies, at InterSystems Corporation, both deeply involved with the Proof-of-Concept of the data management part of this project.

Hope you`ll enjoy this interview.
RVZ

Q1. Space missions are long-term. Generally 15 to 20 years in length. The European Space Agency plans to launch in 2012 a Satellite called Gaia. What is Gaia supposed to do?

O`Mullane:
Well, we now have learned we launch in early 2013, delays are fairly common place in complex space missions. Gaia is ESA’s ambitious space astrometry mission, the main objective of which is to astrometrically and spectro-photometrically map 1000 Million celestial objects (mostly in our galaxy) with unprecedented accuracy. The satellite will downlink close to 100 TB of raw telemetry data over 5 years.
To achieve its required accuracy of a few 10s of Microarcsecond astrometry, a highly involved processing of this data is required. The data processing is a pan-European effort undertaken by the Gaia Data processing and Analysis consortium. The result a phase space map of our Galaxy helping to untangle its evolution and formation.

Q2. In your report, “Charting the Galaxy with the Gaia Satellite”, you write “the Gaia mission is considered the biggest data processing challenge to date in astronomy. Gaia is expected to observe around 1,000,000,000 celestial objects”. What kind of information Gaia is expected to collect on celestial objects? And what kind of information Gaia itself needs in order to function properly?

O`Mullane:
Gaia has two telescopes and a Radial Velocity Spectrometer (RVS). From each telescope simultaneously images are taken to calculate position on the sky and magnitude. Two special sets of CCDs at the end of the focal plane record images in red and blue bands to provide photometry for every object. Then for a large number of objects spectrographic images are recorded.
From the combination of this data we derive very accurate positions distances and motions for celestial objects. Additionally we get metalicites and temperatures etc. which allow the objects to be classified in different star groups. We use the term celestial objects since not all objects that Gaia observes are stars.
Gaia will, for example, see many asteroids and improve their know orbits. Many planets will be detected (though not directly observed).

As for any satellite Gaia has a barrage of on board instrumentation such as gyroscopes, star trackers and thermometers which are read out and downlinked as ‘house keeping’ telemetry. All of this information is needed to track Gaia’s state and position. Perhaps rather more unique for Gaia is the use of the data taken through scientific instruments for ‘self’ calibration. Gaia is all about beating noise with statistics.

Q3. All Gaia data processing software is written in Java. What are the main functionalities of the Gaia data processing software? What were the data requirements for this software?

O`Mullane:
The functionality is to process all of the observed data and reduce it to a set of star catalogue entries which represent the observed stars. The data requirements vary – in the science operations centre we downlink 50-80GB per day so there is a requirement on the daily processing software to be able to process this volume of data in 8 hours. This is the strictest requirement because if we are swamped by the incoming data stream we may never recover.

The Astrometric solution involves a few TB of data extracted from the complete set, the process runs less frequently (each six months to one year) the requirement on that software is to do its job in 4 weeks.
There are many systems like this for photometry, spectroscopy, non single stars, classification, variability analysis, each have there own requirements on data volume and processing time.

Q4. What are the main technical challenges with respect to data processing, manipulation and storage this project poses?

Nagjee:
The sheer volume of data that is expected to be captured by the Gaia satellite poses a technical challenge. For example, 1 billion celestial objects will be surveyed, and roughly 1000 observations (100*10) will be captured for each object, totaling around 1000 billion observations; each observation is represented as a discrete Java object and contains many properties that express various characteristics of these celestial bodies.
The trick is to not only capture this information and ingest (insert) it into the database very quickly, but to also be able to do this as discrete (non-BLOB) objects so that downstream processing can be facilitated in an easy manner. Additionally, the system needs to be optimized for very fast inserts (writes) and also for very fast queries (reads). All of this needs to be done in a very economical and cost-effective fashion – reducing power, energy, cooling, etc. requirements, and also reducing costs as much as possible.
These are some of the challenges that this project poses.

Q5. A specific part of the Gaia data processing software is the so called Astrometric Global Iterative Solution (AGIS). This software is written in Java. What is the function of such module? And what specific data requirements and technical challenges does it have?

Nagjee:
AGIS is a solution or program that iteratively turns the raw data into meaningful information.

O`Mullane:
AGIS takes a subset of the data, so called well behaved or primary stars and basically fits the observations to the astrometric model. This involves refining the satellite attitude (using the known basic angle and the multiple simultaneous observations in each telescope) and calibration (given an attitude and known star positions they should appear at a definite points in the CCD at a given times). It`s a huge (too huge infact) matrix inversion but we do a block iterative estimation based on the conjugate gradient method. This requires looping (or iterating) over the observational data up to 40 times to converge the solution. So we need the IO to be reasonable -we know the access pattern though so we can organize the data such that the reads are almost serial.

Q6. In your high level architecture you use two databases, a so called Main Database and an AGIS Database. What is the rational for this choice and what are the functionalities expected from the two databases?

O`Mullane:
Well the Main Database will hold all data from Gaia and the products of processing. This will grow from a few TBs to few hundreds of TBs during the mission. It`s a large repository of data. Now we could consider lots of tasks reading and updating this but the accounting would be a nightmare – and astronomers really like to know the provenance of their results. So we made the Main Database a little slower to update declaring a version as immutable. One version is then the input to all processing tasks the outputs of which form the next version.
AGIS meanwhile (and other tasks) only require a subset of this data. Often the tasks are iterative and will need to scan over the data frequently or access it in some particular way. Again it is easier for the designer of each task to also design and optimize his data system than try to make one system work for all.

Q7. You write that the AGIS database will contain data for roughly 500,000,000 sources (totaling 50,000,000,000 observations). This is roughly 100 Terabyte of Java data objects. Can you please tell us what are the attributes of the Java objects and what do you plan to do with these 100 Terabyte of data? Is scalability an important factor in your application? Do you have specific time requirements for handling the 100 Terabyte of Java data objects?

O`Mullane:
Last question first – 4 weeks is the target time for running AGIS, that’s ~40 passes through the dataset. The 500 Million sources plus observations for Gaia is a subset of the Gaia Data we estimate it to be around 10TB. Reduced to its quintessential element each object observation is a time, the time when the celestial image crosses the fiducial line on the CCD. For one transit there are 10 such observations with an average of 80 transits per source. Carried with that is the small cut out image over the 5 year mission lifetime, from the CCD. In addition there are various other pieces of information such as which telescope and residuals which are carried around for each observation. For each source we calculate the astrometric parameters which you may see as 6 numbers plus errors. These are: the position (alpha and delta) the distance or parallax (varPi) the transverse motion (muAlpha and muDelta) and the radial velocity (muRvs).
Then there is a Magnitude estimation and various other parameters. We see no choice but to scale the application to achieve the run times we desire and it has always been designed as a distributed application.

Nagjee:
The AGIS Data Model comprises several objects and is defined in terms of Java interfaces. Specifically, AGIS treats each observation as a discrete AstroElementary object. As described in the paper, the AstroElementary object contains various properties (mostly of the IEEE long data type) and is roughly 600 bytes on disk. In addition, the AGIS database contains several supporting indexes which are built during the ingestion phase. These indexes assist with queries during AGIS processing, and also provide fast ad-hoc reporting capabilities. Using InterSystems Caché, with its Caché eXTreme for Java capability, multiple AGIS Java programs will ingest the 100 Terabytes of data generated by Gaia as 50,000,000,000 discrete AstroElementary objects within 5 days (yielding roughly 115,000 object inserts per second, sustained over 5 days).

Internally, we will spread the data across several database files within Caché using our Global and Subscript Mapping capabilities (you can read more about these capabilities here) ,while providing seamless access to the data across all ranges. The spreading of the data across multiple database files is mainly done for manageability.

Q8. You conducted a proof-of-concept with Caché for the AGIS database. What were the main technical challenges of such proof-of-concept and what are the main results you obtained? Why did you select Caché and not a relational database for the AGIS database?

O`Mullane:
We have worked with Oracle for years and we can run AGIS on Derby. We have tested MySql and Postgress (though not with AGIS). To make the relation systems work fast enough we had to reduce our row count – this we did by effectively combining objects in blobs with the result that the RDBMs became more like a files system. Tests with Caché have shown we can achieve the read and write speeds we require without having to group data in blobs. This obviously is more flexible. It may have been possible to do this with another product but each time we had a problem InterSystems came (quickly) and showed us how to get
around it or fixed it. For the recent test we requested writing of specific representative dataset within 24 hours our our hardware – this was achieved in 12 hours. Caché is also a more cost-effective solution.

Nagjee:
The main technical challenge of such a Proof-of-Concept is to be able to generate realistic data and load on the system, and to tune and configure the system to be able to meet the strict insert requirements, while still optimizing sufficiently for down-stream querying of the data.
The white paper(.pdf)” discusses the results, but in summary, we were able to ingest 5,000,000,000 AstroElementary objects (roughly 1/10th of the eventual projected amount) in around 12 hours. Our target was to ingest this data within 24 hours, and we were successful at being able to do this in 1/2 the time.

Caché is an extremely high-performance database, and as the Proof-of-Concept outlined in the white paper proves, Caché is more than capable of handling the stringent time requirements imposed by the Gaia project, even when run on relatively modest hardware.

Q9. How do you handle data ingestion in the AGIS database, and how do you publish back updated objects into the main database?

Nagjee:
We use the Caché eXTreme for Java capability to interact between the Java AGIS application.

Q10. One main component of the proof-of-concept is the new Caché eXTreme for Java. Why is it important, and how did it get used in the proof-of-concept? How do you ensure low latency data storage and retrieval in the AGIS solution?

Nagjee:
Caché eXTreme for Java is a new capability of the InterSystems Caché database that exposes Caché’s enterprise and high-performance features to Java via the JNI (Java Native Intervace). It enables “in-process” communication between Java and Caché, thereby providing extremely low-latency data storage and retrieval.

Q11. What are the next steps planned for AGIS project and the main technical challenges ahead?

Nagjee:
The next series of testing will focus on ingesting even more data sources –up to 50% of the total projected objects. Next, we’ll work on tuning the application and system for reads (queries), and will also continue to explore additional deployment options for the read/query phase (for example, ESA and InterSystems are looking at deploying hundreds of AGIS nodes in the Amazon EC2 cloud so as to reduce the amount of hardware that ESA has to purchase).

O`Mullane:
Well on the technical side we need to move AGIS definitively to Caché for production. There are always communication bottlenecks to be investigated which limit scalability. AGIS itself requires several further developments such as a more robust outlier scheme and a more complete set of calibration equations. AGIS is in a good state but needs more work to deal with the mission data.

———————-
William O`Mullane, Science Operations Development Manager, European Space Agency.
William O’Mullane has a background in Computer Science and has worked on space science projects since 1996 when he assisted with the production of the Hipparcos CDROMS. During this period he was also involved with the Planck and Integral science ground segments as well as contemplating the Gaia data processing problem. From 2000-2005 Wil worked on developing the US National Virtual Observatory (NVO) and on the Sloan Digital Sky Survey (SDSS) in Baltimore, USA. In August 2005 he rejoined the European Space Agency as Gaia Science Operations Development Manager to lead the ESAC development effort for the Gaia Data Processing and Analysis Consortium.

Vik Nagjee, Product Manager, Core Technologies, InterSystems Corporation.
Vik Nagjee is the Product Manager for Core Technologies in the Systems Development group at InterSystems. He is responsible for the areas of Reliability, Availability, Scalability, and Performance for InterSystems’ core products – Caché and Ensemble. Prior to joining InterSystems in 2008, Nagjee held several positions, including Security Architect, Development Lead, and head of the performance & scalability group, at Epic Systems Corporation, a leading healthcare application vendor in the US.
————————-
For additional reading:

The Gaia Mission :
1. “Pinpointing the Milky Way” (download paper, .pdf).

2. “Gaia: Organisation and challenges for the data processing” (download paper, .pdf).

3. “Gaia Data Processing Architecture 2009” (download paper, .pdf).

4. “To Boldly Go Where No Man has Gone Before: Seeking Gaia’s Astrometric Solution with AGIS” (download paper, .pdf).

The AGIS Database

5. “Charting the Galaxy with the Gaia Satellite”.(download white paper, .pdf)”