ODBMS Industry Watch

Mar 3 12

Data Modeling for Analytical Data Warehouses. Interview with Michael Blaha.

by Roberto V. Zicari

“Many data warehouses contain sensitive data such as personal data. There are legal and ethical concerns with accessing such data. So the data must be secured and access controlled as well as logged for audits” — Michael Blaha.

This is the third interview with our expert Dr. Michael Blaha on the topic Database Modeling. This time we look at the issue of data design for Analytical Data Warehouses.

In previous interviews we looked at how good is UML for database design , and how good are Use Cases for database modeling.

Hope you`ll find this interview interesting. I encourage the community to post comments.

RVZ

Q1: What is the difference between data warehouses and day-to-day business applications?

Michael Blaha: Operational (day-to-day business) applications serve the routine needs of a business handling orders, scheduling manufacturing runs, servicing patients, and generating financial statements.

Operational applications have many short transactions that must process quickly. The transactions both read and write.
Well-written applications pay attention to data quality, striving to ensure correct data and avoid errors.

In contrast analytical (data warehouse) applications step back from the business routine and analyze data that accumulates over time. The idea is to gain insight into business patterns that are overlooked when responding to routine needs. Data warehouse queries can have a lengthy execution time as they process reams of data, searching for underlying patterns.

End users read from a data warehouse, but they don`t write to it. Rather writing occurs as the operational applications supply new data that is added to the data warehouse.

Q2: How do you approach data modeling for data warehouse problems?

Michael Blaha: For operational applications, I use the UML class model for conceptual data modeling. (I often use Enterprise Architect.) The notation is more succinct than conventional database notations and promotes abstract thinking.
In addition, the UML class model is understandable for business customers as it defers database design details. And, of course, the UML reaches out to the programming side of development.

In contrast, for analytical applications, I go straight to a database notation. (I often use ERwin.) Data warehouses revolve around facts and dimensions. The structure of a data warehouse model is so straightforward (unlike the model of operational application) that a database notation alone suffices.

For a business user, the UML model and the conventional data model look much the same for a data warehouse.
The programmers of a data warehouse (the ETL developers) are accustomed to database notations (unlike the developers in day-to-day applications).

As an aside, I note that in a past book (A Manager`s Guide to Database Technology) I used a UML class model for analytical modeling. In retrospect I now realize that was a forced fit. The class model does not deliver any benefits for data warehouses and it`s an unfamiliar technology for data warehouse developers, so there`s no point in using it there.

Q3: Is there any synergy between non-relational databases (NoSQL, Object Databases) and data warehouses?

Michael Blaha: Not for conventional data warehouses that are set-oriented. Mass quantities of data must be processed in bulk. Set-oriented data processing is a strength of relational databases and the SQL language. Furthermore, tables are a good metaphor for facts and dimensions and the data is intrinsically strongly typed.

NoSQL (Hadoop) is being used for mining Web data. Web data is by its nature unstructured and much different from conventional data warehouses.

Q4: How do data warehouses achieve fast performance?

Michael Blaha: The primary technique is pre-computation, by anticipating the need for aggregate data and computing it in advance. Indexing is also important for data warehouses, but less important than with operational applications.

Q5: What are some difficult issues with data warehouses?

Michael Blaha:
–Abstraction. Abstraction is needed to devise the proper facts and dimensions. It is always difficult to perform abstraction.

–Conformed dimensions. A large data warehouse schema must be flexible for mining. This can only be achieved if data is on the same basis. Therefore there is a need for conformed dimensions.
For example, there must be a single definition of Customer that is used throughout the warehouse.

–Size. The sheer size of schema and data is a challenge.

–Data cleansing. Many operational applications are old legacy code. Often their data is flawed and may need to be corrected for a data warehouse.

– Data integration. Many data warehouses combine data from multiple applications. The application data overlaps and must be reconciled.

– Security. Many data warehouses contain sensitive data such as personal data. There are legal and ethical concerns with accessing such data. So the data must be secured and access controlled as well as logged for audits.

Q6: What kind of metadata is associated with a data warehouse and is there a role for Object Databases with this?

Michael Blaha: Maybe. Data warehouse metadata includes source-to-target mappings, definitions (of facts, dimensions, and attributes), as well as the organization of the data warehouse into subject areas. The metadata for a data warehouse is just like operational applications. The metadata has to be custom modeled and doesn`t have a standard metaphor for structure like the facts and dimensions of a data warehouse. Relational databases, OO databases, and possibly other kinds of databases are all reasonable candidates.

Q7. In a recent interview, Florian Waas, EMC/Greenplum, said “in the Big Data era the old paradigm of shipping data to the application isn’t working any more. Rather, the application logic must come to the data or else things will break: this is counter to conventional wisdom and the established notion of strata within the database stack. Instead of stand-alone products for ETL, BI/reporting and analytics we have to think about seamless integration: in what ways can we open up a data processing platform to enable applications to get closer? What language interfaces, but also what resource management facilities can we offer? And so on.”
What is your view on this?

Michael Blaha: It’s well known that to get good performance for relational database applications that stored procedures must be used. Stored procedures are logic that is inside the database kernel. Stored procedures circumvent much of the overhead that is incurred by shuttling back and forth between an application process and the database process. So the stored procedure experience is certainly consistent with this comment.

What I try to do in practice is think in terms of objects. Relational database tables and stored procedures are analogous to objects with methods. I put core functionality that is likely to be reusable and computation intensive into stored procedures. I put lightweight functionality and functionality that is peculiar to an application outside the database kernel.

Q8. Hadoop is the system of choice for Big Data and Analytics. How do you approach data modeling in this case?

Michael Blaha: I have no experience with Hadoop. My projects have involved structured data. In contrast Hadoop is architected for unstructured data as is often found on the Web.

Q9. A lot of insights are contained in unstructured or semi-structured data from Big Data applications. Does it make any sense to do data modeling in this case?

Michael Blaha: I have no experience with unstructured data. I have some experience with semi-structured data (XML / XSD). I routinely practice data modeling for XSD files. I published a paper in 2010 lamenting the fact that so
many SOA projects concern storage and retrieval of data and completely lack a data model. I’ve been working on modeling approaches for XSD files, but have not yet devised a solution to my satisfaction.

————————————————–
Michael Blaha is a partner at Modelsoft Consulting Corporation.
Dr. Blaha is recognized as one of the world’s leading authorities on databases and data modeling. He has more than 25 years of experience as a consultant and trainer in conceiving, architecting, modeling, designing, and tuning databases for dozens of major organizations around the world. He has authored six U.S. patents, six books, and many papers. Dr. Blaha received his doctorate from Washington University in St. Louis and is an alumnus of GE Global Research in Schenectady, New York.

– How good is UML for Database Design? Interview with Michael Blaha.

Resources

ODBMS.org: Free Downloads and Links on various data management technologies:
Object Databases
NoSQL Data Stores
Graphs and Data Stores
Cloud Data Stores
Object-Oriented Programming
Entity Framework (EF) Resources
ORM Technology
Object-Relational Impedance Mismatch
Databases in general
Big Data and Analytical Data Platforms

Feb 20 12

A super-set of MySQL for Big Data. Interview with John Busch, Schooner.

by Roberto V. Zicari

“Legacy MySQL does not scale well on a single node, which forces granular sharding and explicit application code changes to make them sharding-aware and results in low utilization of severs”— Dr. John Busch, Schooner Information Technology

A super-set of MySQL suitable for Big Data? On this subject, I have interviewed Dr. John Busch, Founder, Chairman, and CTO of Schooner Information Technology.

RVZ

Q1. What are the limitations of MySQL when handling Big Data?

John Busch: Legacy MySQL does not scale well and uses single threaded asynchronous replication. It’s poor scaling forces granular sharding across many servers and explicit application code changes to make them sharding-aware. It’s single threaded asynchronous replication results in slave lag and data inconsistency, and it requires complex manual fail-over with downtime and data loss. The net result is low utilization of severs, server sprawl, limited service availability, limited data integrity, and complex programming and administration. These are all serious problems when handling Big Data.

Q2. How is SchoonerSQL different with respect to Oracle/MySQL (5.1, 5.5, and the future 5.6)?

John Busch:Schooner licensed the source for MySQL and InnoDB directly from Oracle, with the right to enhance it in a compatible manner. Schooner made fundamental and extensive architectural and resource management advances to MySQL/InnoDB in order to make it enterprise class. SchoonerSQL fully exploits today’s commodity multi-core servers, flash memory, and high-performance networking while dramatically improving performance, availability, scalability, and cost of ownership relative to MySQL 5.X. SchoonerSQL advances include:

– very high thread level parallelism with granular concurrency control and highly parallel DRAM < -> FLASH memory hierarchy management, enabling linear vertical scaling as a function of processor cores;
– tightly integrated (DRAM to DRAM) synchronous replication, coupled with fully parallel asynchronous replication, with automated fail-over within and between data centers, enabling the highest levels of availability with no data loss. and
– transparent, workload-aware, relational sharding with DBShards, enabling unlimited high performance horizontal scaling;

SchoonerSQL is a super-set of Oracle MySQL/InnoDB 5.1.5.5/5.6, providing 100% compatibility for applications and data, while delivering order of magnitude improvements in availability, scalability, performance and cost of ownership.

Q3. How can SchoonerSQL achieve high availability (HA) with performance, scalability and at a reasonable cost?

John Busch: In the past, major trade-offs were required between performance, availability and TCO (total cost of ownership). Today’s commodity multi-core server, flash memory, and high speed networking, coupled with new database architectures and resource management algorithms, enable concurrently achieving radical improvements in performance, scalability, availability, data integrity, and cost of ownership. SchoonerSQL innovations incorporate fundamental database architecture and resource management advances, including:

· Linear vertical scaling, which fully utilizes modern commodity multi-core servers,providing 10:1 consolidation and capital and operating expense reduction;

· Unlimited horizontal scaling, which allows support of very large databases with high-performance and high availability and low cost of ownership using commodity hardware and standard SQL; and

· High performance synchronous and parallel asynchronous replication with automated fail-over, which provides 99.999% HA with full data integrity and no loss on performance.

Q4. What is your relationship with Oracle/MySQL?

John Busch: Schooner is an Oracle gold partner, and an OEM and go-to-market partner of Oracle. Schooner licensed the source for MySQL and InnoDB directly from Oracle and developed SchoonerSQL, which is completely compatible with Oracle’s MySQL. SchoonerSQL is an enterprise class database, and is targeted for customers requiring a mission critical database.. SchoonerSQL provides an order of magnitude improvement in performance, availability, scalability, and cost of ownership relative to Oracle’s MySQL 5.X.

Q5. What is special about SchoonerSQL’s transparent sharding?

John Busch: Beyond SchoonerSQL’s linear vertical scaling andclustering, which enables high-performance and high availability support ofmulti-terabyte databases, SchoonerSQL offers optional transparent relational sharding with DBShards to enable horizontal scaling across nodes for unlimited sized databases and unlimited scaling. SchoonerSQL’s DBShards relational transparent sharding is application-aware, based on analysis and optimization for the query and data access behavior of the specific customer workload.
Based on observed workload behavior, it optimally partitions the data across nodes and transparently replicates supporting data structures to eliminate cross nodecommunication to accomplish query execution.

Q6. How can you obtain scalability and high-performance with Big Data and at the same time offer SQL joins?

John Busch: SchoonerSQL’s DBShards relational transparent sharding optimally replicates supporting data structures used in dynamic queries. This is done on a workload specific-basis, based on dynamic query and data access patterns. As a result, there is no cross-node communication required to execute queries, and in particular for SQL-joins, the data is fully coalesced at the client with Schooner libraries transparently invoking the involved nodes.

Q7. What is your take on MariaDB?

John Busch: MariaDB is trying to offer an alternative to MySQL and to Oracle.
SchoonerSQL is focused on offering a superior MySQL for mission-critical applications and services, with 100% MySQL compatibility, vastly superior performance, availability, scalability and TCO, all in partnership with the Oracle corporation.

Q8. You also offer a Memcached-based product (Membrain). Why Membrain?

John Busch: Membrain is a very high-performance and very high availability scalable key-value store supporting the memcached protocol. Schooner’s experience in the market is that SchoonerSQL and Membrain are both required and very complementary.
Schooner Membrain provides high-performance, scalability, and high availability with low TCO for unstructured data based on fully exploiting flash memory and multi-core servers with synchronous replication and transparent fail -over. SchoonerSQL provides high-performance, scalability, and high availability with low TCO for structured data.

Q9. Talking about scalability and performance what are the main differences if the database is stored on hard drives, SAN, flash memory (Flashcache)? What happens when data does not fit in DRAM?

John Busch: SchoonerSQL and Schooner Membrain are designed to fully exploit the high IOPS of flash memory and SANs and the cores of today’s commodity servers. With SchoonerSQL, the performance when executing out of flash or SAN is almost the same as if everything were executing from DRAM memory, enabling the full utilization of today’s commodity multi-core servers with very high vertical scaling and consolidation at low cost.
SchoonerSQL also provides significant performance improvements with disc storage and flash cache. Measurements based on standard benchmarks show that Schooner offers much higher performance, consolidation, and scalability than any other MySQL or NoSQL product, and SchoonerSQL does this with 99.999% availability and much lower cost of ownership relative to legacy MySQL 5.X or other NoSQL offerings.

Q10. How do you differentiate yourselves from other NoSQL vendors (Key/Value stores, document-based databases and similar NoSQL databases)?

John Busch: SchoonerSQL and Membrain are unique in the industry. Schooner has over 20 filed patents on its advances in database/data store architecture and resource management.
SchoonerSQL and Membrain deliver order of magnitude improvements in performance, scalability, availability, and cost of ownership relative to any other MySQL or NoSQL, while maintaining 100% SQL and memcached compatibility.

Q11. What is your take on a database such as VoltDB?

John Busch: VoltDB is a large DRAM-only database. DRAM is expensive and volatile.
Schooner effectively utilizes parallel flash memory in a tightly integrated architecture , effectively exploiting flash, DRAM , multi-core, and multi-node scalability and availability. This results in superior cost of ownership and availability relative to VoltDB while providing high performance and unlimited scalability, all with 100% SQL compatibility.

Proliferation of Analytics

Q12. A/B testing, sessionization, bot detection, and pathing analysis all require powerful analytics on many petabytes of semi-structured Web data. How do you handle big semi-structured data?

John Busch: As we discussed above, SchoonerSQL provides optimized vertical scaling and clustering coupled with DBShards transparent relational horizontal scaling. This enables queries to be performed on unlimited semi-structured datasets, with 99.999% high availability, full data integrity, and the minimal number of commodity servers.

Q13. How do you see converging data from multiple data sources, both structured and unstructured?

John Busch: Today, SchoonerSQL and Schooner Membrain are often used in conjunction to provide support for both unstructured and structureddata. There are also emerging standards in interfacing heterogeneous structured and unstructured data stores. Schooner believes these are very important, and will contribute to and support these standards in our products.

Q14. Does it make sense to use Apache Hadoop, MapReduce and MySQL together?

John Busch: Hadoop/MapReduce provide a new distributed computational model which is very appropriate for certain application classes, and which is receiving industry acceptance and traction.
Schooner intends to enhance our products to provide exceptional interoperability with Hadoop so that customers can use these products in conjunction in delivering their services.

———–
Dr. John Busch, is the Founder, Chairman, and CTO of Schooner Information Technology.
Prior to Schooner, John was director of computer system architecture at Sun Microsystems Laboratories from 1999 through 2006. In this role, John led research in multi-core processors, multi-tier scale-out architectures, and advanced high-performance computer systems. John received the President’s Award for Innovation at Sun.
Prior to Sun, John was VP of engineering and business partnerships with Diba, Inc., co-founder, CTO and VP of engineering of Clarity Software, and director of computer systems R&D at Hewlett Packard.
John holds a Ph.D. in Computer Science from UCLA, an M.A. in Mathematics from UCLA, an M.S. in Computer Science from Stanford University, and attended the Sloan Program at Stanford.

– vFabric SQLFire: Better then RDBMS and NoSQL?

– MariaDB: the new MySQL? Interview with Michael Monty Widenius.

Related Resources

– ODBMS.org: Free Downloads and Links on:
*Analytical Data Platforms.
*Cloud Data Stores,
*Graphs and Data Stores
*NoSQL Data Stores,

Feb 1 12

On Big Data Analytics: Interview with Florian Waas, EMC/Greenplum.

by Roberto V. Zicari

“With terabytes, things are actually pretty simple — most conventional databases scale to terabytes these days. However, try to scale to petabytes and it’s a whole different ball game.” –Florian Waas.

On the subject of Big Data Analytics, I interviewed Florian Waas (flw). Florian is the Director of Software Engineering at EMC/Greenplum and heads up the Query Processing team.

RVZ

Q1. What are the main technical challenges for big data analytics?

Florian Waas: Put simply, in the Big Data era the old paradigm of shipping data to the application isn’t working any more. Rather, the application logic must “come” to the data or else things will break: this is counter to conventional wisdom and the established notion of strata within the database stack.
Instead of stand-alone products for ETL, BI/reporting and analytics we have to think about seamless integration: in what ways can we open up a data processing platform to enable applications to get closer?
What language interfaces, but also what resource management facilities can we offer? And so on.

At Greenplum, we’ve pioneered a couple of ways to make this integration reality: a few years ago with a Map-Reduce interface for the database and more recently with MADlib, an open source in-database analytics package. In fact, both rely on a powerful query processor under the covers that automates shipping application logic directly to the data.

Q2. When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?

Florian Waas: With terabytes, things are actually pretty simple — most conventional databases scale to terabytes these days. However, try to scale to petabytes and it’s a whole different ball game.
Scale and performance requirements strain conventional databases. Almost always, the problems are a matter of the underlying architecture. If not built for scale from the ground-up a database will ultimately hit the wall — this is what makes it so difficult for the established vendors to play in this space because you cannot simply retrofit a 20+ year-old architecture to become a distributed MPP database over night.
Having said that, over the past few years, a whole crop of new MPP database companies has demonstrated that multiple PB’s don’t pose a terribly big challenge if you approach it with the right architecture in mind.

Q3. How do you handle structured and unstructured data?

Florian Waas: As a rule of thumb, we suggest to our customers to use Greenplum Database for structured data and to consider Greenplum HD—Greenplum’s enterprise Hadoop edition—for unstructured data. We’ve equipped both systems with high-performance connectors to import and export data to each other, which makes for a smooth transition when using one for pre-processing for the other, query HD using Greenplum Database, or whatever combination the application scenario might call for.

Having said this, we have seen a growing number of customers loading highly unstructured data directly into Greenplum Database and convert it into structured data on the fly through in-database logic for data cleansing, etc.

Q4. Cloud computing and open source: Do you they play a role at Greenplum? If yes, how?

Florian Waas: Cloud computing is an important direction for our business and hardly any vendor is better positioned than EMC in this space. Suffice it to say, we’re working on some exciting projects.
So, stay tuned!

As you know, Greenplum has been very close to the open source movement, historically. Besides our ties with the Postgres and Hadoop communities we released our own open source distribution of MADlib for in-database analytics (see also madlib.net)

Q5. In your blog you write that classical database benchmarks “aren’t any good at assessing the query optimizer”. Can you please elaborate on this?

Florian Waas: Unlike customer workloads, standard benchmarks pose few challenges for a query optimizer – the emphasis in these benchmarks is on query execution and storage structures. Recently, several systems that have no query optimizer to speak of have scored top results in the TPC-H benchmark.
And, while impressive at these benchmarks, these systems do usually not perform well in customer accounts when faced with ad-hoc queries — that’s where a good optimizer makes all the difference.

Q6. Why do we need specialized benchmarks for a subcomponent of a database?

Florian Waas: On the one hand, an optimizer benchmark will be a great tool for consumers.
A significant portion of the total cost of ownership of a database system comes from the cost of query tuning and manual query rewriting, in other words, the shortcomings of the query optimizer. Without an optimizer benchmark it’s impossible for consumers to compare the maintenance cost. That’s like buying a car without knowing its fuel consumption!

On the other hand, an optimizer benchmark will be extremely useful for engineering teams in optimizer development. It’s somewhat ironic that vendors haven’t invested in a methodology to show off that part of the system where most of their engine development cost goes.

Q7. Are you aware of any work in this area (Benchmarking query optimizers)?

Florian Waas: Funny, you’d asked. Over the past months I’ve been working with coworkers and colleagues in the industry on some techniques – we’re still far away from a complete benchmark but we’ve made some inroads.

Q8. You had done some work with “dealing with plan regressions caused by changes to the query optimizer”. Could you please explain what the problem is and what kind of solutions did you develop?

Florian Waas: A plan regression is a regression of a query due to changes to the optimizer from one release to the next. For the customer this could mean, after an upgrade or patch release one or more of their truly critical queries might run slower–maybe even so slow that it start impacting their daily business operations.

With the current test technology plan regressions are very hard to guard against simply because the size of the input space makes it impossible to achieve perfect test coverage. This dilemma made a number of vendors increasingly risk averse and turned into the biggest obstacle for innovation in this space. Some vendors came up with rather reactionary safety measures. To use another car analogy: many of these are akin to driving with defective breaks but wearing a helmet in the hopes that this will help prevent the worst in a crash.

I firmly believe in fixing the defective breaks, so to speak, and developing better test and analysis tools. We’ve made some good progress on this front and start seeing some payback already. This is an exciting and largely under-developed area of research!

Q9. Glenn Paulley of Sybase in a keynote at SIGMOD 2011 asked the question of ‘how much more complexity can database systems deal with? What is your take on this?

Florian Waas: Unnecessary complexity is bad. I think everybody will agree with that. Some complexity is inevitable though, and the question becomes: How are we dealing with it?

Database vendors have all too often fallen into the trap of implementing questionable features quickly without looking at the bigger picture. This has led to tons of internal complexity and special casing, not to mention the resulting spaghetti code.
When abstracted correctly and broken down into sound building blocks a lot of complexity can actually be handled quite well. Again, query optimization is a great example here: modern optimizers can be a joy to work with.
They are built and maintained by small surgical teams that innovate effectively! Whereas older models require literally dozens of engineers just to maintain the code base and fix bugs.

In short, I view dealing with complexity primarily as an exciting architecture and design challenge and I’m proud we assembled a team here at Greenplum that’s equally excited to take on this challenge!

Q10. I published an interview with Marko Rodriguez and Peter Neubauer, leaders of the Tinkerpop2 project. What is your opinion on Graph Analysis and Manipulation for databases?

Florian Waas: Great stuff these guys are building –- I’m interested to see how we can combine Big Data with graph analysis!

Q11. Anything else you wish to add?

Florian Waas: It’s been fun!

____________________
Florian Waas (flw) is Director of Software Engineering at EMC/Greenplum and heads up the Query Processing team. His day job is to bring theory and practice together in the form of scalable and robust database technology.
_______________________________
Related Posts
– On Data Management: Interview with Kristof Kloeckner, GM IBM Rational Software.

– On Big Data: Interview with Shilpa Lawande, VP of Engineering at Vertica.

– On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com

Related Resources

ODBMS.ORG: Resources on Analytical Data Platforms: Blog Posts | Free Software| Articles|

Jan 9 12

Use Cases and Database Modeling — An interview with Michael Blaha.

by Roberto V. Zicari

“ Use cases are rote work. The developer listens to business experts and slavishly write what they hear. There is little interpretation and no abstraction. There is little reconciliation of conflicting use cases. For a database project, the conceptual data model is a much more important software engineering contribution than use cases.“ — Dr. Michael Blaha.

First of all let me wish you a Happy, Healthy and Successful 2012!

I am coming back to discuss with our expert Dr. Michael Blaha, the topic of Database Modeling. In a previous interview we looked at the issue of “How good is UML for Database Design”?

Now, we look at Use Cases and discuss how good are they for Database Modeling. Hope you`ll find the interview interesting. I encourage the community to post comments.

RVZ

Q1. How are requirements taken into accounts when performing data base modeling in the daily praxis? What are the common problems and pitfalls?

Michael Blaha: Software development approaches vary widely. I’ve seen organizations use the following techniques for capturing requirements (listed in random order).

— Preparation of use cases.
— Preparation of requirements documents.
— Representation and explanation via a conceptual data model.
— Representation and explanation via prototyping.
— Haphazard approach. Just start writing code.

General issues include
— the amount of time required to capture requirements,
— missing requirements (requirements that are never mentioned)
— forgotten requirements (requirements that are mentioned but then forgotten)
— bogus requirements (requirements that are not germane to the business needs or that needlessly reach into design)
— incomplete understanding (requirements that are contradictory or misunderstood)

Q2. What is a use case?

Michael Blaha: A use case is a piece of functionality that a system provides to its users. A use case describes how a system interacts with outside actors.

Q3. What are the advantages of use cases?

Michael Blaha:
— Use cases lead to written documentation of requirements.
— They are intuitive to business specialists.
— Use cases are easy for developers to understand.
— They enable aspects of system functionality to be enumerated and managed.
— They include error cases.
— They let consulting shops bill many hours for low-skilled personnel.
(This is a cynical view, but I believe this is a major reason for some of the current practice.)

Q4. What are the disadvantages of use cases?

Michael Blaha:
— They are very time consuming. It takes much time to write them down. It takes much time to interview business experts (time that is often unavailable).

— Use cases are just one aspect of requirements. Other aspects should also be considered, such as existing documentation and
artifacts from related software. Many developers obsess on use cases and forget to look for other requirement sources.

— Use cases are rote work. The developer listens to business experts and slavishly write what they hear. There is little interpretation and no abstraction. There is little reconciliation of conflicting use cases.

— I have yet to see benefit from use case diagramming. I have yet to see significant benefit from use case structuring.

— In my opinion, use cases have been overhyped by marketeers.

Q5. How are use cases typically used in practice for database projects?

Michael Blaha: To capture requirements. It is OK to capture detailed requirements with use cases, but they should be
subservient to the class model. The class model defines the domain of discourse that use cases can then reference.

For database applications it is much inferior to start with use cases and afterwards construct a class model. Database applications, in particular, need a data approach and not a process approach.

It is ironic that use cases have arisen from the object-oriented community. Note that OO programming languages define a class structure to which logic is attached. So it is odd that use cases put process first and defer attention to data structure.

Q6. A possible alternative approach to data modeling is to write use cases first, then identifying the subsystems and components, and finally identifying the database schema. Do you agree with this?

Michael Blaha: This is a popular approach. No I do not agree with it. I strongly disagree.
For a database project, the conceptual data model is a much more important software engineering contribution than use cases.

Only when the conceptual model is well understood can use cases be fully understood and reconciled. Only then can developers integrate use cases and abstract their content into a form suitable for building a quality software product.

Q7. Many requirements and the design to satisfy those requirements are normally done with programming, not just schema. Do you agree with this? How do you handle this with use cases?

Michael Blaha: Databases provide a powerful language, but most do not provide a complete language.

The SQL language of relational databases is far from complete and some other language must be used to express full functionality.

OO databases are better in this regard. Since OO databases integrate a programming language with a persistence mechanism they inherently offer a full language for expressing functionality.

Use cases target functionality and functionality alone. Use cases, by their nature, do not pay attention to data structure.

Q8. Do you need to use UML for use cases?

Michael Blaha: No. The idea of use cases are valuable if used properly (in conjunction with data and normally subservient to data).

In my opinion, UML use case diagrams are a waste of time. They don’t add clarity. They add bulk and consume time.

Q9. Are there any suitable tools around to help the process of creating use cases for database design? If yes, how good are they?

Michael Blaha: Well, it’s clear by now that I don’t think much of use case diagrams. I think a textual approach is OK and there are probably requirement tools to manage such text, but I am unfamiliar with the product space.

Q10. Use case methods of design are usually applied to object-oriented models. Do you use use cases when working with an object database?

Michael Blaha: I would argue not. Most object-oriented languages put data first. First develop the data structure and then attach methods to the structure. Use cases are the opposite of this. They put functionality first.

Q11. Can you use use cases as a design method for relational databases, NoSQL databases, graph databases as well? And if yes how?

Michael Blaha: Not reasonably. I guess developers can force fit any technique and try to claim success.

To be realistic, traditional database developers (relational databases) are already resistant (for cultural reasons) to object-oriented jargon/style and the UML. When I show them that the UML class model is really just an ER model and fits in nicely with database conceptualization, they acknowledge my point, but it is still a foreign culture.

I don’t see how use cases have much to offer for NoSQL and graph databases.

Q12. So if you don’t have use cases, how do you address functionality when building database applications?

Michael Blaha: I strongly favor the technique of interactive conceptual data modeling. I get the various business and
technical constituencies in the same room and construct a UML class model live in front of them as we explore their business needs and scope. Of course, the business people articulate their needs in terms of use cases. But theuse cases are grounded by the evolving UML class model defining the domain of discourse. Normally I have my hands full with managing the meeting and constructing a class model in front of them. I don’t have time to explicitly capture the use cases (though I am appreciative if someone else volunteers for that task).

However, I fully consider the use cases by playing them against the evolving model. Of course as I consider use cases relative to the class model, I am reconciling the use cases. I am also considering abstraction as I construct the class model and consequently causing the business experts to do more abstraction in formulating their use case business requirements.

I have built class models this way many times before and it works great. Some developers are shocked at how well it can work.

– Agile data modeling and databases.>

– Why Patterns of Data Modeling?

Related Resources

Dec 14 11

Re-thinking Relational Database Technology. Interview with Barry Morris, Founder & CEO NuoDB.

by Roberto V. Zicari

The landscape for data management products is rapidly evolving. NuoDB is a new entrant in the market. I asked a few questions to Barry Morris, Founder & Chief Executive Officer NuoDB.

RVZ

Q1. What is NuoDB?

Barry Morris: NuoDB combines the power of standard SQL with cloud elasticity. People want SQL and ACID transactions but these have been expensive to scale-up or down. NuoDB changes that by introducing a breakthrough “emergent” internal architecture.

NuoDB scales elastically as you add machines and is self-healing if you take machines away. It also allows you to share machines between databases, vastly increases business continuity guarantees, and runs in geo-distributed active/active configurations. It does all this with very little database administration.

Q2. When did you start the company?

Barry Morris: The technology has been developed over a period of several years but the company was funded in 2010 and has been operating out of Cambridge MA since then.

Q3. Big data, Web-scale concurrency and Cloud: how can NuoDB help here?

Barry Morris: These are some of the main themes of modern system software.
The 30-year-old database architecture that is universally used by traditional SQL database systems is great for traditional applications and in traditional datacenters, but the old architecture is a liability in web-facing applications running on private, public or hybrid clouds.

NuoDB is a general purpose SQL database designed specifically for these modern application requirements.
Massive concurrency with NuoDB is easy to understand: If the system is overloaded with users you can just add as many diskless “transaction managers” as you need. NuoDB handles big data by using redundant Key-value stores for it’s storage architecture.

Cloud support boils down to the Five Key Cloud requirements: Elastic scalability, sharing machines between databases, extreme availability geo-distribution and “zero” DBA costs. You get these for free from a database with an emergent architecture, such as NuoDB.

Data

Q4. What kind of data (structured, non structured), and volumes of data is NuoDB able to handle?

Barry Morris: NuoDB can handle any kinds of data, in a set-based relational model.

Naturally we support all standard SQL types. Additionally we have rich BLOB support that allows us to store anything in an opaque fashion. A forthcoming release of the product will also support user-defined types.

NuoDB extends the traditional SQL type model in some interesting ways. We store arbitrary-length strings and numbers because we store everything by value not by type. Schemas can change easily and dynamically because they are not tightly coupled to the data.

Additionally NuoDB supports table inheritance, a powerful feature for applications that traditionally have wide, sparse tables.

Q5. How do you ensure data availability?

Barry Morris: There is no single point of failure in NuoDB. In fact it is quite hard to stop a NuoDB database from running, or to lose data.

There are three tiers of processes in a NuoDB solution (Brokers, Transaction Managers, and Archive Managers) and each tier can be arbitrarily redundant. All NuoDB Brokers, Transaction Managers and Archive Managers run as true peers of others in their respective tiers. To stop a database you have to stop all processes on at least one tier.

Data is stored redundantly, in as many Archive Managers as you want to deploy. All Archive Managers are peers and if you lose one the system just keeps going with the remaining Archive Managers.

Additionally the system can run across multiple datacenters. In the case of losing a datacenter a NuoDB database will keep running in the other datacenters.

Transactions

Q6. How do you orchestrate and guarantee ACID transactions in a distributed environment?

Barry Morris: We do not use a network lock manager because we need to be asynchronous in order to scale elastically.

Instead concurrency and consistency are managed using an advanced form of MVCC (multi-version concurrency control). We never actually delete or update a record in the system directly. Instead we always create new versions of records pointing at the old versions. It is our job to do all the bookkeeping about who sees which versions of which records and at what time.

The durability model is based on redundant storage in Key-value stores we call Archive Managers.

Q7. You say you use a distributed non-blocking atomic commit protocol. What is it? What is it useful for?

Barry Morris: Until changes to the state of a transactional database are committed they are not part of the state of the database. This is true for any ACID database system.

The Commit Protocol in NuoDB is complex because at any time we can have thousands of concurrent applications reading, writing, updating and deleting data in an asynchronous distributed system with multiple live storage nodes. A naïve design would “lock” the system in order to commit a change to the durable state, but that is a good way of ensuring the system does not perform or scale. Instead we have a distributed, asynchronous commit protocol that allows a transaction to commit without requiring network locks.

Architecture

Q8. What is special about NuoDB Architecture?

Barry Morris: NuoDB has an emergent architecture. It is like a flock of birds that can fly in an organized formation without having a central “brain”. Each bird follows some simple rules and the overall effect is to organize the group.

In NuoDB there is no one is in charge. There is no supervisor, no master, and central authority on anything. Everything that would normally be centralized is distributed. Any Transaction Manager or Archive Manager with the right security credentials can dynamically opt-in or opt-out of a particular database. All properties of the system emerge from the interactions of peers participating on a discretionary basis rather than from a monolithic central coordinator.

SQL

Q9. Do you support SQL?

Barry Morris: Yes. NuoDB is a relational database, and SQL is the primary query language. We have very broad and very standard SQL support.

Performance

Q10. How do you reduce the overhead when storing data to a disk? Do you use in-memory cache?

Barry Morris: Our storage model is that we have multiple redundant Archive Nodes for any given database. Archive Nodes are Key-value stores that know how to create and retrieve blobs of data. Archive Nodes can be implemented on top of any Key-value store (we currently have a file-system implementation and an Amazon S3 implementation). Any database can use multiple types of Archive Nodes at the same time (eg SSD based alongside disk-based).

Archive Nodes allow trade-offs to be made on disk write latency. For an ultra-conservative strategy you can run the system with fully journaled, flushed disk writes on multiple Archive Nodes in multiple datacenters; on the other end of the spectrum you can rely on redundant memory copies as your committed data, with disk writes happening as a background activity. And there are options in between. In all cases there are caching strategies in place that are internal to the Archive Nodes.

Q11. How do you achieve that query processing scales with the number of available nodes?

Barry Morris: Queries running on multiple nodes get the obvious benefit of node parallelism at the query-processing level. They additionally get parallelism at the disk-read level because there are multiple archive nodes. In most applications the most significant scaling benefit is that the Transaction Nodes (which do the query-processing) can load data from each others memory if it is available there. This is orders-of-magnitude faster than loading data from disk.

Q12. What about storage? How does it scale?

Barry Morris: You can have arbitrarily large databases, bounded only by the storage capacity of your underlying Key-value store. The NuoDB system itself is agnostic about data set size.

Working set size is very important for performance. If you have a fairly stable working set that fits in the distributed memory of your Transaction Nodes then you effectively have a distributed in-memory database, and you will see extremely high performance numbers. With the low and decreasing cost of memory this is not an unusual circumstance.

Scalability and Elasticity

Q13. How do you obtain scalability and elasticity?

Barry Morris: These are simple consequences of the emergent architecture.
Transaction Managers are diskless nodes. When a Transaction Manager is added to a database it starts getting allocated work and will consequently increase the throughput of the database system. If you remove a transaction manager you might have some failed transactions, which the application is necessarily designed to handle, but you won’t lose any data and the system will continue regardless.

Q14. How complex is for a developer to write applications using NuoDB?

Barry Morris: It’s no different from traditional relational databases. NuoDB offers standard SQL with JDBC, ODBC, Hibernate, Ruby Active Records etc.

Q15. NuoDB is in a restricted Beta program. When will it be available in production? Is there a way to try the system now?

Barry Morris: NuoDB is in it’s Beta 4 release. We expect it to ship by the end of the year, or very early in 2012. You can try the system by going to our download page.

_____________________________

– MariaDB: the new MySQL? Interview with Michael Monty Widenius.

– On Versant`s technology. Interview with Vishal Bagga.

– Interview with Jonathan Ellis, project chair of Apache Cassandra.

– The evolving market for NoSQL Databases: Interview with James Phillips.

Related Resources

– ODBMS.org: Free Downloads and Links on various data management technologies:
*Analytical Data Platforms.
*Cloud Data Stores,
*Databases in general
*Entity Framework (EF) Resources,
*Graphs and Data Stores
*NoSQL Data Stores,
*Object Databases,
*Object-Oriented Programming,
*Object-Relational Impedance Mismatch,
*ORM Technology,
##

Nov 30 11

On Data Management: Interview with Kristof Kloeckner, GM IBM Rational Software

by Roberto V. Zicari

” In the last years, our main development focus in the data management area has been on technologies that help customers turn insight into action.” — Kristof Kloeckner, IBM Corporation.

I wanted to learn about IBM`s current strategy in data management. I have interviewed Kristof Kloeckner, General Manager of IBM Rational Software, IBM Corporation.

RVZ

Q1. In your opinion, what are the most important technologies being developed and deployed in the last years that are affecting data management?

Kloeckner: In the last years, our main development focus in the data management area has been on technologies that help customers turn insight into action. Three fundamental shifts are being experienced across industries where these technology advances have been critical i) information is exploding, ii) business change is outpacing the collective ability to keep up, and iii) the performance gap between leaders and laggards/followers is widening. Successful organizations are taking a structured approach to turning insight into action through business analytics and optimization. Specifically, innovative technologies that have helped businesses optimize to address these shifts include:

1 – Competing on speed: enabling organizations to speed up decision making and processes to act in a time frame that matters to retain/attract new clients and grow revenues:

2 – Exceeding customer and employee expectations: consumer IT innovation has created high employee expectations of IT: ease of use, collaboration, and mobile access

3. – Exploiting information from everywhere – access to data wherever it resides and in any variety, any velocity, and at any volume. Advancements in Big Data technology include new compression capabilities to transparently compress data on disk to reduce disk space and storage infrastructure requirements.

4- Realize new possibilities from Analytics. Developments within analytics processing that combine and analyze information through multiple analytics services including content analytics and predictive analytics are transforming Healthcare.

5 – Governance information: addressing the pressure seen with regulatory compliance. Technology advancements with Hadoop allow multiple users to access unstructured data for governance purposes.
• Content-development workflow advancements for enterprise business glossaries enable a governed approach to build, publish, and manage enterprise vocabularies

6 – OLTP Scale Out and Availability.

Q2. In a recent keynote you introduced the concept of “Software-driven innovation”. What is it?

Kloeckner:Innovation is increasingly achieved and delivered via software. The need for innovation is key for companies trying to gain competitive advantage in today’s business environment.
In a recent IBM survey of more than 1,500 CEOs – nearly 50% of respondents called out the increased need for product and service innovation as a top concern.

Software connects people, products and information. The more interconnected, instrumented, and intelligent a product or service is, the greater its potential value is to its users. Challenges to achieving software-driven innovation derive from the complexity of interconnected systems, but also from the complexity of software delivery due to the increasingly global distribution of development teams, new models of software supply chains including outsourcing and crowdsourcing, and the reality of constantly shifting market requirements. Effective software-driven innovation requires application and systems lifecycle management with a focus on integration of processes, data and tools, collaboration across roles and organization and optimization of business outcomes through measurements and simplified governance.

Q3. You defined three elements for realizing “software-driven innovation”: Integration, Collaboration, and Optimization. Could you please elaborate on this? In particular, how is the software design and delivery lifecycle managed?

Kloeckner:Collaborative Lifecycle Management integrates tools and processes end-to-end, from requirements management to design and construction to test and verification, and ultimately, deployment.
It enables traceability of artifacts across the lifecycle, real-time planning, in-context collaboration, development analytics or ‘intelligence’ and continuous improvement. It eliminates manual hand-offs, and can lead to increased project velocity, adaptability to change and reduced project risk.

Collaborative design management brings cross-team collaboration on software and systems design and creates deep integrations across the lifecycle. Effective modeling, design and architecture are critical to developing smarter products and services.

Collaborative DevOps bridges the gap between development and deployment by sharing information between the teams, improving deployment planning and automating many deployment steps. This leads to faster delivery of solutions into production.

Collaborative Lifecycle Management, Design Management and DevOps can considerably increase business agility by accelerating software and systems delivery. They can be augmented by capabilities that further increase alignment of development with business goals, for instance through application portfolio management.

Q4. IBM recently announced a new application development collaboration software built on its Jazz development platform. What is special about Jazz?

Kloeckner:Jazz is IBM’s initiative for improving collaboration across the software & systems lifecycle Inspired by the artists who transformed musical expression, Jazz is an initiative to transform software and systems delivery by making it more collaborative, productive and transparent, through integration of information and tasks across the phases of the lifecycle. The Jazz initiative consists of three elements: Platform, Products and Community.

Jazz is built on architectural principles that represent a key departure from approaches taken in the past.
Unlike monolithic, closed platforms of the past, Jazz has an innovative approach to integration based on open, flexible services and Internet architecture (linked data). It is complemented by Open Services for Lifecycle Collaboration (OSLC), an industry initiative of more than 30 companies to define the interfaces for tools integration based on the principles of linked data.

Jazz is supported by a community web site (Jazz.net), as well as a cloud environment supporting university projects (JazzHub).

Q5. What about other approaches such as social networks ala Facebook? Aren’t they also a sort of collaboration platforms?

Kloeckner:Without a doubt, software delivery is a team sport, and the process of delivery becomes an instance of ‘social business’. Collaboration is vital for organizational cohesion on a global scale, but also for the sharing of best practices and artifacts. Increasingly, developers expect to use similar mechanisms to interact and share to those they are using in their personal lives.
At IBM we combine development tools such as Rational Team Concert and Rational Asset Manager with our social business platform IBM Connections to provide an infrastructure for code reuse supporting more than 30000 members in more than 1700 projects. It has a governance structure derived from Open Source which we call Community Source. The platform uses wikis, forums, blogs and other feedback and communication mechanisms.

Q6. What is IBM strategy in data management? DB2, Apache Hadoop, CloudBurst, SmartCloud Enterprise to name a few. How do they relate to each other (if any)?

Kloeckner:Our strategy is to enable the placement of data and the workloads that use it in cost-effective patterns that encompass public and private clouds as well as traditional deployment topologies.

IBM has provided enablement on both public as well as private clouds. This enablement provided easy access to IBM’s rich enterprise data management portfolio for evolving businesses. Our focus is on the enablement of deeper cloud capabilities such as the Platform as a Service and Database as a Service style features in the IBM Workload Deployer (the evolved WebSphere Cloudburst Appliance). The adoption of these technologies leads to the need to integrate with existing IT infrastructures as well as the evolving world of NoSQL.

This deeper cloud capability additionally focuses on the need to expose deeper analytical capability and insight into raw business data which is difficult to model and or understand yet. This leads to the Big Data topic. The run-time infrastructure being used for this deeper insight into this data is built on Apache Hadoop and is the basis for IBM’s Big Data product. This offering is available in a cloud offering or as a product offering. However, Big Data is allowing businesses to expand the range of information that they can do analysis on. Moreover, for this data to be useful, it often needs to tie into the more traditional Information Management world – existing operational and warehouse systems which provide the context needed to both support the broader range of analytics that Big Data enables as well as the context needed to operationalize the results.

Q7. What are the main business and technical issues related to data management when using the Cloud?

Kloeckner:There are four main issues, each of which relate to the data itself: data movement, data security, data ownership/stewardship, as well as the analysis of enormous datasets. The challenge with data movement manifests itself primarily in two ways: availability and the obvious time that it takes to transfer data between sites. The relation to availability is not immediately obvious but high speed data movement is used for replication based HA models as well as recovery scenarios that need to pull data from a remote site. The higher the bandwidth and the lower the latency, the easier it is to create a highly available data management offering on the cloud.

Data security is one place where most of the concerns surface and a lot of the enterprise companies have very sensitive data. Questions such as “Who is the group of people working behind the cloud facade and can I trust them?” are the probably the most common. Fortunately with IBM, we’ve been handling sensitive enterprise data for most of the last century. Having a reputable company to partner with to either host or protect your data in the cloud is absolutely key and it is also why many companies are reaching out to IBM as a cloud provider. Dealing with sensitive data is business as usual for us.

Related to security, is the issue around ownership and stewardship of such data in the cloud. If data is believed to be out of the control of its stakeholders, unless you can demonstrate otherwise, they may begin to lose confidence in the trust-worthiness of that information. This leads to the need for governance of such data in the cloud and tied to security of it.

Lastly, what do you do with petabytes of structured and unstructured data? Cloud and hadoop some times get interchanged in some circles, but IBM’s direction on hadoop is illustrated above with BigInsights which includes BigData and Streams products. Cloud is just as critical put serves a different role. One area that information management is focused on around clouds as an example is making it simpler for business users to build sand boxes and smaller systems in the cloud and integrate there data capture and governance with the technical IT groups that make the business data. We need to make it simple for business users, but allow IT some control over what data goes into the cloud and how long it lives.

Q8. Private cloud vs. public cloud? What is the experience of IBM on this?

Kloeckner:Clients have choices for deploying enterprise applications, and the key factors they take into consideration are not just cost or convenience, but also security, reliability, scalability, control and tooling, and a trusted vendor relationship. We find that for enterprise critical applications customers in their majority favor private (including private managed) clouds. We also find increasing adoption of IBM SmartCloud Enterprise.

IBM has a proven approach for building and managing cloud solutions, providing an integrated platform that uses the same standards and processes across the entire portfolio of products and services. IBM’s expertise and experience in designing, building and implementing cloud solutions—beginning with its own—offers clients the confidence of knowing that they are engaging not just a provider, but a trusted partner in the transformation of their IT service delivery. The IBM Cloud Computing reference architecture builds on IBM’s industry-leading experience and success in implementing SOA solutions. We have also invested in technologies that integrate cloud environments and allow clients combine services deployed in their private cloud environments with services delivered through public clouds. We believe that such hybrid environments are becoming increasingly more common.

Q9. In your opinion, will we ever have a “Trusted” Cloud?

Kloeckner:Concerns about security are the most prominent reasons that organizations cite for not adopting public cloud services. Therefore, creating more comprehensive security capabilities is a prerequisite for getting organizations to adopt public cloud-based services for more complex, business-sensitive, and demanding purposes. A recent Forrester Research reports they fully expect to see the emergence of highly secure and trusted cloud services over the next five years, during which time cloud security will grow into a $1.5 billion market and will shift from being an inhibitor to an enabler of cloud services adoption. The advent of secure cloud services will be a disruptive force in the security solutions market, challenging traditional security solution providers to revamp their architectures, partner ecosystems, and service offerings, while creating an opportunity for emerging vendors, integrators, and consultants to establish themselves.

Q10. Where do you see the data management industry heading in the next years?

Kloeckner:Across industries, technology providers and technology consumers are moving to address the exponential growth in quantity, complexity, variety, and velocity of data to enable better business insights and drive informed actions. Add to this the immediacy of the web and data in motion, the new control and influence of the individual consumer, and the fact that most companies have yet to fully exploit the raw data they have in house within the context of structured models, and you can easily discern the driving forces behind strategic moves, acquisitions and recent product announcements. The potential for business risks continue to increase as do requirements for regulatory compliance. By 2015, the information and analytics software spending is estimated to reach $74B as a result.

The consumability of core technologies such as High Performance Computing and embarrassingly parallel workloads, real-time analysis of streaming data and process flows, ease of access to analytics as services via the cloud and embedded within business processes, as well as advances in natural language interfaces will take analytics to the mainstream and to main street.. IBM’s Watson is just a hint of what is to come.

Solution delivery trends in mobile, cloud, and software as a service are driving information management technology investments to meet these demands. Evolving data access trends from traditional business applications and existing OLTP environments are extending in reach in to social media and content as an example, driving technology investments in processing Big Data as well as content and text analytics.

As technology advances in systems processing power, in-memory computing capabilities, extreme low-energy servers, and storage capabilities such as solid state drives continue to progress, data management technologies continue to advance in parallel leveraging these technologies to enable optimized workloads and consolidated platforms for cloud environments.

Both companies and consumers look for better/faster/easier access to a single best view of the truth, ability to predict the next best action. Whether it be process optimizations within a business, risk and fraud management, dramatically enhanced personalization of each customer experience, or improving the speed and accuracy of diagnosis based the full compendium of the latest research, historical data and best practices, health care, financial services, customer service, precision marketing and even auto repair will be impacted.

_________________

Kristof Kloeckner, Ph.D., is General Manager of IBM Rational Software, IBM Corporation.
Dr. Kloeckner is responsible for all facets of the Rational Software business including strategy, marketing, sales, operations, technology development and overall financial performance. He and his global team are focused on providing the premier development and delivery solutions for transforming business through software-driven innovation. Dr. Kloeckner is based in Somers, New York.

Resources

– ODBMS.org: Free Downloads and Links on various data management technologies and analytics.

– On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com

– Analytics at eBay. An interview with Tom Fastner.

Nov 16 11

On Big Data: Interview with Shilpa Lawande, VP of Engineering at Vertica.

by Roberto V. Zicari

” It is expected there will be 6 billion mobile phones by the end of 2011, and there are currently over 300 Twitter accounts and 500K Facebook status updates created every minute. And, there is now a $2 billion a year market for virtual goods! “ — Shilpa Lawande

I wanted to know more about Vertica Analytics Platform for Big Data. I have interviewed Shilpa Lawande, VP of Engineering at Vertica. Vertica was acquired by HP early this year.

RVZ

Q1. What are the main technical challenges for big data analytics?

Shilpa Lawande: Big data problems have several characteristics that make them technically challenging. First is the volume of data, especially machine-generated data, and how fast that data is growing every year, with new sources of data that are emerging. It is expected there will be 6 billion mobile phones by the end of 2011, and there are currently over 300 Twitter accounts and 500K Facebook status updates created every minute. And, there is now a $2 billion a year market for virtual goods!

A lot of insights are contained in unstructured or semi-structured data from these types of applications, and the problem is analyzing this data at scale. Equally challenging is the problem of ‘how to analyze.’ It can take significant exploration to find the right model for analysis, and the ability to iterate very quickly and “fail fast” through many (possible throwaway) models – at scale – is critical.

Second, as businesses get more value out of analytics, it creates a success problem – they want the data available faster, or in other words, want real-time analytics. And they want more people to have access to it, or in other words, high user volumes.

One of Vertica’s early customers is a Telco that started using Vertica as a ‘data mart’ because they couldn’t get resources from their enterprise data warehouse. Today, they have over a petabyte of data in Vertica, several orders of magnitude bigger than their enterprise data warehouse.

Techniques like social graph analysis, for instance leveraging the influencers in a social network to create better user experience are hard problems to solve at scale. All of these problems combined create a perfect storm of challenges and opportunities to create faster, cheaper and better solutions for big data analytics than traditional approaches can solve.

Q2. How Vertica helps solving such challenges?

Shilpa Lawande: Vertica was designed from the ground up for analytics. We did not try to retrofit 30-year old RDBMS technology to build the Vertica Analytics Platform. Instead, Vertica built a true columnar database engine including sorted columnar storage, a query optimizer and an execution engine.

With sorted columnar storage, there are two methods that drastically reduce the I/O bandwidth requirements for such big data analytics workloads. The first is that Vertica only reads the columns that queries need.
Second, Vertica compresses the data significantly better than anyone else.

Vertica’s execution engine is optimized for modern multi-core processors and we ensure that data stays compressed as much as possible through the query execution, thereby reducing the CPU cycles to process the query. Additionally, we have a scale-out MPP architecture, which means you can add more nodes to Vertica.
All of these elements are extremely critical to handle the data volume challenge.

With Vertica, customers can load several terabytes of data quickly (per hour in fact) and query their data within minutes of it being loaded – that is real-time analytics on big data for you.
There is a myth that columnar databases are slow to load. This may have been true with older generation column stores, but in Vertica, we have a hybrid in-memory/disk load architecture that rapidly ingests incoming data into a write-optimized row store and then converts that to read-optimized sorted columnar storage in the background. This is entirely transparent to the user because queries can access data in both locations seamlessly. We have a very lightweight transaction implementation with snapshot isolation queries can always run without any locks.
And we have no auxiliary data structures, like indexes or materialized views, which need to be maintained post-load.

Last, but not least, we designed the system for “always on,” with built-in high availability features.
Operations that translate into downtime in traditional databases are online in Vertica, including adding or upgrading nodes, adding or modifying database objects, etc.

With Vertica, we’ve removed many of the barriers to monetizing big data and hope to continue to do so.

Q3. When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?

Shilpa Lawande:The short answer to the performance and scale question is that we make the most efficient use of all the resources, and we parallelize at every opportunity. When we compress and encode the data, we sometimes see 80:1 compression (depending on the data) . Vertica’s fully peer-to-peer MPP architecture allows you can issue loads and queries from any node and run multiple load streams. Within each node, operations make full use of the multi-core processors to parallelize their work. A great measure of our raw performance is that we have several OEM customers who run Vertica on a single 1U node, embedded within applications like security and event log management, with very low data latency requirements.

Scalability has three aspects – data volume, hardware size, and concurrency. Vertica’s performance scales linearly (and often super linearly due to compression and other factors) when you double the data volume or run the same data volume on twice the number of nodes. We have customers who have grown their databases from scratch to over a petabyte, with clusters from tens to hundreds of nodes. As far as concurrency goes, running queries 50-200x faster ensures that we can get a lot more queries done in a unit of time. To efficiently handle a highly concurrent mix of short and long queries, we have built-in workload management that controls how resources are allocated to different classes of queries. Some of our customers run with thousands of concurrent users running sub-second queries.

Q4. What is new in Vertica 5.0?

Shilpa Lawande: In Vertica 5.0, we focused on extensibility and elasticity of the platform. Vertica 5.0 introduced a Software Development Kit (SDK) that allows users to write custom user-defined functions so they can do more with the Vertica platform, such as analyze Apache logs or build custom statistical extensions. We also added several more built-in analytic extensions, including event-series joins, event-series pattern matching, and statistical and geospatial functions.

With our new Elastic Cluster features, we allow very fast expansion and contraction of the cluster to handle changing workloads – in a recent POC we showed expansion from 8 nodes to 16 nodes with 11TB of data in an hour. Another feature we’ve added is called Import/Export, which automates fast export of data from one Vertica cluster to another, a very useful feature for creating sandboxes from a production system for exploratory analysis.

Besides these two major features, there are a number of improvements to the manageability of the database, most notably the Data Collector, that captures and maintains a history of system performance data, and the Workload Analyzer, that analyzes this data to point out suboptimal performance and how to fix it.

Q5. How is Vertica currently being used? Could you give examples of applications that use Vertica?

Shilpa Lawande: Vertica has customers in most major verticals, including Telco, Financial Services, Retail, Healthcare, Media and Advertising, and Online Gaming. We have eight out of the top 10 US Telcos using Vertica for Call-Detail-Record analysis, and in Financial Services, a common use-case is a tickstore, where Vertica is used to store and analyze many years of financial trades and quotes data to build models.
A growing segment of Vertica customers are Online Gaming and Web 2.0 companies that use Vertica to build models for in-game personalization.

Q6. Vertica vs. Apache Hadoop: what are the similarities and what are the differences?

Shilpa Lawande: Vertica and Hadoop are both systems that can store and analyze large amounts of data on commodity hardware. The main differences are how the data gets in and out, how fast the system can perform, and what transaction guarantees are provided. Also, from the standpoint of data access, Vertica’s interface is SQL and data must be designed and loaded into a SQL schema for analysis.

With Hadoop, data is loaded AS IS into a distributed file system and accessed programmatically by writing Map-Reduce programs. By not requiring a schema first, Hadoop provides a great tool for exploratory analysis of the data, as long as you have the software development expertise to write Map Reduce programs. Hadoop assumes that the workload it runs will be long running, so it makes heavy use of checkpointing at intermediate stages.
This means parts of a job can fail, be restarted and eventually complete successfully. There are no transactional guarantees.

Vertica, on the other hand, is optimized for performance by careful layout of data and pipelined operations that minimize saving intermediate state. Vertica gets queries to run sub-second and if a query fails, you just run it again. Vertica provides standard ACID transaction semantics on loads and queries.

We recently did a comparison between Hadoop, Pig, and Vertica for a graph problem (see post on our blog) and when it comes to performance, the choice is clearly in favor of Vertica. But we believe in using the right tool for the job and have over 30 customers using both the systems together. Hadoop is a great tool for the early exploration phase, where you need to determine what value there is in the data, what the best schema is, or to transform the source data before loading into Vertica. Once the data models have been identified, use Vertica to get fast responses to queries over the data.

Other customers keep all their data in Vertica and leverage Hadoop’s scheduling capabilities to retrieve the data for different kinds of analysis. To facilitate this, we provide a Hadoop connector that allows efficient bi-directional transfer of data between Vertica and Hadoop. We plan to continue to enhance Vertica’s analytic platform as well as be a partner in the Hadoop ecosystem.

Q7. Cloud computing: Does it play a role at Vertica? If yes, how?

Shilpa Lawande: In 2007, when ‘Cloud’ first started to gather steam, we realized that Vertica had the perfect architecture for the cloud. It is no surprise because the very trends in commodity multi-core servers, storage and interconnects that resulted in our design choices also enable the cloud. We were the first analytic database to run on Amazon EC2 and today have several customers using EC2 to run their analytics.

We consider cloud as an important deployment configuration for Vertica and as the IaaS offerings improve, and we gain real-world experience from our customers, you can expect Vertica’s product to be further optimized for cloud deployments. Cloud is also a big focus area at HP and we now have the unique opportunity to create the best cloud platform to run Vertica and to provide value-added solutions for big data analytics based on Vertica

Q8. How Vertica fits into HP data management strategy?

Shilpa Lawande: Vertica provides HP with a proven platform for big data analytics that can be deployed as software, an appliance, and on the cloud, all of which are key focus areas for HP’s business.
Big data is a growing market and the majority of the data is unstructured. The combination of Vertica and Autonomy gives HP a comprehensive “Information Platform” to manage, analyze, and derive value from the explosion in both structured and unstructured data.

________________________
Shilpa Lawande, VP of Engineering, Vertica.
Shilpa Lawande has been an integral part of the Vertica engineering team since its inception, bringing over 10 years of experience in databases, data warehousing and grid computing to Vertica. Prior to Vertica, she was a key member of the Oracle Server Technologies group where she worked directly on several data warehousing and self-managing features in the Oracle 9i and 10g databases.

Lawande is a co-inventor on several patents on query optimization, materialized views and automatic index tuning for databases. She has also co-authored two books on data warehousing using the Oracle database as well as a book on Enterprise Grid Computing.

Lawande has a Masters in Computer Science from the University of Wisconsin-Madison and a Bachelors in Computer Science and Engineering from the Indian Institute of Technology, Mumbai.
—————

– ODBMS.org Resources on Analytical Data Platforms:
1. Blog Posts
2. Free Software
3. Articles

– Analytics at eBay. An interview with Tom Fastner.

– Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera.

Nov 2 11

On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com

by Roberto V. Zicari

“One of the core concepts of Big Data is being able to evolve analytics over time. In the new world of data analysis your questions are going to evolve and change over time and as such you need to be able to collect, store and analyze data without being constrained by resources. “ — Werner Vogels, Amazon.com

I wanted to know more on what is going on at Amazon.com in the area of Big Data and Analytics. For that, I have interviewed Dr. Werner Vogels, Chief Technology Officer and Vice President of Amazon.com

RVZ

Q1. In your keynote at the Strata Making Data Work Conference held this February in Santa Clara, California, you said that “Data and Storage should be unconstrained“. What did you mean with that?

Vogels: In the old world of data analysis you knew exactly which questions you wanted to asked, which drove a very predictable collection and storage model. In the new world of data analysis your questions are going to evolve and change over time and as such you need to be able to collect, store and analyze data without being constrained by resources.

Q2. You also claimed that “Big Data requires NO LIMIT“. However, Prof. Alex Szalay talking about astronomy warns us that “Data is everywhere, never be at a single location. Not scalable, not maintainable.” Will Big Data Analysis become a new Astronomy?

Vogels: Big Data is the hot topic for this year. With the rise of the internet, and the increasing number of consumers, researchers and businesses of all sizes getting online, the amount of data now available to collect, store, manage, analyze and share is growing. When companies come across such large amounts of data it can lead to data paralysis where they don’t have the resources to make effective use of the information.
To Alex’s point, it’s challenging to get the relevant data at the place where you want it to do your analysis. This is why we see many organizations putting their data in the cloud where it’s easily accessible for everyone.

Q3. You also quoted Jim Gray and mentioned the “Fourth Paradigm: Data Intensive Scientific Discovery”. What is it? Is Business Intelligence becoming more like Science for profit?

Vogels:The book I referenced was The Fourth Paradigm: Data-Intensive Scientific Discovery. This is a collection of essays that discusses the vision for data-intensive scientific discovery, which is the concept of shifting computational science to a data intensive model where we analyze observations.
For Business Intelligence this means that analysis goes beyond finance and accountancy and help companies’ continuously improve the service to their customers.

Q4. Michael Olson of Cloudera, in a recent interview said talking about Analytical Data Platforms that “Cloud is a deployment detail, not fundamental. Where you run your software and what software you run are two different decisions, and you need to make the right choice in both cases.”
What is your opinion on this? What is in your opinion the relationships between Big Data Analysis and Cloud Computing?

Vogels: Big Data holds the promise of helping companies create a competitive advantage as through data analysis they learn how to better serve their customers. This is an approach that we have already applied for 15 years to Amazon.com and we have a solid understanding of the all the challenges around managing and processing Big Data.

One of the core concepts of Big Data is being able to evolve analytics over time. For that, a company cannot be constrained by any resource. As such, Cloud Computing and Big Data are closely linked because for a company to be able to collect, store, organize, analyze and share data, they need access to infinite resources.
AWS customers are doing some really innovative things around dealing with Big Data. For example digital advertising and marketing firm, Razorfish. Razorfish targets online adverts based on data from browsing sessions. A common issue Razorfish found was the need to process gigantic data sets. These large data sets are often the result of holiday shopping traffic on a retail website, or sudden dramatic growth on a media or social networking site.

Normally crunching these numbers would take them two days or more. By leveraging on-demand services such as Amazon Elastic MapReduce, Razorfish is able to drop their processing time to eight hours. There was no upfront investment in computing hardware, no procurement delay, and no additional operations staff hired. All this means Razorfish can offer multi-million dollar client service programs on a small business budget, helping them to increase their return on ad spend by 500%.

Q5. How has Amazon’s technology evolved over the past three years?

Vogels: Every day, Amazon Web Services adds enough new server capacity to support all of Amazon’s global infrastructure in the company’s fifth full year of operation, when it was a $2.76B annual revenue enterprise. Today we have hundreds of thousands of customers in over 190 countries—both startups and large companies. To give you an idea of the scale we’re talking about, Amazon S3 holds over 260 billion objects and regularly peaks at 200k requests per second.

Our pace of innovation has been rapid because of our relentless customer focus. Our process is to release a service into beta that is useful to a lot of people, get customer feedback and rapidly begin adding the bells and whistles based in large part on what customers want and need from the services. There’s really no substitute for the accelerated learning we’ve had from working with hundreds of thousands of customers with every imaginable use case.

We are also relentless about driving efficiencies and passing along the cost savings to our customers.
We’ve lowered our prices 12 times in the past 5 years with no competitive pressure to do so. We’re very comfortable with running high volume, low margin businesses which is very different than traditional IT vendors.

Q6. Amazon’s Dynamo is proprietary, however the publication of your seminar 2007 Dynamo-paper was used as input for Open Source Projects (e.g. Cassandra—which began as a fusion of Google`s Bigtable and Amazon’s Dynamo concepts). Why Amazon allowed this?

Vogels: Dynamo is internal technology developed at Amazon to address the need for an incrementally scalable, highly-available key-value storage system. The technology is designed to give its users the ability to trade-off cost, consistency, durability and performance, while maintaining high-availability.
We found, though, that there had been some struggles with applying the concepts so we published the paper as feedback to the academic community about what one needed to do to build realistic production systems.

Q7. What is the positioning of Amazon with respect to Open Source projects? Why didn’t you develop Open Source data platforms from the start like for example Facebook and LinkedIn?

Vogels: I believe that you need to pour your resources into areas where you can make important contributions, where you can provide the customer with the best possible experience. Our mission is to provide the one-man app developer or the 20,000 person enterprise with a platform of web services they can use to build sophisticated, scalable applications. I believe anything we can do to make AWS lower-cost and widely available will help the community tremendously.

Q8. Amazon Elastic MapReduce utilizes a hosted Hadoop framework running on Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). Why choosing Hadoop? Why not using already existing BI products?

Vogels: We chose Hadoop for several reasons. First, it is the only available framework that could scale to process 100s or even 1000s of terabytes of data and scale to installations of up to 4000 nodes. Second, Hadoop is open source and we can innovate on top of the framework and inside it to help our customers develop more preformat applications quicker. Third, we recognized that Hadoop was gaining substantial popularity in the industry with multiple customers using Hadoop and many vendors innovating on top of Hadoop.

Three years later we believe we made the right choice. We also see that existing BI vendors such as Microstrategy are willing to work with us and integrate their solutions on top of Elastic MapReduce.

Q9. Looking at three elements: Data, Platform, Analysis, what are the main research challenges ahead? And what are the main business challenges ahead?

Vogels: I think that sharing is another important aspect to the mix. Collaborating during the whole process of collecting data, storing it, organizing it and analyzing it is essential. Whether it’s scientists in a research field or doctors at different hospitals collaborating on drug trials, they can use the cloud to easily share results and work on common datasets.

——–
Dr. Vogels is Vice President & Chief Technology Officer at Amazon.com where he is responsible for driving the company’s technology vision, which is to continuously enhance the innovation on behalf of Amazon’s customers at a global scale.

Prior to joining Amazon, he worked as a researcher at Cornell University where he was a principal investigator in several research projects that target the scalability and robustness of mission-critical enterprise computing systems. He has held positions of VP of Technology and CTO in companies that handled the transition of academic technology into industry.

Vogels holds a Ph.D. from the Vrije Universiteit in Amsterdam and has authored many articles for journals and conferences, most of them on distributed systems technologies for enterprise computing.

He was named the 2008 CTO of the Year by Information Week for his contributions to making Cloud Computing a reality. For his unique style in engaging customers, media and the general public, Dr. Vogels received the 2009 Media Momentum Personality of Year Award.

– Google Fusion Tables. Interview with Alon Y. Halevy.

– Interview with Jonathan Ellis, project chair of Apache Cassandra.

– Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera.

– The evolving market for NoSQL Databases: Interview with James Phillips.

– Objects in Space vs. Friends in Facebook.

Oct 24 11

vFabric SQLFire: Better than RDBMS and NoSQL?

by Roberto V. Zicari

“ We believe the ability to do transactions will remain important to app developers.” — Blake Connell, VMware.

vFabric SQLFire beta was recently announced by VMware as part of their Cloud Application Platform, called vFabric 5. I wanted to know more about it. I have therefore interviewed Blake Connell, Sr. Product Marketing Manager, vFabric at VMware.

RVZ

Q1. What is vFabric SQLFire beta?

Blake Connell: SQLFire is a memory-optimized distributed database. The beta is available for anyone to download and try while we complete the initial release.

Q2. What are the main technical challenges that vFabric SQLFire is supposed to solve?

Blake Connell: SQLFire addresses speed, scalability and availability. SQLFire Is memory-optimized for maximum speed, this matters because users are becoming more and more demanding about responsiveness of web sites and applications.

SQLFire is designed for horizontal scalability, if you need more performance or need to store more data you simply add more members to the SQLFire distributed system. SQLFire extends SQL with new keywords that describe how tables should be partitioned and replicated, so as you add nodes the system intelligently rebalances itself for high performance, and it’s all completely transparent to the application. This is much more compelling than having to shard your database manually and build shard-awareness into your application.

Finally SQLFire uses a shared-nothing architecture to ensure availability. If a node in the distributed system dies, applications can simply talk to another node. In fact the database drivers shipped with SQLFire will transparently reconnect to a working node without the application needing to do anything, the app doesn’t even have to retry the request. A key compelling characteristics of SQLFire is that it enables developers to work with the well known SQL programming interface making adoption much more seamless and far less disruptive than alternatives.

Q3. When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?

Blake Connell: A big part of the way we address scalability is through horizontal scalability and automatic partitioning / replication of data as I mentioned. But there are some other very important design choices
we’ve made to build a scalable database.
First, we believe the ability to do transactions will remain important to app developers. However we believe that the vast majority of transactions are small both in time required and amount of data affected. We’ve built a transaction system based on these assumptions that completely avoids the need for a centralized coordinator or lock manager. In fact each node can act as its own transaction coordinator, even when the data in the transaction lives in multiple nodes. In this way you get transactions with linear scalability. If you take this along with our partitioning scheme the end result is a relational database that is much more scalable than a traditional RDBMS.

Q4. What are the main results of your performance benchmark for SQLFire?

Blake Connell: In our preliminary performance testing of vFabric SQLFire, the results show the product is able to achieve near linear scalability as the cluster expands, while the CPU utilization remains steady.
In this test, this indicates SQLFire can accommodate even more load without exhausting CPU resources.
In this performance test, two thirds of all queries complete in under 1 ms, roughly 88% complete in under 2 ms, and 97% take under 5 ms which is extremely fast.

Q5. How do you handle structured and unstructured data?

Blake Connell: vFabric SQLFire handles structured data. A related offering, vFabric GemFire handles unstructured data.

Q6. How do you use vFabric SQLFire from Java applications and/ or from ADO.NET?

Blake Connell: We provide JDBC and ADO.NET drivers for you to embed in your application. After that you just use mostly standard SQL. We have DDL extensions, so for example when you create a table you can specify how it should be partitioned, replicated and so forth, but the DML is just standard SQL.
The DDL extensions are all optional and don’t interfere with standard DDL.

Q7. Could you give examples of applications that use SQLFire beta?

Blake Connell: We have a financial services trading application exploring vFabric SQLFire.
The application dramatically increases the speed at which it can take a customer order and send it to a stock exchange while maintaining reliability and security.

Q8. Why using vFabric SQLFire and not of one of the existing NoSQL database or Relatonal Databases?

Blake Connell: vFabric SQLFire offers capabilities not found in traditional RDBMS or NoSQL solutions.
Comparing vFabric SQLFire to the traditional RDBMS:
– Data can be partitioned and/or replicated in memory across a cluster of machines to deliver scale-out benefits and high availability (HA)
– Not constrained by strict adherence to ACID properties. Developers uphold various degrees of ACID properties based on desired levels of data availability, consistency and performance.
– Optimized key-based access

Comparing vFabric SQLFire to NoSQL:
– vFabric SQLFire is focused on data consistency with scale-out and HA, often NoSQL offerings provide eventually consistent data models.
– vFabric SQLFire supports queries and joins with standard SQL syntax.

Q9. How vFabric SQLFire fits into VMware products strategy?

Blake Connell: At VMware, our approach to data is really two-fold:
1. Leverage virtualization to automate database deployment and operations.
2. Provide new approaches to data beyond the traditional one-size fits all database model.

vFabirc SQLFire and vFabric GemFire are key data offerings from VMware that provide alternative data approaches for modern applications. Innovation is occurring at all layers of IT including the database layer. vFabric GemFire is an memory-oriented data fabric. Java developers using the Spring Framework or .NET developers can leverage an API to write applications to vFabric GemFire. vFabric SQLFire enables teams to leverage their skills and investment in SQL knowledge and gain the benefits of distributed, in-memory, shared-nothing architectures.

Our recently released vFabric Data Director leverages vSphere virtualization to provide Database-as-a-Service with centralized back-up, management and governance. This is a huge step forward for both developers looking for fast and simple access to the database resources and for IT teams who need a manageable approach to providing data services.

Q10. What is next in vFabric SQLFire?

Blake Connell: In the first half of 2012 we anticipate enhancing SQLFire to include asynchronous WAN replication. This will let you run a single SQLFire database in multiple datacenters with high performance and low latency.

– On Versant`s technology. Interview with Vishal Bagga.

– Measuring the scalability of SQL and NoSQL systems.

– The evolving market for NoSQL Databases: Interview with James Phillips.

Oct 6 11

Analytics at eBay. An interview with Tom Fastner.

by Roberto V. Zicari

“How much more complexity can human developers and organizations deal with?”— Tom Fastner, eBay.

Much has already been written about analytics at eBay. But what is the current status? Which data platforms and data management technologies do they currently use? I asked a few questions to Tom Fastner, a Senior Member of Technical Staff, and Architect with eBay.

RVZ

Q1. What are the main technical challenges for big data analytics at eBay?

Tom Fastner: The primary challenges are:

I/O bandwidth: limited due to configuration of the nodes.

Concurrency/workload management: Workload management tools usually manage the limited resource. For many years EDW systems bottle neck on the CPU; big systems are configured with ample CPU making I/O the bottleneck. Vendors are starting to put mechanisms in place to manage I/O, but it will take some time to get to the same level of sophistication.

Data movement (loads, initial loads, backup/restores): As new platforms are emerging you need to make data available on more systems challenging networks, movement tools and support to ensure scalable operations that maintain data consistency

Q2. What are the current Metrics of eBay`s main data warehouses?

Tom Fastner: We have 3 different platforms for Analytics:

A) EDW: Dual systems for transactional (structured) data; Teradata 3.5PB and 2.5 PB spinning disk; 10+ years experience; very high concurrency; good accessibility; hundreds of applications.

B) Singularity: deep Teradata system for semi-structured data; 36 PB spinning disk; lower concurrency that EDW, but can store more data; biggest use case is User Behavior Analysis; largest table is 1.2 PB with ~1.9 Trillion rows.

C) Hadoop: for unstructured/complex data; ~40 PB spinning disk; text analytics, machine learning; has the User Behavior data and selected EDW tables; lower concurrency and utilization.

Q3. When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?

Tom Fastner: EDW: We model for the unknown (close to 3rd NF) to provide a solid physical data model suitable for many applications, that limits the number of physical copies needed to satisfy specific application requirements. A lot of scalability and performance is built into the database, but as any shared resource it does require an excellent operations team to fully leverage the capabilities of the platform

Singularity: The platform is identical to EDW, the only exception are limitations in the workload management due to configuration choices. But since we are leveraging the latest database release we are exploring ways to adopt new storage and processing patterns. Some new data sources are stored in a denormalized form significantly simplifying data modeling and ETL. On top we developed functions to support the analysis of the semi-structured data. It also enables more sophisticated algorithms that would be very hard, inefficient or impossible to implement with pure SQL. One example is the pathing of user sessions. However the size of the data requires us to focus more on best practices (develop on small subsets, use 1% sample; process by day),

Hadoop: The emphasis on Hadoop is on optimizing for access. The reusability of data structures (besides “raw” data) is very low.

Q4: How do you handle un-structured data?

Tom Fastner: Un-structured data is handled on Hadoop only. The data is copied from the source systems into HDFS for further processing. We do not store any of that on the Singularity (Teradata) system.

Q5. What kind of data management technologies do you use? What is your experience in using them?

Tom Fastner: ETL: AbInitio, home grown parallel Ingest system.
Scheduling: UC4.
Repositories: Teradata EDW; Teradata Deep system; Hadoop.
BI: Microstrategy, SAS, Tableau, Excel.
Data modeling: Power Designer.
Adhoc: Teradata SQL Assistant; Hadoop Pig and Hive.
Content Management: Joomla based.

In regards to tools capabilities, there is always something you might wish for. In some cases the tool of choice might even have an important capability, but we have not implemented it (due to resources, complexity, priorities, bugs). The way we try to address required features is through partnerships with some of those vendors.
The most mature partnership is with Teradata; other good examples are Tableau and Microstrategy where we have regular meetings discussing future enhancements. However I do not feel comfortable to rate all those tools and point out shortcomings.

Q6. Cloud computing and open source: Do you they play a role at eBay? If yes, how?

Tom Fastner: We do leverage internal cloud functions for Hadoop; no cloud for Teradata.
Open source: committers for Hadoop and Joomla; strong commitment to improve those technologies

Q7. Glenn Paulley of Sybase in a recent keynote asked the question of how much more complexity can database systems deal with? What is your take on this?

Tom Fastner: I am not familiar with Glenn Pulley’s presentation, so I can’t respond in that context. But in my role at eBay I like to take into account the full picture: How much more complexity can human developers and organizations deal with?

As we learn to deal with more complex data structures and algorithms using specialized solutions, we will look for ways to make them repeatable and available for a wider audience as they mature and demand grows in the market. Specialized solutions do come at a high price (TCO), requiring separate hardware and/or software stack, data replication, data management issues, computing specialists.

Q8. Anything else you wish to add?

Tom Fastner: Ebay is rapidly changing, and analytics is driving many key initiatives like buyer experience, search optimization, buyer protection or mobile commerce. We are investing heavily in new technologies and approaches to leverage new data sources to drive innovation.

———–
Tom Fastner is a Senior Member of Technical Staff, Architect with eBay. In this role Tom works on the architecture of the analytical platforms and related tools. Currently he spends most of his time driving
innovation to process Big Data, the Singularity© system. Before Tom joined eBay he was with NCR/Teradata for 11 years. He was involved in the Dual Active program from the beginning in 2003 as the technical lead of the first global implementation of Dual Active. Prior to NCR/Teradata Tom worked for debis Systemhaus with several
clients in the aerospace sector. He holds a Masters in Computer Science from the Technical University of Munich, Germany, and is a Teradata Certified Master.
————-

– Interview with Jonathan Ellis, project chair of Apache Cassandra.

– Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera.

– The evolving market for NoSQL Databases: Interview with James Phillips.

ODBMS Industry Watch

Data Modeling for Analytical Data Warehouses. Interview with Michael Blaha.

A super-set of MySQL for Big Data. Interview with John Busch, Schooner.

Use Cases and Database Modeling — An interview with Michael Blaha.

On Big Data: Interview with Shilpa Lawande, VP of Engineering at Vertica.

On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com

vFabric SQLFire: Better than RDBMS and NoSQL?

About the author

Archives

Meta

About

Flickr

Search

About the author

Tags

Archives

Meta

About

Flickr

Search