Skip to content

"Trends and Information on AI, Big Data, Data Science, New Data Management Technologies, and Innovation."

This is the Industry Watch blog. To see the complete ODBMS.org
website with useful articles, downloads and industry information, please click here.

Jul 11 12

On Analyzing Unstructured Data. — Interview with Michael Brands.

by Roberto V. Zicari

“The real difference will be made by those companies that will be able to fully exploit and integrate their structured and unstructured data into so called active analytics. With Active Analytics enterprises will be able to use both quantitative and qualitative data and drive action based on a plain understanding of 100% of their data”– Michael Brands.

It is reported that 80% of all data in an enterprise is unstructured information. How do we manage unstructured data? I have interviewed Michael Brands, an expert on analyzing unstructured data and currently a senior product manager for the i.Know technology at InterSystems.

RVZ

Q1. It is commonly said that more than 80% of all data in an enterprise is unstructured information. Examples are telephone conversations, voicemails, emails, electronic documents, paper documents, images, web pages, video and hundreds of other formats. Why is unstructured data important for an enterprise?

Michael Brands: Well unstructured data is important for organizations in general in at least 3 ways. 
First of all 90% of what people do in a business day is unstructured and the results of most of these activities can only be captured in unstructured data.
Second it is generally acknowledged in modern economy that knowledge is the biggest of asset of companies and most of this knowledge, since itʼs developed by people, is recorded in unstructured formats.

The last and maybe most unexpected argument to underpin the importance of unstructured data is the fact large research organizations such as Gardner and IDC state that: “80% of business is conducted on unstructured data”
If we take these tree elements together is even surprising to see most organizations invest heavily in business intelligence applications to improve their business but these applications only cover a very small portion of the data (20% in the most optimistic estimation) that are actually important for their business.
If we look at this from a different prospective we think enterprises that really want to be leading and make a difference will heavily invest in technologies that help them to understand and exploit their unstructured data because if we only look at the numbers (and thatʼs the small portion of data most enterprises already understand very well) the area of unstructured data will be the one where the difference will be made over the next couple of years.
However the real difference will be made by those companies that will be able to fully exploit and integrate their structured and unstructured data into so called active analytics. With Active Analytics enterprises will be able to use both quantitative and qualitative data and drive action based on a plain understanding of 100% of their data.
As InterSystems we have a unique technology offering that was especially designed to help our customers and partners in doing exactly that and weʼre proud our partners that actually deploy the technology to fully exploit a 100% of their data make a real difference in their market and grow way faster than their competitors.

Q2. What is the main difference between semi-structured and unstructured information?

Michael Brands: The very short and bold answer to this question would be to say semi-structured is just a euphemism for unstructured. 
However a more in-depth answer is that unstructured data is a combination of structured and unstructured data in the same data channel.
Typically semi-structured data comes out of forms that foresee specific free text areaʼs to describe specific parts of the required information. This way a “structured” (meta)-data field describes with a fair degree of abstraction the contents of the associated text field.
A typical example will help to clarify this: In an electronic medical record system the notes section in which a doctor can record his observations about a specific patient in free text is typically semi-structured which means the doctor doesnʼt have to write all observations in one text but he can typically “categorize” his observations under different headers such as: “Patient History”, “Family History”, “Clinical Findings”, “Diagnose” and more.
Subdividing such text entry environments into a series of different fields with a fixed header is a very common example of semi-structured data.
 Another very popular example of semi-structured data is e-mail, mp3 or video-data. These data-types contain mainly unstructured data but these unstructured data is always attached to some more structured data such as: Author, Subject or Title, Summary etc.

Q3. The most common example of unstructured data is text. Several applications store portions of their data as unstructured text that is typically implemented as plain text, in rich text format (RTF), as XML, or as a BLOB (Binary Large Object). It is very hard to extract meaning from this content. How iKnow can help here?

Michael Brands: iKnow can help here in a very specific and unique way because it is able to structure these texts into chains of concepts and relations.
What this means is that iKnow will be able to tell you without prior knowledge what the most important concepts in these texts are and how they are related to each other.
This is why, when we talk about iKnow, we say the technology is proactive.
Any other technology that analyses text will need a domain specific model (statistical, ontological or syntactical) containing a lot of domain specific knowledge in order to make some sense out of the texts it is supposed to analyze. iKnow, thanks to its unique way of splitting sentences into concepts and relations doesnʼt need this.
It will fully automatically perform the analysis and highlighting tasks students usually perform as a first step in understanding and memorizing a course text book.

Q4. How do you exactly make conceptual meaning out of unstructured data? Which text analytical methods do you use for that?

Michael Brands: The process we use to extract meaning out of texts is unique because of the following: we do not split sentences into individual words and then try to recombine these words by means of a syntactic parser, an ontology (which essentially is a dictionary combined with a hierarchical model that describes a specific domain), or a statistical model. What iKnow does instead is we split sentences by identifying relational word(group)s in a sentence.
This approach is based on a couple of long known facts about language and communication.

First of all analytical semantics already discovered years ago every sentence is nothing else than a chain of conceptual word groups (often called Noun Phrases or Prepositional Phrases in formal linguistics) tied together by relations (often called Verb Phrases in formal linguistics). So a sentence will semantically always be built as a chain of a concept followed by a relation followed by another concept again followed by another relation and another concept etc.
This basic conception of a binary sentence structure consisting of Noun-headed phrases (concepts) and Verb-headed phrases (relations) is at the heart of almost all major approaches to automated syntactic sentence analysis. However this knowledge is only used by state-of-the-art analysis algorithms to construct second order syntactic dependency structure representations of a sentence rather than to effectively analyze the meaning of a sentence.

A second important discovery underpinning the iKnow approach is the fact, discovered by behavioral psychology and neuro-psychiatry, humans only understand and operate a very small set of different relations to express links between facts, events, or thoughts. Not only the set of different relations people use and understand is very limited but it is also a universal set. In other words people only use a limited number of different relations and these relations are the same for everybody no matter his language, education, cultural background or whatsoever.
This discovery can learn us a lot of how basic mechanisms for learning like derivation and inference work. But more important for our purposes is that we can derive from this that, in sharp contrast with the set of concepts that is infinite and has different subsets for each specific domain, the set of relations is limited and universal.
The combination of these two elements namely the basic binary concept-relation structure of language and the universality and limitedness of the set of relations led to the development of the iKnow approach after a thorough analysis of a lot of state-of-the-art techniques.
Our conclusion of this analysis is the main problem of all classical approaches to text analysis is they all focus essentially on the endless and domain specific set of concepts because they mostly were created to serve the specific needs of a specific domain.
Thanks to this domain specific focus the number of elements a system needs to know upfront can be controlled. Nevertheless a “serious” application quickly integrates several millions of different concepts. This need for large collections of predefined concepts to describe the application domain, commonly called dictionaries, taxonomies or ontologies, leads to a couple of serious problems.
First off all, the time needed to set up and tune such applications is substantial and expensive because domain experts are needed to come up with the most appropriate concepts. Second the foot print of these systems is rather big and their maintenance costly and time-consuming because specialists need to follow whatʼs going on in the domain and adapt the knowledge of the application.
Third, itʼs very difficult to open up a domain specific application for other domains because in these other domains concepts might have different meanings or even contradict each other which can create serious problems at the level of the parsing logic.
Therefore iKnow was built to perform a completely different kind of analysis because by focussing on the relations we can build systems with a very small footprint (an average language model only contains several 10.000s relations and a very small number of context based disambiguation rules).
Moreover our system is not domain specific but it can work with data from very different domains at the same time and doesnʼt need expert input. Splitting up sentences by means of relations and solving the ambiguous cases (this means the cases in which a word or word group can express both a concept and a relation e.g. walk: is a concept in this sentence: Brussels Paris would be quite a walk. and a relation in this sentence: Pete and Mary walk to school) by means of rules that use the function (concept or relation) of the surrounding words (or word groups) to decide whether the ambiguous word is a concept or a relation is a computationally very efficient and fast process and ensures a system that learns as it analyses more data because it kind of “learns” the concepts from the texts because it identifies them as “the groups of words between the relations, before the first relation and between the last relation and the end of the sentence.

Q5. How “precise” is the meaning you extract from unstructured data? Do you have a way to validate it?

Michael Brands: This is a very interesting question because it raises two very difficult topics in the area of semantic data analysis namely : How do you define precision and How to evaluate results generated by semantic technologies ? 
If we use the classical definition of precision in this area, it describes what percentage of the documents given back by a system in response to a query asking for documents containing information about certain concepts actually contains useful information about these concepts.
Based on this definition of precision we can say iKnow scores very close to a 100% because it outperforms competing technologies in itʼs efficiency to detect what words in a sentence belong together and form meaningful groups and how the relate to each other.
Even if weʼd use other more challenging definitions of precision like: the syntactic or formal correctness of the word groups identified by iKnow we score very high percentages, but itʼs evident weʼre dependent of the quality of input. If the input doesnʼt accurately uses punctuation marks or contains a lot of non-letter characters that will affect our precision. Moreover how precision is perceived and defined varies a lot from one use case to another.
Evaluation is a very complex and subjective operation in this area because whatʼs considered to be good or bad heavily depends on what people want to do with the technology and what their background is. So far we let our customers and partners decide after an evaluation period whether the technology does what they expect from it and we didnʼt have “no goes” yet.

Q6. How do you process very large scale archives of data?

Michael Brands: The architecture of the system has been set up to be as flexible as possible and to make sure processes can be executed in parallel where possible and desirable. Moreover the system provides different modes to load data: A batch-load of data which has been especially designed to pump large amounts of existing data such as document archives into an system as fast as possible, a single source load thatʼs especially designed to add individual documents to a system at transactional speed, and a small-batch mode to add limited sets of documents to a system in one process.
On top of that the loading architecture foresees different steps in the loading process: data to be loaded needs to be listed or staged, the data can be converted (this means the data that has to be indexed can be adapted to get better indexing and analysis results), and, off course the data will be loaded into the system.
These different steps can partially be done in parallel and in multiple processes to ensure the best possible performance and flexibility.

Q.7 On one hand we have mining text data, and on the other hand we have database transactions on structured data: how do you relate them to each other?

Michael Brands: Well there are two different perspectives in this question:
 On the one hand itʼs important to underline that all textual data indexed with iKnow can be used as if it was structured data, because the API foresees appropriate methods that allow you to query the textual data the same you would query traditional row-column data. These methods come in 3 different flavors: they can be called as native caché object script methods, they can be called from within a SQL-environment as stored procedures and they are also available as web services.

On the other hand thereʼs the fact all structured data that has a link with the indexed texts can be used as metadata within iKnow. Based on these structured metadata filters can be created and used within the iKnow API to make sure the API returns exactly the results you need.

_______________________
Michael Brands previously founded i.Know NV a company specialized in analyzing unstructured data. In 2010 InterSystems acquired i.Know and since then he is serving as a senior product manager for the i.Know technology at InterSystems.
i.Know’s technology is embedded in the InterSystems technology platform.

Related Posts

Managing Big Data. An interview with David Gorbet (July 2, 2012)

Big Data: Smart Meters — Interview with Markus Gerdes (June 18, 2012)

Big Data for Good (June 4, 2012)

On Big Data Analytics: Interview with Florian Waas, EMC/Greenplum (February 1, 2012)

On Big Data: Interview with Shilpa Lawande, VP of Engineering at Vertica (November 16, 2011)

On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com (November 2, 2011)

Analytics at eBay. An interview with Tom Fastner (October 6, 2011)


##

Jul 2 12

Managing Big Data. An interview with David Gorbet

by Roberto V. Zicari

“Executives and industry leaders are looking at the Big Data issue from a volume perspective, which is certainly an issue – but the increase in data complexity is the biggest challenge that every IT department and CIO must address, and address now. “— David Gorbet.

Managing unstructured Big Data is a challenge and an opportunity at the same time. I have interviewed David Gorbet, vice president of product strategy at MarkLogic.

RVZ

Q1. You have been quoted saying that “more than 80 percent of today’s information is unstructured and it’s typically too big to manage effectively.” What do you mean by that?

David Gorbet: It used to be the case that all the data an organization needed to run its operations effectively was structured data that was generated within the organization. Things like customer transaction data, ERP data, etc.
Today, companies are looking to leverage a lot more data from a wider variety of sources both inside and outside the organization. Things like documents, contracts, machine data, sensor data, social media, health records, emails, etc. The list is endless really. A lot of this data is unstructured, or has a complex structure that’s hard to represent in rows and columns.
And organizations want to be able to combine all this data and analyze it together in new ways. For example, we have more than one customer in different industries whose applications combine geospatial vessel location data with weather and news data to make real-time mission-critical decisions.

MarkLogic was early in recognizing the need for data management solutions that can handle a huge volume of complex data in real time. We started the company a decade ago to solve this problem, and we now have over 300 customers who have been able to build mission-critical real-time Big Data Applications to run their operations on this complex unstructured data.
This trend is accelerating as businesses all over the world are realizing that their old relational technology simply can’t handle this data effectively.

Q2. In your opinion, how is the Big Data movement affecting the market demand for data management software?

David Gorbet: Executives and industry leaders are looking at the Big Data issue from a volume perspective, which is certainly an issue – but the increase in data complexity is the biggest challenge that every IT department and CIO must address, and address now.
Businesses across industries have to not only store the data but also be able to leverage it quickly and effectively to derive business value.
We allow companies to do this better than traditional solutions, and that’s why our customer base doubled last year and continues to grow rapidly. Big Data is a major driver for the acquisition of new technology, and companies are taking action and choosing us.

Q3. Why do you see MarkLogic as a replacement of traditional database systems, and not simply as a complementary solution?

David Gorbet: First of all, we don’t advocate ripping out all your infrastructure and replacing it with something new. We recognize that there are many applications where traditional relational database technology works just fine. That said, when it came time to build applications to process large volumes of complex data, or a wide variety of data with different schemas, most of our customers had struggled with relational technology before coming to us for help.
Traditional relational database systems just don’t have the ability to handle complex unstructured data like we do. Relational databases are very good solutions for managing information that fits in rows and columns, however businesses are finding that getting value from unstructured information requires a totally new approach.
That approach is to use a database built from the ground up to store and manage unstructured information, and allow users to easily access the data, iterate on the data, and to build applications on top of it that utilize the data in new and exciting ways. As data evolves, the database must evolve with it, and MarkLogic plays a unique role as the only technology currently in the market that can fulfill that need.

Q4. How do you store and manage unstructured information in MarkLogic?

David Gorbet: MarkLogic uses documents as its native data type, which is a new way of storing information that better fits how information is already “shaped.”
To query, MarkLogic has developed an indexing system using techniques from search engines to perform database-style queries. These indexes are maintained as part of the insert or update transaction, so they’re available in real-time with no crawl delay.
For Big Data, search is an important component of the solution, and MarkLogic is the only technology that combines real-time search with database-style queries.

Q5. Would you define MarkLogic as an XML Database? A NoSQL database? Or other?

David Gorbet: MarkLogic is a Big Data database, optimized for large volumes of complex structured or unstructured data.
We’re non-relational, so in that sense we’re part of the NoSQL movement, however we built our database with all the traditional robust database functionality you’d expect and require for mission-critical applications, including failover for high availability, database replication for disaster recovery, journal archiving, and of course ACID transactions, which are critical to maintain data integrity.
If you think of what a next-generation database for today’s data should be, that’s MarkLogic.

Q6. MarkLogic has been working on techniques for storing and searching semantic information inside MarkLogic, and you have been running the Billion Triple Challenge, and the Lehigh University Benchmark. What were the main results of these tests?

David Gorbet: The testing showed that we could load 1 billion triples in less than 24 hours using approximately 750 gigabytes of disk and 150 gigabytes of RAM. Our LUBM query performance was extremely good, and in many cases superior, when compared to the performance from existing relational systems and dedicated triple stores.

Q7. Do you plan in the future to offer an open source API for your products?

David Gorbet: We have a thriving community of developers at community.marklogic.com where we make many of the tools, libraries, connectors, etc. that sit on top of our core server available for free, and in some cases as open source projects living on the social coding site github.
For example, we publish the source for XCC, our connector for Java or .NET applications, and we have an open-source REST API there as well.

Q8. James Phillips from Couchbase said in an interview last year : “It is possible we will see standards begin to emerge, both in on-the-wire protocols and perhaps in query languages, allowing interoperability between NoSQL database technologies similar to the kind of interoperability we’ve seen with SQL and relational database technology.” What is your opinion on that?

David Gorbet: MarkLogic certainly sees the value of standards, and for years we’ve worked with the World Wide Web Consortia (W3C) standards groups in developing the XQuery and XSLT languages, which are used by MarkLogic for query and transformation. Interoperability helps drive collaboration and new ideas, and supporting standards will allow us continue to be at the forefront of innovation.

Q9. MarkLogic and Hortonworks last March announced a partnerships to enhance Real-Time Big Data Applications with Apache Hadoop. Can you explain how technically the combination of MarkLogic and Hadoop will work?

David Gorbet: Hadoop is a key technology for Big Data, but doesn`t provide the real-time capabilities that are vital for the mission-critical nature of so many
organizations. MarkLogic brings that power to Hadoop, and is executing its Hadoop vision in stages.
Last November, MarkLogic introduced its Connector for Hadoop, and in March 2012, announced a partnership with leading Hadoop vendor Hortonworks. The partnership enables organizations in both the commercial and public sectors to seamlessly combine the power of MapReduce with MarkLogic’s real-time, interactive analysis and indexing on a single, unified platform.

With MarkLogic and Hortonworks, organizations have a fully supported big data application platform that enables real-time data access and full-text search together with batch processing and massive archival storage.
MarkLogic will certify its connector for Hadoop against the Hortonworks Data Platform, and the two companies will also develop reference-architectures for MarkLogic-Hadoop solutions.

Q10. How do you identify new insights and opportunities in Big Data without having to write more code and wait for the batch process to complete?

David Gorbet: The most impactful Big Data Applications will be industry- or even organization-specific, leveraging the data that the organization
consumes and generates in the course of doing business. There is no single set formula for extracting value from this data; it will depend on the application.
That said, there are many applications where simply being able to comb through large volumes of complex data from multiple sources via interactive queries can give organizations new insights about their products,customers, services, etc.
Being able to combine these interactive data explorations with some analytics and visualization can produce new insights that would otherwise be hidden.
We call this Big Data Search.
For example, we recently demonstrated an application at MarkLogic World that shows through real-time co-occurrence analysis new insights about how products are being used. In our example, it was analysis of social media that revealed that Gatorade is closely associated with flu and fever, and our ability to drill seamlessly from high-level aggregate data into the actual source social media posts shows that many people actually take Gatorade to treat flu symptoms. Geographic visualization shows that this phenomenon may be regional. Our ability to sift through all this data in real-time, using fresh data gathered from multiple sources, both internal and external to the organization helps our customers identify new actionable insights.

———–
David Gorbet is the vice president of product strategy for MarkLogic.
Gorbet brings almost two decades of experience delivering some of the highest-volume applications and enterprise software in the world. Prior to MarkLogic, Gorbet helped pioneer Microsoft`s business online services strategy by founding and leading the SharePoint Online team.
__________

Related Resources

Lecture Notes on “Data Management in the Cloud”.
by Michael Grossniklaus, and David Maier, Portland State University.
The topics covered in the course range from novel data processing
paradigms (MapReduce, Scope, DryadLINQ), to commercial cloud data
management platforms (Google BigTable, Microsoft Azure, Amazon S3
and Dynamo, Yahoo PNUTS) and open-source NoSQL databases
(Cassandra, MongoDB, Neo4J).

Lecture Notes|Intermediate|English| DOWNLOAD ~280 slides (PDF)| 2011-12|

NoSQL Data Stores Resources (free downloads) :
Blog Posts | Free Software | Articles, Papers, Presentations| Documentations, Tutorials, Lecture Notes | PhD and Master Thesis

Related Posts

Big Data for Good (June 4, 2012)

Interview with Mike Stonebraker (May 2, 2012)

Integrating Enterprise Search with Analytics. Interview with Jonathan Ellis ( April 16, 2012)

A super-set of MySQL for Big Data. Interview with John Busch, Schooner (February 20, 2012)

Re-thinking Relational Database Technology. Interview with Barry Morris, Founder & CEO NuoDB ( December 14, 2011)

vFabric SQLFire: Better than RDBMS and NoSQL? October 24, 2011) (

MariaDB: the new MySQL? Interview with Michael Monty Widenius (September 29, 2011)

On Versant`s technology. Interview with Vishal Bagga ( August 17, 2011)

##

Jun 25 12

Managing Internet Protocol Television Data. — An interview with Stefan Arbanowski.

by Roberto V. Zicari

” There is a variety of services possible via IPTV. Starting with live/linear TV and Video on Demand (VoD) over interactive broadcast related apps, like shopping or advertisement, up to social TV apps where communities of users have shared TV experience”— Stefan Arbanowski.

The research center Fraunhofer FOKUS (Fraunhofer Institute for Open Communication Systems) in Berlin, has established a “SmartTV Lab” to build an independent development and testing environment for HybridTV technologies and solutions. They did some work on benchmarking databases for Internet Protocol Television Data. I have interviewed Stefan Arbanowski, who leads the Lab.

RVZ

Q1.What are the main research areas at the Fraunhofer Fokus research center?

Stefan Arbanowski: Be it on your mobile device, TV set or car – the FOKUS Competence Center for Future Media and Applications (FAME) develops future web technologies to offer intelligent services and applications. Our team of visionaries combines creativity and innovation with their technical expertise for the creation of interactive media. These technologies enable smart personalization and support future web functionalities on various platform from diverse domains.

The experts rigorously focus on web-based technologies and strategically use open standards. In the FOKUS Hybrid TV Lab our experts develop future IPTV technologies compliant to current standards with emphasis on advanced functionality, convergence and interoperability. The FOKUS Open IPTV Ecosystem offers one of the first solutions for standardized media services and core components of the various standards.

Q2. At Fraunhofer Fokus, you have experience in using a database for managing and controlling IPTV (Internet Protocol Television) content. What is IPTV? What kind of internet television services can be delivered using IPTV?

Stefan Arbanowski: There is a variety of services possible via IPTV.
Starting with live/linear TV and Video on Demand (VoD) over interactive broadcast related apps, like shopping or advertisement, up to social TV apps where communities of users have shared TV experience.

Q3. What is IPTV data? Could you give a short description of the structure of a typical IPTV data?

Stefan Arbanowski: This is complex: start with page 14 of this doc. 😉

Q4. What are the main requirements for a database to manage such data?

Stefan Arbanowski: There are different challenges. One is the management of different sessions of streams that is used by viewers following a particular service including for instance electronic program guide (EPG) data. Another one is the pure usage data for billing purpose. Requirements are concurrent read/write ops on large (>=1GB) DBs ensuring fast response times.

Q5. How did you evaluate the feasibility of a database technology for managing IPTV data?

Stefan Arbanowski: We did compare Versant ODB (JDO interface) with MySQL Server 5.0 and handling data in RAM. For this we did 3 implementations trying to get most out of the individual technologies.

Q6. Your IPTV benchmark is based on use cases. Why? Could you briefly explain them?

Stefan Arbanowski: It has to be a real world scenario to judge whether a particular technology really helps. We did identify the bottlenecks in current IPTV systems and used them as basis for our use cases.

The objective of the first test case was to handle a demanding number of simultaneous read/write operations and queries with small data objects, typically found in an IPTV Session Control environment.
V/OD performed more than 50% better compared to MySQL in a server side, 3-tier application server architecture. Our results for a traditional client/server architecture showed smaller advantages for the Versant Object Database, performing only approximately 25% better than MySQL, probably because of the network latency of the test environment.

The second test case managed larger sized Broadband Content Guide = Electronic Program Guide (BCG) data in one large transaction. V/OD was more than 8 times
faster compared to MySQL. In our analysis, the test case demonstrated V/OD’s significant advantages when managing complex data structures.

We wrote a white paper for more details.

Q7. What are the main lessons learned in running your benchmark?

Stefan Arbanowski: Comparing databases is never an easy task. Many specific requirements influence the decision making process, for example, the application specific data model and application specific data management tasks. Instead of using a standard database benchmark, such as TPC-C, we chose to develop a couple of benchmarks that are based on our existing IPTV Ecosystem data model and data management requirements, which allowed us to analyze results that are more relevant to the real world requirements found in such systems.

Q8. Anything else you wish to add?

Stefan Arbanowski: Considering these results, we would recommend a V/OD database implementation where performance is mandatory and in particular when the application must manage complex data structures.

_____________________
Dr. Stefan Arbanowski is head of the Competence Centre Future Applications and Media (FAME) at Fraunhofer Institute for Open Communication Systems FOKUS in Berlin, Germany.
Currently, he is coordinating Fraunhofer FOKUS’ IPTV activities, bundling expertise in the areas of interactive applications, media handling, mobile telecommunications, and next generation networks. FOKUS channels those activities towards networked media environments featuring live, on demand, context-aware, and personalized interactive media.
Beside telecommunications and distributed service platforms, he has published more than 70 papers in respective journals and conferences in the area of personalized service provisioning. He is member of various program committees of international conferences.

Jun 18 12

Big Data: Smart Meters — Interview with Markus Gerdes.

by Roberto V. Zicari

“For a large to medium sized German utility, which has about 240,000 conventional meters, quarter-hour meter readings would produce 960,000 sets of meter data to be processed and stored each hour once replaced by smart meters. And every hour another 960,000 sets of meter data have to be processed.” — Markus Gerdes.

80 percent of all households in Germany will have to be equipped with smart meters by 2020, according to a EU single market directive.
Why smart meters? A smart meter, as described by e.On, is “a digital device which can be read remotely and allows customers to check their own energy consumption at any time. This helps them to control their usage better and to identify concrete ways to save energy. Every customer can access their own consumption data online in graphic form displayed in quarter-hour intervals. There is also a great deal of additional information, such as energy saving tips. Similarly, measurements can be made using a digital display in the home in real time and the current usage viewed.” This means Big Data. How do we store, and use all these machine-generated data?
To better understand this, I have interviewed Dr. Markus Gerdes, Product Manager at BTC , a company specialized in the energy sector.

RVZ

Q1. What are the main business activities of BTC ?

Markus Gerdes: BTC provides various IT-services: besides the basics of system management, e.g. hosting services, security services or the new field of mobile security services, BTC primarily delivers IT- and process consulting and system integration services for different industries, especially for utilities.
This means, BTC plans and rolls IT-architectures out, integrates and customizes IT-applications and migrates data for ERP, CRM and more applications. BTC also delivers its IT-applications if desired: In particular, BTC’s Smart Metering solution BTC Advanced Meter Management (BTC AMM) is increasingly known in the smart meter market and has drawn customers` interest at this stage of the market, not only in Germany, but e.g. in Turkey and other European countries as well.

Q2. According to a EU single market directive and German Federal Government, 80 percent of all households in Germany will have to be equipped with smart meters by 2020, How many smart meters will have to be installed? What will the government do with all these data generated?

Markus Gerdes: Currently, 42 million electricity meters are installed in Germany. Thus, about 34 million meters need to be exchanged according to the EU directive in Germany until 2020. In order to achieve this aim, in 2011 the Germany EnWG (law on the energy industry) adds some new aspects: smart meters have to be installed where customers` electricity consumption is more than 6.000 kWh per year, at decentralized feed-in with more than 7 kW and in considerably refurbished or newly constructed buildings, if this is technically feasible.
In this context technical feasible means, that the installed smart meters are certified (as a precondition they have to use the protection profiles) and must be commercially available in the meter market. An amendment to the EnWG is due in September 2012 and it is generally expected that this threshold of 6000 kWh will be lowered. The government will actually not be in charge of the data collected by the Smart Meters. It is metering companies who have to provide the data to the distribution net operators and utility companies. The data is then used for billing and as an input to customer feedback systems for example and potentially grid analyses under the use of pseudonyms.

Q3. Smart Metering: Could you please give us some detail on what Smart Metering means in the Energy sector?

Markus Gerdes: Smart Metering means opportunities. The technology itself does no more or less than deliver data, foremost a timestamp plus a measured value, from a metering system via a communication network to an IT-system, where it is prepared and provided to other systems. If necessary this may even be done in real time. This data can be relevant to different market players in different resolutions and aggregations as a basis for other services.
Furthermore, smart meter offer new features like complex tariffs, load limitations etc. The data and the new features will lead to optimized processes with respect to quality, speed and costs. The type of processing will finally lead to new services, products and solutions – some of which we do not even know today. In combination with other technologies and information types the smart metering infrastructure will be the backbone of smart home applications and the so-called smart grid.
For instance, BTC develops scenarios to combine the BTC AMM with the control of virtual power plants or even with the BTC grid management and control application BTC PRINS. This means: smart markets become reality.

Q4. BTC AG has developed an advanced meter management system for the energy industry. What is it?

Markus Gerdes:The BTC AMM is an innovative software system, which allows meter service providers to manage, control and readout smart meters and provide these meter readings and other possibly relevant information, e.g. status information, information on meter corruption and manipulation to authorized market partners.
Also data and control signals for the smart meter can be provided by the system.
The BTC AMM is developed as a new solution BTC has been able to particularly focus on mass data management and smart meter mass process optimized workflows. In combination with a clear and easy to use frontend we bring our customers a high performance solution for their most important requirements.
In addition, our modular concept and the use of open standards makes our vendor-independent solution not only fit into utilities IT-architecture easily but makes it future-proof.

Q5. What kind of data management requirements do you have for this application? What kind of data is a smart meter producing and at what speed? How do you plan to store and process all the data generated by these smart meters?

Markus Gerdes: Let me address the issue of the data volume and frequency of data sent first. The BTC AMM is designed to collect the data of several millions of
smart meters. In a standard scenario each of the smart meters sends a load profile with a resolution of 15 minutes to the BTC AMM. This means that at least 96 data points have to be stored by the BTC AMM per day and meter. This implies both, a huge amount of data to be stored and a high frequency data traffic.
Hence, the data management system needs to be highly performant in both dimensions. In order to process time series BTC has developed a specific, highly efficient time series management which runs with different data base providers. This enables the BTC AMM to cope even with data sent in a higher frequency. For certain smart grid use cases the BTC AMM processes metering data sent from the meters on the scale of seconds.

Q6. The system you are developing is based on InterSystems Caché® database system. How do you use Cache`?

Markus Gerdes: BTC uses InterSystems Caché as Meter Data Management solution. This means the incoming data from the smart meters is saved into the database and the information provided e.g. to backend-systems via webservices or to other interfaces so that the data can be used for market partner communication or customer feedback systems. And all this means the BTC AMM has to handle thousands of read- and write-operations per second.

Q7. You said that one of the critical challenge you are facing is to “master up the mass data efficiency in communicating with smart meters and the storage and processing of measured time series” Can you please elaborate on this? What is the volume of the data sets involved?

Markus Gerdes: For a large to medium sized German utility, which has about 240,000 conventional meters, quarter-hour meter readings would produce 960,000
sets of meter data to be processed and stored each hour once replaced by smart meters. And every hour another 960,000 sets of meter data have to be processed.
In addition calculations, aggregations and plausibility checks are necessary. Moreover incoming tasks have to be processed and the relevant data has to be delivered to backend applications. This means that the underlying database as well as the AMM-processes may have to process the incoming data every 15 minutes while reading thousands of time series per minute simultaneously.

Q8. How did you test the performance of the underlying database system when handling data streams? What results did you obtain so far?

Markus Gerdes: We designed a load profile generator and used it to simulate the meter readings of more than 1 million smart meters. The tests included the
writing of quarter-hour meter readings. Actually the problem with this test was the speed of the generator to provide the data, not the speed of the AMM. In fact we are able to write more than 12.000 time-series per second. This is far enough to cope even with full meter rollouts.

Q9. What is the current status of this project? What are the lessons learned so far? And the plans ahead? Are there any similar systems implemented in Europe?

Markus Gerdes: At the moment we think that our BTC AMM- and database-performance is able to handle the upcoming mass data during the next years including a full smart meter rollout in Germany. Nevertheless, in terms of smart grid and smart home appliances and an increasing amount of real time event processings, both read and write, it is necessary to get a clear view of future technologies to speed up processing of mass data (e.g. in-memory).
In addition we still have to keep an eye on usability. Although we hope that smart metering in the end will lead to complete machine-to-machine-communication we always have to expect errors and disturbances from technology, communication or even the human factor. As event driven processes are time critical we still have to work on solutions for fast and efficient handling, analyses and processing of mass errors.
_________________

Dr. Markus Gerdes, Product Manager BTC AMM / BTC Smarter Metering Suite, BTC Business Technology Consulting AG.
Since 2009 Mr. Gerdes worked in several research, development and consulting projects in the area of smart metering. He was involved in research and consulting in the sectors Utilities, Industry and Public, regarding IT-architecture and solutions and IT-Security.
He is experienced in the development of energy management solutions.

Jun 4 12

Big Data for Good.

by Roberto V. Zicari

A distinguished panel of experts discuss how Big Data can be used to create Social Capital.

Every day, 2.5 quintillion bytes of data are created. This data comes from digital pictures, videos, posts to social media sites, intelligent sensors, purchase transaction records, cell phone GPS signals to name a few. This is Big Data.

There is a great interest both in the commercial and in the research communities around Big Data. It has been predicted that “analyzing Big Data will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus”, according to research by MGI and McKinsey’s Business Technology Office.

But very few people seem to look at how Big Data can be used for solving social problems. Most of the work in fact is not in this direction. Why this?
What can be done in the international research community to make sure that some of the most brilliant ideas do have an impact also for social issues?

I have invited a panel of distinguished well known researchers and professionals to discuss this issue. The list of panelists include:

Roger Barga, Microsoft Research, group lead eXtreme Computing Group, USA

Laura Haas, IBM Fellow and Director Institute for Massive Data, Analytics and Modeling IBM Research, USA

Alon Halevy, Google Research, Head of the Structured Data Group, USA

Paul Miller, Consultant, Cloud of Data, UK

This Q&A panel focuses exactly at this question: is it possible to conduct research for a corporation and/or a research lab, and at the same time make sure that the potential output of our research has also a social impact?

We take Big Data as a key example. Big Data is clearly of interest to marketers and enterprises a like who wish to offer their customers better services and better quality products. Ultimately their goal is to sell their products/services.

This is good, but how about digging into Big Data to help people in need? Preventing / predicting natural catastrophes, helping offering services “targeting” to people and structures in social need?

Hope you`ll find this interview interesting, as well as eye-opening.

RVZ

Q1. In your opinion, would it be possible to exploit some of the current and future research and developments efforts on Big Data for achieving social capital?

Alon: Yes, Big data is not just the size of an individual data set, but rather the collection of data that is available to us online (e.g., government data, NGOs,
local governments, journalists, etc). By putting these data together we help tell stories about the data and make them of interest and of value to the wider public. As one simple example, a recent Danish Journalism Award was given to a nice visualization of data about which doctors are being sponsored by the medical industry. The ability to communicate this data with the public is certainly part of the Big Data agenda.

Laura: Absolutely. In fact, many of the efforts that we are engaged in today are exactly in this direction.
Much of our “Smarter Planet” related research is around utilizing more intelligently the large amounts of data coming from instrumenting, observing, and capturing the information about phenomena on planet earth, both natural and man-made.

Paul: First, it’s important to recognise that technological advances, new techniques, and new ways of working often deliver both tangible and intangible social benefit as a by-product of something else. Robert Owen and his peers in the late 18th and early 19th centuries might have had genuine motives for the social welfare and educational programmes they delivered for workers in their factories, but it was the commercial success of the factories themselves that paid for the philanthropy. And better educated children became better integrated factory workers, so it wasn’t completely altruistic.

That said, there is clearly scope for Big Data to deliver direct benefits in areas that aid society. Google Flu Trends is perhaps the best-known example – analysis of many millions of searches for flu-related terms (symptoms, medicines, etc) enabling Google’s non-profit Foundation to provide early visibility of illness in ways that could/should assist local healthcare systems. Google’s search engine isn’t about flu, and its indices aren’t for flu detection or prediction; this piece of societal value simply emerges from the ‘data exhaust’ of all those people searching a single site. Flu Trends isn’t alone; Harvard researchers found that Twitter data could be analysed to track the spread of Cholera on Haiti in a way that proved “substantially faster” than traditional techniques. According to Mathew Ingram’s write-up of the research, “What the Harvard and HealthMap study shows is that analyzing the data from large sets like the tweets around Haiti isn’t just good at tracking patterns or seeing connections after an event has occurred, but can actually be of use to researchers on the ground while those events are underway” (my emphasis).

Roger: Absolutely, we have already seen several such examples.
One such example in science is Jim Gray and Alex Szalay’s collaboration to build a virtual observatory for astronomy, which leveraged relational database technology.
The SDSS Sky Server has since supported hundreds of researchers and resulted in thousands of publications over the years.
Another, more recent example, is the language translation system researchers in Microsoft Research built for the aid relief worker in Haiti after the 2010 earthquake.
They leveraged the same technology we leverage in our search operations to build a statistical machine translation engine to translate Haitian Creole to English from scratch in 4 days, 17 hours, and 30 minutes and delivered to aid workers in Haiti.

Q2. If yes, what are the areas where in your opinion Big Data could have a real impact on social capital?

Alon: Bringing data that is otherwise hidden from view to the eyes of the interested public. Data activists and journalists world-wide need to be able to easily
discover data sets, merge them in a sensible fashion and tell stories about them that will grab people’s attention.
As another example, helping people in crisis response situations has huge potential. As two examples, people have used Google Fusion Tables to create maps with critical information for people after the Japan earthquake in 2011 and before the hurricane in NYC later that year.

Laura: Healthcare is an obvious one, where leveraging the vast amounts of genomic information now being produced together with patient records, and the medical literature could help us provide the best known treatments to an individual patient — or discover new therapies that may be more effective than those currently in use.
We have worked already on leveraging big data and machine learning to predict the best therapeutic regimens for AIDS patients, for example. Or, when it comes to natural resources, we are leveraging big data to optimize the placement of turbines in a wind farm so that we get the most power for the least environmental impact. We can also look at man-made phenomena — for example, understanding traffic patterns and using the insight to do better planning or provide incentives that can reduce traffic during crunch hours. Many other examples can be given of how Big Data is being used to improve the planet!

Paul: The opportunities must – surely – be enormous? Any of the big issues affecting society, from environmental change, to population growth, to the need for clean water, food, and healthcare; all of these affect large groups of people and all of them have aspects of policy formulation or delivery that are (or should be, if anyone collected it) data-rich. The Volume, Velocity and Variety of data in many of these areas should offer challenging research opportunities for practitioners… and tangible benefits to society when they’re successful.

Roger: Top of mind is to advance scientific research, what has been referred to as eScience which covers both the traditional hard sciences from astronomy, oceanography, to the social sciences and economics.
Our ability to acquire and analyze unprecedented amounts of data has the potential to have a profound impact on science. It is a leap from the application of computing to support scientists to ‘do’ science (i.e. ‘computational science’) to the integration of computer science and ability to analyze volumes of data to extract insights into the very fabric of science. While on the face of it, this change may seem subtle, we believe it to be fundamental to science and the way science is practiced. Indeed, we believe this development represents the foundations of a new revolution in science.
We captured stories from many different scientific investigations in the book “The Fourth Paradigm: Data-Intensive Scientific Discovery.”

Q3. What are the main challenges in such areas?

Alon: Data discovery is a huge challenge (how to find high-quality data from the vast collections of data that are out there on the Web).
Determining the quality of data sets and relevance to particular issues (i.e., is the data set making some underlying assumption that renders it biased or not informative for a particular question). Combining multiple data sets by people who have little knowledge of database techniques is a constant challenge.

Laura: With any big data project, many of the same issues exist. I’ll mention three major categories of issues: those related to the data, itself, those related to the process of deriving insight and benefit from the data, and finally, those related to management issues such as data privacy, security, and governance in general. In the data space, we talk about the 4 V’s of data — Volume (just dealing with the sheer size of it), Variety (handling the multiplicity of types and sources and formats), Velocity (reacting to the flood of information in the time required by the application), and, last and perhaps least understood, Veracity (how can we cope with uncertainty, imprecision, missing values, and yes, occasionally, mis-statements or untruths?).
The challenges with deriving insight include capturing data, aligning data from different sources (e.g., resolving when two objects are the same), transforming the data into a form suitable for analysis, modeling it, whether mathematically, or through some form of simulation, etc, and then understanding the output — visualizing and sharing the results, for example. And governance includes ensuring that data is used correctly (abiding by its intended uses and relevant laws), tracking how the data is used, transformed, derived, etc, and managing its lifecycle. There are research topics in ALL of these areas!

Paul: Data availability – is there data available, at all? Increasingly, there is. But coverage and comprehensiveness often remain patchy, and the rigour with which datasets are compiled may still raise concerns. A good process will, typically, make bad decisions if based upon bad data.
Data quality – how good is the data? How broad is the coverage? How fine is the sampling resolution? How timely are the readings? How well understood are the sampling biases? What are the implications in, for example, a Tsunami that affects several Pacific Rim countries? If data is of high quality in one country, and poorer in another, does the Aid response skew ‘unfairly’ toward the well-surveyed country or toward the educated guesses being made for the poorly surveyed one?
Data comprehensiveness – are there areas without coverage? What are the implications?
Personally Identifiable Information – much of this information is about people. Can we extract enough information to help people without extracting so much as to compromise their privacy? Partly, this calls for effective industrial practices. Partly, it calls for effective oversight by Government. Partly – perhaps mostly – it requires a realistic reconsideration of what privacy really means… and an informed grown up debate about the real trade-off between aspects of privacy ‘lost’ and benefits gained.
Rather than offering blanket privacy policies, perhaps customers, regulators and software companies should be moving closer to some form of explicit data agreement; if you give me access to X, Y, and Z about yourself, I will use it for purposes A, B, and C… and you will gain benefits/services D, E, and F. The first two parts are increasingly in place, albeit informally. The final part – the benefits – is far less well expressed.
Data dogmatism – analysis of big data can offer quite remarkable insights, but we must be wary of becoming too beholden to the numbers. Domain experts – and common sense – must continue to play a role. It would be worrying, indeed, if the healthcare sector only responded to flu outbreaks when Google Flu Trends told them to! See, for example, a recent blog post of mine

Roger: The first important step is to embrace a data-centric view. The goal is not merely to store data for a specific community but to improve data quality and to deliver as a service accurate, consistent data to operational systems. It isn’t simply a matter of connecting the plumbing between many different data sources, there’s a quality function that has to be applied, to clean, and reconcile all of this information.
Researchers don’t simply need data, they need services-based information over this data to support their work.

Q4. What are the main difficulties, barriers hindering our community to work on social capital projects?

Alon: I don’t think there are particular barriers from a technical perspective. Perhaps the main barrier is ideas of how to actually take this technology and make social impact. These ideas typically don’t come from the technical community, so we need more inspiration from activists.

Laura: Funding and availability of data are two big issues here. Much funding for social capital projects comes from governments — and as we know, are but a small fraction of the overall budget. Further, the market for new tools and so on that might be created in these spaces is relatively limited, so it is not always attractive to private companies to invest. While there is a lot of publicly available data today, often key pieces are missing, or privately held, or cannot be obtained for legal reasons, such as the privacy of individuals, or a country’s national interests. While this is clearly an issue for most medical investigations, it crops up as well even with such apparently innocent topics as disaster management (some data about, e.g., coastal structures, may be classified as part of the national defense).

Paul: Perceived lack of easy access to data that’s unencumbered by legal and privacy issues? The large-scale and long term nature of most of the problems?
It’s not as ‘cool’ as something else? A perception (whether real or otherwise) that academic funding opportunities push researchers in other directions?
Honestly, I’m not sure that there are significant insurmountable difficulties or barriers, if people want to do it enough.
As Tim O’Reilly said in 2009 (and many times since), developers should “Work on stuff that matters.” The same is true of researchers.

Roger: The greatest barrier may be social. Such projects require community awareness to bring people to take action and often a champion to frame the technical challenges in a way that is approachable by the community.
These projects will likely require close collaboration between the technical community and those familiar with the problem.

Q5. What could we do to help supporting initiatives for Big Data for Good?

Alon: Building a collection of high quality data that is widely available and can serve as the backbone for many specific data projects. For example, data sets that
include boundaries of countries/counties and other administrative regions, data sets with up-to-date demographic data. It’s very common that when a particular data story arises, these data sets serve to enrich it.

Laura: Increasingly, we see consortiums of institutions banding together to work on some of these problems. These Centers may provide data and platforms for
data-intensive work, alleviating some of the challenges mentioned above by acquiring and managing data, setting up an environment and tools, bringing in expertise in a given topic, or in data, or in analytics, providing tools for governance, etc. My own group is creating just such a platform, with the goal of facilitating such collaborative ventures. Of course, lobbying our governments for support of such initiatives wouldn’t hurt!

Paul: Match domains with a need to researchers/companies with a skill/product. Activities such as the recent Big Data Week Hackathons might be one route to follow – encourage the organisers (and companies like Kaggle, which do this every day) to run Hackathons and competitions that are explicitly targeted at a ‘social’ problem of some sort. Continue to encourage the Open Data release of key public data sets. Talk to the agencies that are working in areas of interest, and understand the problems that they face. Find ways to help them do what they already want to do, and build trust and rapport that way.

Roger: Provide tools and resources to empower the long tail of research.
Today, only a fraction of scientists and engineers enjoy regular access to high performance and data-intensive computing resources to process and analyze massive amounts of data and run models and simulations quickly. The reality for most of the scientific community is that speed to discovery is often hampered as they have to either queue up for access to limited resources or pare down the scope of research to accommodate available processing power. This problem is particularly acute at the smaller research institutes which represent the long tail of the research community. Tier 1 and some tier 2 universities have sufficient funding and infrastructure to secure and support computing resources while the smaller research programs struggle. Our funding agencies and corporations must provide resources to support researchers, in particular those who do not have access to sufficient resources.

Q6. Are you aware of existing projects/initiatives for Big Data for Good?

Laura: Yes, many! See above for some examples. IBM Research alone has efforts in each of the areas mentioned — and many more. For example, we’ve been working with the city of Rio, in Brazil, to do detailed flood modeling, meter by meter; with the Toronto Children’s Hospital to monitor premature babies in the neonatal ward,
allowing detection of life-threatening infections up to 24 hours earlier; and with the Rizzoli Institute in Italy to find the best cancer treatments for particular groups of patients.

Roger: Yes, the United Nations Global Pulse initiative is one example. Earlier this year at the 2012 Annual Meeting in Davos, the World Economic Forum
published a white paper entitled “Big Data, Big Impact: New Possibilities for International Development“. The WEF paper lays out several of the ideas which fundamentally drive the Global Pulse initiative and presents in concrete terms the opportunity presented by the explosion of data in our world today, and how researchers and policymakers are beginning to realize the potential for leveraging Big Data to extract insights that can be used for Good, in particular for the benefit of low-income populations. What I find intriguing about this project from a technical perspective is how to extract insight from ambient data, from GPS devices, cell phones and medical devices, combined with crowd sourced data from health and aid workers in the field, then analyzed with machine learning and analytics to predict a potential social need or crisis in advance while remediation is still viable.

Q7. Anything else you wish to add?

Alon: Google Fusion Tables has been used in many cases for social good, either though journalists, crisis response or data activists making a compelling
visualization that caught people’s attention. This has been one of the most gratifying aspects of working on Fusion Tables and has served as a main driver for prioritizing our work: make it easy for people with passion for the data (rather than database expertise) to get their work done; make it easier for them to find relevant data and combine it with their own. We look very carefully at the workflow of these professionals and try to make it as efficient as possible.

Laura: I think our community has the ability to do a lot of good by leveraging the tools we are developing, and our expertise, to attack some of the critical problems facing our world. We may even create economic value (not a bad thing, either!) while doing so.
_____________________________

Dr. Roger Barga has been with the Microsoft Corporation since 1997, first working as a researcher in the database research group of Microsoft Research, then as architect of the Technical Computing Initiative, followed by architect and engineering group lead in the eXtreme Computing Group of Microsoft Research.
He currently leads a product group developing an advanced analytics service on Windows Azure. Roger holds a PhD in Computer Science (database systems), MS in Computer Science (machine learning), and a BS in Mathematics.

Dr. Alon Halevy heads the Structured Data Group at Google Research.
Prior to that, he was a Professor of Computer Science at the University of Washington, where he founded the Database Research Group. From 1993 to 1997 he was a Principal Member of Technical Staff at AT&T Bell Laboratories (later AT&T Laboratories). He received his Ph.D in Computer Science from Stanford University in 1993, and his Bachelors degree in Computer Science and Mathematics from the Hebrew University in Jerusalem in 1988. Dr. Halevy was elected Fellow of the Association of Computing Machinery in 2006.

Dr. Laura Haas is an IBM Fellow, and Director of IBM Research’s new Institute for Massive Data, Analytics and Modeling; she also serves as a “catalyst” for ambitious research across IBM’s worldwide research labs. She was the Director of Computer Science at IBM’s Almaden Research Center from 2005 to 2011.
From 2001-2005, she led the Information Integration Solutions architecture and development teams in IBM’s Software Group. Previously, Dr. Haas was a research staff member and manager at Almaden.
She is best known for her work on the Starburst query processor, from which DB2 LUW was developed, on Garlic, a system which allowed integration of heterogeneous data sources, and on Clio, the first semi-automatic tool for heterogeneous schema mapping.
She has received several IBM awards for Outstanding Innovation and Technical Achievement, an IBM Corporate Award for her work on information integration technology, and the Anita Borg Institute Technical Leadership Award. Dr. Haas was Vice President of the VLDB Endowment Board of Trustees from 2004-2009, and is a member of the National Academy of Engineering and the IBM Academy of Technology, an ACM Fellow, and Vice Chair of the board of the Computing Research Association.

Dr. Paul Miller is Founder of the Cloud of Data, a UK-based consultancy primarily concerned with Cloud Computing, Big Data, and Semantic Technologies.
He works with public and private sector clients in Europe and North America, and has a Ph.D in Archaeology (Geographic Information Systems) from the University of York.

________________________

Acknowledgement: I would like to thank Michael J. Carey with whom I have brainstormed about this project at EDBT in Berlin. RVZ

Note: you can DOWNLOAD THIS INTERVIEW AS .PDF

##

May 21 12

Do we still have an impedance mismatch problem? – Interview with José A. Blakeley and Rowan Miller.

by Roberto V. Zicari

“The impedance mismatch problem has been significantly reduced, but not entirely eliminated”— José A. Blakeley.

” Performance and overhead of ORMs has been and will continue to be a concern. However, in the last few years there have been significant performance improvements” –José A. Blakeley, Rowan Miller.

Do we still have an impedance mismatch problem in 2012?
Not an easy question to answer. To get a sense of where are we now, I have interviewed José A. Blakeley and Rowan Miller. José is a Partner Architect in the SQL Server Division at Microsoft, and Rowan is the Program Manager for the ADO.NET Entity Framework team at Microsoft.
The focus of the interview is on ORM (object-relational mapping) technology) and the new release of Entity Framework (EF 5.0). Entity Framework is an object-relational mapper developed by Microsoft that enables .NET developers to work with relational data using domain-specific objects.

RVZ

Q1. Do we still have an impedance mismatch problem in 2012?

Blakeley: The impedance mismatch problem has been significantly reduced, but not entirely eliminated.

Q2. In the past there have been many attempts to remove the impedance mismatch. Is ORM (object-relational mapping) technology really the right solution for that in your opinion? Why? What other alternative solutions are feasible?

Blakeley: There have been several attempts to remove the impedance mismatch. In the late ’80s, early ’90s, object databases and persistent programming languages made significant progress in persisting data structures built in languages like C++, Smalltalk, and Lisp almost seamlessly. For instance, Persistent C++ languages could persist structures containing untyped pointers (e.g., void*). However, to succeed over relational databases, persistent languages needed to also support declarative, set-oriented queries and transactions.
Object database systems failed because they didn’t have strong support for queries, query optimization, and execution, and they didn’t have strong, well-engineered support for transactions. At the same time relational databases grew their capabilities by building extended relational capabilities which reduced the need for persistent languages and so the world continued to gravitate around relational database systems. Object relational mappings (ORM) systems, introduced in the last decade together with programming languages like C# which added built-in query capabilities (i.e., Language Integrated Query – LINQ)to the language, are the latest attempt to eliminate the impedance mismatch.
Object-relational mapping technology, like the Entity Framework, aims at providing a general solution to the problem of mapping database tables to programming language data structures. ORM technology is a right layering in bridging the complex mapping problem between tables and programming constructs. For instance the queries needed to map a set of tables to a class inheritance hierarchy can be quite complex. Similarly, propagating updates from the programming language structures to the tables in the database is a complex problem. Applications can build these mappings by hand, but the process is time-consuming and error prone. Automated ORMs can do this job correctly and faster.

Q3. What are the current main issues with O/R mappings?

Blakeley, Miller: In the area of functionality, enabling a complete ORM covering all programming constructs is a challenge. For example, up until its latest release EF lacked support for enum types. It’s also hard for an ORM to support the full range of concepts supported by the database. For example, support for spatial data types has been available in SQL Server since 2008 but native support has only just been added to EF. This challenge only gets harder when you consider most ORMs, including EF, support multiple database engines, each with different capabilities.

Another challenge is performance. Anytime you add a layer of abstraction there is a performance overhead, this is certainly true for ORMs. One critical area of performance is the time taken to translate a query into SQL that can be run against the database. In EF this involves taking the LINQ query that a user has written and translating it to SQL. In EF5 we made some significant improvements in this area by automatically caching and re-using these translations. The quality of the SQL that is generated is also key to performance, there are many different ways to write the same query and the performance difference can be huge. Things like unnecessary casts can cause the database not to use an index. With every release of EF we improve the SQL that is generated.

Adding a layer of abstraction also introduces another challenge; ORMs make it easy to map a relational database schema and have queries constructed for you, because this translation is handled internally by the ORM it can be difficult to debug when things don’t behave or perform as expected. There are a number of great tools, such as LINQPad and Entity Framework Profiler, which can help debug such scenarios.

Q4. What is special about Microsoft`s ORM (object-relational mapping) technology?

Miller: Arguably the biggest differentiator of EF isn’t a single technical feature but how deeply it integrates with the other tools and technologies that developers use, such as Visual Studio, LINQ, MVC and many others. EF also provides powerful mapping capabilities that allow you to solve some big impedance differences between your database schema and the shape of the objects you want to write code against. EF also gives you the flexibility of working in a designer (Model & Database First) or purely in code (Code First). There is also the benefit of Microsoft’s agreement to support and service the software that it ships.

Q5. Back in 2008 LINQ was a brand-new development in programming languages. What is the current status of LINQ now? For what is LINQ be used in practice?

Miller: LINQ is a really solid feature and while there probably won’t be a lot of new advancements in LINQ itself we should see new products continuing to take advantage of it. I think that is one of the great things about LINQ, it lends itself to so many different scenarios. For example there are LINQ providers today that allow you to query in-memory objects, relational databases and xml files, just to name a few.

Q6. The original design of the the Entity Framework dated back in 2006. Now, EF version 5.0 is currently available in Beta. What’s in EF 5.0?

Miller: Before we answer that question let’s take a minute to talk about EF versioning. The first two releases of EF were included as part of Visual Studio and the .NET Framework and were referred to using the version of the .NET Framework that they were included in. The first version (EF or EF3.5) was included in .NET 3.5 SP1 and the second version (EF4) was included in .NET 4. At that point we really wanted to release more often than Visual Studio and the .NET Framework released so we started to ship ‘out-of-band’ using NuGet. Once we started shipping out-of-band we adopted semantic versioning (as defined at http://semver.org ). Since then we’ve released EF 4.1, 4.2, 4.3 and EF 5.0 is currently available in Beta.

EF has come a long way since it was first released in Visual Studio 2008 and .NET 3.5. As with most v1 products there were a number of important scenarios that weren’t supported in the first release of EF.
EF4 was all about filling in these gaps and included features such as Model First development, support for POCO classes, customizable code generation, the ability to expose foreign key properties in your objects, improved support for unit testing applications built with EF and many other features.

In EF 4.1 we added the DbContext API and Code First development. The DbContext API was introduced as a cleaner and simpler API surface over EF that simplifies the code you write and allows you to be more productive. Code First gives you an alternative to the designer and allows you to define your model using just code. Code First can be used to map to an existing database or to generate a new database based on your code. EF 4.2 was mainly about bug fixes and adding some components to make it easier for tooling to interact with EF. The EF4.3 release introduced the new Code First Migrations feature that allows you to incrementally change your database schema as your Code First model evolves over time.

EF 5.0 is currently available in Beta and introduces some long awaited features including enum support, spatial data types, table valued function support and some significant performance improvements. In Visual Studio 11 we’ve also updated the EF designer to support these new features as well as multiple diagrams within a model and allowing you to apply coloring to your model.

Q7. What are the features that did not make it into EF 5.0., that you consider are important to be added in a next release?

Miller: There are a number of things that our customers are asking for that are on the top of our list for the upcoming versions of EF. These include asynchronous query support, improved support for SQL Azure (automatic connection retries and built in federation support), the ability to use Code First to map to stored procedure and functions, pluggable conventions for Code First and better performance for the designer and at runtime. If we get a significant number of those done in EF6 I think it will be a really great release. Keep in mind that because we now also ship in between Visual Studio releases you’re not looking at years between EF releases any more.

Q8. If your data is made of Java objects, would Entity Framework be useful? And if yes, how?

Blakeley: Unfortunately not. The EF ORM is written in C# and runs on the .NET Common Language Runtime (CLR). To support Java objects, we would need to have a .NET implementation of Java like Sun’s Java.Net.

Q9. EF offers different Entity Data Model design approaches: Database First, Model First, Code First. Why do you need three different design approaches? When would you recommend using each of these approaches?

Miller: This is a great question and something that confuses a lot of people. Whichever approach you choose the decision only impacts the way in which you design and maintain the model, once you start coding against the model there is no difference. Which one to use boils down to two fundamental questions. Firstly, do you want to model using boxes and lines in a designer or would you rather just write code? Secondly, are you working with an existing database or are you creating a new database?

If you want to work with boxes and lines in a designer then you will be using the EF Designer that is included in Visual Studio. If you’re targeting an existing database then the Database First workflow allows you to reverse engineer a model from the database, you can then tweak that model using the designer. If you’re going to be creating a new database then the Model First workflow allows you to start with an empty model and build it up using the designer. You can then generate a database based on the model you have created. Whether you choose Model First or Database First the classes that you will code against are generated for you. This generation is customizable though so if the generated code doesn’t suit your needs there is plenty of opportunity to customize it.

If you would rather forgo the designer and do all your modeling in code then Code First is the approach you want. If you are targeting an existing database you can either hand code the classes and mapping or use the EF Power Tools (available on Visual Studio Gallery) to reverse engineer some starting point code for you. If you are creating a new database then Code First can also generate the database for you and Code First Migrations allows you to control how that database is modified as your model changes over time. The idea of generating a database often scares people but Code First gives you a lot of control over the shape of your schema. Ultimately if there are things that you can’t control in the database using the Code First API then you have the opportunity to apply them using raw SQL in Code First Migrations.

Q10. There are concerns about the performance and the overhead generated by ORM technology. What is your opinion on that?

Blakeley: Performance and overhead of ORMs has been and will continue to be a concern. However, in the last few years there have been significant performance improvements in reducing the code path for the mapping implementations, relational query optimizers continue to get better at handling extremely complex queries, finally, processor technology continues to improve and there is abundant RAM allowing for larger object caches that speed up the mapping.

————–
José Blakeley is a Partner Architect in the SQL Server Division at Microsoft where he works on server programmability, database engine extensibility, query processing, object-relational functionality, scale-out database management, and scientific database applications. José joined Microsoft in 1994. Some of his contributions include the design of the OLE DB data access interfaces, the integration of the .NET runtime inside the SQL Server 2005 products, the development of many extensibility features in SQL Server, and the development of the ADO.NET Entity Framework in Visual Studio 2008. Since 2009 José has been building the SQL Server Parallel Data Warehouse, a scale-out MPP SQL Server appliance. José has authored many conference papers, book chapters and journal articles on design aspects of relational and object database management systems, and data access. Before joining Microsoft, José was a member of the technical staff with Texas Instruments where he was a principal investigator in the development of the DARPA funded Open-OODB object database management system. José became an ACM Fellow in 2009. He received a Ph.D. in computer science from University of Waterloo, Canada on materialized views, a feature implemented in all main commercial relational database products.

Rowan Miller works as a Program Manager for the ADO.NET Entity Framework team at Microsoft. He speaks at technical conferences and blogs. Rowan lives in Seattle, Washington with his wife Athalie. Prior to moving to the US he resided in the small state of Tasmania in Australia.
Outside of technology Rowan’s passions include snowboarding, mountain biking, horse riding, rock climbing and pretty much anything else that involves being active. The primary focus of his life, however, is to follow Jesus.

For further readings

Entity Framework (EF) Resources: Software and Tools | Articles and Presentations

ORM Technology: Blog Posts | Articles and Presentations

Object-Relational Impedance Mismatch: Blog Posts | Articles and Presentations

May 2 12

Interview with Mike Stonebraker.

by Roberto V. Zicari

“I believe that “one size does not fit all”. I.e. in every vertical market I can think of, there is a way to beat legacy relational DBMSs by 1-2 orders of magnitude.” — Mike Stonebraker.

I have interviewed Mike Stonebraker, serial entrepreneur and professor at MIT. In particular, I wanted to know more about his last endeavor, VoltDB.

RVZ

Q1. In your career you developed several data management systems, namely: the Ingres relational DBMS, the object-relational DBMS PostgreSQL, the Aurora Borealis stream processing engine(commercialized as StreamBase), the C-Store column-oriented DBMS (commercialized as Vertica), and the H-Store transaction processing engine (commercialized as VoltDB). In retrospective, what are, in a nutshell, the main differences and similarities between all these systems? What are they respective strengths and weaknesses?

Stonebraker: In addition, I am building SciDB, a DBMS oriented toward complex analytics.
I believe that “one size does not fit all”. I.e. in every vertical market I can think of, there is a way to beat legacy relational DBMSs by 1-2 orders of magnitude.
The techniques used vary from market to market. Hence, StreamBase, Vertica, VoltDB and SciDB are all specialized to different markets. At this point Postgres and Ingres are legacy code bases.

Q2. In 2009 you co-founded VoltDB, a commercial start up based on ideas from the H-Store project. H-Store is a distributed In Memory OLTP system. What is special of VoltDB? How does it compare with other In-memory databases, for example SAP HANA, or Oracle TimesTen?

Stonebraker: A bunch of us wrote a paper “Through the OLTP Looking Glass and What We Found There” (SIGMOD 2008). In it, we identified 4 sources of significant OLTP overhead (concurrency control, write-ahead logging, latching and buffer pool management).
Unless you make a big dent in ALL FOUR of these sources, you will not run dramatically faster than current disk-based RDBMSs. To the best of my knowledge, VoltDB is the only system that eliminates or drastically reduces all four of these overhead components. For example, TimesTen uses conventional record level locking, an Aries-style write ahead log and conventional multi-threading, leading to substantial need for latching. Hence, they eliminate only one of the four sources.

Q3. VoltDB is designed for what you call “high velocity” applications. What do you mean with that? What are the main technical challenges for such systems?

Stonebraker: Consider an application that maintains the “state” of a multi-player internet game. This state is subject to a collection of perhaps thousands of
streams of player actions. Hence, there is a collective “firehose” that the DBMS must keep up with.

In a variety of OLTP applications, the input is a high velocity stream of some sort. These include electronic trading, wireless telephony, digital advertising, and network monitoring.
In addition to drinking from the firehose, such applications require ACID transactions and light real-time analytics, exactly the requirements of traditional OLTP.

In effect, the definition of transaction processing has been expanded to include non-traditional applications.

Q4. Goetz Grafe (HP fellow) said in an interview that “disk-less databases are appropriate where the database contains only application state, e.g., current account balances, currently active logins, current shopping carts, etc. Disks will continue to have a role and economic value where the database also contains history (e.g. cold history such as transactions that affected the account balances, login & logout events, click streams eventually leading to shopping carts, etc.)” What is your take on this?

Stonebraker: In my opinion the best way to organize data management is to run a specialized OLTP engine on current data. Then, send transaction history data,
perhaps including an ETL component, to a companion data warehouse. VoltDB is a factor of 50 or so faster than legacy RDBMSs on the transaction piece, while column stores, such as Vertica, are a similar amount faster on historical analytics. In other words, specialization allows each component to run dramatically faster than a “one size fits all” solution.

A “two system” solution also avoids resource management issues and lock contention, and is very widely used as a DBMS architecture.

Q5. Where will the (historical) data go if we have no disks? In the Cloud?

Stonebraker: Into a companion data warehouse. The major DW players are all disk-based.

Q6. How VoltDB ensures durability?

Stonebraker: VoltDB automatically replicates all tables. On a failure, it performs “Tandem-style” failover and eventual failback. Hence, it totally masks most errors. To protect against cluster-wide failures (such as power issues), it supports snapshotting of data and an innovative “command logging” capability. Command logging
has been shown to be wildly faster than data logging, and supports the same durability as data logging.

Q7. How does VoltDB support atomicity, consistency and isolation?

Stonebraker: All transaction are executed (logically) in timestamp order. Hence, the net outcome of a stream of transactions on a VoltDB data base is equivalent
to their serial execution in timestamp order.

Q8. Would you call VoltDB a relational database system? Does it supports standard SQL? How do you handle scalability problems for complex joins of large amount of data?

Stonebraker: VoltDB supports standard SQL.
Complex joins should be run on a companion data warehouse. After all, the only way to interleave “big reads” with “small writes” in a legacy RDBMS is to use snapshot isolation or run with a reduced level of consistency.
You either get an out-of-date, but consistent answer or an up-to-date, but inconsistent answer. Directing big reads to a companion DW, gives you the same result as snapshot isolation. Hence, I don’t see any disadvantage to doing big reads on a companion system.

Concerning larger amounts of data, our experience is that OLTP problems with more than a few Tbyte of data are quite rare. Hence, these can easily fit in main memory, using a VoltDB architecture.

In addition, we are planning extensions of the VoltDB architecture to handle larger-than-main-memory data sets. Watch for product announcements in this area.

Q9. Does VoltDB handle disaster recovery? If yes, how?

Stonebraker: VoltDB just announced support for replication over a wide area network. This capability support failover to a remote site if a disaster occurs. Check
out voltdb web site for details.

Q10. VoltDB`s mission statement is “to deliver the fastest, most scalable in-memory database products on the planet”. What performance measurements do you have until now to sustain this claim?

Stonebraker: We have run TPC-C at about 50 X the performance of a popular legacy RDBMS. In addition, we have shown linear TPC-C scalability to 384 cores
(more than 3 million transactions per second). That was the biggest cluster we could get access to; there is no reason why VoltDB would not continue to scale.

Q11. Can In-Memory Data Management play a significant role also for Big Data Analytics (up to several PB of data)? If yes, how? What are the largest data sets that VoltDB can handle?

Stonebraker: VoltDB is not focused on analytics. We believe they should be run on a companion data warehouse.

Most of the warehouse customers I talk to want to keep increasing large amounts of increasingly diverse history to run their analytics over. The major data warehouse players are routinely being asked to manage petabyte-sized data warehouses. It is not clear how important main memory will be in this vertical market.

Q12. You were very critical about Apache Hadoop, but VoltDB offers an integration with Hadoop. Why? How does it work technically?
What are the main business benefits from such an integration?

Stonebraker: Consider the “two system” solution mentioned above. VoltDB is intended for the OLTP portion, and some customers wish to run Hadoop as a data
warehouse platform. To facilitate this architecture, VoltDB offers a Hadoop connector.

Q13. How “green” is VoltDB? What are the tradeoff between total power consumption and performance: Do you have any benchmarking results for that?

Stonebraker: We have no official benchmarking numbers. However, on a large variety of applications VoltDB is a factor of 50 or more faster than traditional RDBMSs. Put differently, if legacy folks need 100 nodes, then we need 2!

In effect, if you can offer vastly superior performance (say times 50) on the same hardware, compared to another system, then you can offer the same performance on 1/50th of the hardware. By definition, you are 50 times “greener” than they are.

Q14. You are currently working on science-oriented DBMSs and search engines for accessing the deep web. Could you please give us some details. What kind of results did you obtain so far?

Stonebraker: We are building SciDB, which is oriented toward complex analytics (regression, clustering, machine learning, …). It is my belief that such analytics
will become much more important off into the future. Such analytics are invariably defined on arrays, not tables. Hence, SciDB is an array DBMS, supporting a dialect of SQL for array data. We expect it to be wildly faster than legacy RDBMSs on this kind of application. See SciDB.org for more information.

Q15. You are a co-founder of several venture capital backed start-ups. In which area?

Stonebraker: The recent ones are: StreamBase (stream procession), Vertica (data warehouse market), VoltDB (OLTP), Goby.com (data aggregation of web sources), Paradigm4 (SciDB and complex analytics)

Check the company web sites for more details.

——————————–
Mike Stonebraker
Dr. Stonebraker has been a pioneer of data base research and technology for more than a quarter of a century. He was the main architect of the INGRES relational DBMS, and the object-relational DBMS, POSTGRES. These prototypes were developed at the University of California at Berkeley where Stonebraker was a Professor of Computer Science for twenty five years. More recently at M.I.T. he was a co-architect of the Aurora/Borealis stream processing engine, the C-Store column-oriented DBMS, and the H-Store transaction processing engine. Currently, he is working on science-oriented DBMSs, OLTP DBMSs, and search engines for accessing the deep web. He is the founder of five venture-capital backed startups, which commercialized his prototypes. Presently he serves as Chief Technology Officer of VoltDB, Paradigm4, Inc. and Goby.com.

Professor Stonebraker is the author of scores of research papers on data base technology, operating systems and the architecture of system software services. He was awarded the ACM System Software Award in 1992, for his work on INGRES. Additionally, he was awarded the first annual Innovation award by the ACM SIGMOD special interest group in 1994, and was elected to the National Academy of Engineering in 1997. He was awarded the IEEE John Von Neumann award in 2005, and is presently an Adjunct Professor of Computer Science at M.I.T.

Related Posts

In-memory database systems. Interview with Steve Graves, McObject.
(March 16, 2012)

On Big Data Analytics: Interview with Florian Waas, EMC/Greenplum. (February 1, 2012)

A super-set of MySQL for Big Data. Interview with John Busch, Schooner. (February 20, 2012)

Re-thinking Relational Database Technology. Interview with Barry Morris, Founder & CEO NuoDB. (December 14, 2011)

On Big Data: Interview with Shilpa Lawande, VP of Engineering at Vertica. (November 16, 2011)

vFabric SQLFire: Better then RDBMS and NoSQL? (October 24, 2011)

The future of data management: “Disk-less” databases? Interview with Goetz Graefe. (August 29, 2011).

##

Apr 16 12

Integrating Enterprise Search with Analytics. Interview with Jonathan Ellis.

by Roberto V. Zicari

“Enterprise Search implies being able to search multiple types of data generated by an enterprise. DSE2 takes this to the next level by integrating this with a real time database AND powerful analytics.” — Jonathan Ellis.

I wanted to learn more about the new version of the commercial version of Cassandra, DataStax Enterprise 2.0. I did interview Jonathan Ellis, CTO and co-founder of DataStax and project chair of Apache Cassandra.

RVZ

Q1. What are the new main features of DataStax Enterprise 2.0 (DSE 2.0)?

Jonathan Ellis: The one I’m most excited about is the integration of Solr support. This is not a side-by-side system like Solandra, where Solr has to maintain a separate copy of the data to be indexed, but full integration with Cassandra : you insert your data once, and access it via Cassandra, Solr, or Hadoop.
Search is an increasingly important ingredient for modern applications, and DSE2 is the first to offer fully integrated, scalable search to developers.

DSE2 also includes Apache Sqoop for easy migration of relational data into Cassandra, and plug-and-play log indexing for your application.

Q2. How does it work technically the integration of Cassandra with Apache Hadoop and Apache Soir?

Jonathan Ellis: Basically, Cassandra offers a pluggable index api, and we created a Solr-backed index implementation with this.
We wrote about the technical details here.

Q3. What is exactly an Enterprise Search, and why choosing Apache Soir?

Jonathan Ellis: Enterprise Search implies being able to search multiple types of data generated by an enterprise. DSE2 takes this to the next level by integrating this with a real time database AND powerful analytics.

Solr is the gold standard for search, much the same way that Hadoop is the gold standard for big data map/reduce and analytics. There’s an ecosystem of tools that build on Solr, so offering true Solr support is much more powerful than implementing a proprietary full-text search engine.

Q4. What are the main business benefits of such integration?

Jonathan Ellis: First, developers and administrators have one database and vendor to concern themselves with instead of multiple databases and many software suppliers. Second, the built-in technical benefits of running both Solr and Hadoop on top of Cassandra yields continuous uptime for critical applications as well as future proofing those same apps where growing data volumes and increased user traffic are concerned.

Finally, customers save anywhere from 80-90% over traditional RDBMS vendors by going with DSE. For example, Constant Contact estimated that a new project they had in the works would take $2.5 million and 9 months on traditional relational technology, but with with Cassandra, they delivered it in 3 months for $250,000.That’s one third the time and one tenth the cost; not bad!

Q5. It looks like you are attempting to compete with Google. Is this correct?

Jonathan Ellis: DSE2 is about providing search as a building block for applications, not necessarily delivering an off-the-shelf search appliance.
Compared to Google’s AppEngine product, it’s fair to say that DSE 2.0 provides a similar, scalable platform to build applications on. DSE 2.0 is actually ahead of the game there: Google has announced but not yet delivered full-text search for AppEngine.

Another useful comparison is to Amazon Web Services: DSE 2.0 gives you the equivalent of Amazon’s DynamoDB, S3, Elastic Map/Reduce, and CloudSearch in a single, integrated stack. So instead of having to insert documents once in S3 and again in CloudSearch, you just add it to DSE (with any of the Solr, Cassandra, or Hadoop APIs) without having to worry about having to write code to keep multiple copies in sync when updates happen.

Q6. How do you manage to run real-time, analytics and search operations in the same database cluster, without performance or resource contention problems?

Jonathan Ellis: DSE offers elastic workload partitioning: your analytics jobs run against their own copies of the data, kept in sync by Cassandra replication, so they don’t interfere with your real time queries. When your workload changes, you can re-provision existing nodes from the real time side to the analytical, or vice versa.

Q7. You do not require ETL software to move data between systems. How does it work instead?

Jonathan Ellis: All DSE nodes are part of a single logical Cassandra cluster. DSE tells Cassandra how many copies to keep for which workload partitions, and Cassandra keeps them in sync with its battle-tested replication.
So your real time nodes will have access to new analytical output almost instantly, and you never have to write ETL code to move real time data into your analytical cluster.

Q8. Could you give us some examples of Big Data applications that are currently powered by DSE 2.0?

Jonathan Ellis:
A recent example is Healthx, which develops and manages online portals and applications for the healthcare market. They handle things such as enrollment, reporting, claims management, and business intelligence.
They have to manage countless health groups, individual members, doctors, diagnoses, and a lot more. Data comes in very fast, from all over, changes constantly, and is accessed all the time.

Healthx especially likes the new search capabilities in DSE 2.0. In addition to being able to handle real-time and analytic work, their users can now easily perform lightening fast searches for things like, ‘find me a podiatrist who is female, speaks German, and has an office close to where I live.

Q9. What about Big Data applications which also need to use Relational Data? Is it possible to integrate DSE 2.0 with a Relational Database? If yes, how? How do you handle query of data from various sources?

Jonathan Ellis: Most customers start by migrating their highest-volume, most-frequently-accessed data to DSE (e.g. with the Sqoop tool I mentioned), and leave the rest in a relational system. So RDBMS interoperability is very common at that level.

It’s also possible to perform analytical queries that mix data from DSE and relational sources, or even a legacy HDFS cluster.

Q10. How can developers use DSE 2.0 for storing, indexing and searching web logs?

Jonathan Ellis: We ship a log4j appender with DSE2, so if your log data is coming from Java, it’s trivial to start streaming and indexing that into DSE. For non-Java systems, we’re looking at supporting ingestion through tools like Flume.

Q11. How do you adjust performance and capacity for various workloads depending on the application needs?

Jonathan Ellis: Currently reprovisioning nodes for different workloads is a manual, operator-driven procedure, made easy with our OpsCenter management tool. We’re looking at delivering automatic adaptation to changing workloads in a future release.

Q12. How DSE 2.0 is influenced by DataStax partnerships with Pentaho Corporation (February 28, 2012) with their Pentaho Kettle?

Jonathan Ellis: A question we get frequently is, “I’m sold on Cassandra and DSE, but I need to not only move data from my existing RDBMS’s to you, but transform the data so that it fits into my new Cassandra data model. How can I do that?” With Sqoop, we can extract and load, but nothing else. The free Pentaho solution provides very powerful transformation capabilities to massage the incoming data in nearly every way under the sun before it’s inserted into Cassandra. It does it very fast too,
and with a visual user interface.

Q13. Anything else to add?

Jonathan Ellis: DSE 2.0 is available for download now and is free to use, without any restrictions, for development purposes. Once you move to production, we do require a subscription, but I think you’ll find that the cost associated with DSE is much less than any RDBMS vendor.

_____________

Jonathan Ellis.
Jonathan Ellis is CTO and co-founder of DataStax (formerly Riptano), the commercial leader in products and support for Apache Cassandra. Prior to DataStax, Jonathan built a multi-petabyte, scalable storage system based on Reed-Solomon encoding for backup provider Mozy. Jonathan is project chair of Apache Cassandra.

Related Posts

– Interview with Jonathan Ellis, project chair of Apache Cassandra (May 16, 2011).

Analytics at eBay. An interview with Tom Fastner (October 6, 2011).

On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com (November 2, 2011)

Related Resources

– Big Data and Analytical Data Platforms – Articles.

– NoSQL Data Stores – Articles, Papers, Presentations.

##

Mar 28 12

Publishing Master and PhD Thesis on Big Data.

by Roberto V. Zicari

In order to help disseminating the work of young students and researchers in the area of databases, I started publishing Master and PhD thesis in ODBMS.ORG

Published Master and PhD are available for free download (as. pdf) to all visitors of ODBMS.ORG (50,000+ visitors/month).

Copyright of the Master and PhD thesis remain by the authors.

The process of submission is quite simple. Please send (any time) by email to: editor AT odbms.org

1) a .pdf of your work

2) the filled in template below:
___________________________
Title of the work:
Language (English preferable):
Author:
Affiliation:
Short Abstract (max 2-3 sentences of text):
Type of work (PhD, Master):
Area (see classification below):
No of Pages:
Year of completion:
Name of supervisor/affiliation:

________________________________

To qualify for publication in ODBMS.ORG, the thesis should have been completed and accepted by the respective University/Research Center in 2011 or later, and it should be addressing one or more of the following areas:

> Big Data: Analytics, Storage Platforms
> Cloud Data Stores
> Entity Framework (EF)
> Graphs and Data Stores
> In-Memory Databases
> Object Databases
> NewSQL Data Stores
> NoSQL Data Stores
> Object-Relational Technology
> Relational Databases: Benchmarking, Data Modeling

For any questions, please do not hesitate to contact me.

Hope this help.

Best Regards

Roberto V. Zicari
Editor
ODBMS.ORG
ODBMS Industry Watch Blog

##

Mar 16 12

In-memory database systems. Interview with Steve Graves, McObject.

by Roberto V. Zicari

“Application types that benefit from an in-memory database system are those for which eliminating latency is a key design goal, and those that run on systems that simply have no persistent storage, like network routers and low-end set-top boxes” — Steve Graves.

On the topic of in-memory database systems, I did interview one of our expert, Steve Graves, co-founder and CEO of McObject.

RVZ

Q1. What is an in-memory database system (IMDS)?

Steve Graves: An in-memory database system (IMDS) is a database management system (DBMS) that uses main memory as its primary storage medium.
A “pure” in-memory database system is one that requires no disk or file I/O, whatsoever.
In contrast, a conventional DBMS is designed around the assumption that records will ultimately be written to persistent storage (usually hard disk or flash memory).
Obviously, disk or flash I/O is expensive, in performance terms, and therefore retrieving data from RAM is faster than fetching it from disk or flash, so IMDSs are very fast.
An IMDS also offers a more streamlined design. Because it is not built around the assumption of storage on hard disk or flash memory, the IMDS can eliminate the various DBMS sub-systems required for persistent storage, including cache management, file management and others. For this reason, an in-memory database is also faster than a conventional database that is either fully-cached or stored on a RAM-disk.

In other areas (not related to persistent storage) an IMDS can offer the same features as a traditional DBMS. These include SQL and/or native language (C/C++, Java, C#, etc.) programming interfaces; formal data definition language (DDL) and database schemas; support for relational, object-oriented, network or combination data designs; transaction logging; database indexes; client/server or in-process system architectures; security features, etc. The list could go on and on. In-memory database systems are a sub-category of DBMSs, and should be able to do everything that entails.

Q2. What are significant differences between an in-memory database versus a database that happens to be in memory (e.g. deployed on a RAM-disk).

Steve Graves: We use the comparison to illustrate IMDSs’ contribution to performance beyond the obvious elimination of disk I/O. If IMDSs’ sole benefit stemmed from getting rid of physical I/O, then we could get the same performance by deploying a traditional DBMS entirely in memory – for example, using a RAM-disk in place of a hard drive.

We tested an application performing the same tasks with three storage scenarios: using an on-disk DBMS with a hard drive; the same on-disk DBMS with a RAM-disk; and an IMDS (McObject’s eXtremeDB). Moving the on-disk database to a RAM drive resulted in nearly 4x improvement in database reads, and more than 3x improvement in writes. But the IMDS (using main memory for storage) outperformed the RAM-disk database by 4x for reads and 420x for writes.

Clearly, factors other than eliminating disk I/O contribute to the IMDS’s performance – otherwise, the DBMS-on-RAM-disk would have matched it. The explanation is that even when using a RAM-disk, the traditional DBMS is still performing many persistent storage-related tasks.
For example, it is still managing a database cache – even though the cache is now entirely redundant, because the data is already in RAM. And the DBMS on a RAM-disk is transferring data to and from various locations, such as a file system, the file system cache, the database cache and the client application, compared to an IMDS, which stores data in main memory and transfers it only to the application. These sources of processing overhead are hard-wired into on-disk DBMS design, and persist even when the DBMS uses a RAM-disk.

An in-memory database system also uses the storage space (memory) more efficiently.
A conventional DBMS can use extra storage space in a trade-off to minimize disk I/O (the assumption being that disk I/O is expensive, and storage space is abundant, so it’s a reasonable trade-off). Conversely, an IMDS needs to maximize storage efficiency because memory is not abundant in the way that disk space is. So a 10 gigabyte traditional database might only be 2 gigabytes when stored in an in-memory database.

Q3. What is in your opinion the current status of the in-memory database technology market?

Steve Graves: The best word for the IMDS market right now is “confusing.” “In-memory database” has become a hot buzzword, with seemingly every DBMS vendor now claiming to have one. Often these purported IMDSs are simply the providers’ existing disk-based DBMS products, which have been tweaked to keep all records in memory – and they more closely resemble a 100% cached database (or a DBMS that is using a RAM-disk for storage) than a true IMDS. The underlying design of these products has not changed, and they are still burdened with DBMS overhead such as caching, data transfer, etc. (McObject has published a white paper, Will the Real IMDS Please Stand Up?, about this proliferation of claims to IMDS status.)

Only a handful of vendors offer IMDSs that are built from scratch as in-memory databases. If you consider these to comprise the in-memory database technology market, then the status of the market is mature. The products are stable, have existed for a decade or more and are deployed in a variety of real-time software applications, ranging from embedded systems to real-time enterprise systems.

Q4. What are the application types that benefit the use of an in-memory database system?

Steve Graves: Application types that benefit from an IMDS are those for which eliminating latency is a key design goal, and those that run on systems that simply have no persistent storage, like network routers and low-end set-top boxes. Sometimes these types overlap, as in the case of a network router that needs to be fast, and has no persistent storage. Embedded systems often fall into the latter category, in fields such as telco and networking gear, avionics, industrial control, consumer electronics, and medical technology. What we call the real-time enterprise sector is represented in the first category, encompassing uses such as analytics, capital markets (algorithmic trading, order matching engines, etc.), real-time cache for e-commerce and other Web-based systems, and more.

Software that must run with minimal hardware resources (RAM and CPU) can also benefit.
As discussed above, IMDSs eliminate sub-systems that are part-and-parcel of on-disk DBMS processing. This streamlined design results in a smaller database system code size and reduced demand for CPU cycles. When it comes to hardware, IMDSs can “do more with less.” This means that the manufacturer of, say, a set-top box that requires a database system for its electronic programming guide, may be able to use a less powerful CPU and/or less memory in each box when it opts for an IMDS instead of an on-disk DBMS. These manufacturing cost savings are particularly desirable in embedded systems products targeting the mass market.

Q5. McObject offers an in-memory database system called eXtremeDB, and an open source embedded DBMS, called Perst. What is the difference between the two? Is there any synergy between the two products?

Steve Graves: Perst is an object-oriented embedded database system.
It is open source and available in Java (including Java ME) and C# (.NET) editions. The design goal for Perst is to provide as nearly transparent persistence for Java and C# objects as practically possibly within the normal Java and .NET frameworks. In other words, no special tools, byte codes, or virtual machine are needed. Perst should provide persistence to Java and C# objects while changing the way a programmer uses those objects as little as possible.

eXtremeDB is not an object-oriented database system, though it does have attributes that give it an object-oriented “flavor.” The design goals of eXtremeDB were to provide a full-featured, in-memory DBMS that could be used right across the computing spectrum: from resource-constrained embedded systems to high-end servers used in systems that strive to squeeze out every possible microsecond of latency. McObject’s eXtremeDB in-memory database system product family has features including support for multiple APIs (SQL ODBC/JDBC & native C/C++, Java and C#), varied database indexes (hash, B-tree, R-tree, KD-tree, and Patricia Trie), ACID transactions, multi-user concurrency (via both locking and “optimistic” transaction managers), and more. The core technology is embodied in the eXtremeDB IMDS edition. The product family includes specialized editions, built on this core IMDS, with capabilities including clustering, high availability, transaction logging, hybrid (in-memory and on-disk) storage, 64-bit support, and even kernel mode deployment. eXtremeDB is not open source, although McObject does license the source code.

The two products do not overlap. There is no shared code, and there is no mechanism for them to share or exchange data. Perst for Java is written in Java, Perst for .NET is written in C#, and eXtremeDB is written in C, with optional APIs for Java and .NET. Perst is a candidate for Java and .NET developers that want an object-oriented embedded database system, have no need for the more advanced features of eXtremeDB, do not need to access their database from C/C++ or from multiple programming languages (a Perst database is compatible with Java or C#), and/or prefer the open source model. Perst has been popular for smartphone apps, thanks to its small footprint and smart engineering that enables Perst to run on mobile platforms such as Windows Phone 7 and Java ME.
eXtremeDB will be a candidate when eliminating latency is a key concern (Perst is quite fast, but not positioned for real-time applications), when the target system doesn’t have a JVM (or sufficient resources for one), when the system needs to support multiple programming languages, and/or when any of eXtremeDB’s advanced features are required.

Q6. What are the current main technological developments for in-memory database systems?

Steve Graves: At McObject, we’re excited about the potential of IMDS technology to scale horizontally, across multiple hardware nodes, to deliver greater scalability and fault-tolerance while enabling more cost-effective system expansion through the use of low-cost (i.e. “commodity”) servers. This enthusiasm is embodied in our new eXtremeDB Cluster edition, which manages data stores across distributed nodes. Among eXtremeDB Cluster’s advantages is that it eliminates any performance ceiling from being CPU-bound on a single server.

Scaling across multiple hardware nodes is receiving a lot of attention these days with the emergence of NoSQL solutions. But database system clustering actually has much deeper roots. One of the application areas where it is used most widely is in telecommunications and networking infrastructure, where eXtremeDB has always been a strong player. And many emerging application categories – ranging from software-as-a-service (SaaS) platforms to e-commmerce and social networking applications – can benefit from a technology that marries IMDSs’ performance and “real” DBMS features, with a distributed system model.

Q7. What are the similarities and differences between current various database clustering solutions? In particular, let’s look at dimensions such as scalability, ACID vs. CAP, intended/applicable problem domains, structured vs. unstructured, and complexity of implementation.

Steve Graves: ACID support vs. “eventual consistency” is a good place to start looking at the differences between clustering database solutions (including some cluster-like NoSQL products). ACID-compliant transactions will be Atomic, Consistent, Isolated and Durable; consistency implies the transaction will bring the database from one valid state to another and that every process will have a consistent view of the database. ACID-compliance enables an on-line bookstore to ensure that a purchase transaction updates the Customers, Orders and Inventory tables of its DBMS. All other things being equal, this is desirable: updating Customers and Orders while failing to change Inventory could potentially result in other orders being taken for items that are no longer available.

However, enforcing the ACID properties becomes more of a challenge with distributed solutions, such as database clusters, because the node initiating a transaction has to wait for acknowledgement from the other nodes that the transaction can be successfully committed (i.e. there are no conflicts with concurrent transactions on other nodes). To speed up transactions, some solutions have relaxed their enforcement of these rules in favor of an “eventual consistency” that allows portions of the database (typically on different nodes) to become temporarily out-of-synch (inconsistent).

Systems embracing eventual consistency will be able to scale horizontally better than ACID solutions – it boils down to their asynchronous rather than synchronous nature.

Eventual consistency is, obviously, a weaker consistency model, and implies some process for resolving consistency problems that will arise when multiple asynchronous transactions give rise to conflicts. Resolving such conflicts increases complexity.

Another area where clustering solutions differ is along the lines of shared-nothing vs. shared-everything approaches. In a shared-nothing cluster, each node has its own set of data.
In a shared-everything cluster, each node works on a common copy of database tables and rows, usually stored in a fast storage area network (SAN). Shared-nothing architecture is naturally more complex: if the data in such a system is partitioned (each node has only a subset of the data) and a query requests data that “lives” on another node, there must be code to locate and fetch it. If the data is not partitioned (each node has its own copy) then there must be code to replicate changes to all nodes when any node commits a transaction that modifies data.

NoSQL solutions emerged in the past several years to address challenges that occur when scaling the traditional RDBMS. To achieve scale, these solutions generally embrace eventual consistency (thus validating the CAP Theorem, which holds that a system cannot simultaneously provide Consistency, Availability and Partition tolerance). And this choice defines the intended/applicable problem domains. Specifically, it eliminates systems that must have consistency. However, many systems don’t have this strict consistency requirement – an on-line retailer such as the bookstore mentioned above may accept the occasional order for a non-existent inventory item as a small price to pay for being able to meet its scalability goals. Conversely, transaction processing systems typically demand absolute consistency.

NoSQL is often described as a better choice for so-called unstructured data. Whereas RDBMSs have a data definition language that describes a database schema and becomes recorded in a database dictionary, NoSQL databases are often schema-less, storing opaque “documents” that are keyed by one or more attributes for subsequent retrieval. Proponents argue that schema-less solutions free us from the rigidity imposed by the relational model and make it easier to adapt to real-world changes. Opponents argue that schema-less systems are for lazy programmers, create a maintenance nightmare, and that there is no equivalent to relational calculus or the ANSI standard for SQL. But the entire structured or unstructured discussion is tangential to database cluster solutions.

Q7. Are in-memory database systems an alternative to classical disk-based relational database systems?

Steve Graves: In-memory database systems are an ideal alternative to disk-based DBMSs when performance and efficiency are priorities. However, this explanation is a bit fuzzy, because what programmer would not claim speed and efficiency as goals? To nail down the answer, it’s useful to ask, “When is an IMDS not an alternative to a disk-based database system?”

Volatility is pointed to as a weak point for IMDSs. If someone pulls the plug on a system, all the data in memory can be lost. In some cases, this is not a terrible outcome. For example, if a set-top box programming guide database goes down, it will be re-provisioned from the satellite transponder or cable head-end. In cases where volatility is more of a problem, IMDSs can mitigate the risk. For example, an IMDS can incorporate transaction logging to provide recoverability. In fact, transaction logging is unavoidable with some products, such as Oracle’s TimesTen (it is optional in eXtremeDB). Database clustering and other distributed approaches (such as master/slave replication) contribute to database durability, as does use of non-volatile RAM (NVRAM, or battery-backed RAM) as storage instead of standard DRAM. Hybrid IMDS technology enables the developer to specify persistent storage for selected record types (presumably those for which the “pain” of loss is highest) while all other records are managed in memory.

However, all of these strategies require some effort to plan and implement. The easiest way to reduce volatility is to use a database system that implements persistent storage for all records by default – and that’s a traditional DBMS. So, the IMDS use-case occurs when the need to eliminate latency outweighs the risk of data loss or the cost of the effort to mitigate volatility.

It is also the case that FLASH and, especially, spinning memory are much less expensive than DRAM, which puts an economic lid on very large in-memory databases for all but the richest users. And, riches notwithstanding, it is not yet possible to build a system with 100’s of terabytes, let alone petabytes or exabytes, of memory, whereas spinning memory has no such limitation.

By continuing to use traditional databases for most applications, developers and end-users are signaling that DBMSs’ built-in persistence is worth its cost in latency. But the growing role of IMDSs in real-time technology ranging from financial trading to e-commerce, avionics, telecom/Netcom, analytics, industrial control and more shows that the need for speed and efficiency often outweighs the convenience of a traditional DBMS.

———–
Steve Graves is co-founder and CEO of McObject, a company specializing in embedded Database Management System (DBMS) software. Prior to McObject, Steve was president and chairman of Centura Solutions Corporation and vice president of worldwide consulting for Centura Software Corporation.

Related Posts

A super-set of MySQL for Big Data. Interview with John Busch, Schooner.

Re-thinking Relational Database Technology. Interview with Barry Morris, Founder & CEO NuoDB.

On Data Management: Interview with Kristof Kloeckner, GM IBM Rational Software.

vFabric SQLFire: Better then RDBMS and NoSQL?

Related Resources

ODBMS.ORG: Free Downloads and Links:
Object Databases
NoSQL Data Stores
Graphs and Data Stores
Cloud Data Stores
Object-Oriented Programming
Entity Framework (EF) Resources
ORM Technology
Object-Relational Impedance Mismatch
Databases in general
Big Data and Analytical Data Platforms

#