Skip to content

"Trends and Information on AI, Big Data, Data Science, New Data Management Technologies, and Innovation."

This is the Industry Watch blog. To see the complete ODBMS.org
website with useful articles, downloads and industry information, please click here.

Mar 13 17

On the new developments in Apache Spark and Hadoop. Interview with Amr Awadallah

by Roberto V. Zicari

“What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).”–Amr Awadallah

I have interviewed Amr Awadallah, Chief Technology Officer at Cloudera.  
Main topics of the interview are: the new developments in Apache Spark 2.0 Beta, and Hadoop  3.0.0-alpha1 release ; the lessons learned from Amr´s experience of using Hadoop at Yahoo!; and the business problems that world’s leading organisations do have.

RVZ

Q1. Before Cloudera, you served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organisations to use Hadoop for data analysis and business intelligence. What are the main lessons you learned in that period?

Amr Awadallah: Couple of things. First, I learned that Hadoop is capable of solving all the business intelligence problems that I had at Yahoo.
Namely:
(1) our systems weren’t scaling fast enough (we needed to cut down transformation times from hours to minutes),
(2) our systems weren’t economical on a $/TB basis thus making it hard to retain valuable data for longer time periods, and (3) we needed new methods to be able to store and analyze semi-structured (e.g. logs) and unstructured data (e.g. social media).
By implementing Hadoop in our team we saw first hand how it can address all these problems. The second lesson that I learned was that Hadoop, back then, was very rough to deploy and program against (it took us many months to deploy it and reprogram our transformations to run on it). It was these lessons that made it clear that there is room for a startup to focus on Hadoop since (1) it was solving a very real data problems that many organizations will face, and (2) it needed a lot of polish to make it work smoothly, securely, and reliably within the enterprise.

Q2. In 2008 you founded Cloudera together with Mike Olson (Oracle), Jeff Hammerbacher (Facebook) and Christophe Bisciglia (Google). What was your main motivation at that time?

Amr Awadallah: Pretty much to do what I describe above, we wanted to make the Hadoop technology easy to use for organizations. That included: (1) creating a distribution for Hadoop that bundles all the necessary open-source projects that make it work (we call that CDH, short for Cloudera Distribution for Apache Hadoop). (2) We also created a number of proprietary system management, security, and meta-data management tools around CDH to make it easier for organizations to deploy and operate Hadoop in production.

Q3. What are the typical challenging business problems that world’s leading organisations have?

Amr Awadallah: The technology we provide is very powerful and can be used to solve many problems across many industries, but we see four common themes: The first is simply using Hadoop as a faster, bigger, cheaper system for business intelligence and data analytics. i.e. a lot of organizations just use us to do things they have been doing already, just doing these things in a more economically scalable way.
The second use case is around deeper understanding of customers, i.e. moving away from segmenting all customers into a number of predefined buckets, but rather creating a dynamic micro-segment addressing each customer in a more precise way (thus reducing false positives).
The third use case is about using data to build better products and services, and this use-case is catalyzed by of the internet-of-things. Due to smart-sensors we are able to measure the real-world better than ever before; so this use-case is about taking all that data and leveraging it to either enhance our current product/service offerings, or build entirely new ones.
The fourth use case is about reducing business risk, and it manifests itself in a number of different sub-cases depending on the industry. For example, cyber-security is one of the key ways to reduce risk, and we have an open source project co-developed with Intel, called Apache Spot, which organizations can use to collect all their network flow data then use Spark machine learning algorithms to detect the anomalies in that data. Anti-money laundering and fraud detection is another way that our banking customers employ our platform to reduce risk within their businesses. Similarly, our insurance industry customers use our system to detect fraudulent claims, etc.

Q4. Can they be solved by analysing data? Can you give us some examples of how the use of advanced analytics drive business decisions?

Amr Awadallah: Yes, all the problems mentioned above can be solved with data. I want to highlight though that this isn’t necessarily about business decisions, which is what the Business Intelligence movement was about (we just help make that cheaper and faster). What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).
One of my favorite examples is a solution that one of our customers built to give voice to premature babies in neonatal intensive care units. They analyze the signals coming from the baby (sounds, blood pressure, heart rate, temperature, few brain signals), and based on that a message appears on the monitor above the infant showing the nurse if they are hungry, distressed from too much noise or light, etc.
That is really what we mean by using data to create new products and services that weren’t possible before (and not just reports/dashboard).

Q4. Graphs are important. Is it possible to do scalable graph analytics? If yes, how?

Amr Awadallah: Graphs are indeed important, a lot of our customer use-cases trace back to that (not just for social media analytics, but for example anti-money laundering requires analyzing relationships between many financial accounts for detecting bad behaviors, similarly for cyber security applications). I think scalability depends a fair bit on what’s being analyzed and how scalable we mean by scalable. But for most practical purposes I would say Spark’s GraphX is good enough. For example, you can compute PageRank fairly efficiently and scalably on a cluster using GraphX.

Q5. Data security is increasing important. The risk is due to the growing number of device endpoints. What solutions do exist to minimise such risk?

Amr Awadallah: A comprehensive enterprise data security strategy seeks to mitigate the risks presented by a growing number of potentially compromised endpoints connecting to corporate networks. Endpoint security will enable one or all of the following preventative controls:
The first is policy based enforcement of endpoint security configuration prior to granting and endpoint access to network based corporate assets. This ensures that any endpoint connected to corporate networks meets minimum requirements for endpoint security configuration.
The second measure is endpoint based anti-malware software (the existence of which may be a policy requirement to connect to the network per the first measure). Anti-malware prevents malicious code from infecting endpoints by monitoring for changes to system configuration and unusual activity or processes.
The third measure is endpoint encryption of corporate data on hard drives, folders and even removable media.
As mentioned above we also collaborate with Intel on Apache Spot, which tracks network flow patterns to detect anomalous communication behavior between different devices (including end point devices). Apache Spot just recently won InfoWorld 2017 Tech of the Year Award. Other advanced analytics security partners we closely work with are: CounterTack, Securonix, Niara, and Jask.

Q6. You recently announced the availability of an Apache Spark 2.0 Beta release for users of the Cloudera platform. How does it work? And how does it differ from the Hadoop-based data platform?

Amr Awadallah: First, at a meta-level, Hadoop (MapReduce specifically) was very good at achieving scalable computation by spreading jobs across many CPU cores and hard disk spindles. That said, MapReduce wasn’t very efficient in how it leveraged memory to optimize the performance of data processing pipelines that have many stages or iterations.
The main power of Spark, that made it take over from MapReduce, was how it truly leveraged memory to achieve better performance in deep or iterative data pipelines. That coupled with a simpler developer API made Spark take over very quickly from MapReduce.
Most of our new customer implementations for data processing or data science tend to be in Spark these days, versus MapReduce.
I should clarify however that this doesn’t mean that Hadoop is dead as some say. Apache Hadoop is comprised of three key subsystems: (1) MapReduce for computation, (2) YARN for resource scheduling, and (3) HDFS for storage. Spark only replaces MapReduce, we still rely heavily on both YARN and HDFS.

That said, the most notable features in Apache Spark 2.0 are:

1) Dataset API: It is a new API that represents the distributed collections of objects processed by Spark’s execution engine. It is an extension of Spark’s Dataframe API. It improves upon the Dataframe API by providing type-safe, object oriented programming interfaces. Users can now write User-Defined Functions and Lambda functions that provide compile time type safety. With the Dataset API, users benefit from optimized operations (like sort, join, hash, etc) in the SparkSQL engine, while also getting compile time type safety for user defined functions.

2) Model & Pipeline Persistence in Spark’s ML library: Machine learning Pipelines built with Spark’s ML library can now be serialized to a file and read back in.
The ability to save and reload these pipelines makes it easy for users to perform version control on the pipelines and safely distribute the pipelines. This helps in operationalizing them in production systems.

3) Structured Streaming: New stream processing API and engine that provides SQL like abstractions for authoring operations on data streams, and also improves performance by using the SparkSQL engine for processing the data streams. However, this is still an experimental API and not ready for production usage yet.

Besides the above 3 notable enhancements, there are a bunch of performance and scalability improvements across the board.

Q7. Apache Impala vs. Amazon Redshift: How Does Redshift Compare to Impala?

Amr Awadallah: Apache Impala is an analytic database engine architecturally designed to perform high-performance highly-concurrent SQL analytics on scalable, open data platforms like Hadoop’s HDFS and Amazon S3.
Impala decouples data storage from compute and lets users query data without having to move/load data specifically into an Impala storage-engine (it doesn’t have one). This architectural difference uniquely enables Impala to deliver a more flexible Business Intelligence experience than traditional database architectures like Redshift (which requires pre-loading the data).

Some of the key benefits of the Impala approach include:

* On-demand resources that are immediately ready to query existing S3 data without loading to a different data silo
* Ability to elastically grow/shrink clusters as needed due to decoupled storage and compute
* More predictable, multi-tenant isolation due to the ability to have multiple Impala clusters sharing a common S3 data repository
* Ability to share common data not only amongst Impala clusters, but also any application that runs on cloud-native S3 storage (for example, you can have both Apache Impala and Apache Spark run against the same data asset in S3, while it isn’t possible to have Apache Spark easily access the data stored in Redshift, it has to go through SQL first).
* Greater flexibility to explore new use cases, analytics, and data by directly querying S3 without rigid traditional data models and ETL

Not only does Impala deliver this additional flexibility, it does so at greater cost-performance and scalability compared to Redshift. See the following benchmark for data on that.

That said, Redshift’s sweet spot is in a different target as a smaller datamart as most Redshift installations are in the dozen of nodes range where Redshift’s limitations in scalability, elasticity, flexibility, and requirement to maintain separate copies of data are less critical.

Q8. What is Apache Kudu, and why is it relevant for Impala Users?

Amr Awadallah: Historically we had two storage engines in our distribution: (1) HDFS which is optimized for high-throughput analytics, but doesn’t support updates/inserts and (2) HBase which is optimized for low-latency updates/inserts but isn’t good for doing high-throughput queries. To build a proper data warehouse or time-series analytics system, you typically still need to make updates/inserts and that was why we created Apache Kudu.

Kudu is a new storage system that combines the benefits of both HDFS and HBase into one: it allows for low-latency updates/inserts, but also supports high-throughput analytical queries (i.e. fast analytics on fast moving data).
Unlike HDFS, Kudu is not a file-system, it is a record-based system, so the unit of storage is a record as opposed to a file. This allows Kudu to unlock Impala for real-time streaming applications that were not possible with HDFS.
In HDFS the data would only be visible to Impala after we finish closing the file, which typically happens after a large number of records are accumulated (that adds latency between when records are written to when they become visible to the analytical engine). With Kudu as soon as a record is written it is immediately visible to the Impala analytical engine. Finally, just like HDFS and HBase, the Kudu storage engine is fully integrated with our entire stack, not just Impala.
For example, you can also use Apache Spark for machine-learning jobs directly against Kudu.

Q9. The Apache Hadoop project recently announced its 3.0.0-alpha1 release. What is it?

Amr Awadallah: HDFS Erasure Encoding is really the main exciting new feature in Hadoop 3. Traditionally HDFS required three replicas, by default, for every data block to achieve durability, concurrent performance, and availability. Using erasure encoding techniques, HDFS in Hadoop 3 allows us to significantly reduce the storage overhead from 3x (i.e. 200%) to just 20% extra bits for parity. This will allow us to achieve the same durability benefits of 3x replication, but comes at the cost of potentially lower concurrent performance (when more than one job are trying to access the same block at same time) and lower availability resilience in face of top-of-rack switch failures (less of an issue these days).

Other cool additions are ATS v2 and classpath isolation which you can read more about here

Q10. What is the roadmap ahead for Cloudera Enterprise?

Amr Awadallah: We don’t discuss details of our product roadmap publicly, but there are three guiding themes for us in 2017: The first theme is fast-analytics on fast-moving data (which I covered above in regards to Kudu).
The second theme is cloud, which is making Cloudera Enterprise work better in cloud environments, and make it easier to move workloads (and skill sets) from on-premise clusters to transient cloud clusters in AWS, Azure, and/or Google Cloud.
The third theme is simplifying data-science and machine learning development, especially reducing the time from when a new algorithm is developed to how it can be deployed into production (stay tuned for more on that front).
——————————
Amr Awadallah, Ph.D. Chief Technology Officer, Cloudera
Before co-founding Cloudera in 2008, Amr (@awadallah) was an Entrepreneur-in-Residence at Accel Partners. Prior to joining Accel he served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organizations to use Hadoop for data analysis and business intelligence. Amr joined Yahoo after they acquired his first startup, VivaSmart, in July of 2000. Amr holds a Bachelor’s and Master’s degrees in Electrical Engineering from Cairo University, Egypt, and a Doctorate in Electrical Engineering from Stanford University.

Resources

Download Page for Apache Spark™

Apache Impala supported by Cloudera Enterprise

DATA-X: Videobook- 8 short videos introduce query analytics for Apache Hadoop

A package that allows R developers to use Hadoop HBase

Book: Big Data Analytics with Spark

Related Posts

Streaming Analytics for Chain Monitoring. By Natalino Busa, Head of Data Science at Teradata — Thursday, ODBMS.org January 12, 2017

Five Challenges to IoT Analytics Success. By Dr. Srinath Perera. ODBMS.org SEPTEMBER 23, 2016

Next-Generation Genomics Analysis with Apache Spark. by Jason Bailey. ODBMS.org Thursday, June 30th, 2016

Supporting the Fast Data Paradigm with Apache Spark BY Stephen Dillon, Data Architect, Schneider Electric. ODBMS.org,23 APR, 2016

– The new series of Q&A with Leading Data Scientists– ODBMS.org:
Part II
Part I

Follow us on Twitter: @odbmsorg

##

Feb 27 17

On Digital labor: Technology, Challenges and Opportunities. Interview with Michael Henry

by Roberto V. Zicari

“Digital labor is the name for a new class of tools that can automate routine cognitive tasks. The benefits of automation are similar to previous waves. Many years ago I helped automate a reconciliation function for a large asset manager. Humans took authorization reports from their investment control system and matched them against the confirmations coming from their counterparts. This was a terrible job, and luckily no one does this anymore.
Digital labor has the potential to improve the financial services sector by improving compliance, providing more analytics for risk and control functions, and improving efficiency.”–Michael Henry

I have interviewed Michael Henry, Principal at KPMG LLP. In the interview we covered the challenges faced by financial institutions due to existing regulations standards, KPMG`s solution to automate the onboarding process for their clients, and the potential impact of Digital labor for the financial services sector.

RVZ

Q1. The Organisation for Economic Co-operation and Development (OECD) proposed a Common Reporting Standard (CRS) for the Automatic Exchange of Information (AEOI) that implies a significant increase in the customer due diligence and reporting obligations of financial institutions across the world. What is the implication for your clients?

Michael Henry: The new reporting requirement will require financial institutions to collect and examine more information about their clients for the purposes of tax withholding and reporting. Banks and other regulated institutions will have to examine information from their clients to make sure they are reporting their true residence for tax purposes. This is similar to the US Internal Revenue Service’s FATCA requirements. And like FATCA, many banks will respond by asking for more documentation from their clients and adding staff to perform due diligence on that documentation.

Q2. Specifically, what is “client on boarding”? How is it normally implemented by large financial institutions?

Michael Henry: Client on boarding refers to the series of processes that a financial institution undergoes to determine whether or not it should move forward with conducting or renewing business with a given customer.
The term is inclusive of the underlying regulatory and compliance practices governed by anti-money laundering (AML) and know-your-customer (KYC) rules.
Many large financial institutions deploy thousands of staff, often in low cost offshore locations to perform this function. These staff are usually equipped with basic workflow and data management technology. At Tier 1 organizations this can cost hundreds of millions of dollars annually while pinning their reputations on the shoulders of junior resources making subjective compliance policy interpretations.
For this basic client identification and validation process, one of our clients employs thousands of people in an offshore location. Because this work is boring and repetitive, the client tells us that the attrition rate is more than 10% per month. This presents an enormous risk to the business, as banks entrust their client experience, business results, and reputations to cheap clerical labor that likely joined the bank only a few months ago.

Q3. What are the typical problems?

Michael Henry: The bank must collect information to identify the client and determine the risk that the client will engage in some kind of unlawful activity. To perform this function, the bank must process a large number of data that enter the bank electronically, or through documents. Reading and interpreting documents and trying to apply complex compliance rules using manual processes is time-consuming, error-prone, and expensive.
Technology – Workflow, case management, relational databases, and imaging technologies while mature and effective, still require human beings to read, transcribe, and interpret data.
Inconsistency – Human operators interpret complex decision-trees of rules. The risk of subjectivity grows with the size of the operation.
Accuracy – The majority of today’s onboarding representatives execute what amount to “stare and compare” and “stare, copy and enter” processes. Over the course of a business day in which hundreds of pages or documents will be read and thousands of keystrokes completed, it is inevitable that operator errors will occur.

Q4. You have worked on a solution as a service to automate the onboarding process for your clients. Can you explain in a nutshell how did you do it?

Michael Henry: The solution is comprised of multiple digital labor components to read documents and apply policy rules by machines instead of people.
Humans focus on exceptions, i.e., cases which really require human judgment. Because the exception rates are low, much of the activity becomes straight-through.
The technology uses a combination of robotics, big data, and natural language processing integrated for the solution of KYC, AML, Tax classification, and other compliance activities.

Q5. How difficult was to integrate domain knowledge into advanced technology?

Michael Henry: Domain knowledge is critical. KPMG invested significant regulatory and compliance expertise to reinvent this process for ourselves and our clients. The technology only works because of this investment.
We use advanced technology, but it is all commercially available. Our ability to define specific ontologies and compliance rules on that technology is the differentiator.

Q6. How do you capture information from SEC filings, blog entries, social media, text messages and other sources of structured and unstructured data without manual intervention?

Michael Henry: We capture information from structured and unstructured sources through a combination of technologies. Optical character recognition (OCR) and natural language processing (NLP) software drive our content enrichment process. This allows our platform to ingest unstructured documents (with or without metadata), identify them, and then extract the relevant content according to our ontological models. Some exception processing occurs at this stage, especially if the quality of the documentation is poor.

Q7. How do you integrate, organize and mine customer data?

Michael Henry: Customer data are ingested to the platform through system extracts, tying in to document repositories and the establishment of secure FTP sites. These data then pass through our content enrichment engine and ultimately reside in our MarkLogic NoSQL database.

Q8. Why did you choose MarkLogic’s Enterprise NoSQL database?

Michael Henry: First, we are solving mission-critical subjects for the world’s leading financial institutions. We needed to have an institutional-grade, enterprise-hardened database at the core of our platform.
Second is given the size of the data sets involved, we needed to have a highly scalable database that could handle petabytes of data while simultaneously staging and orchestrating multiple run-time sequences. Finally, we found MarkLogic very aligned to our vision and a good partner in bringing the solution to market.

Q9. How do you use semantics, text analytics and visualisation?

Michael Henry: Semantic analysis allows us to handle unstructured data in natural language formats. Extracting the list of beneficial owners from a 100-page trust document can take a human hours. The tools are so proficient now, that with the right ontological models we can obtain dozens of data from an unstructured document at high volumes with little human intervention. We have been able to ingest hundreds of individual loan documents and produce a data hierarchy by client, by loan, and by event.

Q10. What results did you obtain so far? What is the order of magnitude reduction in human efforts you obtained? As human involvement in the process declines, is the number of errors in reports also declining?

Michael Henry: Today, we serve more than 20 clients. In the tax compliance area, a human may spend more than an hour ingesting a W8 form and conducting due diligence. Most of this is reading KYC documents. Our platform has the ability to handle more than 10 of these per hour per human exception handler. If the task involves humans reading documents and applying validation or other policies, and the rate of actual exceptions is low, we can take 80-90% of the manual effort out. And the tools keep getting better.
More important than the productivity gain is the consistency and accuracy of the automation. No human operator can apply thousands of policy rules consistently. We continue to tune our models, and the machine never forgets.

Q11. In your opinion, what is the impact of the introduction of “Digital Labor”services for the job service market and for the society at large?

Michael Henry: Digital labor is the name for a new class of tools that can automate routine cognitive tasks. The benefits of automation are similar to previous waves. Many years ago I helped automate a reconciliation function for a large asset manager. Humans took authorization reports from their investment control system and matched them against the confirmations coming from their counterparts. This was a terrible job, and luckily no one does this anymore.
Digital labor has the potential to improve the financial services sector by improving compliance, providing more analytics for risk and control functions, and improving efficiency.

************************************************************

Michael Henry Principal, Financial Services, KPMG LPP
Michael is a Principal in KPMG’s Digital Labor practice with more than 25 years’ experience in financial services. Michael specializes in the application of sophisticated technologies (big data, natural language processing, artificial intelligence, machine learning, workflow and robotics) to automate compliance processes. Michael has worked with global and regional banks, and his experience includes living and working in Europe and Asia.

Resources

– FATCA Onboarding & Compliance Solution. KPMG, 2015 (LINK to .PDF)

Related Posts

High-performance Compliance Capture and Analytics Solution for Financial Institutions. Interview with Michael Hay and Oskar Mencer. ODBMS Industry Watch, Published on 2017-01-26

On fraud detection, Medicaid, and the insurance industry. Interview with Charles Kaminski Jr. ODBMS Industry Watch, Published on 2016-11-01

Follow us on Twitter: @odbmsorg

##

Feb 13 17

On in-memory, key-value data stores. Ofer Bengal and Yiftach Shoolman

by Roberto V. Zicari

“While modernizing legacy applications used to be a key reason for deploying in-memory, key-value data stores, we see that this is changing. New applications particularly those that are highly interactive need to bring a user experience that is very responsive under all conditions. For such new applications, an in-memory datastore, particularly one that can simplify run time analytics like counting, scoring, managing lists and sets, is becoming a key ingredient for low latency responses and high throughput.”  –Ofer Bengal.

I have interviewed Ofer Bengal, Co-Founder and CEO of Redis Labs, and Yiftach Shoolman, Co-Founder and CTO of Redis Labs.
Main topics of the interview are: How is the database market evolving, proprietary vs. open source software, in-memory/ key-value data stores, and the new features of Redis.

RVZ

Q1. How do you see the database market evolving?

Ofer Bengal, Yiftach Shoolman: The main trends we identify today and believe will continue in upcoming years are:
1) Non-relational databases will continue to see growing adoption, because the schema framework is ineffective when it comes to unstructured data, change in data patterns, growing data volumes, more stringent performance requirements and the way modern apps are built.
2) Multiple database models as opposed to the absolute dominance of RDMS in the past few decades, each model solving the requirements of certain use cases.
Moreover, certain modern databases can run several database models (document, graph, etc.)
3) Multiple databases (different types or the same type) serving the same app. Modern applications are based on micro service architecture, in which each micro service works with the best database for its use case.
This creates new challenges for modern databases: (a) Instant provisioning – sometime hundreds or thousands of databases are provisioned within a second, and (b) Multi-tenancy, otherwise the cost associated with managing database infrastructure becomes extremely high.
4) Database-as-a-service is growing vs. self deployed and operated databases. With enterprises gradually moving to the cloud and having to deal with multiple type databases, it makes a lot of sense to outsource deployment and ongoing operations rather than building in-house practice of DBAs and Devops.
5) Hybrid transactional and analytical processing (HTAP). Driven by the need for application analytics to drive business decision making in real time, certain modern databases can handle those two different workloads simultaneously, eliminating the need for exporting transactional data to a separate dedicated analytical database.

Q2. Proprietary vs. open source software: what are the pros and cons?

Ofer Bengal, Yiftach Shoolman: From the community perspective, open source is great. If there is a vibrant community, it pushes innovation, problem solving and compatibility issues with different environments.
From users perspective, open source is “open”, accessible, can be used by anyone, transparent, and free of charge.
It often comes with less of a danger of vendor lock-in. It is very suitable for independent developers and startups. However enterprises using open source products may have certain challenges:
1. The product is not always suitable for enterprise workloads, especially when it comes to databases. Capabilities like infinite seamless scaling, high-availability with instant failover and stable performance at scale are not always the open source developer’s top priority.
2. Commercial support must be obtained and this typically comes with a price tag which is not much different than acquiring a commercial database product.
3. Commercial support is typically provided by a single company (most probably founded by the open source creators), which creates “vendor lock-in” by itself.
4. In the case of databases, using database-as-a-service may turn out to be lower in cost compared to provisioning cloud instances and running zero cost open source software on them, because commercial can be based on efficient multi-tenant architecture.

Q3. What is the current market for in-memory, key-value data stores?

Ofer Bengal: In-memory key-value data stores (sometimes called in-memory data grids (IMDGs)) have been around since more than a decade and have proven capable of supporting digital business needs for responsive, always-on user experience; real-time, actionable insights; and dynamic scaling. They are widely employed when you want to scale/modernize legacy applications without spending additional money on extremely expensive RDBMS licenses and hardware.This is achieved by providing a scalable and reliable in-memory datastore that enables low-latency transactional and analytical processing.
While modernizing legacy applications used to be a key reason for deploying in-memory, key-value data stores, we see that this is changing. New applications particularly those that are highly interactive need to bring a user experience that is very responsive under all conditions. For such new applications, an in-memory datastore, particularly one that can simplify run time analytics like counting, scoring, managing lists and sets, is becoming a key ingredient for low latency responses and high throughput.

From a Redis perspective, our innovation in data structures brings about the ability to simplify development to the extent that now most Redis users use it as a first responder and primary datastore for substantial pieces of their data. Furthermore with Redis’ data-structures, users can run operational and analytical use cases on the same database.
In addition, acceleration of other in-memory platforms like Spark is possible with Redis.

Gartner estimates that, in 2015, the stand-alone IMDG market was worth approximately $600 million, having grown by about 30% from the previous year. Gartner expects the market to continue to grow in the double-digit range through 2020 and to exceed $1 billion by 2018. Redis, one of the leaders in this space, grew in just a few years to be one of the most popular databases used by developers and enterprises.

Q4. Amazon ElastiCache supports two open-source in-memory engines: Redis and Memcached. What does it mean in practice?

Yiftach Shoolman: In practice, Amazon ElastiCache is a simple caching service that simplifies a developer experience by providing these two open source in-memory engines. Legacy applications that use simple cache can use ElastiCache seamlessly.
However, ElastiCache is single-tenant, limited to caching use cases and cannot be used as a database, lacking enterprise-grade functionalities such as infinite seamless scalability, instant failover and predictable performance.
The Redis Labs equivalent service, called Redis Cloud provides all the benefits of an enterprise-class Redis.

Q5. What are the pros and cons of Memcached and Redis?

Yiftach Shoolman: Redis can be thought of as modern database while memcached is older technology designed specifically for ephemeral caching.
The most important difference is in persistence and HA – memcached is not persistent nor HA, while Redis can operate as a full-fledged in-memory database, highly available through both in-memory replication and data persistence. This reflects the fact that caches in older architectures were not required to be highly available, but in modern architectures, built for scale and volume, cache outages can significantly impact the business and user experience.
Redis, the newer and more versatile technology allows individual data elements to be manipulated while memcached often incurs serialization/deserialization overheads that makes the entire application processing much slower. This is because Memcached can handle only simple key value use cases, whereas Redis offers many more data structures (hashes, sets, sorted sets, lists, hyperloglog..) that simplify complex data processing, analysis and operational use cases with ease.
Even when used as a cache, Redis has more sophisticated eviction policies which can be both active or passive while memcached has only a simple LRU and lazy eviction.
Redis and Memcached are both very popular open source projects, but given its richer functionality, more advanced design, many potential uses, and greater cost efficiency at scale, Redis should be your first choice in nearly every case.

Q6. For very large data sets or analytics workloads, running everything in-memory might not be cost effective. What is your take on this?

Ofer Bengal, Yiftach Shoolman: For very large data sets or analytics workloads, it is advantageous to utilize alternative memory technologies(such as Flash memory, which is a tenth of the cost), as extensions of memory rather than impose a disk access penalty. We have extended enterprise Redis in this manner to take advantage of Flash memory, while using a tiered approach (keys and hot values are still in the fastest memory, while cold values are in “slower” Flash memory) to ensure that you still see sub-millisecond latencies with millions of ops/sec throughput.

Q7. Redis was created by Salvatore Sanfilippo in 2009. What is his role today?

Ofer Bengal: Salvatore is leading the development of open source Redis within Redis Labs. He works with a group of experienced developers on extending the capabilities of Redis. A good example of this collaborative works is the recent introduction of Redis Modules, which extend Redis to a variety of new modern use cases. Salvatore wrote the API and the other team members in a very short time created and tested a few modules, such as Redisearch (a full-text search engine) and Redis-ML (enhancing the performance of Spark machine learning capabilities). Salvatore’s role is to continue the community innovation around the Redis core, together with his team of Redis Labs developers.

Q8. What are the differences of Redis Labs` version of Redis with the original one developed in 2009?

Yiftach Shoolman: Redis Labs fully supports the open source Redis versions, but enhances them with a container-like layer that adds a proxy, cluster management and a shared nothing architecture. Taken together, Redis Labs provides a solid enterprise foundation to Redis, allowing it to scale seamlessly in memory across many hundreds of servers with the high availability through persistence, in-memory cross-rack/zone/region/datacenter replication and instant automatic failover. No retooling or re-architecting is required to move from open source Redis to enterprise Redis, the process is basically effortless and immediate. Redis Labs also offers various database modules, like a RediSearch, multiple probabilistic modules like Bloom Filter, TopK, CMS, Redis-ML for Machine Learning, Redis-TS for Time Series processing, JSON and Graph support.

Q9. What are the possible scenarios of using Redis for data analytics?

Ofer Bengal, Yiftach Shoolman: Redis data structures come with built-in simple analytic operations like counting, ranking, scoring, ranges and more. Over time, probabilistic data structures have added the ability to analytically estimate millions and trillions of events, without requiring memory to store all of the events.
Set operations have made it possible to simplify comparisons, intersections, unions of sets – analytics that are usually complicated with data stores. RQL (Redis SQL) and secondary indexing, allows executing complex SQL queries on an existing Redis database. And finally recent modules like RediSearch, Neural Redis and Redis-ML have added advanced search and machine learning capabilities – not naturally occurring in any other databases.
With all of these possibilities, and with the move to automated decision making, we see increasing usage of Redis for data analytics scenarios.

Q10. How safe is a Redis server?

Yiftach Shoolman: The Redis enterprise server comes with client-based SSL authentication, built-in cloud firewall support (when running on public clouds), password authentication and role-based authorization that enables customizing security levels.

Qx. Anything else you wish to add?

Ofer Bengal: Redis is a game -changer when it comes to databases, and its progression over the last seven years has demonstrated that the industry and market are demanding performance and increasing flexibility to deal with all types of data processing, storage and analytic scenarios. Redis’ core values have always included high performance, high throughput and very low latencies. With the visionary addition of modules. The community has turned it into an all purpose datastore – suitable for any scenario that needs a database.

____________________________________

Ofer BengalCo-Founder and CEO of Redis Labs
Ofer is a serial entrepreneur who has founded and led several companies in the areas of data communications, telecommunications, Internet, homeland security and medical devices. Ofer was founder & CEO of RIT Technologies (NASDAQ: RITT), a provider of sophisticated telecommunications and data communications systems to major world carriers. He began his career as an aerospace engineer in the Israeli Air Force and then built his own aerospace engineering consulting firm. As a hobby, he has also invented, developed and licensed toy concepts to companies such as Milton Bradley, Hasbro and Tomy. Ofer holds a Bachelor of Science (cum laude) in aerospace engineering from the Technion, Israel Institute of Technology.

Yiftach ShoolmanCo-Founder and CTO of Redis Labs
Yiftach is an experienced technologist, having held leadership engineering and product roles in diverse fields from application acceleration, cloud computing and software-as-a-service (SaaS), to broadband networks and metro networks. He was the founder, president and CTO of Crescendo Networks (acquired by F5, NASDAQ:FFIV), the vice president of software development at Native Networks (acquired by Alcatel, NASDAQ: ALU) and part of the founding team at ECI Telecom broadband division, where he served as vice president of software engineering. Yiftach holds a Bachelor of Science in Mathematics and Computer Science and has completed studies for Master of Science in Computer Science at Tel-Aviv University.

Resources
Redis Cloud Now Available with Integrated Billing through AWS Marketplace- News Release- January 10, 2017.

AWS SaaS Marketplace.

Redis Documentation

EBOOK – REDIS IN ACTION This book covers the use of Redis, an in-memory database/data structure server.

Related Posts

New Gartner Magic Quadrant for Operational Database Management Systems. Interview with Nick Heudecker, ODBMS Industry Watch, November 30, 2016

Follow us on Twitter: @odbmsorg

##

Jan 26 17

High-performance Compliance Capture and Analytics Solution for Financial Institutions. Interview with Michael Hay and Oskar Mencer.

by Roberto V. Zicari

“New regulations such as MIFID II indeed aim at increasing transparency, which in turn requires more precise reporting. These reports require a lot of data to be stored and data capture to be ultra accurate.”– Michael Hay and Oskar Mencer.

Hitachi Data Systems and Maxeler Technologies announced a cooperation around High-performance Compliance Capture and Analytics Solution for Financial Institutions. I have interviewed Michael Hay, VP & CHIEF ENGINEER – HITACHI DATA SYSTEMS, and Oskar Mencer, CEO, CTO, Maxeler Technologies Inc.

RVZ

Q1. What is Multi-scale Dataflow Computing?

O. Mencer: Generally, Multiscale Dataflow Computing is a computing paradigm aimed at optimizing operational efficiency of computing by computing data as it is moving through a system. We use Dataflow to minimize the sum of all distances that the data has to travel. We overlay Dataflow with a Multiscale approach of vertically optimizing the algorithm, the architecture and arithmetic.

Q2. There is an emerging EU Financial Services directive called MIFID II. This EU directive, and its associated regulation, was designed to help the regulators better handle High Frequency Trading (HFT) and so called Dark Pools, in other words, to increase transparency in the markets. What are the technological demands posed by these new financial legislation and compliance regulations?

M. Hay, O. Mencer: New regulations such as MIFID II indeed aim at increasing transparency, which in turn requires more precise reporting. These reports require a lot of data to be stored and data capture to be ultra accurate. It is an ideal environment for Hitachi data solutions to be combined with Maxeler’s low latency capability.

Q3. To address these challenges, Maxeler Technologies Inc. announced a collaboration with Hitachi Data Systems to offer a high-performance compliance capture and analytics solution. Can you please explain what this solution is about?

M. Hay, O. Mencer: We are combining programmable low latency compute with high capacity “Dataflow-like storage” and modern analytics software. This allows us to attack even the toughest customer challenges and provide competitive advantage within modest development time.

Q4. How can this solution help financial institutions achieve high-frequency, transaction-related record keeping mandated in European Union MiFID II and US Dodd-Frank regulations?

M. Hay: Hitachi’s Data Lake solutions can help to unify the wide range of regulatory data challenges faced by today’s financial institutions. With high end filtering and analytics capability added to the system, we can address regulation but also integration and security issues all within a single system.

Q5. In this cooperation, you have accomplished an operational prototype through the use of Maxeler’s DFE (Data Flow Engine) network cards, Dataflow based capture/decode capability executing on Dataflow hardware, a hardware accelerated NFS client, Hitachi’s CB500, Pentaho, and Hitachi Unified Storage (HUS). Can you explain how this architecture works?

M. Hay, O. Mencer: Our architecture accomplishes tight integration between realtime on-the-wire compute and storage. The realtime computing ability and reliability of the storage ensure that no data is lost and reports can be generated on time and on budget.

Q6. With your Multiscale Dataflow technology data is streamed from memory onto a chip where the data moves directly from one functional unit to another, without being written to off-chip memory until the entire process is complete. What is the advantage of this solution with respect to a classical ETL process?

O. Mencer: In a classical ETL process the database is in the critical loop. With the Multiscale Dataflow approach we remove the database from the critical loop and utilize an in-memory copy of the data for ultrafast access and in-memory analytics.

Q7. The overall system from packet capture to NFS write does not use a single server side CPU cycle. What does it mean in practice?

O. Mencer: We use a special substrate to create a dataflow computer by connecting vast numbers of arithmetic units, and implement networking state machines right down on the hardware level. This means that the packet flow through the system is in a tight hardware loop and only metadata travels through conventional CPUs. Additionally, on the storage side Hitachi’s Unified Storage also uses Dataflow-like structures to implement a full set of Network File Serving, a Filesystem and smart object caching for file system object I/O. In this way usage of general CPU cycles if further minimized.
The impact to customers is decreased space needed for the solution coupled to significant performance improvements.

Q8. You claim that dataflow computing can accelerate and run different applications orders of magnitude faster than conventional CPUs. Do you have any benchmarking results to share?

O. Mencer: Benchmarks are not applications and there is no claim that we can accelerate tiny benchmarks.
Our technology enables complete applications with a purpose in the real world to run orders of magnitude faster. For example, in 2011 a Tier 1 investment bank won the American Finance Technology Award for their installation of a machine from Maxeler, which reduced the time to calculate risk from 8 hours down to 2 minutes.

Q9. The Maxeler-Hitachi Data Systems solution leverages the new Amazon AWS F1 instance. Why? Can you please elaborate on this?

M. Hay, O. Mencer: Our joint hardware solution complements the F1 instance for on-premise activities in a hybrid cloud setting. It helps that the latest Maxeler generation (MAX5) is fully compatible with F1 and it is therefore easy to build a hybrid cloud solution with a single code base. If the reader would like to learn more we’re open and able to entertain discussions about finding relevant problems to engage on.

——————————————-

MICHAEL HAY | マイケル ヘイ
VP & CHIEF ENGINEER – HITACHI DATA SYSTEMS. GENERAL MGR, DIGITAL SOLUTIONS BUSINESS DEVELOPMENT – HITACHI, SPBD
As Vice President and Chief Engineer at Hitachi Data Systems and a General Manager of the Service Business Platform Division in Japan, Michael leads a global team that contemplates and enacts the future of Hitachi’s expanding ICT and Social Innovation portfolios. Michael engages a variety R&D teams, using a clear understanding of market requirements, to guide direction and inspire innovation. Michael joined HDS in 2001 after serving as CEO and owner of a consultancy company focused on complex Enterprise and Systems management design and deployments. His professional background spans over 20 years and includes stints at IBM, IBM partners, and other IT start-up companies. These roles have helped Michael develop a capacity to define solutions for tomorrow’s problems. Michael holds a Masters in Industrial Engineering with a focus in Human Factors from San Jose State and a Bachelors degree in Electrical Engineering from the University of New Mexico, in Albuquerque, NM.

Oskar Mencer. Prior to founding Maxeler, Oskar was Member of Technical Staff at the Computing Sciences Center at Bell Labs in Murray Hill, leading the effort in “Stream Computing”. He joined Bell Labs after receiving a PhD from Stanford University. Besides driving Maximum Performance Computing (MPC) at Maxeler, Oskar was Consulting Professor in Geophysics at Stanford University and he is also affiliated with the Computing Department at Imperial College London, having received two Best Paper Awards, an Imperial College Research Excellence Award in 2007 and a Special Award from Com.sult in 2012 for “revolutionising the world of computers”.

————————–
Resources

Video: Maxeler Dataflow Engine attached to a Hitachi Data Systems HNAS. Dr. Itay Greenspon, Maxeler.

Maxeler Technologies Inc Collaborates with Hitachi Data Systems Around High-performance Compliance Capture and Analytics Solution for Financial Institutions. 23 Dec 2016

Maxeler is official AWS F1 Instance Partner.05 Dec 2016

Video: What is OpenSPL? Professor Michael J Flynn, Stanford University
OpenSPL is an open standard for a novel Spatial Programming Language. It is based on the core concept that a program executes in space, rather than in time sequence. All operations are assumed to be parallel unless specified to be sequential. This is similar to a factory floor where all operations execute in parallel, but each operation executes a different part of the overall process. Temporal Programming is a recipe for the execution of actions, whereas Spatial Programming builds a factory to execute the recipe.

Multiscale Dataflow Computing AppGallery

HPC Matters to our Quality of Life and Prosperity. by Don Johnston, Lawrence Livermore National Laboratory

Related Posts

Hitachi Data Systems Works with Maxeler Technologies. Posted by Michael Hay, Jan 3, 2017

The Many Core Phenomena. Blog Post created by Michael Hay on Dec 13, 2016

Follow us on Twitter: @odbmsorg

##

Jan 9 17

On Data Analysis. Interview with Rob Winters

by Roberto V. Zicari

“I’ve managed several employees who have successfully transitioned from an operations role to an analytics role. In fact, some of them have become my best analysts because they have brought a deeper domain knowledge to their analyses than someone approaching from the outside may have done. “–Rob Winters

I have interviewed Rob Winters,Head of Business Intelligence at TravelBird. The interview covers Rob`s projects experience with data analytics and HPE Vertica.

RVZ

Q1. What is the business of TravelBird?

Rob Winters: TravelBird builds and provides a daily selection of inspirational holiday offerings in twelve markets across Europe. Our goal is to create packages which excite the imagination and bring simplicity and joy to the act of travelling. These packages are then shared with our travellers via email, our website, and our iOS and Android applications.

Q2. What are the current data projects at TravelBird?

Rob Winters: TravelBird’s journey with being data driven is relatively short, beginning our initial Business Intelligence buildout in mid-2015. Currently our BI team is engaged in a number of projects, both more traditional BI and advanced analytics, including:
– Building data sources and training an organization in self-service BI
– Replacing our generic daily selections with personalized content selection models
– Optimizing pricing of packages based on product price volatility and customer demand
– Adjusting email frequency and timing to improve customer engagement and lifetime value

Q3. What is your experience in using predictive analytics?

Rob Winters: I have been working in the predictive analytics field for six years now across a variety of problem areas – customer service, retail, gaming, and now travel. From a technology standpoint I originally worked heavily with commercial solutions (Teradata, SAS) but for the last four years have used almost exclusively open source software including Hadoop, Spark, R, and Python.

Q4. How do you evaluate if your discovering insights are “good”?

Rob Winters: During the initial development of our algorithms we will typically follow a basic version of CRISP-DM to build an initial working model for our problem. To test models, we always use an A/B test and typically follow a two phase process: first the model is split-test against the current operational process/human selection, then when the model consistently outperforms the status quo, we will test future model iterations against the control.

Q5. Can you tell us a bit about the work you did in designing and implementing a fully automated, machine learning based content selection platform?

Rob Winters: To provide context, every day our planning team creates six unique product offerings for their target market of 50-500k customers to be shared via web, iOS/Android app, and email. Our goal was to replace that model with one that selects six unique products for each recipient based on past browsing and travel behavior. To do so, we designed an ensemble model consisting of several components:
– A customer preference model (user-item recommendation model)
– A product similarity model (item-item similarity)
– A “hotness” model to promote destinations which are trending/outperforming/expected to do well
– A portfolio model to select the right diversity for each recipient based on recommendation confidence, lifecycle state, and yield optimization of cannibalization vs product fit for a recipient

The data to feed these models is based on observing dozens of events per recipient per day, positive and negative feedback events of the recipient, all observable product features, and human expert input. The models are also able to improve themselves by continuously tuning the input parameters of each model based on recommendation split testing.

Q6. What are the primary technologies you are using?

Rob Winters: Our technology stack consists of the following:
-BI: Tableau
-Data warehousing: HPE Vertica
-Operations DBs: MySQL (web services) + Postgres (internal services)
-Recommendations serving: Redis
-Modeling/Analysis: Python, Spark via PySpark

Q7. What is your experience in using HPE Vertica?

Rob Winters: I have been using Vertica for five years in a number of organizations and facilitated the first rollout in the Netherlands. During that time I have been primarily an end user/data analyst but have also been the DBA for my deployments for the last two years.

Q8: Can you give us some more technical details of what was this first rollout in the Netherlands? What challenges did you solve in using HPE Vertica? What business benefits did you obtain?

Rob Winters: The objective of our rollout was to implement a centralized company datawarehouse to unify several production databases plus external API data.
The existing platform was Postgres (row-based solution) and relatively limited in performance. Primary gains were significantly faster analytics, the ability to add in several terabytes of event data (which was not possible on the prior platform), and new insights into the email database regarding churn, conversion, and customer value.

Q9: What were the main criteria for you to choose HPE Vertica? Did you do any performance test for HPE Vertica?

Rob Winters: We considered a number of alternatives including Microsoft PDW, Greenplum, and Infobright.
The primary considerations were price/performance, scalability, and analytical functionality. We found Vertica to be the best options across those aspects. Regarding performance testing, we did compare Infobright and Vertica and found the latter to be both more performant and easier to work with.

Q10. What specific functionalities of HPE Vertica do you find particularly useful in your job?

Rob Winters: There are a number of aspects which I find extremely beneficial, including:
-Ease of administration
-Performance tunability is very good, much higher than (for example) Redshift
-Analytical function extensions enable extremely powerful analyses directly via SQL
-The ability to load JSON data allows very rapid data integration from new sources

Q11. Do you think is it possible to turn an employee into a data analyst?

Rob Winters: Absolutely, I’ve managed several employees who have successfully transitioned from an operations role to an analytics role. In fact, some of them have become my best analysts because they have brought a deeper domain knowledge to their analyses than someone approaching from the outside may have done. The biggest drivers for success in the transitition have been:
– Attitude/eagerness to learn
– Close collaboration with a more experienced analyst, either their supervisor or a more senior peer
– Making their initial projects in areas where they are unable to fall back on domain knowledge

——
Rob Winters, Head of Business Intelligence at TravelBird.
Rob has been working with and leading analytics teams since 2006 across a number of industries including telco, gaming, retail, and travel. His primary focus since 2011 has been green-field implementations of technology and team creation for both traditional business intelligence and predictive analytics; full details are listed on my linkedin profile. He holds a bachelor’s in economics and an MBA with a IT concentration.

Resources

Data-X: Video lectures on very practical and applied Data Analytics. Data-X is a project to produce a collection of video lectures on very practical and applied data analytics.

HPE Vertica 8 “Frontloader” BY Jeff Healey. ODBMS.org SEPTEMBER 12, 2016

Benchmarking HPE Vertica and Amazon Redshift. (Webinar)

HPE Vertica Analytics Platform on Microsoft Azure. By Chris_Daly. ODBMS.org SEPTEMBER 12, 2016

Hewlett Packard Enterprise Introduces HPE Vertica 8. ODBMS.org SEPTEMBER 7, 2016

Related Posts

On Data Analytics and the Enterprise. Interview with Narendra Mulani. ODBMS Industry Watch, May 24, 2016

On data analytics for finance. Interview with Jason S.Cornez. ODBMS Industry Watch, May 17, 2016

A/B Testing is not art, it is science. By Ramkumar Ravichandran, Director, Analytics, Visa Inc. ODBMS.org, MAY 22, 2015

Follow us on Twitter: @odbmsorg

##

Dec 19 16

Big Data and The Great A.I. Awakening. Interview with Steve Lohr

by Roberto V. Zicari

“I think we’re just beginning to grapple with implications of data as an economic asset” –Steve Lohr.

My last interview for this year is with Steve Lohr. Steve Lohr has covered technology, business, and economics for the New York Times for more than twenty years. In 2013 he was part of the team awarded the Pulitzer Prize for Explanatory Reporting. We discussed Big Data and how it influences the new Artificial Intelligence awakening.

Wishing you all the best for the Holiday Season and a healthy and prosperous New Year!

RVZ

Q1. Why do you think Google (TensorFlow) and Microsoft (Computational Network Toolkit) are open-sourcing their AI software?

Steve Lohr: Both Google and Microsoft are contributing their tools to expand and enlarge the AI community, which is good for the world and good for their businesses. But I also think the move is a recognition that algorithms are not where their long-term advantage lies. Data is.

Q2. What are the implications of that for both business and policy?

Steve Lohr: The companies with big data pools can have great economic power. Today, that shortlist would include Google, Microsoft, Facebook, Amazon, Apple and Baidu.
I think we’re just beginning to grapple with implications of data as an economic asset. For example, you’re seeing that now with Microsoft’s plan to buy LinkedIn, with its personal profiles and professional connections for more than 400 million people. In the evolving data economy, is that an antitrust issue of concern?

Q3. In this competing world of AI, what is more important, vast data pools, sophisticated algorithms or deep pockets?

Steve Lohr: The best answer to that question, I think, came from a recent conversation with Andrew Ng, a Stanford professor who worked at GoogleX, is co-founder of Coursera and is now chief scientist at Baidu. I asked him why Baidu, and he replied there were only a few places to go to be a leader in A.I. Superior software algorithms, he explained, may give you an advantage for months, but probably no more. Instead, Ng said, you look for companies with two things — lots of capital and lots of data. “No one can replicate your data,” he said. “It’s the defensible barrier, not algorithms.”

Q4. What is the interplay and implications of big data and artificial intelligence?

Steve Lohr: The data revolution has made the recent AI advances possible. We’ve seen big improvements in the last few years, for example, in AI tasks like speech recognition and image recognition, using neural network and deep learning techniques. Those technologies have been around for decades, but they are getting a huge boost from the abundance of training data because of all the web image and voice data that can be tapped now.

Q5. Is data science really only a here-and-now version of AI?

Steve Lohr: No, certainly not only. But I do find that phrase a useful way to explain to most of my readers — intelligent people, but not computer scientists — the interplay between data science and AI. To convey that rudiments of data-driven AI are already all around us. It’s not — surely not yet — robot armies and self-driving cars as fixtures of everyday life. But it is internet search, product recommendations, targeted advertising and elements of personalized medicine, to cite a few examples.

Q6. Technology is moving beyond increasing the odds of making a sale, to being used in higher-stakes decisions like medical diagnosis, loan approvals, hiring and crime prevention. What are the societal implications of this?

Steve Lohr: The new, higher-stakes decisions that data science and AI tools are increasingly being used to make — or assist in making — are fundamentally different than marketing and advertising. In marketing and advertising, a decision that is better on average is plenty good enough. You’ve increased sales and made more money. You don’t really have to know why.
But the other decisions you mentioned are practically and ethically very different. These are crucial decisions about individual people’s lives. Better on average isn’t good enough. For these kinds of decisions, issues of accuracy, fairness and discrimination come into play.
That, I think, argues for two things. First, some sort of auditing tool; the technology has to be able to explain itself, to explain how a data-driven algorithm came to the decision or recommendation that it did.
Second, I think it argues for having a “human in the loop” for most of these kinds of decisions for the foreseeable future.

Q7. Will data analytics move into the mainstream of the economy (far beyond the well known, born-on-the-internet success stories like Google, Facebook and Amazon)?

Steve Lohr: Yes, and I think we’re seeing that now in nearly every field — health care, agriculture, transportation, energy and others. That said, it is still very early. It is a phenomenon that will play out for years, and decades.
Recently, I talked to Jeffrey Immelt, the chief executive of General Electric, America’s largest industrial company. GE is investing heavily to put data-generating sensors on its jet engines, power turbines, medical equipment and other machines — and to hire software engineers and data scientists.
Immelt said if you go back more than a century to the origins of the company, dating back to Thomas Edison‘s days, GE’s technical foundation has been materials science and physics. Data analytics, he said, will be the third fundamental technology for GE in the future.
I think that’s a pretty telling sign of where things are headed.

—————————–
Steve Lohr has covered technology, business, and economics for the New York Times for more than twenty years and writes for the Times’ Bits blog. In 2013 he was part of the team awarded the Pulitzer Prize for Explanatory Reporting.
He was a foreign correspondent for a decade and served as an editor, and has written for national publications such as the New York Times Magazine, the Atlantic, and the Washington Monthly. He is the author of Go To: The Story of the Math Majors, Bridge Players, Engineers, Chess Wizards, Maverick Scientists, Iconoclasts—the Programmers Who Created the Software Revolution and Data-ism The Revolution Transforming Decision Making, Consumer Behavior, and Almost Everything Else.
He lives in New York City.

————————–

Resources

Google (TensorFlow): TensorFlow™ is an open source software library for numerical computation using data flow graphs.

Microsoft (Computational Network Toolkit): A free, easy-to-use, open-source, commercial-grade toolkit that trains deep learning algorithms to learn like the human brain.

Data-ism The Revolution Transforming Decision Making, Consumer Behavior, and Almost Everything Else. by Steve Lohr. 2016 HarperCollins Publishers

Related Posts

Don’t Fear the Robots. By STEVE LOHR. -OCT. 24, 2015-The New York Times, SundayReview | NEWS ANALYSIS

G.E., the 124-Year-Old Software Start-Up. By STEVE LOHR. -AUG. 27, 2016- The New York Times, TECHNOLOGY

Machines of Loving Grace. Interview with John Markoff. ODBMS Industry Watch, Published on 2016-08-11

Recruit Institute of Technology. Interview with Alon Halevy. ODBMS Industry Watch, Published on 2016-04-02

Civility in the Age of Artificial Intelligence, by STEVE LOHR, technology reporter for The New York Times, ODBMS.org

On Artificial Intelligence and Society. Interview with Oren Etzioni, ODBMS Industry Watch.

On Big Data and Society. Interview with Viktor Mayer-Schönberger, ODBMS Industry Watch.

Follow us on Twitter:@odbmsorg

##

Nov 30 16

New Gartner Magic Quadrant for Operational Database Management Systems. Interview with Nick Heudecker

by Roberto V. Zicari

“It is too soon to call the operational DBMS market a commodity market, but it’s easy to see a future where that is the case.”–Nick Heudecker.

I have interviewed Nick Heudecker, Research Director on Gartner’s Data & Analytics team.
The main topic of the interview is the new Magic Quadrant for Operational Database Management Systems.

RVZ

Q1. You have published the new Magic Quadrant for Operational Database Management Systems (*). How do you define the operational database management system market?

Nick Heudecker: We define a DBMS as a complete software system used to define, create, manage, update and query a database. DBMSs provide interfaces to independent programs and tools that both support and govern the performance of a variety of concurrent workload types. There is no presupposition that DBMSs must support the relational model or that they must support the full set of possible data types in use today. OPDBMSs must include functionality to support backup and recovery, and have some form of transaction durability — although the atomicity, consistency, isolation and durability model is not a requirement. OPDBMSs may support multiple delivery models, such as stand-alone DBMS software, certified configurations, cloud (public and private) images or versions, and database appliances.

Q2. Can you explain the methodology you used for this new Magic Quadrant?

Nick Heudecker: The methodologies for several Gartner methodologies are public. The Magic Quadrant methodology can be found here.

We use a number of data sources when we’re creating the Magic Quadrant for Operational Database Management Systems.
We survey vendor reference customers and include data from our interactions with Gartner clients. We also consider earlier information and any news about vendors’ products, customers and finances that came to light during the time frame for our analysis.

Once we have the data, we score vendors across the various dimensions of Completeness of Vision and Ability to Execute.
One thing that’s important to note is Magic Quadrants are relative assessments of vendors in a market. We couldn’t have one vendor on an MQ because it would be right in the middle – there’s nothing to compare it to.

Q3. Why were there no Visionaries this year?

Nick Heudecker: We determined there was an overall lack of vision in the market. After a few years of rapid feature expansion, the focus has shifted to operational excellence and execution. Even Leaders shifted to the left on vision, but are still placed in the Leaders quadrant based on their vision for the development of hybrid database management, hardware optimization and integration, emerging deployment models such as containerization, as well as vertical features.

Q4. Were you surprised by the analysis and some of the results you obtained?

Nick Heudecker: The lack of overall vision in the market struck us the most. Other than in a few notable cases, we received largely the same story from most vendors. The explosion of features, and the vendors emerging to implement them, has slowed. The features that initiated the expansion, such as storing new data types, geographically distributed storage, cloud and flexible data consistency models, have become common. Today, nearly every established or emerging DBMS vendor supports these features to some degree. The OPDBMS market has shifted from a phase of rapid innovation to a phase of maturing products and capabilities.

Q5. Do you believe the “NoSQL” label will continue to distinguish DBMSs?

Nick Heudecker: If you look at the entire operational DBMS space, there’s already a great deal of convergence between NoSQL vendors, as well as between NoSQL and traditionally relational vendors. Nearly every vendor, nonrelational and relational, supports multiple data types, like JSON documents, graph or wide-column. NoSQL vendors are adding SQL: MongoDB’s BI Connector and Couchbase’s N1QL are good, if diverse, examples. They’re also adding things like schema management and data validation capabilities.
On the relational side, they’re adding horizontal scaling options and alternative consistency models, as well as modern APIs. And everyone either has or is adding in-memory and cloud capabilities.

It is too soon to call the operational DBMS market a commodity market, but it’s easy to see a future where that is the case.

Q6. What are the other “Vendors to Consider”?

Nick Heudecker: The other vendors to consider are vendors that did not meet the inclusion requirements for the Magic Quadrant. Usually this is because they missed our minimum revenue requirements, but that doesn’t mean they don’t have compelling products.

——————————-
Nick Heudecker is a Research Director on Gartner’s Data & Analytics team. His coverage includes data management technologies and practices.

——————————-

Resources
(*) Magic Quadrant for Operational Database Management Systems. Published: 05 October 2016 ID: G00293203Analyst(s): Nick Heudecker, Donald Feinberg, Merv Adrian, Terilyn Palanca, Rick Greenwald

– Complimentary Gartner Research: 100 Data and Analytics Predictions Through 2020. Get exclusive access to Gartner’s top 100 data and analytics predictions through 2020. Plus access other relevant Gartner research including Magic Quadrant reports for database and data warehouse solutions, and the market guide for in-memory computing (LINK to MemSQL web site – registration required).

Related Posts

MarkLogic Named a Next-Generation Database Challenger in 2016 Gartner Magic Quadrant. By GARY BLOOM, Chief Executive Officer and President MARKLOGIC

MarkLogic Recognized in New Gartner® Magic Quadrant. Gartner Magic Quadrant for Operational Database Management Systems positions MarkLogic® the highest for ability to execute in the Challengers Quadrant

– Accelerating Business Value with a Multi-Model, Multi-Workload Data Platform

– NuoDB Recognized by Gartner in Critical Capabilities for Operational Database Management Systems. Elastic SQL database achieves top five score in all four use cases.

– Clustrix Recognized in Gartner Magic Quadrant for Operational Database Management Systems

– Learn why EDB is named a “Challenger” in the 2016 Gartner ODBMS Magic Quadrant

– DataStax Receives Highest Scores in 2 Use Cases in Gartner’s Critical Capabilities for Operational Database Management Systems

– Gartner Scores Oracle Highest In 3 of 4 Use Cases: Gartner Critical Capabilities for Operational Database Management Systems Report

Gartner Critical Capabilities For Operational Database Management Systems 2016 – Redis Labs Ranked Second Highest In 2/4 Categories (Link- Registation required)

 

Follow us on Twitter: @odbmsorg

##

Nov 1 16

On fraud detection, Medicaid, and the insurance industry. Interview with Charles Kaminski Jr.

by Roberto V. Zicari

“From my perspective, data quality is paramount to an evolving market. When the quality of data improves in a market, both insurance carriers and consumers can make better decisions. “–Charles Kaminski Jr.

I have interviewed Charles Kaminski Jr., Sr. Architect at LexisNexis Risk Solutions. Main topics of the interview are the technological challenges the insurance industry is currently facing, fraud detection, and how to effectively use  predictive analytics.

RVZ

Q1. What is your role at LexisNexis Risk Solutions?

Charles Kaminski Jr.: I am a Sr. Architect at LexisNexis Risk Solutions. I’ve worked for LexisNexis Risk Solutions for about 7 years. My primary responsibility is international expansion for the Insurance vertical. I also work on enterprise initiatives, new technologies, new product development, patents & intellectual property, and acquisitions. From time to time I work with RELX sister companies when they need help. The RELX Group is our parent company.

Q2. How is the life insurance industry evolving?

Charles Kaminski Jr.: My view is somewhat specific to the international markets I serve. From my perspective, data quality is paramount to an evolving market. When the quality of data improves in a market, both insurance carriers and consumers can make better decisions. As that happens, the vast majority of consumers and other players in that market benefit. This isn’t limited to the life insurance industry, but I see it happening there as well.

Q3. What are in your opinion, the main technological challenges the insurance industry is currently facing?

Charles Kaminski Jr.: Each market around the global tends to have its own nuances that don’t apply to any other market. An entity in one market (such as a bank, an aggregator, or a software house) may play a different role or no role at all in another market. Regulations, government involvement, and industry support also vary greatly. I see this in auto, life, and health verticals. These factors create different challenges from one market to the next. But, there are a few themes that seem to exist regardless of market.

Insurance carriers around the globe tend to utilize a healthy mixture of old and new technologies. The technology leaders in this industry are generally more risk adverse when compared to other, less regulated, industries. Also, workflows on the carrier side can be very complex. The primary technological challenge to new product development is understanding customer and vendor technology roadmaps and the implied assumptions in those roadmaps. Understanding the entities in a market as well as their roadmaps is key to being successful.

Q4. Cross-industry fraud is defined by a fraud case where the perpetrator’s activity touches multiple industries and organizations, habitually exploiting system gaps. Is using data and analytics the solution to fraud detection?

Charles Kaminski Jr.: A product person might better answer if using data and analytics is “the” solution to fraud detection. I can tell you it is a very effective solution. Big data can cross boundaries and tell unique stories like no other tool. Companies that reign supreme in crossing those boundaries are the ones that have the technical capabilities to analyze big data with ease and the creative people to ask questions no one else is thinking to ask. One interesting story I can relay here is from work others at LexisNexis have done. It comes from someone I’ve shared a stage with a number of times, so I’m very familiar with the story.

LexisNexis Risk Solutions was asked to help a US state agency identify potential Medicaid fraud. Medicaid fraud is big business with lots of money changing hands. For any state agency with limited resources, it’s never a question of finding enough fraud to prosecute. It’s always a question of finding the big fish to fry.

This US state agency in question could only share the addresses of people using Medicaid and nothing more.
Just a list of addresses is not much to go on. But with the right tools, it’s a good start: Why is someone at one address registering a number of really expensive cars? Why is someone at another address registering a rather expensive boat?
Why is someone at yet another address who owns a Medicaid processing business and buying multiple multi-million-dollar condos, why are they possibly on Medicaid?

Some of these will no doubt be coincidence and I’m oversimplifying this by not mentioning some additional and rather complex analysis. I’m sure you get the idea though. Ultimately you have an interesting list of addresses scored and ordered in terms of where you might want to take a closer look. But that’s not where this story ends. That scored and ordered list is just where this story starts to get interesting.

With a big-data system geared towards analytics, we can take that list and overlay relationship data on top of it.
You can build relationship data from all kinds of sources — who’s married or ever been married to whom, previous neighbors who lived near each other, jointly-registered assets, various public records from business dealings, etc.
When we overlay who knows who, multiple circles start to form. People who don’t know each other are in these circles and at the center of many of these circles (connecting them together) are people who weren’t in the original address list.
Those folks in the center of those circles are the big fish to take a closer look at. Many of these people in the center are the generals recruiting lieutenants to commit the fraud for them. These generals do this so they can stay below the radar.

That’s the interesting part of this story. It’s a story of how big data and analytics can take you from just a list of addresses to some big fish in the center of a fraud ring.

Q5. Drew Whitmore, Senior Director, Insurance Global Alliances, LexisNexis® Risk Solutions, said: “Insurance carriers need innovative core policy and claims management solutions integrated with industry-leading data and analytics to meet their business objectives and deliver on promises of exceptional customer experience,” Why do you believe that a single point of entry to these data and analytic solutions is the best option for insurers’ technology resources and workflow processes?

Charles Kaminski Jr.: Insurance workflows can be very complex. Products that support these workflows can have complex interface. To a technologist, success with a single-point-of-entry strategy is very clear.
Success is when we release a new product but 90% of the single-point-of-entry-interface doesn’t change.
Further still, success is when the technologist on the other side, the employee of the customer, knows exactly what is going on with the new product. Success is when a technologist on the other end of the interface says, “I get what LexisNexis is doing with this.” That technologist also benefits when he or she needs to discuss the new product with a legal departments or internal auditors because those groups will already be familiar with the interface.

Q6. What is the LexisNexis Risk Solutions telematics data and analytics platform? And how is it used in the Insurance industry?

Charles Kaminski Jr.: The telematics platform is a horizontally scalable, high performance, big data and analytics platform. It and the associated data is used by carriers who want to understand driving behavior as well as a number of other attributes associated with a policy. Because the platform is format agnostic, carriers have quite a bit of flexibility to use our solutions or bring their own to the table.

I was part of the original team bringing telematics solutions to market. We considered a number of different problems to solve, prototypes, and solutions in those early days. We went through a number of iterations before settling on our first telematics solution. That initial product enabled telematics for carriers by using a consumer’s smart phone, an ODB2 dongle, and LN’s scalable data analytics systems to store and analyze the data. A dedicated telematics team continues to expand our telematics offerings. I’m no longer involved day-to-day.

Q7. According to a Gartner report* referencing its 2015 CIO Study, “eighty-seven percent of CIOs agree that there is a shift to predictive analytics from reporting in their organizations, and 79% believe that the greatest value and insight will come from active experimentation informed by data rather than the passive analysis of data.” What is your take on this?

Charles Kaminski Jr.: Big data and predictive analytics are powerful tools that have transformed a number of industries. For insurance, they are a must. But these tools are now being adopted by a number of other industries and they are sometimes misapplied. There are a number of cautionary case studies in business news where these capabilities were brought into an organization with high cost and high expectations but the investment provided negative returns. Wikibon is reporting that most enterprises expect a return of $3.50 per dollar spent on big data systems but that the actual return to date is more like $0.55 per dollar spent.

My take on this is twofold. First, if you are looking to bring big data and predictive analytics in house, then spend some time choosing the right first business case with a low cost and a low bar to success. This gives you greater flexibility to find scarce resources around big data and predictive modeling, prove out your technology, and fine tune your assumptions. Also, be sure the resources you engage with have experience getting positive returns using big data and analytics.
Second, if you are an executive looking to drive improvements with these tools and you do not currently have a predictive analytics engine, then consider broader trends first. Twenty years ago businesses goals were being managed through results.
Since then there has been a shift towards driving business and organizational improvements using lead measures and lead indicators. This doesn’t necessarily mean predictive analytics. These lead measures and lead indicators can be developed and iterated over quickly without big-data and complex analytics. They can then be used to drive improvements across an enterprise. This can be done before tools such as big data and predictive modeling are introduced.
There are people and firm that can help businesses get started immediately with comparatively low costs.

————————
Charles Kaminski is a Sr. Architect for LexisNexis Risk Solutions. Charles was part of the team that open-sourced the LexisNexis big data platform, HPCC Systems, which is the backbone of LexisNexis Risk Solutions. He now focuses on global markets and international expansion for the company’s Insurance business. Charles has worked for NASA in their Solar System Exploration Division, Accenture’s Financial Services vertical, and was an entrepreneur before joining LexisNexis Risk Solutions. Charles lives outside of Atlanta with his wife and children.

————————

Resources

*Gartner, ‘Market Trends: Targeting Global Life and P&C Insurers in 2015,’ 23 April 2015, Derry N. Finkeldey

LexisNexis Risk Solutions Elevates Insurance Customer Experience with New Active Risk Management Solution3/1/2016

– LexisNexis Risk Solutions Expands Relationship with Duck Creek Technologies

– Big Data Revolution: What farmers, doctors and insurance agents teach us about discovering big data patterns. Authors: Rob Thomas, Patrick McSharry

– Introduction to HPCC (High-Performance Computing Cluster). Authors: Anthony M. Middleton, Ph.D. LexisNexis Risk Solutions and Arjuna Chala, Sr. Director Operations, LexisNexis Risk Solutions.ODBMS.org, FEBRUARY 19, 2016

— 2016 HPCC Systems Engineering Summit – Community Day

Related Posts

– MarkLogic Case Study: Hannover Re

– Ethical Risk Assessment of Automated Decision Making Systems, By Steven Finlay, Head of Analytics at HML. ODBMS.org FEBRUARY 23, 2015

Follow us on Twitter: @odbmsorg

##

Oct 11 16

How the 11.5 million Panama Papers were analysed. Interview with Mar Cabra

by Roberto V. Zicari

“The best way to explore all The Panama Papers data was using graph database technology, because it’s all relationships, people connected to each other or people connected to companies.” –Mar Cabra.

I have interviewed Mar Cabra, head of the Data & Research Unit of the International Consortium of Investigative Journalists (ICIJ). Main subject of the interview is how the 11.5 million Panama Papers were analysed.

RVZ

Q1. What is the mission of the International Consortium of Investigative Journalists (ICIJ)?

Mar Cabra: Founded in 1997, the ICIJ is a global network of more than 190 independent journalists in more than 65 countries who collaborate on breaking big investigative stories of global social interest.

Q2. What is your role at ICIJ?

Mar Cabra: I am the Editor at the Data and Research Unit – the desk at the ICIJ that deals with data, analysis and processing, as well as supporting the technology we use for our projects.

Q3. The Panama Papers investigation was based on a 2.6 Terabyte trove of data obtained by Süddeutsche Zeitung and shared with ICIJ and a network of more than 100 media organisations. What was your role in this data investigation?

Mar Cabra: I co-ordinated the work of the team of developers and journalists that first got the leak from Süddeutsche Zeitung, then processed it to make it available online though secure platforms with more than 370 journalists.
I also supervised the data analysis that my team did to enhance and focus the stories. My team was also in charge of the interactive product that we produced for the publication stage of The Panama Papers, so we built an interactive visual application called the ‘Powerplayers’ where we detailed the main stories of the politicians with connections to the offshore world. We also released a game explaining how the offshore world works! Finally, in early May, we updated the offshore database with information about the Panama Papers companies, the 200,000-plus companies connected with Mossack Fonseca.

Q4. The leaked dataset are 11.5 million files from Panamanian law firm Mossack Fonseca. How was all this data analyzed?

Mar Cabra: We relied on Open Source technology and processes that we had worked on in previous projects to process the data. We used Apache Tika to process the documents and also to access them, and created a processing chain of 30 to 40 machines in Amazon Web Services which would process in parallel those documents, then index them onto a document search platform that could be used by 100s of journalists from anywhere in the world.

Q5. Why did you decide to use a graph-based approach for that?

Mar Cabra: Inside the 11.5 million files in the original dataset given to us, there were more than 3 million that came from Mossaka Fonseca’s internal database, which basically contained names of companies in offshore jurisdictions and the people behind them. In other words, that’s a graph! The best way to explore all The Panama Papers data was using graph database technology, because it’s all relationships, people connected to each other or people connected to companies.

Q6. What were the main technical challenges you encountered in analysing such a large dataset?

Mar Cabra: We had already used all the tools that we were using in this investigation, in previous projects. The main issue here was dealing with many more files in many more formats. So the main challenge was how can we make readable all those files, which in many cases were images, in a fast way.
Our next problem was how could we make them understandable to journalists that are not tech savvy. Again, that’s where a graph database became very handy, because you don’t need to be a data scientist to work with a graph representation of a dataset, you just see dots on a screen, nodes, and then just click on them and find the connections – like that, very easily, and without having to hand-code or build queries. I should say you can build queries if you want using Cypher, but you don’t have to.

Q7. What are the similarities with the way you analysed data in the Swiss Leaks story (exposing the fraudulent activity of 100,000 HSBC private bank clients in Switzerland)?

Mar Cabra: We used the same tools for that – a document search platform and a graph database and we used them in combination to find stories. The baseline was the same but the complexity was 100 times more for the Panama Papers. So the technology is the same in principle, but because we were dealing with many more documents, much more complex data, in many more formats, we had to make a lot of improvements in the tools so they really worked for this project. For example, we had to improve the document search platform with a batch search feature, where journalists would upload a list of names and then they would get a list back of links when that list of names had a hit a document.

Q8. Emil Eifrem, CEO, Neo Technology wrote: “If the Panama Papers leak had happened ten years ago, no story would have been written because no one else would have had the technology and skillset to make sense of such a massive dataset at this scale.” What is your take on this?

Mar Cabra: We would have done the Panama Papers papers differently, probably printing the documents – and that would have had a tremendous effect on the paper supplies of the world, because printing out all 11.5 million files would have been crazy! We would have published some stories and the public might have seen some names on the front page of a few newspapers, but the scale and the depth and the understanding of this complex world would not have been able to happen without access to the technology we have today. We would just have not been able to do such an in-depth investigation at a global scale without the technology we have access to now.

Q9. Whistleblowers take incredible risks to help you tell data stories. Why do they do it?

Mar Cabra: Occasionally, some whistleblowers have a grudge and are motivated in more personal terms. Many have been what we call in Spanish ‘widows of power’: people who have been in power and have lost it, and those who wish to expose the competition or have a grudge. Motivations of Whistleblowers vary, but I think there is always an intention to expose injustice. ‘John Doe’ is the source behind the Panama Papers, and a few weeks after we published, he explained his motivation; he wanted to expose an unjust system.

————————–
Mar Cabra is the head of ICIJ’s Data & Research Unit, which produces the organization’s key data work and also develops tools for better collaborative investigative journalism. She has been an ICIJ staff member since 2011, and is also a member of the network.

Mar fell in love with data while being a Fulbright scholar and fellow at the Stabile Center for Investigative Journalism at Columbia University in 2009/2010. Since then, she’s promoted data journalism in her native Spain, co-creating the first ever masters degree on investigative reporting, data journalism and visualisation  and the national data journalism conference, which gathers more than 500 people every year.

She previously worked in television (BBC, CCN+ and laSexta Noticias) and her work has been featured in the International Herald Tribune, The Huffington Post, PBS, El País, El Mundo or El Confidencial, among others.
In 2012 she received the Spanish Larra Award to the country’s most promising journalist under 30. (PGP public key)

Resources

– Panama Papers Source Offers Documents To Governments, Hints At More To Come. International Consortium of Investigative Journalists. May 6, 2016

The Panama Papers. ICIJ

– The two journalists from Sueddeutsche ZeitungFrederik Obermaier and Bastian Obermayer

– Offshore Leaks Database: Released in June 2013, the Offshore Leaks Database is a simple search box.

Open Source used for analysing the #PanamaPapers:

– Oxwall: We found an open source social network tool called Oxwall that we tweaked to our advantage. We basically created a private social network for our reporters.

– Apache Tika and Tesseract to do optical character recognition (OCR),

– We created a small program ourselves which we called Extract which is actually in our GitHub account that allowed us to do this parallel processing. Extract would get a file and try to see if it could recognize the content. If it couldn’t recognize the content, then we would do OCR and then send it to our document searching platform, which was Apache Solr.

– Based on Apache Solr, we created an index, and then we used Project Blacklight, another open source tool that was originally used for libraries, as our front-end tool. For example, Columbia University Library, where I studied, used this tool.

– Linkurious: Linkurious is software that allows you to visualize graphs very easily. You get a license, you put it in your server, and if you have a database in Neo4j you just plug it in and within hours you have the system set up. It also has this private system where our reporters can login or logout.

– Thanks to another open source tool – in this case Talend – and extractions from a load tool, we were able to easily transform our database into Neo4j, plug in Linkurious and get reporters to search.

Neo4j: Neo4j is a highly scalable, native graph database purpose-built to leverage not only data but also its relationships. Neo4j’s native graph storage and processing engine deliver constant, real-time performance, helping enterprises build intelligent applications to meet today’s evolving data challenges.

-The good thing about Linkurious is that the reporters or the developers at the other end of the spectrum can also make highly technical Cypher queries if they want to start looking more in depth at the data.

Related Posts

##

Sep 23 16

On Silos, Data Integration and Data Security. Interview with David Gorbet

by Roberto V. Zicari

“Data integration isn’t just about moving data from one place to another. It’s about building an actionable, operational view on data that comes from multiple sources so you can integrate the combined data into your operations rather than just looking at it later as you would in a typical warehouse project.” — David Gorbet.

I have interviewed David Gorbet, Senior Vice President,Engineering at MarkLogic. We cover several topics in the interview: Silos, Data integration, data quality, security and the new features of MarkLogic 9.

RVZ

Q1. Data integration is the number one challenge for many organisations. Why?

David Gorbet: There are three ways to look at that question. First, why do organizations have so many data silos? Second, what’s the motivation to integrate these silos, and third, why is this so hard?

Our Product EVP, Joe Pasqua, did an excellent presentation on the first question at this year’s MarkLogic World. The spoiler is that silos are a natural and inevitable result of an organization’s success. As companies become more successful, they start to grow. As they grow, they need to partition in order to scale. To function, these partitions need to run somewhat autonomously, which inevitably creates silos.
Another way silos enter the picture is what I call “application accretion” or less charitably, “crusty application buildup.” Companies merge, and now they have two HR systems. Divisions acquire special-purpose applications and now they have data that exists only in those applications. IT projects are successful and now need to add capabilities, but it’s easier to bolt them on and move data back and forth than to design them into an existing IT system.

Two years ago I proposed a data-centric view of the world versus an application-centric view. If you think about it, most organizations have a relatively small number of “things” that they care deeply about, but a very large number of “activities” they do with these “things.”
For example, most organizations have customers, but customer-related activities happen all across the organization.
Sales is selling to them. Marketing is messaging to them. Support is helping solve their problems. Finance is billing them. And so on… All these activities are designed to be independent because they take place in organizational silos, and the data silos just reflect that. But the data is all about customers, and each of these activities would benefit greatly from information generated by and maintained in the other silos. Imagine if Marketing could know what customers use the product for to tailor the message, or if Sales knew that the customer was having an issue with the product and was engaged with Support? Sometimes dealing with large organizations feels like dealing with a crazy person with multiple personalities. Organizations that can integrate this data can give their customers a much better, saner experience.

And it’s not just customers. Maybe it’s trades for a financial institution, or chemical compounds for a pharmaceutical company, or adverse events for a life sciences company, or “entities of interest” for an intelligence or police organization. Getting a true, 360-degree view of these things can make a huge difference for these organizations.
In some cases, like with one customer I spoke about in my most recent MarkLogic World keynote who looks at the environment of potentially at-risk children, it can literally mean the difference between life and death.

So why is this so hard? Because most technologies require you to create data models that can accommodate everything you need to know about all of your data in advance, before you can even start the data integration project. They also require you to know the types of queries you’re going to do on that data so you can design efficient schemas and indexing schemes.
This is true even of some NoSQL technologies that require you to figure out sharding and compound indexing schemes in advance of loading your data. As I demonstrated in that keynote I mentioned, even if you have a relatively small set of entities that are quite simple, this is incredibly hard to do.
Usually it’s so hard that instead organizations decide to do a subset of the integration to solve a specific need or answer a specific question. Sadly, this tends to create yet another silo.

Q2. Integrate data from silos: how is it possible?

David Gorbet: Data integration isn’t just about moving data from one place to another. It’s about building an actionable, operational view on data that comes from multiple sources so you can integrate the combined data into your operations rather than just looking at it later as you would in a typical warehouse project.

How do you do that? You build an operational data hub that can consume data from multiple sources and expose APIs on that data so that downstream consumers, either applications or other systems, can consume it in real time. To do this you need an infrastructure that can accommodate the variability across silos naturally, without a lot of up-front data modeling, and without each silo having a ripple effect on all the others.
For the engineers out there (like me), think of this as trying to turn an O(n2) problem into an O(n) problem.
As the number of silos increases, most projects get exponentially more complex, since you can only have one schema and every new silo impacts that schema, which is shared by all data across all existing silos. You want a technology where adding a new data silo does not require re-doing all the work you’ve already done. In addition, you need a flexible technology that allows a flexible data model that can adapt to change. Change in both what data is used and in how it’s used. A system that can evolve with the evolving needs of the business.

MarkLogic can do this because it can ingest data with multiple different schemas and index and query it together.
You don’t have to create one schema that can accommodate all your data. Our built-in application services allows our customers to build APIs that expose the data directly from their data hub and with ACID transactions, these APIs can be used to build real operational applications.

Q3. What is the problem with traditional solutions like relational databases, Extract Transform and Load (ETL) tools?

David Gorbet: To use a metaphor, most technology used for this type of project is like concrete. Now concrete is incredibly versatile. You can make anything you want out of concrete: a bench, a statue, a building, a bridge… But once you’ve made it, you’d better like it because if you want to change it you have to get out the jackhammer.

Many projects that use these tools start out with lofty goals, and they spend a lot of time upfront modeling data and designing schemas. Very quickly they realize that they are not going to be able to make that magical data model that can accommodate everything and be efficiently queried. They start to cut corners to make their problem more tractable, or they design flexible but overly generic models like tall thin tables that are inefficient to query. Every corner they cut limits the types of applications they can then build on the resulting integrated data, and inevitably they end up needing some data they left behind, or needing to execute a query they hadn’t planned (and built an index) for.

Usually at some point they decide to change the model from a hub-and-spoke data integration model to a point-to-point model, because point-to-point integrations are much easier. That, or it evolves as new requirements emerge, and it becomes impossible to keep up by jackhammering the system and starting over. But this just pushes the complexity out of these now point-to-point flows and into the overall system architecture. It also causes huge governance problems, since data now flows in lots of directions and is transformed in many ways that are generally pretty opaque and hard to trace. The inability to capture and query metadata about these data flows causes master-data problems and governance problems, to the point where some organizations genuinely have no idea where potentially sensitive data is being used. The overall system complexity also makes it hard to scale and expensive to operate.

Q4. What are the typical challenges of handling both structured, and unstructured data?

David Gorbet: It’s hard enough to integrate structured data from multiple silos. Everything I’ve already talked about applies even if you have purely structured data. But when some of your data is unstructured, or has a complex, variable structure, it’s much harder. A lot of data has a mix of structured data and unstructured text. Medical records, journal articles, contracts, emails, tweets, specifications, product catalogs, etc. The traditional solution to textual data in a relational world is to put it in an opaque BLOB or CLOB, and then surface its content via a search technology that can crawl the data and build indexes on it. This approach suffers from several problems.

First, it involves stitching together multiple different technologies, each of which has its own operational and governance characteristics. They don’t scale the same way. They don’t have the same security model (unless they have no security model, which is actually pretty common). They don’t have the same availability characteristics or disaster recovery model.
They don’t backup consistently with each other. The indexes are separate, so they can’t be queried together, and keeping them in sync so that they’re consistent is difficult or impossible.

Second, more and more text is being mined for structure. There are technologies that can identify people, places, things, events, etc. in freeform text and structure it. Sentiment analysis is being done to add metadata to text. So it’s no longer accurate to think of text as islands of unstructured data inside a structured record. It’s more like text and structure are inter-mixed at all levels of granularity. The resulting structure is by its nature fluid, and therefore incompatible with the up-front modeling required by relational technology.

Third, search engines don’t index structure unless you tell them to, which essentially involves explaining the “schema” of the text to them so that they can build facets and provide structured search capabilities. So even in your “unstructured” technology, you’re often dealing with schema design.

Finally, as powerful as it is, search technology doesn’t know anything about the semantics of the data. Semantic search enables a much richer search and discovery experience. Look for example at the info box to the right of your Google results. This is provided by Google’s knowledge graph, a graph of data using Semantic Web technologies. If you want to provide this kind of experience, where the system can understand concepts and expand or narrow the context of the search accordingly, you need yet another technology to manage the knowledge graph.

Two years ago at my MarkLogic World keynote I said that search is the query language for unstructured data, so if you have a mix of structured and unstructured data, you need to be able to search and query together. MarkLogic lets you mix structured and unstructured search, as well as semantic search, all in one query, resolved in one technology.

Q5. An important aspect when analysing data is Data Quality. How do you evaluate if the data is of good or of bad quality?

David Gorbet: Data quality is tough, particularly when you’re bringing data together from multiple silos. Traditional technologies require you to transform the data from one schema into another in order to move it from place to place. Every transformation leaves some data behind, and every one has the potential to be a point of data loss or data corruption if the transformation isn’t perfect. In addition, the lineage of the data is often lost. Where did this attribute of this entity come from? When was it extracted? What was the transform that was run on it? What did it look like before?
All of this is lost in the ETL process. The best way to ensure data quality is to always bring along with each record the original, untransformed data, as well as metadata tracing its provenance, lineage and context.
MarkLogic lets you do this, because our flexible schema accommodates source data, canonicalized (transformed) data, and metadata all in the same record, and all of it is queryable together. So if you find a bug in your transform, it’s easy to query for all impacted records, and because you have the source data there, you can easily fix it as well.

In addition, our Bitemporal feature can trace changes to a record over time, and let you query your data as it is, as it was, or as you thought it was at any given point in time or over any historical (or in some cases future) time range. So you have traceability when your data changes, and you can understand how and why it has changed.

Q6. Data leakage is another problem for many corporations that experienced high profile security incidents. What can be done to solve this problem?

David Gorbet: Security is another important aspect of data governance. And security isn’t just about locking all your data in a vault and only letting some people look at it. Security is more granular than that. There are some data that can be seen by just about anyone in your organization. Some that should only be seen by people who need it, and some that should be hidden from all but people with specific roles. In some cases, even users with a particular role should not see data unless they have a provable need in addition to the role required. This is called “compartment security,” meaning you have to be in a certain compartment to see data, regardless of your role or clearance overall.

There is a principle in security called “defense in depth.” Basically it means pushing the security to the lowest layer possible in the stack. That’s why it’s critically important that your DBMS have strong and granular security features.
This is especially true if you’re integrating data from silos, each of which may have its own security rules.
You need your integrated data hub to be able to observe and enforce those rules, regardless of how complex they are.

Increasingly the concern is over the so-called “insider threat.” This is the employee, contractor, vendor, managed service provider, or cloud provider who has access to your infrastructure. Another good reason not to implement security in your application, because if you do, any DBA will be able to circumvent it. Today, with the move to cloud and other outsourced infrastructure, organizations are also concerned about what’s on the file system. Even if you secure your data at the DBMS layer, a system administrator with file system access can still get at it. To counter this, more organizations are requiring “at rest” encryption of data, which means that the data is encrypted on the file system. A good implementation will require a separate role to manage encryption keys, different from the DBA or SA roles, along with a separate key management technology. In our implementation, MarkLogic never even sees the database encryption keys, relying instead on a separate key management system (KMS) to unlock data for us. This separation of concerns is a lot more secure, because it would require insiders to collude across functions and organizations to steal data. You can even keep your data in the cloud and your keys on-premises, or with another managed service provider.

Q8. What is new in MarkLogic® 9 database? ?

David Gorbet: There’s so much in MarkLogic 9 it’s hard to cover all of it. That presentation I referenced earlier from Joe does a pretty good job of summarizing the features. Many of the features in MarkLogic 9 are designed to make data integration even easier. MarkLogic 9 has new ways of modeling data that can keep it in its flexible document form, but project it into tabular form for more traditional analysis (aggregates, group-bys, joins, etc.) using either SQL or a NoSQL API we call the Optic API. This allows you to define the structured parts of your data and let MarkLogic index it in a way that makes it most efficient to query and aggregate.
You can also use this technique to extract RDF triples from your data, giving you easy access to the full power of Semantics technologies.
We’re doing more to make it easier to get data into MarkLogic via a new data movement SDK that you can hook directly up to your data pipeline. This SDK can help orchestrate transformations and parallel loads of data no matter where it comes from.

We’re also doubling down on security. Earlier I mentioned encryption at rest. That’s a new feature for MarkLogic 9.
We’re also doing sub-record-level role- and compartment-based access control. This means that if you have a record (like a customer record) that you want to make broadly available, but there is some data in that record (like a SSN) that you want to restrict access to, you can easily do that. You can also obfuscate and transform data within a record to redact it for export or for use in a context that is less secure than MarkLogic.

Security is a governance feature, and we’re improving other governance features as well, with policy-based tiering for lifecycle management, and improvements to our Bitemporal feature that make it a full-fledged compliance feature.
We’re introducing new tools to help monitor and manage multiple clusters at a time. And we’re making many other improvements in many other areas, like our new geospatial region index that makes region-region queries much faster, improvements to tools like Query Console and MLCP, and many, many more.

One exciting feature that is a bit hard to understand at first is our new Entity Services feature. You can think of this as a catalog of entities. You can put whatever you want in this catalog. Entity attributes, relationships, etc. but also policies, governance rules, and other entity class metadata. This is a queryable semantic model, so you can query your catalog at runtime in your application. We’ll also be providing tools that use this catalog to help build the right set of indexes, indexing templates, APIs, etc. for your specific data. Over time, Entity Services will become the foundation of our vision of the “smart database.” You’ll hear us start talking a lot more about that soon.

—————–

David Gorbet, Senior Vice President, Engineering, MarkLogic.

David Gorbet has the best job in the world. As SVP of Engineering, David manages the team that delivers the MarkLogic product and supports our customers as they use it to power their amazing applications. Working with all those smart, talented engineers as they pour their passion into our product is a humbling experience, and seeing the creativity and vision of our customers and how they’re using our product to change their industry is simply awesome.

Prior to MarkLogic, David helped pioneer Microsoft’s business online services strategy by founding and leading the SharePoint Online team. In addition to SharePoint Online, David has held a number of positions at Microsoft and elsewhere with a number of enterprise server products and applications, and numerous incubation products.

David holds a Bachelor of Applied Science Degree in Systems Design Engineering with an additional major in Psychology from the University of Waterloo, and an MBA from the University of Washington Foster School of Business.

Resources

Join the Early Access program for a MarkLogic 9 introduction by visiting: ea.marklogic.com

-The MarkLogic Developer License is free to all who sign up and join the MarkLogic developer community.

Related Posts

– On Data Governance. Interview with David Saul. ODBMS Industry Watch,  2016-07-23

– On Data Interoperability. Interview with Julie Lockner. ODBMS Industry Watch, 2016-06-07

– On Data Analytics and the Enterprise. Interview with Narendra Mulani. ODBMS Industry Watch, 2016-05-24

Follow us on Twitter: @odbmsorg

##