Skip to content

"Trends and Information on AI, Big Data, Data Science, New Data Management Technologies, and Innovation."

This is the Industry Watch blog. To see the complete ODBMS.org
website with useful articles, downloads and industry information, please click here.

Jan 17 24

On The Future of Vector Databases. Interview with Charles Xie

by Roberto V. Zicari

Open source is reshaping the technological landscape, and this holds particularly true for AI applications. As we progress into AI, we will witness the proliferation of open-source systems, from large language models to advanced AI algorithms and improved database systems.

Q1. What is your definition of a Vector Database?

Charles Xie: A vector database is a cutting-edge data infrastructure designed to manage unstructured data. When we refer to unstructured data, we specifically mean content like images, videos, and natural language. Using deep learning algorithms, this data can be transformed into a novel form that encapsulates its semantic representation. These representations, commonly known as vector embeddings or vectors, signify the semantic essence of the data. Once these vector embeddings are generated, we store them within a vector database, empowering us to perform semantic queries on the data. This capability is potent because, unlike traditional keyword-based searches, it allows us to delve into the semantics of unstructured data, such as images, videos, and textual content, offering a more nuanced and contextually rich search experience.

Q2. Currently, there are a multitude of vector databases on the market. Why do they come in so many versions?

Charles Xie: When examining vector database systems, disparities emerge. Some, like Chroma, adopt an embedded system approach akin to SQLite, offering simplicity but lacking essential functionalities like scalability. Conversely, systems like PG Vector and Pinecone pursue a scale-up approach, excelling in single-node instances but limiting scalability.

As a seasoned database engineer with over two decades of experience, I stress the complexity inherent in database systems. A systematic approach is vital when assessing these systems, encompassing components like storage layers, storage formats, data orchestration layers, query optimizers, and execution engines. Considering the rise of heterogeneous architectures, the latter must be adaptable across diverse hardware, from modern CPUs to GPUs.

From its inception, Milvus has embraced heterogeneous computing, efficiently running on various modern processors like Intel and AMD CPUs, ARM CPUs, and Nvidia GPUs. The integration extends to supporting vector processing AI processes. The challenge lies in tailoring algorithms and execution engines to each processor’s characteristics, ensuring optimal performance. Scalability, inevitable as data grows, is a crucial consideration addressed by Milvus, supporting both scale-up and scale-out scenarios.

As the vector database gains prominence, its appeal to vendors stems from its potential to reshape data management. Therefore, transitioning to a vector database necessitates evaluating its criticality to business functions and anticipating data volume growth. Milvus stands out for both scenarios, offering consistent, optimal performance for mission-critical services and remarkable cost-effectiveness as data scales.

Q3. In your opinion when does it make sense to transition to a pure vector database? And when not?

Charles Xie: Now, let’s delve into the considerations for transitioning to a pure vector database. It’s crucial to clarify that a pure vector database isn’t merely a traditional database with a vector plugin; it’s a purposefully designed solution for handling vector embeddings.

There are two key factors to weigh. Firstly, assess whether vector computing and similarity search are critical to your business. For instance, if you’re constructing a RAG solution integral to millions of users daily and forming the core of your business, the performance of vector computing becomes paramount. In such a situation, opting for a pure vector database system is advisable. It ensures consistent, optimal performance that aligns with your SLA requirements, especially for mission-critical services where performance is non-negotiable. Choosing a vector database system guarantees a robust foundation, shielding you from unforeseen surprises in your regular database services.

The second crucial consideration is the inevitable increase in data volume over time. As your service runs for an extended period, the likelihood of accumulating larger datasets grows. With the continuous expansion of data, cost optimization becomes an inevitable concern. Most pure vector database systems on the market, including Milvus, deliver superior performance while requiring fewer resources, making them highly cost-effective.

As your data volume escalates, optimizing costs becomes a priority. It’s common to observe that the bills for vector database services grow substantially with the expanding dataset. In this context, Milvus stands out, showcasing over 100 times more cost-effectiveness than alternatives such as PG Vector, OpenSearch, and other non-native web database solutions. The cost-effectiveness of Milvus becomes increasingly advantageous as your data scales, making it a strategic choice for sustainable and efficient operations.

Q4. What is the initial feedback from users of Vector Databases?

Charles Xie: Reflecting on our beginnings six years ago, we focused primarily on catering to enterprise users. At the time, we engaged with numerous users involved in recommendation systems, e-commerce, and image recognition. Collaborations with traditional AI companies working on natural language processing, especially when dealing with substantial datasets, provided valuable insights.

The predominant feedback we received emphasized the enterprise sector’s specific needs. These users, being enterprises, possessed extensive datasets and a cadre of proficient developers. They emphasized deploying a highly available and performant vector database system in a production environment, a requirement often seen in large enterprises where AI was gaining traction.

It’s important to note that independent AI developers were not as prevalent during that period. AI, being predominantly in the hands of hyper-scalers and large enterprises, meant that the cost of developing AI algorithms and applications was considerably high. Around six years ago, hyper-scalers and large enterprises were the primary users of vector database systems, given their capacity to afford dedicated teams of AI developers and engineers. This context laid the foundation for our initial focus and direction.

In the last two years, we’ve witnessed a remarkable shift in the landscape of AI, marked by the breakthrough of modern AI, particularly the prominence of large language models. Notably, there has been a significant surge in independent AI developers, with the majority comprising teams of fewer than five individuals. This starkly contrasts the scenario six years ago when the AI development scene was dominated by large enterprises capable of assembling teams of tens of engineers, often including a cadre of computer science PhDs, to drive AI application development.

The transformation is striking—what was once the exclusive realm of well-funded enterprises can now be undertaken by small teams or even individual developers. This democratization of AI applications marks a fundamental shift in accessibility and opportunities within the AI space.

Q5. Will semantic search be performed in the future by ChatGPT instead of using vectors and a K-nearest neighbor search?

Charles Xie: Indeed, the foundation models we encounter, such as Chat GPT and vector databases, share a common theoretical underpinning—the embedding vector abstraction. Both Chat GPT and vector database systems leverage embedding vectors to encapsulate the semantic essence of the underlying unstructured data. This shared data abstraction allows them to make sense of the information and perform queries effectively. Across large language models, AI models, and vector database systems, a profound connection exists rooted in the utilization of the same data abstraction—embedding vectors.

This connection extends further as they employ identical metrics, primarily relying on distance metrics like Euclidean or cosine distance. Whether within Chat GPT or other large language models, using consistent metrics facilitates the measurement of similarities among vector embeddings.

Theoretically, a profound connection exists between large language models like Chat GPT and various vector databases, stemming from their shared use of embedding vector abstraction. The workload division between them becomes apparent—they both excel at performing semantic and k-nearest neighbor searches. However, the noteworthy distinction lies in the cost efficiency of these operations.

While large language models and vector databases tackle the same tasks, the cost disparity is significant. Executing semantic search and k-nearest neighbor search in a vector database system proves to be approximately 100 times more cost-effective than carrying out these operations within a large language model. This substantial cost difference prompts many leading AI companies, including OpenAI, to advocate for using vector databases in AI applications for semantic search and k-nearest neighbor search due to their superior cost-effectiveness.

Q6. There seems to be a need from enterprises to have a unified data management system that can support different workloads and different applications. Is this doable in practice? If not, is there a risk of fragmentations of various database offerings?

Charles Xie: No, I don’t think so. To illustrate my point, let’s consider the automobile industry. Can you envision a world where a single vehicle serves as an SUV, sedan, truck, and school bus all at once? This has yet to happen in the last 100 years of the automobile industry, and if anything, the industry will be even more diversified in the next 100 years.

It all started with the Model T; from this, we witnessed the birth of a great variety of automobiles commercialized for different purposes. On the road, we see lots of differences between SUVs, trucks, sports cars, and sedans, to name a few. A closer look at all these automobiles reveals that they are specialized and designed for specific situations.

For instance, SUVs and sedans are designed for family use, but their chassis and suspension systems are entirely different. SUVs typically have a higher chassis and a more advanced suspension system, allowing them to navigate obstacles more easily. On the other hand, sedans, designed for urban areas and high-speed driving on highways, have a lower chassis for a more comfortable driving experience. Each design serves a specific goal.

Looking at all these database systems, we see that many design goals contradict each other. It’s challenging, if not impossible, to optimize a design to meet all these diverse requirements. Therefore, the future of database systems lies in developing more purpose-built and specialized ones.

This trend is already evident in the past 20 years. Initially, we had traditional relational database systems. Still, over time, we witnessed the emergence of big data solutions, the rise of NoSQL databases, the development of time series database systems, graph database systems, document database systems, and now, the ascent of vector database systems.

On the other hand, certain vendors might have an opportunity to provide a unified interface or SDK to access various underlying database systems—from vector databases to traditional relational database systems. There could be a possibility of having a unified interface.

At Milvus, we are actively working on this concept. In the next stage, we aim to develop an SQL-like interface tailored for vector similarity search in vector databases. We aim to incorporate vector database functionality under the same interface as traditional SQL, providing a unified experience.

Q7. What does the future hold for Vector databases?

Charles Xie: Indeed, we are poised to witness an expansion in the functionalities offered by vector database systems. In the past few years, these systems primarily focused on providing a single functionality: approximate nearest neighbor search (ANN search). However, the landscape is evolving, and in the next two years, we will see a broader array of functionalities.

Traditionally, vector databases supported similarity-based search. Now, they are extending their capabilities to include exact search or matching. You can analyze your data through two lenses: a similarity search for a broader understanding and an exact search for detailed insights. By combining these two approaches, users can fine-tune the balance between obtaining a high-level overview and delving into specific details.

Obtaining a sketch of the data might be sufficient for certain situations, and a semantic-based search works well. On the other hand, in situations where minute differences matter, users can zoom in on the data and scrutinize each entry for subtle features.

Vector databases will likely support additional vector computing workloads, such as vector clustering and classification. These functionalities are particularly relevant in applications like fraud detection and anomaly detection, where unsupervised learning techniques can be applied to cluster or classify vector embeddings, identifying common patterns.

Q8. And how do you believe the market for open source Vector databases will evolve? 

Charles Xie: Open source is reshaping the technological landscape, and this holds particularly true for AI applications. As we progress into AI, we will witness the proliferation of open-source systems, from large language models to advanced AI algorithms and improved database systems. The significance of open source extends beyond mere technological innovation; it exerts a profound impact on our world’s social and economic fabric. In the era of modern AI, with the dominance of large language models, open-source models and open-source vector databases are positioned to emerge victorious, shaping the future of technology and its societal implications.

Q9. In conclusion, are Vector databases transforming the general landscape, not just AI?

Charles Xie: Indeed, vector databases represent a revolutionary technology poised to redefine how humanity perceives and processes data. They are the key to unlocking the vast troves of unstructured data that constitute over 80% of the world’s data. The promise of vector database technology lies in its ability to unleash the hidden value within unstructured data, paving the way for transformative advancements in our understanding and utilization of information.

………………………………………………..

Charles Xie is the founder and CEO of Zilliz, focusing on building next-generation databases and search technologies for AI and LLMs applications. At Zilliz, he also invented Milvus, the world’s most popular open-source vector database for production-ready AI. He is currently a board member of LF AI & Data Foundation and served as the board’s chairperson in 2020 and 2021. Charles previously worked at Oracle as a founding engineer of the Oracle 12c cloud database project. Charles holds a master’s degree in computer science from the University of Wisconsin-Madison.

Related Posts

On Zilliz Cloud, a Fully Managed AI-Native Vector Database. Q&A with James Luan. ODBMS.org,JUNE 15, 2023

On Vector Databases and Gen AI. Q&A with Frank Liu. ODBMS.org, DECEMBER 8, 2023

Follow us on X

Follow us on LinkedIn

Jan 5 24

On the Future of AI. Interview with Raj Verma

by Roberto V. Zicari

 Five years from now, today’s AI systems will look archaic to us. In the same way that computers of the 60s look archaic to us today. What will happen with AI is that it will scale and therefore become simpler, and more intuitive. And if you think about it, scaling AI is the best way to make it more democratic, more accessible.

Q1. What are the innovations that most surprised you in 2023?

Raj Verma: Generative AI is definitely the talk of the town right now. 2023 marked its breakthrough, and I think the hype around it is well founded. Few people knew what generative AI was before 2023. Now everyone’s talking about it and using it. So I was quite impressed by the takeup of this new technology. 

But if we go deeper, we have to acknowledge that the rise of AI would not have been possible without significant advancements in how large amounts of data are stored and handled. Data is the core of AI and what is used to train LLMs. Without data, AI is useless. To have powerful generative AI that gives you answers, predictions and content right at the moment you need it, you need real-time data, or data that is fresh, in motion and delivered in a matter of milliseconds. The interpretation and categorization of data are therefore crucial in powering LLMs and AI systems. 

In that sense, you will notice a lot of hype around Specialized Vector Databases (SVDB), which are independent systems that you plug into your data architecture designed to store, index and retrieve vectors, or multidimensional data points. These are popular because LLMs are increasingly relying on vector data. Think of vectors as an image or a text converted into a stored data point. When you prompt an AI system, it will look for similarities in those stored data points, or vectors, to give you an answer. So vectors are really important for AI systems and businesses often believe that a database focused on just storing and processing vector data is essential for AI systems.

However, you don’t really need SVDBs to power your AI applications. In fact, loads of companies have come to regret their use because, as an independent system, they result in redundant data, excessive data movement, increasing labor and licensing costs and limited query power. 

The solution is to store all your data — structured data, semi-structured data based on JSON, time-series, full-text, spatial, key-value and vector data — in one database. And within this system have a powerful vector database functionality that you can leverage to conduct vector similarity search. 

All this to say that, I’ve been impressed at the speed in which we are developing ways to power generative AI. We’re experimenting based on its needs and quickly figuring out what works and doesn’t work. 

Q2. What is real-time data and why is it essential for AI? 

Raj Verma:  Real time is about what we experience in the now. It is access to the information you need, at the moment you need it, delivered together with the exact context you need to make the best decision. To experience this now, you need real-time data — data that is fresh and in motion. And with AI, the need for real-time data — fast, updated and accurate data — is becoming more apparent. Because without data, AI is useless. And when AI models are trained on outdated or stale data, you get things like AI bias or hallucinations. So, in order to have AI that is powerful, and that can really help us make better choices, we need real time data. 

With the use of generative AI expanding beyond the tech industry, the need for real-time data is more urgent than ever. This is why it is important to have databases that can handle storage, access and contextualization of information. At SingleStore, our vision is that databases should support both transactional (OLTP) and analytical (OLAP) workloads, so that you can transact without moving data and put it in the right context — all of which can be delivered in millisecond response times. 

Q3. One of the biggest concerns around AI is bias, the idea that existing prejudices in the data used to train AI might creep into its decisions, content and predictions. What can we do to mitigate this risk? 

Raj Verma: I believe humans should always be involved in the training process. With AI, we must be both student and teacher, allowing it to learn from us, and in that way continuously give it input so that it can give us the insight we need. There are many laudable efforts to develop Hybrid Human AI models, which basically incorporate human insight with machine learning. Examples of hybrid AI include systems in which humans monitor AI processes through auditing or verification. Hybrid models can help businesses in several ways. For example, while AI can analyze consumer data and preferences, humans can jump in to guide how it uses that insight to create relevant and engaging content. 

As developers, we must also be very cognizant of where the data used to train LLMs comes from. And in this sense, being transparent about where it comes from helps, because the systems can be held accountable and challenged if biased data does creep into the training process. The important thing here is also to know that an AI system is only as good as the data that is trained on. 

Q4. The popularity and accessibility of generative artificial intelligence (gen AI) has made it feel like the future we see in science fiction movies is finally at our doorstep. And those science fiction movies have sowed much worry about AI being dangerous. Is this Science fiction vision of AI becoming true? 

Raj Verma: Don’t expect machines to take over the world, at least not any time soon. AI can process and analyze large amounts of data and generate content based on that, at a much faster pace than we humans can. But they are still very dependent on human input. The idea that human-like robots will come to rule the world makes for great fiction movies, but it’s far from becoming a reality. 

That doesn’t mean that AI isn’t dangerous — and we have a responsibility to discern discerning which threats are real. 

AI poses an unprecedented risk in fueling the spread of disinformation because it has the capacity to create authentic looking content. Distinguishing between content generated by AI and that created by humans will become increasingly challenging. AI can also pose cybersecurity threats. You can trick ChatGPT into writing malicious code, or use other generative AI systems to enhance ransomware. And AI can worsen current malicious trends that have surfaced with social media. I personally worry that AI systems will exploit the attention economy and spur higher levels of social media addiction. This can have terrible consequences on teenagers’ mental health. As a father of two, I am deeply concerned about this. 

These are the threats that we should worry about. And we humans are capable of mitigating these risks. We should always be involved in AI’s development, audit it and pay special attention to the data that we use to train it. 

Q5. You are quoted saying that ” without data, AI wouldn’t exist—but with bad or incorrect data, it can be dangerous.”  How dangerous can AI be?

Raj Verma: Generative AI is like a superhuman who reads an entire library of thousands of books to answer your question, all in a matter of seconds. If it doesn’t have access to that library, and if that library doesn’t have the latest books, magazines and newspapers, then it cannot give you the most relevant information you need to make the best decision possible. This is a very simple explanation of why, without data, AI is useless. Now imagine that library is full of outdated books that were written by white supremacists during the civil war. The information you are going to get from this AI system is going to guide your decisions, and you are going to make some very bad decisions. You are going to make biased decisions, and you’re going to perpetuate biases that already exist in society. That’s how AI can be dangerous, and that is why we need AI systems to have access to the most updated, accurate data out there. 

Q6. Should AI be Regulated? And if yes, what kind of regulation? 

Raj Verma: The issue is, it’s hard to regulate something that is still developing. We just don’t know what AI will look like, in its entirety, in the future. So we want to avoid regulation hampering the development of this technology. That doesn’t mean that there aren’t standards that can be applied globally. Data regulation is key, since data is the backbone of AI. Data regulation can be based on the principle of transparency, which is key to generate trust in AI and our ability to hold this technology and its developers accountable should something go wrong. To achieve transparency you need to know where the data in the AI system is coming from. So, proper documentation of the data used to train LLMs is something we can regulate. You also must be able to explain the reasoning behind an AI system’s solutions or decisions. These must be understandable by humans. And there’s also transparency in how you present the AI system to users. Do users know that they are talking to an AI robot and not a human? We can regulate data transparency without imposing excessive measures that could hamper AI’s development. 

Q7. There is no global approach on AI regulation. Several Countries in the world are in various stages of evolving their approach to regulating AI. What are the practical consequences of this?

Raj Verma: A global scale regulation of AI is incredibly challenging. Each country’s social values will be reflected in the way they approach regulating this new technology. The EU has a very strong approach to consumer protection and privacy, which is probably why it authored the first significant widespread attempt to regulate AI in the world. I don’t believe we will see such a wide sweeping legislation in the US, a country that values innovation and market dynamics. The US, we will see a decentralized approach to regulation, with maybe some specific decrees that seek to regulate its use in specific industries, like healthcare or finance. 

Many worry that the EUs new AI act will become another poster child of the Brussels effect, where firms end up adopting the EU’s regulation, in absence of any other, because it saves costs. Yet the Brussels effect might not exactly happen with the AI act, particularly because firms might want to use different algorithms in the first place. For example, marketing companies will want to use different algorithms for different geographic areas because consumers behave differently depending on where they live. It won’t be hard then for firms to have their different algorithms comply with different rules in different regions. 

All this to say that we should expect different AI regimes around the world. Companies should prepare for that. AI trade friction with Europe is likely to emerge, and private companies will advance their own “responsible AI” initiatives as they face a fragmented global AI regulatory landscape.

Q8. How can we improve the way we gather data to feed LLMs?

Raj Verma:  We need to make sure LLMs are up to date. Open source LLMs that are trained on large, publicly available data are prone to hallucinate because at least part of their data is outdated and probably biased. There are ways to fix this problem, including Retrieval Augmented Generation (RAG), which is a technique that uses a program to retrieve contextual information from outside the model, immediately feeding it to the AI system. Think of it as an open book test where the AI model, with the help of a program (the book), can look up information specific to the question it is being asked about. This is a very cost effective way of updating LLMs because you don’t need to retrain it all the time and can use it in case-specific prompts. 

RAG is central to how we at SingleStore are bringing LLMs to date. To curate data in real time, it needs to be stored as vectors, which SingleStore allows users to do. That way you can join all kinds of data and deliver the specific data you need in a matter of milliseconds. 

Q9. What is the evolutionary path you think AI will go through? When we look back 5-10 years from now, how will we look at genAI systems like ChatGPT? 

Raj Verma:  Five years from now, today’s AI systems will look archaic to us. In the same way that computers of the 60s look archaic to us today. What will happen with AI is that it will scale and therefore become simpler, and more intuitive. And if you think about it, scaling AI is the best way to make it more democratic, more accessible. That is the challenge we have in front of us, scaling AI, so that it works seamlessly in giving us the exact insight we need to improve our choices. I believe this scaling process should revolve around information, context and choice, what I call the trinity of intelligence. These are the three tenets that differentiate AI from previous groundbreaking technologies. They are also what help us experience the now in a way that we are empowered to make the best choices. Because this is our vision at SingleStore, we focus on developing a multi-generational platform which you can use to transact and reason with data in millisecond response times. We believe this is the way to make AI more powerful because with more precise databases that can deliver information in real time, we can power the AI systems that will really help us make the best choices as humans. 

………………………………………..

Raj Verma is the CEO of SingleStore. 

He brings more than 25 years of global experience in enterprise software and operating at scale. Raj was instrumental in the growth of TIBCO software to over $1 billion in revenue, serving as CMO, EVP Global Sales, and COO. He was also formerly COO at Apttus Software and Hortonworks. Raj earned his bachelor’s degree in Computer Science from BMS College of Engineering in Bangalore, India.

Related Posts

Achieving Scale Through Simplicity + the Future of AI. Raj Verma. October 31st, 2023

How will the GenAI/LLM database market evolve in 2024. Q&A with Madhukar Kumar. ODBMS.org,  December 9, 2023  

On Generative AI. Interview with Maharaj Mukherjee. ODBMS Industry Watch,  December 10, 2023

On Generative AI and Databases. Interview with Adam Prout. ODBMS Industry Watch, October 9, 2023

__________________

Follow us on X

Follow us on LinkedIn

Dec 10 23

On Generative AI. Interview with Maharaj Mukherjee 

by Roberto V. Zicari

“Managing changes is one of the dimensions that an organization may adapt to for reaping the complete benefits out of Generative AI. It need to manage the change, adaptation and redeployment easy on its workforce so that people do not feel threatened by this new technology.” 

Q1. Generative AI applications like ChatGPT, DALL-E, Stable Diffusion and others are said to rapidly democratizing the technology in business and society. Is this really happening ?

Mukherjee: Democratizing in the AI/ML area has been happening for some time. It is a slow evolutionary process. It started with the advent of auto ML tools whereby the model building moved very quickly from the expertise of data scientists to the purview of any one with a touch of a button. But except for some areas such as face recognition, etc., it has been out of reach for most people. Now with the coming of generative AI with large foundational models and large language models the doors have been opened to the public to experiment with AI in all different ways. But I do not think the ball will stop here and we are yet to see all the many ways people can make use of these new sets of tools that have suddenly become available to them. These use cases will now drive the development of newer types of research and innovation in the AI field. 

Q2What industries do you believe will be most impacted by LLMs and Generative AI? Why?

Mukherjee: In my humble opinion, the arts and entertainment industry as well as the advertisement industry would be the first adaptors of this technology and that is already happening. It will be happening slowly in the area that require more specialized knowledge such as in the Science, Technology, and Engineering and often more regulated such as healthcare and pharmaceutical industries. 

Q3. What kind infrastructure will be essential for deploying generative AI?

Mukherjee: The current barrier to the early adaptation for any small industry are two folds. One is the availability of specialized hardware such as GPUs and the next is availability of quality data scientists and programmers who can make use of the best-known algorithms and best available hardware to make it work. These two limitations are keeping these technologies out of reach for most companies except for a few very large organizations. But I would think that it is a matter of time and with improvements of scales things will be within the reach of almost every organization. 

Q4. How will companies prevent the breach of third-party copyright in using pre-trained foundation models?

Mukherjee: That is a main concern for existing LLM and Gen-AI models. However, many organizations are already in the process of building models based on Retrieval Augmented Generation (RAG) which can take care of copyright violation and other ethical and legal issues. Another way to handle such violations is tagging the generation and retrieval to the original source by keeping a record of all intermediate steps using methodologies such as block chain. 

Q5. What about trust in generative AI? How can you ensure the accuracy of generative AI outputs and maintain user confidence?

Mukherjee: A model is always as good as the results it generates. The problem of errors is not new to the area of AI and ML and just calling it using some anthropomorphic terms such as “Hallucinations” does not in any case make it different. Often errors are introduced as a safety measure as a bias in the model. As more and more people get used to these fundamental limitations of AI, people will adjust their expectations and find how the best way to use these models. 

Q6. LLMs may generate algorithmic bias due to imperfect training data or decisions implicitly or explicitly made by the engineers developing the models. What is your take on this?

Mukherjee: Quality of data has always been an issue with any AI/ML models, and it is not very different in the age of generative AI and LLM. It is always the case of “Garbage In – Garbage Out”. In traditional AI/ML model developments engineers have been culling and engineering data to suit their goal and need. But often the engineering bias sips in how the data is selected and consequently some human biases are introduced in the model. In the realm of Generative AI the philosophy is slightly different from traditional AI. Since it is built based on any and all kinds of data – the initial data bias may not be an issue here. However, we still need to make sure that the output follows our societal norms and principles and also not harmful in general. 

Q7. Will change management be critical to implementing generative AI?

Mukherjee: Managing changes is one of the dimensions that an organization may adapt to for reaping the complete benefits out of Generative AI. It need to manage the change, adaptation and redeployment easy on its workforce so that people do not feel threatened by this new technology. 

Q8. How is AI regulation going to have an impact when it comes to harnessing the opportunities of generative AI?

Mukherjee: As with any human technology Generative AI needs to conform to our accepted societal principles, norms, and moralities. If the technologist and the industry cannot regulate themselves, there is a risk that the regulations may be imposed upon them by outsiders who may not have as much knowledge and understanding of the technology. It is better, therefore that developers of Gen AI step back and spend some time figuring out how to do that and impose certain standard checks and balances upon themselves. 

Q10. If it were up to you would you use generative AIi in mission-critical applications?

Mukherjee: I would repeat my thoughts as before that Gen AI is not fundamentally different from traditional AI. Any area where people have used any traditional AI in the past may consider adapting to Generative AI or at least explore the options as a possibility. 

……………………………………….

Maharaj Mukherjee, Senior Vice President and Senior Architect Lead, Bank of America

Well recognized expert in cutting edge technologies including Edge and Massively Distributed Computing, Artificial Intelligence and Machine Learning, Cognitive Deep Learning, Blockchain, and Internet-of-Things. Currently working in the Bank of America as Senior Vice President and Senior Architect Lead in the Technology Infrastructure Organization. Previously worked as Senior AI/ML Architect and SVP at the Employee Experience Technology within Bank of America. Before Bank of America Maharaj Mukherjee worked for twenty years in the IBM Research in various leading-edge technologies including Shape Processing Engine, Computational Lithography, Design for Manufacturing, Deep and Cognitive Machine Learning, and Watson Internet of Things. Maharaj Mukherjee is an IBM Master Inventor Emeritus and holds 162 US patents and 160 International Patents to his credit. He is also recognized as a top inventor in the Bank of America in 2020 and 2021. He received the Platinum Award from the Bank of America for being one of the top three inventors in 2021. He was also recognized by IBM for the “Twenty Patents from the Past Twenty years” in 2015. He was inducted in IBM’s Inventor Wall of Fame in 2011. 

He holds a PhD from Rensselaer Polytechnic Institute, an MS from SUNY Stony Brook, and B-Tech (Hons.) from Indian Institute of Technology, Kharagpur.

He currently serves as a member of the Institute of Electrical and Electronics Engineers (IEEE) USA Awards Committee as well as a member of the IEEE Region 1 Awards Committee. He is also the current chair of Central Area of IEEE USA. 

Related Posts

On Generative AI and Databases. Interview with Adam ProutODBMS Industry Watch, October 9, 2023

On Generative AI. Interview with Philippe Kahn, ODBMS Industry Watch, June 19, 2023

Follow us on X: @odbmsorg

Oct 9 23

On Generative AI and Databases. Interview with Adam Prout

by Roberto V. Zicari

” With GenAI also requiring massive amounts of training data, the need for greater storage capacity is crucial. Databases are designed to scale as data volumes grow, ensuring generative AI projects can handle larger datasets as they become available. This means databases can help support the growing demand for AI capabilities across the business world. “

Q1. How is Generative AI transforming the way we store, structure, and query data?

Adam Prout: The focus of generative AI is to create new data, such as texts and images. At its core, GenAI is made of neural networks, a subset of machine learning that handles unstructured data like text, audio, images, and videos. These networks consist of connected layers that learn from training data and identify patterns to make new instances. But it’s not creating copies of the existing instance in the data set. Instead, these networks develop unique data points based on the training data. As a result of increased computational power and the massive amounts of data produced in recent years, it has paved the way for generative AI. 

Due to advancements in GenAI, many organizations are exploring the ways that the technology can increase efficiencies in their operations. For example, generative AI can help data analysts find hidden patterns in data sets, deriving actionable insights faster than a human could. In other instances, data augmentation helps organizations generate more data to train neural networks. Models like generative adversarial networks (GANs) can learn the distribution of original data, augment it, and create synthetic data to diversify training datasets for machine learning models. Likewise, content creation is a significant use case for generative AI as organizations can create reports, summaries, and other deliverables using proprietary data at a rapid speed. 

As for query data, we can ask questions of our data in natural language, creating efficiency over writing an SQL query or doing a full text search. More data is being stored in vector embeddings in databases, and looked up via Approximate Nearest Neighbor (ANN) vector searches, as a result of GenAI. 

There are many more ways that generative AI helps organizations better leverage their existing data while generating original instances. We’ll continue to discover how generative AI can transform the way we store, structure, and query data for years to come. 

Q2. Generative AI relies on large amounts of data to generate human-like answers. Among the challenges faced by generative AI are Data Quality and Quantity. How can a database help here?

Adam Prout: Databases provide a structured framework for data storage, allowing organizations to implement routine data quality checks and validation rules to ensure models are only trained on high-quality information. Another advantage of using a database is the consistent maintenance of data through cleansing and enrichment tools. These processes remove inconsistencies, duplicates, and errors from the data, leading to better model training and improved generative AI outputs. 

With GenAI also requiring massive amounts of training data, the need for greater storage capacity is crucial. Databases are designed to scale as data volumes grow, ensuring generative AI projects can handle larger datasets as they become available. This means databases can help support the growing demand for AI capabilities across the business world. 

Q3. Unlike traditional AI workloads that require additional specialized skills, new Generative AI workloads are available to a larger segment of the developer community. What does it mean in practice?

Adam Prout: This is great news for the practice. More software developers are able to leverage generative AI tools to increase efficiency and solve simple, clearly defined problems. And with a growing number of advanced AI code-generation tools on the market, developers can experiment with these technologies to create artificial data and test their code. 

It’s no surprise that developers will play a key role in the GenAI revolution. Their expertise and skill sets are vital to improving the performance of AI and machine learning models. They’ll be able to successfully pivot to focusing on AI development as the need for AI/ML skills skyrockets. 

Q4. Generative AI: How to Choose the Optimal Database?

Adam Prout: When selecting the right database for AI and machine learning models, organizations need to take into account several considerations:

  • Speed of data processing: The ability to handle large volumes of data while processing information quickly can help organizations gain real-time insights to drive decision-making. This is especially true when working with streaming data or developing applications that require quick response times such as fraud detection or recommendation systems. A database built on a distributed architecture and in-memory data story can enable data processing at lightning-fast speed, helping organizations make fast and informed decisions. 
  • Vector search: The way vector searches handle high-dimensional data and provide advanced search and similarity capabilities helps organizations simplify data management processes. A vector search categorizes data based on multiple features, allowing organizations to store and search high-dimensional vectors efficiently. This capability helps organizations build more accurate and effective machine learning models as it filters comprehensive datasets into the systems. 
  • Scalability and integration: As AI requires more computing power and training data, selecting a database becomes even more important to help organizations build out their capabilities. Massive AI projects need a database that can handle complex queries at scale while helping extract and transform data to train AI/ML platforms. A highly scalable database can help companies meet increasing demands for AI-powered workloads. General purpose databases are flexible enough to handle a wide swath of data.  
  • Real-time analytics capabilities: Databases with built-in analytics capabilities can help organizations quickly identify trends and patterns in their data to make more informed and instantaneous decisions. The ability to run analytical queries paired with transactional ones in the same database system, known as hybrid transactional/analytical processing (HTAP), can eliminate the need for separate systems to complete tasks — simplifying the data architecture and reducing costs. This also offers greater flexibility as organizations look to adopt more AI capabilities into their operations. 

Q5. Are NoSQL databases better suited for Generative AI than SQL databases?

Adam Prout: NoSQL and SQL databases each have their own strengths and weaknesses, and which one works best for Generative AI depends on what your project needs. NoSQL databases inherently come with more flexibility when it comes to handling unstructured or semi-structured data, which can be beneficial for certain types of data used in Generative AI – think text, images, and sensor data. As for SQL databases, they provide powerful query capabilities, enabling IT leaders to perform complex data retrieval and analysis. 

To put it simply, many GenAI projects use a combination of both types of databases, leveraging the strengths of each. When choosing which database to utilize, it’s critical to evaluate the needs and constraints of your project. 

Q6. Some SQL databases do have some features that make them compatible with Generative AI, such as supporting JSON data and functions. Are they suited for Generative AI?

Adam Prout: SQL databases that support features, like JSON, can be well-suited for certain aspects of Generative AI, largely when dealing with flexible or semi-structured data formats. Some benefits these features provide are JSON support, schema flexibility, data integration, complex querying, and scalability. 

However, depending on the nature of one’s data – the volume and the complexity – a combination of SQL databases with NoSQL databases may also be a suitable solution. There isn’t a “one-size-fits-all” approach, and to ensure you’re best aligning with your project’s needs and constraints, it’s important to evaluate the end goal that is wanting to be achieved by this particular project. 

Q7. Are databases with vector support the bridge between LLMs and enterprise gen AI apps? Why?

Adam Prout: Databases that include vector support can most definitely play a crucial role when it comes to bridging the gap between LLMs and enterprise Generative AI applications for many reasons: 

  1. Easier storage and retrieval of embeddings: LLMs, like ChatGPT, generate word embeddings or vector representations of text data – meaning it’s not only designed to efficiently store embeddings but also to retrieve them, making it easier to manage and query. 
  2. Quick and accurate similarity searches: Vector searches reign supreme when it comes to performing similarity searches, and in the context of Generative AI, this is very valuable, as it enables applications to find similar documents or content quickly.
  3. Scalability: Scalability is crucial for enterprise applications that need to process vast amounts of data, especially as LLMs continue to produce substantial volumes of vector data. Vector search are purpose-built to efficiently manage large-scale vector data, making them a vital component in handling such demands.
  4. Real-time applications: Various enterprise Generative AI applications like chatbots, sentiment analysis, and content generation, require real-time processing. Vector enables real-time retrieval and analysis of vector data – increasing the necessary responsiveness of applications. 

Q8. Will vector databases be the essential infrastructure in bringing about the societal and economic changes promised by AI?

Adam Prout: Firstly, I want to clarify my thoughts on the term “vector database.” To SingleStore, vector search is a capability of a database, not a new category of database. That being said, databases that support vector indexing are suited for storing and querying high-dimensional vectors, meaning that they are well-equipped for tasks related to machine learning, recommendation systems, natural language processing, and more. 

So, will vector searches be the essential infrastructure in bringing about the societal and economic changes promised by AI? They most definitely play a significant role, however, it’s important to understand that they are just one piece of a very large puzzle that includes algorithms, hardware, ethical considerations, and much more. Whether or not they become “the essential infrastructure” depends on various factors, such as specific applications and use cases of AI. In addition, good results from GenAI prompts often require more than a vector search – often there is a need for more traditional filters on other attributes of data and the like. 

Q9. Who is already using Generative AI in the enterprise world?

Adam Prout: A recent report explored how companies are utilizing generative AI and shared: 46% for content generation, 43% for developing analytics insights summary, 32% for analytics insight generation, 32% for code development, and 27% for process documentation. On top of this, most companies are curious about AI but don’t use it as part of their everyday process, with the majority of 53% saying they are “exploring” or “experimenting” with the tech. 

All of this is to say that the use of generative AI in the enterprise landscape continues to evolve rapidly – whether that be organizations fully implementing the tech in their day-to-day operations, or employees utilizing it to complete specific tasks. 

Q10. SingleStoreDB has evolved over the past 10 years from its early days as MemSQL (in-memory OLTP) to become a more general purpose distributed SQL Database. How do you manage AI and Generative AI?

Adam Prout: When we first founded MemSQL, people were saying SQL couldn’t scale – we know that wasn’t true. We knew we could build something scalable, but similar enough to traditional, single-host so that customers wouldn’t have to learn a whole new database.
That took us to real-time analytics and progressing to a general purpose database. We expanded to a broad set of workloads and analytics, with performance similar to or even better than specialized systems. We’re giving customers the flexibility that comes with general purpose databases, as well. 
As for AI, SingleStore has supported basic exact-match vector search capabilities for many years and we are adding improved vector indexes for ANN search for larger data sets. We believe vector searches combined with general purpose SQL databases capable of filtering, full text search, JSON and the like are crucial capabilities to unlock the most value from GenAI.
……………………………………………..


Adam Prout, CTO and Co-Founder, SingleStore


Adam Prout is the CTO at SingleStore and oversees product architecture and development. He joined SingleStore in 2011 as a co-founding engineer. Previously, Adam led engineering efforts on kernel development at Microsoft SQL Server. He holds Bachelor degrees in Computer Science and Mathematics, and a Masters degree in Mathematics from the University of Waterloo.

Related Posts

On Generative AI. Interview with Philippe Kahn, ODBMS Industry Watch, June 19, 2023

On Generative AI. Q&A with Bill Franks, ODBMS.org JUNE 26, 2023

Resources

ODBMS.org EXPERT ARTICLES

Follow us on X: @ODBMSorg

Jun 19 23

On Generative AI. Interview with Philippe Kahn

by Roberto V. Zicari

 AI will neither save nor doom the world. It’s people that will. “

Q1. OpenAI CEO Sam Altman says AI will reshape society, acknowledges risks: ‘A little bit scared of this’  (*) What is your take on this? 

Philippe: At Fullpower-AI, we build a domain-specific generative AI platform so that I may have an insider perspective.  The AI platforms we build are the world’s most rapidly developing technologies. They aren’t just generating text, images, videos, and sounds. They are creating a combination of excitement and anxiety among people and governments across the globe. It’s important to stay ahead of the curve. 

Here are some advantages of Generative AI

  1. Assistance and Automation: ChatGPT, Bard, and similar models can provide valuable assistance and automation in various tasks. They can answer questions, provide recommendations, assist with research, and even automate certain processes, saving time and effort.
  2. One-on-one personalization for learning, researching, and automating. 
  3. Creativity: Generative AI can create new and original content, including text, images, and music. It can develop ideas and solutions that some humans might have not considered. 
  4. Democratization of knowledge and content: Generative AI can make complex information and technologies more accessible to a broader audience. It can simplify and explain complex concepts more understandably, allowing people to engage with information that might otherwise be challenging to understand.

Here are some disadvantages of generative AI: 

  1. Bias and Misinformation: Generative AI models like ChatGPT can inadvertently reflect and amplify biases present in the training data. The model’s outputs may also contain similar biases or misinformation if the training data contains biased or inaccurate information. And, of course, there is a potential for a compounding effect. 
  2. Potential nonsense generation: Although AI models can generate coherent responses, they often need a true understanding of the content. They rely heavily on patterns, brute force data manipulation, and statistical correlations in the data rather than true comprehension, which can lead to incorrect or nonsensical answers with a potentially compounding effect. 
  3. Ethical Concerns: There are ethical concerns surrounding generative AI, particularly when it comes to deep fakes and the potential for malicious use. These models can be misused to create convincing fake content, significantly affecting opinion, attitude, policy, privacy, security, and trust. It’s the old perverse: “The bigger the lie, the more will believe it.” 
  4. Overreliance and Dependency: As generative AI becomes more prevalent, there is a risk of overreliance and dependency on these systems. People might rely too heavily on AI-generated content without critically evaluating its accuracy or considering alternative perspectives, leading to intellectual laziness and bigotry. 
  5. Unintended Consequences:  It’s easy with these systems to produce “credible” spam, phishing, or propaganda with very negative societal impacts.

Recognizing and addressing these negatives is important to ensure the responsible development and deployment of generative AI technologies. We are fully aware of that at Fullpower-AI and think about it continuously. 

Q2. What do you think is the Social Impact of Chat GPT-4? 

Philippe: Per your prior question ChatGPT, Bard, and others already have a profoundly positive and negative impact. 

For example, more advanced generative AI could increase automation in various industries and job sectors. While this can improve efficiency and productivity, it may also result in job displacement or changes in the job market, requiring individuals to adapt to new roles and acquire new skills.

It’s important to remember that these are all speculative impacts based on the general trajectory of AI advancement. The specific social impact of the AI systems would depend on various factors, including design, deployment, and the actions taken by developers, policymakers, and society to shape its use.  

AI will neither save nor doom the world. It’s people that will. 

Q3. Do you use Chat GPT-4?  What do you think about it? 

Philippe: At Fullpower-AI, we build domain-specific generative AI systems targeting sleep management, breathing anomalies, skincare, industrial automation, etc. As general-purpose systems, ChatGPT and Bard have proven their usefulness. It’s important to remember the safeguards per your question 1. 


Q4. In the interview above, it is mentioned that,  “GPT-4 is just one step toward OpenAI’s goal to eventually build Artificial General Intelligence, which is when AI crosses a powerful threshold which could be described as AI systems that are generally smarter than humans.” Is this Science Fiction, or is it something that may happen? 

Philippe: Building Artificial General Intelligence is challenging. It remains controversial. This is way passed the Turing test because human behavior and intelligent behavior are not the same things.

Regarding feasibility, there are varying opinions regarding the potential for achieving AGI. I believe we will achieve 90% of AGI in the next decade. The last 10% may take a very long time. 

It is important to approach AGI development cautiously, ensuring responsible and ethical practices are in place to address potential risks and consequences. Continued research, collaboration, and discussions are necessary to advance our understanding of AGI and its societal implications.

Qx anything else you wish to add here?

Philippe:  While there is no definitive answer, it is crucial to consider the potential risks and take precautions to ensure AGI’s safe development and deployment. To mitigate risks, we advocate for AGI’s development with safety and ethics. Part of the challenge is the geopolitical competition. We must ensure that nations on the fringe don’t exploit loopholes. Personally, I believe in progress and that AI technology can have a deep positive impact on our future. 

Resources:

(*) Source: abc news. OpenAI CEO, CTO on risks and how AI will reshape society, March 16, 2023, 

………………………………………

Philippe Kahn is a highly successful serial entrepreneur who founded a number of leading companies, including Fullpower-AI, LightSurf, Starfish Technologies, and Borland.

Feb 6 23

On Innovation. A Conversation with Philippe Kahn

by Roberto V. Zicari

 I always think about a graduate class called “Invent.” Innovation has to be based more on spark than process. “

I asked ten questions on innovation to Philippe Kahn back in February 2006. Now this is a new revision…

RVZ

Q1. What is Innovation for you?

Philippe Kahn: Innovation is a key success ingredient for science, business, and personal growth. It is all about bringing something new: New ideas, new devices, and new methods. 

Q2. What pivotal role did your parents play in your personal development?

Philippe Kahn: Yes, I grew up with a single Mom, my hero. Here Wikipedia speaks for itself: Clair Monis.

Q3. Besides a master’s in mathematics, you also received a master’s in musicology composition and classical flute performance. Did music influence your career as an entrepreneur? How?

Philippe Kahn: Playing music is part of my daily practice. My Mom, a concert violinist, would make me practice 30 minutes before going to school. This has become a daily life discipline like meditation. I play both Jazz and classical music daily. 

Q4. You are credited with creating the first camera phone. You had a vision, but this did not materialize at that time. What is the main lesson you learned from this?

Philippe Kahn: Pioneering visions never materialize instantly. We created the first working prototype in 1997, launched it in Japan toward the end of 1999, then in the US in 2002. In 2007 Steve Jobs and Apple launched the iPhone, and the market grew. Here is a helpful link

Q5. You are also credited with being a pioneer in wearable technology. This developed into Fullpower-AI: AI-modeled biosensing algorithms and embedded AI Machine Learning solutions and generative AI for Synthetic trial augmentation. What obstacles did you have to overcome, to make this vision reality? Who helped you to make it a reality?

Philippe Kahn: Our team at Fullpower-AI created the first iteration of our IoT biosensing platform. We thought that the first application was for wearables. We built complete solutions for Nike and launched Nike Running solutions. We also licensed our technology to Jawbone in 2011. It is all internal development: Device, sensing, firmware, security, cloud, and actionable insights. Now we are focused on digital transformation with our IoT/AIoT biosensing platform. Our goal is to help transform sleep, cosmetics, wellness, and medicine by leveraging our platform. 

Q6. What do you consider are the current most promising innovations that will have an impact in the near future?

Philippe Kahn:. We all know of the impact of generative AI on creating content such as text, music, and graphics. It’s helpful to many, but there may be a few hints of plagiarism. I think that generative AI could be helpful in helping people develop better writing, musical, and graphic skills. However, the most promising applications of AI are in wellness, health, and medicine. We may finally make significant progress in tackling challenges such as Alzheimer’s, Cancer, etc. All this is possible because of the combination of IoT, AIoT, biosensing, and deep learning. 

Q7. In 2006 you mentioned that Vision, Leadership, and Perseverance were in your opinion the top 3 criteria for successful Innovation. Did you change your mind in the meanwhile?

Philippe Kahn: Yes, Vision, Leadership, and Perseverance are key. Let’s sprinkle a bit of luck too. With Fullpower-AI and our IoT/AIoT platform, we were early in 2010, now we look like an “overnight success!’

Q8. What is a culture that supports and sustains Innovation?

Philippe Kahn: No matter what size, visionary leadership is key. It’s necessary but sometimes not sufficient. Augmenting the teams with the best talent is key while setting up non-invasive disciplined processes. 

Q9. What should be taught in universities to help Innovation that is currently missing in your opinion?

Philippe Kahn: I always think about a graduate class called “Invent.” Innovation has to be based more on spark than process. 

Q10. You and your wife Sonia run the Lee-Kahn Foundation. Tell us a bit about it.

Philippe Kahn: Yes, we like to focus on the environment, in particular wild life, animal welfare, and conservation. Our founding vision is utopian, yet something we can get behind. It reads like this: “May our children and our children’s children enjoy better health and be able to hear the howl of a Wolf Pack in the wild, experience the magic of Dolphins playing with the ocean waves, drink pure water from every stream… “

………………………………………….

Philippe Kahn is a highly successful serial entrepreneur who founded a number of leading companies, including Fullpower-AI, LightSurf, Starfish Technologies, and Borland.

Resources:

On Innovation.  Archive of interviews (2006-now)

Jan 13 23

On Cloud Database Management Systems. Interview with Rahul Pathak.

by Roberto V. Zicari

IT teams no longer want to be consumed by undifferentiated heavy lifting so that they can focus on strategic business goals and innovation. This is very liberating, and we believe that this is a major growth driver. 

Q1: In your opinion what is the status of the database market today and in the next years to come? 

Rahul: The broader database market trend is more of a question for analysts. Our unwavering focus is to continue innovating on behalf of customers to make advanced database features more approachable while reducing the costs and complexities of maintaining databases. IT teams no longer want to be consumed by undifferentiated heavy lifting so that they can focus on strategic business goals and innovation. This is very liberating, and we believe that this is a major growth driver. 

Q2: You just wrapped up re:Invent 2022. Is re:Invent the high point of the year in terms of your database announcements? 

Rahul: re:Invent is always an exciting and energizing event. That said, we actually release new innovations throughout the year, when they are ready. For example, we released some big innovations earlier in 2022, like Amazon Aurora Serverless v2Amazon RDS Multi-AZ with two readable standbys, and a whole lot more. We also have announcements at re:Invent in addition to providing attendees a hands-on learning experience of our services. 

Q3: Can you share some details on these more notable launches prior to re:Invent?  

Rahul: Absolutely. We launched Amazon Aurora Serverless v2 (ASv2), which provides customers the ability to instantly scale up and down in fine grained increments based on their application’s needs. ASv2 is particularly useful for spiky, intermittent, or unpredictable workloads. Manually managing database capacity can take up valuable time and can lead to inefficient use of database resources. With ASv2, customers only pay on a per-second basis for the database capacity that you use when the database is active. ASv2 has become the fastest adopted feature in the history of Aurora. Customers, like Liberty Mutual, S&P Global, and AltPlus, have used ASv2 to reduce their costs while achieving improved database performance. 

Another feature launch that has proven compelling to customers is the release of Amazon RDS Multi-AZ with two readable standbys in different AZs, improving both performance and availability. As you may know, we launched Multi-AZ deployment back in 2020 in which we automatically create a primary database (DB) instance and synchronously replicate the data to an instance in a different AZ. When it detects a failure, Amazon RDS automatically fails over to a standby instance without manual intervention. Now, the launch of Multi-AZ two standbys adds another layer of protection and significant performance benefits. With this feature, failovers typically occur in under 35 seconds with zero data loss and no manual intervention. Customers can gain read scalability by distributing traffic across two readable standby instances and up to 2x improved write latency compared to Multi-AZ with one standby.  

Q4: During re:Invent, it was mentioned that AWS also recently launched serverless and global database for your graph database, Amazon Neptune. Can you share some details on this? 

Rahul: Yes, Amazon Neptune is now our sixth database to be serverless and our fifth database with ability to scale reads globally across regions. Both of these capabilities are important for modern day applications with global performance requirements at scale. I should also mention that for our first ever serverless database, Amazon DynamoDB, we recently announced the capability to import data from S3. This further underscores our focus on increasing interoperability and integration across our services to minimize effort by customers in moving their data to where they need it. 

Q5: On the heels of re:Invent, AWS became the new Leader of Leaders in the Gartner MQ for Cloud Database Management Systems 2022. That’s a remarkable achievement. How is AWS thinking about this recognition? What are the main strengths that Gartner found in your offering? Are there any weaknesses? 

Rahul: While AWS has been named as a leader for the eighth consecutive year, we were elated and humbled to be positioned highest in execution and placed furthest in vision among the top 20 data and analytics companies in the world. We think listening to our customers and solving their most challenging problems is key. We engage closely with customers on product roadmaps and work diligently to deliver on our commitments as promised. Our own experience in operating our e-commerce business has and continues to also be a wellspring of learnings for what it takes to build massive modern internet scale applications serving customers on a global scale. 

In their 2022 report, Gartner called out the breadth of our services as a major strength. Our best-fit philosophy, targeted to specific use cases as needed by various applications and microservices, is really paying off. No vendor ever gets a perfect score and Gartner also noted that there is still upside from better integration between our sevices. Gartner gave us credit for a progress towards an integration roadmap, and this continues to be a major roadmap theme for us. At re:Invent, we announced Amazon Aurora zero-ETL integration with Amazon Redshift, and we’re eager to continue delivering on our integration roadmap. You can read the report here

Q6. What were the overarching themes around your announcements at re:Invent 2022? 

Rahul: Our database business tracks several themes that we deliver against. Of these themes, there were three that were at the center of our announcements. These themes were interoperability across services, advancing performance and scale, and operational excellence by making security and advanced operational techniques more approachable. 

Q7: Why are these themes important? 

Rahul: Interoperability across our services is important because it improves productivity across development and operations teams. Integration between services is needed as part of building modern applications. It’s a question of where the integration occurs. Application developers often have to include this integration as part of their application code or solution architects must take extra measures to include additional integration components which increases complexity. If the integration is built in under the covers, then that’s one big area developers and architects don’t need to worry about. 

Performance and scale are important because of the deluge of data and types of data organizations are experiencing and will continue to experience. For almost every organization this deluge of data is a clear and present day-to-day reality. Customers need reassurance that they can scale-up and scale-out with real-time performance. 

Finally, the approachability of security and advanced operational techniques removes big hurdles that get in the way of organizations that don’t want to make massive investments in IT operations and specialized skills. It levels the playing field for the undifferentiated heavy lifting – things that are not core to the business but necessary for advancing the mission of the business. The definition of undifferentiated heavy lifting is expanding. Years ago, we started by removing the resources associated with hardware provisioning, database setup, patching, backups, and more. This is expanding to scaling up/down and scaling in/out based on an application’s needs, and removing the highly specialized skill sets and extensive resources otherwise required.  

Q8: What did AWS announce in support of interoperability across services? 

Rahul: We announced the preview of interoperability between Amazon Aurora and Amazon Redshift. Each of these services leads in their categories – Amazon Aurora as an operational database and Amazon Redshift as an analytical database. 

The traditional approach to integration between operational and analytical databases is to use generalized ETL or ELT. This is beset with problems in so many ways. It’s complex and heavy, often requiring manual coding of SQL to optimize query performance. It’s harder to setup, maintain and use. Maintenance and the lifecycle management of this type of data integration is worsened by the inherent fragility of this approach – the integration breaks when there is a change to the source or target schema. This requires extensive testing after every change. What you get after taking on all these burdens is usually a low performance, non-elastic solution that doesn’t adapt well to changing workloads. 

We announced the preview of a purpose-built, point-to-point, fully managed integration that doesn’t suffer from these issues. Our Amazon Aurora zero-ETL integration with Amazon Redshift can consolidate data from multiple Aurora databases to a single Redshift database, giving you the benefit of near-real-time analytics on unified data. This opens up an entire category of use cases for time sensitive analytics on fresh data. 

The integration is easy to setup – creating a Redshift integration target, whether it’s a new or existing endpoint, is easy. Furthermore, we designed this zero-ETL integration for easy maintenance adapting to Aurora side schema changes. Database or table additions and deletions are handled transparently. If a transient error is encountered, the integration automatically re-synchs after the recovery from the error. 

Data is replicated in parallel, within seconds. So large data volumes are not a problem. On the Amazon Redshift side, you can transform data with materialized views for improving query performance. 

Q9: Now shifting to performance and scale, what are the highlights? 

Rahul: We announced three key new features starting with Amazon DocumentDB Elastic Clusters which will horizontally scale writes with automated operations. As you may know, we can already horizontally scale reads across all our popular databases using read replicas. For Amazon DocumentDB, our customers needed the ability to horizontally scale writes beyond limits of a single node. Amazon DocumentDB Elastic Clusters uses sharding, a form of partitioning data across multiple nodes in a cluster, so that each node can support both reads and writes in a multi-active approach. When data is written to a node it is immediately replicated to the other nodes. This has the added benefit of supporting massive volumes of data. What’s exciting is Amazon DocumentDB can scale to handle millions writes (and reads) per second with petabytes of storage capacity. 

In addition to horizontal scaling, we also invested in optimizing the performance of a single database instance. Our announcement of Amazon RDS Optimized Writes and Amazon RDS Optimized Reads for MySQL are examples of this. Both of these enhancements improve our internal implementation to improve performance. 

Prior to RDS Optimized Writes, atomicity of writes was handled by writing pages twice. Smaller chunks of a page were first written to a “doublewrite buffer” and then written to storage. This protects against data loss in case of failure, but two writes take longer and consume more I/O bandwidth reducing database throughput and performance. For use cases with a high volume of concurrent transactions, to solve for durability customers also need to provision additional IOPS to meet their performance requirements. Optimized writes work by atomically writing more data to the database for each I/O operation. So, this means that the pages are written to table storage durably as a single atomic operation in one step. With Optimized Writes, customers can now gain up to 2x improvement in write transaction throughput at no additional cost and with zero data loss. 

With RDS Optimized Reads, read performance is improved by leveraging data proximity. A MySQL server creates internal temporary tables while processing complex or unoptimized queries like analytical queries that require grouping, sorting etc. When these temporary tables cannot fit into memory, the server defaults to disk storage. With Optimized Reads, RDS places these temporary tables on the instance’s local storage instead of an Elastic Block Storage volume, which is shared network storage.  It’s the local availability of temporary data that makes queries up to 50% faster. 

Q10: How about security and operational excellence, what did AWS announce for this theme? 

Rahul: Security is of utmost importance and an area of sustained investment for us. We announced the preview of Amazon GuardDuty RDS Protection, which protects Amazon Aurora databases from suspicious login attempts that can lead to data exfiltration and ransomware attacks. It does this by identifying anomalies, sending intrusion alerts, managing stolen credentials, and more. Our goal with GuardDuty was to create a tool that’s easy to enable and produces timely, actionable results. We use machine learning to accurately detect highly suspicious activities like access attacks using evasion techniques. Security findings are enriched with contextual data so you can quickly answer questions such as what database was accessed, what was anomalous about the activity, has the user previously accessed the database, and more. Aurora is the starting point. We’ll also extend this capability to other RDS engines. 

We also announced Trusted Extensions for PostgreSQL, an open-source development kit and project, available for Amazon Aurora and Amazon RDS. This project is focused on increasing the security posture for extensions starting with PostgreSQL.  

Developers love PostgreSQL for many reasons including the thousands of available extensions, but adding extensions can be risky. This makes certification of extensions very important. Our customers asked us for an easier way to use their extensions of choice and also build their own extensions. It’s impractical for AWS to certify the long tail of extensions, so we worked with the open-source community to come up with a more scalable model. 

Trusted Language Extensions for PostgreSQL is a framework that empowers developers and operators to more safely test and certify extensions. Now, as soon as a developer determines an existing extension meets their needs or is ready to implement a custom extension, they can safely test and deploy it in production. Developers no longer need to wait for AWS to certify an extension to begin implementation because Trusted Language Extensions are considered to be part of your application. It provides a safe approach because the impact of any defects in an extension’s code is limited to a single database connection. Trusted Language Extensions supports popular programming languages that developers love including JavaScript, Perl, and PL/pgSQL. We do plan to support other programming languages so stay tuned for announcements in 2023.  

Q11: What else did AWS launch for making advanced operational techniques more approachable?   

Rahul: I am also excited about Amazon RDS Blue/Green Deployments, which automates an advanced DevOps technique – and this is available for MySQL in both Amazon RDS and Amazon Aurora. In the current atmosphere of 24/7 operations, downtime for updates (security patches, major version upgrades, schema changes, and more) or disruptions or data loss due to failed attempts at updates are not acceptable.  

In this DevOps technique, the production environment is the ‘blue’ environment and the staging environment is the ‘green’ environment. For organizations with advanced DevOps skills, they will test new versions of software in a ‘green’ environment under a production load, before actually putting it in production. But this requires advanced operational knowledge, careful planning and time. With RDS Blue/Green Deployments, we provide a fully managed staging environment. When an upgrade is deemed to be ready, the database can be updated in less than a minute with zero data loss – a much simpler, safer and faster approach to database updates. 

Another launch is AWS Database Migration Service (DMS) Schema Conversion making heterogeneous migrations operationally easier. Previously, a separate schema conversion tool was needed for mapping the data at the source database to the target database. Now the schema conversion is integrated with DMS, making schema assessments and conversions much simpler. Heterogenous schema conversion can now be initiated with a few simple steps, reducing set up time from hours to minutes. 

Q12: Would you like to add anything else? 

Rahul: A good way to come up to speed with the latest from AWS and the art of the possible is to watch recordings from re:Invent. We showcased product announcements and a breadth of sessions that cover our product roadmap and best practices. You can also learn more from our database category page, and database blog. We’re energized and focused on innovating for our customers! Feedback is always welcome and I encourage all customers to reach out so we can help no matter where they may be on their journey to the cloud – simply complete our  Contact Us form.  

………………………………..

Rahul Pathak is Vice President, Relational Database Engines at AWS, where he leads Amazon Aurora, Amazon Redshift, and Amazon QLDB, AWS’ core relational database engine technologies. Prior to his current role, he was VP, Analytics at AWS where he led Amazon EMR, Amazon Redshift, AWS Lake Formation, AWS Glue, Amazon Athena, and Amazon OpenSearch Service. During his 11+ years at AWS, Rahul has focused on managed database and analytics services with previous roles leading Emerging Databases, Blockchain, RDS Commercial Databases, and more. Rahul has over twenty-five years of experience in technology and has co-founded two companies, one focused on digital media analytics and the other on IP-geolocation. He holds a degree in Computer Science from MIT and an Executive MBA from University of Washington. 

Resources

AWS positioned highest in execution and furthest in vision

Gartner has recognized Amazon Web Services (AWS) as a Leader and positioned it highest in execution and furthest in vision in the 2022 Magic Quadrant for Cloud Database Management Systems among 20 vendors evaluated. This Magic Quadrant report provides cloud data and analytics buyers with vendor insights based on Gartner research criteria. AWS has been a Leader in the report for eight consecutive years.  

Magic Quadrant for Cloud Database Management Systems

Published 13 December 2022 – ID G00763557 – 71 min read

Figure 1: Magic Quadrant for Cloud Database Management Systems (source Gartner (December 2022)

Read the Gartner Report

Related Posts

EXPERT ARTICLES DECEMBER 16, 2022

Deep Dive Amazon DocumentDB Elastic Clusters. Q&A with Vin Yu
https://www.odbms.org/2022/12/deep-dive-amazon-documentdb-elastic-clusters-qa-with-vin-yu/

Follow us on Twitter: @odbmsorg

Mar 21 22

On Using in Memory Database. Interview with Jonah H. Harris

by Roberto V. Zicari

” Whether it’s adding features, fixing bugs, or improving performance, it all comes down to the quality of the code.”–Jonah H. Harris.

Q1. You are the director of Artificial Intelligence & Machine Learning at The Meet Group. What are your current responsibilities?

Jonah H. Harris: AI and ML research is rapidly growing. Staying on top of those advancements to identify key strategic opportunities and improvements that deliver novel and strategic solutions, which solidify our position as leaders in personal connection, is paramount. While setting direction is important, my primary goal is to shape, grow, and lead an exceptional team of Machine Learning Engineers to research, design, develop, and implement innovative solutions and advance our company’s capabilities across multiple business units. Our focus areas primarily include deep learning, natural language processing, computer vision, recommendation, ranking, and anomaly detection. It’s quite a bit to remain current on these days. 

Q2. What do you use Artificial Intelligence & Machine Learning for?

Jonah H. Harris: At The Meet Group, we provide multiple brands and platforms which enable members to identify potential partners for romantic, platonic, and entertainment purposes. While traditional recommendation systems match items (e.g., books, videos, etc.) with a user’s interests, we aim to match people who are mutually interested in and likely to communicate with each other. While recommendation is a critical component of our business, additional work is required to perform abuse prevention and improve monetization – all of which are enhanced using a combination of data science, machine learning, and artificial intelligence. Our team employs many different techniques and technologies to accomplish each area mentioned above as quickly and efficiently as possible.

Q3. You have been working previously as the VP of Architecture and Lead DBA, overseeing high performance data access. What were your most important projects?

Jonah H. Harris: Now paired with Parship, The Meet Group is a worldwide leader in personal connection with a globally distributed workforce. When I joined as the Lead DBA in 2008, however, it was a small social network named myYearbook based in New Hope, Pennsylvania. Through multiple acquisitions and stages of the company, from private to NASDAQ-listed and private once again, I’ve been fortunate enough to grow with the organization and hold various positions from individual technologist to Chief Technology Officer. I’ve always enjoyed challenging work and my current position, overseeing AI/ML, is no different.

When I think of all the projects I’ve architected or developed over the years, one of the most fun and architecturally challenging was the reciprocal matchmaking system designed for a game called BlindDate.

BlindDate was a questionnaire-based matchmaking system that allowed members to select questions about themselves, supply their own answers, and identify their desired partner’s answers. To be “matched,” other members would need to answer the same questions along with the desired answers bi-directionally. One important implementation caveat was that we did not want to precompute these matches – they had to be done in (soft) real-time. We found many members would submit hundreds or even thousands of questions. While we did our best to partition this problem into an optimal search space, performing this reciprocal match was a performance challenge.

For our MVP, we initially designed this to use a relational database. Early on, however, we found this began to take around eight hundred milliseconds per request. As the game scaled, this would never work as initially designed. This led us to look at eXtremeDB.

Coupled with its new (at the time) multi-version concurrency control (MVCC) transaction manager and ability to control the low-level data structure format, we were able to design a bitwise-optimized matching algorithm. As a result, the eXtremeDB-based implementation dropped the response time of a single request down to seventy-six microseconds on the exact same hardware; it also reduced memory usage by two-thirds. 

Q4. What are the main challenges you have encountered to achieve high performance data access?

Jonah H. Harris: Largely, a primary challenge is defining the appropriate structure to store and query data. Relational databases are great for general-purpose data management. On the other hand, NoSQL-oriented systems are great for flexibility. Similarly, systems such as Redis provide a unique ability to perform tasks that can’t easily be done with great performance in a traditional database management system. When designing an application, you have to choose the best tool for the job and make trade-offs where necessary. In some cases, this requires utilizing multiple data management technologies or sacrificing performance on one task in favor of another. It’s hard to find a system that’s both as flexible as it is fast: eXtremeDB is really the only contender in that category I’ve found.

Q5. Can you tell us about some of the work you have done with eXtremeDB?

Jonah H. Harris: In addition to the BlindDate case mentioned above, we experimented with storing a graph database structure in eXtremeDB – it was highly performant and gave us the ability to store the graph in an optimal form while also making it queryable via SQL.

eXtremeDB is so good that I have personally licensed it to develop and test out my own ideas and implementations of various systems. I’ve built everything from a Redis-compatible service to real-time recommender systems based on eXtremeDB.

I’m actually in the process of writing a book for Apress, Realtime Recommendation Systems: Building Responsive Recommenders from the Ground Up, and testing out several of those algorithms with eXtremeDB as well. Compared to several well-known open-source recommenders, my eXtremeDB-based versions consistently demonstrate several hundred percent improvements in performance. This is due to eXtremeDB’s highly-optimized in-memory implementation, which doesn’t force me to sacrifice on-disk capabilities as other systems do. Additionally, I’ve always licensed the eXtremeDB source code, which is rare for a company to offer. With that, I’ve been able to gain a solid understanding of internals and compile-time optimizations, enabling me to make even better performance gains. The code is immaculate, and McObject is equally great about accepting patches for additional functionality.  


Q6. Why choosing eXtremeDB?

Jonah H. Harris: If my earlier answers haven’t already praised its modularity flexibility enough, I’ll state it more clearly: with over twenty years of professional experience not only administering and developing against databases but also working on their internals, eXtremeDB is the only system I’ve found that gives developers the ability to build almost anything with very few constraints.
Likewise, McObject’s support is exceptional. You can ask as detailed of a question as you can imagine and get a solid answer, in many cases from the engineers themselves. 

Q7. You have implemented a number of features for commercial and open-source databases. What are the main lessons you have learned?

Jonah H. Harris: Whether it’s adding features, fixing bugs, or improving performance, it all comes down to the quality of the code. Unfortunately, most open-source database code is abysmal. Postgres, InnoDB (proper), and Redis are exceptions. That said, you’d expect commercial implementations to be so much better – but they’re usually not. It’s sad, really.

While I didn’t know it initially, part of the team behind eXtremeDB was also behind the old Raima Database Manager (RDM). In the late nineties, I used RDM quite a bit and had a source code license for its code as well. Aside from the MASM-based NetBIOS lock manager implementation, which I believe they acquired from a third-party developer, it was an extremely well-written system with great documentation. So, when I found out eXtremeDB was a brand new, from the ground-up, in-memory-optimized system with very similar developer-friendly embedded database design goals, I was sold!
Sure, I’ve worked on the internals of many different database systems. But, I have no problem understanding the code to eXtremeDB at all. It’s all well-organized and straightforward, which is hard to do for a system that supports multiple transaction managers and is optimized for both in-memory and on-disk operations.  

Q8. You are an active open-source contributor. What are your current open source projects you contribute to? 

Jonah H. Harris: As of late, I haven’t had a great deal of time to do much open-source work. Database-wise, my latest contributions are to Redis, adding a few useful commands and performance optimizations. The rest are generally bug fixes or feature additions in libraries I frequently use.


Q9. What is your experience of using open source software for mission critical applications?

Jonah H. Harris: I’ve always been a big advocate of open-source. I remember first using FreeBSD and Linux in the mid-90s when I was in middle school. That said, I’m huge on choosing the best tool for the job at hand. Sometimes that’s open-source, and sometimes it’s not.

In the early 2000s, I was hired to lead the development of a Johnson & Johnson brand’s rewrite of their CFR Part 11 quality system ERP module from PowerBuilder to Apache+PHP. We used a good amount of open-source, but it still ran on top of HP-UX and Oracle. Did it need to? No. But that’s what they were comfortable with and, to be honest, those were a better choice stability-wise at the time.

These days, when I’m building a general back-end web-based API, I default to Node.js+NGINX, Postgres, and Redis. As most things are containerized on top of a Linux distribution these days, it’s hard to beat that stack. Language-wise, I like TypeScript, though I do see cases for Rust and Go in the future.

That said, when I’m building a performance-optimized system, I still prefer C with libuv for networking. For data management, I’ll use eXtremeDB when I need MVCC or dual in-memory/on-disk functionality. There’s no need to reinvent that, and nothing is nearly as fast. Otherwise, I’ll use klib data structures for simple single-threaded apps.

Open source is great, and it’s come a long way. But, there are still valid cases for using commercial systems.

Qx Anything else you wish to add?

Jonah H. Harris: For the most part, IMDB systems have always been considered a niche: you either know about them or you don’t. eXtremeDB is an IMDB-optimized system, but its functionality far surpasses its competitors in every aspect. It can be used locally or distributed, with and without SQL,  in-memory only or as an on-disk hybrid, in-process and as a server, with high availability, vector-optimized operations, real-time embeddability, source code, and many compile-time optimizations. More people really should know about it; it’s a genuinely fantastic system.

……………………………………..

Jonah H. Harris Director of Artificial Intelligence & Machine Learning, The Meet Group.

Leader. Entrepreneur. Technologist. NEXTGRES Founder. Former CTO at The Meet Group. OakTable Member. Open Source Contributor. Founding Member of the Forbes Technology Council.

Resources

McObject and Siemens Embedded Announce Immediate Availability of eXtremeDB/rt for Nucleus RTOS

Related Posts

On eXtremeDB/rt. Q&A with Steven Graves. ODBMS.org OCTOBER 8, 2021

Follow us on Twitter: @odbmsorg

Mar 7 22

On Cloud Database Management Systems. Interview with Jordan Tigani

by Roberto V. Zicari

“If a company starts as an on-premise business and decides to become relevant in the cloud space it requires dedicated energy devoted to changing the culture.

...If there was a lesson to share it would be to not under-invest or think it will be easy. You also need to hire people with cloud experience. There are a number of areas where you can hire smart people and they will pick up what they need, but if you’re trying to make a transition there is no substitute for actual hands-on experience with what happens when you try to scale.” — Jordan Tigani.

Q1. You are Chief Product Officer at SingleStore since June 30th, 2020. What are the main projects you have been working at SingleStore?

Jordan Tigani: The number one thing that I’ve been focused on is helping SingleStore transform into a cloud company. This means more than having a product that runs in the cloud; you need to reimagine how you build software, how you monitor it, how you support it, and what features you need. We’ve got a great team that has taken these ideas and ran with them, but to some extent, this is a cultural change, and that takes a lot of time and directed energy. 

I’ve also been working on refining the mission and completing the technology so that it solves all use cases. For the last year, we’ve been focused on data-intensive applications, which are, broadly, applications that hit bottlenecks in data. This is a growing subset of the database market, as richer applications tend to want to do more interesting things with their data.

Q2. You co-created Google BigQuery as a founding engineer and went on to be the Director for Engineering Director and also Product Management. How much has your work at Google influenced your current work at SingleStore?

Jordan Tigani: My two biggest learnings from Google were how to build a cloud product that scales (when I left BigQuery it was using about 3 million CPU cores) and a deep customer empathy for the cloud analytics market.

I also saw a lot of things that customers wanted to do, but we had a hard time making the technology work to solve their problems. One of our tech leads had a great saying: “It’s just code.“ This meant that given enough time, you could make any feature work. However, if you didn’t have the right architecture you would hit limitations, and all the clever coding in the world would be able to help you. 

Some of the things that BigQuery customers were pushing us to do—like being able to do rapid updates, or serve low latency queries—were things that were incredibly difficult to do with the architecture. Many of these same things were problems that SingleStore had already solved, and by virtue of their architecture, there was a technological moat that would be hard for competitors to cross.

Q3. The tag line of SingleStore is “The Single Database for All Data-Intensive Applications for operational analytics and cloud-native applications“. To demonstrate how fast SingleStore is on both transactional and analytical workloads you did comparative benchmarks against leading cloud vendors for both TPC-H (analytics) and TPC-C (transactions). What were the main results?

Jordan Tigani: The main takeaway was that SingleStore is as fast or faster on analytics benchmarks as cloud data warehouses, and is as fast or faster than cloud operational databases at transactional benchmarks. This means that in one database, with one storage type, you can get stellar performance on both transactions and analytics, which many people think is impossible. This brings us closer to having a “general purpose” database, where you don’t have to necessarily plan what you’re going to do with it before you start using it.

Q4. Why did you compare separately your performance at TPC-H with data warehouse vendors, and TPC-C with only one operational database vendor? What did you learn?

Jordan Tigani: On the analytics side, the main cloud data warehouse vendors have been engaging in public benchmark wars and focusing on performance. We didn’t want to escalate the amount of noise being thrown around, but we did want to call attention to the fact that we can put up some pretty stellar numbers ourselves.

We only measured one operational database vendor because TPC-C is harder to set up, and because of the way it is defined, it doesn’t provide rich information like TPC-H. We’re working with a third-party vendor to release a more detailed report, which will include additional operational database vendors.

Q5. How did your perform the test?

Jordan Tigani: We had ignored benchmarks for a long time since they often do not correlate with real-world performance. But in recent months, data warehouse vendors have been poking each other about TPC results.  So, we put a couple of engineers on the problem and had them run some tests against our database and competitors. 

When you run competitive benchmarks yourself you often get accused of selective reporting or cheating (Databricks and Snowflake had a recent public spat about this). We’ve hired a third-party vendor to reproduce the results and the report should be out in another month or so. When they do publish their report, they’ll also reveal the companies they are comparing us against. 

Q6. You mentioned in the article that your benchmark runs used the schema, data and queries as defined by the TPC. However, they do not meet all the criteria for official TPC benchmark runs, and are not official TPC results. Isnt`t this a limitation to the acceptance of such bechnmarks?

Jordan Tigani: It is very expensive and difficult to do an official TPC submission, and at the end of the day, it doesn’t tell you much. For TPC-H, for example, we did a “power run,” which means running the queries sequentially. This shows off the ability to perform well in several different query shapes that are indicative of a data warehousing workload. It is a lot harder to run a full TPC-H benchmark as it involves multiple concurrent queries and changing data.

Q7. Not every workload needs transactions and analytics. What are the typical applications that need some flavor of both?

Jordan Tigani: There are two types of applications that need both analytics and transactions. The first is applications that are doing analytics. That is, they’re showing custom dashboards and slicing and dicing data. They tend to need up-to-the-moment data and low latency because they’re serving requests to end-users. They also need high concurrency because they are being used by analysts and are part of the end product being served. Data warehouses aren’t a great option in this use case, because they can’t scale to high concurrent user counts and are generally designed for throughput rather than low latency. SingleStore has a lot of customers in the financial services industry who back a lot of their portfolio analytics tools behind SingleStore databases.

The other type of application that needs analytics and transactions is one that wants to make use of data to enrich the experience. Maybe they want to do a product search and faceted drill-down. Maybe they want to show a leaderboard at a game. Traditional databases aren’t always great at these use cases once you get beyond a certain scale, and then performance can fall off a cliff. People end up stitching multiple databases together—maybe adding a cache on top of it because it is slow—and then have to deal with complexity to keep a consistent model and all the data in sync. Have you ever seen an application that showed a notification or unread message count, and then when you clicked on it there weren’t any notifications or unread messages? This is one of the ways this pattern shows up to the detriment of users; if they had used SingleStore they could keep those values in sync.

Q8. You are quoted saying that “Making the jump to being a cloud-native rather than just a company who runs their product on the cloud requires deep changes throughout the organization”. What are the key lessons learned you wish to share?

Jordan Tigani: If a company starts as an on-premise business and decides to become relevant in the cloud space it requires dedicated energy devoted to changing the culture. We drew up a 24-point score card last year and graded where SingleStore was on every axis of cloud readiness. The score card had everything from Elasticity to Auth to Scalability. We created a plan to get everything to “green” – it takes a long time and a lot of sustained energy, but it was worthwhile to do so.

It paid off considering we were one of the 20 databases recognized by Gartner in the 2021 Magic Quadrant for Cloud Database Management Systems. We believe that is something that could have not happened if we didn’t dedicate significant energy to making sure we were thoroughly cloud.

If there was a lesson to share it would be to not under-invest or think it will be easy. You also need to hire people with cloud experience. There are a number of areas where you can hire smart people and they will pick up what they need, but if you’re trying to make a transition there is no substitute for actual hands-on experience with what happens when you try to scale.

Q9. How is the pandemic changing the market for enterprise infrastructures?

Jordan Tigani: The pandemic is changing the market for enterprise infrastructure in two ways. First, it is accelerating the transition to the cloud. If you’ve got a physical server somewhere you have to have staff that physically maintains those machines, which goes in the opposite direction of a workforce that is becoming more distributed and remote in the pandemic.

Secondly, the pandemic is accelerating the need for fast, accurate data. If you’re in the office, you can often tell how things are going by the “buzz.” But if your only connection to your team and your customers is through zoom, there is a lot of key information that is missing. The only way to get some semblance of that information back is through data and being able to mine what customers are doing, how sales are going, and how much attrition you’re seeing in the workforce. 

Big data analytics tools were, to some extent, developed to handle cases where you had so many customers that you couldn’t meet them all and could only get a pulse by looking at data. Google and Amazon are two companies that relied heavily on data because they had to. These techniques are being applied successfully when you may not have billions of customers, but have a difficult time reading their pulse.

Q10. Can you tell us a bit how did you help True Digital in Thailand to develop heat maps around geographies with large COVID-19 infection rates? What lessons did you learn?

Jordan Tigani: True Digital is a telecom provider in Thailand that was able to use cellular data to help track the spread of the pandemic. In the early days of Covid-19, there was a huge focus on getting answers quickly and they were able to build out and ship an application on top of SingleStore in a matter of weeks. One lesson we learned was that if you need to build something in a hurry that needs to scale quickly, making sure you have the right tools when you start is important. SingleStore was ideally suited for True Digital’s needs, and we helped them get something out faster than they would have otherwise. You can read more about our work with True Digital here.

Q11. You are quoted saying “I like the idea of using AI to augment and go beyond what you can do currently. There’s really intelligence, which is a step beyond analytics, which is driving real insight from the data and automatic insight from the data.” Can you please elaborate on this?

Jordan Tigani: There is a hierarchy of analytical needs, and at the base level is collecting data. If you don’t have the data, then you’re blind to what is happening in the data. 

The next step is understanding the data sources, which requires a feedback loop with a human to understand what the data is telling you. Too often people try to skip this step and jump right making decisions based on the data, and they end up making the wrong decisions because the data isn’t actually telling them what they thought it was telling them. A great example I’ve seen of this is when people were looking at counts of customers, but every customer that wasn’t logged in got the same customer ID, so the averages got completely skewed. 

Once you have data that is cleaned and reputable, you can start understanding what the data shows. This is where BI and dashboards come in. Insight tends to come from questions that someone asks, like  “why were my sales down in the southern region?”

Where it starts to get interesting is when you take the next step; making data-driven decisions. You have data that you understand and rely on, and you have been able to drill down and ask questions. AI and machine learning can help you all along the way–from figuring out what data to capture, to the structure of your data, to answering questions. As the last step, you need absolute trust in the lower levels of the system, or else you risk making a lot of bad decisions that you can’t diagnose.

Q12. You are Board Member of Atlas Corps, whose mission is to address critical social issues.

Jordan Tigani: There are generally two types of organizations that address social issues: those that address the issues directly, and those that seek to address the roots of the problems. For example, an organization in the former category would help distribute food during a famine while the latter would help teach sustainable farming. 

As an engineer and someone who appreciates the building of the right systems and architectures, organizations that help improve systems are most interesting to me. Atlas Corps generally goes one step further than just trying to address the root of problems; they seek to help train people who are themselves addressing the problem. Who are we to come in and tell people how to farm, for example? Why not help boost the people in those locations who already have the context, and help teach them how to build stronger and scalable organizations?

Q13. What are the current projects?

Jordan Tigani: The pandemic has been hard on Atlas Corps since their model involved bringing social sector leaders to the United States for training and service in social change organizations. If you can’t bring people into the country, or those organizations are working remotely, it’s difficult to make those programs work. Atlas Corps has been working on building out their model to handle remote work, at least partly, which has made it work and scale better during the pandemic. Their tagline is “talent is universal, opportunity is not,” which is a lesson I try to apply everywhere.

………………………………………………

Jordan Tigani, Chief Product Officer, SingleStore.

Jordan is the Chief Product Officer at SingleStore, where he oversees the engineering, product and design teams. He was one of the creators of Google BigQuery, wrote two books on the subject, and led first then engineering and then the product teams. He is the veteran of several star-crossed startups, and spent several years at Microsoft working on bit-twiddling.

Resources

TPC Benchmarking Results.Genevieve LaLonde, Jack Chen, Szu-Po Wang, SingleStore, 2022

Related Posts

On AI, Cloud, and Data & Analytics. Interview with Sastry Durvasula and John Almasan. ODBMS Industry Watch, December 10, 2021

Follow us on Twitter: @odbmsorg

Feb 16 22

On IoT and InfluxDB. Interview with Paul Dix

by Roberto V. Zicari

Time is a critical context for understanding how things function. It serves as the digital history for businesses. When you think about institutional knowledge, that’s not just bound up in people. Data is part of that knowledge base as well. So, when companies can capture, store and analyze that data in an effective way, it produces better results.” –Paul Dix.

Q1. InfluxData just announced accelerated IoT momentum with new customers and product features. Tell us what makes InfluxDB so well-suited to manage IoT data.

Paul Dix: We’re seeing time series data become vital for success in any industrial setting. The context of time is critical to understanding both historical and current performance. Being able to determine and anticipate trends over time helps companies drive improvements in mission-critical processes, making them more consistent, efficient and reliable. We built InfluxDB to facilitate every step of this process. We’ve been fortunate to work with several major players in the IIoT space already, so we’ve been able to really understand the workflows and processes that drive industrial operations and better develop solutions around them. 

Q2. How do the new edge features for InfluxDB that you just announced help developers working with time series data for IoT and industrial settings?

Paul Dix: The new features give developers more flexibility and nimbleness in terms of architecture so that they can build more effective solutions on the edge that account for the resources they have available there. For example, we understand that some companies have very limited resources on the edge, so we’ve made it easier to intelligently deploy configurable packages there. By breaking down the stack into smaller components, developers can reduce the amount of software they need to install and run on the edge. At the same time, we want developers to have the option to do more at the edge if they can. That’s why we’ve made it easier to run analytics on persistent data at the edge and to replicate data from an edge instance of InfluxDB to a cloud instance.

We’re also working to make it easier for IoT/IIoT developers to manage the many devices that they need to deal with. One of our new updates allows developers to distribute processed data with custom payloads to thousands of devices all at once from a single script. On the other side of the equation, we have another new feature that helps contextualize IoT data generated from multiple sources, using Telegraf, our open source collection agent, and MQTT topic parsing.

Q3. What makes time series data so important for IoT and IIoT? 

Paul Dix: Time is a critical context for understanding how things function. It serves as the digital history for businesses. When you think about institutional knowledge, that’s not just bound up in people. Data is part of that knowledge base as well. So, when companies can capture, store and analyze that data in an effective way, it produces better results. For example, manufacturers may want to know how long a valve has been in service, or how many parts their current configuration can produce per hour. Time is a constant measure that creates a baseline for comparative purposes, generates a current snapshot for systems and processes, and reveals a roadmap for identified patterns to persist and therefore become more predictable. 

Time series data is well-suited to IoT and IIoT because it ties the readings from critical sensors and devices to the context of time. It’s also easy to use persistent time series data for multiple, different purposes. We can think about temperature in this case. In a consumer IoT context, such as a home thermostat, users primarily want to know what the current temperature is. In an IIoT context, manufacturers want to know the current temperature, but also what the temperature was in the last batch, or the batch from the previous week. Using InfluxDB to collect and manage time series data makes these kinds of tasks easy. At InfluxData, we’re fortunate that InfluxDB is one of a select group of successful projects and products where IoT, data, and analytics deliver significant value to organizations and the customers they serve.

Q4. Graphite Energy is featured in the announcement as a company that’s using InfluxDB to manage its time series data. Can you tell us more about the impact InfluxDB has had on its business?

Paul Dix: We’re really excited about our work with Graphite Energy – they’re an Australian company that makes thermal energy storage (TES) units. These devices get energy from renewable sources and store it until it’s required for industrial processes in the form of heat or steam. Its goal is to decarbonize industrial production. 

All of Graphite Energy’s operations are grounded in data – they’re collecting time series data from their devices out in the field and use InfluxDB to store and analyze these millions of data points they’re collecting daily. Graphite Energy uses that data to optimize its products, to guide remote operation, engineering and reporting, and to inform product development and research vectors. InfluxDB has also been a key component in the development of their Digital Twin feature. For this, they use time series data to generate a real-time digital model of a TES unit, that is accurate to within five percent of actual device performance. This allows them to roll backward g and forward in time to track performance. The Digital Twin is a key component of the company’s predictive toolkit and ongoing product optimization efforts. The more efficient Graphite Energy’s TES units are, the better they’re able to facilitate decarbonization. That’s a win for everyone.

Q5. How are some of your other IoT customers using the InfluxDB platform? 

Paul Dix: Our customers are doing great things in the IoT space. I’ll highlight just a few here quickly. 

  • Rolls-Royce Power Systems is using InfluxDB to improve operational efficiency at its industrial engine manufacturing facility. By collecting sensor data from the engines of ships, trains, planes, and other industrial equipment, Rolls-Royce is able to monitor performance in real time, identify trends, and predict when maintenance will be needed.
  • Flexcity monitors and manages electrical devices for its customers. They also monitor supply-side energy output and use that information to dynamically shed or store excess electrical load in their monitored devices to help with grid balancing and demand response. They use InfluxDB as their managed time series platform. They use Flux to calculate complex, real-time metrics, and take advantage of tasks in InfluxDB for alerting and notifications.
  • Loft OrbitalUsing InfluxDB Cloud to collect and store IoT sensor data from its spacecrafts. The company flies and operates customer payloads with satellite buses, and uses InfluxDB to gain observability into its infrastructure and collect IoT sensor data, including millions of highly critical spacecraft metrics, with the business currently ingesting 10 million measurements every 10 minutes.

Q6. InfluxData has partnered with some of the leading manufacturing providers including PTC and Siemens. How have these partnerships benefitted shared customers?

Paul Dix: A lot goes into these partnerships on both ends, and we work really hard to make and keep them mutually beneficial. One thing that’s a real benefit to customers is when we’re able to integrate InfluxDB with our partner’s platform. Take PTC, for example. InfluxDB is the preferred time series platform for ThingWorx and there is a native integration within the PTC platform itself. That makes it a lot easier for customers to get up and running with InfluxDB, and because it’s already integrated with PTC, they know the two systems are going to play together nicely. Having a solution like that reduces a lot of time and stress that typically occurs in the development process, especially when building out new solutions or retrofitting old ones. 

Beyond PTC, additional industry-leading IIoT platforms including Bosch ctrlXSiemens WinCC OAAkenza IoT and Cogent DataHub have also partnered with InfluxData to use InfluxDB as a supported persistence provider and data historian.

Q7. What’s on the horizon for InfluxData and InfluxDB this year? How do you plan to build on this momentum in IoT?

Paul Dix: IoT will continue to be a priority for our team this year. We’re also looking forward to bringing the benefits of InfluxDB IOx to InfluxDB users. InfluxDB IOx is a new time series storage engine that combines several cutting-edge open source technologies from the Apache Foundation. Written in Rust, IOx uses Parquet for on-disk storage, Arrow for in-memory storage and communication, and Data Fusion for querying. IOx focuses on boundless cardinality and high performance querying. 

IoT and IIoT users will benefit from IOx since they will have the ability to use InfluxDB and its related suite of developer tooling for emerging operational use cases that rely on events, tracing, and other high cardinality data, along with metrics. We’re eager to integrate this project into our existing platform so our IoT users can monitor any number of assets without worrying about the volume or variety of their data.

The arrival of IOx to our cloud platform will enable IoT and IIoT users to store, query, and analyze higher precision data and raw events in addition to more traditional metric summaries. In addition to the real-time replication currently enabled from the edge with Telegraf and InfluxDB 2.0, IOx will enable bulk replication of Parquet files for settings where the edge may not have real-time connectivity. Users working with machine learning libraries in Python will find it easier to connect to and retrieve data at scale for training and predictions because of IOx’s support for Apache Arrow Flight.

Qx. Anything else you wish to add?

Paul Dix: The big takeaway is we’re really excited about the many applications for time series in IoT. Regardless of industry, time series is transforming our ability to understand the activities and output of people, processes and technologies impacting businesses. Nowhere is this more apparent than in IoT or industrial settings.

…………………………………………………….

Paul Dix is the creator of InfluxDB. He has helped build software for startups, large companies and organizations like Microsoft, Google, McAfee, Thomson Reuters, and Air Force Space Command. He is the series editor for Addison Wesley’s Data & Analytics book and video series. In 2010 Paul wrote the book Service-Oriented Design with Ruby and Rails for Addison Wesley’s. In 2009 he started the NYC Machine Learning Meetup, which now has over 7,000 members. Paul holds a degree in computer science from Columbia University.

Resources

InfluxData Announces New Customers and Accelerated Momentum in Industrial Data and Internet of Things, February 15, 2022 

Related Posts

On IoT and Time Series Databases. Q&A with Brian Gilmore. ODBMS.org, October 18, 2021.

Follow us on Twitter: @odbmsorg

##