The 10 Coolest Big Data Tools Of 2024 (So Far)

Here’s a look at 10 new, expanded and improved big data tools, platforms, systems and services that solution and service providers should be aware of.

Big Data, Cool Tools

Data management, already of significant importance for operational and business intelligence purposes, has taken on a new level of priority for businesses and organizations as the wave of AI technology development and adoption pushes demands for data to new heights.

The global “datasphere” – the total amount of data created, captured, replicated and consumed – is growing at more than 20 percent a year and is forecast to reach approximately 291 zettabytes by 2027, according to market researcher IDC.

But wrangling all that data, including collecting and managing it and preparing it for analytical and AI tasks, is a challenge with so much data to work with. That’s driving demand for new big data tools and technologies – from both established IT vendors and startups – to help businesses access, collect, manage, move, transform, analyze, understand, measure, govern, maintain and secure all this data.

What follows is a look at 10 cool big data tools designed to help customers more effectively carry out all these big data chores. They include next-generation databases, data management tools and data analytics software. Some are entirely new products recently introduced by startups or established vendors, while others are products that have undergone significant upgrades or offer ground-breaking new capabilities.

Anomalo

Anomalo’s data quality monitoring platform has been gaining market traction as artificial intelligence applications and large language models drive demand for not just more data, but more high-quality data.

The Anomalo system makes use of machine learning technology to monitor data quality and automatically detect and understand the root cause of data issues, including erroneous data and missing or incomplete data.

On June 12 Anomalo said it had expanded its platform’s data quality monitoring capabilities to include unstructured text data – a critical addition given that generative AI applications often make use of large amounts of text.

The Palo Alto, Calif.-based company raised $33 million in Series B funding in January. The company is closely allied with data lakehouse giant Databricks, which is an investor and in June named Anomalo its Emerging Partner of the Year.

Databricks LakeFlow And AI/BI

Data lake platform developer Databricks launched a number of new software and services at its Data + AI Summit in early June with Databricks LakeFlow and Databricks AI/BI topping the list.

Databricks Lakeflow is a unified, intelligent data engineering system that combines all aspects of data engineering including data ingestion, transformation and orchestration. Data teams can use LakeFlow to ingest data at scale from databases such as MySQL, Postgres and Oracle and enterprise applications such as Salesforce, Workday, Sharepoint, NetSuite, Microsoft Dynamics and Google Analytics.

LakeFlow automates the deployment, operation and monitoring of data pipelines in production with CI/CD support, advanced workflows, and built-in data quality checks and data health monitoring. Key components include LakeFlow Connect for data ingestion, LakeFlow Pipelines for automating real-time data pipelines, and LakeFlowJobs for orchestrating workflows across the Databricks Data Intelligence Platform.

Databricks AI/BI is a next-generation business intelligence product that aims to bring data analytics and insights to a wider audience of business users. It includes Dashboards, an AI-powered, low-code tool for building and distributing interactive dashboards; and Genie, a conversational interface for using natural language for ad-hoc and follow-up queries. Dashboards and Genie are both powered by a compound AI system that learns from ongoing usage.

EDB Postgres AI

In May database provider EDB unveiled EDB Postgres AI, an intelligent database platform that’s capable of performing transactional, analytical and AI data workloads.

Database systems have traditional been developed and used for either transaction processing or data analysis tasks, given the very different data demands and database designs for each. And the rapidly growing wave of AI applications is creating its own set of unique demands for how data is managed and used.

EDB, which has long sold database software and related services based on the Postgres open-source database, said the new EDB Postgres AI system supports transactional, analytical, AI and machine learning workloads and applications – either running in the cloud, on premises or on a physical appliance.

Key capabilities of EDB Postgres AI include rapid analytics for transactional data, intelligent observability for managing and operationalizing data insights, support for vectorized data, continuous high availability for mission-critical applications, and the ability to modernize legacy database systems using EDB’s Oracle Compatibility Mode.

MongoDB AI Applications Program

Pulling together all the technology components to assemble a complete IT and data stack for building and running generative applications is a challenge.

In May MongoDB debuted the MongoDB AI Applications Program (MAAP), which provides a complete technology stack, services and other resources to help businesses develop and deploy at scale applications with advanced generative AI capabilities.

More of a program and blueprint than a product, MAAP – with the MongoDB Atlas cloud database and development platform at its core – includes reference architectures and technology from a who’s who in the AI space including the cloud platform giants, LLM (large language model) tech developers Cohere and Anthropic, and a number of AI development tool companies including Fireworks.ai, LlamaIndex and Credal.ai.

The program also includes strategic advisory and professional services from MongoDB and a number of the company’s systems integrator and consulting partners. MAAP is scheduled to be generally available in July.

MotherDuck

MotherDuck, a serverless data analytics platform based on the open-source DuckDB database, became generally available on June 11. Early editions of the much-anticipated software have been in various trial stages for about a year.

The company is pitching the new software as an easy-to-use data analytics tool that does not need complex supporting data infrastructure. Making use of advancements in computer hardware performance, the DuckDB-MotherDuck combo can process large amounts of data on a single machine and meet the needs of 99 percent of users who don’t require a complex petabyte-scale system.

MotherDuck assigns compute resources to each user, cutting costs and simplifying administration, and utilizes local compute resources through hybrid cloud-local “dual-execution” queries. Businesses and organizations can avoid spending huge sums for data infrastructure usually needed only for extremely high-performance data analysis tasks, according to the company.

Seattle-based MotherDuck was co-founded in 2022 by Google BigQuery founding engineer Jordan Tigani, now MotherDuck CEO. The company raised $52.5 million in Series B funding in September 2024.

Pinecone Serverless

The Pinecone Serverless vector database, unveiled in January and generally available in May, is designed to help businesses develop generative AI applications that are fast, accurate and scalable.

The large language models that power generative AI software require huge volumes of data to operate. Pinecone says that making more data or “knowledge” available to LLMs improves the quality of the answers generated by GenAI applications. But storing and searching through sufficient amounts of vector data on-demand is a challenge.

The new PineCone Serverless database, which builds on the company’s core vector database that launched in 2021, “lets companies add practically unlimited knowledge to their GenAI applications,” Pinecone says. It eliminates the need for developers to provision or manage infrastructure and allows developers to use any LLM of their choice.

The Pinecone Serverless architecture separates data reads, writes and storage and includes a multi-tenant compute layer for efficient on-demand data retrieval for thousands of users, according to Pinecone. The database uses new indexing and retrieval algorithms for fast and memory-efficient vector search from blob storage without sacrificing retrieval quality.

Qlik Talend Cloud And Qlik Answers

In June Qlik launched Qlik Talend Cloud, a new data management platform incorporating technology stemming from Qlik’s 2023 acquisition of Talend. It is slated for general availability later this summer.

Qlik Talend Cloud is built on the Qlik Cloud infrastructure and provides a unified package of data integration and data curation capabilities for building and deploying AI-augmented ELT (extract, load and transform) pipelines that deliver trusted data assets throughout an organization.

The platform delivers AI-augmented data integration with “extensive” data quality and data governance capabilities, according to the company. It’s data engineering tools provide a spectrum of data transformation capabilities – from no-code to pro-code options – for creating AI-ready data for complex AI projects.

Qlik Talend Cloud incorporates SaaS data connectivity functionality from Talend’s 2018 acquisition of startup Stitch, boosting the platform’s ability to work with diverse data sources. The platform also includes a curated data marketplace that simplifies data discovery and data sharing, and the Qlik Talend Trust Score for AI for assessing data health and quality for AI readiness.

Also in June Qlik launched Qlik Answers, a new generative AI assistant for utilizing unstructured data. Qlik Answers, which incorporates technology from the company’s Kyndi acquisition in January, is an out-of-the-box, generative AI-powered knowledge assistant for searching, accessing and utilizing unstructured data from a broad range of sources including PDF and Word documents, webpages and Microsoft Sharepoint.

Scoop Analytics

Startup Scoop Analytics just emerged from stealth with its new business analytics tool that’s powered entirely by spreadsheets.

The software makes it possible for anyone with spreadsheet skills to collect data from any application, blend data from different sources and use it to create “visually compelling data stories” through slide presentations based on live data, according to the company.

The Scoop platform collects or “scoops” application reports from a wide range of operational applications such as Salesforce. At the heart of the Scoop system is an advanced time series analysis engine that automatically creates a time series database, and workflows that move the data into a full-featured spreadsheet that’s used to blend, augment and query data and create dashboards, charts and graphs.

Co-founder and CEO Brad Peters says Scoop’s mission is to “deliver data analytics in a form factor that doesn’t require a data team” and achieve the long-time goal of true self-service business intelligence.

San Francisco-based Scoop Analytics was founded by Peters and others who previously worked at business analytics software developer Birst. The company officially launched June 18 with $3.5 million in seed funding from Ridge Ventures, Industry Ventures and Engineering Capital.

Starburst Galaxy Icehouse

In April Starburst launched Galaxy Icehouse, a fully managed data analytics service based on its Galaxy cloud data lakehouse platform combined with the Apache Iceberg table format for managing huge analytic datasets.

Starburst offers the new lakehouse platform as a high-performance, more cost-effective alternative to competing cloud data lake and data warehouse services for data management, data analytics and AI tasks.

Galaxy Icehouse combines the open-source Trino SQL query engine – the core of the Starburst platform – with the Apache Iceberg data table format to create a fully managed data lake service that the company says provides “powerful scalability, cost-effectiveness and query performance.”

The service helps organizations avoid the burden and cost of building and maintaining a custom data lake system “without the risk of vendor lock-in.” The company also touted the service’s effectiveness in handling compute-intensive data transformation and preparation workloads.

Galaxy Icehouse supports near real-time, petabyte-scale data ingestion into Iceberg managed tables. Data and development teams can use SQL to prepare and optimize data for production analytics, the company said, along with the auto-tuning capabilities in Starburst Warp Speed to improve query performance.

SurrealDB

SurrealDB is a multi-modal database that allows developers to use multiple techniques to store and model data.

While SurrealDB can use tables similarly to relational databases, it is a document database at its core and offers the additional functionality and flexibility found in other NoSQL or “NewSQL” databases.

Developers are adopting multi-modal databases to quickly adapt to different data requirements and reduce the need to operate multiple database systems, according to SurrealDB. And the company says “a rapidly growing number” of businesses and organizations are consolidating multiple, disparate databases onto SurrealDB to simplify data management operations, maintain data consistency and cut expenses.

After numerous beta releases, SurrealDB 1.0.0 debuted on September 13, 2023. Release 1.5.3, the current production edition, was released on June 14 while SurrealDB 2.0.0-alpha.3 was just released on June 25.

In addition to the database SurrealDB offers the Surrealist management application to manage schema and data within the database. The company also just announced beta access to Surreal Cloud.

SurrealDB, based in London, said on June 18 it had raised $20 million in a new funding round.