The 10 Hottest Big Data Tools Of 2024

Here’s a look at the 10 hottest big data tools of 2024 including Databricks Apps, EDB Postgres AI, Qlik Talend Cloud and ThoughtSpot Spotter.

Data management, already of significant importance for operational and business intelligence purposes, has taken on a new level of priority for businesses and organizations as the wave of AI technology development and adoption pushes demands for data to new heights.

The global “datasphere” – the total amount of data created, captured, replicated and consumed – is growing at more than 20 percent a year and is forecast to reach approximately 291 zettabytes in 2027, according to market researcher IDC.

But wrangling all that data, including collecting and managing it and preparing it for analytical and AI tasks, is a challenge with so much data to work with. That’s driving demand for new big data tools and technologies – from both established IT vendors and startups – to help businesses access, collect, manage, move, transform, analyze, understand, measure, govern, maintain and secure all this data.

What follows is a look at 10 cool big data tools, designed to help customers more effectively carry out all these big data chores, that caught our attention in 2024. They include next-generation databases, data management tools and data analytics software. Some are entirely new products recently introduced by startups or established vendors, while others are products that have undergone significant upgrades or offer new ground-breaking capabilities.

Apache DataFusion

The Apache Software Foundation describes the open-source DataFusion as “a fast, extensible query engine for building high quality, data-centric systems” such as database, dataframe libraries, machine learning, and streaming applications.

DataFusion can be used as an embedded SQL engine or customized and used as a foundation for building new systems with a focus on high-throughput, low-latency analytical, streaming and transaction workloads.

DataFusion leverages the technology capabilities of Apache Arrow, a language-agnostic framework for building data analytics applications that process columnar data, and the Rust programming language.

In June the Apache Software Foundation, which has been developing DataFusion since 2019 as part of the Apache Arrow project, said DataFusion is now designated as a Top-Level Project “to provide more focused governance capacity for continued growth.”

DataFusion is available for download from the Apache Software Foundation website, GitHub, and other sites under the Apache 2.0 License.

Databricks Apps

In October unveiled Databricks Apps, a new set of development capabilities that the company says provide a fast way to natively build and deploy internal data-intensive analytical and AI applications directly on the Databricks Data Intelligence Platform.

The development services are particularly geared toward developing custom software such as AI applications, analytical applications, data visualization dashboards, self-service analytics capabilities and data quality monitoring software.

“Our mission is to democratize data and AI. And as part of the data intelligence platform, we're building a platform that lets every customer get value from their data and from their investment,” said Shanku Niyogi, Databricks vice president of product management, in an interview with CRN.

Databricks Apps is initially focused on Python, the top programming language for data-intensive applications. Databricks Apps enables developers to build apps natively in Databricks using such tools as Visual Studio Code and PyCharm and popular Python frameworks, such as Dash, Shiny, Gradio, Streamlit and Flask.

Databricks Apps also makes it possible to incorporate AI components within applications, making it possible for developers to call specific AI models when they need more flexibility, according to the Databricks blog announcing Databricks Apps.

DataPalago

Startup DataPelago exited stealth in October, unveiling what the company describes as the world’s first “universal data processing engine” that can handle the complexity and volume of today’s data for “accelerated computing” analytical and artificial intelligence workloads.

“Data is changing, the applications are changing and, most importantly, [IT] infrastructure is changing. When you have three different disruptive trends coming all together, it requires you to step back and see what the next world looks like and what should be the data processing platform,” said Rajan Goyal, DataPelago co-founder and CEO, in an interview with CRN.

DataPelago’s universal data processing engine, which is being used by some customers on a pilot/preview basis, is designed to overcome the performance, cost and scalability limitations of current-generation IT systems and meet the needs of what the company calls “the accelerated computing era.”

The startup’s universal engine was built from the ground up to support GenAI and data lakehouse analytics workloads by employing a hardware-software co-design approach, according to the company. The engine is designed to work with today’s data stacks including CPU-, GPU-, TPU- and FPGA-based hardware; data processing frameworks such as Spark, Trino and Apache Flink; multiple types of data stores; and data processing platforms such as Snowflake and data lakehouses like Databricks, Goyal said.

DataPelago says the unique architecture of its processing engine can process data one to two orders of magnitude faster than traditional query engines. It is uniquely suited for use cases that are resource intensive, such as analyzing billions of transactions while ensuring data freshness, can support AI-driven models to detect threats at wire-line speeds across millions of consumer and data center endpoints, and provides a scalable platform to facilitate the rapid deployment of training, fine-tuning and RAG inference pipelines,” according to the company.

EDB Postgres AI

EDB Postgres AI is EDB’s next-generation data platform that can tackle transactional, analytical and AI workloads.

Databases are traditionally developed to handle either transaction processing or data analytics tasks given that the two are very different processes with different technical requirements.

EDB Postgres AI, which debuted in May, is designed to provide unified capabilities that can handle transaction processing, data analytics, AI and machine learning applications on a single data platform.

EDB (previously “EnterpriseDB”) has long offered an Oracle-compatible database based on the open-source Postgres database. Earlier this year EDB CEO Kevin Dallas told CRN about the company’s plans to expand its database into a comprehensive data and AI platform.

In addition to its ability to handle different types of processing workloads, EDB Postgres AI is also extremely flexible in that it can be deployed in the cloud, as on-premises software and on physical appliances – all powered by the same Postgres engine. Key capabilities include rapid analytics for transactional data, intelligent observability for on-premises and cloud databases, support for vector databases and continuous high availability.

Just this month the company updated EDB Postgres AI with a new hybrid control plane for centralized control with real-time observability, as AI accelerator for developing production-ready AI applications, and an analytics accelerator for high-performance analytics across data tiers.

MotherDuck

MotherDuck, a serverless data analytics platform based on the open-source DuckDB database, became generally available on June 11.

Early editions of the much-anticipated software have been in various trial stages for about a year.

The company is pitching the new software as an easy-to-use data analytics tool that does not need complex supporting data infrastructure. Making use of advancements in computer hardware performance, the DuckDB-MotherDuck combo can process large amounts of data on a single machine and meet the needs of 99 percent of users who don’t require a complex petabyte-scale system.

MotherDuck assigns compute resources to each user, cutting costs and simplifying administration, and utilizes local compute resources through hybrid cloud-local “dual-execution” queries. Businesses and organizations can avoid spending huge sums for data infrastructure usually needed only for extremely high-performance data analysis tasks, according to the company.

Pinecone Vector Database

Vector databases have quickly risen in popularity for their ability to speed up AI application development and improve the operation and accuracy of AI application workloads through their ability to provide large volumes of data on demand to the large language models (LLMs) that power GenAI systems.

Since startup Pinecone launched its namesake vector database in early 2021, originally with a focus on machine learning tasks, it has been widely adopted amidst the AI wave sweeping the IT industry.

This year Pinecone launched a serverless edition of its data base (announced in January and generally available in May) which, according to Pinecone, “lets companies add practically unlimited knowledge to their GenAI applications.” The company said the serverless product provides up to a 50x cost reduction and eliminates IT infrastructure hassles.

Just this month Pinecone debuted the Pinecone Knowledge Platform, which has its vector database at its core and adds integrated inference capabilities, including proprietary, fully managed embedding and reranking models, that further accelerate the development of “grounded” AI applications.

Qlik Talend Cloud

Data analytics and integration tech developer Qlik launched Qlik Talend Cloud, a data management platform based on technology stemming from Qlik’s 2023 acquisition of Talend.

Qlik Talend Cloud, built on the Qlik Cloud infrastructure provides a unified package of data integration and data curation capabilities for building and deploying AI-augmented ELT (extract, load and transform) pipelines that deliver trusted data assets throughout an organization, according to Qlik’s descriptions of the new product.

“We’ve brought together all the capabilities that we acquired with Talend…and now fully integrated all the [Talend] tools and capabilities and the Qlik data integration capabilities into one platform,” Qlik CEO Mike Capone said when introducing the new system at the Qlik Connect event (pictured) in June. “What that enables you to do is rapidly and easily build and deploy data pipelines, from raw data all the way through, with our capabilities of data integration, data lake management, data quality, data governance and data transformation.”

The platform delivers AI-augmented data integration with “extensive” data quality and data governance capabilities, according to the company. It’s data engineering tools provide “a spectrum” of data transformation capabilities – from no-code to pro-code options – for creating AI-ready data for complex AI projects, the company said.

Qlik Talend Cloud also incorporates SaaS data connectivity functionality from Talend’s 2018 acquisition of startup Stitch, boosting the platform’s ability to work with diverse data sources. The platform also includes a curated data marketplace that simplifies data discovery and data sharing, and the Qlik Talend Trust Score for AI for assessing data health and quality for AI readiness.

Scoop Analytics

Startup Scoop Analytics emerged from stealth in June with its software for automating data reporting processes and developing AI-powered business intelligence presentations and reports.

The software makes it possible for anyone with spreadsheet skills to collect data from any application, blend data from different sources and use it to create “visually compelling data stories” through slide presentations based on live data, according to the company.

The Scoop platform collects or “scoops” application reports from a wide range of operational applications such as Salesforce. At the heart of the Scoop system is an advanced time series analysis engine that automatically creates a time series database, and workflows that move the data into a full-featured spreadsheet that’s used to blend, augment and query data and create dashboards, charts and graphs.

Co-founder and CEO Brad Peters says Scoop’s mission is to “deliver data analytics in a form factor that doesn’t require a data team” and achieve the long-time goal of true self-service business intelligence.

Starburst Galaxy Icehouse

Starburst launched Galaxy Icehouse, a fully managed data analytics service, in April. Icehouse is based on the Starburst Galaxy cloud data lakehouse platform, which combines the Trino distributed SQL query engine with the Apache Iceberg table format for analytic datasets.

The debut of Galaxy Icehouse marked Starburst’s latest expansion of its data analytics offerings. Galaxy Icehouse combines the open-source Trino SQL query engine – the core of the Starburst platform – with the Apache Iceberg data table format to provide a fully managed data lake service that the company says provides “powerful scalability, cost-effectiveness and query performance” without the burden and cost of building and maintaining a custom system and “without the risk of vendor lock-in.”

Starburst CEO and co-founder Justin Borgman emphasized the cost-performance benefits of Galaxy Icehouse compared to traditional cloud-based data warehouses such as Amazon Web Services’ Redshift. He also cited its effectiveness in handling compute-intensive data transformation and preparation workloads.

Galaxy Icehouse supports near real-time, petabyte-scale data ingestion into Iceberg managed tables, according to Starburst. Data and development teams can use SQL to prepare and optimize data for production analytics, the company said, along with the auto-tuning capabilities in Starburst Warp Speed to improve query performance.

ThoughtSpot Spotter

In November ThoughtSpot, a leading player in the AI and data analytics space, took the wraps off Spotter, an agentic AI analyst tool that the company said brings the analytical and reasoning skills of a data analyst to every business user.

Spotter enables users, irrespective of their technical capabilities or industry, to converse with Spotter as they would a human analyst and obtain self-serve actionable insights in natural, conversational language.

The new software has the ability to provide answers to any question on enterprise structured data, no matter how big or complex the dataset or where the data resides, and provides context from one question to the next to create a conversational experience.

ThoughtSpot said Spotter is capable of learning the language and context of any industry and continuously improves its outputs based on human feedback.

The software integrates within users’ preferred platforms and can be embedded within existing applications – such as Salesforce and ServiceNow – and digital productivity tools. It’s compatible with any cloud platform and leading large language models (including OpenAI GPT and Google Gemini).