AI security is a data management problem

Jan. 23, 2025
Implementing a modern data architecture prioritizing robust, continuous security is the most effective way to launch a data strategy.

Over the last few years, the volume and variety of data that organizations have access to has exploded, with 64% of organizations managing at least one petabyte (which is 1 million gigabytes, or 1,000 terabytes) of data and 41% working at least 500 quadrillion bytes of data. This data is essential for decision-making; it offers critical insights made available with enterprise AI's power. By feeding AI solutions like large language models (LLMs) or agentic workflows this plethora of data, enterprises unlock a whole new level of transformation.

However, despite AI's various benefits, its inherent threats cannot be overlooked. If not appropriately governed, AI can lead to non-compliance with privacy regulations, misuse of proprietary data, data leakage, or increased risks of cyberattacks—all of which could lead to substantial legal, financial, and reputational repercussions. According to a recent IBM survey, organizations with a high level of data non-compliance show an average cost of $5.05 million, 12.6 percent higher than the average.

Data Security in the Age of AI

AI models rely on large amounts and various data types to learn patterns, make predictions, and improve decision-making over time. However, once data is inputted into an AI model, it can’t be removed or untrained. This is particularly dangerous for public models, where users remain unaware of how much private information they ultimately share with an ungoverned solution. In March 2024, 27.4% of corporate data that employees put into AI tools was sensitive, up from 10.7% a year ago. It presents several risks as it does not govern those actions to ensure safe, secure, and compliant practices.

Organizations need to be extremely cautious that the right data is being put into these models, as using incomplete or unapproved data could result in bias or incorrect outputs and a slew of data privacy violations. Regulations such as GDPR and The AI Act safeguard personal data protection and require explicit consent for AI models to use that data. When enterprises leverage customer data to train models without their knowledge, violations are most seen, which can have severe legal and reputational repercussions. One company was fined upward of $50 million for not complying with consumer privacy laws around its use of AI in advertising targeting. 

On top of the risks of using AI in the enterprise, businesses must also safeguard themselves against external factors, such as bad actors who now use AI to their benefit. Seventy-fix percent of IT leaders surveyed by ISC2 are moderate to highly concerned that AI will be used for cyberattacks or other malicious activities. AI allows hackers to attack enterprises more easily – helping them identify and exploit vulnerabilities in systems, create highly convincing phishing emails, fake websites, and even voice simulations, and enhance the ability of malware to adapt to and bypass traditional signature-based detection methods.

Regulations such as GDPR and The AI Act safeguard personal data protection and require explicit consent for AI models to use that data.

With these looming threats, enterprises must take a data privacy and security-first mindset across their departments in 2025. This begins with looking at privacy as a core business function rather than a box to check off. Ingraining privacy within the business, rather than viewing it as an add-on, will help provide complete visibility into an organization's data stack, ensuring better controls and protection of enterprise data as it is used for AI.

Privacy and Security-First Mindset

Establishing data privacy as a business process doesn’t happen overnight. An organization must take several steps to transition to this mindset effectively.

The first step enterprises should take is defining a comprehensive data strategy that includes plans for managing data, for instance, in a hybrid cloud environment. An organization’s data strategy should outline exactly how it will adhere to a “privacy by design” approach, especially when leveraging AI.

For starters, “privacy by design” refers to a framework that seeks to proactively embed privacy into the design specifications of information technologies, networked infrastructure, and business practices. To do this effectively, there are seven foundational principles organizations must adhere to reduce the risk of privacy infractions and data breaches from occurring, including:

  1. Proactive, not reactive: Privacy by design anticipates risks and prevents privacy invasion before it occurs. It comes “before the fact, not after.”
  2. Privacy as default: Privacy by design ensures that personal data is automatically protected in any IT system or business practice as the default. In other words, privacy is built into the system.
  3. Privacy embedded into design: Privacy measures must be essential, integral components of the core functionality, not bolted on as add-ons to the design and architecture of IT systems and business practices.
  4. Full functionality—positive-sum, not zero-sum: Privacy by design avoids “either-or” dichotomies, such as privacy vs. security, where unnecessary trade-offs occur. It demonstrates that it is possible to have both.
  5. End-to-end security—full lifecycle protection: Privacy by design extends security throughout the lifecycle of the data involved, from its collection and use to its destruction or removal. It has been embedded into IT systems' design from the start.
  6. Visibility and transparency: In a “trust but verify” approach, privacy by design ensures the data subject is fully aware of personal data being collected and why. All component parts remain transparent to users and providers. Data collected and stored should have a valid purpose that benefits the customer.
  7. Respect for user privacy: The goal of user-centered privacy requires architects and operators to prioritize the individual's interests by offering strong privacy defaults, appropriate notice, and empowering, user-centric options.

Once “privacy by design” is established, organizations should then work through an audit of their data, analyzing and articulating what data the organization stores, where it’s stored, how it is being used, when, by whom, what level of permission they have from the data subject for usage, and how to delete or mask that data upon request. These steps are critical in the age of AI; data needs to be labeled, and the correct permissions must be obtained to determine if it is safe to use within an AI model.

To elevate this approach, many enterprises are modernizing their data architectures to support governance efforts. The most robust solutions are designed with multiple layers of security to protect against various threats, such as unauthorized access, data breaches, and cyberattacks. Suppose an organization does not have the most efficient and up-to-date data architecture. In that case, it won’t be able to deploy these enhanced protection protocols, such as data encryption, multi-factor authentication, data masking, audit logs, disaster recovery plans and more – all of which are beneficial in the case of an AI-driven cyberattack or accidental misuse of data in an AI system. Furthermore, this type of architecture allows companies to bring AI to their data – rather than vice versa – by adding AI enhancements to their data platform, reducing the risks of data leakage or misuse by an AI model.

This architecture also ensures that all of an enterprise’s data lives in one central location regardless of format or structure. Hybrid data lakehouses, in particular, provide centralized security management, allowing for consistent application of security policies across the entire dataset and ensuring that data access, encryption, and auditing can be uniformly enforced, reducing the chances of security gaps. They can also be integrated with robust data governance frameworks, enabling organizations to implement strict controls over data lineage, ownership, and usage. The benefits provided by this type of architecture enhance security by ensuring that only authorized data owners can modify or use data and that all actions are traceable.

Looking forward

Enterprise AI can’t, and shouldn’t, be avoided. It provides incredible benefits such as improved efficiency, enhanced data-driven insights and decision-making, and a more personalized customer experience. However, the risks of this technology – particularly to enterprise data – cannot be ignored. If enterprises truly value their data – and the benefits it can provide – they must establish a thoughtful data strategy to protect it. This starts with developing a comprehensive data strategy that accounts for external threats like bad actors and other AI-driven attacks, and internal risks like the development of AI systems and the shifting compliance landscape. The most efficient way to kickstart a data strategy is to adopt a modern data architecture with robust and continuous security and monitoring measures to ensure data security and quality.

About the Author

Carolyn Duby | Field CTO and Cybersecurity Lead at Cloudera

Carolyn Duby is a cybersecurity and data expert who drives some of today’s most important tech topics, including hybrid data cloud, big data, privacy, becoming a data-driven organization, and careers in tech. With more than 30 years of experience in the industry, she currently serves as Field CTO and Cyber Security Lead at Cloudera, guiding companies in banking, insurance, healthcare and more through complex transformations that help them use data as a strategic asset to impact their bottom line.