AIML-04

Is your ML training data monitored and audited?

Explanation

This question is asking whether your organization has processes in place to monitor and audit the data used to train your machine learning (ML) models. What it means: Monitoring and auditing ML training data involves tracking where the data comes from, who has access to it, how it's modified, and ensuring its integrity throughout the ML lifecycle. This includes maintaining logs of data access, changes, and usage in training processes. Why it's being asked: Training data is the foundation of ML models, and compromised or biased training data can lead to serious security and ethical issues: 1. Data poisoning attacks can occur when adversaries manipulate training data to introduce backdoors or biases 2. Sensitive information in training data could lead to privacy breaches or model outputs that leak confidential information 3. Regulatory compliance (like GDPR, HIPAA) often requires tracking data lineage and usage 4. Biased or low-quality training data can result in discriminatory or unreliable AI systems How to best answer: Describe your specific processes for: - Data provenance tracking (where data comes from and its chain of custody) - Access controls for training data - Monitoring systems that detect unusual access or modifications - Regular auditing procedures and their frequency - Tools used for monitoring data quality and integrity - Documentation practices for data changes Be specific about technologies and processes rather than making general claims. Include information about who is responsible for these monitoring and auditing functions within your organization.

Guidance

Looking for how you reduce the risk of compromising training data.

Example Responses

Example Response 1

Yes, our ML training data is comprehensively monitored and audited through multiple layers of controls We maintain a data provenance system that tracks the origin, transformations, and usage of all training datasets Access to training data is restricted through role-based access controls and all access events are logged in our SIEM system We employ automated data quality monitoring tools that continuously scan for anomalies, drift, or potential poisoning attempts Our Data Governance team conducts quarterly audits of all training datasets, reviewing access logs, transformation history, and data quality metrics Additionally, we use cryptographic hashing to verify data integrity throughout the ML pipeline and maintain immutable audit logs of all data modifications These processes are documented in our ML Data Governance Policy, which is reviewed annually and following any security incidents.

Example Response 2

Yes, we have implemented a robust monitoring and auditing framework for our ML training data Our approach includes: 1) A centralized data catalog that maintains metadata about all training datasets including source, ownership, sensitivity classification, and usage history; 2) Automated data lineage tracking that records all transformations applied to datasets; 3) Continuous monitoring through our DataGuard platform that alerts on unusual access patterns or unexpected modifications to training data; 4) Monthly automated data quality assessments that check for drift, outliers, and potential poisoning; and 5) Bi-annual formal audits conducted by our internal audit team in collaboration with ML engineers All training data access requires multi-factor authentication, and privileged operations (like deletion or bulk modification) require approval through our change management system We maintain these logs for a minimum of 18 months to support forensic analysis if needed.

Example Response 3

We currently do not have a formal monitoring and auditing system specifically for ML training data Our data scientists maintain their own datasets and are responsible for ensuring data quality While we do have general system access logs that would capture who accessed data storage systems, we don't have specialized tools for tracking ML data lineage or monitoring for potential data poisoning attempts Our development team is planning to implement a data governance framework in the next quarter that will include monitoring and auditing capabilities for ML training data, but this is still in the planning phase In the interim, we mitigate risks through strict access controls to our data storage systems and by conducting manual reviews of datasets before they are used for training production models.

Context

Tab: AI
Category: AI Machine Learning