AIML-01

Do you separate ML training data from your ML solution data?

Explanation

This question is asking whether your organization maintains a separation between the data used to train your machine learning (ML) models and the data that your ML solution processes in production. What this means: ML models are trained on datasets (training data) to learn patterns and make predictions. Once trained, these models are deployed in production environments where they process new data (solution data) to generate insights or predictions. This question asks if you keep these two types of data separate. Why it matters for security: 1. Data isolation: Separating training data from production data prevents potential data leakage or contamination. 2. Access control: Different teams may need access to training data (data scientists) versus production data (operations teams). 3. Compliance: Many regulations require clear boundaries between data used for different purposes. 4. Integrity protection: Separating data helps ensure that production data cannot accidentally or maliciously modify training data. 5. Incident containment: If there's a security breach in one environment, separation helps contain the impact. How to best answer: Provide specific details about how your organization maintains this separation. This could include: - Physical separation (different storage systems) - Logical separation (different databases or containers) - Access control mechanisms - Data flow processes that maintain separation - Any tools or platforms that help enforce this separation If you do maintain separation, explain your approach clearly. If you don't maintain separation, explain why not and what compensating controls you have in place.

Guidance

Looking for protection of training data.

Example Responses

Example Response 1

Yes, we maintain strict separation between ML training data and production solution data Our training data is stored in isolated S3 buckets with separate access controls and is only accessible to our data science team Once models are trained, they are deployed to a production environment where they only interact with customer data through well-defined APIs The production environment has no access to the original training datasets We maintain this separation through network segmentation, IAM policies, and regular access reviews Additionally, we use versioned datasets for training to ensure reproducibility without compromising production data.

Example Response 2

Yes, our organization implements a comprehensive data separation strategy for our ML operations Training data is housed in a dedicated data lake environment with read-only access for our ML engineers This environment is completely isolated from our production systems where the trained models operate We use a model registry to version and track models as they move from development to production, ensuring that training data never leaves its designated environment We also implement data lineage tracking to maintain visibility of how data flows between environments while preserving separation Regular audits verify that this separation is maintained.

Example Response 3

No, we currently do not maintain complete separation between our ML training data and solution data Our startup has built an integrated platform where the same data storage is used for both training and production inference While we recognize this isn't ideal from a security perspective, we've implemented compensating controls including: strict access logging for all data operations, versioning of datasets to track usage, and read-only access patterns for the ML inference pipeline We're currently in the process of redesigning our architecture to implement proper separation between training and production environments, with completion expected in the next quarter.

Context

Tab: AI
Category: AI Machine Learning