synthetic medical data

i worked on synthetic data generation for my research paper today.

below is my strategy

Our synthetic data generation methodology employs a privacy-preserving approach utilizing the GPT-4o model (2024-08-06) via Azure OpenAI API.

We implemented a systematic framework that begins with representative sampling from the original dataset, converting a minimal set of records (n=2) to dictionary format to serve as structural templates. The generation process is guided by carefully engineered prompts that specify medical domain requirements while ensuring complete de-identification.

Generation parameters were optimized with temperature=0.5 to balance consistency with variability, and max_tokens=4000 to accommodate complex medical records. Our robust processing pipeline includes comprehensive error handling, multiple fallback strategies for parsing different response structures, and conversion of synthetic records back to standardized DataFrame format.

Quality control measures enforce structural fidelity to the original data while ensuring all synthetic identifiers remain realistic but non-identifiable. We applied this methodology to generate synthetic brain cancer MRI and pathology report pairs, maintaining the original dataset's cancer type distribution.

This approach enables privacy-compliant data sharing for medical research while preserving clinically relevant patterns and relationships in the data, addressing a critical need in collaborative medical research where protected health information constraints often limit data accessibility.

how it looks like in code

sample_data = pd.read_csv("data.csv").head(10)
data_type = "mri and pathology",
specific_context = "Brain cancer MRI and pathology reports of glioblastoma"
num_samples=2

real_examples = sample_data.sample(min(2, len(sample_data))).to_dict("records")

# System prompt engineering for medical domain requirements
system_prompt = f"""You are a data synthesis expert specializing in medical {data_type} generation.
Your task is to generate synthetic data that closely resembles the provided examples
while ensuring complete privacy protection. The synthetic data should maintain the characteristics,
terminology, and structure of the real data but contain no actual identifiable information.

Pay careful attention to the content of the example data to determine the appropriate medical domain and terminology.
"""

# User prompt with detailed requirements for de-identification
user_prompt = f"""Given the following context: {specific_context}
Generate {num_samples} synthetic {data_type} records based on these real examples:

{json.dumps(real_examples, indent=2)}

Requirements:
1. Each record must follow the exact format of the examples
2. Use realistic but synthetic identifiers
3. Maintain the structure, length, and detail level of real data
4. Carefully analyze the examples to identify the medical domain and use appropriate terminology
5. ALL fields must have valid values - no empty fields

Return your response in this exact JSON format:
{{
"records": [
    # {num_samples} records here with all fields
]
}}
"""

# Optimized generation parameters
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ],
    temperature=0.5,  # Balance consistency with variability
    max_tokens=4000,  # Accommodate complex medical records
    response_format={"type": "json_object"},
)

# Robust processing pipeline with fallback strategies
try:
    content = json.loads(content_str)

    # Multiple fallback strategies for parsing different response structures
    if isinstance(content, dict) and "records" in content:
        records = content["records"]
    elif isinstance(content, dict) and "data" in content:
        records = content["data"]
    elif isinstance(content, list):
        records = content
    else:
        # Try to find any list in the response
        for key, value in content.items():
            if isinstance(value, list) and len(value) > 0:
                records = value
                break
        else:
            records = [content]

    return records

except Exception as e:
    print(f"Error generating synthetic data: {str(e)}")
    return []

BENEDICT NEO 梁耀恩

synthetic medical data