i worked on synthetic data generation for my research paper today.
below is my strategy
Our synthetic data generation methodology employs a privacy-preserving approach utilizing the GPT-4o model (2024-08-06) via Azure OpenAI API. We implemented a systematic framework that begins with representative sampling from the original dataset, converting a minimal set of records (n=2) to dictionary format to serve as structural templates. The generation process is guided by carefully engineered prompts that specify medical domain requirements while ensuring complete de-identification. Generation parameters were optimized with temperature=0.5 to balance consistency with variability, and max_tokens=4000 to accommodate complex medical records. Our robust processing pipeline includes comprehensive error handling, multiple fallback strategies for parsing different response structures, and conversion of synthetic records back to standardized DataFrame format. Quality control measures enforce structural fidelity to the original data while ensuring all synthetic identifiers remain realistic but non-identifiable. We applied this methodology to generate synthetic brain cancer MRI and pathology report pairs, maintaining the original dataset's cancer type distribution. This approach enables privacy-compliant data sharing for medical research while preserving clinically relevant patterns and relationships in the data, addressing a critical need in collaborative medical research where protected health information constraints often limit data accessibility.
how it looks like in code
sample_data = pd.read_csv("data.csv").head(10) data_type = "mri and pathology", specific_context = "Brain cancer MRI and pathology reports of glioblastoma" num_samples=2 real_examples = sample_data.sample(min(2, len(sample_data))).to_dict("records") # System prompt engineering for medical domain requirements system_prompt = f"""You are a data synthesis expert specializing in medical {data_type} generation. Your task is to generate synthetic data that closely resembles the provided examples while ensuring complete privacy protection. The synthetic data should maintain the characteristics, terminology, and structure of the real data but contain no actual identifiable information. Pay careful attention to the content of the example data to determine the appropriate medical domain and terminology. """ # User prompt with detailed requirements for de-identification user_prompt = f"""Given the following context: {specific_context} Generate {num_samples} synthetic {data_type} records based on these real examples: {json.dumps(real_examples, indent=2)} Requirements: 1. Each record must follow the exact format of the examples 2. Use realistic but synthetic identifiers 3. Maintain the structure, length, and detail level of real data 4. Carefully analyze the examples to identify the medical domain and use appropriate terminology 5. ALL fields must have valid values - no empty fields Return your response in this exact JSON format: {{ "records": [ # {num_samples} records here with all fields ] }} """ # Optimized generation parameters response = client.chat.completions.create( model=MODEL, messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}, ], temperature=0.5, # Balance consistency with variability max_tokens=4000, # Accommodate complex medical records response_format={"type": "json_object"}, ) # Robust processing pipeline with fallback strategies try: content = json.loads(content_str) # Multiple fallback strategies for parsing different response structures if isinstance(content, dict) and "records" in content: records = content["records"] elif isinstance(content, dict) and "data" in content: records = content["data"] elif isinstance(content, list): records = content else: # Try to find any list in the response for key, value in content.items(): if isinstance(value, list) and len(value) > 0: records = value break else: records = [content] return records except Exception as e: print(f"Error generating synthetic data: {str(e)}") return []