Researchers suggest that DeepSeek, a Chinese AI lab, may have trained its latest R1-0528 reasoning model using outputs from Google's Gemini AI, igniting a controversy. The updated model, released in late May, showcases impressive performance gains, rivaling top-tier models like OpenAI’s o3 and Google’s Gemini 2.5 Pro.
However, questions about its training data have cast a shadow over its achievements, raising ethical and intellectual property concerns in the AI industry.
Evidence of Gemini Influence
The speculation began when developers noticed linguistic similarities between DeepSeek’s R1-0528 and Google’s Gemini 2.5 Pro. Melbourne-based developer Sam Paech, known for creating AI emotional intelligence evaluations, pointed out that R1-0528 frequently uses terminology like “context window,” “foundation model,” and “function calling,” which are prevalent in Gemini’s documentation.
Additionally, the model’s reasoning traces intermediate steps it generates while solving problems that mirror the structure and style of Gemini’s outputs. These patterns suggest that DeepSeek may have employed distillation, a technique where a smaller model is trained on outputs from a larger, more advanced model.
This isn’t the first time DeepSeek has faced such accusations; earlier in 2025, OpenAI flagged evidence that DeepSeek used distillation on its ChatGPT outputs, a practice prohibited by OpenAI’s terms of service.
ALSO READ | AI Revolutionizes Healthcare: Chatbot Solves Founder's 18-Month Pain Mystery
Detection Methods Uncover Linguistic Clues
Paech’s analysis leveraged advanced AI detection techniques to identify potential training data lineage. These methods include statistical analysis of word frequencies and syntactic structures, as well as neural network-based approaches like BERT and RoBERTa, which can detect machine-generated text with up to 97% accuracy in controlled settings.
Zero-shot detection, another technique, analyzes text probability distributions without additional training, achieving near 99% accuracy for certain models. Such tools are critical in AI forensics, as models trained on outputs from other systems often inherit distinctive vocabulary and phrasing, acting as digital fingerprints.
These findings have fueled speculation that DeepSeek relied on Gemini’s outputs to enhance R1-0528’s capabilities, though no definitive proof has been confirmed.
DeepSeek’s Performance Leap
Despite the controversy, R1-0528 demonstrates remarkable improvements over its predecessor. On the AIME 2025 mathematics test, it achieved 87.5% accuracy, up from 70%, and scored 91.4% on AIME 2024.
Its programming capabilities also advanced, with LiveCodeBench scores rising from 63.5% to 73.3% and SWE Verified evaluation improving from 49.2% to 57.6%. General reasoning saw gains on the GPQA-Diamond test, climbing from 71.5% to 81.0%, while performance on the complex “Humanity’s Last Exam” more than doubled from 8.5% to 17.7%.
The model also reduces hallucinations and factually inaccurate responses and introduces support for JSON output and enhanced function calling, making it more developer-friendly.
Did You Know?
DeepSeek trained its earlier V3 model for just $6 million, a fraction of the $100 million spent on OpenAI’s GPT-4, showcasing its ability to achieve high performance with limited resources.
Industry Implications and DeepSeek’s Silence
The allegations against DeepSeek highlight broader challenges in AI development, particularly around data provenance and intellectual property. The open web, increasingly filled with AI-generated content, complicates efforts to ensure clean training datasets, leading to unintended overlaps in model outputs.
Major AI firms like OpenAI and Google have responded by tightening security, with OpenAI requiring ID verification for advanced model access and Google limiting third-party use of Gemini outputs.
DeepSeek has not publicly addressed the Gemini allegations, consistent with its past silence on similar claims involving OpenAI data. Experts in the field believe that DeepSeek's lack of resources and difficulty in getting powerful GPUs because of U.S. export rules might lead it to depend on synthetic data from top models, which is a cheaper way to compete with larger companies.
Comments (0)
Please sign in to leave a comment
No comments yet. Be the first to share your thoughts!