DeepSeek Data Leak: 12,000 Hardcoded API Keys and Passwords Exposed
A recent analysis revealed 11,908 active DeepSeek API keys, passwords, and authentication tokens embedded in publicly scraped web data.
AI Training Risks and Hardcoded Credentials
Cybersecurity firm Truffle Security warns that AI models trained on unfiltered internet data risk internalizing and replicating insecure coding patterns. This aligns with previous findings that LLMs often suggest hardcoding credentials, raising concerns about the influence of training data on security practices.
DeepSeek Data Exposure
Truffle Security scanned 400 terabytes of Common Crawl’s December 2024 dataset, covering 2.67 billion web pages from 47.5 million hosts. Using their open-source tool TruffleHog, researchers identified:
- 11,908 verified live secrets, granting access to services like AWS, Slack, and Mailchimp.
- 2.76 million web pages containing exposed credentials, with 63% of keys reused across multiple domains.
- A single WalkScore API key appearing 57,029 times across 1,871 subdomains, highlighting extensive credential reuse.
Some of the most severe exposures included AWS root keys embedded in front-end HTML and 17 Slack webhooks hardcoded into a single webpage’s chat feature.
Mailchimp API Keys and Security Risks
Mailchimp API keys were the most frequently exposed, with over 1,500 instances, often found in client-side JavaScript. This practice enabled phishing attacks and data exfiltration.
Common Crawl’s dataset, stored in 90,000 WARC files, preserves raw HTML, JavaScript, and server responses, making it a valuable yet risky resource for AI training.
Processing Challenges and Infrastructure Optimization
Truffle Security used a 20-node AWS cluster to process the dataset, leveraging awk for file splitting and TruffleHog’s verification engine to distinguish live secrets from inert strings—an essential step, as LLMs cannot differentiate valid credentials during training.
The team faced infrastructure challenges, including WARC’s streaming inefficiencies, which initially slowed processing. However, AWS optimizations eventually reduced download times by 5–6x.

DeepSeek Data Breach
Truffle Security analyzed 400 terabytes of Common Crawl’s December 2024 dataset, spanning 2.67 billion web pages from 47.5 million hosts. Using their open-source tool TruffleHog, researchers uncovered:
- 11,908 verified live secrets, granting access to platforms like AWS, Slack, and Mailchimp.
- 2.76 million web pages containing exposed credentials, with 63% of keys reused across multiple domains.
- A WalkScore API key appearing 57,029 times across 1,871 subdomains, showcasing widespread credential reuse.
High-Risk Exposures and Mailchimp Vulnerabilities
The dataset revealed severe security lapses, including AWS root keys embedded in front-end HTML and 17 Slack webhooks hardcoded into a single webpage’s chat feature.
Mailchimp API keys were the most frequently exposed, with 1,500+ instances, often hardcoded into client-side JavaScript. This practice left them vulnerable to phishing attacks and data exfiltration.
Data Processing and Infrastructure Challenges
Common Crawl’s dataset, stored in 90,000 WARC files, preserves raw HTML, JavaScript, and server responses from scanned websites.
To process this massive archive, Truffle Security deployed a 20-node AWS cluster, using awk for file splitting and TruffleHog’s verification engine to distinguish active secrets from inactive strings—an essential step since LLMs cannot differentiate valid credentials during training.
Researchers initially encountered WARC streaming inefficiencies, slowing down analysis. However, AWS optimizations ultimately improved processing speeds, cutting download times by 5–6x.

Ethical Disclosure and AI Security Risks
Despite the challenges, Truffle Security prioritized responsible disclosure by working with vendors like Mailchimp to revoke thousands of compromised keys, avoiding a mass outreach approach to individual website owners.
The study highlights a growing concern: LLMs trained on publicly accessible data inherit its security flaws. While models like DeepSeek employ safeguards, fine-tuning, and prompt constraints, the widespread presence of hardcoded secrets in training datasets risks reinforcing insecure coding practices.
Another issue is the presence of non-functional credentials (e.g., placeholder tokens), which LLMs cannot contextually evaluate during code generation—further normalizing unsafe patterns.
Truffle Security also warns that developers who reuse API keys across multiple projects face increased risks. In one case, a shared Mailchimp key from a software firm exposed all client domains linked to its account, creating a high-value target for attackers.
Mitigation Strategies
To reduce AI-generated security risks, Truffle Security recommends:
- Integrating security guardrails into AI-powered coding tools like GitHub Copilot’s Custom Instructions, enforcing policies against hardcoding secrets.
- Expanding secret-scanning programs to cover archived web data, preventing historical leaks from resurfacing in training datasets.
- Adopting Constitutional AI techniques to align LLMs with security best practices, minimizing inadvertent exposure of sensitive patterns.
With LLMs increasingly influencing software development, securing their training data isn’t just important—it’s critical to building a safer digital ecosystem.
Post Comment