News Cyberkitera February 28, 2025 0 Comments

DeepSeek Data Leak: 12,000 Hardcoded API Keys and Passwords Exposed

Listen to this article

A recent analysis revealed 11,908 active DeepSeek API keys, passwords, and authentication tokens embedded in publicly scraped web data.

Table of Contents

AI Training Risks and Hardcoded Credentials

Cybersecurity firm Truffle Security warns that AI models trained on unfiltered internet data risk internalizing and replicating insecure coding patterns. This aligns with previous findings that LLMs often suggest hardcoding credentials, raising concerns about the influence of training data on security practices.

DeepSeek Data Exposure

Truffle Security scanned 400 terabytes of Common Crawl’s December 2024 dataset, covering 2.67 billion web pages from 47.5 million hosts. Using their open-source tool TruffleHog, researchers identified:

11,908 verified live secrets, granting access to services like AWS, Slack, and Mailchimp.
2.76 million web pages containing exposed credentials, with 63% of keys reused across multiple domains.
A single WalkScore API key appearing 57,029 times across 1,871 subdomains, highlighting extensive credential reuse.

Some of the most severe exposures included AWS root keys embedded in front-end HTML and 17 Slack webhooks hardcoded into a single webpage’s chat feature.

Mailchimp API Keys and Security Risks

Mailchimp API keys were the most frequently exposed, with over 1,500 instances, often found in client-side JavaScript. This practice enabled phishing attacks and data exfiltration.

Common Crawl’s dataset, stored in 90,000 WARC files, preserves raw HTML, JavaScript, and server responses, making it a valuable yet risky resource for AI training.

Processing Challenges and Infrastructure Optimization

Truffle Security used a 20-node AWS cluster to process the dataset, leveraging awk for file splitting and TruffleHog’s verification engine to distinguish live secrets from inert strings—an essential step, as LLMs cannot differentiate valid credentials during training.

The team faced infrastructure challenges, including WARC’s streaming inefficiencies, which initially slowed processing. However, AWS optimizations eventually reduced download times by 5–6x.

DeepSeek Data Breach

Truffle Security analyzed 400 terabytes of Common Crawl’s December 2024 dataset, spanning 2.67 billion web pages from 47.5 million hosts. Using their open-source tool TruffleHog, researchers uncovered:

11,908 verified live secrets, granting access to platforms like AWS, Slack, and Mailchimp.
2.76 million web pages containing exposed credentials, with 63% of keys reused across multiple domains.
A WalkScore API key appearing 57,029 times across 1,871 subdomains, showcasing widespread credential reuse.

High-Risk Exposures and Mailchimp Vulnerabilities

The dataset revealed severe security lapses, including AWS root keys embedded in front-end HTML and 17 Slack webhooks hardcoded into a single webpage’s chat feature.

Mailchimp API keys were the most frequently exposed, with 1,500+ instances, often hardcoded into client-side JavaScript. This practice left them vulnerable to phishing attacks and data exfiltration.

Data Processing and Infrastructure Challenges

Common Crawl’s dataset, stored in 90,000 WARC files, preserves raw HTML, JavaScript, and server responses from scanned websites.

To process this massive archive, Truffle Security deployed a 20-node AWS cluster, using awk for file splitting and TruffleHog’s verification engine to distinguish active secrets from inactive strings—an essential step since LLMs cannot differentiate valid credentials during training.

Researchers initially encountered WARC streaming inefficiencies, slowing down analysis. However, AWS optimizations ultimately improved processing speeds, cutting download times by 5–6x.

Ethical Disclosure and AI Security Risks

Despite the challenges, Truffle Security prioritized responsible disclosure by working with vendors like Mailchimp to revoke thousands of compromised keys, avoiding a mass outreach approach to individual website owners.

The study highlights a growing concern: LLMs trained on publicly accessible data inherit its security flaws. While models like DeepSeek employ safeguards, fine-tuning, and prompt constraints, the widespread presence of hardcoded secrets in training datasets risks reinforcing insecure coding practices.

Another issue is the presence of non-functional credentials (e.g., placeholder tokens), which LLMs cannot contextually evaluate during code generation—further normalizing unsafe patterns.

Truffle Security also warns that developers who reuse API keys across multiple projects face increased risks. In one case, a shared Mailchimp key from a software firm exposed all client domains linked to its account, creating a high-value target for attackers.

Mitigation Strategies

To reduce AI-generated security risks, Truffle Security recommends:

Integrating security guardrails into AI-powered coding tools like GitHub Copilot’s Custom Instructions, enforcing policies against hardcoding secrets.
Expanding secret-scanning programs to cover archived web data, preventing historical leaks from resurfacing in training datasets.
Adopting Constitutional AI techniques to align LLMs with security best practices, minimizing inadvertent exposure of sensitive patterns.

With LLMs increasingly influencing software development, securing their training data isn’t just important—it’s critical to building a safer digital ecosystem.

Trending

Fog Ransomware Group Blurs Lines Between Cybercrime and Espionage in Sophisticated Attack

CISA Issues Urgent Warning on Actively Exploited Apple 0-Day Vulnerabilities

Google Blocks Over 5 Billion Bad Ads in 2024, Suspends 39 Million Advertisers Amid Rising AI-Driven Scams

Critical WordPress Plugin Flaw Exposes Sites to File Inclusion Attacks

AkiraBot Targets 420,000 Sites with OpenAI-Generated Spam, Bypassing CAPTCHA Protections

Google Issues Critical Security Patch for Chrome to Fix ‘Use After Free’ Vulnerability

PoisonSeed Hijacks CRM Accounts to Distribute Malicious Cryptocurrency Seed Phrases

Outlook to Enforce Stricter Email Authentication for High-Volume Senders