1. The "Black Box" Problem in Media

We live in an era of information opacity. Modern media has a sourcing problem. A headline claims "Study Shows AI is Biased," but finding the actual PDF of the study is impossible. A review claims "Users Hate This Tool," but links to no evidence or methodology. When you read a software review on many major tech publications, you have no idea if the writer actually used the software, or if they just rewrote the marketing copy from the vendor's homepage.

WhichAIPick operates differently. We believe that Data Sovereignty extends to the reader. You have a right to know exactly where our conclusions come from. You have a right to audit our inputs.

This policy details the precise inputs that feed our algorithm. We classify our data into three categories: Primary (Vendor), Secondary (Synthetic), and Tertiary (User). By understanding this "Data Supply Chain," you can better evaluate the trustworthiness of our recommendations in our Software Directory.

2. Primary Data: Verified Vendor Intelligence

This is the "Hard Spec" data. In the world of physical hardware reviews, this would be the dimensions of a phone or the horsepower of a car. In AI software, it is the context window size, the parameter count, and the API rate limits.

We treat vendor-supplied data with extreme skepticism until it is verified. Marketing departments are incentivized to exaggerate. Engineering departments are incentivized to be accurate. Our job is to find the engineering truth hidden behind the marketing fluff.

Sources of Primary Data

Verification Protocols

We cross-reference marketing claims against technical docs. If marketing says "Unlimited Context" but the API docs say "128k context limit," we publish the 128k number and penalize the Trust score. If a vendor claims to be "GDPR Compliant" but their servers are hosted exclusively in Virginia, USA, without Standard Contractual Clauses (SCCs), we flag this as a Compliance Risk.

3. Secondary Data: Synthetic Benchmarking

This is the most valuable data we own. It is proprietary data generated by our own testing labs. It does not exist anywhere else on the internet. Because AI models are non-deterministic (they give different answers every time), relying on a single test is statistically invalid. We rely on **Synthetic Benchmarking**—running thousands of automated tests to generate a statistical average of performance.

The Test Harness

We have built a custom testing harness called "The Gauntlet." This software pipeline connects to the APIs of the tools we review and fires a randomized battery of prompts. You can read more about this in our Review Methodology.

Synthetic Data Generation

To test these tools without compromising user privacy, we generate **Synthetic User Data**. We do not paste real confidential emails into an AI writer to test it. Instead, we use a library of "Fake PII" (Personally Identifiable Information)—fake names, fake addresses, fake credit card numbers—to test how the AI handles sensitive data. This allows us to test "Data Leakage" safety rails without risking real harm.

4. Tertiary Data: User Telemetry & Aggregation

We aggregate anonymized user behavior to understand "Real World" usage patterns. This helps us weight our categories. If 80% of our users are searching for "Free Plan," we increase the weight of the "Free Tier Generosity" score in our algorithm.

The "Wisdom of the Crowd" vs. "The Mob"

User data is noisy. A user might give a tool a 1-star review because they forgot their password. To filter this signal, we use **Cohort Analysis**.

Privacy Commitment

We never use individual user data to influence a specific review. We look at aggregate cohorts only. We do not track individual browsing history across the web. For comprehensive details, see our Editorial Ethics.

5. The Data Lifecycle: From Collection to Deletion

We believe in **Data Minimization**. We only collect the data we need, and we delete it when we don't. Here is the lifecycle of a data point at WhichAIPick:

Phase 1: Ingestion

Data enters our system via our automated scrapers (for pricing), our API harness (for benchmarks), or our user feedback forms. All incoming data is timestamped and source-tagged.

Phase 2: Normalization

Raw data is messy. One vendor lists pricing as "$20/mo", another as "$240/yr", another as "0.003 cents/token". Our normalization engine converts all these values into a standard **Monthly Cost of Ownership** metric to allow for apples-to-apples comparison. See Pricing Policy.

Phase 3: Processing & Scoring

The normalized data is fed into our Algorithm (The "WhichAIPick Score"). This algorithm runs nightly. This means a change in a vendor's pricing today will be reflected in their score tomorrow morning.

Phase 4: Archival & Deletion

We retain historical pricing data for 24 months to power our "Price History" charts. After 24 months, granular data is aggregated into monthly averages and the raw data is purged. User telemetry data (clicks, dwell time) is anonymized after 90 days and deleted after 12 months.

6. Third-Party Subprocessors

We are a modern software company, which means we rely on cloud infrastructure. We are transparent about who handles our data. We only work with sub-processors who meet our strict security standards.

Subprocessor Purpose Location
Amazon Web Services (AWS) Hosting & Compute Infrastructure USA (Virginia)
Cloudflare CDN & DDoS Protection Global (Edge)
PostgreSQL (Managed) Database Storage USA
Google Analytics 4 Anonymized Web Traffic Analytics Global

We do NOT share data with data brokers, ad networks, or hedge funds.

7. Algorithmic Fairness & Bias Testing

Algorithms are opinions embedded in code. Our ranking algorithm is an opinion: "We prefer safe, cheap, fast tools." We are transparent about this bias. However, we strive to eliminate **Unintended Bias**.

The "Small Vendor" Bias

Review sites often favor big incumbents (Adobe, Microsoft) because they have higher search volume. To counter this, our algorithm includes a "Discovery Boost" for new tools that score highly on technical merit but have low brand awareness. This ensures that a brilliant tool built by a 2-person team in Estonia has a chance to outrank a mediocre tool built by Google. Discover these gems in our New Arrivals.

The "English Language" Bias

Currently, our natural language processing tests are predominantly in English. We acknowledge this limitation. We are actively working on a multi-lingual benchmark suite to fairly evaluate tools that specialize in Spanish, French, and Mandarin generation.

8. Data Limitations and Known Unknowns

We pride ourselves on accuracy, but we are engineers, not magicians. Our data has limitations that you should be aware of:

9. How to Request Your Data

If you have created an account with WhichAIPick (for saving tools or creating alerts), you have the right to export or delete your data.

To Export: Email privacy@whichaipick.com with the subject "Data Export Request." We will provide a JSON file of your data within 14 days.

To Delete: Email privacy@whichaipick.com with the subject "Account Deletion." We will purge your data from our active database immediately and from our backups within 30 days.

10. Glossary of Data Terms

Understanding data privacy requires understanding the jargon.

PII (Personally Identifiable Information)
Any data that can identify you: Name, Email, IP Address, Phone Number. We strip this.
Pseudonymization
Replacing your name with a random ID (e.g., "User 8492"). The data is still there, but it's not linked to "John Doe."
Differential Privacy
A mathematical technique that adds "noise" to a dataset so that statistical trends can be analyzed without revealing any single individual's data.
Data Lake
A centralized repository where we store structured and unstructured data at scale. Ours is encrypted.
Clean Room
A secure environment where data is analyzed without leaving the secure server. No data ever "leaves" the clean room.

Related Resources

Version Notes:
v1.4 (2026-02-19): Final Lock. Added "Limitations of Data" and Compliance Contact.
v1.3 (2026-02-19): Added Glossary of Data Terms.
v1.2 (2026-02-19): Hyper-expanded "Data Lifecycle" and "Subprocessors" sections. Added "Algorithmic Fairness" and "Synthetic Benchmarking" details.
v1.1 (2026-02-19): Added "Primary vs Secondary" data classification spectrum.
v1.0 (2024-02-01): Initial data transparency statement.