Kanishk Pansari
← Writing·January 15, 2025
PythonPandasEDAData EngineeringData Quality

What I Learned Building a CSV Analyzer - And Why Data Quality Is a $3 Trillion Problem

Bad data costs the global economy trillions every year. Building a tool to fix it taught me more about real-world data engineering than any course ever could.

Every year, bad data costs the global economy an estimated $3.1 trillion. That number comes from IBM Research, and it doesn't mean data that's missing - it means data that exists, looks fine, and is silently wrong. Wrong column types. Hidden nulls. Dates stored as strings. Numeric IDs with leading zeros stripped by Excel. Data that passes every surface-level check and fails every real-world use case.

I built the CSV Analyzer because I kept running the same diagnostic commands every time I opened a new dataset. Twelve lines of Pandas. The same twelve lines, every time. I turned them into a tool. What I didn't expect was that building that tool would teach me more about the actual problems in data engineering than two years of coursework.


The Problem Nobody Talks About: Data Arrives Broken

In every data science course, you start with a clean dataset. Titanic. Iris. House prices. These datasets are clean by design - they exist to teach algorithms, not data reality. In the real world, the data that reaches you has already been through five hands: someone collected it in a spreadsheet, someone else exported it to CSV, a third person renamed the columns, a fourth uploaded it to a shared drive, and a fifth emailed it to you. By the time it reaches your pd.read_csv(), it carries the fingerprints of every person who touched it.

The CSV Analyzer was built to surface all of that the moment you upload a file.


What the Tool Actually Does

Upload any CSV and within seconds you get a full structural report:

  • Column-level type inference - not just what Pandas thinks the type is, but what it should be, based on the actual values in the column
  • Null detection - including non-standard nulls: empty strings, "N/A", "-", "none", "unknown", and 0 used as a placeholder
  • Distribution analysis - value frequency, cardinality, outlier candidates
  • Data quality flags - columns that are likely identifiers being treated as numerics, dates stored as objects, mixed-type columns

No code. No setup. Just upload and read.


Three Technical Problems That Changed How I Think About Data

1. Pandas Lies About Column Types

Pandas is confident but wrong. A column containing ["0001", "0002", "0003"] gets read as int64 and becomes [1, 2, 3]. The leading zeros disappear silently. A column containing ["2024-01-15", "2024-02-20"] gets read as object unless you explicitly tell Pandas to parse dates - and even then, mixed date formats break the parser.

The fix was building a secondary type inference layer that runs after Pandas loads the data. It samples each column, applies regex patterns for common types (dates, IDs, phone numbers, currency), and flags mismatches between the inferred type and the stored type. This single addition catches the class of bugs that costs analytics teams hours every week.

Why it matters: Every model you train on incorrectly-typed data learns patterns from noise. A customer ID read as an integer becomes a number your model thinks is meaningful. A date read as a string becomes a categorical variable. Type errors propagate silently through every downstream computation.

2. Null Handling Is Not Binary

The standard null check - df.isnull().sum() - only catches NaN. It misses everything that a human encoded as null but didn't mark as null. Business data is full of these: "-" used when a field is unknown, "N/A" in a standardized export, 0 used as a placeholder in a numeric column where 0 is also a valid value.

I built a configurable null-equivalence layer. You define what counts as null for your domain (or use the defaults), and the analyzer re-runs the null statistics against those definitions. The result is a null rate that reflects business reality, not just technical encoding.

Why it matters: A column that shows 2% nulls in a standard check might show 18% nulls once you account for "-" and "N/A". That difference changes whether the column is usable, whether imputation is appropriate, and whether the data collection process needs to be fixed at the source. Getting this number right at the start of an analysis saves hours at the end.

3. Scale Breaks Everything Simple

The first version of the analyzer loaded the entire file into memory. On a 200MB CSV, it crashed. The fix was chunked reading - pd.read_csv(chunksize=10000) - computing statistics incrementally across chunks and merging the results. This required rethinking every statistic: instead of computing a mean on the full column, you compute chunk-level means and use a weighted average. Instead of computing a histogram on all values, you maintain a running frequency dictionary and truncate at the end.

Why it matters: Data engineering at scale is fundamentally about computing the right answer without loading everything into memory at once. Chunked processing, incremental aggregation, streaming statistics - these are the patterns that make the difference between a script that works on a laptop and a pipeline that works in production. Building this for a personal tool taught me the underlying logic that database engines and distributed compute systems like Spark are built on.


The Broader Impact: Democratizing Data Literacy

The CSV Analyzer isn't a tool for data scientists. Data scientists already know how to run these checks. It's a tool for the marketing manager who received a CSV from her analytics team and doesn't know if the numbers are trustworthy. It's for the operations director who has a spreadsheet of supplier data and needs to know if it's clean before he makes a procurement decision. It's for the junior analyst on his first week who doesn't yet know what questions to ask of a new dataset.

This is where data quality tools have an outsized impact. When someone without technical training can look at a report and see "this column has 23% non-standard nulls and the date column is being read as text," they can ask better questions. They can push back on bad data before it becomes a bad decision.

The global stakes are real. The IBM $3.1 trillion figure isn't abstract. It's the pharmaceutical company that ran a trial on a dataset with corrupted patient identifiers. It's the logistics firm whose delivery routes were optimized on data where 15% of addresses were malformed. It's every business decision made downstream of a column type that was wrong from day one.


What I'd Build Next

The current version tells you what is wrong with your data. The next version tells you why - tracing quality issues back to likely collection or encoding errors and suggesting fixes, not just flagging problems. That requires understanding the semantics of the data, not just its structure. It's a harder problem, and it's where data quality tools are heading.

I'm also working on a data quality score: a single 0–100 number that summarizes the overall health of a dataset before you dive into the details. A score that a non-technical stakeholder can read in two seconds and understand. That number doesn't exist yet in most tooling. It should.


What Building This Taught Me About Data Engineering

The gap between "this works on my machine" and "this works on someone else's data" is where real data engineering lives. Anyone can run df.describe(). The question is what you do when df.describe() gives you the wrong answer - when the tool you trust is lying to you because the data underneath it is broken.

Building something that other people use, with data you've never seen, on schemas you never designed, forces you to think about every edge case instead of just the common ones. It forces you to build for reality, not for the clean example in the documentation.

That's the difference between studying data and shipping it.