Index

03Field ReportExperimentation · 20M+ events

Growth & Retention

An end-to-end data pipeline and machine learning system that processes 20M+ e-commerce events to validate A/B tests and predict cart abandonment in real-time.

Overview

E-Commerce Growth & Retention is an end-to-end, production-grade analytics and machine-learning project that answers a concrete business question: does a redesigned checkout flow actually move revenue, and can we predict which shoppers are about to abandon their cart? I built the entire system solo, from raw clickstream ingestion to a live, interactive web app — deliberately spanning data engineering, analytics engineering, statistical inference, machine learning, and MLOps so the project reads as one continuous data pipeline rather than a single notebook.

Architecturally, data flows through a classic medallion pipeline: over 20 million raw cosmetics-shop events (~4.6 GB across five months) are ingested with a simulated A/B experiment, landed in Google Cloud Storage and BigQuery, transformed and tested with dbt, statistically validated with a Chi-Square test, modeled with an XGBoost classifier explained via SHAP, and finally served through a FastAPI inference service behind a multi-tab Streamlit front end — all containerized with Docker and monitored for data drift with Evidently AI.

Problem and Goal

Cart abandonment is the single largest source of leaked revenue in e-commerce: shoppers add items, reach the checkout, and disappear. Two questions follow directly from that. First, the causal question — if we ship a new checkout experience, does it genuinely lift conversion, or are we fooling ourselves with noise? Second, the predictive question — given a live session, can we estimate the probability that this user will abandon their cart in time to intervene?

I set out to answer both rigorously. The causal question is handled with a properly randomized A/B test and a significance test rather than eyeballed dashboards; the predictive question is handled with an interpretable gradient-boosted model whose every prediction can be explained feature-by-feature. The goal was never just a model file — it was a defensible, deployable system that a product or growth team could actually act on.

The Dataset & Experiment Simulation

The foundation is the Kaggle "eCommerce Events History in a Cosmetics Shop" dataset — real, anonymized clickstream logs covering October 2019 through February 2020. Each row is a single user action (view, cart, purchase, remove_from_cart) with a product, brand, price, user ID, and session ID. Across five monthly files this is roughly 20 million events totaling ~4.6 GB — large enough that naive in-memory processing fails, forcing real data-engineering decisions.

Since I didn't have a real experiment to analyze, I simulated one rigorously. A Python script hashes each user_id with MD5 and assigns the user to Control (old checkout) or Treatment (new checkout) based on the hash parity. Hashing makes the assignment deterministic and stable — the same user always lands in the same group across all five months — which is exactly how production experiment-bucketing systems behave. I then injected a realistic causal effect: 5% of Treatment-group cart events are probabilistically converted into purchase events, encoding a genuine conversion uplift for the new checkout that the downstream statistical test has to rediscover from the data.

Phase 1 — Data Engineering & Cloud Ingestion

The raw CSVs are far too large to load whole, so ingestion is streamed in one-million-row chunks with pandas. Each chunk is cleaned (NaN user IDs are coerced), tagged with its experiment group, and written incrementally to a processed directory — keeping memory flat regardless of file size.

The processed, experiment-tagged data is uploaded to a Google Cloud Storage data lake as partitioned files, then loaded into a BigQuery Bronze (raw) dataset (cosmetics_bronze.events). This Bronze layer is the immutable, single-source-of-truth landing zone that every downstream transformation builds on — establishing a clean separation between raw ingestion and modeled, business-ready data.

Phase 2 — Modern Data Stack: dbt Transformation & Testing

Transformation lives in a documented dbt project running on BigQuery, organized as a medallion architecture (Bronze → Silver → Gold) so each layer has a single, clear responsibility.

  • Silver (staging): stg_events is a view that casts every column to a strict type, normalizes timestamps, and filters out malformed rows (null or 'unknown' users, null sessions) — turning messy raw logs into a trustworthy clean layer.
  • Data quality tests: dbt schema tests enforce contracts directly in the pipeline — not_null on keys and timestamps, and an accepted_values test that guarantees event_type only ever contains view, cart, purchase, or remove_from_cart. A bad upstream load fails the build instead of silently corrupting analysis.
  • Gold (marts): business-ready tables built with advanced SQL.

The marquee Gold model is fact_user_sessions, which reconstructs user sessions purely in SQL using window functions: LAG() finds the gap to each user's previous event, a 30-minute inactivity timeout flags session boundaries, and a cumulative SUM() over that flag generates a stable per-user session ID. Each session is then aggregated into total views, carts, purchases, revenue, and duration. A second model, dim_users, rolls sessions up to the user grain to compute lifetime value, total sessions, and days-since-last-session, carrying each user's experiment group along for the analysis layer.

Phase 3a — Statistical Validation: The A/B Test

With clean Gold tables in place, the causal question gets a proper answer. I pull per-group converted-vs-total user counts from dim_users and run a Chi-Square test of independence (scipy.stats.chi2_contingency) on the Control-vs-Treatment conversion contingency table.

The test returns a Chi-Square statistic and a p-value that I evaluate against a 0.05 significance threshold — turning "the new checkout looks better" into a defensible statistical claim: the lift is either significant or it isn't. Because the uplift was injected at the data layer, this closes the loop end-to-end — the experiment is designed, simulated, modeled, and then statistically recovered from the warehouse.

Phase 3b — Predicting Cart Abandonment with XGBoost

The predictive model is trained directly off the warehouse: a BigQuery query pulls every session that contained at least one cart event and labels it abandoned_cart = 1 when the session produced zero purchases. The feature set is deliberately lean and interpretable — total_views, total_carts, session_duration_minutes, and experiment_group — so the model's behavior maps cleanly onto real product levers.

I train an XGBoost binary classifier (n_estimators=100, max_depth=4, learning_rate=0.1) optimized for ROC-AUC, with a stratified train/test split to preserve class balance. The model is evaluated on held-out data with accuracy, ROC-AUC, and a full classification report, then serialized to a compact 170 KB .pkl file alongside a 1,000-row training sample reserved for the explainability layer.

Phase 3c — Model Explainability with SHAP

A black-box probability is useless to a growth team — they need to know why. I wrap the trained model in a SHAP TreeExplainer and generate two complementary views. A global summary plot ranks features by overall impact and shows the direction of their effect across all users, while a local waterfall plot decomposes a single high-risk prediction into the exact contribution of each feature — making every individual score auditable.

SHAP summary plot of global feature importance for the cart-abandonment model
Figure 1: SHAP global summary plot — each feature's overall impact on the model's cart-abandonment prediction, with point color encoding the feature value and horizontal spread encoding the magnitude of its effect.

Zooming from the global view to a single decision, the local waterfall plot below traces one high-risk session: starting from the model's base rate, each feature pushes the abandonment probability up or down until it lands on the final score — so a growth analyst can read off exactly which signals flagged that specific user.

SHAP waterfall plot explaining a single high-risk cart-abandonment prediction
Figure 2: SHAP local waterfall plot — a feature-by-feature breakdown of one individual prediction, showing how each input moves the score from the base value to the final abandonment probability.

Phase 4a — Real-Time Inference API (FastAPI)

The model is served behind a FastAPI backend that loads the serialized XGBoost model and SHAP explainer once at startup. The core POST /predict endpoint accepts a session's parameters as a typed Pydantic request, returns the abandonment probability, and — critically — renders the per-prediction SHAP waterfall plot server-side and ships it back as a base-64 PNG, so the front end gets both the score and its explanation in a single round-trip.

A second endpoint, POST /bulk_predict, accepts a list of sessions and vectorizes scoring across the whole batch — the backbone for the app's CSV-upload batch-scoring feature.

Phase 4b — Interactive Web App (Streamlit)

The user-facing layer is a multi-tab Streamlit application that turns the whole pipeline into something a non-technical stakeholder can explore:

  • Real-Time Predictor: sliders for views, carts, session duration, and checkout experience drive a live call to /predict; the app renders the probability, a color-coded risk verdict, and the SHAP explanation inline.
  • Live A/B Test Dashboard: queries the BigQuery Gold tables on demand and visualizes the experiment with Plotly — headline revenue and conversion-rate metrics with deltas, plus a Control-vs-Treatment conversion funnel.
  • Bulk Prediction: upload a CSV of sessions, score them all through /bulk_predict, preview the results, and download a scored file — the batch counterpart to the real-time tab.

Phase 4c — Containerization & Deployment

Both services are containerized with their own Dockerfiles (slim Python 3.12 base images) and wired together with docker-compose: the API exposes port 8000 with a health check, and the Streamlit front end on port 8501 talks to it over the internal Docker network via an API_URL environment variable. This two-service design — stateless model API plus thin UI — is exactly the shape that deploys cleanly to a serverless target like Google Cloud Run.

Phase 4d — Model Observability (Evidently AI)

A model in production silently decays as the world shifts under it, so the project closes with monitoring. Using Evidently AI, I generate a Data Drift report that compares the original training distribution (the saved reference sample) against incoming production-like data, flagging when feature distributions such as cart counts or session duration have drifted far enough to warrant retraining. The output is a shareable HTML report — the early-warning system that keeps the deployed model honest.

Engineering Challenges Solved

  • Multi-gigabyte ingestion on a laptop: 4.6 GB of raw events won't fit in memory. Solved with chunked, streaming reads and incremental writes so processing stays memory-flat regardless of dataset size.
  • Deterministic experiment assignment: users must stay in the same A/B bucket across five months of data. Solved by hashing user_id rather than random assignment, mirroring real production bucketing systems.
  • Sessionization in pure SQL: raw logs have no session concept tied to inactivity. Reconstructed sessions entirely in the warehouse with LAG(), a 30-minute timeout flag, and a cumulative-sum session ID — no Python loop required.
  • Causal claims, not vibes: a conversion-rate lift could be random noise. Backed every comparison with a Chi-Square significance test so the result is defensible.
  • Explainability as a first-class output: a bare probability isn't actionable. Rendered SHAP waterfall plots server-side and shipped them with each prediction, so every score arrives with its own justification.
  • Decoupled, deployable services: kept the model API and the UI as separate containers communicating over the network, avoiding a monolith and keeping the system cloud-deployment-ready.

Tech Stack

  • Languages: Python 3.12, advanced SQL (BigQuery dialect).
  • Data Engineering: pandas (chunked processing), Google Cloud Storage (data lake), Google BigQuery (warehouse).
  • Analytics Engineering: dbt (medallion modeling, window functions, schema tests & documentation).
  • Statistics: SciPy (Chi-Square test of independence), hypothesis testing & p-value interpretation.
  • Machine Learning: XGBoost (gradient-boosted classifier), scikit-learn (split & metrics), SHAP (global + local explainability).
  • Backend & App: FastAPI + Uvicorn, Pydantic, Streamlit, Plotly, Matplotlib.
  • Infra & MLOps: Docker + docker-compose, Google Cloud Run (target), Evidently AI (data-drift monitoring).

Every stage emits an inspectable, swappable artifact — Bronze tables, tested Gold models, a serialized model, SHAP plots, and drift reports — so any single component can be upgraded without rebuilding the rest of the pipeline.

Outcomes & Impact

  • Delivered a complete, end-to-end data product spanning ingestion, warehousing, analytics engineering, statistics, ML, and deployment — designed and built solo.
  • Processed 20M+ raw events (~4.6 GB) through a chunked pipeline into a tested, documented dbt warehouse on BigQuery.
  • Reconstructed user sessions entirely in SQL with window functions and a 30-minute inactivity rule, then rolled them into session- and user-grain Gold tables.
  • Validated the checkout experiment with a Chi-Square significance test, turning a simulated uplift into a statistically defensible result.
  • Trained an XGBoost cart-abandonment classifier optimized for ROC-AUC and made every prediction auditable with SHAP global and local explanations.
  • Shipped a three-tab Streamlit app (real-time prediction, live BigQuery A/B dashboard, bulk CSV scoring) over a Dockerized FastAPI backend, with Evidently AI drift monitoring.
  • [ADD HEADLINE METRIC HERE — e.g., measured conversion lift / p-value from your A/B run, model ROC-AUC, or rows processed per minute.]

Conclusion

E-Commerce Growth & Retention is the project I point to when I want to show that I can own a data problem from raw bytes all the way to a deployed, explainable product. It pairs the analyst's instincts — framing a business question, designing an experiment, proving a result with statistics — with the engineer's discipline of building a tested data warehouse, an interpretable model, and a containerized service around it. It's the clearest demonstration of how I think about data: not as an isolated notebook, but as a system that moves a real business metric and can defend every number it produces.