Index

06Field ReportEvent study · NLP

Crypto Tweet Impact

A statistically rigorous data-science pipeline and event study that evaluates the market impact of ~20,000 LLM-classified tweets on Bitcoin and Dogecoin prices using multiple-testing corrections and an interactive Shiny dashboard.

Overview

This is a reproducible, statistically rigorous study of a deceptively simple question: do Elon Musk's cryptocurrency tweets actually move the coin? I built an end-to-end R data-science pipeline that pairs roughly 20,000 LLM-classified Musk tweets with daily price histories for Dogecoin (DOGE/USD) and Bitcoin (BTC/USD), then runs a full event-study framework over them — multi-horizon returns, a seeded random baseline control, formal hypothesis tests, effect sizes, multiple-testing correction, and a power analysis. It was completed for CSP 571 (Data Preparation & Analysis) at the Illinois Institute of Technology.

The point of difference is the statistical hygiene. Most "tweet vs. price" analyses stop at a correlation and a scatter plot; this one reports Cohen's d effect sizes and confidence intervals alongside every p-value, corrects across 30+ category tests with both Bonferroni and Benjamini–Hochberg FDR, and quantifies what the small sample can and cannot detect. The headline result is itself the lesson: most signals that look significant disappear once you correct for multiple comparisons. The deliverables are a self-contained Quarto HTML report (~2,300 lines), an interactive Shiny dashboard, and compiled LaTeX PDF reports.

Problem and Goal

High-profile social-media activity is widely believed to move crypto markets, and Musk's tweets — especially about the meme-origin coin Dogecoin — are the canonical example. But "everyone knows it moves the price" is an empirical claim, not a proven one, and the popular evidence for it tends to be anecdotal: a chart with a tweet annotated next to a green candle. My goal was to test the claim properly — to treat each tweet as an event, measure the realized forward returns, compare them against a fair baseline, and apply the kind of correction and effect-size reporting that separates a real finding from noise. The honest answer mattered more than a clickbait one, so the analysis is candid throughout about statistical power and the difference between statistical and practical significance.

Key Findings

Five research questions drove the study; the results are deliberately unglamorous, and that is the contribution:

  • Does sentiment predict next-day returns? No meaningful effect for either coin (BTC: r ≈ 0.12, p = 0.22; DOGE: r ≈ −0.12, p = 0.38).
  • Do negative-sentiment DOGE tweets precede declines? The opposite — a contrarian signal: mean +4.27% 7-day return (p = 0.0097, 95% CI [1.1%, 7.4%]).
  • Do meme vs. "relevant" tweets behave differently? Strongly: meme DOGE tweets averaged +6.9% over 7 days while contextually relevant ones averaged −6.2%.
  • Does tweet volume drive volatility? A tentative positive trend, but not statistically significant (BTC p ≈ 0.15).
  • What survives multiple-testing correction? Almost nothing — most raw-significant signals fail Bonferroni/FDR.

The most defensible takeaway is methodological: with only ~58 verified Doge tweets, statistical power is low, so rigorous correction is essential to avoid false discoveries — and the analysis says so explicitly rather than overselling the contrarian effect it found.

Analytical Architecture & Methodology

The pipeline runs end-to-end inside a single Quarto notebook (analysis/data_analysis.qmd), with two data sources feeding a join keyed on date: ~20k LLM-classified tweets on one side, and daily OHLCV price series on the other. Price data is fetched from Yahoo Finance via quantmod and auto-regenerates if the committed CSVs are ever missing, so the report renders from a clean checkout without manual data wrangling.

Crypto tweet-impact analysis pipeline diagram
Figure 1: End-to-end pipeline — LLM-classified tweets and Yahoo Finance prices joined by date, run through a multi-horizon event study against a random baseline, corrected with Bonferroni/FDR plus a power analysis, and rendered to a Quarto report, Shiny dashboard, and LaTeX PDFs.

The seven stages, in order:

  • Ingest: load tweets + prices; auto-fetch DOGE/BTC daily series from Yahoo Finance if absent.
  • Clean & filter: keep relevant, non-meme, valid-sentiment tweets per coin; de-duplicate; bucket sentiment into low/medium/high.
  • Feature-engineer: compute percentage returns at 1-, 3-, 7-, and 30-day horizons.
  • Event study: compute tweet-event returns and a seeded random baseline control group, so "before/after a tweet" is benchmarked against "a random day" rather than against nothing.
  • Test: Welch t-tests and Mann-Whitney U vs. baseline; one-sample t-tests for negative-sentiment events; correlation and multiple regression of returns on sentiment and volume.
  • Correct & quantify: Bonferroni + Benjamini–Hochberg FDR adjustment across all category tests; Cohen's d with confidence intervals; power and minimum-detectable-effect analysis with pwr.
  • Report: narrative interpretation, summary tables (kableExtra), and visualizations (ggplot2).

LLM-in-the-Loop Tweet Classification

Raw price history is easy; the hard input was turning ~20,000 free-text Musk posts into structured, analysis-ready signal. The labeling pipeline produces sentiment_classified.csv (~20,370 rows) through a multi-step process: a crypto keyword filter (bitcoin, btc, ethereum, doge, crypto, …) narrows the scrape, then the OpenAI API scores each tweet for sentiment (0–10) and classifies it on two further axes: contextually relevant vs. joke/meme, and a topic category (Bitcoin, Doge Coin, Alt coins, Ethereum/smart contracts, general crypto, exchanges/trading, and others).

Critically, the two primary categories — Bitcoin and Doge Coin — were manually verified by hand rather than trusted blindly from the model output. That manual-verification step is what makes the per-coin sentiment interpretation defensible, and it is exactly why the verified DOGE sample is small (~58 tweets): the analysis treats label quality as worth more than label quantity. The labeling is run as an offline pipeline and its outputs are committed to data/processed/, so the statistical analysis is fully reproducible without re-spending API calls.

Statistical Rigor

This is the core of the project and the part most directly aimed at a data-science audience. Rather than reporting a single test, the study layers several methods so each finding has to survive multiple, complementary checks:

  • Event-study design with a control group: tweet-event returns are compared against a seeded random baseline sampled from the same price series, so the comparison is "tweet day vs. typical day," not "tweet day vs. zero."
  • Parametric + non-parametric tests: Welch's t-test (unequal variances) is paired with the Mann-Whitney U rank test, so conclusions don't hinge on a normality assumption the return distributions may not satisfy.
  • Effect sizes, not just p-values: Cohen's d with confidence intervals accompanies every comparison, with an explicit discussion of statistical vs. practical significance.
  • Multiple-testing correction: both Bonferroni (family-wise error) and Benjamini–Hochberg FDR are applied across 30+ category tests — the single most important guard against false discovery in this design.
  • Power analysis: the pwr package quantifies what the ~58-tweet sample can detect, reporting minimum detectable effects so the reader knows when a null result means "no effect" vs. "not enough data."
Sentiment score plotted against subsequent returns
Figure 2: Tweet sentiment vs. subsequent returns. The weak, non-significant relationship is the result — visual evidence that sentiment is not a usable predictor of next-day price movement for either coin.

Tweet-Volume and Volatility

Beyond directional return, the Bitcoin analysis tests whether the volume of Musk's crypto tweets is associated with subsequent price volatility — a separate hypothesis from "does sentiment predict direction." The relationship shows a tentative positive trend but does not reach statistical significance (p ≈ 0.15), and the report presents it as such rather than rounding it up to a finding. This is representative of the project's discipline: a suggestive-but-not-significant result is reported as suggestive-but-not-significant.

Tweet volume plotted against subsequent volatility
Figure 3: Tweet volume vs. subsequent volatility for BTC — a positive trend that fails to clear the significance threshold once tested formally.

Interactive Shiny Dashboard

To make the event study explorable rather than static, I built a self-contained Shiny dashboard (dashboard/app.R) that lets you step through individual Bitcoin tweets and inspect the BTC price reaction. For each tweet it shows the tweet text, the 30-day forward closing-price path, and the realized 1-/7-/30-day returns. The app joins the categorized-tweet dataset to the price series in-process, and degrades gracefully when the large, git-ignored raw tweet-text file is absent — it simply hides the tweet text instead of crashing, so the dashboard runs from a clean clone.

Shiny dashboard showing a Bitcoin tweet and its 30-day BTC price path
Figure 4: The tweet-effects dashboard. A 2018 Musk tweet about AI is shown alongside the BTC closing-price path over the following 30 days, with a −3.06% next-day, +0.51% 7-day, and −33.19% 30-day return — illustrating how widely the realized outcome can swing from a single tweet.
Shiny dashboard showing a second Bitcoin tweet and its 30-day price reaction
Figure 5: A second tweet's 30-day reaction (+1.30% next-day, −2.51% 7-day, +2.07% 30-day). Browsing tweet-by-tweet makes the central finding tangible — the sign and size of the "reaction" is mostly noise.

Engineering & Reproducibility

The repository is structured as a portfolio-grade, reproducible project rather than a loose notebook dump. A Makefile exposes the full lifecycle — make deps, make report, make dashboard, make reports-pdf, make clean — and a one-shot scripts/install_dependencies.R installs every package the analysis and dashboard need. The data directory is split into processed/ (small, committed, analysis-ready) and raw/ (large source scrapes, git-ignored), with a documented data dictionary and provenance trail for every column.

  • Deterministic results: random seeds are fixed so the baseline control group — and therefore every reported figure — is reproducible across runs.
  • Self-contained output: the Quarto report renders with embed-resources: true to a single HTML file with all assets inlined, so it can be shared as one artifact.
  • Graceful data fallbacks: price CSVs auto-download from Yahoo Finance if missing; the dashboard runs without the git-ignored raw text file.
  • Path-robust code: the dashboard resolves its data directory whether launched from the repo root or from inside dashboard/.
  • Separation of sources and deliverables: editable LaTeX sources live in docs/, while the compiled PDF proposal and phase-2 report live in reports/.

Engineering Challenges Solved

  • ~20k free-text tweets into structured signal: built a keyword-filter → LLM-sentiment → LLM-classification pipeline, then hand-verified the two coins that the conclusions actually depend on — trading sample size for label trustworthiness.
  • Multiple-comparison false discovery: testing 30+ tweet categories invites spurious "significant" hits. Applied Bonferroni and FDR correction so the reported conclusions reflect what survives, not what looked good once.
  • Low statistical power, honestly handled: with ~58 verified DOGE tweets, a null result is ambiguous. Added a pwr power analysis with minimum-detectable-effect estimates so the reader can tell "no effect" from "not enough data."
  • Fair baseline for an event study: "returns after a tweet" means nothing without a control. Constructed a seeded random baseline from the same price series so every event return is measured against a typical-day distribution.
  • Reproducibility from a cold checkout: committed the processed datasets, fixed seeds, and made price data auto-fetch — so the report and dashboard run without re-spending OpenAI calls or manual data steps.

Tech Stack

  • Language: R 4.x.
  • Data wrangling: tidyverse (dplyr, tidyr, readr, stringr, purrr), lubridate.
  • Modeling & statistics: broom, tseries, base stats (Welch t-tests, Mann-Whitney U, one-sample t, lm, cor), pwr for power analysis.
  • Market data: quantmod (Yahoo Finance ingestion).
  • Reporting: Quarto, knitr, kableExtra; LaTeX for compiled PDF reports.
  • Interactive app: Shiny, ggplot2.
  • Sentiment & classification: OpenAI API (offline labeling pipeline; outputs committed to data/processed/).
  • Tooling: GNU Make, Git, RStudio project.

The pipeline is intentionally modular: each stage emits an inspectable intermediate artifact (classified tweets, per-horizon returns, baseline draws, test tables), so any single step can be re-run, audited, or swapped without rebuilding the whole study.

Limitations & Future Directions

The report is candid about its constraints — and treats that candor as part of the contribution rather than a disclaimer to bury:

  • Small verified sample (~58 Doge tweets) limits statistical power — quantified, not hand-waved, via the power analysis.
  • Observational design yields associations, not causation; many market confounders are unmodeled.
  • Temporal dependence in returns is not fully captured (no formal time-series model).
  • Random baseline does not perfectly match the market conditions present at tweet times.

Each limitation is paired with a concrete next step in the report: larger and refreshed samples, condition-matched baselines, explicit time-series methods, and causal-inference techniques.

Key Learnings

  • Correction changes the story. The single biggest lesson was watching a fistful of "significant" raw results evaporate under Bonferroni/FDR — a visceral demonstration of why multiple-testing discipline is non-negotiable.
  • Effect size and power belong next to every p-value. Reporting Cohen's d, confidence intervals, and minimum detectable effects turned ambiguous nulls into interpretable statements.
  • Label quality beats label quantity. Hand-verifying the two coins the conclusions depend on was worth more than a larger but noisier auto-labeled set.
  • Reproducibility is an engineering feature. Fixed seeds, committed processed data, auto-fetching prices, and a Make-driven lifecycle are what let this run end-to-end from a clean clone.

Conclusion

This project is the clearest demonstration of how I approach data science: take a popular, intuitive claim, design an honest test for it, and let the statistics — not the narrative — decide the verdict. It pairs a real LLM-in-the-loop data pipeline with a disciplined inferential framework (event study, non-parametric and parametric testing, effect sizes, multiple-testing correction, and power analysis), packages the result as a reproducible Quarto report and an interactive Shiny dashboard, and is candid about exactly what its sample can and cannot support. The conclusion that "most apparent tweet effects don't survive correction" is less exciting than a viral headline — and far more defensible, which is the entire point.