SoBack.ai — Data Pipeline Architecture

How we predict when AI founders are ready to build again. Internal reference document, Feb 2026.

1. The Core Thesis

Founders who leave startups and enter Big Tech almost always come back. The academic research backs this up:

Gompers, Kovner, Lerner & Scharfstein — "Performance Persistence in Entrepreneurship" (Harvard Business School / Journal of Financial Economics, 2010)
Previously successful founders have a 30% chance of success on their next venture vs. 18% for first-timers. Key finding: entrepreneurs exhibit persistence in selecting the right industry and time to start new ventures. This is the academic foundation for why timing prediction is possible.
Daniel Kim — "Predictable Exodus: Startup Acquisitions and Employee Departures" (Wharton, 2024)
33% of acquired workers leave within the first year, vs. 12% of regular hires with similar skills. Leverages U.S. Census Bureau administrative datasets across all U.S. high-tech startups. This proves the departure pattern is statistically real and predictable.
Stanford GSB Founder Effect Study
One-third of Stanford GSB alumni founders are serial entrepreneurs (2+ ventures). 83% of serial founders kept their first firm open while starting their second. 52% of MBA-founded companies launch 3+ years after graduation. The "refill and return" pattern is the norm, not the exception.
Target academic partnerships: Paul Gompers (HBS), Daniel Kim (Wharton), and the Kauffman Foundation all have datasets that could validate and refine the SoBack model. A research partnership here could produce a published paper AND a proprietary dataset.

2. Data Sources — The Signal Stack

Organized by reliability and legal accessibility. We never scrape LinkedIn directly (ToS violation, active litigation). Everything below is public or API-accessible.

Primary Sources (Refresh: Daily)

YC Startup Directory Crunchbase API GitHub Public Activity Conference Speaker Lists SEC EDGAR (M&A filings) AngelList / Wellfound

Secondary Sources (Refresh: Weekly)

Twitter/X (Tech Alerts feed + individual accounts) Tech News APIs (TechCrunch, The Information, Bloomberg) Podcast Guest Lists (AI-focused shows) PitchBook (if licensed) Happenstance.ai Google Scholar (new publications)

Experimental Sources (Refresh: Ad-hoc)

Reddit (r/ExperiencedDevs, r/startups, r/cscareerquestions) Blind (anonymous Big Tech posts) Meetup.com / Luma (AI event RSVPs) Wayback Machine (LinkedIn profile change detection) Patent filings (new individual patents outside employer) Domain registrations (new personal domains)

3. Pipeline Architecture

Phase 1: Founder Identification & Base Profile

WEEK 1-2

Build the initial database of 2,000+ AI-relevant YC founders.

Key insight: The YC directory already flags companies as "Inactive" vs "Active." We can retroactively scrape Wayback Machine snapshots of the YC directory to determine WHEN a company went inactive. This gives us a timeline we can't get anywhere else.

Phase 2: Static Signal Layer

WEEK 2-3

Overlay the data that doesn't change frequently but is highly predictive.

SignalSourceWhat It Tells UsWeight
Company exit typeCrunchbase, SECAcqui-hire vs. asset sale vs. shutdown. Acqui-hires have the most predictable departure timelines.HIGH
Time since company endedYC Directory + WaybackThe "refill clock." Most founders need 18-36 months to financially recover and mentally reset.HIGH
Vesting timelineStandard 4yr/1yr cliff modelEstimated vesting cliff based on hire date at Big Tech employer. Most departures cluster at 2yr and 4yr marks.HIGH
Founder typeYC Directory, CrunchbaseTechnical vs. non-technical founder. Technical founders tend to return faster (they can build alone).MED
Previous founding countCrunchbaseSerial founders (2+ companies) return at ~2x the rate of first-time founders.MED
Historical tenure patternsHappenstance, CrunchbaseHow long has this person stayed at previous jobs? Short tenure history = higher departure likelihood.MED

Phase 3: Dynamic Signal Layer

WEEK 3-5

These are the real-time signals that move the readiness score up or down. Refreshed daily to weekly.

SignalSourceWhat It Tells UsWeight
Social posting frequency changeTwitter/X APISudden increase in startup-related posts after months of silence. The "I'm thinking about building" signal.HIGH
Angel investing activityAngelList, Crunchbase, SECWriting checks = mentally back in the startup ecosystem. Strong leading indicator (3-6mo before departure).HIGH
Conference speaking (non-employer)Conference sites, LumaSpeaking at startup/AI events not affiliated with current employer = building personal brand for what's next.MED
GitHub public activity spikeGitHub APINew public repos, increased commit frequency on personal projects. Building something on the side.MED
Employer reorg / product shutdownTech news APIs, Twitter (Tech Alerts)Their team got dissolved or their product got killed. Massive departure accelerant.HIGH
Employer layoffs in their divisionLayoffs.fyi, TrueUp, newsEven if they weren't laid off, layoffs in their org signal instability and reduce commitment.MED
New domain registrationWHOIS lookupsRegistered a new .ai or .com domain? Probably has an idea brewing.LOW
Patent filing (personal)USPTOFiled a patent outside their employer's IP? Side project with serious intent.LOW
AI meetup / event RSVPsLuma, Meetup.comAttending niche AI events as an individual (not representing their employer).LOW

Phase 4: The SoBack Score

WEEK 5-6

Combine static and dynamic signals into a single readiness score (0-100).

SoBack Score = ( Static_Base_Score(exit_type, time_since_exit, vesting_timeline, founder_type, serial_count, tenure_pattern) × 0.4 ) + ( Dynamic_Signal_Score(social_change, angel_activity, conf_speaking, github_spike, employer_reorg, employer_layoffs, domain_reg, patent_filing, event_rsvps) × 0.6 ) // Static base provides the foundation (who is structurally likely to leave) // Dynamic signals provide the timing (who is actively showing signs NOW) // 60/40 weighting favors recency — a high static score with no dynamic signals = "likely someday" // A high dynamic score on top of a high static score = "MOVE NOW"

Score Interpretation

Score RangeLabelWhat It MeansRecommended Action
90-100MOVE NOWMultiple dynamic signals firing on top of a strong static profile. This person is actively preparing to leave.Reach out immediately. Suggest the outreach angle.
75-89HOTStrong static profile with emerging dynamic signals. Vesting cliff approaching or employer instability detected.Add to watchlist. Prepare outreach. Window opens in 1-3 months.
50-74WARMINGGood static fundamentals but limited dynamic signals yet. Could tip at any time with a trigger event.Monitor weekly. Will move to HOT when trigger events occur.
25-49RESTINGIn the refill phase. Probably still committed to current role. Static signals suggest eventual departure.Long-term pipeline. Check quarterly.
0-24SETTLEDNo departure signals detected. May have transitioned to a career employee mindset.Deprioritize. Reassess if employer events occur.

Phase 5: Validation & Calibration

WEEK 6-8

Before launch, backtest the model against known departures.

Phase 6: YC Directory Time Machine

ONGOING

One of our most unique data advantages: retroactively reconstructing when YC companies went inactive.

This "time machine" approach is a genuine competitive moat. No one else is systematically reconstructing the temporal history of YC company status changes. It's public data, but nobody's assembled it.

4. Connection Graph (3.5M Connections)

Beyond the 2,000 tracked founders, we map their extended network to:

5. Legal & Ethical Guardrails

6. Build Cost Estimate

ComponentEffortNotes
YC Directory scraper + Wayback Time Machine1 weekPython + Wayback Machine CDX API. One-time build, then daily refresh.
Crunchbase / SEC / AngelList integrations1 weekAPI integrations. Crunchbase Pro license required (~$500/mo).
Happenstance.ai integration2-3 daysDepends on their API. May need manual exports initially.
Twitter/X + GitHub + Conference scrapers1 weekTwitter API ($100/mo for Basic). GitHub API is free. Conference sites need custom scrapers.
SoBack Score model (v1)2 weeksWeighted scoring model. No ML needed for v1 — just weighted heuristics calibrated against historical data.
API layer1 weekREST + structured data for agent consumption. Auth, rate limiting, usage metering.
Database + infrastructure3 daysPostgres + Redis. Hosted on Railway or Render. ~$50-100/mo.
Total MVP6-8 weeks~$2K/mo in data costs. One full-stack engineer.