SoBack.ai — Data Pipeline Architecture

How we predict when AI founders are ready to build again. Internal reference document, Feb 2026.

1. The Core Thesis

Founders who leave startups and enter Big Tech almost always come back. The academic research backs this up:

Gompers, Kovner, Lerner & Scharfstein — "Performance Persistence in Entrepreneurship" (Harvard Business School / Journal of Financial Economics, 2010)

Previously successful founders have a 30% chance of success on their next venture vs. 18% for first-timers. Key finding: entrepreneurs exhibit persistence in selecting the right industry and time to start new ventures. This is the academic foundation for why timing prediction is possible.

Daniel Kim — "Predictable Exodus: Startup Acquisitions and Employee Departures" (Wharton, 2024)

33% of acquired workers leave within the first year, vs. 12% of regular hires with similar skills. Leverages U.S. Census Bureau administrative datasets across all U.S. high-tech startups. This proves the departure pattern is statistically real and predictable.

Stanford GSB Founder Effect Study

One-third of Stanford GSB alumni founders are serial entrepreneurs (2+ ventures). 83% of serial founders kept their first firm open while starting their second. 52% of MBA-founded companies launch 3+ years after graduation. The "refill and return" pattern is the norm, not the exception.

Target academic partnerships: Paul Gompers (HBS), Daniel Kim (Wharton), and the Kauffman Foundation all have datasets that could validate and refine the SoBack model. A research partnership here could produce a published paper AND a proprietary dataset.

2. Data Sources — The Signal Stack

Organized by reliability and legal accessibility. We never scrape LinkedIn directly (ToS violation, active litigation). Everything below is public or API-accessible.

Primary Sources (Refresh: Daily)

YC Startup Directory Crunchbase API GitHub Public Activity Conference Speaker Lists SEC EDGAR (M&A filings) AngelList / Wellfound

Secondary Sources (Refresh: Weekly)

Twitter/X (Tech Alerts feed + individual accounts) Tech News APIs (TechCrunch, The Information, Bloomberg) Podcast Guest Lists (AI-focused shows) PitchBook (if licensed) Happenstance.ai Google Scholar (new publications)

Experimental Sources (Refresh: Ad-hoc)

Reddit (r/ExperiencedDevs, r/startups, r/cscareerquestions) Blind (anonymous Big Tech posts) Meetup.com / Luma (AI event RSVPs) Wayback Machine (LinkedIn profile change detection) Patent filings (new individual patents outside employer) Domain registrations (new personal domains)

3. Pipeline Architecture

Phase 1: Founder Identification & Base Profile

WEEK 1-2

Build the initial database of 2,000+ AI-relevant YC founders.

Step 1: Pull all companies from YC directory. Filter for: AI/ML tags, inactive/acquired/dead status, and any company whose description mentions AI, machine learning, NLP, computer vision, data infrastructure.
Step 2: For each company, extract founder names and any linked social profiles from the YC directory itself (it includes founder bios and links).
Step 3: Cross-reference with Crunchbase for acquisition data (who acquired them, when, deal size if public).
Step 4: Use Happenstance.ai to map where these founders are now — which Big Tech company, what role, how long they've been there.
Step 5: Build the base profile: Founder name, YC batch, company name, company status (active/inactive/acquired/dead), date of status change, current employer, current role, start date at current employer.

Key insight: The YC directory already flags companies as "Inactive" vs "Active." We can retroactively scrape Wayback Machine snapshots of the YC directory to determine WHEN a company went inactive. This gives us a timeline we can't get anywhere else.

Phase 2: Static Signal Layer

WEEK 2-3

Overlay the data that doesn't change frequently but is highly predictive.

Signal	Source	What It Tells Us	Weight
Company exit type	Crunchbase, SEC	Acqui-hire vs. asset sale vs. shutdown. Acqui-hires have the most predictable departure timelines.	HIGH
Time since company ended	YC Directory + Wayback	The "refill clock." Most founders need 18-36 months to financially recover and mentally reset.	HIGH
Vesting timeline	Standard 4yr/1yr cliff model	Estimated vesting cliff based on hire date at Big Tech employer. Most departures cluster at 2yr and 4yr marks.	HIGH
Founder type	YC Directory, Crunchbase	Technical vs. non-technical founder. Technical founders tend to return faster (they can build alone).	MED
Previous founding count	Crunchbase	Serial founders (2+ companies) return at ~2x the rate of first-time founders.	MED
Historical tenure patterns	Happenstance, Crunchbase	How long has this person stayed at previous jobs? Short tenure history = higher departure likelihood.	MED

Phase 3: Dynamic Signal Layer

WEEK 3-5

These are the real-time signals that move the readiness score up or down. Refreshed daily to weekly.

Signal	Source	What It Tells Us	Weight
Social posting frequency change	Twitter/X API	Sudden increase in startup-related posts after months of silence. The "I'm thinking about building" signal.	HIGH
Angel investing activity	AngelList, Crunchbase, SEC	Writing checks = mentally back in the startup ecosystem. Strong leading indicator (3-6mo before departure).	HIGH
Conference speaking (non-employer)	Conference sites, Luma	Speaking at startup/AI events not affiliated with current employer = building personal brand for what's next.	MED
GitHub public activity spike	GitHub API	New public repos, increased commit frequency on personal projects. Building something on the side.	MED
Employer reorg / product shutdown	Tech news APIs, Twitter (Tech Alerts)	Their team got dissolved or their product got killed. Massive departure accelerant.	HIGH
Employer layoffs in their division	Layoffs.fyi, TrueUp, news	Even if they weren't laid off, layoffs in their org signal instability and reduce commitment.	MED
New domain registration	WHOIS lookups	Registered a new .ai or .com domain? Probably has an idea brewing.	LOW
Patent filing (personal)	USPTO	Filed a patent outside their employer's IP? Side project with serious intent.	LOW
AI meetup / event RSVPs	Luma, Meetup.com	Attending niche AI events as an individual (not representing their employer).	LOW

Phase 4: The SoBack Score

WEEK 5-6

Combine static and dynamic signals into a single readiness score (0-100).

SoBack Score = ( Static_Base_Score(exit_type, time_since_exit, vesting_timeline, founder_type, serial_count, tenure_pattern) × 0.4 ) + ( Dynamic_Signal_Score(social_change, angel_activity, conf_speaking, github_spike, employer_reorg, employer_layoffs, domain_reg, patent_filing, event_rsvps) × 0.6 ) // Static base provides the foundation (who is structurally likely to leave) // Dynamic signals provide the timing (who is actively showing signs NOW) // 60/40 weighting favors recency — a high static score with no dynamic signals = "likely someday" // A high dynamic score on top of a high static score = "MOVE NOW"

Score Interpretation

Score Range	Label	What It Means	Recommended Action
90-100	MOVE NOW	Multiple dynamic signals firing on top of a strong static profile. This person is actively preparing to leave.	Reach out immediately. Suggest the outreach angle.
75-89	HOT	Strong static profile with emerging dynamic signals. Vesting cliff approaching or employer instability detected.	Add to watchlist. Prepare outreach. Window opens in 1-3 months.
50-74	WARMING	Good static fundamentals but limited dynamic signals yet. Could tip at any time with a trigger event.	Monitor weekly. Will move to HOT when trigger events occur.
25-49	RESTING	In the refill phase. Probably still committed to current role. Static signals suggest eventual departure.	Long-term pipeline. Check quarterly.
0-24	SETTLED	No departure signals detected. May have transitioned to a career employee mindset.	Deprioritize. Reassess if employer events occur.

Phase 5: Validation & Calibration

WEEK 6-8

Before launch, backtest the model against known departures.

Historical validation: Take YC founders who DID leave Big Tech in 2023-2025 and start new companies. Run the model against their signal trail from 3-6 months before departure. Did the model predict it?
False positive analysis: Take founders who showed some signals but DIDN'T leave. What was different? This calibrates the weights.
Academic partnership: Engage Paul Gompers (HBS) or Daniel Kim (Wharton) to validate methodology. A co-published paper = instant credibility AND a proprietary dataset advantage.
Ongoing calibration: Every confirmed departure or non-departure becomes training data. The model gets better with every data point.

Phase 6: YC Directory Time Machine

ONGOING

One of our most unique data advantages: retroactively reconstructing when YC companies went inactive.

Wayback Machine scraping: The Internet Archive has regular snapshots of ycombinator.com/companies going back years. By comparing snapshots, we can determine the approximate date a company's status changed from "Active" to "Inactive."
News cross-reference: When a company shuts down or is acquired, there's usually a TechCrunch or other press mention. The Tech Alerts Twitter feed is also a good source. Cross-referencing news dates with directory changes gives precise timelines.
Founder employment timeline reconstruction: Once we know when the company ended, we can estimate when the founder joined Big Tech (usually 1-3 months later). Combined with standard vesting schedules, this gives us a predicted departure window without needing any LinkedIn data.

This "time machine" approach is a genuine competitive moat. No one else is systematically reconstructing the temporal history of YC company status changes. It's public data, but nobody's assembled it.

4. Connection Graph (3.5M Connections)

Beyond the 2,000 tracked founders, we map their extended network to:

Identify co-founder patterns: When founder A leaves, founder B often follows within 6 months. Map the co-founder graph to predict cascading departures.
Surface hidden talent: The Ghost archetype — founders who've gone dark but are still connected to active people in the ecosystem.
Warm intro routing: When a startup founder wants to reach a tracked founder, we can suggest the shortest path through mutual connections.
Expand the pool: 2,000 founders with an average of 1,750 connections each = 3.5M nodes in the graph. Some of those connections are non-YC founders who show the same patterns.

5. Legal & Ethical Guardrails

No LinkedIn scraping. LinkedIn actively litigates (hiQ Labs case). All data comes from public APIs, public directories, and licensed data sources.
Opt-out mechanism. Any tracked individual can request removal from the database. This is both ethical and legally required under CCPA/GDPR.
No employer notification. We never tell an employer that their employee is being tracked or scored. That would be a trust violation that kills the product.
Composite messaging. Public-facing content (like the Reddit quotes on the landing page) uses composites, never real posts attributed to real people.
Framing. We're helping founders find their next opportunity, not surveilling employees. The language and positioning reinforce this at every touchpoint.

6. Build Cost Estimate

Component	Effort	Notes
YC Directory scraper + Wayback Time Machine	1 week	Python + Wayback Machine CDX API. One-time build, then daily refresh.
Crunchbase / SEC / AngelList integrations	1 week	API integrations. Crunchbase Pro license required (~$500/mo).
Happenstance.ai integration	2-3 days	Depends on their API. May need manual exports initially.
Twitter/X + GitHub + Conference scrapers	1 week	Twitter API ($100/mo for Basic). GitHub API is free. Conference sites need custom scrapers.
SoBack Score model (v1)	2 weeks	Weighted scoring model. No ML needed for v1 — just weighted heuristics calibrated against historical data.
API layer	1 week	REST + structured data for agent consumption. Auth, rate limiting, usage metering.
Database + infrastructure	3 days	Postgres + Redis. Hosted on Railway or Render. ~$50-100/mo.
Total MVP	6-8 weeks	~$2K/mo in data costs. One full-stack engineer.