SoBack.ai — Data Pipeline Architecture
How we predict when AI founders are ready to build again. Internal reference document, Feb 2026.
1. The Core Thesis
Founders who leave startups and enter Big Tech almost always come back. The academic research backs this up:
Gompers, Kovner, Lerner & Scharfstein — "Performance Persistence in Entrepreneurship" (Harvard Business School / Journal of Financial Economics, 2010)
Previously successful founders have a 30% chance of success on their next venture vs. 18% for first-timers. Key finding: entrepreneurs exhibit persistence in selecting the right industry and time to start new ventures. This is the academic foundation for why timing prediction is possible.
Daniel Kim — "Predictable Exodus: Startup Acquisitions and Employee Departures" (Wharton, 2024)
33% of acquired workers leave within the first year, vs. 12% of regular hires with similar skills. Leverages U.S. Census Bureau administrative datasets across all U.S. high-tech startups. This proves the departure pattern is statistically real and predictable.
Stanford GSB Founder Effect Study
One-third of Stanford GSB alumni founders are serial entrepreneurs (2+ ventures). 83% of serial founders kept their first firm open while starting their second. 52% of MBA-founded companies launch 3+ years after graduation. The "refill and return" pattern is the norm, not the exception.
Target academic partnerships: Paul Gompers (HBS), Daniel Kim (Wharton), and the Kauffman Foundation all have datasets that could validate and refine the SoBack model. A research partnership here could produce a published paper AND a proprietary dataset.
2. Data Sources — The Signal Stack
Organized by reliability and legal accessibility. We never scrape LinkedIn directly (ToS violation, active litigation). Everything below is public or API-accessible.
Primary Sources (Refresh: Daily)
YC Startup Directory
Crunchbase API
GitHub Public Activity
Conference Speaker Lists
SEC EDGAR (M&A filings)
AngelList / Wellfound
Secondary Sources (Refresh: Weekly)
Twitter/X (Tech Alerts feed + individual accounts)
Tech News APIs (TechCrunch, The Information, Bloomberg)
Podcast Guest Lists (AI-focused shows)
PitchBook (if licensed)
Happenstance.ai
Google Scholar (new publications)
Experimental Sources (Refresh: Ad-hoc)
Reddit (r/ExperiencedDevs, r/startups, r/cscareerquestions)
Blind (anonymous Big Tech posts)
Meetup.com / Luma (AI event RSVPs)
Wayback Machine (LinkedIn profile change detection)
Patent filings (new individual patents outside employer)
Domain registrations (new personal domains)
3. Pipeline Architecture
Build the initial database of 2,000+ AI-relevant YC founders.
- Step 1: Pull all companies from YC directory. Filter for: AI/ML tags, inactive/acquired/dead status, and any company whose description mentions AI, machine learning, NLP, computer vision, data infrastructure.
- Step 2: For each company, extract founder names and any linked social profiles from the YC directory itself (it includes founder bios and links).
- Step 3: Cross-reference with Crunchbase for acquisition data (who acquired them, when, deal size if public).
- Step 4: Use Happenstance.ai to map where these founders are now — which Big Tech company, what role, how long they've been there.
- Step 5: Build the base profile: Founder name, YC batch, company name, company status (active/inactive/acquired/dead), date of status change, current employer, current role, start date at current employer.
Key insight: The YC directory already flags companies as "Inactive" vs "Active." We can retroactively scrape Wayback Machine snapshots of the YC directory to determine WHEN a company went inactive. This gives us a timeline we can't get anywhere else.
Overlay the data that doesn't change frequently but is highly predictive.
| Signal | Source | What It Tells Us | Weight |
| Company exit type | Crunchbase, SEC | Acqui-hire vs. asset sale vs. shutdown. Acqui-hires have the most predictable departure timelines. | HIGH |
| Time since company ended | YC Directory + Wayback | The "refill clock." Most founders need 18-36 months to financially recover and mentally reset. | HIGH |
| Vesting timeline | Standard 4yr/1yr cliff model | Estimated vesting cliff based on hire date at Big Tech employer. Most departures cluster at 2yr and 4yr marks. | HIGH |
| Founder type | YC Directory, Crunchbase | Technical vs. non-technical founder. Technical founders tend to return faster (they can build alone). | MED |
| Previous founding count | Crunchbase | Serial founders (2+ companies) return at ~2x the rate of first-time founders. | MED |
| Historical tenure patterns | Happenstance, Crunchbase | How long has this person stayed at previous jobs? Short tenure history = higher departure likelihood. | MED |
These are the real-time signals that move the readiness score up or down. Refreshed daily to weekly.
| Signal | Source | What It Tells Us | Weight |
| Social posting frequency change | Twitter/X API | Sudden increase in startup-related posts after months of silence. The "I'm thinking about building" signal. | HIGH |
| Angel investing activity | AngelList, Crunchbase, SEC | Writing checks = mentally back in the startup ecosystem. Strong leading indicator (3-6mo before departure). | HIGH |
| Conference speaking (non-employer) | Conference sites, Luma | Speaking at startup/AI events not affiliated with current employer = building personal brand for what's next. | MED |
| GitHub public activity spike | GitHub API | New public repos, increased commit frequency on personal projects. Building something on the side. | MED |
| Employer reorg / product shutdown | Tech news APIs, Twitter (Tech Alerts) | Their team got dissolved or their product got killed. Massive departure accelerant. | HIGH |
| Employer layoffs in their division | Layoffs.fyi, TrueUp, news | Even if they weren't laid off, layoffs in their org signal instability and reduce commitment. | MED |
| New domain registration | WHOIS lookups | Registered a new .ai or .com domain? Probably has an idea brewing. | LOW |
| Patent filing (personal) | USPTO | Filed a patent outside their employer's IP? Side project with serious intent. | LOW |
| AI meetup / event RSVPs | Luma, Meetup.com | Attending niche AI events as an individual (not representing their employer). | LOW |
Combine static and dynamic signals into a single readiness score (0-100).
SoBack Score = (
Static_Base_Score(exit_type, time_since_exit, vesting_timeline, founder_type, serial_count, tenure_pattern)
× 0.4
) + (
Dynamic_Signal_Score(social_change, angel_activity, conf_speaking, github_spike,
employer_reorg, employer_layoffs, domain_reg, patent_filing, event_rsvps)
× 0.6
)
// Static base provides the foundation (who is structurally likely to leave)
// Dynamic signals provide the timing (who is actively showing signs NOW)
// 60/40 weighting favors recency — a high static score with no dynamic signals = "likely someday"
// A high dynamic score on top of a high static score = "MOVE NOW"
Score Interpretation
| Score Range | Label | What It Means | Recommended Action |
| 90-100 | MOVE NOW | Multiple dynamic signals firing on top of a strong static profile. This person is actively preparing to leave. | Reach out immediately. Suggest the outreach angle. |
| 75-89 | HOT | Strong static profile with emerging dynamic signals. Vesting cliff approaching or employer instability detected. | Add to watchlist. Prepare outreach. Window opens in 1-3 months. |
| 50-74 | WARMING | Good static fundamentals but limited dynamic signals yet. Could tip at any time with a trigger event. | Monitor weekly. Will move to HOT when trigger events occur. |
| 25-49 | RESTING | In the refill phase. Probably still committed to current role. Static signals suggest eventual departure. | Long-term pipeline. Check quarterly. |
| 0-24 | SETTLED | No departure signals detected. May have transitioned to a career employee mindset. | Deprioritize. Reassess if employer events occur. |
Before launch, backtest the model against known departures.
- Historical validation: Take YC founders who DID leave Big Tech in 2023-2025 and start new companies. Run the model against their signal trail from 3-6 months before departure. Did the model predict it?
- False positive analysis: Take founders who showed some signals but DIDN'T leave. What was different? This calibrates the weights.
- Academic partnership: Engage Paul Gompers (HBS) or Daniel Kim (Wharton) to validate methodology. A co-published paper = instant credibility AND a proprietary dataset advantage.
- Ongoing calibration: Every confirmed departure or non-departure becomes training data. The model gets better with every data point.
One of our most unique data advantages: retroactively reconstructing when YC companies went inactive.
- Wayback Machine scraping: The Internet Archive has regular snapshots of ycombinator.com/companies going back years. By comparing snapshots, we can determine the approximate date a company's status changed from "Active" to "Inactive."
- News cross-reference: When a company shuts down or is acquired, there's usually a TechCrunch or other press mention. The Tech Alerts Twitter feed is also a good source. Cross-referencing news dates with directory changes gives precise timelines.
- Founder employment timeline reconstruction: Once we know when the company ended, we can estimate when the founder joined Big Tech (usually 1-3 months later). Combined with standard vesting schedules, this gives us a predicted departure window without needing any LinkedIn data.
This "time machine" approach is a genuine competitive moat. No one else is systematically reconstructing the temporal history of YC company status changes. It's public data, but nobody's assembled it.
4. Connection Graph (3.5M Connections)
Beyond the 2,000 tracked founders, we map their extended network to:
- Identify co-founder patterns: When founder A leaves, founder B often follows within 6 months. Map the co-founder graph to predict cascading departures.
- Surface hidden talent: The Ghost archetype — founders who've gone dark but are still connected to active people in the ecosystem.
- Warm intro routing: When a startup founder wants to reach a tracked founder, we can suggest the shortest path through mutual connections.
- Expand the pool: 2,000 founders with an average of 1,750 connections each = 3.5M nodes in the graph. Some of those connections are non-YC founders who show the same patterns.
5. Legal & Ethical Guardrails
- No LinkedIn scraping. LinkedIn actively litigates (hiQ Labs case). All data comes from public APIs, public directories, and licensed data sources.
- Opt-out mechanism. Any tracked individual can request removal from the database. This is both ethical and legally required under CCPA/GDPR.
- No employer notification. We never tell an employer that their employee is being tracked or scored. That would be a trust violation that kills the product.
- Composite messaging. Public-facing content (like the Reddit quotes on the landing page) uses composites, never real posts attributed to real people.
- Framing. We're helping founders find their next opportunity, not surveilling employees. The language and positioning reinforce this at every touchpoint.
6. Build Cost Estimate
| Component | Effort | Notes |
| YC Directory scraper + Wayback Time Machine | 1 week | Python + Wayback Machine CDX API. One-time build, then daily refresh. |
| Crunchbase / SEC / AngelList integrations | 1 week | API integrations. Crunchbase Pro license required (~$500/mo). |
| Happenstance.ai integration | 2-3 days | Depends on their API. May need manual exports initially. |
| Twitter/X + GitHub + Conference scrapers | 1 week | Twitter API ($100/mo for Basic). GitHub API is free. Conference sites need custom scrapers. |
| SoBack Score model (v1) | 2 weeks | Weighted scoring model. No ML needed for v1 — just weighted heuristics calibrated against historical data. |
| API layer | 1 week | REST + structured data for agent consumption. Auth, rate limiting, usage metering. |
| Database + infrastructure | 3 days | Postgres + Redis. Hosted on Railway or Render. ~$50-100/mo. |
| Total MVP | 6-8 weeks | ~$2K/mo in data costs. One full-stack engineer. |