Import profile data from LinkedIn data export #97

Open
opened 2026-06-18 02:28:43 +00:00 by james · 0 comments
Owner

Bootstrapping a Carol profile by hand is the obvious friction point — most users already have years of jobs / education / skills curated on LinkedIn. A one-shot import from a LinkedIn data export lets a new user get to a useful state in minutes instead of an afternoon of copy-paste.

Filing this brings LinkedIn import into scope; CLAUDE.md's "Out of scope (today)" list should be updated to reflect that when the first PR for this work lands.

Why the data-export path, not an API

LinkedIn's official API (Marketing / Talent / Sign In with LinkedIn) returns essentially nothing about a user's own profile data unless the calling app has gone through LinkedIn's partner approval — which is a non-starter for a self-hosted personal tool. Scraping is against LinkedIn's TOS and they actively block it.

The realistic path is LinkedIn's own Get a copy of your data export. Users request it from LinkedIn's privacy settings, wait 10 minutes to 24 hours, and get a .zip of CSV files covering profile, positions, education, skills, certifications, languages, projects, recommendations, connections, messages, etc. The CSV schema is stable enough that a parser written today will keep working for a few years.

This makes the import:

  • User-driven, not automated — the user has to request and download the export themselves.
  • Periodic-refresh-by-redoing, not live-sync. Future re-imports are a re-upload; the importer needs to handle updates without duplicating.
  • Free of LinkedIn auth integration entirely. No OAuth, no scopes, no rate limits, no per-app review.

Scope (first pass)

Map what's in a typical LinkedIn data export onto Carol's existing entities:

LinkedIn export file Carol entity Notes
Profile.csv Profile name, headline, summary, location, industry — single row
Positions.csv Jobs company, title, location, start/end, description
Education.csv Education school, degree, field of study, start/end
Skills.csv Skills flat list of skill names; no level / years-of-experience metadata in the export
Certifications.csv Skills (or its own?) issuer, name, issued date, expiration — design decision: new entity or coalesce into Skills
Languages.csv Skills (typed) language + proficiency; same design question
Projects.csv Projects name, description, dates, contributors

Out of scope for the first pass (file as follow-ups when needed):

  • Connections.csv → People / Organizations. Connections imply a network, not just a contact list; merging into existing People records, dedup by name+company, and avoiding spam-rate-of-creation needs design work that shouldn't block first-cut profile import.
  • Recommendations.csv (given + received). No matching entity today.
  • Messages.csv, Reactions.csv, Comments.csv, Likes.csv. Engagement data Carol doesn't model.
  • Email Addresses.csv of contacts. Privacy-sensitive bulk-PII; explicit second-pass design.

Acceptance criteria

  • A user can navigate to a "Settings → Import" page (or equivalent — placement is a UX decision, not part of this ticket's gate), upload a LinkedIn .zip data export, and see a preview of what Carol parsed: how many positions, education entries, skills, projects.
  • On confirm, the import populates the user's Profile, Jobs, Education, Skills, and Projects entities. All writes scoped to the calling user_id per the per-user data isolation convention.
  • A second import of the same export (idempotency) does not duplicate rows. Entity match keys are pinned in the importer and documented (e.g. Jobs = (company, title, start_date); Skills = (normalized_name); etc.).
  • An updated export (a position added, a skill removed, dates revised) merges sensibly: new entities appear, existing entities update in place, removed-on-LinkedIn entities are not auto-deleted from Carol (LinkedIn is a source, not a master).
  • The importer surfaces parse errors per-file rather than failing the whole import on the first bad row.
  • No LinkedIn export file is persisted to the server after parsing — the .zip is processed in-memory (or in a per-request tempdir wiped on completion). The data inside it is the user's; the .zip itself doesn't need to be retained.

Design questions to settle before implementation

  • Where does the upload land? A dedicated /settings/import page is the obvious answer; alternatives include a one-shot first-run wizard, or attaching to the Profile edit page.
  • Are LinkedIn Certifications and Languages first-class entities or just typed skills? Calls for a quick look at how the Profile model is shaping up.
  • What does "merge sensibly" mean for jobs that have the same company + title with overlapping date ranges? Probably "preserve the manual edits, append the import data as a new revision", but the conflict-resolution policy needs to be explicit.
  • Should there be a dry-run report (per-entity diff) before the import commits? Friction vs. safety trade-off. Defaulting to "yes, show a confirm screen with the per-entity counts" feels right; full per-row diff is overkill for v1.

Out of scope today

  • Connections.csv import. Files a follow-up ticket once first-pass profile import lands.
  • Continuous sync with LinkedIn. The data-export path is fundamentally a periodic-batch flow.
  • Reverse: exporting Carol's data into a LinkedIn-import-compatible format.
  • Resume/CV parsing (PDF, DOCX). LinkedIn export is the bounded scope; resume-from-anywhere is a different problem.

Part of epic #2.

Bootstrapping a Carol profile by hand is the obvious friction point — most users already have years of jobs / education / skills curated on LinkedIn. A one-shot import from a LinkedIn data export lets a new user get to a useful state in minutes instead of an afternoon of copy-paste. Filing this brings LinkedIn import into scope; `CLAUDE.md`'s "Out of scope (today)" list should be updated to reflect that when the first PR for this work lands. ## Why the data-export path, not an API LinkedIn's official API (Marketing / Talent / Sign In with LinkedIn) returns essentially nothing about a user's own profile data unless the calling app has gone through LinkedIn's partner approval — which is a non-starter for a self-hosted personal tool. Scraping is against LinkedIn's TOS and they actively block it. The realistic path is LinkedIn's own [Get a copy of your data](https://www.linkedin.com/help/linkedin/answer/a1339364) export. Users request it from LinkedIn's privacy settings, wait 10 minutes to 24 hours, and get a `.zip` of CSV files covering profile, positions, education, skills, certifications, languages, projects, recommendations, connections, messages, etc. The CSV schema is stable enough that a parser written today will keep working for a few years. This makes the import: - **User-driven**, not automated — the user has to request and download the export themselves. - **Periodic-refresh-by-redoing**, not live-sync. Future re-imports are a re-upload; the importer needs to handle updates without duplicating. - **Free of LinkedIn auth integration** entirely. No OAuth, no scopes, no rate limits, no per-app review. ## Scope (first pass) Map what's in a typical LinkedIn data export onto Carol's existing entities: | LinkedIn export file | Carol entity | Notes | |---|---|---| | `Profile.csv` | Profile | name, headline, summary, location, industry — single row | | `Positions.csv` | Jobs | company, title, location, start/end, description | | `Education.csv` | Education | school, degree, field of study, start/end | | `Skills.csv` | Skills | flat list of skill names; no level / years-of-experience metadata in the export | | `Certifications.csv` | Skills (or its own?) | issuer, name, issued date, expiration — design decision: new entity or coalesce into Skills | | `Languages.csv` | Skills (typed) | language + proficiency; same design question | | `Projects.csv` | Projects | name, description, dates, contributors | **Out of scope for the first pass** (file as follow-ups when needed): - `Connections.csv` → People / Organizations. Connections imply a *network*, not just a contact list; merging into existing People records, dedup by name+company, and avoiding spam-rate-of-creation needs design work that shouldn't block first-cut profile import. - `Recommendations.csv` (given + received). No matching entity today. - `Messages.csv`, `Reactions.csv`, `Comments.csv`, `Likes.csv`. Engagement data Carol doesn't model. - `Email Addresses.csv` of contacts. Privacy-sensitive bulk-PII; explicit second-pass design. ## Acceptance criteria - [ ] A user can navigate to a "Settings → Import" page (or equivalent — placement is a UX decision, not part of this ticket's gate), upload a LinkedIn `.zip` data export, and see a preview of what Carol parsed: how many positions, education entries, skills, projects. - [ ] On confirm, the import populates the user's `Profile`, `Jobs`, `Education`, `Skills`, and `Projects` entities. All writes scoped to the calling `user_id` per the per-user data isolation convention. - [ ] A second import of the same export (idempotency) does **not** duplicate rows. Entity match keys are pinned in the importer and documented (e.g. `Jobs` = `(company, title, start_date)`; `Skills` = `(normalized_name)`; etc.). - [ ] An updated export (a position added, a skill removed, dates revised) merges sensibly: new entities appear, existing entities update in place, removed-on-LinkedIn entities are **not** auto-deleted from Carol (LinkedIn is a *source*, not a master). - [ ] The importer surfaces parse errors per-file rather than failing the whole import on the first bad row. - [ ] No LinkedIn export file is persisted to the server after parsing — the `.zip` is processed in-memory (or in a per-request tempdir wiped on completion). The data inside it is the user's; the `.zip` itself doesn't need to be retained. ## Design questions to settle before implementation - **Where does the upload land?** A dedicated `/settings/import` page is the obvious answer; alternatives include a one-shot first-run wizard, or attaching to the Profile edit page. - **Are LinkedIn `Certifications` and `Languages` first-class entities or just typed skills?** Calls for a quick look at how the Profile model is shaping up. - **What does "merge sensibly" mean for jobs that have the same company + title with overlapping date ranges?** Probably "preserve the manual edits, append the import data as a new revision", but the conflict-resolution policy needs to be explicit. - **Should there be a dry-run report (per-entity diff) before the import commits?** Friction vs. safety trade-off. Defaulting to "yes, show a confirm screen with the per-entity counts" feels right; full per-row diff is overkill for v1. ## Out of scope today - `Connections.csv` import. Files a follow-up ticket once first-pass profile import lands. - Continuous sync with LinkedIn. The data-export path is fundamentally a periodic-batch flow. - Reverse: exporting Carol's data into a LinkedIn-import-compatible format. - Resume/CV parsing (PDF, DOCX). LinkedIn export is the bounded scope; resume-from-anywhere is a different problem. Part of epic #2.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
james/carol#97
No description provided.