LinkedIn Profile Data Engineering Challenge (2-Hour In-Person Assessment)

Context

Our company regularly receives LinkedIn profile data from a third-party data provider. This data arrives as CSV files containing profile information that needs to be processed, cleaned, and stored in our database for further analysis. Your task is to build a data pipeline that handles this workflow.

Time Allocation

You have 2 hours to complete as much of this challenge as possible. We understand that full completion is challenging within this timeframe - focus on demonstrating your approach, coding style, and problem-solving skills rather than implementing every feature.

Requirements

The provided requirements.txt file should have all the necessary packages to complete this task. Feel free to install use any other packages you may need.

pip install -r requirements.txt

Data Source

A CSV file containing LinkedIn profile data will be provided to you
The file includes the following fields:
- Url (LinkedIn profile URL)
- First Name
- Last Name
- Job Title
- Headline
- Company
- Industry
- Location
- Work Email
- Other Work Emails
- Twitter
- Github
- Company LinkedinUrl
- Company Domain
- Profile Image Url

Priority Tasks (Core Requirements)

Data Extraction & Transformation
- Write code to read the provided CSV file
- Clean the data (handle missing values, standardize formats)
- Create a full name field from first and last name
- Enrich the data with at least one computed field:
  - seniority_level: Determine from Job Title (Junior, Mid, Senior, Executive)
Data Loading
- Design a PostgreSQL schema to store the processed data
- Implement code to load the data into PostgreSQL
- Include a unique identifier or primary key strategy
AWS S3 Integration
- Write functions to simulate downloading from and uploading to S3
- You don't need to test these functions, but show proper AWS SDK usage
Basic Prefect Flow
- Implement a simple Prefect flow that connects your extraction, transformation, and loading steps
- Include basic error handling

Stretch Goals (If Time Permits)

Additional Data Enrichment
- Implement additional computed fields:
  - tech_profile: Boolean flag indicating if Github field is populated
  - email_domain: Extract domain from Work Email
  - has_multiple_emails: Boolean flag indicating if Other Work Emails is populated
Advanced Error Handling
- Implement more comprehensive error handling and logging
- Add data validation checks
Testing
- Write unit tests for key components of your solution
Documentation
- Add comprehensive docstrings and comments
- Create a brief design document explaining your approach

Development Environment

A development environment will be provided with:
- Python 3.12+
- Docker installation
- Postgres Docker container
- Required Python libraries pre-installed (pandas, boto3, prefect, etc.)
- Access to documentation

Database Connection

For this assessment, we recommend setting up a Docker container with a postgres image

docker run -d --name postgres-interview -e POSTGRES_PASSWORD=yourpassword -p 5432:5432 postgres:latest

import psycopg2

conn = psycopg2.connect(
    dbname="postgres",
    user="postgres",
    password="yourpassword",
    host="localhost",
    port="5432"
)
cursor = conn.cursor()
cursor.execute("SELECT * FROM products WHERE price > %s", (500,))
print(cursor.fetchall())
conn.close()

Sample Data

Here's a sample of what the input CSV will look like:

"Url","First Name","Last Name","Job Title","Headline","Company","Industry","Location","Work Email","Other Work Emails","Twitter","Github","Company LinkedinUrl","Company Domain","Profile Image Url"
"https://linkedin.com/in/johndoe","John","Doe","Senior Data Engineer","Senior Data Engineer at TechCorp","TechCorp","Information Technology","San Francisco, CA","john.doe@techcorp.com","jdoe@techcorp.com","@johndoecodes","johndoe","https://linkedin.com/company/techcorp","techcorp.com","https://media.linkedin.com/profile/johndoe.jpg"
"https://linkedin.com/in/janedoe","Jane","Doe","Product Manager","Product Manager at DataSoft","DataSoft","Software Development","Seattle, WA","jane.doe@datasoft.com","","@janedoe","","https://linkedin.com/company/datasoft","datasoft.com","https://media.linkedin.com/profile/janedoe.jpg"
"https://linkedin.com/in/alexjohnson","Alex","Johnson","ML Engineer","Machine Learning Engineer","AI Solutions Inc","Artificial Intelligence","Austin, TX","alex.johnson@aisolutions.com","aj@aisolutions.com","@alexj_ai","alexjohnson","https://linkedin.com/company/ai-solutions","aisolutions.com","https://media.linkedin.com/profile/alexjohnson.jpg"
"https://linkedin.com/in/sarahwilliams","Sarah","Williams","Senior Data Scientist","Data Scientist at BigAnalytics","BigAnalytics","Data Science","New York, NY","sarah.williams@biganalytics.com","","@datascisarah","sarahwilliams","https://linkedin.com/company/biganalytics","biganalytics.com","https://media.linkedin.com/profile/sarahwilliams.jpg"
"https://linkedin.com/in/michaelbrown","Michael","Brown","Software Engineer","Software Engineer at CodeCrafters","CodeCrafters","Software Engineering","Chicago, IL","michael.brown@codecrafters.io","mike@codecrafters.io","@mikebcode","mbrown","https://linkedin.com/company/codecrafters","codecrafters.io","https://media.linkedin.com/profile/michaelbrown.jpg"

Evaluation Criteria

You will be evaluated on:

Code Quality & Organization
- Clean, readable, and well-structured code
- Modular design with clear separation of concerns
- Proper error handling for critical operations
Data Engineering Fundamentals
- Effective data cleaning and transformation
- Appropriate database schema design
- Efficient data loading methods
Technical Knowledge
- Proper use of AWS S3 APIs
- Effective implementation of Prefect for orchestration
- SQL knowledge for database operations
Problem-Solving Approach
- How you prioritize tasks given the time constraint
- Your approach to debugging and problem-solving
- Questions you ask and clarifications you seek
Communication
- Code comments and documentation
- Explanation of your approach during follow-up discussion
- Clarity about what you completed and what you would do with more time

Hints & Tips

Start with a working end-to-end solution focusing on core functionality
Add complexity incrementally as time permits
It's better to have a simple working solution than a complex partial one
If you get stuck, document your approach and move on
We're interested in your thought process as much as your code

What to Submit

All Python code files
SQL scripts for creating database schema
Brief notes on your approach and any design decisions
If you don't finish everything, note what you would do with more time

After the coding session, be prepared for a brief discussion about your solution, challenges you faced, and your design decisions.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
src		src
README.md		README.md
config.py		config.py
explore.ipynb		explore.ipynb
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LinkedIn Profile Data Engineering Challenge (2-Hour In-Person Assessment)

Context

Time Allocation

Requirements

Data Source

Priority Tasks (Core Requirements)

Stretch Goals (If Time Permits)

Development Environment

Database Connection

Sample Data

Evaluation Criteria

Hints & Tips

What to Submit

About

Uh oh!

Releases

Packages

Languages

gonezama/challenge-li

Folders and files

Latest commit

History

Repository files navigation

LinkedIn Profile Data Engineering Challenge (2-Hour In-Person Assessment)

Context

Time Allocation

Requirements

Data Source

Priority Tasks (Core Requirements)

Stretch Goals (If Time Permits)

Development Environment

Database Connection

Sample Data

Evaluation Criteria

Hints & Tips

What to Submit

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages