Skip to content

gonezama/challenge-li

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LinkedIn Profile Data Engineering Challenge (2-Hour In-Person Assessment)

Context

Our company regularly receives LinkedIn profile data from a third-party data provider. This data arrives as CSV files containing profile information that needs to be processed, cleaned, and stored in our database for further analysis. Your task is to build a data pipeline that handles this workflow.

Time Allocation

You have 2 hours to complete as much of this challenge as possible. We understand that full completion is challenging within this timeframe - focus on demonstrating your approach, coding style, and problem-solving skills rather than implementing every feature.

Requirements

The provided requirements.txt file should have all the necessary packages to complete this task. Feel free to install use any other packages you may need.

pip install -r requirements.txt

Data Source

  • A CSV file containing LinkedIn profile data will be provided to you
  • The file includes the following fields:
    • Url (LinkedIn profile URL)
    • First Name
    • Last Name
    • Job Title
    • Headline
    • Company
    • Industry
    • Location
    • Work Email
    • Other Work Emails
    • Twitter
    • Github
    • Company LinkedinUrl
    • Company Domain
    • Profile Image Url

Priority Tasks (Core Requirements)

  1. Data Extraction & Transformation

    • Write code to read the provided CSV file
    • Clean the data (handle missing values, standardize formats)
    • Create a full name field from first and last name
    • Enrich the data with at least one computed field:
      • seniority_level: Determine from Job Title (Junior, Mid, Senior, Executive)
  2. Data Loading

    • Design a PostgreSQL schema to store the processed data
    • Implement code to load the data into PostgreSQL
    • Include a unique identifier or primary key strategy
  3. AWS S3 Integration

    • Write functions to simulate downloading from and uploading to S3
    • You don't need to test these functions, but show proper AWS SDK usage
  4. Basic Prefect Flow

    • Implement a simple Prefect flow that connects your extraction, transformation, and loading steps
    • Include basic error handling

Stretch Goals (If Time Permits)

  1. Additional Data Enrichment

    • Implement additional computed fields:
      • tech_profile: Boolean flag indicating if Github field is populated
      • email_domain: Extract domain from Work Email
      • has_multiple_emails: Boolean flag indicating if Other Work Emails is populated
  2. Advanced Error Handling

    • Implement more comprehensive error handling and logging
    • Add data validation checks
  3. Testing

    • Write unit tests for key components of your solution
  4. Documentation

    • Add comprehensive docstrings and comments
    • Create a brief design document explaining your approach

Development Environment

  • A development environment will be provided with:
    • Python 3.12+
    • Docker installation
    • Postgres Docker container
    • Required Python libraries pre-installed (pandas, boto3, prefect, etc.)
    • Access to documentation

Database Connection

For this assessment, we recommend setting up a Docker container with a postgres image

docker run -d --name postgres-interview -e POSTGRES_PASSWORD=yourpassword -p 5432:5432 postgres:latest
import psycopg2

conn = psycopg2.connect(
    dbname="postgres",
    user="postgres",
    password="yourpassword",
    host="localhost",
    port="5432"
)
cursor = conn.cursor()
cursor.execute("SELECT * FROM products WHERE price > %s", (500,))
print(cursor.fetchall())
conn.close()

Sample Data

Here's a sample of what the input CSV will look like:

"Url","First Name","Last Name","Job Title","Headline","Company","Industry","Location","Work Email","Other Work Emails","Twitter","Github","Company LinkedinUrl","Company Domain","Profile Image Url"
"https://linkedin.com/in/johndoe","John","Doe","Senior Data Engineer","Senior Data Engineer at TechCorp","TechCorp","Information Technology","San Francisco, CA","john.doe@techcorp.com","jdoe@techcorp.com","@johndoecodes","johndoe","https://linkedin.com/company/techcorp","techcorp.com","https://media.linkedin.com/profile/johndoe.jpg"
"https://linkedin.com/in/janedoe","Jane","Doe","Product Manager","Product Manager at DataSoft","DataSoft","Software Development","Seattle, WA","jane.doe@datasoft.com","","@janedoe","","https://linkedin.com/company/datasoft","datasoft.com","https://media.linkedin.com/profile/janedoe.jpg"
"https://linkedin.com/in/alexjohnson","Alex","Johnson","ML Engineer","Machine Learning Engineer","AI Solutions Inc","Artificial Intelligence","Austin, TX","alex.johnson@aisolutions.com","aj@aisolutions.com","@alexj_ai","alexjohnson","https://linkedin.com/company/ai-solutions","aisolutions.com","https://media.linkedin.com/profile/alexjohnson.jpg"
"https://linkedin.com/in/sarahwilliams","Sarah","Williams","Senior Data Scientist","Data Scientist at BigAnalytics","BigAnalytics","Data Science","New York, NY","sarah.williams@biganalytics.com","","@datascisarah","sarahwilliams","https://linkedin.com/company/biganalytics","biganalytics.com","https://media.linkedin.com/profile/sarahwilliams.jpg"
"https://linkedin.com/in/michaelbrown","Michael","Brown","Software Engineer","Software Engineer at CodeCrafters","CodeCrafters","Software Engineering","Chicago, IL","michael.brown@codecrafters.io","mike@codecrafters.io","@mikebcode","mbrown","https://linkedin.com/company/codecrafters","codecrafters.io","https://media.linkedin.com/profile/michaelbrown.jpg"

Evaluation Criteria

You will be evaluated on:

  1. Code Quality & Organization

    • Clean, readable, and well-structured code
    • Modular design with clear separation of concerns
    • Proper error handling for critical operations
  2. Data Engineering Fundamentals

    • Effective data cleaning and transformation
    • Appropriate database schema design
    • Efficient data loading methods
  3. Technical Knowledge

    • Proper use of AWS S3 APIs
    • Effective implementation of Prefect for orchestration
    • SQL knowledge for database operations
  4. Problem-Solving Approach

    • How you prioritize tasks given the time constraint
    • Your approach to debugging and problem-solving
    • Questions you ask and clarifications you seek
  5. Communication

    • Code comments and documentation
    • Explanation of your approach during follow-up discussion
    • Clarity about what you completed and what you would do with more time

Hints & Tips

  • Start with a working end-to-end solution focusing on core functionality
  • Add complexity incrementally as time permits
  • It's better to have a simple working solution than a complex partial one
  • If you get stuck, document your approach and move on
  • We're interested in your thought process as much as your code

What to Submit

  • All Python code files
  • SQL scripts for creating database schema
  • Brief notes on your approach and any design decisions
  • If you don't finish everything, note what you would do with more time

After the coding session, be prepared for a brief discussion about your solution, challenges you faced, and your design decisions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published