Our company regularly receives LinkedIn profile data from a third-party data provider. This data arrives as CSV files containing profile information that needs to be processed, cleaned, and stored in our database for further analysis. Your task is to build a data pipeline that handles this workflow.
You have 2 hours to complete as much of this challenge as possible. We understand that full completion is challenging within this timeframe - focus on demonstrating your approach, coding style, and problem-solving skills rather than implementing every feature.
The provided requirements.txt file should have all the necessary packages to complete this task. Feel free to install use any other packages you may need.
pip install -r requirements.txt- A CSV file containing LinkedIn profile data will be provided to you
- The file includes the following fields:
Url(LinkedIn profile URL)First NameLast NameJob TitleHeadlineCompanyIndustryLocationWork EmailOther Work EmailsTwitterGithubCompany LinkedinUrlCompany DomainProfile Image Url
-
Data Extraction & Transformation
- Write code to read the provided CSV file
- Clean the data (handle missing values, standardize formats)
- Create a full name field from first and last name
- Enrich the data with at least one computed field:
seniority_level: Determine fromJob Title(Junior, Mid, Senior, Executive)
-
Data Loading
- Design a PostgreSQL schema to store the processed data
- Implement code to load the data into PostgreSQL
- Include a unique identifier or primary key strategy
-
AWS S3 Integration
- Write functions to simulate downloading from and uploading to S3
- You don't need to test these functions, but show proper AWS SDK usage
-
Basic Prefect Flow
- Implement a simple Prefect flow that connects your extraction, transformation, and loading steps
- Include basic error handling
-
Additional Data Enrichment
- Implement additional computed fields:
tech_profile: Boolean flag indicating if Github field is populatedemail_domain: Extract domain from Work Emailhas_multiple_emails: Boolean flag indicating if Other Work Emails is populated
- Implement additional computed fields:
-
Advanced Error Handling
- Implement more comprehensive error handling and logging
- Add data validation checks
-
Testing
- Write unit tests for key components of your solution
-
Documentation
- Add comprehensive docstrings and comments
- Create a brief design document explaining your approach
- A development environment will be provided with:
- Python 3.12+
- Docker installation
- Postgres Docker container
- Required Python libraries pre-installed (pandas, boto3, prefect, etc.)
- Access to documentation
For this assessment, we recommend setting up a Docker container with a postgres image
docker run -d --name postgres-interview -e POSTGRES_PASSWORD=yourpassword -p 5432:5432 postgres:latestimport psycopg2
conn = psycopg2.connect(
dbname="postgres",
user="postgres",
password="yourpassword",
host="localhost",
port="5432"
)
cursor = conn.cursor()
cursor.execute("SELECT * FROM products WHERE price > %s", (500,))
print(cursor.fetchall())
conn.close()Here's a sample of what the input CSV will look like:
"Url","First Name","Last Name","Job Title","Headline","Company","Industry","Location","Work Email","Other Work Emails","Twitter","Github","Company LinkedinUrl","Company Domain","Profile Image Url"
"https://linkedin.com/in/johndoe","John","Doe","Senior Data Engineer","Senior Data Engineer at TechCorp","TechCorp","Information Technology","San Francisco, CA","john.doe@techcorp.com","jdoe@techcorp.com","@johndoecodes","johndoe","https://linkedin.com/company/techcorp","techcorp.com","https://media.linkedin.com/profile/johndoe.jpg"
"https://linkedin.com/in/janedoe","Jane","Doe","Product Manager","Product Manager at DataSoft","DataSoft","Software Development","Seattle, WA","jane.doe@datasoft.com","","@janedoe","","https://linkedin.com/company/datasoft","datasoft.com","https://media.linkedin.com/profile/janedoe.jpg"
"https://linkedin.com/in/alexjohnson","Alex","Johnson","ML Engineer","Machine Learning Engineer","AI Solutions Inc","Artificial Intelligence","Austin, TX","alex.johnson@aisolutions.com","aj@aisolutions.com","@alexj_ai","alexjohnson","https://linkedin.com/company/ai-solutions","aisolutions.com","https://media.linkedin.com/profile/alexjohnson.jpg"
"https://linkedin.com/in/sarahwilliams","Sarah","Williams","Senior Data Scientist","Data Scientist at BigAnalytics","BigAnalytics","Data Science","New York, NY","sarah.williams@biganalytics.com","","@datascisarah","sarahwilliams","https://linkedin.com/company/biganalytics","biganalytics.com","https://media.linkedin.com/profile/sarahwilliams.jpg"
"https://linkedin.com/in/michaelbrown","Michael","Brown","Software Engineer","Software Engineer at CodeCrafters","CodeCrafters","Software Engineering","Chicago, IL","michael.brown@codecrafters.io","mike@codecrafters.io","@mikebcode","mbrown","https://linkedin.com/company/codecrafters","codecrafters.io","https://media.linkedin.com/profile/michaelbrown.jpg"You will be evaluated on:
-
Code Quality & Organization
- Clean, readable, and well-structured code
- Modular design with clear separation of concerns
- Proper error handling for critical operations
-
Data Engineering Fundamentals
- Effective data cleaning and transformation
- Appropriate database schema design
- Efficient data loading methods
-
Technical Knowledge
- Proper use of AWS S3 APIs
- Effective implementation of Prefect for orchestration
- SQL knowledge for database operations
-
Problem-Solving Approach
- How you prioritize tasks given the time constraint
- Your approach to debugging and problem-solving
- Questions you ask and clarifications you seek
-
Communication
- Code comments and documentation
- Explanation of your approach during follow-up discussion
- Clarity about what you completed and what you would do with more time
- Start with a working end-to-end solution focusing on core functionality
- Add complexity incrementally as time permits
- It's better to have a simple working solution than a complex partial one
- If you get stuck, document your approach and move on
- We're interested in your thought process as much as your code
- All Python code files
- SQL scripts for creating database schema
- Brief notes on your approach and any design decisions
- If you don't finish everything, note what you would do with more time
After the coding session, be prepared for a brief discussion about your solution, challenges you faced, and your design decisions.