-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathInterview Query SQL.txt
More file actions
830 lines (648 loc) · 18.5 KB
/
Interview Query SQL.txt
File metadata and controls
830 lines (648 loc) · 18.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
--------------------
Department Expenses
--------------------
As part of the financial management in a large corporation, the CFO wants to review the expenses in all departments for the previous financial year (2022).
Write a SQL query to calculate the total expenditure for each department in 2022. Additionally, for comparison purposes, return the average expense across all departments in 2022.
Note: The output should include the department name, the total expense, and the average expense (rounded to 2 decimal places). The data should be sorted in descending order by total expenditure.
Example:
Input:
departments table
Column Type
id INTEGER
name VARCHAR
expenses table
Column Type
id INTEGER
department_id INTEGER
amount FLOAT
date DATE
Output:
Column Type
department_name VARCHAR
total_expense FLOAT
average_expense FLOAT
Solution :
SELECT
name as department_name,
coalesce(sum(amount),0) as total_expense,
(select
coalesce(round(sum(amount)/count(distinct d.id),2),0)
FROM departments d
left join expenses e
on d.id = e.department_id and year(date) = 2022)
as average_expense
FROM departments d
left join expenses e
on d.id = e.department_id and year(date) = 2022
group by 1
order by 2 desc
-----------------------
Cumulative Distribution
-----------------------
Given the two tables, write a SQL query that creates a cumulative distribution of the number of comments per user.
Assume bin buckets class intervals of one.
Example:
Input:
users table
Columns Type
id INTEGER
name VARCHAR
created_at DATETIME
neighborhood_id INTEGER
sex VARCHAR
comments table
Columns Type
user_id INTEGER
body VARCHAR
created_at DATETIME
Output:
Columns Type
frequency INTEGER
cum_total FLOAT
Solution (old):
WITH hist AS (
SELECT users.id, COUNT(c.user_id) AS frequency
FROM users
LEFT JOIN comments as c
ON users.id = c.user_id
GROUP BY 1
),
freq AS (
SELECT frequency, COUNT(*) AS num_users
FROM hist
GROUP BY 1
)
SELECT f1.frequency, SUM(f2.num_users) AS cum_total
FROM freq AS f1
LEFT JOIN freq AS f2
ON f1.frequency >= f2.frequency
GROUP BY 1
Solution (new):
-- freq of comments, total no of users
-- data : join users (id) - comments (user_id)
-- cte1 : get total number of comments per user
-- query using cte1 : get the number of users per comment frequency
with cte1 as (
select
u.id, count(c.created_at) as comments_count
from users u
left join comments c
-- inner join doesnt work if you also have to display users with 0 comments
on u.id = c.user_id
group by 1
)
select
sum(count(id)) over (order by comments_count) as cum_total,
comments_count as frequency
from cte1
group by 2
------------------
Comments Histogram
------------------
Write a SQL query to create a histogram of the number of comments per user in the month of January 2020.
Note: Assume bin buckets class intervals of one.
Comments that were created outside of January 2020 should be counted in a “0” bucket
Example:
Input:
users table
Columns Type
id INTEGER
name VARCHAR
created_at DATETIME
neighborhood_id INTEGER
mail VARCHAR
comments table
Columns Type
user_id INTEGER
body VARCHAR
created_at DATETIME
Output:
Column Type
comment_count INTEGER
frequency INTEGER
Solution:
with cte1 as (
select
u.id,
coalesce(sum(case when c.created_at >= '2020-01-01' and c.created_at < '2020-02-01'
then 1 else 0 end), 0) as comment_count
from users u
left join comments c on u.id = c.user_id
group by 1
)
select
comment_count,
count(*) as frequency
from cte1
group by 1
order by 1
----------------
Annual Retention
----------------
You’re given a table called annual_payments for an annually billed B2B SAAS subscription product.
annual_payments table
Column Type
id INTEGER
status VARCHAR
user_id INTEGER
created_at DATETIME
amount FLOAT
amount_refunded FLOAT
product VARCHAR
last_updated DATETIME
Users pay for the three different products: 'PDF Editor', 'Cloud Storage', and 'Mobile CRM'.
How would you formulate a query to calculate the average annual retention, for each subsequent year, at the end of the year?
Retention rate is calculated as a percentage of active subscriptions at the end of the year T in the active subscriptions at the end of the previous year (T−1).
Example 1:
User 1 bought a subscript to PDF editor in 2019 for the first time and renewed their subscription in 2020. In 2021, they canceled their subscription. Retention rates should be calculated as:
2019 2020 2021
0.00 1.00 0.00
Example 2:
User 2 bought a subscription to 'PDF Editor' in 2019 for the first time, renewed their subscription in 2020, and canceled a few days later in 2020. Retention rates should be calculated as:
2019 2020 2021
0.00 0.00 0.00
Notes:
The status column may contain the values 'paid', 'refunded', or 'failed'.
A 'failed' payment equates to a user canceling their subscription on the date the next payment is due.
If a user refunds a successful payment, then the row gets updated, and the status becomes 'refunded'. The date of refund is recorded in last_updated.
When a refund occurs in year T, the subscription is canceled. Users may purchase the same product again in the subsequent years, and it is considered a new purchase.
Solution:
-- cte to determine active subscriptions
with active_subscriptions as (
select
user_id,
product,
extract(year from created_at) as year,
case
when status = 'paid' and amount_refunded = 0 then 1
when status in ('failed', 'refunded') then 0
else 0
end as is_active
from annual_payments
),
-- cte to join current year's data with previous year's data
-- using this instead of lag, although that could be used as well
-- retention_data as (
-- select
-- a.year,
-- a.product,
-- a.user_id,
-- a.is_active,
-- coalesce(b.is_active, 0) as prev_year_active
-- from active_subscriptions a
-- left join active_subscriptions b
-- on a.user_id = b.user_id
-- and a.product = b.product
-- and a.year = b.year + 1
-- ),
-- cte to get current year's data with previous year's data using lag
retention_data as (
select
year,
product,
user_id,
is_active,
coalesce(lag(is_active) over (partition by user_id, product order by year), 0) as prev_year_active
from active_subscriptions
),
-- cte to calculate retention rates for each year
retention_rates as (
select
year,
sum(case when is_active = 1 and prev_year_active = 1 then 1 else 0 end) as retained,
sum(prev_year_active) as prev_year_total,
-- case when for retained/prev_year_total
case
when sum(prev_year_active) > 0
then round(cast(sum(case when is_active = 1 and prev_year_active = 1 then 1 else 0 end) as float) / sum(prev_year_active), 2)
else 0
end as percentage_renewed
from retention_data
group by year
)
-- final query to display the results
select
percentage_renewed,
year
from retention_rates
-------------------------
Rolling Bank Transactions
-------------------------
We’re given a table of bank transactions with three columns, user_id, a deposit or withdrawal value (determined if the value is positive or negative), and created_at time for each transaction.
Write a query to get the total three-day rolling average for deposits by day.
Note: Please use the format '%Y-%m-%d' for the date in the outout
Example:
Input:
bank_transactions table
Column Type
user_id INTEGER
created_at DATETIME
transaction_value FLOAT
Output:
Column Type
dt VARCHAR
rolling_three_day FLOAT
Solution:
-- cte to get the sum of deposits from a datetime to date level
with daily_deposits as (
select
date(created_at) as dt,
sum(transaction_value) as daily_deposit_value
from bank_transactions
where transaction_value>0
group by 1
)
-- query to get the rolling avg transaction value for 3 days (current + prev 2 days) using window function
select
dt,
avg(daily_deposit_value) over (
order by dt
rows between 2 preceding and current row
) as rolling_three_day
from daily_deposits
order by 1
-----------------
Random SQL Sample
-----------------
Let’s say we have a table with an id and name fields. The table holds over 100 million rows and we want to sample a random row in the table without throttling the database.
Write a query to randomly sample a row from this table.
Input:
big_table table
Columns Type
id INTEGER
name VARCHAR
Solution 1:
select *
from big_table
order by rand()
limit 1
cons: This works fast if you only have around 1000 rows. It might take a few seconds to run at 10K.
And then at 100K it takes a long time!
------------------
Top Three Salaries
------------------
Given the employees and departments table, write a query to get the top 3 highest employee salaries by department. If the department contains less that 3 employees, the top 2 or the top 1 highest salaries should be listed (assume that each department has at least 1 employee).
Note: The output should include the full name of the employee in one column, the department name, and the salary. The output should be sorted by department name in ascending order and salary in descending order.
Example:
Input:
employees table
Column Type
id INTEGER
first_name VARCHAR
last_name VARCHAR
salary INTEGER
department_id INTEGER
departments table
Column Type
id INTEGER
name VARCHAR
Output:
Column Type
employee_name VARCHAR
department_name VARCHAR
salary INTEGER
Solution 1:
with cte1 as (
select employee_name, department_id, salary from (
select
concat(first_name, " ", last_name) as employee_name,
department_id,
salary,
rank() over (partition by department_id order by salary desc) as r
from employees
) t1
where r<=3
)
select d.name as department_name, employee_name, salary
from cte1 c
left join departments d
on c.department_id = d.id
order by 1, 3 desc
Solution 2:
with dept_salaries as (
select
concat(first_name, " ", last_name) as employee_name,
name as department_name,
salary,
dense_rank() over (partition by d.id order by salary desc) as r
from employees e
inner join departments d
on e.department_id = d.id
)
select
employee_name,
department_name,
salary
from dept_salaries
where r<=3
order by department_name, salary desc
-----------------
Payments Received
-----------------
You’re given two tables, payments and users. The payments table holds all payments between users with the payment_state column consisting of either "success" or "failed".
How many customers that signed up in January 2020 had a combined (successful) sending and receiving volume greater than $100 in their first 30 days?
Note: The sender_id and recipient_id both represent the user_id.
payments table
Column Type
payment_id INTEGER
sender_id INTEGER
recipient_id INTEGER
created_at DATETIME
payment_state VARCHAR
amount_cents INTEGER
users table
Column Type
id INTEGER
created_at DATETIME
Output:
Column Type
num_customers INTEGER
Solution:
-- this cte calculates the total volume in dollars for each user who signed up in January 2020
with user_volumes as (
select
u.id as user_id,
sum(amount_cents)/100.0 as volume_dollars
from users u
join payments p on (u.id = p.sender_id or u.id = p.recipient_id)
where u.created_at >= '2020-01-01' and u.created_at < '2020-02-01'
and p.created_at >= u.created_at
and p.created_at < date_add(u.created_at, interval 30 day)
and p.payment_state = 'success'
group by u.id
)
-- count the number of users with total volume exceeding $100
select
count(*) as num_customers
from user_volumes
where volume_dollars > 100
-----------------
Employee Salaries
-----------------
Given a employees and departments table, select the top 3 departments with at least ten employees and rank them according to the percentage of their employees making over 100K in salary.
Example:
Input:
employees table
Columns Type
id INTEGER
first_name VARCHAR
last_name VARCHAR
salary INTEGER
department_id INTEGER
departments table
Columns Type
id INTEGER
name VARCHAR
Output:
Column Type
percentage_over_100k FLOAT
department_name VARCHAR
number_of_employees INTEGER
Solution (old):
with cte1 as (
select department_id,
count(distinct id) as number_of_employees
from employees
group by 1
having count(distinct id) >=10
),
cte2 as (
select department_id,
sum(salary_flag) as salary_gt_100000 from(
select department_id,
id,
salary,
case when salary > 100000 then 1 else 0 end as salary_flag
from employees
group by 1,2,3
) t1
group by 1
)
select
d.name as department_name,
number_of_employees,
salary_gt_100000/number_of_employees as percentage_over_100k
from cte2
inner join cte1 on cte2.department_id = cte1.department_id
left join departments d on cte2.department_id = d.id
Solution (new):
-- cte1 - select departments with # employees >=10
-- get % of employees making over 100k
select
d.name as department_name,
count(distinct e.id) as number_of_employees,
sum(case when salary>100000 then 1 else 0 end)/count(distinct e.id) as percentage_over_100k
from employees e
inner join departments d
on e.department_id = d.id
group by 1
having count(distinct e.id)>=10
order by 3
--------------------
Subscription Overlap
--------------------
Given a table of product subscriptions with a subscription start date and end date for each user, write a query that returns true or false whether or not each user has a subscription date range that overlaps with any other completed subscription.
Completed subscriptions have end_date recorded.
Example:
Input:
subscriptions table
Column Type
user_id INTEGER
start_date DATETIME
end_date DATETIME
user_id start_date end_date
1 2019-01-01 2019-01-31
2 2019-01-15 2019-01-17
3 2019-01-29 2019-02-04
4 2019-02-05 2019-02-10
Output:
user_id overlap
1 1
2 1
3 1
4 0
Solution:
-- get the users with complete subscriptions
-- compare min and max start date of each column
select user_id,
case when overlap_count > 0 then 1 else 0 end as overlap
from (
select user_id,
(select count(*)
from subscriptions s2
where s1.user_id != s2.user_id
and s1.start_date <= s2.end_date
and s1.end_date >= s2.start_date
and s2.end_date is not null) as overlap_count
from subscriptions s1
where s1.end_date is not null
) t1
order by user_id
-----------------------------
Employee Salaries (ETL Error)
-----------------------------
https://www.interviewquery.com/questions/employee-salaries-etl-error
Let’s say we have a table representing a company payroll schema.
Due to an ETL error, the employees table, instead of updating the salaries every year when doing compensation adjustments, did an insert instead. The head of HR still needs the current salary of each employee.
Write a query to get the current salary for each employee.
Note: Assume no duplicate combination of first and last names (I.E. No two John Smiths). Assume the INSERT operation works with ID autoincrement.
Example:
Input:
employees table
Column Type
id VARCHAR
first_name VARCHAR
last_name VARCHAR
salary INTEGER
department_id INTEGER
Output:
Column Types
first_name VARCHAR
last_name VARCHAR
salary INTEGER
Solution:
select first_name, last_name, salary
from (
select *, rank() over (partition by concat(first_name,last_name) order by id desc) as r
from employees
) t1
where r = 1
---------------
Customer Orders
---------------
Write a query to identify customers who placed more than three transactions each in both 2019 and 2020.
Example:
Input:
transactions table
Column Type
id INTEGER
user_id INTEGER
created_at DATETIME
product_id INTEGER
quantity INTEGER
users table
Column Type
id INTEGER
name VARCHAR
Output:
Column Type
customer_name VARCHAR
Solution:
select
u.name as customer_name
from transactions t
inner join users u
on t.user_id = u.id
where year(created_at) in (2019,2020)
group by 1
having sum(case when year(created_at) = 2019 then 1 else 0 end) > 3 and
sum(case when year(created_at) = 2020 then 1 else 0 end) > 3
-------------------
Empty Neighborhoods
-------------------
We’re given two tables, a users table with demographic information and the neighborhood they live in and a neighborhoods table.
Write a query that returns all neighborhoods that have 0 users.
Example:
Input:
users table
Columns Type
id INTEGER
name VARCHAR
neighborhood_id INTEGER
created_at DATETIME
neighborhoods table
Columns Type
id INTEGER
name VARCHAR
city_id INTEGER
Output:
Columns Type
name VARCHAR
Solution:
select
n.name as name
from neighborhoods n
left join users u
on n.id = u.neighborhood_id
group by 1
having count(u.id) = 0
--------------
Download Facts
--------------
iven two tables: accounts, and downloads, find the average number of downloads for free vs paying accounts, broken down by day.
Note: You only need to consider accounts that have had at least one download before when calculating the average.
Note: round average_downloads to 2 decimal places.
Example:
Input:
accounts table
Column Type
account_id INTEGER
paying_customer BOOLEAN
downloads table
column type
account_id INTEGER
download_date DATETIME
downloads INTEGER
Output:
Column Type
download_date DATETIME
paying_customer BOOLEAN
average_downloads FLOAT
Solution:
select
download_date,
paying_customer,
avg(downloads) as average_downloads
from accounts a
inner join downloads d
on a.account_id = d.account_id
group by 1,2
------------------
Closest SAT Scores
------------------
Given a table of students and their SAT test scores, write a query to return the two students with the closest test scores with the score difference.
If there are multiple students with the same minimum score difference, select the student name combination that is higher in the alphabet.
Example:
Input:
scores table
Column Type
id INTEGER
student VARCHAR
score INTEGER
Output:
Column Type
one_student VARCHAR
other_student VARCHAR
score_diff INTEGER
Solution (old):
with score_diff as (
select
case when s1.student < s2.student then s1.student else s2.student end as one_student,
case when s1.student < s2.student then s2.student else s1.student end as other_student,
abs(s1.score - s2.score) as score_diff,
row_number() over (order by abs(s1.score - s2.score),
greatest(s1.student, s2.student),
least(s1.student, s2.student)) as rn
from
scores s1
join
scores s2 on s1.id < s2.id
)
select
one_student,
other_student,
score_diff
from
score_diff
where
rn = 1
Solution (new):
select
s1.student as one_student,
s2.student as other_student,
abs(s1.score-s2.score) as score_diff
from scores s1
inner join scores s2
on s1.id < s2.id
order by 3,1,2
limit 1