Top Big Tech Coding Questions in Python

21 minute read

Hello everyone, today I will discuss some interesting Easy Python exercises. You can solve them in less than 3 minutes for each of them.

I have collected some basic questions in Python for the largest Big Tech companies.

  • Microsoft
  • Apple
  • Amazon

Usually FAANG is an acronym used to describe some of the most prominent companies in the tech sector. FANG is a group of high performing technology stocks that includes Facebook, Amazon, Netflix, and Google (Alphabet). Investors then added Apple into the list to form the acronym FAANG.

Big Tech, also known as the Tech Giants, Big Four, or Big Five, is a name given to the presently four or five largest, most dominant, and most prestigious companies in the information technology industry of the United States.

Coding Questions

The questions contains an example of the first rows of the datasets to deal with.

Microsoft

Find the total number of records that belong to each variety in the dataset. Output the variety along with the corresponding number of records. Order records by the variety in ascending order.

iris

sepal_length sepal_width petal_length petal_width variety
5.1 3.5 1.4 0.2 Setosa
4.9 3 1.4 0.2 Setosa
4.7 3.2 1.3 0.2 Setosa
4.6 3.1 1.5 0.2 Setosa
5 3.6 1.4 0.2 Setosa
# Import your libraries
import pandas as pd

# Start writing code
iris.groupby(['variety']).size().to_frame('n_variety').reset_index().sort_values('variety')

Write a query that returns the number of unique users per client per month.

fact_events

id time_id user_id customer_id client_id event_type event_id
1 2020-02-28 00:00:00 3668-QPYBK Sendit desktop message sent 3
2 2020-02-28 00:00:00 7892-POOKP Connectix mobile file received 2
3 2020-04-03 00:00:00 9763-GRSKD Zoomit desktop video call received 7
4 2020-04-02 00:00:00 9763-GRSKD Connectix desktop video call received 7
5 2020-02-06 00:00:00 9237-HQITU Sendit desktop video call received 7
# Import your libraries
import pandas as pd

# Start writing code
df=fact_events
df['month'] = pd.to_datetime(df['time_id']).dt.month
df.groupby(['month','client_id'])['user_id'].nunique().reset_index()

We have a table with employees and their salaries, however, some of the records are old and contain outdated salary information. Find the current salary of each employee assuming that salaries increase each year. Output their id, first name, last name, department ID, and current salary. Order your list by employee ID in ascending order.

ms_employee_salary

id first_name last_name salary department_id
1 Todd Wilson 110000 1006
1 Todd Wilson 106119 1006
2 Justin Simon 128922 1005
2 Justin Simon 130000 1005
3 Kelly Rosario 42689 1002

Solution

# Import your libraries
import pandas as pd

# Start writing code
df = ms_employee_salary
df.groupby('id').max().reset_index()

Given a list of projects and employees mapped to each project, calculate by the amount of project budget allocated to each employee . The output should include the project title and the project budget per employee rounded to the closest integer. Order your list by projects with the highest budget per employee first.

Preview of firsts rows

ms_projects

id title budget
1 Project1 29498
2 Project2 32487
3 Project3 43909
4 Project4 15776
5 Project5 36268

ms_emp_projects

emp_id project_id
10592 1
10593 2
10594 3
10595 4
10596 5

Solution

# Import your libraries
import pandas as pd

# Start writing code
df_grp = ms_emp_projects.groupby('project_id')['emp_id'].count().reset_index()

df_merge = pd.merge(ms_projects , df_grp , left_on = 'id'
                    , right_on = 'project_id' 
                    , how = 'inner')
df_merge['budget_emp_ratio'] = df_merge['budget'] / df_merge['emp_id']
df_merge[['title','budget_emp_ratio']]

Given a list of projects and employees mapped to each project, calculate by the amount of project budget allocated to each employee . The output should include the project title and the project budget per employee rounded to the closest integer. Order your list by projects with the highest budget per employee first.

ms_projects

id title budget
1 Project1 29498
2 Project2 32487
3 Project3 43909
4 Project4 15776
5 Project5 36268

ms_emp_projects

emp_id project_id
10592 1
10593 2
10594 3
10595 4
10596 5

Solution

import pandas as pd
import numpy as np

df=pd.merge(ms_projects, ms_emp_projects, how = 'inner',left_on = ['id'], right_on=['project_id'])
df1=df.groupby(['title','budget'])['emp_id'].size().reset_index()
df1['budget_emp_ratio'] = (df1['budget']/df1['emp_id']).round(0)
df2=df1.sort_values(by='budget_emp_ratio',ascending=False)
result = df2[["title","budget_emp_ratio"]]

Write a query that returns the company (customer id column) with highest number of users that use desktop only.

fact_events

id time_id user_id customer_id client_id event_type event_id
1 2020-02-28 00:00:00 3668-QPYBK Sendit desktop message sent 3
2 2020-02-28 00:00:00 7892-POOKP Connectix mobile file received 2
3 2020-04-03 00:00:00 9763-GRSKD Zoomit desktop video call received 7
4 2020-04-02 00:00:00 9763-GRSKD Connectix desktop video call received 7
5 2020-02-06 00:00:00 9237-HQITU Sendit desktop video call received 7

Solution

import pandas as pd

one_client_users = fact_events.groupby('user_id')['client_id'].nunique().reset_index()
one_client_users = one_client_users[one_client_users['client_id']==1]
desktop_users = fact_events[fact_events['client_id']=='desktop']

result = fact_events[(fact_events['user_id'].isin(desktop_users['user_id'])) & (fact_events['user_id'].isin(one_client_users['user_id']))].groupby('customer_id')['user_id'].nunique().to_frame('num_users').reset_index()
result = result[result['num_users']==result['num_users'].max()][['customer_id']]

Write a query that returns the company (customer id column) with highest number of users that use desktop only.

id time_id user_id customer_id client_id event_type event_id
1 2020-02-28 00:00:00 3668-QPYBK Sendit desktop message sent 3
2 2020-02-28 00:00:00 7892-POOKP Connectix mobile file received 2
3 2020-04-03 00:00:00 9763-GRSKD Zoomit desktop video call received 7
4 2020-04-02 00:00:00 9763-GRSKD Connectix desktop video call received 7
5 2020-02-06 00:00:00 9237-HQITU Sendit      

Solution

import pandas as pd

one_client_users = fact_events.groupby('user_id')['client_id'].nunique().reset_index()
one_client_users = one_client_users[one_client_users['client_id']==1]
desktop_users = fact_events[fact_events['client_id']=='desktop']

result = fact_events[(fact_events['user_id'].isin(desktop_users['user_id'])) & (fact_events['user_id'].isin(one_client_users['user_id']))].groupby('customer_id')['user_id'].nunique().to_frame('num_users').reset_index()
result = result[result['num_users']==result['num_users'].max()][['customer_id']]

Write a query that returns a number of users who are exclusive to only one client. Output the client_id and number of exclusive users.

fact_events

id time_id user_id customer_id client_id event_type event_id
1 2020-02-28 00:00:00 3668-QPYBK Sendit desktop message sent 3
2 2020-02-28 00:00:00 7892-POOKP Connectix mobile file received 2
3 2020-04-03 00:00:00 9763-GRSKD Zoomit desktop video call received 7
4 2020-04-02 00:00:00 9763-GRSKD Connectix desktop video call received 7
5 2020-02-06 00:00:00 9237-HQITU Sendit desktop video call received 7

Solution

import pandas as pd

fact_events['n_clients'] = fact_events.groupby('user_id')['client_id'].transform('nunique')
result = fact_events[fact_events['n_clients']==1].groupby(['client_id'])['user_id'].nunique().reset_index()

Write a query that returns a list of the bottom 2 companies by mobile usage. Company is defined in the customer_id column. Mobile usage is defined as the number of events registered on a client_id == ‘mobile’. Order the result by the number of events ascending. In the case where there are multiple companies tied for the bottom ranks (rank 1 or 2), return all the companies. Output the customer_id and number of events.

fact_events

id time_id user_id customer_id client_id event_type event_id
1 2020-02-28 00:00:00 3668-QPYBK Sendit desktop message sent 3
2 2020-02-28 00:00:00 7892-POOKP Connectix mobile file received 2
3 2020-04-03 00:00:00 9763-GRSKD Zoomit desktop video call received 7
4 2020-04-02 00:00:00 9763-GRSKD Connectix desktop video call received 7
5 2020-02-06 00:00:00 9237-HQITU Sendit desktop video call received 7

Solution

# Import your libraries
import pandas as pd

# Start writing code
fact_events.head()

agg_df = fact_events\
    .query('client_id == "mobile"')\
    .groupby('customer_id')['event_id']\
    .count()\
    .reset_index()\
    .rename(columns={'event_id': 'n_events'})
    
agg_df[agg_df['n_events'].isin(agg_df['n_events'].drop_duplicates().nsmallest(2))]\
    .sort_values('n_events', ascending=True)

Find the number of employees who received the bonus and who didn’t. Output an indication of whether the bonus was received or not along with the corresponding number of employees. ex: if the bonus was received: 1, if not: 0.

employee

id first_name last_name age sex employee_title department salary target bonus email city address manager_id
5 Max George 26 M Sales Sales 1300 200 150 [email protected] California 2638 Richards Avenue 1
13 Katty Bond 56 F Manager Management 150000 0 300 [email protected] Arizona   1
11 Richerd Gear 57 M Manager Management 250000 0 300 [email protected] Alabama   1
10 Jennifer Dion 34 F Sales Sales 1000 200 150 [email protected] Alabama   13
19 George Joe 50 M Manager Management 100000 0 300 [email protected] Florida 1003 Wyatt Street 1

bonus

worker_ref_id bonus_amount bonus_date
1 5000 2020-02-16 00:00:00
2 3000 2011-06-16 00:00:00
3 4000 2020-02-16 00:00:00
1 4500 2020-02-16 00:00:00
2 3500 2011-06-16 00:00:00

Solution

import pandas as pd
import numpy as np

merged_df = pd.merge(employee,bonus,left_on='id',right_on='worker_ref_id',how='left')
merged_df['has_bonus'] = (merged_df.bonus_date.notnull()).astype(int)
result = merged_df.groupby(['has_bonus'])['id'].nunique()

Find the number of transactions that occurred for each product. Output the product name along with the corresponding number of transactions and order records by the product id in ascending order. You can ignore products without transactions.

excel_sql_inventory_data

product_id product_name product_type unit price_unit wholesale current_inventory
1 strawberry produce lb 3.28 1.77 13
2 apple_fuji produce lb 1.44 0.43 2
3 orange produce lb 1.02 0.37 2
4 clementines produce lb 1.19 0.44 44
5 blood_orange produce lb 3.86 1.66 19

excel_sql_transaction_data

transaction_id time product_id
1 2016-01-08 17:46:17 3
4 2016-01-06 17:57:42 4
8 2016-01-07 09:35:40 8
9 2016-01-03 09:06:20 12
9 2016-01-03 09:06:20 14

Solution

import pandas as pd
import numpy as np

transaction = excel_sql_transaction_data[excel_sql_transaction_data['product_id'].notnull()]

merged = pd.merge(excel_sql_inventory_data,transaction, on = 'product_id', how = 'inner')

merged['count'] = merged.groupby(['product_name','product_id'])['transaction_id'].transform('count')

merged = merged.drop_duplicates(subset = 'product_id')

result = merged.sort_values('product_id')[['product_name', 'count']]

Select the most popular client_id based on a count of the number of users who have at least 50% of their events from the following list: ‘video call received’, ‘video call sent’, ‘voice call received’, ‘voice call sent’.

fact_events

id time_id user_id customer_id client_id event_type event_id
1 2020-02-28 00:00:00 3668-QPYBK Sendit desktop message sent 3
2 2020-02-28 00:00:00 7892-POOKP Connectix mobile file received 2
3 2020-04-03 00:00:00 9763-GRSKD Zoomit desktop video call received 7
4 2020-04-02 00:00:00 9763-GRSKD Connectix desktop video call received 7
5 2020-02-06 00:00:00 9237-HQITU Sendit desktop video call received 7

Solution

import pandas as pd

events_list = ['video call received', 'video call sent', 'voice call received', 'voice call sent']
fact_events['event_check'] = fact_events['event_type'].apply(lambda x: 1 if x in events_list else 0)
fact_events['event_check_mean'] = fact_events.groupby('user_id')['event_check'].transform('mean')
result = fact_events[fact_events['event_check_mean']>=0.5].groupby('client_id')['id'].count().reset_index()
result['ranking'] = result['id'].rank(ascending=False)
result = result[result.ranking == 1][['client_id']]

Calculate the share of new and existing users for each month in the table. Output the month, share of new users, and share of existing users as a ratio. New users are defined as users who started using services in the current month (there is no usage history in previous months). Existing users are users who used services in current month, but they also used services in any previous month. Assume that the dates are all from the year 2020.

fact_events

id time_id user_id customer_id client_id event_type event_id
1 2020-02-28 00:00:00 3668-QPYBK Sendit desktop message sent 3
2 2020-02-28 00:00:00 7892-POOKP Connectix mobile file received 2
3 2020-04-03 00:00:00 9763-GRSKD Zoomit desktop video call received 7
4 2020-04-02 00:00:00 9763-GRSKD Connectix desktop video call received 7
5 2020-02-06 00:00:00 9237-HQITU Sendit desktop video call received 7

Solution

import pandas as pd
import datetime as dt 

result = fact_events.groupby(pd.to_datetime(fact_events['time_id']).dt.month)['user_id'].nunique().to_frame('n_users')
df1 = fact_events.groupby('user_id')['time_id'].min().to_frame('month_min').reset_index()
df1['month_min'] = pd.to_datetime(df1['month_min']).dt.month
df1 = df1.groupby('month_min')['user_id'].nunique().to_frame('n_new_users')
result = result.join(df1)
result['share_of_new_users'] = result['n_new_users'] /  result['n_users']
result['share_of_old_users'] = 1 - result['share_of_new_users']
result = result[['share_of_new_users', 'share_of_old_users']].reset_index()

Find the total number of downloads for paying and non-paying users by date. Include only records where non-paying customers have more downloads than paying customers. The output should be sorted by earliest date first and contain 3 columns date, non-paying downloads, paying downloads.

ms_user_dimension

user_id acc_id
1 716
2 749
3 713
4 744
5 726

ms_acc_dimension

acc_id paying_customer
700 no
701 no
702 no
703 no
704 no

ms_download_facts

View the output in a separate browser tab

date user_id downloads
2020-08-24 00:00:00 1 6
2020-08-22 00:00:00 2 6
2020-08-18 00:00:00 3 2
2020-08-24 00:00:00 4 4
2020-08-19 00:00:00 5 7

Solution

import pandas as pd
import numpy as np

df=pd.merge(ms_user_dimension, ms_acc_dimension, how = 'inner',left_on = ['acc_id'], right_on=['acc_id'])
df1= pd.merge(df, ms_download_facts, how = 'inner',left_on = ['user_id'], right_on=['user_id'])
x=df1.pivot_table(index=['date'],columns=['paying_customer'],values=['downloads'],aggfunc='sum')
df2=pd.DataFrame(x.to_records())
df2.columns = df2.columns.str.replace("[()]","").str.replace("[' ']","").str.replace("[,]","").str.replace("downloads","")
df3=df2.fillna(0)
df3['diff'] = df3['no']-df3['yes']
df4 = df3[df3["diff"] > 0]
result = df4[["date","no","yes"]].sort_values("date")

Interview Questions in Apple

Count the number of user events performed by MacBookPro users. Output the result along with the event name. Sort the result based on the event count in the descending order.

playbook_events

user_id occurred_at event_type event_name location device
6991 2014-06-09 18:26:54 engagement home_page United States iphone 5
18851 2014-08-29 13:18:38 signup_flow enter_info Russia asus chromebook
14998 2014-07-01 12:47:56 engagement login France hp pavilion desktop
8186 2014-05-23 10:44:16 engagement home_page Italy macbook pro
9626 2014-07-31 17:15:14 engagement login Russia nexus 7

Solution

# Import your libraries
import pandas as pd

# Start writing code
df = playbook_events
df.loc[df.device == 'macbook pro', 'event_name'].value_counts().reset_index()

Find customers who appear in the orders table more than three times.

orders

id cust_id order_date order_details total_order_cost
1 3 2019-03-04 00:00:00 Coat 100
2 3 2019-03-01 00:00:00 Shoes 80
3 3 2019-03-07 00:00:00 Skirt 30
4 7 2019-02-01 00:00:00 Coat 25
5 7 2019-03-10 00:00:00 Shoes 80

Solution

# Import your libraries
import pandas as pd

# Start writing code
grp = orders.groupby('cust_id').size().reset_index(name='ct')
grp[grp.ct > 3]['cust_id'].values

Find customers who have never made an order. Output the first name of the customer.

customers

id first_name last_name city address phone_number
8 John Joseph San Francisco   928-386-8164
7 Jill Michael Austin   813-297-0692
4 William Daniel Denver   813-368-1200
5 Henry Jackson Miami   808-601-7513
13 Emma Isaac Miami   808-690-5201

orders

id cust_id order_date order_details total_order_cost
1 3 2019-03-04 00:00:00 Coat 100
2 3 2019-03-01 00:00:00 Shoes 80
3 3 2019-03-07 00:00:00 Skirt 30
4 7 2019-02-01 00:00:00 Coat 25
5 7 2019-03-10 00:00:00 Shoes 80

Solution

import pandas as pd
import numpy as np

merge = pd.merge(customers,orders, left_on = 'id', right_on = 'cust_id', how = 'left')
result = merge[merge['cust_id'].isnull()][['first_name']]


For each platform (e.g. Windows, iPhone, iPad etc.), calculate the number of users. Consider unique users and not individual sessions. Output the name of the platform with the corresponding number of users.

user_sessions

session_id user_id session_starttime session_endtime platform
1 U1 2020-01-01 12:14:28 2020-01-01 12:16:08 Windows
2 U1 2020-01-01 18:23:50 2020-01-01 18:24:00 Windows
3 U1 2020-01-01 08:15:00 2020-01-01 08:20:00 IPhone
4 U2 2020-01-01 10:53:10 2020-01-01 10:53:30 IPhone
5 U2 2020-01-01 18:25:14 2020-01-01 18:27:53 IPhone

Solution

grouped_df = user_sessions.groupby("platform")
grouped_df = grouped_df['user_id'].nunique()
result = grouped_df.reset_index()

Find the number of male and female employees per department and also their corresponding total salaries. Output department names along with the corresponding number of female employees, the total salary of female employees, the number of male employees, and the total salary of male employees.

employee

id first_name last_name age sex employee_title department salary target bonus email city address manager_id
5 Max George 26 M Sales Sales 1300 200 150 [email protected] California 2638 Richards Avenue 1
13 Katty Bond 56 F Manager Management 150000 0 300 [email protected] Arizona   1
11 Richerd Gear 57 M Manager Management 250000 0 300 [email protected] Alabama   1
10 Jennifer Dion 34 F Sales Sales 1000 200 150 [email protected] Alabama   13
19 George Joe 50 M Manager Management 100000 0 300        

Solution

# Import your libraries
import pandas as pd

# Start writing code
df = employee
sex_df = df.groupby(['department', 'sex'])['id'].nunique().reset_index()
salary_df = df.groupby(['department', 'sex'])['salary'].sum().reset_index()
pd.merge(sex_df, salary_df, on = ['department', 'sex'])

Interview Questions in Amazon

Find the details of each customer regardless of whether the customer made an order. Output the customer’s first name, last name, and the city along with the order details. You may have duplicate rows in your results due to a customer ordering several of the same items. Sort records based on the customer’s first name and the order details in ascending order.

Sample of header:

customers

id first_name last_name city address phone_number
8 John Joseph San Francisco   928-386-8164
7 Jill Michael Austin   813-297-0692
4 William Daniel Denver   813-368-1200
5 Henry Jackson Miami   808-601-7513
13 Emma Isaac Miami   808-690-5201

orders

id cust_id order_date order_details total_order_cost
1 3 2019-03-04 00:00:00 Coat 100
2 3 2019-03-01 00:00:00 Shoes 80
3 3 2019-03-07 00:00:00 Skirt 30
4 7 2019-02-01 00:00:00 Coat 25
5 7 2019-03-10 00:00:00 Shoes 80

Solution

# Import your libraries
import pandas as pd

# Start writing code
merged = pd.merge(customers, orders, left_on = 'id', right_on = 'cust_id', how = 'left')

merged[['first_name', 'last_name', 'city', 'order_details']].sort_values(['first_name', 'order_details'])

Find the second highest salary without using ORDER BY.

worker

worker_id first_name last_name salary joining_date department
1 Monika Arora 100000 2014-02-20 09:00:00 HR
2 Niharika Verma 80000 2014-06-11 09:00:00 Admin
3 Vishal Singhal 300000 2014-02-20 09:00:00 HR
4 Amitah Singh 500000 2014-02-20 09:00:00 Admin
5 Vivek Bhati 500000    
import pandas as pd
import numpy as np

distinct_salary = worker[['salary']].drop_duplicates()
top2_highest = distinct_salary.nlargest(2,'salary')[['salary']]
result = top2_highest.nsmallest(1,'salary')[['salary']]

Find the 5 highest salaries. Order records based on salary in descending order.

worker

worker_id first_name last_name salary joining_date department
1 Monika Arora 100000 2014-02-20 09:00:00 HR
2 Niharika Verma 80000 2014-06-11 09:00:00 Admin
3 Vishal Singhal 300000 2014-02-20 09:00:00 HR
4 Amitah Singh 500000 2014-02-20 09:00:00 Admin
5 Vivek Bhati 500000 2014-06-11  

Solution

import pandas as pd
worker.salary.drop_duplicates().nlargest(5)

or

import pandas as pd
import numpy as np
result = worker.drop_duplicates(['salary'])[['salary']].sort_values(['salary'], ascending=False).head(5)

Find the five highest salaries.

worker

worker_id first_name last_name salary joining_date department
1 Monika Arora 100000 2014-02-20 09:00:00 HR
2 Niharika Verma 80000 2014-06-11 09:00:00 Admin
3 Vishal Singhal 300000 2014-02-20 09:00:00 HR
4 Amitah Singh 500000 2014-02-20 09:00:00 Admin
5 Vivek Bhati 500000    

Solution

# Import your libraries
import pandas as pd

# Start writing code
df=worker
df['salary'].drop_duplicates().nlargest(5)

Find the total cost of each customer’s orders. Output customer’s id, first name, and the total order cost. Order records by customer’s first name alphabetically.

customers

id first_name last_name city address phone_number
8 John Joseph San Francisco   928-386-8164
7 Jill Michael Austin   813-297-0692
4 William Daniel Denver   813-368-1200
5 Henry Jackson Miami   808-601-7513
13 Emma Isaac Miami    

orders

id cust_id order_date order_details total_order_cost
1 3 2019-03-04 00:00:00 Coat 100
2 3 2019-03-01 00:00:00 Shoes 80
3 3 2019-03-07 00:00:00 Skirt 30
4 7 2019-02-01 00:00:00 Coat 25
5 7 2019-03-10 00:00:00 Shoes 80

Solution

import pandas as pd
import numpy as np

merge = pd.merge(customers, orders, left_on="id", right_on="cust_id")
merge = (
    merge.groupby(["cust_id", "first_name"])["total_order_cost"].sum().reset_index()
)
result = merge.sort_values(by="first_name", ascending=True)

Find the top 5 highest paid workers. Output all information about workers whose salaries are among top 5 paid ones.

worker

worker_id first_name last_name salary joining_date department
1 Monika Arora 100000 2014-02-20 09:00:00 HR
2 Niharika Verma 80000 2014-06-11 09:00:00 Admin
3 Vishal Singhal 300000 2014-02-20 09:00:00 HR
4 Amitah Singh 500000 2014-02-20 09:00:00 Admin
5 Vivek Bhati 500000    

Solution

import pandas as pd
import numpy as np
result = worker[worker['salary'].rank(method='min', ascending=False) <= 5] 

Find all workers whose first name contains the letter ‘a’. Output all columns for such workers.

worker

worker_id first_name last_name salary joining_date department
1 Monika Arora 100000 2014-02-20 09:00:00 HR
2 Niharika Verma 80000 2014-06-11 09:00:00 Admin
3 Vishal Singhal 300000 2014-02-20 09:00:00 HR
4 Amitah Singh 500000 2014-02-20 09:00:00 Admin
5 Vivek Bhati 500000    

Solution

import pandas as pd
import numpy as np

result = worker[worker['first_name'].str.contains("a")]

Find all workers who joined on February 2014.

worker

worker_id first_name last_name salary joining_date department
1 Monika Arora 100000 2014-02-20 09:00:00 HR
2 Niharika Verma 80000 2014-06-11 09:00:00 Admin
3 Vishal Singhal 300000 2014-02-20 09:00:00 HR
4 Amitah Singh 500000 2014-02-20 09:00:00 Admin
5 Vivek Bhati 500000    

Solution

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

worker['joining_date'] = pd.to_datetime(worker['joining_date'], format="%Y-%m-%d")
result = worker[(worker['joining_date']>pd.Timestamp(2014,2,1)) & (worker['joining_date']<pd.Timestamp(2014,2,28))]

or

# Import your libraries
import pandas as pd

# Start writing code
worker.query("joining_date.dt.month == 2 & joining_date.dt.year == 2014")

Write a query that will calculate the number of shipments per month. The unique key for one shipment is a combination of shipment_id and sub_id. Output the year_month in format YYYY-MM and the number of shipments in that month.

amazon_shipment

shipment_id sub_id weight shipment_date add
101 1 10 2021-08-30 00:00:00  
101 2 20 2021-09-01 00:00:00  
101 3 10 2021-09-05 00:00:00  
102 1 50 2021-09-02 00:00:00  
103 1 25 2021-09-01 00:00:00  

Solution

# Import your libraries
import pandas as pd

# Start writing code
df = amazon_shipment
df['year_month'] = pd.to_datetime(df['shipment_date']).dt.to_period('M')
df.groupby('year_month').size().to_frame('num_of_shipment').reset_index()

Find all workers who are also managers. Output the first name along with the corresponding title.

worker

worker_id first_name last_name salary joining_date department
1 Monika Arora 100000 2014-02-20 09:00:00 HR
2 Niharika Verma 80000 2014-06-11 09:00:00 Admin
3 Vishal Singhal 300000 2014-02-20 09:00:00 HR
4 Amitah Singh 500000 2014-02-20 09:00:00 Admin
5 Vivek Bhati 500000 2014-06-11 09:00:00 Admin

title

worker_ref_id worker_title affected_from
1 Manager 2016-02-20 00:00:00
2 Executive 2016-06-11 00:00:00
8 Executive 2016-06-11 00:00:00
5 Manager 2016-06-11 00:00:00
4 Asst. Manager 2016-06-11 00:00:00

Solution

# Import your libraries
import pandas as pd

# Start writing code
worker.join(title[title.worker_title=='Manager'][['worker_ref_id', 'worker_title']].set_index('worker_ref_id'),
    how='inner', on='worker_id')[['first_name', 'worker_title']]

Find the rows where (worker title, affected from date) combination occurred more than once Output the worker title, affected_from date, and the number of times the occurrence happened.

title

worker_ref_id worker_title affected_from
1 Manager 2016-02-20 00:00:00
2 Executive 2016-06-11 00:00:00
8 Executive 2016-06-11 00:00:00
5 Manager 2016-06-11 00:00:00
4 Asst. Manager 2016-06-11 00:00:00

Solution

import pandas as pd
import numpy as np
import datetime

title['new_affected_from'] = title['affected_from']
title['new_affected_from'] =  pd.to_datetime(title['affected_from'], format='%Y-%m-%d' ).dt.strftime('%Y-%m-%d')

group = title.groupby(['worker_title','new_affected_from']).size().to_frame('n_affected').reset_index()
result = group[group['n_affected'] > 1]

Find the total salary of each department. Output the salary along with the corresponding department.

worker

worker_id first_name last_name salary joining_date department
1 Monika Arora 100000 2014-02-20 09:00:00 HR
2 Niharika Verma 80000 2014-06-11 09:00:00 Admin
3 Vishal Singhal 300000 2014-02-20 09:00:00 HR
4 Amitah Singh 500000 2014-02-20 09:00:00 Admin
5 Vivek Bhati 500000 2014-06-11 09:00:00 Admin

Solution

# Import your libraries
import pandas as pd

# Start writing code
worker.head()

worker.groupby("department").salary.sum().to_frame("total_salary").reset_index()

Find order details made by Jill and Eva. Consider the Jill and Eva as first names of customers. Output the order date, details and cost along with the first name. Order records based on the customer id in ascending order.

customers

id first_name last_name city address phone_number
8 John Joseph San Francisco   928-386-8164
7 Jill Michael Austin   813-297-0692
4 William Daniel Denver   813-368-1200
5 Henry Jackson Miami   808-601-7513
13 Emma Isaac Miami   808-690-5201

orders

id cust_id order_date order_details total_order_cost
1 3 2019-03-04 00:00:00 Coat 100
2 3 2019-03-01 00:00:00 Shoes 80
3 3 2019-03-07 00:00:00 Skirt 30
4 7 2019-02-01 00:00:00 Coat 25
5 7 2019-03-10 00:00:00 Shoes 80

Solution

# Import your libraries
import pandas as pd

# Start writing code
customers.head()

df = pd.merge(customers, orders, left_on="id", right_on="cust_id")
custs = ["Jill", "Eva"]
df1 = df[df["first_name"].isin(custs)].sort_values("cust_id")[["first_name", "order_date", "order_details", "total_order_cost"]]

Find all workers that work in the Admin department

worker

worker_id first_name last_name salary joining_date department
1 Monika Arora 100000 2014-02-20 09:00:00 HR
2 Niharika Verma 80000 2014-06-11 09:00:00 Admin
3 Vishal Singhal 300000 2014-02-20 09:00:00 HR
4 Amitah Singh 500000 2014-02-20 09:00:00 Admin
5 Vivek Bhati 500000 2014-06-11 0  

Solution

import pandas as pd
import numpy as np

result = worker[worker['department']=='Admin']

Find the three lowest salaries. Order records based on the salary in ascending order.

worker

worker_id first_name last_name salary joining_date department
1 Monika Arora 100000 2014-02-20 09:00:00 HR
2 Niharika Verma 80000 2014-06-11 09:00:00 Admin
3 Vishal Singhal 300000 2014-02-20 09:00:00 HR
4 Amitah Singh 500000 2014-02-20 09:00:00 Admin
5 Vivek Bhati 500000 2014-06-11 09:00:00 Admin

Solution

import pandas as pd
import numpy as np

salaries = worker[['salary']].drop_duplicates()
result = salaries.nsmallest(3,'salary')[['salary']]

For each video game player, find the latest date when they logged in.

players_logins

player_id login_date
101 2021-12-14 00:00:00
101 2021-12-18 00:00:00
101 2021-12-15 00:00:00
101 2021-12-19 00:00:00
102 2021-12-31 00:00:00
result = players_logins.groupby(['player_id'])['login_date'].max().reset_index()

Write a query that’ll identify returning active users. A returning active user is a user that has made a second purchase within 7 days of any other of their purchases. Output a list of user_ids of these returning active users.

amazon_transactions

id user_id item created_at revenue
1 109 milk 2020-03-03 00:00:00 123
2 139 biscuit 2020-03-18 00:00:00 421
3 120 milk 2020-03-18 00:00:00 176
4 108 banana 2020-03-18 00:00:00 862
5 130 milk 2020-03-28 00:00:00 333

Solution

# Import your libraries
import pandas as pd
import numpy as np
import datetime

# Start writing code

df = amazon_transactions.sort_values(['user_id','created_at'])
df['diff'] = df.groupby('user_id')['created_at'].diff()
df[df['diff'] <= pd.Timedelta(days=7)]['user_id'].unique()

or

import pandas as pd
import numpy as np
from datetime import datetime

amazon_transactions["created_at"] = pd.to_datetime(amazon_transactions["created_at"]).dt.strftime('%m-%d-%Y')
df = amazon_transactions.sort_values(by=['user_id', 'created_at'], ascending=[True, True])
df['prev_value'] = df.groupby('user_id')['created_at'].shift()
df['days'] = (pd.to_datetime(df['created_at']) - pd.to_datetime(df['prev_value'])).dt.days
result = df[df['days'] <= 7]['user_id'].unique()

Find the number of customers without an order.

orders

View the output in a separate browser tab

id cust_id order_date order_details total_order_cost
1 3 2019-03-04 00:00:00 Coat 100
2 3 2019-03-01 00:00:00 Shoes 80
3 3 2019-03-07 00:00:00 Skirt 30
4 7 2019-02-01 00:00:00 Coat 25
5 7 2019-03-10 00:00:00 Shoes  

customers

id first_name last_name city address phone_number
8 John Joseph San Francisco   928-386-8164
7 Jill Michael Austin   813-297-0692
4 William Daniel Denver   813-368-1200
5 Henry Jackson Miami   808-601-7513
13 Emma Isaac Miami    

Solution

import pandas as pd
import numpy as np

merged = pd.merge(orders,customers,left_on='cust_id',right_on='id',how='right')
null_cust = merged[merged['cust_id'].isnull()]
result = len(null_cust)

Find the most expensive products on Amazon for each product category. Output category, product name and the price (as a number)

innerwear_amazon_com

product_name mrp price pdp_url brand_name product_category retailer description rating review_count style_attributes total_sizes available_size color
Wacoal Women’s Full Figure Basic Beauty Underwire Bra $50.00 $50.00 https://www.amazon.com/-/dp/B005FR9XVK?th=1&psc=1                    

Solution

# Import your libraries
import pandas as pd

# Start writing code
innerwear_amazon_com.head() 

df=innerwear_amazon_com[['product_name','product_category','price']]
df.price = df.price.str.replace('[\$]','').astype('float')

dff = df.groupby(['product_category']).agg({'price':'max'}).reset_index()

result = pd.merge(df,dff,on=['product_category','price'])

Combine the first and last names of workers with a space inbetween.

worker

worker_id first_name last_name salary joining_date department
1 Monika Arora 100000 2014-02-20 09:00:00 HR
2 Niharika Verma 80000 2014-06-11 09:00:00 Admin
3 Vishal Singhal 300000 2014-02-20 09:00:00 HR
4 Amitah Singh 500000 2014-02-20 09:00:00 Admin
5 Vivek Bhati 500000 2014-06-11 09:00:00 Admin

Solution

import pandas as pd
import numpy as np

result = (worker['first_name'].map(str) + ' ' + worker['last_name'].map(str))

Find the percentage of shipable orders. Consider an order is shipable if the customer’s address is known.

orders

id cust_id order_date order_details total_order_cost
1 3 2019-03-04 00:00:00 Coat 100
2 3 2019-03-01 00:00:00 Shoes 80
3 3 2019-03-07 00:00:00 Skirt 30
4 7 2019-02-01 00:00:00 Coat 25
5 7 2019-03-10 00:00:00 Shoes 80

customers

id first_name last_name city address phone_number
8 John Joseph San Francisco   928-386-8164
7 Jill Michael Austin   813-297-0692
4 William Daniel Denver   813-368-1200
5 Henry Jackson Miami   808-601-7513
13 Emma Isaac Miami   808-690-5201
import pandas as pd
import numpy as np

merged_df = pd.merge(orders,customers,left_on='cust_id',right_on='id')
merged_df['is_shipable'] = (merged_df.address.notnull()).astype(int)
result = 100 * (merged_df['is_shipable'].sum()/len(merged_df))

Find all workers whose first name contains 6 letters and also ends with the letter ‘h’.

worker

worker_id first_name last_name salary joining_date department
1 Monika Arora 100000 2014-02-20 09:00:00 HR
2 Niharika Verma 80000 2014-06-11 09:00:00 Admin
3 Vishal Singhal 300000 2014-02-20 09:00:00 HR
4 Amitah Singh 500000 2014-02-20 09:00:00 Admin
5 Vivek Bhati 50000    

Solution

import pandas as pd
import numpy as np

six_char = worker[worker['first_name'].str.contains('[a-zA-Z]{6}$')]
result = six_char[six_char['first_name'].str.endswith("h")]

Find products which are exclusive to only Amazon and therefore not sold at Top Shop and Macy’s. Your output should include the product name, brand name, price, and rating.

Two products are considered equal if they have the same product name and same maximum retail price (mrp column).

innerwear_macys_com

product_name mrp price pdp_url brand_name product_category retailer description rating review_count style_attributes total_sizes available_size color
Cotton Tanga Mini Knickers $20.00 $20.00 http://www.topshop.com/en/tsus/product/clothing-70483/lingerie-2313852/cotton-frill-bra-and-tanga-knickers-set-662234                    

innerwear_macys_com and innerwear_amazon_com has the same structure.

Solution

# Import your libraries
import pandas as pd

# Start writing code
retailers = [innerwear_macys_com, innerwear_topshop_com, innerwear_amazon_com]

for r in retailers:
    r['code'] = r['mrp'].astype(str) + r['product_name']
    
df_com = pd.concat([innerwear_macys_com,innerwear_topshop_com])
products = list(df_com['code'].unique())

results = innerwear_amazon_com[~(innerwear_amazon_com['code'].isin(products))]


Find the best selling item for each month (no need to separate months by year) where the biggest total invoice was paid. The best selling item is calculated using the formula (unitprice * quantity). Output the description of the item along with the amount paid.

online_retail

invoiceno stockcode description quantity invoicedate unitprice customerid country
544586 21890 S/6 WOODEN SKITTLES IN COTTON BAG 3 2011-02-21 00:00:00 2.95 17338 United Kingdom
541104 84509G SET OF 4 FAIRY CAKE PLACEMATS 3 2011-01-13 00:00:00 3.29   United Kingdom
560772 22499 WOODEN UNION JACK BUNTING 3 2011-07-20 00:00:00 4.96   United Kingdom
555150 22488 NATURAL SLATE RECTANGLE CHALKBOARD 5 2011-05-31 00:00:00 3.29   United Kingdom
570521 21625 VINTAGE UNION JACK APRON 3 2011-10-11 00:00:00 6.95 12371 Switzerland

Solution

# month\description\unitprice * quantity
import pandas as pd

# Start writing code
df = online_retail[['invoicedate', 'description', 'quantity', 'unitprice']]
df['month'] =  df['invoicedate'].dt.month
df['total_invoice'] = df['quantity']*df['unitprice']
df = df.groupby(['month', 'description'])['total_invoice'].sum().reset_index()
df['rank'] = df.groupby(['month']).rank(ascending=False, method='dense')
df[df['rank'] == 1][['month', 'description', 'total_invoice']]

Congratulations! We have practiced some coding questions for high tech companies.

Posted:

Leave a comment