Interview Preparation
Practice Problems & Q&A for Senior Quantitative Analytics Associate
9. Sample SQL Interview Questions
Question 1: Customer Transaction Analysis
Problem: Find customers who made transactions in 3 consecutive months.
Question 2: Agent Performance Metrics
Problem: Calculate average handle time, first call resolution rate, and customer satisfaction by agent.
Question 3: Cohort Retention Analysis
Problem: Calculate monthly retention rates for customer cohorts.
13. Advanced SQL Interview Questions
Question 4: Running Totals and Moving Averages
Problem: Calculate running total of sales and 7-day moving average by product.
Question 5: Self-Join for Finding Gaps
Problem: Find gaps in customer account numbers (missing IDs).
Question 6: Hierarchical Queries (Manager-Employee)
Problem: Find all employees in a manager's reporting chain with their level.
Question 7: Complex Aggregation - Pivoting Data
Problem: Pivot transaction data to show monthly sales by product category.
Question 8: Performance Optimization
Problem: Optimize a slow query that joins multiple large tables.
Before (Slow):
After (Optimized):
Question 9: Detecting Fraud Patterns
Problem: Identify suspicious transaction patterns (multiple transactions in short time from different locations).
10. Sample Python Interview Questions
Question 1: Data Cleaning Pipeline
Problem: Build a function to clean and prepare customer data for analysis.
Question 2: A/B Test Analysis
Problem: Analyze results of an A/B test and determine statistical significance.
Question 3: Feature Engineering for ML
Problem: Create features for a churn prediction model.
14. Advanced Python Interview Questions
Question 4: Time Series Decomposition
Problem: Decompose time series into trend, seasonality, and residuals.
Question 5: Customer Segmentation with K-Means
Problem: Segment customers using RFM analysis and K-means clustering.
Question 6: ETL Pipeline with Error Handling
Problem: Build a robust ETL pipeline with logging and error handling.
15. Probability & Statistics Problems
Problem 1: Expected Value
Question: A credit card company offers cashback: 3% on gas, 2% on groceries, 1% on everything else. Average customer spends $200/month on gas, $400 on groceries, $600 on other. What's the expected monthly cashback?
Solution:
Problem 2: Bayes' Theorem (Fraud Detection)
Question: 0.1% of transactions are fraudulent. Your fraud detection model has 95% sensitivity (true positive rate) and 98% specificity (true negative rate). If a transaction is flagged as fraud, what's the probability it's actually fraudulent?
Solution:
Problem 3: Central Limit Theorem
Question: Average customer wait time is 5 minutes with standard deviation of 2 minutes. What's the probability that the average wait time for 100 customers exceeds 5.3 minutes?
Solution:
Problem 4: Statistical Power
Question: You want to detect a 5% improvement in conversion rate (from 10% to 10.5%) with 80% power and 95% confidence. How many samples do you need per group?
Problem 5: Confidence Intervals
Question: Sample of 500 customers shows average transaction value of $75 with standard deviation of $20. Calculate 95% confidence interval.
16. Machine Learning Interview Questions
Q1: Explain bias-variance tradeoff
Answer: The bias-variance tradeoff is the balance between two sources of model error:
- Bias: Error from overly simplistic assumptions. High bias → underfitting (model too simple, misses patterns). Example: linear regression for non-linear data.
- Variance: Error from sensitivity to training data fluctuations. High variance → overfitting (model too complex, captures noise). Example: deep decision tree.
- Tradeoff: Total Error = Bias² + Variance + Irreducible Error
- Simple models: High bias, low variance
- Complex models: Low bias, high variance
- Goal: Find sweet spot that minimizes total error
- Mitigation: Cross-validation, regularization (L1/L2), ensemble methods, feature engineering
Q2: When would you use Random Forest vs Gradient Boosting?
Answer:
- Random Forest (Bagging):
- Trains trees in parallel, each on random sample of data/features
- Reduces variance through averaging
- More robust to outliers and noise
- Faster to train (parallelizable)
- Less prone to overfitting
- Use when: Need robustness, have noisy data, want faster training, need feature importance
- Gradient Boosting (Boosting):
- Trains trees sequentially, each correcting previous errors
- Reduces bias through iterative improvement
- Often higher accuracy
- Slower to train (sequential)
- More prone to overfitting, needs careful tuning
- Use when: Need maximum accuracy, have clean data, willing to tune hyperparameters
- Banking Example:
- Fraud detection → Random Forest (need robustness, speed for real-time)
- Credit scoring → Gradient Boosting (accuracy critical, offline training acceptable)
Q3: How do you handle imbalanced classes?
Answer: Imbalanced classes (e.g., 99% non-fraud, 1% fraud) require special handling:
- 1. Resampling:
- Oversampling minority class (SMOTE creates synthetic samples)
- Undersampling majority class (risks losing information)
- Hybrid approaches
- 2. Algorithm-level:
- Class weights (penalize errors on minority class more)
- Threshold adjustment (lower threshold for minority class)
- Anomaly detection algorithms (Isolation Forest, One-Class SVM)
- 3. Evaluation metrics:
- Don't use accuracy (misleading with imbalance)
- Use: Precision, Recall, F1-score, ROC-AUC, PR-AUC
- Focus on minority class performance
- 4. Ensemble methods:
- EasyEnsemble, BalancedRandomForest
- Train multiple models on different balanced subsets
- Banking Example: Fraud detection with 0.1% fraud rate
- Use class_weight='balanced' in sklearn
- Optimize for high recall (catch frauds) with acceptable precision
- Use PR-AUC as primary metric
Q4: Feature selection techniques
Answer:
- Filter Methods (Statistical tests, before modeling):
- Correlation coefficient (remove highly correlated features)
- Chi-square test (categorical features)
- Mutual information (captures non-linear relationships)
- Variance threshold (remove low-variance features)
- Wrapper Methods (Use model performance):
- Recursive Feature Elimination (RFE): Iteratively remove least important features
- Forward selection: Start empty, add best features one by one
- Backward elimination: Start with all, remove worst one by one
- Embedded Methods (Built into algorithm):
- L1 regularization (Lasso): Shrinks coefficients to zero
- Tree-based importance: Random Forest, XGBoost feature importance
- ElasticNet: Combines L1 and L2
- When to use:
- High dimensionality (curse of dimensionality)
- Reduce overfitting
- Improve interpretability
- Faster training and inference
Q5: Explain regularization
Answer: Regularization adds a penalty term to the loss function to prevent overfitting:
Practical example: Predicting customer churn with 50 features, many correlated (age/tenure, income/balance). Use ElasticNet to handle correlations while selecting most important features.
17. Time Series Analysis
Core Concepts
- Components:
- Trend: Long-term increase or decrease
- Seasonality: Regular patterns (daily, weekly, yearly)
- Cyclical: Irregular fluctuations (business cycles)
- Residual/Noise: Random variation
- Stationarity: Statistical properties constant over time
- Tests: Augmented Dickey-Fuller (ADF), KPSS
- Make stationary: Differencing, detrending, log transform
- Autocorrelation: Correlation with lagged values
- ACF (Autocorrelation Function): Direct correlation
- PACF (Partial ACF): Correlation after removing effects of intermediate lags
Time Series Models
- ARIMA (AutoRegressive Integrated Moving Average):
- AR(p): Regression on past p values
- I(d): Differencing d times for stationarity
- MA(q): Regression on past q forecast errors
- Model selection: Use ACF/PACF plots, AIC/BIC criteria
- SARIMA: ARIMA with seasonal components
- Prophet (Facebook): Handles seasonality, holidays, trend changes
- LSTM/GRU: Deep learning for complex patterns
- Exponential Smoothing: Simple, interpretable forecasting
Banking Use Case: Call Volume Forecasting
Evaluation Metrics
- MAE (Mean Absolute Error): Average absolute difference, same units as data
- RMSE (Root Mean Squared Error): Penalizes large errors more
- MAPE (Mean Absolute Percentage Error): Scale-independent, interpretable as %
- SMAPE (Symmetric MAPE): Better for data with zeros
- AIC/BIC: Model selection (lower is better)
18. Experimentation & Causal Inference
A/B Testing Best Practices
- Design Phase:
- Define hypothesis (null and alternative)
- Choose primary metric (guardrail metrics too)
- Calculate required sample size (power analysis)
- Determine test duration (avoid novelty effect)
- Randomization strategy (user-level vs session-level)
- Running Phase:
- Monitor for sample ratio mismatch (SRM)
- Check for data quality issues early
- Don't peek! (or use sequential testing methods)
- Watch for external factors (holidays, outages)
- Analysis Phase:
- Check assumptions (normality, equal variance)
- Statistical significance AND practical significance
- Segmentation analysis (did all segments benefit?)
- Long-term impact estimation
Common Pitfalls
- Peeking: Looking at results before test completes inflates Type I error
- Solution: Use sequential testing (e.g., always valid inference)
- Multiple comparisons: Testing many metrics increases false positives
- Solution: Bonferroni correction, FDR control, pre-specify primary metric
- Network effects: Treatment spills over to control group
- Example: Referral program affects control users
- Solution: Cluster randomization, ego-network analysis
- Novelty effect: Users try new feature just because it's new
- Solution: Run test longer (2-4 weeks minimum)
- Survivorship bias: Only analyzing users who completed journey
- Solution: Intent-to-treat analysis
Advanced Techniques
- CUPED (Controlled-Experiment Using Pre-Experiment Data):
- Reduces variance using pre-experiment covariates
- Can reduce required sample size by 30-50%
- Adjusted metric = Y - θ(X_pre - E[X_pre])
- Multi-Armed Bandits:
- Adaptive allocation: shift traffic to better variants
- Reduce opportunity cost of experimentation
- Trade-off: Exploration vs Exploitation
- Difference-in-Differences:
- Quasi-experimental design when randomization impossible
- Compare changes over time between treatment and control
- Assumes parallel trends
- Synthetic Control:
- Create synthetic control group from weighted combination
- Useful for single-unit treatments (e.g., policy change in one region)
Case Study: Testing New Chatbot for Customer Service
20. Algorithms & Data Structures Basics
Essential Data Structures
- Arrays/Lists: O(1) access, O(n) search, good for indexed access
- Hash Tables/Dictionaries: O(1) average insert/lookup, perfect for counting, grouping
- Sets: O(1) membership testing, removing duplicates
- Queues: FIFO, BFS, task scheduling
- Stacks: LIFO, DFS, undo operations, expression evaluation
- Heaps: O(log n) insert/delete min/max, priority queues, top-K problems
Common Algorithm Patterns
Analytics-Specific Problems
Time Complexity Cheat Sheet
- O(1): Hash table lookup, array access by index
- O(log n): Binary search, balanced tree operations
- O(n): Linear scan, single loop
- O(n log n): Efficient sorting (merge sort, heap sort)
- O(n²): Nested loops (avoid when possible!)
- O(2ⁿ): Recursive algorithms without memoization (avoid!)
SQL Query Optimization Mindset
- Indexes: Like hash tables for databases - O(log n) vs O(n) scan
- WHERE before JOIN: Filter early to reduce join size
- Avoid SELECT *: Only fetch needed columns
- Use EXISTS vs IN: EXISTS stops at first match
- Partition pruning: WHERE on partitioned column
- Materialized views: Pre-compute expensive aggregations
11. Business Case Study Examples
Case Study 1: Reducing Call Center Costs
Scenario: The call center handles 1M calls/month with an average cost of $5 per call. Leadership wants to reduce costs by 20% while maintaining customer satisfaction.
Approach:
- Analyze call drivers: Use text mining on call transcripts to identify top reasons for calls
- Identify automation opportunities: Simple inquiries (balance checks, payment status) → chatbot/IVR
- Calculate impact: 30% of calls are routine inquiries, automation saves $3.50 per call
- Estimate savings: 300K calls × $3.50 = $1.05M/month (21% reduction)
- Monitor metrics: CSAT, containment rate, escalation rate, agent handle time
- Iterate: Continuous improvement based on feedback and failure analysis
Case Study 2: Improving First Call Resolution
Scenario: Current FCR is 70%. Industry benchmark is 80%. Each repeat call costs $5.
Analysis Framework:
- Root cause analysis: Why do customers call back?
- Agent knowledge gaps → training opportunities
- System limitations → tool improvements
- Process issues → workflow optimization
- Segmentation: Which customer segments / issue types have lowest FCR?
- Agent analysis: Top performers vs bottom performers - what's different?
- Data-driven recommendations:
- Implement knowledge management system
- Create decision trees for common scenarios
- Real-time agent assist using ML
- Impact sizing: 10% FCR improvement = 100K fewer calls/month = $500K savings
Case Study 3: ML Model for Fraud Detection
Scenario: Build a case for implementing ML-based fraud detection.
Business Case Structure:
- Current state:
- Rule-based system flags 5% of transactions
- Manual review costs $2 per transaction
- False positive rate: 90% (high customer friction)
- Fraud losses: $10M annually
- Proposed solution: ML model using ensemble methods (Random Forest + XGBoost)
- Expected improvements:
- Reduce false positives by 40% → better customer experience
- Increase fraud detection rate by 15% → $1.5M additional fraud caught
- Reduce manual review volume by 35% → $1.4M cost savings
- Implementation costs: $500K (data prep, model dev, infrastructure, testing)
- ROI: ($1.5M + $1.4M - $500K) / $500K = 480% first year
- Success metrics: Precision, Recall, F1, AUC-ROC, cost per fraud detected
12. Behavioral Interview Preparation
Common Behavioral Questions (STAR Method)
- "Tell me about a time you had to influence stakeholders without direct authority."
- Situation: Cross-functional project requiring buy-in from product, engineering, design
- Task: Convince teams to prioritize data infrastructure improvements
- Action: Built business case showing impact, created prototype, addressed concerns individually
- Result: Secured resources, delivered 30% reduction in data processing time
- "Describe a complex analytical problem you solved independently."
- Situation: Unexplained drop in conversion rates
- Task: Identify root cause with limited guidance
- Action: Segmented users, analyzed funnels, discovered browser-specific bug
- Result: Fixed issue, recovered $500K in monthly revenue
- "Give an example of when you had to make a data-driven recommendation to senior leadership."
- "Tell me about a time you failed and what you learned."
- "Describe how you handled conflicting priorities in an Agile environment."
- "Share an example of improving operational efficiency through analytics."
Questions to Ask Interviewers
- "What are the biggest challenges the MLIO team is currently facing?"
- "How does the team measure success for ML/AI implementations?"
- "What does the typical analytics project lifecycle look like here?"
- "How does the team balance exploratory analysis with productionizing solutions?"
- "What tools and technologies is the team investing in for the future?"
- "How does this role interact with product teams during roadmap planning?"
- "What opportunities exist for professional development and learning?"
- "Can you describe a recent success story from the team?"
Key Themes to Emphasize
- Business Impact Focus: Always tie technical work to business outcomes (cost savings, customer experience, revenue)
- Communication: Ability to translate technical insights for non-technical audiences
- Collaboration: Working effectively with engineering, product, design, and business teams
- Agile Mindset: Iterative approach, failing fast, continuous improvement
- Independence: Self-starter who can navigate ambiguity and find solutions
- Customer-Centric: Understanding how analytics improves customer and agent experience
- Learning Agility: Quickly picking up new tools, domains, and methodologies
21. Quick Reference Cheat Sheet
SQL Window Functions Syntax
Python Pandas Quick Reference
Statistics Formulas Quick Reference
ML Model Selection Guide
Interview Question Frameworks
22. Essential Formulas & Concepts
Business Metrics Calculations
Statistical Tests Decision Tree
Time Complexity Examples
23. Real JPMorgan Interview Questions
Technical Round Questions (Glassdoor/Blind Compilation)
- SQL:
- "Write a query to find the second highest salary in each department"
- "Calculate running total of transactions by customer, ordered by date"
- "Find customers who made purchases in all 12 months of the year"
- "Identify gaps in sequential transaction IDs"
- "Calculate month-over-month growth rate for each product"
- Python/Pandas:
- "Given a DataFrame of transactions, find customers with declining spend over last 3 months"
- "Implement a function to detect outliers using IQR method"
- "Merge two datasets on approximate string matching (fuzzy matching)"
- "Calculate cohort retention rates from transaction data"
- Statistics:
- "How would you detect if a coin is biased?"
- "Explain p-value to a non-technical stakeholder"
- "You run an A/B test and get p=0.051. What do you do?"
- "How do you determine sample size for an experiment?"
- Machine Learning:
- "How would you build a model to predict credit card fraud?"
- "Explain the difference between precision and recall, and when each matters"
- "How do you handle imbalanced datasets?"
- "Walk me through your process for feature engineering"
- "How do you prevent overfitting?"
Case Study Examples
- "Credit card approval rates dropped 15% last month. Investigate."
- Framework: Segment by applicant type, product, channel, geography, time
- Check: Model changes, policy changes, data quality, external factors (economy, marketing)
- Action: Root cause analysis, quantify impact, recommend fixes
- "Design a dashboard for contact center operations leaders."
- Stakeholders: Ops managers, team leads, executives
- Metrics: Volume, AHT, FCR, CSAT, agent utilization, SLA
- Features: Real-time view, trends, alerts, drill-downs by channel/reason/agent
- "Should we invest in a chatbot for customer service?"
- Define success: Cost per interaction, CSAT, containment rate
- Estimate: Addressable call volume, cost savings, development cost
- ROI calculation: Savings vs investment, payback period
- Risks: CSAT decline, implementation challenges, adoption
- "Analyze mobile app ratings decline from 4.5 to 4.0."
- Data: App store reviews, version history, support tickets, usage analytics
- Analysis: Text mining of reviews, correlation with app versions, platform differences
- Findings: Recent update (v3.2) introduced bugs for iOS 14 users
- Recommendation: Hotfix release, proactive communication, improved QA
Behavioral Questions (JPM-Specific)
- "Tell me about a time you had to work with incomplete data"
- "Describe a situation where your analysis changed a business decision"
- "How do you prioritize when you have multiple urgent requests?"
- "Tell me about a time you had to explain a technical concept to a non-technical audience"
- "Describe a project where you had to collaborate with engineering and product teams"
- "What do you know about JPMorgan Chase's consumer banking business?"
- "Why do you want to work in the ML & Intelligent Operations team specifically?"
- "How would you approach learning about a new business domain?"
- "Tell me about a time you identified an opportunity that others missed"
- "Describe your experience with Agile/Scrum methodology"