TikTok Data Engineer Interview - First Round Experience

9/21/2024

Technical Interview (1st Round)

Questions:

Introduction and Background: The interviewer started by asking me about my current role and projects. I gave a brief overview of the data engineering projects I’ve been working on, especially focusing on big data technologies like Apache Spark.
Project Discussion: I was asked in detail about the data pipelines I had built, especially focusing on how I managed large-scale data processing using Spark. This included questions about my experience with ETL processes, and the challenges I faced while optimizing data pipelines.
Apache Spark and Salting Question: The tricky part of the interview came when they asked about salting in Spark. The interviewer wanted to know how I would apply salting to handle skewed data in a specific scenario:
- Scenario: You have a dataset of products, users, and their interaction timestamps. Some products are very popular and receive a disproportionate amount of interactions compared to others. How would you apply salting to distribute the load evenly across Spark partitions to prevent data skew?
- My Response: I explained the concept of salting as adding a “salt” (or random value) to the key to distribute the data more evenly across partitions. Here’s what I said:
  - I would create a new column that appends a random salt value to the product key. The idea is to divide the popular products into multiple groups, which would spread the load across different partitions.
  - For example, for each product_id, I would concatenate a random number (like salt = 0, 1, 2) to the product ID to generate new salted keys (e.g., product_id_0, product_id_1, product_id_2). This would help balance the workload when doing joins or aggregations, thus mitigating the skew.
  - After the processing is complete, I would remove the salt to get back to the original product_id for further analysis.

Follow-up Questions:

How would you determine the number of salts needed?
What would be the impact of salting on query performance, and how would you optimize it?

Candidate's Approach

I explained the concept of salting and how it can be applied to manage skewed data in Spark. I detailed the process of creating new salted keys and the rationale behind it, emphasizing the importance of balancing the workload during data processing.

Interviewer's Feedback

No feedback provided.