Deep Learning Recommendation Models (DLRM)

Paper link: https://arxiv.org/abs/1906.00091

Model workflow:

Source from: https://medium.com/swlh/deep-learning-recommendation-models-dlrm-a-deep-dive-f38a95f47c2c

All configurable parameters are outlined in blue, and the operators used are shown in Green.

Issues

Memory Capacity Dominated (Input from Network)
Memory Band-Width Dominated (Processing of Features: Embedding Lookup and MLP)
Communication Based (Interaction between Features)
Compute Dominated (Compute/Run-Time Bottleneck)

1. Memory Capacity Dominated

Embeddings contribute the majority of parameters, with several tables each requiring excess of multiple GBs of memory. This necessitates Distribution of moels across Multiple Devices.
Data Parallelism is preferred for MLPs since this enables conurrent processing of samples on different devices and only requires communication when accumulating updates.

2. Memory Band-Width Dominated

Embedding Lookups can cause memory constraints.
Quotient-Remainder Trick

Using 2 complementary functions i.e. integer quotient and remainder functions: we can produce 2 separate embedding tables and combine them in a way that yields a unique embedding for each category.

3. Communication Based

DLRM uses model parallelism to avoid replicating the whole set of embedding tables on every GPU device and data parallelism to enable concurrent processing of samples in FC layers.

MLP parameters are replicated across GPU devices and not shuffled.

What is the problem?

Transferring embedding tables across nodes in a cluster becomes expensive and could be a Bottleneck.

Solution

Since it's the interaction between pairs of learned embedding vectors that matters and not the absolute values of embedding themselves.

We hypothesize we can learn embeddings in different nodes independently to result in a good model.

Saves Network Bandwidth by synchronizing only MLP parameters and learning Embedding tables independently on each of the server nodes.

4. Compute Dominated

MLP also results in Compute Overload
Co-location creates performance bottlenecks when running production-scale recommendation models leading to lower resource utilization

Solution: FBGEMM

Deep Learning Recommendation Models (DLRM)

Issues​

1. Memory Capacity Dominated​

2. Memory Band-Width Dominated​

3. Communication Based​

4. Compute Dominated​

Issues

1. Memory Capacity Dominated

2. Memory Band-Width Dominated

3. Communication Based

4. Compute Dominated