please see the attached file to have better description
in this project, you will need to train a simple Logistic Regression model. You can use any machine learning library that supports distributed training, such as Tensorflow and PyTorch.
You will start with writing the code for training the model on a single machine. Training should start from a random linear vector as your model. Then, calculate the loss of your model on the dataset using: , where and are training data features and label, respectively. The dataset for you to train your model is MNIST handwritten digits database. Train to minimize with an optimizer such as gradient descent.
The next task is to modify the workloads so that they can be launched in a distributed way. You will experiment with both synchronous and asynchronous SGD. In distributed mode, dataset is usually spread among the VMs. On each iteration, the gradients are calculated on each worker machine using its shard of data. In synchronous mode, the gradients will be accumulated to update the model and then go to next iteration. However, in asynchronous mode, there is no accumulation process and the worker nodes update the model independently.
After finishing the implementation, plot the performance and test error for both of two modes and explain any similarity / differences. In addition, monitor the CPU/Memory/Network usage. You should also monitor the CPU/Memory/Network usage of each VM during training. You can try to use tools like: dstat or sar. You are welcome to use any other tool you like to monitor the system. Show your observations and determine which one is the bottleneck.
Try different batch size and compare the difference.
Deliverables: You should submit a report describing how you complete each step and reporting your observations as required above. You also need to submit the code used for the training on a single machine and multiple machines, respectively.