I. Answer true or false to the following questions. (30 points in total)
1. ( ) We can use a pre-trained word embeddings, like fastText or Word2Vec, to obtain sentence vectors from sentences.
2. ( ) After downloading the pre-trained AlexNet CNN, you must always re-
train it with your images before using it.
3. ( ) If we set the batch size to 5 when training an LSTM on 10 sequences, the LSTM states will be reset twice after two epochs.
4. ( ) When training an LSTM with batch size = 1, there are same number of back propagations for each sequence if we pad all sequences into the same length.
5. ( ) After training the Skip-Gram approach with sentence input, we can extract a sentence vector for each sentence.
6. ( ) If we want to train a word-embedding layer to convert each word to a vector, we need to prepare a target label (i.e. Y) for each input sequence.
7. ( ) Users need to specify number of negative samples when training Word2Vec using the CBOW approach.
8. ( ) The CBOW approach uses multiple target words in the training process.
9. ( ) When training a Skip-Gram approach to obtain word embeddings for millions of vocabularies, we should set the loss function to softmax.
10. ( ) The vanishing gradient (descent) problem can still occur when training an LSTM that has only one LSTM layer.
11. ( ) When classifying continuous activities on real-time sensor inputs, we can use the bi-directional LSTM.
12. ( ) A CNN with following layers in sequence has a good design: (input, convolution, ReLU, Batch Normalization, convolution, ReLU, Batch Normalization, pooling, fully-connected, softmax).
13. ( ) Image augmentation may help a CNN to avoid the overfitting problem.
14. ( ) For predicting the sentiment of a sentence (i.e. a sequence), we need to build an LSTM to perform many-to-many classification.
15. ( ) To implement an alphabet prediction system to predict sequences like ABCDE -- FGH, it is possible to implement the system using LSTM M:1 (i.e. many-to-one, or sequence-to-last) prediction.
II. Please type your answers for the following questions in this WORD file.
1. (20 points) Shortly before you left an elevator, you overheard two data scientists are arguing where the normalization (or standardization) processes are needed in order to properly train a CNN for a regression task.
Before the elevator was closed, can you tell the data scientists your suggestions? Please limit your answers to three sentences. Anything beyond three sentences cannot be heard by the data scientists, and hence, will **NOT** be graded”.
2. (20 points) Mary is a data scientist. She received three (3) sequences listed in the following table. Mary was asked to build/train an LSTM to take the “Input Data” in each sequence to predict the “Target” of each corresponding sequence.
(2.1) If the requirement document indicates that the length of each sequence is 1, how many predictor(s) does each sequence have?
(2.2) If the requirement document indicates that the length of each sequence is 2, how many predictor(s) does each sequence have?
(2.3) If the requirement document indicates that the length of each sequence is 6, how many predictor(s) does each sequence have?
3. (15 points) Two data scientists were asked to build an LSTM network to classify 100,000 sentences into binary sentiment (i.e. positive vs. negative). Half of the sentences are labeled as positive, and another half of the sentences are labeled as negative. Following pseudo-code lists the neural network architecture they used for this analysis task. We also know that there is NO problem in all the following neural network pseudo-code and actual implementation.
They have CORRECTLY performed ALL the possible text pre-processing (i.e. cleaning, removing stop words, tokenization, one-hot encoding, padding to different sentence lengths, sorting sentences), and they have also tested many LSTM hyper-parameters such as (1) the number LSTM layers, (2) the number of LSTM units in each layer, (3) the number of fully-connected (FC) layers, (4) the number of neurons at each FC layer, (5) different activation functions at each layer, (6) different epochs, (7) different batch sizes, (8) different learning rates, (9) different learners/optimizers, (10) and different number of dropout layers with different dropout rates. But, their quality measurements (i.e. accuracy and F-measure) are always very bad. They do not know what other options to try next.
While there is no guarantee the quality may be improved, can you give them few suggestions on what other options they can try? Please do not suggest the 10 options listed above and text cleaning the two data scientists have tried. Please list and explain your options in three sentences to the two data scientists before the elevator was closed. Anything beyond three sentences cannot be heard by the two data scientists, and hence, will **NOT** be graded.
4. (15 points) Shortly before you left an elevator, you overheard two data scientists are arguing what are the purposes of getting a powerful pre-trained CNN (i.e. AlexNet) that contains NO classification layers (i.e. NO fully-connected dense layers and the softmax layer).
Before the elevator was closed, can you quickly tell the data scientists at least two purposes of doing this (i.e. getting a powerful pre-trained CNN that contains NO classification layers) in three sentences? Anything beyond three sentences cannot be heard by the two data scientists, and hence, will **NOT** be graded.