diff --git a/answers/chapter7.md b/answers/chapter7.md
index e69de29b..79894b64 100644
--- a/answers/chapter7.md
+++ b/answers/chapter7.md
@@ -0,0 +1,313 @@
+### 7.1 Basics
+
+1. [E] Explain supervised, unsupervised, weakly supervised, semi-supervised, and active learning.
+ Supervised learning is when the training data is labelled and the model can correct its learning using them.
+ Unsupervised learning is when there are no labels and the model has to learn from the inputs itself how to make predictions. Note that unsupervised learning in some settings are similar to supervised, e.g. masked language models.
+ Weakly supervised learning uses heuristic functions to label data for training and does not require any labels from the training data.
+ Semi-supervised learning requires some labelled data to train an initial model on. The model predicts labels on the unlabelled data and those with high raw probability get added to the training data. The process continues until the desired performance is reached.
+ Active learning selects subset of data points to learn from. An active learner makes predicitons on unlabelled data and sends the ones it is least confident about to annotators to label.
+2. Empirical risk minimization.
+ 1. [E] What’s the risk in empirical risk minimization?
+ The goal of ERM is to minimize the risk of the model. The risk is defined as the expected value of the loss function over the true underlying data distribution. However, this risk is estimated by taking the average of the loss function over the training data. This creates the risk of the data distribution in the real world/production being different from the training data and therefore the model risk being higher than estimated. Another issue is that this estimate will be much lower if the model is overfitting.
+ 2. [E] Why is it empirical?
+ Because it is estimated from the data at hand, i.e. the training data as opposed to being computed on the true underlying data distribution.
+ 3. [E] How do we minimize that risk?
+ Having more data to represent the true distribution and using regularization, early stopping, dropout to avoid overfitting.
+3. [E] Occam's razor states that when the simple explanation and complex explanation both work equally well, the simple explanation is usually correct. How do we apply this principle in ML?
+ One way to apply Occam's razor in machine learning is by using simpler models, such as linear regression, instead of more complex models like deep neural networks. A simpler model will have fewer parameters to learn, and therefore less risk of overfitting the data.
+ Another way to apply Occam's razor in machine learning is through feature selection. When dealing with large datasets, it is often the case that not all features are relevant to the problem at hand. By selecting only the most relevant features, you can reduce the complexity of the model and improve its performance.
+ Additionally, techniques like regularization, such as L1, L2, and dropout, can also be used to reduce the complexity of the model and prevent overfitting.
+4. [E] What are the conditions that allowed deep learning to gain popularity in the last decade?
+ 1. Cloud computing, more compute power: Availability of virtual machines with a wide variety of compute power, GPU resources and the easy access of all from cloud providers.
+ 2. Open-source tools: Access to powerful deep learning frameworks such as TF and Pytorch
+ 3. Open-source data: A lot more data is available which makes it easier for practitioners to get past the long data collection phase.
+ 4. Open-source models: Has enable collaboration and improvements upon released models.
+5. [M] If we have a wide NN and a deep NN with the same number of parameters, which one is more expressive and why?
+ A deeper NN. Because it can model more complex functions and has more opportunities to learn hierarchical representations of the data which makes it better at generalization. On the other hand, a wider network with more neurons is likely to memorize the inputs and the corresponding outputs rather than learn the underlying representation of the inputs.
+6. [H] The Universal Approximation Theorem states that a neural network with 1 hidden layer can approximate any continuous function for inputs within a specific range. Then why can’t a simple neural network reach an arbitrarily small positive error?
+ A neural network, like any other machine learning model, is only able to approximate the underlying function based on the data it has been trained on. This means that even if the network has the capacity to approximate the function, it can still make errors if the data it is trained on is noisy or incomplete. With only a single hidden layer, the model will not be able to learn the hierarchical representation of the data and will underfit. Another reason can be local minima, i.e. when the optimization algorithm gets stuck in a sub-optimal solution, preventing the network from reaching the global minimum.
+7. [E] What are saddle points and local minima? Which are thought to cause more problems for training large NNs?
+ Saddle points are critical points that are neither minima or maxima. This means that in some directions the objective function is increasing, while in other directions the objective function is decreasing.
+ A local minimum is a point in the parameter space where the gradient of the objective function is zero. This means that the objective function has a minimum value in the neighborhood of that point and the gradient is zero in all directions.
+ At a saddle point, the gradient of the loss function is zero in some directions but not all. This means that the optimization algorithm is not decreasing the loss function in some directions, and it will not converge to a global minimum. Instead, the algorithm may oscillate or even diverge which makes the learning process unstable. So saddle points are generally more problematic than local minima.
+8. Hyperparameters.
+ 4. [E] What are the differences between parameters and hyperparameters?
+ Parameters are the internal variables of a model that are learned from the data during the training process, while hyperparameters are the external variables of a model that are set before the training process starts. Hyperparameter tuning is an essential step in the machine learning pipeline and it's necessary to find the optimal values for the hyperparameters that make the model perform well on unseen data.
+ 5. [E] Why is hyperparameter tuning important?
+ Hyperparameter tuning is important because it allows you to optimize the performance of a model by adjusting the settings of the model that are not learned from the data. By systematically exploring the hyperparameter space, we can find the optimal values that make the model perform well on unseen data. This can improve the generalization of the model and prevent overfitting.
+ 6. [M] Explain algorithm for tuning hyperparameters.
+ 1. Grid Search: In grid search, all possible combinations of the hyperparameter values are trained and evaluated. This is a simple and straightforward method, but it can be computationally expensive for large hyperparameter spaces.
+ 2. Random Search: In random search, random combinations of the hyperparameter values are trained and evaluated. This is a more efficient method than grid search as it requires fewer evaluations, but it may not explore the hyperparameter space as thoroughly.
+ 3. Bayesian optimization: This algorithm is based on Bayesian statistics, it models the distribution of the hyperparameters and the objective function, and it uses this distribution to choose the next set of hyperparameters to evaluate. This algorithm is more computationally expensive than grid or random search, but it can converge faster to the optimal solution.
+9. Classification vs. regression.
+ 7. [E] What makes a classification problem different from a regression problem?
+ The main difference between classification and regression is the type of output they predict: classification is used to predict discrete labels or classes, while regression is used to predict continuous output values.
+ 8. [E] Can a classification problem be turned into a regression problem and vice versa?
+ Yes, a classification problem can be turned into a regression problem by converting the categorical output variable into a continuous one. For example, instead of predicting a class label, the model could predict a probability of the input belonging to each class. Similarly, a regression problem can be turned into a classification problem by converting the continuous output variable into a categorical one. For example, by dividing the range of output values into bins and assigning a class label to each bin, the model could predict the class label of the input instead of the continuous output value.
+10. Parametric vs. non-parametric methods.
+ 9. [E] What’s the difference between parametric methods and non-parametric methods? Give an example of each method.
+ Parametric methods are methods that learn using a pre-defined mapping function to map the inputs to the output. These methods make assumptions about the probability distribution of the data, typically assuming a normal distribution. In addition, these methods have a fixed number of parameters. An example of a parametric method is linear regression, which assumes that the relationship between the independent and dependent variables is linear, and estimates the coefficients of the line of best fit.
+ Non-parametric methods, on the other hand, make fewer assumptions about the data distribution and do not have a fixed number of parameters or a pre-defined mapping function to learn the relationship between inputs and outputs. These methods are more flexible and can be applied to a wider range of data types. An example of a non-parametric method is the k-nearest neighbors (k-NN) algorithm, which classifies a data point based on the majority class of its k-nearest neighbors. Another example is decision trees.
+ 10. [H] When should we use one and when should we use the other?
+ If we are not sure about the underlying data distributions, non-parametric is a better choice.
+11. [M] Why does ensembling independently trained models generally improve performance?
+ Becuase it combines the learning of different models with different biases. This combination generally makes the predictions more accurate. They also have the advantage of reducing overfitting because different models can make different predictions on the training data making the ensemble less susceptible to noise and better at generalization.
+12. [M] Why does L1 regularization tend to lead to sparsity while L2 regularization pushes weights closer to 0?
+ L1 regularization, also known as Lasso regularization, adds a penalty term to the loss function that is proportional to the absolute value of the weights. This penalty term encourages the weights to be small in magnitude, and it tends to lead to sparsity because it will drive some weights exactly to zero.
+ L2 regularization, also known as Ridge regularization, adds a penalty term to the loss function that is proportional to the square of the weights. This penalty term encourages the weights to be small, but it does not encourage them to be exactly zero. Instead, it pushes the weights closer to zero, but not necessarily exactly to zero.
+13. [E] Why does an ML model’s performance degrade in production?
+ There can be a variety of reasons:
+ 1. Data drift
+ 2. Concept drift
+ 3. Different evironment settings from the dev environment, e.g. package versions
+14. [M] What problems might we run into when deploying large machine learning models?
+ 1. Memory and computational resources: Large models require a lot of memory and computational resources to run, which can be a challenge for deployment on resource-constrained devices or in cloud environments with limited resources.
+ 2. Latency: Large models can have high latency, which can make them difficult to use in real-time or near real-time applications.
+ 3. Retraining and maintenance: Retraining and maintaining large models can be challenging, especially when deploying multiple versions of the same model or updating the model over time.
+15. Your model performs really well on the test set but poorly in production.
+ 11. [M] What are your hypotheses about the causes?
+ 1. It can be that the training data is not representative of the production data, i.e. the user inputs are much noisier than the training data.
+ 2. Alternatively, there can be data drift or concept drift.
+ 3. The model may be biased towards certain groups or input values, resulting in poor performance for certain subpopulations.
+ 4. Preprocessing or feature engineering steps might vary between the environments
+ 12. [H] How do you validate whether your hypotheses are correct?
+ Run invariance and slice-based tests on the training data and see if small pertubations on the trianing data affects the results or if the results vary depending on the sub-group. If it does it means that the model is not robust enough. Ideally these tests should have been run prior to making the model live.
+ In the case of data drift we can compare the distribution of training and production data with statistical methods such as KL divergence.
+ 13. [M] Imagine your hypotheses about the causes are correct. What would you do to address them?
+ You will need to retrain your model and include perturbed data in the training set to make it less susceptible to noise.
+ In the case of data or concept drift you will need to retrain your model on new data or do online training to learn the new relationships.
+
+### 7.2 Sampling and creating training data
+
+1. [E] If you have 6 shirts and 4 pairs of pants, how many ways are there to choose 2 shirts and 1 pair of pants?
+ There are 15 ways to choose 2 shirts out of 6, and 4 ways to choose 1 pair of pants out of 4. To find the number of ways to choose both items, you would multiply these two values: 15 x 4 = 60.
+2. [M] What is the difference between sampling with vs. without replacement? Name an example of when you would use one rather than the other?
+ Sampling with replacement means that after a sample is selected, it is put back into the population so that it can be selected again. Sampling without replacement means that once a sample is selected, it is not put back into the population and cannot be selected again. An example where one would pick sampling with replacement over without is boostrapping in bagging. On ther other hand for splitting the data into train and test sets you would want to sample without replacement.
+3. [M] Explain Markov chain Monte Carlo sampling.
+ The basic idea behind MCMC is to construct a sequence of samples, called a Markov chain, that is designed to converge to the target distribution. The chain starts at some initial state and then iteratively generates new states that are probabilistically determined by the current state. After running the chain for a sufficient number of iterations, the samples generated by the chain will be distributed according to the target distribution.
+4. [M] If you need to sample from high-dimensional data, which sampling method would you choose?
+ MCMC based methods are suitable for sampling from high dimensional data. e.g. Gibbs sampling, Metropolis-Hastings.
+5. [H] Suppose we have a classification task with many classes. An example is when you have to predict the next word in a sentence -- the next word can be one of many, many possible words. If we have to calculate the probabilities for all classes, it’ll be prohibitively expensive. Instead, we can calculate the probabilities for a small set of candidate classes. This method is called candidate sampling. Name and explain some of the candidate sampling algorithms.
+ 1. Softmax sampling: instead of calculating the softmax for all possible classes, it samples them based on a distribution function Q and trains the model to maximize the probability of the target class over the sample set.
+ 2. Noise Contrastive Estimations: transforms the training to a binary logistic regression where the non-target classes are sampled and the goal is to predict whether the output belongs to the positive class, i.e. target or the noise.
+ **Hint**: check out this great [article](https://www.tensorflow.org/extras/candidate_sampling.pdf) on candidate sampling by the TensorFlow team.
+6. Suppose you want to build a model to classify whether a Reddit comment violates the website’s rule. You have 10 million unlabeled comments from 10K users over the last 24 months and you want to label 100K of them.
+ 1. [M] How would you sample 100K comments to label?
+ 1. Time-based sampling: this method is useful when the data is collected over a specific period of time and you want to ensure that the labeled dataset represents the whole period. You could sample comments from different months or years to ensure that the labeled dataset is diverse and covers the whole period of 24 months.
+ 2. User-based sampling (cluster sampling): this method is useful when you want to ensure that the labeled dataset represents the whole user base. You could randomly select 100 users and sample all their comments, or sample a random number of comments for each user. This way the labeled dataset represents a diverse group of users.
+ 3. High-uncertainty sampling: This method is useful when you want to ensure that the model is exposed to examples that are difficult to classify and have a low confidence level. You could sample comments that are difficult to classify as violating or non-violating the website's rule. For example, you could sample comments that are written in a different language, contain sarcasm or have a neutral sentiment. Alternatively, weakly supervised heuristics to label some samples enought to train a prelimenary model that we can use as an active learner to determine a set of samples that are more difficult to predict.
+ 1. [M] Suppose you get back 100K labeled comments from 20 annotators and you want to look at some labels to estimate the quality of the labels. How many labels would you look at? How would you sample them?
+ A good starting point for the number of samples to inspect would be around 10% of the total number of labels. In this case, that would be around 10,000 labels. This sample size would be large enough to get a good estimate of the inter-annotator agreement while still being manageable to inspect.
+ However, it's worth noting that this is just a starting point and the actual number of samples to inspect may need to be adjusted based on the results of the agreement metrics and the inspection of the sample of labels.
+ For instance, if the inter-annotator agreement is found to be low, it might be necessary to inspect more labels to identify the sources of disagreement. On the other hand, if the agreement is high and the labels are found to be of high quality, it might be possible to inspect fewer labels to confirm the quality of the labels.
+ Another alternative is random sampling, cluster sampling (group by user and then sample from each user), systemic sampling (pick every 10 comments)
+ **Hint**: This [article](https://www.cloudresearch.com/resources/guides/sampling/pros-cons-of-different-sampling-methods/) on different sampling methods and their use cases might help.
+
+7. [M] Suppose you work for a news site that historically has translated only 1% of all its articles. Your coworker argues that we should translate more articles into Chinese because translations help with the readership. On average, your translated articles have twice as many views as your non-translated articles. What might be wrong with this argument?
+ **Hint**: think about selection bias.
+ 1. Selection bias: The sample of translated articles may not be representative of all articles on the site. The translated articles may be more popular or more likely to be viewed, which could be due to a number of factors, such as the topic, headline, or author.
+ 2. Causation vs correlation: The fact that translated articles have twice as many views as non-translated articles does not necessarily mean that translating more articles will lead to more views. There could be other factors that are causing the increased readership and translating more articles may not necessarily lead to more views.
+8. [M] How to determine whether two sets of samples (e.g. train and test splits) come from the same distribution?
+ There are different statistical tests for this:
+ 1. Maximum Mean Discrepancy (MMD): Is a kernal-based method that computes the distance between the means of two projections (distributions) in higher dimensions
+ 2. Kullback Leibler divergence (KL divergence) : KL divergence is a measure of how different two probability distributions are, it can be used to compare two sets of samples and determine if they come from the same distribution.
+ 3. Chi-Squared: It compares the observed frequencies of the two samples against the expected frequencies of the two samples if they come from the same distribution.
+9. [H] How do you know you’ve collected enough samples to train your ML model?
+ 1. For more traditional models, a simple rule of thumb is to have 10 x data than the features.
+ 2. Investigating similar case studies on similar model architectures.
+ 3. Run cross validation on existing data, if overfitting occurs and the model is not that complex, can be indicating that more data is needed.
+ 4. Training an initial model and investigating its mistakes can be telling
+10. [M] How to determine outliers in your data samples? What to do with them?
+ 1. Visual inspection: One way to detect outliers is to visually inspect the data by creating plots such as histograms, scatter plots, or box plots. Outliers will be represented as points that are far from the main cluster of points.
+ 2. Z-score: Z-score, also known as standard score, is a measure of how many standard deviations an observation is from the mean. A z-score greater than 3 or less than -3 can be considered as an outlier.
+ 3. Interquartile range (IQR): Interquartile range (IQR) is a measure of the spread of the data. It is defined as the difference between the 75th percentile and the 25th percentile. Data points that are more than 1.5*IQR below the 25th percentile or above the 75th percentile can be considered as outliers.
+ 4. Mahalanobis Distance: Mahalanobis Distance is a method that takes into account the correlation among the variables. It calculates the distance of each point from the mean of the data, taking into account the covariance matrix.
+
+ What we do with outliers depends on the task at hand, depending on how big of an outlier the samples are and whether we think they can occur in production we can can delete them or keep them. Transformations like log can be performed to make the values more aligned with the rest of the values.
+11. Sample duplication
+ 1. [M] When should you remove duplicate training samples? When shouldn’t you?
+ Cases where they should not be removed:
+ 1. Duplicates should not be removed if the duplicated samples are representative of the real world, e.g. duplicated objects in a scene can be a common scenario and therefore they should be kept
+ 2. If the duplicates were created to oversample a minority class they should be kept to avoid bias towards the majority class
+ In the case where duplicates are a result of a sampling error and do not represent the real world data distribution they should be removed as they can introduce unwanted bias towards certain data classes and increase the training time.
+ It's important to understand the undelying cause of duplicates before making a decision.
+ 1. [M] What happens if we accidentally duplicate every data point in your train set or in your test set?
+ Duplicating the train set can have some negative effects. For example, it can lead to memorization of the duplicates and overfitting which will not generalize to unseen data. It can also increase the training time and memory requirements.
+ For the test case, if there is leakage, the model may have an inflated performance by "predicting" samples that were in the training set and memorized correctly. Duplicating the test data will most likely result in increased inference time overall but since the metrics are in ratios, predicting a sample wrong/correctly many times should not affect the overall metrics.
+12. Missing data
+ 1. [H] In your dataset, two out of 20 variables have more than 30% missing values. What would you do?
+ Deletion: the easiest but not so effective way is to delete those variables. It's probably better to delete the variables over the rows since the latter results in removing over 30% of the data.
+ However, this results in less information to learn from. It is important to investigate why the values of each variable is missing. Are they completely at random? Not at random, or at random? If they are not missing at random, the fact that they are missing can be quite telling and keeping them will be better. In this case setting the missing values to default values that are different from acceptable values can be a better solution.
+ 1. [M] How might techniques that handle missing data make selection bias worse? How do you handle this bias?
+ In the case where the data is missing at random, i.e. missing not because of the true missing value but because of some other value, we can introduce selection bias by deleting the samples. For example, if the participants from gender A do not disclose their age and we delete all the rows with missing age, we delete all the samples from gender A. This makes selection bias worse because now the model doesn't see examples of this population and will underperform on this sub-population in production.
+
+ It's best to handle the missing data differently to not make selection bias worst. In the case where there are still samples with the rarer feature values, e.g. gender A from the example above, might be worth oversampling them to have more examples from the sub-group.
+13. [M] Why is randomization important when designing experiments (experimental design)?
+ It is important in selecting a population that is representative of the true distribution. This way, we don't introduce unwanted bias from selecting a disproportionate number of smaples from a certain population.
+14. Class imbalance.
+ 1. [E] How would class imbalance affect your model?
+ It can lead to the majority class dominating the predictions which has a negative affect on correctly classifying the samples from the minority classes.
+ 1. [E] Why is it hard for ML models to perform well on data with class imbalance?
+ In an imbalanced dataset the model becomes biased towards the majority class as it mostly sees samples of that class and does not see enough of the minority classes. This doesn't give it enough information to learn the underlying patterns of the rare class and so it becomes less sesitive to it.
+ 1. [M] Imagine you want to build a model to detect skin legions from images. In your training dataset, only 1% of your images shows signs of legions. After training, your model seems to make a lot more false negatives than false positives. What are some of the techniques you'd use to improve your model?
+ 1. Data augmentation: More samples from the rare class can be generated by slightly modifying the existing data, e.g. flipping, cropping, rotating.
+ 2. Depending on the number of majority samples you can downsample them or look into oversample methods for the lesion class.
+ 3. The cost function can be modified to account penalize the model more when it makes a FN. One way is to give each class a weight that is the inverse of the number of samples in that class. Another useful cost function used for class imbalance is Focal loss which gives a higher weight to hard examples, i.e. the minority examples by multipying the log p with (1 - p)^gamma.
+15. Training data leakage.
+ 1. [M] Imagine you're working with a binary task where the positive class accounts for only 1% of your data. You decide to oversample the rare class then split your data into train and test splits. Your model performs well on the test split but poorly in production. What might have happened?
+ Oversampling can be duplicating the samples or tweaking some features by a very little cmount. The oversampling should have happened after splitting the data. With doing it prior to splitting, the data from train (oversampled points) has leaked to the test set and the model is overly optimistic about its performance on test. The production data is different from the training and testing data which explains the discrepency.
+ 1. [M] You want to build a model to classify whether a comment is spam or not spam. You have a dataset of a million comments over the period of 7 days. You decide to randomly split all your data into the train and test splits. Your co-worker points out that this can lead to data leakage. How?
+ Splitting the data randomly into train and test splits can lead to data leakage when working with a time-series dataset, such as comments over a period of 7 days, because the model may learn patterns related to specific trends of the day. This can lead to information from the future leaking into the training and allowing the model to cheat during evalutation.
+
+ **Hint**: You might want to clarify what oversampling here means. Oversampling can be as simple as dupplicating samples from the rare class.
+
+16. [M] How does data sparsity affect your models?
+ 1. Increased space and time complexity
+ 2. Bias towards dense features and underestimating the predictive power of sparse features
+ 3. Overfitting
+
+ **Hint**: Sparse data is different from missing data.
+
+17. Feature leakage
+ 26. [E] What are some causes of feature leakage?
+ Using the target to construct features is one way where the features have information about the target that should not be accessible. Another cause can be that the data is not representative of the real world examples and the model learns features from the data that are highly correlated with the target but they do not generalize well, e.g. the neural network to classify huskies and wolves only saw examples of wolves with a snowy background and huskies with a green background.
+ 27. [E] Why does normalization help prevent feature leakage?
+ It can help reduce the correlation between leaked features and the target.
+ 28. [M] How do you detect feature leakage?
+ 1. Investigating the correlation between the target and features or combination of features.
+ 2. Ablations studies to find features that significately affect performance and identifying the cause.
+ 3. Keeping an eye on new features if they significantly improve the model performance.
+18. [M] Suppose you want to build a model to classify whether a tweet spreads misinformation. You have 100K labeled tweets over the last 24 months. You decide to randomly shuffle on your data and pick 80% to be the train split, 10% to be the valid split, and 10% to be the test split. What might be the problem with this way of partitioning?
+ With a random shuffle rather than splitting based on dates there will be temporal data leakage from test splits to train and validation and the model will essentially cheat in predicting the test data, but perform poorly on production data.
+19. [M] You’re building a neural network and you want to use both numerical and textual features. How would you process those different features?
+ The textual features need to be tokenized into smaller units, e.g. words, and preprocessed, they then will be mapped to different numbers and passed to an either trainable embedding layer or a pretrained layer, e.g. GloVe. After the textual data is embedded, we can concatenate the normalized numerical data to it.
+20. [H] Your model has been performing fairly well using just a subset of features available in your data. Your boss decided that you should use all the features available instead. What might happen to the training error? What might happen to the test error?
+ 1. Training error: The training error may decrease as the model has more information to learn from. With more features, the model may be able to better fit the training data, leading to a lower training error.
+ 2. Test error: The test error may increase or decrease depending on whether the additional features are relevant or not. If the additional features are relevant to the task and provide useful information, the test error may decrease. However, if the additional features are not relevant or contain noise, the test error may increase. However, if the features are not informative, it can lead to overfitting and the test error will increase.
+
+ **Hint**: Think about the curse of dimensionality: as we use more dimensions to describe our data, the more sparse space becomes, and the further are data points from each other.
+
+### 7.3 Objective functions, metrics, and evaluation
+
+1. Convergence.
+ 1. [E] When we say an algorithm converges, what does convergence mean?
+ In the context of algorithms, convergence refers to the process of approaching a specific value or set of values as the number of iterations increases. For example, in optimization problems, an algorithm is said to converge when the solution it finds is within a specified tolerance of the true optimal solution, or when the difference between solutions in consecutive iterations is below a certain threshold. In machine learning, an algorithm is said to converge when the performance of the model on the training set stops improving or even starts degrading with more training examples. The specific definition of convergence can vary depending on the context and the algorithm in question.
+ 1. [E] How do we know when a model has converged?
+ When the loss does not change much from one iteration to the next.
+1. [E] Draw the loss curves for overfitting and underfitting.
+ https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html
+1. Bias-variance trade-off
+ 1. [E] What’s the bias-variance trade-off?
+ The bias-variance trade-off is a fundamental concept in machine learning that refers to the trade-off between a model's ability to fit the training data well (low bias) and its ability to generalize well to new, unseen data (low variance). A model with high bias is one that makes strong assumptions about the underlying relationship between the input and output variables, and as a result, may not fit the training data very well. On the other hand, a model with high variance is one that is highly sensitive to the specific details of the training data, and may not generalize well to new data. In general, a good machine learning model will strike a balance between these two extremes, and the goal of many machine learning techniques is to find the right balance between bias and variance.
+ 1. [M] How’s this tradeoff related to overfitting and underfitting?
+ A model with high bias is said to be underfitting, because it is not able to fit the training data well. A model with high variance is said to be overfitting, because it is fitting the noise in the training data, rather than the underlying pattern
+ 1. [M] How do you know that your model is high variance, low bias? What would you do in this case?
+ If a model has high variance and low bias, it means that it is highly sensitive to the specific training data it was trained on. It's likely to perform well on the training data but poorly on unseen data, overfitting the training data. This can be identified by observing that the model has high performance on the training set but poor performance on the validation or test sets.
+ To address this issue, one can use techniques such as regularization, which adds a penalty term to the loss function to discourage large weights, or ensemble methods, which combine the predictions of multiple models to reduce variance. Another strategy could be to collect more data to increase the size of the training set.
+ 1. [M] How do you know that your model is low variance, high bias? What would you do in this case?
+ A model that is low variance and high bias generally means that the model is underfitting the data. This can happen if the model is too simple, or if the model has not been trained for enough iterations. To address this issue, one could try to increase the complexity of the model
+1. Cross-validation.
+ 1. [E] Explain different methods for cross-validation.
+ 1. K-fold cross-validation: The data is divided into k subsets, and the model is trained and evaluated k times, each time using a different subset as the evaluation set and the remaining subsets as the training set.
+ 2. Leave-one-out cross-validation: This method is similar to k-fold cross-validation, but with k set to the number of samples in the data. For each iteration, one sample is used as the evaluation set, and the remaining samples are used as the training set.
+ 3. Stratified cross-validation: This method is used when the data is imbalanced, meaning there are unequal number of samples in each class. The data is divided into k subsets, ensuring that each subset has roughly the same class distribution as the original data.
+ 1. [M] Why don’t we see more cross-validation in deep learning?
+ The main reason why we don't see more cross-validation in deep learning is that deep learning models are computationally expensive to train, and performing cross-validation would require training multiple versions of the same model, which can be computationally infeasible. Additionally, in deep learning, we often use large amounts of data to train models, making cross-validation less necessary. A single split of the data into training and test sets is often sufficient for evaluating the performance of a deep learning model.
+1. Train, valid, test splits.
+ 1. [E] What’s wrong with training and testing a model on the same data?
+ Because it will overfit to the data and not generalize. The goal of testing the trained model is to see how it can generalize on unseen data and make sure that the learned parameters do well on data it has not used to adjust those parameters. Using the same data defeats this purpose.
+ 1. [E] Why do we need a validation set on top of a train set and a test set?
+ A validation set is used to tune the hyperparameters of a model during the training process. It allows us to evaluate the performance of the model on unseen data before it is tested on the test set. Without a validation set, we would only have the training set to tune the hyperparameters, and there's a risk of overfitting.
+ 1. [M] Your model’s loss curves on the train, valid, and test sets look like this. What might have been the cause of this? What would you do?
+ It seems that the validation loss is increasing which is a sign of overfitting. Interestingly the overfitted model generalizes well to the test data which can be from data leakage. Another explanation can be that the validation dataset has a different distribution from the train and test. Some solutions would be to use regulaization and early stopping to avoid overfitting. In addition, it will be useful to investigate the distribution of the data in the different sets to see if validation is different from train and test. If so, the splits should be different for the train and test to be representative of the differences.
+ Investigation on data leakage is also necessary and the leaked examples should be removed from the train set.
+
+
+
+
+
+1. [E] Your team is building a system to aid doctors in predicting whether a patient has cancer or not from their X-ray scan. Your colleague announces that the problem is solved now that they’ve built a system that can predict with 99.99% accuracy. How would you respond to that claim?
+ I would ask what the accuracy is in each class. It is very likely that the model is predicting no cancer most of the time and since the negative class is more prevalent, the accuracy is high. But we care more about detecting the cancer cases which are rarer.
+1. F1 score.
+ 1. [E] What’s the benefit of F1 over the accuracy?
+ In situations where the data is imbalanced, F1 score is a better measure compared to accuracy because it gives equal weight to the precision and recall. In such cases, accuracy can be misleading as the classifier might just predict the majority class and get a high accuracy even though it's not doing a good job at classifying the minority class.
+ 1. [M] Can we still use F1 for a problem with more than two classes. How?
+ Macro and Micro F1-scores are used for multi-class probelems. Macro-F1 is the average of F1 scores of each class. Micro-F1 is the similar to F1 but sums the TP, FP and FNs across all classes which is the same as accuracy. Micro-F1 gives every observation equal weight and is not suitable for imbalanced classes, whereas macro F1 gives each class equal weight and is better for imbalanced classes.
+ Good resources: https://stackoverflow.com/questions/37358496/is-f1-micro-the-same-as-accuracy, https://stephenallwright.com/micro-vs-macro-f1-score/
+1. Given a binary classifier that outputs the following confusion matrix.
+
+
+ |
+ |
+
+ Predicted True
+ |
+ Predicted False
+ |
+
+
+ | Actual True
+ |
+ 30
+ |
+ 20
+ |
+
+
+ | Actual False
+ |
+ 5
+ |
+ 40
+ |
+
+
+
+ 1. [E] Calculate the model’s precision, recall, and F1.
+ Precision (PPV) = 30 / (30 + 5) = 0.857
+ Recall (Sensitivity, TPR) = 30 / (30 + 20) = 0.6
+ F1 = 2 * (Precision * Recall) / (Precision + Recall) = 2 * (0.857 * 0.6) / (0.857 + 0.6) = 0.69
+ 1. [M] What can we do to improve the model’s performance?
+ 1. Modifying the decision threshold will affect the number of FP and FN, e.g. reducing it will decrease the number of FN but likely increase the number of FPs. Depending on the data the overall F1 score can increase.
+ 2. Investigating the feature values of FN and FP cases and seeing if any modifications to the features can help
+ 3. Using other classifiers
+ 4. Changing the hyper-parameters
+1. Consider a classification where 99% of data belongs to class A and 1% of data belongs to class B.
+ 1. [M] If your model predicts A 100% of the time, what would the F1 score be? **Hint**: The F1 score when A is mapped to 0 and B to 1 is different from the F1 score when A is mapped to 1 and B to 0.
+ If A is mapped to 1, the precision will be 99% and the recall will be 100% so the F1 is 99.5%. However if A is mapped to 0, the F1 score will be undefined because the precision has a devision by 0.
+ 1. [M] If we have a model that predicts A and B at a random (uniformly), what would the expected F1 be?
+ In this case the predictions are 50-50, when A is class 1, the precision is 1 and the recall is 50 / 99 = .51 which gives an F1 score of 68%. In the opposite case, the precision is 1 / 50 and the recall is 1 which gives an F1 score of 3.9 %.
+1. [M] For logistic regression, why is log loss recommended over MSE (mean squared error)?
+ Because it is a probabilistic loss function that measures the dissimilarity between the predicted probability distribution and the true distribution and is a natural choice for logistic regression because the output of a logistic regression model is a probability, and log loss can directly measure the dissimilarity between the predicted probabilities and the true labels.
+ Also, MSE is non-convex for logistic regression which makes finding the best fit to the data harder than log loss which is convex.
+1. [M] When should we use RMSE (Root Mean Squared Error) over MAE (Mean Absolute Error) and vice versa?
+ RMSE penalizes higher differences (larger errors) more. It is also differntiable everywhere. MAE on the other hand is more interpretable.
+ RMSE is more suitable for cases where making large errors can be catashtrophic, e.g. fraud detection but in the case where the data has outliers for the results to not be dominated by those data points MAE is a better choice.
+1. [M] Show that the negative log-likelihood and cross-entropy are the same for binary classification tasks.
+ Let's consider a binary classification task with two classes: class 0 and class 1, and let y be the true label (0 or 1) and p be the predicted probability of the positive class (class 1).
+
+ The negative log-likelihood loss function is defined as:
+
+ L(y, p) = -(y * log(p) + (1-y) * log(1-p))
+
+ The cross-entropy loss function is defined as:
+
+ H(y, p) = SUM(-y * log(p))
+ which is the same as the negative log-likelihood.
+1. [M] For classification tasks with more than two labels (e.g. MNIST with 10 labels), why is cross-entropy a better loss function than MSE?
+ The MSE loss function compares the predicted output with the true output element-wise and calculates the mean squared error between them. The problem with this is that it assigns equal penalties to all the incorrect class predictions regardless of how confident the model is in those predictions. This can lead to the model being more cautious and not putting enough weight on the correct class, which is not ideal for multi-class classification problems.
+ On the other hand, the cross-entropy loss function uses the predicted class probabilities to penalize the model for incorrect class predictions. It calculates the negative log-likelihood of the true class given the predicted class probabilities. The closer the predicted class probabilities are to the true class, the lower the cross-entropy loss will be. This means that the model will be penalized more for predictions that it is very confident in but are incorrect, and less for predictions that it is uncertain about. This encourages the model to be more confident in its predictions, which is desirable in multi-class classification.
+1. [E] Consider a language with an alphabet of 27 characters. What would be the maximal entropy of this language?
+ log(27)
+1. [E] A lot of machine learning models aim to approximate probability distributions. Let’s say P is the distribution of the data and Q is the distribution learned by our model. How do measure how close Q is to P?
+ Kullback-Leibler divergence (KL divergence): a measure of the difference between two probability distributions, often used when the true distribution P is unknown. A symmetric version of this tests is the JS divergencet. Another measure for discrete values is Chi-squared distance.
+1. MPE (Most Probable Explanation) vs. MAP (Maximum A Posteriori)
+ 1. [E] How do MPE and MAP differ?
+ Considering the prior information (evidence), MAP finds a subset of non-evidence parameters with the highest probability. MPE finds the values for all non-evidence parameters with the highest probability.
+ 1. [H] Give an example of when they would produce different results.
+ https://www.quora.com/What-are-the-cases-in-which-Most-Probable-Explanation-MPE-tasks-do-not-generalize-to-Maximum-A-Posteriori-MAP-task
+1. [E] Suppose you want to build a model to predict the price of a stock in the next 8 hours and that the predicted price should never be off more than 10% from the actual price. Which metric would you use?
+ Mean absolute percentage error is a metric that computes the average of the ratio between the absolute error and the actual value which is what we need in this case.
+
+ **Hint**: check out MAPE.
+
diff --git a/answers/chapter8.md b/answers/chapter8.md
index e69de29b..7c39f4cf 100644
--- a/answers/chapter8.md
+++ b/answers/chapter8.md
@@ -0,0 +1,540 @@
+#### 8.1.2 Questions
+
+1. [E] What are the basic assumptions to be made for linear regression?
+ 1. Linearity: there is a linear relationship between the inputs and outputs
+ 2. Independence: the observations are independent of each other and the input variables (features) are not correlated
+ 3. Homoscedasticity: the spread of the residuals is the same for all variables, i.e. the variance of the error is constant across all independent variables
+ 4. Normality: the residuals should be normally distributed
+2. [E] What happens if we don’t apply feature scaling to logistic regression?
+ Without feature scaling, the optimization algorithm will converge much slower as the scale of the features will have a large impact on the optimization process.
+ The reason behind this is that the optimization algorithm calculates the gradient of the cost function with respect to the parameters, and the step size of the update is determined by the learning rate. If the features have vastly different scales, the update step size will be much larger for the features with larger scales and much smaller for the features with smaller scales. This will make the optimization process very slow, as it will oscillate between large and small steps, and it will take a lot of iterations to converge.
+3. [E] What are the algorithms you’d use when developing the prototype of a fraud detection model?
+ To detect fraud (anomolies) some useful algorithms are decision trees, random forest, KNNs, auto-encoders
+4. Feature selection.
+ 1. [E] Why do we use feature selection?
+ Some features are more informative than others and removing the less important features has a number of benefits:
+ 1. Improved performance
+ 2. Easier to interpret the model with fewer dimesions
+ 3. Less prone to overfitting
+ 2. [M] What are some of the algorithms for feature selection? Pros and cons of each.
+ 1. Filter methods: These methods use a statistical test to evaluate the relevance of each feature with respect to the target variable. Features are then ranked based on their score and the top-ranking features are selected. Examples of filter methods include chi-squared, mutual information, and ANOVA. Pros: easy to implement and fast to run. Cons: can be sensitive to the choice of statistical test and may not take into account the relationships between features.
+ 2. Wrapper methods: These methods use a machine learning model to evaluate the performance of different subsets of features. Features are then selected based on their contribution to the performance of the model. Examples of wrapper methods include recursive feature elimination (RFE) and sequential feature selection (SFS). Pros: can take into account the relationships between features and can be more accurate than filter methods. Cons: computationally expensive and can be sensitive to the choice of machine learning model.
+ 3. Embedded methods: These methods use a machine learning model to select features during the training process. Features are selected based on their contribution to the performance of the model. Examples of embedded methods include Lasso and Ridge regression. Pros: can take into account the relationships between features and can be more accurate than filter methods. Cons: computationally expensive and can be sensitive to the choice of machine learning model.
+ 4. Hybrid methods: These methods combine the strengths of different feature selection methods to improve the performance of the model. Examples of hybrid methods include combining filter and wrapper methods. Pros: can take into account the relationships between features and can be more accurate than filter methods. Cons: computationally expensive and can be sensitive to the choice of machine learning model.
+5. k-means clustering.
+ 1. [E] How would you choose the value of k?
+ There are a few different methods that can be used to choose the value of k in k-means clustering. One popular method is the elbow method, which involves fitting the k-means model for different values of k and then plotting the sum of squared distances between data points and their closest cluster centroid (also called the within-cluster sum of squares) as a function of k. The value of k at which the within-cluster sum of squares begins to decrease at a slower rate is chosen as the optimal number of clusters.
+ Another method is the silhouette method, which involves measuring the similarity of each data point to its own cluster compared to other clusters. It ranges from -1 to 1. A higher value of silhouette score denotes that the point is well-matched to its own cluster.
+ 1. [E] If the labels are known, how would you evaluate the performance of your k-means clustering algorithm?
+ Some extrinsic metrics are:
+ 1. Normalized Mutual Information (NMI): Computes the mutual information between the true and predicted labels.
+ 2. Adjusted Rand Index (ARI): Computes the similarity between the true and predicted clusters using number of pair-wise correct predictions.
+ 3. Fowlkes-Mallows Index (FMI): Is the geometric mean of precision and recall between true labels and predicted labels.
+ 1. [M] How would you do it if the labels aren’t known?
+ Some intrinsic metrics are:
+ 1. Silhouette score: Measures the within and between cluster distances, ranges from -1 to 1. Higher is better.
+ 2. The Calinski-Harabasz index (CHI): Ratio of between cluster variance to within. The higher the better.
+ 3. Davies-Bouldin Index (DBI): This measures the average similarity between each cluster and its most similar cluster. A lower value indicates better clustering.
+ 1. [H] Given the following dataset, can you predict how K-means clustering works on it? Explain.
+ Given K = 2 the algorithm will start with 2 random data points as the center of the two clusters, it will then progress to picking another point and assigning it to either cluster depending on its distance with each centroid. The centroid of the assigned cluster will then get updated and so on and so forth. The final clusters depend on the initial choice of datapoints and we might want to run it multiple times to get the highest performance.
+ Let's assume that one of the centroids is in the ring cluster and the other is in the middle cluster, Given the shape of the ring cluster, the distance between the points that are on the opposite side/half of the ring will likely get assigned to the middle cluster because they are closer to the middle cluster, also some of the sparser points from the middle cluster may get assigned to outer cluster. So the final result might look like this.
+ Algorithms such as HDBSCAN that are better with clusters of varying denstity and shape will probably be a better option for this data.
+
+ 
+
+6. k-nearest neighbor classification.
+ 1. [E] How would you choose the value of k?
+ 1. Empirical testing: One approach is to try out different values of k and evaluate the performance of the classifier using a validation set or cross-validation. The value of k that results in the best performance is chosen.
+ 2. Rule of thumb: A commonly used rule of thumb is to choose k to be the square root of the number of samples in the training set. This value is chosen as it balances the trade-off between overfitting and underfitting.
+ 3. Cross-validation: Another approach is to use cross-validation techniques such as GridSearchCV or RandomizedSearchCV to find the optimal value of k.
+ 4. Elbow method: Another approach is to use the elbow method. we plot the relationship between the number of clusters and WCSS (Within Cluster Sum of Squares) and select the elbow of the curve as the number of clusters to use in the algorithm.
+ 1. [E] What happens when you increase or decrease the value of k?
+ A small k means the model relies on fewer data points in making a decision. This results in noisier and jagged boundries.
+ A large k means the model averages more data points which smoothens the decision boundries.
+ 1. [M] How does the value of k impact the bias and variance?
+ The noisy and jagged boundries from a small k means that the model is overfitting and has high variance. The smoothness of the boundries from a larger k means the model has high bias.
+7. k-means and GMM are both powerful clustering algorithms.
+ 1. [M] Compare the two.
+ 1. K-means is sensitive to the initial choice of centroids and can get stuck in a local minima. GMM with expected maximization (EM) is less sensitive to the initial choice of parameters
+ 2. K-means tries to minimize euclidean distance between points and the centroids, it therefore strugles when the clusters have different shapes and densities, whereas GMMs finds the clusters using Gaussian distributions and handle varying cluster shapes better.
+ 3. K-means is simpler and faster than GMMs.
+ 1. [M] When would you choose one over another?
+ If the clusters are of different shapes and sizes, GMM is a better choice. If the cluster shapes are all spherical and of roughly the same size, K-means is a faster and simpler choice.
+8. Bagging and boosting are two popular ensembling methods. Random forest is a bagging example while XGBoost is a boosting example.
+ 1. [M] What are some of the fundamental differences between bagging and boosting algorithms?
+ 1. Bagging (bootstrap aggregating) creates multiple samples with replacement from the dataset and trains models on each set independently. The final prediction is made by taking a vote from all predictors for classification and average of all for regression.
+ 2. Boosting trains multiple weak learners on the same dataset but weights the samples based on the performance of each learner at any given stage. These learners are not trained independently. The sample weights are updated after each learner is fit to the data. This causes the incorrect predictions to get a higher weight and force the next learner to focus more on correcting those mistakes.
+ 3. Each weak learner is weighted in boosting and unlike bagging which all the predictors have the same weight, the total error made by each weak learner determines the final weight of it. The final prediction is therefore a weighted average of learners' predictions.
+ 1. [M] How are they used in deep learning?
+ Bagging can be used to decrease the variance of neural nets, i.e. training various NNs on subsets of data with replacement. Boosting is typically used to improve the performance of weaker models, such as decision trees. In deep learning, neural networks are already powerful models and typically don't need boosting to improve performance. However, it is possible to use boosting algorithms to ensemble multiple neural networks together to further improve performance. One example could be training several different architectures of CNNs, such as VGG and ResNet, and then using a boosting algorithm to ensemble their predictions together to make a final prediction. This can help to reduce overfitting and improve the generalization of the model.
+9. Given this directed graph.
+
+ 
+
+ 1. [E] Construct its adjacency matrix.
+ [[0, 1, 0, 1, 1], [0, 0, 1, 1, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]
+ 1. [E] How would this matrix change if the graph is now undirected?
+ If a directed graph becomes undirected, the adjacency matrix will change by making the matrix symmetric. In an undirected graph, if there is an edge between vertex i and vertex j, then there is also an edge between vertex j and vertex i. Therefore, the element in the ith row and jth column and jth row and ith column of the adjacency matrix will be the same.
+ 1. [M] What can you say about the adjacency matrices of two isomorphic graphs?
+ The adjacency matrices of two isomorphic graphs are identical up to a permutation of rows and columns. This means that if two graphs are isomorphic, their adjacency matrices will be the same when one of them is relabelled to match the other one. This is because the adjacency matrix of a graph represents the connectivity structure of the graph, which is preserved under isomorphism.
+10. Imagine we build a user-item collaborative filtering system to recommend to each user items similar to the items they’ve bought before.
+ 1. [M] You can build either a user-item matrix or an item-item matrix. What are the pros and cons of each approach?
+ A user-item matrix is a matrix where each row represents a user and each column represents an item. The entries of the matrix indicate the interactions between users and items, such as purchases or ratings. The advantage of this approach is that it allows for easy interpretation of user preferences and item popularity. However, it can be computationally expensive when the number of users or items is large.
+ An item-item matrix is a matrix where each row and column represents an item. The entries indicate the similarity between pairs of items based on their interactions with users. This approach is computationally more efficient than the user-item matrix because it does not have to store information about all users. However, it can be less interpretable and harder to incorporate new items or users into the system.
+ In general, item-item matrix approach is more computationally efficient but less interpretable and harder to incorporate new items or users into the system.
+ 1. [E] How would you handle a new user who hasn’t made any purchases in the past?
+ Some approaches to user cold start problem:
+ 1. Use of demographic information: If demographic information is available for the new user, the system might recommend items that are popular among users with similar demographic characteristics.
+ 2. Popularity-based recommendations: In the absence of any information, the system can recommend the most popular items. This can be useful for a new user, but it may not be personalized to their preferences.
+ 3. Ask the user to rate or review some items, this way the model can understand the user preferences and make personalized recommendations
+11. [E] Is feature scaling necessary for kernel methods?
+ Depends on the kernel. Some like RBF are sensitive to scale but some kernal have built in scaling and don't need explicit scaling as a preprocessing step.
+12. Naive Bayes classifier.
+ 19. [E] How is Naive Bayes classifier naive?
+ It assumes that all the features in the data are mutually independent, meaning that the presence or absence of one feature has no effect on the presence or absence of any other feature.
+ 20. [M] Let’s try to construct a Naive Bayes classifier to classify whether a tweet has a positive or negative sentiment. We have four training samples:
+
+
+
+ |
+ Tweet
+ |
+ Label
+ |
+
+
+ | This makes me so upset
+ |
+ Negative
+ |
+
+
+ | This puppy makes me happy
+ |
+ Positive
+ |
+
+
+ | Look at this happy hamster
+ |
+ Positive
+ |
+
+
+ | No hamsters allowed in my house
+ |
+ Negative
+ |
+
+
+
+ According to your classifier, what's sentiment of the sentence `The hamster is upset with the puppy`?
+ The priors and likelihood of each word in the query should be calculated based on the training samples. Doing so, the probability of seeing hamster, upset and puppy in the positive class is higher than the negative class.
+
+13. Two popular algorithms for winning Kaggle solutions are Light GBM and XGBoost. They are both gradient boosting algorithms.
+ 1. [E] What is gradient boosting?
+ Gradient Boosting is specifically designed to optimize a differentiable loss function, such as the mean squared error for regression, or the cross-entropy loss for classification. This allows the algorithm to use gradient descent to optimize the weights of the weak models.
+ 1. [M] What problems is gradient boosting good for?
+ It's very versatile and can be used in regression, classification, ranking problems.
+14. SVM.
+ 1. [E] What’s linear separation? Why is it desirable when we use SVM?
+ Linear separation is the ability of a classifier to separate the data points of different classes using a linear boundary. It is desirable when using Support Vector Machines (SVMs) because it allows the classifier to be represented by a simple hyperplane, which makes the optimization problem for finding the best hyperplane computationally efficient.
+ 1. [M] How well would vanilla SVM work on this dataset?
+ It will draw a line between the two classes with the maximum margin
+
+
+ 
+
+
+ 1. [M] How well would vanilla SVM work on this dataset?
+ It will draw a line in between the furthest diamond and circle of the clusters leaving little margin in between the classes
+
+
+ 
+
+
+ 1. [M] How well would vanilla SVM work on this dataset?
+ It will not be able to separate the two classes.
+
+ 
+
+
+#### 8.2.1 Natural language processing
+1. RNNS
+ 1. [E] What’s the motivation for RNN?
+ The motivation behind Recurrent Neural Networks is to capture the dependencies of the observations in a sequence of data, e.g. time series or natural language. For instance, in natural language the words in a sentence are not independent of one another and knowing the first three words can help you guess the forth word. RNNs aim to capture this dependency by passing the information from the previous inputs when processing the current input.
+ 1. [E] What’s the motivation for LSTM?
+ RNNs have the issue of long-term dependencies due to vanishing gradients, LSTMs (and GRUs) were introduced to overcome this issue by allowing information from early layers directly to later layers. The forget gates in LSTMs structure allows for the network to selectively remember/forget information from time step to the next.
+ 1. [M] How would you do dropouts in an RNN?
+ Dropout can be applied in a RNN in different ways:
+ 1. It can be applied to the hidden state that goes to the output and not to the next timestamp. Note that different samples in a mini-batch should have different dropout masks but the same sample in different time steps should have the same mask
+ 2. It can be applied to the inputs x_t
+ 3. It can be applied to the weights between the hidden states (on the recurrent states). Note that the same dropout mask should be used for all time steps in a mini-batch
+2. [E] What’s density estimation? Why do we say a language model is a density estimator?
+ Density estimation means estimating the probability density function (PDF) of a random variable from a set of observations. The PDF of a variable describes the probability of the variable taking on different values.
+
+ Language models are trained on sequences of words to learn the probability of words occurring. In other words, they are estimating the PDF of word sequences and can therefore be interpreted as density estimators.
+3. [M] Language models are often referred to as unsupervised learning, but some say its mechanism isn’t that different from supervised learning. What are your thoughts?
+ Language models are trained on vast amounts of text without any explicit labels. In that regard they are unsupervised. But in order for the model to learn the intricacies of the language, the relationship between different words it is usually trained in an auto-regressive manner, i.e. a set of words are masked and the model is trained to predict the masked words. These masked words can be thought of as labels which is similar to supervised learning.
+4. Word embeddings.
+
+ 1. [M] Why do we need word embeddings?
+
+ Word embeddings are a way to map words to vector representations that can be used in matrix multiplication in neural networks. These representations preserve the semantics and are lower in dimension than one-hot encoded vectors.
+
+ 2. [M] What’s the difference between count-based and prediction-based word embeddings?
+
+ Count-based embeddings learn the embeddings based on the co-occurrences of words across a large dataset. GloVe is a count-based embedding method. Prediction-based word embeddings learns the embeddings by learning to predict a word or set of words based on the surrounding words and minimising the prediction loss.
+
+ 3. [H] Most word embedding algorithms are based on the assumption that words that appear in similar contexts have similar meanings. What are some of the problems with context-based word embeddings?
+
+ Context-based embeddings reinforce gender and racial biases present in the training data. For example the embedding of the word smart or beautiful should not have any gender preferences baked into it but if you ask a LM to describe someone smart or translate a sentence describing someone smart from a gender-neutral language to English for instance, it will prefer the pronoun he for smart and she for beautiful because in the context of other words in the training data smart is more associated with males and beauty with females.
+
+ Another issue can be that words with different meaning will have different embeddings which make it difficult for them to be used standalone.
+5. Given 5 documents:
+ D1: The duck loves to eat the worm
+ D2: The worm doesn’t like the early bird
+ D3: The bird loves to get up early to get the worm
+ D4: The bird gets the worm from the early duck
+ D5: The duck and the birds are so different from each other but one thing they have in common is that they both get the worm
+ 1. [M] Given a query Q: “The early bird gets the worm”, find the two top-ranked documents according to the TF/IDF rank using the cosine similarity measure and the term set {bird, duck, worm, early, get, love}. Are the top-ranked documents relevant to the query?
+ Each document/query has to go through lemmatization and stemming. Then, the TF and IDF for each document and query is calculated and multiplied together. Finally the cosine similarity between the query's TF/IDF vector and each documents' determines the rank of the similarities. The top two are selected:
+ The IDF for the documents and set above is:
+ {bird: 0.22, duck: 0.51, worm: 0, early: 0.51, get: 0.51, love: 0.92}
+ This is because 4 out of 5 documents contain the word "bird" so the natural log of 5 documents over 4 is 0.22 and so on. Below are the term frequencies for the query and documents:
+ D1: {bird: 0, duck: 1, worm: 1, early: 0, get: 0, love: 1}
+ D2: {bird: 1, duck: 0, worm: 1, early: 1, get: 0, love: 0}
+ D3: {bird: 1, duck: 0, worm: 1, early: 1, get: 2, love: 1}
+ D4: {bird: 1, duck: 1, worm: 1, early: 1, get: 1, love: 0}
+ D5: {bird: 1, duck: 1, worm: 1, early: 0, get: 1, love: 0}
+ Q: {bird: 1, duck: 0, worm: 1, early: 1, get: 1, love: 0}
+ With this the TF/IDF of each document and query becomes:
+ D1: [0, .51, 0, 0, 0, .92]
+ D2: [.22, 0, 0, .51, 0, 0]
+ D3: [.22, 0, 0, .51, 1.02, .92]
+ D4: [.22, .51, 0, .51, .51, 0]
+ D5: [.22, .51, 0, 0, .51, 0]
+ Q: [.22, 0, 0, .51, .51, 0]
+ Now the cosine similarity between the query and each document is:
+ cos(Q, D1) = 0
+ cos(Q, D2) = .737
+ cos(Q, D3) = .742
+ cos(Q, D4) = .828
+ cos(Q, D5) = .543
+ So top 2 documents are D4 and D3. They are relevant in the sense that they share common words. D3 seems to share more similarity in semantics than D4 which mentions the "early duck".
+ 1. [M] Assume that document D5 goes on to tell more about the duck and the bird and mentions “bird” three times, instead of just once. What happens to the rank of D5? Is this change in the ranking of D5 a desirable property of TF/IDF? Why?
+ This changes the TF of D5 to {bird: 3, duck: 1, worm: 1, early: 0, get: 1, love: 0} which results in a TF/IDF of [.66, 0, 0, .51, .51, 0]. This increases the cosine similarity score between the query and D5 to .55 which does not change the overall ranking.
+ This change is not a desirable property of TF/IDF because a document can just copy and paste a word hundreds of times and increase its TF/IDF score without adding relevant information.
+
+6. [E] Your client wants you to train a language model on their dataset but their dataset is very small with only about 10,000 tokens. Would you use an n-gram or a neural language model?
+ Depends on the specific tasks and constraints. In general, if the dataset is small and the task does not require modeling complex syntactic and semantic relationships, you might consider using a n-gram language model, as it may be more efficient and easier to implement. If the task requires modeling complex syntactic and semantic relationships and the dataset is large enough to support the learning of a neural language model, you might consider using a neural language model.
+7. [E] For n-gram language models, does increasing the context length (n) improve the model’s performance? Why or why not?
+ To some extent, increasing n from 1 to 3 for example helps with capturing the context better through longer range dependencies. However, increasing n at some point will have a negative affect on the model's generalization ability and computational efficiency. As the context length (n) increases, the number of possible n-grams increases exponentially, which can lead to sparsity in the data and poor generalization to unseen data.
+8. [M] What problems might we encounter when using softmax as the last layer for word-level language models? How do we fix it?
+ The issue is that with large vocabularies, the softmax computation is very expensive as there are B (batch size) * d (model dimension) * V (vocab size)parameters.
+ There are a number of alternatives to using a standard softmax layer:
+ 1. Hierarchical softmax: Words are leaves of a tree and instead of predicting the probability of each word, the probability of nodes are predicted
+ 1. Differentiated softmax: Is based on the intuition that not all words require the same number of parameters: Many occurrences of frequent words allow us to fit many parameters to them, while extremely rare words might only allow to fit a few
+ 1. Sampling softmax: By using different sampling techniques, e.g. negative sampling, this alternative approximates the normalization in the denominator of the softmax with some other loss that is cheap to compute. However, sampling-based approaches are only useful at training time -- during inference, the full softmax still needs to be computed to obtain a normalised probability.
+ Related articles: https://towardsdatascience.com/how-to-overcome-the-large-vocabulary-bottleneck-using-an-adaptive-softmax-layer-e965a534493d, https://ruder.io/word-embeddings-softmax/index.html#hierarchicalsoftmax
+9. [E] What's the Levenshtein distance of the two words “doctor” and “bottle”?
+ The distance is 4: Replace "d", "c", "o" and "r"
+10. [M] BLEU is a popular metric for machine translation. What are the pros and cons of BLEU?
+ Pros:
+ 1. It is widely used so different models can be compared with one another
+ 2. It is easy to implement. It only needs the target and prediction to calculate the precision for different n-grams
+ Cons:
+ 1. Does not consider semantics and only relies on same tokens this has two issues: it penalises translations that convey the same meaning but use different words. On the other hand, it doesn’t penalise translations that are semantically incorrect but have a lot of overlapping words
+11. [H] On the same test set, LM model A has a character-level entropy of 2 while LM model A has a word-level entropy of 6. Which model would you choose to deploy?
+ The entropy is just a measure of randomness in the model and having a lower entropy doesn’t mean the model is better. The models should be compared on metrics useful for the task the model is going to be used for in production.
+12. [M] Imagine you have to train a NER model on the text corpus A. Would you make A case-sensitive or case-insensitive?
+ Depends on the target entities and the data. If we want the model to distinguish between say Apple the company and apple the fruit, should consider case-sensitivity. However, if the data contains a mix of upper and lower case of the same words of the same entity, enforcing case sensitivity can be confusing to the model.
+13. [M] Why does removing stop words sometimes hurt a sentiment analysis model?
+ Because the removal of some stopwords such as negating words (no, not, etc.), change the semantics. For example, a negative review that says: "Do not buy this product! It is no good" will turn into "Do buy this product! It is good" after removing stopwords which has the exact opposite meaning of the original review.
+14. [M] Many models use relative position embedding instead of absolute position embedding. Why is that?
+ Relative position embedding can generalize to unknown sequence lengths because it encodes the distance between tokens whereas absolute position embeddings is limited to a fixed length.
+15. [H] Some NLP models use the same weights for both the embedding layer and the layer just before softmax. What’s the purpose of this?
+ From: https://paperswithcode.com/method/weight-tying#:~:text=Weight%20Tying%20improves%20the%20performance,that%20it%20is%20applied%20to.
+ Weight Tying improves the performance of language models by tying (sharing) the weights of the embedding and softmax layers. This method also massively reduces the total number of parameters in the language models that it is applied to.
+ Language models are typically comprised of an embedding layer, followed by a number of Transformer or LSTM layers, which are finally followed by a softmax layer. Embedding layers learn word representations, such that similar words (in meaning) are represented by vectors that are near each other (in cosine distance). [Press & Wolf, 2016] showed that the softmax matrix, in which every word also has a vector representation, also exhibits this property. This leads them to propose to share the softmax and embedding matrices, which is done today in nearly all language models.
+#### 8.2.2 Computer vision
+1. [M] For neural networks that work with images like VGG-19, InceptionNet, you often see a visualization of what type of features each filter captures. How are these visualizations created?
+ One common approach for creating these visualizations is to use an algorithm called "feature maximization". This algorithm starts with a random image and then repeatedly applies a small change to the image that maximizes the activation of a particular filter. Through this process, an image is generated that is specifically tailored to activate that filter.
+
+ Another approach is to use "saliency maps". In this approach, an image is fed into the neural network, and the network produces activations for each filter. Next, the gradient of the output of a particular filter with respect to the input image is calculated. This gradient can then be used to create a heatmap where the brightness represents the importance of each pixel in the input image for that filter.
+1. Filter size.
+ 1. [M] How are your model’s accuracy and computational efficiency affected when you decrease or increase its filter size?
+ For some tasks such as object detection, bigger filter sizes are better as they capture more contextual information and the relationship between different parts of the image. However, for segmentation tasks the local features are more important and therefore a smaller filter size may result in higher accuracy.
+ In regards to computational efficiency, the bigger the filter size, the more parameters the model needs to learn and therefore the more computation and memory it requires.
+ 1. [E] How do you choose the ideal filter size?
+ It is common to experiment with different filter sizes and evaluate the model's performance using different evaluation metrics, this will give an idea of which filter size works best on the specific task and data set. If the task is object detection, a bigger kernal size may be better since the context and relationship between different parts of the image is important. However, if the task is segmentation, smaller sizes are better to preserve spatial information. Another thing to keep in mind when deciding on the filter sizes is the computational complexity of larger kernel sizes as they introduce more parameters. Generally, it is common to have smaller kernel sizes in the initial layers where local features are extracted and increase the size in deeper layers to extract more abstract features.
+1. [M] Convolutional layers are also known as “locally connected.” Explain what it means.
+ The term "locally connected" refers to the fact that the neurons in a convolutional layer are connected only to a small region of the input image, rather than to the entire image. Each neuron in a convolutional layer is connected to a small subset of the input image, and these subsets are called "receptive fields". These receptive fields are of the same size and arranged in the same way as the kernel of the convolutional layer and slide over the input image in a process called convolution.
+ A key feature of locally connected layers is that they are able to extract spatial features in the input data that are translation-invariant, meaning they can identify objects and patterns regardless of their location in the image.
+ For example, consider an image of a face, in which the face can appear at different positions in the image. If we use a fully connected layer, the weight of the neurons would have to be adjusted for all possible positions of the face. But by using locally connected layers, the model only needs to learn the features of the face, regardless of its location, making it less computationally expensive and more robust to changes in position.
+1. [M] When we use CNNs for text data, what would the number of channels be for the first conv layer?
+ Similar to grayscale data, one channel is used for text data
+1. [E] What is the role of zero padding?
+ Zero padding is the process of adding zeros to the edges of the input. One reason for this is to enforce a certain size to the output of the convolution. Another benefit of zero padding is that more edge pixels will be included in the convolution and therefore more information will be captured.
+1. [E] Why do we need upsampling? How to do it?
+ Upsampling is needed to restore the desired resolution after downsampling. There are different techniques for upsampling, some are independent of the input data. For example, Nearest Neighbors, Interpolation or Bed of Nails. All these methods involve copying some of the input values or filling in zeros in some postions. Another technique is called Transposed Convolutions which involves striding a kernal on the downsampled image. To elaborate, each element in the input is multiplied with each element in the kernal and the overlapping results are summed up. The striding kernal is learned during training so unlike the other techniques it is dependant on the data.
+ Here's an article with illustrations of these techniques: https://towardsdatascience.com/transposed-convolution-demystified-84ca81b4baba
+1. [M] What does a 1x1 convolutional layer do?
+ It's used as a dimensionality reduction method to reduce the number of feature maps before applying expensive convolutions in the further layers.
+1. Pooling.
+ 1. [E] What happens when you use max-pooling instead of average pooling?
+ In avertage pooling, all the features from the filter are considered and passed to the next layer. This results in a smoother image compared to the output of max pooling which detects the sharp and brighter pixels.
+ 1. [E] When should we use one instead of the other?
+ It depends on the task and objective. Average pooling will include all the features in the feature map whereas max pooling has data loss and only considers the highest values and misses out on the other details related to the rest of the image. If the task is to detect edges for example, max pooling is a better choice.
+ 1. [E] What happens when pooling is removed completely?
+ Increased computation complexity: Without pooling, the model would have to compute activations for all neurons in the feature maps, leading to an increase in the number of computations and time required to process an input.
+
+ Increased memory usage: The model would need to store activations for all neurons in the feature maps, leading to an increase in memory usage.
+
+ Loss of spatial invariance: Pooling is used to reduce the spatial resolution of feature maps, which helps to make the model invariant to small translations and rotations in the input. Without pooling, the model would be sensitive to small variations in the input.
+ 1. [M] What happens if we replace a 2 x 2 max pool layer with a conv layer of stride 2?
+ Replacing a 2x2 max pool layer with a convolutional layer with a stride of 2 would result in the same spatial downsampling of the feature maps. However, the main difference is that a convolutional layer also learns to extract features from the input data, while a max pooling layer only performs spatial downsampling. Also, the conv layer adds to the number of learnable parameters while max pooling doesn't.
+1. [M] When we replace a normal convolutional layer with a depthwise separable convolutional layer, the number of parameters can go down. How does this happen? Give an example to illustrate this.
+ A good article with example and illutstrations can be found here: https://www.geeksforgeeks.org/depth-wise-separable-convolutional-neural-networks/
+1. [M] Can you use a base model trained on ImageNet (image size 256 x 256) for an object classification task on images of size 320 x 360? How?
+ The different image sizes may result in a different context or scale, which could affect the model's performance. Also, the model trained on a different input size may not be familiar with the different aspect ratios from a different input size. That being said, you can transform your inputs to match the expected 256 x 256 size. Alternatively, you can finetune ImageNet on images with size 320 x 360.
+1. [H] How can a fully-connected layer be converted to a convolutional layer?
+ The neurons need to be reshaped to have two dimensions, the weights need to be rearranged to match the desired kernal size and number of filters. A good explanation with illustration can be found here: https://sebastianraschka.com/faq/docs/fc-to-conv.html#:~:text=There%20are%20two%20ways%20to,1x1%20convolutions%20with%20multiple%20channels.
+1. [H] Pros and cons of FFT-based convolution and Winograd-based convolution.
+ FFT-based convolution:
+ Pros:
+ 1. Speed up
+ 1. It can be used to implement convolution operations of any kernel size
+ Cons:
+ 1. It requires a large amount of memory to store the transformed data, which can be a problem for large-scale neural networks.
+ 1. It can be sensitive to numerical errors and rounding issues, which could affect the accuracy of the results.
+ Winograd-based convolution:
+ Pros:
+ 1. Speed up
+ Cons:
+ 1. It is not suitable for large kernel sizes and it's not as general as the FFT-based convolution.
+ 1. It requires a large amount of memory to store the transformed data, which can be a problem for large-scale neural networks.
+#### 8.2.3 Reinforcement learning
+
+> 🌳 **Tip** 🌳
+To refresh your knowledge on deep RL, checkout [Spinning Up in Deep RL](https://spinningup.openai.com/en/latest/) (OpenAI)
+
+
+28. [E] Explain the explore vs exploit tradeoff with examples.
+29. [E] How would a finite or infinite horizon affect our algorithms?
+30. [E] Why do we need the discount term for objective functions?
+31. [E] Fill in the empty circles using the minimax algorithm.
+
+
+
+
+
+32. [M] Fill in the alpha and beta values as you traverse the minimax tree from left to right.
+
+
+
+
+
+33. [E] Given a policy, derive the reward function.
+34. [M] Pros and cons of on-policy vs. off-policy.
+35. [M] What’s the difference between model-based and model-free? Which one is more data-efficient?
+
+#### 8.2.4 Other
+
+36. [M] An autoencoder is a neural network that learns to copy its input to its output. When would this be useful?
+ The main goal of an autoencoder is to learn the latent represenation of the input, such that the output can reconstruct the input from this compact, low dimensional latent space.
+ There are many use cases for this:
+ 1. Compression and storage: autoencoders can be used to reduce the size of the input by learning a compact representation and reconstructing it back when needed.
+ 2. Denoising: the encoder portion of an autoencoder can be used to remove the noise from the input while preserving the underlying structure and outputting the denoised version from the decoder.
+ 3. Anomoly detection: The learned representations of the input can be used to identify outliers at inference.
+ 4. Generative models: The learned latent space of the encoder can be sampled and used by the generator to generate new data similar to the inputs.
+ 5. Transfer learning and feature extraction: An encoder from the pretrained autoencoder can be used as an embedder to project the inputs to an embedding space which can be used as features to a classification model for example.
+37. Self-attention.
+ 15. [E] What’s the motivation for self-attention?
+ The motivation behind self-attention is for an the model to attend to different parts of the inputs that are relevant to the task at hand. It does this by calculating weights for each of the parts. The parts that are more relevant to the task at hand get higher weights. These weights are used to determine the contribution of different input componenets when making a predition.
+ 16. [E] Why would you choose a self-attention architecture over RNNs or CNNs?
+ 1. One limitation of RNNs and CNNs that self-attention resolves is assigning different weights to the inputs based on their relevance.
+ 2. Attention-based models are better with longer-term dependecies
+ 3. Attention-based models can run in parallel and are therefore more computationally efficient than RNNs
+ 17. [M] Why would you need multi-headed attention instead of just one head for attention?
+ According to the Attention Is All You Need paper (https://arxiv.org/pdf/1706.03762.pdf), multi-head attention allows the model to attend to words other than the current input from different representation subspaces. In other words, using multiple heads in the allows the model to learn different types of relationships between elements in the input sequence and attend to different granularity and modalities of the input, which improves the performance of the model.
+ 18. [M] How would changing the number of heads in multi-headed attention affect the model’s performance?
+ Depending on the amount of data and the task complexity, increasing the number of heads may improve the model's performance as more heads allows for different types of relationships between the input elements to be learned. Increasing it too much may not be useful as the model may not learn any new representation subspaces. The number is a hyper-parameter that needs to be tuned like any other hyper-parameter.
+38. Transfer learning
+ 19. [E] You want to build a classifier to predict sentiment in tweets but you have very little labeled data (say 1000). What do you do?
+ There are a number of ways:
+ 1. Create more data through data augmentation technniques such as back-translation, synonym replacement, random replacement, etc.
+ 2. Fine-tuning a pre-trained model that has been trained for sentiment classification
+ 20. [M] What’s gradual unfreezing? How might it help with transfer learning?
+ Gradual unfreezing is a way to fine-tune a pretrained model. It's the process of training some of the layers of a pretrained model on a new task or same task with different data at a time and increasing the number of trainable layers over time. This helps with transfer learning because it allows the pretrained model to slowly adjust its weights for the task at hand while maintaining some of the learned information from pretraining.
+39. Bayesian methods.
+ 21. [M] How do Bayesian methods differ from the mainstream deep learning approach?
+ The main difference is that Bayesian methods model a probability distribution of the parameters, and therefore model the uncertainty of predictions whereas mainstream deep learning approaches optimize for reducing the training/validation error and do not model uncertainty.
+ 22. [M] How are the pros and cons of Bayesian neural networks compared to the mainstream neural networks?
+ 1. BNNs give uncertainty of predictions
+ 2. BNNs are more interpretable than mainstream NNs
+ 3. BNNs less prone to overfitting
+ However,
+ 4. BNNs are computationally expensive during inference, since multiple forward passes is needed
+ 5. BNNs require more data than mainstream NNs to model the posteriors
+ 23. [M] Why do we say that Bayesian neural networks are natural ensembles?
+ The model predictions are made by averaging over multiple models, each corresponding to a different set of parameter values drawn from the distributions. This process can be seen as a form of model averaging, where the final prediction is the average of the predictions made by multiple models. This can be seen as "natural ensembles", where the different models correspond to different sets of parameter values. The averaging process allows the model to make probabilistic predictions, and to quantify the uncertainty of its predictions.
+40. GANs.
+ 24. [E] What do GANs converge to?
+ GANs converge to a Nash equilibrium which is a stable state where neither the generator or discriminator can make furthur improvements by solely changing their strategy. In this state, the generator generates plausible samples and the discriminator is not able to distinguish whether the sample is real or generated. This is known as a "zero-sum" game.
+ 25. [M] Why are GANs so hard to train?
+ The generator and discriminator can get stuck in local minima and get in an infinite loop of making small improvements. In this case, the generator creats images that are slightly more realistic but not representative of the overall real data distribution and the discriminator learns to adapt to these samples and gets better and identifying them. This phenomenon is know as "model collapse".
+ There is also the chance of instability where the generator does not generate realistic images and the discriminator is able to identify them as fake.
+
+### 8.3 Training neural networks
+41. [E] When building a neural network, should you overfit or underfit it first?
+ Better to start simple and gradually add complexity. This way it easier to verify that things are working as expected
+42. [E] Write the vanilla gradient update.
+ θ = θ - α * ∇θL(θ)
+ Where:
+ 1. θ is the set of parameters of the network.
+ 2. α is the learning rate, a scalar value that controls the step size of the update.
+ 3. L(θ) is the loss function, which measures the difference between the predicted output and the true output.
+ 4. ∇θL(θ) is the gradient of the loss function with respect to the parameters, which represents the direction of the steepest descent in the parameter space.
+43. Neural network in simple Numpy.
+ 26. [E] Write in plain NumPy the forward and backward pass for a two-layer feed-forward neural network with a ReLU layer in between.
+ https://github.com/MaryFllh/ml_algorithms/tree/main/neural_net
+ 27. [M] Implement vanilla dropout for the forward and backward pass in NumPy.
+ https://github.com/MaryFllh/ml_algorithms/tree/main/neural_net
+
+44. Activation functions.
+ 28. [E] Draw the graphs for sigmoid, tanh, ReLU, and leaky ReLU.
+ 29. [E] Pros and cons of each activation function.
+ Sigmoid:
+ Pros: Easy to compute, differentiable
+ Cons: Susceptible to vanishing gradients. Gradient values are only significant between -3 and 3 and anything outside of that range is close to zero.
+ tanh:
+ Pros: Easy to compute, differentiable and symetric with a mean of 0
+ Cons: Same issue with Sigmoid
+ ReLU:
+ Pros: Easy to compute and more computationally efficient than sigmoid or tanh. Doesn't have the saturating property of tanh and sigmoid and converges faster
+ Cons: Dead neurons
+ Leaky ReLU:
+ Pros: Solves the dead neurons problem, easy to compute
+ Cons: Gradient of negative values too small can affect training time. Also choosing an appropriate value for the leakage factor α can be tricky, and it's often chosen through trial and error.
+ 30. [E] Is ReLU differentiable? What to do when it’s not differentiable?
+ It is not differentiable at x = 0. However, it is safe to consider the derivative at this point 0 because:
+ 1. The exact point at which the function is not differentiable is seldom reached in an algorithm.
+ 2. At the point of non-differentiability, you can assign the derivative of the function at the point “right next” to the singularity and the algorithm will work fine. For example, in ReLU we can give the derivative of the function at zero as 0. It would not make any difference in the backpropagation algorithm because the distance between the point zero and the “next” one is zero.
+ 31. [M] Derive derivatives for sigmoid function $$\sigma(x)$$ when $$x$$ is a vector.
+ y'(x) = sigma(x) * (1 - sigma(x)) , where the function and the subtraction are applied component-wise.
+45. [E] What’s the motivation for skip connection in neural works?
+ The motivation is to address the problem of vanishing gradients. Skip connections help to address this problem by allowing the gradients to bypass one or more layers in the network and flow directly to the earlier layers. This helps to ensure that the gradients are larger and can more easily flow back through the network, allowing the network to learn more effectively.
+46. Vanishing and exploding gradients.
+ 32. [E] How do we know that gradients are exploding? How do we prevent it?
+ One way to know if the gradients are exploding is to monitor the gradients during training and check if the norm of the gradients (i.e. the magnitude of the gradients) is becoming very large. This can be done by printing the norm of the gradients or by using a tool such as TensorBoard to visualize the gradients.
+ Another indication of exploding gradients is that the loss may become NaN (Not a Number) or inf (infinity), this happens when the gradients grow to large numbers that can't be handled by the computer.
+ There are various ways to prevent exploding gradiants:
+ Gradient Clipping: This method involves clipping the gradients to a maximum value, this ensures that the gradients do not become too large.
+
+ Weight Initialization: Choosing appropriate weight initialization methods can also help to prevent gradients from exploding. For example, using techniques such as Glorot initialization or He initialization can help to ensure that the weights of the network are initialized to appropriate values.
+
+ Normalization: Use of techniques such as batch normalization can also help to prevent gradients from exploding by normalizing the inputs to the activation functions, which makes the training process more stable.
+
+ Regularization: Using regularization techniques such as L1 or L2 can also prevent gradients from exploding by adding a penalty term to the loss function that discourages large weights.
+ 33. [E] Why are RNNs especially susceptible to vanishing and exploding gradients?
+ RNNs are particularly susceptible to the vanishing and exploding gradients problem because the gradients can flow through multiple time steps, and the weights are multiplied many times over the time steps. This can cause the gradients to either become very small (vanishing gradients) or very large (exploding gradients) as they flow through the network.
+47. [M] Weight normalization separates a weight vector’s norm from its gradient. How would it help with training?
+ The basic idea behind weight normalization is to normalize the weights of a network so that they have a fixed norm (magnitude) during training. This is done by dividing the weights by their norm so that the magnitude of the weights is fixed, regardless of the values of the gradients.
+ By normalizing the weights in this way, the gradients only need to update the direction of the weights, rather than both the direction and the scale. This can help to make the training process more stable because the gradients only need to adjust the direction of the weights, rather than both the direction and the scale.
+ Additionally, weight normalization can help to improve the generalization of the model, as it makes the model less sensitive to the scale of the weights.
+48. [M] When training a large neural network, say a language model with a billion parameters, you evaluate your model on a validation set at the end of every epoch. You realize that your validation loss is often lower than your train loss. What might be happening?
+ One reason could be that the train loss is calculated at the end of each batch backporpagation. With a large model, the initial batches will be far from the optimal solution and the loss will be high. However, the validation loss is calculated at the end of each epoch which is after all batches have completed their forward and backward passes and made multiple updates to the weights. In other words, by then end of an epoch the the model has learned more and become more stable so the validation loss can be lower.
+49. [E] What criteria would you use for early stopping?
+ Validation accuracy or loss. When the validation accuracy (loss) starts to decease (increase) or becomes stagnant, it is an indication that the model is overfitting, and the training process should be stopped.
+50. [E] Gradient descent vs SGD vs mini-batch SGD.
+ Gradient Descent: The classic gradient descent algorithm is a batch-based optimization algorithm, which calculates the gradients using the entire training dataset. The gradients are then used to update the weights of the network. The main advantage of this algorithm is that it is guaranteed to converge to a global minimum, but it can be very slow for large datasets.
+ Stochastic Gradient Descent (SGD): Stochastic gradient descent is an optimization algorithm that uses random samples from the training dataset to estimate the gradients. Instead of using the entire dataset, it uses just a single example at each iteration to update the weights. This method is computationally more efficient than gradient descent, but it is also less accurate. It can be more sensitive to the choice of initial weights, and it may converge to a local minimum instead of a global minimum.
+ Mini-batch Stochastic Gradient Descent (mini-batch SGD): Mini-batch stochastic gradient descent is a variant of stochastic gradient descent that uses a small, fixed-size subset of the training data, called a mini-batch, to calculate the gradients. It is a trade-off between the computational efficiency of SGD and the accuracy of gradient descent. It has been shown to converge faster and be more stable than pure SGD, and it is the most commonly used optimization algorithm for training neural networks.
+51. [H] It’s a common practice to train deep learning models using epochs: we sample batches from data **without** replacement. Why would we use epochs instead of just sampling data **with** replacement?
+ One reason is that the convergence rate of sampling without replacement is faster (https://arxiv.org/pdf/1202.4184v1.pdf, short explantion can be found here: https://stats.stackexchange.com/questions/235844/should-training-samples-randomly-drawn-for-mini-batch-training-neural-nets-be-dr).
+ In addition, since we are training only one model (and not multiple like decision trees in a random forest), allowing the model to see as many examples as possible through sampling without replacement reduces bias and makes the model better at generalization.
+52. [M] Your model’ weights fluctuate a lot during training. How does that affect your model’s performance? What to do about it?
+ The fluctuation during training can be a sign the model struggles with convergence and that it has high variance. This can affect the model's accuracy and reliability. There can be a number of reasons why this happens:
+ 1. High learning rate: The weight updates take large steps in the direction of the gradient and creates fluctuation. Reducing the learning rate can help.
+ 2. Small batch size: The smaller the batch size the noisier the gradients which can cause fluctuation in the weight updates. Increasing the batch size or doing gradient accumulation can help. In gradient accumulation the weights are not updated after each batch, but after a number of preset batches are complete to reduce the noise.
+53. Learning rate.
+ 34. [E] Draw a graph number of training epochs vs training error for when the learning rate is:
+ 1. too high
+ 2. too low
+ 3. acceptable.
+ 35. [E] What’s learning rate warmup? Why do we need it?
+ Learning rate warmup is a technique used to gradually increase the learning rate during the initial stages of training. The idea is to start with a small learning rate and gradually increase it over a certain number of training steps or epochs.
+ There are several reasons why learning rate warmup can be useful:
+
+ High learning rate instability: When starting with a high learning rate, the model's weights can fluctuate a lot, leading to instability and poor performance. Learning rate warmup allows the model to converge to a stable solution before increasing the learning rate.
+
+ Avoiding poor local minima: Starting with a high learning rate can cause the model to converge to a poor local minimum, rather than a global minimum. Learning rate warmup allows the model to explore the parameter space before settling into a suboptimal solution.
+
+ Gradient sparsity: When the gradients are sparse, it can be hard for the optimizer to make progress with a high learning rate. A warmup period allows the optimizer to converge to a good initial point before increasing the learning rate.
+54. [E] Compare batch norm and layer norm.
+ Batch norm transforms the output of each layer based on the mean and variance of all the samples in the batch. In other words, it computes the mean and variance of each feature across all batch samples and trasforms each batch's feature value based on the calculated statistics. This means the the batch size and sequence length affects batch normalization. Also, because the statistics depend on all the batch samples, using batch norm in parallel settings is difficult.
+ On the other hand, layer norm is independent on the batch and calculates the mean and variance for each sample separately, and is therefore better suited for when the sequence lengths are different in a batch or when the training is done in parallel.
+ Layernorm is more suitable for NLP tasks where the sequence lenghts vary whereas batch norm is more common in computer vision tasks.
+55. [M] Why is squared L2 norm sometimes preferred to L2 norm for regularizing neural networks?
+ Squared L2 norm has a smooth gradient everywhere, as opposed to L2 norm which has a kink at the origin. This helps with stable updates and faster convergence.
+56. [E] Some models use weight decay: after each gradient update, the weights are multiplied by a factor slightly less than 1. What is this useful for?
+ Weight decay is a regularization technique that encourages the model to have smaller weights by applying a small multiplicative factor slightly less than 1 to the weights after each gradient update. This can help to prevent overfitting by preventing the model from becoming too confident in any particular weight.
+57. It’s a common practice for the learning rate to be reduced throughout the training.
+ 36. [E] What’s the motivation?
+ We start with a large learning rate at the beginning of training, when the gradients are large and the model is far from the optimal solution. As training progresses, the gradients become smaller, and a smaller learning rate is needed to make small adjustments to the weights.
+ 37. [M] What might be the exceptions?
+ 1. Fine-tuning a pre-trained model. In this case the gradients are likely going to be low that we don't want to start off with a high rate at the beginning.
+ 2. Stochastic Gradient Descent with momentum or adaptive learning rate optimization methods like Adam, Adagrad or Adadelta: These methods can adapt the learning rate during training, which can help to find the optimal learning rate without the need for explicit learning rate scheduling.
+58. Batch size.
+ 38. [E] What happens to your model training when you decrease the batch size to 1?
+ Batch size of 1 (SGD) means that the model parameters are updated after each sample. This has the advantage of being able to train on very large datasets that will not fit into memory to be loaded at once. However, has many drawbacks:
+ 1. High variance: The gradients calculated from a single example is very noisy and causes the weights to fluctuate a lot during training
+ 2. Slower convergence: Due to the noisy gradients, convergence will be slow.
+ 3. Computationally inefficient: Parameter updates after each sample pass is time consuming and computationally inefficient.
+ 39. [E] What happens when you use the entire training data in a batch?
+ Using the entire dataset as a batch (bath gradient descent) has the advantage of faster convergence since the model updates its parameters based on all the data points and which is less noisy. However, there are some downsides:
+ 1. Large memory requirements: loading the entire dataset at once may be infeasible depending on the size of the data.
+ 2. Slower training: Since all the datapoints are used to compute the gradients, it can be slow.
+ 40. [M] How should we adjust the learning rate as we increase or decrease the batch size?
+ A smaller batch size means that the gradients will be noisy so the learning rate should be higher to account for that. Larger batch size requires smaller learning rate because the gradients are more stable, so we want to take smaller steps and avoid overshooting.
+59. [M] Why is Adagrad sometimes favored in problems with sparse gradients?
+ Adagrad adapts the learning rate of each parameter by dividing a fixed learning rate with the square root of the the cumulative sum of that parameter's squared gradients. This means that parameters that are not frequent (sparse parameters), the division value will be low and therefore the learning rate will be high. This helps the model to converge quickly for those parameters.
+60. Adam vs. SGD.
+ 41. [M] What can you say about the ability to converge and generalize of Adam vs. SGD?
+ Adam uses different learning rates for the model paramaters by using the exponentially moving average of the first and second moments of the gradients. This makes it faster to converge. It is better in generalization than SGD because it is less sensative to the choice of the initial learning rate.
+ 42. [M] What else can you say about the difference between these two optimizers?
+ Adam typically requires less fine-tuning of the learning rate as compared to SGD, which is especially useful when the dataset is large, or the number of parameters is large.
+61. [M] With model parallelism, you might update your model weights using the gradients from each machine asynchronously or synchronously. What are the pros and cons of asynchronous SGD vs. synchronous SGD?
+ ASGD updates the parameters faster than SSGD because each machine has its own version of the model parameters and updates them as soon as it computes the gradients. However, because it does not use the updates from other machines, the gradients it uses might be stale and overall will require more steps to converge.
+62. [M] Why shouldn’t we have two consecutive linear layers in a neural network?
+ The main reason for this is that a linear function is, by definition, a function that preserves the linearity of the input. So if we have two consecutive linear layers, the output of the first linear layer will be passed through the second linear layer without any changes, meaning that the second linear layer will not be able to introduce any non-linearity to the data. This lack of non-linearity can cause problems for the training process because it limits the ability of the network to learn complex and non-linear relationships between inputs and outputs. With only linear layers, the neural network will not be able to learn any non-linear functions, which can limit its ability to generalize to new data.
+63. [M] Can a neural network with only RELU (non-linearity) act as a linear classifier?
+ ReLU is a non-linear function, meaning that it does not obey the superposition principle, and therefore a neural network with only ReLU non-linearity will not be a linear classifier. For a neural network to act as a linear classifier, it should have a linear activation function such as Identity activation function or a linear perceptron with all the weights and bias set to zero.
+64. [M] Design the smallest neural network that can function as an XOR gate.
+ The smallest neural network that can function as an XOR gate is a single layer perceptron with two inputs, two hidden units and one output unit. The input layer takes in the two binary inputs, the hidden layer uses an activation function such as a sigmoid function to process the inputs and produce the output. The output unit will use a threshold function to produce the final binary output.
+65. [E] Why don’t we just initialize all weights in a neural network to zero?
+ Because it creates a symmetry problem. If all neurons have the same weight values at initialization, during backpropagation, the gradients will be the same for all neurons and during each iteration the weights will update in the same way for all the neurons, this will not allow the network to learn different features and will not be able to generalize well.
+66. Stochasticity.
+ 43. [M] What are some sources of randomness in a neural network?
+ Weight initialization, dropout, data splitting in batches.
+ 44. [M] Sometimes stochasticity is desirable when training neural networks. Why is that?
+ It can help the models avoid getting stuck in a local minima, and generalize better.
+67. Dead neuron.
+ 45. [E] What’s a dead neuron?
+ Dead neurons are neurons in a neural network that have become ineffective during training. This can happen when the weights of the neuron are updated such that the output of the neuron is always close to zero, or when the gradient with respect to the weights of the neuron is close to zero. Dead neurons can cause problems for the neural network
+ 46. [E] How do we detect them in our neural network?
+ 1. Monitoring the output of individual neurons: One way to detect dead neurons is to monitor the output of individual neurons during training. If the output of a neuron is always close to zero or has very small gradient, then it could be considered a dead neuron.
+ 2. Visualizing the weights of the network: Another way to detect dead neurons is to visualize the weights of the network. If the weights of a neuron are not updating during training, or if they are consistently close to zero, then it could be considered a dead neuron.
+ 3. Analyzing the gradients: Analyzing the gradients of the weights of the network can also reveal dead neurons. If the gradients for a particular neuron are consistently close to zero, it could be considered a dead neuron.
+ 47. [M] How to prevent them?
+ There are techniques to prevent it such as using a smaller learning rate, using activation functions that have a non-zero derivative everywhere and using weight initialization techniques that are specifically designed to avoid dead neurons.
+68. Pruning.
+ 48. [M] Pruning is a popular technique where certain weights of a neural network are set to 0. Why is it desirable?
+ It can be useful for model compression to reduce the size of a trained model by removing the less informative componenets. It can also be used to reduce latency during inference.
+ 49. [M] How do you choose what to prune from a neural network?
+ A threshold can be set to determine what weights should be set to zero, i.e. those that are below the threshold. Another way is to monitor activation outputs, to identify dead neurons and remove them.
+69. [H] Under what conditions would it be possible to recover training data from the weight checkpoints?
+ Weight checkpoints are generally used to resume the training process after the training was unexpectedly interupted. The data itself is not saved with checkpoints but in some conditions it may be possible to sample the data from the weights. For example, if the data compression was performed before training and the compression algorithm is known, reverse engineering the compression on the weights might work.
+70. [H] Why do we try to reduce the size of a big trained model through techniques such as knowledge distillation instead of just training a small model from the beginning?
+ Larger models can learn better feature representations than smaller models. With techniques such as distillation we can benefit from the better presentation learned by larger models and the computational efficiency of smaller models.
\ No newline at end of file