# Background Information

## What is the difference between supervised and unsupervised learning?

Simply say, the input of supervised learning is labelled while that of unsupervised learning is unlabelled. Therefore, we can use supervised learning to do prediction while we use unsupervised learning to do analysis.

## What is selection bias?

Selection bias is a kind of error that happens when you choose what is studied. The training results would depend on the method of collecting samples. When the selection is not random, the statistical analysis would be not reliable or called distorted. The trained model would not be used in the real world.

There are four kinds of selection bias: 1. Sampling bias: A systematic error due to the non-random sample. Some samples have more probability of being chosen than others.

Time interval: The measurements at one time can not represent the whole sample set. Especially when some targets are disappeared. You will get some extreme values.

Data: The data is chosen for supporting one conclusion not based on generally agreed rules.

Attrition: It is a kind of selection bias caused by attrition (loss of participants) discounting trial subjects/tests

## What is the bias? What is the variance? What is bias-variance trade-off?

The bias is an error that shows the difference between the real data and the predicted data. The variance shows the instability of the dataset. It shows how far a set of data is spread out from their mean value.

When the model is under-fitting, the bias would have a large value while the variance would have a small value. When the model is over-fitting, the bias has the low value while the variance has the high value. The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance. This is the reason why we discuss the bias-variance trade-off.

## What are the types of biases that can occur during sampling?

Selection bias Under coverage bias Survivorship bias

## What is the survivorship Bias?

It is the logical error of focusing aspects that support surviving some process and casually overlooking those that did not work because of their lack of prominence. This can lead to wrong conclusions in numerous different means.

## How a ROC (Receiver Operating Characteristic Curve) curve works?

ROC curve is a graphical representation of the contrast between true positive and false positive rates.

## How a P-R curve works?

P-R curve is a graphical representation of the contrast between true positive predicted value and the true positive rates.

## What is the confusion matrix?

The confusion matrix is a 2*2 table provided by the binary classifier. It have 4 outputs. They are true positives, false negatives, false positives and true negatives. The confusion matrix is calculated with the testing data.

Therefore, we can get: 1. Error Rate = (FP+FN)/(P+N) 2. Accuracy = (TP+TN)/(P+N) 3. Sensitivity(Recall or True positive rate) = TP/P where P = TP + FN 4. Specificity(True negative rate) = TN/N where N = TN + FP 5. Precision(Positive predicted value) = TP/(TP+FP) F-Score: \[F_{1}=\left(\frac{2}{\operatorname{recall}^{-1}+\operatorname{precision}^{-1}}\right)=2 \cdot \frac{\text { precision } \cdot \text { recall }}{\text { precision }+\text { recall }}\]

# Statistics

## What is the normal distribution?

It is also called as Gaussian distribution. The properties are: 1. one mode 2. left and right have same shape. 3. maximum point at the mean. 4. mean is same as mode and median. 5. asymptotic. [æsɪmp'tɒtɪk]

## What are the correlation and covariance?

Correlation: Correlation measures how strongly two variables are related. value from -1 to 1 and has not unit.

Covariance: systematic relation between a pair of variables. Value from -inf to inf and has unit.

## What are the point estimates and confidence intervals?

The point estimation gives us a particular values as an estimate of a parameter.

The confidence interval is a range of values which is likely to contain the parameter.

## What is the goal of A/B testing?

The goal of A/B Testing is to identify any changes to effect results.

## What is the p-value?

p-value is used to show the strength of the results.

## What is the statistical power of sensitivity (TP/P) and how to calculate it?

Sensitivity is eqaul to TP/(TP+NF).

Sensitivity is used to validate the accuracy of a classifier (Logistic, SVM, Random Forest)

https://www.edureka.co/blog/interview-questions/data-science-interview-questions/

## What is the difference between the over-fitting and under-fitting?

Overfitting: model describes random error or noise instead of the underlying relationship. The model is complex and has too many parameters.

Unde-fitting: model can not caoture the underlying relationship.

## What is regularisation? Why is it useful?

Regularisation is the process of adding tuning parameter to a model to induce smoothness in order to prevent overfitting. This is most often done by adding a constant multiple to an existing weight vector. This constant is often the L1(Lasso) or L2(ridge).

# Normalization

## What is data normalization and why do we need it?

Data normalization is a very important preprocessing step for training. It is used to rescale values to fit in a specific range to make better convergence during backpropagation. (Independent and identically distributed random)

Normally, there are two ways to normalize data: 1. Min-max Scaling: The original data would be mapped to an interval, from 0 to 1. Each data point subtracts the minimum value and then dividing by the difference between the maximum value and the minimum value.

- Z-Score Normalization: Each data point subtracts the mean of each data point and dividing by the standard deviation.

If we do not use the normalization, the high-magnitude features will be weighted more in the cost function, while the affections of the low-magnitude features will be not insignificant. After normalization, we can faster to find the optimal global solution.

Of course, the normalization can not address all the optimization problem. It is only useful for the methods based on the gradient, such as linear regression, logical regression, support vector machine, neural network and other models. For binary trees problems, the normalization is not a good choice.

# Network Training

## Why we like to use the soft-max?

The input is a vector of the real numbers and output is a probability distribution. The element is non-negative and the sum over all components is 1. This is the first benefit. second, it is easy to calculate at backpropagation.

## Can you explain the difference between a Validation Set and a Test Set?

A Validation set can be considered as a part of the training set as it is used for parameter selection and to avoid overfitting of the model being built. On the other hand, a Test Set is used for testing or evaluating the performance of a trained machine learning model.

## What is cross-validation.

Cross-validation is a model validation technique for evaluating how the outcomes of statistical analysis will generalize to an independent dataset.

## What is ‘Naive’ in a Naive Bayes?

Variables are independent.

# SVM

## Explain SVM

If you have n features in your training set, SVM will plot it in n-dimensional space. SVM will try to find a *hyperplanes* to separate out different classes based on a provided kernel function.

## what is the support vectors in SVM?

The lines show the distance between the hyperplanes to the closest data poitns called the support vector. The support vector is parallel to the hyperplanes. The distance between the two support vectors is called the margin.

## What are the different kernels in SVM?

Linear Kernel Polynomial kernel Radial basis kernel Sigmoid kernel

# Decision Tree

## What is the decision tree?

It splits a large dataset into smaller and smaller subset while an associate decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

## What are entropy and information gain?

Entropy : A decision tree is used to involve data into homogenious subsets. If the sample is completely homogenious then entropy is zero and if the sample is an equally divided it has entropy of one.

Information Gain : The Information Gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attributes that return the highest information gain.

## What is pruning in Decision Tree?

Pruning is removing sections of the tree that provide little power to classify instances.

# logistic regression

## What is logistic regression?

Logistic regression is a logit model to predict the binary outcome from a linear combination of predictor variables.

## What is Linear Regression?

Linear regression predict the Y from a second variable X.

## What are the drawbacks of the linear model?

- error is linearity
- can not be used for multiple or binary outcomes.
- can not address the over-fitting.

## How to choose the number of clusters in k-means?

Based on the *Elbow Curve*, which is show *within Sum of Squares* at the different number of clusters. The bending points is the K.

## What is Ensemble Learning?

Ensemble Learning is basically combining a diverse set of learners (Individual models) together to improvise on the stability and predictive power of the model. The common method is bagging and boosting.

## How Are Weights Initialized in a Network?

Initializing all weights to 0: This makes your model similar to a linear model. All the neurons and every layer perform the same operation, giving the same output and making the deep net useless.

Initializing all weights randomly: Here, the weights are assigned randomly by initializing them very close to 0. It gives better accuracy to the model since every neuron performs different computations. This is the most commonly used method.

## What Is the Cost Function?

Cost function is a measure to evaluate how good your model’s performance is. It’s used to compute the error of the output layer during *backpropagation*. We push that error backwards through the neural network and use that during the different training functions.

## What Are Hyperparameters?

A hyperparameter is a parameter whose value is set before the learning process begins. It determines how a network is trained and the structure of the network

## What Will Happen If the Learning Rate Is Set inaccurately ?

When your learning rate is too low, training of the model will progress very slowly as we are making minimal updates to the weights. It will take many updates before reaching the minimum point.

If the learning rate is set too high, this causes undesirable divergent behaviour to the loss function due to drastic updates in weights. It may fail to converge (model can give a good output) or even diverge (data is too chaotic for the network to train).

## WWhat Are the Different Layers on CNN?

- Convolutional Layer
- ReLU Layer
- Pooling Layer
- Fully Connected Layer

## What Is Pooling on CNN, and How Does It Work?

It performs down-sampling operations to reduce the dimensionality and creates a pooled feature map by sliding a filter matrix over the input matrix.

## What is exploding gradients? What is exploding gradients?

if you see exponentially growing (very large) error gradients which accumulate and result in very large updates to neural network model weights during training, they’re known as exploding gradients. weight->NaN

While training an RNN, your slope can become either too small; this makes the training difficult. When the slope is too small, the problem is known as a Vanishing Gradient. It leads to long training times, poor performance, and low accuracy.

## Explain Back Propagation

- Forward Propagation of Training Data
- Error (lost) are computed using output and target
- Back Propagate for computing derivative of error with activation function.
- Using previously calculated derivatives for output
- Update the Weights

## What is the role of the Activation Function?

Activation Function is a switch-like things in the neural network. It's used to introduce non-linearity into the neural network helping it to learn more complex function.

## What is an Auto-Encoder?

Auto-encoders are simple learning networks that aim to transform inputs into outputs with the minimum possible error. This means that we want the output to be as close to input as possible. We add a couple of layers between the input and the output, and the sizes of these layers are smaller than the input layer. The auto-encoder receives unlabelled input which is then encoded to reconstruct the input.

## What Is Dropout and Batch Normalization?

Dropout is a technique of dropping out hidden and visible units of a network randomly to prevent overfitting of data.

Batch normalization is the technique to improve the performance and stability of neural networks by normalizing the inputs in every layer.

# Activation

## Talk some examples of activation function?

Sigmoid:

\[f(z)=\frac{1}{1+\exp (-z)}\] Derivative of sigmoid:

\[f^{\prime}(z)=f(z)(1-f(z)) \]

Tanh: \[f(z)=\tanh (z)=\frac{\mathrm{e}^{z}-\mathrm{e}^{-z}}{\mathrm{e}^{z}+\mathrm{e}^{-z}}\]

Derivative of Tanh: \[f^{\prime}(z)=1-(f(z))^{2}\]

ReLu: \[f(z)=\max (0, z)\],

Derivative of ReLu: \[f^{\prime}(z)=\left\{\begin{array}{l} {1, z>0} \\ {0, z \leqslant 0} \end{array}\right.\]

## What is Gradient Descent?

A gradient measures how much the output of a function changes if you change the inputs a little bit.

Gradient Descent means that the gradient is not clearly changed even the input is changed a lot.

## Why the gradient descent happens when the Sigmoid (Tanh)activation function is used?

The Tanh (-1 to 1) is a translation of Sigmoid (0 to 1). For sigmoid activation, if Sigmoid is lower than 0 or bigger than 1, the gradient is equal to 0.

## What is the benefit and drawbacks of using ReLU?

Benefit:

- Since Sigmoid/Tanh needs to calculate exponential functions, they needs the large computaition

Drawbacks:

## Which lost fuction is widely applied?

For binary classifier, the loss functions are: 1. Hinge loss \(L_{\text {hinge }}(f, y)=\max \{0,1-f y\}\) However, it is not derivated for \(fy = 1\). 2. Logistic loss \(L_{\text {logistic }}(f, y)=\log _{2}(1+\exp (-f y))\) It is sensitive to the abnormal value. 3. Cross entropy \(L_{\text {cross catropy }}(f, y)=-\log _{2}\left(\frac{1+f y}{2}\right)\)

For regression problem. 1. Square loss \(L_{\text {square }}(f, y)=(f-y)^{2}\) It is sensitive to the abnormal value. 2. absolute loss \(L_{\text {absolute }}(f, y)=|f-y|\) It is not derivated at \(f=y\). 3. Huber loss \(L_{\text {Huber }}(f, y)=\left\{\begin{array}{ll} {(f-y)^{2},} & {|f-y| \leqslant \delta} \\ {2 \delta|f-y|-\delta^{2},} & {|f-y|>\delta} \end{array}\right.\)

For classification problem,