Machine Learning in Genomics # MCQs Practice set

Q.1 Which machine learning technique is commonly used for predicting whether a genetic variant is pathogenic or benign based on its sequence context?

Convolutional Neural Networks

Linear Regression

K-Nearest Neighbors

Decision Trees

Explanation - CNNs can automatically learn spatial patterns in DNA sequences, making them suitable for variant effect prediction, whereas the other methods are less effective at capturing sequence motifs.

Correct answer is: Convolutional Neural Networks

Q.2 What is the primary purpose of the Variant Effect Predictor (VEP) tool in genomic pipelines?

Align sequencing reads to a reference genome

Annotate genetic variants with functional information

Perform differential gene expression analysis

Predict protein tertiary structure

Explanation - VEP is used to annotate variants with predicted effects on genes and proteins; it does not align reads, analyze expression, or predict protein structure.

Correct answer is: Annotate genetic variants with functional information

Q.3 In a genome-wide association study (GWAS), which statistical model is most appropriate for handling high-dimensional SNP data with many correlated features?

Elastic Net Regression

Random Forest

Principal Component Analysis

Linear Mixed Model

Explanation - Elastic Net combines L1 and L2 penalties to perform variable selection and handle multicollinearity among SNPs; other methods are less tailored for high-dimensional SNP selection.

Correct answer is: Elastic Net Regression

Q.4 Which deep learning architecture is particularly effective for modeling long-range dependencies in genomic sequences?

Recurrent Neural Networks (RNNs)

Feedforward Neural Networks

Transformer Models

Support Vector Machines

Explanation - Transformers use self-attention to capture long-range dependencies, making them powerful for long genomic sequences, whereas RNNs suffer from vanishing gradients.

Correct answer is: Transformer Models

Q.5 What does the area under the ROC curve (AUC) indicate in a binary classification model for disease risk prediction?

Overall accuracy of the model

Balance between false positives and false negatives

Probability of a positive outcome

Number of misclassifications

Explanation - AUC measures the model's ability to rank positives above negatives across all thresholds, reflecting the trade-off between true positive and false positive rates.

Correct answer is: Balance between false positives and false negatives

Q.6 Which of the following is a common preprocessing step before training a neural network on raw DNA sequence data?

One-hot encoding of nucleotides

Standard scaling of expression values

Principal component analysis of genotype data

Imputing missing phenotypes

Explanation - One-hot encoding transforms nucleotides (A, C, G, T) into binary vectors suitable for neural network input; the other options apply to different data types.

Correct answer is: One-hot encoding of nucleotides

Q.7 Which metric is most appropriate for evaluating clustering of cell types in single-cell RNA-seq data?

Silhouette Score

Mean Squared Error

Log Loss

Confusion Matrix

Explanation - Silhouette Score measures how similar an object is to its own cluster compared to other clusters, making it ideal for assessing single-cell clustering.

Correct answer is: Silhouette Score

Q.8 What advantage do GPU-accelerated models offer in training deep learning models for whole-genome variant calling?

Reduced memory usage

Faster training times

Higher accuracy inherently

Lower power consumption

Explanation - GPUs parallelize computations, dramatically speeding up training of large genomic models; they do not automatically improve accuracy or reduce memory.

Correct answer is: Faster training times

Q.9 In the context of epigenomics, what is the main goal of using a convolutional neural network to predict DNA methylation patterns?

To identify regulatory motifs influencing methylation

To cluster cell types based on methylation profiles

To impute missing methylation data points

To generate 3D chromatin contact maps

Explanation - CNNs can detect sequence motifs that correlate with methylation status; while they can help impute missing data, the primary aim is motif discovery.

Correct answer is: To identify regulatory motifs influencing methylation

Q.10 Which type of neural network is most suitable for modeling time-series gene expression data across developmental stages?

Convolutional Neural Network

Long Short-Term Memory (LSTM) network

Generative Adversarial Network

Autoencoder

Explanation - LSTMs are designed to capture temporal dependencies in sequential data, making them ideal for developmental gene expression trajectories.

Correct answer is: Long Short-Term Memory (LSTM) network

Q.11 Which loss function is commonly used for multi-label classification in gene function prediction tasks?

Cross-Entropy Loss

Mean Squared Error

Hinge Loss

Binary Cross-Entropy Loss

Explanation - Binary cross-entropy applies independently to each label, suitable for multi-label problems where each gene can have multiple functions.

Correct answer is: Binary Cross-Entropy Loss

Q.12 What does the term 'transfer learning' refer to in the context of genomic deep learning?

Transferring data from one species to another

Using a pre-trained model on one dataset to initialize training on a new dataset

Moving the entire pipeline to a cloud environment

Converting raw sequencing data into processed features

Explanation - Transfer learning leverages knowledge from a pre-trained model to improve learning on a related genomic task, reducing data and training time requirements.

Correct answer is: Using a pre-trained model on one dataset to initialize training on a new dataset

Q.13 Which of the following best describes the purpose of data augmentation in genomic sequence modeling?

Increase the number of samples by shuffling sequences randomly

Add synthetic noise to sequencing reads to mimic sequencing errors

Generate reverse complements of DNA sequences

All of the above

Explanation - Data augmentation for DNA includes reverse complements, simulated errors, and random shuffling to expand training data and improve model robustness.

Correct answer is: All of the above

Q.14 In a variant effect prediction pipeline, which step ensures that variants are accurately mapped to their genomic coordinates?

Read alignment

Variant calling

Genotype refinement

Coordinate conversion using liftover

Explanation - Liftover aligns variant coordinates between genome assemblies, ensuring accurate mapping; alignment and calling are earlier steps.

Correct answer is: Coordinate conversion using liftover

Q.15 Which machine learning metric is best for evaluating imbalanced classification of rare pathogenic variants?

Accuracy

F1-score

Matthews Correlation Coefficient (MCC)

Mean Absolute Error

Explanation - MCC accounts for all four confusion matrix categories and is robust to class imbalance, unlike accuracy or MAE.

Correct answer is: Matthews Correlation Coefficient (MCC)

Q.16 Which of the following is NOT a typical input feature for machine learning models predicting CRISPR-Cas9 off-target activity?

Sequence context around the target site

GC content of the guide RNA

Chromatin accessibility scores

Protein folding energy of the Cas9 enzyme

Explanation - Off-target predictions rely on DNA sequence and epigenomic context, not on the folding energy of Cas9.

Correct answer is: Protein folding energy of the Cas9 enzyme

Q.17 In deep learning applied to genomics, what is the main purpose of using a batch normalization layer?

Reduce the number of parameters

Normalize input data distribution across layers

Add regularization to prevent overfitting

Increase model interpretability

Explanation - Batch normalization stabilizes and accelerates training by normalizing activations; it does not directly reduce parameters or interpretability.

Correct answer is: Normalize input data distribution across layers

Q.18 Which of the following best explains why convolutional layers are effective for detecting transcription factor binding sites in DNA?

They capture spatial relationships in sequential data

They reduce the dimensionality of the input

They enforce sparsity in the weight matrices

They inherently model evolutionary conservation

Explanation - Convolutions slide across the sequence, identifying motifs (sub-sequences) that indicate binding sites.

Correct answer is: They capture spatial relationships in sequential data

Q.19 What is a primary advantage of using graph neural networks (GNNs) in protein-protein interaction prediction?

They can handle variable-sized input graphs efficiently

They are inherently interpretable

They avoid the need for feature engineering

They require fewer computational resources than CNNs

Explanation - GNNs process graphs of varying sizes and learn node/edge representations, which is advantageous for PPI networks.

Correct answer is: They can handle variable-sized input graphs efficiently

Q.20 Which of the following best describes the role of the ‘attention mechanism’ in transformer-based models for genomics?

It reduces the number of parameters needed

It selects relevant positions in the input sequence for each output

It ensures model convergence

It normalizes the output probabilities

Explanation - Attention weights highlight which parts of the input sequence influence the prediction, enabling modeling of long-range dependencies.

Correct answer is: It selects relevant positions in the input sequence for each output

Q.21 Which evaluation metric is most suitable for ranking genetic variants by predicted pathogenicity?

Precision

Recall

Average Precision (AP)

Area Under Precision-Recall Curve (AUPRC)

Explanation - AUPRC captures performance across different ranking thresholds, especially valuable when the positive class is rare.

Correct answer is: Area Under Precision-Recall Curve (AUPRC)

Q.22 What does the term 'feature importance' refer to in the context of random forest models applied to genomic data?

The computational time required to calculate each feature

The frequency with which a feature is used to split nodes across all trees

The statistical significance of each feature

The number of missing values in each feature

Explanation - Feature importance measures how often a feature is selected for splitting, indicating its influence on model predictions.

Correct answer is: The frequency with which a feature is used to split nodes across all trees

Q.23 Which of the following is a commonly used regularization technique to prevent overfitting in deep genomic models?

Dropout

Batch Size Reduction

Increasing the number of layers

Data concatenation

Explanation - Dropout randomly deactivates neurons during training, forcing the network to learn robust features and reducing overfitting.

Correct answer is: Dropout

Q.24 In a multi-task learning setup for predicting both variant pathogenicity and gene expression impact, which loss function strategy is most appropriate?

Weighted sum of task-specific losses

A single combined loss with joint labels

A hierarchical loss where one task guides the other

Separate models with no shared parameters

Explanation - Multi-task learning typically balances losses for each task via weights to train a shared representation.

Correct answer is: Weighted sum of task-specific losses

Q.25 Which of the following is a key challenge in applying deep learning to single-cell RNA-seq data?

High levels of dropout leading to sparse gene expression matrices

Limited availability of labeled data for supervised training

Inability to capture continuous developmental trajectories

All of the above

Explanation - Single-cell data is sparse, often lacks labels, and requires methods that capture continuous transitions.

Correct answer is: All of the above

Q.26 Which computational technique is frequently employed to speed up convolution operations on DNA sequences in GPU implementations?

Fast Fourier Transform (FFT) based convolution

Direct matrix multiplication

Recursive convolution

Sparse convolution

Explanation - FFT transforms convolution into element-wise multiplication, improving speed especially for long sequences on GPUs.

Correct answer is: Fast Fourier Transform (FFT) based convolution

Q.27 In variant calling pipelines, what is the main benefit of using a probabilistic model like HaplotypeCaller over simple pileup methods?

It requires less computational resources

It models read alignment uncertainty and haplotype structure

It provides deterministic outputs

It eliminates the need for base quality scores

Explanation - Probabilistic haplotype callers incorporate alignment ambiguity and infer haplotypes, increasing accuracy.

Correct answer is: It models read alignment uncertainty and haplotype structure

Q.28 Which of the following is NOT typically an input feature for a model predicting enhancer activity?

DNA sequence motif presence

Chromatin accessibility (ATAC-seq peaks)

Gene expression levels of neighboring genes

CpG methylation status

Explanation - Enhancer activity models primarily use sequence and epigenomic signals; gene expression is usually an output or downstream analysis.

Correct answer is: Gene expression levels of neighboring genes

Q.29 What is the primary role of a 'softmax' layer in a deep learning model for multi-class classification of genetic variants?

To ensure output probabilities sum to one

To reduce dimensionality of the input

To regularize the network

To provide non-linear activation

Explanation - Softmax converts raw logits into a probability distribution over classes, useful for multi-class variant classification.

Correct answer is: To ensure output probabilities sum to one

Q.30 Which algorithm is best suited for unsupervised clustering of high-dimensional gene expression profiles?

k-means clustering

Hierarchical clustering

t-SNE followed by DBSCAN

Linear Regression

Explanation - t-SNE reduces dimensionality while preserving local structure, and DBSCAN can identify clusters without specifying k.

Correct answer is: t-SNE followed by DBSCAN

Q.31 In the context of DNA sequencing, what does 'base calling' refer to?

Assigning a nucleotide identity to each sequencing signal

Aligning reads to a reference genome

Calling genetic variants from aligned reads

Estimating read quality scores

Explanation - Base calling converts raw sensor data into nucleotide sequences (A, C, G, T).

Correct answer is: Assigning a nucleotide identity to each sequencing signal

Q.32 Which of the following best describes 'imputation' in genomic datasets?

Adding synthetic data points to increase dataset size

Predicting missing genotype values

Correcting sequencing errors

Aligning sequences to a reference genome

Explanation - Imputation estimates missing genotype data, improving downstream analyses.

Correct answer is: Predicting missing genotype values

Q.33 Which neural network architecture is particularly adept at capturing hierarchical representations in genomic data?

Recurrent Neural Networks

Convolutional Neural Networks with multiple layers

Feedforward Networks

Support Vector Machines

Explanation - Deep CNNs can learn hierarchical motifs from raw DNA sequences.

Correct answer is: Convolutional Neural Networks with multiple layers

Q.34 What is the main advantage of using a 'multi-head attention' mechanism in transformer models for genomic sequence analysis?

It allows the model to focus on multiple positions in the sequence simultaneously

It reduces the number of parameters required

It enforces sparsity in the attention matrix

It ensures faster convergence

Explanation - Multi-head attention provides multiple parallel attention distributions, capturing diverse sequence features.

Correct answer is: It allows the model to focus on multiple positions in the sequence simultaneously

Q.35 Which type of machine learning model is best suited for predicting continuous phenotypes from genomic data?

Regression models

Classification models

Clustering models

Dimensionality reduction models

Explanation - Regression models output continuous values, making them suitable for quantitative trait prediction.

Correct answer is: Regression models

Q.36 In a convolutional neural network designed to predict splice sites, why are zero-padding and stride parameters critical?

They control the output feature map size and preserve spatial resolution

They regularize the network during training

They enforce sparsity in the weights

They are not important; any values work

Explanation - Padding maintains sequence length while strides determine how much the window moves, affecting receptive field.

Correct answer is: They control the output feature map size and preserve spatial resolution

Q.37 Which of the following is a key advantage of using deep learning over traditional rule-based methods for predicting protein-DNA binding affinities?

Requires no domain knowledge

Can automatically learn complex sequence dependencies

Always provides higher accuracy

Is computationally cheaper

Explanation - Deep learning discovers intricate patterns without hand-crafted rules, though computational cost may be higher.

Correct answer is: Can automatically learn complex sequence dependencies

Q.38 What does the 'L1 regularization' term in a loss function promote in the context of feature selection for genomic models?

Sparsity of the weight vector

Large weight values

Uniform distribution of weights

High model complexity

Explanation - L1 penalty drives many weights to zero, effectively selecting a subset of features.

Correct answer is: Sparsity of the weight vector

Q.39 Which of the following best describes the purpose of the 'early stopping' strategy during training of a genomic neural network?

Prevent overfitting by halting training when validation performance degrades

Accelerate training by stopping after a fixed number of epochs

Ensure the model achieves perfect training accuracy

Guarantee convergence to a global optimum

Explanation - Early stopping monitors a validation metric and stops training when performance stops improving.

Correct answer is: Prevent overfitting by halting training when validation performance degrades

Q.40 In the context of genomics, what does the term 'k-mer' refer to?

A specific type of DNA sequencing technology

A subsequence of length k nucleotides

A gene located on chromosome k

A statistical test for variant significance

Explanation - k-mers are contiguous sequences of k nucleotides used for sequence analysis and feature extraction.

Correct answer is: A subsequence of length k nucleotides

Q.41 Which of the following is a major bottleneck when training deep learning models on whole-genome sequencing data?

Limited GPU memory for large input matrices

Scarcity of labeled variants

Difficulty in interpreting model predictions

High cost of sequencing equipment

Explanation - The sheer size of genomic data challenges memory limits; data partitioning and streaming are required.

Correct answer is: Limited GPU memory for large input matrices

Q.42 Which algorithm is commonly used for deconvolving bulk RNA-seq data to estimate cell type proportions?

CIBERSORT

Random Forest

Linear Regression

K-Means

Explanation - CIBERSORT uses support vector regression to deconvolve bulk expression into cell-type signatures.

Correct answer is: CIBERSORT

Q.43 What is the primary objective of using a 'generative adversarial network' (GAN) in genomics?

To classify variants as pathogenic or benign

To generate realistic synthetic genomic sequences

To predict protein tertiary structure

To perform dimensionality reduction

Explanation - GANs can produce synthetic data that mimics real genomic distributions for training or data augmentation.

Correct answer is: To generate realistic synthetic genomic sequences

Q.44 Which of the following is a key benefit of using the 'Adam' optimizer over plain stochastic gradient descent (SGD) in genomic deep learning?

It requires fewer epochs to converge

It eliminates the need for learning rate scheduling

It uses adaptive learning rates for each parameter

It guarantees finding the global minimum

Explanation - Adam adapts step sizes based on gradient history, often speeding up convergence compared to vanilla SGD.

Correct answer is: It uses adaptive learning rates for each parameter

Q.45 Why is it important to perform 'data normalization' before feeding gene expression data into a machine learning model?

To ensure all features have the same scale

To increase the number of samples

To reduce the dimensionality

To prevent data leakage

Explanation - Normalization removes scale differences, making training stable and improving convergence.

Correct answer is: To ensure all features have the same scale

Q.46 Which of the following best explains the concept of 'transfer learning' in the context of genomic image data (e.g., Hi-C contact maps)?

Using a model trained on protein structures to analyze Hi-C maps

Fine-tuning a pre-trained convolutional network on Hi-C data

Transferring Hi-C data to a different species

Transferring the entire pipeline to cloud storage

Explanation - Transfer learning leverages knowledge from models trained on related image-like data to improve performance on Hi-C images.

Correct answer is: Fine-tuning a pre-trained convolutional network on Hi-C data

Q.47 Which statistical test is typically used to assess the significance of differential expression between two groups in RNA-seq studies?

Student's t-test

Wilcoxon signed-rank test

DESeq2's negative binomial test

Chi-squared test

Explanation - DESeq2 models count data with a negative binomial distribution, suitable for RNA-seq differential expression.

Correct answer is: DESeq2's negative binomial test

Q.48 What is the primary purpose of the 'attention mechanism' in a transformer model trained on genomic sequences?

To compute the loss function

To focus on relevant parts of the sequence for each prediction

To reduce the size of the training dataset

To enforce sequence conservation

Explanation - Attention weights highlight which sequence positions contribute most to the prediction.

Correct answer is: To focus on relevant parts of the sequence for each prediction

Q.49 Which of the following is an example of a semi-supervised learning approach in genomics?

Training a model only on labeled data

Using a mixture of labeled and unlabeled data with pseudo-labels

Clustering all data points

Using only unsupervised dimensionality reduction

Explanation - Semi-supervised learning leverages unlabeled data by generating pseudo-labels to improve model performance.

Correct answer is: Using a mixture of labeled and unlabeled data with pseudo-labels

Q.50 In a machine learning pipeline for predicting enhancer-promoter interactions, which feature is least likely to be useful?

Chromatin interaction frequency from Hi-C data

DNA methylation status near the enhancer

Protein-coding gene length

Transcription factor binding motif presence

Explanation - Enhancer-promoter interaction depends on epigenetic and motif features, not on gene length.

Correct answer is: Protein-coding gene length

Q.51 Which of the following best describes 'dropout' regularization?

Randomly removing a subset of input features during training

Randomly deactivating neurons in hidden layers during training

Adding Gaussian noise to the input data

Increasing the learning rate during training

Explanation - Dropout prevents co-adaptation of neurons by randomly setting activations to zero during each update.

Correct answer is: Randomly deactivating neurons in hidden layers during training

Q.52 Which evaluation metric would you use to assess the performance of a regression model predicting gene expression levels from genomic features?

Accuracy

Root Mean Squared Error (RMSE)

Area Under the ROC Curve (AUC)

Precision

Explanation - RMSE measures the average deviation between predicted and actual expression values in regression tasks.

Correct answer is: Root Mean Squared Error (RMSE)

Q.53 What does the 'softmax temperature' parameter control in a classification neural network?

The learning rate of the softmax layer

The spread of the predicted probability distribution

The number of output classes

The regularization strength

Explanation - A higher temperature yields a softer probability distribution; a lower temperature sharpens predictions.

Correct answer is: The spread of the predicted probability distribution

Q.54 Which type of model is typically used for imputation of missing genotypes in large-scale genome-wide association studies?

Naïve Bayes

Hidden Markov Models (HMMs)

K-Nearest Neighbors

Linear Regression

Explanation - HMMs model linkage disequilibrium to infer missing genotypes across SNPs.

Correct answer is: Hidden Markov Models (HMMs)

Q.55 Which of the following best describes a 'feature importance plot' generated from a random forest model?

A scatter plot of model predictions

A histogram of feature values

A bar chart showing the relative influence of each feature

A line plot of loss over epochs

Explanation - Feature importance plots visualize which input variables most influence the model's decisions.

Correct answer is: A bar chart showing the relative influence of each feature

Q.56 What is the main goal of using a 'convolutional autoencoder' on genomic sequences?

To classify variants into pathogenic groups

To compress sequences into a lower-dimensional representation

To generate synthetic DNA sequences

To identify splice sites directly

Explanation - Autoencoders learn latent representations that capture essential sequence information for downstream tasks.

Correct answer is: To compress sequences into a lower-dimensional representation

Q.57 Which of the following best describes the concept of 'cross-validation' in the context of machine learning on genomic data?

Using the entire dataset for training and testing simultaneously

Splitting data into training and test sets multiple times to assess model generalizability

Applying the model only on a subset of the data

Removing features that cross over between training and test sets

Explanation - Cross-validation provides a robust estimate of performance by training on various data splits.

Correct answer is: Splitting data into training and test sets multiple times to assess model generalizability

Q.58 In a deep learning model for predicting CRISPR-Cas9 off-target effects, which input representation is commonly used?

One-hot encoded gRNA sequence

3D protein structure of Cas9

Gene expression levels in target cells

Chromatin accessibility heatmap

Explanation - The model primarily uses the guide RNA sequence, optionally combined with chromatin accessibility data.

Correct answer is: One-hot encoded gRNA sequence

Q.59 Which of the following is a key challenge when applying deep learning to long genomic sequences (~1 Mb) on standard GPUs?

Excessive computational speed

Limited memory for storing intermediate activations

Insufficient training data

Difficulty in interpreting results

Explanation - Large sequences require many layers and large tensors, exceeding GPU memory limits.

Correct answer is: Limited memory for storing intermediate activations

Q.60 In the context of genome assembly, what is the purpose of a 'k-mer frequency table'?

To measure read quality scores

To identify common subsequences for graph construction

To count the number of genes on a chromosome

To calculate GC content across the genome

Explanation - k-mer tables are used to build de Bruijn graphs for assembly by counting subsequence occurrences.

Correct answer is: To identify common subsequences for graph construction

Q.61 Which of the following is NOT a typical step in a variant effect prediction pipeline?

Read alignment

Variant calling

Variant annotation

Gene expression quantification

Explanation - Variant effect prediction focuses on genotype annotation; expression quantification is a separate analysis.

Correct answer is: Gene expression quantification

Q.62 Which machine learning approach would you use to predict whether a DNA region is a promoter based on its sequence?

Support Vector Machine with a string kernel

Linear regression on GC content

Decision tree on gene length

Clustering on variant counts

Explanation - String kernels enable SVMs to capture sequence patterns relevant to promoter prediction.

Correct answer is: Support Vector Machine with a string kernel

Q.63 In the evaluation of a variant pathogenicity model, why is the precision-recall curve often preferred over the ROC curve?

Precision-recall curves are easier to compute

They provide more insight when the positive class is rare

They require less data

They are standardized for genomic studies

Explanation - PR curves focus on positive predictions, which is critical for imbalanced datasets like pathogenic variants.

Correct answer is: They provide more insight when the positive class is rare

Q.64 What does 'l2 regularization' (ridge) do in the context of a regression model for predicting gene expression?

Encourages sparsity in the coefficients

Adds a penalty on the sum of squared coefficients to prevent overfitting

Normalizes the input features

Increases the learning rate

Explanation - L2 regularization shrinks coefficients, reducing variance and improving generalization.

Correct answer is: Adds a penalty on the sum of squared coefficients to prevent overfitting

Q.65 Which of the following best describes 'gene set enrichment analysis' (GSEA) in computational biology?

A method to identify overrepresented biological pathways in a list of genes

A technique to predict gene function from sequence motifs

A clustering method for gene expression data

A method to measure read alignment quality

Explanation - GSEA tests whether predefined gene sets show statistically significant differences between phenotypes.

Correct answer is: A method to identify overrepresented biological pathways in a list of genes

Q.66 Which of the following is an advantage of using a 'graph convolutional network' (GCN) for protein-protein interaction prediction?

It can directly process 3D structural data

It can incorporate the connectivity structure of interaction networks

It does not require any training data

It reduces the need for hyperparameter tuning

Explanation - GCNs propagate information along edges in a graph, capturing network topology relevant for PPI.

Correct answer is: It can incorporate the connectivity structure of interaction networks

Q.67 Which technique is used to transform categorical variables into numeric form for machine learning models?

Standardization

One-Hot Encoding

Normalization

Log Transformation

Explanation - One-hot encoding turns categories into binary vectors suitable for models that require numeric input.

Correct answer is: One-Hot Encoding

Q.68 What does the term 'variant calling' refer to in genomics?

Identifying genetic variants from sequencing data

Predicting gene function

Aligning sequences to a reference

Annotating regulatory regions

Explanation - Variant calling detects differences (SNPs, indels) between sample and reference genomes.

Correct answer is: Identifying genetic variants from sequencing data

Q.69 Which machine learning method is commonly used for predicting whether a gene is expressed or not?

Decision Tree

Linear Regression

Principal Component Analysis

K-Means Clustering

Explanation - Decision trees can handle classification tasks, such as predicting binary expression status.

Correct answer is: Decision Tree

Q.70 What is a 'k-mer' in DNA sequence analysis?

A specific enzyme used in sequencing

A sequence of k nucleotides

A type of genetic mutation

A statistical test for variant significance

Explanation - k-mers are substrings of length k extracted from DNA, useful for counting and pattern detection.

Correct answer is: A sequence of k nucleotides

Q.71 Which of the following is a common preprocessing step for DNA sequences before feeding them into a neural network?

One-hot encoding

Fourier transform

DNA methylation measurement

Protein structure prediction

Explanation - One-hot encoding converts each nucleotide into a binary vector for neural network input.

Correct answer is: One-hot encoding

Q.72 What does 'GC content' refer to in a DNA sequence?

The proportion of adenine and thymine bases

The proportion of guanine and cytosine bases

The total length of the sequence

The number of genes in the sequence

Explanation - GC content is calculated as (G+C)/(A+T+G+C).

Correct answer is: The proportion of guanine and cytosine bases

Q.73 Which type of neural network is particularly good at modeling sequential data like DNA?

Convolutional Neural Network

Recurrent Neural Network

Feedforward Neural Network

Autoencoder

Explanation - RNNs can handle sequences by maintaining a hidden state across positions.

Correct answer is: Recurrent Neural Network

Q.74 In a simple classification model for disease prediction, what does an 'accuracy of 80%' mean?

The model predicts 80% of samples correctly

The model has an 80% chance of being correct on any prediction

The model is 80% faster than other models

The model uses 80% of the available features

Explanation - Accuracy is the fraction of correct predictions over total predictions.

Correct answer is: The model predicts 80% of samples correctly

Q.75 Which of the following is NOT a typical input feature for predicting whether a DNA region is a promoter?

GC content

Motif presence

Gene length

Chromatin accessibility

Explanation - Promoter prediction relies on sequence and epigenomic features; gene length is not directly relevant.

Correct answer is: Gene length

Q.76 What is the main purpose of 'cross-validation' in machine learning?

To validate the data source

To test the model on the training set only

To evaluate model performance on unseen data

To reduce the dataset size

Explanation - Cross-validation splits data into training and validation sets multiple times to estimate generalization.

Correct answer is: To evaluate model performance on unseen data

Q.77 Which algorithm can be used to cluster genes based on their expression patterns?

k-means clustering

Decision tree

Linear regression

Principal component analysis

Explanation - k-means groups samples into clusters based on similarity, suitable for expression data.

Correct answer is: k-means clustering

Q.78 What does a 'confusion matrix' display in a classification task?

The distribution of prediction scores

True positives, false positives, true negatives, and false negatives

The training loss over epochs

The feature importance ranking

Explanation - The confusion matrix summarizes model predictions compared to ground truth.

Correct answer is: True positives, false positives, true negatives, and false negatives

Q.79 Which of the following is a key advantage of using a 'deep neural network' over a 'linear model' for variant effect prediction?

It is simpler to interpret

It requires less training data

It can capture non-linear relationships

It always achieves higher accuracy

Explanation - Deep networks model complex patterns, whereas linear models only capture linear trends.

Correct answer is: It can capture non-linear relationships

Q.80 In a machine learning model, what is 'overfitting'?

When the model generalizes well to new data

When the model performs poorly on training data

When the model memorizes training data and fails on new data

When the model uses too few features

Explanation - Overfitting occurs when a model captures noise in training data, reducing performance on unseen samples.

Correct answer is: When the model memorizes training data and fails on new data

Q.81 Which of the following metrics is best for evaluating a model that predicts continuous gene expression values?

Precision

Recall

Mean Squared Error (MSE)

Accuracy

Explanation - MSE measures the average squared difference between predicted and actual continuous values.

Correct answer is: Mean Squared Error (MSE)

Q.82 Which of the following is a common use of a 'convolutional neural network' (CNN) in genomics?

Predicting protein structures from amino acid sequences

Detecting DNA motifs in genomic sequences

Estimating DNA replication timing

Aligning sequencing reads to a reference

Explanation - CNNs scan sequences with filters that capture motif patterns.

Correct answer is: Detecting DNA motifs in genomic sequences

Q.83 Which of the following best describes the concept of 'attention' in transformer models applied to DNA sequences?

A method to reduce model size

A mechanism to focus on relevant positions in the sequence

A way to generate synthetic sequences

A technique for data augmentation

Explanation - Attention assigns weights to sequence positions, highlighting important information for prediction.

Correct answer is: A mechanism to focus on relevant positions in the sequence

Q.84 In a genomic deep learning pipeline, why is it important to use a 'validation set' during training?

To test the model on the same data it was trained on

To tune hyperparameters and monitor overfitting

To reduce the size of the dataset

To store the final model weights

Explanation - A validation set provides an unbiased estimate of model performance for hyperparameter selection.

Correct answer is: To tune hyperparameters and monitor overfitting

Q.85 Which of the following is a typical output of a variant effect predictor tool?

The read depth at a genomic locus

The predicted impact of a variant on a protein function

The GC content of a chromosome

The 3D structure of a DNA molecule

Explanation - Variant effect predictors estimate functional consequences, such as damaging or benign.

Correct answer is: The predicted impact of a variant on a protein function

Q.86 What is the primary advantage of using a 'graph neural network' for modeling protein-protein interactions?

It can directly use 3D structures of proteins

It can capture the network topology of interactions

It requires no training data

It reduces the need for GPU acceleration

Explanation - Graph neural networks propagate information along edges, capturing interaction patterns.

Correct answer is: It can capture the network topology of interactions

Q.87 Which of the following metrics measures the trade-off between false positives and false negatives in a binary classification model?

Accuracy

Precision

Recall

Area Under the ROC Curve (AUC)

Explanation - AUC summarizes the model's ability to separate classes across thresholds, balancing FP and FN.

Correct answer is: Area Under the ROC Curve (AUC)

Q.88 Which type of deep learning architecture is best suited for modeling long-range dependencies in DNA sequences?

Convolutional Neural Network

Recurrent Neural Network

Transformer

Feedforward Neural Network

Explanation - Transformers use self-attention to capture relationships across long sequences.

Correct answer is: Transformer

Q.89 In a supervised learning model for gene expression prediction, which of the following is a hyperparameter that can be tuned?

Learning rate

Read length

Base quality score

Sequencing platform

Explanation - Learning rate controls how quickly the model updates its weights during training.

Correct answer is: Learning rate

Q.90 Which of the following best describes 'batch normalization' in neural networks?

A technique to regularize model weights

A method to accelerate training by normalizing layer inputs

A way to reduce the number of layers

A technique to encode DNA sequences

Explanation - Batch normalization reduces internal covariate shift, leading to faster convergence.

Correct answer is: A method to accelerate training by normalizing layer inputs

Q.91 Why is 'dropout' used during training of deep learning models?

To speed up training

To reduce the number of training samples

To prevent overfitting by randomly disabling neurons

To increase the number of parameters

Explanation - Dropout forces the network to learn redundant representations, improving generalization.

Correct answer is: To prevent overfitting by randomly disabling neurons

Q.92 Which of the following is a common data augmentation technique for DNA sequences?

Adding random noise to expression values

Generating reverse complement sequences

Shuffling gene labels

Duplicating the entire dataset

Explanation - Reverse complements preserve sequence meaning and expand training data.

Correct answer is: Generating reverse complement sequences

Q.93 What does 'cross-entropy loss' measure in classification tasks?

The difference between predicted and actual class probabilities

The absolute error of predictions

The proportion of correct predictions

The distance between two distributions

Explanation - Cross-entropy quantifies how well the predicted probability distribution matches the true distribution.

Correct answer is: The difference between predicted and actual class probabilities

Q.94 Which of the following is a commonly used feature for predicting transcription factor binding sites?

GC content only

DNA shape features

Gene expression levels

Protein tertiary structure

Explanation - DNA shape captures physical properties influencing transcription factor binding.

Correct answer is: DNA shape features

Q.95 In the context of genomic data, what is the purpose of 'normalization' of read counts?

To convert counts into percentages

To correct for differences in sequencing depth

To adjust for GC bias only

To reduce the number of reads

Explanation - Normalization ensures comparability across samples by adjusting for varying library sizes.

Correct answer is: To correct for differences in sequencing depth

Q.96 Which of the following best describes 'transfer learning' in deep learning?

Using a pre-trained model as a starting point for a new task

Transferring data between species

Moving the entire pipeline to a different computer

Transferring the learning rate

Explanation - Transfer learning fine-tunes a model pre-trained on a related task to accelerate learning.

Correct answer is: Using a pre-trained model as a starting point for a new task

Q.97 Which algorithm is most appropriate for clustering single-cell RNA-seq data into distinct cell types?

k-means

Hierarchical clustering

t-SNE followed by DBSCAN

Linear regression

Explanation - t-SNE reduces dimensionality, and DBSCAN identifies clusters without specifying the number.

Correct answer is: t-SNE followed by DBSCAN

Q.98 In variant calling, what does the term 'genotype likelihood' refer to?

The probability of a variant being pathogenic

The probability of a genotype given the sequencing reads

The likelihood of the reference genome

The cost of genotyping

Explanation - Genotype likelihoods quantify how likely each genotype explains the observed data.

Correct answer is: The probability of a genotype given the sequencing reads

Q.99 Which of the following is a key challenge when applying deep learning to large-scale genomic data?

Small dataset size

Lack of GPU hardware

Limited memory for long sequences

High interpretability of models

Explanation - Large genomes require many parameters and memory, exceeding typical GPU limits.

Correct answer is: Limited memory for long sequences

Q.100 Which type of model is often used for predicting whether a DNA region is a promoter?

Support Vector Machine with a string kernel

Linear regression on gene length

Decision tree on variant counts

Clustering on gene expression

Explanation - String kernels allow SVMs to capture sequence patterns characteristic of promoters.

Correct answer is: Support Vector Machine with a string kernel

Q.101 What is the main purpose of 'feature selection' in genomic machine learning pipelines?

To increase the number of features

To reduce overfitting by selecting informative features

To convert categorical features to numerical

To compute read depths

Explanation - Feature selection removes irrelevant or redundant features, improving model performance.

Correct answer is: To reduce overfitting by selecting informative features

Q.102 Which of the following is NOT a typical input for a model predicting enhancer activity?

DNA sequence motifs

Chromatin accessibility data

Gene expression levels of neighboring genes

CpG methylation status

Explanation - Enhancer activity predictions rely on sequence and epigenomic features, not neighboring gene expression.

Correct answer is: Gene expression levels of neighboring genes

Q.103 What is the primary role of the 'softmax' activation function in a multi-class classification neural network?

To enforce sparsity in weights

To produce a probability distribution over classes

To regularize the model

To accelerate training

Explanation - Softmax maps raw logits to a vector summing to one, representing class probabilities.

Correct answer is: To produce a probability distribution over classes

Q.104 Which of the following is a common way to evaluate a model that predicts disease risk from genomic data?

Mean Absolute Error

ROC AUC

Precision-Recall curve

All of the above

Explanation - All listed metrics can be relevant depending on the task and data balance.

Correct answer is: All of the above

Q.105 In the context of CRISPR-Cas9 editing, what is the 'off-target effect'?

The intended editing of the target gene

Unintended edits at genomic sites other than the target

The efficiency of on-target editing

The rate of off-target delivery

Explanation - Off-target effects occur when the Cas9 complex cuts unintended sites, causing unwanted mutations.

Correct answer is: Unintended edits at genomic sites other than the target

Q.106 Which of the following best describes the 'learning rate' hyperparameter?

The speed at which the model's weights are updated

The number of epochs to run

The ratio of training to validation data

The number of hidden units in a layer

Explanation - Learning rate controls the step size in gradient descent optimization.

Correct answer is: The speed at which the model's weights are updated

Q.107 What does the 'Adam' optimizer do differently compared to standard stochastic gradient descent (SGD)?

It uses a fixed learning rate

It adapts learning rates for each parameter using momentum and RMSprop

It does not update weights

It requires fewer hyperparameters

Explanation - Adam combines momentum and adaptive learning rates to accelerate convergence.

Correct answer is: It adapts learning rates for each parameter using momentum and RMSprop

Q.108 Which of the following is a typical output of a genome annotation tool like Ensembl?

Protein tertiary structures

Gene models with exon-intron boundaries

Chromosome images

Variant frequencies only

Explanation - Annotation tools provide gene predictions, including exon-intron structure.

Correct answer is: Gene models with exon-intron boundaries

Q.109 In a variant classification task, why might you prefer the F1-score over accuracy?

Accuracy is more sensitive to class imbalance

F1-score focuses on the minority class

F1-score is easier to compute

Accuracy is only for regression tasks

Explanation - F1 balances precision and recall, providing a better measure for imbalanced classes.

Correct answer is: F1-score focuses on the minority class

Q.110 What is the primary goal of the 'k-mers' counting step in a de Bruijn graph assembly pipeline?

To identify sequencing errors

To create nodes in the graph representing k-length sequences

To compute GC content

To estimate the number of genes

Explanation - k-mer counts form the basis of de Bruijn graphs used for genome assembly.

Correct answer is: To create nodes in the graph representing k-length sequences

Q.111 Which of the following best describes a 'negative binomial' distribution in the context of RNA-seq differential expression analysis?

A distribution that models continuous data

A distribution that captures overdispersion in count data

A distribution that assumes equal variance and mean

A distribution for binary outcomes

Explanation - RNA-seq counts often show variance greater than the mean, modeled by a negative binomial.

Correct answer is: A distribution that captures overdispersion in count data

Q.112 In a machine learning model, what does 'regularization' aim to achieve?

Increase model complexity

Prevent overfitting by adding a penalty

Speed up training by removing layers

Reduce the dataset size

Explanation - Regularization discourages large weights, helping models generalize better.

Correct answer is: Prevent overfitting by adding a penalty

Q.113 Which of the following is a commonly used metric to evaluate a regression model's performance?

Area Under the ROC Curve

Mean Absolute Error

Precision

Recall

Explanation - MAE measures the average absolute difference between predictions and ground truth.

Correct answer is: Mean Absolute Error

Q.114 Which of the following is a typical input feature for predicting DNA methylation levels?

GC content

DNA shape features

Protein folding energy

Gene expression in other species

Explanation - DNA shape captures structural properties that influence methylation patterns.

Correct answer is: DNA shape features

Q.115 What is the function of a 'mask' in transformer models for DNA sequence analysis?

To hide future positions during training for autoregressive prediction

To reduce the dimensionality of the input

To prevent overfitting

To speed up computation

Explanation - Masking prevents the model from accessing future tokens when predicting the next token.

Correct answer is: To hide future positions during training for autoregressive prediction

Q.116 Which of the following is a common method for detecting differential methylation between two conditions?

DESeq2

limma

edgeR

DSS

Explanation - DSS is specifically designed for differential methylation analysis.

Correct answer is: DSS

Q.117 In the context of variant calling, what does a 'phred quality score' indicate?

The probability that a variant is false

The confidence in a base call, calculated as -10*log10(error probability)

The coverage depth at a genomic position

The GC content of a region

Explanation - Phred scores translate error probabilities into a standardized scale.

Correct answer is: The confidence in a base call, calculated as -10*log10(error probability)

Q.118 Which of the following best describes the 'AUC-PR' curve?

A curve plotting Accuracy vs. Precision

A curve plotting Recall vs. Precision for varying thresholds

A curve plotting ROC points only for high sensitivity

A curve plotting True Positive vs. False Negative rates

Explanation - AUC-PR summarizes performance across thresholds, useful for imbalanced data.

Correct answer is: A curve plotting Recall vs. Precision for varying thresholds

Q.119 What is the primary benefit of using a 'convolutional autoencoder' for genomic sequence compression?

It generates synthetic variants

It learns a compact representation of sequences

It speeds up alignment

It reduces sequencing errors

Explanation - Autoencoders compress data into a latent space and reconstruct it, providing efficient encodings.

Correct answer is: It learns a compact representation of sequences

Q.120 Which of the following is a key advantage of using a 'graph attention network' (GAT) for protein interaction prediction?

It captures the importance of neighboring nodes with attention weights

It requires no graph structure

It reduces the number of training epochs

It is always more interpretable than other models

Explanation - GATs weight contributions of neighbors differently, improving prediction of interactions.

Correct answer is: It captures the importance of neighboring nodes with attention weights

Q.121 Which of the following is a typical preprocessing step when preparing gene expression data for machine learning?

Log transformation of counts

Removing all genes with low expression

Standard scaling

All of the above

Explanation - Gene expression preprocessing often includes log transform, filtering, and scaling.

Correct answer is: All of the above

Q.122 Which metric is most appropriate for evaluating the performance of a multi-class classifier when classes are imbalanced?

Overall accuracy

Macro-averaged F1-score

Micro-averaged F1-score

AUC-ROC for each class

Explanation - Macro-averaged F1 gives equal weight to each class, highlighting performance on minority classes.

Correct answer is: Macro-averaged F1-score

Q.123 What is the main purpose of using a 'pseudo-label' in semi-supervised learning?

To label unlabeled data based on model predictions

To remove noise from labeled data

To reduce the training dataset size

To increase the number of output classes

Explanation - Pseudo-labels provide temporary labels for unlabeled data, enabling semi-supervised training.

Correct answer is: To label unlabeled data based on model predictions

Q.124 Which of the following is NOT a common evaluation metric for a disease risk prediction model?

Accuracy

Precision

Recall

Coefficient of Determination (R^2)

Explanation - R^2 is for regression, while disease risk models are usually classification.

Correct answer is: Coefficient of Determination (R^2)

Q.125 Which of the following best describes the purpose of a 'liftover' tool in genomics?

To convert sequencing reads to a different format

To map coordinates from one reference assembly to another

To align reads to a reference genome

To call variants from aligned reads

Explanation - Liftover translates genomic positions between assemblies like hg19 to hg38.

Correct answer is: To map coordinates from one reference assembly to another

Q.126 Which of the following is a common type of genetic variant?

Single Nucleotide Polymorphism (SNP)

Gene expression level

Protein folding energy

Chromatin accessibility

Explanation - SNPs are single-base changes in the genome.

Correct answer is: Single Nucleotide Polymorphism (SNP)

Q.127 Which of the following is a key benefit of using a 'deep neural network' over a 'linear regression' for predicting gene expression from genomic features?

It can capture non-linear relationships

It requires less data

It is always more interpretable

It uses fewer parameters

Explanation - Deep networks model complex, non-linear interactions among genomic features.

Correct answer is: It can capture non-linear relationships

Q.128 What is the main goal of a 'variant effect prediction' model?

To predict whether a variant is likely to affect protein function

To count the number of variants in a genome

To align reads to a reference genome

To measure GC content across the genome

Explanation - Variant effect predictors estimate functional impact on genes or proteins.

Correct answer is: To predict whether a variant is likely to affect protein function

Q.129 Which of the following techniques is used to reduce overfitting in neural networks?

Increasing the number of epochs

Adding more layers

Dropout

Using a larger batch size

Explanation - Dropout randomly disables neurons during training, preventing memorization.

Correct answer is: Dropout

Q.130 Which of the following best describes 'cross-validation' in machine learning?

Using the entire dataset for training only

Splitting data into training and validation sets multiple times to estimate performance

Using a single training and test split

Performing hyperparameter tuning on a fixed subset

Explanation - Cross-validation provides a robust estimate of generalization by cycling through data splits.

Correct answer is: Splitting data into training and validation sets multiple times to estimate performance

Q.131 Which of the following is a common output of a genome assembly pipeline?

A list of annotated genes

A set of contiguous sequences (contigs)

A heatmap of methylation levels

A predicted protein structure

Explanation - Assembly yields contigs or scaffolds representing the genome.

Correct answer is: A set of contiguous sequences (contigs)

Q.132 Which of the following best describes the 'ReLU' activation function?

It outputs the exponential of its input

It returns 0 for negative inputs and the input for positive inputs

It normalizes its input

It reduces overfitting

Explanation - ReLU (Rectified Linear Unit) is a simple non-linear activation that mitigates vanishing gradients.

Correct answer is: It returns 0 for negative inputs and the input for positive inputs

Q.133 What does 'GC content' measure in a DNA sequence?

The proportion of adenine and thymine

The proportion of guanine and cytosine

The total length of the sequence

The number of genes present

Explanation - GC content is calculated as (G+C) divided by total bases.

Correct answer is: The proportion of guanine and cytosine

Q.134 In a supervised learning setting, what is the role of the 'loss function'?

To determine the learning rate

To quantify the difference between predicted and true values

To encode input data

To generate synthetic data

Explanation - The loss function measures prediction error, guiding weight updates during training.

Correct answer is: To quantify the difference between predicted and true values

Q.135 Which of the following is a common preprocessing step for RNA-seq data before machine learning?

Normalization of read counts (e.g., TPM, RPKM)

Trimming of adapter sequences

Imputing missing data

All of the above

Explanation - RNA-seq preprocessing includes trimming, normalization, and imputation as needed.

Correct answer is: All of the above

Q.136 In deep learning for genomics, which of the following is a reason to use 'transfer learning'?

To avoid overfitting by using a simpler model

To reduce the training time by reusing a pre-trained model

To increase the number of features

To change the learning rate

Explanation - Transfer learning leverages learned representations from a related task, speeding training.

Correct answer is: To reduce the training time by reusing a pre-trained model