Machine Learning in Genomics # MCQs Practice set

Q.1 Which machine learning technique is commonly used for predicting whether a genetic variant is pathogenic or benign based on its sequence context?

Convolutional Neural Networks
Linear Regression
K-Nearest Neighbors
Decision Trees
Explanation - CNNs can automatically learn spatial patterns in DNA sequences, making them suitable for variant effect prediction, whereas the other methods are less effective at capturing sequence motifs.
Correct answer is: Convolutional Neural Networks

Q.2 What is the primary purpose of the Variant Effect Predictor (VEP) tool in genomic pipelines?

Align sequencing reads to a reference genome
Annotate genetic variants with functional information
Perform differential gene expression analysis
Predict protein tertiary structure
Explanation - VEP is used to annotate variants with predicted effects on genes and proteins; it does not align reads, analyze expression, or predict protein structure.
Correct answer is: Annotate genetic variants with functional information

Q.3 In a genome-wide association study (GWAS), which statistical model is most appropriate for handling high-dimensional SNP data with many correlated features?

Elastic Net Regression
Random Forest
Principal Component Analysis
Linear Mixed Model
Explanation - Elastic Net combines L1 and L2 penalties to perform variable selection and handle multicollinearity among SNPs; other methods are less tailored for high-dimensional SNP selection.
Correct answer is: Elastic Net Regression

Q.4 Which deep learning architecture is particularly effective for modeling long-range dependencies in genomic sequences?

Recurrent Neural Networks (RNNs)
Feedforward Neural Networks
Transformer Models
Support Vector Machines
Explanation - Transformers use self-attention to capture long-range dependencies, making them powerful for long genomic sequences, whereas RNNs suffer from vanishing gradients.
Correct answer is: Transformer Models

Q.5 What does the area under the ROC curve (AUC) indicate in a binary classification model for disease risk prediction?

Overall accuracy of the model
Balance between false positives and false negatives
Probability of a positive outcome
Number of misclassifications
Explanation - AUC measures the model's ability to rank positives above negatives across all thresholds, reflecting the trade-off between true positive and false positive rates.
Correct answer is: Balance between false positives and false negatives

Q.6 Which of the following is a common preprocessing step before training a neural network on raw DNA sequence data?

One-hot encoding of nucleotides
Standard scaling of expression values
Principal component analysis of genotype data
Imputing missing phenotypes
Explanation - One-hot encoding transforms nucleotides (A, C, G, T) into binary vectors suitable for neural network input; the other options apply to different data types.
Correct answer is: One-hot encoding of nucleotides

Q.7 Which metric is most appropriate for evaluating clustering of cell types in single-cell RNA-seq data?

Silhouette Score
Mean Squared Error
Log Loss
Confusion Matrix
Explanation - Silhouette Score measures how similar an object is to its own cluster compared to other clusters, making it ideal for assessing single-cell clustering.
Correct answer is: Silhouette Score

Q.8 What advantage do GPU-accelerated models offer in training deep learning models for whole-genome variant calling?

Reduced memory usage
Faster training times
Higher accuracy inherently
Lower power consumption
Explanation - GPUs parallelize computations, dramatically speeding up training of large genomic models; they do not automatically improve accuracy or reduce memory.
Correct answer is: Faster training times

Q.9 In the context of epigenomics, what is the main goal of using a convolutional neural network to predict DNA methylation patterns?

To identify regulatory motifs influencing methylation
To cluster cell types based on methylation profiles
To impute missing methylation data points
To generate 3D chromatin contact maps
Explanation - CNNs can detect sequence motifs that correlate with methylation status; while they can help impute missing data, the primary aim is motif discovery.
Correct answer is: To identify regulatory motifs influencing methylation

Q.10 Which type of neural network is most suitable for modeling time-series gene expression data across developmental stages?

Convolutional Neural Network
Long Short-Term Memory (LSTM) network
Generative Adversarial Network
Autoencoder
Explanation - LSTMs are designed to capture temporal dependencies in sequential data, making them ideal for developmental gene expression trajectories.
Correct answer is: Long Short-Term Memory (LSTM) network

Q.11 Which loss function is commonly used for multi-label classification in gene function prediction tasks?

Cross-Entropy Loss
Mean Squared Error
Hinge Loss
Binary Cross-Entropy Loss
Explanation - Binary cross-entropy applies independently to each label, suitable for multi-label problems where each gene can have multiple functions.
Correct answer is: Binary Cross-Entropy Loss

Q.12 What does the term 'transfer learning' refer to in the context of genomic deep learning?

Transferring data from one species to another
Using a pre-trained model on one dataset to initialize training on a new dataset
Moving the entire pipeline to a cloud environment
Converting raw sequencing data into processed features
Explanation - Transfer learning leverages knowledge from a pre-trained model to improve learning on a related genomic task, reducing data and training time requirements.
Correct answer is: Using a pre-trained model on one dataset to initialize training on a new dataset

Q.13 Which of the following best describes the purpose of data augmentation in genomic sequence modeling?

Increase the number of samples by shuffling sequences randomly
Add synthetic noise to sequencing reads to mimic sequencing errors
Generate reverse complements of DNA sequences
All of the above
Explanation - Data augmentation for DNA includes reverse complements, simulated errors, and random shuffling to expand training data and improve model robustness.
Correct answer is: All of the above

Q.14 In a variant effect prediction pipeline, which step ensures that variants are accurately mapped to their genomic coordinates?

Read alignment
Variant calling
Genotype refinement
Coordinate conversion using liftover
Explanation - Liftover aligns variant coordinates between genome assemblies, ensuring accurate mapping; alignment and calling are earlier steps.
Correct answer is: Coordinate conversion using liftover

Q.15 Which machine learning metric is best for evaluating imbalanced classification of rare pathogenic variants?

Accuracy
F1-score
Matthews Correlation Coefficient (MCC)
Mean Absolute Error
Explanation - MCC accounts for all four confusion matrix categories and is robust to class imbalance, unlike accuracy or MAE.
Correct answer is: Matthews Correlation Coefficient (MCC)

Q.16 Which of the following is NOT a typical input feature for machine learning models predicting CRISPR-Cas9 off-target activity?

Sequence context around the target site
GC content of the guide RNA
Chromatin accessibility scores
Protein folding energy of the Cas9 enzyme
Explanation - Off-target predictions rely on DNA sequence and epigenomic context, not on the folding energy of Cas9.
Correct answer is: Protein folding energy of the Cas9 enzyme

Q.17 In deep learning applied to genomics, what is the main purpose of using a batch normalization layer?

Reduce the number of parameters
Normalize input data distribution across layers
Add regularization to prevent overfitting
Increase model interpretability
Explanation - Batch normalization stabilizes and accelerates training by normalizing activations; it does not directly reduce parameters or interpretability.
Correct answer is: Normalize input data distribution across layers

Q.18 Which of the following best explains why convolutional layers are effective for detecting transcription factor binding sites in DNA?

They capture spatial relationships in sequential data
They reduce the dimensionality of the input
They enforce sparsity in the weight matrices
They inherently model evolutionary conservation
Explanation - Convolutions slide across the sequence, identifying motifs (sub-sequences) that indicate binding sites.
Correct answer is: They capture spatial relationships in sequential data

Q.19 What is a primary advantage of using graph neural networks (GNNs) in protein-protein interaction prediction?

They can handle variable-sized input graphs efficiently
They are inherently interpretable
They avoid the need for feature engineering
They require fewer computational resources than CNNs
Explanation - GNNs process graphs of varying sizes and learn node/edge representations, which is advantageous for PPI networks.
Correct answer is: They can handle variable-sized input graphs efficiently

Q.20 Which of the following best describes the role of the ‘attention mechanism’ in transformer-based models for genomics?

It reduces the number of parameters needed
It selects relevant positions in the input sequence for each output
It ensures model convergence
It normalizes the output probabilities
Explanation - Attention weights highlight which parts of the input sequence influence the prediction, enabling modeling of long-range dependencies.
Correct answer is: It selects relevant positions in the input sequence for each output

Q.21 Which evaluation metric is most suitable for ranking genetic variants by predicted pathogenicity?

Precision
Recall
Average Precision (AP)
Area Under Precision-Recall Curve (AUPRC)
Explanation - AUPRC captures performance across different ranking thresholds, especially valuable when the positive class is rare.
Correct answer is: Area Under Precision-Recall Curve (AUPRC)

Q.22 What does the term 'feature importance' refer to in the context of random forest models applied to genomic data?

The computational time required to calculate each feature
The frequency with which a feature is used to split nodes across all trees
The statistical significance of each feature
The number of missing values in each feature
Explanation - Feature importance measures how often a feature is selected for splitting, indicating its influence on model predictions.
Correct answer is: The frequency with which a feature is used to split nodes across all trees

Q.23 Which of the following is a commonly used regularization technique to prevent overfitting in deep genomic models?

Dropout
Batch Size Reduction
Increasing the number of layers
Data concatenation
Explanation - Dropout randomly deactivates neurons during training, forcing the network to learn robust features and reducing overfitting.
Correct answer is: Dropout

Q.24 In a multi-task learning setup for predicting both variant pathogenicity and gene expression impact, which loss function strategy is most appropriate?

Weighted sum of task-specific losses
A single combined loss with joint labels
A hierarchical loss where one task guides the other
Separate models with no shared parameters
Explanation - Multi-task learning typically balances losses for each task via weights to train a shared representation.
Correct answer is: Weighted sum of task-specific losses

Q.25 Which of the following is a key challenge in applying deep learning to single-cell RNA-seq data?

High levels of dropout leading to sparse gene expression matrices
Limited availability of labeled data for supervised training
Inability to capture continuous developmental trajectories
All of the above
Explanation - Single-cell data is sparse, often lacks labels, and requires methods that capture continuous transitions.
Correct answer is: All of the above

Q.26 Which computational technique is frequently employed to speed up convolution operations on DNA sequences in GPU implementations?

Fast Fourier Transform (FFT) based convolution
Direct matrix multiplication
Recursive convolution
Sparse convolution
Explanation - FFT transforms convolution into element-wise multiplication, improving speed especially for long sequences on GPUs.
Correct answer is: Fast Fourier Transform (FFT) based convolution

Q.27 In variant calling pipelines, what is the main benefit of using a probabilistic model like HaplotypeCaller over simple pileup methods?

It requires less computational resources
It models read alignment uncertainty and haplotype structure
It provides deterministic outputs
It eliminates the need for base quality scores
Explanation - Probabilistic haplotype callers incorporate alignment ambiguity and infer haplotypes, increasing accuracy.
Correct answer is: It models read alignment uncertainty and haplotype structure

Q.28 Which of the following is NOT typically an input feature for a model predicting enhancer activity?

DNA sequence motif presence
Chromatin accessibility (ATAC-seq peaks)
Gene expression levels of neighboring genes
CpG methylation status
Explanation - Enhancer activity models primarily use sequence and epigenomic signals; gene expression is usually an output or downstream analysis.
Correct answer is: Gene expression levels of neighboring genes

Q.29 What is the primary role of a 'softmax' layer in a deep learning model for multi-class classification of genetic variants?

To ensure output probabilities sum to one
To reduce dimensionality of the input
To regularize the network
To provide non-linear activation
Explanation - Softmax converts raw logits into a probability distribution over classes, useful for multi-class variant classification.
Correct answer is: To ensure output probabilities sum to one

Q.30 Which algorithm is best suited for unsupervised clustering of high-dimensional gene expression profiles?

k-means clustering
Hierarchical clustering
t-SNE followed by DBSCAN
Linear Regression
Explanation - t-SNE reduces dimensionality while preserving local structure, and DBSCAN can identify clusters without specifying k.
Correct answer is: t-SNE followed by DBSCAN

Q.31 In the context of DNA sequencing, what does 'base calling' refer to?

Assigning a nucleotide identity to each sequencing signal
Aligning reads to a reference genome
Calling genetic variants from aligned reads
Estimating read quality scores
Explanation - Base calling converts raw sensor data into nucleotide sequences (A, C, G, T).
Correct answer is: Assigning a nucleotide identity to each sequencing signal

Q.32 Which of the following best describes 'imputation' in genomic datasets?

Adding synthetic data points to increase dataset size
Predicting missing genotype values
Correcting sequencing errors
Aligning sequences to a reference genome
Explanation - Imputation estimates missing genotype data, improving downstream analyses.
Correct answer is: Predicting missing genotype values

Q.33 Which neural network architecture is particularly adept at capturing hierarchical representations in genomic data?

Recurrent Neural Networks
Convolutional Neural Networks with multiple layers
Feedforward Networks
Support Vector Machines
Explanation - Deep CNNs can learn hierarchical motifs from raw DNA sequences.
Correct answer is: Convolutional Neural Networks with multiple layers

Q.34 What is the main advantage of using a 'multi-head attention' mechanism in transformer models for genomic sequence analysis?

It allows the model to focus on multiple positions in the sequence simultaneously
It reduces the number of parameters required
It enforces sparsity in the attention matrix
It ensures faster convergence
Explanation - Multi-head attention provides multiple parallel attention distributions, capturing diverse sequence features.
Correct answer is: It allows the model to focus on multiple positions in the sequence simultaneously

Q.35 Which type of machine learning model is best suited for predicting continuous phenotypes from genomic data?

Regression models
Classification models
Clustering models
Dimensionality reduction models
Explanation - Regression models output continuous values, making them suitable for quantitative trait prediction.
Correct answer is: Regression models

Q.36 In a convolutional neural network designed to predict splice sites, why are zero-padding and stride parameters critical?

They control the output feature map size and preserve spatial resolution
They regularize the network during training
They enforce sparsity in the weights
They are not important; any values work
Explanation - Padding maintains sequence length while strides determine how much the window moves, affecting receptive field.
Correct answer is: They control the output feature map size and preserve spatial resolution

Q.37 Which of the following is a key advantage of using deep learning over traditional rule-based methods for predicting protein-DNA binding affinities?

Requires no domain knowledge
Can automatically learn complex sequence dependencies
Always provides higher accuracy
Is computationally cheaper
Explanation - Deep learning discovers intricate patterns without hand-crafted rules, though computational cost may be higher.
Correct answer is: Can automatically learn complex sequence dependencies

Q.38 What does the 'L1 regularization' term in a loss function promote in the context of feature selection for genomic models?

Sparsity of the weight vector
Large weight values
Uniform distribution of weights
High model complexity
Explanation - L1 penalty drives many weights to zero, effectively selecting a subset of features.
Correct answer is: Sparsity of the weight vector

Q.39 Which of the following best describes the purpose of the 'early stopping' strategy during training of a genomic neural network?

Prevent overfitting by halting training when validation performance degrades
Accelerate training by stopping after a fixed number of epochs
Ensure the model achieves perfect training accuracy
Guarantee convergence to a global optimum
Explanation - Early stopping monitors a validation metric and stops training when performance stops improving.
Correct answer is: Prevent overfitting by halting training when validation performance degrades

Q.40 In the context of genomics, what does the term 'k-mer' refer to?

A specific type of DNA sequencing technology
A subsequence of length k nucleotides
A gene located on chromosome k
A statistical test for variant significance
Explanation - k-mers are contiguous sequences of k nucleotides used for sequence analysis and feature extraction.
Correct answer is: A subsequence of length k nucleotides

Q.41 Which of the following is a major bottleneck when training deep learning models on whole-genome sequencing data?

Limited GPU memory for large input matrices
Scarcity of labeled variants
Difficulty in interpreting model predictions
High cost of sequencing equipment
Explanation - The sheer size of genomic data challenges memory limits; data partitioning and streaming are required.
Correct answer is: Limited GPU memory for large input matrices

Q.42 Which algorithm is commonly used for deconvolving bulk RNA-seq data to estimate cell type proportions?

CIBERSORT
Random Forest
Linear Regression
K-Means
Explanation - CIBERSORT uses support vector regression to deconvolve bulk expression into cell-type signatures.
Correct answer is: CIBERSORT

Q.43 What is the primary objective of using a 'generative adversarial network' (GAN) in genomics?

To classify variants as pathogenic or benign
To generate realistic synthetic genomic sequences
To predict protein tertiary structure
To perform dimensionality reduction
Explanation - GANs can produce synthetic data that mimics real genomic distributions for training or data augmentation.
Correct answer is: To generate realistic synthetic genomic sequences

Q.44 Which of the following is a key benefit of using the 'Adam' optimizer over plain stochastic gradient descent (SGD) in genomic deep learning?

It requires fewer epochs to converge
It eliminates the need for learning rate scheduling
It uses adaptive learning rates for each parameter
It guarantees finding the global minimum
Explanation - Adam adapts step sizes based on gradient history, often speeding up convergence compared to vanilla SGD.
Correct answer is: It uses adaptive learning rates for each parameter

Q.45 Why is it important to perform 'data normalization' before feeding gene expression data into a machine learning model?

To ensure all features have the same scale
To increase the number of samples
To reduce the dimensionality
To prevent data leakage
Explanation - Normalization removes scale differences, making training stable and improving convergence.
Correct answer is: To ensure all features have the same scale

Q.46 Which of the following best explains the concept of 'transfer learning' in the context of genomic image data (e.g., Hi-C contact maps)?

Using a model trained on protein structures to analyze Hi-C maps
Fine-tuning a pre-trained convolutional network on Hi-C data
Transferring Hi-C data to a different species
Transferring the entire pipeline to cloud storage
Explanation - Transfer learning leverages knowledge from models trained on related image-like data to improve performance on Hi-C images.
Correct answer is: Fine-tuning a pre-trained convolutional network on Hi-C data

Q.47 Which statistical test is typically used to assess the significance of differential expression between two groups in RNA-seq studies?

Student's t-test
Wilcoxon signed-rank test
DESeq2's negative binomial test
Chi-squared test
Explanation - DESeq2 models count data with a negative binomial distribution, suitable for RNA-seq differential expression.
Correct answer is: DESeq2's negative binomial test

Q.48 What is the primary purpose of the 'attention mechanism' in a transformer model trained on genomic sequences?

To compute the loss function
To focus on relevant parts of the sequence for each prediction
To reduce the size of the training dataset
To enforce sequence conservation
Explanation - Attention weights highlight which sequence positions contribute most to the prediction.
Correct answer is: To focus on relevant parts of the sequence for each prediction

Q.49 Which of the following is an example of a semi-supervised learning approach in genomics?

Training a model only on labeled data
Using a mixture of labeled and unlabeled data with pseudo-labels
Clustering all data points
Using only unsupervised dimensionality reduction
Explanation - Semi-supervised learning leverages unlabeled data by generating pseudo-labels to improve model performance.
Correct answer is: Using a mixture of labeled and unlabeled data with pseudo-labels

Q.50 In a machine learning pipeline for predicting enhancer-promoter interactions, which feature is least likely to be useful?

Chromatin interaction frequency from Hi-C data
DNA methylation status near the enhancer
Protein-coding gene length
Transcription factor binding motif presence
Explanation - Enhancer-promoter interaction depends on epigenetic and motif features, not on gene length.
Correct answer is: Protein-coding gene length

Q.51 Which of the following best describes 'dropout' regularization?

Randomly removing a subset of input features during training
Randomly deactivating neurons in hidden layers during training
Adding Gaussian noise to the input data
Increasing the learning rate during training
Explanation - Dropout prevents co-adaptation of neurons by randomly setting activations to zero during each update.
Correct answer is: Randomly deactivating neurons in hidden layers during training

Q.52 Which evaluation metric would you use to assess the performance of a regression model predicting gene expression levels from genomic features?

Accuracy
Root Mean Squared Error (RMSE)
Area Under the ROC Curve (AUC)
Precision
Explanation - RMSE measures the average deviation between predicted and actual expression values in regression tasks.
Correct answer is: Root Mean Squared Error (RMSE)

Q.53 What does the 'softmax temperature' parameter control in a classification neural network?

The learning rate of the softmax layer
The spread of the predicted probability distribution
The number of output classes
The regularization strength
Explanation - A higher temperature yields a softer probability distribution; a lower temperature sharpens predictions.
Correct answer is: The spread of the predicted probability distribution

Q.54 Which type of model is typically used for imputation of missing genotypes in large-scale genome-wide association studies?

Naïve Bayes
Hidden Markov Models (HMMs)
K-Nearest Neighbors
Linear Regression
Explanation - HMMs model linkage disequilibrium to infer missing genotypes across SNPs.
Correct answer is: Hidden Markov Models (HMMs)

Q.55 Which of the following best describes a 'feature importance plot' generated from a random forest model?

A scatter plot of model predictions
A histogram of feature values
A bar chart showing the relative influence of each feature
A line plot of loss over epochs
Explanation - Feature importance plots visualize which input variables most influence the model's decisions.
Correct answer is: A bar chart showing the relative influence of each feature

Q.56 What is the main goal of using a 'convolutional autoencoder' on genomic sequences?

To classify variants into pathogenic groups
To compress sequences into a lower-dimensional representation
To generate synthetic DNA sequences
To identify splice sites directly
Explanation - Autoencoders learn latent representations that capture essential sequence information for downstream tasks.
Correct answer is: To compress sequences into a lower-dimensional representation

Q.57 Which of the following best describes the concept of 'cross-validation' in the context of machine learning on genomic data?

Using the entire dataset for training and testing simultaneously
Splitting data into training and test sets multiple times to assess model generalizability
Applying the model only on a subset of the data
Removing features that cross over between training and test sets
Explanation - Cross-validation provides a robust estimate of performance by training on various data splits.
Correct answer is: Splitting data into training and test sets multiple times to assess model generalizability

Q.58 In a deep learning model for predicting CRISPR-Cas9 off-target effects, which input representation is commonly used?

One-hot encoded gRNA sequence
3D protein structure of Cas9
Gene expression levels in target cells
Chromatin accessibility heatmap
Explanation - The model primarily uses the guide RNA sequence, optionally combined with chromatin accessibility data.
Correct answer is: One-hot encoded gRNA sequence

Q.59 Which of the following is a key challenge when applying deep learning to long genomic sequences (~1 Mb) on standard GPUs?

Excessive computational speed
Limited memory for storing intermediate activations
Insufficient training data
Difficulty in interpreting results
Explanation - Large sequences require many layers and large tensors, exceeding GPU memory limits.
Correct answer is: Limited memory for storing intermediate activations

Q.60 In the context of genome assembly, what is the purpose of a 'k-mer frequency table'?

To measure read quality scores
To identify common subsequences for graph construction
To count the number of genes on a chromosome
To calculate GC content across the genome
Explanation - k-mer tables are used to build de Bruijn graphs for assembly by counting subsequence occurrences.
Correct answer is: To identify common subsequences for graph construction

Q.61 Which of the following is NOT a typical step in a variant effect prediction pipeline?

Read alignment
Variant calling
Variant annotation
Gene expression quantification
Explanation - Variant effect prediction focuses on genotype annotation; expression quantification is a separate analysis.
Correct answer is: Gene expression quantification

Q.62 Which machine learning approach would you use to predict whether a DNA region is a promoter based on its sequence?

Support Vector Machine with a string kernel
Linear regression on GC content
Decision tree on gene length
Clustering on variant counts
Explanation - String kernels enable SVMs to capture sequence patterns relevant to promoter prediction.
Correct answer is: Support Vector Machine with a string kernel

Q.63 In the evaluation of a variant pathogenicity model, why is the precision-recall curve often preferred over the ROC curve?

Precision-recall curves are easier to compute
They provide more insight when the positive class is rare
They require less data
They are standardized for genomic studies
Explanation - PR curves focus on positive predictions, which is critical for imbalanced datasets like pathogenic variants.
Correct answer is: They provide more insight when the positive class is rare

Q.64 What does 'l2 regularization' (ridge) do in the context of a regression model for predicting gene expression?

Encourages sparsity in the coefficients
Adds a penalty on the sum of squared coefficients to prevent overfitting
Normalizes the input features
Increases the learning rate
Explanation - L2 regularization shrinks coefficients, reducing variance and improving generalization.
Correct answer is: Adds a penalty on the sum of squared coefficients to prevent overfitting

Q.65 Which of the following best describes 'gene set enrichment analysis' (GSEA) in computational biology?

A method to identify overrepresented biological pathways in a list of genes
A technique to predict gene function from sequence motifs
A clustering method for gene expression data
A method to measure read alignment quality
Explanation - GSEA tests whether predefined gene sets show statistically significant differences between phenotypes.
Correct answer is: A method to identify overrepresented biological pathways in a list of genes

Q.66 Which of the following is an advantage of using a 'graph convolutional network' (GCN) for protein-protein interaction prediction?

It can directly process 3D structural data
It can incorporate the connectivity structure of interaction networks
It does not require any training data
It reduces the need for hyperparameter tuning
Explanation - GCNs propagate information along edges in a graph, capturing network topology relevant for PPI.
Correct answer is: It can incorporate the connectivity structure of interaction networks

Q.67 Which technique is used to transform categorical variables into numeric form for machine learning models?

Standardization
One-Hot Encoding
Normalization
Log Transformation
Explanation - One-hot encoding turns categories into binary vectors suitable for models that require numeric input.
Correct answer is: One-Hot Encoding

Q.68 What does the term 'variant calling' refer to in genomics?

Identifying genetic variants from sequencing data
Predicting gene function
Aligning sequences to a reference
Annotating regulatory regions
Explanation - Variant calling detects differences (SNPs, indels) between sample and reference genomes.
Correct answer is: Identifying genetic variants from sequencing data

Q.69 Which machine learning method is commonly used for predicting whether a gene is expressed or not?

Decision Tree
Linear Regression
Principal Component Analysis
K-Means Clustering
Explanation - Decision trees can handle classification tasks, such as predicting binary expression status.
Correct answer is: Decision Tree

Q.70 What is a 'k-mer' in DNA sequence analysis?

A specific enzyme used in sequencing
A sequence of k nucleotides
A type of genetic mutation
A statistical test for variant significance
Explanation - k-mers are substrings of length k extracted from DNA, useful for counting and pattern detection.
Correct answer is: A sequence of k nucleotides

Q.71 Which of the following is a common preprocessing step for DNA sequences before feeding them into a neural network?

One-hot encoding
Fourier transform
DNA methylation measurement
Protein structure prediction
Explanation - One-hot encoding converts each nucleotide into a binary vector for neural network input.
Correct answer is: One-hot encoding

Q.72 What does 'GC content' refer to in a DNA sequence?

The proportion of adenine and thymine bases
The proportion of guanine and cytosine bases
The total length of the sequence
The number of genes in the sequence
Explanation - GC content is calculated as (G+C)/(A+T+G+C).
Correct answer is: The proportion of guanine and cytosine bases

Q.73 Which type of neural network is particularly good at modeling sequential data like DNA?

Convolutional Neural Network
Recurrent Neural Network
Feedforward Neural Network
Autoencoder
Explanation - RNNs can handle sequences by maintaining a hidden state across positions.
Correct answer is: Recurrent Neural Network

Q.74 In a simple classification model for disease prediction, what does an 'accuracy of 80%' mean?

The model predicts 80% of samples correctly
The model has an 80% chance of being correct on any prediction
The model is 80% faster than other models
The model uses 80% of the available features
Explanation - Accuracy is the fraction of correct predictions over total predictions.
Correct answer is: The model predicts 80% of samples correctly

Q.75 Which of the following is NOT a typical input feature for predicting whether a DNA region is a promoter?

GC content
Motif presence
Gene length
Chromatin accessibility
Explanation - Promoter prediction relies on sequence and epigenomic features; gene length is not directly relevant.
Correct answer is: Gene length

Q.76 What is the main purpose of 'cross-validation' in machine learning?

To validate the data source
To test the model on the training set only
To evaluate model performance on unseen data
To reduce the dataset size
Explanation - Cross-validation splits data into training and validation sets multiple times to estimate generalization.
Correct answer is: To evaluate model performance on unseen data

Q.77 Which algorithm can be used to cluster genes based on their expression patterns?

k-means clustering
Decision tree
Linear regression
Principal component analysis
Explanation - k-means groups samples into clusters based on similarity, suitable for expression data.
Correct answer is: k-means clustering

Q.78 What does a 'confusion matrix' display in a classification task?

The distribution of prediction scores
True positives, false positives, true negatives, and false negatives
The training loss over epochs
The feature importance ranking
Explanation - The confusion matrix summarizes model predictions compared to ground truth.
Correct answer is: True positives, false positives, true negatives, and false negatives

Q.79 Which of the following is a key advantage of using a 'deep neural network' over a 'linear model' for variant effect prediction?

It is simpler to interpret
It requires less training data
It can capture non-linear relationships
It always achieves higher accuracy
Explanation - Deep networks model complex patterns, whereas linear models only capture linear trends.
Correct answer is: It can capture non-linear relationships

Q.80 In a machine learning model, what is 'overfitting'?

When the model generalizes well to new data
When the model performs poorly on training data
When the model memorizes training data and fails on new data
When the model uses too few features
Explanation - Overfitting occurs when a model captures noise in training data, reducing performance on unseen samples.
Correct answer is: When the model memorizes training data and fails on new data

Q.81 Which of the following metrics is best for evaluating a model that predicts continuous gene expression values?

Precision
Recall
Mean Squared Error (MSE)
Accuracy
Explanation - MSE measures the average squared difference between predicted and actual continuous values.
Correct answer is: Mean Squared Error (MSE)

Q.82 Which of the following is a common use of a 'convolutional neural network' (CNN) in genomics?

Predicting protein structures from amino acid sequences
Detecting DNA motifs in genomic sequences
Estimating DNA replication timing
Aligning sequencing reads to a reference
Explanation - CNNs scan sequences with filters that capture motif patterns.
Correct answer is: Detecting DNA motifs in genomic sequences

Q.83 Which of the following best describes the concept of 'attention' in transformer models applied to DNA sequences?

A method to reduce model size
A mechanism to focus on relevant positions in the sequence
A way to generate synthetic sequences
A technique for data augmentation
Explanation - Attention assigns weights to sequence positions, highlighting important information for prediction.
Correct answer is: A mechanism to focus on relevant positions in the sequence

Q.84 In a genomic deep learning pipeline, why is it important to use a 'validation set' during training?

To test the model on the same data it was trained on
To tune hyperparameters and monitor overfitting
To reduce the size of the dataset
To store the final model weights
Explanation - A validation set provides an unbiased estimate of model performance for hyperparameter selection.
Correct answer is: To tune hyperparameters and monitor overfitting

Q.85 Which of the following is a typical output of a variant effect predictor tool?

The read depth at a genomic locus
The predicted impact of a variant on a protein function
The GC content of a chromosome
The 3D structure of a DNA molecule
Explanation - Variant effect predictors estimate functional consequences, such as damaging or benign.
Correct answer is: The predicted impact of a variant on a protein function

Q.86 What is the primary advantage of using a 'graph neural network' for modeling protein-protein interactions?

It can directly use 3D structures of proteins
It can capture the network topology of interactions
It requires no training data
It reduces the need for GPU acceleration
Explanation - Graph neural networks propagate information along edges, capturing interaction patterns.
Correct answer is: It can capture the network topology of interactions

Q.87 Which of the following metrics measures the trade-off between false positives and false negatives in a binary classification model?

Accuracy
Precision
Recall
Area Under the ROC Curve (AUC)
Explanation - AUC summarizes the model's ability to separate classes across thresholds, balancing FP and FN.
Correct answer is: Area Under the ROC Curve (AUC)

Q.88 Which type of deep learning architecture is best suited for modeling long-range dependencies in DNA sequences?

Convolutional Neural Network
Recurrent Neural Network
Transformer
Feedforward Neural Network
Explanation - Transformers use self-attention to capture relationships across long sequences.
Correct answer is: Transformer

Q.89 In a supervised learning model for gene expression prediction, which of the following is a hyperparameter that can be tuned?

Learning rate
Read length
Base quality score
Sequencing platform
Explanation - Learning rate controls how quickly the model updates its weights during training.
Correct answer is: Learning rate

Q.90 Which of the following best describes 'batch normalization' in neural networks?

A technique to regularize model weights
A method to accelerate training by normalizing layer inputs
A way to reduce the number of layers
A technique to encode DNA sequences
Explanation - Batch normalization reduces internal covariate shift, leading to faster convergence.
Correct answer is: A method to accelerate training by normalizing layer inputs

Q.91 Why is 'dropout' used during training of deep learning models?

To speed up training
To reduce the number of training samples
To prevent overfitting by randomly disabling neurons
To increase the number of parameters
Explanation - Dropout forces the network to learn redundant representations, improving generalization.
Correct answer is: To prevent overfitting by randomly disabling neurons

Q.92 Which of the following is a common data augmentation technique for DNA sequences?

Adding random noise to expression values
Generating reverse complement sequences
Shuffling gene labels
Duplicating the entire dataset
Explanation - Reverse complements preserve sequence meaning and expand training data.
Correct answer is: Generating reverse complement sequences

Q.93 What does 'cross-entropy loss' measure in classification tasks?

The difference between predicted and actual class probabilities
The absolute error of predictions
The proportion of correct predictions
The distance between two distributions
Explanation - Cross-entropy quantifies how well the predicted probability distribution matches the true distribution.
Correct answer is: The difference between predicted and actual class probabilities

Q.94 Which of the following is a commonly used feature for predicting transcription factor binding sites?

GC content only
DNA shape features
Gene expression levels
Protein tertiary structure
Explanation - DNA shape captures physical properties influencing transcription factor binding.
Correct answer is: DNA shape features

Q.95 In the context of genomic data, what is the purpose of 'normalization' of read counts?

To convert counts into percentages
To correct for differences in sequencing depth
To adjust for GC bias only
To reduce the number of reads
Explanation - Normalization ensures comparability across samples by adjusting for varying library sizes.
Correct answer is: To correct for differences in sequencing depth

Q.96 Which of the following best describes 'transfer learning' in deep learning?

Using a pre-trained model as a starting point for a new task
Transferring data between species
Moving the entire pipeline to a different computer
Transferring the learning rate
Explanation - Transfer learning fine-tunes a model pre-trained on a related task to accelerate learning.
Correct answer is: Using a pre-trained model as a starting point for a new task

Q.97 Which algorithm is most appropriate for clustering single-cell RNA-seq data into distinct cell types?

k-means
Hierarchical clustering
t-SNE followed by DBSCAN
Linear regression
Explanation - t-SNE reduces dimensionality, and DBSCAN identifies clusters without specifying the number.
Correct answer is: t-SNE followed by DBSCAN

Q.98 In variant calling, what does the term 'genotype likelihood' refer to?

The probability of a variant being pathogenic
The probability of a genotype given the sequencing reads
The likelihood of the reference genome
The cost of genotyping
Explanation - Genotype likelihoods quantify how likely each genotype explains the observed data.
Correct answer is: The probability of a genotype given the sequencing reads

Q.99 Which of the following is a key challenge when applying deep learning to large-scale genomic data?

Small dataset size
Lack of GPU hardware
Limited memory for long sequences
High interpretability of models
Explanation - Large genomes require many parameters and memory, exceeding typical GPU limits.
Correct answer is: Limited memory for long sequences

Q.100 Which type of model is often used for predicting whether a DNA region is a promoter?

Support Vector Machine with a string kernel
Linear regression on gene length
Decision tree on variant counts
Clustering on gene expression
Explanation - String kernels allow SVMs to capture sequence patterns characteristic of promoters.
Correct answer is: Support Vector Machine with a string kernel

Q.101 What is the main purpose of 'feature selection' in genomic machine learning pipelines?

To increase the number of features
To reduce overfitting by selecting informative features
To convert categorical features to numerical
To compute read depths
Explanation - Feature selection removes irrelevant or redundant features, improving model performance.
Correct answer is: To reduce overfitting by selecting informative features

Q.102 Which of the following is NOT a typical input for a model predicting enhancer activity?

DNA sequence motifs
Chromatin accessibility data
Gene expression levels of neighboring genes
CpG methylation status
Explanation - Enhancer activity predictions rely on sequence and epigenomic features, not neighboring gene expression.
Correct answer is: Gene expression levels of neighboring genes

Q.103 What is the primary role of the 'softmax' activation function in a multi-class classification neural network?

To enforce sparsity in weights
To produce a probability distribution over classes
To regularize the model
To accelerate training
Explanation - Softmax maps raw logits to a vector summing to one, representing class probabilities.
Correct answer is: To produce a probability distribution over classes

Q.104 Which of the following is a common way to evaluate a model that predicts disease risk from genomic data?

Mean Absolute Error
ROC AUC
Precision-Recall curve
All of the above
Explanation - All listed metrics can be relevant depending on the task and data balance.
Correct answer is: All of the above

Q.105 In the context of CRISPR-Cas9 editing, what is the 'off-target effect'?

The intended editing of the target gene
Unintended edits at genomic sites other than the target
The efficiency of on-target editing
The rate of off-target delivery
Explanation - Off-target effects occur when the Cas9 complex cuts unintended sites, causing unwanted mutations.
Correct answer is: Unintended edits at genomic sites other than the target

Q.106 Which of the following best describes the 'learning rate' hyperparameter?

The speed at which the model's weights are updated
The number of epochs to run
The ratio of training to validation data
The number of hidden units in a layer
Explanation - Learning rate controls the step size in gradient descent optimization.
Correct answer is: The speed at which the model's weights are updated

Q.107 What does the 'Adam' optimizer do differently compared to standard stochastic gradient descent (SGD)?

It uses a fixed learning rate
It adapts learning rates for each parameter using momentum and RMSprop
It does not update weights
It requires fewer hyperparameters
Explanation - Adam combines momentum and adaptive learning rates to accelerate convergence.
Correct answer is: It adapts learning rates for each parameter using momentum and RMSprop

Q.108 Which of the following is a typical output of a genome annotation tool like Ensembl?

Protein tertiary structures
Gene models with exon-intron boundaries
Chromosome images
Variant frequencies only
Explanation - Annotation tools provide gene predictions, including exon-intron structure.
Correct answer is: Gene models with exon-intron boundaries

Q.109 In a variant classification task, why might you prefer the F1-score over accuracy?

Accuracy is more sensitive to class imbalance
F1-score focuses on the minority class
F1-score is easier to compute
Accuracy is only for regression tasks
Explanation - F1 balances precision and recall, providing a better measure for imbalanced classes.
Correct answer is: F1-score focuses on the minority class

Q.110 What is the primary goal of the 'k-mers' counting step in a de Bruijn graph assembly pipeline?

To identify sequencing errors
To create nodes in the graph representing k-length sequences
To compute GC content
To estimate the number of genes
Explanation - k-mer counts form the basis of de Bruijn graphs used for genome assembly.
Correct answer is: To create nodes in the graph representing k-length sequences

Q.111 Which of the following best describes a 'negative binomial' distribution in the context of RNA-seq differential expression analysis?

A distribution that models continuous data
A distribution that captures overdispersion in count data
A distribution that assumes equal variance and mean
A distribution for binary outcomes
Explanation - RNA-seq counts often show variance greater than the mean, modeled by a negative binomial.
Correct answer is: A distribution that captures overdispersion in count data

Q.112 In a machine learning model, what does 'regularization' aim to achieve?

Increase model complexity
Prevent overfitting by adding a penalty
Speed up training by removing layers
Reduce the dataset size
Explanation - Regularization discourages large weights, helping models generalize better.
Correct answer is: Prevent overfitting by adding a penalty

Q.113 Which of the following is a commonly used metric to evaluate a regression model's performance?

Area Under the ROC Curve
Mean Absolute Error
Precision
Recall
Explanation - MAE measures the average absolute difference between predictions and ground truth.
Correct answer is: Mean Absolute Error

Q.114 Which of the following is a typical input feature for predicting DNA methylation levels?

GC content
DNA shape features
Protein folding energy
Gene expression in other species
Explanation - DNA shape captures structural properties that influence methylation patterns.
Correct answer is: DNA shape features

Q.115 What is the function of a 'mask' in transformer models for DNA sequence analysis?

To hide future positions during training for autoregressive prediction
To reduce the dimensionality of the input
To prevent overfitting
To speed up computation
Explanation - Masking prevents the model from accessing future tokens when predicting the next token.
Correct answer is: To hide future positions during training for autoregressive prediction

Q.116 Which of the following is a common method for detecting differential methylation between two conditions?

DESeq2
limma
edgeR
DSS
Explanation - DSS is specifically designed for differential methylation analysis.
Correct answer is: DSS

Q.117 In the context of variant calling, what does a 'phred quality score' indicate?

The probability that a variant is false
The confidence in a base call, calculated as -10*log10(error probability)
The coverage depth at a genomic position
The GC content of a region
Explanation - Phred scores translate error probabilities into a standardized scale.
Correct answer is: The confidence in a base call, calculated as -10*log10(error probability)

Q.118 Which of the following best describes the 'AUC-PR' curve?

A curve plotting Accuracy vs. Precision
A curve plotting Recall vs. Precision for varying thresholds
A curve plotting ROC points only for high sensitivity
A curve plotting True Positive vs. False Negative rates
Explanation - AUC-PR summarizes performance across thresholds, useful for imbalanced data.
Correct answer is: A curve plotting Recall vs. Precision for varying thresholds

Q.119 What is the primary benefit of using a 'convolutional autoencoder' for genomic sequence compression?

It generates synthetic variants
It learns a compact representation of sequences
It speeds up alignment
It reduces sequencing errors
Explanation - Autoencoders compress data into a latent space and reconstruct it, providing efficient encodings.
Correct answer is: It learns a compact representation of sequences

Q.120 Which of the following is a key advantage of using a 'graph attention network' (GAT) for protein interaction prediction?

It captures the importance of neighboring nodes with attention weights
It requires no graph structure
It reduces the number of training epochs
It is always more interpretable than other models
Explanation - GATs weight contributions of neighbors differently, improving prediction of interactions.
Correct answer is: It captures the importance of neighboring nodes with attention weights

Q.121 Which of the following is a typical preprocessing step when preparing gene expression data for machine learning?

Log transformation of counts
Removing all genes with low expression
Standard scaling
All of the above
Explanation - Gene expression preprocessing often includes log transform, filtering, and scaling.
Correct answer is: All of the above

Q.122 Which metric is most appropriate for evaluating the performance of a multi-class classifier when classes are imbalanced?

Overall accuracy
Macro-averaged F1-score
Micro-averaged F1-score
AUC-ROC for each class
Explanation - Macro-averaged F1 gives equal weight to each class, highlighting performance on minority classes.
Correct answer is: Macro-averaged F1-score

Q.123 What is the main purpose of using a 'pseudo-label' in semi-supervised learning?

To label unlabeled data based on model predictions
To remove noise from labeled data
To reduce the training dataset size
To increase the number of output classes
Explanation - Pseudo-labels provide temporary labels for unlabeled data, enabling semi-supervised training.
Correct answer is: To label unlabeled data based on model predictions

Q.124 Which of the following is NOT a common evaluation metric for a disease risk prediction model?

Accuracy
Precision
Recall
Coefficient of Determination (R^2)
Explanation - R^2 is for regression, while disease risk models are usually classification.
Correct answer is: Coefficient of Determination (R^2)

Q.125 Which of the following best describes the purpose of a 'liftover' tool in genomics?

To convert sequencing reads to a different format
To map coordinates from one reference assembly to another
To align reads to a reference genome
To call variants from aligned reads
Explanation - Liftover translates genomic positions between assemblies like hg19 to hg38.
Correct answer is: To map coordinates from one reference assembly to another

Q.126 Which of the following is a common type of genetic variant?

Single Nucleotide Polymorphism (SNP)
Gene expression level
Protein folding energy
Chromatin accessibility
Explanation - SNPs are single-base changes in the genome.
Correct answer is: Single Nucleotide Polymorphism (SNP)

Q.127 Which of the following is a key benefit of using a 'deep neural network' over a 'linear regression' for predicting gene expression from genomic features?

It can capture non-linear relationships
It requires less data
It is always more interpretable
It uses fewer parameters
Explanation - Deep networks model complex, non-linear interactions among genomic features.
Correct answer is: It can capture non-linear relationships

Q.128 What is the main goal of a 'variant effect prediction' model?

To predict whether a variant is likely to affect protein function
To count the number of variants in a genome
To align reads to a reference genome
To measure GC content across the genome
Explanation - Variant effect predictors estimate functional impact on genes or proteins.
Correct answer is: To predict whether a variant is likely to affect protein function

Q.129 Which of the following techniques is used to reduce overfitting in neural networks?

Increasing the number of epochs
Adding more layers
Dropout
Using a larger batch size
Explanation - Dropout randomly disables neurons during training, preventing memorization.
Correct answer is: Dropout

Q.130 Which of the following best describes 'cross-validation' in machine learning?

Using the entire dataset for training only
Splitting data into training and validation sets multiple times to estimate performance
Using a single training and test split
Performing hyperparameter tuning on a fixed subset
Explanation - Cross-validation provides a robust estimate of generalization by cycling through data splits.
Correct answer is: Splitting data into training and validation sets multiple times to estimate performance

Q.131 Which of the following is a common output of a genome assembly pipeline?

A list of annotated genes
A set of contiguous sequences (contigs)
A heatmap of methylation levels
A predicted protein structure
Explanation - Assembly yields contigs or scaffolds representing the genome.
Correct answer is: A set of contiguous sequences (contigs)

Q.132 Which of the following best describes the 'ReLU' activation function?

It outputs the exponential of its input
It returns 0 for negative inputs and the input for positive inputs
It normalizes its input
It reduces overfitting
Explanation - ReLU (Rectified Linear Unit) is a simple non-linear activation that mitigates vanishing gradients.
Correct answer is: It returns 0 for negative inputs and the input for positive inputs

Q.133 What does 'GC content' measure in a DNA sequence?

The proportion of adenine and thymine
The proportion of guanine and cytosine
The total length of the sequence
The number of genes present
Explanation - GC content is calculated as (G+C) divided by total bases.
Correct answer is: The proportion of guanine and cytosine

Q.134 In a supervised learning setting, what is the role of the 'loss function'?

To determine the learning rate
To quantify the difference between predicted and true values
To encode input data
To generate synthetic data
Explanation - The loss function measures prediction error, guiding weight updates during training.
Correct answer is: To quantify the difference between predicted and true values

Q.135 Which of the following is a common preprocessing step for RNA-seq data before machine learning?

Normalization of read counts (e.g., TPM, RPKM)
Trimming of adapter sequences
Imputing missing data
All of the above
Explanation - RNA-seq preprocessing includes trimming, normalization, and imputation as needed.
Correct answer is: All of the above

Q.136 In deep learning for genomics, which of the following is a reason to use 'transfer learning'?

To avoid overfitting by using a simpler model
To reduce the training time by reusing a pre-trained model
To increase the number of features
To change the learning rate
Explanation - Transfer learning leverages learned representations from a related task, speeding training.
Correct answer is: To reduce the training time by reusing a pre-trained model