Computational Biology # MCQs Practice set

Q.1 In a BLAST search, which parameter most directly affects the stringency of matches?

Word size
Gap penalty
E-value threshold
Database size
Explanation - The E-value represents the number of matches expected by chance; lowering it increases stringency.
Correct answer is: E-value threshold

Q.2 What type of data is represented by a FASTQ file?

Protein structure coordinates
RNA expression levels
DNA sequences with quality scores
Mass spectrometry spectra
Explanation - FASTQ files store sequence reads and a corresponding quality score for each base.
Correct answer is: DNA sequences with quality scores

Q.3 Which of the following best describes a gene ontology (GO) term?

A unique DNA sequence identifier
A functional annotation describing biological processes, cellular components, or molecular functions
A file format for storing protein sequences
An algorithm used for sequence alignment
Explanation - GO terms provide standardized descriptions of gene products' roles in biology.
Correct answer is: A functional annotation describing biological processes, cellular components, or molecular functions

Q.4 Which programming language is most commonly used for statistical analysis in bioinformatics?

Java
Python
R
C++
Explanation - R offers extensive packages for statistical computing and bioinformatics analyses.
Correct answer is: R

Q.5 In a phylogenetic tree constructed using the neighbor‑joining method, what does a branch length typically represent?

Number of species
Genetic distance
Time of divergence in years
Mutation rate per site
Explanation - Branch lengths in NJ trees reflect the amount of sequence change accumulated between nodes.
Correct answer is: Genetic distance

Q.6 Which of the following is NOT a typical use of a GPU in bioinformatics?

Accelerating sequence alignment
Real‑time protein folding simulations
Storing large genomic databases
Running deep learning models
Explanation - GPUs are used for parallel computation, not for primary storage of data.
Correct answer is: Storing large genomic databases

Q.7 Which algorithm is commonly used to find the most probable path of hidden states in a Hidden Markov Model for gene prediction?

Viterbi algorithm
Dijkstra algorithm
K‑means clustering
Principal component analysis
Explanation - The Viterbi algorithm calculates the most likely sequence of hidden states given observed data.
Correct answer is: Viterbi algorithm

Q.8 What is the primary purpose of a Markov chain Monte Carlo (MCMC) method in phylogenetics?

To find the shortest path between two sequences
To estimate posterior probabilities of phylogenetic trees
To align multiple protein sequences
To filter out low‑quality reads
Explanation - MCMC samples tree space according to its probability under a statistical model.
Correct answer is: To estimate posterior probabilities of phylogenetic trees

Q.9 Which of the following best describes the term 'read depth' in next‑generation sequencing?

Number of reads that overlap a particular genomic position
Average length of sequencing reads
Total number of samples processed
Number of unique DNA molecules sequenced
Explanation - Read depth indicates coverage; higher depth increases confidence in base calls.
Correct answer is: Number of reads that overlap a particular genomic position

Q.10 Which deep learning architecture is most suitable for predicting protein secondary structure from amino acid sequences?

Convolutional neural network (CNN)
Recurrent neural network (RNN)
Generative adversarial network (GAN)
Autoencoder
Explanation - RNNs capture sequential dependencies in amino acid chains, which is essential for secondary structure prediction.
Correct answer is: Recurrent neural network (RNN)

Q.11 What does the 'E' in the FASTQ header line represent?

Enzyme name
Sequence identifier
Quality score string
Experiment number
Explanation - The header starts with '@' followed by the read ID; 'E' is part of the identifier string.
Correct answer is: Sequence identifier

Q.12 Which file format is used to represent three‑dimensional protein structures?

FASTA
WIG
PDB
GTF
Explanation - PDB (Protein Data Bank) files store atomic coordinates of protein structures.
Correct answer is: PDB

Q.13 In the context of genome assembly, what is a 'contig'?

A single DNA fragment sequenced in one read
An overlapping set of reads that has been assembled into a continuous sequence
The reference genome used for alignment
A type of sequencing error
Explanation - Contigs are contiguous sequences resulting from overlapping reads.
Correct answer is: An overlapping set of reads that has been assembled into a continuous sequence

Q.14 Which of these techniques is primarily used to identify differentially expressed genes between two conditions?

RNA‑Seq differential expression analysis (e.g., DESeq2)
Chromatin immunoprecipitation (ChIP‑Seq)
Mass spectrometry proteomics
Microscopy imaging
Explanation - DESeq2 and similar tools analyze count data from RNA‑Seq to find genes with significant expression changes.
Correct answer is: RNA‑Seq differential expression analysis (e.g., DESeq2)

Q.15 What is the primary function of the BLAST algorithm’s word size parameter?

To determine the length of the query sequence
To control the initial seed length for matches
To set the maximum number of mismatches allowed
To define the output format
Explanation - Word size defines the length of exact matches that initiate further alignment extension.
Correct answer is: To control the initial seed length for matches

Q.16 In the context of bioinformatics pipelines, what is a 'workflow manager'?

A software tool that automates execution of tasks and tracks dependencies
A type of database for storing sequencing data
A hardware device that accelerates alignment algorithms
A programming language for writing bioinformatics scripts
Explanation - Workflow managers (e.g., Snakemake, Nextflow) orchestrate complex analysis pipelines.
Correct answer is: A software tool that automates execution of tasks and tracks dependencies

Q.17 Which of the following best describes an SNP (single nucleotide polymorphism)?

A large insertion or deletion
A variation at a single base position among individuals
A type of repetitive DNA element
A gene duplication event
Explanation - SNPs are common genetic variants involving a single nucleotide change.
Correct answer is: A variation at a single base position among individuals

Q.18 What is the purpose of a k‑mer index in sequence alignment?

To reduce the search space by mapping short substrings to positions
To compress the entire genome into a single number
To annotate functional elements in the genome
To perform quality control of sequencing reads
Explanation - K‑mer indices allow quick lookup of potential alignment anchors.
Correct answer is: To reduce the search space by mapping short substrings to positions

Q.19 Which algorithm is commonly used to compute the edit distance between two sequences?

Hamming distance
Levenshtein distance
Jaccard index
Euclidean distance
Explanation - Levenshtein distance counts the minimum number of insertions, deletions, and substitutions needed.
Correct answer is: Levenshtein distance

Q.20 In machine learning for bioinformatics, what does 'cross‑validation' help prevent?

Overfitting
Data loss
Under‑sampling
Missing values
Explanation - Cross‑validation assesses model generalizability to unseen data, reducing overfitting.
Correct answer is: Overfitting

Q.21 Which of these is an example of a structural bioinformatics task?

Predicting a protein's tertiary structure from its amino‑acid sequence
Identifying regulatory motifs in promoter regions
Quantifying gene expression levels
Designing a new microprocessor
Explanation - Structural bioinformatics focuses on the 3‑D arrangement of biomolecules.
Correct answer is: Predicting a protein's tertiary structure from its amino‑acid sequence

Q.22 What does 'coverage' refer to in sequencing data?

The proportion of the genome successfully sequenced
The number of times a base is sequenced on average
The number of samples in a study
The depth of the sequencing machine
Explanation - Coverage (depth) indicates how many reads overlap a given base position.
Correct answer is: The number of times a base is sequenced on average

Q.23 Which of the following best describes the use of a 'seed file' in BLAST searches?

To store a pre‑computed alignment matrix
To provide a set of reference sequences for comparison
To specify the output format
To define the word size
Explanation - The seed file is the database against which query sequences are compared.
Correct answer is: To provide a set of reference sequences for comparison

Q.24 Which data format is used to represent gene models and annotations in genome browsers?

BED
SAM
VCF
WIG
Explanation - BED files store genomic intervals and annotation information for browsers.
Correct answer is: BED

Q.25 What is the primary advantage of using an FPGA for bioinformatics computations?

High storage capacity
Parallel processing with low power consumption
Fast floating‑point calculations
Ease of programming
Explanation - FPGAs allow custom hardware pipelines that can be highly parallel and energy‑efficient.
Correct answer is: Parallel processing with low power consumption

Q.26 Which of these is a key metric for evaluating a clustering algorithm in omics data?

Silhouette score
E-value
Read depth
Gap penalty
Explanation - The silhouette score measures how similar an object is to its own cluster compared to others.
Correct answer is: Silhouette score

Q.27 What does the 'GC content' of a DNA sequence represent?

The number of guanine and cytosine bases per 1000 nucleotides
The number of genes in a chromosome
The frequency of methylated cytosines
The total length of the sequence
Explanation - GC content is calculated as the proportion of G and C nucleotides in the sequence.
Correct answer is: The number of guanine and cytosine bases per 1000 nucleotides

Q.28 Which of the following is NOT a typical step in a de novo genome assembly pipeline?

Read error correction
Contig graph construction
Reference genome alignment
Scaffolding
Explanation - De novo assembly does not rely on a reference genome; it constructs the genome from scratch.
Correct answer is: Reference genome alignment

Q.29 What is the role of a 'masker' tool in genomic analysis?

To identify low‑complexity and repeat regions in sequences
To compress genomic data
To predict protein function
To assemble transcriptomes
Explanation - Masker flags or hides repetitive elements to prevent spurious alignments.
Correct answer is: To identify low‑complexity and repeat regions in sequences

Q.30 In the context of RNA‑Seq, what does a 'TPM' value indicate?

Total read pairs mapped to a transcript
Transcripts per million, a normalized expression metric
The number of transcripts in a pathway
The total number of mapped reads
Explanation - TPM accounts for transcript length and sequencing depth, enabling comparison across samples.
Correct answer is: Transcripts per million, a normalized expression metric

Q.31 Which of the following best describes a 'variant call file' (VCF)?

A file listing all possible mutations in a genome
A format for storing identified genetic variants and their annotations
A list of sequencing reads
A database of protein structures
Explanation - VCF contains variant positions, genotypes, and optional annotations.
Correct answer is: A format for storing identified genetic variants and their annotations

Q.32 What is the main benefit of using a cloud‑based HPC cluster for bioinformatics?

Unlimited on‑premises storage
On‑demand scalability and parallel computing resources
Lower initial hardware cost
Always-on local backups
Explanation - Cloud HPC allows dynamic scaling to meet the high computational demands of large datasets.
Correct answer is: On‑demand scalability and parallel computing resources

Q.33 Which algorithm is commonly used for multiple sequence alignment in bioinformatics?

Smith‑Waterman
ClustalW
BFS
A*
Explanation - ClustalW performs progressive multiple sequence alignment with heuristic methods.
Correct answer is: ClustalW

Q.34 In phylogenetics, what does the term 'bootstrap value' signify?

The confidence level of a particular branch in a phylogenetic tree
The number of bootstrap replicates performed
The number of base substitutions
The length of the alignment
Explanation - Bootstrap values estimate how often a branch appears in resampled datasets.
Correct answer is: The confidence level of a particular branch in a phylogenetic tree

Q.35 Which of the following tools is used for aligning RNA‑Seq reads to a reference genome?

Bowtie
STAR
BWA
MAFFT
Explanation - STAR is designed for rapid alignment of RNA‑Seq data, handling spliced reads effectively.
Correct answer is: STAR

Q.36 What does the 'FAST' in FASTA stand for?

Fast Alignment Sequence Text
Fast All Sequencing Analysis
Fast Sequence Format
Fast Assembly
Explanation - FASTA is a text‑based format for representing nucleotide or peptide sequences.
Correct answer is: Fast Sequence Format

Q.37 Which of these is a common source of bias in high‑throughput sequencing?

GC‑rich sequence representation
Temperature fluctuations in the lab
Human error in pipetting
Use of outdated software
Explanation - GC‑rich regions often have lower sequencing coverage due to PCR bias.
Correct answer is: GC‑rich sequence representation

Q.38 What is the purpose of the 'Gap penalty' in sequence alignment?

To penalize insertions or deletions to reduce over‑extension of gaps
To set the maximum number of mismatches allowed
To adjust the quality score of reads
To define the length of the alignment
Explanation - Gap penalties discourage excessive or unrealistic gaps in alignments.
Correct answer is: To penalize insertions or deletions to reduce over‑extension of gaps

Q.39 In a gene‑prediction pipeline, what is a 'spliced alignment'?

Alignment that takes into account exon‑intron boundaries
Alignment of two protein sequences
Alignment that includes post‑translational modifications
Alignment used only for mitochondrial genomes
Explanation - Spliced alignment aligns cDNA or RNA‑Seq data to a reference genome, skipping introns.
Correct answer is: Alignment that takes into account exon‑intron boundaries

Q.40 Which of the following best describes a 'phred score' in sequencing?

Quality score representing the probability of an incorrect base call
Score indicating read length
Metric for GC content
Number of mismatches in an alignment
Explanation - Phred scores are logarithmic, higher values indicate higher confidence.
Correct answer is: Quality score representing the probability of an incorrect base call

Q.41 What is the main advantage of using a 'k‑mer based' assembler?

It does not require a reference genome
It can assemble extremely long reads easily
It automatically identifies gene functions
It reduces computational cost by using fixed‑length substrings
Explanation - K‑mers enable efficient graph construction and memory usage for assembly.
Correct answer is: It reduces computational cost by using fixed‑length substrings

Q.42 Which of these is a common feature of a 'variant annotation tool' like VEP?

Predicts protein‑structure folds
Annotates the functional impact of variants on genes and transcripts
Aligns sequences to a reference
Compresses sequencing data
Explanation - VEP (Variant Effect Predictor) links variants to potential biological effects.
Correct answer is: Annotates the functional impact of variants on genes and transcripts

Q.43 Which of these metrics is commonly used to assess the quality of a genome assembly?

N50
E-value
GC content
Read depth
Explanation - N50 measures the length of the shortest contig covering 50% of the assembly.
Correct answer is: N50

Q.44 In a phylogenetic tree, what is a 'clade'?

A branch of the tree
A group of organisms that share a common ancestor
A leaf node
A method for measuring genetic distance
Explanation - A clade includes an ancestor and all its descendants.
Correct answer is: A group of organisms that share a common ancestor

Q.45 Which algorithm is best suited for de‑novo transcriptome assembly?

Canu
Trinity
BLAST
SPAdes
Explanation - Trinity is specifically designed for assembling RNA‑Seq data into transcripts.
Correct answer is: Trinity

Q.46 What does the term 'saturation coverage' refer to in sequencing experiments?

The coverage level beyond which additional sequencing does not significantly increase variant discovery
The maximum read length achievable
The number of unique reads that cover the entire genome
The point at which the sequencer stops producing reads
Explanation - Saturation coverage indicates diminishing returns for deeper sequencing.
Correct answer is: The coverage level beyond which additional sequencing does not significantly increase variant discovery

Q.47 Which of these is NOT an advantage of using a GPU for bioinformatics computations?

Massive parallelism
Energy efficiency for certain tasks
Easy to program with high‑level languages
High memory bandwidth
Explanation - GPUs typically require specialized programming (CUDA, OpenCL), though higher‑level wrappers exist.
Correct answer is: Easy to program with high‑level languages

Q.48 Which of the following best describes a 'paired‑end' sequencing read?

Two reads originating from opposite ends of the same DNA fragment
A single read that contains both DNA and RNA sequences
A read that is paired with a reference genome
A read that has been paired with a control sample
Explanation - Paired‑end reads provide distance information between fragments, aiding assembly and mapping.
Correct answer is: Two reads originating from opposite ends of the same DNA fragment

Q.49 Which of these is a commonly used scoring matrix for protein‑protein alignment?

BLOSUM62
Smith‑Waterman
Levenshtein
Hamming
Explanation - BLOSUM62 is a substitution matrix for evaluating protein sequence similarity.
Correct answer is: BLOSUM62

Q.50 What does a 'heat map' represent in transcriptomics?

The frequency of mutations across a genome
Expression levels of genes across samples, displayed with color intensity
The GC content distribution
The physical distance between genes
Explanation - Heat maps visualize high‑dimensional expression data using a color scale.
Correct answer is: Expression levels of genes across samples, displayed with color intensity

Q.51 Which of these tools is used for visualizing large phylogenetic trees interactively?

FigTree
IGV
JBrowse
Circos
Explanation - FigTree is a software that displays and edits phylogenetic trees with interactive features.
Correct answer is: FigTree

Q.52 Which of the following best describes the concept of 'digital PCR' in bioinformatics?

A computational simulation of PCR
An algorithm for predicting PCR primer efficiency
A high‑throughput method for absolute quantification of nucleic acids using partitioned reactions
A technique for sequencing PCR products
Explanation - Digital PCR partitions the sample, enabling precise copy number estimation.
Correct answer is: A high‑throughput method for absolute quantification of nucleic acids using partitioned reactions

Q.53 In machine learning, what is the purpose of 'regularization'?

To reduce overfitting by penalizing model complexity
To increase model complexity
To perform feature scaling
To optimize hyperparameters
Explanation - Regularization techniques (L1/L2) constrain model parameters to improve generalization.
Correct answer is: To reduce overfitting by penalizing model complexity

Q.54 Which file format stores the mapping of sequencing reads to reference coordinates?

VCF
BAM
BED
GFF
Explanation - BAM is the binary version of SAM, containing read‑alignment information.
Correct answer is: BAM

Q.55 Which of the following best explains the purpose of a 'blastn' command in BLAST?

Align protein sequences
Align nucleotide sequences
Identify gene ontology terms
Visualize phylogenetic trees
Explanation - blastn performs nucleotide‑nucleotide local alignments.
Correct answer is: Align nucleotide sequences

Q.56 In the context of gene regulatory networks, what does a 'regulon' refer to?

A set of genes regulated by a common transcription factor
A cluster of promoter sequences
The DNA sequence of a regulatory gene
A type of sequencing technology
Explanation - Regulons group genes sharing the same regulatory control.
Correct answer is: A set of genes regulated by a common transcription factor

Q.57 Which of these is an example of a 'supercomputer' used for genomics research?

NVIDIA Tesla GPU
Amazon EC2 Cloud Instances
HPC cluster at a university
Apple MacBook Pro
Explanation - High‑performance computing clusters provide large parallel compute resources.
Correct answer is: HPC cluster at a university

Q.58 Which algorithm is used to predict the secondary structure of RNA from its sequence?

RNAfold
BWA
BLASTP
HMMER
Explanation - RNAfold uses thermodynamic models to predict RNA secondary structure.
Correct answer is: RNAfold

Q.59 In a read alignment, what is a 'soft clip' operation?

Removing low‑quality bases from the end of a read
Marking bases that do not align to the reference but are kept in the read record
Trimming adapter sequences from reads
Discarding the entire read if it doesn't map
Explanation - Soft clipping indicates unmapped portions without deleting them from the alignment.
Correct answer is: Marking bases that do not align to the reference but are kept in the read record

Q.60 Which of these is a measure of reproducibility in an experiment?

Bias
Variance
Precision
Recall
Explanation - Precision refers to the proportion of true positives among predicted positives, reflecting reproducibility.
Correct answer is: Precision

Q.61 What does 'allele frequency' represent in population genetics?

The number of alleles in a gene pool
The proportion of a specific allele among all alleles at a locus in a population
The mutation rate of a locus
The total number of individuals in a population
Explanation - Allele frequency quantifies how common an allele is within a population.
Correct answer is: The proportion of a specific allele among all alleles at a locus in a population

Q.62 Which of the following is a key feature of a 'variant calling' pipeline?

Identifying single‑nucleotide variations and indels from aligned reads
Predicting protein 3‑D structures
Visualizing gene expression heat maps
Annotating gene ontology terms
Explanation - Variant callers analyze aligned reads to detect genomic differences.
Correct answer is: Identifying single‑nucleotide variations and indels from aligned reads

Q.63 Which of these describes the 'A‑matrix' in protein sequence analysis?

A substitution matrix used to score alignments
A matrix of amino‑acid frequencies in a database
An algorithm for aligning sequences
A measure of alignment accuracy
Explanation - A‑matrices (e.g., PAM) provide log‑odds scores for aligning amino acids.
Correct answer is: A substitution matrix used to score alignments

Q.64 In a gene expression study, what does a 'fold change' value of 2 indicate?

Expression is halved in the test condition
Expression is twice as high in the test condition compared to control
Expression is unchanged
Expression is four times higher
Explanation - A fold change of 2 means the test level is double the control level.
Correct answer is: Expression is twice as high in the test condition compared to control

Q.65 What is a 'pseudogene'?

A gene that is functional in humans but not in mice
A gene that has lost its protein‑coding ability due to mutations
A gene that codes for RNA only
A gene that is overexpressed in cancer cells
Explanation - Pseudogenes arise from duplication or mutation, rendering them nonfunctional.
Correct answer is: A gene that has lost its protein‑coding ability due to mutations

Q.66 Which of the following best describes an 'ORF' in genomics?

Open Reading Frame, a continuous stretch of codons without stop codons
Optimized Read Filter used in alignment
A type of sequencing error
Oligonucleotide Retrieval File
Explanation - ORFs are candidate protein‑coding regions.
Correct answer is: Open Reading Frame, a continuous stretch of codons without stop codons

Q.67 Which of these tools is commonly used for genome annotation?

GeneMark
BLASTP
Bowtie
Cytoscape
Explanation - GeneMark predicts coding regions in DNA sequences.
Correct answer is: GeneMark

Q.68 In phylogenetics, what does a 'polytomy' indicate?

An unresolved branching where more than two lineages diverge from a single node
A branch that has zero length
A branch that has been removed from the tree
A node with exactly two descendant branches
Explanation - Polytomy reflects uncertainty or simultaneous divergence.
Correct answer is: An unresolved branching where more than two lineages diverge from a single node

Q.69 Which file format is typically used to store variant annotation information?

GFF
BED
VCF
FASTA
Explanation - VCF files contain variant positions and annotations.
Correct answer is: VCF

Q.70 What is the purpose of the 'reverse complement' operation in DNA sequence analysis?

To generate the RNA counterpart of a DNA strand
To obtain the complementary sequence on the opposite strand
To reverse the direction of a protein sequence
To translate DNA into protein
Explanation - Reverse complement reflects the opposite DNA strand orientation.
Correct answer is: To obtain the complementary sequence on the opposite strand

Q.71 Which of these is NOT a step in preparing data for a machine‑learning model in genomics?

Data cleaning
Feature selection
Model deployment
Sequencing the DNA
Explanation - Sequencing is an upstream experimental step; machine‑learning begins after data acquisition.
Correct answer is: Sequencing the DNA

Q.72 Which of the following tools is used to identify conserved motifs in protein families?

HMMER
MAFFT
SAMtools
BEDTools
Explanation - HMMER builds hidden Markov models to find conserved sequence motifs.
Correct answer is: HMMER

Q.73 In an RNA‑Seq differential expression analysis, what does a 'p‑value < 0.05' indicate?

The result is likely due to chance
The result is statistically significant
The genes are not differentially expressed
The sequencing depth is insufficient
Explanation - A p‑value below 0.05 suggests the observed difference is unlikely by chance.
Correct answer is: The result is statistically significant

Q.74 What is a 'phasing' problem in genomics?

Determining which alleles are inherited together on the same chromosome
Aligning short reads to a reference
Predicting protein structure
Mapping regulatory elements
Explanation - Phasing resolves haplotype structures from genotype data.
Correct answer is: Determining which alleles are inherited together on the same chromosome

Q.75 Which of the following best describes the 'Eukaryotic Linear Motif (ELM)' database?

A repository of 3‑D protein structures
A collection of short functional motifs in eukaryotic proteins
A tool for genome assembly
A sequence alignment program
Explanation - ELM catalogs linear motifs involved in signaling, localization, and regulation.
Correct answer is: A collection of short functional motifs in eukaryotic proteins

Q.76 Which of these is a common approach to reduce noise in a single‑cell RNA‑Seq dataset?

Removing low‑quality cells with many zero counts
Increasing sequencing depth for every cell
Using a higher gap penalty in alignments
Discarding all genes with high GC content
Explanation - Filtering out low‑coverage cells improves downstream analyses.
Correct answer is: Removing low‑quality cells with many zero counts

Q.77 In a genome assembly graph, what does a 'bubble' represent?

A repetitive region leading to multiple possible paths
An error in sequencing data
A missing gene
A high‑confidence contig
Explanation - Bubbles occur where the assembler cannot resolve a unique path due to repeats.
Correct answer is: A repetitive region leading to multiple possible paths

Q.78 Which of these metrics is used to evaluate the quality of a phylogenetic tree?

Bootstrap support values
Read depth
GC content
E-value
Explanation - Bootstrap values assess the reliability of tree branches.
Correct answer is: Bootstrap support values

Q.79 Which of these is an example of a 'feature importance' metric in machine learning?

Mean decrease impurity
Phred score
Read depth
E‑value
Explanation - It measures the contribution of each feature to model splits.
Correct answer is: Mean decrease impurity

Q.80 What does the term 'coverage breadth' refer to in sequencing?

The number of times a base is sequenced
The proportion of the genome that is covered by at least one read
The number of reads overlapping a region
The depth of sequencing in a specific region
Explanation - Breadth reflects how much of the genome is represented in the data.
Correct answer is: The proportion of the genome that is covered by at least one read

Q.81 Which of the following best describes the 'k‑mer spectrum' of a sequencing dataset?

The distribution of k‑mer frequencies across the dataset
The number of unique reads
The GC content distribution
The number of contigs assembled
Explanation - The k‑mer spectrum informs error rates and genome complexity.
Correct answer is: The distribution of k‑mer frequencies across the dataset

Q.82 Which of these is NOT a typical use case for a 'workflow engine' in bioinformatics?

Automating reproducible analyses
Managing compute resources on a cluster
Directly visualizing genome annotations
Tracking dependencies between computational steps
Explanation - Workflow engines orchestrate pipelines; visualization is a separate task.
Correct answer is: Directly visualizing genome annotations

Q.83 What is the primary benefit of using 'paired‑end' reads for genome assembly?

They provide longer read lengths
They help resolve repetitive regions by giving distance information
They reduce the cost of sequencing
They eliminate sequencing errors entirely
Explanation - Paired‑end distances aid scaffold construction across repeats.
Correct answer is: They help resolve repetitive regions by giving distance information

Q.84 In a machine learning pipeline for predicting disease risk from genomic data, which step ensures that the model does not overfit to the training data?

Hyperparameter tuning
Cross‑validation
Increasing training data size
Using a simpler model architecture
Explanation - Cross‑validation evaluates model generalization on unseen data.
Correct answer is: Cross‑validation

Q.85 Which of these tools is used to visualize genomic variant data in a genome browser?

IGV
Cytoscape
BLAST
MAFFT
Explanation - IGV (Integrative Genomics Viewer) displays BAM, VCF, and other genomic tracks.
Correct answer is: IGV

Q.86 What does the acronym 'SAM' stand for in bioinformatics?

Sequence Alignment/Map
Sequence Annotation Module
Signal Amplification Machine
Statistical Analysis Method
Explanation - SAM is a text format for storing alignment information.
Correct answer is: Sequence Alignment/Map

Q.87 Which of the following best describes a 'variant allele frequency' (VAF) in a tumor sample?

The proportion of reads supporting a variant allele at a particular locus
The number of variants in the genome
The depth of coverage at a locus
The frequency of a variant across different tumors
Explanation - VAF indicates how prevalent a mutation is within a sample.
Correct answer is: The proportion of reads supporting a variant allele at a particular locus

Q.88 Which of these is a major challenge when working with single‑cell RNA‑Seq data?

High sequencing cost per cell
Uniform gene expression across all cells
Absence of dropouts (zero counts)
Very low number of genes per cell
Explanation - Single‑cell sequencing requires many cells, increasing cost.
Correct answer is: High sequencing cost per cell

Q.89 Which of these file formats is used to represent 2‑D genomic interaction maps?

BEDPE
HiC
GFF3
FASTA
Explanation - BEDPE stores paired‑end genomic coordinates, useful for Hi‑C data.
Correct answer is: BEDPE

Q.90 What is the main function of 'BLASTP' in protein‑sequence comparison?

Align protein sequences against a protein database
Align nucleotide sequences against a nucleotide database
Predict protein secondary structure
Translate nucleotide sequences into proteins
Explanation - BLASTP performs local alignments of protein queries to protein databases.
Correct answer is: Align protein sequences against a protein database

Q.91 Which of the following is a commonly used method for multiple sequence alignment of proteins?

Clustal Omega
Bowtie2
TopHat
SAMtools
Explanation - Clustal Omega is a fast, accurate MSA tool for protein sequences.
Correct answer is: Clustal Omega

Q.92 Which of these best describes a 'phylogenetic tree'?

A graphical representation of evolutionary relationships among taxa
A map of genomic regions on a chromosome
A list of protein domains in a gene
A record of sequencing instrument performance
Explanation - Phylogenetic trees illustrate common ancestry and divergence.
Correct answer is: A graphical representation of evolutionary relationships among taxa

Q.93 What is the purpose of an 'index file' (e.g., .bai) in BAM files?

To compress the BAM file
To speed up random access to alignments in the BAM file
To store annotation data
To record quality scores
Explanation - The index allows rapid retrieval of reads covering specific genomic regions.
Correct answer is: To speed up random access to alignments in the BAM file

Q.94 Which of these is a measure of the predictive accuracy of a classification model?

Precision
Recall
Accuracy
All of the above
Explanation - Accuracy, precision, and recall are common classification metrics.
Correct answer is: All of the above

Q.95 What is the role of a 'phasing algorithm' in population genetics?

To determine the parental origin of alleles
To identify gene function
To align sequences to a reference genome
To visualize gene expression data
Explanation - Phasing reconstructs haplotypes from genotype data.
Correct answer is: To determine the parental origin of alleles

Q.96 Which of the following is a typical use of a 'cloud service' for genomics?

Local storage of raw sequencing data
On‑demand compute resources for large‑scale analyses
Hardware manufacturing
Data backup to external hard drives
Explanation - Cloud platforms provide scalable compute for big data tasks.
Correct answer is: On‑demand compute resources for large‑scale analyses

Q.97 Which of the following best describes the 'Markov model' used in gene prediction?

A model that predicts gene start sites based on codon usage probabilities
A method for aligning sequences
A way to visualize genetic data
A database of known genes
Explanation - Hidden Markov models use transition probabilities between states like exons, introns.
Correct answer is: A model that predicts gene start sites based on codon usage probabilities

Q.98 What is a 'de‑novo assembly'?

Assembly that uses a reference genome for alignment
Assembly that constructs genomes from scratch without a reference
Assembly of transcriptomes only
Assembly that focuses on mitochondrial DNA
Explanation - De‑novo assembly builds the genome using only reads, no reference.
Correct answer is: Assembly that constructs genomes from scratch without a reference

Q.99 Which of the following tools is specifically designed for aligning short reads to a genome?

Bowtie
MAFFT
BLASTN
BLASTX
Explanation - Bowtie is a fast, memory‑efficient short‑read aligner.
Correct answer is: Bowtie

Q.100 What does the 'Hamming distance' measure in sequence comparison?

The number of mismatches between two sequences of equal length
The number of insertions and deletions
The number of identical bases
The number of gaps
Explanation - Hamming distance counts differing positions in equal‑length strings.
Correct answer is: The number of mismatches between two sequences of equal length

Q.101 Which of the following best describes a 'gene fusion event'?

Two genes combine to produce a hybrid protein
A gene becomes duplicated
A gene is lost during evolution
A gene is transcribed into RNA only
Explanation - Gene fusions merge exons from distinct genes, often implicated in cancer.
Correct answer is: Two genes combine to produce a hybrid protein

Q.102 What is the main advantage of using 'deep learning' for protein structure prediction?

It can model long‑range interactions more accurately than physics‑based methods
It requires no training data
It runs only on CPUs
It provides exact solutions to the Schrödinger equation
Explanation - DL models capture complex patterns in protein folding from large datasets.
Correct answer is: It can model long‑range interactions more accurately than physics‑based methods

Q.103 Which of these is a common metric for evaluating a clustering algorithm on gene expression data?

Silhouette score
E-value
Read depth
Gap penalty
Explanation - Silhouette measures cohesion and separation of clusters.
Correct answer is: Silhouette score

Q.104 In a computational biology pipeline, what is the purpose of the 'trimming' step?

To remove adapter sequences and low‑quality bases from reads
To align reads to the reference genome
To assemble the genome
To annotate genes
Explanation - Trimming improves the quality of downstream analyses.
Correct answer is: To remove adapter sequences and low‑quality bases from reads

Q.105 Which of these best describes a 'mass spectrometry' technique used in proteomics?

Sequencing DNA directly
Measuring the mass-to-charge ratio of ionized peptides
Aligning RNA sequences
Predicting gene expression levels
Explanation - MS identifies peptides by their mass spectra.
Correct answer is: Measuring the mass-to-charge ratio of ionized peptides

Q.106 What is the main purpose of a 'variant filtration' step in a VCF pipeline?

To remove low‑quality or likely false variant calls
To translate variants into proteins
To align reads to the genome
To compress the VCF file
Explanation - Filtering enhances the reliability of variant sets.
Correct answer is: To remove low‑quality or likely false variant calls

Q.107 Which of these is a type of 'short‑read' sequencing technology?

PacBio RS II
Oxford Nanopore MinION
Illumina NovaSeq
Sanger sequencing
Explanation - NovaSeq provides high‑throughput short reads (~150 bp).
Correct answer is: Illumina NovaSeq

Q.108 In the context of RNA‑Seq, what is a 'junction read'?

A read that aligns to a single exon only
A read that spans across an exon‑exon splice junction
A read that does not align to the genome
A read that contains sequencing adapters
Explanation - Junction reads provide evidence of splicing events.
Correct answer is: A read that spans across an exon‑exon splice junction

Q.109 Which of the following best describes a 'single‑cell atlas' project?

A database of all possible protein structures
A comprehensive map of cell types and states at single‑cell resolution
A tool for aligning DNA sequences
A method for sequencing entire genomes in one go
Explanation - Cell atlases catalog cellular heterogeneity across tissues.
Correct answer is: A comprehensive map of cell types and states at single‑cell resolution

Q.110 Which of these is an example of a 'conserved non‑coding sequence' (CNS) in genomics?

A highly conserved protein‑coding region
A repetitive DNA element
A regulatory DNA region that remains unchanged across species
A mitochondrial DNA segment
Explanation - CNSs often indicate functional regulatory elements.
Correct answer is: A regulatory DNA region that remains unchanged across species

Q.111 Which of the following best explains the concept of 'gene ontology (GO) enrichment'?

Identifying over‑represented functional categories in a gene set
Aligning genes to a reference genome
Predicting protein‑protein interactions
Normalizing gene expression levels
Explanation - GO enrichment highlights biological themes within a dataset.
Correct answer is: Identifying over‑represented functional categories in a gene set

Q.112 Which of these is a common method for visualizing genomic variants across multiple samples?

Genome browser tracks
Heat map of gene expression
Protein‑structure model
Phylogenetic tree
Explanation - Browsers display variants as annotated tracks for comparative view.
Correct answer is: Genome browser tracks

Q.113 What does the acronym 'RNA‑seq' stand for?

RNA sequencing
Rapid nucleotide analysis
Reversible nucleotide assembly
Random nucleotide sequencing
Explanation - RNA‑seq quantifies transcriptomes by sequencing cDNA.
Correct answer is: RNA sequencing

Q.114 Which of the following best describes a 'de‑novo transcriptome assembly' pipeline?

Aligning transcripts to a reference genome
Assembling transcripts from short reads without a reference genome
Annotating gene functions
Predicting protein secondary structure
Explanation - De‑novo assembly constructs transcript sequences solely from reads.
Correct answer is: Assembling transcripts from short reads without a reference genome

Q.115 Which of these metrics is used to assess the accuracy of a variant caller?

True positive rate (sensitivity)
False discovery rate
Precision
All of the above
Explanation - Sensitivity, FDR, and precision evaluate different aspects of accuracy.
Correct answer is: All of the above

Q.116 What is the function of a 'read aligner' in a next‑generation sequencing pipeline?

To convert raw reads into FASTQ format
To map sequencing reads to a reference genome or transcriptome
To predict gene function
To perform de‑novo assembly of genomes
Explanation - Aligners match reads to the reference, generating coordinate information.
Correct answer is: To map sequencing reads to a reference genome or transcriptome

Q.117 Which of the following is a key advantage of using 'k‑mer hashing' in assembly?

It reduces memory usage
It improves alignment accuracy
It speeds up alignment of short reads
It compresses the final assembly
Explanation - K‑mer hashing allows efficient representation of k‑mers in memory.
Correct answer is: It reduces memory usage

Q.118 Which of these is an example of a 'long‑read sequencing' technology?

Illumina MiSeq
PacBio Sequel II
Sanger sequencing
Ion Torrent PGM
Explanation - PacBio generates reads up to tens of kilobases long.
Correct answer is: PacBio Sequel II

Q.119 Which of these is a typical output of a gene prediction algorithm?

Annotated gene coordinates and predicted gene models
A phylogenetic tree
Protein 3‑D structure
A heat map of expression levels
Explanation - Gene prediction provides start/stop positions and exon‑intron structure.
Correct answer is: Annotated gene coordinates and predicted gene models

Q.120 What is a 'feature vector' in the context of machine learning for genomics?

A set of numeric values describing characteristics of a sample or sequence
A database of gene annotations
A file format for sequencing reads
A tool for aligning proteins
Explanation - Feature vectors represent data in a format usable by ML algorithms.
Correct answer is: A set of numeric values describing characteristics of a sample or sequence

Q.121 Which of the following best describes a 'k‑mer count histogram'?

A plot showing the frequency of each k‑mer in a dataset
A chart of sequencing instrument run times
A table of gene expression values
A list of variant calls
Explanation - The histogram helps assess error rates and genome size.
Correct answer is: A plot showing the frequency of each k‑mer in a dataset

Q.122 Which of these is a common challenge when performing metagenomic assembly?

High diversity of species leading to complex assembly graphs
Uniform sequencing depth across all organisms
The absence of repetitive elements
Very low read error rates
Explanation - Metagenomes contain many species, complicating assembly.
Correct answer is: High diversity of species leading to complex assembly graphs

Q.123 What is the main benefit of using an 'ensemble' of models in predictive genomics?

It reduces training time
It improves predictive accuracy by combining multiple models
It eliminates the need for cross‑validation
It simplifies data preprocessing
Explanation - Ensemble methods average predictions, reducing variance.
Correct answer is: It improves predictive accuracy by combining multiple models

Q.124 Which of these tools is specifically designed for visualizing gene regulatory networks?

Cytoscape
IGV
BLAST
MAFFT
Explanation - Cytoscape is a network visualization platform.
Correct answer is: Cytoscape

Q.125 What does the term 'batch effect' refer to in high‑throughput biology experiments?

Systematic technical differences between groups of samples processed at different times
A biological variation among cell types
An error in sequence alignment
The effect of a single sequencing run
Explanation - Batch effects can bias downstream analyses if not corrected.
Correct answer is: Systematic technical differences between groups of samples processed at different times

Q.126 Which of these is a commonly used metric for evaluating alignment quality?

Alignment score
E‑value
Read depth
GC content
Explanation - The score reflects how well sequences align, considering mismatches and gaps.
Correct answer is: Alignment score

Q.127 Which of the following best describes a 'motif discovery' algorithm?

An algorithm that predicts 3‑D protein structures
An algorithm that finds recurring short sequences in a set of DNA or protein sequences
An algorithm that aligns sequences to a reference genome
An algorithm that compresses genomic data
Explanation - Motif discovery identifies functional motifs such as binding sites.
Correct answer is: An algorithm that finds recurring short sequences in a set of DNA or protein sequences

Q.128 Which of these best describes the 'pseudocount' concept in hidden Markov models?

A small value added to avoid zero probabilities during training
A measure of read quality
A parameter controlling gap penalties
A statistic used for variant filtering
Explanation - Pseudocounts regularize the probability estimates in HMMs.
Correct answer is: A small value added to avoid zero probabilities during training

Q.129 Which of these is a common file format for storing sequence alignment files in binary form?

SAM
BAM
FASTA
GTF
Explanation - BAM is the binary version of SAM, used for efficient storage and retrieval.
Correct answer is: BAM

Q.130 What is the purpose of an 'adapter trimming' step in RNA‑Seq data processing?

To remove artificial sequences added during library preparation
To map reads to the reference genome
To assemble the transcriptome
To predict gene function
Explanation - Adapter contamination can interfere with accurate mapping and quantification.
Correct answer is: To remove artificial sequences added during library preparation

Q.131 Which of these best describes a 'copy‑number variation' (CNV)?

A variation in the number of copies of a particular genomic region between individuals
A mutation in a single nucleotide
An insertion of a transposable element
A variation in gene expression levels
Explanation - CNVs involve deletions or duplications of large DNA segments.
Correct answer is: A variation in the number of copies of a particular genomic region between individuals

Q.132 Which of these is an example of a 'deep learning' model applied to genomics?

Convolutional neural network for chromatin accessibility prediction
BLAST for sequence alignment
Bowtie for read alignment
MAFFT for multiple sequence alignment
Explanation - CNNs learn hierarchical patterns from raw DNA sequences.
Correct answer is: Convolutional neural network for chromatin accessibility prediction

Q.133 Which of the following best describes the 'phred score' conversion?

Q = -10 × log10(P)
Q = -10 × log10(1-P)
Q = log10(P)
Q = 10 × log10(P)
Explanation - Phred score reflects the probability of an incorrect base call.
Correct answer is: Q = -10 × log10(P)

Q.134 Which of the following best describes 'variant calling' in the context of DNA sequencing?

Identifying positions in the genome where an individual's sequence differs from a reference
Predicting gene expression from read counts
Aligning reads to a reference genome
Assembling a genome from scratch
Explanation - Variant calling detects SNPs, indels, and structural variants.
Correct answer is: Identifying positions in the genome where an individual's sequence differs from a reference

Q.135 Which of the following best describes a 'k‑mer spectrum plot' used in genome size estimation?

A plot of k‑mer frequency versus k‑mer length
A plot of read length distribution
A plot of GC content across the genome
A plot of read mapping quality
Explanation - The peak in the spectrum indicates genome coverage and complexity.
Correct answer is: A plot of k‑mer frequency versus k‑mer length

Q.136 Which of these is a commonly used metric for evaluating differential gene expression?

Adjusted p‑value
Fold change
Both of the above
None of the above
Explanation - Significance is assessed by adjusted p‑values; magnitude by fold change.
Correct answer is: Both of the above

Q.137 Which of the following tools is designed for visualizing high‑dimensional single‑cell data?

UMAP
BLAST
MAFFT
Cytoscape
Explanation - UMAP reduces high‑dimensional data into two dimensions for clustering.
Correct answer is: UMAP

Q.138 What does a 'phasing graph' represent in population genetics?

The relationship between different phasing methods
The network of possible haplotype combinations
The sequence of a single chromosome
The alignment of reads to the reference
Explanation - The graph captures relationships among possible phased haplotypes.
Correct answer is: The network of possible haplotype combinations

Q.139 Which of the following best describes an 'ensemble of phylogenetic trees'?

A single tree that combines all phylogenetic information
A collection of trees used to estimate posterior probability of clades
A tree built from only one gene
A tree that ignores bootstrap values
Explanation - Bayesian phylogenetics samples tree space to produce a distribution of trees.
Correct answer is: A collection of trees used to estimate posterior probability of clades

Q.140 Which of these is a typical output format for a gene prediction tool?

GFF3
FASTA
VCF
WIG
Explanation - GFF3 files describe gene models with coordinates and attributes.
Correct answer is: GFF3

Q.141 Which of these is a common strategy to mitigate sequencing bias due to GC content?

Use of PCR‑free library preparation
Increasing adapter concentration
Using only short reads
Discarding low‑quality reads
Explanation - PCR amplification can skew GC representation; PCR‑free methods reduce bias.
Correct answer is: Use of PCR‑free library preparation

Q.142 Which of the following best describes 'tissue‑specific expression profiling'?

Measuring gene expression in all tissues at once
Quantifying expression levels of genes across different tissues
Identifying variants specific to a tissue
Aligning reads to a reference genome
Explanation - Tissue‑specific profiling reveals where genes are active.
Correct answer is: Quantifying expression levels of genes across different tissues

Q.143 In a computational biology workflow, what is the purpose of 'normalization' in RNA‑Seq data?

To adjust for differences in sequencing depth between samples
To map reads to the reference genome
To assemble the transcriptome
To predict gene function
Explanation - Normalization ensures comparability across samples.
Correct answer is: To adjust for differences in sequencing depth between samples

Q.144 Which of the following best describes a 'variant annotation' step?

Adding functional context to identified variants
Aligning reads to the reference genome
Compressing raw sequencing data
Predicting protein 3‑D structure
Explanation - Annotation links variants to genes, transcripts, and functional impacts.
Correct answer is: Adding functional context to identified variants

Q.145 Which of these is a common type of error introduced during sequencing?

Substitution errors
Large chromosomal rearrangements
Protein folding errors
Gene annotation errors
Explanation - Substitutions (A↔G, etc.) are frequent in sequencing platforms.
Correct answer is: Substitution errors

Q.146 Which of these tools is specifically designed for de‑novo assembly of small genomes?

Velvet
TopHat
Bowtie2
MAFFT
Explanation - Velvet is a de‑novo assembler using de Bruijn graphs.
Correct answer is: Velvet

Q.147 Which of the following best describes a 'genotype‑by‑environment interaction' (GxE) study?

Examining how environmental factors influence genotype frequencies
Testing the effect of genotype on protein structure
Measuring gene expression in different tissues
Analyzing genome assembly quality
Explanation - GxE studies explore how genotype effects vary across environments.
Correct answer is: Examining how environmental factors influence genotype frequencies

Q.148 Which of the following best describes a 'long‑read assembly graph'?

A graph where nodes represent long reads and edges represent overlaps
A graph where nodes represent k‑mers
A phylogenetic tree of long‑read sequences
A representation of gene expression levels
Explanation - Long‑read assemblers often use overlap graphs to resolve repeats.
Correct answer is: A graph where nodes represent long reads and edges represent overlaps

Q.149 Which of these is a commonly used metric to evaluate the quality of a genome assembly?

N50
Fold change
Read depth
P‑value
Explanation - N50 is a standard statistic reflecting assembly contiguity.
Correct answer is: N50

Q.150 What does the 'phred score' of 30 represent?

A 1 in 1000 chance of incorrect base call
A 1 in 10 chance of incorrect base call
A 1 in 100 chance of incorrect base call
A 1 in 10,000 chance of incorrect base call
Explanation - Q30 corresponds to 99.9% accuracy (10^-3 error probability).
Correct answer is: A 1 in 1000 chance of incorrect base call

Q.151 Which of these best describes the function of a 'reference genome'?

A complete, well‑annotated sequence used as a baseline for comparisons
A raw sequencing run
A set of gene expression values
A database of protein structures
Explanation - The reference provides coordinates for mapping and variant calling.
Correct answer is: A complete, well‑annotated sequence used as a baseline for comparisons