Programming for Bioinformatics # MCQs Practice set

Q.1 Which Python library is commonly used to parse FASTA files in bioinformatics?

NumPy

Biopython

Pandas

Matplotlib

Explanation - Biopython provides tools such as SeqIO for reading and writing FASTA files.

Correct answer is: Biopython

Q.2 What does the BLAST algorithm primarily compare?

Protein structures

DNA sequences

Gene expression levels

Metabolic pathways

Explanation - BLAST (Basic Local Alignment Search Tool) compares nucleotide or protein sequences to find regions of local similarity.

Correct answer is: DNA sequences

Q.3 In a phylogenetic tree, what does a longer branch typically indicate?

Higher mutation rate

Shorter evolutionary distance

Lower sequence similarity

More recent common ancestor

Explanation - Longer branches represent greater evolutionary changes, indicating a higher mutation rate or longer time.

Correct answer is: Higher mutation rate

Q.4 Which of the following is NOT a commonly used sequence alignment metric?

Percent identity

E-value

GC content

Alignment score

Explanation - GC content refers to the proportion of G and C bases in a sequence, not a metric of alignment quality.

Correct answer is: GC content

Q.5 In Python, which function is used to shuffle a list randomly?

random.shuffle()

random.sample()

list.shuffle()

shuffle()

Explanation - The random.shuffle() function shuffles a list in place.

Correct answer is: random.shuffle()

Q.6 What type of data structure is a FASTA file?

Binary tree

Linked list

Text file format

Matrix

Explanation - FASTA is a plain text format with header lines starting with '>' followed by sequence lines.

Correct answer is: Text file format

Q.7 Which Python package is ideal for manipulating biological data frames?

NumPy

SciPy

Pandas

Plotly

Explanation - Pandas provides DataFrame structures useful for handling tabular bioinformatics data.

Correct answer is: Pandas

Q.8 The 'E-value' in BLAST indicates:

The evolutionary distance between sequences

The number of mismatches

The expected number of hits by chance

The alignment length

Explanation - E-value estimates how many matches could be found by chance, lower values indicate more significant hits.

Correct answer is: The expected number of hits by chance

Q.9 Which command in the Linux shell lists all files, including hidden ones?

ls -a

ls -l

ls -h

ls -R

Explanation - The '-a' flag includes hidden files starting with a dot.

Correct answer is: ls -a

Q.10 What is the main purpose of using a 'quality score' in next-generation sequencing data?

To determine the read length

To indicate the confidence of each base call

To assign a color code to sequences

To measure GC content

Explanation - Quality scores reflect the probability that a base call is incorrect.

Correct answer is: To indicate the confidence of each base call

Q.11 Which Python library would you use for machine learning in genomics?

Scikit-learn

OpenCV

PyTorch

TensorFlow

Explanation - Scikit-learn offers a variety of ML algorithms suitable for bioinformatics tasks.

Correct answer is: Scikit-learn

Q.12 In a multiple sequence alignment, a column with no gaps or mismatches is called a:

Consensus column

Anchor column

Gap column

Polymorphic column

Explanation - A consensus column shows identical residues across all sequences.

Correct answer is: Consensus column

Q.13 Which data type in R is used to store a sequence of DNA nucleotides?

integer

factor

DNAString

matrix

Explanation - DNAString from the Biostrings package represents nucleotide sequences in R.

Correct answer is: DNAString

Q.14 What does the acronym 'RNA-Seq' stand for?

Random Nucleotide Analysis Sequencing

Rapid Nucleotide Amplification Sequencing

RNA Sequencing

Rescue Nucleic Acid Sequencing

Explanation - RNA-Seq refers to high-throughput sequencing of RNA transcripts.

Correct answer is: RNA Sequencing

Q.15 Which algorithm is most suitable for constructing phylogenies based on distance matrices?

Maximum likelihood

Neighbor-Joining

BLAST

Hidden Markov Models

Explanation - Neighbor-Joining is a distance‑based method for phylogenetic tree reconstruction.

Correct answer is: Neighbor-Joining

Q.16 In the context of gene expression analysis, what is a 'heatmap' used for?

To display sequence alignment

To visualize differential expression across samples

To plot GC content

To show phylogenetic trees

Explanation - Heatmaps represent expression levels with color gradients, aiding in pattern recognition.

Correct answer is: To visualize differential expression across samples

Q.17 Which command is used to convert a FASTQ file to FASTA format using seqtk?

seqtk seq -a input.fastq > output.fasta

seqtk convert -f input.fastq -t fasta

seqtk fq2fa input.fastq output.fasta

seqtk format -fasta input.fastq

Explanation - The '-a' flag tells seqtk to output in FASTA format.

Correct answer is: seqtk seq -a input.fastq > output.fasta

Q.18 Which of the following best describes the 'p-value' in differential gene expression?

Probability of observing the data if the null hypothesis is true

Proportion of reads mapped to a gene

Length of a transcript

Number of genes expressed

Explanation - The p-value indicates the likelihood that observed differences occurred by chance.

Correct answer is: Probability of observing the data if the null hypothesis is true

Q.19 Which programming language is NOT typically used in bioinformatics pipelines?

Python

Java

Bash

MATLAB

Explanation - While MATLAB can be used, it is less common than Python, Java, or Bash in bioinformatics.

Correct answer is: MATLAB

Q.20 What does 'GC content' refer to in a DNA sequence?

The number of G and C bases divided by total bases

The number of G bases only

The number of C bases only

The ratio of A to T bases

Explanation - GC content is calculated as (G+C)/total nucleotides, expressed as a percentage.

Correct answer is: The number of G and C bases divided by total bases

Q.21 Which algorithm is commonly used for motif discovery in DNA sequences?

K-means clustering

MEME

Gaussian Mixture Models

Random Forest

Explanation - MEME (Multiple EM for Motif Elicitation) identifies statistically significant motifs.

Correct answer is: MEME

Q.22 Which of the following is a typical input for a Hidden Markov Model in protein family classification?

RNA structure files

Protein sequence alignments

DNA methylation data

Chromosome conformation capture data

Explanation - HMMs model sequence patterns in alignments to predict protein families.

Correct answer is: Protein sequence alignments

Q.23 What is the primary function of the 'samtools view' command?

Compress BAM files

Convert SAM to BAM

Filter alignments by mapping quality

Generate coverage plots

Explanation - samtools view can filter reads by flags, quality, and other criteria.

Correct answer is: Filter alignments by mapping quality

Q.24 In R, which function from the Bioconductor package 'DESeq2' is used to normalize count data?

estimateSizeFactors

normalizeCounts

preprocessInput

scaleData

Explanation - estimateSizeFactors() calculates normalization factors for sequencing depth.

Correct answer is: estimateSizeFactors

Q.25 Which of the following best describes a 'single‑cell RNA‑seq' experiment?

Sequencing DNA from a single organism

Sequencing RNA from individual cells

Sequencing proteins in a bulk sample

Sequencing a single gene

Explanation - Single‑cell RNA‑seq captures transcriptomes at the resolution of single cells.

Correct answer is: Sequencing RNA from individual cells

Q.26 In Python, how do you open a file for reading?

open('file.txt', 'w')

open('file.txt', 'r')

open('file.txt', 'x')

open('file.txt', 'a')

Explanation - The 'r' mode opens a file for reading.

Correct answer is: open('file.txt', 'r')

Q.27 What does 'ORF' stand for in genetics?

Open Reading Frame

Oligonucleotide Receptor Factor

Overall Ribonucleotide Frequency

Optimized Reverse Function

Explanation - ORF refers to a continuous sequence of codons that could encode a protein.

Correct answer is: Open Reading Frame

Q.28 Which of the following is an example of a k‑mer?

AGTC

ATGCG

GATC

TAA

Explanation - A k‑mer is a substring of length k; AGTC is a 4‑mer.

Correct answer is: AGTC

Q.29 In a phylogenetic tree, what does a bootstrap value represent?

Confidence level of a branch

Number of species in the tree

Length of the branch

Mutation rate

Explanation - Bootstrap values estimate statistical support for tree branches.

Correct answer is: Confidence level of a branch

Q.30 Which command is used to extract reads mapped to chromosome 12 from a BAM file?

samtools view -h input.bam chr12 > chr12.bam

samtools index input.bam chr12

samtools filter -r 12 input.bam

samtools extract -c 12 input.bam

Explanation - The 'view' command with chromosome name selects reads from that region.

Correct answer is: samtools view -h input.bam chr12 > chr12.bam

Q.31 What is the purpose of the 'gzip' command in a bioinformatics pipeline?

To create a backup archive

To compress files for storage

To decompress FASTQ files only

To convert text to binary

Explanation - gzip reduces file size, commonly used for large sequencing files.

Correct answer is: To compress files for storage

Q.32 Which of the following is NOT a type of variant called by GATK?

SNP

Insertion

Deletion

Chromosome translocation

Explanation - GATK detects SNPs, indels, but not structural variants like translocations.

Correct answer is: Chromosome translocation

Q.33 In Python, which library provides tools for working with genomic intervals?

Biopython

pandas

pysam

pybedtools

Explanation - pybedtools interfaces with BEDTools for genomic interval operations.

Correct answer is: pybedtools

Q.34 Which of these metrics is commonly used to assess clustering quality in unsupervised gene expression analysis?

Silhouette score

p-value

GC content

E-value

Explanation - Silhouette score measures how similar an object is to its own cluster compared to other clusters.

Correct answer is: Silhouette score

Q.35 What does the 'MAFFT' program primarily do?

Align multiple protein or nucleotide sequences

Perform phylogenetic tree reconstruction

Predict secondary structure

Cluster gene expression data

Explanation - MAFFT is a multiple sequence alignment tool.

Correct answer is: Align multiple protein or nucleotide sequences

Q.36 Which of the following best describes a 'contig' in genome assembly?

A single DNA fragment from a plasmid

An assembled sequence from overlapping reads

A region of low coverage

A gap between scaffolds

Explanation - Contigs are contiguous sequences assembled from overlapping reads.

Correct answer is: An assembled sequence from overlapping reads

Q.37 In R, what function from the 'ggplot2' package is used to create a scatter plot?

geom_bar()

geom_line()

geom_point()

geom_histogram()

Explanation - geom_point() plots points for scatter plots.

Correct answer is: geom_point()

Q.38 Which type of filter is commonly used to remove low‑quality reads based on quality scores?

Median filter

Low‑pass filter

Quality score filter

High‑pass filter

Explanation - Reads are filtered by minimum per‑base or average quality thresholds.

Correct answer is: Quality score filter

Q.39 What does the 'trim_galore' tool do?

Trims adapters and low‑quality ends from sequencing reads

Aligns reads to a reference genome

Compresses FASTQ files

Converts FASTQ to BAM

Explanation - Trim Galore is a wrapper around Cutadapt for adapter trimming.

Correct answer is: Trims adapters and low‑quality ends from sequencing reads

Q.40 Which of the following is a property of a 'protein motif'?

A specific DNA sequence

A conserved pattern of amino acids

A gene regulatory network

A chromatin state

Explanation - Protein motifs are short, conserved sequences that often indicate functional domains.

Correct answer is: A conserved pattern of amino acids

Q.41 What is the main function of the 'cutadapt' program?

Assemble genomes

Trim adapters from sequencing reads

Align reads to a reference

Perform differential expression analysis

Explanation - Cutadapt removes adapter sequences and low‑quality bases.

Correct answer is: Trim adapters from sequencing reads

Q.42 Which of these is a commonly used file format for storing gene annotations?

FASTA

SAM

GTF

BED

Explanation - GTF (Gene Transfer Format) contains gene feature annotations.

Correct answer is: GTF

Q.43 Which command is used to count the number of reads in a FASTQ file using awk?

awk '{print NR}' file.fastq | wc -l

awk 'NR % 4 == 0' file.fastq | wc -l

awk '{print $1}' file.fastq | wc -l

awk '/^@/{print $0}' file.fastq | wc -l

Explanation - Each FASTQ record consists of 4 lines; counting lines divisible by 4 gives read count.

Correct answer is: awk 'NR % 4 == 0' file.fastq | wc -l

Q.44 What is the purpose of a 'phylogenetic bootstrap analysis'?

To estimate mutation rates

To test the robustness of tree branches

To align sequences

To find conserved motifs

Explanation - Bootstrapping resamples data to assess confidence in tree topology.

Correct answer is: To test the robustness of tree branches

Q.45 Which Python package is useful for visualizing genomic data tracks?

Matplotlib

pyGenomeTracks

NumPy

SciPy

Explanation - pyGenomeTracks renders genome browser‑style tracks programmatically.

Correct answer is: pyGenomeTracks

Q.46 In the context of next‑generation sequencing, what does 'paired‑end' refer to?

Two independent samples

Sequencing reads from both ends of a DNA fragment

Two types of base calling

Read pairing with adapter sequences

Explanation - Paired‑end sequencing generates two reads per fragment, one from each end.

Correct answer is: Sequencing reads from both ends of a DNA fragment

Q.47 Which of the following commands converts a SAM file to BAM and sorts it?

samtools sort -o output.bam input.sam

samtools view -bS input.sam | samtools sort -o output.bam

samtools convert input.sam output.bam

samtools index input.sam output.bam

Explanation - This pipeline first converts SAM to BAM then sorts the BAM file.

Correct answer is: samtools view -bS input.sam | samtools sort -o output.bam

Q.48 Which R function is used to read a FASTQ file into a Biostrings object?

readDNAStringSet()

readFastq()

readDNAFile()

readSequence()

Explanation - readFastq() from Biostrings imports FASTQ files as DNAStringSet.

Correct answer is: readFastq()

Q.49 In a variant call format (VCF) file, which field stores the genotype of an individual?

REF

ALT

INFO

FORMAT

Explanation - FORMAT defines genotype fields, such as GT, AD, DP.

Correct answer is: FORMAT

Q.50 Which command is used to generate a de novo assembly using SPAdes?

spades.py -1 reads_1.fq -2 reads_2.fq -o assembly

spades -assemble -reads reads.fq -output assembly

spades --assemble -i reads.fq -o assembly

spades assembly -i reads.fq -o assembly

Explanation - SPAdes requires paired‑end input via '-1' and '-2', and specifies output dir with '-o'.

Correct answer is: spades.py -1 reads_1.fq -2 reads_2.fq -o assembly

Q.51 Which of the following best describes a 'coverage depth' metric?

Number of unique sequences in a dataset

Average number of times a base is read

Length of the longest read

Percentage of reads mapping to the reference

Explanation - Coverage depth is the mean read depth across a genomic region.

Correct answer is: Average number of times a base is read

Q.52 What is the purpose of using 'indel realignment' during variant calling?

To remove duplicate reads

To correct mis‑aligned reads around insertions/deletions

To convert BAM to FASTQ

To filter by quality score

Explanation - Indel realignment reduces false SNV calls near indels.

Correct answer is: To correct mis‑aligned reads around insertions/deletions

Q.53 Which Python library is used for working with graph data structures in bioinformatics?

NetworkX

Pillow

OpenCV

PyTorch

Explanation - NetworkX provides graph algorithms useful for network biology.

Correct answer is: NetworkX

Q.54 Which command extracts the header lines from a FASTQ file?

awk '/^@/{print}' file.fastq

grep '^@' file.fastq

awk 'NR%4==1' file.fastq

All of the above

Explanation - All listed commands correctly capture header lines beginning with '@'.

Correct answer is: All of the above

Q.55 What does the 'MACS2' software do in ChIP‑seq data analysis?

Call peaks of enriched DNA regions

Align reads to reference genome

Normalize read counts

Predict transcription factor binding sites

Explanation - MACS2 identifies statistically significant enrichment peaks.

Correct answer is: Call peaks of enriched DNA regions

Q.56 In Python, which method of a pandas DataFrame returns the mean of numeric columns?

sum()

mean()

average()

count()

Explanation - DataFrame.mean() computes column‑wise arithmetic mean.

Correct answer is: mean()

Q.57 Which of the following is NOT a valid base in RNA sequencing reads?

Explanation - RNA uses uracil (U) instead of thymine (T).

Correct answer is: T

Q.58 What is the primary advantage of using 'single‑molecule real‑time (SMRT)' sequencing?

Short read length

Long read length

Lower cost

Higher error rate only

Explanation - SMRT sequencing generates reads exceeding 10 kb.

Correct answer is: Long read length

Q.59 Which R package provides tools for differential expression analysis of RNA‑seq data?

DESeq2

ggplot2

dplyr

tidyr

Explanation - DESeq2 models count data to test for differential expression.

Correct answer is: DESeq2

Q.60 What does the 'samtools flagstat' command output?

Alignment quality scores

Statistics of reads (mapped, unmapped)

Base composition

Coverage histogram

Explanation - Flagstat provides a quick summary of alignment statistics.

Correct answer is: Statistics of reads (mapped, unmapped)

Q.61 Which of the following describes a 'motif discovery algorithm' in DNA sequences?

A tool that predicts gene structure

A method to find statistically over‑represented patterns

A software that aligns reads

A pipeline for assembly

Explanation - Motif discovery seeks common motifs within a set of sequences.

Correct answer is: A method to find statistically over‑represented patterns

Q.62 What is the role of 'hash tables' in bioinformatics data processing?

Storing large matrices efficiently

Facilitating quick look‑ups of sequence identifiers

Plotting gene expression heatmaps

Compressing genomic data

Explanation - Hash tables provide constant‑time access to keys like sequence IDs.

Correct answer is: Facilitating quick look‑ups of sequence identifiers

Q.63 Which of these file extensions is commonly used for compressed FASTQ files?

.fq.gz

.sam.gz

.bam.gz

.vcf.gz

Explanation - Compressed FASTQ files use the .fq.gz extension.

Correct answer is: .fq.gz

Q.64 In a VCF file, what does the 'AF' field represent?

Allele frequency in the sample population

Alignment score

Average coverage

Alternate allele count

Explanation - AF stands for allele frequency, indicating variant prevalence.

Correct answer is: Allele frequency in the sample population

Q.65 Which command line utility is used to merge multiple BAM files?

samtools merge

samtools cat

samtools combine

samtools concat

Explanation - samtools merge combines BAM files into a single sorted BAM.

Correct answer is: samtools merge

Q.66 What is a 'kmer count table' used for in metagenomics?

Estimating genome size

Comparing read quality

Building phylogenetic trees

Storing gene annotations

Explanation - Kmer frequency distributions help infer genome size and complexity.

Correct answer is: Estimating genome size

Q.67 Which Python module provides a 'deque' data structure useful for sliding windows?

collections

numpy

pandas

Explanation - collections.deque is a double‑ended queue ideal for windowed operations.

Correct answer is: collections

Q.68 What does the 'fastqc' tool evaluate in sequencing data?

Assembly quality

Read quality metrics such as per‑base sequence quality

Variant calling accuracy

Phylogenetic tree reliability

Explanation - FastQC reports on many metrics, including per‑base quality and GC bias.

Correct answer is: Read quality metrics such as per‑base sequence quality

Q.69 Which of the following is NOT a type of alignment score in sequence alignment?

Bit score

E-value

Percent identity

Alignment length

Explanation - Alignment length is a parameter, not a score metric.

Correct answer is: Alignment length

Q.70 In a gene‑expression heatmap, what does a darker color typically represent?

Low expression

High expression

Average expression

No expression

Explanation - Heatmaps often use a color gradient where darker shades indicate higher values.

Correct answer is: High expression

Q.71 What is the main output of the 'bwa mem' command?

FASTA file

SAM file

VCF file

BAM file

Explanation - BWA mem produces a SAM alignment file, which can be converted to BAM.

Correct answer is: SAM file

Q.72 Which R function is used to perform a principal component analysis (PCA) on expression data?

prcomp()

pca()

princomp()

pca_analysis()

Explanation - prcomp() is the base R function for PCA.

Correct answer is: prcomp()

Q.73 What does a 'strand‑specific' RNA‑seq library preserve?

Sense strands

Both sense and antisense strands equally

Only antisense strands

No strand information

Explanation - Strand‑specific protocols retain the original transcript strand information.

Correct answer is: Sense strands

Q.74 Which of these is a key metric for evaluating a de novo assembly?

GC content

N50 value

E-value

Alignment score

Explanation - N50 indicates the contig length where half the assembly is in contigs of that size or larger.

Correct answer is: N50 value

Q.75 Which Python function calculates the Levenshtein distance between two strings?

difflib.SequenceMatcher.distance()

python-Levenshtein.distance()

distance()

levenshtein()

Explanation - The python-Levenshtein library provides an efficient distance calculation.

Correct answer is: python-Levenshtein.distance()

Q.76 In a phylogenetic tree, what does a 'polytomy' indicate?

A node with multiple descendant branches

A node with no branches

A node with exactly two branches

A node with a single descendant

Explanation - A polytomy represents unresolved branching order.

Correct answer is: A node with multiple descendant branches

Q.77 Which command in Linux lists the number of lines in a file?

ls -l

wc -l

cat file | wc -l

Both b and c

Explanation - wc -l counts lines; cat file | wc -l is a common pipeline.

Correct answer is: Both b and c

Q.78 Which of these is NOT a common output of a metagenomic assembler?

Scaffold sequences

Contig sequences

Reference genomes

Binning assignments

Explanation - Assemblers produce contigs/scaffolds; reference genomes are not directly output.

Correct answer is: Reference genomes

Q.79 In Python, which data type is most suitable for storing a sequence of nucleotides?

list

tuple

str

dict

Explanation - Strings efficiently hold DNA/RNA sequences.

Correct answer is: str

Q.80 What does the 'awk NF' command do when applied to a FASTQ file?

Prints only lines with a field count greater than zero

Prints header lines

Prints lines containing 'N'

Prints lines with even number of fields

Explanation - NF is the number of fields; awk NF prints non‑empty lines.

Correct answer is: Prints only lines with a field count greater than zero

Q.81 Which R function is used to write a VCF file from a VariantAnnotation object?

writeVcf()

vcfWrite()

saveVcf()

exportVcf()

Explanation - writeVcf() writes VariantAnnotation objects to disk.

Correct answer is: writeVcf()

Q.82 What is the primary purpose of a 'reference genome' in alignment?

To serve as a target for mapping reads

To generate sequencing adapters

To store variant calls

To provide annotation data

Explanation - Reads are aligned against a reference genome to determine their genomic positions.

Correct answer is: To serve as a target for mapping reads

Q.83 Which of the following commands removes duplicate reads in a BAM file using Picard?

picard MarkDuplicates I=input.bam O=dedup.bam

picard MarkDuplicates I=input.bam O=dedup.bam REMOVE_DUPLICATES=true

picard MarkDuplicates INPUT=input.bam OUTPUT=dedup.bam

All of the above

Explanation - All three syntaxes are acceptable Picard command variants.

Correct answer is: All of the above

Q.84 In bioinformatics, what does 'GC skew' measure?

The ratio of GC to AT content

The imbalance of G versus C along the genome

The GC content across different species

The GC content in a single read

Explanation - GC skew (G-C)/(G+C) indicates strand asymmetry.

Correct answer is: The imbalance of G versus C along the genome

Q.85 Which of the following is a common step in a RNA‑seq differential expression pipeline?

Read trimming

Quality filtering

Alignment to reference

All of the above

Explanation - RNA‑seq workflows typically involve trimming, quality filtering, and alignment.

Correct answer is: All of the above

Q.86 What is the output format of the 'samtools mpileup' command?

VCF

SAM

BAM

BED

Explanation - mpileup produces a VCF‑style output summarizing base calls.

Correct answer is: VCF

Q.87 Which algorithm is used by HMMER to detect protein domains?

Hidden Markov Models

Dynamic programming

Smith‑Waterman

BLASTP

Explanation - HMMER employs HMMs to model sequence families.

Correct answer is: Hidden Markov Models

Q.88 In Python, how do you open a file and read all lines into a list?

open('file.txt').readlines()

open('file.txt', 'r').readlines()

readlines('file.txt')

both a and b

Explanation - Both syntaxes read all lines; specifying 'r' is optional.

Correct answer is: both a and b

Q.89 Which of the following best describes a 'pseudogene'?

An active gene producing functional proteins

A non‑coding RNA gene

A gene that has lost its function due to mutations

A gene with multiple splice variants

Explanation - Pseudogenes are remnants of genes that no longer produce functional products.

Correct answer is: A gene that has lost its function due to mutations

Q.90 What is a 'transcriptome'?

The complete set of proteins in a cell

The complete set of DNA in a cell

The complete set of RNA transcripts in a cell

The set of all metabolites

Explanation - Transcriptome refers to all RNA molecules transcribed from the genome.

Correct answer is: The complete set of RNA transcripts in a cell

Q.91 Which of these commands extracts reads with a MAPQ score above 30?

samtools view -q 30 input.bam > highq.bam

samtools view -q 30 input.bam | samtools sort -o highq.bam

samtools view -h input.bam | awk '$5>=30' > highq.bam

All of the above

Explanation - All three methods filter by MAPQ >=30.

Correct answer is: All of the above

Q.92 What is the purpose of using 'multi‑qc' in a sequencing pipeline?

Generate a single QC report from multiple samples

Compress data files

Align reads to multiple references

Call variants

Explanation - multi‑qc aggregates QC metrics from FastQC and other tools into one report.

Correct answer is: Generate a single QC report from multiple samples

Q.93 Which R package is commonly used to plot genomic tracks like coverage or SNP density?

ggplot2

Gviz

tidyverse

data.table

Explanation - Gviz creates genome browsers‑style plots in R.

Correct answer is: Gviz

Q.94 In a FASTQ file, what does the '+' line signify?

Quality string delimiter

Sequence identifier repeat

Start of next record

End of file

Explanation - The '+' line separates sequence from its quality string.

Correct answer is: Quality string delimiter

Q.95 Which command extracts only unique reads from a BAM file?

samtools rmdup

samtools markdup -r

samtools dedup

samtools dedupe

Explanation - samtools rmdup removes duplicate reads based on alignment coordinates.

Correct answer is: samtools rmdup

Q.96 Which of the following best describes a 'gene ontology (GO)' term?

A type of DNA sequencing technology

A standardized description of gene functions

A tool for sequence alignment

A file format for variant calls

Explanation - GO terms categorize gene functions into biological processes, molecular functions, and cellular components.

Correct answer is: A standardized description of gene functions

Q.97 In Python, what does the 'pandas.read_csv()' function return?

A list

A DataFrame

A Series

A dictionary

Explanation - read_csv() reads tabular data into a pandas DataFrame.

Correct answer is: A DataFrame

Q.98 Which of these is a common method for normalizing RNA‑seq read counts?

RPKM

TPM

FPKM

All of the above

Explanation - All are normalization methods adjusting for gene length and sequencing depth.

Correct answer is: All of the above

Q.99 What does the 'GATK HaplotypeCaller' do?

Calls SNPs and indels from aligned reads

Aligns reads to the reference genome

Creates a reference assembly

Generates phylogenetic trees

Explanation - HaplotypeCaller performs local re‑assembly for accurate variant calling.

Correct answer is: Calls SNPs and indels from aligned reads

Q.100 Which of the following best describes 'phasing' in genetics?

Determining the sequence of nucleotides

Assigning variants to their parental origin

Calculating GC content

Sorting reads by quality

Explanation - Phasing reconstructs which variants co‑occur on the same chromosome.

Correct answer is: Assigning variants to their parental origin

Q.101 In a BLAST search, which parameter directly affects the length of the query region used in the alignment?

E-value

Word size

Gap penalty

Scoring matrix

Explanation - Word size defines the length of exact matches that seed alignments.

Correct answer is: Word size

Q.102 Which of the following commands would you use to convert a SAM file to a sorted BAM file using samtools?

samtools view -bS input.sam | samtools sort -o sorted.bam

samtools view input.sam | samtools sort -o sorted.bam

samtools sort input.sam -o sorted.bam

samtools convert -b input.sam -o sorted.bam

Explanation - The pipeline first converts to BAM and then sorts it.

Correct answer is: samtools view -bS input.sam | samtools sort -o sorted.bam

Q.103 What is the purpose of a 'reference panel' in population genetics?

To provide a set of known variants for imputation

To store RNA‑seq reads

To align sequencing data

To visualize phylogenies

Explanation - Reference panels contain variant data used for genotype imputation.

Correct answer is: To provide a set of known variants for imputation

Q.104 Which R package is used for clustering analysis of gene expression data?

cluster

clusterProfiler

stats

gplots

Explanation - The cluster package provides hierarchical clustering utilities.

Correct answer is: cluster

Q.105 Which of the following best describes a 'scoring matrix' in sequence alignment?

A file containing base frequencies

A table assigning scores to residue pairs

A list of alignment scores

A graphical representation of alignments

Explanation - Scoring matrices like BLOSUM or PAM assign scores to substitutions.

Correct answer is: A table assigning scores to residue pairs

Q.106 In Python, how would you import the 'pandas' library?

import pandas

include pandas

use pandas

require pandas

Explanation - The import statement loads the pandas module.

Correct answer is: import pandas

Q.107 Which command calculates the GC skew across a genome in a sliding window?

skewfinder -g genome.fasta

skew -g genome.fasta

skewfinder genome.fasta

skew genome.fasta

Explanation - skewfinder is a tool that computes GC and AT skew in windows.

Correct answer is: skewfinder -g genome.fasta

Q.108 What does the 'GFF3' file format contain?

Gene expression data

Genomic feature annotations

Variant calls

Sequencing quality scores

Explanation - GFF3 files list genomic features such as exons, transcripts, and genes.

Correct answer is: Genomic feature annotations

Q.109 Which of the following best describes a 'transposon'?

A protein-coding gene

A mobile genetic element

A type of RNA polymerase

A DNA methylation marker

Explanation - Transposons can move within the genome, affecting structure and function.

Correct answer is: A mobile genetic element

Q.110 Which command counts the number of occurrences of a pattern in a file using grep?

grep -c 'pattern' file.txt

grep 'pattern' file.txt | wc -l

both a and b

none of the above

Explanation - Both commands return the count of matching lines.

Correct answer is: both a and b

Q.111 In a phylogenetic tree, what is a 'branch length' typically proportional to?

Mutation rate

Genome size

Sequence length

Number of taxa

Explanation - Branch length reflects evolutionary distance, often tied to mutations.

Correct answer is: Mutation rate

Q.112 What does the 'BWA MEM' algorithm use for alignment?

Suffix arrays

Burrows–Wheeler transform

Huffman coding

Dynamic programming

Explanation - BWA MEM uses the BWT index for efficient read mapping.

Correct answer is: Burrows–Wheeler transform

Q.113 Which of the following is a key component of a 'genome annotation pipeline'?

Read mapping

Protein structure prediction

Variant calling

Metabolic modeling

Explanation - Annotation pipelines often begin with aligning reads to a reference.

Correct answer is: Read mapping

Q.114 Which R function extracts the mean expression for each gene across samples?

rowMeans()

colMeans()

mean()

median()

Explanation - rowMeans() computes the mean of each row (gene) in a matrix.

Correct answer is: rowMeans()

Q.115 Which command is used to sort a BAM file by coordinate using samtools?

samtools sort -o sorted.bam input.bam

samtools order input.bam -o sorted.bam

samtools coordinate input.bam -o sorted.bam

samtools index input.bam

Explanation - samtools sort orders reads by genomic coordinates.

Correct answer is: samtools sort -o sorted.bam input.bam

Q.116 In a 'de novo' assembly, what is the 'k' in a 'k‑mer' strategy?

Number of contigs produced

Size of the k‑mer subsequence

Length of reads

Number of iterations

Explanation - The 'k' denotes the length of substrings used for overlap detection.

Correct answer is: Size of the k‑mer subsequence

Q.117 Which command removes adapter sequences from paired‑end reads using Trimmomatic?

trimmomatic PE input_R1.fastq input_R2.fastq output_forward_paired.fq output_forward_unpaired.fq output_reverse_paired.fq output_reverse_unpaired.fq ILLUMINACLIP:adapters.fa:2:30:10

trim_galore --paired input_R1.fastq input_R2.fastq

cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGTT

both a and b

Explanation - Both Trimmomatic and Trim Galore can trim adapters from PE reads.

Correct answer is: both a and b

Q.118 Which of the following best describes the 'Read Depth' metric?

Average read length

Number of reads covering a region

Error rate of sequencing

GC content variation

Explanation - Read depth indicates how many reads map to a genomic position.

Correct answer is: Number of reads covering a region

Q.119 What is the main output of the 'FastQC' tool?

Alignment files

Quality reports in HTML and text

Variant call files

Phylogenetic trees

Explanation - FastQC generates a multi‑panel HTML report summarizing read quality.

Correct answer is: Quality reports in HTML and text

Q.120 Which of the following commands converts a VCF file to a BED file containing only variant positions?

vcftools --vcf input.vcf --positions --bed output.bed

awk 'NR>1 {print $1":"$4"-"$4}' input.vcf > output.bed

sed -n '2,$p' input.vcf | cut -f1,4 > output.bed

All of the above

Explanation - All methods can extract variant positions into BED format.

Correct answer is: All of the above

Q.121 Which of the following is an advantage of long‑read sequencing?

Higher per‑base accuracy

Lower error rates

Better assembly of repetitive regions

Smaller library preparation time

Explanation - Long reads span repeats, improving assembly contiguity.

Correct answer is: Better assembly of repetitive regions

Q.122 In Python, which function generates a random integer between 1 and 10?

random.randint(1,10)

random.random(1,10)

random.randrange(1,10)

both a and c

Explanation - Both randint and randrange produce a random integer in the range.

Correct answer is: both a and c

Q.123 Which command extracts the first 100 lines of a file?

head -n 100 file.txt

head 100 file.txt

tail -n 100 file.txt

sed -n '1,100p' file.txt

Explanation - head with -n outputs the top 100 lines.

Correct answer is: head -n 100 file.txt

Q.124 Which of these is a typical input for the 'MAFFT' alignment program?

Protein FASTA files

RNA‑seq FASTQ files

Variant call files

Chromosome conformation capture data

Explanation - MAFFT aligns protein or nucleotide sequences given in FASTA format.

Correct answer is: Protein FASTA files

Q.125 In a BAM file, which flag value indicates a properly paired read?

0x1

0x2

0x4

0x8

Explanation - 0x2 means the read is properly paired.

Correct answer is: 0x2

Q.126 Which R function calculates the variance of a numeric vector?

var()

sd()

mean()

median()

Explanation - var() returns the sample variance of a numeric vector.

Correct answer is: var()

Q.127 What is the output format of the 'samtools mpileup' command when used with the '--vcf' flag?

VCF

SAM

BAM

BED

Explanation - The '--vcf' flag tells mpileup to output VCF format.

Correct answer is: VCF

Q.128 Which command in Linux creates a compressed version of a file?

tar -czvf archive.tar.gz folder/

gzip file.txt

compress file.txt

both a and b

Explanation - Both tar with gzip and gzip directly compress files.

Correct answer is: both a and b

Q.129 Which of the following best describes a 'transcript isoform'?

A different gene variant

An alternative splicing product of the same gene

A type of DNA methylation

A protein domain

Explanation - Isoforms result from alternative splicing producing distinct transcripts.

Correct answer is: An alternative splicing product of the same gene

Q.130 In Python, how do you iterate over the keys of a dictionary?

for key in dict:

for key in dict.keys():

both a and b

foreach key in dict

Explanation - Both syntaxes iterate over dictionary keys.

Correct answer is: both a and b

Q.131 Which command creates a FASTQ file containing only reads with a Phred quality score above 30?

awk 'NR%4==0 && $1>=30' input.fastq > highq.fastq

seqtk seq -q 30 input.fastq > highq.fastq

sed -n '4~4p' input.fastq | awk '$1>=30' > highq.fastq

All of the above

Explanation - seqtk's '-q' flag filters by quality score.

Correct answer is: seqtk seq -q 30 input.fastq > highq.fastq

Q.132 Which of the following is a typical output of the 'RNA‑seq differential expression' analysis?

Variant call files

Differentially expressed gene list with fold change

Phylogenetic tree

Metabolic network diagram

Explanation - DE analyses produce gene lists with statistics and fold changes.

Correct answer is: Differentially expressed gene list with fold change

Q.133 What does the 'CIGAR' string in a BAM file describe?

Read length

Alignment operations (match/mismatch/indel)

Sequencing platform

Quality scores

Explanation - CIGAR encodes how reads align to the reference genome.

Correct answer is: Alignment operations (match/mismatch/indel)

Q.134 Which of the following commands performs a de novo assembly using SPAdes for single‑end reads?

spades.py -s reads.fq -o assembly

spades -single reads.fq -output assembly

spades -s reads.fq --output assembly

All of the above

Explanation - spades.py accepts single‑end input with '-s' flag.

Correct answer is: spades.py -s reads.fq -o assembly

Q.135 Which command in R calculates the Pearson correlation between two vectors?

cor(x, y, method='pearson')

pearson(x, y)

correlation(x, y)

corr(x, y, type='pearson')

Explanation - The cor() function with method='pearson' computes Pearson correlation.

Correct answer is: cor(x, y, method='pearson')

Q.136 Which of the following best describes a 'methylome'?

The set of all genes in a genome

The set of all DNA methylation marks across the genome

The set of all RNA transcripts

The set of all protein domains

Explanation - Methylome refers to genome‑wide DNA methylation patterns.

Correct answer is: The set of all DNA methylation marks across the genome

Q.137 In a variant call, what does 'QUAL' represent?

Quality score of the genotype

Number of supporting reads

Allele frequency

Position of the variant

Explanation - 'QUAL' is the Phred‑scaled quality of the variant call.

Correct answer is: Quality score of the genotype

Q.138 Which of the following commands displays the first 10 lines of a file in reverse order?

head -n 10 file.txt | tac

tac file.txt | head -n 10

tail -n 10 file.txt | rev

both a and b

Explanation - Both commands reverse the lines before showing the top 10.

Correct answer is: both a and b

Q.139 Which of the following best describes an 'indel'?

Insertion or deletion of nucleotides

Substitution of a single nucleotide

A translocation event

A chromosomal inversion

Explanation - Indels are insertions or deletions relative to the reference.

Correct answer is: Insertion or deletion of nucleotides

Q.140 Which R function plots a heatmap of a gene expression matrix?

heatmap()

plot()

ggplot()

corrplot()

Explanation - heatmap() from base R generates a simple heatmap.

Correct answer is: heatmap()

Q.141 What does 'FDR' stand for in differential expression analysis?

False Discovery Rate

Fold Difference Ratio

Full Data Range

Fast Distribution Ratio

Explanation - FDR is the expected proportion of false positives among significant results.

Correct answer is: False Discovery Rate

Q.142 Which of these tools is used for rapid read alignment against a reference genome?

BWA MEM

MAFFT

HMMER

BLASTP

Explanation - BWA MEM is designed for fast mapping of short reads.

Correct answer is: BWA MEM

Q.143 In Python, which library is best for plotting genomic tracks similar to UCSC Genome Browser?

pyGenomeTracks

Matplotlib

Plotly

Bokeh

Explanation - pyGenomeTracks generates genome browser‑style plots programmatically.

Correct answer is: pyGenomeTracks

Q.144 Which command extracts the mean depth of coverage from a depth file generated by samtools depth?

awk '{sum+=$3} END{print sum/NR}' depth.txt

awk '{print $3}' depth.txt | paste -sd+ - | bc / NR

both a and b

none of the above

Explanation - Both commands compute the mean depth by summing and dividing by record count.

Correct answer is: both a and b

Q.145 What is the purpose of a 'barcode' in multiplexed sequencing libraries?

To identify sample origin within a pooled run

To increase read length

To mark quality of reads

To sort reads by GC content

Explanation - Barcodes tag reads from different samples, enabling demultiplexing.

Correct answer is: To identify sample origin within a pooled run

Q.146 Which of these commands creates an index for a BAM file?

samtools index input.bam

samtools mkindex input.bam

samtools index -b input.bam

samtools makeindex input.bam

Explanation - samtools index builds a coordinate index for efficient access.

Correct answer is: samtools index input.bam

Q.147 Which of the following best describes a 'de novo' assembly?

Assembly using a known reference genome

Assembly without a reference, using reads alone

Assembly of protein sequences

Assembly of transcriptomes only

Explanation - De novo assembly constructs sequences from scratch.

Correct answer is: Assembly without a reference, using reads alone

Q.148 Which R function is used to write a CSV file from a DataFrame?

write.csv()

write_csv()

csv.write()

write.table()

Explanation - write.csv() outputs a DataFrame to a CSV file.

Correct answer is: write.csv()

Q.149 What does the command 'wget https://example.com/file.fasta' do?

Uploads file.fasta to the server

Downloads file.fasta from the URL

Deletes file.fasta from the server

Copies file.fasta to local directory

Explanation - wget retrieves files from the web via HTTP/FTP.

Correct answer is: Downloads file.fasta from the URL

Q.150 Which of the following best describes a 'gene ontology (GO) enrichment' analysis?

Assessing over‑representation of GO terms in a gene set

Mapping genes to their chromosomal positions

Identifying sequence motifs in promoters

Predicting 3D protein structures

Explanation - GO enrichment identifies biological functions over‑represented in a list.

Correct answer is: Assessing over‑representation of GO terms in a gene set

Q.151 In a FASTQ file, what does the '+' line contain when it is followed by an identical header?

Quality string placeholder

Sequence identifier repeat

Sequence itself

No data

Explanation - The '+' line can be followed by the same header or left blank.

Correct answer is: Quality string placeholder

Q.152 What does the 'FASTQC' tool highlight in its per‑sequence GC content plot?

GC content distribution across reads

Read length distribution

Quality score trends

Adapter contamination

Explanation - This plot shows the GC distribution for each read, indicating bias.

Correct answer is: GC content distribution across reads

Q.153 Which command in Linux counts the total number of characters in a file?

wc -c file.txt

cat file.txt | wc -c

both a and b

none of the above

Explanation - wc -c returns the character count; piping works similarly.

Correct answer is: both a and b

Q.154 In the context of gene regulation, what is a 'promoter'?

A coding region of a gene

A regulatory DNA sequence upstream of a gene

A type of RNA polymerase

A protein domain

Explanation - Promoters initiate transcription by binding transcription factors.

Correct answer is: A regulatory DNA sequence upstream of a gene

Q.155 Which R function returns the standard deviation of a vector?

sd()

var()

mean()

sum()

Explanation - sd() computes the sample standard deviation.

Correct answer is: sd()

Q.156 What does the 'SAM' format store that the 'BAM' format does not?

Sequence data

Alignment data

Quality scores

It is a binary format; BAM stores the same data in binary form

Explanation - BAM is a compressed binary representation of SAM.

Correct answer is: It is a binary format; BAM stores the same data in binary form

Q.157 Which of the following commands is used to generate a FASTA file containing only coding sequences from a GTF annotation?

gffread -g genome.fa -y coding.fasta annotation.gtf

awk '$3==CDS' annotation.gtf > coding.gtf

sed -n '/CDS/p' annotation.gtf > coding.gtf

both a and b

Explanation - gffread extracts coding sequences based on GTF features.

Correct answer is: gffread -g genome.fa -y coding.fasta annotation.gtf

Q.158 In a gene‑expression heatmap, what is the typical purpose of a dendrogram?

Shows the phylogenetic tree

Groups similar expression profiles

Indicates GC content

Displays sequence alignment

Explanation - Dendrograms cluster genes or samples with similar patterns.

Correct answer is: Groups similar expression profiles

Q.159 Which of the following commands calculates the mean of a column in a tabular file using awk?

awk '{sum+=$2} END{print sum/NR}' file.txt

awk '{print $2}' file.txt | paste -sd+ - | bc / NR

both a and b

none of the above

Explanation - Both compute the average of the second column.

Correct answer is: both a and b

Q.160 What does a 'coverage plot' display?

Number of reads per base across the genome

Expression levels across samples

GC content variation

Phylogenetic distances

Explanation - Coverage plots show read depth across genomic coordinates.

Correct answer is: Number of reads per base across the genome

Q.161 In a FASTQ file, how many lines correspond to one read?

Explanation - A FASTQ record consists of 4 lines: header, sequence, '+', and quality.

Correct answer is: 4

Q.162 Which R package is commonly used for functional enrichment analysis of gene sets?

clusterProfiler

ggplot2

data.table

dplyr

Explanation - clusterProfiler performs GO and pathway enrichment analyses.

Correct answer is: clusterProfiler

Q.163 Which of the following commands creates a gzipped FASTQ file from an uncompressed FASTQ?

gzip input.fastq

bgzip input.fastq

pigz input.fastq

All of the above

Explanation - All three tools can compress FASTQ files to .gz.

Correct answer is: All of the above

Q.164 What is the purpose of a 'masker' in genome annotation?

To identify and annotate repeats

To filter low‑quality reads

To compress the genome

To predict transcription factor binding sites

Explanation - RepeatMasker identifies repetitive elements in the genome.

Correct answer is: To identify and annotate repeats

Q.165 Which of the following best describes a 'haplotype'?

A set of DNA bases forming a gene

A combination of alleles at multiple loci on the same chromosome

A type of protein

A statistical measure of read depth

Explanation - Haplotypes represent linked genetic variants on one chromosome.

Correct answer is: A combination of alleles at multiple loci on the same chromosome

Q.166 In Python, how do you split a string by commas?

string.split(',')

string.split()

string.split(',')

Explanation - The split() method splits on the delimiter provided.

Correct answer is: string.split(',')

Q.167 Which of the following commands calculates the GC content of a FASTA file using awk?

awk 'NR>1{g+=gsub(/G|C/,"")} END{print (g/len)*100}' file.fasta

awk '/[GC]/{g++} END{print g/NR}' file.fasta

both a and b

none of the above

Explanation - Both snippets accumulate G/C counts and calculate percentage.

Correct answer is: both a and b

Q.168 What does the 'SAM flag 0x10' indicate?

Read is mapped in the forward direction

Read is mapped in the reverse direction

Read is unmapped

Read is part of a paired‑end alignment

Explanation - 0x10 denotes the reverse complement strand mapping.

Correct answer is: Read is mapped in the reverse direction

Q.169 Which command lists the contents of a directory sorted by modification time?

ls -t

ls -l

ls -h

ls -s

Explanation - The '-t' flag sorts by modification time, newest first.

Correct answer is: ls -t

Q.170 Which of the following best describes 'metagenomics'?

Sequencing of individual genomes

Sequencing of mixed microbial communities

Sequencing of the human transcriptome

Sequencing of the human genome only

Explanation - Metagenomics analyzes DNA from environmental samples containing many species.

Correct answer is: Sequencing of mixed microbial communities

Q.171 What does the 'RPKM' normalization formula stand for?

Reads Per Kilobase per Million mapped reads

Reads per Kilobase of mRNA

RNA per Kilobase of genome

Random Per Kilobase per Million

Explanation - RPKM corrects for gene length and sequencing depth.

Correct answer is: Reads Per Kilobase per Million mapped reads

Q.172 Which of the following commands generates a FASTA file containing only sequences longer than 1000 bases?

awk 'NR%4==2 && length($0)>1000' file.fasta > long.fasta

seqtk subseq file.fasta -m 1000 > long.fasta

sed -n '/^>/p' file.fasta > headers.txt && grep -A1 -B1 '^>.*$' file.fasta | awk 'NF>1000' > long.fasta

All of the above

Explanation - All commands filter by sequence length.

Correct answer is: All of the above

Q.173 What does the 'GC skew' plot help identify in a bacterial genome?

Strand replication origin and terminus

Gene expression levels

Phylogenetic relationships

Methylation patterns

Explanation - GC skew changes around the origin and terminus of replication.

Correct answer is: Strand replication origin and terminus

Q.174 Which of the following is a typical input for the 'BWA aln' algorithm?

Paired‑end reads

Single‑end reads

Protein sequences

RNA‑seq FASTQ files

Explanation - BWA aln is used for short, single‑end read mapping.

Correct answer is: Single‑end reads

Q.175 Which R function merges two data frames by a common column?

merge()

cbind()

rbind()

join()

Explanation - merge() combines data frames on key columns.

Correct answer is: merge()

Q.176 Which of the following best describes a 'pseudogene'?

An active protein‑coding gene

A non‑coding RNA gene

A gene that has lost its function due to mutations

A gene with multiple splice variants

Explanation - Pseudogenes are non‑functional remnants of once‑active genes.

Correct answer is: A gene that has lost its function due to mutations

Q.177 What does a 'scaffold' represent in genome assembly?

A single continuous contig

A set of contigs linked with estimated distances

A read from the sequencing library

A protein domain annotation

Explanation - Scaffolds arrange contigs with gap estimates.

Correct answer is: A set of contigs linked with estimated distances

Q.178 Which command displays the number of distinct sequences in a FASTA file using awk?

awk '/^>/ {count++} END{print count}' file.fasta

grep -c '^>' file.fasta

both a and b

none of the above

Explanation - Both count the number of header lines, indicating sequences.

Correct answer is: both a and b

Q.179 In a phylogenetic tree, what is the 'root'?

The most recent common ancestor of all taxa

The oldest species in the tree

The node with the longest branch

The leaf node

Explanation - The root represents the ancestral point from which all branches diverge.

Correct answer is: The most recent common ancestor of all taxa

Q.180 Which of the following is a valid Python list comprehension that squares numbers 1–5?

[x**2 for x in range(1,6)]

[x*x for x in 1..5]

(x**2 for x in range(1,6))

[x**2 for x in range(6)]

Explanation - This comprehension correctly iterates 1–5 and squares each element.

Correct answer is: [x**2 for x in range(1,6)]

Q.181 What does the command 'samtools flagstat input.bam' output?

The number of mapped and unmapped reads

The base composition of the reference genome

The alignment score distribution

The GC content of the reads

Explanation - flagstat provides a quick summary of mapping statistics.

Correct answer is: The number of mapped and unmapped reads

Q.182 Which R function calculates the median of a numeric vector?

median()

mean()

mode()

median()

Explanation - median() returns the middle value of a sorted vector.

Correct answer is: median()

Q.183 What does the 'N' character represent in a DNA sequence?

A single nucleotide

An ambiguous nucleotide (any base)

A gap

A stop codon

Explanation - N indicates any base (A, T, C, or G).

Correct answer is: An ambiguous nucleotide (any base)

Q.184 Which command in R merges two data frames on a shared key using dplyr?

left_join(df1, df2, by='id')

merge(df1, df2, by='id')

join(df1, df2)

all_join(df1, df2)

Explanation - left_join() from dplyr performs a left merge.

Correct answer is: left_join(df1, df2, by='id')

Q.185 Which of the following commands performs a de novo assembly using Canu for long reads?

canu -p asm -d out genomeSize=3g -pacbio-raw reads.fq

canu -assembly asm -input reads.fq

canu -run asm -reads reads.fq

canu -p asm -d out -genomeSize 3g reads.fq

Explanation - This syntax specifies output directory, genome size, and PacBio raw reads.

Correct answer is: canu -p asm -d out genomeSize=3g -pacbio-raw reads.fq

Q.186 Which of the following best describes a 'pseudogene'?

A gene that has lost its function due to mutations

A non‑coding RNA gene

A gene with multiple splice variants

An active protein‑coding gene

Explanation - Pseudogenes are non‑functional remnants of once‑active genes.

Correct answer is: A gene that has lost its function due to mutations

Q.187 Which command lists all files and directories in the current directory, including hidden ones, sorted alphabetically?

ls -a

ls -al

ls -alh

All of the above

Explanation - ls -al lists files, including hidden ones, and shows detailed info.

Correct answer is: ls -al

Q.188 Which R function computes the log2 fold change between two conditions?

log2(x/y)

log2(x)/log2(y)

log2(x)+log2(y)

log2(x*y)

Explanation - Log2 fold change is the log base 2 of the ratio of expression values.

Correct answer is: log2(x/y)

Q.1 Which Python library is commonly used to parse FASTA files in bioinformatics?

Q.2 What does the BLAST algorithm primarily compare?

Q.3 In a phylogenetic tree, what does a longer branch typically indicate?

Q.4 Which of the following is NOT a commonly used sequence alignment metric?

Q.5 In Python, which function is used to shuffle a list randomly?

Q.6 What type of data structure is a FASTA file?

Q.7 Which Python package is ideal for manipulating biological data frames?

Q.8 The 'E-value' in BLAST indicates:

Q.9 Which command in the Linux shell lists all files, including hidden ones?

Q.10 What is the main purpose of using a 'quality score' in next-generation sequencing data?

Q.11 Which Python library would you use for machine learning in genomics?

Q.12 In a multiple sequence alignment, a column with no gaps or mismatches is called a:

Q.13 Which data type in R is used to store a sequence of DNA nucleotides?

Q.14 What does the acronym 'RNA-Seq' stand for?

Q.15 Which algorithm is most suitable for constructing phylogenies based on distance matrices?

Q.16 In the context of gene expression analysis, what is a 'heatmap' used for?

Q.17 Which command is used to convert a FASTQ file to FASTA format using seqtk?

Q.18 Which of the following best describes the 'p-value' in differential gene expression?

Q.19 Which programming language is NOT typically used in bioinformatics pipelines?

Q.20 What does 'GC content' refer to in a DNA sequence?

Q.21 Which algorithm is commonly used for motif discovery in DNA sequences?

Q.22 Which of the following is a typical input for a Hidden Markov Model in protein family classification?

Q.23 What is the primary function of the 'samtools view' command?

Q.24 In R, which function from the Bioconductor package 'DESeq2' is used to normalize count data?

Q.25 Which of the following best describes a 'single‑cell RNA‑seq' experiment?

Q.26 In Python, how do you open a file for reading?

Q.27 What does 'ORF' stand for in genetics?

Q.28 Which of the following is an example of a k‑mer?

Q.29 In a phylogenetic tree, what does a bootstrap value represent?

Q.30 Which command is used to extract reads mapped to chromosome 12 from a BAM file?

Q.31 What is the purpose of the 'gzip' command in a bioinformatics pipeline?

Q.32 Which of the following is NOT a type of variant called by GATK?

Q.33 In Python, which library provides tools for working with genomic intervals?

Q.34 Which of these metrics is commonly used to assess clustering quality in unsupervised gene expression analysis?

Q.35 What does the 'MAFFT' program primarily do?

Q.36 Which of the following best describes a 'contig' in genome assembly?

Q.37 In R, what function from the 'ggplot2' package is used to create a scatter plot?

Q.38 Which type of filter is commonly used to remove low‑quality reads based on quality scores?

Q.39 What does the 'trim_galore' tool do?

Q.40 Which of the following is a property of a 'protein motif'?

Q.41 What is the main function of the 'cutadapt' program?

Q.42 Which of these is a commonly used file format for storing gene annotations?

Q.43 Which command is used to count the number of reads in a FASTQ file using awk?

Q.44 What is the purpose of a 'phylogenetic bootstrap analysis'?

Q.45 Which Python package is useful for visualizing genomic data tracks?

Q.46 In the context of next‑generation sequencing, what does 'paired‑end' refer to?

Q.47 Which of the following commands converts a SAM file to BAM and sorts it?

Q.48 Which R function is used to read a FASTQ file into a Biostrings object?

Q.49 In a variant call format (VCF) file, which field stores the genotype of an individual?

Q.50 Which command is used to generate a de novo assembly using SPAdes?

Q.51 Which of the following best describes a 'coverage depth' metric?

Q.52 What is the purpose of using 'indel realignment' during variant calling?

Q.53 Which Python library is used for working with graph data structures in bioinformatics?

Q.54 Which command extracts the header lines from a FASTQ file?

Q.55 What does the 'MACS2' software do in ChIP‑seq data analysis?

Q.56 In Python, which method of a pandas DataFrame returns the mean of numeric columns?

Q.57 Which of the following is NOT a valid base in RNA sequencing reads?

Q.58 What is the primary advantage of using 'single‑molecule real‑time (SMRT)' sequencing?

Q.59 Which R package provides tools for differential expression analysis of RNA‑seq data?

Q.60 What does the 'samtools flagstat' command output?

Q.61 Which of the following describes a 'motif discovery algorithm' in DNA sequences?

Q.62 What is the role of 'hash tables' in bioinformatics data processing?

Q.63 Which of these file extensions is commonly used for compressed FASTQ files?

Q.64 In a VCF file, what does the 'AF' field represent?

Q.65 Which command line utility is used to merge multiple BAM files?

Q.66 What is a 'kmer count table' used for in metagenomics?

Q.67 Which Python module provides a 'deque' data structure useful for sliding windows?

Q.68 What does the 'fastqc' tool evaluate in sequencing data?

Q.69 Which of the following is NOT a type of alignment score in sequence alignment?

Q.70 In a gene‑expression heatmap, what does a darker color typically represent?

Q.71 What is the main output of the 'bwa mem' command?

Q.72 Which R function is used to perform a principal component analysis (PCA) on expression data?

Q.73 What does a 'strand‑specific' RNA‑seq library preserve?

Q.74 Which of these is a key metric for evaluating a de novo assembly?

Q.75 Which Python function calculates the Levenshtein distance between two strings?

Q.76 In a phylogenetic tree, what does a 'polytomy' indicate?

Q.77 Which command in Linux lists the number of lines in a file?

Q.78 Which of these is NOT a common output of a metagenomic assembler?

Q.79 In Python, which data type is most suitable for storing a sequence of nucleotides?

Q.80 What does the 'awk NF' command do when applied to a FASTQ file?

Q.50 Which command is used to generate a de novo assembly using SPAdes?

Q.74 Which of these is a key metric for evaluating a de novo assembly?

Q.116 In a 'de novo' assembly, what is the 'k' in a 'k‑mer' strategy?

Q.134 Which of the following commands performs a de novo assembly using SPAdes for single‑end reads?

Q.147 Which of the following best describes a 'de novo' assembly?