Reinforcement Learning # MCQs Practice set

Q.1 What is the main goal of reinforcement learning?

To label data

To maximize cumulative reward

To minimize classification error

To cluster data

Explanation - Reinforcement learning aims to find an optimal policy that maximizes the long-term cumulative reward an agent can achieve through interactions with the environment.

Correct answer is: To maximize cumulative reward

Q.2 Which of the following is NOT a component of reinforcement learning?

Agent

Environment

Reward Signal

Loss Function

Explanation - Reinforcement learning involves an agent, environment, states, actions, and rewards. Loss functions are common in supervised learning, not core RL terminology.

Correct answer is: Loss Function

Q.3 In reinforcement learning, what does a policy define?

How states are clustered

How actions are chosen in states

How rewards are assigned

How data is labeled

Explanation - A policy in RL is a mapping from states to actions, defining the agent’s behavior.

Correct answer is: How actions are chosen in states

Q.4 What is a value function used for in RL?

To predict rewards for actions

To assign labels to data

To count the number of states

To measure error rates

Explanation - Value functions estimate the expected long-term cumulative reward for states or state-action pairs.

Correct answer is: To predict rewards for actions

Q.5 Which of the following best describes the Markov property?

Future depends only on present state

Future depends on entire history

Future is random

Future is independent of present

Explanation - The Markov property states that the future state depends only on the current state and action, not on past states.

Correct answer is: Future depends only on present state

Q.6 Which RL method directly uses rewards without learning a model of the environment?

Model-based RL

Supervised learning

Model-free RL

Clustering

Explanation - Model-free RL does not learn the environment’s dynamics; it learns policies or value functions directly from experiences.

Correct answer is: Model-free RL

Q.7 What is an episode in reinforcement learning?

A sequence of actions only

A sequence of states, actions, and rewards until termination

A single reward observation

A group of states without actions

Explanation - An episode represents one complete trajectory from a start state to a terminal state in RL.

Correct answer is: A sequence of states, actions, and rewards until termination

Q.8 Q-learning is an example of which type of RL method?

Policy-based

Model-based

Value-based

Unsupervised

Explanation - Q-learning is a value-based method that learns the action-value function.

Correct answer is: Value-based

Q.9 Which algorithm improves policy using gradient ascent?

Policy Gradient

Q-learning

SARSA

K-means

Explanation - Policy gradient algorithms directly optimize the policy parameters using gradient ascent on expected rewards.

Correct answer is: Policy Gradient

Q.10 What does SARSA stand for?

State-Action-Reward-State-Action

Supervised Adaptive Reinforcement Strategy Algorithm

Sequential Action Reward Sample Analysis

State-Action-Reward-Sequence Algorithm

Explanation - SARSA is an on-policy RL algorithm named after the sequence of experience it uses.

Correct answer is: State-Action-Reward-State-Action

Q.11 What is the difference between Q-learning and SARSA?

Q-learning is on-policy, SARSA is off-policy

Q-learning is off-policy, SARSA is on-policy

Both are off-policy

Both are on-policy

Explanation - Q-learning updates values assuming greedy policy actions, while SARSA updates values based on the actual action taken by the policy.

Correct answer is: Q-learning is off-policy, SARSA is on-policy

Q.12 Which discount factor γ makes the agent value immediate rewards only?

γ = 0

γ = 0.5

γ = 1

γ > 1

Explanation - When γ=0, the agent only considers immediate rewards, ignoring future ones.

Correct answer is: γ = 0

Q.13 Which RL algorithm combines policy-based and value-based methods?

SARSA

Actor-Critic

Q-learning

Monte Carlo

Explanation - Actor-Critic methods combine policy gradient updates (actor) with value function estimation (critic).

Correct answer is: Actor-Critic

Q.14 Exploration in RL refers to:

Exploiting known actions

Trying new actions to gather information

Maximizing immediate reward

Ignoring rewards

Explanation - Exploration allows agents to try different actions to discover potentially better long-term rewards.

Correct answer is: Trying new actions to gather information

Q.15 The ε-greedy strategy balances:

Accuracy vs Precision

Speed vs Accuracy

Exploration vs Exploitation

Policy vs Reward

Explanation - ε-greedy selects the best-known action most of the time but explores randomly with probability ε.

Correct answer is: Exploration vs Exploitation

Q.16 Which algorithm uses Monte Carlo methods?

Policy Gradient

Temporal-Difference Learning

Q-learning

Episode-based value estimation

Explanation - Monte Carlo RL estimates value functions by averaging returns from complete episodes.

Correct answer is: Episode-based value estimation

Q.17 What is a reward signal?

Information about the environment’s structure

A numerical feedback indicating success or failure

The probability of next state

A policy definition

Explanation - The reward signal provides feedback to the agent about how good or bad an action was.

Correct answer is: A numerical feedback indicating success or failure

Q.18 Which learning method updates after every step rather than after full episodes?

Monte Carlo

Temporal-Difference

Batch Learning

Supervised

Explanation - TD methods update estimates based on partial returns, after each step, unlike Monte Carlo.

Correct answer is: Temporal-Difference

Q.19 Which term describes the process of improving a policy using a value function?

Policy Evaluation

Policy Improvement

Exploration

Reward Maximization

Explanation - Policy improvement is the process of updating a policy to take actions that yield higher expected returns.

Correct answer is: Policy Improvement

Q.20 What is the Bellman equation used for?

Defining neural networks

Relating state values recursively

Computing accuracy

Defining clustering rules

Explanation - The Bellman equation expresses value functions in terms of immediate rewards and future state values.

Correct answer is: Relating state values recursively

Q.21 Which of the following is a continuous action RL algorithm?

Q-learning

SARSA

Deep Deterministic Policy Gradient (DDPG)

Monte Carlo

Explanation - DDPG handles continuous action spaces by combining actor-critic with deterministic policies.

Correct answer is: Deep Deterministic Policy Gradient (DDPG)

Q.22 In reinforcement learning, the agent learns through:

Supervision

Trial and error

Data labels

Random sampling only

Explanation - Reinforcement learning is based on trial-and-error interactions with the environment to maximize rewards.

Correct answer is: Trial and error

Q.23 Which of these is an application of RL?

Image classification

Game playing

Clustering customers

Regression analysis

Explanation - RL has been successfully applied in game playing (e.g., AlphaGo, Atari games).

Correct answer is: Game playing

Q.24 What is a state in RL?

A label for data

A representation of the environment at a given time

A numerical reward

A policy parameter

Explanation - A state is the agent’s perception of the environment at a specific time.

Correct answer is: A representation of the environment at a given time

Q.25 Which approach combines deep learning with RL?

Supervised CNNs

Deep Reinforcement Learning

Unsupervised clustering

Neural regression

Explanation - Deep RL uses neural networks to approximate value functions or policies in complex environments.

Correct answer is: Deep Reinforcement Learning

Q.1 What is the main goal of reinforcement learning?

Q.2 Which of the following is NOT a component of reinforcement learning?

Q.3 In reinforcement learning, what does a policy define?

Q.4 What is a value function used for in RL?

Q.5 Which of the following best describes the Markov property?

Q.6 Which RL method directly uses rewards without learning a model of the environment?

Q.7 What is an episode in reinforcement learning?

Q.8 Q-learning is an example of which type of RL method?

Q.9 Which algorithm improves policy using gradient ascent?

Q.10 What does SARSA stand for?

Q.11 What is the difference between Q-learning and SARSA?

Q.12 Which discount factor γ makes the agent value immediate rewards only?

Q.13 Which RL algorithm combines policy-based and value-based methods?

Q.14 Exploration in RL refers to:

Q.15 The ε-greedy strategy balances:

Q.16 Which algorithm uses Monte Carlo methods?

Q.17 What is a reward signal?

Q.18 Which learning method updates after every step rather than after full episodes?

Q.19 Which term describes the process of improving a policy using a value function?

Q.20 What is the Bellman equation used for?

Q.21 Which of the following is a continuous action RL algorithm?

Q.22 In reinforcement learning, the agent learns through:

Q.23 Which of these is an application of RL?

Q.24 What is a state in RL?

Q.25 Which approach combines deep learning with RL?

Privacy & Cookie Consent