Q.1 What is the main goal of reinforcement learning?
To label data
To maximize cumulative reward
To minimize classification error
To cluster data
Explanation - Reinforcement learning aims to find an optimal policy that maximizes the long-term cumulative reward an agent can achieve through interactions with the environment.
Correct answer is: To maximize cumulative reward
Q.2 Which of the following is NOT a component of reinforcement learning?
Agent
Environment
Reward Signal
Loss Function
Explanation - Reinforcement learning involves an agent, environment, states, actions, and rewards. Loss functions are common in supervised learning, not core RL terminology.
Correct answer is: Loss Function
Q.3 In reinforcement learning, what does a policy define?
How states are clustered
How actions are chosen in states
How rewards are assigned
How data is labeled
Explanation - A policy in RL is a mapping from states to actions, defining the agent’s behavior.
Correct answer is: How actions are chosen in states
Q.4 What is a value function used for in RL?
To predict rewards for actions
To assign labels to data
To count the number of states
To measure error rates
Explanation - Value functions estimate the expected long-term cumulative reward for states or state-action pairs.
Correct answer is: To predict rewards for actions
Q.5 Which of the following best describes the Markov property?
Future depends only on present state
Future depends on entire history
Future is random
Future is independent of present
Explanation - The Markov property states that the future state depends only on the current state and action, not on past states.
Correct answer is: Future depends only on present state
Q.6 Which RL method directly uses rewards without learning a model of the environment?
Model-based RL
Supervised learning
Model-free RL
Clustering
Explanation - Model-free RL does not learn the environment’s dynamics; it learns policies or value functions directly from experiences.
Correct answer is: Model-free RL
Q.7 What is an episode in reinforcement learning?
A sequence of actions only
A sequence of states, actions, and rewards until termination
A single reward observation
A group of states without actions
Explanation - An episode represents one complete trajectory from a start state to a terminal state in RL.
Correct answer is: A sequence of states, actions, and rewards until termination
Q.8 Q-learning is an example of which type of RL method?
Policy-based
Model-based
Value-based
Unsupervised
Explanation - Q-learning is a value-based method that learns the action-value function.
Correct answer is: Value-based
Q.9 Which algorithm improves policy using gradient ascent?
Policy Gradient
Q-learning
SARSA
K-means
Explanation - Policy gradient algorithms directly optimize the policy parameters using gradient ascent on expected rewards.
Correct answer is: Policy Gradient
Q.10 What does SARSA stand for?
State-Action-Reward-State-Action
Supervised Adaptive Reinforcement Strategy Algorithm
Sequential Action Reward Sample Analysis
State-Action-Reward-Sequence Algorithm
Explanation - SARSA is an on-policy RL algorithm named after the sequence of experience it uses.
Correct answer is: State-Action-Reward-State-Action
Q.11 What is the difference between Q-learning and SARSA?
Q-learning is on-policy, SARSA is off-policy
Q-learning is off-policy, SARSA is on-policy
Both are off-policy
Both are on-policy
Explanation - Q-learning updates values assuming greedy policy actions, while SARSA updates values based on the actual action taken by the policy.
Correct answer is: Q-learning is off-policy, SARSA is on-policy
Q.12 Which discount factor γ makes the agent value immediate rewards only?
γ = 0
γ = 0.5
γ = 1
γ > 1
Explanation - When γ=0, the agent only considers immediate rewards, ignoring future ones.
Correct answer is: γ = 0
Q.13 Which RL algorithm combines policy-based and value-based methods?
SARSA
Actor-Critic
Q-learning
Monte Carlo
Explanation - Actor-Critic methods combine policy gradient updates (actor) with value function estimation (critic).
Correct answer is: Actor-Critic
Q.14 Exploration in RL refers to:
Exploiting known actions
Trying new actions to gather information
Maximizing immediate reward
Ignoring rewards
Explanation - Exploration allows agents to try different actions to discover potentially better long-term rewards.
Correct answer is: Trying new actions to gather information
Q.15 The ε-greedy strategy balances:
Accuracy vs Precision
Speed vs Accuracy
Exploration vs Exploitation
Policy vs Reward
Explanation - ε-greedy selects the best-known action most of the time but explores randomly with probability ε.
Correct answer is: Exploration vs Exploitation
Q.16 Which algorithm uses Monte Carlo methods?
Policy Gradient
Temporal-Difference Learning
Q-learning
Episode-based value estimation
Explanation - Monte Carlo RL estimates value functions by averaging returns from complete episodes.
Correct answer is: Episode-based value estimation
Q.17 What is a reward signal?
Information about the environment’s structure
A numerical feedback indicating success or failure
The probability of next state
A policy definition
Explanation - The reward signal provides feedback to the agent about how good or bad an action was.
Correct answer is: A numerical feedback indicating success or failure
Q.18 Which learning method updates after every step rather than after full episodes?
Monte Carlo
Temporal-Difference
Batch Learning
Supervised
Explanation - TD methods update estimates based on partial returns, after each step, unlike Monte Carlo.
Correct answer is: Temporal-Difference
Q.19 Which term describes the process of improving a policy using a value function?
Policy Evaluation
Policy Improvement
Exploration
Reward Maximization
Explanation - Policy improvement is the process of updating a policy to take actions that yield higher expected returns.
Correct answer is: Policy Improvement
Q.20 What is the Bellman equation used for?
Defining neural networks
Relating state values recursively
Computing accuracy
Defining clustering rules
Explanation - The Bellman equation expresses value functions in terms of immediate rewards and future state values.
Correct answer is: Relating state values recursively
Q.21 Which of the following is a continuous action RL algorithm?
Q-learning
SARSA
Deep Deterministic Policy Gradient (DDPG)
Monte Carlo
Explanation - DDPG handles continuous action spaces by combining actor-critic with deterministic policies.
Correct answer is: Deep Deterministic Policy Gradient (DDPG)
Q.22 In reinforcement learning, the agent learns through:
Supervision
Trial and error
Data labels
Random sampling only
Explanation - Reinforcement learning is based on trial-and-error interactions with the environment to maximize rewards.
Correct answer is: Trial and error
Q.23 Which of these is an application of RL?
Image classification
Game playing
Clustering customers
Regression analysis
Explanation - RL has been successfully applied in game playing (e.g., AlphaGo, Atari games).
Correct answer is: Game playing
Q.24 What is a state in RL?
A label for data
A representation of the environment at a given time
A numerical reward
A policy parameter
Explanation - A state is the agent’s perception of the environment at a specific time.
Correct answer is: A representation of the environment at a given time
Q.25 Which approach combines deep learning with RL?
Supervised CNNs
Deep Reinforcement Learning
Unsupervised clustering
Neural regression
Explanation - Deep RL uses neural networks to approximate value functions or policies in complex environments.
Correct answer is: Deep Reinforcement Learning
