What is reinforcement learning?

Artificial intelligence is growing by leaps and bound. AI techniques like deep learning and reinforcement learning have the potential to influence business to a large extent. We can state machine learning as a monolith in AI. However, machine learning is diversified with various sub-types, and reinforcement learning is one of the dominating ones among them.

What is Reinforcement learning?

Reinforcement learning is the training of machine learning models in which it is used to make a sequence of decisions. In a potentially complex environment, the agent learns to achieve a goal with uncertainty. Reinforcement learning is a game-like situation where the computer applies trial and error to come up with a solution to the problem. To get the machine to do what the programmer wants, the artificial intelligence gets either rewards or penalties for the actions it performs. Its goal is to maximize the total reward.
Here the designer sets the reward policy, which is the rules of the game. The designer does not give any hints or suggestions on how to solve the game. It’s the model’s job to figure out how to perform the task to maximize the reward. It starts from totally random trials and finishes with sophisticated tactics and superhuman skills. Reinforcement learning is one of the most effective techniques with the power of many trials and hints on the machine’s creativity. When run on sufficiently powerful computer infrastructure, reinforcement learning can gather experience from numerous parallel gameplays.

How does Reinforcement learning work?

Reinforcement learning is based on a hypothesis that tells we must be able to describe by the maximization of expected cumulative reward. The formal framework for reinforcement learning borrows from the problem of optimal control of Markov Decision Processes (MDP)
The main elements of a Reinforcement learning system are:

The agent or the learner
The environment the agent interacts with
The policy that the agent follows to take actions
The reward signal that the agent observes upon taking actions
The agent explores an unknown environment to achieve a goal.

The value function abstracts the reward signal, which captures the ‘goodness’ of a state. The reward signal captures the immediate benefit of a certain state. In contrast, the value function captures the cumulative reward that is expected to be collected from that state on going into the future. The objective of a reinforcement learning algorithm is to discover the action policy that maximizes the average value that it can extract from every state of the system. We can categorize reinforcement algorithms into two broad categories –

Model-free algorithm
Model-based algorithm

Model-free algorithms

With model-free algorithms, we cannot build an explicit model of the environment or the MDP more rigorously. These are based on trial-and-error algorithms that run experiments with the environment using actions and derive the optimal policy from it directly. These algorithms are either value-based or policy-based. Value-based algorithms consider an optimal policy to be a direct result of accurately estimating every state’s value function. Using a recursive relation described by the Bellman equation, the agent interacts with the environment and sample trajectories of states and rewards. The value function of the MDP is estimated with the help of enough trajectories. Once the value function is known, the optimal policy is discovered with respect to the value function at every state of the process. Some popular value-based algorithms are SARSA and Q-learning.
Policy-based algorithms, on the other hand, do not model the value function. Instead, they directly estimate the optimal policy. Like value-based algorithms, the agent samples trajectories of states and rewards; however, this information is used to explicitly improve the policy by maximizing the average value function across all states. Popular policy-based RL algorithms include Monte Carlo policy gradient (REINFORCE) and deterministic policy gradient (DPG).

Model-based algorithms

Model-based RL algorithms sample the states, take actions and observe the rewards by building a model of the environment. For every state and possible action, the model predicts the expected reward and the expected future state. Here the former is a regression problem, and the latter is a density estimation problem.
Examples of Reinforcement learning

Robotics
Autonomous driving
AlphaGo

Benefits of Reinforcement Learning

Reinforcement learning can handle a wide range of complex problems that other machine learning algorithms cannot tackle. It is closer to artificial general intelligence (AGI), as it possesses the ability to seek a long-term goal while exploring various possibilities autonomously. RL comes with many benefits, and some of them include:
It considers problems as a whole. RL is different from Conventional machine learning algorithms as those algorithms are designed to excel at specific subtasks without a notion of the big picture. On the other hand, RL doesn’t divide the problem into subproblems; it directly works to maximize the long-term reward. It has an obvious purpose, understands the goal, and is capable of trading off short-term rewards for long-term benefits.
Does not need a separate data collection step. In RL, training data is obtained via the direct interaction of the agent with the environment. Training data is the learning agent’s experience, not a separate collection of data that has to be fed to the algorithm. This significantly reduces the burden on the supervisor in charge of the training process.
Works in dynamic, uncertain environments. RL algorithms are inherently adaptive and built to respond to changes in the environment. In RL, time matters, and the experience that the agent collects is not independently and identically distributed (i.i.d.), unlike conventional machine learning algorithms. Since the dimension of time is deeply buried in the mechanics of RL, learning is inherently adaptive.

Challenges with Reinforcement Learning

Though RL algorithms are able to solve complex problems in diverse simulated environments successfully, their adoption rate is slow in the real world.

RL agent needs extensive experience. Scaling and tweaking the neural network controlling the agent is a challenge. There is no way to communicate with the network other than through the system of rewards and penalties. The training data used for RL algorithms are generated autonomously. So the rate of data collection is limited by the dynamics of the environment. In an environment with high latency, the learning curve is slow. Furthermore, extensive exploration is needed in complex environments with high-dimensional state spaces before a good solution can be found.
Delayed rewards. The learning agent can trade-off short-term rewards for long-term gains. Although the foundational principle RL is useful, the same makes it difficult for the agent to discover the optimal policy. This is especially true in environments where the outcome is unknown until a large number of sequential actions are taken. In this scenario, assigning credit to a previous action for the final outcome is challenging and can introduce large variance during training. The game of chess is a relevant example here, where the outcome of the game is unknown until both players have made all their moves.
Lack of interpretability. Once an RL agent has learned the optimal policy and is deployed in the environment, it takes actions based on its experience. To an external observer, the reason for these actions might not be obvious. This lack of interpretability interferes with the development of trust between the agent and the observer. If an observer could explain the actions that the RL agent tasks, it would help him understand the problem better and discover the model’s limitations, especially in high-risk environments.