reinforcement learning

Reinforcement learning is the training of machine learning models in which it is used to make a sequence of decisions. In a potentially complex environment, the agent learns to achieve a goal with uncertainty. Reinforcement learning is a game-like situation where the computer applies trial and error to come up with a solution to the problem. To get the machine to do what the programmer wants, the artificial intelligence gets either rewards or penalties for the actions it performs. Its goal is to maximize the total reward.
Here the designer sets the reward policy, which is the rules of the game. The designer does not give any hints or suggestions on how to solve the game. It’s the model’s job to figure out how to perform the task to maximize the reward. It starts from totally random trials and finishes with sophisticated tactics and superhuman skills. Reinforcement learning is one of the most effective techniques with the power of many trials and hints on the machine’s creativity. When run on sufficiently powerful computer infrastructure, reinforcement learning can gather experience from numerous parallel gameplays.

Although RL is a powerful approach to AI, it is not suitable for all problems and there are several types of RL.

Related post – What is Reinforcement learning?

6 Questions you should ask yourself before solving with RL type

1. Does my algorithm need to make a sequence of decisions?

RL is an optimum way to solve problems that require sequential decision-making. Here the sequential decision-making means a series of decisions that all affect each other. Suppose you are developing an AI program to win a game, and here it is not enough for the algorithm to make a good decision. Here he has to make a whole series of good decisions. While it gives a single reward for a positive outcome, it eliminates solutions that result in low rewards. At the same time, it elevates those that allow a full sequence of good decisions.

2. How much data do I have? What will happen if the wrong decision is made?

The amount of data you already have and the cost of making the wrong decisions are the two vital determinants to decide whether to use RL online or offline.

For example, you are using a video platform and you need to train it with an algorithm so that it can provide recommendations to users. Without any data, you have no choice but to interact with the user with recommendation decisions in real-time is an online process. Such exploration comes at a cost – a few bad recommendations made while learning the system can disappoint the user. However, if you already have large amounts of data, you can develop a good policy without interacting with specific users. This is offline RL training.

3. Do I have an existing model?

If may want to write a program for a robot that will pick up a physical object. Here you can use the laws of physics to inform your model. Again if it is for a program to maximize stock market returns, there is no existing model that can be used. Instead, heuristics will be used here with manual tuning. Interestingly, these heuristics could be suboptimal. Generally, RL is a good choice when there is no existing model to build on or you want to improve on an existing decision-making strategy.

4.Does my goal change?

Sometimes in AI your target never changes. With stocks, you will always want to maximize your returns. Such a problem is not conditioned by a goal, because you always solve the same goal. But in other cases, your goal may be a moving target. Consider Loon, Google’s recently shut down effort to build giant balloons to bring the internet to rural areas. Here, the optimal position for each ball is different. For such cases, lens conditioned RL is more suitable.

5. How long is my time horizon?

So how many decisions does my algorithm have to make before arriving at a solution?

The answer can help you determine whether to use hierarchical or non-hierarchical RL. Consider writing a program to make a robot pick up an object. The robot must approach the object and close its grippers to lift the object. For programs like this, with a small number of decisions, non-hierarchical RL is often adequate. Now imagine that the same robot has to locate nails, place them on a board, then pick up a hammer and hit the nail with the hammer. At the abstract level, there are only three or four stages. But if we write a program that displays the position of the robot’s hands, it will be a long sequence of actions. In such cases with longer time horizons, hierarchical LR is often useful.

6.Is your task really sequential decision-making? What information do I have about my users?

Say to sell a particular product you are looking to optimize the design of a website. In some cases, a user may never return to your website. Maybe it is the color of the website which will decide whether the user makes a purchase from the website. You can show users three backgrounds in different colors at random and see which one works best. But if you have additional information about your users, such as their gender or location, you can incorporate that information and use it to better shape your AI program. Contextual Bandits are a unique decision-making approach that is suited to these types of situations. With context bandits, there are theoretical guarantees of performance: an algorithm can test different actions and learn what is the most rewarding outcome for a given situation. However, if the user has to come back multiple times – go ahead and use RL in its most general form – alas at the cost of no notional guarantees.

Leave a comment