ABOUT THIS EPISODE

Introduction to reinforcement learning concepts.

ocdevel.com/mlg/29 for notes and resources.

English
United States

TRANSCRIPT

00:00:01welcome to the OC develop machine learning guide podcast this podcast series is a How To tutorial on machine learning it's sequential so if you haven't yet started episode 1 this is episode 29 reinforcement learning introduction
00:00:19finally my friends we are at reinforcement learning the beginning of the end of your artificial intelligence Quest the beginning of the end the beginning of the end because reinforcement learning is not artificial intelligence but as you'll see in a bit is such a core component of AI it's kind of the heart of a i and we start driving into RL you'll really feel like you've made a huge step towards that goal you'll feel the magic of RL and start to feel like you're making a dent in the grand picture less Define RL first and then we'll come back to AI reinforcement learning or RL is the third pillar of machine learning we have unsupervised super vised and reinforcement learning that sometimes semi-supervised but we don't talk about that in this series unsupervised learning as you'll recall is a machine learning model Learning Without instruction to sort of sick data into piles find patterns in the data to put the triangles over here in the circles over there
00:01:19super, in my experience in industry or practical application so most of what we've been dealing with in this podcast series is supervised learning supervised learning is a student and a teacher you're the teacher and your neural network is the student your training your model to recognize patterns with flashcards okay the flash cards are your data you hand your neural network a pile of flashcards and on the front of each flash card are the features and on the back of each flash card is the label or the Target that your neural network is trying to learn how to predict so supervised learning is like neural network School you are the trainer and your training your model to recognize patterns so that when you release it into the wild it can continue to recognize that pattern on data it's never seen before Vision natural language processing recommender systems most of the Practical business applications of machine learning fall into this category
00:02:17reinforcement learning is the learning model or the agent we call it the agent training itself you don't give it labeled data to train on instead you give it a system whereby it knows whether an action it took his good or bad and that's it from there it learns all on its own how to navigate the cruel world so supervised learning is handing your student a deck of flashcards reinforcement learning is handing a kid a sword and a shield and a scoreboard and sending it out into the world and your agent will learn all by itself how to swing the sword how to walk around this map which bad guys are too strong for it to fight and eventually how to beat the game it's an action based machine Learning System it's all about taking actions in an environment to achieve an eventual goal so the goals and action based machine Learning System and the way that it learns what act
00:03:17to take and how to accomplish its goal is by being rewarded and punished and that's it so classic applications and use cases of reinforcement learning tend to be geared around games normal human games like chess and go and I'm sure you've heard about alphago beating the world champion Lisa dull in recent times that's deep reinforcement learning as well as video games like Atari and doom anything that can be framed as trying to achieve some eventual goal by taking actions through time and receiving reward or punishment that's a reinforcement learning scenario so it's not just video games we've got self-driving cars whose goal is to get the human from A to B safely wear rewards might center around staying in the lines and getting the person there in a timely fashion and Punishment might be breaking too fast in these things you can apply it to robots walking around in environment I'm sure you've seen those mules or dogs bye bye
00:04:17and dynamics that are walking around through the forest so I'm sure it has some sort of reward and Punishment system baked into it for learning how to walk and navigate an environment and of course for our purposes it can apply to stock trading day trading the eventual goal being maximizing your portfolio value and the rewards and Punishment being your gains and losses as you trade so there's lots of application of reinforcement learning now let's get back to AI in a very early episode of the series we Define day I sort of a list of checkboxes like we'll have achieved hey I once we've combined a bucket list of features we talked about perception like vision and speech will those we've tackled with cons Nets and R and ends AI requires learning will this whole thing's been about learning you probably want your AI to have a body that's robotics it's not necessary but it's icing on the cake and that's out of the jurisdiction of this podcast it should have the ability to act
00:05:17actuators the ability to open doors or walk around or make decisions well action is what reinforcement learning is all about so I could be covering that stuff and it should be able to plan and planning versus reinforcement learning versus action we're going to sit those two bits a part in this episode were going to talk about the difference between action and planning model 3 reinforcement learning versus model-based reinforcement learning will get to that in a bit but it looks like all the stuff we've covered so far in the stuff going to be covering now those pretty much check all of our boxes so looks like we're getting really close to the Grand goal of artificial intelligence now there's 3 more check boxes that I commonly see that may not yet be checked those are reasoning knowledge representation and memory now in my opinion a case can be made for both knowledge representation and memory being baked into a neural networks neurons that the weights of a neuron are at Smith
00:06:16free of the pattern it's trying to predict or the history of actions that have resulted in high rewards that is now going to act upon in a reinforcement learning agent so case can be made that knowledge representation and memory might be considered to be baked into the neural networks me Ron's already in other words that we've already checked those boxes inadvertently as for reasoning that's a tougher one that's one that may be indeed is an unchecked checkbox but another case could be made here that planning and reasoning might go hand-in-hand that being able to plan your way through an action sequence of an environment is the act of reasoning about your situation and since we're going to be covering planning in this sequence of episodes that were also covering reasoning may not be a very strong case says potentially some more work that still needs doing around the reasoning knowledge representation and memory ass
00:07:16pics of artificial intelligence there's also a research project by Google deepmind called the differentiable computer which purports to solving these exact things knowledge representation memory and reasoning so that's something worth looking into but will put those on the back burner for now and I'll leave it to you listeners to decide whether those aspects of AI are yet to be accomplished other than those bits were damn clothes damn close to the end goal and reinforcement learning especially model-based reinforcement learning which introduces planning to the mix takes us one giant step towards that vision
00:07:56now it's been a little bit more time on supervised vs. reinforcement learning their cases where supervised learning is an obvious fit vision for example your training your comment on a bunch of images that are tagged as CatDog a tree it learns by flipping over the flashcards slapping his forehead and fixing its mistakes over time how to make that distinction on its own how to recognize the pattern supervised learning clear case a clear case of reinforcement learning is playing a video game you have a scoreboard in the top left that's your reward your taking actions in an environment over time trying to accomplish your goal very clear case of reinforcement learning about trading what about Bitcoin trading which is our podcast project that is a case that can go both ways and that's partially why I chose that project for us as it transitions us from supervised 2 reinforcement how can we frame it as a supervised learning scenario well we would feed into our model whether it's now
00:08:56what's TMR Nan orrock on Fannett a window of time steps that's the front of the flash card and its trying to predict the very next time steps price that's the back of the flash card so learned over time how to predict a next price action based on a Time window now what do you do with that what do you do now that you know the next price action well if you're an expert Trader you'll program in a bunch of trading rules manually into your program a bunch of if else statements basically saying if the price is going this way and simple moving average is such and such and the price 20 steps ago was this then do that that's the supervised learning approach to acting on it predicted price action and for many trading firms that's exactly what they want they know how to trade their the experts in the subject all they want is a predicted next price action by their supervised learning model and they'll take it from here
00:09:56thank you very much okay another approach would be that the front of your flash card is the time step window like before but the back of the flash card is whether to buy or sell or hold so it kind of looks like reinforcement learning in that were predicting an action but the agent isn't learning what actions to take on its own you are teaching it which actions to take given a Time step window you are manually teaching you are the teacher and it is the student and you are teaching it if the window looks like this you buy if it window looks like that you sell and in order to do that you have to have an expert on hand able to label very precisely and accurately the best trade signal for a Time step window and that may be very difficult to get accurate and that may be very time-consuming when we switch to reinforcement learning our agent learns how to buy and sell through trial and error through losing money and go
00:10:56any money towards an ultimate goal of having a very high value portfolio and you don't have to tell it how to trade you don't have to tell it what's a good trade what's a bad Trade It learns those all by itself and the hope of reinforcement learning as has been shown with Atari game playing and alphago vs Lisa doll is that this but can learn to trade with superhuman performance that it will learn that it will learn the proper Buy sell signals given a Time window much better than human could trade it on now we haven't seen that to be the case yet in deep reinforcement learning when applied to trading that's my disclaimer for you I actually haven't made any money yet with our Bitcoin trading bot and through lots of conversations with people and companies there's still a lot of work and research in the space that needs to be done before these Bots achieve superhuman trading power but that's the goal know one thing that a good reinforcement learning agent would need to consider when
00:11:56it's performing its actions is the consequences of its actions now for future Awards this is called the credit assignment problem because you crrl agent is going to be experiencing delayed rewards it might buy some amount now and take a small penalty based on the commission's but that that purchase will grow in value over time so it's a downstream reward is greater than its present penalty and this is something that's factored into any reinforcement learning agent delayed rewards the credit assignment problem this is solved by something called a discount factor a discount factor which will describe in the technical details of reinforcement learning in the next episode so that's a lay of the land of reinforcement learning its definition and how it compares to supervised learning now let's start cracking open reinforcement learning from a high-level perspective let's start looking at its insides we won't get too technical in this episode
00:12:56is an introduction will get a little bit more tactical in the next episode the first high-level distinction we make in reinforcement learning is something called Model free reinforcement learning agent versus model-based reinforcement learning agents model free versus model-based the simpler reinforcement learning agent model free RL agent is something I like to consider a reactionary agent it is a gut reaction agent and Instinct based agent if you tap it with a hammer its leg comes up if you put food in front of its mouth it bites this is almost like it's Reptilian Brain and this has a special name this this reactionary component of our agent it's called a policy the policy determines what action the agent takes given what it is experiencing right now what it sees or what it hears or what it tastes are smells if it sees food right in front of its face
00:13:56goes into neural network is inputs inputs to a neural network that neural network is called your policy and outcomes an action which is to bite the food so model free RL agent is a gut reaction Instinct agent and a model base RL agent is a much more sophisticated agent which has planning built into its system I can look ahead many steps it can start to think about the problem and weigh the pros and cons of specific actions so model free is reactionary model-based is planning based that's the high-level separation of our agents in this world of reinforcement learning that we're about to embark on we won't be getting into the model based agents for quite some time they're much more sophisticated in complex so we're going to start with the reactionary agents the policy based model free agents and you'd be surprised at how far you can go
00:14:56with these model free agents if you've seen these YouTube videos of an orange Ragdoll Man in an environment he's running around and it has that Benny Hill theme song playing and he's like running really funny and Swinging his arm around and trying to jump over gaps and stuff this is a physics-based environment called mujoco and you Jo Co which stands for multi-joint Dynamics with contact and it's a very complex environment I mean this rag doll guy has to learn how to move all of his limbs with joints throughout the Lambs so he has to learn how to move quite a number of Parts on his body and learn how to jump over obstacles and all these things you would think that couldn't be performed by an unsophisticated model but indeed usually that is showcasing a model free reinforcement learning agent like the proximate policy optimization agent for the DQ Network so you can make surprising progress with a model free agent
00:15:56and in fact we're using a model free agent for our Bitcoin trading bot know before I move past model-based agents because again we're not going to be really diving into that for quite some time I want to tell you what these models are first off the word model I really dislike this it took me a long time to understand what was being said here model free versus model base I don't like the use of the word model because there's models everywhere everywhere in rrl agents whether their model for your model based or model free agent has the reinforcement learning agent and a high-level like approximate policy optimization agent that's a model and it has inside of that Akon's net for parsing the screenshot of the video game it's playing or an lstm RNN if it's parsing sometime series like stock data that's a model so right from the get-go we have two models and yet this is a model free agents very confusing when they say model in Rancho
00:16:56learning in the context of model free versus model-based they're referring to something very very specific it's the planning component of the RL agent namely it's a system which learns how the world Works how the world around it works so that it can plan based on that knowledge this is called transition Dynamics and we'll get to that in the next episode but a very simple reinforcement learning environment which doesn't need to be learned per say you'll see some of the stuff in your chapter ones of the resources that I'll be recommending the transition Dynamics are baked into the system and for much more complex environments like Atari games or the majoko environment that transition Dynamics need to be learnt so let's have an example if we're talking about the game chess when played on a computer maybe it's the computer versus you we could do a number of approaches the computer could have built in
00:17:56do it system all the rules of Chess and sort of what are the best actions to take under what circumstances and this would be like those Windows 95 and Windows 98 chess-playing algorithms this is an AI this is an algorithm but it is planning it is a planning base model using either these tree search algorithms Monte Carlo tree search is popular one or any number of other planning base algorithms this is what planning is if the chess pieces in this position than what is the optimal sequence of steps to take given the opponent's configuration and so it goes down the street simulating how things might play out and it prunes trees that look like they'll be a dead end until eventually it comes to a high-scoring possible move to taken and takes that move that would be the case when the Dynamics of the system are baked intermodel are programmed into the Windows 95 chess algorithm that stuff doesn't fly in the real world
00:18:56in modern systems in complex video games and robots walking around the world we couldn't possibly program the physics of the universe into a robot and so instead you give it the option to learn how the world around it works it can learn the Dynamics of the world these are called the transition Dynamics so when we say model free versus model-based the word model refers to a model that is mapping the Dynamics of the world the transition Dynamics and so in a model based reinforcement learning algorithm will have a model for our reinforcement learning agent and for the Khan's net with in it and whatever else over here on the left and we'll have a model for learning how the world around it works so that it can plan to make wiser decisions over here on the right and that's what the model is they say in model-based now these planning or searching out
00:19:55rhythms are the stuff of classical AI if you crack open a textbook pre-2000 about a I like one that I will recommend at the end of the show called ai ai modern approach it's these algorithms that those books will teach you these planning searching algorithms for deciding what action to take given a specific configuration and the newfangled stuff of reinforcement learning takes that planning component pops it in as a module and now it can both react I can buy the food or kick its leg and it can plan and it can learn to do both of those in a very sophisticated deep learning framework using componets l stm's or dents layers so this is kind of why I think deep model-based reinforcement learning is getting towards the Crux of a eyes cuz we're combining all the powers of every
00:20:55we've learned this far all into one robot so that's the high-level breakdown of oral agents that will be exploring our education model 3 versus model-based will start with model free and now we will make another division over here on the left in the model free RL agents were going to split that into two new branches these are the policy gradient agents and the value based agents will get into the technicals of those in the next episode will talk high-level here policy gradient agents are there simple it's the 101 RL agent you'll learn in any book and all it is is deep learning applied to actions it's just performing an action assessing the consequence the reward considering actions in a Time Horizon so we can handle delayed rewards and the credit assignment problem and then from there taking a gradient step just traditional machine learning
00:21:55in a direction that optimizes the policy in a direction that helps the agent make better decisions in the future so it's really the classical machine learning strategies that you've seen up into this point applied to an actions and rewards framework it's really vanilla stuff and you'll see this early on in your education the policy gradient methods then over here we have the value-based method and this takes a different approach to the problem setting and uses all this Bellman stuff that will get into later the Bellman optimality equation and value iteration and all these things in order to pose the problem with a different spin very very similar but a different Spin and what we learn is not the policy directly but we learn to be able to predict the value of the current state were in and the value of each possible action we can take so it's a slightly more sophisticated spin it's a little bit more like a one-step look ahead than a gut
00:22:55action so policy gradient your training directly the neurons to fire in a certain way when the network is in a specific State that's really the hammer on the knee kicking the leg and food in front of the mouth taking a bite whereas the value-based methods are able to sort of look down and see the value of the state we're currently standing on and look forward and see the value of each of the options we can take a b or c and reaching out and grabbing one of those options whichever one has the highest value is a little bit more sophisticated little bit less reactionary and in fact it turns out it has lower variance as we'll see you later than the policy gradient method so there's a lot of advantages it's certainly not planning agent don't get me wrong when I say it's looking forward One Step at the action to could take it's not planning so it's more like a judgement call then it got reaction another pros and cons to the
00:23:55policy gradient methods versus the value based methods so it's not cut-and-dry I made it out to sound like the value-based methods are the better approach that's not true it's pros and cons set up for example this isn't necessarily the case but it is very very often the case policy gradient methods allow you to use continuous actions where value-based methods require you to use discrete actions not necessarily the case but very very commonly the case a continuous action for example is picking a number between 0 and 100 sew-in are Bitcoin trading bot we want to be able to sell some arbitrary amount of Bitcoin or dollars so we want to have the ability to take an action on a continuous scale where is if we were limited to using a discrete action we have to hard-code some predetermined amount that the trading bot could buy or sell which isn't very slick on the other
00:24:55and the value based methods have a lot less variance which is a huge problem in reinforcement learning which will get to later so pros and cons of policy gradient methods versus value-based methods and her the prior episode on hyper search what you really want to do is try them all try both approaches will get into the specific policy gradient models and value-based models in the future episode but right here I want to name drop the most popular it from each Camp just so you have an idea the reigning champion from the value based Camp is the most popular RL agent in the world the most spoken of in all the literature and blog posts and showcased all over the Internet deep Q networks dqn that is a value-based agent I'm I'm quite sure you've heard of DQ ends and the current rate
00:25:55champion of the policy gradient approaches is the proximate policy optimization model or PPO so PPO versus dqn is sort of Showdown you'll see a lot of now just to name drop a handful of other popular RL agents out there we have the actor agent the Acer agent ddpg or deep deterministic policy gradient and trpo or trusted region policy optimization and we'll compare a lot of these in the future know something I've observed this is my own observation and I don't know whether it's true I might get into trouble for saying this is this Google deepmind and openai those are the two biggest research outfits for deep RL now I've observed that Google tends to be a champion of the value learning approaches the Q networks where openai it seems tends to be a champion of the policy gradient approaches like pee
00:26:55c o n t r p o s just something interesting that I've observed that Google seems to Champion the Deep Q networks and openai tends to Champion the policy gradient approaches whether or not that's true this goes to show that there are pros and cons to both camps there is no clear winner on PG vs value-based will break apart the PG vs value stuff more in the future but that's just a lay of the land now let's talk about technology libraries and Frameworks and code that you can use now the hands on machine learning book that I've been recommending over and over has a fantastic chapter on deep reinforcement learning the last chapter of the book and a guides you on a lay of the land like this episode of the next episode of it walks you through the core concept of reinforcement learning it has you hand code from scratch a policy gradient approach and a deep Q Network and it also throws a little something called actor critic into the mix which will talk
00:27:55so I highly recommend if you haven't already read that chapter in the Hands-On ml book that you can hand code your own deep reinforcement learning agent but that'll be as simple as they come moving forward if you want to get if you want to keep up with the times follow the latest and greatest improvements in modifications on the reinforcement learning agents Buy cutting-edge Research then you're going to want to use a framework and these Frameworks will bacon all the very complex math and theory behind these agents which can be quite a task to wrestle with if you want to hand code these things and there's a handful of popular frame works out there for reinforcement learning all of them are built on top of tensorflow so you'll get that GPU optimization and all the knowledge that you've gained thus far will come in handy the first one I want to mention is openai baselines baselines is it a repository of the code that accompanies all of open a eyes Publications
00:28:55which of their papers like the PPO paper for example has a company open source code so that you can follow along and also so that you can verify their benchmarks versus versus your own on your computer or try different twists of these agents cetera it's not really a framework it's more of a Dumping Ground for code that accompanies their research so opening I basslines is more intended for research not intended for developers not intended as a plug and play framework and in fact I tried to use openai bass lines for our trading bot in the early days and I just couldn't adapt their code to our circumstances because they're code was too tightly coupled to their environment setup so baselines intended for research not intended for developers so on the flip side we have reinforced iOS 10 Surfers tensor Force which is the framework that were using for the Bitcoin trading bot sensor force is a developer framework
00:29:55intended to be plug-and-play and easy to use and in my opinion it is the easiest to use of the frame works out there it just has a really slick interface where you can pop in an environment like are Bitcoin trading environment and it's 50 lines of code you can choose which model free reinforcement learning agent you want to use like the PPO agent or the dqn agent what is your network architecture whether it's a CNN or an LST MRN and a handful hyperparameters and it Go and it will abstract all the really hairy math behind the scenes for you and the cool part about it is it's modular architecture which allows you to on a whim decide to switch from PPO to dqn without very much effort at all it could take maybe 10 minutes and you switched from your PPO to a dqn and then that way you can easily Benchmark the relative performance of reinforcement agents for your environment I talked about
00:30:55Bruce Lee using hyper search to determine which reinforcement learning agent to use and you also want to use hyper search on the Hyper's that correspond with specific reinforcement learning agents so tense or Force being a developer framework makes that process very very painless now the downside of tensor force is that it doesn't really have a big company backing there's two main developer behind it they're both from the University of Oxford so it doesn't have the big name behind it like baselines has openai which could be a problem for some people where they might want to gauge their trust in the future success of the framework another framework that might meet you halfway between those two is nirvana systems coach coach is built by Intel now that's a big name as far as a backers concerned Intel has a sort of labs Department called Nirvana where they are actually developing their own deep learning framework compare
00:31:55tensorflow and on top of it they've been building this deep reinforcement learning framework called coach now they're smart they know that most people out there using tensorflow not Nirvana and so they've made their framework also compatible with tensorflow as a first class Citizen and so coach runs on tensorflow and it is intended to be a developer framework just like tensor force is it's not like baselines a Dumping Ground for research code its intended to be used by developers in a plug-and-play fashion but it is not as successful in my experience at being so plug and play as tensor Force so pros and cons for all of these Frameworks coach has a bigger backing and thereby prospectively brighter future but tensor Force at present in my experience is the most well old machine and lens the best to developers like you and me and then finally we have RL lab and this is an older framework this one
00:32:55more popular before these other contenders came on the scene I actually cannot speak to this framework at all I don't have any experience with RL lab but I'll just drop it in the show notes there so you can take a look at it and do your own comparison now the way these Frameworks work is you're going to specify which reinforcement learning agent you're going to use one from either the policy gradient camp-like PPO or one from the value-based camp-like dqn and you're going to specify your network so you'll build some fashion of CNN or and lstm with some dents layers in there and you'll provide it with an environment and environment is a class that you build it is a subclass of open a eyes Jim package GYN gym opening eyes Jim is a whole sweet of environments that you can train your RL agent within things like the Atari video games or moving a mouse through a maze
00:33:55we're trying to drive a car up a hill all these little experimental environments that have pre built into them the actions you can take the reward system the transition Dynamics the environment physics all these things and they range from very simple to very complex now Jim does not have mujoco that I described earlier that is actually a proprietary environment do you have to purchase a license for so Jim is an open-source sweet of environments that you can download via pit majoko something will have to go on a website and purchase a license to use these environments range from simple to complex and the most simple of all the most commonly used hello world of reinforcement learning environment is called cart pole you're balancing a pole on a cart it's kind of like if you've ever held a broom vertically in the palm of your hand and you're moving your hand around trying to balance the broom so that it doesn't fall
00:34:55now you can kind of doing this little shuffle dance trying to balance the broom that's a car pole is you can move this car either left or right and the goal is to keep the broom balanced in the air and the reward system is a plus one for every time step that you don't drop the broom in other words the longer you keep the broom balanced the more reward you get the actions are either left or right so it's a single discrete action so you can use a DQ Network here easy peasy and then built into the environment is the physics of how this thing works the angular velocity and the direction of carton all that stuff so that the pole balancing is following sort of a physical system well if you want to build your own environment as we did in our Bitcoin trading bot we wanted a Bitcoin environment with price action history and the physics of what happens when you buy and sell well if you want that you subclass the environment class from openai open
00:35:55Jim package is sort of like a package of standards that is respected by all the reinforcement learning Frameworks so a gym environment super class is sort of the specification That Bass lines tensor Force coach and Ro lab all respect so as long as you subclass a gym environment than your environment is bound to work across the different Frameworks which makes evaluating the different Frame Works a lot simpler so my recommendation is try your hand all these Frameworks were using tens or Force I like that one the best coaches one that I want to dive into I haven't had time to take a look at it but baselines is more for you researchers out there rather than developers that is reinforcement learning in a nutshell it is goal-oriented machine learning taking actions in an environment and being rewarded and it teaches itself how to act in an environment which is what differentiates it from supervised learning where you teach it how to act
00:36:55reinforcement learning teaches itself we split it into model free versus model-based reinforcement learning agents model free agents are reactionary agents you hit me with a hammer and it kicks you put food in front of its mouth and it bites model-based RL algorithms model the environment around it in a sufficiently sophisticated way that it can use that model to plan actions not just react but plan and really it's the model based deep RL that's getting us into the very depths of AI proper we will be discussing model free RL for the next few episodes will break model free RL down into policy gradient methods like proximate policy optimization or PPO and value-based methods like deep Q networks or DQ ends there are pros and cons to both approaches and we discussed some technology namely
00:37:55subclassing an opening I gym environment to create an environment that you can work in or you can use one of their prefabs like cart pull the hello world of RL environments and then using those environments within a deep reinforcement learning framework like opening I bass lines or reinforce iOS tensor force or nervana systems coach now let's talk about the resources this is going to be a very heavy resources section and the reason for this is that the next episodes are going to be coming out slowly I'm new to reinforcement learning myself so I'm going to be releasing episodes as I understand the concept so it's going to take some time and I want you to be able to get a deep dive Head Start I want you to be able to read these resources now so you don't have to wait for the for the next episodes so I'm going to dump all the Deep reinforcement learning resources that were going to be covering in this whole sequence of episodes right
00:38:55are now in the resources section the first thing like I mentioned is to read the last chapter in the Hands-On machine learning book which is a chapter on deep RL and it's a real quickie lay of the land and it has you programming a very simple vanilla policy gradient method and deep to network next up and I want you to consume these resources in sequential order the next thing you should read is reinforcement learning an introduction by Sutton and Bartow this is the single most recommended resource on reinforcement learning out there it is the base textbook for your introductory reinforcement learning University course and they just released a free completed 2nd edition draft PDF which is a 2018 release so it's really fresh the original first edition was way back in 1998 or something and this addition is brand-spanking-new so you're going to get a lot of the latest and greatest and introduces reinforcement learning
00:39:55primarily model free reinforcement learning when you finish that book move on to NAIA a modern approach this is the classic AI introduction textbook when you embark on your machine learning Masters or PhD probably one of the first classes they have you take is going to be an introduction to artificial intelligence and they'll assign this textbook a modern approach it introduces all the classical approaches to artificial intelligence especially in the domain of searching and planning it's these searching and planning algorithms that you're going to package up and pop into your model-based reinforcement learning agents so a combination of a sudden and Bartow book and the AI modern approach book that'll get you all the knowledge you need to go forward with model-based reinforcement learning next it's time to move on to deep reinforcement learning and to combine those two together
00:40:55and the resource here is a Berkley course cs294 deep reinforcement learning and all the videos are available on YouTube so this is going to be a very heavy deep reinforcement learning course that combines model 3 and model based approaches and describes all the latest and greatest state-of-the-art research in RL like the PPO out those three primary resources Sutton and Bartow modern approach and cs294 there's also a popular video series RL course by David silver on YouTube but I'll recommend this one me personally I have converted it to audio and I put it on my iPod while I'm doing chores or commuting I found the other three resources to be richer educational material so I want you to save your vision time for those three resources and use the RL course by David silver as audio time supplementary material
00:41:55you at the gym or cleaning the house that's it for the resources those resources will take you quite some time to consume may be the better part of a year so this will keep you busy for a while and I'm unlikely to be recommending other resources in the coming few episodes so don't feel overwhelmed now it may take me some time to release the next episode cuz I'm going to want to really intuitively understand the technical details of this stuff so that I can boil it down for you so just a heads up could be a bit of time but that's it for the introduction to RL and I'll see you next time

Transcribed by algorithms. Report Errata
Disclaimer: The podcast and artwork embedded on this page are from OCDevel, which is the property of its owner and not affiliated with or endorsed by Listen Notes, Inc.

EDIT

Thank you for helping to keep the podcast database up to date.