Convnets or CNNs. Filters, feature maps, window/stride/padding, max-pooling.

ocdevel.com/mlg/25 for notes and resources

United States


00:00:01Welcome to the ocean develop machine learning guide podcast I'm your host tyler anneli This siri's aims to teach you high level fundamentals of machine learning from a to z audio maybe an inferior medium to task But with all our exercise commute and shores hours of the day not
00:00:17having an audio supplementary education would be a missed opportunity and where other resource is provide you with the machine learning trees I will provide the forest Additionally consider me your syllabus Att tthe e end of every episode i'll provide The best of the best resource is curated from
00:00:32around the web for you to learn each episode's details this syriza's sequential So if you haven't yet start with episode one this's episode twenty five convolution yl neural networks Wait today we're gonna be talking about conv nets convolution all neural networks or see an ends before we do
00:00:59that a little bit of admin i've been told by a handful of listeners that they want to donate to the show except that patriotic charges monthly and they only want to donate once so i've created a one time pay pal donate button as well as posting my bitcoin
00:01:14wallet address foryou crypto junkies if anybody is willing to donate to the show if you're not willing to donate please do leave a review on itunes that brings more listeners to the show which helps keep this alive and well so convolution yl neural networks for one reason or
00:01:30another conv nets tend to be the most popular topic discussed by new an aspiring machine learning engineers i don't know why specifically conv nets are so popular i mean i understand that vision is essential a key component to robots and a i and all that stuff but no
00:01:47less so the natural language processing by way of recurrent neural networks and the like but anyway continents are super popular in the deep learning space conv nets are the thing of vision in machine learning in the same way that recurrent neural networks are the thing of nlp natural
00:02:04language processing as well as any sort of time serious problems such a stock markets and whether prediction kamenetz are for images image classification image recognition computer vision and conv nets to me or a really clear case of the machine learning hostile takeover of artificial intelligence i've said this
00:02:24in prior episode that i think that the crux of a i is ml that ml is fast subsuming eye in a significant way so much so that the terms are almost becoming synonymous that's definitely the case with nlp machine learning came in and made a heavy dent with
00:02:41her current neural networks on all of the various aspects of nlp that's not to say that an lp was entirely conquered by machine learning but that machine learning has contributed very heavily to a space in the case of computer vision i think we see that even mohr so
00:02:56confidence really truly dominate space of computer vision and so we're going to be talking about that today with respect image classification image recognition and the like now for those of you who have good memory and you recall from a prior episode and i was talking about facial recognition
00:03:12and i was using a multilayered perceptron right The vanilla neural network an mlp as an example of an algorithm for image recognition i said that the first hidden layer might be for detecting things like lines and edges the second hidden layer for shapes and objects and the third
00:03:30hidden layer for things like eyes ears mouth and nose and then the final layer being a sigmoid function if you're just concerned with detecting whether or not there's a face in the picture or a soft max if we're trying to classify it as tree dog cat or human
00:03:45so i was using an mlp for an example of image classifications i lied to you my dear listeners nobody uses mlps for image classification they use confidence but an mlp sort of lens well to a pedagogical mental a picture of the situation and we encounter mlps earlier on
00:04:05in our machine learning learning so i thought it made sense to give you a picture but you don't use them l piece for images you use confidence and here's why an mlp for image recognition is like using a bag of words for spam detection now you may be
00:04:21thinking hey i thought that you said bag of words algorithms like naive bay's work well for spam classification you just take all the words in une male and you just cut him all up and just throw him in a bag to shake up the bag and you spilled
00:04:34out on the table and in case of spam detection in natural language processing maybe you're looking for the word viagra Okay you just kind of pushing all the words around and oh there it is viagra bam This is spam easy peasy yes bag of wars works fantastically for
00:04:49natural language processing in certain problems but using a bag of ward's kind of idea in image classification doesn't make sense What you would be doing is cutting the picture up into all of its pixels okay if you have a five by five picture you'd have twenty five pixels
00:05:06and then you throw all those pixels into a bag and you shake it up and you dumped the pixels on the table and now what how the heck are you supposed to detect whether or not there's something that you're looking for in that picture it's just a bag
00:05:19of pixels that's what an mlp would be giving you an mlp remember the other word for a regular neural network of d n n is another word for it deep neural network or an a and an artificial neural network we're just going to be calling them ml peace
00:05:33from now on an mlp consists of dense layers dense layers meaning that all of the neurons from the prior layer are connected to the next layer so all of the pixels of the input are connected to the first hidden layer all of the pixels are connected to every
00:05:51neuron of the first hidden layer so everything is combined with everything else and then all of those neurons are connected to all of the neurons of the next hidden layer everything is combined with everything else really is like a bag of words you're just throwing all the pixels
00:06:05in and you're combining them every which possible way but that's not how pictures work when you're looking for something specific in a picture you're generally looking for a type of object regionally located in one little window one square let's say that we're looking for waldo we're gonna be
00:06:22using where's waldo is the example of this episode we're looking for waldo in a picture now there's not going to be a little piece of waldo in the left and a tiny piece of waldo in the bottom right and maybe he's hat in the centre of the picture
00:06:34and his foot over here on the top right that's not how it works it's all going to be clumps together in one object and that object can be anywhere in the picture so that's why mlps don't work for image classifications instead we want a neural network that works
00:06:50with patches windows of pixels all at once little chunks in the picture and even within one window in a picture a window that maybe a box around waldo in the picture even within that window we still don't want to just combine every pixel in that window every which
00:07:10way with each other that still won't be very helpful for detecting whether waldo is in this window instead we really want to look for a specific shape or a specific sort of color pattern in this window and so what we're going to design is something called a filter
00:07:26ah filter is the crux of confidence it's the core component what a filter is is an object detector imagine that you have a five by five piece of paper and you take scissors and you cut out the shape of waldo in that piece of paper so that there's
00:07:44a hole in the center of the piece of paper that's the shape of waldo and then you take that piece of paper and you put it on top of your picture you're fifty by fifty image and now you take that piece of paper your filter and you use
00:07:57your finger to slide it to the right you slide it over the picture you slide it from the top left all the way to the right and then you bring it down one roast start back on the left kind of like a typewriter right you type all the
00:08:08way to the right and then enter and the piece of paper goes up and you started the next line at the left and then you start sliding your filter to the right again and the moment that you there's actually a waldo in the picture it will be very
00:08:20apparent to you because he'll sort of fill in that cut out in the center of your piece of paper up until that point until your piece of paper was over a waldo nothing sort of obviously filled that hole in the piece of paper that was cut out like
00:08:36the shape of waldo nothing very apparently filled it was all just a bunch of sort of pixel gibberish until you got over waldo and he fit just so perfectly right into that cut out and it made him pop made him really stand out so it's not really an
00:08:50object detector there is no activation function or output of this neuron that gives you a yes or a no necessarily instead it's a thing that sort of makes the object pop makes him stand out in the location where he is so that's what a filter is it's almost
00:09:08like a separate image a smaller image maybe a five by five filter that you're going to be using to search for an object in a fifty by fifty image and the filter is designed in a way that makes what it is you're looking for pop makes it pop
00:09:24out of the picture now having your filter sort of be the shape of waldo is a bit of an oversimplification of filter usually doesn't work that way a filter is usually a little bit more simplistic than that in the case of waldo detection for example one actual filter
00:09:40we might use is going to be horizontal stripes because waldo shirt has red horizontal stripes on it so a simple filter that would make him pop out of the image is a filter that has these stripes on it horizontally and what i mean by that is it's a
00:09:57five by five filter a five by five sort of picture square of pay pixels where every even raoh is filled with one's and every odd ro is filled with zeros that's kind of like the cut outs what that does is when it is applied to a patch in
00:10:15the picture all of the even rows of that patch are disabled because they're multiplied by zero all the pixels are multiplied by zero and all of the odd rows of that patch are enabled because they're multiplied by one and so when we hover over a waldo will see
00:10:32a bunch of red stripes pop out at us but when we're hovering over anything else in the picture it sort of looks like striped nonsense so a filter can't learn something quite so complex as the cut out of a human shape but it can learn something simple enough
00:10:48that could still give us a good insight as to what we're looking at and then we would combine multiple filters together to really increase our confidence so waldo has glasses he has a beanie he has red striped shirt and he's something of a beanpole he's very skinny guy
00:11:05kind of occupies vertically a very small section in the centre of a window so you might imagine designing four or five different filters each of which is looking for different patterns in the patch of pixels and all of them combined sort of making something pop out in the
00:11:23picture that will give us confidence as to whether or not we're looking at an object we're looking for so let's put these all together in rial convolution yl neural network terms when i say a patch of pixels what this is called is a window a window is a
00:11:38square chunk in the picture and at a filter is a filter again we have a filter or any number of filters that we're going to be starting in the top left of our picture and sliding to the right and any time it sort of hovers over something that
00:11:53we're looking for that thing kind of pops out through the filters and makes it stand out in the image when i think about filters i like to imagine them as kind of like an old school cool lens sort of a cylinder that you hold in between your fingers
00:12:06and you know you kind of put it on the picture on the top left and you look through it with your eye close your other eye and you're looking through the lens with your i and you're sliding it to the right and most of the time all you
00:12:18see a sort of blur but whenever you are over the thing you're looking for in the picture then it becomes very clear the waldo becomes very clear whereas the other windows air sort of blurry now inside of that cylinder inside of that lens there are multiple filters imagine
00:12:35you go to an eye doctor and he puts some lenses in front of your eyes and you're looking at some letters and he says better or worse he puts an additional layer of lenses in front of your eyes and you say better and then he keeps that there
00:12:46and then he puts an additional layer of lenses in front of your eyes off third lens in front of the other two says better or worse and you say worse so he's trying to find sort of this right sequence of layers of lenses that really makes the letter
00:13:01a on the board pop out at you be very crystal clear and that's what you're doing in designing these filters you hava lenses cylinder object that has multiple layers of filters in sight of it and this is what the machine learning model is going tto learn it's going
00:13:17tto learn the design of each of these filters each filter layer in your sort of cylindrical lens now we have a filter but that is not actually a layer we're talking about deep learning here neural networks we didn't design a layer a hidden layer in our neural network
00:13:36instead we have a little tool in our tool belt this lens this filter in order to construct the first hidden layer of our convolution yl neural network what's called a convolution yl layer what we're going to do is like i said we started the top left with this
00:13:52filter and we apply it to the picture all the way to the right we slide all the way to the right and then like a typewriter ching we started the next row on the left and we slide it all the way to the right again change started the
00:14:03next roast light all the way to the right again until we've covered the entire picture and applied this filter throughout the picture and what we have now is a new picture an entirely new image where all the wall those in the image pop all the waldo's air now
00:14:22crystal clear and everything in between is blurry or gibberish e pixley this is called a feature map a feature map a filter is the tool we use for making an object pop in a picture and a feature map is that picture transformed with the filter in every window
00:14:44every square of the picture is transformed with the filter and now we have a new image and it's called a feature map now like i said this lens that we're using to slide over the picture has multiple filters inside of it multiple layers of filters each filter trying
00:15:02to detect a different type of pattern stripes glasses shapes some tall thing in the center of the filter et cetera multiple layers of filters and so what is output in our convolution yl layer that next layer is actually multiple feature maps one feature map with each filter applied
00:15:22to the picture so what we have in our first hidden layer of our neural network is a three d box off pixels three d box of pixels with by height okay and it's the same width and height as our original picture except that instead of being picture it's
00:15:41our filter applied to the picture for every window for every patch of pixels in the image with by height and then depth depth is the number of filters so each convolution yl layer is a with by height feature map a feature map being applying one filter to your
00:16:02entire image fish and depth being the number of filters you have okay kind of confusing so let's start from the top we have a picture that comes in is your input in two d with by height we don't flatten it now what our neural network is going to
00:16:18learn is filters filters are these masks that make certain patterns in a pixel patch a window pop pop out of that window this is what the convolution all neural network is gonna learn these filters and we're gonna have multiple of them gonna have one filter for stripes one
00:16:37filter for glasses shapes one filter for skinny object in the centre of a window and we're going to stack these on top of each other we're going to take that stack of filters we're going to apply it from left to right top to bottom in our picture and
00:16:51that's going to output a new picture with by high pixels but the depth is the amount of filters so what we have is a box now that's our first hidden layer our convolution yl layer a layer of feature maps if we want additional hidden layers of our neural
00:17:09network we would do this again we will learn new filters and we will apply those new filters to that first hidden layer because that first in layers kind of a picture of its own it may not be a picture that makes a lot of sense to humans but
00:17:23it will make sense to the machine learning algorithm we will learn these new filters and we will ply them window by window by window to that first hidden layer and what we will get out of it is a new picture a new convolution yl layer which is with
00:17:37my height pixels and feature maps depth the third dimension being feature maps and a feature map is when you apply your filter to every window of a picture and that will be or second hidden layer your second convolution yl layer and then finally to sort of cap off
00:17:54your convolution yl neural network what you'll usually do is then pipe the result of all that through dense layers you've made certain patterns in your her picture pop and stand out and now you consort of latch onto those with your dense layers to determine whether something is in
00:18:12your image or not by piping that through a soft max or a logistic function or the like very good that is a convolution yl neural network we're going to talk about the additional details like stride and patting window sizes and max pulling in a bit but i just
00:18:26want you to know that that's the essence of convolution yl neural networks each layer is called a convolution a layer and what a layer is is a stack of feature maps and those feature maps come from applying filter across your picture and it's these filters that we learn
00:18:43is the filters that the com net learns in the back propagation step now oftentimes in deep learning part of the process is sort of boiling things down step by step as we go through the neural network it's kind of this looks like a funnel every layer of neurons
00:19:00get smaller and smaller and smaller until our final output is either one neuron in the case of logistic regression or one of multiple neurons but i mean let's say ten or twenty in the case of soft max regression if we were to have like a layer of five
00:19:15hundred twelve neurons and then the next layer is five twelve and then the next layer is five twelve so we're not actually boiling things down we're just kind of mixing matching and then the last layer is that one sigmoid function right Well first off that final layer would
00:19:30have way too much work to do We would be depending on it too much to sort of boiled down this universe of combined features in tow One point we would be overworking this neuron so it would be better if we could boil them down bit by bit by
00:19:44bit until finally when it's the last neurons turn he only has maybe twenty eight employees that have to report to him but in addition to that part of the magic of neural networks is that they break things down high are quickly so that they get smaller and smaller
00:19:59as you go along so in a picture for example if you started with a fifty by fifty picture well that would be twenty five hundred pixels twenty five hundred units in your input layer and ideally you would boil that down in tow let's say ten or twenty different
00:20:14types of lines and objects and then you would boil that down into eyes ears mouth and nose four objects and then you'd boil that down tto one that's kind of the way that deep learning generally works not always but generally we like to go from very very big
00:20:30two very small gradually high are quickly now the way i've been describing convolution yl layers is that each feature map is the same size as the picture they're applied to we take our filter and we move it when no by window over the picture and what comes out
00:20:46is a feature map exact same size and if we have multiple convolution yl layers like this then it doesn't feel like we're sort of boiling our picture down to its essence over time so the way we do this the way we boil images down into their essence step
00:21:01by step is by a combination of window size stride and patting okay window stride and patting now window we've already talked about window is the size of a patch pixels that you're looking at any one given time in your picture so i window of five by five means
00:21:21you're looking at twenty five pixels at once stride is how much you move that window over at a time if we had a stride of one we would move that window over one pixel at a time meaning that when our filter maps that to the feature map in
00:21:37the convolution a layer there will be a lot of overlap between each window if we had a stride of five that would mean the window would skip completely to the next patch so our filter would look at a five by five window and then it would slide over
00:21:53five pixels all the way past the last pixel scene in the first observation so the filter is now looking at a new patch with no overlap with the prior patch how do we reason about this stride and window size combined you always think about them in combination try
00:22:11to think of them as some sort of ratio like to over five five being the window size and to being stride size or or something like this window and stride always get considered together in the previous example where the picture gets maps directly to a feature map and
00:22:26they're the same dimensions that's a stride of one if we were to use a stride of five what that would do is take your windows you're five by five windows and boiled them down into one pixel each so you would take a five by five window and that
00:22:44would turn into one pixel in the downstream feature map if you had a stride of one you would slide right one and that would turn into one pixel as well in the downstream feature map essentially we looked at two pixels in our original image and it has become
00:23:00too pixels in our new image in our feature map so that didn't actually do any sort of compression itjust did transformation if we wanted to compress the image into a smaller feature map that bigger stride of five what you would do is you take a window of five
00:23:16by five that would become one pixel in the feature map and you'd move over five whole pixels and that new five by five window would become a new pixel in the feature map and everything in between would be left out so all the pixels will have been considered
00:23:32because we didn't skip any pixels but they'll have been boiled down substantially twenty five pixels will become one that's how you do sort of compression in this process you have a higher stride and a higher window size now that's not always beneficial let's say for example that waldo
00:23:52sort of straddled in between those two windows we have a window five by five and then we stride five so the window now moves to an entirely new set of pixels but while those right in the middle there half of his body is on the right side of
00:24:08the first window and the other half of his body is on the left side of the second window neither filters would pick up waldo in the windows okay we've got the stripe detector filter maybe that would ding ding ding but what about that sort of skinny object in
00:24:23the centre of the window filter that filters not going to make anything pop in the window so even though a higher stride will give us good sort of compression or boiling down of our windows it may result in poor detection of objects so a good middle ground is
00:24:39generally preferred maybe a stride of two or a stride of three so there's always a decent amount of overlap so it was c waldo because at some point he will be in the center of a window and because the stride is greater than one these windows of five
00:24:55by five will still be boiled down into smaller patches in the down scream feature map so some combination of window size and stride is how you achieved boiling things down into smaller layers and like i said window and strike they always go hand in hand it took me
00:25:13a while to understand kamenetz because they're so money terms we're talking about filters and feature maps convolution yl layers window stride patting and max pooling these are all terms we're going to talk about in this episode so many terms it helps when you realize that many of these
00:25:30terms our combined with each other they're different pieces of the same thing so window and stride they always go hand in hand feature map and filter are basically the same thing a filter is a small section your little paper cut out your five by five the size of
00:25:47a window and when you apply it to the whole him and you get a feature map so a feature map is applied filter so feature map and filter they go hand in hand all your feature maps stacked isa convolution yl layer so all those three things going hand
00:26:00feature map filter convolution layer okay and then over here we have a window and stride those things kind of go hand in hand for image compression and then the other thing that goes along with image compression something called patting patting is very simple to understand patting is we
00:26:14have are five by five window and we're sliding it to the right okay let's say our picture is not fifty by fifty but fifty to buy fifty two some number that's not divisible by five well our window will slide all the way to the right until it gets
00:26:29to those sort of last two pixels now we can decide one of two things to do we can either stop there and move to the next row or we can move our window five pixels to the right anyway there's only two pixels left in the picture so what
00:26:45we'll do is we'll create three fake pixels basically zeros so that the remaining two pixels are considered there sort of in the left part of our window and then the excess is just these fake pixels and presumably the confident will learn that this excess on the right side
00:27:03of the picture can be ignored we call padding is same petting equals same when we include the fake pixels and we call it valid when we exclude the excess pixels i don't have a great way for remembering same versus valid i always have to look it up personally
00:27:21so they're just two separate ways of handling the excess pixels now you might think same that is always including the excess pixels seems like the smarter way teo always go shouldn't we always include every pixel Well not necessarily in a lot of pictures sort of the borders of
00:27:39the image are kind of craft I mean we do cropping as a pre processing step anyway many times so in many cases excluding the small amount of border pixels is not a big loss in other cases you do want to include every single pixel especially in cases where
00:27:59it's not actually image recognition we're working with i will talk in a subsequent episode about how you can use convolution all neural networks for stock markets stock market prices you're not looking at an image whatsoever totally different space than computervision conv nets and recurrent neural networks you can
00:28:18use these things sort of in very surprising domains you have to think outside of the box but in those cases where you're working with features that aren't really pixels you want include all those features in the pie process and so patting equals same is the right way to
00:28:33go so just depends on your situation Okay so we talked about window and stride and to some extent patting those three being used as sort of an image compression technique away of boiling down your picture at one layer into a smaller convolution a layer and then doing the
00:28:53same thing to that convolution layer to the next convolution layer until everything gets smaller and smaller and smaller and then you hit your dense layers and you're working with a small amount of features We talked about that as one way of doing image compression and that's compression in
00:29:06the machine learning sense it's compressing features into a smaller feature it sort of represents all the other features it's higher quickly boiling information down into smaller and smaller bits there's another method of image compression in convolution yl neural networks called max pooling and this is sort of the
00:29:26traditional sense of image compression which is simply making an image smaller not actually doing any sort of machine learning just scaling it down lossy compression in the truest sense max pooling or there are other types of pulling layers we call them pulling layers you can use max pulling
00:29:43or you can use mean pooling so let's talk about max pulling because that's the most common what max pulling does is it takes your picture and it just makes it smaller that is it just compresses it down now it's different than using a complex convolution yl layer of
00:29:59filters and stride and patting and blah blah all it does is let's say you're going to boil to buy to window in tow one pixel okay so you're dividing it before you're making every patch of four pixels become one pixels were just downscaling it substantially all you're doing
00:30:15is you're taking that patch of four pixels that window of four pixels and you're taking the max pixel that's it the maximum pixel by that i mean in a grayscale image every pixel is represented by a number between zero and one where one is black and zeros white
00:30:33so we take the maximum of those pixels and we just use that and throw away the other pixels this is just true compression this is compressing images like if you were trying to upload a photo to facebook and it said your pictures too big you know you try
00:30:47to upload a picture that's ten twenty four by ten twenty four and pretend that facebook says we only accept images one twenty eight by one twenty eight ok they don't do the compression on their side they expect you to have smaller images to upload to their website well
00:31:01what you might do is open up preview or photo shop or something and just click at it picture dimensions and just make it smaller that's all that's going on with max pool it's just making a picture smaller and is doing so in a very destructive way you know
00:31:15from experience when you make pictures smaller or bigger it's lossy if you make it smaller it's called lossy compression it looks kind of pixelated something looks a little bit off about it if you squint your eyes you can tell that there was some damage done in the process
00:31:29but you have to squint your eyes and that's the idea here with max pooling is you khun apply lossy compression to your pictures to make them smaller without doing too much damage in the process now why would you want to do this We had the option of using
00:31:44a big stride and big window in a convolution yl layer for boiling a picture down sort of boiling the essence down We're not just throwing stuff away were boiling it down to its essence why would we want to use max pooling were used max pulling for a totally
00:32:00different reason that reason is to save compute time it turns out that convolution yl neural networks are the most expensive neural networks in all the land mohr than recurrent neural networks mohr than mlps more than anything why Well we've taken an image that has with by height and
00:32:21you're kind of multiplying those two as faras number of features is concerned and you're piping that into a convolution yl layer that has with by height as well as depth and sometimes very very deep depth maybe sixty four feature maps or ninety six feature maps and you might
00:32:39have ten twenty hidden convolution yl layers when you start looking at the image net competition com net architectures these things are massive this is really where you see your gp you shine if you're working with an mlp or recurrent neural network you know you'll probably see it five
00:33:01to ten x performance gain by using your gp you instead of your cpu but when you're using conven ets you'll see your gp realization spike up to ninety nine percent and you will be screaming fast running your confidence on your gp by comparison to your cpu confident architectures
00:33:21is really where your gpu performance shines and not just in computational speed but the amount of memory that's used by your architecture your ten eighty t i for example has about eleven gigs of ram separate from your systems ram well when you're doing a whole bunch of image
00:33:38processing you're going to be consuming a lot of that ram so confident so heavy very heavy beasts and the easiest way toe slim them down to make them less heavy is just image compression just make your image is smaller and that's what max pulling us for using a
00:33:55combination of stride and window size toe boil your image is down is something of a machine learning technique that's boiling down the essence contained in the pixels of your image but max pooling is just for making things smaller so that they'll run faster and you can apply max
00:34:13pulling to your image directly right after the first layer you could also apply it after every convolution yl layer because each convolution a layer while they may be coming smaller and within height there probably becoming deeper in depth of feature maps so using max pooling will reduce the
00:34:32dimensionality of your process making your confident run faster by the way there's something i forgot to mention earlier in this episode we think of a convolution a layer as with by height by depth okay within height pixels and depth being feature maps welly input layer image also has
00:34:50depth it is rgb red green blue valley we call those channels so your input image is with by height pixels and channels deep everyone pixel will have three channels being rgb values so your input picture is also a box and then every subsequent convolution a layer is a
00:35:13box so really every layer is kind of a picture in its own right okay and that's it that's convolution all neural networks they're not easy but they're not complex i'd say architecturally you just have to read a chapter on them and you know and maybe you'll have to
00:35:28read it twice to come to grips with what all the moving parts are here but unlike something like natural language processing where maybe understanding how a recurrent neural network works is fine and dandy but to understand the lay of the land of nlp there's there's a whole lot
00:35:43of problems you have to solve with confidence the one problem you're solving here is image recognition or object detection in an image so there's not a whole lot you have to know this isn't a three part siri's like with nlp but just to make sure you understand all
00:35:57the parts i'm sorry i'm so redundant we're going to start from the top and work our way here you start with an image it is a with by height pixels image incidentally it's also three channels deep that is rgb values he got a box that's your image you
00:36:14pipe that into your continent your first hidden layer is called a convolution yl layer and the way this layer functions is this convolution a layer has with by high pixels as well and feature maps depth any number of feature maps deep these feature maps are derived by applying
00:36:36a filter one filter per feature map you apply this filter to the image put it in the top left corner of the image and you slide it to the right and it generates sort of a new image where the objects that that filter is designed to make pop
00:36:54well it's a new image where all those objects in the image pop so in a waldo detecting filter your feature map is going to be a new version of your original picture where everything is blurry except all the waldo's who are very clear so your filter is some
00:37:13small window of pixels that has something sort of out of it and when you apply it to your whole image what you get out is a feature map you do that with every filter in your convolution a layer you get your feature maps of the convolution a layer
00:37:29each convolution a layer is with by haif pixels and feature maps deep it is the filters that your neural network is trying tto learn you specify upfront actually the amount of filters you're going to be using or the amount of feature maps in your convolution a layer you
00:37:47specify the width and height of your convolution layer as well as the amount of feature maps being used the depth it's the job of the neural network tto learn the design of the filter's not the amount of them so that's the essence of it the details are that
00:38:02the filter has a window size the size of the filter is called the window with and height maybe it's a five by five it's some small square the stride determines how much that window moves at any given time if you're using a stride of one than that window
00:38:19moves over one pixel at a time and the resulting feature map is the same size as your original image if your stride is five then your window moves over an entire window at a time so that there's no pixel overlap and each window of five pixels becomes one
00:38:40pixel in the feature map In other words your image gets compressed toe a smaller image Generally a good strategy is to use something in between maybe a window five by five in a stride of too so there's some amount of overlap which improves the likelihood of detecting objects
00:38:59and yet some amount of skipping which results in image compression Additionally a technical detail is what you should do when you've slid your window all the way to the right and there's excess pixels Do you include them or do you skip them if you skip them We call
00:39:16this valid padding and if you include them we call this same padding and the way we make that work is by adding extra dummy pixels zero pixels so that our window of five will fit over five pixels some of which are going to be dummy pixels A combination
00:39:35of window size patting and stride will result in image compression in each convolution yl layer until your final layer which is generally a good strategy but another way to achieve image compression is called max pooling max pooling is lossy simple image compression used primarily to save system resource
00:39:58is if your images or too big or your convolution yl layers are too big and it's just hurting your ram or your gpu performance you'll use max pooling okay so that's the general our texture of a convolution all neural network that's the general architecture er now you probably
00:40:18won't have a great deal of success trying to freelance your way through designing a comp net understanding the general principles how to design a convolution yl layer and then building an image detector with that that'll probably give you you know a decent image detector maybe ten percent error
00:40:37rate or twenty percent error rate It turns out that the amount of convolution all layers and where you put the max pooling layers and the and the window size and stride size and all those things these air the hyper parameters right Selecting these things our choosing your hyper
00:40:54parameters com net hyper parameter selection is very sensitive as faras error rate is concerned If you want a very very well tuned com net then you're going to spend a hell of a lot of time tuning your hyper parameters and so one thing you can do instead and
00:41:12especially if the image detector your building is a classic type of image detector you're actually trying to detect people and dogs and common objects in common photos is use one of these off the shelf conven et architecture's so there's this competition called l s v r c image
00:41:32net challenge and it's a challenge for people to be able to text specific objects in a database of photos and they hold this every year and every year people come to this competition and they beat last year's conv net architecture er with a new architecture a new combination
00:41:52of max pooling and stride and window and number of feature maps and all those things and so one of the classic ones is called les net five as in yon laocoon the ellie for liu kun and then a subsequent years winner was called alex net so that would
00:42:09have beaten lane et would have decreased the error rate and then a subsequent one is google net and then a next one is inception and then the next one is resin it and so on and so on you may have heard of these different net architectures i heard
00:42:24these things floating around for the longest time rez net and alex net i didn't realize that they're all comped nets none of them are foreign ends none of them are mlps so when you hear something net you're probably dealing with a confident and these kamenetz thes architectures are
00:42:40enormous there are some big big confidence very complex and very sensitively tuned hyper parameters so if you just want an image detector for some project you're doing a robot with vision you can use one of these off the shelf networks and a good rule of thumb is just
00:42:58used the winner from the most recent year used twenty seventeen's winner for example it will have defeated all the prior architectures but if what you're building is in the domain of computer vision but is maybe a little bit less common than common object detection in common pictures then
00:43:17what you could do is sort of study the architectures and see what makes good hyper parameters and good layer styles and then use that to dr designing your own continent so what i describe to you in this episode is the core components of designing a continent architecture but
00:43:35very likely if you actually plan on using continents in the wild especially for image recognition you'll want teo look at one of these prefab architectures that came out of the image net challenge and probably use the most recent winner cool cool that's it for this episode in the
00:43:55resource is section i'm going to post link to a youtube recorded siri's of cs to thirty one n a course by stanford specifically on kamenetz and of course the standard deep learning resource is i've always been recommending all post those in the show notes and the hands on
00:44:12machine learning with psych hitler and intensive flow book that i have been recommending has a very good chap drawn con vanessa's Well i'll see you in the next episode

Transcribed by algorithms. Report Errata
Disclaimer: The podcast and artwork embedded on this page are from OCDevel, which is the property of its owner and not affiliated with or endorsed by Listen Notes, Inc.


Thank you for helping to keep the podcast database up to date.