On this episode we invite Mark Harris, Chief Technologist at NVIDIA, to talk about programming for the GPU. Show notes: http://www.programmingthrowdown.com/2016/05/episode-54-programming-for-gpu.html
United States


00:00:11episode 54 programming for the GPU take it away Patrick we're here today with Mark Harris from Nvidia Mark are you going to introduce yourself tell us a little bit about what you do it in video okay yes my title and videos Chief technologist for GPU Computing software and my role is kind of twofold what is inward facing in one is outward facing so it might and when facing role is to help Define are software strategy for for computer video and search for the best for things like Cuda programming which will talk about more hopefully and the other aspect is the external facing roll which is a little bit of evangelism giving talks at conferences such as the GP technology Conference Center and I also run a a Blog called parallel for all which is on a video please
00:01:11drop a link after the show recording and we on that block it's a developer blog written by Developers for developers and it's it's deeply technical blog about parallel programming in GPU programming
00:01:26so what is can you maybe just get a little bit of background what Nvidia is what they do as a company mean many most people have heard of them but they just think of what I guess you going to go to Best Buy anymore but you go to Newegg International why I could as a thing so well what is NVIDIA Nvidia is a visual Computing company and what that means is that we focus on Building Solutions for all aspects of visual Computing we call ourselves the inventors of the GPU that the graphics Processing Unit I think Nvidia coin that term back in like 1999 but the first G-Force product and most consumers would be familiar with especially Gamers will be
00:02:25with our GeForce gpus which are graphics cards or making your games Graphics look amazing and run really fast but gpus are used in a in a variety of computations and so we have for kind of focus business areas and those are gaming professional visualization Data Center and automotive and digital Computing or parallel Computing are really important in all four of those and it turns out that gpus are very useful and good at accelerating those competitions so the obvious ones are gaming cuz the GP was designed for computer graphics and also professional visualization but the area I work in his is Data Center and parallel Computing so you know cheap use it turns out are great at
00:03:24parallel Computing because Graphics is parallel and I mean I don't know how much detail you want me to go into about the history of that but is that with Graphics you're trying to do two things are trying to figure out where a bunch of triangles are in space and then you're also trying to draw sort of a bunch of pixels on the screen and in both cases their kind of embarrassingly parallel like you have many triangles and you they can all be discovered Brenda located independently and you've many pixels like I'll be processed more or less independently and so that makes GPU kind of like ideal for for doing a lot of these things in parallel right
00:04:11absolutely yeah so I mean do you have millions of pixels to shade and every frame and you're running you know 60 frames a second or whatever so and triangles if I guess modern games probably have thousands to millions of triangles / / frame to so you're getting to the point of having you know pixel size or something so triangles and yes I'm back in the day back when I was in grad school and actually way before that people kind of recognize this with Graphics hardware and they started packing around on on using graphics apis and cheap used to do Computing that the cheapest weren't really necessarily designed for and so this is something that I focus on grad school on and I called it gpgpu which stands for general-purpose computation on gpus
00:05:11and since then that's become something that what isn't just grad students mucking around with graphics apis and not end up back in 2006 or 2007 and videos launched which is a set of extensions to C & C plus plus that allow you to program gpus for parallel Computing in a traditional programming language rather than using a graphics API people first started kind doing the gpgpu stuff and I mean you might be able to fill me in where I'm misremembering or incorrect mark But on at first you were writing in a language is essentially a Shader for the GPS you kind of had a still frame whatever problem you were doing in terms of trying to tell the graphics what to do then Cody came out what what kind of it did was that process like and what it in video really see that made
00:06:11say hey there something here
00:06:14yeah didn't think so my personal story for this was that I was an internet in video in 2001 and that's when I sort of learned my PSD was on cloud simulation and rendering so I'm not done with this was before if you were talking about the cloud so I was wondering clouds right like pictures about moisture and I had gotten this I learned from some of the engineers there about some of the things they've done with shaders like one of the guys broke some shallow water equation something solvers indirect tax and The Game of Life and The Game of Life running in in pixel shaders and this was before they called it later Model 2.0 or something like that so there was no floating point on these gpus at that point so you
00:07:14play kind of pack everything in six point you had basically 10-bit Precision in a pixel shaders and that was fun but I got the idea and I learned that in video was going to be coming out with GPS with floating-point pixel shaders in the near future and so I went back to grad school and I thought well what if I did all of the simulation of the clouds on the GPU in addition to the rendering and so I basically started doing fluid simulation on GPU and but I don't know what I want in a bunch of undergrad student researchers were doing stuff in similar areas with Ray tracing going on in shaders there were and so I was also into the gpgp so very early and there was nothing I could download that would do Edge detection so yeah
00:08:14weather Seattle. I Surrender the edges and it was so hard this is pretty cool. I was so hard I had took like a week to figure out like how to get it to compile and then I hope you don't have the right GPU you have to go out and buy another one and it was just it was it could have made it so much easier right so I think that I mean while I was backing in Cresskill and not an Nvidia before I came on full-time obviously people at Nvidia really saw this opportunity and I believe that it was Jensen our CEO ultimately who was done and then by the time I got back in 2003 there was already an effort to build nv50 which became G80 which is the first could capable GPU there was no there's the first GP with a dedicated Computing mode with white addressable memory instead of just pixel shaders random access memory from from the Shader units and
00:09:14you're right it was hard and it was kind of fun and it was if you felt like you know a hacker getting the stuff to work but when you got something to work it kind of had this feeling of magic right oh wow my fluid simulation actually works or you know this thing this this reaction to Fusion simulation that I wrote that only really needs more way more Precision that I have in these fix Point pixel shaders actually works and it felt like magic which is really not a sustainable feeling in software development so what Nvidia did was to build Hardware that that was dedicated to Computing as well as graphics and then build software on top of that and so we saw early on by talk from talking to potential customers that we would have to build something using languages that they're familiar
00:10:14when we went around the customers and you know we're talking to people in areas ranging from defense to oil and gas to fluid simulation like Cadence and affluence I guess what the company of the time not need answers now and they all said well it's got to either be for for Trans Sunset it's got to be C or C plus plus so we decided we were kind of afraid of Ford transits time so we were like well we've got to build something that's based on C and that became so it's basically see with some extensions and it took away that magic feeling you know it did it did it did what you thought it should do rather than maybe if I hacked this this way I'll get to work and then it does capable GPU widespread adoption nvidia's investment and putting in the extra
00:11:14how to build a compiler to obtain all. Did that really compare me to take a while well it certainly it's really paid off but it also took a while there was a lot of initial interest and you know what the option started immediately no people were using could have one point now I still talk to customers are like yeah I've been using Cuda since it first came out of the pay or whatever you know until they started software with it right away but to really call it successful and actually see real applications that people could go and buy or download that that accelerate what gpus probably took a couple years
00:11:58and an hour you know it's it's the point I couldn't say Kudos mainstream but but it's definitely something that that real products and labs and researchers and all of these things all use and it is running some amazing Graphics is it the case that there is also Cuda program to doing General processing in the same pipeline or is it typically that you run some specific scientific application that would use it so early on we made efforts to to get Kuda into some games and there are some ways that could is used in game so for example in video has a physics physics simulation library for games called physics and it uses Cuda for no cost simulation particle simulation
00:12:58rigid body simulation things like that but most games that are doing Computing in a lot of games do do general-purpose Computing that use computer shaders within the graphics API so after Cuda came out DirectX and opengl both introduced their own flavors of compute shaders that basically are able to to do similar things to kuta programs but within the graphics API so they don't have to let me know be juggling two different apis but they have largely the same programming model you know within the kernel kernel is what we call a parallel region of your program within the colonel whether it's in the graphics API computer Shader language or in Cuda C plus plus the program model is basically the same with a few minor differences
00:13:54show me the perfect transition to kind of go into like what is that you know obviously is Audio Only thing but like kind of describes like what is the programming model for writing these programs so if you want to write a program for a GPU you want to take advantage of all of the parallelism so gpus have now thousands of parallel Coors and if you can think of days if you're if you're coming from Graphics background you can think of these eyes pixel Shader Coors or but really they're unified course that do everything from transforming vertices for the vertex Shader to shading pixels to just running computer instructions and so you basically writing the way you can think about it if you look in your program for parallel or regions that have have parallelism and what that means is you have loops should typically in a program where they either ations of the loop are not dependent on each other
00:14:54right so they could be run at the same time so you can think of flattening out that Loop and then running each iteration at the same time or many of the iterations of the same time on supper processors and so that's why Cuda lets you do and basically you write a program where or you write a kernel program where within that function the whole kernel is being executed by many threads simultaneously so basically the code is single thread code but it's running in parallel across many threads and someone buys a computer they buy you know a quad-core computer are they buy and i7 that has 6 quarters right and you're saying the GPU has a thousand Kors so so it's out I mean just a very naive or you could say what why why not why don't use a GPU for everything and has a thousand Kors my CPU has four corners right so why would why would I ever use the CPU
00:15:54well I will say that the GPU is becoming increasingly important but in fact I was just looking at it I saw it a shot of a Broadwell CPU and GPU like literally literally the actual scores Broadwell. A z on Z on his problem is not have to keep you but I'm just a regular core like I5 or something like that and with four cores so the four cores take up you know maybe a quarter of the diary or something like that but the one I use a GPU for everything well the cores are different as you're hinting at we call them Cuda cores but you know that's a sort of a marketing name but really they are individual processing elements that process instructions but they they have a thing as a parallel execution model that that's called simte you may have heard of
00:16:54D & B stands for single instruction multiple data and what that means is that you have a single instruction but it gets executed by on multiple data elements cycle taneously and so you can think of that as having multiple threads that well-known actor of data elements and you apply the same instruction to all the elements in that back door simultaneous yeah like I told that it was that that's that's right and that's where you would basically have a bunch of mail use sim tea is it was an Nvidia coined term but it's been used more broadly since then I think since you're single instruction multiple threads so now instead of just having a vector data elements you actually have multiple threads that executes the same instruction and the the difference in the important difference it here is that each of these threads has its own program counter
00:17:54which means that they can Branch two different instructions separately whereas with CMD the branch has to be wrapped around the whole Vector effectively so if you if you need to make a decision it has to be at the granularity of your vector size right if you need to make a decision in Kuta or in some tea then it can happen at a single thread granularity of course there's a cost of that because the hardware although we have all these little cute course they do share instruction Sachin decode logic and so you may end up with overhead of of replays or predication of the instruction I'm getting pretty technically that's kind of the difference between GPU Coors and Vector units are SSD units and where the the real differences in terms of your original question is is that the cores are very lightweight cores on a GPU
00:18:54they don't have very good single thread performance they really get their performance in aggregate right for running many threads and parallel usually doing the same thing possibly branching and diverging some but usually doing the same thing and see if he course on the other hand are are they have a lot of things like Branch prediction and big cashes their optimized for latency in other words that are optimized to reduce latency which means that if you only have one thing to do you can do it really fast on the Jeep you if you only have one thing to do you're leaving you know 999 course Idol right so
00:19:37I need the door we talked about it is that gpus are optimized for throughput CPUs are optimized for latency there's a big gray area there because CPUs have a BX and they can do things in parallel to its just that the scale of parallelism is is lower on a CPU vs GPU and we're optimized for throughput which means instead of trying to reduce latency we try to hide it so we always talk about latency hiding cheap use a really good at hiding latency by executing other work while while we're waiting so if there's a memory access that we have to wait for David to come into the cash from off chip then we do work in other groups of threads possibly even the same instruction in other groups of stress but but we know you should we have other instructions issued to hide that leads
00:20:37are the screen the triangles have all been position did he say okay this part of the screen is good to go let's start rendering pixels meanwhile at the next frame of triangles is already trying to get pushed to the screen at the same time I got the kind of pipelining you're talking about
00:20:53it's yeah it's theirs pipe lining in football but it's it's also just noticed about having so if you if you think about your pixels you have some large triangle but possibly that has you know hundreds of pixels that covers and they're all shaded with the same same pixel Shader so that pixel Shader have to go and computer has the fish from textures it has to blend the color is it against you know arbitrary computation now so but those pixels are grouped together into groups in the hardware and included we refer to this as Works to turn the comes from weaving actually but because of you had parallel threads and weaving and so there's warps so it's warp is a group of 32 threads are 32 pixels and while one warp is waiting on a text to fetch for example or a memory load then we can switch to in
00:21:53directions from another warp within within the the movie called a multi-processor until the multiprocessor can issue instructions from multiple warp simultaneously or while while one more is waiting or similar to hyper-threading stuff that she is hyper threading is you have you know on your CPU you have your own floating Point Unit you have so many does integer arithmetic many of these little mini modules and you could sort of fake out having two or more threads if one of them needs a floating Point Unit and the other one needs the Earth with psych unit at the same time and it's as if you're executing them both in parallel
00:22:47write an n on the CPU I believe what I was reading requires ultimately is duplicating resources like the register file right so and on the CPU the registry file is relatively small there's at least on x86 CPUs there there's no registers that are fairly small play on a GP of the register file is quite big it's almost like the speed of a small cash except that it's a register file so it's like dry clean Cesspool by instructions and so we actually have you know I'm not on the GPU that did the the course I talked about a group together and things called multiprocessors so for example of multiprocessor on Pascal is the latest architecture we just now it says like 64 Cuda cores and it has I think a hundred twenty-eight or 256gb register file on that essay
00:23:47wow there's a lot of registers multi-processing you said you each core needs to be at a similar instruction basically that you want to be executing them same thing as much as possible this is really interesting because it really understanding how your program gets executed help you design really good software or at least in the efficiency case and what is it that's actually different that causes you not to be able to get off that far she has a multi-processor you have all these quarters in it and what is it that actually like you said you you do look at some things not other things like what is it actually that is preventing you from being a lab code running very different parts of the program
00:24:26so what's prevent wood for bending you having code well each quarter each thread basically doesn't isn't being run on a full-blown cord that has a separate you know instruction Fetch and decode an issue right so that logic is shared across basically 32 course and so we group threads into groups of 32 that's what we call a warp and I'm not sure if I'm answering your question it said that the injections are expecting a batch and that you want all of the cores in Ultimate attacks kid that same thing so if they get too far off they need some instructions that another processor doesn't yet need or whatever right when you get out of sync and you'd have to add extra Hardware to handle that will get you closer
00:25:26when you execute when you write code for GPU you want to be aware of kind of the branch Enos of that code right so if you have a loop where you possibly a lot of data Buddy each iteration to Loop through your checking conditions and you know it's really data dependent if every iteration is completely data dependent what it does then put performance May potentially suffer but if you can kind of do some work ahead of time to 2 maybe reorder your data sorting or something like that or Benning so that threads that are contiguous in terms of their order their thread ID or whatever are accessing memory that's contiguous and also making decisions that are contiguous then you're going to get much better performance you do operation a in an odd-numbered indexes of the array you do Opera
00:26:26can be an inside-out running linearly through the array you would want to maybe like process all the even numbers first and then all the odd numbers as Eva not even odd yeah I would just use my phone to take on the indices in that case instead of saying if even do this if I lost it that right and so that you doing rather than which threads are doing it if you have all of these threads all doing well I guess doing the same thing but I'm different on different piece of the data how do you debug this oh yeah I imagine you don't step through the debugger like you do input GTB go line-by-line I'll probably be bad and we do have tools with a very good tools now in fact
00:27:26with a debugger and a profiler but they were very basic so do you do that you can step through the instructions and people do and when he really trying to figure out a difficult bug you know just like on a CPU it really helps to have added bugger that lets you step in and inspect memory locations and variable values and things like that and so you can do that but there are a couple different modes that you can step right so we have a couple of tools one is on the Linux side we have talked to the GDB which is basically a modified GDB that supports programming debugging Cuda programs on Windows we have something called nsight sorry Visual Studio Edition which is a plug-in for visual studio
00:28:22that is used to bugging and profiling inside the ID for GPU programs and that Insight also has Graphics profiling in the buggy features was also an eclipse plugin called or an Eclipse IDE called inside Eclipse Edition for for Linux and Mac that kind of wraps that stuff as well as profile anyway so if you're stepping through a program running on the GPU I talked about works one way to step through is actually look at one thread and step each you know instruction for that are each line of code for that thread another way to do it is to step a warp or or two step all the threads in what's called a thread block just got could have come store and there's a difference you know different reason you might want to do that you might want to actually look at the values held in variables for a number of 3
00:29:22at once and you can do that and then the burger so you kind of doing parallel to bugging white or you might want to just you know focus on one friend to try to understand the logic a bit better and so you can in inside Visual Studio Edition at least I'm not sure about the Trinity be probably also there to you can toggle which way you want the debugger two-step where the hardware is always going to run things at work at a time but it looks at you or only focus in on the values of a single thread if you want warp and the other ones are just kind of frozen or I guess they could be running it doesn't matter because they're not they're not dependent on each other that well if you're getting a break point you do you need to freeze the program and so that's actually requires hardware and it's something that we've gradually improve it should be used to be that you had to have
00:30:19you couldn't have a display attached to the GPU you were to bugging and if you think about it you know the gp2 has modes and has Graphics mode in computer mode and hand if you freeze it in computer mode then it can service the display which means you you're running Windows then suddenly Windows freezes right so so you had to have a temper GPU to in order to the bog previously we can do single Gene gp2 bugging and with Pascal which we just announced it that you can see in March we have compute preemption it's basically allows you to you know I'm just as bad as it sounds just with traditional preemption you know you've basically 10 store the state of the program and kick it out and switch to something else some other application and so that allows the departure to two-step through programs in Hip break points while making sure the operating system is interactive on a c
00:31:19GPS system
00:31:21so one of the things you I mean obviously your perspective Nvidia Cuda but I mean people will know and you mentioned before looking at the guy shot of some of the Broadway chips or whatever having GPS is on the same diet as a CPU what are some advantages and disadvantages can I speak to like what the differences between processor integrated with a GPU versus a discrete GPU
00:31:45sure so yeah I mean the majority of the products that Nvidia sells the GPS that we sell are discrete gpus and other words they're there on the board that plugs into like a PCI Express socket and they're separate from the CPU and so
00:32:08well just a little bit on that one when your programming a program that uses the GPU for example in Kuna you're running a heterogeneous program you know the program still needs to use the CPU right so most programs have at least control from The View if not significant complication there also and so you have to take care of the GPS if you have separate memory and so there are transfers that happened that have to happen between the GPU in the CPU and I can come back to that later and like we should come back remind me to talk about unified memory but they're also pop stars that are integrated as you mentioned so Nvidia has a line of processors called Tegra which are system-on-a-chip is also known as I mentioned Broadway core CPUs have their Iris graphics on board so they have a GPU or integrated with the CPU on die so these are
00:33:08similar in some ways the system on a chip approaches is a bit broader are the Tegra basically has a few arm course and then it has a GPU and it has also has a bunch of other you know all the things that you need to build a whole small system and so Tegra is used in things for like us are not laptop tablet computers like the Google pixel C I think has has Tiger in it and it's also used in something we called Jetson which is an embedded development kit which is aimed at people who are developing things like robots Thrones other embedded systems and so to your question you know what the difference between these in the trade-offs well if you have a certain die size if you can dedicate it all to GPU obviously you're going to have a more powerful GPU but if you have to split it half between GP
00:34:08CPU at work you can see if you and other stuff then the you know the amount of competition capability of each of those things goes down so it's a it's a balancing act right what do you want to do if you want to do high-end supercomputing you know to test Nvidia Tesla gpus are used in supercomputers like Titan that at Oak Ridge National Labs if you want to do that then this is the one chip roach problem isn't the right way for you to because you need the most powerful GPU with the highest memory bandwidth in the highest computational throughput right if you want to do if you want to build a robot where you need CPUs on gpus and sensors and no data inputs and all this kind of stuff then an integrated processor that's really low power obviously makes a lot of sense so you know we we we build things for the whole Spectrum from very low power and bedded two places where
00:35:08we need power efficiency but the actual total system power is not much of a not as much of an issue transferring out over less a PCI Express in that you know Odyssey pass a certain data size that makes great sense I just said Mike supercomputing but then if you talk about how does a program where is there a way to say kind of guide someone and say Hey you could do this on the CPU or you could take the time to transmit to the GPU and then transmit back and how they kind of build that threshold than their mind about which one to do or is there even a way to like write a single program and then at runtime or at compile time it determines you know hey based on this code size we're going to execute this in one vs. the other
00:35:57yeah okay there's a lot in that that's all right so
00:36:03I'll go back to what it was I was talking about if you want a system with a GPU and CPU that are separate and they have separate memories up until 2 to 6 which we launched a couple years ago you always had to explicitly manage all memory and so it's as you were talking about you have to you have to create and let's say the data comes from a file so your CPU loads that date from file into CPU memory you then have to allocate GPU memory for that data and do an explicit and then copy between the CPU memory in the GPU memory so that's no good has an API for that to them copy right it's works just like them copy accept it allows you to crop a copy from the CPU to the GPU or the other direction from GPU memory pointer to another appointment and does it cost to that because PCI Express has a certain bandwidth so you know given the data size you know the band with a PCI Express you can ask
00:37:03how long that's going to take and so if you have a huge amount of data in a small amount of computation and you're only going to do a small amount of competition on the on the GPU before you need to do something on the CPU like no I don't know send it on the network or write it back to this court or whatever then the overhead of transferring it might be higher than the computation cost that actually you know of the Performing the task on the GPU and so there are trade-offs as you into that you need to decide whether it's worth while on the current Hardware to transfer data to the GPU for processing and their many applications where
00:37:46it's obviously beneficial but there are some applications where that trade-off is trickier you know and so there's a lot of things you can do like trying to overlap the communication with computation pipelining are we have facilities in Cuda for streaming so basically you can associate computations and copies with separate streams of baby I come in so that if they're independent they can be overlap so what you could do is you could chunk your data up so that you transfer a little bit of that you start processing on it and then you started you do another transfer on a separate stream simultaneously things like that but yeah there is a bit of a balancing act there and it's it depends on the application sometimes it's trickier than others I think if you have Tegra right then then you're sharing some memories so then you don't I guess I copy doesn't happen there something else must happen or something
00:38:46oh yes oh Integra you have one memory so you know it's share between all the processors Integra so so you know you can allocate a pointer and then just share it between CPU and GPU code there's a couple of things that a couple of notches on current Tiger Eyes I believe like I'm at the tk1 not sure if it's true on the on the Tegra K1 tiger X1 the cashes were not Coco hair between the CPU and GPU and so sometimes what seems like it should be free actually has a cost because of having to invalidate Cash's I'm so so there's you know there's a wide variety of people who we want to leverage GPU there's there's people like Patrick who builds robots in underwater
00:39:46pains in his garage and once for a company that Patrick I work for and they kicked me off the project was immediately I had no idea what I was doing and so I'm more of like a lab are python person and so how does how to serve the Cuda ecosystem serve cater to all of these different people have different background they apparently they don't do for transfer those people are SOL but for everybody else
00:40:28sorry the last part if you told us earlier that that it doesn't support for Trans so the 4 train people are SOL but they're not but I'm great that's a great question so and you said the word ecosystem and we do talk about ecosystem and a video a lot in terms of a lot of companies do that but when you're building a platform you care about the ecosystem and so you're right there are a lot of programmers in there are many programming languages and we would like to enable the mall or anybody that has parallel program or a lot of data to process anybody needs high bandwidth and throughput we would like them to be using gpus and so we try to enable as many ways of programming gpus as possible to cater to those different needs and so when we talk about the Cuda platform
00:41:22we talked about three ways of programming there's directives which which are basically hints to the compiler that you can add two Loops in C or Fortran that allow the compiler to try to automatically paralyzed so the standard which I enables you to specify of this Loop is parallel please you know paralyzed it for me and that started on CPUs on there's work on going in openmp to support accelerators like gpus and we're involved in that there's also another standard called open ACC which is another way to program and it supports there's compilers for for for training as well as
00:42:13C & C plus plus for open HTC as well so the second way is with libraries you know if you have a fairly standard competition or if you use a industry-standard library for those competitions there is a good chance that there's already a drop-in replacement that targets GP it is so free sample there's a popular linear algebra Library it's actually just a standard that many libraries Implement called Blas answer basic linear algebra subroutines and there's a Koopa last that Nvidia provides there's 2 ft which does fast Fourier transforms if you use FTW on the CPU for example or or mkl on Intel processors you can drop into ft and accelerate does on the GPU there and then there
00:43:13a number of other more kind of domain specific libraries there's does libraries of salt verse to solve cuckoo sparse sparse and basically whole bunches and there's that one that's getting a lot of interest now called to DNN which we can talk about more which is for deep learning deep neural networks so that's the second ways libraries in the third ways with programming Ling which is so I'll talk a lot about to die and what I really meant by that was cute I C plus plus our courtesy which basically is used in videos compiler and BCC to compile C or C plus plus with extensions for parallelism but there's also kuna Fortran which is which is created by a company called TGI the Portland Group which is now out on buying video but they started could afford friend when they were an independent company and couldn't work and basically takes that could occur
00:44:13remodel that I talked about and introduce is it to Fortran with extensions was made by this company Continuum analytics in Austin and they make they make a product called conda anaconda anaconda is awesome it's a python basically package manager kind of like using app to get and Linux or or ruby gems if your Ruby programmer and it lets you basically manage packages but what they are they also have made a bunch of their own python packages one of which is called
00:45:00number which is an open source compiler for Python and you might say but wait python so I can file this interpreted well what they've done is they allowed you to put a little annotation on a function you basically put at jit in front of a function that and that then it generates up uses p.m. to compile that so it'll run faster on your CPU and they also have a cribbage it went and a number of syntax things to expose the Cuda programming model in Python used piano which is pretty good it's a python based kind of like a Matlab like environment but but it runs on the GPU and then there's also a tensorflow which is a new one that I've only done the little test out so I haven't played much with tensorflow but it's also
00:46:00Jesus not loud like environment but it's but under the hood it's all running on the GPU I think they're both using Kubler-Ross I believe into DMS absolutely yes definitely they are so this I mean you talked about this in the Scientific Python podcast that I wasn't too and and there they are tensor libraries or and there might like but I mean really they're being driven by by Deep neural networks work right so and that's where we focus with q tienen but also to blast the linear algebra so a lot of the complications on the sensors are basically just Matrix Vector multiplies or Matrix Matrix multiplies things like that and that's where is really Excel if you want to get Peak Performance on GPU then just do large Matrix Matrix multiplies right
00:46:59we want to enable all these things and so we we've worked in a few areas one of the directors I talked about one is in building libraries know where it makes sense where weather there is demand and then the other is the compiler we wanted to enable other compiler Riders and developers to build compilers that could Target cheap and so we started using lldm which one of you guys have talked about I'll be home on this show but it's a open source compiler tool chain and it basically has become really popular and is being used in the back end of a lot of different compilers for various languages and by by allowing by basically will be provided as some extensions to lbm extensions because because lvm is extensible we were able
00:47:59do it all entirely within that so-and-so are extensions are actually a subset of NBA be available other than that allow you in the the low-level representation of lbm to express the parallelism just as you would include a kernels and that an lvm compilers opium-based compiler to Target gpus and so we have a library call DanTDM that will generate assembly code for the GPU from this extended all the Mir and we also open sourced
00:48:38a version of this and it's included in a Le'Veon and so that that is unable number developer such as Continuum analytics such as PGI such as others even Google to to Target gpus much easier than to build tools on language tools for them and you want to get something up and running that's really cool I like that you on to you in a day or a week going to go from you know you know intro to having something kind of really cool that you could show your friends what would you recommend like is there a cool demo that you recommend or or a site that has a cool like like for Ruby there is the rails for zombies where you end up with this like twitter-like website they could show off it is or something like that for Koodo
00:49:34yes we should do it like two different zombies or something like that we recommend for people who specifically want to learn Cuda programming Cuda C plus plus is check out does a Udacity course it's it's actually almost a couple years old now but it would still be relevant and I think it's called programming massively
00:49:58no electric
00:50:11that might be a day pictures of Dave luebke who's and John Davis cool that would definitely need a city and there's a show on you don't know what that is but it's a it's a great platform for learning almost anything Technical and now they're getting into other areas too so there's any recourse on Cuda that that all of you should check out
00:50:57and if you don't want to if you're not up to see for example if you want to use GPS but you're not a c programmer you guys already mentioned the tools are some of the tools that are available for python so Tiana and tensorflow so I would say for python programmers they should check those out and there's a number of tutorials for those who has a bit of a learning curve I think but this is difficult tensorflow there's a lot of awesome documentation I haven't played with it enough yet but it but it looks it looks very solid and then numbers the other one but I want one thing I want to mention is the SDK so the computer STK I think we're going it works SDK because we have a whole bunch of other works sdks I didn't video includes the Cuda toolkit and the truth will kit has a whole bunch
00:51:57samples like tons of samples and their nicely grouped into categories including you know what are the categories is called simple you know it's not because they're you know they're easy but because I do simple things but so you can start with those and there's there's two in there that if you want to do something fun there's there's in the simulation category there's one that that I co-wrote with the guy that called Larson Islands that's and body gets used to demonstrate GB usable. Basically it's simulating gravitational interactions of of stars effectively so it basically does this all pairs competition of gravity between two in the stars in it it runs really fast on gpus and we can get really cool visuals out of his credit opengl render
00:52:57I just want to play with you I think it's called particles that former colleague of mine wrote and it is really cool demo with all these balls in a box and you can just turn around and they Clyde you know so they they don't pass through each other and that's all done using kuta I think there's a smoke particles one that does smoke simulation I don't actually not sure it's just doing particle simulation but it's rendering it to look like smoke with light scattering and stuff so
00:53:30call Thomas kind of what a day at in videos like transition to. Because everybody out there is probably thinking this cuz I still in if you wanted us to remind you to talk about unified memory
00:53:51right so I should have thought about that when I was talking about the headache but a heterogeneous processor so I didn't want to cut anyone off memory is a feature of the Cuda programming model that we introduced in Kuta 6 a couple years ago and that were with the Pascal architecture word enhancing in a lot so the idea is I mentioned that you have to explicitly transfer data from the GPU to CPU and it would be nice if that weren't the case it would it would be nice if you could just allocate data and then the Jeep you in the CPU could use it and then you know behind the scenes maybe it would get migrated on demand to the processor that needs it and that's what unified memory is so you to find memory in 226 was basically software which does paid migration between the GP in the city so if you're familiar with virtual memory you have pages and when the CPU needs
00:54:51Access Data that's in a page that possibly is not in memory it's on disc then it it does a page fault fault on that memory data into memory and then you can proceed with accessing it well before Pascal GPS didn't have the ability to page fault but we kind of you know looking forward to GPS. Did we build unified memory so that you can still access memory from both the CPU and GPU and it gets migrated at the page level automatically with Pascal. Because we can pay for all that means that you can just allocate data with Cuda Malik managed it's called and then when the see if you touch is a page it'll it'll get that paid will get fault it back to the CPU and GPU touches page it'll get salted to the GPU and while that may sound expensive off of nose page
00:55:51RR hideable you know exactly about latency hiding you can just hide that with other work and it enables other things once you have Hardware support Pascal gpus have the ability to access a 49 fit versatile address which is one bit larger than the CP virtual address space which means that the GPU can access all its own memory as well as all the CPU memory and all the memory of any other GPU in the system it has enough space for that and so that means you can have a single virtual memory space and the hardware just takes care of migrating the pages when where they're needed and with operating system support that means that you can potentially support accessing memory even if it's just allocated with the system alligator in other words malloc in C or new and C plus plus you can just allocate memory with malaker new
00:56:51past that point it to the GPU and access it pass a pointer to another GPU and access it use it on the CPU Etc even if it even accessing more memory than the physical memory because this the operating system handles virtual memory and paging out to disc and things like that so it's a big step for or heterogeneous in terms of making it easier but also enabling you to process data sizes that you possibly couldn't before because I wouldn't fit in GB memory card was magic it kind of ties into the whole llvm thing where you know someone might just annotator for lube and you want to send that to the GPU and you want that process to be as painless as possible you don't want to have to inject a bunch of copy to the GPU commands into their code absolutely absolutely
00:57:51like open act right and the PD I got is actually added the mode to the opening c c compiler about a year ago that will automatically use unified memory behind-the-scenes story nobody see what you normally do is you have to you first you at annotate your Loop you say I'll just leave this parallel parallel but then you find it slower because it actually was probably doesn't know which data it needs and tell her just copies all the data over for that Loop and even if you if some of its read-only for example from the Jeep door or it's not access them deep you and so you can you can go deeper in open ACC and use these data directives to annotate oh well copy this now or go to read-only copy on the GPU things like that but with unified memory you shouldn't have to do that I mean there's always eat you always know more about your program than the compiler
00:58:51you can always help with performance by adding more information like that but that becomes an optimization rather than a requirement right and yeah so you'll find memory really ties into the the kind of automatic offloading approach
00:59:08tatcha I make sense to make sense so so I know you work remotely but sort of in general what is sort of day-to-day like you have a lot of people who are in high school and college Auto listeners who are in college and they want to know a lot about industry and what it's like to work to work in different Industries so what sort of a typical day like I get out of bed I sit down at my desk. Because I work remotely but I may be a little different this question to everyone we interview but here actually let's ask what's it like to work remotely because that's that's something a lot of people probably aren't familiar with actually so working remotely is good I think for me
01:00:04because well it allows me to to live where I want I don't know if I mentioned on the recording of it I live in Australia my families now have an Australian family and at least for now we know we're living here and I love beautiful place up in the mountains which I couldn't do in Silicon Valley are certainly can afford to so the downside there's a downside I mean there's an issue the downside of being remote is that you don't get to go into the office and work with your team directly everyday so if I were to give advice to young people starting out cuz I think you were you aiming towards that work in the office go to the headquarters if you're going to a big company you know unless you're working for a company that's distributed and that's the culture and then you just have meetups and travel to meeting
01:01:04then you really want to experience the company calls her and I and I did that at first I actually worked in the UK I'd already been an intern in the home office but I worked in the UK for a while in office and it makes a difference in terms of building your team and getting to know people so if you work remotely you really have to work over come the the barriers of being remote and make sure people know you're there so I have a lot of one-on-one meetings on the phone with people I'm just so I'm staying in touch and thing in a Lupin and so I can do my work and then travel several times a year but you know I get up in the morning and it's afternoon in in California and I have all my meetings early in the day which is kind of a pain they say you shouldn't start with meetings but
01:02:04but after I'm done with my meetings my whole afternoon is free to just focus on work where is it if you're in the office and you're involved in a lot of different things that you end up getting called into tons of meeting and it's hard to get large blocks of time and so I think it's important for an engineering to have large blocks of time thinking work for people just starting out I wouldn't recommend remote or so but I want to know the team there's a lot of companies that have a work-from-home day maybe you know on Wednesdays are going to work from home or something like that and in that case it's okay you know you around that or even if you've been with a certain team for four five six years and then you move off-site you build those relationships and you have those bonds and then working from home
01:03:04can give you a lot of benefit thank you Mark so much for for coming on the show this is fascinating I think all of us have benefited greatly from the work that you and other people as in video I've done both ways for us to relax play video games and also in our day-to-day like at work making making it so we can accelerate we are programs in and the other person doesn't want to say I stuff it's gone from the other things taking monster things taking hours and that's just amazing and I know Patrick son a lot of sort of high performance Computing and things like that in South so
01:03:48thanks a lot I will send the link to your blog your blog is parallel for all so listening go to program 3 on., have a link to that and thank you again and I will wrap it up and I thank you guys in the audience for a for a supporting us on patreon and the reviews and the comments feedback on social media all of that yeah we really appreciate all that we as you can tell you know he's changed the format when we do interviews people find know this by now because we've done a few of them at this point will do will do a programming language show next month but but we just had some absolutely amazing people like Mark reach out to us and so we definitely wanted to do to do
01:04:48interview and I hope you guys appreciate it so by buying our pilot programming Throwdown is distributed under a Creative Commons attribution-share alike 2.0 license you're free to share copy distribute transmit to work to remix adapter work but you must provide an attribution to Patrick and I and share alike and Kai

Transcribed by algorithms. Report Errata


By Patrick Wheeler and Jason Gauci
Programming Throwdown attempts to educate Computer Scientsts and Software Engineers on a cavalcade of programming and tech topics. Every show covers a new programming language, so listeners will be able to speak intelligently about any programming language. Look for our Podcast in the iTunes Store
United States
80 episodes
since Mar, 2011
Disclaimer: The podcast and artwork embedded on this page are from Patrick Wheeler and Jason Gauci, which is the property of its owner and not affiliated with or endorsed by Listen Notes, Inc.


Thank you for helping to keep the podcast database up to date.