Wikipedia:Reference desk/Archives/Computing/2017 November 8

Computing desk
< November 7 << Oct | November | Dec >> November 9 >
Welcome to the Wikipedia Computing Reference Desk Archives
The page you are currently viewing is an archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


November 8 edit

Fast Combining Of Bitmaps In Memory edit

I am helping a friend work on a project of his and have encountered a snag and was hoping someone might have some helpful insight. I have a pointer that points to a bunch of bytes describing an image (The bytes come in quads: Alpha/Red/Green/Blue), this pointer is pulled out of memory and must be the final destination for the result. What I need to do is blend in a bunch of other images with the first one so that if you were laying them over top of each other, the transparencies would reinforce (If you had 3 images that were slightly transparent, when all blended with the image from memory, they would make that image more transparent than any of the inputs, in proportion to how transparent they were) and the colours would combine based upon their intensity. To do this one time is quick enough, but the end goal is to have the images to be blended in change and use a fresh version of the base image every frame of the application (which is, roughly, 40FPS), however, this method of doing things is not nearly fast enough and causes everything to lurch to a halt. So, is there any way to blend the images in that enables me to do it quickly and, if the data structures have to change, quickly dump everything back into the original pointer format (which must remain as is)?24.3.61.185 (talk) 00:17, 8 November 2017 (UTC)[reply]

Really if you have a recent CPU you should be able to do millions of operations like this in 25 ms. Sof if it grinds to a halt maybe you are using something that does not run on the hardware. Perhaps you will need to change languages to something that actually compiles to machine code. Modern Intel/AMD processors can use the the "PADDB xmm1, xmm2/m128" instruction to add sets of bytes to go even faster. The transparency would also add. It may also be possible to get your video card to overlay transparencies. Graeme Bartlett (talk) 01:37, 8 November 2017 (UTC)[reply]
I'm using C++, it does compile to machine code, it is on a modern CPU. The problem is, probably, that it is not using the GPU since it is a bit of a hack, something I was hoping to get around or find a library I could use to do it, since I'm not overly acquainted with this area, I was hoping for some pointing in the direction of what someone might try using (which would be rather helpful, but that's on me for not saying more, I suppose). At any rate, you are making some assumptions about the number of operations; it is easy to get into millions of operations when dealing with images (and processing this isn't the only thing my cpu is doing per frame - in short, thank you for the suggestions that would apply if your assumptions about my project panned out and I was doing it in a noncompiled language on a twenty year old computer using really small images).24.3.61.185 (talk) 03:03, 8 November 2017 (UTC)[reply]
You may have to turn optimization on, and tell the compiler for what CPU you are compiling. You may be able to see the codes output from the compiler to see if it is making efficient Do you know the exact formula that you want to use to combine for each byte value, and how the transparency interacts? I suspect you may want to multiply (255-α)/255 times the RGB code and sum for all layers. But you could just add, and it could be quick if that's what you want to do. What should happen if the intensity saturates (ie ≥255)? In assembler, you can use PADDUSB-sse instruction to add an sse register byte by byte into memory, stopping at 255. [1] Graeme Bartlett (talk) 03:42, 8 November 2017 (UTC)[reply]
Are you doing the compete job one element at a time or are you doing it by blending in one picture at a time? It may be worth doing a loop which does the job for a thousand elements at a time which would use the first level cache efficiently. Dmcq (talk) 13:33, 8 November 2017 (UTC)[reply]
I have a set of nested loops that, in order, loop through the images, then the y coordinates, then the x; what would be a better way to structure this? The innermost loop does the blend on each of the four channels for the pixel. The blending equation takes two bytes (the byte for the current value of the image in memory, the byte for the current image being blended into the image in memory) and gives out a byte (the new value for the image in memory). Since the blend equation gets called a lot, I made a pointer p that, given bytes b1 and b2, has p(256 * b1 + b2) point to the value of the blend equation for those bytes. Doing a few calculations, my ideal situation, I realized, would require processing several billion pixels a second, which seems unlikely to happen without framerate issues, but, currently, I'm noticing severe frame rate issues with only a few images being blended in. For example, 1 image blended keeps me at 40 frames, 2 images at 38, 3 at around 27, 4 at 20, and 5 at 11; this seems way too steep of a fall off (I expected a more linear drop in frames), but I don't have access to all of the source for what I am working with (which is why I am stuck doing it this way). If I rewrite the for loop to use openmp to parallelize it over multiple cores, I get a framerate drop off of, roughly, the drop from blending 1 less image (which, again, isn't what I expected). From what I'm seeing with lag using a small number of images, I feel like there has to be a way to get more performance that doesn't require learning CUDA/OpenCL for a single use. (I can work at a resolution of 640x480, at 5 images that's around 70 million pixels, ideally, I would be happy if I could get 10-15 images blended without lag...). Out of curiousity, since most of the images to be blended can be assumed to be radial gradients that are to be fixed at various points of the memory sourced image, and since such images don't lose a lot from a rescaling, do you think if I downscaled all the images by a factor of four on each axis (so 1 / 16th), then blended them, then rescaled the memory image back up to full size before sending it on its way that I would see a decent increase in performance? (I'm guessing so, but I'm also guessing that this could still hit some walls).24.3.61.185 (talk) 18:03, 8 November 2017 (UTC)[reply]
Assuming that pixels with increasing x are contiguous then the easiest (!) change might be loop over y: loop over the images; loop over x; do a pixel. If you can do a number of pixels at a time in that inner loop over x that would also be good. Another option would be loop over y: loop over x: loop over the images: do a pixel. Dmcq (talk) 18:22, 8 November 2017 (UTC)[reply]
Thank you:-) Just for my own sake of learning, why would you do y then images rather than images than y for the loop?24.3.61.185 (talk) 18:47, 8 November 2017 (UTC)[reply]
Because doing it that way you won't have lines of your source and target being thrown out of the cache and then reloaded for the next image. If you have a few images doing it an image at a time triples the L3 cache or DRAM accesses. Dmcq (talk) 19:04, 8 November 2017 (UTC)[reply]
You say that this is a one-time thing and that you are "blending" images. I would write an imagemagick command that loads both images, resizes them to the same size (resize, crop, whatever) and then "blends" them. There are many options for blending. Finally, output the result. You won't likely write a process that is faster than what imagemagick already does. As a test, I loaded two extremely large photos and used the imagemagick utility composite: composite -blend 50 DC0142.jpg DC0143.jpg blended.jpg. I got the two photos blended together in far less than a second - not even half a second. 209.149.113.5 (talk) 18:41, 8 November 2017 (UTC)[reply]
That would be a really good suggestion, but by "one time thing", I mean that I will only need to do what I am doing this way one time, thus, learning GPU parallel stuff would be a real bummer, not that I only need the images to be combined one time. For those curious, my friend coded a game, years ago, but lost the source code. He used static lights that were all blended into an overlay, but he always wanted to add dynamic and moving lights into it. I found that I could get the game to execute a call from a dll each frame and pass the overlay into it (and that I could inject the data for the dynamic lights into each scene (coordinates, what they attach to, etc.)). So, all that was left was to blend them in, which, sadly, caused a massive performance hit and forced us to be stuck doing everything on the CPU without any ability to tweak any of the major structure of the game code, thus, this hack solution.24.3.61.185 (talk) 18:47, 8 November 2017 (UTC)[reply]
It appears that you want to dynamically shade the display. You need to do that using vector operations, not for loops. That is why you need to do it in the video card, not the CPU. 209.149.113.5 (talk) 18:55, 8 November 2017 (UTC)[reply]
Yes, but I don't know how to take over the game's display, but I can hijack a bitmap from the game and work with that, so I'm stuck with the CPU at the moment.24.3.61.185 (talk) 19:19, 8 November 2017 (UTC)[reply]
What is the blending equation? Your array of bytes p(256 * b1 + b2) will take up 65K and so cause problems in the L1 data cache. Dmcq (talk) 18:50, 8 November 2017 (UTC)[reply]
The blending equation is (a + b) / (1 + ab), with a and b being the value of the byte divided by 255.0. I was doing it in the loop, but it went smoother removed - it works out to 3 casts, 2 adds, a multiply, and a divide per channel with 4 channels. *Since the overlay is just solid colour with constant alpha, a quick operation on the alpha channel of the final output puts the alpha to where it needs to be, but allows me to do the repeated blending without using a different equation (it ran smoother doing it that way).24.3.61.185 (talk) 19:23, 8 November 2017 (UTC)[reply]
That's interesting, that's the same formula for tan or tanh of a sum. If one converted to arctan and added and then converted back that would give the same result. Doing that of course wouldn't be sensible - but a lookup table of 256 approximations to arctan could be used and then you just add the values and then you might be able to do some sort of scale and lookup at the end to get the tan of adding the values for the various images. Dmcq (talk) 23:46, 8 November 2017 (UTC)[reply]
What I said there would for a dingle use involve three memory accesses but wouldn't hurt the L1 cache much. It might be better to simply do the arithmetic and be done with it. If you wnat to avoid going to float and back again it can be multiplied out as (0xFF*0xFF*0x100)*(a+b))/((0xFF*0xFF)+a*b), the compiler can figure out what those constants are. In fact it might give better rounding to have 0xfe8000 as the constant on top to round up slightly but not go over 0xFF, 0xFF*0xFF is 0xFE01 Dmcq (talk) 16:07, 9 November 2017 (UTC)[reply]
I'm going to run some tests, today, and see what gives the best results - I have bytes coming in that I have to work, so I'm wondering if it might not be faster to do a lookup with tangents (I can preapply the atanh to all of the images that are going to be used for the dynamic lights (the static lights are generated when the scene loads, though, not pulled from files, which is a little unfortunate). Thank you for all of the help, I'll reply back to see how everything went:-) Thank you so very much, this has been a very helpful experience:-)24.3.61.185 (talk) 18:50, 9 November 2017 (UTC)[reply]
This seems to be a question for a special RAMDAC which supports Sprite (computer graphics). --Hans Haase (有问题吗) 19:49, 11 November 2017 (UTC)[reply]

Having problem implementing equation 7 in https://en.wikipedia.org/wiki/Multilateration edit

[Question moved to WP:RD/Math.] Tevildo (talk) 19:54, 8 November 2017 (UTC)[reply]