Graphics Processor, PC, Playstation, Xbox One

AMD’s RDNA 2 performance per watt contribution

From  https://www.anandtech.com/show/15589/amd-clarifies-comments-on-7nm-7nm-for-future-products-euv-not-specified
Title: AMD Clarifies Comments on 7nm / 7nm+ for Future Products

TSMC has three high-level versions of its 7nm process:

  • N7, which is the basic initial version using ‘DUV’ only tools (so no EUV),
  • N7P, which is the second generation version of N7 which is also only DUV
  • N7+, which is an EUV version of N7 for a number of layers in the metal stack

This nomenclature has been finalized within the past year or so.

N7+ EUV is the superior process node when compared to N7P DUV. Smaller process nodes are based on EUV e.g.

EUV = Extreme ultraviolet lithography

DUV =Deep ultraviolet lithography
———

From https://fuse.wikichip.org/news/3398/tsmc-details-5-nm/

TSMC emphasized the extensive use of EUV with this process. It’s worth pointing out that this is really TSMC’s first ‘main’ EUV-based process. TSMC N7 and N7P nodes are DUV-basedTSMC first production EUV process is the N7+ but that node is really an orphan – not compatible with the prior nodes and no clear migration path forward other than going back to this node.

On the other hand, N5 is designed as the main migration path from N7 for most customers. TSMC says that more than 10 EUV layers are used to replace at least 4 times more immersion layers at cut, contact, via and metal line steps. This is comparing their EUV-based N5 node to a hypothetical N5 node that utilizes multi-patterning.

N7P yields 7% performance increase or 10% reduced power consumption when compared to N7

———

From https://www.techspot.com/news/80237-tsmc-7nm-production-improves-performance-10.html

TSMC’s N7+EUV has 10% perf/watt and 20% density improvements and N7P‘s improvement is less than N7+. LOL

We know AMD didn’t use EUV N7+ which leaves the second-generation N7P.

Compare the following AMD PR slides:

Note AMD credited 7nm process gains for RDNA 1

VS

Note AMD did NOT 7nm process gains for RDNA 2

In terms of the majority pref/watt improvements, AMD did NOT attribute RDNA 2’s improvements to just N7P.

Xbox One

Intel, TSMC and Samsung process node comparison

These are 10nm like nodes. Intel is nearly double the density of other 8-11nm fabs. Who is lying here? Hint : It’s not Intel. Their competitors are claiming 8nm when it’s more like 12 nm.

And Intel 10nm is MORE dense than Samgung and TSMC 1st/2nd gen 7nm :

Here’s the 12-16nm node. Intel’s 14nm is more dense than any of the competitors “12nm-16nm” nodes. It’s *a lot* closer to Samsung’s “10nm” node than Samungs 10nm is to Intel’s 10nm node.

Point here being these node names are utter marketing garbage. Meaningless marketing BS.

Graphics Processor, PC, Xbox One

Variable Rate Shading

Regarding VRS and what it can do on PC:

Gears of War tactics running on Ryzen 3900X and an RTX 2080 Ti, at 4K and with all graphical settings at maximum:

VRS OFF: average 47 fps
VRS ON: average 62 FPS (+21%)

From https://hothardware.com/news/3dmark-variable-rate-shading-test-performance-gains-gpus

benchmark run with a GeForce RTX 2080 Ti installed into a testbed alongside AMD’s Ryzen 9 3900X CPU. As you can see, flipping the VRS toggle netted us a greater than 46 percent performance gain. 

Graphics Processor, PC, Playstation, Xbox One

Improving PC’s SSD performance, evolution and PS5’s Unreal Engine 5 demo.

Source link

For the PC running PS5’s UE5 demo, https://www.pcgamer.com/unreal-engine-5-tech-demo/

Would this demo run on my PC with a RTX 2070 Super? Yes, according to Libreri, and I should get “pretty good” performance.

The above quote from Epic Games chief technical officer Kim Libreri.

John Carmack responds to TimSweeney of Epic’s statement.

PC, Xbox One

Gears of War 5 with PC Ultra settings 4K PC vs Xbox Series X

From https://www.eurogamer.net/articles/digitalfoundry-2020-inside-xbox-series-x-full-specs

but there was one startling takeaway – we were shown benchmark results that, on this two-week-old, unoptimised port, already deliver very, very similar performance to an RTX 2080.

https://youtu.be/oNZibJazWTo?t=726

No Caption Provided

51 / 39 = 1.30 or 30%

https://www.guru3d.com/articles_pages/gears_of_war_5_pc_graphics_performance_benchmark_review,6.html

Gears 5 with PC Ultra settings at 4K

No Caption Provided

Scale RX 5700 XT’s 9.66 TFLOPS** into XSX’s 12.147 TFLOPS and it lands on RTX 2080 class level i.e. 50 fps

**Based from real-world average clock speed for RX 5700 XT from https://www.techpowerup.com/review/amd-radeon-rx-5700-xt/34.html

12.147 / 9.66 = 1.25745 or 25.745%, hence XSX GPU has ~25.75% extra TFLOPS over RX 5700 XT.

Gears 5’s ranking follows

No Caption Provided

Apply 1.25 on RX 5700 XT’s 122% and it would on land 152.5‬% which is between RTX 2080 and RTX 2080 Super.

Xbox One

AMD GCN’s Xbox 360 origins

AMD GCN’s Xbox 360 origins

No Caption Provided
No Caption Provided

PC GCN 1.0 has two ACE units with a total of 16 queued context.

PC GCN 1.1 has up to eight ACE units with a total of 64 queued context.

Game console GCNs has an extra Graphics CP (Command Processor)

No Caption Provided

ATI’s fusion concept is the genesis for AMD’s GCN fusion with CPU capability

No Caption Provided

AMD GCN era

No Caption Provided
No Caption Provided

 

Graphics Processor

GPU memory compression vs effective physical memory bandwidth

NVIDIA’s superior memory compression and effective physical memory bandwidth e.g.

No Caption Provided

Fury X has 512 GB/s raw memory bandwidth with 322 GB/s effective physical memory bandwidth and a small memory compression recovery.

No Caption Provided

Polaris memory compression has improvements for Tonga/Fury that recovers all the inefficiencies with the memory sub-system.

Superior Titan Pascal’s effective memory bandwidth

No Caption Provided

NVIDIA made sure memory bandwidth is not a major bottleneck for their ex-flagship Titan X Pascal GPUs.

AMD Vega’s memory compression is garbage since it couldn’t recover inefficiencies with its memory subsystem.

Turing has further memory compression improvements over Pascal.

No Caption Provided
Graphics Processor

GDDR5, DDR3, DDR4 and HBM latencies

No Caption Provided

For a given generation, AMD’s memory access times are the same for DDR3 (A6-3650) and GDDR5 (HD 6850).

Intel memory controller’s access times are lower, hence superior over AMD’s.

No Caption Provided

From https://www.computer.org/csdl/journal/si/2019/08/08674801/18IluD0rWjS

GDDR5 and DDR4 have similar latency.

GDDR6 has lower latency when compared to GDDR5 (a claim made by Rambus).

https://www.rambus.com/blog_category/hbm-and-gddr6/

AI-specific hardware has been a catalyst for this tremendous growth, but there are always bottlenecks that must be addressed. A poll of the audience participants found that memory bandwidth was their #1 area for needed focus. Steve and Bill agreed and explored how HBM2E and GDDR6 memory could help advance AI/ML to the next level.

Steve discussed how HBM2E provides unsurpassed bandwidth and capacity, in a very compact footprint, that is a great fit for AI/ML training with deployments in heat- and space-constrained data centers. At the same time, the excellent performance and low latency of GDDR6 memory, built on time-tested manufacturing processes, make it an ideal choice for AI/ML inference which is increasingly implemented in powerful “IoT” devices such as ADAS in cars and trucks.

Xbox One

Gamespot’s interview with Super Lucky’s Tale developer Paul Bettner on X1X’s GPU relative to NVIDIA Pascal GPUs

Gamespot’s interview with Super Lucky’s Tale developer Paul Bettner

Question: As it pertains to the PC space, what GPU do you think it most closely aligns with?

Answer:

It’s definitely within the realm of [Nvidia’s 10-series cards], ranging from the 1060 to 1080. It’s somewhere in that range, looking at the comparisons that we’ve done, but it’s hard to make an apples-to-apples comparison of exactly where it’s at as it really depends on the games and the CPU and things like that…it’s like current generation of Nvidia [hardware]

 

On related information From ARK Survival developer, Xbox One X’s GPU being like GTX 1070

Xbox One

Xbox One X’s GPU and CPU balance

From World of tank developer on Xbox One X’s CPU and GPU balance

He replied that, “We’ve actually found the CPU and GPU improvements to complement each other quite well. Increasing the resolution from 1080p to 4K uses much of the additional power of the GPU but has basically no effect on the CPU.

From http://gamingbolt.com/xbox-one-xs-4k-resolution-has-no-impact-on-cpu-gpu-allowed-increased-lod-and-more-objects-dev

3265973-fb_api_compare

From EA DICE, Xbox One’s API efficiency is better than PS4 and PC’s DirectX12.

Xbox One

Fallout 4 Xbox One X build has native 4K

From http://www.eurogamer.net/articles/digitalfoundry-2017-fallout-4-ps4-pro-patch-analysis

PS4 Pro’s Fallout 4 pushes the game to a native 1440p (3,686,400 pixels).

From http://www.tweaktown.com/news/58024/bethesda-xbox-one-support/index.html

“We’re looking for all of our studios to add a level of support for Xbox One X. We Tweeted out last night that we’re working right now to get Skyrim SE and Fallout 4 supported on the X,”

Hines said in a recent interview with Geoff Keighely. “We’re working right now to get both of those titles supported with higher resolution, True 4Khigher frame rates, etc. The games will take advantage of the hardware and Microsoft’s been grateful and Phil Spencer came out last year to tell us what they’re doing and walk us through a tech demo to let all of our guys get up to speed on what Xbox One X is capable of doing and how we want to embrace it and incorporate it into our games.”

X1X’s 4K has 8,294,400 pixels.

More than 43 percent advantage for Xbox One X.

For comparison with PC GPUs from http://www.pcgamer.com/fallout-4-graphics-revisited-patch-13/ 

85b87be6f42a23d99a27b2f6ddc0dd6a-650-80

R9-390 with 5.1 TFLOPS reached just under 30 fps average.

Xbox One X seems to have R9-390X to GTX 980 Ti range for Fallout 4 4K.

Xbox One

Peak Rasterization Rate with Xbox One X GPU, GTX 1070 and RX-Vega 56

Peak Rasterization Rate

http://techreport.com/review/32391/amd-radeon-rx-vega-64-and-rx-vega-56-graphics-cards-reviewed

R9-290X has 4.2 G triangle /sec

R9-Fury X has 4.2 G triangle /sec<——— Hawaii type GPU hardware despite AMD pasting more CUs.

GTX 1070 has 5.0 G triangle /sec <———- Not a quantum leap above X1X’s 4.69 G.

Radeon RX Vega 56 5.9 G triangle /sec

Radeon RX Vega 64 (liquid) 6.7 G triangle /sec

GTX 1080 has 6.9 G triangle /sec

http://www.anandtech.com/show/11740/hot-chips-microsoft-xbox-one-x-scorpio-engine-live-blog-930am-pt-430pm-utc

X1X has 4.69 G primitives /sec <——– faster than Fury X’s version 4.2 G triangle /sec and it’s close to GTX 1070’s 5.0 G triangle /sec

RX-580 is missing X1X/Vega 10/Maxwell/Pascal ROPS with multi-MB cache improvements and that’s coming with Vega 11 (Polaris 10/20 replacement). For information on this topic head to https://gpucuriosity.wordpress.com/2017/09/10/xbox-one-xs-render-backend-2mb-render-cache-size-advantage-over-the-older-gcns/

 

“Vega 10” denotes RX-Vega 56 and RX-Vega 64 GPU cards.

 

Xbox One

From ARK Survival developer, Xbox One X’s GPU being like GTX 1070

From https://www.gamespot.com/articles/ark-dev-talks-xbox-one-x-and-says-sony-wont-allow-/1100-6452662/

On the subject of the Xbox One X’s horsepower, Stieglitz said Ark can run at the equivalent of “Medium” or “High” settings on PC. It can run at 1080p/60fps (Medium) or 1440p/30fps (High), and it sounds like developer Studio Wildcard may offer an option to switch between them.

For GTX 1070, medium settings 1080p resolution with 60 fps target starts at https://youtu.be/nIbiUd3l4PQ?t=20

 

http://www.tweaktown.com/news/58011/ark-dev-xbox-one-pc-gtx-1070-16gb-ram/index.html

As for the comparisons between the PC and Xbox One X, he said: “If you think about it, it’s kind of equivalent to a GTX 1070 maybe and the Xbox One X actually has 12GB of GDDR5 memory. It’s kind of like having a pretty high-end PC minus a lot of overhead due to the operating system on PC. So I would say it’s equivalent to a 16GB 1070 PC, and that’s a pretty good deal for $499″.

 

 

Xbox One

Xbox One X’s GPU advantages over AMD’s older Graphics Core Next (GCN).

From http://www.anandtech.com/show/11740/hot-chips-microsoft-xbox-one-x-scorpio-engine-live-blog-930am-pt-430pm-utc#post0821123606

12:36PM EDT – 8x 256KB render caches

12:37PM EDT – 2MB L2 cache with bypass and index buffer access

12:38PM EDT – out of order rasterization, 1MB parameter cache, delta color compression, depth compression, compressed texture access

Xbox One X GPU’s Render Back Ends (RBE) has 256KB cache each and there’s 8 of them, hence 2 MB render cache.

Xbox One X’s  2 MB L2 cache, 1 MB parameter cache and 2 MB render cache totals to 5 MB of cache.

Xbox One X HC

https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

The old GCN’s Render Back End (RBE) cache. Page 13 of 18.

Once the pixels fragments in a tile have been shaded, they flow to the Render Back-Ends (RBEs). The RBEs apply depth, stencil and alpha tests to determine whether pixel fragments are visible in the final frame. The visible pixels fragments are then sampled for coverage and color to construct the final output pixels. The RBEs in GCN can access up to 8 color samples (i.e. 8x MSAA) from the 16KB color caches and 16 coverage samples (i.e. for up to 16x EQAA) from the 4KB depth caches per pixel. The color samples are blended using weights determined by the coverage samples to generate a final anti-aliased pixel color. The results are written out to the frame buffer, through the memory controllers

GCN version 1.0’s RBE cache size is just 20 KB. 8x RBE = 160 KB render cache for Radeon HD 7970.

Radeon HD R9-290X/R9-390X’s aging RBE (which contains ROPS) comparison. https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah

3278632-rbe_hawaii_gcn

As a side note, Render Back Ends includes it’s own math operations i.e. “Logic Operations” and Blending.

16 RBE units with each RBE unit containing 4 ROPS units. Each RBE units has 24 KB cache.

24 bytes x 16 = 384 KB.

Xbox One X’s RBE/ROPS has 2048 KB or 2 MB render cache.

Xbox One X’s RBE has 256 KB. 8x RBE = 2048 KB (or 2 MB) render cache. Xbox One X has hold more rendering data on the chip when compared to Radeon HD 7970 and R9-290X/R9-390X.

http://www.eurogamer.net/articles/digitalfoundry-2017-project-scorpio-tech-revealed

We quadrupled the GPU L2 cache size, again for targeting the 4K performance.”

X1X GPU’s 2MB L2 cache can be used for rendering in addition to X1X’s 2 MB render cache.

RX-480’s RBE (Render Back Ends)

3277143-0931961367-Radeo

It seems RX-480’s RBEs doesn’t have any improvements.

The following block diagram shows RX-480/RX-580’s RBE (Pixel Engine path) bottleneck!

3285262-0416136363-vega_

X1X’s RBE/Pixel Engine has 2MB render cache and it’s located next 2MB L2 cache!

When X1X’s L2 cache (for TMU) and render cache (for ROPS) is combined, the total cache size is 4MB which is similar VEGA’s shared 4MB L2 cache for RBE/ROPS and TMUs.

Transistor count difference between X1X’s GPU and RX-480/RX-580

http://www.anandtech.com/show/10446/the-amd-radeon-rx-480-preview

RX-480/RX-580 has 5.7 billion transistors.

R9-290X/R9-390X has about 6.2 billion transistors.

https://www.allaboutcircuits.com/news/microsofts-scorpio-system-on-chip/

 

Xbox One X GPU has 7 billion transistors which points to GPU design not being RX-480/RX-580.

Performance implications

R9-290X/R9-390X’s Compute Unit (CU)’s TMU path has 1 MB L2 cache before over-spilling to external memory which is then memory bandwidth bound. RBEs has tiny 384 KB cache before before over-spilling to external memory which is a known bottleneck and hence the reasons for compute shaders optimization push from AMD. This is one of many reasons that the older AMD GPUs can’t convert their GpGPU performance into graphics performance.

RX-480’s CU’s TMUs has 2 MB L2 cache before over-spilling to external memory i.e. being memory bandwidth bound. RBEs has tiny cache storage before before over-spilling to external memory which is known bottleneck and hence the reasons for compute shaders optimization push from AMD. This is one of many reasons that the older AMD GPUs can’t convert their GpGPU performance into graphics performance.

Comparisons with Xbox One X’s GPU

Xbox One X GPU’s CU’s TMU path has 2 MB L2 cache while RBE path has 2 MB render cache before over-spilling to external memory. This advantage could contributed to Xbox One X’s good performance for ForzaTech’s wet track with heavy alpha effects usage which rivaled NVIDIA’s GeForce GTX 1070 (1).

Larger cache means that the GPU doesn’t have to access larger, slower memory pools as much, which primarily reduces the load on the VRAM subsystem (increasing available VRAM for other tasks), whilst simultaneously accelerating rendering speed.

Comparisons with other GPUs

GTX 1060’s SM/TMUs and RBE/ROPS paths has 1.5 MB L2 cache before over-spilling to external memory i.e. being memory bandwidth bound. Both TMU and RBE/ROPS read/write paths has similar performance.

GTX 1070’s SM/TMUs and RBE/ROPS paths has 2 MB L2 cache before over-spilling to external memory i.e. being memory bandwidth bound. Both TMU and RBE/ROPS read/write paths has similar performance.

gp102-tile-1

Shader Module (SM) is NVIDIA’s terminology for AMD’s Compute Unit (CU).

 

References

  1. http://www.eurogamer.net/articles/digitalfoundry-2017-forza-motorsport-on-project-scorpio-the-full-story