For Leela zero (in terms of chess speed)

Approximately how much faster is one Tesla v100 over two 1080s in a 2-way sli?

And approximately how much faster are two Tesla v100 on a 2 way over a single Tesla v100?

Approximately how much faster is one Tesla v100 over two 1080s in a 2-way sli?

And approximately how much faster are two Tesla v100 on a 2 way over a single Tesla v100?

Would be better to ask this on the leela forum. Not sure what sort of metrics they use exactly. By one cryptomining metric the v100 was over twice as fast (0.76 kh/s vs. 2.02 kh/s), so a single card would be faster than two 1080Ti cards in that case.

there is a Leela-Benchmark site ,

https://docs.google.com/spreadsheets/d/1lGFf6PLGmBUSMan-YP7Vul4DpRNfn6K8oeCjBILe6uA/edit#gid=0

https://docs.google.com/spreadsheets/d/1lGFf6PLGmBUSMan-YP7Vul4DpRNfn6K8oeCjBILe6uA/edit#gid=857482380

https://docs.google.com/spreadsheets/d/1lGFf6PLGmBUSMan-YP7Vul4DpRNfn6K8oeCjBILe6uA/edit#gid=1508569046

2x1080ti , 15x192 : 20174kn/s

Tesla P100 , 15x192 : 17070kn/s

V100 vs. P100 :

https://www.xcelerit.com/computing-benchmarks/insights/benchmarks-deep-learning-nvidia-p100-vs-v100-gpu/

also this : https://lc0bench.netlify.com/

https://docs.google.com/spreadsheets/d/1lGFf6PLGmBUSMan-YP7Vul4DpRNfn6K8oeCjBILe6uA/edit#gid=0

https://docs.google.com/spreadsheets/d/1lGFf6PLGmBUSMan-YP7Vul4DpRNfn6K8oeCjBILe6uA/edit#gid=857482380

https://docs.google.com/spreadsheets/d/1lGFf6PLGmBUSMan-YP7Vul4DpRNfn6K8oeCjBILe6uA/edit#gid=1508569046

2x1080ti , 15x192 : 20174kn/s

Tesla P100 , 15x192 : 17070kn/s

V100 vs. P100 :

https://www.xcelerit.com/computing-benchmarks/insights/benchmarks-deep-learning-nvidia-p100-vs-v100-gpu/

also this : https://lc0bench.netlify.com/

Thanks

Why is there a big difference?

Titan v 31k nps 20x256 id10048

Titan v 51k nps 15x192 id476

Why is there a big difference?

Titan v 31k nps 20x256 id10048

Titan v 51k nps 15x192 id476

I think this is about how large these networks are . 15x192 so for example you have 15 rows and 192 weights in each row vs 20 rows and 256 weights each or opposite 256 rows and 20 weights. So to search first network you need less time than the second one, therefore your relative speed is higher but output is less exact. This is just an example how those nets works, I watched some youtube videos! :) If Im wrong please correct me..

> For Leela zero (in terms of chess speed)

>

> Approximately how much faster is one Tesla v100 over two 1080s in a 2-way sli?

>

> And approximately how much faster are two Tesla v100 on a 2 way over a single Tesla v100?

It would help to understand:

a) The way to work this out is by knowing the speed of the graphics card which you care about is its computational ability measured in TFLOPS.

b) Lc0 scales to 2 cards but not more at the moment

c) Lc0 can use FP16 which is supported by some cards and not by others. The 1080 is a fast FP32 card (The 1080 Ti is much better by the way but still only FP32) but it doesn't do FP16 (half precision) very well. The V100 does.

The 1080 Ti is about 11 TFLOPS (FP32) and the V100 is about 30 TFLOPS (FP16)

I think the speedup from 1 to 2 cards is 1.8x so......

1.8 x 11 = about 20

So a V100 will be 50% quicker. However, if LC0 can use Tensor cores on the V100 then the V100 would be about 7x quicker.

But really the big question you should be asking is whether the new gaming cards due to be released very soon support FP16. That's the big question. I think they might, but we'll see on launch.

I've read it goes further down to 8-bit integer multiplications.

TOPS (TeraOPS) , not FLOPS

int8 , not FP16

1080ti,600,250,11,22?,-

2080ti,1200,250,13,26?,78

v100,7000,300,15,32,100 other source : 7000,300,63,125,63

jetson xavier,2500,30,13,26?,30

teslaT4,3000?,75,8,16,130

google1,

google2,

{price,watts,32bit TFLOPS,16bit TFLOPS,8bit TOPS}

TPU: 92 TOPS 8bit @ 75W

TESLA V100: 120 TFLOPS @ 300W

TESLA P100: 13 TFLOPS @ 300W

Leela is not yet capable to use 8bit-Int, but they will probably add that soon

TOPS (TeraOPS) , not FLOPS

int8 , not FP16

1080ti,600,250,11,22?,-

2080ti,1200,250,13,26?,78

v100,7000,300,15,32,100 other source : 7000,300,63,125,63

jetson xavier,2500,30,13,26?,30

teslaT4,3000?,75,8,16,130

google1,

google2,

{price,watts,32bit TFLOPS,16bit TFLOPS,8bit TOPS}

TPU: 92 TOPS 8bit @ 75W

TESLA V100: 120 TFLOPS @ 300W

TESLA P100: 13 TFLOPS @ 300W

Leela is not yet capable to use 8bit-Int, but they will probably add that soon

I have to correct this.

> Posts may only be edited a limited time after their original submission. This time limit has expired.

1080ti,600,250,11,22?,-

2080ti,1200,250,13,26?,78

v100,7000,300,15,32,100 other source : 7000,300,63,125,63

jetson xavier,2500,30,13,26?,30

teslaT4,3000?,75,32?,65,130

google1,

google2,

{price,watts,32bit TFLOPS,16bit TFLOPS,8bit TOPS}

https://www.tomshardware.com/news/nvidia-tesla-t4-turing-gpu,37788.html

> Posts may only be edited a limited time after their original submission. This time limit has expired.

1080ti,600,250,11,22?,-

2080ti,1200,250,13,26?,78

v100,7000,300,15,32,100 other source : 7000,300,63,125,63

jetson xavier,2500,30,13,26?,30

teslaT4,3000?,75,32?,65,130

google1,

google2,

{price,watts,32bit TFLOPS,16bit TFLOPS,8bit TOPS}

https://www.tomshardware.com/news/nvidia-tesla-t4-turing-gpu,37788.html

>v100,7000,300,15,32,100 other source : 7000,300,63,125,63

Per Nvidia's own datasheet the PCI-E version does 14 Teraflops 32-bit (single precision). The 'NVLink' version for custom mainboards can do 15.7, but that's basically just for the HPC (supercomputer) crowd.

The Tesla T4 does 8.1 Tflops single precision (32-bit). Keep in mind that it's only a 75 watt device, V100 is 250 W+.

The 65 Tflops value is for FP16, but Nvidia also lists it as "mixed precision" between 16 and 32-bit. So not sure what that means exactly. It also has a special 4-bit integer mode where it can pull 260 TOPS, and I heard they were experimenting with a 1-bit integer mode (weird).

> The Tesla T4 does 8.1 Tflops single precision (32-bit). Keep in mind that it's only a 75 watt device, V100 is 250 W+.

Big step up in performance per watt.

> The 65 Tflops value is for FP16, but Nvidia also lists it as "mixed precision" between 16 and 32-bit. So not sure what that means exactly.

It means Tensors can do either FP16 or FP32.

> Leela is not yet capable to use 8bit-Int, but they will probably add that soon

Maybe. Hopefully. Presumably there is a (small) penalty for dropping precision because the weight values cannot be calculated so exactly, but the increase in nps will be more than worth it.

At the moment FP16 performance is the decisive factor in choosing a card. For the latest cards they can do this two ways: CUDA and Tensor. The latter is faster, despite having fewer cores.

UGH.

Looks like the 2080 Ti has been deliberately crippled in FP16 performance. Very sad.

Looks like the 2080 Ti has been deliberately crippled in FP16 performance. Very sad.

I now know much more about this than I did a week ago. In summary:

- fp16 is indeed the crucial factor for buying a card. Forget Int8 or anything below, it looses too much accuracy and the gain in nps aren't worth it.

- fp16 on the new Nvidia cards is great, despite my comments above. If you really want to know why I can point you to the technical answer.

- fp16 is indeed the crucial factor for buying a card. Forget Int8 or anything below, it looses too much accuracy and the gain in nps aren't worth it.

- fp16 on the new Nvidia cards is great, despite my comments above. If you really want to know why I can point you to the technical answer.

Rohan Ryan

09:41 (vor 2 Stunden)

Nachricht auf Deutsch übersetzen

This is a pinned post on Discord by Ankan

with cudnn 7.3 and 411.63 driver available at nvidia.com

minibatch-size=512, network id: 11250, go nodes 1000000

fp32 fp16

GTX 1080Ti: 8996 -----

Titan V: 13295 29379

RTX 2080: 9708 26678

RTX 2080Ti: 12208 32472

It appears that as far as Lzzero goes, the RTX 2080 is almost three times faster than GTX 1080 Ti. Look forward to benchmarks for the RTX 2070. It should offer great value to Lczero fans looking to invest in a graphics card.

Note: LcZero currently uses fp16. And the GTX 1080Ti does not support fp16, only fp32. So, the performance of fp16 in Titan V, RTX 2080, RTX 2080 Ti needs to be compared to fp32 performance of GTX 1080Ti.

09:41 (vor 2 Stunden)

Nachricht auf Deutsch übersetzen

This is a pinned post on Discord by Ankan

with cudnn 7.3 and 411.63 driver available at nvidia.com

minibatch-size=512, network id: 11250, go nodes 1000000

fp32 fp16

GTX 1080Ti: 8996 -----

Titan V: 13295 29379

RTX 2080: 9708 26678

RTX 2080Ti: 12208 32472

It appears that as far as Lzzero goes, the RTX 2080 is almost three times faster than GTX 1080 Ti. Look forward to benchmarks for the RTX 2070. It should offer great value to Lczero fans looking to invest in a graphics card.

Note: LcZero currently uses fp16. And the GTX 1080Ti does not support fp16, only fp32. So, the performance of fp16 in Titan V, RTX 2080, RTX 2080 Ti needs to be compared to fp32 performance of GTX 1080Ti.

the first person with a 2080ti reports 30000 games per day on the Lc0 forum

Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill