[Attila GPU] Attila OGL2/D3D9 GPU C Model Simulator

2024-02-22 04:40

本文主要是介绍[Attila GPU] Attila OGL2/D3D9 GPU C Model Simulator,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!


http://www.opengpu.org/forum.php?mod=viewthread&tid=1094&highlight=Attila


查看:  4979 | 回复:  14
打印  上一主题  下一主题

[Attila GPU] Attila OGL2/D3D9 GPU C Model Simulator [复制链接]

   
ic.expert

管理员

Rank: 28Rank: 28Rank: 28Rank: 28Rank: 28Rank: 28Rank: 28

注册时间
2007-7-11
积分
32646
  • 串个门
  • 加好友
  • 打招呼
  • 发消息
跳转到指定楼层
1#
  发表于 2009-10-19 01:29:41  | 只看该作者  | 倒序浏览

C Model  Implementation of

Attila OpenGL2.x / Direct3D 9 GPU Simulator








The 3D Rendering Pipeline 

GPUs are designed as specific purpose processors implementing a specific 3D rendering algorithm. The 3D rendering algorithm implemented takes as input a stream of vertices that define the geometry of the scene. The input vertex stream passes through a computation stage that transforms and computes some of the vertex attributes generating a stream of transformed vertices. The stream of transformed vertices is assembled into a stream of triangles, each triangle keeping the attributes of its three vertices. The stream of triangles may pass through a stage that performs a clipping test. Then each triangle passes through a rasterizer that generates a stream of fragments, discrete portions of the triangle surface that correspond with the pixels of the rendered image. Fragment attributes are derived from the triangle vertex attributes. 

This stream of fragments may pass through a number of stages performing a number of visibility tests (stencil, depth, alpha and scissor) that will remove non visible fragments and then will pass through a second computation stage. The fragment computation stage may modify the fragment attributes using additional information from n-dimensional arrays stored in memory (textures). Textures may not be accessed as stream. The stream of shaded fragments will, finally, update the framebuffer. Figure 1 shows a high level abstraction of the rendering pipeline for the described rendering algorithm. 



Polygon_raster_pipeline.jpg

magnify-clip.png
Figure 1. Polygon Rasterization Pipeline (  Polygon_raster_pipeline.ppt (21.5 KB, 下载次数: 19) )



Modern GPUs implement the two described computation stages as programmable stages named vertex shading and fragment shading. The programmability of these stages and the streaming nature of the rendering algorithm allows the implemention of other stream based algorithms over modern GPUs [12]. However those implementations may not be optimal. The non programmable stages are configurable using a limited and predefined set of parameters. 

The shading stages are programmed using a shader, or shader program, a relatively small program written in assembly-like (legacy) or high level C-like languages for graphics that describes how the input attributes of a processing element (a vertex or a fragment) are used to compute its output attributes. 

Graphics applications use software APIs (OpenGL or Direct3D) that present an interface for the described rendering algorithm and map the algorithm to the modern GPU hardware capabilities. 

The 3D rendering algorithm is embarrasingly parallel and shows parallelism at multiple levels. The largest source of parallelism comes from the data and control independency of the processing elements: vertices are independent of each other, triangles are mostly independent (except for transparenct surfaces) and fragments from the same triangle are independent. 

GPUs exploit three forms of parallelism: the pipeline is divided into hundreds of single cycle stages to increase the throughput and the GPU clock frequency (pipeline parallelism); the pipeline stages are replicated to process in parallel multiple vertices, triangles and fragments (data parallelism); and independent instructions in a shader program may be executed in parallel (instruction level parallelism). 

We will now briefly describe the ATTILA implementation of the 3D rendering pipeline. We have blended techniques and ideas from different vendors and publications [3] and we have made educated guesses in those areas where information was specially scarce. Our implementation correlates in most aspects with current real GPUs. 



Attila Architecture (Unified Shader Model)



Figure 2. Attila Architecture (  Attila_architecture.ppt (58.5 KB, 下载次数: 13)

)     


The ATTILA architecture supports both hard partitioning of vertex and fragment shaders (the norm in current GPUs) or an unified shader model. Figure 2 shows the ATTILA GPU graphic pipeline for the unified shader model. The input and output processing elements, the bandwidth and the latency of the different ATTILA stages can be found at Table 1. Table 2 shows the sizes of some of the input queues in those stages and the number of threads supported in the vertex and fragment/unified shader units. The diagram and the table data corresponds to a reference architecture implementing 4 vertex shaders (non unified), 2 shader units (fragment or unified), 2 ROPs and 4 64-bit DDR channels. 

Two GPU units are not shown in Figure 2, the Command Processor that controls the whole pipeline, processing the commands received from the system main processor and the DAC unit that consumes bandwidth for screen refreshes and outputs the rendered frames into a file. The Streamer unit reads streams of vertex input attributes from GPU or system memory and feeds them to a pool of vertex or unified shader units (Figure 2). The streamer also supports an indexed mode that allows to reuse vertices shaded and stored in a small post shading cache. After shading the Primitive Assembly stage converts the shaded vertices into triangles and the Clipper stage performes a trivial triangle rejection test. 

The rasterizer stages generate fragments from the input triangles. The rasterization algorithm is based on the 2D Homogeneous rasterization algorithm which allows for unclipped triangles to be rasterized. The Triangle Setup stage calculates the triangle edge equations and a depth interpolation equation while the Fragment Generator stage traverses the whole triangle generating tiles of fragments. ATTILA supports two fragment generation algorithms: a tile based fragment scanner and a recursive algorithm. 



Table 1. Inputs, outputs and latencies in the reference ATTILA architecture ( Attila_units_inputs_outputs.ppt (42.5 KB, 下载次数: 9)

)



After fragment generation a Hierarchical Z buffer is used to remove non visible fragment tiles at a fast rate without accessing GPU memory. The HZ buffer is stored as on chip memory and supports resolutions up to 4096x4096 (256 KB). The processing element for the next stages is the fragment quad, a tile of 2x2 fragments. Most modern GPUs use this working unit for memory locality and the computation of the texture lod in the Texture Unit. 

The Z and stencil test stage removes as early as possible non visible fragments thereby reducing the computational load in the fragment shaders. Figure 2 shows the datapath for early fragment rejection. However another path exists to performe the tests after fragment shading. ATTILA only supports a depth and stencil buffer mode: 8 bits for stencil and 24 bits buffer for depth. The Z and Stencil test unit implements a 16 KB 64 lines 4-way set associative cache. The cache supports fast depth/stencil buffer clear and depth compression. The architecture is derived from the methods described for ATI GPUs. 



Table 2. Queue sizes and number of threads in the ATTILA reference architecture ( Attila_units_queue_sizes.ppt (30.5 KB, 下载次数: 9)

)



The Interpolator unit uses perspective corrected linear interpolation to generate the fragment attributes from the triangle attributes. However other implementations may interpolate the fragment attributes in the Fragment Shader. The interpolated fragment quads are fed into the fragment or unified shader pool. The Texture Unit attached to each fragment or unified shader supports n-dimensional and cubemap textures, mipmapping, bilinear, trilinear and anisotropic filtering. The Texture Cache architecture is configured as a 64 lines 4-way set associative 16 KB cache. Relatively small texture caches are known to work well. Compressed textures are also supported. 

The Color Write stage basic architecture is similar to the Z and Stencil test stage but color compression may not be supported. 

The Memory Controller interfaces with the ATTILA memory and the main computer memory system. The ATTILA memory interface simulates a simplified (G)DDR memory but banks are not being simulated. The memory access unit is a 64 byte transaction: 8 cycle transfer from a 64-bit channel. The number of channels and the channel interleaving is configurable. Read to write and write to read penalties are implemented. A number of queues and dedicated buses conform a complex crossbar that services the memory requests for the different GPU stages. 




Shader Architecture 

Our shader architecture uses as a base the OpenGL ARB specifications for vertex and fragment shader programs. 

The ARB vertex and fragment program specifications define assembly alike instructions that can be used to program how the vertex and fragment output registers can be calculated from per vertex and fragment input registers and a set of per batch constant parameters. There are four defined register banks (as shown in Figure 3): the input register bank, a read only bank, stores the vertex and fragment input attributes; the output register bank, write only, stores the vertex and fragment output attributes; the temporal register bank, supports reading and writing, is used to store intermediate values; and a constant parameter bank stores parameters that are constant for a whole frame batch. A shader register is a 4 component 32 bit float point vector, limiting the ARB shader program models to support only float point data. The programming model doesn’t support any kind of execution flow control. The ARB shader program models are quite limited but only when our OpenGL library implements support for a glSlang (a HLSL or high level shader language) compiler our architecture will be able to go beyond the limited ARB shader program model. 




Figure 3. Shader Architecture (  Attila_shader_architecture.ppt (29 KB, 下载次数: 11)

)




The glSlang programming language virtualizes all the hardware resources available for the shader tasking the compiler and optimizer are accommodating the required resources with the resources available in the target architecture. Current glSlang implementations for modern GPUs like those of ATI and NVidia are allowed to fail when programs require resources beyond the available hardware resources. The glSlang shader language is losely based on a C syntax with additional data types and operations (for example SIMD data types and operations) that are better suited for shader processing. Loops, subroutine calls and conditional statements are supported, as expected, but architecture support may be missing in current GPUs, as is the case of our current GPU architecture, only supporting ‘static’ (constant based) branching and code replication for constant loops. We plan to add support for true branching in the next iteration of our shader architecture. 

The ARB instructions are defined as an opcode, a destination operand and up to three source operands. The source operands support full swizzling of their 4 components, a negation and an absolute value modifiers. The destination operand supports full swizzling and masking of the instruction result. There are two main types of operations performing scalar or vectorial (SIMD) computations. The vectorial operations supported are: addition (ADD), compare (CMP), dot point (DP3, DP4, DPH), distance vector (DST), floor (FLR), fraction (FRC), compute light coefficients (LIT), linear interpolation (LRP), multiply and add (MAD), maximum (MAX), minimum (MIN), move (MOV), multiplication (MUL), set great or equal (SGE), set less (SLT) and subtract (SUB). The scalar operations supported are: cosine (COS), exponential base 2 (EX2), logarithm base 2 (LG2), exponentiate (POW), reciprocal (RCP), reciprocal square root (RSQ) and sine (SIN). All can be implemented with a 4 component SIMD ALU and a special ALU for some of the scalar operations like the RCP and RSQ instructions. 

There are a few differences between the vertex and fragment program specifications. Fragments can access texture data with the TEX, TXB, and TXP instructions while vertex can’t. Texture instructions, in our architecture, use the SIMD ALU for the texture address computation and then the texture request is issued to the Texture Unit that access the Texture Cache and memory and performs the filtering of the sampled texels (as described in section 2). For fragment programs a KILL instruction is defined, used to ‘stop’ (marks the fragment as to be culled) the processing of a fragment Texture and KILL instructions use vectorial operands. An additional instruction modifier _SAT is defined only for fragment programs to inexpensively implement the required clamping (to the [0, 1] range) of color result values. 

Our unified shader architecture implements the superset of both vertex and fragment program models, however we are currently limited to the ARB vertex and fragment program capabilities in our current OpenGL library. Future support for glSlang programs will enable all our additional shader capabilities (for example vertex texturing) to be used. The unification of the vertex and fragment programming models is a target for future APIs (for example Shader Model 4.0 [4] in Direct3D and OpenGL glSlang) and GPU architectures. Our current legacy support for a non unified shader pipeline is performed capping an unified shader unit to work as a vertex shader unit from a current GPU. Shader unification not only creates a coherent programming model for both fragment and vertex processing but also simplifies the architecture design, and allows a better use of the shading hardware as more shader units can be allocated to process vertices or fragments as their work load balance changes from batch to batch. 

The Shader unit works on groups of four threads (each thread corresponding with a vertex or fragment input) because of a requirement of fragment processing (texture lod derivative computation). The same instruction for the 4 threads in a group is fetched and sent to the decode stage. A group of threads may be ready (fetch allowed), blocked (no fetch allowed) or finished (waiting for the thread results to be sent to the next rendering pipeline stage). 

Our shader architecture supports the fetch and execution of a configurable number (instruction way) of instructions per cycle and shader execution thread. The current implementation doesn’t discriminate between the SIMD and special operation ALUs and both are considered replicated for a n-way configuration. Textures instructions are only supported at one per execution thread and cycle and the shader thread is always blocked after the texture request is issued to the Texture Unit, expecting a large memory latency. The shader instruction decoder detects dependencies and conflicts accessing the register bank ports and requests the shader fetch unit to refetch instructions that stall the pipeline. 

The shader execution pipeline consists of the followin single cycle stages: a fetch stage; a decode stage; a register read stage; a variable number (instruction dependant) ranging from 1 to 9 of execution stages and a register write stage. Instructions are always fetched in order. Separated hardware pipelines are implemented to receive the shader inputs (vertex and fragment input attributes) and send the shader results to the next rendering stages (vertex and fragment output attributes). The instructions are fetched from a small sized (not over 512 instructions) instruction memory where shader programs are explicitly loaded before starting the batch rendering. Shader program length limitations will be removed in the future implementing an instruction cache that will read transparently the instructions from memory. 

The high level of parallelism inherent to shader processing (all processing elements are always independent) is exploited implementing multithreading to hide texture (memory) access latency with up to 256 threads currently configured in our architecture (non unified vertex shaders only implement a few threads for hiding instruction execution latency as they don’t support texture access). In the future we will implement per batch (static) or even dynamic (register renaming) allocation of temporal registers from a single physical register file to each shader thread. The number of threads on execution will change as the shader program requirements for live temporal registers change. However, in the experiments presented in the next section, most fragment shaders don’t require more than four live registers, not the whole 12+ temporal registers required for the ARB specification, keeping the hardware requirements in line (2048 registers) with what could be implementable. 

The architectures of the shader units in current GPUs have a large degree of variation when putting aside that they implement a similar set of instructions. Fragment and vertex shaders can be quite different in the number and arrangement of ALUs, the number of supported threads, the support for branching or loops and the access to memory (textures). Shaders from different companies are also quite different and their true architectures and limitations are never fully disclosed. One of the characteristics that they share, and that our architecture doesn’t fully support yet, is the capability of launching multiple instructions per cycle and execution thread to different ALUs, similar how a VLIW processor would do. Available information puts in as much as 5 or 6 different ARB like instructions the number that can be launched at a time. That is possible thanks to ALUs arranged in cascade, multiple paths for texture and scalar instructions, and special SIMD ALUs that support splited vector inputs for two different operations. 



References 

  • General-Purpose Computation Using Graphics Hardware http://www.gpgpu.org/ [gpgpu]
  • K. Fatahalian , J. Sugerman , P. Hanrahan, Understanding the efficiency of GPU algorithms for matrix-matrix multiplication, Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, August 29-30, 2004, Grenoble, France [fatahalian04]
  • Stanford University CS488a Fall 2001 Real-Time Graphics Architecture. Kurt Akeley, Path Hanrahan.[akeley01]
  • Microsoft Meltdown 2003, DirectX Next Slides [directx03]





  • 评论:

    这个GPU Simulator是我们GPU设计最开始的原形参照。我觉得这个Simulator对于图形工程师的意义远远大于RTL工程师,总的来说可以让你明白可编程图形流水线是怎么运行的。读懂源代码之前一定需要深刻的理解OGL API的,不然没办法阅读。所以不懂图形学的这东西还是免谈了吧,免得浪费自己青春。可以先看完图形学再说 :〉

    下面放几个猛图:



microarchitecture--.jpg


Attila GPU Architecture Diagram



ATTILA  simulator boxes and signals.JPG



Memory Bus Architecture Diagram




Attila Unified Shaders,jpg.JPG


Unified Shader Graphics Pipeline DatafLow Diagram 1





ATILA-rin unified shader.JPG


Unified Shader Graphics Pipeline DatafLow Diagram 2




下面是论文和Simualter下载:

 ATTILA SIMULATOR.pdf (920.54 KB, 下载次数: 67) 

 Project Evaluating ATTILA cycle-accuracy GPU simulator.pdf (867.19 KB, 下载次数: 60) 


 ATILA-rei-source-17-01-2007.7z (6.57 MB, 下载次数: 73) 

 ATILA-rei-binaries-17-01-2007.7z (4.23 MB, 下载次数: 25) 



这东西据说西班牙老外一个人写了4年(中途有Intel的资助),但是最后挂了好几个人的名字。赫赫,强烈建议图形工程师通读这个代码~~~  读完以后就可以全面了解传统GPU内部构造了!!但是由于这个C Model所使用的图形算法和真正硬件的图形算法相差甚远,而且Shader Unit完全没有任何Arch信息,所以距离RTL还有非常长的一段距离,所以如果RTL工程师如果不懂图形算法和向量多线程DSP体系结构,那就凭借这个Simualtor来写不出来GPU RTL实现的。不过好在他是一个不错的C Model~~可以用来Verification。






from : http://attila.ac.upc.edu/wiki/index.php/Main_Page

magnify-clip.png (267 Bytes, 下载次数: 1)

magnify-clip.png

magnify-clip.png (267 Bytes, 下载次数: 1)

magnify-clip.png

Attila_units_queue_sizes.jpg (49.95 KB, 下载次数: 1)

Attila_units_queue_sizes.jpg

magnify-clip.png (267 Bytes, 下载次数: 1)

magnify-clip.png

magnify-clip.png (267 Bytes, 下载次数: 1)

magnify-clip.png

分享分享0 收藏收藏1 1 0
 
回复

使用道具 举报

   
efeijiang

轻车都尉(从四品)

Rank: 13Rank: 13Rank: 13Rank: 13

注册时间
2009-10-26
积分
1902
  • 串个门
  • 加好友
  • 打招呼
  • 发消息
2#
  发表于 2009-11-22 21:48:49  | 只看该作者
老大把Atila的代码全部看过了?该从哪里看起阿?
公开促进交流;交流促进提高;上传资料坚持0阅读权限
 
回复

使用道具 举报

   
ic.expert

管理员

Rank: 28Rank: 28Rank: 28Rank: 28Rank: 28Rank: 28Rank: 28

注册时间
2007-7-11
积分
32646
  • 串个门
  • 加好友
  • 打招呼
  • 发消息
3#
  发表于 2009-11-23 05:14:58  | 只看该作者
不知道老大看没看完,我没有全看完,对于图形基础比较薄弱的大牛,可以从bus部分看起。attila GPU Bus和memory Controller写在一起的。
 
 
回复

使用道具 举报

   
efeijiang

轻车都尉(从四品)

Rank: 13Rank: 13Rank: 13Rank: 13

注册时间
2009-10-26
积分
1902
  • 串个门
  • 加好友
  • 打招呼
  • 发消息
4#
  发表于 2009-11-23 10:56:33  | 只看该作者
多谢老大指点!
公开促进交流;交流促进提高;上传资料坚持0阅读权限
 
回复

使用道具 举报

   
agentjones

武骑尉(从七品)

Rank: 1

注册时间
2009-8-15
积分
7
  • 串个门
  • 加好友
  • 打招呼
  • 发消息
5#
  发表于 2009-11-23 23:12:43  | 只看该作者
加泰罗尼亚理工学院CS专业的一个项目,似乎是Intel资助的,或许也是为larrabee做技术积累吧。
 
 
回复

使用道具 举报

   
efeijiang

轻车都尉(从四品)

Rank: 13Rank: 13Rank: 13Rank: 13

注册时间
2009-10-26
积分
1902
  • 串个门
  • 加好友
  • 打招呼
  • 发消息
6#
  发表于 2009-11-25 11:09:16  | 只看该作者
老大,我现在准备看这个代码;
源码已经下载了,在Windows下可以运行吗?还是要在linux下运行?
请老大给指点指点

不知道老大看没看完,我没有全看完,对于图形基础比较薄弱的大牛,可以从bus部分看起。attila GPU Bus和memory Controller写在一起的。
ic.expert 发表于 2009-11-23 05:14 
公开促进交流;交流促进提高;上传资料坚持0阅读权限
 
回复

使用道具 举报

   
klonlat

轻车都尉(从四品)

Rank: 13Rank: 13Rank: 13Rank: 13

注册时间
2009-11-23
积分
1080

谋士勋章

  • 串个门
  • 加好友
  • 打招呼
  • 发消息
7#
  发表于 2009-11-25 11:30:39  | 只看该作者
windows下用vs2005编译,win32目录下有工程文件,编译通过没问题
linux下用makefile编,还没试
1

查看全部评分

  • ic.expert

 
 
回复

使用道具 举报

   
hwdavr

骑都尉(从五品)

Rank: 9Rank: 9Rank: 9

注册时间
2009-10-28
积分
220
  • 串个门
  • 加好友
  • 打招呼
  • 发消息
8#
  发表于 2010-3-28 20:08:40  | 只看该作者
VS2008编译有问题啊,怎么解决
 
 
回复

使用道具 举报

   
cxp2760

飞骑尉(从六品)

Rank: 5Rank: 5

注册时间
2009-9-18
积分
67
  • 串个门
  • 加好友
  • 打招呼
  • 发消息
9#
  发表于 2010-10-28 16:49:37  | 只看该作者
编译通过,不能运行,是否还要安装驱动????
 
 
回复

使用道具 举报

   
ctang112

武骑尉(从七品)

Rank: 1

注册时间
2010-6-25
积分
1
  • 串个门
  • 加好友
  • 打招呼
  • 发消息
10#
  发表于 2010-11-25 21:31:23  | 只看该作者
赞,还有这么牛的东西!
 
 
回复

使用道具 举报

   
hotdog

轻车都尉(从四品)

Rank: 13Rank: 13Rank: 13Rank: 13

注册时间
2011-2-19
积分
1020
  • 串个门
  • 加好友
  • 打招呼
  • 发消息
11#
  发表于 2011-4-10 22:53:13  | 只看该作者
这个东西真好玩,问题是我怎么能连上 DX的game
 
 
回复

使用道具 举报

   
hotdog

轻车都尉(从四品)

Rank: 13Rank: 13Rank: 13Rank: 13

注册时间
2011-2-19
积分
1020
  • 串个门
  • 加好友
  • 打招呼
  • 发消息
12#
  发表于 2011-4-16 16:41:58  | 只看该作者
打算开始学习这个c-model了,有志同道合者吗
 
 
回复

使用道具 举报

   
endlesswings

飞骑尉(从六品)

Rank: 5Rank: 5

注册时间
2012-7-20
积分
73
  • 串个门
  • 加好友
  • 打招呼
  • 发消息
13#
  发表于 2012-7-20 20:28:50  | 只看该作者
非常有用的资料!
 
 
回复

使用道具 举报

   
cudac

云骑尉(正七品)

Rank: 4

注册时间
2010-6-16
积分
27
  • 串个门
  • 加好友
  • 打招呼
  • 发消息
14#
  发表于 2012-7-23 13:45:36  | 只看该作者
多谢分享
 
 
回复

使用道具 举报

   
puma8888

轻车都尉(从四品)

Rank: 13Rank: 13Rank: 13Rank: 13

注册时间
2011-12-23
积分
1882
  • 串个门
  • 加好友
  • 打招呼
  • 发消息
15#
  发表于 2012-8-9 02:44:42  | 只看该作者
谢谢分享
 
 
回复

使用道具 举报

   


这篇关于[Attila GPU] Attila OGL2/D3D9 GPU C Model Simulator的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/734155

相关文章

AI Toolkit + H100 GPU,一小时内微调最新热门文生图模型 FLUX

上个月,FLUX 席卷了互联网,这并非没有原因。他们声称优于 DALLE 3、Ideogram 和 Stable Diffusion 3 等模型,而这一点已被证明是有依据的。随着越来越多的流行图像生成工具(如 Stable Diffusion Web UI Forge 和 ComyUI)开始支持这些模型,FLUX 在 Stable Diffusion 领域的扩展将会持续下去。 自 FLU

如何用GPU算力卡P100玩黑神话悟空?

精力有限,只记录关键信息,希望未来能够有助于其他人。 文章目录 综述背景评估游戏性能需求显卡需求CPU和内存系统需求主机需求显式需求 实操硬件安装安装操作系统Win11安装驱动修改注册表选择程序使用什么GPU 安装黑神话悟空其他 综述 用P100 + PCIe Gen3.0 + Dell720服务器(32C64G),运行黑神话悟空画质中等流畅运行。 背景 假设有一张P100-

MVC(Model-View-Controller)和MVVM(Model-View-ViewModel)

1、MVC MVC(Model-View-Controller) 是一种常用的架构模式,用于分离应用程序的逻辑、数据和展示。它通过三个核心组件(模型、视图和控制器)将应用程序的业务逻辑与用户界面隔离,促进代码的可维护性、可扩展性和模块化。在 MVC 模式中,各组件可以与多种设计模式结合使用,以增强灵活性和可维护性。以下是 MVC 各组件与常见设计模式的关系和作用: 1. Model(模型)

GPU 计算 CMPS224 2021 学习笔记 02

并行类型 (1)任务并行 (2)数据并行 CPU & GPU CPU和GPU拥有相互独立的内存空间,需要在两者之间相互传输数据。 (1)分配GPU内存 (2)将CPU上的数据复制到GPU上 (3)在GPU上对数据进行计算操作 (4)将计算结果从GPU复制到CPU上 (5)释放GPU内存 CUDA内存管理API (1)分配内存 cudaErro

PyInstaller问题解决 onnxruntime-gpu 使用GPU和CUDA加速模型推理

前言 在模型推理时,需要使用GPU加速,相关的CUDA和CUDNN安装好后,通过onnxruntime-gpu实现。 直接运行python程序是正常使用GPU的,如果使用PyInstaller将.py文件打包为.exe,发现只能使用CPU推理了。 本文分析这个问题和提供解决方案,供大家参考。 问题分析——找不到ONNX Runtime GPU 动态库 首先直接运行python程序

麒麟系统安装GPU驱动

1.nvidia 1.1显卡驱动 本机显卡型号:nvidia rtx 3090 1.1.1下载驱动 打开 https://www.nvidia.cn/geforce/drivers/ 也可以直接使用下面这个地址下载 https://www.nvidia.com/download/driverResults.aspx/205464/en-us/ 1.1.3安装驱动 右击,

Kubernetes的alpha.kubernetes.io/nvidia-gpu无法限制GPU个数

问题描述: Pod.yaml文件中关于GPU资源的设置如下: 然而在docker中运行GPU程序时,发现宿主机上的两块GPU都在跑。甚至在yaml文件中删除关于GPU的请求,在docker中都可以运行GPU。 原因: 上例说明alpha.kubernetes.io/nvidia-gpu无效。查看yaml文件,发现该docker开启了特权模式(privileged:ture): 而

diffusion model 合集

diffusion model 整理 DDPM: 前向一步到位,从数据集里的图片加噪声,根据随机到的 t t t 决定混合的比例,反向要慢慢迭代,DDPM是用了1000步迭代。模型的输入是带噪声图和 t,t 先生成embedding后,用通道和的方式加到每一层中间去: 训练过程是对每个样本分配一个随机的t,采样一个高斯噪声 ϵ \epsilon ϵ,然后根据 t 对图片和噪声进行混合,将加噪

GPU池化赋能智能制造

2023年3月10日,“第六届智能工厂高峰论坛”在杭州隆重揭幕。本次会议由e-works数字化企业网、浙江制信科技有限公司主办,中国人工智能学会智能制造专业委员会、长三角新能源汽车产业链联盟、长三角(杭州)制造业数字化能力中心、浙江省智能工厂操作系统技术创新中心协办。趋动科技作为钻石合作伙伴出席了本次峰会,与制造业精英企业以及行业专业人士共同分享制造业在智能工厂推进过程中的成功经验,探讨工厂改进中

【linux 常用命令】查看gpu、显卡常用命令

1.查看显卡基本信息 lspci | grep -i nvidia 2.查看显卡驱动版本 nvidia-smi -a 3.查看gpu使用情况 nvidia-smi (spam) [dongli@dt-gpu-1 train]$ nvidia-smi Fri Sep 27 16:42:33 2019 +----------------------------------------