Internet Windows Android

And consider what will happen in two different cases. Clarification of family ties

* always topical questions, what you should pay attention to when choosing a processor, so as not to be mistaken.

Our goal in this article is to describe all the factors that affect processor performance and other operational characteristics.

Surely it's not a secret for anyone that the processor is the main computing unit of a computer. You can even say - the most important part of the computer.

It is he who deals with the processing of almost all processes and tasks that occur in the computer.

Be it watching video, music, surfing the Internet, writing and reading in memory, processing 3D and video, games. And much more.

Therefore, to the choice C central NS processor should be taken very carefully. It may turn out that you have decided to install a powerful video card and a processor that does not match its level. In this case, the processor will not reveal the potential of the video card, which will slow down its operation. The processor will be fully loaded and literally boil, and the video card will wait for its turn, working at 60-70% of its capabilities.

That is why, when choosing a balanced computer, not costs neglect processor in favor of a powerful video card. The processor power must be sufficient to unleash the potential of the video card, otherwise it’s just wasted money.

Intel vs. AMD

* catch up forever

Corporation Intel, has huge human resources, and almost inexhaustible finances. Many innovations in the semiconductor industry and new technologies come from this company. Processors and developments Intel, on average 1-1,5 years ahead of engineers' developments AMD... But as you know, for the opportunity to possess the most modern technologies- you have to pay.

Processor Pricing Policy Intel, is based on both number of cores, amount of cache but also on "Freshness" of architecture, performance per cyclewatt,chip technology... The value of cache memory, "subtleties of the technical process" and others important characteristics processor will be discussed below. For the possession of such technologies as a free frequency multiplier, you will also have to pay an additional amount.

Company AMD, unlike the company Intel, strives for the availability of its processors for the end user and for a competent pricing policy.

You could even say that AMD– « People's brand". In its price tags you will find what you need at a very attractive price. Usually one year after the appearance new technology at the company Intel, there is an analogue of technology from AMD... If you are not chasing the highest performance and pay more attention to the price tag than to the availability of advanced technologies, then the company's products AMD- just for you.

Price policy AMD, is more based on the number of cores and quite a bit - on the amount of cache memory, the presence of architectural improvements. In some cases, for the ability to have a third-level cache memory, you will have to pay a little extra ( Phenom has a cache memory of 3 levels, Athlon content with only limited, 2 levels). But sometimes AMD Pampers his fans the ability to unblock cheaper processors to more expensive ones. You can unlock kernels or cache memory. Improve Athlon before Phenom... This is possible due to the modular architecture and the lack of some cheaper models, AMD just disables some on-chip blocks of more expensive ones (programmatically).

Kernels- remain practically unchanged, only their number differs (valid for processors 2006-2011 years). Due to the modularity of its processors, the company does an excellent job of selling rejected chips, which, when some blocks are turned off, become a processor from a less productive line.

The company has been working on a completely new architecture for many years under the codename Bulldozer, but at the time of entering 2011 year, the new processors showed not the best performance. AMD sinned on operating systems that they did not understand the architectural features of dual cores and "other multithreading."

According to company representatives, special fixes and patches should be expected to experience the full performance of these processors. However, at the beginning 2012 years, company representatives postponed the release of the update to support the architecture Bulldozer for the second half of the year.

Processor frequency, number of cores, multithreading.

In times Pentium 4 and before him - CPU frequency was the main factor in processor performance when choosing a processor.

This is not surprising, because processor architectures were specially designed to achieve a high frequency, this was especially strongly reflected in the processor. Pentium 4 on architecture NetBurst... The high frequency was not effective with the long pipeline used in the architecture. Even Athlon XP frequency 2GHz, in terms of performance, was higher than Pentium 4 c 2.4GHz... So it was pure marketing. After this error, the company Intel realized my mistakes and returned to the good side I started working not on the frequency component, but on the performance per clock cycle. From architecture NetBurst had to be abandoned.

What we gives multicore?

Quad-core processor with frequency 2.4 GHz, in multi-threaded applications, will theoretically be the approximate equivalent of a single-core processor with a frequency 9.6GHz or a 2-core processor with a frequency 4.8 GHz... But that's only in theory. Practically however, two dual-core processors in a two-socket motherboard will be faster than one 4-core processor at the same operating frequency. Bus speed and memory latency limitations are evident.

* subject to the same architectures and the amount of cache memory

Multi-core, makes it possible to execute instructions and calculations in parts. For example, you need to perform three arithmetic operations. The first two are executed on each of the processor cores and the results are added to the cache memory, where the next action can be performed on them by any of the free cores. The system is very flexible, but without proper optimization it may not work. Therefore, it is very important to optimize for multicore for the architecture of processors in the OS environment.

Apps that "love" and use multithreading: archivers, video players and encoders, antiviruses, defragmenter programs, graphic editor , browsers, Flash.

Also, to the "lovers" of multithreading, you can refer such operating systems as Windows 7 and Windows Vista , as well as many OS kernel based Linux that work noticeably faster if you have multi-core processor.

Most games, it is quite enough a 2-core processor for high frequency... Now, however, more and more games are released, "sharpened" for multithreading. Take at least such SandBox games like Gta 4 or Prototype, in which on a 2-core processor with a frequency lower 2.6 GHz- you don't feel comfortable, the frame rate falls below 30 frames per second. Although in this case, most likely the reason for such incidents is the "weak" optimization of games, lack of time or "indirect" hands of those who transferred games from consoles to PC.

When buying a new processor for games, now you should pay attention to processors with 4 or more cores. But still, do not neglect 2 nuclear processors from the "top category". In some games, these processors sometimes feel better than some multi-core ones.

Processor cache memory.

- This is a dedicated area of ​​the processor crystal, in which intermediate data between processor cores, RAM and other buses are processed and stored.

It operates at a very high clock speed (usually at the frequency of the processor itself), has a very high throughput and processor cores work with it directly ( L1).

Because of her shortages, the processor can be idle in time-consuming tasks, waiting for new data to arrive in the cache for processing. Also cache memory serves for Records of frequently repeated data, which, if necessary, can be quickly restored without unnecessary calculations, without forcing the processor to waste time on them again.

Performance is also added by the fact that the cache memory is unified, and all cores can equally use the data from it. This provides additional opportunities for multi-threaded optimization.

This technique is now used for 3rd level cache... Processors Intel there were processors with shared L2 cache ( C2D E 7 ***,E 8 ***), thanks to which appeared this way increase multithreading performance.

When overclocking the processor, the cache memory can become a weak point, preventing the processor from overclocking more than its maximum operating frequency without errors. However, the upside is that it will run at the same frequency as the overclocked processor.

In general, the more cache memory, the faster CPU. In which applications?

All applications where a lot of floating point data, instructions, and streams are used, the cache memory is actively used. They love cache memory very much archivers, video encoders, antiviruses and graphic editor etc.

Favorable to a large number caches include games... Especially strategies, auto-simulations, RPGs, SandBox and all games where there are many small details, particles, geometry elements, information flows and physical effects.

Cache memory plays a very significant role in unlocking the potential of systems with 2 or more video cards. After all, some share of the load falls on the interaction of the processor cores both with each other and for working with streams of several video chips. It is in this case that the organization of the cache memory is important, and the large-volume L3 cache is very useful.

Cache memory, always protected against possible mistakes (ECC), upon detection of which, they are corrected. This is very important, because a small error in the memory cache, during processing, can turn into a gigantic, continuous error, from which the entire system will fall.

Proprietary technologies.

(hyper-threading, Ht)–

for the first time the technology was applied in processors Pentium 4, but it did not always work correctly and often slowed down the processor more than accelerated it. The reason was a too long pipeline and an incomplete branch prediction system. Applied by the company Intel, there are no analogues of the technology yet, except for an analogue then? what the company's engineers have implemented AMD in architecture Bulldozer.

The principle of the system is such that for each physical core, a two computational threads instead of one. That is, if you have a 4-core processor with Ht (Core i 7), then you have virtual threads 8 .

The performance gain is achieved due to the fact that data can enter the pipeline already in the middle of it, and not necessarily first. If any processor units capable of performing this action are idle, they receive a task for execution. The performance gain is not the same as in real physical cores, but comparable (~ 50-75%, depending on the kind of application). It rarely happens that in some applications, HT negatively affects on performance. This is due to poor optimization of applications for this technology, the inability to understand that there are "virtual" flows and the absence of limiters for the load of flows evenly.

TurboBoost Is a very useful technology that increases the frequency of operation of the most used processor cores, depending on the level of their workload. It is very useful when the application does not know how to use all 4 cores, and loads only one or two, while their operating frequency increases, which partially compensates for the performance. An analogue of this technology for the company AMD is the technology Turbo Core.

, 3 dnow! instructions... Designed to speed up the processor in multimedia calculations (video, music, 2D / 3D graphics, etc.), as well as speed up the work of such programs as archivers, programs for working with images and video (with the support of instructions by these programs).

3dnow! - quite old technology AMD which contains additional instructions processing multimedia content, in addition to SSE first version.

* Namely, the ability to stream processing of single precision real numbers.

The presence of the most new version- is a big plus, the processor begins to perform certain tasks more efficiently with proper software optimization. Processors AMD have similar names, but slightly different.

* Example - SSE 4.1 (Intel) - SSE 4A (AMD).

In addition, these instruction sets are not identical. These are analogs in which there are slight differences.

Cool'n'Quiet, SpeedStep, CoolCore, Enchanced Half State (C1E) andT... d.

These technologies, at low load, reduce the processor frequency by reducing the multiplier and voltage on the core, disabling part of the cache, etc. This allows the processor to heat up much less and consume less power and make less noise. If power is needed, the processor will return to its normal state in a split second. On standard settings Bios almost always enabled, if desired, they can be disabled to reduce possible "freezes" when switching in 3D games.

Some of these technologies control the speed of the fans in the system. For example, if the processor does not need increased heat dissipation and is not loaded, the processor fan speed decreases ( AMD Cool'n'Quiet, Intel Speed ​​Step).

Intel Virtualization Technology and AMD Virtualization.

These hardware technologies allow using special programs to run several operating systems at once, without any significant loss in performance. Also, it is used for correct work servers, because often they have more than one OS installed.

Execute Disable Bit and# eXecute Bit technology designed to protect your computer from virus attacks and software errors that can crash the system by buffer overflow.

Intel 64 , AMD 64 , EM 64 T - this technology allows the processor to work both in an OS with a 32-bit architecture and in an OS with a 64-bit one. System 64 bit- from the point of view of benefits, for an ordinary user it differs in that more than 3.25 GB can be used in this system random access memory... On 32-bit systems, use b O Larger amount of RAM is not possible due to the limited amount of addressable memory *.

Most applications with 32-bit architecture can be run on a system with 64-bit OS.

* What can you do if back in 1985, no one could even think of such gigantic, by the standards of that time, volumes of RAM.

Additionally.

A few words about.

This point is worth paying close attention to. The thinner the technical process, the smaller processor consumes energy and, as a result, heats up less. And among other things, it has a higher margin of safety for overclocking.

The finer the technical process, the more you can "wrap" in a chip (and not only) and increase the processor's capabilities. At the same time, heat dissipation and power consumption are also reduced proportionally, due to lower current losses and a decrease in the core area. You can notice a tendency that with each new generation of the same architecture on a new technical process, energy consumption also grows, but this is not so. It's just that manufacturers are moving towards even greater performance and step over the heat dissipation line of the previous generation of processors due to the increase in the number of transistors, which is not proportional to the decrease in the technical process.

Built into the processor.

If you don't need an integrated video core, then you shouldn't buy a processor with it. You will only get worse heat dissipation, excess heat (not always), worse overclocking potential (not always), and overpaid money.

In addition, those kernels that are built into the processor are suitable only for loading the OS, surfing the Internet and watching videos (and even then not of any quality).

Market trends are still changing and the opportunity to buy a productive processor from Intel without a video kernel, it drops out less and less. The policy of forcing the imposition of an embedded video core, appeared with processors Intel codenamed Sandy bridge, the main innovation of which was the embedded kernel based on the same technical process. The video core is located jointly with processor on one crystal, and not as simple as in previous generations of processors Intel... For those who do not use it, there are disadvantages in the form of some overpayment for the processor, the displacement of the heating source relative to the center of the heat distribution cover. However, there are also pluses. Disabled video core, can be used for very fast video encoding using technology Quick Sync coupled with special software supporting this technology. In future, Intel promises to expand the horizons of using the embedded video core for parallel computing.

Processor sockets. Platform lifespan.


Intel has a rough policy for their platforms. The lifespan of each (the period for the beginning and end of sales of processors for it), usually does not exceed 1.5 - 2 years. In addition, the company has several platforms developing in parallel.

Company AMD, has the opposite policy of compatibility. On her platform on AM 3, will fit all future generations of processors that support DDR3... Even when the platform exits to AM 3+ and later, either new processors will be released separately for AM 3, or the new processors will be compatible with old motherboards, and it will be possible to make a painless upgrade for the wallet by changing only the processor (without changing the motherboard, RAM, etc.) and flashing the motherboard. The only nuances of incompatibility may be when changing the type, since a different memory controller built into the processor will be required. So the compatibility is limited and not supported by all motherboards. But in general, for an economical user or those who are not used to changing the platform completely every 2 years - the choice of the processor manufacturer is clear - this is AMD.

Cooling the processor.

As standard, the processor comes with BOX- a new cooler that will simply cope with its task. It is a piece of aluminum with a not very high dispersion area. Efficient coolers based on heatpipes and fixed fins are designed for highly efficient heat dissipation. If you do not want to hear unnecessary noise from the fan, then you should buy an alternative, more efficient cooler with heat pipes, or a closed-loop or open-loop liquid cooling system. Such cooling systems will additionally give the possibility of overclocking for the processor.

Conclusion.

All important aspects affecting processor performance and performance have been considered. Let's repeat what you should pay attention to:

  • Select manufacturer
  • Processor architecture
  • Technical process
  • CPU frequency
  • Number of processor cores
  • Processor cache size and type
  • Technology and instruction support
  • Quality cooling

We hope this material will help you understand and decide on the choice of a processor that meets your expectations.

saul September 9, 2015 at 01:38 PM

Implementing a multi-threaded game engine architecture

  • Intel Blog,
  • Game development,
  • Parallel programming,
  • Website development
  • Translation

With the advent of multi-core processors, it became necessary to create a game engine based on a parallel architecture. Using all the processors in the system - both the graphics processor (GPU) and the central processor (CPU) - opens up much more possibilities than a single-threaded GPU-only engine. For example, using more CPU cores can improve visual effects by increasing the number of physical objects used in the game, as well as to achieve more realistic behavior of characters through the implementation of advanced artificial intelligence (AI).
Let's consider the features of the implementation of the multi-threaded architecture of the game engine.

1. Introduction

1.1. Overview

The multi-threaded architecture of the game engine allows you to use the capabilities of all processors on the platform to the maximum. It assumes parallel execution of various functional blocks on all available processors. However, it turns out to be not so easy to implement such a scheme. Individual elements game engines often interact with each other, which can lead to errors when they are executed simultaneously. To handle such scenarios, the engine provides special mechanisms for data synchronization, excluding possible locks. It also implements methods for synchronizing data at the same time, thereby minimizing execution time.

To understand the presented materials, you need to be well versed in modern methods of creation computer games, multithreading support for game engines, or to improve application performance in general.

2. Parallel execution state

Concurrency state is a key concept in multithreading. Only by dividing the game engine into separate systems, each operating in its own mode and practically not interacting with the rest of the engine, can we achieve the greatest efficiency of parallel computing and reduce the time required for synchronization. It is not possible to completely isolate individual parts of the engine by excluding all shared resources. However, for operations such as retrieving position or orientation data for objects, individual systems can use local copies of the data rather than shared resources. This allows you to minimize data dependency in different parts of the engine. Changes to shared data made by an individual system are notified to the state manager, which queues them up. This is called messaging mode. This mode assumes that upon completion of tasks, engine systems receive change notifications and update their internal data accordingly. This mechanism can significantly reduce the synchronization time and the dependence of systems on each other.

2.1 Execution states

For the execution state manager to work efficiently, it is recommended to synchronize operations to a specific clock. This allows all systems to run at the same time. In this case, the clock rate does not have to correspond to the frame rate. And the duration of the clock cycles may not depend on the frequency. It can be selected so that one clock cycle corresponds to the time required to transmit one frame (regardless of its size). In other words, the frequency or duration of ticks is determined by the specific implementation of the state manager. Figure 1 shows a “free” step-by-step mode of operation, which does not require all systems to complete an operation in the same clock cycle. The mode in which all systems complete execution of operations in one clock cycle is called "hard" step mode. It is shown schematically in Figure 2.


Figure 1. Execution state in free step-by-step mode

2.1.1. Free step-by-step mode
In a free step-by-step mode, all systems operate continuously for a predetermined period of time required to complete the next portion of calculations. However, the name “free” should not be taken literally: systems are not synchronized at an arbitrary moment in time, they are only “free” in choosing the number of clock cycles required to complete the next stage.
Generally, in this mode, it is not enough to send a simple notification of a state change to the state manager. The updated data must also be sent. This is because the system that changed the shared data might be in a running state while another system waiting for that data is about to update. In this case, more memory is required as more copies of the data have to be made. Therefore, the "free" regime cannot be considered a universal solution for all occasions.
2.1.2. Hard step mode
In this mode, the execution of tasks on all systems is completed in one cycle. This mechanism is simpler to implement and does not require the transfer of updated data along with the notification. Indeed, if necessary, one system can simply request new values ​​from another system (of course, at the end of the run cycle).
In hard mode, you can implement a pseudo-free stepping mode of operation by distributing computations between different steps. In particular, this may be required for AI calculations, where an initial "common goal" is calculated in the first clock cycle, which is gradually refined in the following stages.


Figure 2. Execution state in hard step-by-step mode

2.2. Data sync

Changes to shared data by multiple systems can lead to a conflict of changes. In this case, the messaging system needs to provide an algorithm for choosing the correct total value. There are two main approaches based on the following criteria.
  • Time: The final value is the last change made.
  • Priority: The total is the change made by the system with the highest priority. If the priority of the systems is the same, you can also take into account the time of the change.
All obsolete data (for any of the criteria) can be simply overwritten or excluded from the notification queue.
Because the total can vary depending on the order in which the changes are made, it can be very difficult to use the relative values ​​of the total data. In such cases, absolute values ​​should be used. Then, when updating local data, systems can simply replace the old values ​​with new ones. The optimal solution is to choose absolute or relative values ​​depending on the specific situation. For example, general data such as position and orientation must be absolute because the order in which changes are made is important. Relative values ​​can be used, for example, for a particle generation system, since all information about the particles is stored only in it.

3. Engine

When developing the engine, the main focus is on the flexibility required to further expand its functionality. This will optimize it for use under certain constraints (eg memory).
The engine can be roughly divided into two parts: framework and managers. The framework (see section 3.1) includes parts of the game that are replicated at runtime, that is, they exist in several instances. It also includes the elements involved in the execution of the main game loop. Managers (see section 3.2) are Singleton objects that are responsible for executing the logic of the game.
Below is a diagram of the game engine.


Figure 3. General architecture of the engine

Please note that functional game modules, or systems, are not part of the engine. The engine only unites them together, acting as a connecting element. This modular organization makes it possible to load and unload systems as needed.

The interaction of the engine and systems is carried out using interfaces. They are implemented in such a way as to provide the engine with access to system functions, and systems - to the engine managers.
A detailed engine diagram is provided in Appendix A, Engine Diagram.

Virtually all systems are independent of each other (see Section 2, "Concurrent Execution State"), that is, they can perform actions in parallel without affecting the operation of other systems. However, any data change will entail certain difficulties, since the systems will have to interact with each other. The exchange of information between systems is necessary in the following cases:

  • to inform another system about a change in general data (for example, the position or orientation of objects);
  • to perform functions that are not available for this system (for example, the AI ​​system turns to the system for calculating the geometric or physical properties of an object to perform a ray intersection test).
In the first case, the state manager described in the previous section can be used to control the exchange of information. (For more information on the state manager, see Section 3.2.2, "The state manager".)
In the second case, it is necessary to implement a special mechanism that will provide the services of one system for the use of another. Full description this mechanism is described in Section 3.2.3, “Service Manager”.

3.1. Framework

The framework is used to combine all the elements of the engine. This is where the engine is initialized, with the exception of managers, which are instantiated globally. It also stores information about the scene. To achieve greater flexibility, the scene is implemented as a so-called generic scene that contains generic objects. They are containers that combine the various functional parts of the scene. For details, see section 3.1.2.
The main game loop is also implemented in the framework. It can be schematically represented as follows.


Figure 4. The main loop of the game

The engine runs in a windowed environment, so the first step in the game loop needs to process any pending messages from the OS windows. If this is not done, the engine will not respond to OS messages. In the second step, the scheduler assigns tasks using the task manager. This process is detailed in section 3.1.1 below. After that, the state manager (see section 3.2.2) sends information about the changes made to the engine systems, the work of which it can affect. At the last step, depending on the execution status, the framework determines whether to terminate or continue the engine, for example, to move to the next scene. Information about the state of the engine is stored by the environment manager. For more details see section 3.2.4.

3.1.1. Scheduler
The scheduler generates an execution reference clock at a specified rate. If the benchmark mode requires the next operation to start immediately after the completion of the previous one, without waiting for the end of the cycle, the frequency can be unlimited.
On a clock signal, the scheduler uses the task manager to put the systems in run mode. In free stepping mode (Section 2.1.1), the scheduler polls the systems to determine how many clock cycles they will need to complete the task. Based on the results of the survey, the scheduler determines which systems are ready for execution and which ones will complete their work at a specific time step. The scheduler can change the number of ticks if a system takes longer to execute. In hard stepping mode (Section 2.1.2), all systems start and end at the same clock cycle, so the scheduler waits for all systems to finish executing.
3.1.2. Versatile scene and objects
Generic scene and objects are containers for functionality implemented in other systems. They are intended solely to interact with the engine and do not perform any other function. However, they can be extended to take advantage of functions available to other systems. This allows for loose coupling. Indeed, a universal scene and objects can use the properties of other systems without being bound to them. It is this property that eliminates the dependence of systems on each other and enables them to work simultaneously.
The diagram below shows an extension of a universal scene and object.


Figure 5. Expansion of the universal scene and object

Let's look at how extensions work in the following example. Suppose the extension of the universal universal scene is performed, the scene is extended to use the use of graphic, physical and other properties. In this case, the “graphic” part of the extension will be responsible for the initialization of the display, and its “physical” part will be responsible for the implementation of physical laws for rigid bodies, for example, the force of gravity. Scenes contain objects, so a generic scene will also include multiple generic objects. Generic objects can also be extended or can be extended to use graphical, physical, and other properties. For example, the drawing of an object on the screen will be implemented by graphic expansion functions, and the calculation of the interaction of solids - by physical ones.

A detailed diagram of the interaction of the engine and systems is given in Appendix B, "Scheme of the interaction of the engine and systems."
It should be noted that the generic scene and generic object are responsible for registering all of their "extensions" with the state manager so that all extensions can be notified of changes made by other extensions (that is, other systems). An example is a graphical extension registered to receive notifications of position and orientation changes made by physical extension.
For details on system components, see Section 5.2, System Components.

3.2. Managers

Managers control the work of the engine. They are Singleton objects, that is, each type of manager is available in only one instance. This is necessary because duplicating manager resources will inevitably lead to redundancy and negatively impact performance. In addition, managers are responsible for implementing common functions for all systems.
3.2.1. Task manager
The task manager is responsible for managing system tasks in the thread pool. The thread pool creates one thread for each processor to ensure optimal n-fold scaling and to prevent unnecessary threads from being assigned, eliminating unnecessary task switching overhead in the operating system.

The scheduler provides the task manager with a list of tasks to complete, as well as information about which tasks to wait for completion. He receives this data from different systems... Each system only gets one task to complete. This method is called functional decomposition. However, for data processing, each such task can be divided into an arbitrary number of subtasks (data decomposition).
Below is an example of distributing tasks between threads for a quad-core system.


Figure 6. Example of a thread pool used by a task manager

In addition to processing scheduler requests for access to main tasks, the task manager can work in the initialization mode. It sequentially polls the systems from each thread so that they can initialize the local data stores required for operation.
Tips for implementing a task manager are given in Appendix D, Tips for Implementing Tasks.

3.2.2. State manager
The state manager is part of the messaging engine. It monitors changes and sends notifications about them to all systems that may be affected by these changes. In order not to send out unnecessary notifications, the state manager stores information about which systems to notify in a particular case. This mechanism is implemented using the Observer pattern (see Appendix C, Observer (Design Pattern)). In short, given template involves the use of an "observer" that monitors any changes in the subject, while the role of an intermediary between them is played by the change controller.

The mechanism works as follows. 1. The observer tells the change controller (or state manager) which entities it wants to track for changes. 2. The subject notifies the controller of all his changes. 3. On a signal from the framework, the controller notifies the observer about changes in the subject. 4. The observer sends a request to the subject to receive updated data.

In free stepping mode (see Section 2.1.1), the implementation of this mechanism becomes somewhat more complicated. First, the updated data will have to be sent along with the change notification. In this mode, send on demand is not applicable. Indeed, if at the time of receiving the request, the system responsible for the changes is not yet finished executing, it will not be able to provide updated data. Second, if a system is not yet ready to receive changes at the end of a clock cycle, the state manager will have to hold on to the changed data until all systems registered to receive it are ready.

The framework provides two state managers for this: for processing changes at the scene level and at the object level. Typically, messages regarding scenes and objects are independent of each other, so using two separate managers eliminates the need to process unnecessary data. But if the scene needs to take into account the state of an object, you can register it to receive notifications about its changes.

To avoid unnecessary synchronization, the state manager generates a change notification queue separately for each thread created by the task manager. Therefore, no synchronization is required when accessing the queue. Section 2.2 describes a method that can be used to merge queues after execution.


Figure 7. Notification of internal changes to a generic object

Change notifications do not need to be sent sequentially. There is a way to send them in parallel. When performing a task, the system works with all of its objects. For example, as physical objects interact with each other, the physical system controls their movement, the calculation of collisions, new acting forces, etc. When receiving notifications, the system object does not interact with other objects in its system. It interacts with its associated generic object extensions. This means that generic objects are now independent of each other and can be updated at the same time. This approach does not exclude edge cases that should be considered during the synchronization process. However, it allows the use of parallel execution mode when it seemed that you can only act sequentially.

3.2.3. Service manager
The service manager provides systems with access to features on other systems that would otherwise be unavailable to them. It is important to understand that functions are accessed through interfaces, not directly. Information about the system interfaces is also stored in the service manager.
To eliminate the dependence of systems on each other, each of them has only a small set of services. In addition, the ability to use a particular service is not determined by the system itself, but by the service manager.


Figure 8. Example of a service manager

The service manager has another function as well. It provides systems with access to properties of other systems. Properties are system-specific values ​​that are not communicated in the messaging system. This can be an expansion of the screen resolution in a graphics system or the magnitude of gravity in a physical system. The service manager makes this data available to systems, but does not directly control it. It puts property changes into a special queue and only publishes them after sequential execution. Please note that access to properties of another system is rarely required and should not be abused. For example, you might need it to turn wireframe mode on and off in the graphics system from the console window, or to change the screen resolution as requested by the player from the user interface. This opportunity mainly used to set parameters that do not change from frame to frame.

3.2.4. Environment manager
  • The environment manager provides the engine runtime environment. Its functions can be conditionally divided into the following groups.
  • Variables: the names and values ​​of common variables used by all parts of the engine. Usually, the values ​​of variables are determined when the scene is loaded or certain custom settings... The engine and various systems can access them by sending a request.
  • Execution: Execution data, such as the completion of a scene or program execution. These parameters can be set and requested by both the systems themselves and the engine.
3.2.5. Platform manager
The platform manager implements an abstraction for operating system calls and also provides additional functionality beyond the simple abstraction. The advantage of this approach is that it encapsulates several typical functions within a single call. That is, they do not have to be implemented separately for each caller, overloading it with details about OS calls.
Consider, as an example, calling the platform manager to load a system dynamic library. It not only loads the system, but also gets the function entry points and calls the library initialization function. The manager also stores the library descriptor and unloads it after the engine is finished.

The platform manager is also responsible for providing information about the processor, such as supported SIMD instructions, and for initializing a specific mode of operation for processes. The systems cannot use other functions for generating queries.

4. Interfaces

Interfaces are the means of communication between the framework, managers and systems. The framework and managers are part of the engine, so they can interact with each other directly. Systems do not belong to the engine. Moreover, they all perform different functions, which leads to the need to create a single method of interaction with them. Since systems cannot interact with managers directly, they need to provide a different way of access. However, not all functions of managers should be open to systems. Some of them are available only to the framework.

Interfaces define the set of functions required for use standard method access. This eliminates the need for the framework to know the implementation details of specific systems, since it can only interact with them through a specific set of calls.

4.1. Subject and Observer Interfaces

The main purpose of the subject and observer interfaces is to register which observers to send notifications about, and also to send such notifications. Registering and breaking the connection with the observer are standard functions for all subjects, included in the implementation of their interface.

4.2. Manager interfaces

Managers, although they are Singleton objects, are directly accessible only to the framework. Other systems can only access managers through interfaces that represent only a fraction of their overall functionality. After initialization, the interface is passed to the system, which uses it to work with certain manager functions.
There is no single interface for all managers. Each of them has its own separate interface.

4.3. System interfaces

For the framework to be able to access the components of the system, it needs interfaces. Without them, everyone's support new system the engine would have to be implemented separately.
Each system includes four components, so there should be four interfaces. Namely: system, scene, object and task. Detailed description see Section 5, Systems. Interfaces are the means of accessing components. The system interfaces allow you to create and delete scenes. Scene interfaces, in turn, allow you to create and destroy objects, as well as request information about the main task of the system. The task interface is mainly used by the task manager when setting tasks to the thread pool.
Since the scene and the object, as part of the system, must interact with each other and with the universal scene and the object to which they are attached, their interfaces are also created based on the interfaces of the subject and the observer.

4.4. Change interfaces

These interfaces are used to transfer data between systems. All systems making changes of a particular type must implement such an interface. Geometry is an example. The geometry interface includes methods for determining the position, orientation, and scale of an element. Any system that makes changes to geometry must implement such an interface so that information about other systems is not required to access the changed data.

5. Systems

Systems are the part of the engine that is responsible for implementing game functionality. They perform all the basic tasks without which the engine would not make sense. Interaction between the engine and systems is carried out using interfaces (see Section 4.3, “System Interfaces”). This is necessary in order not to overload the engine with information about various types of systems. The interfaces make it much easier to add a new system because the engine does not need to consider all the implementation details.

5.1. Types

Engine systems can be roughly divided into several predefined categories corresponding to standard components games. For example: geometry, graphics, physics (solid collision), sound, input processing, AI and animation.
Systems with non-standard functions fall into a separate category. It is important to understand that any system that modifies data for a particular category must be aware of the interface of that category, since the engine does not provide such information.

5.2. System components

Several components need to be implemented for each system. Some of them are: system, scene, object, and task. All of these components are used to interact with different parts of the engine.
The diagram below depicts the interactions between the various components.


Figure 9. System components

A detailed diagram of the connections between the engine systems is given in Appendix B, "Scheme of the interaction between the engine and systems."

5.2.1. System
The “system” component, or simply the system, is responsible for initializing system resources, which will practically not change during the engine's operation. For example, the graphics system analyzes resource addresses to determine where they are located and to speed up loading when using the resource. It also sets the screen resolution.
The system is the main entry point for the framework. It provides information about itself (for example, the type of system), as well as methods for creating and deleting scenes.
5.2.2. Scene
The scene component, or system scene, is responsible for managing the resources associated with the current scene. The generic scene uses system scenes to expand functionality by leveraging their functionality. An example is a physical scene that is used to create a new game world and, when initializing the scene, determines the forces of gravity in it.
Scenes provide methods for creating and destroying objects, as well as a “task” component for processing the scene and a method for accessing it.
5.2.3. An object
The object component, or system object, belongs to the scene and is usually associated with what the user sees on the screen. A generic object uses a system object to extend functionality by exposing its properties as its own.
An example is the geometric, graphic, and physical extension of a generic object to display a wooden beam on a screen. Geometric properties will include the position, orientation, and scale of the object. The graphics system will use a special grid to display it. And the physical system will endow it with the properties of a rigid body for calculating interactions with other bodies and the acting forces of gravity.

In certain cases, a system object needs to account for changes to a generic object or one of its extensions. For this purpose, you can create a special relationship that will track the changes made.

5.2.4. Task
The task component, or system task, is used to process a scene. The task receives a command to update the scene from the task manager. This is a signal to run system functions on scene objects.
The execution of a task can be divided into subtasks, distributing them also using the task manager to an even larger number of threads. it convenient way scaling the engine to multiple processors. This technique is called data decomposition.
Information about object changes in the process of updating the scene tasks is passed to the state manager. For details on the state manager, see section 3.2.2.

6. Combining all the components

All the elements described above are related to each other and are part of one whole. The engine's work can be roughly divided into several stages, described in the following sections.

6.1. Initialization phase

The engine starts with the initialization of the managers and the framework.
  • The framework calls the scene loader.
  • After determining which systems the scene will use, the loader calls the platform manager to load the appropriate modules.
  • The platform manager loads the appropriate modules and passes them to the interface manager, then calls them to create a new system.
  • The module returns a pointer to the system instance that implements the system interface to the loader.
  • The service manager registers all the services that the system module provides.


Figure 10. Initialization of managers and engine systems

6.2. Scene loading stage

Control is returned to the loader, which loads the scene.
  • The loader creates a generic scene. To instantiate system scenes, it calls the system interfaces, extending the functionality of the generic scene.
  • The universal scene defines what data each system scene can change and notifications about what changes it should receive.
  • By matching scenes that make certain changes and want to be notified about them, the generic scene passes this information to the state manager.
  • For each object in the scene, the loader creates a generic object, then determines which systems will extend the generic object. The correspondence between system objects is determined in the same way that is used for scenes. It is also passed to the state manager.
  • The loader uses the resulting scene interfaces to instantiate system objects and use them to extend generic objects.
  • The scheduler queries the scene interfaces for their primary tasks so that this information can be passed on to the task manager at runtime.


Figure 11. Initialization of the universal scene and object

6.3. Game Cycle Stage

  • The platform manager is used to process window messages and other elements necessary for the current platform to work.
  • Then control passes to the scheduler, which waits for the end of the clock to continue working.
  • At the end of a clock cycle, in a free step-by-step mode, the scheduler checks which tasks have been completed. All completed tasks (that is, ready to be performed) are transferred to the task manager.
  • The scheduler determines which tasks will be completed in the current tick and waits for them to complete.
  • In hard step mode, these operations are repeated every measure. The scheduler submits all tasks to the manager and waits for them to be completed.
6.3.1. Completing the task
Control is transferred to the task manager.
  • It forms a queue of all received tasks, then, as free threads appear, starts their execution. (The process of executing tasks differs depending on the systems. Systems can work with only one task or process several tasks from the queue at the same time, thus realizing parallel execution.)
  • In the course of execution, tasks can work with the entire scene or only with certain objects, changing their internal data.
  • Systems should be notified of any changes to general data (eg position or orientation). Therefore, when the task is executed, the system scene or object informs the observer of any changes. In this case, the observer actually acts as a change controller, which is part of the state manager.
  • The change controller generates a queue of change notifications for further processing. It ignores changes that do not affect the given observer.
  • To use certain services, a task contacts the service manager. The service manager also allows you to change properties of other systems that are not available for transmission in the messaging engine (for example, the data entry system changes the screen extension - a property of the graphics system).
  • Tasks can also contact the environment manager to get environment variables and to change the execution state (pause execution, go to the next scene, etc.).


Figure 12. Task manager and tasks

6.3.2. Updating data
After completing all the tasks for the current tick, the main game loop calls the state manager to start the data update phase.
  • The state manager calls each of its change controllers in turn to send the accumulated notifications. The controller checks which observers to send change notifications to for each subject.
  • It then calls the desired observer and informs him of the change (the notification also includes a pointer to the subject interface). In free-stepping mode, the observer receives modified data from the change controller, but in hard-stepping mode, it must request it from the subject itself.
  • Typically, observers interested in receiving system object change notifications are other system objects associated with the same generic object. This allows you to divide the process of making changes into several tasks that can be performed in parallel. To simplify the synchronization process, you can combine all related Generic Object extensions in a single task.
6.3.3. Check progress and exit
The final step in the game loop is to check the state of the runtime. There are several such states: running, paused, next scene, etc. If the "running" state is selected, the next iteration of the loop will start. The "exit" state means the end of the cycle, the release of resources, and exit from the application. Other states can be implemented, for example, "pause", "next scene", etc.

7. Conclusion

The main idea of ​​this article is outlined in Section 2, "Parallel Execution State". Thanks to functional decomposition and data decomposition, it is possible to implement not only the multithreading of the engine, but also its scalability to an even larger number of cores in the future. To eliminate the overhead of synchronization while keeping your data up to date, use state managers in addition to messaging.

The Observer pattern is a function of the messaging engine. It is important to understand well how it works in order to choose the best way to implement it for the engine. In fact, it is a mechanism for interaction between different systems, which ensures the synchronization of common data.

Task management plays an important role in load balancing. Appendix D provides tips for creating an effective task manager for a game engine.

As you can see, game engine multithreading is possible thanks to a well-defined structure and messaging mechanism. It can significantly improve the performance of current and future processors.

Appendix A. Engine schematic

Processing is started from the main game loop (see Fig. 4, "Main game loop").


Appendix B. Scheme of interaction between the engine and systems


Appendix C. Observer (Design Pattern)

The Observer pattern is described in detail in the book Object Oriented Design Techniques. Design Patterns ", E. Gamma, R. Helm, R. Johnson, J. Vlissides (" Design Patterns: Elements of Reusable Object-Oriented Software ", Gamma E., Helm R., Johnson R., Vlissides J.). On English language it was first published in 1995 by Addison-Wesley.

The main idea of ​​this model is as follows: if some elements need to be notified about changes of other elements, they are not obliged to look through the list of all possible changes, trying to find the necessary data in it. The model assumes the presence of an actor and an observer, which are used to send notifications of changes. The observer monitors any changes to the subject. The change controller acts as an intermediary between these two components. The following diagram illustrates this relationship.


Figure 13. Observer template

The process of using this model is described below.

  1. The change controller registers the observer and subject for which it wants to be notified.
  2. The change controller is actually an observer. Instead of the observer, together with the subject, he registers himself. The change controller also maintains its own list of observers and subjects registered with them.
  3. An entity adds an observer (that is, the change controller) to its list of observers who want to be notified of its changes. Sometimes the type of change is additionally indicated, which determines which changes the observer is interested in. This allows you to streamline the process of sending out change notifications.
  4. When changing data or state, the subject notifies the observer through a callback mechanism and communicates information about the changed types.
  5. The change controller generates a queue of notifications about changes and waits for a signal to distribute them to objects and systems.
  6. During distribution, the change controller speaks to real observers.
  7. Observers request information about changed data or state from the subject (or receive them along with notifications).
  8. Before an observer can be deleted, or if it no longer needs to be notified about a subject, it removes the subscription to that subject in the change controller.
There are many different ways implement task distribution. However, it is best to keep the number of worker threads equal to the number of available platform logical processors. Try not to tie tasks to a specific thread. The execution times of tasks of different systems do not always coincide. This can lead to uneven workload distribution among worker threads and affect efficiency. To make this process easier, use task management libraries like

Having dealt with the theory of multithreading, let us consider a practical example - Pentium 4. Already at the stage of development of this processor, Intel engineers continued to work on increasing its performance without introducing changes to the programming interface. Five simplest ways were considered:
1. Increase the clock frequency.
2. Placing two processors on one microcircuit.
3. Introduction of new functional blocks.
1. Extension of the conveyor.
2. Using multithreading.
The most obvious way to improve performance is to increase the clock speed without changing other parameters. As a rule, each subsequent processor model has a slightly higher clock speed than the previous one. Unfortunately, with a straight-line increase in clock speed, developers are faced with two problems: an increase in power consumption (which is important for laptop computers and other computing devices running on batteries) and overheating (which requires the creation of more efficient heat sinks).
The second method - placing two processors on a microcircuit - is relatively simple, but it involves doubling the area occupied by the microcircuit. If each processor is supplied with its own cache memory, the number of chips on a platter is halved, but this also means a doubling of production costs. By providing a shared cache for both processors, a significant increase in footprint can be avoided, but another problem arises - the amount of cache per processor is halved, and this inevitably affects performance. In addition, while professional server applications are able to fully utilize the resources of multiple processors, in ordinary desktop programs, internal parallelism is much less developed.
The introduction of new functional blocks is also not difficult, but it is important to strike a balance here. What's the point in a dozen ALU blocks if the microcircuit cannot issue commands to the conveyor at such a speed that it can load all these blocks?
A conveyor with an increased number of rungs, capable of dividing tasks into smaller segments and processing them in short periods of time, on the one hand, increases productivity, on the other hand, increases the negative consequences of mispredicted transitions, cache misses, interruptions and other events that disrupt the normal flow processing commands in the processor. In addition, to fully realize the capabilities of the extended pipeline, it is necessary to increase the clock frequency, and this, as we know, leads to increased power consumption and heat dissipation.
Finally, you can implement multithreading. The advantage of this technology is that it introduces an additional program stream to bring in hardware resources that would otherwise be idle. Based on the results of experimental studies, Intel developers found that a 5% increase in chip area when implementing multithreading for many applications gives a performance gain of 25%. The first Intel processor to support multithreading was the 2002 Xeon. Subsequently, starting at 3.06 GHz, multithreading was introduced into the Pentium 4 line. Intel calls the implementation of multithreading in the Pentium 4 hyperthreading.
The basic principle of hyper-threading is the simultaneous execution of two software threads (or processes - the processor does not distinguish between processes and software threads). The operating system is considering hyper-threading Pentium processor 4 as a dual-processor complex with shared caches and main memory. The operating system performs scheduling for each program thread separately. Thus, two applications can run at the same time. For example, mail program can send or receive messages in background while the user interacts with the interactive application — that is, the daemon and the user program run concurrently, as if two processors were available to the system.
Application programs that can be executed as multiple program threads can use both "virtual processors". For example, video editing programs usually allow users to apply filters to all frames. These filters adjust brightness, contrast, color balance and other properties of the frames. In such a situation, the program can assign one virtual processor to process even frames, and another to process odd frames. In this case, the two processors will work completely independently of each other.
Since the software threads access the same hardware resources, coordination of these threads is necessary. In the context of hyperthreading, Intel has identified four useful strategies for managing resource sharing: resource duplication, and hard, threshold, and full resource sharing. Let's take a look at these strategies.
Let's start with resource duplication. As you know, some resources are duplicated for the purpose of organizing program flows. For example, since each program thread requires individual control, a second instruction counter is needed. In addition, it is necessary to enter a second table for mapping architectural registers (EAX, EBX, etc.) to physical registers; Likewise, the interrupt controller is duplicated, since interrupt handling for each thread is done individually.
The following is the technique hard division resources (partitioned resource sharing) between program streams. For example, if the processor provides a queue between two functional stages of the pipeline, then half of the slots can be given to thread 1, the other half to thread 2. Sharing of resources is easy to implement, does not lead to imbalance and ensures complete independence of program threads from each other. With the complete sharing of all resources, one processor actually turns into two. On the other hand, a situation may arise in which one program thread does not use resources that could be useful to the second thread, but for which it does not have access authority. As a result, resources that could otherwise be used are idle.
The opposite of hard sharing is full resource sharing. In this scheme, any program thread can access the required resources, and they are serviced in the order in which access requests are received. Consider a situation in which a fast stream, consisting primarily of addition and subtraction operations, coexists with a slow stream that implements multiplication and division operations. If instructions are called from memory faster than the multiplication and division operations are performed, the number of instructions called within the slow thread and queued to the pipeline will gradually grow. Eventually, these commands will fill the queue, as a result, the fast flow will stop due to lack of space. Complete resource sharing solves the problem of non-optimal use of common resources, but creates an imbalance in their consumption - one thread can slow down or stop another.
The intermediate scheme is implemented within the framework of threshold resource sharing. According to this scheme, any program thread can dynamically acquire a certain (limited) amount of resources. When applied to replicated resources, this approach provides flexibility without the threat of downtime for one of the program threads due to inability to obtain resources. If, for example, you forbid each of the threads to occupy more than 3/4 of the command queue, the increased resource consumption of a slow thread will not interfere with the execution of a fast one.
The Pentium 4 hyper-threading model combines different resource sharing strategies. Thus, an attempt is made to solve all the problems associated with each strategy. Duplication is implemented in relation to resources that are constantly required by both program threads (in particular, in relation to the instruction counter, the register mapping table and the interrupt controller). Duplication of these resources increases the area of ​​the microcircuit by only 5% - agree, quite a reasonable payment for multithreading. Resources that are available in such a volume that it is virtually impossible for them to be captured by one thread (for example, cache lines) are allocated dynamically. Access to the resources that control the operation of the pipeline (in particular, its numerous queues) is divided - half of the slots are assigned to each program thread. The main pipeline of the Pentium 4 Netburst architecture is shown in Fig. 8.7; the white and gray areas in this illustration represent the resource allocation mechanism between the white and gray program streams.
As you can see, all the queues in this illustration are separated - each program thread is allocated half of the slots. Neither thread can restrict the work of the other. The distribution and substitution block is also split. The scheduler resources are shared dynamically, but based on some threshold, so no thread can occupy all the slots in the queue. For all other stages of the conveyor, complete separation takes place.
However, multithreading is not so simple. Even this progressive technique has drawbacks. Rigid resource sharing does not come with significant overhead, but dynamic partitioning, especially with regard to thresholds, requires tracking resource consumption at runtime. In addition, in some cases, programs perform significantly better without multithreading than with it. Suppose, for example, that if you have two threads, they each require 3/4 of the cache to function properly. If they were executed in turn, each would show sufficient efficiency with a small number of cache misses (as you know, associated with additional costs). In the case of parallel execution, each would have significantly more cache misses, and the end result would be worse than without multithreading.
More information about the multithreading mechanism of RepPit 4 can be found in.

Having dealt with the theory of multithreading, let us consider a practical example - Pentium 4. Already at the stage of development of this processor, Intel engineers continued to work on increasing its performance without introducing changes to the programming interface. Five simplest ways were considered:

Increasing the clock frequency;

Placement of two processors on one microcircuit;

Introduction of new functional blocks;

Extension of the conveyor;

Using multithreading.

The most obvious way to improve performance is to increase the clock speed without changing other parameters. As a rule, each subsequent processor model has a slightly higher clock speed than the previous one. Unfortunately, with a straight-line increase in clock speed, developers are faced with two problems: an increase in power consumption (which is important for laptop computers and other computing devices running on batteries) and overheating (which requires the creation of more efficient heat sinks).

The second method - placing two processors on a microcircuit - is relatively simple, but it involves doubling the area occupied by the microcircuit. If each processor is supplied with its own cache memory, the number of chips on a platter is halved, but this also means a doubling of production costs. By providing a shared cache for both processors, a significant increase in footprint can be avoided, but another problem arises - the amount of cache per processor is halved, and this inevitably affects performance. In addition, while professional server applications are able to fully utilize the resources of multiple processors, in ordinary desktop programs, internal parallelism is much less developed.

The introduction of new functional blocks is also not difficult, but it is important to strike a balance here. What's the point in a dozen ALU blocks if the microcircuit cannot issue commands to the conveyor at such a speed that it can load all these blocks?

A conveyor with an increased number of rungs, capable of dividing tasks into smaller segments and processing them in short periods of time, on the one hand, increases productivity, on the other hand, increases the negative consequences of mispredicted transitions, cache misses, interruptions and other events that disrupt the normal flow processing commands in the processor. In addition, to fully realize the capabilities of the extended pipeline, it is necessary to increase the clock frequency, and this, as we know, leads to increased power consumption and heat dissipation.

Finally, you can implement multithreading. The advantage of this technology is that it introduces an additional program stream to bring in hardware resources that would otherwise be idle. Based on the results of experimental studies, Intel developers found that a 5% increase in chip area when implementing multithreading for many applications gives a performance gain of 25%. The first Intel processor to support multithreading was the 2002 Heon. Subsequently, starting at 3.06 GHz, multithreading was introduced into the Pentium 4 line. Intel calls the implementation of multithreading in the Pentium 4 hyperthreading.

Introduction. Computer technology is developing rapidly. Computing devices are becoming more powerful, more compact, and more convenient, but recently, improving the performance of devices has become a big problem. In 1965, Gordon Moore (one of the founders of Intel) concluded that "the number of transistors placed on an integrated circuit chip doubles every 24 months."

The first developments in the field of creating multiprocessor systems began in the 70s. For a long time, the performance of conventional single-core processors increased by increasing the clock frequency (up to 80% of the performance was determined only by the clock frequency) with a simultaneous increase in the number of transistors on the chip. The fundamental laws of physics stopped this process: the chips began to overheat, the technological one began to approach the size of silicon atoms. All these factors have led to the fact that:

  • the leakage currents increased, as a result of which the heat generation and power consumption increased.
  • the processor has become much "faster" than memory. Performance dropped due to latency in accessing RAM and loading data into the cache.
  • there is such a thing as "von Neumann bottleneck." It means the inefficiency of the processor architecture when executing a program.

Multiprocessor systems (as one of the ways to solve the problem) were not widely used, since they required expensive and difficult to manufacture multiprocessor systems. motherboards... Based on this, productivity increased in other ways. The concept of multithreading turned out to be effective - the simultaneous processing of several streams of commands.

Hyper-Threading Technology (HTT), or Hyper-Threading Technology, which allows a processor to execute multiple threads on a single core. It is HTT that, according to many experts, has become a prerequisite for the creation of multi-core processors. A processor executing multiple threads at the same time is called thread-level parallelism (TLP —thread-level parallelism).

To unleash the potential of a multicore processor, an executable program must use all computational cores, which is not always achievable. Old sequential programs that can use only one core will no longer run faster on the new generation of processors, so programmers are increasingly involved in the development of new microprocessors.

1. General concepts

Architecture in the broadest sense is a description of a complex system consisting of many elements.

In the process of development, semiconductor structures (microcircuits) evolve, therefore, the principles of building processors, the number of elements included in their composition, the way their interaction is organized, are constantly changing. Thus, CPUs with the same basic design principles are usually called processors of the same architecture. And such principles themselves are called processor architecture (or microarchitecture).

A microprocessor (or processor) is the main component of a computer. It processes information, executes programs, and controls other devices in the system. How fast programs will run depends on the processor's power.

The core is the backbone of any microprocessor. It consists of millions of transistors located on a silicon chip. The microprocessor is divided into special cells called registers general purpose(RON) The work of the processor in total consists in retrieving commands and data from memory in a certain sequence and executing them. In addition, for the sake of increasing the speed of the PC, the microprocessor is equipped with an internal cache memory. Cache memory is inner memory processor used as a buffer (to protect against interruptions in communication with RAM).

Intel processors used in IBM-compatible PCs have more than a thousand instructions and are referred to as processors with an extended instruction set - CISC-processors (CISC - Complex Instruction Set Computing).

1.1 High performance computing. Parallelism

The pace of development of computing technology is easy to trace: from ENIAC (the first electronic digital computer for general use) with a performance of several thousand operations per second to the supercomputer Tianhe-2 (1000 trillion floating point operations per second). This means that the speed of computation has increased by a trillion times in 60 years. The creation of high-performance computing systems is one of the most difficult scientific and technical problems. Given that the speed of calculations technical means has grown only a few million times, the overall speed of computing has grown trillions of times. This effect is achieved due to the use of parallelism at all stages of computation. Parallel computing requires a search for rational memory allocation, reliable methods of transferring information and coordinating computational processes.

1.2 Symmetric multiprocessing

Symmetric Multiprocessing (abbreviated SMP) or symmetric multiprocessing is a special architecture of multiprocessor systems in which multiple processors have access to shared memory. This is a very common architecture, quite widely used recently.

When using SMP, several processors work in a computer at once, each on its own task. An SMP system with a high-quality operating system rationally distributes tasks between processors, providing an even load on each of them. However, there is a problem with memory access, because even uniprocessor systems take a relatively long time to do this. Thus, access to RAM in SMP occurs sequentially: first one processor, then the second.

Due to the features listed above, SMP systems are used exclusively in the scientific field, industry, business, and extremely rarely in work offices. In addition to the high cost of hardware implementation, such systems require very expensive and high-quality software that provides multithreaded execution of tasks. Regular programs (games, text editors) will not work effectively in SMP systems, since they do not provide for this degree of parallelism. If you adapt any program for an SMP system, then it will become extremely inefficient to work on uniprocessor systems, which leads to the need to create several versions of the same program for different systems. An exception is, for example, the ABLETON LIVE program (designed for creating music and preparing Dj-sets), which has support for multiprocessor systems. If you run a regular program on a multiprocessor system, it will still run a little faster than on a uniprocessor system. This is due to the so-called hardware interrupt (stopping the program for processing by the kernel), which is executed on another free processor.

An SMP system (like any other system based on parallel computing) imposes increased requirements on such a memory parameter as the memory bus bandwidth. This often limits the number of processors in the system (modern SMP systems work efficiently up to 16 processors).

Since the processors have shared memory, there is a need for its rational use and data coordination. In a multiprocessor system, it turns out that multiple caches work for a shared memory resource. Cache coherence is a cache property that ensures the integrity of the data stored in individual caches for a shared resource. This concept- a special case of the concept of memory coherence, where several cores have access to shared memory (ubiquitous in modern multicore systems). If we describe these concepts in general terms, then the picture will be as follows: the same data block can be loaded into different caches, where the data is processed in different ways.

Failure to use any data change notifications will result in an error. Cache coherency is designed to resolve such conflicts and maintain data consistency in caches.

SMP systems are a subgroup of MIMD (multi in-struction multi data) of the Flynn classification of computing systems (Stanford University professor, co-founder of Palyn Associates). According to this classification, almost all types of parallel systems can be classified as MIMD.

The division of multiprocessor systems into types occurs on the basis of division according to the principle of memory use. This approach made it possible to distinguish between the following important types

multiprocessor systems - multiprocessors (multiprocessor systems with shared shared memory) and multicomputers (systems with separate memory). Shared data used in parallel computations require synchronization. The task of data synchronization is one of the most important problems, and its solution in the development of multiprocessor and multicore and, accordingly, the necessary software is a priority for engineers and programmers. Data sharing can be done by physically allocating memory. This approach is called non-uniform memory access (NUMA).

These systems include:

  • Systems where only individual processor caches are used for data presentation (cache-only memory architecture).
  • Systems with the provision of coherence of local caches for different processors (cache-coherent NUMA).
  • Systems with shared access to individual processor memory without hardware-based cache coherence (non-cache coherent NUMA).

Simplifying the problem of creating multiprocessor systems is achieved by using distributed shared memory, but this method leads to a noticeable increase in the complexity of parallel programming.

1.3 Concurrent multithreading

Based on all of the above disadvantages of symmetric multiprocessing, it makes sense to develop and develop other ways to improve performance. If you analyze the work of each individual transistor in the processor, you can pay attention to a very interesting fact- when performing most of the computational operations, not all components of the processor are involved (according to recent studies, about 30% of all transistors). Thus, if the processor performs, say, a simple arithmetic operation, then most of the processor is idle, therefore, it can be used for other calculations. So, if the processor is currently performing real operations, then an integer arithmetic operation can be loaded into the free part. To increase the load on the processor, you can create speculative (or anticipatory) execution of operations, which requires a lot of complexity in the hardware logic of the processor. If you define in advance the threads (sequences of commands) in the program that can be executed independently of each other, then this will significantly simplify the task (this method is easily implemented at the hardware level). This idea, which belongs to Dean Tulsen (developed by him in 1955 at the University of Washington), is called simul-taneous multithreading. It was later developed by Intel under the name hyper threading. Thus, one processor executing multiple threads is perceived by the operating system. Windows system like multiple processors. The use of this technology again requires an appropriate level of software. The maximum effect of using multithreading technology is about 30%.

1.4 Multicore

Multithreading technology is a software implementation of multicore. A further increase in performance, as always, requires changes in the hardware of the processor. The complication of systems and architectures is not always effective. There is an opposite opinion: “everything ingenious is simple!”. Indeed, in order to increase the performance of a processor, it is not at all necessary to increase its clock frequency, to complicate the logical and hardware components, since it is enough just to rationalize and refine the existing technology. This method is very beneficial - there is no need to solve the problem of increasing the heat dissipation of the processor, the development of new expensive equipment for the production of microcircuits. This approach was implemented within the framework of multicore technology - the implementation of several computational cores on a single crystal. If we take the original processor and compare the performance gains when implementing several methods of increasing performance, then it is obvious that the use of multicore technology is the best option.

If we compare the architectures of a symmetric multiprocessor and a multicore, they will turn out to be almost identical. The core cache can be multilevel (local and shared, and data from RAM can be loaded directly into the L2 cache). Based on the considered advantages of the multi-core processor architecture, manufacturers focus on it. This technology turned out to be quite cheap to implement and universal, which made it possible to bring it to a wide market. In addition, this architecture has made its own adjustments to Moore's Law: "the number of computing cores in a processor will double every 18 months."

If you look at the modern computer hardware market, you can see that devices with four- and eight-core processors dominate. In addition, processor manufacturers claim that soon processors with hundreds of processing cores will be on the market. As has been said many times before, the full potential of a multicore architecture is revealed only with high-quality software. Thus, the sphere of production of computer hardware and software is very closely related.