Мануал по flat assembler - FASM - Страница 2

@Mikl___ · 21.08.2014, 09:03 **[ТС]**

Студворк — интернет-сервис помощи студентам

Assembler
1
    ficom word [bx]  ; сравнивает st0 с 16-битным целым числом

"fcomi", "fcomip", "fucomi", "fucomip" сравнивают ST0 с другим регистром FPU и ставят, в зависимости от результатов, флаги ZF, PF и CF. "fcomip" и "fucomip" ещё выдвигают стек регистров после завершения сравнения. Инструкции, образованные добавлением условного мнемоника FPU (смотрите таблицу 3.2) к мнемонику "fcmov", переводят указанный регистр FPU в регистр ST0, если данное условие выполняется. Эти инструкции поддерживают два разных синтаксиса, первый с одним операндом, определяющим регистр FPU, второй с двумя операндами, где операнд-адресат - регистр ST0, а операнд-источник, идущий вторым, - нужный регистр FPU.

Assembler
1
2
    fcomi st2        ; сравнивает st0 с st2 и устанавливает флаги
    fcmovb st0,st2   ; переводит st2 в st0 если меньше

Таблица 3.2 Условия FPU

Мнемоник	Тестируемое условие	Описание
b	CF = 1	меньше
e	ZF = 1	равно
be	CF or ZF = 1	меньше или равно
u	PF = 1	неномализованное
nb	CF = 0	не меньше
ne	ZF = 0	не равно
nbe	CF and ZF = 0	не меньше и не равно
nu	PF = 0	нормализованное

"ftst" сравнивает значение в ST0 с 0.0 и в зависимости от результатов устанавливает флаги в слове статуса FPU. "fxam" проверяет содержимое регистра ST0 и устанавливает флаги в слове статуса FPU, показывая класс значения в регистре. Эти инструкции не имеют операндов.
"fstsw" и "fnstsw" сохраняют текущее значение слова статуса FPU в указанном месте. Операндом-адресатом может быть либо 16 бит в памяти, либо регистр AX. "fstsw" перед сохранением слова проверяет на подвешенные немаскируемые численные исключения, "fnstsw" этого не делает.
"fstcw" и "fnstcw" сохраняют текущее значение управляющего слова FPU в указанном месте в памяти. "fstcw" перед сохранением слова проверяет на подвешенные немаскируемые численные исключения, "fnstcw" этого не делает. "fldcw" загружает операнд в управляющее слово FPU. Операндом должно быть 16-битное расположение в памяти.
"fstenv" и "fnstenv" сохраняют текущий контекст FPU в расположении в памяти, указанном в операнде-адресате, и далее маскируют все исключения операций с плавающей точкой. "fstenv" перед совершением операции проверяет на подвешенные немаскируемые численные исключения, "fnstenv" этого не делает. "fldenv" загружает полный контекст FPU из памяти в FPU. "fsave" и "fnsave" сохраняют текущий статус FPU (контекст и регистры стека) в указанном месте в памяти и затем ре-инициализируют FPU. "fsave" перед совершением операции проверяет на подвешенные немаскируемые численные исключения, "fnsave" этого не делает. "frstor" загружает статус FPU из указанного места в памяти. Все эти инструкции в качестве операнда требуют расположение в памяти.
"finit" и "fninit" устанавливают контекст FPU в его значение по умолчанию. "finit" перед совершением операции проверяет на подвешенные немаскируемые численные исключения, "fninit" этого не делает. "fclex" and "fnclex" очищают флаги исключений FPU в слове статуса FPU."fclex" перед совершением операции проверяет на подвешенные немаскируемые численные исключения, "fnclex" этого не делает. "wait" и "fwait" - это синонимы одной и той же инструкции, которая указывает процессор проверить наличие подвешенных немаскируемых численных исключений и разобраться с ними до продолжения работы. Эти инструкции не имеют операндов.
"ffree" помечает тег, ассоциированный с указанным регистром FPU, как пустой. Операндом должен служить регистр FPU.
"fincstp" и "fdecstp" вращают стек FPU на единицу, прибавляя или отнимая единицу от поля TOP слова статуса FPU. У этих инструкций нет операндов.

@Mikl___ · 21.08.2014, 09:03 **[ТС]**

2.1.14 Инструкции MMX

Инструкции MMX оперируют со сжатыми целочисленными типами и используют регистры MMX, которыми являются нижние 64-битные части 80-битных регистров FPU. Поэтому инструкции MMX не могут использоваться в одно и то же время с инструкциями FPU. Они могут оперировать со сжатыми байтами (восемь 8-битных целых чисел), сжатыми словами (четыре 16-битных целых чисел) или сжатыми двойными словами (два 32-битных целых числа). Использование сжатых форматов позволяет совершать операции одновременно над многими данными.

"movq" копирует четверное слово из операнда-источника в операнд-адресат. По крайней мере одним из операндов должен являться регистр MMX, вторым может быть либо регистр MMX, либо 64-битное расположение в памяти.

Assembler
1
2
    movq mm0,mm1     ; копирует четверное слово из регистра в регистр
    movq mm2,[ebx]   ; копирует четверное слово из памяти в регистр

"movd" копирует двойное слово из операнда-источника в операнд-адресат. Одним из операндов должен быть регистр MMX, вторым может быть регистр общего назначения либо 32-битное расположение в памяти. Используется только нижнее двойное слово регистра MMX.

Все основные операции MMX имеют два операнда, где операндом-адресатом должен быть регистр MMX, а операндом-источником может быть либо регистр MMX, либо 64-битное расположение в памяти. Операция совершается на соответствующих элементах данных источника и адресата и сохраняется элементах данных адресата. "paddb", "paddw" и "paddd" совершают сложение сжатых байтов, сжатых слов и сжатых двойных слов. "psubb", "psubw" и "psubd" совершают вычитание соответствующих типов. "paddsb","paddsw", "psubsb" и "psubsw" совершают сложение или вычитание сжатых байтов или сжатых слов со знаковым насыщением. "paddusb", "paddusw", "psubusb", "psubusw" - это аналоги, но без знакового насыщения. "pmulhw" и "pmullw" совершают знаковое умножение сжатых слов и сохраняют верхние или нижние слова результатов в операнде-адресате. "pmaddwd" совершает умножение сжатых слов и складывает четыре промежуточных продукта в виде двойных слов в парах, чтобы получить результат в виде сжатых двойных слов. "pand", "por" и "pxor" совершают логические операции над четверными словами, "pandn" также производит логическое отрицание перед операцией "and". "pcmpeqb", "pcmpeqw" и "pcmpeqd" сравнивают на эквивалентность сжатые байты, сжатые слова или сжатые двойные слова. Если пара элементов данных эквивалентна, то соответствующий элемент данных операнда-адресата покрывается единичными битам, иначе нулевыми. "pcmpgtb", "pcmpgtw" и "pcmpgtd" совершают похожую операцию, но они проверяют, больше ли элементы данных в операнде-адресате, чем соответствующие элементы данных в операнде-источнике. "packsswb" конвертирует сжатые знаковые слова в сжатые знаковые байты, "packssdw" конвертирует сжатые знаковые двойные слова в сжатые знаковые слова, используя насыщение, чтобы удовлетворить условиям переполнения. "packuswb" конвертирует сжатые знаковые слова в сжатые беззнаковые байты. Конвертированные элементы данных из операнда-источника сохраняются в нижней части операнда-адресата, тогда как конвертированные элементы данных операнда-адресата сохраняются в его верхней части. "punpckhbw", "punpckhwd" и "punpckhdq" чередуют элементы данных из верхних частей источника и адресата и сохраняют результат в операнд-адресат. "punpcklbw", "punpcklwd" и "punpckldq" совершают те же операции, но с нижними частями операндов.

Assembler
1
2
    paddsb mm0,[esi] ; складывает сжатые байты со знаковым насыщением
    pcmpeqw mm3,mm7  ; проверяет сжатые слова на эквивалентность

"psllw", "pslld" и "psllq" совершают логический сдвиг влево сжатых слов, сжатых двойных слов или одиночных четверных слов в операнде-адресате, на число битов, указанное в операнде-источнике. "psrlw", "psrld" и "psrlq" совершают логический сдвиг вправо сжатых слов, сжатых двойных слов или одиночных четверных слов. "psraw" и "psrad" совершают арифметический сдвиг сжатых слов или двойных слов. Операндом-адресатом должен быть регистр MMX, а операндом-источником может быть регистр MMX, 64-битное расположение в памяти или 8-битное непосредственное значение.

Assembler
1
2
    psllw mm2,mm4    ; сдвигает слова влево логически
    psrad mm4,[ebx]  ; сдвигает двойные слова вправо арифметически

"emms" делает регистры FPU используемыми для инструкций FPU. Эта инструкция должна быть применена перед использованием инструкций FPU, если в ход пускались инструкции MMX.

@Mikl___ · 21.08.2014, 09:05 **[ТС]**

2.1.15 Инструкции SSE

Расширение SSE добавляет больше инструкций MMX, а также представляет операции со сжатыми значениями одинарной точности с плавающей точкой. 128-битный сжатый формат одинарной точности содержит четыре значения одинарной точности с плавающей точкой. 128-битные регистры SSE созданы для поддержки операций этого типа данных.

"movaps" и "movups" переводят операнд размером в двойное четверное слово, содержащий значения одинарной точности из операнда-источника в операнд-адресат. По крайней мере одним из операндов должен быть регистр SSE, вторым может быть либо тоже регистр SSE, либо 128-битное расположение в памяти. Операнды в памяти для "movaps" должны быть выровнены по 16-битной границе, для "movups" этого не требуется.

Assembler
1
    movups xmm0,[ebx]  ; переводит невыровненное двойное четверное слово

"movlps" переводит два сжатых значения одинарной точности из нижнего четверного слова регистра-источника в верхнее четверное слово регистра-адресата. "movhlps" переводит два сжатых значения одинарной точности из верхнего четверного слова регистра-источника в нижнее четверное слово регистра-адресата. Обоими операндами должны быть регистры SSE.
"movmskps" переводит знаковые биты всех значений одинарной точности в регистре SSE в нижние четыре бита регистра общего назначения. Операндом-источником должен быть регистр SSE, операндом-адресатом должен быть регистр общего назначения.
"movss" переводит значение одинарной точности между источником и адресатом (переводится только нижнее двойное слово). По крайней мере одним из операндов должен быть регистр SSE, вторым может быть либо тоже регистр SSE, либо 32-битное расположение в памяти.

Assembler
1
    movss [edi],xmm3   ; переводит нижнее двойное слово из xmm3 в память

Каждая арифметическая операция SSE имеет два варианта. Если мнемоник заканчивается на "ps", операндом-источником может быть 128-битное расположение в памяти или регистр SSE, операндом-адресатом должен быть регистр SSE, и операция производится над четырьмя сжатыми значениями одинарной точности, для каждой пары соответствующих элементов данных отдельно, и результат сохраняется в регистре-адресате. Если мнемоник заканчивается на "ss", то операндом-источником может быть 32-битное расположение в памяти или регистр SSE, операндом-адресатом должен быть регистр SSE, и операция производится над одним значением одинарной точности, используются только нижние двойные слова регистров SSE, и результат сохраняется в нижнем двойном слове регистра-адресата. "addps" и "addss" складывают значения, "subps" и "subss" вычитают источник из адресата, "mulps" и "mulss" перемножают значения, "divps" и "divss" делят адресат на источник, "rcpps" и "rcpss" вычисляют аппроксимированную обратную величину источника, "sqrtps" и "sqrtss" вычисляют квадратный корень источника, "rsqrtps" и "rsqrtss" вычисляют аппроксимированную обратную величину квадратного корня источника, "maxps" и "maxss" сравнивают источник и адресат и возвращают большее значение, "minps" и "minss" сравнивают источник и адресат и возвращают меньшее значение.

@Mikl___ · Код Мнемоник Описание

Assembler
1
2
    mulss xmm0,[ebx]   ; перемножает значения одинарной точности
    addps xmm3,xmm7    ; складывает сжатые значения одинарной точности

"andps", "andnps", "orps" и "xorps" производят логические операции над сжатыми значениями одинарной точности. Операндом-источником может быть 128-битное расположение в памяти или регистр SSE, операндом-адресатом должен быть регистр SSE.
"cmpps" сравнивает сжатые значения одинарной точности и возвращают маскируемый результат в операнд-адресат, которым должен быть регистр SSE. Операндом-источником может быть либо регистр SSE, либо 128-битное расположение в памяти, третьим операндом должно быть непосредственное значение, выбирающее код одного из восьми условий сравнения (таблица 3.3). "cmpss" совершает ту же над значениями одинарной точности, изменяется только нижнее двойное слово регистра-адресата, таким образом операндом-источником должен быть либо регистр SSE, либо 32-битное расположение в памяти. Эти две инструкции имеют также варианты с двумя операндами и условие, закодированным в мнемоник. Их мнемоники образуются путем добавления мнемоников из таблицы 3.3 к "cmp" и после добавления к ним в конце "ps" или "ss".

Assembler
1
2
    cmpps xmm2,xmm4,0  ; сравнивает сжатые значения одинарной точности
    cmpltss xmm0,[ebx] ; сравнивает значения одинарной точности

Таблица 3.3 Условия SSE

Код	Мнемоник	Описание
0	eq	равно
1	lt	меньше
2	le	меньше или равно
3	unord	ненормализованное
4	neq	не равно
5	nlt	не меньше
6	nle	не меньше и не равно
7	ord	нормализованное

"comiss" и "ucomiss" сравнивают значения одинарной точности и ставят в зависимости от результата флаги ZF, PF и CF. Операндом-адресатом должен быть регистр SSE, операндом-источником может быть 32-битное расположение в памяти или регистр SSE.
"shufps" переводит некоторые два из четырех значений одинарной точности из операнда-адресата в нижнее четверное слово операнда-адресата и некоторые два из четырех значений одинарной точности из операнда-источника в верхнее четверное слово операнда-адресата. Операндом-адресатом должен быть регистр SSE, операндом-источником может быть 128-битное расположение в памяти или регистр SSE, а третьим операндом должно быть 8-битное непосредственное значение, какие конкретно значения будут задействованы. Биты 0 и 1 указывают значение из адресата, которое должно быть в нижнем двойном слове результата, биты 2 и 3 указывают значение из адресата, которое должно быть во втором двойном слове результата, биты 4 и 5 указывают значение из источника, которое должно быть в третьем двойном слове результата, биты 6 и 7 указывают значение из источника, которое должно быть в нижнем верхнем слове результата.

Assembler
1
    shufps xmm0,xmm0,10010011b ; перемешивает двойные слова

"unpckhps" совершает перемежающуюся распаковку значений из верхних частей источника и адресата и сохраняет результат в адресат, которым должен быть регистр SSE. Операндом-источником может быть 128-битное расположение в памяти или регистр SSE. "unpcklps" совершает перемежающуюся распаковку значений из нижних частей источника и адресата и сохраняет результат в адресат, правила для операндов такие же.
"cvtpi2ps" конвертирует два сжатых целых числа размером в двойное слово в два сжатых значения с плавающей точкой одинарной точности и сохраняет результат в нижнем четверном слове адресата, которым должен быть регистр SSE. Операндом-источником может быть 64-битное расположение в памяти или регистр MMX.

Assembler
1
    cvtpi2ps xmm0,mm0  ; конвертирует целые числа в значения одинарной точности

"cvtsi2ss" конвертирует целое число размером в двойное слово в сжатое значение с плавающей точкой одинарной точности и сохраняет результат в нижнем двойном слове адресата, которым должен быть регистр SSE. Операндом-источником может быть 32-битное расположение в памяти или 32-битный регистр общего назначения.

Assembler
1
    cvtsi2ss xmm0,eax  ; конвертирует целое число в значение одинарной точности

"cvtps2pi" конвертирует два сжатых значения с плавающей точкой одинарной точности в два сжатых целых числа размером в двойное слово и сохраняет результат в адресате, которым должен быть регистр MMX. Операндом-источником может быть 64-битное расположение в памяти либо регистр SSE, в котором будет использовано только нижнее четверное слово. "cvttps2pi" совершает похожую операцию, но для округления здесь используется отбрасывание дробной части, правила для операндов у этой инструкции такие же.

Assembler
1
    cvtps2pi mm0,xmm0  ; конвертирует значения одинарной точности в целые числа

"cvtss2si" конвертирует сжатое значение с плавающей точкой одинарной точности в сжатое целое число размером в двойное слово и сохраняет результат в адресате, которым должен быть 32-битный регистр общего назначения. Операндом-источником может быть 32-битное расположение в памяти либо регистр SSE, в котором будет использовано только нижнее двойное слово. "cvttss2si" совершает похожую операцию, но для округления здесь используется отбрасывание дробной части, правила для операндов у этой инструкции такие же.

Assembler
1
    cvtss2si eax,xmm0  ; конвертирует значение одинарной точности в целое число

"pextrw" копирует слово, указанное третьим операндом, из источника в адресат. Операндом-источником должен быть регистр MMX, операндом-адресатом должен быть 32-битный регистр общего назначения (но используется только нижнее его слово), третьим операндом должно быть 8-битное непосредственное значение.

Assembler
1
    pextrw eax,mm0,1   ; извлекает слово в eax

"pinsrw" вставляет слово из источника в место в адресате, указанное третьим операндом, которым должно быть 8-битное непосредственное значение. Операндом-адресатом должен быть регистр MMX, операндом-источником должен быть 32-битный регистр общего назначения (но используется только нижнее его слово).

Assembler
1
    pinsrw mm1,ebx,2   ; вставляет слово из ebx

"pavgb" and "pavgw" вычисляют среднее сжатых байтов или слов. "pmaxub" возвращает максимум сжатых беззнаковых байтов, "pminub" возвращает минимум сжатых беззнаковых байтов, "pmaxsw" возвращает максимум сжатых знаковых слов, "pminsw" возвращает минимум сжатых знаковых слов. "pmulhuw" совершает беззнаковое умножение сжатых слов и сохраняет верхние слова результатов в операнд-адресат. "psadbw" вычисляет абсолютные разности сжатых беззнаковых байтов, суммирует эти разности и сохраняет результат в нижнее слово операнда-адресата. Все эти инструкции следуют тем же правилам для операндов, что и основные операции MMX, описанные в предыдущем параграфе.
"pmovmskb" создает маску из знаковых битов всех байтов в источнике и сохраняет результат в нижнем байте адресата. Операндом-источником должен быть регистр MMX, операндом-адресатом должен быть 32-битный регистр общего назначения.
"pshufw" копирует слова, указанные третьим операндом, из источника в адресат. Операндом-адресатом должен быть регистр MMХ, операндом-источником может быть 64-битное расположение в памяти или регистр MMX, третьим операндом должно быть 8-битное непосредственное значение, выбирающее, какие значения будут помещены в адресат, таким же образом как третий операнд в инструкции "shufps".
"movntq" переводит четверное слово из операнда-источника в память, используя "не-временное малое количество" (non-temporal hint), чтобы минимизировать загрязнение кэша. Операндом-источником должен быть регистр MMX, операндом-адресатом должно быть 64-битное расположение в памяти. "movntps" сохраняет сжатые значения одинарной точности из регистра SSE в память, используя "не-временное малое количество". Операндом-источником должен быть регистр SSE, операндом-адресатом должно быть 128-битное расположение в памяти. "maskmovq" сохраняет выбранные байты из первого операнда в 64-битное расположение в памяти, используя "не-временное малое количество". Обоими операндами должны служить регистры MMX, второй операнд указывает, какие байты из первого операнда должны быть записаны в память. Расположение в памяти указывается регистром DI (или EDI) в сегменте, определенном в DS.
"prefetcht0", "prefetcht1", "prefetcht2" and "prefetchnta" помещает строку данных из памяти, которая содержит байт, указанный в операнде, в определенное место в иерархии кеша. Операндом должно быть 8-битное расположение в памяти.
"sfence" переводит в последовательный режим все предыдущие команды, совершающие запись в память. У этой инструкции нет операндов.
"ldmxcsr" загружает 32-битный операнд в памяти в регистр MXCSR. "stmxsr" сохраняет содержимое MXCSR в 32-битный операнд в памяти.
"fxsave" сохраняет текущий статус FPU, регистр MXCSR и все регистры FPU и SSE в 512-байтное расположение в памяти, указанное в операнде-адресате. "fxstor" перезагружает данные, ранее сохраненные инструкцией "fxsave" из 512-байтного расположения в памяти. Операнд для обеих этих инструкций должен быть выровнен по 16 байтам, нужно объявить операнд неопределенного размера.

@Mikl___ · 21.08.2014, 09:06 **[ТС]**

2.1.16 SSE2 instructions

SSE2 (англ. Streaming SIMD Extensions 2, потоковое SIMD-расширение процессора) — это SIMD (англ. Single Instruction, Multiple Data, Одна инструкция — множество данных) набор инструкций, разработанный Intel и впервые представленный в процессорах серии Pentium 4. SSE2 расширяет набор инструкций SSE с целью полного вытеснения MMX. Набор SSE2 добавил 144 новые команды к SSE, в котором было только 70 команд.

SSE2 использует восемь 128-битных регистров (xmm0 до xmm7), включённых в архитектуру x86 с вводом расширения SSE, каждый из которых трактуется как 2 последовательных значения с плавающей точкой двойной точности.

SSE2 включает в себя набор инструкций, который производит операции со скалярными и упакованными типами данных.

SSE2 содержит инструкции для потоковой обработки целочисленных данных в тех же 128-битных xmm регистрах, что делает это расширение более предпочтительным для целочисленных вычислений, нежели использование набора инструкций MMX, появившегося гораздо раньше.

SSE2 включает в себя две части – расширение SSE и расширение MMX.

расширение SSE работает с вещественными числами.
расширение MMX работает с целыми.

В SSE2 регистры по сравнению с MMX удвоились (64 бита -> 128 битов). Т.к. скорость выполнения инструкций не изменилась, при оптимизации под SSE2 программа получает двукратный прирост производительности. Если программа уже была оптимизирована под MMX, то оптимизация под SSE2 дается сравнительно легко в силу сходности системы команд.

SSE2 включает в себя ряд команд управления кэшем, предназначенных для минимизации загрязнения кэша при обработке объёмных потоков данных.

SSE2 включает в себя сложные дополнения к командам преобразования чисел
Технология SSE2 повышает эффективность

The SSE2 extension introduces the operations on packed double precision floating point values, extends the syntax of MMX instructions, and adds also some new instructions.
movapd and movupd transfer a double quad word operand containing packed double precision values from source operand to destination operand. These instructions are analogous to movaps and movups and have the same rules for operands.
movlpd moves double precision value between the memory and the low quad word of SSE register. movhpd moved double precision value between the memory and the high quad word of SSE register. These instructions are analogous to movlps and movhps and have the same rules for operands.
movmskpd transfers the most signicant bit of each of the two double precision values in the SSE register into low two bits of a general register. This instruction is analogous to movmskps and has the same rules for operands.
movsd transfers a double precision value between source and destination operand (only the low quad word is trasferred). At least one of the operands have to be a SSE register, the second one can be also a SSE register or 64-bit memory location.
Arithmetic operations on double precision values are: addpd, addsd, subpd, subsd, mulpd, mulsd, divpd, divsd, sqrtpd, sqrtsd, maxpd, maxsd, minpd, minsd, and they are analoguous to arithmetic operations on single precision values described in previous section. When the mnemonic ends with pd instead of ps, the operation is performed on packed two double precision values, but rules for operands are the same. When the mnemonic ends with sd instead of ss, the source operand can be a 64-bit memory location or a SSE register, the destination operand must be a SSE register and the operation is performed on double precision values, only low quad words of SSE registers are used in this case.
andpd, andnpd, orpd and xorpd perform the logical operations on packed double precision values. They are analoguous to SSE logical operations on single prevision values and have the same rules for operands.
cmppd compares packed double precision values and returns and returns a mask result into the destination operand. This instruction is analoguous to cmpps and has the same rules for operands. cmpsd performs the same operation on double precision values, only low quad word of destination register is aected, in this case source operand can be a 64-bit memory or SSE register. Variant with only two operands are obtained by attaching the condition mnemonic from table 2.3 to the cmp mnemonic and then attaching the pd or sd at the end.
comisd and ucomisd compare the double precision values and set the ZF, PF and CF flags to show the result. The destination operand must be a SSE register, the source operand can be a 128-bit memory location or SSE register.
shufpd moves any of the two double precision values from the destination operand into the low quad word of the destination operand, and any of the two values from the source operand into the high quad word of the destination operand. This instruction is analoguous to shufps and has the same rules for operand. Bit 0 of the third operand selects the value to be moved from the destination operand, bit 1 selects the value to be moved from the source operand, the rest of bits are reserved and must be zeroed.
unpckhpd performs an unpack of the high quad words from the source and destination operands, unpcklpd performs an unpack of the low quad words from the source and destination operands. They are analoguous to unpckhps and unpcklps, and have the same rules for operands.
cvtps2pd converts the packed two single precision floating point values to two packed double precision floating point values, the destination operand must be a SSE register, the source operand can be a 64-bit memory location or SSE register. cvtpd2ps converts the packed two double precision floating point values to packed two single precision floating point values, the destination operand must be a SSE register, the source operand can be a 128-bit memory location or SSE register. cvtss2sd converts the single precision floating point value to double precision floating point value, the destination operand must be a SSE register, the source operand can be a 32-bit memory location or SSE register. cvtsd2ss converts the double precision floating point value to single precision floating point value, the destination operand must be a SSE register, the source operand can be 64-bit memory location or SSE register.
cvtpi2pd converts packed two double word integers into the the packed double precision floating point values, the destination operand must be a SSE register, the source operand can be a 64-bit memory location or MMX register. cvtsi2sd converts a double word integer into a double precision floating point value, the destination operand must be a SSE register, the source operand can be a 32-bit memory location or 32-bit general register. cvtpd2pi converts packed double precision floating point values into packed two double word integers, the destination operand should be a MMX register, the source operand can be a 128-bit memory location or SSE register. cvttpd2pi performs the similar operation, except that truncation is used to round a source values to integers, rules for operands are the same. cvtsd2si converts a double precision floating point value into a double word integer, the destination operand should be a 32-bit general register, the source operand can be a 64-bit memory location or SSE register. cvttsd2si performs the similar operation, except that truncation is used to round a source value to integer, rules for operands are the same.

@Mikl___ · 21.08.2014, 10:18 **[ТС]**

cvtps2dq and cvttps2dq convert packed single precision floating point values to packed four double word integers, storing them in the destination operand. cvtpd2dq and cvttpd2dq convert packed double precision floating point values to packed two double word integers, storing the result in the low quad word of the destination operand. cvtdq2ps converts packed four double word integers to packed single precision floating point values.
For all these instructions destination operand must be a SSE register, the source operand can be a 128-bit memory location or SSE register.
cvtdq2pd converts packed two double word integers from the low quad word of the source operand to packed double precision floating point values, the source can be a 64-bit memory location or SSE register, destination has to be SSE register.
movdqa and movdqu transfer a double quad word operand containing packed integers from source operand to destination operand. At least one of the operands have to be a SSE register, the second one can be also a SSE register or 128-bit memory location. Memory operands for movdqa instruction must be aligned on boundary of 16 bytes, operands for movdqu instruction don't have to be aligned.
movq2dq moves the contents of the MMX source register to the low quad word of destination SSE register. movdq2q moves the low quad word from the source SSE register to the destination MMX register.

Assembler
1
2
movq2dq xmm0,mm1 ; move from MMX register to SSE register
movdq2q mm0,xmm1 ; move from SSE register to MMX register

All MMX instructions operating on the 64-bit packed integers (those with mnemonics starting with p) are extended to operate on 128-bit packed integers located in SSE registers. Additional syntax for these instructions needs an SSE register where MMX register was needed, and the 128-bit memory location or SSE register where 64-bit memory location or MMX register were needed. The exception is pshufw instruction, which doesn't allow extended syntax, but has two new variants: pshufhw and pshuflw, which allow only the extended syntax, and perform the same operation as pshufw on the high or low quad words of operands respectively. Also the new instruction pshufd is introduced, which performs the same operation as pshufw, but on the double words instead of words, it allows only the extended syntax.

Assembler
1
2
psubb xmm0,[esi] ; substract 16 packed bytes
pextrw eax,xmm0,7 ; extract highest word into eax

paddq performs the addition of packed quad words, psubq performs the substraction of packed quad words, pmuludq performs an unsigned multiplication of low double words from each corresponding quad words and returns the results in packed quad words.
These instructions follow the same rules for operands as the general MMX operations described in 2.1.14.
pslldq and psrldq perform logical shift left or right of the double quad word in the destination operand by the amount of bytes specied in the source operand. The destination operand should be a SSE register, source operand should be an 8-bit immediate value.
punpckhqdq interleaves the high quad word of the source operand and the high quad word of the destination operand and writes them to the destination SSE register. punpcklqdq interleaves the low quad word of the source operand and the low quad word of the destination operand and writes them to the destination SSE register. The source operand can be a 128-bit memory location or SSE register.
movntdq stores packed integer data from the SSE register to memory using non-temporal hint. The source operand should be a SSE register, the destination operand should be a 128-bit memory location. movntpd stores packed double precision values from the SSE register to memory using a non-temporal hint. Rules for operand are the same. movnti stores integer from a general register to memory using a non-temporal hint. The source operand should be a 32-bit general register, the destination operand should be a 32-bit memory location. maskmovdqu stores selected bytes from the rst operand into a 128-bit memory location using a non-temporal hint. Both operands should be a SSE registers, the second operand selects wich bytes from the source operand are written to memory. The memory location is pointed by DI (or EDI) register in the segment selected by DS and does not need to be aligned.
clflush writes and invalidates the cache line associated with the address of byte specied with the operand, which should be a 8-bit memory location.
lfence performs a serializing operation on all instruction loading from memory that were issued prior to it. mfence performs a serializing operation on all instruction accesing memory that were issued prior to it, and so it combines the functions of sfence (described in previous section) and lfence instructions. These instructions have no operands.

@Mikl___ · 21.08.2014, 10:18 **[ТС]**

2.1.17 Инструкции SSE3

SSE3 (PNI — Prescott New Instruction) — третья версия SIMD-расширения Intel, потомок SSE, SSE2 и MMX. Впервые представлено в ядре Prescott процессора Pentium 4. AMD предложила свою реализацию SSE3 для процессоров Athlon 64 (ядра Venice, San Diego и Newark).

Набор SSE3 содержит 13 инструкций:

FISTTP (x87) — преобразование вещественного числа в целое с сохранением целочисленного значения и округлением в сторону нуля
MOVSLDUP (SSE)
MOVSHDUP (SSE)
MOVDDUP (SSE2)
LDDQU (SSE/SSE2) — загрузка 128bit невыровненных данных из памяти в регистр xmm, с предотвращением пересечения границы строки кеша
ADDSUBPD (SSE) Add Subtract Packed Double
ADDSUBPS (SSE2) Add Subtract Packed Single
HADDPS (SSE) Horizontal Add Packed Single
HSUBPS (SSE) Horizontal Subtract Packed Single
HADDPD (SSE2) Horizontal Add Packed Double
HSUBPD (SSE2) Horizontal Subtract Packed Double
MONITOR (нет аналога в SSE3 для AMD)
MWAIT (нет аналога в SSE3 для AMD)

Наиболее заметное изменение - возможность горизонтальной работы с регистрами. Если говорить более конкретно, добавлены команды сложения и вычитания нескольких значений, хранящихся в одном регистре. Эти команды упростили ряд DSP и 3D-операций. Существует также новая команда для преобразования значений с плавающей точкой в целые без необходимости вносить изменения в глобальном режиме округления.
Prescott technology introduced some new instructions to improve the performance of SSE and SSE2 - this extension is called SSE3.
fisttp behaves like the fistp instruction and accepts the same operands, the only dierence is that it always used truncation, irrespective of the rounding mode.
movshdup loads into destination operand the 128-bit value obtained from the source value of the same size by lling the each quad word with the two duplicates of the value in its high double word.
movsldup performs the same action, except it duplicates the values of low double words. The destination operand should be SSE register, the source operand can be SSE register or 128-bit memory location.
movddup loads the 64-bit source value and duplicates it into high and low quad word of the destination operand. The destination operand should be SSE register, the source operand can be SSE register or 64-bit memory location.
lddqu is functionally equivalent to movdqu with memory as source operand, but it may improve performance when the source operand crosses a cacheline boundary. The destination operand has to be SSE register, the source operand must be 128-bit memory location.
addsubps performs single precision addition of second and fourth pairs and single precision substracion of the rst and third pairs of floating point values in the operands.
addsubpd performs double precision addition of the second pair and double precision substraction of the rst pair of floating point values in the operand.
haddps performs the addition of two single precision values within the each quad word of source and destination operands, and stores the results of such horizontal addition of values from destination operand into low quad word of destination operand, and the results from the source operand into high quad word of destination operand.
haddpd performs the addition of two double precision values within each operand, and stores the result from destination operand into low quad word of destination operand, and the result from source operand into high quad word of destination operand. All these instructions need the destination operand to be SSE register, source operand can be SSE register or 128-bit memory location.
monitor sets up an address range for monitoring of write-back stores. It need its three operands to be EAX, ECX and EDX register in that order.
mwait waits for a write-back store to the address range set up by the monitor instruction. It uses two operands with additional parameters, rst being the EAX and second the ECX register.
The functionality of SSE3 is further extended by the set of Supplemental SSE3 instructions (SSSE3). They generally follow the same rules for operands as all the MMX operations extended by SSE.
phaddw and phaddd perform the horizontal additional of the pairs of adjacent values from both the source and destination operand, and stores the sums into the destination (sums from the source operand go into lower part of destination register). They operate on 16-bit or 32-bit chunks, respectively. phaddsw performs the same operation on signed 16-bit packed values, but the result of each addition is saturated. phsubw and phsubd analogously perform the horizontal substraction of 16-bit or 32-bit packed value, and phsubsw performs the horizontal substraction of signed 16-bit packed values with saturation.
pabsb, pabsw and pabsd calculate the absolute value of each signed packed signed value in source operand and stores them into the destination register. They operator on 8-bit, 16-bit and 32-bit elements respectively.
pmaddubsw multiplies signed 8-bit values from the source operand with the corresponding unsigned 8-bit values from the destination operand to produce intermediate 16-bit values, and every adjacent pair of those intermediate values is then added horizontally and those 16-bit sums are stored into the destination operand.
pmulhrsw multiplies corresponding 16-bit integers from the source and destination operand to produce intermediate 32-bit values, and the 16 bits next to the highest bit of each of those values are then rounded and packed into the destination operand.
pshufb shues the bytes in the destination operand according to the mask provided by source operand - each of the bytes in source operand is an index of the target position for the corresponding byte in the destination.
psignb, psignw and psignd perform the operation on 8-bit, 16-bit or 32-bit integers in destination operand, depending on the signs of the values in the source. If the value in source is negative, the corresponding value in the destination register is negated, if the value in source is positive, no operation is performed on the corresponding value is performed, and if the value in source is zero, the value in destination is zeroed, too.
palignr appends the source operand to the destination operand to form the intermediate value of twice the size, and then extracts into the destination register the 64 or 128 bits that are right-aligned to the byte oset specied by the third operand, which should be an 8-bit immediate value. This is the only SSSE3 instruction that takes three arguments.

@Mikl___ · 22.08.2014, 08:41 **[ТС]**

2.1.18 AMD 3DNow!-инструкции

3DNow! — дополнительное расширение MMX. Причиной создания 3DNow! послужило стремление завоевать превосходство над процессорами производства компании Intel в области обработки мультимедийных данных.

Технология 3DNow! ввела 21 новую команду процессора и возможность оперировать 32-битными вещественными типами в стандартных MMX-регистрах. Также были добавлены специальные инструкции, оптимизирующие переключение в режим MMX/3DNow! (femms, которая заменяла стандартную инструкцию emms) и работу с кэшем процессора. Технология 3DNow! расширяла возможности технологии MMX, не требуя введения новых режимов работы процессора и новых регистров.

Начиная с микроархитектуры Bulldozer расширение не поддерживается (за исключением команды prefetch)

Инструкции 3DNow!

PAVGUSB — вычисление среднего 8-битовых целых значений
PI2FD — перевод 32-битных целых в вещественные числа
PF2ID — перевод вещественных в 32-битные целые числа
PFCMPGE — сравнение вещественных чисел, больше или равно
PFCMPGT — сравнение вещественных чисел, больше
PFCMPEQ — сравнение вещественных чисел, равно
PFACC — накопление суммы вещественных чисел
PFADD — сложение вещественных чисел
PFSUB — вычитание вещественных чисел
PFSUBR — обратное вычитание вещественных чисел
PFMIN — нахождение минимума вещественных чисел
PFMAX — нахождение максимума вещественных чисел
PFMUL — умножение вещественных чисел
PFRCP — нахождение приближённого значения обратного (1/x) вещественных чисел
PFRSQRT — нахождение приближённого значения обратного от квадратного корня (1/sqrt(x)) вещественных чисел
PFRCPIT1 — первый шаг вычисления значения обратного (1/x) вещественных чисел
PFRSQIT1 — первый шаг вычисления значения обратного от квадратного корня (1/sqrt(x)) вещественных чисел
PFRCPIT2 — второй шаr вычисления значения обратного или обратного от квадратного корня вещественных чисел
PMULHRW — умножение 16-битных целых чисел с округлением
FEMMS — быстрое переключение состояния FPU/MMX
PREFETCH/PREFETCHW — предвыборка строки кэша процессора из памяти

The 3DNow! extension adds a new MMX instructions to those described in 2.1.14, and introduces operation on the 64–bit packed floating point values, each consisting of two single precision floating point values.

These instructions follow the same rules as the general MMX operations, the destination operand should be a MMX register, the source operand can be a MMX register or 64–bit memory location. pavgusb computes the rounded averages of packed unsigned bytes. pmulhrw performs a signed multiply of the packed words, round the high word of each double word results and stores them in the destination operand.
pi2fd converts packed double word integers into packed floating point values.
pf2id converts packed floating point values into packed double word integers using truncation.
pi2fw converts packed word integers into packed floating point values, only low words of each double word in source operand are used.
pf2iw converts packed floating point values to packed word integers, results are extended to double words using the sign extension.
pfadd adds packed floating point values. pfsub and pfsubr substracts packed floating point values, the first one substracts source values from destination values, the second one substracts destination values from the source values.
pfmul multiplies packed floating point values.
pfacc adds the low and high floating point values of the destination operand, storing the result in the low double word of destination, and adds the low and high floating point values of the source operand, storing the result in the high double word of destination.
pfnacc substracts the high floating point value of the destination operand from the low, storing the result in the low double word of destination, and substracts the high floating point value of the source operand from the low, storing the result in the high double word of destination.
pfpnacc substracts the high floating point value of the destination operand from the low, storing the result in the low double word of destination, and adds the low and high floating point values of the source operand, storing the result in the high double word of destination.
pfmax and pfmin compute the maximum and minimum of floating point values.
pswapd reverses the high and low double word of the source operand.

pfrcp returns an estimates of the reciprocals of floating point values from the source operand, pfrsqrt returns an estimates of the reciprocal square roots of floating point values from the source operand, pfrcpit1 performs the first step in the Newton–Raphson iteration to refine the reciprocal approximation produced by pfrcp instruction, pfrsqit1 performs the first step in the Newton–Raphson iteration to refine the reciprocal square root approximation produced by pfrsqrt instruction, pfrcpit2 performs the second final step in the Newton–Raphson iteration to refine the reciprocal approximation or the reciprocal square root approximation.
pfcmpeq, pfcmpge and pfcmpgt compare the packed floating point values and sets all bits or zeroes all bits of the correspoding data element in the destination operand according to the result of comparison, first checks whether values are equal, second checks whether destination value is greater or equal to source value, third checks whether destination value is greater than source value.

prefetch and prefetchw load the line of data from memory that contains byte specified with the operand into the data cache, prefetchw instruction should be used when the data in the cache line is expected to be modified, otherwise the prefetch instruction should be used. The operand should be an 8–bit memory location.

femms performs a fast clear of MMX state. It has no operands.

@Mikl___ · 22.08.2014, 08:42 **[ТС]**

2.1.19 64-разрядные инструкции

The AMD64 and EM64T architectures (we will use the common name x86–64 for them both) extend the x86 instruction set for the 64–bit processing. While legacy and compatibility modes use the same set of registers and instructions, the new long mode extends the x86 operations to 64 bits and introduces several new registers. You can turn on generating the code for this mode with the use64 directive.

Each of the general purpose registers is extended to 64 bits and the eight whole new general purpose registers and также eight new SSE registers are added. See table 2.4 for the summary of new registers (only the ones that was not listed in table 1.2). The general purpose registers of smallers sizes are the low order portions of the larger ones. You can still access the ah, bh, ch and dh registers in long mode, but you cannot use them in the same instruction with any of the new registers.

In general any instruction from x86 architecture, which allowed 16–bit or 32–bit operand sizes, in long mode allows также the 64–bit operands. The 64– bit registers should be used for addressing in long mode, the 32–bit addressing is также allowed, but it’s not possible to use the addresses based on 16–bit registers. Below are the samples of new operations possible in long mode on the example of mov instruction:

Assembler
1
2
mov rax,r8 ; transfer 64-bit general register
mov al,[rbx] ; transfer memory addressed by 64-bit register

The long mode uses также the instruction pointer based addresses, you can specify it manually with the special RIP register symbol, but such addressing is также automatically generated by FASM, since there is no 64–bit absolute addressing in long mode. You can still force the assembler to use the 32–bit absolute addressing by putting the dword size override for address inside the square brackets. There is также one exception, where the 64–bit absolute addressing is possible, it’s the mov instruction with one of the operand being accumulator register, and second being the memory operand. To force the assembler to use the 64–bit absolute addressing there, use the qword size

Таблица 2.4: Новые регистры в режиме "long mode"

Тип регистра	общего назначения				SSE	AVX	AVX2
Разрядность	8	16	32	64	128	256	512
Названия				rax			zmm0
регистров				rcx			zmm1
				rdx			zmm2
				rbx			zmm3
	spl			rsp			zmm4
	bpl			rbp			zmm5
	sil			rsi			zmm6
	dil			rdi			zmm7
	r8b	r8w	r8d	r8	xmm8	ymm8	zmm8
	r9b	r9w	r9d	r9	xmm9	ymm9	zmm9
	r10b	r10w	r10d	r10	xmm10	ymm10	zmm10
	r11b	r11w	r11d	r11	xmm11	ymm11	zmm11
	r12b	r12w	r12d	r12	xmm12	ymm12	zmm12
	r13b	r13w	r13d	r13	xmm13	ymm13	zmm13
	r14b	r14w	r14d	r14	xmm14	ymm14	zmm14
	r15b	r15w	r15d	r15	xmm15	ymm15	zmm15

operator for address inside the square brackets. When no size operator is applied to address, assembler generates the optimal form automatically.

Assembler
1
2
3
4
mov [qword 0],rax ; absolute 64-bit addressing
mov [dword 0],r15d ; absolute 32-bit addressing
mov [0],rsi ; automatic RIP-relative addressing
mov [rip+3],sil ; manual RIP-relative addressing

также as the immediate operands for 64–bit operations only the signed 32–bit values are possible, with the only exception being the mov instruction with destination operand being 64–bit general purpose register. Trying to force the 64–bit immediate with any other instruction will cause an error. If any operation is performed on the 32–bit general registers in long mode, the upper 32 bits of the 64–bit registers containing them are filled with zeros. This is unlike the operations on 16–bit or 8–bit portions of those registers, which preserve the upper bits.

Three new type conversion instructions are available. The cdqe sign extends the double word in EAX into quad word and stores the result in RAX register. cqo sign extends the quad word in RAX into double quad word and stores the extra bits in the RDX register. These instructions have no operands. movsxd sign extends the double word source operand, being either the 32–bit register or memory, into 64–bit destination operand, which has to be register. No analogous instruction is needed for the zero extension, since it is done automatically by any operations on 32–bit registers, as noted in previous paragraph. And the movzx and movsx instructions, conforming to the general rule, can be used with 64–bit destination operand, allowing extension of byte or word values into quad words.

All the binary arithmetic and logical instruction are promoted to allow 64–bit operands in long mode. The use of decimal arithmetic instructions in long mode is prohibited.

The stack operations, like push and pop in long mode default to 64–bit operands and it’s not possible to use 32–bit operands with them. The pusha and popa are disallowed in long mode.

The indirect near jumps and calls in long mode default to 64–bit operands and it’s not possible to use the 32–bit operands with them. On the other hand, the indirect far jumps and calls allow any operands that were allowed by the x86 architecture and также 80–bit memory operand is allowed (though only EM64T seems to implement such variant), with the first eight bytes defining the offset and two last bytes specifying the selector. The direct far jumps and calls are not allowed in long mode.

The I/O instructions, in, out, ins and outs are the exceptional instructions that are not extended to accept quad word operands in long mode. But all other string operations are, and there are new short forms movsq, cmpsq, scasq, lodsq and stosq introduced for the variants of string operations for 64–bit string elements. The RSI and RDI registers are used by default to address the string elements.

The lfs, lgs and lss instructions are extended to accept 80–bit source memory operand with 64–bit destination register (though only EM64T seems to implement such variant). The lds and les are disallowed in long mode. The system instructions like lgdt which required the 48–bit memory operand, in long mode require the 80–bit memory operand.

The cmpxchg16b is the 64–bit equivalent of cmpxchg8b instruction, it uses the double quad word memory operand and 64–bit registers to perform the analoguous operation.

swapgs is the new instruction, which swaps the contents of GS register and the KernelGSbase model–specific register (MSR address 0C0000102h). syscall and sysret is the pair of new instructions that provide the functionality similar to sysenter and sysexit in long mode, where the latter pair is disallowed.

@Mikl___ · 22.08.2014, 08:43 **[ТС]**

2.1.20 Инструкции SSE4

SSE4 — новый набор команд микроархитектуры Intel Core, впервые реализованный в процессорах серии Penryn

SSE4 состоит из 54 инструкций, 47 из них относят к SSE4.1 (они есть в процессорах Penryn). Полный набор команд (SSE4.1 и SSE4.2, то есть 47 + оставшиеся 7 команд) доступен в процессорах Intel с микроархитектурой Nehalem, которые были выпущены в середине ноября 2008 года и более поздних редакциях. Ни одна из SSE4 инструкций не работает с 64-х битными mmx регистрами (только с 128-ми битными xmm0-15).

Добавлены инструкции, ускоряющие компенсацию движения в видеокодеках, быстрое чтение из USWC памяти, множество инструкций для упрощения векторизации программ компиляторами. Кроме того, в SSE4.2 добавлены инструкции обработки строк 8/16 битных символов, вычисления CRC32, popcnt. Впервые в SSE4 регистр xmm0 стал использоваться как неявный аргумент для некоторых инструкций.

There are actually three dierent sets of instructions under the name SSE4. Intel de- signed two of them, SSE4.1 and SSE4.2, with latter extending the former into the full Intel's SSE4 set. On the other hand, the implementation by AMD includes only a few instructions from this set, but также contains some additional instructions, that are called the SSE4a set.
The SSE4.1 instructions mostly follow the same rules for operands, as the basic SSE operations, so they require destination operand to be SSE register and source operand to be 128-bit memory location or SSE register, and some operations require a third operand, the 8-bit immediate value.
pmulld performs a signed multiplication of the packed double words and stores the low double words of the results in the destination operand.
pmuldq performs a two signed multiplications of the corresponding double words in the lower quad words of operands, and stores the results as packed quad words into the destination register.
pminsb and pmaxsb return the minimum or maximum values of packed signed bytes, pminuw and pmaxuw return the minimum and maximum values of packed unsigned words, pminud, pmaxud, pminsd and pmaxsd return minimum or maximum values of packed unsigned or signed words. These instructions complement the instructions computing packed minimum or maximum introduced by SSE.
ptest sets the ZF ag to one when the result of bitwise AND of the both operands is zero, and zeroes the ZF otherwise. It также sets CF ag to one, when the result of bitwise AND of the destination operand with the bitwise NOT of the source operand is zero, and zeroes the CF otherwise.
pcmpeqq compares packed quad words for equality, and fills the corresponding elements of destination operand with either ones or zeros, depending on the result of comparison. packusdw converts packed signed double words from both the source and destination operand into the unsigned words using saturation, and stores the eight resulting word values into the destination register.
phminposuw finds the minimum unsigned word value in source operand and places it into the lowest word of destination operand, setting the remaining upper bits of destination to zero.
roundps, roundss, roundpd and roundsd perform the rounding of packed or individual oating point value of single or double precision, using the rounding mode specified by the third operand.

Assembler
1
roundsd xmm0,xmm1,0011b ; round toward zero

dpps calculates dot product of packed single precision oating point values, that is it multiplies the corresponding pairs of values from source and destination operand and then sums the products up. The high four bits of the 8-bit immediate third operand control which products are calculated and taken to the sum, and the low four bits control, into which elements of destination the resulting dot product is copied (the other elements are filled with zero). dppd calculates dot product of packed double precision oating point values. The bits 4 and 5 of third operand control, which products are calculated and added, and bits 0 and 1 of this value control, which elements in destination register should get filled with the result.
mpsadbw calculates multiple sums of absolute dierences of unsigned bytes. The third operand controls, with value in bits 0-1, which of the four-byte blocks in source operand is taken to calculate the absolute dierencies, and with value in bit 2, at which of the two first four-byte block in destination operand start calculating multiple sums. The sum is calculated from four absolute dierencies between the corresponding unsigned bytes in the source and destination block, and each next sum is calculated in the same way, but taking the four bytes from destination at the position one byte after the position of previous block. The four bytes from the source stay the same each time. This way eight sums of absolute dierencies are calculated and stored as packed word values into the destination operand. The instructions described in this paragraph follow the same rules for operands, as roundps instruction. blendps, blendvps, blendpd and blendvpd conditionally copy the values from source operand into the destination operand, depending on the bits of the mask provided by third operand. If a mask bit is set, the corresponding element of source is copied into the same place in destination, otherwise this position is destination is left unchanged. The rules for the first two operands are the same, as for general SSE instructions. blendps and blendpd need third operand to be 8-bit immediate, and they operate on single or double precision values, respectively. blendvps and blendvpd require third operand to be the XMM0 register.

Assembler
1
blendvps xmm3,xmm7,xmm0 ; blend according to mask

pblendw conditionally copies word elements from the source operand into the destination, depending on the bits of mask provided by third operand, which needs to be 8-bit immediate value. pblendvb conditionally copies byte elements from the source operands into destination, depending on mask defined by the third operand, which has to be XMM0 register. These instructions follow the same rules for operands as blendps and blendvps instructions, respectively. insertps inserts a single precision oating point value taken from the position in source operand specified by bits 6-7 of third operand into location in destination register selected by bits 4-5 of third operand. Additionally, the low four bits of third operand control, which elements in destination register will be set to zero. The first two operands follow the same rules as for the general SSE operation, the third operand should be 8-bit immediate. extractps extracts a single precision oating point value taken from the location in source operand specified by low two bits of third operand, and stores it into the destination operand. The destination can be a 32-bit memory value or general purpose register, the source operand must be SSE register, and the third operand should be 8-bit immediate value.

Assembler
1
extractps edx,xmm3,3 ; extract the highest value

pinsrb, pinsrd and pinsrq copy a byte, double word or quad word from the source operand into the location of destination operand determined by the third operand. The destination operand has to be SSE register, the source operand can be a memory location of appropriate size, or the 32-bit general purpose register (but 64-bit general purpose register for pinsrq, which is only available in long mode), and the third operand has to be 8-bit immediate value. These instructions complement the pinsrw instruction operating on SSE register destination, which was introduced by SSE2. pinsrd xmm4,eax,1 ; insert double word into second position pextrb, pextrw, pextrd and pextrq copy a byte, word, double word or quad word from the location in source operand specified by third operand, into the destination. The source operand should be SSE register, the third operand should be 8-bit immediate, and the destination operand can be memory location of appropriate size, or the 32-bit general purpose register (but 64-bit general purpose register for pextrq, which is only available in long mode). The pextrw instruction with SSE register as source was already introduced by SSE2, but SSE4 extends it to allow memory operand as destination. pextrw [ebx],xmm3,7 ; extract highest word into memory pmovsxbw and pmovzxbw perform sign extension or zero extension of eight byte values from the source operand into packed word values in destination operand, which has to be SSE register. The source can be 64-bit memory or SSE register ― when it is register, only its low portion is used. pmovsxbd and pmovzxbd perform sign extension or zero extension of the four byte values from the source operand into packed double word values in destination operand, the source can be 32-bit memory or SSE register. pmovsxbq and pmovzxbq perform sign extension or zero extension of the two byte values from the source operand into packed quad word values in destination operand, the source can be 16-bit memory or SSE register. pmovsxwd and pmovzxwd perform sign extension or zero extension of the four word values from the source operand into packed double words in destination operand, the source can be 64-bit memory or SSE register. pmovsxwq and pmovzxwq perform sign extension or zero extension of the two word values from the source operand into packed quad words in destination operand, the source can be 32-bit memory or SSE register. pmovsxdq and pmovzxdq perform sign extension or zero extension of the two double word values from the source operand into packed quad words in destination operand, the source can be 64-bit memory or SSE register.

@Mikl___ · Инструкция Описание

2.1.21 Инструкции AVX

Advanced Vector Extensions (AVX) — расширение системы команд. AVX предоставляет различные улучшения, новые инструкции и новую схему кодирования машинных кодов.

Новая схема кодирования инструкций VEX
Ширина векторных регистров SIMD увеличивается со 128 (XMM) до 256 бит (регистры YMM0 — YMM15). Существующие 128-битные SSE инструкции будут использовать младшую половину новых YMM регистров, не изменяя старшую часть. Для работы с YMM регистрами добавлены новые 256-битные AVX инструкции. В будущем возможно расширение векторных регистров SIMD до 512 или 1024 бит. Например, процессоры с архитектурой Larrabee уже имеют векторные регистры (ZMM) шириной в 512 бит, и используют для работы с ними SIMD команды с MVEX и VEX префиксами, но при этом они не поддерживают AVX.

Неразрушающие операции
Набор AVX инструкций использует трехоперандный синтаксис. Например, вместо a = a + b можно использовать c = a + b, при этом регистр a остается неизмененным. В случаях, когда значение a используется дальше в вычислениях, это повышает производительность, так как избавляет от необходимости сохранять перед вычислением и восстанавливать после вычисления регистр, содержавший a, из другого регистра или памяти.

Для большинства новых инструкций отсутствуют требования к выравниванию операндов в памяти. Однако рекомендуется следить за выравниванием на размер операнда, во избежание значительного снижения производительности.

Набор инструкций AVX содержит в себе аналоги 128-битных SSE инструкций для вещественных чисел. При этом, в отличие от оригиналов, сохранение 128-битного результата будет обнулять старшую половину YMM регистра. 128-битные AVX инструкции сохраняют прочие преимущества AVX, такие как новая схема кодирования, трехоперандный синтаксис и невыровненный доступ к памяти. Рекомендуется отказаться от старых SSE инструкций в пользу новых 128-битных AVX инструкций, даже если достаточно двух операндов

Новая схема кодирования
Новая схема кодирования инструкций VEX использует VEX префикс. В настоящий момент существуют два VEX префикса, длиной 2 и 3 байта. Для 2-х байтного VEX префикса первый байт равен 0xC5, для 3-х байтного 0xC4. В 64-битном режиме первый байт VEX префикса уникален. В 32-битном режиме возникает конфликт с инструкциями LES и LDS, который разрешается старшим битом второго байта, он имеет значение только в 64-битном режиме, через неподдерживаемые формы инструкций LES и LDS. Длина существующих AVX инструкций, вместе с VEX префиксом, не превышает 11 байт. В следующих версиях ожидается появление более длинных инструкций.

Новые инструкции

Инструкция	Описание
VBROADCASTSS, VBROADCASTSD, VBROADCASTF128	Копирует 32-х, 64-х или 128-ми битный операнд из памяти во все элементы векторного регистра XMM или YMM.
VINSERTF128	Замещает младшую или старшую половину 256-ти битного регистра YMM значением 128-ми битного операнда. Другая часть регистра-получателя не изменяется.
VEXTRACTF128	Извлекает младшую или старшую половину 256-ти битного регистра YMM и копирует в 128-ми битный операнд-назначение.
VMASKMOVPS, VMASKMOVPD	Условно считывает любое количество элементов из векторного операнда из памяти в регистр-получатель, оставляя остальные элементы несчитанными и обнуляя соответствующие им элементы регистра-получателя. Также может условно записывать любое количество элементов из векторного регистра в векторный операнд в памяти, оставляя остальные элементы операнда памяти неизменёнными
VPERMILPS, VPERMILPD	Переставляет 32-х или 64-х битные элементы вектора согласно операнду-селектору (из памяти или из регистра).
VPERM2F128	Переставляет 4 128-ми битных элемента двух 256-ти битных регистров в 256-ти битный операнд-назначение с использованием непосредственной константы (imm) в качестве селектора.
VZEROALL	Обнуляет все YMM регистры и помечает их как неиспользуемые. Используется при переключении между 128-ми битным режимом и 256-ти битным.
VZEROUPPER	Обнуляет старшие половины всех регистров YMM. Используется при переключении между 128-ми битным режимом и 256-ти битным.

Также в спецификации AVX описана группа инструкций PCLMUL (Parallel Carry-Less Multiplication, Parallel CLMUL)

Assembler
1
2
3
4
5
PCLMULLQLQDQ xmmreg,xmmrm [rm: 66 0f 3a 44 /r 00]
PCLMULHQLQDQ xmmreg,xmmrm [rm: 66 0f 3a 44 /r 01]
PCLMULLQHQDQ xmmreg,xmmrm [rm: 66 0f 3a 44 /r 02]
PCLMULHQHQDQ xmmreg,xmmrm [rm: 66 0f 3a 44 /r 03]
PCLMULQDQ xmmreg,xmmrm,imm [rmi: 66 0f 3a 44 /r ib]

Применение
Подходит для интенсивных вычислений с плавающей точкой в мультимедиа программах и научных задачах. Там, где возможна более высокая степень параллелизма, увеличивает производительность с вещественными числами.

The Advanced Vector Extensions introduce instructions that are new variants of SSE instructions, with new scheme of encoding that allows extended syntax having a destination operand separate from all the source operands. It также introduces 256-bit AVX registers, which extend up the old 128-bit SSE registers. Any AVX instruction that puts some result into SSE register, puts zero bits into high portion of the AVX register containing it.
The AVX version of SSE instruction has the mnemonic obtained by prepending SSE instruction name with v. For any SSE arithmetic instruction which had a destination operand also being used as one of the source values, the AVX variant has a new syntax with three operands - the destination and two sources. The destination and first source can be SSE registers, and second source can be SSE register or memory. If the operation is performed on single pair of values, the remaining bits of first source SSE register are copied into the the destination register.

Assembler
1
2
vsubss xmm0,xmm2,xmm3 ; substract two 32-bit floats
vmulsd xmm0,xmm7,qword [esi] ; multiply two 64-bit floats

In case of packed operations, each instruction can also operate on the 256-bit data size when the AVX registers are specified instead of SSE registers, and the size of memory operand is also doubled then.

Assembler
1
vaddps ymm1,ymm5,yword [esi] ; eight sums of 32-bit float pairs

The instructions that operate on packed integer types (in particular the ones that earlier had been promoted from MMX to SSE) also acquired the new syntax with three operands, however they are only allowed to operate on 128-bit packed types and thus cannot use the whole AVX registers.

Assembler
1
2
vpavgw xmm3,xmm0,xmm2 ; average of 16-bit integers
vpslld xmm1,xmm0,1 ; shift double words left

If the SSE version of instruction had a syntax with three operands, the third one being an immediate value, the AVX version of such instruction takes four operands, with immediate remaining the last one.

Assembler
1
2
vshufpd ymm0,ymm1,ymm2,10010011b ; shuffle 64-bit floats
vpalignr xmm0,xmm4,xmm2,3 ; extract byte aligned value

The promotion to new syntax according to the rules described above has been applied to all the instructions from SSE extensions up to SSE4, with the exceptions described below.
vdppd instruction has syntax extended to four operans, but it does not have a 256-bit version.

The are a few instructions, namely vsqrtpd, vsqrtps, vrcpps and vrsqrtps, which can operate on 256-bit data size, but retained the syntax with only two operands, because they use data from only one source:

Assembler
1
vsqrtpd ymm1,ymm0 ; put square roots into other register

In a similar way vroundpd and vroundps retained the syntax with three operands, the last one being immediate value.

Assembler
1
vroundps ymm0,ymm1,0011b ; round toward zero

Also some of the operations on packed integers kept their two-operand or three- operand syntax while being promoted to AVX version. In such case these instruc- tions follow exactly the same rules for operands as their SSE counterparts (since op- erations on packed integers do not have 256-bit variants in AVX extension). These include vpcmpestri, vpcmpestrm, vpcmpistri, vpcmpistrm, vphminposuw, vpshufd, vpshufhw, vpshuflw. And there are more instructions that in AVX versions keep exactly the same syntax for operands as the one from SSE, without any additional options: vcomiss, vcomisd, vcvtss2si, vcvtsd2si, vcvttss2si, vcvttsd2si, vextractps, vpextrb, vpextrw, vpextrd, vpextrq, vmovd, vmovq, vmovntdqa, vmaskmovdqu, vpmovmskb, vpmovsxbw, vpmovsxbd, vpmovsxbq, vpmovsxwd, vpmovsxwq, vpmovsxdq, vpmovzxbw, vpmovzxbd, vpmovzxbq, vpmovzxwd, vpmovzxwq and vpmovzxdq.

@Mikl___ · 22.08.2014, 08:44 **[ТС]**

Assembler
1
2
pmovzxbq xmm0,word [si] ; zero-extend bytes to quad words
pmovsxwq xmm0,xmm1 ; sign-extend words to quad words

movntdqa loads double quad word from the source operand to the destination using a non-temporal hint. The destination operand should be SSE register, and the source operand should be 128-bit memory location. The SSE4.2, described below, adds not only some new operations on SSE registers, but также introduces some completely new instructions operating on general purpose registers only. pcmpistri compares two zero-ended (implicit length) strings provided in its source and destination operand and generates an index stored to ECX; pcmpistrm performs the same comparison and generates a mask stored to XMM0. pcmpestri compares two strings of explicit lengths, with length provided in EAX for the destination operand and in EDX for the source operand, and generates an index stored to ECX; pcmpestrm performs the same comparision and generates a mask stored to XMM0. The source and destination operand follow the same rules as for general SSE instructions, the third operand should be 8-bit immediate value determining the details of performed operation - refer to Intel documentation for information on those details. pcmpgtq compares packed quad words, and fills the corresponding elements of desti- nation operand with either ones or zeros, depending on whether the value in destination is greater than the one in source, or not. This instruction follows the same rules for operands as pcmpeqq. crc32 accumulates a CRC32 value for the source operand starting with initial value provided by destination operand, and stores the result in destination. Unless in long mode, the destination operand should be a 32-bit general purpose register, and the source operand can be a byte, word, or double word register or memory location. In long mode the destination operand can также be a 64-bit general purpose register, and the source operand in such case can be a byte or quad word register or memory location.

Assembler
1
2
3
crc32 eax,dl ; accumulate CRC32 on byte value
crc32 eax,word [ebx] ; accumulate CRC32 on word value
crc32 rax,qword [rbx] ; accumulate CRC32 on quad word value

popcnt calculates the number of bits set in the source operand, which can be 16-bit, 32-bit, or 64-bit general purpose register or memory location, and stores this count in the destination operand, which has to be register of the same size as source operand. The 64-bit variant is available only in long mode.

Assembler
1
popcnt ecx,eax ; count bits set to 1

The SSE4a extension, which также includes the popcnt instruction introduced by SSE4.2, at the same time adds the lzcnt instruction, which follows the same syntax, and calculates the count of leading zero bits in source operand (if the source operand is all zero bits, the total number of bits in source operand is stored in destination). extrq extract the sequence of bits from the low quad word of SSE register provided as first operand and stores them at the low end of this register, filling the remaining bits in the low quad word with zeros. The position of bit string and its length can either be provided with two 8-bit immediate values as second and third operand, or by SSE register as second operand (and there is no third operand in such case), which should contain position value in bits 8-13 and length of bit string in bits 0-5.

Assembler
1
2
extrq xmm0,8,7 ; extract 8 bits from position 7
extrq xmm0,xmm5 ; extract bits defined by register

insertq writes the sequence of bits from the low quad word of the source operand into specified position in low quad word of the destination operand, leaving the other bits in low quad word of destination intact. The position where bits should be written and the length of bit string can either be provided with two 8-bit immediate values as third and fourth operand, or by the bit fields in source operand (and there are only two operands in such case), which should contain position value in bits 72-77 and length of bit string in bits 64-69.

Assembler
1
2
insertq xmm1,xmm0,4,2 ; insert 4 bits at position 2
insertq xmm1,xmm0 ; insert bits defined by register

movntss and movntsd store single or double precision oating point value from the source SSE register into 32-bit or 64-bit destination memory location respectively, using non-temporal hint.

@Mikl___ · 22.08.2014, 12:14 **[ТС]**

The move and conversion instructions have mostly been promoted to allow 256-bit size operands in addition to the 128-bit variant with syntax identical to that from SSE version of the same instruction. Each of the vcvtdq2ps, vcvtps2dq and vcvttps2dq, vmovaps, vmovapd, vmovups, vmovupd, vmovdqa, vmovdqu, vlddqu, vmovntps, vmovntpd, vmovntdq, vmovsldup, vmovshdup, vmovmskps and vmovmskpd inherits the 128-bit syntax from SSE without any changes, and also allows a new form with 256-bit operands in place of 128-bit ones.

Assembler
1
vmovups [edi],ymm6 ; store unaligned 256-bit data

vmovddup has the identical 128-bit syntax as its SSE version, and it also has a 256-bit version, which stores the duplicates of the lowest quad word from the source operand in the lower half of destination operand, and in the upper half of destination the duplicates of the low quad word from the upper half of source. Both source and destination operands need then to be 256-bit values.

vmovlhps and vmovhlps have only 128-bit versions, and each takes three operands, which all must be SSE registers. vmovlhps copies two single precision values from the low quad word of second source register to the high quad word of destination register, and copies the low quad word of first source register into the low quad word of destination register. vmovhlps copies two single precision values from the high quad word of second source register to the low quad word of destination register, and copies the high quad word of first source register into the high quad word of destination register. vmovlps, vmovhps, vmovlpd and vmovhpd have only 128-bit versions and their syntax varies depending on whether memory operand is a destination or source. When memory is destination, the syntax is identical to the one of equivalent SSE instruction, and when memory is source, the instruction requires three operands, first two being SSE registers and the third one 64-bit memory. The value put into destination is then the value copied from first source with either low or high quad word replaced with value from second source (the memory operand).

Assembler
1
2
vmovhps [esi],xmm7 ; store upper half to memory
vmovlps xmm0,xmm7,[ebx] ; low from memory, rest from register

vmovss and vmovsd have syntax identical to their SSE equivalents as long as one of the operands is memory, while the versions that operate purely on registers require three operands (each being SSE register). The value stored in destination is then the value copied from first source with lowest data element replaced with the lowest value from second source.

Assembler
1
2
vmovss xmm3,[edi] ; low from memory, rest zeroed
vmovss xmm0,xmm1,xmm2 ; one value from xmm2, three from xmm1

vcvtss2sd, vcvtsd2ss, vcvtsi2ss and vcvtsi2d use the three-operand syntax, where destination and first source are always SSE registers, and the second source follows the same rules and the source in syntax of equivalent SSE instruction. The value stored in destination is then the value copied from first source with lowest data element replaced with the result of conversion.

Assembler
1
2
vcvtsi2sd xmm4,xmm4,ecx ; 32-bit integer to 64-bit float
vcvtsi2ss xmm0,xmm0,rax ; 64-bit integer to 32-bit float

vcvtdq2pd and vcvtps2pd allow the same syntax as their SSE equivalents, plus the new variants with AVX register as destination and SSE register or 128-bit memory as source. Analogously vcvtpd2dq, vcvttpd2dq and vcvtpd2ps, in addition to variant with syntax identical to SSE version, allow a variant with SSE register as destination and AVX register or 256-bit memory as source.

vinsertps, vpinsrb, vpinsrw, vpinsrd, vpinsrq and vpblendw use a syntax with four operands, where destination and first source have to be SSE registers, and the third and fourth operand follow the same rules as second and third operand in the syntax of equivalent SSE instruction. Value stored in destination is the the value copied from first source with some data elements replaced with values extracted from the second source, analogously to the operation of corresponding SSE instruction.

Assembler
1
vpinsrd xmm0,xmm0,eax,3 ; insert double word

vblendvps, vblendvpd and vpblendvb use a new syntax with four register operands: destination, two sources and a mask, where second source can also be a memory operand. vblendvps and vblendvpd have 256-bit variant, where operands are AVX registers or 256-bit memory, as well as 128-bit variant, which has operands being SSE registers or 128-bit memory. vpblendvb has only a 128-bit variant. Value stored in destination is the value copied from the first source with some data elements replaced, according to mask, by values from the second source.

Assembler
1
vblendvps ymm3,ymm1,ymm2,ymm7 ; blend according to mask

vptest allows the same syntax as its SSE version and также has a 256-bit version, with both operands doubled in size. There are также two new instructions, vtestps and vtestpd, which perform analogous tests, but only of the sign bits of corresponding single precision or double precision values, and set the ZF and CF accordingly. They follow the same syntax rules as vptest.

Assembler
1
2
vptest ymm0,yword [ebx] ; test 256-bit values
vtestpd xmm0,xmm1 ; test sign bits of 64-bit floats

vbroadcastss, vbroadcastsd and vbroadcastf128 are new instructions, which broadcast the data element defined by source operand into all elements of corresponing size in the destination register. vbroadcastss needs source to be 32-bit memory and destination to be either SSE or AVX register. vbroadcastsd requires 64-bit memory as source, and AVX register as destination.

vbroadcastf128 requires 128-bit memory as source, and AVX register as destination.

Assembler
1
vbroadcastss ymm0,dword [eax] ; get eight copies of value

vinsertf128 is the new instruction, which takes four operands. The destination and first source have to be AVX registers, second source can be SSE register or 128-bit memory location, and fourth operand should be an immediate value. It stores in destination the value obtained by taking contents of first source and replacing one of its 128-bit units with value of the second source. The lowest bit of fourth operand specifies at which position that replacement is done (either 0 or 1).
vextractf128 is the new instruction with three operands. The destination needs to be SSE register or 128-bit memory location, the source must be AVX register, and the third operand should be an immediate value. It extracts into destination one of the 128-bit units from source. The lowest bit of third operand specifies, which unit is extracted.
vmaskmovps and vmaskmovpd are the new instructions with three operands that selectively store in destination the elements from second source depending on the sign bits of corresponding elements from first source. These instructions can operate on either 128-bit data (SSE registers) or 256-bit data (AVX registers). Either destination or second source has to be a memory location of appropriate size, the two other operands should be registers.

Assembler
1
2
vmaskmovps [edi],xmm0,xmm5 ; conditionally store
vmaskmovpd ymm5,ymm0,[esi] ; conditionally load

vpermilpd and vpermilps are the new instructions with three operands that permute the values from first source according to the control fields from second source and put the result into destination operand. It allows to use either three SSE registers or three AVX registers as its operands, the second source can be a memory of size equal to the registers used. In alternative form the second source can be immediate value and then the first source can be a memory location of the size equal to destination register.
vperm2f128 is the new instruction with four operands, which selects 128-bit blocks of oating point data from first and second source according to the bit fields from fourth operand, and stores them in destination. Destination and first source need to be AVX registers, second source can be AVX register or 256-bit memory area, and fourth operand should be an immediate value.

Assembler
1
vperm2f128 ymm0,ymm6,ymm7,12h ; permute 128-bit blocks

vzeroall instruction sets all the AVX registers to zero. vzeroupper sets the upper 128-bit portions of all AVX registers to zero, leaving the SSE registers intact. These new instructions take no operands.
vldmxcsr and vstmxcsr are the AVX versions of ldmxcsr and stmxcsr instructions. The rules for their operands remain unchanged.

@Mikl___ · 22.08.2014, 12:31 **[ТС]**

2.1.22 Инструкции AVX2

AVX-512 расширяет AVX до векторов длиной 512-бит при помощи кодировки с префиксом EVEX. Расширение AVX-512 вводит 32 векторных регистра (ZMM), каждый по 512 бит, 8 регистров масок, 512-разрядные упакованные форматы для целых и дробных числе и операции над ними, тонкое управление режимами округления (позволяет переопределить глобальные настройки), операции broadcast, подавление ошибок в операциях с дробными числами, операции gather/scatter, быстрые математические операции, компактное кодирование больших смещений. AVX-512 предлагает совместимость с AVX, в том смысле, что программа может использовать инструкции как AVX, так и AVX-512 без снижения производительности. Регистры AVX (YMM0–YMM15) отображаются на младшие части регистров AVX-512 (ZMM0–ZMM15), по аналогии с SSE и AVX регистрами.

The AVX2 extension allows all the AVX instructions operating on packed integers to use 256-bit data types, and introduces some new instructions as well. The AVX instructions that operate on packed integers and had only a 128-bit vari- ants, have been supplemented with 256-bit variants, and thus their syntax rules became analogous to AVX instructions operating on packed oating point types.

Assembler
1
2
vpsubb ymm0,ymm0,[esi] ; substract 32 packed bytes
vpavgw ymm3,ymm0,ymm2 ; average of 16-bit integers

However there are some instructions that have not been equipped with the 256-bit vari- ants. vpcmpestri, vpcmpestrm, vpcmpistri, vpcmpistrm, vpextrb, vpextrw, vpextrd, vpextrq, vpinsrb, vpinsrw, vpinsrd, vpinsrq and vphminposuw are not aected by AVX2 and allow only the 128-bit operands. The packed shift instructions, which allowed the third operand specifying amount to be SSE register or 128-bit memory location, use the same rules for the third operand in their 256-bit variant.

Assembler
1
2
vpsllw ymm2,ymm2,xmm4 ; shift words left
vpsrad ymm0,ymm3,xword [ebx] ; shift double words right

There are также new packed shift instructions with standard three-operand AVX syn- tax, which shift each element from first source by the amount specified in corresponding element of second source, and store the results in destination. vpsllvd shifts 32-bit elements left, vpsllvq shifts 64-bit elements left, vpsrlvd shifts 32-bit elements right logically, vpsrlvq shifts 64-bit elements right logically and vpsravd shifts 32-bit ele- ments right arithmetically. The sign-extend and zero-extend instructions, which in AVX versions allowed source operand to be SSE register or a memory of specific size, in the new 256-bit variant need memory of that size doubled or SSE register as source and AVX register as destination. vpmovzxbq ymm0,dword [esi] ; bytes to quad words также vmovntdqa has been upgraded with 256-bit variant, so it allows to transfer 256-bit value from memory to AVX register, it needs memory address to be aligned to 32 bytes. vpmaskmovd and vpmaskmovq are the new instructions with syntax identical to vmaskmovps or vmaskmovpd, and they performs analogous operation on packed 32-bit or 64-bit values. vinserti128, vextracti128, vbroadcasti128 and vperm2i128 are the new in- structions with syntax identical to vinsertf128, vextractf128, vbroadcastf128 and vperm2f128 respectively, and they perform analogous operations on 128-bit blocks of integer data. vbroadcastss and vbroadcastsd instructions have been extended to allow SSE register as a source operand (which in AVX could only be a memory). vpbroadcastb, vpbroadcastw, vpbroadcastd and vpbroadcastq are the new in- structions which broadcast the byte, word, double word or quad word from the source operand into all elements of corresponing size in the destination register. The destina- tion operand can be either SSE or AVX register, and the source operand can be SSE register or memory of size equal to the size of data element. vpbroadcastb ymm0,byte [ebx] ; get 32 identical bytes vpermd and vpermps are new three-operand instructions, which use each 32-bit element from first source as an index of element in second source which is copied into destination at position corresponding to element containing index. The destination and first source have to be AVX registers, and the second source can be AVX register or 256-bit memory. vpermq and vpermpd are new three-operand instructions, which use 2-bit indexes from the immediate value specified as third operand to determine which element from source store at given position in destination. The destination has to be AVX register, source can be AVX register or 256-bit memory, and the third operand must be 8-bit immediate value. The family of new instructions performing gather operation have special syntax, as in their memory operand they use addressing mode that is unique to them. The base of address can be a 32-bit or 64-bit general purpose register (the latter only in long mode), and the index (possibly multiplied by scale value, as in standard addressing) is specified by SSE or AVX register. It is possible to use only index without base and any numerical displacement can be added to the address. Each of those instructions takes three operands. First operand is the destination register, second operand is memory addressed with a vector index, and third operand is register containing a mask. The most significant bit of each element of mask determines whether a value will be loaded from memory into corresponding element in destination. The address of each element to load is determined by using the corresponding element from index register in memory operand to calculate final address with given base and displacement. When the index register contains less elements than the destination and mask registers, the higher elements of destination are zeroed. After the value is successfuly loaded, the corresponding element in mask register is set to zero. The destination, index and mask should all be distinct registers, it is not allowed to use the same register in two dierent roles. vgatherdps loads single precision oating point values addressed by 32-bit indexes. The destination, index and mask should all be registers of the same type, either SSE or AVX. The data addressed by memory operand is 32-bit in size.

Assembler
1
2
vgatherdps xmm0,[eax+xmm1],xmm3 ; gather four floats
vgatherdps ymm0,[ebx+ymm7*4],ymm3 ; gather eight floats

vgatherqps loads single precision oating point values addressed by 64-bit indexes. The destination and mask should always be SSE registers, while index register can be either SSE or AVX register. The data addressed by memory operand is 32-bit in size.

Assembler
1
2
vgatherqps xmm0,[xmm2],xmm3 ; gather two floats
vgatherqps xmm0,[ymm2+64],xmm3 ; gather four floats

vgatherdpd loads double precision oating point values addressed by 32-bit indexes. The index register should always be SSE register, the destination and mask should be two registers of the same type, either SSE or AVX. The data addressed by memory operand is 64-bit in size.

Assembler
1
2
vgatherdpd xmm0,[ebp+xmm1],xmm3 ; gather two doubles
vgatherdpd ymm0,[xmm3*8],ymm5 ; gather four doubles

vgatherqpd loads double precision oating point values addressed by 64-bit indexes. The destination, index and mask should all be registers of the same type, either SSE or AVX. The data addressed by memory operand is 64-bit in size. vpgatherdd and vpgatherqd load 32-bit values addressed by either 32-bit or 64-bit indexes. They follow the same rules as vgatherdps and vgatherqps respectively. vpgatherdq and vpgatherqq load 64-bit values addressed by either 32-bit or 64-bit indexes. They follow the same rules as vgatherdpd and vgatherqpd respectively.

@Mikl___ · 24.08.2014, 06:07 **[ТС]**

2.1.23 Auxiliary sets of computational instructions

There is a number of additional instruction set extensions related to AVX. They introduce new vector instructions (and sometimes также their SSE equivalents that use classic instruction encoding), and even some new instructions operating on general registers that use the AVX-like encoding allowing the extended syntax with separate destination and source operands. The CPU support for each of these instructions sets needs to be determined separately. The AES extension provides a specialized set of instructions for the purpose of cryptographic computations defined by Advanced Encryption Standard. Each of these instructions has two versions: the AVX one and the one with SSE-like syntax that uses classic encoding. Refer to the Intel manuals for the details of operation of these instructions. aesenc and aesenclast perform a single round of AES encryption on data from first source with a round key from second source, and store result in destination. The destination and first source are SSE registers, and the second source can be SSE register or 128-bit memory. The AVX versions of these instructions, vaesenc and vaesenclast, use the syntax with three operands, while the SSE-like version has only two operands, with first operand being both the destination and first source. aesdec and aesdeclast perform a single round of AES decryption on data from first source with a round key from second source. The syntax rules for them and their AVX versions are the same as for aesenc. aesimc performs the InvMixColumns transformation of source operand and store the result in destination. Both aesimc and vaesimc use only two operands, destination being SSE register, and source being SSE register or 128-bit memory location. aeskeygenassist is a helper instruction for generating the round key. It needs three operands: destination being SSE register, source being SSE register or 128-bit memory, and third operand being 8-bit immediate value. The AVX version of this instruction uses the same syntax. The CLMUL extension introduces just one instruction, pclmulqdq, and its AVX version as well. This instruction performs a carryless multiplication of two 64-bit values selected from first and second source according to the bit fields in immediate value. The destination and first source are SSE registers, second source is SSE register or 128- bit memory, and immediate value is provided as last operand. vpclmulqdq takes four operands, while pclmulqdq takes only three operands, with the first one serving both the role of destination and first source. The FMA (Fused Multiply-Add) extension introduces additional AVX instructions which perform multiplication and summation as single operation. Each one takes three operands, first one serving both the role of destination and first source, and the following ones being the second and third source. The mnemonic of FMA instruction is obtained by appending to vf prefix: first either m or nm to select whether result of multiplication should be taken as-is or negated, then either add or sub to select whether third value will be added to the product or substracted from the product, then either 132, 213 or 231 to select which source operands are multiplied and which one is added or substracted, and finally the type of data on which the instruction operates, either ps, pd, ss or sd. As it was with SSE instructions promoted to AVX, instructions operating on packed oating point values allow 128-bit or 256-bit syntax, in former all the operands are SSE registers, but the third one can также be a 128-bit memory, in latter the operands are AVX registers and the third one can также be a 256-bit memory. Instructions that compute just one oating point result need operands to be SSE registers, and the third operand can также be a memory, either 32-bit for single precision or 64-bit for double precision.

Assembler
1
2
vfmsub231ps ymm1,ymm2,ymm3 ; multiply and substract
vfnmadd132sd xmm0,xmm5,[ebx] ; multiply, negate and add

In addition to the instructions created by the rule described above, there are families of instructions with mnemonics starting with either vfmaddsub or vfmsubadd, followed by either 132, 213 or 231 and then either ps or pd (the operation must always be on packed values in this case). They add to the result of multiplication or substract from it depending on the position of value in packed data ― instructions from the vfmaddsub group add when the position is odd and substract when the position is even, instructions from the vfmsubadd group add when the position is even and subtstract when the position is odd. The rules for operands are the same as for other FMA instructions. The FMA4 instructions are similar to FMA, but use syntax with four operands and thus allow destination to be dierent than all the sources. Their mnemonics are identical to FMA instructions with the 132, 213 or 231 cut out, as having separate destination operand makes such selection of operands super uous. The multiplication is always performed on values from the first and second source, and then the value from third source is added or substracted. Either second or third source can be a memory operand, and the rules for the sizes of operands are the same as for FMA instructions.

Assembler
1
2
vfmaddpd ymm0,ymm1,[esi],ymm2 ; multiply and add
vfmsubss xmm0,xmm1,xmm2,[ebx] ; multiply and substract

The F16C extension consists of two instructions, vcvtps2ph and vcvtph2ps, which convert oating point values between single precision and half precision (the 16-bit oat- ing point format). vcvtps2ph takes three operands: destination, source, and rounding controls. The third operand is always an immediate, the source is either SSE or AVX register containing single precision values, and the destination is SSE register or mem- ory, the size of memory is 64 bits when the source is SSE register and 128 bits when the source is AVX register. vcvtph2ps takes two operands, the destination that can be SSE or AVX register, and the source that is SSE register or memory with size of the half of destination operand's size. The AMD XOP extension introduces a number of new vector instructions with en- coding and syntax analogous to AVX instructions. vfrczps, vfrczss, vfrczpd and vfrczsd extract fractional portions of single or double precision values, they all take two operands. The packed operations allow either SSE or AVX register as destination, for the other two it has to be SSE register. Source can be register of the same type as destination, or memory of appropriate size (256-bit for destination being AVX register, 128-bit for packed operation with destination being SSE register, 64-bit for operation on a solitary double precision value and 32-bit for operation on a solitary single precision value).

Assembler
1
vfrczps ymm0,[esi] ; load fractional parts

@Mikl___ · Код Мнемоника Описание

vpcmov copies bits from either first or second source into destination depending on the values of corresponding bits in the fourth operand (the selector). If the bit in selector is set, the corresponding bit from first source is copied into the same position in destination, otherwise the bit from second source is copied. Either second source or selector can be memory location, 128-bit or 256-bit depending on whether SSE registers or AVX registers are specified as the other operands.

Assembler
1
2
vpcmov xmm0,xmm1,xmm2,[ebx] ; selector in memory
vpcmov ymm0,ymm5,[esi],ymm2 ; source in memory

The family of packed comparison instructions take four operands, the destination and first source being SSE register, second source being SSE register or 128-bit memory and the fourth operand being immediate value defining the type of comparison. The mnemonic or instruction is created by appending to vpcom prefix either b or ub to com- pare signed or unsigned bytes, w or uw to compare signed or unsigned words, d or ud to compare signed or unsigned double words, q or uq to compare signed or unsigned quad words. The respective values from the first and second source are compared and the corresponding data element in destination is set to either all ones or all zeros de- pending on the result of comparison. The fourth operand has to specify one of the eight comparison types (table 2.5). All these instructions have также variants with only three operands and the type of comparison encoded within the instruction name by inserting the comparison mnemonic after vpcom.

Assembler
1
2
vpcomb xmm0,xmm1,xmm2,4 ; test for equal bytes
vpcomgew xmm0,xmm1,[ebx] ; compare signed words

Таблица 2.5: XOP comparisons.

Код	Мнемоника	Описание
0	lt	меньше
1	le	меньше или равно
2	gt	больше
3	ge	больше или равно
4	eq	равно
5	neq	не равно
6	false	ложь
7	true	истина

vpermil2ps and vpermil2pd set the elements in destination register to zero or to a value selected from first or second source depending on the corresponding bit fields from the fourth operand (the selector) and the immediate value provided in fifth operand. Refer to the AMD manuals for the detailed explanation of the operation performed by these instructions. Each of the first four operands can be a register, and either second source or selector can be memory location, 128-bit or 256-bit depending on whether SSE registers or AVX registers are used for the other operands.

Assembler
1
vpermil2ps ymm0,ymm3,ymm7,ymm2,0 ; permute from two sources

vphaddbw adds pairs of adjacent signed bytes to form 16-bit values and stores them at the same positions in destination. vphaddubw does the same but treats the bytes as unsigned. vphaddbd and vphaddubd sum all bytes (either signed or unsigned) in each four-byte block to 32-bit results, vphaddbq and vphaddubq sum all bytes in each eight-byte block to 64-bit results, vphaddwd and vphadduwd add pairs of words to 32- bit results, vphaddwq and vphadduwq sum all words in each four-word block to 64-bit results, vphadddq and vphaddudq add pairs of double words to 64-bit results. vphsubbw substracts in each two-byte block the byte at higher position from the one at lower position, and stores the result as a signed 16-bit value at the corresponding position in destination, vphsubwd substracts in each two-word block the word at higher position from the one at lower position and makes signed 32-bit results, vphsubdq substract in each block of two double word the one at higher position from the one at lower position and makes signed 64-bit results. Each of these instructions takes two operands, the destination being SSE register, and the source being SSE register or 128-bit memory.

Assembler
1
vphadduwq xmm0,xmm1 ; sum quadruplets of words

vpmacsww and vpmacssww multiply the corresponding signed 16-bit values from the first and second source and then add the products to the parallel values from the third source, then vpmacsww takes the lowest 16 bits of the result and vpmacssww saturates the result down to 16-bit value, and they store the final 16-bit results in the desti- nation. vpmacsdd and vpmacssdd perform the analogous operation on 32-bit values. vpmacswd and vpmacsswd do the same calculation only on the low 16-bit values from each 32-bit block and form the 32-bit results. vpmacsdql and vpmacssdql perform such operation on the low 32-bit values from each 64-bit block and form the 64-bit results, while vpmacsdqh and vpmacssdqh do the same on the high 32-bit values from each 64-bit block, также forming the 64-bit results. vpmadcswd and vpmadcsswd multiply the corresponding signed 16-bit value from the first and second source, then sum all the four products and add this sum to each 16-bit element from third source, storing the truncated or saturated result in destination. All these instructions take four operands, the second source can be 128-bit memory or SSE register, all the other operands have to be SSE registers.

Assembler
1
vpmacsdd xmm6,xmm1,[ebx],xmm6 ; accumulate product

vpperm selects bytes from first and second source, optionally applies a separate transformation to each of them, and stores them in the destination. The bit fields in fourth operand (the selector) specify for each position in destination what byte from which source is taken and what operation is applied to it before it is stored there. Refer to the AMD manuals for the detailed information about these bit fields. This instruction takes four operands, either second source or selector can be a 128-bit memory (or they can be SSE registers both), all the other operands have to be SSE registers. vpshlb, vpshlw, vpshld and vpshlq shift logically bytes, words, double words or quad words respectively. The amount of bits to shift by is specified for each element separately by the signed byte placed at the corresponding position in the third operand. The source containing elements to shift is provided as second operand. Either second or third operand can be 128-bit memory (or they can be SSE registers both) and the other operands have to be SSE registers.

Assembler
1
vpshld xmm3,xmm1,[ebx] ; shift bytes from xmm1

vpshab, vpshaw, vpshad and vpshaq arithmetically shift bytes, words, double words or quad words. These instructions follow the same rules as the logical shifts described above. vprotb, vprotw, vprotd and vprotq rotate bytes, word, double words or quad words. They follow the same rules as shifts, but additionally allow third operand to be immediate value, in which case the same amount of rotation is specified for all the elements in source.

Assembler
1
vprotb xmm0,[esi],3 ; rotate bytes to the left

The MOVBE extension introduces just one new instruction, movbe, which swaps bytes in value from source before storing it in destination, so can be used to load and store big endian values. It takes two operands, either the destination or source should be a 16-bit, 32-bit or 64-bit memory (the last one being only allowed in long mode), and the other operand should be a general register of the same size. The BMI extension, consisting of two subsets ― BMI1 and BMI2, introduces new instructions operating on general registers, which use the same encoding as AVX in- structions and so allow the extended syntax. All these instructions use 32-bit operands, and in long mode they также allow the forms with 64-bit operands. andn calculates the bitwise AND of second source with the inverted bits of first source and stores the result in destination. The destination and the first source have to be general registers, the second source can be general register or memory. andn edx,eax,[ebx] ; bit-multiply inverted eax with memory bextr extracts from the first source the sequence of bits using an index and length specified by bit fields in the second source operand and stores it into destination. The lowest 8 bits of second source specify the position of bit sequence to extract and the next 8 bits of second source specify the length of sequence. The first source can be a general register or memory, the other two operands have to be general registers. bextr eax,[esi],ecx ; extract bit field from memory blsi extracts the lowest set bit from the source, setting all the other bits in destination to zero. The destination must be a general register, the source can be general register or memory.

Assembler
1
blsi rax,r11 ; isolate the lowest set bit

blsmsk sets all the bits in the destination up to the lowest set bit in the source, including this bit. blsr copies all the bits from the source to destination except for the lowest set bit, which is replaced by zero. These instructions follow the same rules for operands as blsi. tzcnt counts the number of trailing zero bits, that is the zero bits up to the lowest set bit of source value. This instruction is analogous to lzcnt and follows the same rules for operands, so it также has a 16-bit version, unlike the other BMI instructions. bzhi is BMI2 instruction, which copies the bits from first source to destination, zeroing all the bits up from the position specified by second source. It follows the same rules for operands as bextr. pext uses a mask in second source operand to select bits from first operands and puts the selected bits as a continuous sequence into destination. pdep performs the reverse operation ― it takes sequence of bits from the first source and puts them consecutively at the positions where the bits in second source are set, setting all the other bits in destination to zero.

@Mikl___ · 27.08.2014, 10:24 **[ТС]**

These BMI2 instructions follow the same rules for operands as andn. mulx is a BMI2 instruction which performs an unsigned multiplication of value from EDX or RDX register (depending on the size of specified operands) by the value from third operand, and stores the low half of result in the second operand, and the high half of result in the first operand, and it does it without aecting the ags. The third operand can be general register or memory, and both the destination operands have to be general registers.

Assembler
1
mulx edx,eax,ecx ; multiply edx by ecx into edx:eax

shlx, shrx and sarx are BMI2 instructions, which perform logical or arithmetical shifts of value from first source by the amount specified by second source, and store the result in destination without aecting the ags. The have the same rules for operands as bzhi instruction. rorx is a BMI2 instruction which rotates right the value from source operand by the constant amount specified in third operand and stores the result in destination without aecting the ags. The destination operand has to be general register, the source operand can be general register or memory, and the third operand has to be an immediate value.

Assembler
1
rorx eax,edx,7 ; rotate without affecting flags

The TBM is an extension designed by AMD to supplement the BMI set. The bextr instruction is extended with a new form, in which second source is a 32-bit immediate value. blsic is a new instruction which performs the same operation as blsi, but with the bits of result reversed. It uses the same rules for operands as blsi. blsfill is a new instruction, which takes the value from source, sets all the bits below the lowest set bit and store the result in destination, it также uses the same rules for operands as blsi. blci, blcic, blcs, blcmsk and blcfill are instructions analogous to blsi, blsic, blsr, blsmsk and blsfill respectively, but they perform the bit-inverted versions of the same operations. They follow the same rules for operands as the instructions they re ect. tzmsk finds the lowest set bit in value from source operand, sets all bits below it to 1 and all the rest of bits to zero, then writes the result to destination. t1mskc finds the least significant zero bit in the value from source operand, sets the bits below it to zero and all the other bits to 1, and writes the result to destination. These instructions have the same rules for operands as blsi.

@Mikl___ · 28.08.2014, 10:03 **[ТС]**

2.1.24 Other extensions of instruction set

There is a number of additional instruction set extensions recognized by at assembler, and the general syntax of the instructions introduced by those extensions is provided here. For a detailed information on the operations performed by them, check out the manuals from Intel (for the VMX, SMX, XSAVE, RDRAND, FSGSBASE, INVPCID, HLE and RTM extensions) or AMD (for the SVM extension). The Virtual-Machine Extensions (VMX) provide a set of instructions for the man- agement of virtual machines. The vmxon instruction, which enters the VMX operation, requires a single 64-bit memory operand, which should be a physical address of memory region, which the logical processor may use to support VMX operation. The vmxoff instruction, which leaves the VMX operation, has no operands. The vmlaunch and vmresume, which launch or resume the virtual machines, and vmcall, which allows guest software to call the VM monitor, use no operands either. The vmptrld loads the physical address of current Virtual Machine Control Structure (VMCS) from its memory operand, vmptrst stores the pointer to current VMCS into address specified by its memory operand, and vmclear sets the launch state of the VMCS referenced by its memory operand to clear. These three instruction all require single 64-bit memory operand. The vmread reads from VCMS a field specified by the source operand and stores it into the destination operand. The source operand should be a general purpose register, and the destination operand can be a register of memory. The vmwrite writes into a VMCS field specified by the destination operand the value provided by source operand. The source operand can be a general purpose register or memory, and the destination operand must be a register. The size of operands for those instructions should be 64-bit when in long mode, and 32-bit otherwise. The invept and invvpid invalidate the translation lookaside buers (TLBs) and paging-structure caches, either derived from extended page tables (EPT), or based on the virtual processor identifier (VPID). These instructions require two operands, the first one being the general purpose register specifying the type of invalidation, and the second one being a 128-bit memory operand providing the invalidation descriptor. The first operand should be a 64-bit register when in long mode, and 32-bit register otherwise. The Safer Mode Extensions (SMX) provide the functionalities available throught the getsec instruction. This instruction takes no operands, and the function that is executed is determined by the contents of EAX register upon executing this instruction. The Secure Virtual Machine (SVM) is a variant of virtual machine extension used by AMD. The skinit instruction securely reinitializes the processor allowing the startup of trusted software, such as the virtual machine monitor (VMM). This instruction takes a single operand, which must be EAX, and provides a physical address of the secure loader block (SLB). The vmrun instruction is used to start a guest virtual machine, its only operand should be an accumulator register (AX, EAX or RAX, the last one available only in long mode) providing the physical address of the virtual machine control block (VMCB). The vmsave stores a subset of processor state into VMCB specified by its operand, and vmload loads the same subset of processor state from a specified VMCB. The same operand rules as for the vmrun apply to those two instructions. vmmcall allows the guest software to call the VMM. This instruction takes no operands. stgi set the global interrupt ag to 1, and clgi zeroes it. These instructions take no operands. invlpga invalidates the TLB mapping for a virtual page specified by the first operand (which has to be accumulator register) and address space identifier specified by the second operand (which must be ECX register). The XSAVE set of instructions allows to save and restore processor state components. xsave and xsaveopt store the components of processor state defined by bit mask in EDX and EAX registers into area defined by memory operand. xrstor restores from the area specified by memory operand the components of processor state defined by mask in EDX and EAX. The xsave64, xsaveopt64 and xrstor64 are 64-bit versions of these instructions, allowed only in long mode. xgetbv read the contents of 64-bit XCR (extended control register) specified in ECX register into EDX and EAX registers. xsetbv writes the contents of EDX and EAX into the 64-bit XCR specified by ECX register. These instructions have no operands. The RDRAND extension introduces one new instruction, rdrand, which loads the hardware-generated random value into general register. It takes one operand, which can be 16-bit, 32-bit or 64-bit register (with the last one being allowed only in long mode). The FSGSBASE extension adds long mode instructions that allow to read and write the segment base registers for FS and GS segments. rdfsbase and rdgsbase read the corresponding segment base registers into operand, while wrfsbase and wrgsbase write the value of operand into those register. All these instructions take one operand, which can be 32-bit or 64-bit general register.

The INVPCID extension adds invpcid instruction, which invalidates mapping in the TLBs and paging caches based on the invalidation type specified in first operand and PCID invalidate descriptor specified in second operand. The first operands should be 32-bit general register when not in long mode, or 64-bit general register when in long mode. The second operand should be 128-bit memory location.

The HLE and RTM extensions provide set of instructions for the transactional management. The xacquire and xrelease are new prefixes that can be used with some of the instructions to start or end lock elision on the memory address specified by prefixed instruction. The xbegin instruction starts the transactional execution, its operand is the address a fallback routine that gets executes in case of transaction abort, specified like the operand for near jump instruction. xend marks the end of transcational execution region, it takes no operands. xabort forces the transaction abort, it takes an 8-bit immediate value as its only operand, this value is passed in the highest bits of EAX to the fallback routine. xtest checks whether there is transactional execution in progress, this instruction takes no operands.

@Mikl___ · 29.08.2014, 06:21 **[ТС]**

2.1 Директивы управления

Этот параграф описывает директивы, которые управляют процессом ассемблирования. Эти директивы выполняются во время ассемблирования и могут делать так, чтобы некоторые блоки инструкций ассемблировались по-разному или не ассемблировались вовсе.

2.2.1 Числовые константы

Численные константы определяются с помощью директивы =. Сначала пишется имя константы, затем директива = и выражение, определяющее значение константы. Значением может быть число или адрес, но в отличие от меток, в численных константах нельзя использовать регистровую адресацию. В целом же численные константы ведут себя, так же как и метки. Вы даже можете обратиться к константе до того, как она была определена (forward-reference).

Hо есть и другой вид численных констант. Если определить константу с именем, которое уже использовалось ранее для определения численной константы, то ассемблер будет рассматривать данную константу как переменную времени транслирования (assembly-time variable). Такой переменной можно присваивать новые значения, но, по вполне понятной причине, на нее нельзя ссылаться до ее определения. Давайте рассмотрим два варианта численных констант в одном примере:

Assembler
1
2
3
4
dd sum
x = 1
x = x+2
sum = x

Здесь x - переменная времени транслирования. Всякий раз при обращении к ней используется последнее присвоенное этой переменной значение. Если бы мы попытались использовать x до его определения, например, написав dd x вместо dd sum, это вызвало бы ошибку. Когда мы определяем x = x+2, предыдущее значение x используется для вычисления нового. Когда определяется константа sum, значение x равно 3. Это значение и присваивается sum. Так как константа sum определяется один раз, она является стандартной численной константой, и ее можно использовать до определения. Таким образом, dd sum ассемблируется как dd 3. Чтобы узнать больше смотрите раздел 2.2.6

Значению численной константы может предшествовать оператор размера, гарантирующий, что значение находится в допустимом для заданного размера диапазоне. Оператор размера может также повлиять на некоторые вычисления с константами. Следующий пример определяет две разные константы: первую размером 8 бит, вторую - 32 бита

Assembler
1
2
c8 = byte -1
c32 = dword -1

Если вам надо определить константу со значением адреса, вы можете использовать расширенный синтаксис директивы label (описывалась ранее в разделе 1.2.3). Например,

Assembler
1
label myaddr at ebp+4

определяет метку по адресу ebp+4. Метки, в отличие от численных констант, могут определять адрес на основе регистров, но не могут становиться переменными времени транслирования.

2.2.2 Условное ассемблирование

С помощью директивы "if" можно ассемблировать или не ассемблировать блок инструкций в зависимости от выполнения условия. За ней должно следовать логическое выражение, определяющее условие. Инструкции на следующих строках ассемблируются, только если это условие выполняется, иначе они пропускаются. Опциональная директива "elseif" со следующим за ней логическим выражением, определяющим дополнительное условие, начинает следующий блок инструкций, который ассемблируется, если предыдущие условия не выполняются, а данное дополнительное условие выполняется. Опциональная директива "else" начинает блок инструкций, которые ассемблируются, если не выполняется ни одно из условий. "end if" заканчивает последний блок инструкций.
Вы должны помнить, что директива "if" обрабатывается на стадии ассемблирования и поэтому не влияет на директивы препроцессора, такие как определения символьных констант и макроинструкции - когда ассемблер распознает директиву "if", весь препроцессинг уже закончен.
Логическое выражение состоит из логических значений и логических операторов. Логические операторы выглядят так: "~" для логического отрицания, "&" для логического И, "|" для логического ИЛИ. Отрицание имеет высший приоритет. Логическое значение может быть числовым выражением, оно будет считаться ложным в случае равенства нулю, иначе оно будет истинным. Для создания логического значения можно сравнить два числовых выражения, используя один из следующих операторов: "=" (равно), "<" (меньше), ">" (больше), "<=" (меньше или равно), ">=" (больше или равно), "<>" (не равно).
"used" со следующим за ним символом имени, это логическое значение, которое проверяет, использовался ли где-нибудь данный символ (он возвращает правильный результат даже если символ используется только после этой проверки). За оператором "defined" может следовать любое выражение, обычно это только одно символьное имя; этот оператор проверяет, содержит ли данное выражение исключительно символы, определенные в коде, и доступные из текущей позиции.
Следующий простой пример использует константу "count" которая должна быть определена где-то в коде:

Assembler
1
2
3
4
    if count>0
        mov cx,count
        rep movsb
    end if

Эти две инструкции будут ассемблированы только если константа "count" больше нуля. Следующий пример показывает более комплексную условную структуру:

Assembler
1
2
3
4
5
6
7
8
9
10
11
12
    if count & ~ count mod 4
        mov cx,count/4
        rep movsd
    else if count>4
        mov cx,count/4
        rep movsd
        mov cx,count mod 4
        rep movsb
    else
        mov cx,count
        rep movsb
    end if

Первый блок инструкций ассеблируется, если константа "count" не равна нулю и кратна четырем, если это условие не выполняется, оценивается второе логическое условие, следующее за "else if", и если оно верно, ассемблируется второй блок инструкций, иначе ассемблируется последний блок, который следует за строкой, содержащей только "else".
Также есть операторы, которые позволяют сравнивать значения, которые представляют собой последовательности символов. "eq" проверяет такие значения на тождественность. Оператор "in" проверяет, принадлежит ли данное значение к списку значений, следующему за оператором. Список должен быть заключен между символами "<" и ">", а его члены должны быть разделены запятыми. Символы считаются одинаковыми, если они имеют одно и то же значение для ассемблера - например, "pword" и "fword" для ассемблера одинаковы поэтому не различаются вышеуказанными операторами. Так же "16 eq 10h" является истиной, однако "16 eq 10+4" нет.
Оператор "eqtype" имеют ли сравниваемые значения одинаковую структуру, и принадлежат ли структурные элементы одному типу. Различаемые типы включают в себя числовые выражения, строки, заключенные в кавычки, значения с плавающей точкой, адресные выражения (выражения в квадратных скобках или предваренные оператором "ptr"), мнемоники инструкций, регистры, операторы размера, операторы перехода и операторы типа кода. И каждый из специальных символов, действующих как разделители, такой как запятая или двоеточие, это отдельный тип сам по себе. Например, два значения, каждое из которых состоит из имени регистра и числового выражения, разделенных запятой, будут распознаны как один тип, независимо от вида регистра и сложности числового выражения; за исключением строк, заключенных в кавычки и значений с плавающей точкой, которые относятся к специальным видом числовых выражений и распознаются как разные типы. Поэтому условие "eax,16 eqtype fs,3+7" является истиной, но "eax,16 eqtype eax,1.6" - ложь.

@Mikl___ · 29.08.2014, 06:26 **[ТС]**

2.2.3 Повторение блоков инструкций

"times" повторяет одну инструкцию указанное количество раз. За ней должно следовать числовое выражение, определяющее количество повторений, и инструкция, которую нужно повторять (опционально для того, чтобы отделить число и инструкцию, можно использовать двоеточие). Специальный символ "%", использующийся внутри инструкции, эквивалентен номеру текущего повтора. Например, "times 5 db %" определит пять байтов со значениями 1, 2, 3, 4, 5. Поддерживается также рекурсивное использование директивы "times", например, "times 3 times % db %" определит шесть байтов со значениями 1, 1, 2, 1, 2, 3.
"repeat" повторяет целый блок инструкций. За ней должно следовать числовое выражение, определяющее количество повторений. Инструкции для повторения предполагаются на следующих строках, а заканчиваться блок должен директивой "end repeat", например:

Assembler
1
2
3
4
    repeat 8
        mov byte [bx],%
        inc bx
    end repeat

Сгенерированный код сохраняет байты со значениями от одного до восьми в памяти, адресованной регистром BX.
Количество повторений может быть равным нулю, и в таком случае инструкции не будут ассемболироваться вовсе.
"break" позволяет остановить повторение раньше и продолжить ассемблирование с первой строки после "end repeat". В сочетании с директивой "if" она позволяет остановить повторение при выполнении некоторого особого условия, например:

Assembler
1
2
3
4
5
6
7
    s = x/2
    repeat 100
        if x/s = s
            break
        end if
        s = (s+x/s)/2
    end repeat

"while" повторяет блок инструкций, пока выполняется следующее за ней условие, определенное логическим выражением. Блок инструкций для повторения должен заканчиваться директивой "end while". Перед каждым повторением логическое выражение вычисляется и если его значение ложь, ассемблирование продолжается, начиная с первой строки после "end while". Также в этом случае символ "%" содержит номер текущего повторения. Директива "break" может быть использована для остановки этого типа цикла так же , как с директивой "repeat". Предыдущий пример может быть переписан с использованием "while" вместо "repeat" таким образом:

Assembler
1
2
3
4
5
6
7
    s = x/2
    while x/s <> s
        s = (s+x/s)/2
        if % = 100
            break
        end if
    end while

Блоки, определенные с использованием "if", "repeat" и "while" могут быть вложены в любом порядке, однако и закрыты в обратном. Директива "break" всегда останавливает обработку бока, который был начат последним либо директивой "repeat", либо "while".

Добавлено через 5 минут
2.2.4 Адресные пространства

"org" устанавливает адрес, по которому следующий за ней код должен появиться в памяти. За ней должно следовать числовое выражение, указывающее адрес. Эта директива начинает новое адресное пространство, следующий код сам по себе никуда не двигается, но все метки, определенные в нем и значение символа "$" изменяются как если бы он был бы помещен по этому адресу. Тем не менее обязанность поместить во время выполнения код по правильному адресу лежит на программисте.
"load" позволяет определить константу двоичным значением, загруженным из уже сассемблированного кода. За директивой должно следовать имя константы, затем опционально оператор размера, затем оператор "from" и числовое выражение, определяющее валидный адрес в текущем адресном пространстве. Оператор размера здесь имеет необычное значение - он определяет, сколько байтов (до 8) должно быть загружено из двоичного значения константы. Если оператор размера не определен, загружается один байт (таким образом значение оказывается в пределах от 0 до 255). Загруженные данные не могут превосходить текущее смещение.
"store" может модифицировать уже сгенерированный код заменой некоторых ранее сгенерированных байтов значением, задаваемым следующим за инструкцией числовым выражением. Перед этим выражением может идти оператор размера, определяющий, насколько длинное значение оно задает, то есть сколько будет сохранено байт. Если оператор размера не задан, подразумевается длина в один байт. Далее должен следовать оператор "at" и числовое выражение, указывающее валидный адрес в текущем адресном пространстве кода. По этому адресу будет сохранено задаваемое значение. Это директива для продвинутого применения и её следует использовать осторожно.
Обе директивы "load" и "store" ограничены оперированием только в пределах текущего адресного пространства. Символ "$$" всегда равен базовому адресу в текущем адресном пространстве, а символ "$" - это адрес текущей позиции в нём, то есть эти два значения определяют границы действия директив "load" и "store".
Сочетая директивы "load" и "store" можно делать вещи, такие как шифрование некоторого из уже сгенерированного кода. Например, для шифрования всего кода, сгенерированного в текущем адресном пространстве вы можете использовать такой блок директив:

Assembler
1
2
3
4
    repeat $-$$
        load a byte from $$+%-1
        store byte a xor c at $$+%-1
    end repeat

и каждый байт коза будет проксорен со значением, определенным константой "c".
"virtual" определяет виртуальные данные по указанному адресу. Эти данные не будут включены в файл вывода, но но метки, определенные здесь, могут использоваться в других частях кода. За этой директивой может следовать оператор "at" и числовое выражение, определяющее адрес виртуальных данных, иначе будет использован текущий адрес, что равносильно директиве "virtual at $". Инструкции определяемых данных должны быть расположены на следующих строках и заканчиваться директивой "end virtual". Блок виртуальных инструкций сам по себе независимое адресное пространство, и после того, как оно заканчивается, восстанавливается контекст предыдущего адресного пространства.
Директива "virtual" может быть использована для создания объединения нескольких переменных, например:

Assembler
1
2
3
4
5
    GDTR dp ?
    virtual at GDTR
        GDT_limit dw ?
        GDT_address dd ?
    end virtual

Здесь определяются две части 48-битной переменной по адресу "GDTR".
Директива также может быть использована для определения меток некоторых структур, адресованных регистром, например:

Assembler
1
2
3
4
    virtual at bx
        LDT_limit dw ?
        LDT_address dd ?
    end virtual

С таким определением инструкция "mov ax,[LDT_limit]" будет сассемблирована в "mov ax,[bx]".
Также может быть полезно объявление инструкций и значений данных внутри виртуально блока, так как директиву "load" можно использовать для загрузки в константы значений из виртуально сгенерированного кода. Эта директива должна быть использована после загружаемого кода, но до окончания виртуального блока, так как она может загружать значения только из того же адресного пространства. Например:

Assembler
1
2
3
4
5
    virtual at 0
        xor eax,eax
        and edx,eax
        load zeroq dword from 0
    end virtual

Этот кусок кода определяет константу "zeroq", которая будет содержать четыре байта машинного кода инструкций, указанных внутри виртуального блока. Этот метод также может быть использован для загрузки некоторых бинарных значений из внешнего файла. Например этот код:

Assembler
1
2
3
4
    virtual at 0
        file 'a.txt':10h,1
        load char from 0
    end virtual

загружает один байт со смещением 10h из файла "a.txt" в константу "char".
Все директивы "section", описанные в 2.4, также начинают новое адресное пространство.

2.2.5 Другие директивы

"align" выравнивает код или данные по указанной границе. За ней должно следовать числовое выражение, определяющее количество байтов, на кратность которому должен быть выровнен текущий адрес. Значение границы должно быть степенью двойки.
Директива "align" заполняет байты, которые должны быть пропущены, чтобы совершить выравнивание, инструкциями "nop", и в это же время маркирует эту область как неинициализированные данные, то есть если её поместить среди других неинициализированных данных, это не займет места в файле вывода, выравнивание байтов происходит таким же образом. Если вам нужно заполнить область выравнивания какими-то другими значениями, вы можете сочетать "align" и "virtual", чтобы получить требуемый размер выравнивания и далее создать выравнивание самостоятельно, например:

Assembler
1
2
3
4
5
    virtual
        align 16
        a = $ - $$
    end virtual
    db a dup 0

Константа "a" определяется как разница между адресом после выравнивания и адресом блока "virtual" (смотрите предыдущий параграф), то есть она равна размеру требуемого пространства выравнивания.
"display" во время ассемблирования показывает сообщение. За ней должны следовать строка в кавычках или значения байтов, разделенные запятыми. Директива может быть использована для показа значений некоторых констант, например:

Assembler
1
2
3
4
5
6
7
8
9
10
    bits = 16
    display 'Current offset is 0x'
    repeat bits/4
        d = '0' + $ shr (bits-%*4) and 0Fh
        if d > '9'
            d = d + 'A'-'9'-1
        end if
        display d
    end repeat
    display 13,10

Этот блок директив рассчитывает четыре цифры 16-битного значения и конвертирует их в знаки для показа. Помните что это не будет работать, если адреса в текущем адресном пространстве перемещаемы (как это может быть с объектным форматом вывода и форматом PE), так как таким образом могут быть использованы только абсолютные значения. Абсолютное значение может быть получено вычислением относительного адреса, например "$-$$" или "rva $" в случае формата PE.

Новые блоги и статьи Все статьи Все блоги /
Новый ноутбук volvo 07.12.2025 Всем привет. По скидке в "черную пятницу" взял себе новый ноутбук Lenovo ThinkBook 16 G7 на Амазоне: Ryzen 5 7533HS 64 Gb DDR5 1Tb NVMe 16" Full HD Display Win11 Pro	Музыка, написанная Искусственным Интеллектом volvo 04.12.2025 Всем привет. Некоторое время назад меня заинтересовало, что уже умеет ИИ в плане написания музыки для песен, и, собственно, исполнения этих самых песен. Стихов у нас много, уже вышли 4 книги, еще 3. . .	От async/await к виртуальным потокам в Python IndentationError 23.11.2025 Армин Ронахер поставил под сомнение async/ await. Создатель Flask заявляет: цветные функции - провал, виртуальные потоки - решение. Не threading-динозавры, а новое поколение лёгких потоков. Откат?. . .	Поиск "дружественных имён" СОМ портов Argus19 22.11.2025 Поиск "дружественных имён" СОМ портов На странице: https:/ / norseev. ru/ 2018/ 01/ 04/ comportlist_windows/ нашёл схожую тему. Там приведён код на С++, который показывает только имена СОМ портов, типа,. . .	Сколько Государство потратило денег на меня, обеспечивая инсулином. Programma_Boinc 20.11.2025 Сколько Государство потратило денег на меня, обеспечивая инсулином. Вот решила сделать интересный приблизительный подсчет, сколько государство потратило на меня денег на покупку инсулинов. . . .
Ломающие изменения в C#.NStar Alpha Etyuhibosecyu 20.11.2025 Уже можно не только тестировать, но и пользоваться C#. NStar - писать оконные приложения, содержащие надписи, кнопки, текстовые поля и даже изображения, например, моя игра "Три в ряд" написана на этом. . .	Мысли в слух kumehtar 18.11.2025 Кстати, совсем недавно имел разговор на тему медитаций с людьми. И обнаружил, что они вообще не понимают что такое медитация и зачем она нужна. Самые базовые вещи. Для них это - когда просто люди. . .	Создание Single Page Application на фреймах krapotkin 16.11.2025 Статья исключительно для начинающих. Подходы оригинальностью не блещут. В век Веб все очень привыкли к дизайну Single-Page-Application . Быстренько разберем подход "на фреймах". Мы делаем одну. . .	Фото: Daniel Greenwood kumehtar 13.11.2025	Расскажи мне о Мире, бродяга kumehtar 12.11.2025 — Расскажи мне о Мире, бродяга, Ты же видел моря и метели. Как сменялись короны и стяги, Как эпохи стрелою летели. - Этот мир — это крылья и горы, Снег и пламя, любовь и тревоги, И бескрайние. . .

@Mikl___ Ушел с форума 16371 / 7683 / 1080 Регистрация: 11.11.2010 Сообщений: 13,757
	21.08.2014, 09:06 [ТС]
	2.1.16 SSE2 instructions SSE2 (англ. Streaming SIMD Extensions 2, потоковое SIMD-расширение процессора) — это SIMD (англ. Single Instruction, Multiple Data, Одна инструкция — множество данных) набор инструкций, разработанный Intel и впервые представленный в процессорах серии Pentium 4. SSE2 расширяет набор инструкций SSE с целью полного вытеснения MMX. Набор SSE2 добавил 144 новые команды к SSE, в котором было только 70 команд. SSE2 использует восемь 128-битных регистров (xmm0 до xmm7), включённых в архитектуру x86 с вводом расширения SSE, каждый из которых трактуется как 2 последовательных значения с плавающей точкой двойной точности. SSE2 включает в себя набор инструкций, который производит операции со скалярными и упакованными типами данных. SSE2 содержит инструкции для потоковой обработки целочисленных данных в тех же 128-битных xmm регистрах, что делает это расширение более предпочтительным для целочисленных вычислений, нежели использование набора инструкций MMX, появившегося гораздо раньше. SSE2 включает в себя две части – расширение SSE и расширение MMX. расширение SSE работает с вещественными числами. расширение MMX работает с целыми. В SSE2 регистры по сравнению с MMX удвоились (64 бита -> 128 битов). Т.к. скорость выполнения инструкций не изменилась, при оптимизации под SSE2 программа получает двукратный прирост производительности. Если программа уже была оптимизирована под MMX, то оптимизация под SSE2 дается сравнительно легко в силу сходности системы команд. SSE2 включает в себя ряд команд управления кэшем, предназначенных для минимизации загрязнения кэша при обработке объёмных потоков данных. SSE2 включает в себя сложные дополнения к командам преобразования чисел Технология SSE2 повышает эффективность The SSE2 extension introduces the operations on packed double precision floating point values, extends the syntax of MMX instructions, and adds also some new instructions. movapd and movupd transfer a double quad word operand containing packed double precision values from source operand to destination operand. These instructions are analogous to movaps and movups and have the same rules for operands. movlpd moves double precision value between the memory and the low quad word of SSE register. movhpd moved double precision value between the memory and the high quad word of SSE register. These instructions are analogous to movlps and movhps and have the same rules for operands. movmskpd transfers the most signicant bit of each of the two double precision values in the SSE register into low two bits of a general register. This instruction is analogous to movmskps and has the same rules for operands. movsd transfers a double precision value between source and destination operand (only the low quad word is trasferred). At least one of the operands have to be a SSE register, the second one can be also a SSE register or 64-bit memory location. Arithmetic operations on double precision values are: addpd, addsd, subpd, subsd, mulpd, mulsd, divpd, divsd, sqrtpd, sqrtsd, maxpd, maxsd, minpd, minsd, and they are analoguous to arithmetic operations on single precision values described in previous section. When the mnemonic ends with pd instead of ps, the operation is performed on packed two double precision values, but rules for operands are the same. When the mnemonic ends with sd instead of ss, the source operand can be a 64-bit memory location or a SSE register, the destination operand must be a SSE register and the operation is performed on double precision values, only low quad words of SSE registers are used in this case. andpd, andnpd, orpd and xorpd perform the logical operations on packed double precision values. They are analoguous to SSE logical operations on single prevision values and have the same rules for operands. cmppd compares packed double precision values and returns and returns a mask result into the destination operand. This instruction is analoguous to cmpps and has the same rules for operands. cmpsd performs the same operation on double precision values, only low quad word of destination register is aected, in this case source operand can be a 64-bit memory or SSE register. Variant with only two operands are obtained by attaching the condition mnemonic from table 2.3 to the cmp mnemonic and then attaching the pd or sd at the end. comisd and ucomisd compare the double precision values and set the ZF, PF and CF flags to show the result. The destination operand must be a SSE register, the source operand can be a 128-bit memory location or SSE register. shufpd moves any of the two double precision values from the destination operand into the low quad word of the destination operand, and any of the two values from the source operand into the high quad word of the destination operand. This instruction is analoguous to shufps and has the same rules for operand. Bit 0 of the third operand selects the value to be moved from the destination operand, bit 1 selects the value to be moved from the source operand, the rest of bits are reserved and must be zeroed. unpckhpd performs an unpack of the high quad words from the source and destination operands, unpcklpd performs an unpack of the low quad words from the source and destination operands. They are analoguous to unpckhps and unpcklps, and have the same rules for operands. cvtps2pd converts the packed two single precision floating point values to two packed double precision floating point values, the destination operand must be a SSE register, the source operand can be a 64-bit memory location or SSE register. cvtpd2ps converts the packed two double precision floating point values to packed two single precision floating point values, the destination operand must be a SSE register, the source operand can be a 128-bit memory location or SSE register. cvtss2sd converts the single precision floating point value to double precision floating point value, the destination operand must be a SSE register, the source operand can be a 32-bit memory location or SSE register. cvtsd2ss converts the double precision floating point value to single precision floating point value, the destination operand must be a SSE register, the source operand can be 64-bit memory location or SSE register. cvtpi2pd converts packed two double word integers into the the packed double precision floating point values, the destination operand must be a SSE register, the source operand can be a 64-bit memory location or MMX register. cvtsi2sd converts a double word integer into a double precision floating point value, the destination operand must be a SSE register, the source operand can be a 32-bit memory location or 32-bit general register. cvtpd2pi converts packed double precision floating point values into packed two double word integers, the destination operand should be a MMX register, the source operand can be a 128-bit memory location or SSE register. cvttpd2pi performs the similar operation, except that truncation is used to round a source values to integers, rules for operands are the same. cvtsd2si converts a double precision floating point value into a double word integer, the destination operand should be a 32-bit general register, the source operand can be a 64-bit memory location or SSE register. cvttsd2si performs the similar operation, except that truncation is used to round a source value to integer, rules for operands are the same. 1

@Mikl___ Ушел с форума 16371 / 7683 / 1080 Регистрация: 11.11.2010 Сообщений: 13,757
	21.08.2014, 10:18 [ТС]
	2.1.17 Инструкции SSE3 SSE3 (PNI — Prescott New Instruction) — третья версия SIMD-расширения Intel, потомок SSE, SSE2 и MMX. Впервые представлено в ядре Prescott процессора Pentium 4. AMD предложила свою реализацию SSE3 для процессоров Athlon 64 (ядра Venice, San Diego и Newark). Набор SSE3 содержит 13 инструкций: FISTTP (x87) — преобразование вещественного числа в целое с сохранением целочисленного значения и округлением в сторону нуля MOVSLDUP (SSE) MOVSHDUP (SSE) MOVDDUP (SSE2) LDDQU (SSE/SSE2) — загрузка 128bit невыровненных данных из памяти в регистр xmm, с предотвращением пересечения границы строки кеша ADDSUBPD (SSE) Add Subtract Packed Double ADDSUBPS (SSE2) Add Subtract Packed Single HADDPS (SSE) Horizontal Add Packed Single HSUBPS (SSE) Horizontal Subtract Packed Single HADDPD (SSE2) Horizontal Add Packed Double HSUBPD (SSE2) Horizontal Subtract Packed Double MONITOR (нет аналога в SSE3 для AMD) MWAIT (нет аналога в SSE3 для AMD) Наиболее заметное изменение - возможность горизонтальной работы с регистрами. Если говорить более конкретно, добавлены команды сложения и вычитания нескольких значений, хранящихся в одном регистре. Эти команды упростили ряд DSP и 3D-операций. Существует также новая команда для преобразования значений с плавающей точкой в целые без необходимости вносить изменения в глобальном режиме округления. Prescott technology introduced some new instructions to improve the performance of SSE and SSE2 - this extension is called SSE3. fisttp behaves like the fistp instruction and accepts the same operands, the only dierence is that it always used truncation, irrespective of the rounding mode. movshdup loads into destination operand the 128-bit value obtained from the source value of the same size by lling the each quad word with the two duplicates of the value in its high double word. movsldup performs the same action, except it duplicates the values of low double words. The destination operand should be SSE register, the source operand can be SSE register or 128-bit memory location. movddup loads the 64-bit source value and duplicates it into high and low quad word of the destination operand. The destination operand should be SSE register, the source operand can be SSE register or 64-bit memory location. lddqu is functionally equivalent to movdqu with memory as source operand, but it may improve performance when the source operand crosses a cacheline boundary. The destination operand has to be SSE register, the source operand must be 128-bit memory location. addsubps performs single precision addition of second and fourth pairs and single precision substracion of the rst and third pairs of floating point values in the operands. addsubpd performs double precision addition of the second pair and double precision substraction of the rst pair of floating point values in the operand. haddps performs the addition of two single precision values within the each quad word of source and destination operands, and stores the results of such horizontal addition of values from destination operand into low quad word of destination operand, and the results from the source operand into high quad word of destination operand. haddpd performs the addition of two double precision values within each operand, and stores the result from destination operand into low quad word of destination operand, and the result from source operand into high quad word of destination operand. All these instructions need the destination operand to be SSE register, source operand can be SSE register or 128-bit memory location. monitor sets up an address range for monitoring of write-back stores. It need its three operands to be EAX, ECX and EDX register in that order. mwait waits for a write-back store to the address range set up by the monitor instruction. It uses two operands with additional parameters, rst being the EAX and second the ECX register. The functionality of SSE3 is further extended by the set of Supplemental SSE3 instructions (SSSE3). They generally follow the same rules for operands as all the MMX operations extended by SSE. phaddw and phaddd perform the horizontal additional of the pairs of adjacent values from both the source and destination operand, and stores the sums into the destination (sums from the source operand go into lower part of destination register). They operate on 16-bit or 32-bit chunks, respectively. phaddsw performs the same operation on signed 16-bit packed values, but the result of each addition is saturated. phsubw and phsubd analogously perform the horizontal substraction of 16-bit or 32-bit packed value, and phsubsw performs the horizontal substraction of signed 16-bit packed values with saturation. pabsb, pabsw and pabsd calculate the absolute value of each signed packed signed value in source operand and stores them into the destination register. They operator on 8-bit, 16-bit and 32-bit elements respectively. pmaddubsw multiplies signed 8-bit values from the source operand with the corresponding unsigned 8-bit values from the destination operand to produce intermediate 16-bit values, and every adjacent pair of those intermediate values is then added horizontally and those 16-bit sums are stored into the destination operand. pmulhrsw multiplies corresponding 16-bit integers from the source and destination operand to produce intermediate 32-bit values, and the 16 bits next to the highest bit of each of those values are then rounded and packed into the destination operand. pshufb shues the bytes in the destination operand according to the mask provided by source operand - each of the bytes in source operand is an index of the target position for the corresponding byte in the destination. psignb, psignw and psignd perform the operation on 8-bit, 16-bit or 32-bit integers in destination operand, depending on the signs of the values in the source. If the value in source is negative, the corresponding value in the destination register is negated, if the value in source is positive, no operation is performed on the corresponding value is performed, and if the value in source is zero, the value in destination is zeroed, too. palignr appends the source operand to the destination operand to form the intermediate value of twice the size, and then extracts into the destination register the 64 or 128 bits that are right-aligned to the byte oset specied by the third operand, which should be an 8-bit immediate value. This is the only SSSE3 instruction that takes three arguments. 0

@Mikl___ Ушел с форума 16371 / 7683 / 1080 Регистрация: 11.11.2010 Сообщений: 13,757
	22.08.2014, 08:41 [ТС]
	2.1.18 AMD 3DNow!-инструкции 3DNow! — дополнительное расширение MMX. Причиной создания 3DNow! послужило стремление завоевать превосходство над процессорами производства компании Intel в области обработки мультимедийных данных. Технология 3DNow! ввела 21 новую команду процессора и возможность оперировать 32-битными вещественными типами в стандартных MMX-регистрах. Также были добавлены специальные инструкции, оптимизирующие переключение в режим MMX/3DNow! (femms, которая заменяла стандартную инструкцию emms) и работу с кэшем процессора. Технология 3DNow! расширяла возможности технологии MMX, не требуя введения новых режимов работы процессора и новых регистров. Начиная с микроархитектуры Bulldozer расширение не поддерживается (за исключением команды prefetch) Инструкции 3DNow! PAVGUSB — вычисление среднего 8-битовых целых значений PI2FD — перевод 32-битных целых в вещественные числа PF2ID — перевод вещественных в 32-битные целые числа PFCMPGE — сравнение вещественных чисел, больше или равно PFCMPGT — сравнение вещественных чисел, больше PFCMPEQ — сравнение вещественных чисел, равно PFACC — накопление суммы вещественных чисел PFADD — сложение вещественных чисел PFSUB — вычитание вещественных чисел PFSUBR — обратное вычитание вещественных чисел PFMIN — нахождение минимума вещественных чисел PFMAX — нахождение максимума вещественных чисел PFMUL — умножение вещественных чисел PFRCP — нахождение приближённого значения обратного (1/x) вещественных чисел PFRSQRT — нахождение приближённого значения обратного от квадратного корня (1/sqrt(x)) вещественных чисел PFRCPIT1 — первый шаг вычисления значения обратного (1/x) вещественных чисел PFRSQIT1 — первый шаг вычисления значения обратного от квадратного корня (1/sqrt(x)) вещественных чисел PFRCPIT2 — второй шаr вычисления значения обратного или обратного от квадратного корня вещественных чисел PMULHRW — умножение 16-битных целых чисел с округлением FEMMS — быстрое переключение состояния FPU/MMX PREFETCH/PREFETCHW — предвыборка строки кэша процессора из памяти The 3DNow! extension adds a new MMX instructions to those described in 2.1.14, and introduces operation on the 64–bit packed floating point values, each consisting of two single precision floating point values. These instructions follow the same rules as the general MMX operations, the destination operand should be a MMX register, the source operand can be a MMX register or 64–bit memory location. pavgusb computes the rounded averages of packed unsigned bytes. pmulhrw performs a signed multiply of the packed words, round the high word of each double word results and stores them in the destination operand. pi2fd converts packed double word integers into packed floating point values. pf2id converts packed floating point values into packed double word integers using truncation. pi2fw converts packed word integers into packed floating point values, only low words of each double word in source operand are used. pf2iw converts packed floating point values to packed word integers, results are extended to double words using the sign extension. pfadd adds packed floating point values. pfsub and pfsubr substracts packed floating point values, the first one substracts source values from destination values, the second one substracts destination values from the source values. pfmul multiplies packed floating point values. pfacc adds the low and high floating point values of the destination operand, storing the result in the low double word of destination, and adds the low and high floating point values of the source operand, storing the result in the high double word of destination. pfnacc substracts the high floating point value of the destination operand from the low, storing the result in the low double word of destination, and substracts the high floating point value of the source operand from the low, storing the result in the high double word of destination. pfpnacc substracts the high floating point value of the destination operand from the low, storing the result in the low double word of destination, and adds the low and high floating point values of the source operand, storing the result in the high double word of destination. pfmax and pfmin compute the maximum and minimum of floating point values. pswapd reverses the high and low double word of the source operand. pfrcp returns an estimates of the reciprocals of floating point values from the source operand, pfrsqrt returns an estimates of the reciprocal square roots of floating point values from the source operand, pfrcpit1 performs the first step in the Newton–Raphson iteration to refine the reciprocal approximation produced by pfrcp instruction, pfrsqit1 performs the first step in the Newton–Raphson iteration to refine the reciprocal square root approximation produced by pfrsqrt instruction, pfrcpit2 performs the second final step in the Newton–Raphson iteration to refine the reciprocal approximation or the reciprocal square root approximation. pfcmpeq, pfcmpge and pfcmpgt compare the packed floating point values and sets all bits or zeroes all bits of the correspoding data element in the destination operand according to the result of comparison, first checks whether values are equal, second checks whether destination value is greater or equal to source value, third checks whether destination value is greater than source value. prefetch and prefetchw load the line of data from memory that contains byte specified with the operand into the data cache, prefetchw instruction should be used when the data in the cache line is expected to be modified, otherwise the prefetch instruction should be used. The operand should be an 8–bit memory location. femms performs a fast clear of MMX state. It has no operands. 0

@Mikl___ Ушел с форума 16371 / 7683 / 1080 Регистрация: 11.11.2010 Сообщений: 13,757
	28.08.2014, 10:03 [ТС]
	2.1.24 Other extensions of instruction set There is a number of additional instruction set extensions recognized by at assembler, and the general syntax of the instructions introduced by those extensions is provided here. For a detailed information on the operations performed by them, check out the manuals from Intel (for the VMX, SMX, XSAVE, RDRAND, FSGSBASE, INVPCID, HLE and RTM extensions) or AMD (for the SVM extension). The Virtual-Machine Extensions (VMX) provide a set of instructions for the man- agement of virtual machines. The vmxon instruction, which enters the VMX operation, requires a single 64-bit memory operand, which should be a physical address of memory region, which the logical processor may use to support VMX operation. The vmxoff instruction, which leaves the VMX operation, has no operands. The vmlaunch and vmresume, which launch or resume the virtual machines, and vmcall, which allows guest software to call the VM monitor, use no operands either. The vmptrld loads the physical address of current Virtual Machine Control Structure (VMCS) from its memory operand, vmptrst stores the pointer to current VMCS into address specified by its memory operand, and vmclear sets the launch state of the VMCS referenced by its memory operand to clear. These three instruction all require single 64-bit memory operand. The vmread reads from VCMS a field specified by the source operand and stores it into the destination operand. The source operand should be a general purpose register, and the destination operand can be a register of memory. The vmwrite writes into a VMCS field specified by the destination operand the value provided by source operand. The source operand can be a general purpose register or memory, and the destination operand must be a register. The size of operands for those instructions should be 64-bit when in long mode, and 32-bit otherwise. The invept and invvpid invalidate the translation lookaside buers (TLBs) and paging-structure caches, either derived from extended page tables (EPT), or based on the virtual processor identifier (VPID). These instructions require two operands, the first one being the general purpose register specifying the type of invalidation, and the second one being a 128-bit memory operand providing the invalidation descriptor. The first operand should be a 64-bit register when in long mode, and 32-bit register otherwise. The Safer Mode Extensions (SMX) provide the functionalities available throught the getsec instruction. This instruction takes no operands, and the function that is executed is determined by the contents of EAX register upon executing this instruction. The Secure Virtual Machine (SVM) is a variant of virtual machine extension used by AMD. The skinit instruction securely reinitializes the processor allowing the startup of trusted software, such as the virtual machine monitor (VMM). This instruction takes a single operand, which must be EAX, and provides a physical address of the secure loader block (SLB). The vmrun instruction is used to start a guest virtual machine, its only operand should be an accumulator register (AX, EAX or RAX, the last one available only in long mode) providing the physical address of the virtual machine control block (VMCB). The vmsave stores a subset of processor state into VMCB specified by its operand, and vmload loads the same subset of processor state from a specified VMCB. The same operand rules as for the vmrun apply to those two instructions. vmmcall allows the guest software to call the VMM. This instruction takes no operands. stgi set the global interrupt ag to 1, and clgi zeroes it. These instructions take no operands. invlpga invalidates the TLB mapping for a virtual page specified by the first operand (which has to be accumulator register) and address space identifier specified by the second operand (which must be ECX register). The XSAVE set of instructions allows to save and restore processor state components. xsave and xsaveopt store the components of processor state defined by bit mask in EDX and EAX registers into area defined by memory operand. xrstor restores from the area specified by memory operand the components of processor state defined by mask in EDX and EAX. The xsave64, xsaveopt64 and xrstor64 are 64-bit versions of these instructions, allowed only in long mode. xgetbv read the contents of 64-bit XCR (extended control register) specified in ECX register into EDX and EAX registers. xsetbv writes the contents of EDX and EAX into the 64-bit XCR specified by ECX register. These instructions have no operands. The RDRAND extension introduces one new instruction, rdrand, which loads the hardware-generated random value into general register. It takes one operand, which can be 16-bit, 32-bit or 64-bit register (with the last one being allowed only in long mode). The FSGSBASE extension adds long mode instructions that allow to read and write the segment base registers for FS and GS segments. rdfsbase and rdgsbase read the corresponding segment base registers into operand, while wrfsbase and wrgsbase write the value of operand into those register. All these instructions take one operand, which can be 32-bit or 64-bit general register. The INVPCID extension adds invpcid instruction, which invalidates mapping in the TLBs and paging caches based on the invalidation type specified in first operand and PCID invalidate descriptor specified in second operand. The first operands should be 32-bit general register when not in long mode, or 64-bit general register when in long mode. The second operand should be 128-bit memory location. The HLE and RTM extensions provide set of instructions for the transactional management. The xacquire and xrelease are new prefixes that can be used with some of the instructions to start or end lock elision on the memory address specified by prefixed instruction. The xbegin instruction starts the transactional execution, its operand is the address a fallback routine that gets executes in case of transaction abort, specified like the operand for near jump instruction. xend marks the end of transcational execution region, it takes no operands. xabort forces the transaction abort, it takes an 8-bit immediate value as its only operand, this value is passed in the highest bits of EAX to the fallback routine. xtest checks whether there is transactional execution in progress, this instruction takes no operands. 0