NumReps is of course a limit of replacements to be made.
well the asm is ultimately primitive, however it works pretty fast (10MB zeroes -> 1 in less than 100ms on my old
[email protected]) - I just may assume that P4+ cpus have optimizations for such trivial memory-retrieval/storing schemes
here's the FASM code without compiler definitions for 'segments' and stuff.
BTW, I've tested the code just one time

proc ReplaceByte stdcall uses ebx ecx edx esi edi, hayStack, hayStackSize, ByteFrom:WORD, ByteTo:WORD, StartOffset, NumReps
pushfd
mov ecx,[hayStackSize]
mov eax,[StartOffset]
xor edx,edx
sub ecx,eax
jle .done
cmp [NumReps],0
jz .done
mov edi,[hayStack]
add edi,eax ;edi=&(hayStack[StartOffset])
movzx eax,byte [ByteFrom]
movzx ebx,byte [ByteTo]
cld
.rep:
repne scasb
jne .done
mov [edi-1],bl
inc edx
dec [NumReps]
jz .done
or ecx,ecx
jnz .rep
.done:
popfd
mov eax,edx
ret
endp