* Previously when a conversation was forked this would result in both the parent and the child sharing exactly the same logits. Since sampling is allowed to modify logits this could lead to issues in sampling (e.g. one conversation is sampled and overwrites logits to be all zero, second conversation is sampled and generates nonsense). Fixed this by setting a "forked" flag, logits are copied if this flag is set. Flag is cleared next time the conversation is prompted so this extra copying only happens once after a fork occurs.
* Removed finalizer from `BatchedExecutor`. This class does not directly own any unmanaged resources so it is not necessary.
Replaced `BatchedExecutor.Prompt(string)` method with `BatchedExecutor.Create()` method. This improves the API in two ways:
- A conversation can be created, without immediately prompting it
- Other prompting overloads (e.g. prompt with token list) can be used without duplicating all the overloads onto `BatchedExecutor`
Added `BatchSize` property to `LLamaContext`
- Updated LLamaSharpChatCompletion class in LLama.SemanticKernel/ChatCompletion/LLamaSharpChatCompletion.cs
- Changed the type of the "_model" field from "StatelessExecutor" to "ILLamaExecutor"
- Updated the constructor to accept an "ILLamaExecutor" parameter instead of a "StatelessExecutor" parameter
- Updated LLamaSharpChatCompletion class in LLama.SemanticKernel/LLamaSharp.SemanticKernel.csproj
- Updated LLama.Unittest project in LLama.Unittest/LLama.Unittest.csproj
- Added a "PackageReference" for "Moq" version 4.20.70
- Added ExtensionMethodsTests class in LLama.Unittest/SemanticKernel/ExtensionMethodsTests.cs
- Added tests for the "ToLLamaSharpChatHistory" and "ToLLamaSharpInferenceParams" extension methods
- Added LLamaSharpChatCompletionTests class in LLama.Unittest/SemanticKernel/LLamaSharpChatCompletionTests.cs
- Added tests for the LLamaSharpChatCompletion class
ℹ️ The LLamaSharpChatCompletion class in the LLama.SemanticKernel project has been updated to use the ILLamaExecutor interface instead of the StatelessExecutor class. This change allows for better abstraction and flexibility in the implementation of the LLamaSharpChatCompletion class. The LLamaSharpChatCompletion class is responsible for providing chat completion functionality in the LLamaSharp project. The LLama.Unittest project has also been updated to include tests for the LLamaSharpChatCompletion class and the extension methods used by the class.
- Modified library loading to be based on `SetDllImportResolver`. This replaces the built in loading system and ensures there can't be two libraries loaded at once.
- llava and llama are loaded separately, as needed.
- All the previous loading logic is still used, within the `SetDllImportResolver`
- Split out CUDA, AVX and MacOS paths to separate helper methods.
- `Description` now specifies if it is for `llama` or `llava`
- Update Microsoft.KernelMemory.Core to version 0.34.240313.1
- Update Microsoft.SemanticKernel to version 1.6.2
- Update Microsoft.SemanticKernel.Plugins.Memory to version 1.6.2-alpha
- Update Microsoft.KernelMemory.Abstractions to version 0.34.240313.1
- Update Microsoft.SemanticKernel.Abstractions to version 1.6.2
* Add llava_binaries, update all binaries to make the test
* Llava API + LlavaTest
Preliminary
* First prototype of Load + Unit Test
* Temporary run test con branch LlavaAPI
* Disable Embed test to review the rest of the test
* Restore Embedding test
* Use BatchThread to eval image embeddings
Test Threads default value to ensure it doesn´t produce problems.
* Rename test file
* Update action versions
* Test only one method, no release embeddings
* Revert "Test only one method, no release embeddings"
This reverts commit 264e176dccc9cd0be318b800ae5e102a4635d01c.
* Correct API call
* Only test llava related functionality
* Cuda and Cblast binaries
* Restore build policy
* Changes related with code review
* Add SafeHandles
* Set overwrite to upload-artifact@v4
* Revert to upload-artifact@v3
* revert to upload-artifact@v3
* Added a lock object into `SafeLlamaModelHandle` which all calls to `llama_decode` (in the `SafeLLamaContextHandle`) lock first. This prevents two contexts from running inference on the same model at the same time, which seems to be unsafe in llama.cpp.
* Modified the lock to be global over _all_ inferences. This seems to be necessary (at least with the CUDA backend).
Modified LLamaBatch to not share tokens with other sequences if logits is true. This ensures that the logit span at the end in used by exactly one sequence - therefore it's safe to mutate. This removes the need for copying _very_ large arrays (vocab size) and simplifies sampling pipelines.
* Added a `Guidance` method to `LLamaTokenDataArray` which applies classifier free guidance
* Factored out a safer `llama_sample_apply_guidance` method based on spans
* Created a guided sampling demo using the batched executor
* fixed comment, "classifier free" not "context free"
* Rebased onto master and fixed breakage due to changes in `BaseSamplingPipeline`
* Asking user for guidance weight
* Progress bar in batched fork demo
* Improved fork example (using tree display)
* Added proper disposal of resources in batched examples
* Added some more comments in BatchedExecutorGuidance