Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml_new_tensor_impl: not enough space in the context's memory pool #29

Closed
NotNite opened this issue Mar 12, 2023 · 16 comments
Closed

ggml_new_tensor_impl: not enough space in the context's memory pool #29

NotNite opened this issue Mar 12, 2023 · 16 comments
Labels
wontfix This will not be worked on

Comments

@NotNite
Copy link

NotNite commented Mar 12, 2023

Heya! Friend showed this to me and I'm trying to get it to work myself on Windows 10. I've applied the changes as seen in #22 to get it to build (more specifically, I pulled in the new commits from etra0's fork, but the actual executable fails to run - printing this before segfaulting:

ggml_new_tensor_impl: not enough space in the context's memory pool (needed 458853944, available 454395136)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 458870468, available 454395136)

I'm trying to use 7B on an i9-13900K (and I have about 30 gigs of memory free right now), and I've verified my hashes with a friend. Any ideas? Thanks!

@NotNite
Copy link
Author

NotNite commented Mar 12, 2023

Tried out #31 - it, uh, got farther: GGML_ASSERT: D:\code\c++\llama.cpp\ggml.c:9349: false

@etra0
Copy link
Collaborator

etra0 commented Mar 12, 2023

ok I did an upsie in that PR, initializing it that way apparently didn't zero'ed out the rest of the fields. I updated the branch, please test it again now!

@NotNite
Copy link
Author

NotNite commented Mar 12, 2023

ok I did an upsie in that PR, initializing it that way apparently didn't zero'ed out the rest of the fields. I updated the branch, please test it again now!

It started to expand the prompt, but with seemingly garbage data: Building a website can be done in 10 simple steps: ╨Ñ╤Ç╨╛╨╜╨╛╨╗╨╛╨│╨╕╤ÿ╨

@ggerganov
Copy link
Owner

Should be good on latest master - reopen if issue persists.
Make sure to rebuild and regen the models after updating

@eshaanagarwal
Copy link

Hey i was trying to run this on a RHEL 8 server with 32 cpu cores. and i am getting the same error. On my second query.

I am using GPT4All-J v1.3-groovy.

ggml_new_tensor_impl: not enough space in the context's memory pool

@eshaanagarwal
Copy link

Hi @ggerganov @gjmulder I would appreciate some direction for this pls.

@superbsky
Copy link

Getting the same issue on Apple M1 Pro with 16GB RAM when trying the example from:

https://github.com/curiousily/Get-Things-Done-with-Prompt-Engineering-and-LangChain/blob/master/06.private-gpt4all-qa-pdf.ipynb

Using a relatively large PDF with ~200 pages

Stack trace:

gpt_tokenize: unknown token '?'
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16118890208, available 16072355200)
[1] 62734 segmentation fault python3
/opt/homebrew/Cellar/python@3.11/3.11.4/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

@dzupin
Copy link

dzupin commented Jul 18, 2023

Same issue when running on Win11 with 64GB RAM (25 GB utilized):

ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 450887680, available 446693376)
Traceback (most recent call last):
File "C:\AI\oobabooga_windows_GPU\text-generation-webui\modules\callbacks.py", line 55, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
File "C:\AI\oobabooga_windows_GPU\text-generation-webui\modules\llamacpp_model.py", line 92, in generate
for completion_chunk in completion_chunks:
File "C:\AI\oobabooga_windows_GPU\installer_files\env\lib\site-packages\llama_cpp\llama.py", line 891, in _create_completion
for token in self.generate(
File "C:\AI\oobabooga_windows_GPU\installer_files\env\lib\site-packages\llama_cpp\llama.py", line 713, in generate
self.eval(tokens)
File "C:\AI\oobabooga_windows_GPU\installer_files\env\lib\site-packages\llama_cpp\llama.py", line 453, in eval
return_code = llama_cpp.llama_eval(
File "C:\AI\oobabooga_windows_GPU\installer_files\env\lib\site-packages\llama_cpp\llama_cpp.py", line 612, in llama_eval
return _lib.llama_eval(ctx, tokens, n_tokens, n_past, n_threads)
OSError: exception: access violation reading 0x0000000000000028
Output generated in 39.00 seconds (0.00 tokens/s, 0 tokens, context 5200, seed 1177762893)

@LoganDark
Copy link
Contributor

Same issue when running on Win11 with 64GB RAM (25 GB utilized): [snip]

Oh hey, exact same error:

ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 452859040, available 446693376)

@omarelanis
Copy link

Same issue here, tried a combination of settings but just keep getting the memory error even though both RAM and GPU RAM are less than 50% utilization.

I had to follow the guide here to build llama-cpp with GPU support as it wasn't working previously, but even before that it was giving the same error (side note GPU support natively does work in oobabooga windows!?):
abetlen/llama-cpp-python#182

Anyone have any ideas?

HW:
Intel i9-10900K OC @5.3GHz
64GB DDR4-2400 / PC4-19200
12GB Nvidia GeForce RTX 3060

Using embedded DuckDB with persistence: data will be stored in: db
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6
llama.cpp: loading model from models/llama7b/llama-deus-7b-v3.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_head_kv = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 2927.79 MB (+ 1024.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 10 repeating layers to GPU
llama_model_load_internal: offloaded 10/35 layers to GPU
llama_model_load_internal: total VRAM used: 1470 MB
llama_new_context_with_model: kv self size = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

What would you like to know about the policies?

test

ggml_new_object: not enough space in the context's memory pool (needed 10882896, available 10650320)
Traceback (most recent call last):
File "H:\AI_Projects\Indexer_Plus_GPT\chat.py", line 84, in
main()
File "H:\AI_Projects\Indexer_Plus_GPT\chat.py", line 55, in main
res = qa(query)
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 243, in call
raise e
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 237, in call
self._call(inputs, run_manager=run_manager)
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\retrieval_qa\base.py", line 133, in _call
answer = self.combine_documents_chain.run(
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 441, in run
return self(kwargs, callbacks=callbacks, tags=tags, metadata=metadata)[
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 243, in call
raise e
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 237, in call
self._call(inputs, run_manager=run_manager)
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\combine_documents\base.py", line 106, in _call
output, extra_return_dict = self.combine_docs(
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\combine_documents\stuff.py", line 165, in combine_docs
return self.llm_chain.predict(callbacks=callbacks, **inputs), {}
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\llm.py", line 252, in predict
return self(kwargs, callbacks=callbacks)[self.output_key]
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 243, in call
raise e
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 237, in call
self._call(inputs, run_manager=run_manager)
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\llm.py", line 92, in _call
response = self.generate([inputs], run_manager=run_manager)
File "C:\Program Files\Python310\lib\site-packages\langchain\chains\llm.py", line 102, in generate
return self.llm.generate_prompt(
File "C:\Program Files\Python310\lib\site-packages\langchain\llms\base.py", line 188, in generate_prompt
return self.generate(prompt_strings, stop=stop, callbacks=callbacks, **kwargs)
File "C:\Program Files\Python310\lib\site-packages\langchain\llms\base.py", line 281, in generate
output = self._generate_helper(
File "C:\Program Files\Python310\lib\site-packages\langchain\llms\base.py", line 225, in _generate_helper
raise e
File "C:\Program Files\Python310\lib\site-packages\langchain\llms\base.py", line 212, in _generate_helper
self._generate(
File "C:\Program Files\Python310\lib\site-packages\langchain\llms\base.py", line 604, in _generate
self._call(prompt, stop=stop, run_manager=run_manager, **kwargs)
File "C:\Program Files\Python310\lib\site-packages\langchain\llms\llamacpp.py", line 229, in _call
for token in self.stream(prompt=prompt, stop=stop, run_manager=run_manager):
File "C:\Program Files\Python310\lib\site-packages\langchain\llms\llamacpp.py", line 279, in stream
for chunk in result:
File "C:\Program Files\Python310\lib\site-packages\llama_cpp\llama.py", line 899, in _create_completion
for token in self.generate(
File "C:\Program Files\Python310\lib\site-packages\llama_cpp\llama.py", line 721, in generate
self.eval(tokens)
File "C:\Program Files\Python310\lib\site-packages\llama_cpp\llama.py", line 461, in eval
return_code = llama_cpp.llama_eval(
File "C:\Program Files\Python310\lib\site-packages\llama_cpp\llama_cpp.py", line 678, in llama_eval
return _lib.llama_eval(ctx, tokens, n_tokens, n_past, n_threads)
OSError: exception: access violation reading 0x0000000000000000

@jiapei100
Copy link

Same here... any solutions already???

@sherrmann
Copy link

Solved this by going back to llama-cpp-python version 0.1.74

@LoganDark
Copy link
Contributor

Solved this by going back to llama-cpp-python version 0.1.74

well this has nothing to do with python

@dereklll
Copy link

Same here... any solutions already???

@sozforex
Copy link
Contributor

@dereklll This issue was closed 6 months ago, I'd suggest to create a new one.

@dillfrescott
Copy link

Same issue on a runpod gpu machine, tried 2 different gpu's

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests