• empireOfLove2@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    4
    ·
    9 hours ago

    There’s no way that I know of to see the per prompt usage for commercially available models. They obviously hide that. I admit I don’t research them much but I am assuming each chip is processing prompts one at a time.

    Its pretty simple arithmetic - if it’s running exclusively on a single GPU system, and a prompt takes X seconds to generate on said gpu, then you take the GPUs power over X seconds plus whatever fraction of the datacenter overhead power that gpu uses. For locally run models on your own hardware this is also trivial to calculate.

    Alternatively, GPU’s run at a certain number of “tokens” per second and each prompt is a certain number of tokens being fed into the model, generally scaling with the length of prompt.