PS: Many paid software tools have appeared on the market, similar to tools like Jihu Manjian and SuTui, but their core functionalities are all the same; what still needs to be tested is the effectiveness of GPT. This year, Sora has emerged as an evolved version in this field, which is more likely to impact the film and television production sector (UE4).
Function Design#
-
Extract storyboard scenes: Sentence segmentation of novel text, SD image generation, and TTS text-to-speech.
-
Novel content > Derive prompt words (SD painting).
-
Merge image and audio into video.
Model:
TTS(edge), SD painting model (using: cetusMix_Whalefall2 here), GPT (using Gemini here).
Project address: story-vision
Core Code#
Novel Storyboard Extraction GPT#
prompt = """I want you to create a storyboard based on the novel content, inferring scenes from the original text description; infer and supplement missing or implied information, including but not limited to: character clothing, character hairstyle, character hair color, character complexion, character facial features, character posture, character emotions, character body movements, etc.), style description (including but not limited to: era description, spatial description, time period description, geographical environment description, weather description), item description (including but not limited to: animals, plants, food, fruits, toys), visual perspective (including but not limited to: character proportions, camera depth description, observation angle description), but do not overdo it. Describe richer character emotions and emotional states through camera language, and generate a new descriptive content through sentences once you understand this requirement. Change the output format to: Illustration 1: Original description: corresponding original sentences; Scene description: corresponding scene plot content; Scene characters: names of characters appearing in the scene; Clothing: the protagonist is dressed casually; Location: sitting in front of the bar; Expression: facial lines are gentle, expression is pleasant; Behavior: gently swaying the wine glass in hand. Environment: the background of the bar is dark-toned, candlelight flickers in the background, giving a dreamy feeling. If you understand this requirement, please confirm these five points, and return results containing only these five points. The novel content is as follows:"""
def split_text_into_chunks(text, max_length=ai_max_length):
"""
Split text into chunks with a maximum length, ensuring that splits only occur at line breaks.
"""
lines = text.splitlines()
chunks = []
current_chunk = ''
for line in lines:
if len(current_chunk + ' ' + line) <= max_length:
current_chunk += ' ' + line
else:
chunks.append(current_chunk)
current_chunk = line
chunks.append(current_chunk)
return chunks
def rewrite_text_with_genai(text, prompt="Please rewrite this text:"):
chunks = split_text_into_chunks(text)
rewritten_text = ''
# pbar = tqdm(total=len(chunks), ncols=150)
genai.configure(api_key=cfg['genai_api_key'])
model = genai.GenerativeModel('gemini-pro')
for chunk in chunks:
_prompt=f"{prompt}\n{chunk}",
response = model.generate_content(
contents=_prompt,
generation_config=genai.GenerationConfig(
temperature=0.1,
),
stream=True,
safety_settings = [
{
"category": "HARM_CATEGORY_DANGEROUS",
"threshold": "BLOCK_NONE",
},
{
"category": "HARM_CATEGORY_HARASSMENT",
"threshold": "BLOCK_NONE",
},
{
"category": "HARM_CATEGORY_HATE_SPEECH",
"threshold": "BLOCK_NONE",
},
{
"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
"threshold": "BLOCK_NONE",
},
{
"category": "HARM_CATEGORY_DANGEROUS_CONTENT",
"threshold": "BLOCK_NONE",
},
]
)
for _chunk in response:
if _chunk.text is not None:
rewritten_text += _chunk.text.strip()
# pbar.update(1)
# pbar.close()
return rewritten_text
Storyboard Output
SD Text-to-Image#
The prompt for SD is generated by GPT based on the storyboard text output above.
from diffusers import StableDiffusionPipeline
from diffusers.utils import load_image
import torch
model_path = "./models/cetusMix_Whalefall2.safetensors"
pipeline = StableDiffusionPipeline.from_single_file(
model_path,
torch_dtype=torch.float16,
variant="fp16"
).to("mps")
generator = torch.Generator("mps").manual_seed(31)
def sd_cetus(save_name, prompt):
prompt = prompt
image = pipeline(prompt).images[0]
image.save('data/img/'+ save_name +'.jpg')
Image Effect
TTS Audio Generation#
There are many TTS options available online; here we use the one provided by edge.
import edge_tts
import asyncio
voice = 'zh-CN-YunxiNeural'
output = 'data/voice/'
rate = '-4%'
volume = '+0%'
async def tts_function(text, save_name):
tts = edge_tts.Communicate(
text,
voice=voice,
rate=rate,
volume=volume
)
await tts.save(output + save_name + '.wav')
Video Effect#
[video(video-7erojzmT-1713340240300)(type-csdn)(url-https://live.csdn.net/v/embed/379613)(image-https://video-community.csdnimg.cn/vod-84deb4/00b03862fc8b71eebfc44531859c0102/snapshots/0bc4b0ed08a54fc2a412ee3ad1f3fdf2-00005.jpg?auth_key=4866938633-0-0-f335ae8248a7095d7f5d885a25aba80e)(title-Chapter 1: Entering the Station_out)]