Overall Goal
I am building a city model deformation system based on the semantic features of six movies.
The core idea is:
To map the semantic structure of different movies into parameters using SOM (Self-Organizing Map), driving the geometric animation of 17 city models.
The entire pipeline is currently running, and I have completed the deformable animation structure of the model (GeoNodes + point cloud buildings).
The next step is to use the generated control CSV to drive all of this.
Training Data Logic (Data Source → SOM)
(1) Training data comes from the original subtitles (.srt) of six movies.
Read the .srt files of the six movies.
Clean the text: remove timestamps, serial numbers, and punctuation, keeping only English words.
Obtain a set of all words (approximately several thousand words).
Use the CLIP text encoder ViT-B/32 to encode each word into a 512-dimensional vector.
This ensures that the input to the SOM is completely consistent with the text content to be mapped, resulting in a more accurate semantic space.
SOM Training Logic: Semantic Map
(1) The SOM uses a 20×30 grid I wrote myself.
Input: CLIP 512D vectors of all words
Output: A 20×30 semantic topology space (word_som)
Each SOM cell stores a semantic "center vector"
(2) The meaning of SOM
Each word has a semantic position (som_x, som_y)
Similar words are close together (e.g., "memory", "dream", "past"), and words with different semantics are far apart.
This map is the basis for driving the animation later.
Extending SOM to "Sentence-level Semantics"
I needed to find the SOM position for "each line of dialogue", so I did the following:
(1) Parse each line of subtitles
Remove the subtitle number and timestamp
Extract line by line
A total of several thousand lines of dialogue (six-film corpus) were obtained.
(2) For a single line → All words → Find their positions on the SOM
Cleanse a sentence into multiple words
Find each word's position on the SOM (som_x, som_y)
Take the average of these word positions as the semantic center of the sentence (float is normal)
(3) Obtain the final sentence-level SOM annotation table
The final generated data is:
movie, line_id, line_text, som_x, som_y, emotion
This CSV is the "semantic feature library driving the animation".
City Model Deformation Animation (Completed)
I have completed this part:
I built 17 city modules (point cloud building form) in Blender
I created a deformable system (Geo Nodes)
I introduced various animation control parameters such as noise, scaling, rotation, and translation
All models can be uniformly controlled on the timeline
It is now possible to generate dynamic effects of city morphology changing with parameters
That is to say: the city model can deform, push, rotate, and breathe, but it is not yet semantically bound.
The core next step: Driving animation with CSV
The next task is clear:
I need to map the previously generated SOM+emotion table (movie semantics) into animation control parameters.
This includes:
scale (zoom in/out)
loc_x / loc_y / loc_z (displacement)
rot_z (rotation)
noise_val (Geo Node noise intensity)
The logic is as follows:
Normalize som_x, som_y, and emotion
Set different parameter ranges for each movie (determining style)
Generate a movie_driving_params.csv file
Batch read using Blender + automatic keyframe insertion
This achieves:
Different movies → different semantics → different deformation methods for city models















