Merge pull request #248 from aminajavaid30/ai_speech_trainer

Added new Project: AI Speech Trainer Agent Team
This commit is contained in:
Shubham Saboo 2025-06-20 18:35:15 -05:00 committed by GitHub
commit 2cfb953a65
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
19 changed files with 1128 additions and 0 deletions

View file

@ -0,0 +1,163 @@
# AI Speech Trainer: Your Multimodal Public Speaking Coach
## Overview of the Idea
AI Speech Trainer is an AI-powered multi-agent, multimodal public speaking coach that listens to how you speak, watches how you express, and evaluates what you say - helping you become a confident public speaker.
Whether you're preparing for a TED talk, interview, or school presentation, AI Speech Trainer provides you with personalized feedback, helping you improve your public speaking skills - highlighting your strengths and weaknesses and giving you valuable suggestions to speak better, clearer, and more confidently.
This project has been built as part of the **Global Agent Hackathon (May 2025)**. It leverages the power of multi-agent collaboration, real-time feedback, and multimodal analysis to help anyone become a confident and effective speaker.
## Features
### Core Features
- **Facial Expression Analysis**: Emotion recognition and eye contact estimation
- **Audio Analysis**: Pace, pitch, clarity, and filler words
- **Content Evaluation**: GPT-based feedback on structure, tone, and clarity
- **Personalized Feedback**: Average score, overall assessment, strengths, weaknesses, and suggestions for improvement
### Agents
- **Facial Agent**: Analyzes expression, engagement, and eye contact
- **Vocal Agent**: Detects speech issues (speed, filler words, pitch)
- **Content Agent**: Uses LLMs to assess and improve content clarity
- **Feedback Agent**: Uses the responses from other agents to evaluate the speaker based on a scoring rubric
- **Coordinator Agent**: A team of agents - Orchestrates all analysis and feedback generation
## How It Works
### **User Flow**:
1. User opens the Streamlit app and uploads a video of themselves practicing a speech or presentation.
2. Multiple agents get into action:
- Facial Agent analyzes expressions and eye contact.
- Vocal Agent transcribes the speech and detects voice attributes.
- Content Agent evaluates grammar, structure, and coherence.
- Feedback agent provides feedback on the overall effectiveness of the speech.
- A Coordinator Agent aggregates all agent insights.
AI Speech Trainer presents a detailed feedback report including scores based on a rubric and summary of the feedback.
### **Core Functionality**:
- Facial emotion recognition using OpenCV, DeepFace, and Mediapipe landmarks.
- Voice transcription and analysis.
- Content analysis using GPT-based feedback.
- Aggregated evaluation score and feedback summary.
### **Multimodal Elements**:
- **Audio**: Speech input & voice quality analysis.
- **Video**: Facial expression tracking and feedback.
- **Text**: GPT-based feedback on structure, clarity, and tone.
## Tech Stack
### AI/ML Tools
- **Agno**: For building multi-agent collaboration and coordination.
- **Facial Expression Tool**: Facial emotion analysis - New customized tool.
- **Voice Analysis Tool**: Voice transcription and analysis - New customized tool.
- **Together API (Llama-3.3-70B-Instruct-Turbo-Free)**: LLM - Content analysis and feedback generation.
### Application Framework
- **Streamlit**: Frontend for user interface.
- **FastAPI**: For backend API endpoints.
### Languages & Packages
- **Python**: Core language for backend logic and agent implementation.
- **OpenCV + DeepFace + Mediapipe**: For facial expression analysis
- **Moviepy + Faster-Whisper + Librosa**: For voice analysis
## UI Approach
Built with Streamlit, the UI includes:
- Home page with Video Upload section, buttons, and a space for displaying the Transcript.
- Feedback page to display evaluation scores, detailed feedback, strengths, weaknesses, suggestions for improvement, and a performance chart.
## Visuals
### High Level Architecture
<img src="visuals/ai_speech_trainer.drawio.png">
### Home Page
<img src="visuals/home.png">
### Feedback Page
<img src="visuals/feedback.png">
## Setup Instructions
### 1. Clone the repo
```sh
git clone https://github.com/aminajavaid30/ai_speech_trainer.git
cd ai_speech_trainer
```
### 2. Install dependencies
```sh
pip install -r requirements.txt
```
### 3. **Add your API keys** - Create a .env file with:
```sh
TOGETHER_API_KEY=...
```
### 4. Initialize the backend
Navigate to the **backend** folder and run the following command:
```sh
uvicorn main:app --reload
```
### 5. Run the app
Navigate to the **frontend** folder and run the following command:
```sh
streamlit run Home.py
```
## Team Information
- **Team Lead**: https://github.com/aminajavaid30 - Agentic System Designer and Developer
- **Team Members**: https://github.com/aminajavaid30 - Individual Project
- **Background/Experience**: Data Scientist with a background in Software Engineering and Web Development. Experienced in building AI products and agentic workflows.
## Demo Video Link
https://youtu.be/Sb0JPUpJTGQ
## Folder Structure
```sh
/backend
/agents
/tools
- facial_expression_tool.py
- voice_analysis_tool.py
- content_analysis_agent.py
- coordinator_agent.py
- facial_expression_agent.py
- feedback_agent.py
- voice_analysis_agent.py
main.py (FastAPI App)
/frontend
/pages
- 1 - Feedback.py
Home.py
page_config.py
sidebar.py
style.css
.env
LICENSE
README.md
requirements.txt
```
## License
MIT License
## Additional Notes
- This project has been designed to utilize the capabilities of **Agno** as an AI agent development platform. It depicts the potential of Agno as a team of collaborative agents working together seamlessly in order to address a real-world challenge - analyzing speech presentations by users and providing them with comprehensive evaluation and feedback to improve their public speaking skills. Each individual agent has a clear cut goal to follow and together they coordinate as a team to solve a complex multimodal problem.
- This project has a huge potential for further enhancements. It could be a starting point for a more comprehensive and useful agentic systsem. Some of the proposed additional functionalities could be:
1. Incorporating real-time video recording and conversational capabilities through different role scenarios.
2. Playing back the user speech through an AI avatar to help users learn and understand best speaking practices.
3. Keeping a record of user sessions.
4. Including a performance dashboard to track user performance over time.
Each of these additional functionalities could be added by implementing specific goal-oriented agents in the system.
## Limitations
- **Together API** with **meta-llama/Llama-3.3-70B-Instruct-Turbo-Free** as LLM has a small token limit, therefore, it works with small video clips (15-30 seconds).
- Use other LLM options for longer video clips. Don't forget to add their API keys in the *.env* file.
## Acknowledgements
Built for the **#GlobalAgentHackathonMay2025** using Agno, Streamlit, Together API, and FastAPI.

View file

@ -0,0 +1,52 @@
from agno.agent import Agent
from agno.models.together import Together
from dotenv import load_dotenv
import os
load_dotenv()
# Define the content analysis agent
content_analysis_agent = Agent(
name="content-analysis-agent",
model=Together(
id="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free",
api_key=os.getenv("TOGETHER_API_KEY")
),
description="""
You are a content analysis agent that evaluates transcribed speech for structure, clarity, and filler words.
You will return grammar corrections, identified filler words, and suggestions for content improvement.
""",
instructions=[
"You will be provided with a transcript of spoken content.",
"Your task is to analyze the transcript and identify:",
"- Grammar and syntax corrections.",
"- Filler words and their frequency.",
"- Suggestions for improving clarity and structure.",
"The response MUST be in the following JSON format:",
"{",
'"grammar_corrections": [list of corrections],',
'"filler_words": { "word": count, ... },',
'"suggestions": [list of suggestions]',
"}",
"Ensure the response is in proper JSON format with keys and values in double quotes.",
"Do not include any additional text outside the JSON response."
],
markdown=True,
show_tool_calls=True,
debug_mode=True
)
# # Example usage
# if __name__ == "__main__":
# # Sample transcript from the Voice Analysis Agent
# transcript = (
# "So, um, I was thinking that, like, we could actually start the project soon. "
# "You know, it's basically ready, and, uh, we just need to finalize some details."
# )
# prompt = f"Analyze the following transcript:\n\n{transcript}"
# content_analysis_agent.print_response(prompt, stream=True)
# # Run agent and return the response as a variable
# response: RunResponse = content_analysis_agent.run(prompt)
# # Print the response in markdown format
# pprint_run_response(response, markdown=True)

View file

@ -0,0 +1,74 @@
from agno.team.team import Team
from agno.agent import Agent, RunResponse
from agno.models.together import Together
from agents.facial_expression_agent import facial_expression_agent
from agents.voice_analysis_agent import voice_analysis_agent
from agents.content_analysis_agent import content_analysis_agent
from agents.feedback_agent import feedback_agent
import os
from pydantic import BaseModel, Field
# Structured response
class CoordinatorResponse(BaseModel):
facial_expression_response: str = Field(..., description="Response from facial expression agent")
voice_analysis_response: str = Field(..., description="Response from voice analysis agent")
content_analysis_response: str = Field(..., description="Response from content analysis agent")
feedback_response: str = Field(..., description="Response from feedback agent")
strengths: list[str] = Field(..., description="List of speaker's strengths")
weaknesses: list[str] = Field(..., description="List of speaker's weaknesses")
suggestions: list[str] = Field(..., description="List of suggestions for speaker to improve")
# Initialize a team of agents
coordinator_agent = Team(
name="coordinator-agent",
mode="coordinate",
model=Together(id="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free", api_key=os.getenv("TOGETHER_API_KEY")),
members=[facial_expression_agent, voice_analysis_agent, content_analysis_agent, feedback_agent],
description="You are a public speaking coach who helps individuals improve their presentation skills through feedback and analysis.",
instructions=[
"You will be provided with a video file of a person speaking in a public setting.",
"You will analyze the speaker's facial expressions, voice modulation, and content delivery to provide constructive feedback.",
"Ask the facial expression agent to analyze the video file to detect emotions and engagement.",
"Ask the voice analysis agent to analyze the audio file to detect speech rate, pitch variation, and volume consistency.",
"Ask the content analysis agent to analyze the transcript given by voice analysis agent for structure, clarity, and filler words.",
"Ask the feedback agent to evaluate the analysis results from the facial expression agent, voice analysis agent, and content analysis agent to provide feedback on the overall effectiveness of the presentation, highlighting strengths and areas for improvement.",
"Your response MUST include the exact responses from the facial expression agent, voice analysis agent, content analysis agent, and feedback agent.",
"Additionally, your response MUST provide lists of the speaker's strengths, weaknesses, and suggestions for improvement based on all the responses and feedback provided by the feedback agent.",
"Important: You MUST directly address the speaker while providing strengths, weaknesses, and suggestions for improvement in a clear and constructive manner.",
"The response MUST be in the following strict JSON format:",
"{",
'"facial_expression_response": [facial_expression_agent_response],',
'"voice_analysis_response": [voice_analysis_agent_response],',
'"content_analysis_response": [content_analysis_agent_response],',
'"feedback_response": [feedback_agent_response]',
'"strengths": [speaker_strengths_list],',
'"weaknesses": [speaker_weaknesses_list],',
'"suggestions": [suggestions_for_improvement_list]',
"}",
"The response MUST be in strict JSON format with keys and values in double quotes.",
"The values in the JSON response MUST NOT be null or empty.",
"The final response MUST not include any other text or anything else other than the JSON response.",
"The final response MUST not include any backslashes in the JSON response.",
"The final response MUST be a valid JSON object and MUST not have any unterminated strings in the JSON response."
],
add_datetime_to_instructions=True,
add_member_tools_to_system_message=False, # This can be tried to make the agent more consistently get the transfer tool call correct
enable_agentic_context=True, # Allow the agent to maintain a shared context and send that to members.
share_member_interactions=True, # Share all member responses with subsequent member requests.
show_members_responses=True,
response_model=CoordinatorResponse,
use_json_mode=True,
markdown=True,
show_tool_calls=True,
debug_mode=True
)
# # Example usage
# video = "../../videos/my_video.mp4"
# prompt = f"Analyze facial expressions, voice modulation, and content delivery in the following video: {video} and provide constructive feedback."
# coordinator_agent.print_response(prompt, stream=True)
# # Run agent and return the response as a variable
# response: RunResponse = coordinator_agent.run(prompt)
# # Print the response in markdown format
# pprint_run_response(response, markdown=True)

View file

@ -0,0 +1,50 @@
from agno.agent import Agent, RunResponse
from agno.models.together import Together
from agents.tools.facial_expression_tool import analyze_facial_expressions as facial_expression_tool
from agno.utils.pprint import pprint_run_response
from dotenv import load_dotenv
import os
load_dotenv()
# Define the facial expression agent
facial_expression_agent = Agent(
name="facial-expression-agent",
model=Together(id="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free", api_key=os.getenv("TOGETHER_API_KEY")),
tools=[facial_expression_tool],
description=
'''
You are a facial expression agent that will analyze facial expressions in videos to detect emotions and engagement.
You will return the emotion timeline and engagement metrics.
''',
instructions=[
"You will be provided with a video file of a person speaking in a public setting.",
"Your task is to analyze the facial expressions in the video to detect emotions and engagement.",
"You will analyze the video frame by frame to detect and classify facial expressions such as happiness, sadness, anger, surprise, and neutrality.",
"You will also analyze the engagement metrics such as eye contact count and smile count",
"The response MUST be in the following JSON format:",
"{",
'"emotion_timeline": [emotion_timeline]',
"engagement_metrics: {",
'"eye_contact_frequency": [eye contact_frequency]',
'"smile_frequency": [smile_frequency]',
"}",
"}",
"The response MUST be in proper JSON format with keys and values in double quotes.",
"The final response MUST not include any other text or anything else other than the JSON response.",
"The final response MUST not include any backslashes in the JSON response.",
"The final response MUST be a valid JSON object and MUST not have any unterminated strings in the JSON response."
],
markdown=True,
show_tool_calls=True,
debug_mode=True
)
# video = "../../videos/my_video.mp4"
# prompt = f"Analyze facial expressions in the video file to detect emotions and engagement in the following video: {video}"
# facial_expression_agent.print_response(prompt, stream=True)
# # Run agent and return the response as a variable
# response: RunResponse = facial_expression_agent.run(prompt)
# # Print the response in markdown format
# pprint_run_response(response, markdown=True)

View file

@ -0,0 +1,67 @@
from agno.agent import Agent
from agno.models.together import Together
from dotenv import load_dotenv
import os
load_dotenv()
# Define the content analysis agent
feedback_agent = Agent(
name="feedback-agent",
model=Together(
id="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free",
api_key=os.getenv("TOGETHER_API_KEY")
),
description="""
You are a feedback agent that evaluates presentation based on the analysis results from all agents.
You will provide feedback on the overall effectiveness of the presentation.
""",
instructions=[
"You are a public speaking coach that evaluates a speaker's performance based on a detailed scoring rubric.",
"You will be provided with analysis results from the facial expression agent, voice analysis agent, and content analysis agent.",
"You will be given a criteria to evaluate the performance of the speaker based on the analysis results.",
"Your task is to evaluate the speaker on the following 5 criteria, scoring each from 1 (Poor) to 5 (Excellent):",
"1. **Content & Organization** - Evaluate how logically structured and well-organized the speech content is.",
"2. **Delivery & Vocal Quality** - Assess clarity, articulation, vocal variety, and use of filler words.",
"3. **Body Language & Eye Contact** - Consider posture, gestures, and level of eye contact.",
"4. **Audience Engagement** - Evaluate enthusiasm and ability to hold the audience's attention.",
"5. **Language & Clarity** - Check for grammar, clarity, appropriateness, and impact of language.",
"You will then provide a summary of the speaker's strengths and areas for improvement based on the rubric.",
"Important: You MUST directly address the speaker while providing constructive feedback.",
"Provide your response in the following strict JSON format:",
"{",
'"scores": {',
' "content_organization": [1-5],',
' "delivery_vocal_quality": [1-5],',
' "body_language_eye_contact": [1-5],',
' "audience_engagement": [1-5],',
' "language_clarity": [1-5]',
'},',
'"total_score": [sum of all scores from 5 to 25],',
'"interpretation": "[One of: \'Needs significant improvement\', \'Developing skills\', \'Competent speaker\', \'Proficient speaker\', \'Outstanding speaker\']",',
'"feedback_summary": "[Concise feedback summarizing the speaker\'s strengths and areas for improvement based on rubric.]"',
"}",
"DO NOT include anything outside the JSON response.",
"Ensure all keys and values are properly quoted and formatted.",
"The response MUST be in proper JSON format with keys and values in double quotes.",
"The final response MUST not include any other text or anything else other than the JSON response."
],
markdown=True,
show_tool_calls=True,
debug_mode=True
)
# # Example usage
# if __name__ == "__main__":
# # Sample transcript from the Voice Analysis Agent
# transcript = (
# "So, um, I was thinking that, like, we could actually start the project soon. "
# "You know, it's basically ready, and, uh, we just need to finalize some details."
# )
# prompt = f"Analyze the following transcript:\n\n{transcript}"
# content_analysis_agent.print_response(prompt, stream=True)
# # Run agent and return the response as a variable
# response: RunResponse = content_analysis_agent.run(prompt)
# # Print the response in markdown format
# pprint_run_response(response, markdown=True)

View file

@ -0,0 +1,117 @@
import cv2
import numpy as np
import mediapipe as mp
from deepface import DeepFace
from agno.tools import tool
import json
def log_before_call(fc):
"""Pre-hook function that runs before the tool execution"""
print(f"About to call function with arguments: {fc.arguments}")
def log_after_call(fc):
"""Post-hook function that runs after the tool execution"""
print(f"Function call completed with result: {fc.result}")
@tool(
name="analyze_facial_expressions", # Custom name for the tool (otherwise the function name is used)
description="Analyzes facial expressions to detect emotions and engagement.", # Custom description (otherwise the function docstring is used)
show_result=True, # Show result after function call
stop_after_tool_call=True, # Return the result immediately after the tool call and stop the agent
pre_hook=log_before_call, # Hook to run before execution
post_hook=log_after_call, # Hook to run after execution
cache_results=False, # Enable caching of results
cache_dir="/tmp/agno_cache", # Custom cache directory
cache_ttl=3600 # Cache TTL in seconds (1 hour)
)
def analyze_facial_expressions(video_path: str) -> dict:
"""
Analyzes facial expressions in a video to detect emotions and engagement.
Args:
video_path: The path to the video file.
Returns:
A dictionary containing the emotion timeline and engagement metrics.
"""
mp_face_mesh = mp.solutions.face_mesh
face_mesh = mp_face_mesh.FaceMesh(static_image_mode=False, max_num_faces=1)
cap = cv2.VideoCapture(video_path)
emotion_timeline = []
eye_contact_count = 0
smile_count = 0
frame_count = 0
fps = cap.get(cv2.CAP_PROP_FPS)
# Process every nth frame for performance optimization
frame_interval = 5
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
frame_count += 1
if frame_count % frame_interval != 0:
continue
# Resize frame for faster processing
frame = cv2.resize(frame, (640, 480))
rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
results = face_mesh.process(rgb_frame)
if results.multi_face_landmarks:
for face_landmarks in results.multi_face_landmarks:
# Extract landmarks
landmarks = face_landmarks.landmark
# Convert landmarks to pixel coordinates
h, w, _ = frame.shape
landmark_coords = [(int(lm.x * w), int(lm.y * h)) for lm in landmarks]
# Emotion Detection using DeepFace & Smile Detection
try:
analysis = DeepFace.analyze(frame, actions=['emotion'], enforce_detection=False)
emotion = analysis[0]['dominant_emotion']
if emotion == "happy":
smile_count += 1
timestamp = frame_count / fps
# convert timestamp into seconds
timestamp = round(timestamp, 2)
emotion_timeline.append({"timestamp": timestamp, "emotion": emotion})
except Exception as e:
print(f"Error analyzing frame: {e}")
continue
# Engagement Metric: Eye contact estimation
# Using eye landmarks: 159 (left eye upper lid), 145 (left eye lower lid), 386 (right eye upper lid), 374 (right eye lower lid)
left_eye_upper_lid = landmark_coords[159]
left_eye_lower_lid = landmark_coords[145]
right_eye_upper_lid = landmark_coords[386]
right_eye_lower_lid = landmark_coords[374]
left_eye_opening = np.linalg.norm(np.array(left_eye_upper_lid) - np.array(left_eye_lower_lid))
right_eye_opening = np.linalg.norm(np.array(right_eye_upper_lid) - np.array(right_eye_lower_lid))
eye_opening_avg = (left_eye_opening + right_eye_opening) / 2
# Simple heuristic: if eyes are wide open, assume eye contact
if eye_opening_avg > 5: # Threshold adjustment through experimentation
eye_contact_count += 1
cap.release()
face_mesh.close()
total_processed_frames = frame_count // frame_interval
if total_processed_frames == 0:
total_processed_frames = 1 # Avoid division by zero
return json.dumps({
"emotion_timeline": emotion_timeline,
"engagement_metrics": {
"eye_contact_frequency": eye_contact_count / total_processed_frames,
"smile_frequency": smile_count / total_processed_frames
}
})

View file

@ -0,0 +1,135 @@
import os
import json
import tempfile
import numpy as np
import librosa
from moviepy import VideoFileClip
from faster_whisper import WhisperModel
from agno.tools import tool
from dotenv import load_dotenv
load_dotenv()
def extract_audio_from_video(video_path: str, output_audio_path: str) -> str:
"""
Extracts audio from a video file and saves it as an audio file.
Args:
video_path: Path to the input video file.
output_audio_path: Path to save the extracted audio file.
Returns:
Path to the extracted audio file.
"""
video_clip = VideoFileClip(video_path)
audio_clip = video_clip.audio
audio_clip.write_audiofile(output_audio_path)
audio_clip.close()
video_clip.close()
return output_audio_path
def load_whisper_model():
try:
model = WhisperModel("small", device="cpu", compute_type="int8")
return model
except Exception as e:
print(f"Error loading Whisper model: {e}")
return None
def transcribe_audio(audio_file):
"""
Transcribe the audio file using faster-whisper.
Returns:
str: Transcribed text or error/fallback message.
"""
if not audio_file or not os.path.exists(audio_file):
return "No audio file exists at the specified path."
model = load_whisper_model()
if not model:
return "Model failed to load. Please check system resources or model path."
try:
print("Model loaded successfully. Transcribing audio...")
segments, _ = model.transcribe(audio_file)
full_text = " ".join(segment.text for segment in segments)
return full_text.strip() if full_text else "I couldn't understand the audio. Please try again."
except Exception as e:
print(f"Error transcribing audio with faster-whisper: {e}")
return "I'm having trouble transcribing your audio. Please try again or speak more clearly."
def log_before_call(fc):
"""Pre-hook function that runs before the tool execution"""
print(f"About to call function with arguments: {fc.arguments}")
def log_after_call(fc):
"""Post-hook function that runs after the tool execution"""
print(f"Function call completed with result: {fc.result}")
@tool(
name="analyze_voice_attributes", # Custom name for the tool (otherwise the function name is used)
description="Analyzes vocal attributes like clarity, intonation, and pace.", # Custom description (otherwise the function docstring is used)
show_result=True, # Show result after function call
stop_after_tool_call=True, # Return the result immediately after the tool call and stop the agent
pre_hook=log_before_call, # Hook to run before execution
post_hook=log_after_call, # Hook to run after execution
cache_results=False, # Enable caching of results
cache_dir="/tmp/agno_cache", # Custom cache directory
cache_ttl=3600 # Cache TTL in seconds (1 hour)
)
def analyze_voice_attributes(file_path: str) -> dict:
"""
Analyzes vocal attributes in an audio file.
Args:
audio_path: The path to the audio file.
Returns:
A dictionary containing the transcribed text, speech rate, pitch variation, and volume consistency.
"""
# Determine file extension
_, ext = os.path.splitext(file_path)
ext = ext.lower()
# If the file is a video, extract audio
if ext in ['.mp4']:
with tempfile.NamedTemporaryFile(suffix='.mp3', delete=False) as temp_audio_file:
audio_path = extract_audio_from_video(file_path, temp_audio_file.name)
else:
audio_path = file_path
# Transcribe audio
transcription = transcribe_audio(audio_path)
# Proceed with analysis using the audio_path
# Load audio
y, sr = librosa.load(audio_path, sr=16000)
words = transcription.split()
# Calculate speech rate
duration = librosa.get_duration(y=y, sr=sr)
speech_rate = len(words) / (duration / 60.0) # words per minute
# Pitch variation
pitches, magnitudes = librosa.piptrack(y=y, sr=sr)
pitch_values = pitches[magnitudes > np.median(magnitudes)]
pitch_variation = np.std(pitch_values) if pitch_values.size > 0 else 0
# Volume consistency
rms = librosa.feature.rms(y=y)[0]
volume_consistency = np.std(rms)
# Clean up temporary audio file if created
if ext in ['.mp4']:
os.remove(audio_path)
return json.dumps({
"transcription": transcription,
"speech_rate_wpm": str(round(speech_rate, 2)),
"pitch_variation": str(round(pitch_variation, 2)),
"volume_consistency": str(round(volume_consistency, 4))
})

View file

@ -0,0 +1,45 @@
from agno.agent import Agent, RunResponse
from agno.agent import Agent
from agno.models.together import Together
from agents.tools.voice_analysis_tool import analyze_voice_attributes as voice_analysis_tool
from agno.utils.pprint import pprint_run_response
from dotenv import load_dotenv
import os
load_dotenv()
# Define the voice analysis agent
voice_analysis_agent = Agent(
name="voice-analysis-agent",
model=Together(id="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free", api_key=os.getenv("TOGETHER_API_KEY")),
tools=[voice_analysis_tool],
description="""
You are a voice analysis agent that evaluates vocal attributes like clarity, intonation, and pace.
You will return the transcribed text, speech rate, pitch variation, and volume consistency.
""",
instructions=[
"You will be provided with an audio file of a person speaking.",
"Your task is to analyze the vocal attributes in the audio to detect speech rate, pitch variation, and volume consistency.",
"The response MUST be in the following JSON format:",
"{",
'"transcription": [transcription]',
'"speech_rate_wpm": [speech_rate_wpm],',
'"pitch_variation": [pitch_variation],',
'"volume_consistency": [volume_consistency]',
"}",
"The response MUST be in proper JSON format with keys and values in double quotes.",
"The final response MUST not include any other text or anything else other than the JSON response."
],
markdown=True,
show_tool_calls=True,
debug_mode=True
)
# audio = "../../videos/my_video.mp4"
# prompt = f"Analyze vocal attributes in the audio file to detect speech rate, pitch variation, and volume consistency in the following audio: {audio}"
# voice_analysis_agent.print_response(prompt, stream=True)
# # Run agent and return the response as a variable
# response: RunResponse = voice_analysis_agent.run(prompt)
# # Print the response in markdown format
# pprint_run_response(response, markdown=True)

View file

@ -0,0 +1,43 @@
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from fastapi.encoders import jsonable_encoder
from fastapi.responses import JSONResponse
from pydantic import BaseModel
from agents.coordinator_agent import coordinator_agent
from agno.agent import RunResponse
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
# Initialize FastAPI app
app = FastAPI()
# Configure CORS to allow requests from your frontend
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # To be replaced with the frontend's origin in production
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Define the request body model
class AnalysisRequest(BaseModel):
video_url: str
# Define the entry point
@app.get("/")
async def root():
return {"message": "Welcome to the video analysis API!"}
# Define the analysis endpoint
@app.post("/analyze")
async def analyze(request: AnalysisRequest):
video_url = request.video_url
prompt = f"Analyze the following video: {video_url}"
response: RunResponse = coordinator_agent.run(prompt)
# Assuming response.content is a Pydantic model or a dictionary
json_compatible_response = jsonable_encoder(response.content)
return JSONResponse(content=json_compatible_response)

View file

@ -0,0 +1,6 @@
[theme]
primaryColor = "#4B8BBE"
backgroundColor = "#F5F5F5"
secondaryBackgroundColor = "#E0E0E0"
textColor = "#262730"
font = "sans serif"

View file

@ -0,0 +1,143 @@
import streamlit as st
import requests
import tempfile
import os
import json
import numpy as np
from page_congif import render_page_config
render_page_config()
# Initialize session state variables
if "begin" not in st.session_state:
st.session_state.begin = False
if "video_path" not in st.session_state:
st.session_state.video_path = None
if "upload_file" not in st.session_state:
st.session_state.upload_file = False
if "response" not in st.session_state:
st.session_state.response = None
if "facial_expression_response" not in st.session_state:
st.session_state.facial_expression_response = None
if "voice_analysis_response" not in st.session_state:
st.session_state.voice_analysis_response = None
if "content_analysis_response" not in st.session_state:
st.session_state.content_analysis_response = None
if "feedback_response" not in st.session_state:
st.session_state.feedback_response = None
def clear_session_response():
st.session_state.response = None
st.session_state.facial_expression_response = None
st.session_state.voice_analysis_response = None
st.session_state.content_analysis_response = None
st.session_state.feedback_response = None
# Create two columns with a 70:30 width ratio
col1, col2 = st.columns([0.7, 0.3])
# Left column: Video area and buttons
with col1:
spacer1, btn_col = st.columns([0.8, 0.2])
if st.session_state.begin:
with spacer1:
st.markdown("<h4>📽️ Video</h4>", unsafe_allow_html=True)
with btn_col:
if st.button("📤 Upload Video"):
if st.session_state.video_path:
os.remove(st.session_state.video_path)
st.session_state.video_path = None
clear_session_response()
st.session_state.upload_file = True
st.rerun() # Force rerun to fully reset uploader
if st.session_state.get("upload_file"):
uploaded_file = st.file_uploader("📤 Upload Video", type=["mp4"])
if uploaded_file is not None:
temp_dir = tempfile.gettempdir()
# Use a random name to avoid reuse
unique_name = f"{int(np.random.rand()*1e8)}_{uploaded_file.name}"
file_path = os.path.join(temp_dir, unique_name)
if not os.path.exists(file_path):
with open(file_path, "wb") as f:
f.write(uploaded_file.read())
st.session_state.video_path = file_path
st.session_state.upload_file = False
st.rerun()
# elif not st.session_state.get("video_path"):
if not st.session_state.begin:
st.success("""
**Welcome to AI Speech Trainer!**
Your ultimate companion to help improve your public speaking skills.
""")
st.info("""
🚀 To get started:
\n\t1. Record a video of yourself practicing a speech or presentation - use any video recording app.
\n\t2. Upload the recorded video.
\n\t3. Analyze the video to get personalized feedback.
""")
if st.button("👉 Let's begin!"):
st.session_state.begin = True
st.rerun()
if st.session_state.video_path:
st.video(st.session_state.video_path, autoplay=False)
if not st.session_state.response:
if st.button("▶️ Analyze Video"):
with st.spinner("Analyzing video..."):
st.warning("⚠️ This process may take some time, so please be patient and wait for the analysis to complete.")
API_URL = "http://localhost:8000/analyze"
response = requests.post(API_URL, json={"video_url": st.session_state.video_path})
if response.status_code == 200:
st.success("Video analysis completed successfully.")
response = response.json()
st.session_state.response = response
st.session_state.facial_expression_response = response.get("facial_expression_response")
st.session_state.voice_analysis_response = response.get("voice_analysis_response")
st.session_state.content_analysis_response = response.get("content_analysis_response")
st.session_state.feedback_response = response.get("feedback_response")
st.rerun()
else:
st.error("🚨 Error during video analysis. Please try again.")
# Right column: Transcript and feedback
with col2:
st.markdown("<h4>📝 Transcript</h4>", unsafe_allow_html=True)
transcript_text = "Your transcript will be displayed here."
if st.session_state.response:
voice_analysis_response = st.session_state.voice_analysis_response
transcript = json.loads(voice_analysis_response).get("transcription")
else:
transcript = None
st.markdown(
f"""
<div style="background-color:#f0f2f6; padding: 1.5rem; border-radius: 10px;
border: 1px solid #ccc; font-family: 'Segoe UI', sans-serif;
line-height: 1.6; color: #333; height: 400px; max-height: 400px; overflow-y: auto;">
{transcript if transcript else transcript_text}
</div>
<br>
""",
unsafe_allow_html=True
)
if st.button("📝 Get Feedback"):
st.switch_page("pages/1 - Feedback.py")

View file

@ -0,0 +1,31 @@
import streamlit as st
from sidebar import render_sidebar
def render_page_config():
# Set page configuration
st.set_page_config(
page_icon="🎙️",
page_title="AI Speech Trainer",
initial_sidebar_state="auto",
layout="wide")
# Load external CSS
with open("style.css") as f:
st.markdown(f"<style>{f.read()}</style>", unsafe_allow_html=True)
# Sidebar
render_sidebar()
# Main title with an icon
st.markdown(
"""
<div class="custom-header"'>
<span>🗣 AI Speech Trainer</span><br>
<span>Your personal coach for public speaking</span>
</div>
""",
unsafe_allow_html=True
)
# Horizontal line
st.markdown("<hr class='custom-hr'>", unsafe_allow_html=True)

View file

@ -0,0 +1,127 @@
import streamlit as st
import plotly.graph_objects as go
import json
from page_congif import render_page_config
render_page_config()
# Get feedback response from session state
if st.session_state.feedback_response:
feedback_response = json.loads(st.session_state.feedback_response)
feedback_scores = feedback_response.get("scores")
# Evaluation scores based on the public speaking rubric
scores = {
"Content & Organization": feedback_scores.get("content_organization"),
"Delivery & Vocal Quality": feedback_scores.get("delivery_vocal_quality"),
"Body Language & Eye Contact": feedback_scores.get("body_language_eye_contact"),
"Audience Engagement": feedback_scores.get("audience_engagement"),
"Language & Clarity": feedback_scores.get("language_clarity")
}
total_score = feedback_response.get("total_score")
interpretation = feedback_response.get("interpretation")
feedback_summary = feedback_response.get("feedback_summary")
else:
st.warning("No feedback available! Please upload a video and analyze it first.")
scores = {
"Content & Organization": 0,
"Delivery & Vocal Quality": 0,
"Body Language & Eye Contact": 0,
"Audience Engagement": 0,
"Language & Clarity": 0
}
total_score = 0
interpretation = ""
feedback_summary = ""
# Calculate average score
average_score = sum(scores.values()) / len(scores)
# Determine strengths, weaknesses, and suggestions for improvement
if st.session_state.response:
strengths = st.session_state.response.get("strengths")
weaknesses = st.session_state.response.get("weaknesses")
suggestions = st.session_state.response.get("suggestions")
else:
strengths = []
weaknesses = []
suggestions = []
# Create three columns with equal width
col1, col2, col3 = st.columns([0.3, 0.4, 0.3])
# Left Column: Evaluation Summary
with col1:
st.subheader("🧾 Evaluation Summary")
st.markdown("<br>", unsafe_allow_html=True)
for criterion, score in scores.items():
label_col, progress_col, score_col = st.columns([2, 3, 1]) # Adjust the ratio as needed
with label_col:
st.markdown(f"**{criterion}**")
with progress_col:
st.progress(score / 5)
with score_col:
st.markdown(f"<span><b>{score}/5</b></span>", unsafe_allow_html=True)
st.markdown("<br>", unsafe_allow_html=True)
# Display total score
st.markdown(f"#### 🏆 Total Score: {total_score} / 25")
# Display average score
st.markdown(f"#### 🎯 Average Score: {average_score:.2f} / 5")
st.markdown("""---""")
st.markdown("##### 🗣️ Feedback Summary:")
# Display interpretation
st.markdown(f"📝 **Overall Assessment**: {interpretation}")
# Display feedback summary
st.info(f"{feedback_summary}")
# Middle Column: Strengths, Weaknesses, and Suggestions
with col2:
# Display strengths
st.markdown("##### 🦾 Strengths:")
strengths_text = '\n'.join(f"- {item}" for item in strengths)
st.success(strengths_text)
# Display weaknesses
st.markdown("##### ⚠️ Weaknesses:")
weaknesses_text = '\n'.join(f"- {item}" for item in weaknesses)
st.error(weaknesses_text)
# Display suggestions
st.markdown("##### 💡 Suggestions for Improvement:")
suggestions_text = '\n'.join(f"- {item}" for item in suggestions)
st.warning(suggestions_text)
# Right Column: Performance Chart
with col3:
st.subheader("📊 Performance Chart")
# Radar Chart
radar_fig = go.Figure()
radar_fig.add_trace(go.Scatterpolar(
r=list(scores.values()),
theta=list(scores.keys()),
fill='toself',
name='Scores'
))
radar_fig.update_layout(
polar=dict(
radialaxis=dict(visible=True, range=[0, 5])
),
showlegend=False,
margin=dict(t=50, b=50, l=50, r=50), # Reduced margins
width=350,
height=350
)
st.plotly_chart(radar_fig, use_container_width=True)
st.markdown("""---""")

View file

@ -0,0 +1,24 @@
# Sidebar with About section
import streamlit as st
def render_sidebar():
st.sidebar.header("About")
st.sidebar.info(
"""
**AI Speech Trainer** helps users improve their public speaking skills through:\
📽 Video Analysis\
🗣 Voice Analysis\
📊 Content Analysis & Feedback\
- Upload your video to receive a detailed feedback.
- Improve your public speaking skills with AI-powered analysis.
- Get personalized suggestions to enhance your performance.
"""
)

View file

@ -0,0 +1,35 @@
.main > div:first-child {
padding-top: 0.9rem !important;
}
.custom-header {
text-align: center;
}
.custom-header > span:first-child {
font-size: 2.5rem;
font-weight: 600;
color: #4B8BBE;
}
.custom-header > span:nth-child(2) {
font-size: 1.2rem;
color: gray;
}
video {
width: 640px !important;
height: 360px !important;
max-width: none !important;
border-radius: 12px !important;
overflow: hidden;
}
.custom-hr {
margin-top: 0.5rem;
margin-bottom: 1rem;
}
.stFileUploaderFile {
display: none;
}

View file

@ -0,0 +1,16 @@
streamlit
pandas
plotly
opencv-python
tf-keras
deepface
mediapipe
agno
openai
requests
librosa
python-dotenv
moviepy
faster-whisper
fastapi
uvicorn

Binary file not shown.

After

Width:  |  Height:  |  Size: 118 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 168 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 129 KiB