diff --git a/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/README.md b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/README.md new file mode 100644 index 0000000..f155fde --- /dev/null +++ b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/README.md @@ -0,0 +1,163 @@ +# AI Speech Trainer: Your Multimodal Public Speaking Coach + +## Overview of the Idea +AI Speech Trainer is an AI-powered multi-agent, multimodal public speaking coach that listens to how you speak, watches how you express, and evaluates what you say - helping you become a confident public speaker. + +Whether you're preparing for a TED talk, interview, or school presentation, AI Speech Trainer provides you with personalized feedback, helping you improve your public speaking skills - highlighting your strengths and weaknesses and giving you valuable suggestions to speak better, clearer, and more confidently. + +This project has been built as part of the **Global Agent Hackathon (May 2025)**. It leverages the power of multi-agent collaboration, real-time feedback, and multimodal analysis to help anyone become a confident and effective speaker. + +## Features +### Core Features +- **Facial Expression Analysis**: Emotion recognition and eye contact estimation +- **Audio Analysis**: Pace, pitch, clarity, and filler words +- **Content Evaluation**: GPT-based feedback on structure, tone, and clarity +- **Personalized Feedback**: Average score, overall assessment, strengths, weaknesses, and suggestions for improvement + +### Agents +- **Facial Agent**: Analyzes expression, engagement, and eye contact +- **Vocal Agent**: Detects speech issues (speed, filler words, pitch) +- **Content Agent**: Uses LLMs to assess and improve content clarity +- **Feedback Agent**: Uses the responses from other agents to evaluate the speaker based on a scoring rubric +- **Coordinator Agent**: A team of agents - Orchestrates all analysis and feedback generation + +## How It Works +### **User Flow**: +1. User opens the Streamlit app and uploads a video of themselves practicing a speech or presentation. + +2. Multiple agents get into action: + +- Facial Agent analyzes expressions and eye contact. +- Vocal Agent transcribes the speech and detects voice attributes. +- Content Agent evaluates grammar, structure, and coherence. +- Feedback agent provides feedback on the overall effectiveness of the speech. +- A Coordinator Agent aggregates all agent insights. + +AI Speech Trainer presents a detailed feedback report including scores based on a rubric and summary of the feedback. + +### **Core Functionality**: +- Facial emotion recognition using OpenCV, DeepFace, and Mediapipe landmarks. +- Voice transcription and analysis. +- Content analysis using GPT-based feedback. +- Aggregated evaluation score and feedback summary. + +### **Multimodal Elements**: +- **Audio**: Speech input & voice quality analysis. +- **Video**: Facial expression tracking and feedback. +- **Text**: GPT-based feedback on structure, clarity, and tone. + +## Tech Stack +### AI/ML Tools +- **Agno**: For building multi-agent collaboration and coordination. +- **Facial Expression Tool**: Facial emotion analysis - New customized tool. +- **Voice Analysis Tool**: Voice transcription and analysis - New customized tool. +- **Together API (Llama-3.3-70B-Instruct-Turbo-Free)**: LLM - Content analysis and feedback generation. + +### Application Framework +- **Streamlit**: Frontend for user interface. +- **FastAPI**: For backend API endpoints. + +### Languages & Packages +- **Python**: Core language for backend logic and agent implementation. +- **OpenCV + DeepFace + Mediapipe**: For facial expression analysis +- **Moviepy + Faster-Whisper + Librosa**: For voice analysis + +## UI Approach +Built with Streamlit, the UI includes: + +- Home page with Video Upload section, buttons, and a space for displaying the Transcript. +- Feedback page to display evaluation scores, detailed feedback, strengths, weaknesses, suggestions for improvement, and a performance chart. + +## Visuals +### High Level Architecture + + +### Home Page + + +### Feedback Page + + +## Setup Instructions +### 1. Clone the repo +```sh +git clone https://github.com/aminajavaid30/ai_speech_trainer.git +cd ai_speech_trainer +``` + +### 2. Install dependencies +```sh +pip install -r requirements.txt +``` + +### 3. **Add your API keys** - Create a .env file with: +```sh +TOGETHER_API_KEY=... +``` + +### 4. Initialize the backend +Navigate to the **backend** folder and run the following command: +```sh +uvicorn main:app --reload +``` + +### 5. Run the app +Navigate to the **frontend** folder and run the following command: +```sh +streamlit run Home.py +``` + +## Team Information +- **Team Lead**: https://github.com/aminajavaid30 - Agentic System Designer and Developer +- **Team Members**: https://github.com/aminajavaid30 - Individual Project +- **Background/Experience**: Data Scientist with a background in Software Engineering and Web Development. Experienced in building AI products and agentic workflows. + +## Demo Video Link +https://youtu.be/Sb0JPUpJTGQ + +## Folder Structure +```sh +/backend + /agents + /tools + - facial_expression_tool.py + - voice_analysis_tool.py + - content_analysis_agent.py + - coordinator_agent.py + - facial_expression_agent.py + - feedback_agent.py + - voice_analysis_agent.py + main.py (FastAPI App) +/frontend + /pages + - 1 - Feedback.py + Home.py + page_config.py + sidebar.py + style.css +.env +LICENSE +README.md +requirements.txt +``` + +## License +MIT License + +## Additional Notes +- This project has been designed to utilize the capabilities of **Agno** as an AI agent development platform. It depicts the potential of Agno as a team of collaborative agents working together seamlessly in order to address a real-world challenge - analyzing speech presentations by users and providing them with comprehensive evaluation and feedback to improve their public speaking skills. Each individual agent has a clear cut goal to follow and together they coordinate as a team to solve a complex multimodal problem. + +- This project has a huge potential for further enhancements. It could be a starting point for a more comprehensive and useful agentic systsem. Some of the proposed additional functionalities could be: + 1. Incorporating real-time video recording and conversational capabilities through different role scenarios. + 2. Playing back the user speech through an AI avatar to help users learn and understand best speaking practices. + 3. Keeping a record of user sessions. + 4. Including a performance dashboard to track user performance over time. + + Each of these additional functionalities could be added by implementing specific goal-oriented agents in the system. + +## Limitations +- **Together API** with **meta-llama/Llama-3.3-70B-Instruct-Turbo-Free** as LLM has a small token limit, therefore, it works with small video clips (15-30 seconds). +- Use other LLM options for longer video clips. Don't forget to add their API keys in the *.env* file. + +## Acknowledgements +Built for the **#GlobalAgentHackathonMay2025** using Agno, Streamlit, Together API, and FastAPI. \ No newline at end of file diff --git a/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/content_analysis_agent.py b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/content_analysis_agent.py new file mode 100644 index 0000000..c4043b0 --- /dev/null +++ b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/content_analysis_agent.py @@ -0,0 +1,52 @@ +from agno.agent import Agent +from agno.models.together import Together +from dotenv import load_dotenv +import os + +load_dotenv() + +# Define the content analysis agent +content_analysis_agent = Agent( + name="content-analysis-agent", + model=Together( + id="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free", + api_key=os.getenv("TOGETHER_API_KEY") + ), + description=""" + You are a content analysis agent that evaluates transcribed speech for structure, clarity, and filler words. + You will return grammar corrections, identified filler words, and suggestions for content improvement. + """, + instructions=[ + "You will be provided with a transcript of spoken content.", + "Your task is to analyze the transcript and identify:", + "- Grammar and syntax corrections.", + "- Filler words and their frequency.", + "- Suggestions for improving clarity and structure.", + "The response MUST be in the following JSON format:", + "{", + '"grammar_corrections": [list of corrections],', + '"filler_words": { "word": count, ... },', + '"suggestions": [list of suggestions]', + "}", + "Ensure the response is in proper JSON format with keys and values in double quotes.", + "Do not include any additional text outside the JSON response." + ], + markdown=True, + show_tool_calls=True, + debug_mode=True +) + +# # Example usage +# if __name__ == "__main__": +# # Sample transcript from the Voice Analysis Agent +# transcript = ( +# "So, um, I was thinking that, like, we could actually start the project soon. " +# "You know, it's basically ready, and, uh, we just need to finalize some details." +# ) +# prompt = f"Analyze the following transcript:\n\n{transcript}" +# content_analysis_agent.print_response(prompt, stream=True) + + # # Run agent and return the response as a variable + # response: RunResponse = content_analysis_agent.run(prompt) + # # Print the response in markdown format + # pprint_run_response(response, markdown=True) \ No newline at end of file diff --git a/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/coordinator_agent.py b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/coordinator_agent.py new file mode 100644 index 0000000..ebd95f0 --- /dev/null +++ b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/coordinator_agent.py @@ -0,0 +1,74 @@ +from agno.team.team import Team +from agno.agent import Agent, RunResponse +from agno.models.together import Together +from agents.facial_expression_agent import facial_expression_agent +from agents.voice_analysis_agent import voice_analysis_agent +from agents.content_analysis_agent import content_analysis_agent +from agents.feedback_agent import feedback_agent +import os +from pydantic import BaseModel, Field + +# Structured response +class CoordinatorResponse(BaseModel): + facial_expression_response: str = Field(..., description="Response from facial expression agent") + voice_analysis_response: str = Field(..., description="Response from voice analysis agent") + content_analysis_response: str = Field(..., description="Response from content analysis agent") + feedback_response: str = Field(..., description="Response from feedback agent") + strengths: list[str] = Field(..., description="List of speaker's strengths") + weaknesses: list[str] = Field(..., description="List of speaker's weaknesses") + suggestions: list[str] = Field(..., description="List of suggestions for speaker to improve") + +# Initialize a team of agents +coordinator_agent = Team( + name="coordinator-agent", + mode="coordinate", + model=Together(id="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free", api_key=os.getenv("TOGETHER_API_KEY")), + members=[facial_expression_agent, voice_analysis_agent, content_analysis_agent, feedback_agent], + description="You are a public speaking coach who helps individuals improve their presentation skills through feedback and analysis.", + instructions=[ + "You will be provided with a video file of a person speaking in a public setting.", + "You will analyze the speaker's facial expressions, voice modulation, and content delivery to provide constructive feedback.", + "Ask the facial expression agent to analyze the video file to detect emotions and engagement.", + "Ask the voice analysis agent to analyze the audio file to detect speech rate, pitch variation, and volume consistency.", + "Ask the content analysis agent to analyze the transcript given by voice analysis agent for structure, clarity, and filler words.", + "Ask the feedback agent to evaluate the analysis results from the facial expression agent, voice analysis agent, and content analysis agent to provide feedback on the overall effectiveness of the presentation, highlighting strengths and areas for improvement.", + "Your response MUST include the exact responses from the facial expression agent, voice analysis agent, content analysis agent, and feedback agent.", + "Additionally, your response MUST provide lists of the speaker's strengths, weaknesses, and suggestions for improvement based on all the responses and feedback provided by the feedback agent.", + "Important: You MUST directly address the speaker while providing strengths, weaknesses, and suggestions for improvement in a clear and constructive manner.", + "The response MUST be in the following strict JSON format:", + "{", + '"facial_expression_response": [facial_expression_agent_response],', + '"voice_analysis_response": [voice_analysis_agent_response],', + '"content_analysis_response": [content_analysis_agent_response],', + '"feedback_response": [feedback_agent_response]', + '"strengths": [speaker_strengths_list],', + '"weaknesses": [speaker_weaknesses_list],', + '"suggestions": [suggestions_for_improvement_list]', + "}", + "The response MUST be in strict JSON format with keys and values in double quotes.", + "The values in the JSON response MUST NOT be null or empty.", + "The final response MUST not include any other text or anything else other than the JSON response.", + "The final response MUST not include any backslashes in the JSON response.", + "The final response MUST be a valid JSON object and MUST not have any unterminated strings in the JSON response." + ], + add_datetime_to_instructions=True, + add_member_tools_to_system_message=False, # This can be tried to make the agent more consistently get the transfer tool call correct + enable_agentic_context=True, # Allow the agent to maintain a shared context and send that to members. + share_member_interactions=True, # Share all member responses with subsequent member requests. + show_members_responses=True, + response_model=CoordinatorResponse, + use_json_mode=True, + markdown=True, + show_tool_calls=True, + debug_mode=True +) + +# # Example usage +# video = "../../videos/my_video.mp4" +# prompt = f"Analyze facial expressions, voice modulation, and content delivery in the following video: {video} and provide constructive feedback." +# coordinator_agent.print_response(prompt, stream=True) + +# # Run agent and return the response as a variable +# response: RunResponse = coordinator_agent.run(prompt) +# # Print the response in markdown format +# pprint_run_response(response, markdown=True) \ No newline at end of file diff --git a/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/facial_expression_agent.py b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/facial_expression_agent.py new file mode 100644 index 0000000..f104fb3 --- /dev/null +++ b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/facial_expression_agent.py @@ -0,0 +1,50 @@ +from agno.agent import Agent, RunResponse +from agno.models.together import Together +from agents.tools.facial_expression_tool import analyze_facial_expressions as facial_expression_tool +from agno.utils.pprint import pprint_run_response +from dotenv import load_dotenv +import os + +load_dotenv() + +# Define the facial expression agent +facial_expression_agent = Agent( + name="facial-expression-agent", + model=Together(id="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free", api_key=os.getenv("TOGETHER_API_KEY")), + tools=[facial_expression_tool], + description= + ''' + You are a facial expression agent that will analyze facial expressions in videos to detect emotions and engagement. + You will return the emotion timeline and engagement metrics. + ''', + instructions=[ + "You will be provided with a video file of a person speaking in a public setting.", + "Your task is to analyze the facial expressions in the video to detect emotions and engagement.", + "You will analyze the video frame by frame to detect and classify facial expressions such as happiness, sadness, anger, surprise, and neutrality.", + "You will also analyze the engagement metrics such as eye contact count and smile count", + "The response MUST be in the following JSON format:", + "{", + '"emotion_timeline": [emotion_timeline]', + "engagement_metrics: {", + '"eye_contact_frequency": [eye contact_frequency]', + '"smile_frequency": [smile_frequency]', + "}", + "}", + "The response MUST be in proper JSON format with keys and values in double quotes.", + "The final response MUST not include any other text or anything else other than the JSON response.", + "The final response MUST not include any backslashes in the JSON response.", + "The final response MUST be a valid JSON object and MUST not have any unterminated strings in the JSON response." + ], + markdown=True, + show_tool_calls=True, + debug_mode=True +) + +# video = "../../videos/my_video.mp4" +# prompt = f"Analyze facial expressions in the video file to detect emotions and engagement in the following video: {video}" +# facial_expression_agent.print_response(prompt, stream=True) + +# # Run agent and return the response as a variable +# response: RunResponse = facial_expression_agent.run(prompt) +# # Print the response in markdown format +# pprint_run_response(response, markdown=True) \ No newline at end of file diff --git a/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/feedback_agent.py b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/feedback_agent.py new file mode 100644 index 0000000..7010127 --- /dev/null +++ b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/feedback_agent.py @@ -0,0 +1,67 @@ +from agno.agent import Agent +from agno.models.together import Together +from dotenv import load_dotenv +import os + +load_dotenv() + +# Define the content analysis agent +feedback_agent = Agent( + name="feedback-agent", + model=Together( + id="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free", + api_key=os.getenv("TOGETHER_API_KEY") + ), + description=""" + You are a feedback agent that evaluates presentation based on the analysis results from all agents. + You will provide feedback on the overall effectiveness of the presentation. + """, + instructions=[ + "You are a public speaking coach that evaluates a speaker's performance based on a detailed scoring rubric.", + "You will be provided with analysis results from the facial expression agent, voice analysis agent, and content analysis agent.", + "You will be given a criteria to evaluate the performance of the speaker based on the analysis results.", + "Your task is to evaluate the speaker on the following 5 criteria, scoring each from 1 (Poor) to 5 (Excellent):", + "1. **Content & Organization** - Evaluate how logically structured and well-organized the speech content is.", + "2. **Delivery & Vocal Quality** - Assess clarity, articulation, vocal variety, and use of filler words.", + "3. **Body Language & Eye Contact** - Consider posture, gestures, and level of eye contact.", + "4. **Audience Engagement** - Evaluate enthusiasm and ability to hold the audience's attention.", + "5. **Language & Clarity** - Check for grammar, clarity, appropriateness, and impact of language.", + "You will then provide a summary of the speaker's strengths and areas for improvement based on the rubric.", + "Important: You MUST directly address the speaker while providing constructive feedback.", + "Provide your response in the following strict JSON format:", + "{", + '"scores": {', + ' "content_organization": [1-5],', + ' "delivery_vocal_quality": [1-5],', + ' "body_language_eye_contact": [1-5],', + ' "audience_engagement": [1-5],', + ' "language_clarity": [1-5]', + '},', + '"total_score": [sum of all scores from 5 to 25],', + '"interpretation": "[One of: \'Needs significant improvement\', \'Developing skills\', \'Competent speaker\', \'Proficient speaker\', \'Outstanding speaker\']",', + '"feedback_summary": "[Concise feedback summarizing the speaker\'s strengths and areas for improvement based on rubric.]"', + "}", + "DO NOT include anything outside the JSON response.", + "Ensure all keys and values are properly quoted and formatted.", + "The response MUST be in proper JSON format with keys and values in double quotes.", + "The final response MUST not include any other text or anything else other than the JSON response." + ], + markdown=True, + show_tool_calls=True, + debug_mode=True +) + +# # Example usage +# if __name__ == "__main__": +# # Sample transcript from the Voice Analysis Agent +# transcript = ( +# "So, um, I was thinking that, like, we could actually start the project soon. " +# "You know, it's basically ready, and, uh, we just need to finalize some details." +# ) +# prompt = f"Analyze the following transcript:\n\n{transcript}" +# content_analysis_agent.print_response(prompt, stream=True) + + # # Run agent and return the response as a variable + # response: RunResponse = content_analysis_agent.run(prompt) + # # Print the response in markdown format + # pprint_run_response(response, markdown=True) \ No newline at end of file diff --git a/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/tools/facial_expression_tool.py b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/tools/facial_expression_tool.py new file mode 100644 index 0000000..a2b0350 --- /dev/null +++ b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/tools/facial_expression_tool.py @@ -0,0 +1,117 @@ +import cv2 +import numpy as np +import mediapipe as mp +from deepface import DeepFace +from agno.tools import tool +import json + +def log_before_call(fc): + """Pre-hook function that runs before the tool execution""" + print(f"About to call function with arguments: {fc.arguments}") + +def log_after_call(fc): + """Post-hook function that runs after the tool execution""" + print(f"Function call completed with result: {fc.result}") + +@tool( + name="analyze_facial_expressions", # Custom name for the tool (otherwise the function name is used) + description="Analyzes facial expressions to detect emotions and engagement.", # Custom description (otherwise the function docstring is used) + show_result=True, # Show result after function call + stop_after_tool_call=True, # Return the result immediately after the tool call and stop the agent + pre_hook=log_before_call, # Hook to run before execution + post_hook=log_after_call, # Hook to run after execution + cache_results=False, # Enable caching of results + cache_dir="/tmp/agno_cache", # Custom cache directory + cache_ttl=3600 # Cache TTL in seconds (1 hour) +) +def analyze_facial_expressions(video_path: str) -> dict: + """ + Analyzes facial expressions in a video to detect emotions and engagement. + + Args: + video_path: The path to the video file. + + Returns: + A dictionary containing the emotion timeline and engagement metrics. + """ + mp_face_mesh = mp.solutions.face_mesh + face_mesh = mp_face_mesh.FaceMesh(static_image_mode=False, max_num_faces=1) + cap = cv2.VideoCapture(video_path) + + emotion_timeline = [] + eye_contact_count = 0 + smile_count = 0 + frame_count = 0 + fps = cap.get(cv2.CAP_PROP_FPS) + + # Process every nth frame for performance optimization + frame_interval = 5 + + while cap.isOpened(): + ret, frame = cap.read() + if not ret: + break + + frame_count += 1 + if frame_count % frame_interval != 0: + continue + + # Resize frame for faster processing + frame = cv2.resize(frame, (640, 480)) + rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) + results = face_mesh.process(rgb_frame) + + if results.multi_face_landmarks: + for face_landmarks in results.multi_face_landmarks: + # Extract landmarks + landmarks = face_landmarks.landmark + + # Convert landmarks to pixel coordinates + h, w, _ = frame.shape + landmark_coords = [(int(lm.x * w), int(lm.y * h)) for lm in landmarks] + + # Emotion Detection using DeepFace & Smile Detection + try: + analysis = DeepFace.analyze(frame, actions=['emotion'], enforce_detection=False) + emotion = analysis[0]['dominant_emotion'] + if emotion == "happy": + smile_count += 1 + + timestamp = frame_count / fps + # convert timestamp into seconds + timestamp = round(timestamp, 2) + emotion_timeline.append({"timestamp": timestamp, "emotion": emotion}) + except Exception as e: + print(f"Error analyzing frame: {e}") + continue + + # Engagement Metric: Eye contact estimation + # Using eye landmarks: 159 (left eye upper lid), 145 (left eye lower lid), 386 (right eye upper lid), 374 (right eye lower lid) + left_eye_upper_lid = landmark_coords[159] + left_eye_lower_lid = landmark_coords[145] + right_eye_upper_lid = landmark_coords[386] + right_eye_lower_lid = landmark_coords[374] + + left_eye_opening = np.linalg.norm(np.array(left_eye_upper_lid) - np.array(left_eye_lower_lid)) + right_eye_opening = np.linalg.norm(np.array(right_eye_upper_lid) - np.array(right_eye_lower_lid)) + + eye_opening_avg = (left_eye_opening + right_eye_opening) / 2 + + # Simple heuristic: if eyes are wide open, assume eye contact + if eye_opening_avg > 5: # Threshold adjustment through experimentation + eye_contact_count += 1 + + cap.release() + face_mesh.close() + + total_processed_frames = frame_count // frame_interval + if total_processed_frames == 0: + total_processed_frames = 1 # Avoid division by zero + + return json.dumps({ + "emotion_timeline": emotion_timeline, + "engagement_metrics": { + "eye_contact_frequency": eye_contact_count / total_processed_frames, + "smile_frequency": smile_count / total_processed_frames + } + }) \ No newline at end of file diff --git a/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/tools/voice_analysis_tool.py b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/tools/voice_analysis_tool.py new file mode 100644 index 0000000..63e71a7 --- /dev/null +++ b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/tools/voice_analysis_tool.py @@ -0,0 +1,135 @@ +import os +import json +import tempfile +import numpy as np +import librosa +from moviepy import VideoFileClip +from faster_whisper import WhisperModel +from agno.tools import tool +from dotenv import load_dotenv + +load_dotenv() + +def extract_audio_from_video(video_path: str, output_audio_path: str) -> str: + """ + Extracts audio from a video file and saves it as an audio file. + + Args: + video_path: Path to the input video file. + output_audio_path: Path to save the extracted audio file. + + Returns: + Path to the extracted audio file. + """ + video_clip = VideoFileClip(video_path) + audio_clip = video_clip.audio + audio_clip.write_audiofile(output_audio_path) + audio_clip.close() + video_clip.close() + return output_audio_path + +def load_whisper_model(): + try: + model = WhisperModel("small", device="cpu", compute_type="int8") + return model + except Exception as e: + print(f"Error loading Whisper model: {e}") + return None + +def transcribe_audio(audio_file): + """ + Transcribe the audio file using faster-whisper. + + Returns: + str: Transcribed text or error/fallback message. + """ + if not audio_file or not os.path.exists(audio_file): + return "No audio file exists at the specified path." + + model = load_whisper_model() + if not model: + return "Model failed to load. Please check system resources or model path." + + try: + print("Model loaded successfully. Transcribing audio...") + segments, _ = model.transcribe(audio_file) + full_text = " ".join(segment.text for segment in segments) + return full_text.strip() if full_text else "I couldn't understand the audio. Please try again." + + except Exception as e: + print(f"Error transcribing audio with faster-whisper: {e}") + return "I'm having trouble transcribing your audio. Please try again or speak more clearly." + +def log_before_call(fc): + """Pre-hook function that runs before the tool execution""" + print(f"About to call function with arguments: {fc.arguments}") + +def log_after_call(fc): + """Post-hook function that runs after the tool execution""" + print(f"Function call completed with result: {fc.result}") + +@tool( + name="analyze_voice_attributes", # Custom name for the tool (otherwise the function name is used) + description="Analyzes vocal attributes like clarity, intonation, and pace.", # Custom description (otherwise the function docstring is used) + show_result=True, # Show result after function call + stop_after_tool_call=True, # Return the result immediately after the tool call and stop the agent + pre_hook=log_before_call, # Hook to run before execution + post_hook=log_after_call, # Hook to run after execution + cache_results=False, # Enable caching of results + cache_dir="/tmp/agno_cache", # Custom cache directory + cache_ttl=3600 # Cache TTL in seconds (1 hour) +) +def analyze_voice_attributes(file_path: str) -> dict: + """ + Analyzes vocal attributes in an audio file. + + Args: + audio_path: The path to the audio file. + + Returns: + A dictionary containing the transcribed text, speech rate, pitch variation, and volume consistency. + """ + + # Determine file extension + _, ext = os.path.splitext(file_path) + ext = ext.lower() + + # If the file is a video, extract audio + if ext in ['.mp4']: + with tempfile.NamedTemporaryFile(suffix='.mp3', delete=False) as temp_audio_file: + audio_path = extract_audio_from_video(file_path, temp_audio_file.name) + else: + audio_path = file_path + + # Transcribe audio + transcription = transcribe_audio(audio_path) + + # Proceed with analysis using the audio_path + # Load audio + y, sr = librosa.load(audio_path, sr=16000) + + words = transcription.split() + + # Calculate speech rate + duration = librosa.get_duration(y=y, sr=sr) + speech_rate = len(words) / (duration / 60.0) # words per minute + + # Pitch variation + pitches, magnitudes = librosa.piptrack(y=y, sr=sr) + pitch_values = pitches[magnitudes > np.median(magnitudes)] + pitch_variation = np.std(pitch_values) if pitch_values.size > 0 else 0 + + # Volume consistency + rms = librosa.feature.rms(y=y)[0] + volume_consistency = np.std(rms) + + # Clean up temporary audio file if created + if ext in ['.mp4']: + os.remove(audio_path) + + return json.dumps({ + "transcription": transcription, + "speech_rate_wpm": str(round(speech_rate, 2)), + "pitch_variation": str(round(pitch_variation, 2)), + "volume_consistency": str(round(volume_consistency, 4)) + }) \ No newline at end of file diff --git a/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/voice_analysis_agent.py b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/voice_analysis_agent.py new file mode 100644 index 0000000..5460658 --- /dev/null +++ b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/agents/voice_analysis_agent.py @@ -0,0 +1,45 @@ +from agno.agent import Agent, RunResponse +from agno.agent import Agent +from agno.models.together import Together +from agents.tools.voice_analysis_tool import analyze_voice_attributes as voice_analysis_tool +from agno.utils.pprint import pprint_run_response +from dotenv import load_dotenv +import os + +load_dotenv() + +# Define the voice analysis agent +voice_analysis_agent = Agent( + name="voice-analysis-agent", + model=Together(id="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free", api_key=os.getenv("TOGETHER_API_KEY")), + tools=[voice_analysis_tool], + description=""" + You are a voice analysis agent that evaluates vocal attributes like clarity, intonation, and pace. + You will return the transcribed text, speech rate, pitch variation, and volume consistency. + """, + instructions=[ + "You will be provided with an audio file of a person speaking.", + "Your task is to analyze the vocal attributes in the audio to detect speech rate, pitch variation, and volume consistency.", + "The response MUST be in the following JSON format:", + "{", + '"transcription": [transcription]', + '"speech_rate_wpm": [speech_rate_wpm],', + '"pitch_variation": [pitch_variation],', + '"volume_consistency": [volume_consistency]', + "}", + "The response MUST be in proper JSON format with keys and values in double quotes.", + "The final response MUST not include any other text or anything else other than the JSON response." + ], + markdown=True, + show_tool_calls=True, + debug_mode=True +) + +# audio = "../../videos/my_video.mp4" +# prompt = f"Analyze vocal attributes in the audio file to detect speech rate, pitch variation, and volume consistency in the following audio: {audio}" +# voice_analysis_agent.print_response(prompt, stream=True) + +# # Run agent and return the response as a variable +# response: RunResponse = voice_analysis_agent.run(prompt) +# # Print the response in markdown format +# pprint_run_response(response, markdown=True) \ No newline at end of file diff --git a/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/main.py b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/main.py new file mode 100644 index 0000000..ee357b1 --- /dev/null +++ b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/backend/main.py @@ -0,0 +1,43 @@ +from fastapi import FastAPI +from fastapi.middleware.cors import CORSMiddleware +from fastapi.encoders import jsonable_encoder +from fastapi.responses import JSONResponse +from pydantic import BaseModel +from agents.coordinator_agent import coordinator_agent +from agno.agent import RunResponse +from dotenv import load_dotenv + +# Load environment variables +load_dotenv() + +# Initialize FastAPI app +app = FastAPI() + +# Configure CORS to allow requests from your frontend +app.add_middleware( + CORSMiddleware, + allow_origins=["*"], # To be replaced with the frontend's origin in production + allow_credentials=True, + allow_methods=["*"], + allow_headers=["*"], +) + +# Define the request body model +class AnalysisRequest(BaseModel): + video_url: str + +# Define the entry point +@app.get("/") +async def root(): + return {"message": "Welcome to the video analysis API!"} + +# Define the analysis endpoint +@app.post("/analyze") +async def analyze(request: AnalysisRequest): + video_url = request.video_url + prompt = f"Analyze the following video: {video_url}" + response: RunResponse = coordinator_agent.run(prompt) + + # Assuming response.content is a Pydantic model or a dictionary + json_compatible_response = jsonable_encoder(response.content) + return JSONResponse(content=json_compatible_response) \ No newline at end of file diff --git a/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/frontend/.streamlit/config.toml b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/frontend/.streamlit/config.toml new file mode 100644 index 0000000..cec67f7 --- /dev/null +++ b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/frontend/.streamlit/config.toml @@ -0,0 +1,6 @@ +[theme] +primaryColor = "#4B8BBE" +backgroundColor = "#F5F5F5" +secondaryBackgroundColor = "#E0E0E0" +textColor = "#262730" +font = "sans serif" \ No newline at end of file diff --git a/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/frontend/Home.py b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/frontend/Home.py new file mode 100644 index 0000000..56e8074 --- /dev/null +++ b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/frontend/Home.py @@ -0,0 +1,143 @@ +import streamlit as st +import requests +import tempfile +import os +import json +import numpy as np +from page_congif import render_page_config + +render_page_config() + +# Initialize session state variables +if "begin" not in st.session_state: + st.session_state.begin = False + +if "video_path" not in st.session_state: + st.session_state.video_path = None + +if "upload_file" not in st.session_state: + st.session_state.upload_file = False + +if "response" not in st.session_state: + st.session_state.response = None + +if "facial_expression_response" not in st.session_state: + st.session_state.facial_expression_response = None + +if "voice_analysis_response" not in st.session_state: + st.session_state.voice_analysis_response = None + +if "content_analysis_response" not in st.session_state: + st.session_state.content_analysis_response = None + +if "feedback_response" not in st.session_state: + st.session_state.feedback_response = None + + +def clear_session_response(): + st.session_state.response = None + st.session_state.facial_expression_response = None + st.session_state.voice_analysis_response = None + st.session_state.content_analysis_response = None + st.session_state.feedback_response = None + + +# Create two columns with a 70:30 width ratio +col1, col2 = st.columns([0.7, 0.3]) + +# Left column: Video area and buttons +with col1: + spacer1, btn_col = st.columns([0.8, 0.2]) + + if st.session_state.begin: + with spacer1: + st.markdown("

๐Ÿ“ฝ๏ธ Video

", unsafe_allow_html=True) + + with btn_col: + if st.button("๐Ÿ“ค Upload Video"): + if st.session_state.video_path: + os.remove(st.session_state.video_path) + st.session_state.video_path = None + clear_session_response() + st.session_state.upload_file = True + st.rerun() # Force rerun to fully reset uploader + + + if st.session_state.get("upload_file"): + uploaded_file = st.file_uploader("๐Ÿ“ค Upload Video", type=["mp4"]) + + if uploaded_file is not None: + temp_dir = tempfile.gettempdir() + # Use a random name to avoid reuse + unique_name = f"{int(np.random.rand()*1e8)}_{uploaded_file.name}" + file_path = os.path.join(temp_dir, unique_name) + + if not os.path.exists(file_path): + with open(file_path, "wb") as f: + f.write(uploaded_file.read()) + + st.session_state.video_path = file_path + st.session_state.upload_file = False + st.rerun() + # elif not st.session_state.get("video_path"): + if not st.session_state.begin: + st.success(""" + **Welcome to AI Speech Trainer!** + Your ultimate companion to help improve your public speaking skills. + """) + st.info(""" + ๐Ÿš€ To get started: + \n\t1. Record a video of yourself practicing a speech or presentation - use any video recording app. + \n\t2. Upload the recorded video. + \n\t3. Analyze the video to get personalized feedback. + """) + if st.button("๐Ÿ‘‰ Let's begin!"): + st.session_state.begin = True + st.rerun() + + if st.session_state.video_path: + st.video(st.session_state.video_path, autoplay=False) + + if not st.session_state.response: + if st.button("โ–ถ๏ธ Analyze Video"): + with st.spinner("Analyzing video..."): + st.warning("โš ๏ธ This process may take some time, so please be patient and wait for the analysis to complete.") + API_URL = "http://localhost:8000/analyze" + response = requests.post(API_URL, json={"video_url": st.session_state.video_path}) + + if response.status_code == 200: + st.success("Video analysis completed successfully.") + response = response.json() + st.session_state.response = response + st.session_state.facial_expression_response = response.get("facial_expression_response") + st.session_state.voice_analysis_response = response.get("voice_analysis_response") + st.session_state.content_analysis_response = response.get("content_analysis_response") + st.session_state.feedback_response = response.get("feedback_response") + st.rerun() + else: + st.error("๐Ÿšจ Error during video analysis. Please try again.") + +# Right column: Transcript and feedback +with col2: + st.markdown("

๐Ÿ“ Transcript

", unsafe_allow_html=True) + transcript_text = "Your transcript will be displayed here." + if st.session_state.response: + voice_analysis_response = st.session_state.voice_analysis_response + transcript = json.loads(voice_analysis_response).get("transcription") + else: + transcript = None + + st.markdown( + f""" +
+ {transcript if transcript else transcript_text} +
+
+ """, + unsafe_allow_html=True + ) + + if st.button("๐Ÿ“ Get Feedback"): + st.switch_page("pages/1 - Feedback.py") \ No newline at end of file diff --git a/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/frontend/page_congif.py b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/frontend/page_congif.py new file mode 100644 index 0000000..491c452 --- /dev/null +++ b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/frontend/page_congif.py @@ -0,0 +1,31 @@ +import streamlit as st +from sidebar import render_sidebar + +def render_page_config(): + # Set page configuration + st.set_page_config( + page_icon="๐ŸŽ™๏ธ", + page_title="AI Speech Trainer", + initial_sidebar_state="auto", + layout="wide") + + # Load external CSS + with open("style.css") as f: + st.markdown(f"", unsafe_allow_html=True) + + # Sidebar + render_sidebar() + + # Main title with an icon + st.markdown( + """ +
+ ๐Ÿ—ฃ๏ธ AI Speech Trainer
+ Your personal coach for public speaking +
+ """, + unsafe_allow_html=True + ) + + # Horizontal line + st.markdown("
", unsafe_allow_html=True) \ No newline at end of file diff --git a/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/frontend/pages/1 - Feedback.py b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/frontend/pages/1 - Feedback.py new file mode 100644 index 0000000..f992547 --- /dev/null +++ b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/frontend/pages/1 - Feedback.py @@ -0,0 +1,127 @@ +import streamlit as st +import plotly.graph_objects as go +import json +from page_congif import render_page_config + +render_page_config() + +# Get feedback response from session state +if st.session_state.feedback_response: + feedback_response = json.loads(st.session_state.feedback_response) + feedback_scores = feedback_response.get("scores") + + # Evaluation scores based on the public speaking rubric + scores = { + "Content & Organization": feedback_scores.get("content_organization"), + "Delivery & Vocal Quality": feedback_scores.get("delivery_vocal_quality"), + "Body Language & Eye Contact": feedback_scores.get("body_language_eye_contact"), + "Audience Engagement": feedback_scores.get("audience_engagement"), + "Language & Clarity": feedback_scores.get("language_clarity") + } + + total_score = feedback_response.get("total_score") + interpretation = feedback_response.get("interpretation") + feedback_summary = feedback_response.get("feedback_summary") +else: + st.warning("No feedback available! Please upload a video and analyze it first.") + scores = { + "Content & Organization": 0, + "Delivery & Vocal Quality": 0, + "Body Language & Eye Contact": 0, + "Audience Engagement": 0, + "Language & Clarity": 0 + } + + total_score = 0 + interpretation = "" + feedback_summary = "" + +# Calculate average score +average_score = sum(scores.values()) / len(scores) + +# Determine strengths, weaknesses, and suggestions for improvement +if st.session_state.response: + strengths = st.session_state.response.get("strengths") + weaknesses = st.session_state.response.get("weaknesses") + suggestions = st.session_state.response.get("suggestions") +else: + strengths = [] + weaknesses = [] + suggestions = [] + +# Create three columns with equal width +col1, col2, col3 = st.columns([0.3, 0.4, 0.3]) + +# Left Column: Evaluation Summary +with col1: + st.subheader("๐Ÿงพ Evaluation Summary") + + st.markdown("
", unsafe_allow_html=True) + + for criterion, score in scores.items(): + label_col, progress_col, score_col = st.columns([2, 3, 1]) # Adjust the ratio as needed + with label_col: + st.markdown(f"**{criterion}**") + with progress_col: + st.progress(score / 5) + with score_col: + st.markdown(f"{score}/5", unsafe_allow_html=True) + + st.markdown("
", unsafe_allow_html=True) + + # Display total score + st.markdown(f"#### ๐Ÿ† Total Score: {total_score} / 25") + # Display average score + st.markdown(f"#### ๐ŸŽฏ Average Score: {average_score:.2f} / 5") + + st.markdown("""---""") + + st.markdown("##### ๐Ÿ—ฃ๏ธ Feedback Summary:") + # Display interpretation + st.markdown(f"๐Ÿ“ **Overall Assessment**: {interpretation}") + # Display feedback summary + st.info(f"{feedback_summary}") + + +# Middle Column: Strengths, Weaknesses, and Suggestions +with col2: + # Display strengths + st.markdown("##### ๐Ÿฆพ Strengths:") + strengths_text = '\n'.join(f"- {item}" for item in strengths) + st.success(strengths_text) + + # Display weaknesses + st.markdown("##### โš ๏ธ Weaknesses:") + weaknesses_text = '\n'.join(f"- {item}" for item in weaknesses) + st.error(weaknesses_text) + + # Display suggestions + st.markdown("##### ๐Ÿ’ก Suggestions for Improvement:") + suggestions_text = '\n'.join(f"- {item}" for item in suggestions) + st.warning(suggestions_text) + + +# Right Column: Performance Chart +with col3: + st.subheader("๐Ÿ“Š Performance Chart") + + # Radar Chart + radar_fig = go.Figure() + radar_fig.add_trace(go.Scatterpolar( + r=list(scores.values()), + theta=list(scores.keys()), + fill='toself', + name='Scores' + )) + radar_fig.update_layout( + polar=dict( + radialaxis=dict(visible=True, range=[0, 5]) + ), + showlegend=False, + margin=dict(t=50, b=50, l=50, r=50), # Reduced margins + width=350, + height=350 + ) + st.plotly_chart(radar_fig, use_container_width=True) + + st.markdown("""---""") \ No newline at end of file diff --git a/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/frontend/sidebar.py b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/frontend/sidebar.py new file mode 100644 index 0000000..3b4489b --- /dev/null +++ b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/frontend/sidebar.py @@ -0,0 +1,24 @@ +# Sidebar with About section +import streamlit as st + +def render_sidebar(): + st.sidebar.header("About") + st.sidebar.info( + """ + **AI Speech Trainer** helps users improve their public speaking skills through:\ + + + ๐Ÿ“ฝ๏ธ Video Analysis\ + + ๐Ÿ—ฃ๏ธ Voice Analysis\ + + ๐Ÿ“Š Content Analysis & Feedback\ + + + - Upload your video to receive a detailed feedback. + + - Improve your public speaking skills with AI-powered analysis. + + - Get personalized suggestions to enhance your performance. + """ + ) \ No newline at end of file diff --git a/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/frontend/style.css b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/frontend/style.css new file mode 100644 index 0000000..f83ca02 --- /dev/null +++ b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/frontend/style.css @@ -0,0 +1,35 @@ +.main > div:first-child { + padding-top: 0.9rem !important; +} + +.custom-header { + text-align: center; +} + +.custom-header > span:first-child { + font-size: 2.5rem; + font-weight: 600; + color: #4B8BBE; +} + +.custom-header > span:nth-child(2) { + font-size: 1.2rem; + color: gray; +} + +video { + width: 640px !important; + height: 360px !important; + max-width: none !important; + border-radius: 12px !important; + overflow: hidden; +} + +.custom-hr { + margin-top: 0.5rem; + margin-bottom: 1rem; +} + +.stFileUploaderFile { + display: none; +} \ No newline at end of file diff --git a/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/requirements.txt b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/requirements.txt new file mode 100644 index 0000000..0b9e729 --- /dev/null +++ b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/requirements.txt @@ -0,0 +1,16 @@ +streamlit +pandas +plotly +opencv-python +tf-keras +deepface +mediapipe +agno +openai +requests +librosa +python-dotenv +moviepy +faster-whisper +fastapi +uvicorn \ No newline at end of file diff --git a/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/visuals/ai_speech_trainer.drawio.png b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/visuals/ai_speech_trainer.drawio.png new file mode 100644 index 0000000..973b3be Binary files /dev/null and b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/visuals/ai_speech_trainer.drawio.png differ diff --git a/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/visuals/feedback.png b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/visuals/feedback.png new file mode 100644 index 0000000..1bd0963 Binary files /dev/null and b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/visuals/feedback.png differ diff --git a/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/visuals/home.png b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/visuals/home.png new file mode 100644 index 0000000..449e29d Binary files /dev/null and b/advanced_ai_agents/multi_agent_apps/ai_speech_trainer/visuals/home.png differ