
About the Dataset
The dataset used in this solution consists of approximately 4,000 medical transcriptions, which were obtained from Kaggle. The original dataset was in CSV format, and it was transformed into individual text files using a Python script for further processing. These text files were then stored in an AWS S3 bucket for easy access and synchronization with Amazon Kendra.
Architecture Diagram

Advantages
1. Comprehensive Medical Transcription Search
- Users can search through a vast repository of medical transcriptions to find relevant information quickly.
- The solution facilitates efficient information retrieval from a large and unstructured dataset.
2. Natural Language Understanding
- Utilizes Amazon Kendra, a managed service that provides natural language search capabilities.
- Provides users with meaningful and contextually relevant results for their queries.
3. Data Cleaning and Inference
- Integrates AWS Bedrock Large Language Models (LLMs) to clean and structure unstructured data.
- Offers valuable insights and inferences from medical transcriptions, improving data usability.
4. Scalability and Cloud Integration
- Hosted on AWS, the solution can scale to handle large datasets and high query volumes.
- Utilizes cloud resources efficiently, reducing infrastructure management overhead.
Disadvantages
1. Query Precision
- While the solution provides powerful search capabilities, users may need to refine their queries to achieve precise results, as the accuracy of search results can vary based on query complexity and dataset quality.
2. Privacy and Accuracy Concerns
- Given that these are medical documents, users should exercise caution and consider reviewing the retrieved information for accuracy and privacy compliance before making any critical decisions based on the search results. Medical data accuracy is crucial in this domain, and users should verify the information obtained through the application.
How it Works
Data Preparation:
- The initial dataset of medical transcriptions is obtained from Kaggle and transformed into individual text files. These files are stored in an AWS S3 bucket.
Amazon Kendra Integration:
- Amazon Kendra is used to create an index and store the text files. Kendra provides advanced natural language search capabilities.
Lambda Query Processing:
- When a user submits a query, a Lambda function is triggered. This function interacts with Kendra to fetch relevant snippets and S3 URLs based on the query.
Data Cleaning and Inference:
- The fetched data is often unstructured. AWS Bedrock Large Language Models (LLMs) are employed to clean and structure this data, providing meaningful insights.
Flask Frontend:
- The cleaned data and inferences are presented to the user through a Flask web application. Users can easily search for and access relevant information from medical transcriptions.
Conclusion
The Kendra-based Medical Transcription Search Application is a powerful solution for efficiently searching and gaining insights from a large dataset of medical transcriptions. It leverages AWS services, Kendra’s natural language search capabilities, and LLMs to provide users with relevant and structured information. While it offers many advantages, users should be aware of its AWS dependency and potential learning curve.