Text and Code Extractor for LLM Context
Fuel Your LLM: Extract Text and Code from Any File
This script is a recursive text miner that simplifies the process of gathering text data from various file types within a directory, making it ready for use with Large Language Models (LLMs).
For a specified folder (directory path), the script recursively scans all files within a specified folder, regardless of their type, extracting both English prose and code snippet, using a hybrid approach:
- Heuristics: rules to identify common patterns in English text and code
- Regular Expressions: Precise pattern matching to capture specific language structures
- Syntax Highlighting: Provide a creative, implict code identification signal.
Usage
- Extract and combine English prose and Code into a single text
- Upload your text file into your preferred LLM’s context window.
- Craft prompts that require quick analysis across a large body of text.
Code
import os
import tika
from tika import parser
from langdetect import detect
import re
from pygments import highlight
from pygments.lexers import get_lexer_by_name
from pygments.formatters import TerminalFormatter
def is_english_or_code(text):
"""Detects if the given text is primarily English or contains code-like structures."""
try:
language = detect(text)
return language == 'en' or contains_code(text)
except: # Handle short or ambiguous text
return False
def contains_code(text):
"""Uses basic heuristics to check for code-like patterns."""
# Incorporate code keywords
code_keywords = [
r'\bimport\b', r'\bdef\b', r'\bclass\b', r'\b#include\b', r'\bpublic\b',
r'\bprivate\b', r'\bvoid\b', r'\bint\b', r'\breturn\b', r'\bsystem\b',
r'function\b', r'console\.log\b', r'<?php\b', r'echo\b', r'var\b',
r'\$\w+', r'^\s*#', r'^\s*//', r'^\s*/\*', r'\bif\b', r'\belse\b',
r'\bfi\b', r'\bwhile\b', r'\bfor\b', r'\bdone\b', r'\becho\b', r'\blet\b'
]
# Incorporate selected code patterns
code_patterns = [
r'\b\d+\b', r'[{}\[\]();,:]', r'==|!=|<=|>=|->|=>',
r'[/\\\+\-\*%\^]', r'"[^"]*"|\'[^\']*\'', r'\$\w+', r'<[^>]+>'
]
for pattern in code_keywords + code_patterns:
if re.search(pattern, text, re.MULTILINE):
return True
return False
def extract_text_from_file(filepath):
"""Extracts text from various file types using Tika."""
parsed = parser.from_file(filepath)
return parsed['content'] if parsed['content'] else ""
def process_directory(directory, output_file):
"""Recursively processes files in the given directory."""
for root, dirs, files in os.walk(directory):
for file in files:
filepath = os.path.join(root, file)
try:
text = extract_text_from_file(filepath)
# Clean up potential MS Word jargon (if needed)
text = re.sub(r'^.+?\n\n', '', text)
if is_english_or_code(text):
# Attempt syntax highlighting as an additional code detection mechanism
try:
lexer = get_lexer_by_name(detect(text), stripall=True)
highlighted_text = highlight(text, lexer, TerminalFormatter())
except:
highlighted_text = None # Highlighting failed, likely not code
with open(output_file, 'a', encoding='utf-8') as f:
if highlighted_text: # If highlighting succeeded, use it
f.write(highlighted_text + '\n')
else: # Otherwise, use the original text
f.write(text + '\n')
except Exception as e:
print(f"Error processing {filepath}: {e}")
if __name__ == "__main__":
directory_to_scan = input("Enter the directory to scan: ")
output_file = 'output_text.txt'
# Clear the output file if it exists
with open(output_file, 'w', encoding='utf-8') as f:
pass
process_directory(directory_to_scan, output_file)
print(f"All English text and code content has been combined into {output_file}")
Example
Input
A single folder containing 219 Files, 21 Folders of filetypes: .js, .mjs, .js, .md, .png, .scss, .txt, .ts, .tsx totaling 2.27 MB
Output
output_text.txt returned: 534 KB. 58,067 words and 507,061 characters.
Prompt
Text prompt, having uploaded output_txt.txt as context
Provide instructions for removing %string% from my site. Use the manual/instructions uploaded. Identify the source files such that future compilations the static site generator no longer contain %string% in the footer.
LLM Response
Successful instructions generated
The footer you want to remove is generated by the
Footer
component, which is defined in the file/components/Footer.tsx
. The text content of the footer, including the copyright notice and the links, are customizable through theOptions
interface that theFooter
component constructor accepts. To remove the%string%
text and the links from the footer, you need to modify thelayout.ts
file. In this file, locate the instantiation of theFooter
component within the layout definition. You’ll likely find it within thesharedPageComponents
object. Modify theFooter
component instantiation to provide an emptylinks
object to its constructor. This will effectively remove the “GitHub” and “Discord Community” links from the footer. Additionally, you can customize the text content of the footer by providing a modified string to thei18n(cfg.locale).components.footer.createdWith
property within theFooter
component. If you want to remove the text completely, you can set it to an empty string. TheFooter
component uses internationalization (i18n) to display the text “Created with”. If you have translations set up for different locales, make sure to update the corresponding translation files to remove or modify this text as well. If you upgrade,.layout.ts
may get overwritten, and you may need to reapply these changes. Consider keeping a backup of your modified layout file or using a version control system like Git to track your changes.