Text and Code Extractor for LLM Context

Fuel Your LLM: Extract Text and Code from Any File

This script is a recursive text miner that simplifies the process of gathering text data from various file types within a directory, making it ready for use with Large Language Models (LLMs).

For a specified folder (directory path), the script recursively scans all files within a specified folder, regardless of their type, extracting both English prose and code snippet, using a hybrid approach:

  • Heuristics: rules to identify common patterns in English text and code
  • Regular Expressions: Precise pattern matching to capture specific language structures
  • Syntax Highlighting: Provide a creative, implict code identification signal.

Usage

  • Extract and combine English prose and Code into a single text
  • Upload your text file into your preferred LLM’s context window.
  • Craft prompts that require quick analysis across a large body of text.

Code

import os
import tika
from tika import parser
from langdetect import detect
import re
from pygments import highlight
from pygments.lexers import get_lexer_by_name
from pygments.formatters import TerminalFormatter
 
def is_english_or_code(text):
    """Detects if the given text is primarily English or contains code-like structures."""
    try:
        language = detect(text)
        return language == 'en' or contains_code(text)
    except:  # Handle short or ambiguous text
        return False
 
def contains_code(text):
    """Uses basic heuristics to check for code-like patterns."""
 
    # Incorporate code keywords
    code_keywords = [
        r'\bimport\b', r'\bdef\b', r'\bclass\b', r'\b#include\b', r'\bpublic\b',
        r'\bprivate\b', r'\bvoid\b', r'\bint\b', r'\breturn\b', r'\bsystem\b',
        r'function\b', r'console\.log\b', r'<?php\b', r'echo\b', r'var\b',
        r'\$\w+', r'^\s*#', r'^\s*//', r'^\s*/\*', r'\bif\b', r'\belse\b',
        r'\bfi\b', r'\bwhile\b', r'\bfor\b', r'\bdone\b', r'\becho\b', r'\blet\b'
    ]
 
    # Incorporate selected code patterns
    code_patterns = [
        r'\b\d+\b', r'[{}\[\]();,:]', r'==|!=|<=|>=|->|=>',
        r'[/\\\+\-\*%\^]', r'"[^"]*"|\'[^\']*\'', r'\$\w+', r'<[^>]+>'
    ]
 
    for pattern in code_keywords + code_patterns:
        if re.search(pattern, text, re.MULTILINE):
            return True
    return False
 
def extract_text_from_file(filepath):
    """Extracts text from various file types using Tika."""
    parsed = parser.from_file(filepath)
    return parsed['content'] if parsed['content'] else ""
 
def process_directory(directory, output_file):
    """Recursively processes files in the given directory."""
    for root, dirs, files in os.walk(directory):
        for file in files:
            filepath = os.path.join(root, file)
            try:
                text = extract_text_from_file(filepath)
 
                # Clean up potential MS Word jargon (if needed)
                text = re.sub(r'^.+?\n\n', '', text)
 
                if is_english_or_code(text):
                    # Attempt syntax highlighting as an additional code detection mechanism
                    try:
                        lexer = get_lexer_by_name(detect(text), stripall=True)
                        highlighted_text = highlight(text, lexer, TerminalFormatter())
                    except:
                        highlighted_text = None  # Highlighting failed, likely not code
 
                    with open(output_file, 'a', encoding='utf-8') as f:
                        if highlighted_text:  # If highlighting succeeded, use it
                            f.write(highlighted_text + '\n')
                        else:  # Otherwise, use the original text
                            f.write(text + '\n')
 
            except Exception as e:
                print(f"Error processing {filepath}: {e}")
 
if __name__ == "__main__":
    directory_to_scan = input("Enter the directory to scan: ")
    output_file = 'output_text.txt'
 
    # Clear the output file if it exists
    with open(output_file, 'w', encoding='utf-8') as f:
        pass
 
    process_directory(directory_to_scan, output_file)
    print(f"All English text and code content has been combined into {output_file}")

Example

Input

A single folder containing 219 Files, 21 Folders of filetypes: .js, .mjs, .js, .md, .png, .scss, .txt, .ts, .tsx totaling 2.27 MB

Output

output_text.txt returned: 534 KB. 58,067 words and 507,061 characters.

Prompt

Text prompt, having uploaded output_txt.txt as context

Provide instructions for removing %string% from my site. Use the manual/instructions uploaded. Identify the source files such that future compilations the static site generator no longer contain %string% in the footer.

LLM Response

Successful instructions generated

The footer you want to remove is generated by the Footer component, which is defined in the file /components/Footer.tsx. The text content of the footer, including the copyright notice and the links, are customizable through the Options interface that the Footer component constructor accepts. To remove the %string% text and the links from the footer, you need to modify the layout.ts file. In this file, locate the instantiation of the Footer component within the layout definition. You’ll likely find it within the sharedPageComponents object. Modify the Footer component instantiation to provide an empty links object to its constructor. This will effectively remove the “GitHub” and “Discord Community” links from the footer. Additionally, you can customize the text content of the footer by providing a modified string to the i18n(cfg.locale).components.footer.createdWith property within the Footer component. If you want to remove the text completely, you can set it to an empty string. The Footer component uses internationalization (i18n) to display the text “Created with”. If you have translations set up for different locales, make sure to update the corresponding translation files to remove or modify this text as well. If you upgrade, .layout.ts may get overwritten, and you may need to reapply these changes. Consider keeping a backup of your modified layout file or using a version control system like Git to track your changes.