loading…
Search for a command to run...
loading…
Extract content from URLs, documents, videos, and audio files using intelligent auto-engine selection. Supports web pages, PDFs, Word docs, YouTube transcripts,
Extract content from URLs, documents, videos, and audio files using intelligent auto-engine selection. Supports web pages, PDFs, Word docs, YouTube transcripts, and more with structured JSON responses.
License: MIT PyPI version Downloads Downloads GitHub stars GitHub forks GitHub issues Ruff
Extract, process, and summarize content from URLs, files, and text through a unified async Python API, CLI, or MCP server.
| Category | Formats |
|---|---|
| Web | URLs, HTML pages, YouTube videos, Reddit posts |
| Documents | PDF, DOCX, PPTX, XLSX, EPUB, Markdown, plain text |
| Media | MP3, WAV, M4A, FLAC, OGG (audio); MP4, AVI, MOV, MKV (video) |
pip install content-core
import content_core
result = await content_core.extract_content(url="https://example.com")
print(result.content)
Or with zero install:
uvx content-core extract "https://example.com"
Content Core provides a unified content-core command with subcommands for extraction, summarization, and MCP server.
# From a URL
content-core extract "https://example.com"
# From a file
content-core extract document.pdf
# With JSON output
content-core extract document.pdf --format json
# With a specific engine
content-core extract "https://example.com" --engine firecrawl
# From stdin
echo "some text" | content-core extract
# Summarize text
content-core summarize "Long article text here..."
# With context
content-core summarize "Long text" --context "bullet points"
# From stdin
cat article.txt | content-core summarize --context "explain to a child"
content-core mcp
# Set persistent config
content-core config set llm_provider anthropic
content-core config set llm_model claude-sonnet-4-20250514
# List current config
content-core config list
# Delete a config value
content-core config delete llm_provider
Config is stored in ~/.content-core/config.toml. Priority: command flags > env vars > config file > defaults.
All commands work without installation using uvx:
uvx content-core extract "https://example.com"
uvx content-core summarize "text" --context "one sentence"
uvx content-core mcp
import content_core
# From a URL
result = await content_core.extract_content(url="https://example.com")
# From a file
result = await content_core.extract_content(file_path="document.pdf")
# From text
result = await content_core.extract_content(content="some text")
# With engine override
from content_core import ContentCoreConfig
config = ContentCoreConfig(url_engine="firecrawl")
result = await content_core.extract_content(url="https://example.com", config=config)
import content_core
summary = await content_core.summarize("long article text", context="bullet points")
from content_core import ContentCoreConfig
config = ContentCoreConfig(
url_engine="firecrawl",
document_engine="docling",
audio_concurrency=5,
)
result = await content_core.extract_content(url="https://example.com", config=config)
Content Core includes a Model Context Protocol (MCP) server for use with Claude Desktop and other MCP-compatible applications.
Add to your claude_desktop_config.json:
{
"mcpServers": {
"content-core": {
"command": "uvx",
"args": ["content-core", "mcp"],
"env": {
"OPENAI_API_KEY": "sk-..."
}
}
}
}
The MCP server exposes two tools: extract_content and summarize_content. Both return plain text.
For detailed setup, see the MCP documentation.
Content Core includes a SKILL.md that teaches AI agents how to use it for extracting content from external sources. To make it available in your Claude Code project, copy it to your skills directory:
# Download the skill
curl -o .claude/skills/content-core/SKILL.md --create-dirs \
https://raw.githubusercontent.com/lfnovo/content-core/main/SKILL.md
Once installed, Claude Code can use content-core to extract content from URLs, documents, and media files — either via CLI (uvx content-core) or MCP if configured.
Content Core uses Esperanto to support multiple LLM and STT providers. Switch providers by changing the config — no code changes needed:
# Use Anthropic for summarization
content-core config set llm_provider anthropic
content-core config set llm_model claude-sonnet-4-20250514
# Use Groq for transcription
content-core config set stt_provider groq
content-core config set stt_model whisper-large-v3
Supported providers include OpenAI, Anthropic, Google, Groq, DeepSeek, Ollama, and more. See the Esperanto documentation for the full list.
Content Core uses ContentCoreConfig powered by pydantic-settings. Settings are resolved in priority order: constructor args > env vars (CCORE_*) > config file (~/.content-core/config.toml) > defaults.
| Variable | Description | Default |
|---|---|---|
CCORE_URL_ENGINE |
URL extraction engine (auto, simple, firecrawl, jina, crawl4ai) |
auto |
CCORE_DOCUMENT_ENGINE |
Document extraction engine (auto, simple, docling) |
auto |
CCORE_AUDIO_CONCURRENCY |
Concurrent audio transcriptions (1-10) | 3 |
CRAWL4AI_API_URL |
Crawl4AI Docker API URL (omit for local browser mode) | - |
FIRECRAWL_API_URL |
Custom Firecrawl API URL for self-hosted instances | - |
CCORE_FIRECRAWL_PROXY |
Firecrawl proxy mode (auto, basic, stealth) |
auto |
CCORE_FIRECRAWL_WAIT_FOR |
Wait time in ms before extraction | 3000 |
CCORE_LLM_PROVIDER |
LLM provider for summarization | - |
CCORE_LLM_MODEL |
LLM model for summarization | - |
CCORE_STT_PROVIDER |
Speech-to-text provider | - |
CCORE_STT_MODEL |
Speech-to-text model | - |
CCORE_STT_TIMEOUT |
Speech-to-text timeout in seconds | - |
CCORE_YOUTUBE_LANGUAGES |
Preferred YouTube transcript languages | - |
API keys for external services are set via their standard environment variables (e.g., OPENAI_API_KEY, FIRECRAWL_API_KEY, JINA_API_KEY).
Content Core reads standard HTTP_PROXY / HTTPS_PROXY / NO_PROXY environment variables automatically. No additional configuration is needed.
# Docling for advanced document parsing (PDF, DOCX, PPTX, XLSX)
pip install content-core[docling]
# Crawl4AI for local browser-based URL extraction
pip install content-core[crawl4ai]
python -m playwright install --with-deps
# LangChain tool wrappers
pip install content-core[langchain]
# All optional features
pip install content-core[docling,crawl4ai,langchain]
When installed with the langchain extra, Content Core provides LangChain-compatible tool wrappers:
from content_core.tools import extract_content_tool, summarize_content_tool
tools = [extract_content_tool, summarize_content_tool]
git clone https://github.com/lfnovo/content-core
cd content-core
uv sync --group dev
# Run tests
make test
# Lint
make ruff
This project is licensed under the MIT License.
Contributions are welcome! Please see our Contributing Guide for details.
Run in your terminal:
claude mcp add lfnovo-content-core -- npx Yes, lfnovo/content-core MCP is free — one-click install via Unyly at no cost.
No, lfnovo/content-core runs without API keys or environment variables.
Self-hosted: the server runs locally on your machine via the install command above.
Open lfnovo/content-core on unyly.org, pick your client tab (Claude Desktop, Claude Code, Cursor) and press Install — the config is generated automatically, no JSON editing.
Transcripts, channel stats, search
by YouTubeAI image generation using various models.
by modelcontextprotocolUnified GPU inference API with 30 AI services (LLM, image gen, video, TTS, whisper, embeddings, reranking, OCR) as MCP tools. Pay-per-use via x402 USDC or API k
by gpu-bridgeA powerful image generation tool using Google's Imagen 3.0 API through MCP. Generate high-quality images from text prompts with advanced photography, artistic,
by hamflxNot sure what to pick?
Find your stack in 60 seconds
Author?
Embed badge for your README
Browse similar
All media MCPs