Migrating File_formats.py: A Foundation For Code Generation
Overview
Hey guys! This is the FOUNDATION issue for the constants migration project. Think of it as setting the stage and laying the groundwork for Issues #2-5. Basically, this issue sets up the infrastructure and pattern that the other migration issues will follow. So, it's super important that we nail this one first before moving on. Let's dive in!
Context
Okay, so right now, le_utils/constants/file_formats.py is using what we can call the "old school" approach. It's like this:
- It loads
resources/formatlookup.jsonat runtime usingpkgutil.get_data(). Think of it as grabbing a file on the fly. - We have to manually keep Python constants (
MP4 = "mp4",PDF = "pdf", etc.) in sync. Imagine writing the same thing twice – not fun! - There's a manual
_FORMATLOOKUPdictionary and agetformat()helper function. It's like having to build our own tools from scratch. - No JavaScript export available. So, our JavaScript friends are left out in the cold.
- Tests verify Python/JSON sync. We're making sure things match, but it's a bit of a manual process.
This issue is all about migrating to a more modern approach – using a spec and code generation, just like 8 other modules already do. This means we define the formats in a structured way and then automatically generate the Python and JavaScript code. How cool is that?
Scope
So, what are we actually going to do in this issue? Here’s the breakdown:
- Enhance
generate_from_specs.py: This is the key part! We need to make our code generation script capable of handling namedtuple-based constants. This is the major infrastructure work. - Create
spec/constants-file_formats.json:** We'll create a JSON file following the new format. This will be the single source of truth for our file formats. - Generate Python and JavaScript files via
make build:** We'll use our updated script to automatically create the necessary code. - Update tests to verify against the spec:** No more manual syncing! Our tests will now check against the JSON spec.
- Delete
resources/formatlookup.json:** We're getting rid of the old way of doing things. - Document the spec format for Issues #2-5 to follow:** We'll create a guide so that others can easily follow this pattern.
That's a lot, but it's all about setting us up for success in the future.
Current Structure
Let's take a look at what we're working with right now.
File: le_utils/resources/formatlookup.json (currently only has 20 formats)
{
"mp4": {"mimetype": "video/mp4"},
"webm": {"mimetype": "video/webm"},
"vtt": {"mimetype": ".vtt"},
"pdf": {"mimetype": "application/pdf"},
...
}
Python module (file_formats.py) currently has 40+ manual constants, including:
- Formats in JSON:
MP4,WEBM,VTT,PDF,EPUB,MP3,JPG,JPEG,PNG,GIF,JSON,SVG,GRAPHIE,PERSEUS,H5P,ZIM,HTML5(zip),BLOOMPUB,BLOOMD,HTML5_ARTICLE(kpub) - Formats NOT in JSON (these need to be added to the spec):
AVI,MOV,MPG,WMV,MKV,FLV,OGV,M4V,SRT,TTML,SAMI,SCC,DFXP - Namedtuple:
class Format(namedtuple("Format", ["id", "mimetype"])): pass LIST, choices tuple, helper functiongetformat()
See all those formats not in the JSON? We're going to fix that!
Target Spec Format
We're going to create spec/constants-file_formats.json and include ALL formats, even the ones missing from the current JSON. This will be our single source of truth.
{
"namedtuple": {
"name": "Format",
"fields": ["id", "mimetype"]
},
"constants": {
"mp4": {"mimetype": "video/mp4"},
"webm": {"mimetype": "video/webm"},
"avi": {"mimetype": "video/x-msvideo"},
"mov": {"mimetype": "video/quicktime"},
"mpg": {"mimetype": "video/mpeg"},
"wmv": {"mimetype": "video/x-ms-wmv"},
"mkv": {"mimetype": "video/x-matroska"},
"flv": {"mimetype": "video/x-flv"},
"ogv": {"mimetype": "video/ogg"},
"m4v": {"mimetype": "video/x-m4v"},
"vtt": {"mimetype": "text/vtt"},
"srt": {"mimetype": "application/x-subrip"},
"ttml": {"mimetype": "application/ttml+xml"},
"sami": {"mimetype": "application/x-sami"},
"scc": {"mimetype": "text/x-scc"},
"dfxp": {"mimetype": "application/ttaf+xml"},
"mp3": {"mimetype": "audio/mpeg"},
"pdf": {"mimetype": "application/pdf"},
"epub": {"mimetype": "application/epub+zip"},
"jpg": {"mimetype": "image/jpeg"},
"jpeg": {"mimetype": "image/jpeg"},
"png": {"mimetype": "image/png"},
"gif": {"mimetype": "image/gif"},
"json": {"mimetype": "application/json"},
"svg": {"mimetype": "image/svg+xml"},
"graphie": {"mimetype": "application/graphie"},
"perseus": {"mimetype": "application/perseus+zip"},
"h5p": {"mimetype": "application/zip"},
"zim": {"mimetype": "application/zim"},
"zip": {"mimetype": "application/zip"},
"bloompub": {"mimetype": "application/bloompub+zip"},
"bloomd": {"mimetype": "application/bloompub+zip"},
"kpub": {"mimetype": "application/kpub+zip"}
}
}
How to determine mimetypes for missing formats:
- Check MDN Web Docs for standard mimetypes: https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Common_types
- For video/subtitle formats, use common IANA registered types or conventional
x-prefixes. Think of it as following the standards or making our own if we have to. - For custom formats (graphie, perseus, kpub, etc.), use
application/{format}orapplication/{format}+zippattern. It's like creating a consistent naming scheme. - When uncertain, search "mime type for {extension}" or check existing file type databases. Google is your friend!
Generation Script Enhancement
We need to update scripts/generate_from_specs.py to handle the namedtuple format. This is where the magic happens!
- Modify
read_constants_specs()to detect and handle the namedtuple format:- Check if the spec has a
namedtuplekey. It's like looking for a specific ingredient in a recipe. - If yes, extract the namedtuple definition and constants. We're grabbing the important bits.
- If no, use the existing simple constant handling. We're being flexible and handling different cases.
- Check if the spec has a
- Update
write_python_file()to support namedtuples:- Add
from collections import namedtupleimport when needed. We need to import the right tools for the job. - Generate the namedtuple class definition. We're creating the structure for our data.
- Generate
{MODULE}LISTwith namedtuple instances. We're creating a list of our formats. - Generate uppercase constants from keys (e.g.,
MP4 = "mp4"). This makes our code easier to read and use. - Generate
_MIMETYPEconstants (e.g.,MP4_MIMETYPE = "video/mp4") for each format. We're providing extra information about each format. - Generate choices tuple with custom display names (from spec or title-cased). We're making it easy to display the formats in a user-friendly way.
- Generate lookup dict:
_{MODULE}LOOKUP = {item.id: item for item in {MODULE}LIST}. This allows us to quickly find a format by its ID. - Generate helper function (e.g.,
getformat()). We're making it easy to access the formats.
- Add
- Update
write_js_file()to export rich namedtuple data with PascalCase:- Export constant name → id mapping (default export, e.g.,
MP4: "mp4"). This is the basic way to use the formats in JavaScript. - Export
FormatsList- full namedtuple data as an array. This gives us access to all the information about each format. - Export
FormatsMap- Map for efficient lookups. This allows us to quickly find a format by its ID in JavaScript.
- Export constant name → id mapping (default export, e.g.,
Generated Output Example
Let's see what the generated code will look like.
Python (le_utils/constants/file_formats.py):
# -*- coding: utf-8 -*-
# Generated by scripts/generate_from_specs.py
from __future__ import unicode_literals
from collections import namedtuple
# FileFormats
class Format(namedtuple("Format", ["id", "mimetype"])): pass
# Format constants
MP4 = "mp4"
WEBM = "webm"
AVI = "avi"
PDF = "pdf"
# ... (all formats)
# Mimetype constants
MP4_MIMETYPE = "video/mp4"
WEBM_MIMETYPE = "video/webm"
AVI_MIMETYPE = "video/x-msvideo"
PDF_MIMETYPE = "application/pdf"
# ...
choices = (
(MP4, "Mp4"),
(WEBM, "Webm"),
(AVI, "Avi"),
(PDF, "Pdf"),
# ...
)
FORMATLIST = [
Format(id="mp4", mimetype="video/mp4"),
Format(id="webm", mimetype="video/webm"),
Format(id="avi", mimetype="video/x-msvideo"),
# ...
]
_FORMATLOOKUP = {f.id: f for f in FORMATLIST}
def getformat(id, default=None):
"""
Try to lookup a file format object for its `id` in internal representation.
Returns None if lookup by internal representation fails.
"""
return _FORMATLOOKUP.get(id) or None
JavaScript (js/FileFormats.js):
// Generated by scripts/generate_from_specs.py
// Format constants
export default {
MP4: "mp4",
WEBM: "webm",
AVI: "avi",
PDF: "pdf",
// ...
};
// Full format data with mimetypes
export const FormatsList = [
{ id: "mp4", mimetype: "video/mp4" },
{ id: "webm", mimetype: "video/webm" },
{ id: "avi", mimetype: "video/x-msvideo" },
{ id: "pdf", mimetype: "application/pdf" },
# ...
];
// Lookup Map
export const FormatsMap = new Map(
FormatsList.map(format => [format.id, format])
);
This is how JavaScript code can use the generated constants:
- Use constants:
import FileFormats from './FileFormats'; if (ext === FileFormats.MP4) ... - Access full data:
import { FormatsList } from './FileFormats'; - Look up by id:
import { FormatsMap } from './FileFormats'; const format = FormatsMap.get('pdf');
Testing Updates
File: tests/test_formats.py
We need to update our tests to check against the spec instead of the old JSON. This is how we make sure everything is working correctly.
import os
import json
spec_path = os.path.join(os.path.dirname(__file__), "..", "spec", "constants-file_formats.json")
with open(spec_path) as f:
spec = json.load(f)
formatlookup = spec["constants"]
# Verify all constants in Python module match spec
# Verify FORMATLIST namedtuples match spec data
# Test getformat() helper
# Verify _MIMETYPE constants
How to Run Tests
Here's how to run the tests:
# Run file formats tests
pytest tests/test_formats.py -v
# Run all tests to ensure nothing broke
pytest tests/ -v
Acceptance Criteria
What needs to be done for this issue to be considered complete?
- [ ]
scripts/generate_from_specs.pyenhanced to support namedtuple specs - [ ]
spec/constants-file_formats.jsoncreated with ALL formats (including AVI, MOV, SRT, etc. currently missing) - [ ] Mimetypes determined for all missing formats (using MDN/IANA resources)
- [ ]
make buildsuccessfully generates Python and JavaScript files - [ ] Generated
le_utils/constants/file_formats.pyhas:- [ ] Namedtuple class definition
- [ ] Uppercase format constants for ALL formats
- [ ]
_MIMETYPEconstants for each format - [ ]
choicestuple - [ ]
FORMATLISTwith namedtuple instances - [ ]
_FORMATLOOKUPdict - [ ]
getformat()helper function
- [ ] Generated
js/FileFormats.jshas:- [ ] Default export with constant name mappings
- [ ]
FormatsListexport (PascalCase) with full data - [ ]
FormatsMapexport (PascalCase) as Map
- [ ]
tests/test_formats.pyupdated to test against spec - [ ] All tests pass:
pytest tests/ -v - [ ]
resources/formatlookup.jsondeleted - [ ] Auto-generated comment in code
Notes for Issues #2-5
Once this is done, we'll have a clear pattern for the other issues:
- Create spec with
namedtupleandconstantskeys - Include ALL constants (even if not in the old JSON)
- Run
make build - Update tests to reference spec
- Delete old JSON resource
The generation script will handle everything automatically, including PascalCase JS exports. This is going to make our lives so much easier!
Related Issues
This is part of the tracking issue #181.